Python improve import time of data

Multi tool use
I have a data file that contains the following:
somename = [ [1,2,3,4,5,6],...plus other such elements making a 60 MB file called somefile.py ]
In my python script (of the same folder as the data) I have (and only this along with appropriate shebang):
from somefile import somename
This took almost 20 minutes to complete. How can such an import be improved?
I'm using python 3.7 working on a mac osx 10.13.
python
add a comment |
I have a data file that contains the following:
somename = [ [1,2,3,4,5,6],...plus other such elements making a 60 MB file called somefile.py ]
In my python script (of the same folder as the data) I have (and only this along with appropriate shebang):
from somefile import somename
This took almost 20 minutes to complete. How can such an import be improved?
I'm using python 3.7 working on a mac osx 10.13.
python
12
why not use something like a json file for static data? and have it loaded when it is needed.
– 422_unprocessable_entity
Jan 2 at 12:37
@BhathiyaPerera Many thanks for the suggestion.
– Geoff
Jan 2 at 12:46
1
From your question it seems your entire file is raw data and not actual python implementations. Is this the case? If so treating it as a text file and reading its lines could dramatically improve performance
– yuvgin
Jan 2 at 13:17
Quite simply: do not try to use a python module as data storage for such volumes - use a proper datastore (database, whatever).
– bruno desthuilliers
Jan 2 at 13:25
add a comment |
I have a data file that contains the following:
somename = [ [1,2,3,4,5,6],...plus other such elements making a 60 MB file called somefile.py ]
In my python script (of the same folder as the data) I have (and only this along with appropriate shebang):
from somefile import somename
This took almost 20 minutes to complete. How can such an import be improved?
I'm using python 3.7 working on a mac osx 10.13.
python
I have a data file that contains the following:
somename = [ [1,2,3,4,5,6],...plus other such elements making a 60 MB file called somefile.py ]
In my python script (of the same folder as the data) I have (and only this along with appropriate shebang):
from somefile import somename
This took almost 20 minutes to complete. How can such an import be improved?
I'm using python 3.7 working on a mac osx 10.13.
python
python
asked Jan 2 at 12:31
GeoffGeoff
4601929
4601929
12
why not use something like a json file for static data? and have it loaded when it is needed.
– 422_unprocessable_entity
Jan 2 at 12:37
@BhathiyaPerera Many thanks for the suggestion.
– Geoff
Jan 2 at 12:46
1
From your question it seems your entire file is raw data and not actual python implementations. Is this the case? If so treating it as a text file and reading its lines could dramatically improve performance
– yuvgin
Jan 2 at 13:17
Quite simply: do not try to use a python module as data storage for such volumes - use a proper datastore (database, whatever).
– bruno desthuilliers
Jan 2 at 13:25
add a comment |
12
why not use something like a json file for static data? and have it loaded when it is needed.
– 422_unprocessable_entity
Jan 2 at 12:37
@BhathiyaPerera Many thanks for the suggestion.
– Geoff
Jan 2 at 12:46
1
From your question it seems your entire file is raw data and not actual python implementations. Is this the case? If so treating it as a text file and reading its lines could dramatically improve performance
– yuvgin
Jan 2 at 13:17
Quite simply: do not try to use a python module as data storage for such volumes - use a proper datastore (database, whatever).
– bruno desthuilliers
Jan 2 at 13:25
12
12
why not use something like a json file for static data? and have it loaded when it is needed.
– 422_unprocessable_entity
Jan 2 at 12:37
why not use something like a json file for static data? and have it loaded when it is needed.
– 422_unprocessable_entity
Jan 2 at 12:37
@BhathiyaPerera Many thanks for the suggestion.
– Geoff
Jan 2 at 12:46
@BhathiyaPerera Many thanks for the suggestion.
– Geoff
Jan 2 at 12:46
1
1
From your question it seems your entire file is raw data and not actual python implementations. Is this the case? If so treating it as a text file and reading its lines could dramatically improve performance
– yuvgin
Jan 2 at 13:17
From your question it seems your entire file is raw data and not actual python implementations. Is this the case? If so treating it as a text file and reading its lines could dramatically improve performance
– yuvgin
Jan 2 at 13:17
Quite simply: do not try to use a python module as data storage for such volumes - use a proper datastore (database, whatever).
– bruno desthuilliers
Jan 2 at 13:25
Quite simply: do not try to use a python module as data storage for such volumes - use a proper datastore (database, whatever).
– bruno desthuilliers
Jan 2 at 13:25
add a comment |
2 Answers
2
active
oldest
votes
loading files as "Python source code" will always be relatively slow, but 20 minutes to load a 60MiB file seems far too slow. Python uses a full lexer/parser, and does things like tracking source locations for accurate error reporting amongst other things. It's grammer is deliberately simple which makes parsing relatively fast, but still it's going to be much slower than other file formats.
I'd go with one of the other suggestions, but thought it would be interesting to compare timings across different file formats
first I generate some data:
somename = [list(range(6)) for _ in range(100_000)]
this takes my computer 152 ms to do, I can then save this in a "Python source file" with:
with open('data.py', 'w') as fd:
fd.write(f'somename = {somename}')
which takes 84.1 ms, reloading this using:
from data import somename
which takes 1.40 seconds — I tried with some other sizes and the scaling seems linear in array length which I find impressive. I then started to play with different file formats, JSON:
import json
with open('data.json', 'w') as fd:
json.dump(somename, fd)
with open('data.json') as fd:
somename = json.load(fd)
here saving took 787 ms and loading took 131 ms. Next, CSV:
import csv
with open('data.csv', 'w') as fd:
out = csv.writer(fd)
out.writerows(somename)
with open('data.csv') as fd:
inp = csv.reader(fd)
somename = [[int(v) for v in row] for row in inp]
saving took 114 ms while loading took 329 ms (down to 129 ms if strings aren't converted to int
s). next I tried musbur's suggestion of pickle
:
import pickle # no need for `cPickle` in Python 3
with open('data.pck', 'wb') as fd:
pickle.dump(somename, fd)
with open('data.pck', 'rb') as fd:
somename = pickle.load(fd)
the saving took 49.1 ms and loading took 128 ms
The take-home message sees to be that saving data in source code takes 10 times as long, but I'm not sure how it's taking your computer 20 minutes!
add a comment |
The somename.py file is obviously created by some piece of software. If it is re-created regularly (i.e., changes often), that other piece of software should be rewritten in such a way to create data that is more easily imported in Python (such as tabular text data, JSON, yaml, ...). If it is static data that never changes, do this:
import cPickle
from somefile import somename
fh = open("data.pck", "wb")
cPickle.dump(somename, fh)
fh.close()
This will serialize your data into a file "data.pck" from which it can be re-loaded very quickly.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54006460%2fpython-improve-import-time-of-data%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
loading files as "Python source code" will always be relatively slow, but 20 minutes to load a 60MiB file seems far too slow. Python uses a full lexer/parser, and does things like tracking source locations for accurate error reporting amongst other things. It's grammer is deliberately simple which makes parsing relatively fast, but still it's going to be much slower than other file formats.
I'd go with one of the other suggestions, but thought it would be interesting to compare timings across different file formats
first I generate some data:
somename = [list(range(6)) for _ in range(100_000)]
this takes my computer 152 ms to do, I can then save this in a "Python source file" with:
with open('data.py', 'w') as fd:
fd.write(f'somename = {somename}')
which takes 84.1 ms, reloading this using:
from data import somename
which takes 1.40 seconds — I tried with some other sizes and the scaling seems linear in array length which I find impressive. I then started to play with different file formats, JSON:
import json
with open('data.json', 'w') as fd:
json.dump(somename, fd)
with open('data.json') as fd:
somename = json.load(fd)
here saving took 787 ms and loading took 131 ms. Next, CSV:
import csv
with open('data.csv', 'w') as fd:
out = csv.writer(fd)
out.writerows(somename)
with open('data.csv') as fd:
inp = csv.reader(fd)
somename = [[int(v) for v in row] for row in inp]
saving took 114 ms while loading took 329 ms (down to 129 ms if strings aren't converted to int
s). next I tried musbur's suggestion of pickle
:
import pickle # no need for `cPickle` in Python 3
with open('data.pck', 'wb') as fd:
pickle.dump(somename, fd)
with open('data.pck', 'rb') as fd:
somename = pickle.load(fd)
the saving took 49.1 ms and loading took 128 ms
The take-home message sees to be that saving data in source code takes 10 times as long, but I'm not sure how it's taking your computer 20 minutes!
add a comment |
loading files as "Python source code" will always be relatively slow, but 20 minutes to load a 60MiB file seems far too slow. Python uses a full lexer/parser, and does things like tracking source locations for accurate error reporting amongst other things. It's grammer is deliberately simple which makes parsing relatively fast, but still it's going to be much slower than other file formats.
I'd go with one of the other suggestions, but thought it would be interesting to compare timings across different file formats
first I generate some data:
somename = [list(range(6)) for _ in range(100_000)]
this takes my computer 152 ms to do, I can then save this in a "Python source file" with:
with open('data.py', 'w') as fd:
fd.write(f'somename = {somename}')
which takes 84.1 ms, reloading this using:
from data import somename
which takes 1.40 seconds — I tried with some other sizes and the scaling seems linear in array length which I find impressive. I then started to play with different file formats, JSON:
import json
with open('data.json', 'w') as fd:
json.dump(somename, fd)
with open('data.json') as fd:
somename = json.load(fd)
here saving took 787 ms and loading took 131 ms. Next, CSV:
import csv
with open('data.csv', 'w') as fd:
out = csv.writer(fd)
out.writerows(somename)
with open('data.csv') as fd:
inp = csv.reader(fd)
somename = [[int(v) for v in row] for row in inp]
saving took 114 ms while loading took 329 ms (down to 129 ms if strings aren't converted to int
s). next I tried musbur's suggestion of pickle
:
import pickle # no need for `cPickle` in Python 3
with open('data.pck', 'wb') as fd:
pickle.dump(somename, fd)
with open('data.pck', 'rb') as fd:
somename = pickle.load(fd)
the saving took 49.1 ms and loading took 128 ms
The take-home message sees to be that saving data in source code takes 10 times as long, but I'm not sure how it's taking your computer 20 minutes!
add a comment |
loading files as "Python source code" will always be relatively slow, but 20 minutes to load a 60MiB file seems far too slow. Python uses a full lexer/parser, and does things like tracking source locations for accurate error reporting amongst other things. It's grammer is deliberately simple which makes parsing relatively fast, but still it's going to be much slower than other file formats.
I'd go with one of the other suggestions, but thought it would be interesting to compare timings across different file formats
first I generate some data:
somename = [list(range(6)) for _ in range(100_000)]
this takes my computer 152 ms to do, I can then save this in a "Python source file" with:
with open('data.py', 'w') as fd:
fd.write(f'somename = {somename}')
which takes 84.1 ms, reloading this using:
from data import somename
which takes 1.40 seconds — I tried with some other sizes and the scaling seems linear in array length which I find impressive. I then started to play with different file formats, JSON:
import json
with open('data.json', 'w') as fd:
json.dump(somename, fd)
with open('data.json') as fd:
somename = json.load(fd)
here saving took 787 ms and loading took 131 ms. Next, CSV:
import csv
with open('data.csv', 'w') as fd:
out = csv.writer(fd)
out.writerows(somename)
with open('data.csv') as fd:
inp = csv.reader(fd)
somename = [[int(v) for v in row] for row in inp]
saving took 114 ms while loading took 329 ms (down to 129 ms if strings aren't converted to int
s). next I tried musbur's suggestion of pickle
:
import pickle # no need for `cPickle` in Python 3
with open('data.pck', 'wb') as fd:
pickle.dump(somename, fd)
with open('data.pck', 'rb') as fd:
somename = pickle.load(fd)
the saving took 49.1 ms and loading took 128 ms
The take-home message sees to be that saving data in source code takes 10 times as long, but I'm not sure how it's taking your computer 20 minutes!
loading files as "Python source code" will always be relatively slow, but 20 minutes to load a 60MiB file seems far too slow. Python uses a full lexer/parser, and does things like tracking source locations for accurate error reporting amongst other things. It's grammer is deliberately simple which makes parsing relatively fast, but still it's going to be much slower than other file formats.
I'd go with one of the other suggestions, but thought it would be interesting to compare timings across different file formats
first I generate some data:
somename = [list(range(6)) for _ in range(100_000)]
this takes my computer 152 ms to do, I can then save this in a "Python source file" with:
with open('data.py', 'w') as fd:
fd.write(f'somename = {somename}')
which takes 84.1 ms, reloading this using:
from data import somename
which takes 1.40 seconds — I tried with some other sizes and the scaling seems linear in array length which I find impressive. I then started to play with different file formats, JSON:
import json
with open('data.json', 'w') as fd:
json.dump(somename, fd)
with open('data.json') as fd:
somename = json.load(fd)
here saving took 787 ms and loading took 131 ms. Next, CSV:
import csv
with open('data.csv', 'w') as fd:
out = csv.writer(fd)
out.writerows(somename)
with open('data.csv') as fd:
inp = csv.reader(fd)
somename = [[int(v) for v in row] for row in inp]
saving took 114 ms while loading took 329 ms (down to 129 ms if strings aren't converted to int
s). next I tried musbur's suggestion of pickle
:
import pickle # no need for `cPickle` in Python 3
with open('data.pck', 'wb') as fd:
pickle.dump(somename, fd)
with open('data.pck', 'rb') as fd:
somename = pickle.load(fd)
the saving took 49.1 ms and loading took 128 ms
The take-home message sees to be that saving data in source code takes 10 times as long, but I'm not sure how it's taking your computer 20 minutes!
answered Jan 2 at 13:56
Sam MasonSam Mason
3,34811331
3,34811331
add a comment |
add a comment |
The somename.py file is obviously created by some piece of software. If it is re-created regularly (i.e., changes often), that other piece of software should be rewritten in such a way to create data that is more easily imported in Python (such as tabular text data, JSON, yaml, ...). If it is static data that never changes, do this:
import cPickle
from somefile import somename
fh = open("data.pck", "wb")
cPickle.dump(somename, fh)
fh.close()
This will serialize your data into a file "data.pck" from which it can be re-loaded very quickly.
add a comment |
The somename.py file is obviously created by some piece of software. If it is re-created regularly (i.e., changes often), that other piece of software should be rewritten in such a way to create data that is more easily imported in Python (such as tabular text data, JSON, yaml, ...). If it is static data that never changes, do this:
import cPickle
from somefile import somename
fh = open("data.pck", "wb")
cPickle.dump(somename, fh)
fh.close()
This will serialize your data into a file "data.pck" from which it can be re-loaded very quickly.
add a comment |
The somename.py file is obviously created by some piece of software. If it is re-created regularly (i.e., changes often), that other piece of software should be rewritten in such a way to create data that is more easily imported in Python (such as tabular text data, JSON, yaml, ...). If it is static data that never changes, do this:
import cPickle
from somefile import somename
fh = open("data.pck", "wb")
cPickle.dump(somename, fh)
fh.close()
This will serialize your data into a file "data.pck" from which it can be re-loaded very quickly.
The somename.py file is obviously created by some piece of software. If it is re-created regularly (i.e., changes often), that other piece of software should be rewritten in such a way to create data that is more easily imported in Python (such as tabular text data, JSON, yaml, ...). If it is static data that never changes, do this:
import cPickle
from somefile import somename
fh = open("data.pck", "wb")
cPickle.dump(somename, fh)
fh.close()
This will serialize your data into a file "data.pck" from which it can be re-loaded very quickly.
answered Jan 2 at 13:37
musburmusbur
947
947
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54006460%2fpython-improve-import-time-of-data%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
EwUDa6hNCtwKwd0cm,uWKys aOihMLyvwrzcoL2zB0MXvLGk NF4,G,vyI3oT,ARwov55K m5QyzmX o,QBKIsXQNGs4bmMZ
12
why not use something like a json file for static data? and have it loaded when it is needed.
– 422_unprocessable_entity
Jan 2 at 12:37
@BhathiyaPerera Many thanks for the suggestion.
– Geoff
Jan 2 at 12:46
1
From your question it seems your entire file is raw data and not actual python implementations. Is this the case? If so treating it as a text file and reading its lines could dramatically improve performance
– yuvgin
Jan 2 at 13:17
Quite simply: do not try to use a python module as data storage for such volumes - use a proper datastore (database, whatever).
– bruno desthuilliers
Jan 2 at 13:25