Python improve import time of data

I have a data file that contains the following:

somename = [ [1,2,3,4,5,6],...plus other such elements making a 60 MB file called somefile.py ]

In my python script (of the same folder as the data) I have (and only this along with appropriate shebang):

from somefile import somename

This took almost 20 minutes to complete. How can such an import be improved?

I'm using python 3.7 working on a mac osx 10.13.

asked Jan 2 at 12:31

Geoff

4601929

12

why not use something like a json file for static data? and have it loaded when it is needed.

– 422_unprocessable_entity
Jan 2 at 12:37

@BhathiyaPerera Many thanks for the suggestion.

– Geoff
Jan 2 at 12:46

1

From your question it seems your entire file is raw data and not actual python implementations. Is this the case? If so treating it as a text file and reading its lines could dramatically improve performance

– yuvgin
Jan 2 at 13:17

Quite simply: do not try to use a python module as data storage for such volumes - use a proper datastore (database, whatever).

– bruno desthuilliers
Jan 2 at 13:25

add a comment |

I have a data file that contains the following:

somename = [ [1,2,3,4,5,6],...plus other such elements making a 60 MB file called somefile.py ]

In my python script (of the same folder as the data) I have (and only this along with appropriate shebang):

from somefile import somename

This took almost 20 minutes to complete. How can such an import be improved?

I'm using python 3.7 working on a mac osx 10.13.

asked Jan 2 at 12:31

Geoff

4601929

12

why not use something like a json file for static data? and have it loaded when it is needed.

– 422_unprocessable_entity
Jan 2 at 12:37

@BhathiyaPerera Many thanks for the suggestion.

– Geoff
Jan 2 at 12:46

1

From your question it seems your entire file is raw data and not actual python implementations. Is this the case? If so treating it as a text file and reading its lines could dramatically improve performance

– yuvgin
Jan 2 at 13:17

Quite simply: do not try to use a python module as data storage for such volumes - use a proper datastore (database, whatever).

– bruno desthuilliers
Jan 2 at 13:25

add a comment |

I have a data file that contains the following:

somename = [ [1,2,3,4,5,6],...plus other such elements making a 60 MB file called somefile.py ]

In my python script (of the same folder as the data) I have (and only this along with appropriate shebang):

from somefile import somename

This took almost 20 minutes to complete. How can such an import be improved?

I'm using python 3.7 working on a mac osx 10.13.

asked Jan 2 at 12:31

Geoff

4601929

I have a data file that contains the following:

somename = [ [1,2,3,4,5,6],...plus other such elements making a 60 MB file called somefile.py ]

In my python script (of the same folder as the data) I have (and only this along with appropriate shebang):

from somefile import somename

This took almost 20 minutes to complete. How can such an import be improved?

I'm using python 3.7 working on a mac osx 10.13.

python

asked Jan 2 at 12:31

Geoff

4601929

asked Jan 2 at 12:31

Geoff

4601929

asked Jan 2 at 12:31

Geoff

4601929

asked Jan 2 at 12:31

Geoff

4601929

asked Jan 2 at 12:31

Geoff

4601929

12

why not use something like a json file for static data? and have it loaded when it is needed.

– 422_unprocessable_entity
Jan 2 at 12:37

@BhathiyaPerera Many thanks for the suggestion.

– Geoff
Jan 2 at 12:46

1

From your question it seems your entire file is raw data and not actual python implementations. Is this the case? If so treating it as a text file and reading its lines could dramatically improve performance

– yuvgin
Jan 2 at 13:17

Quite simply: do not try to use a python module as data storage for such volumes - use a proper datastore (database, whatever).

– bruno desthuilliers
Jan 2 at 13:25

add a comment |

12

why not use something like a json file for static data? and have it loaded when it is needed.

– 422_unprocessable_entity
Jan 2 at 12:37

@BhathiyaPerera Many thanks for the suggestion.

– Geoff
Jan 2 at 12:46

1

From your question it seems your entire file is raw data and not actual python implementations. Is this the case? If so treating it as a text file and reading its lines could dramatically improve performance

– yuvgin
Jan 2 at 13:17

Quite simply: do not try to use a python module as data storage for such volumes - use a proper datastore (database, whatever).

– bruno desthuilliers
Jan 2 at 13:25

why not use something like a json file for static data? and have it loaded when it is needed.

– 422_unprocessable_entity
Jan 2 at 12:37

@BhathiyaPerera Many thanks for the suggestion.

– Geoff
Jan 2 at 12:46

From your question it seems your entire file is raw data and not actual python implementations. Is this the case? If so treating it as a text file and reading its lines could dramatically improve performance

– yuvgin
Jan 2 at 13:17

Quite simply: do not try to use a python module as data storage for such volumes - use a proper datastore (database, whatever).

– bruno desthuilliers
Jan 2 at 13:25

add a comment |

2 Answers
2

active

oldest

votes

loading files as "Python source code" will always be relatively slow, but 20 minutes to load a 60MiB file seems far too slow. Python uses a full lexer/parser, and does things like tracking source locations for accurate error reporting amongst other things. It's grammer is deliberately simple which makes parsing relatively fast, but still it's going to be much slower than other file formats.

I'd go with one of the other suggestions, but thought it would be interesting to compare timings across different file formats

first I generate some data:

somename = [list(range(6)) for _ in range(100_000)]

this takes my computer 152 ms to do, I can then save this in a "Python source file" with:

with open('data.py', 'w') as fd:

    fd.write(f'somename = {somename}')

which takes 84.1 ms, reloading this using:

from data import somename

which takes 1.40 seconds — I tried with some other sizes and the scaling seems linear in array length which I find impressive. I then started to play with different file formats, JSON:

import json



with open('data.json', 'w') as fd:

    json.dump(somename, fd)



with open('data.json') as fd:

    somename = json.load(fd)

here saving took 787 ms and loading took 131 ms. Next, CSV:

import csv



with open('data.csv', 'w') as fd:

    out = csv.writer(fd)

    out.writerows(somename)



with open('data.csv') as fd:

    inp = csv.reader(fd)

    somename = [[int(v) for v in row] for row in inp]

saving took 114 ms while loading took 329 ms (down to 129 ms if strings aren't converted to ints). next I tried musbur's suggestion of pickle:

import pickle  # no need for `cPickle` in Python 3



with open('data.pck', 'wb') as fd:

    pickle.dump(somename, fd)



with open('data.pck', 'rb') as fd:

    somename = pickle.load(fd)

the saving took 49.1 ms and loading took 128 ms

The take-home message sees to be that saving data in source code takes 10 times as long, but I'm not sure how it's taking your computer 20 minutes!

answered Jan 2 at 13:56

Sam Mason

3,34811331

add a comment |

The somename.py file is obviously created by some piece of software. If it is re-created regularly (i.e., changes often), that other piece of software should be rewritten in such a way to create data that is more easily imported in Python (such as tabular text data, JSON, yaml, ...). If it is static data that never changes, do this:

import cPickle

from somefile import somename



fh = open("data.pck", "wb")

cPickle.dump(somename, fh)

fh.close()

This will serialize your data into a file "data.pck" from which it can be re-loaded very quickly.

answered Jan 2 at 13:37

musbur

947

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54006460%2fpython-improve-import-time-of-data%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

I'd go with one of the other suggestions, but thought it would be interesting to compare timings across different file formats

first I generate some data:

somename = [list(range(6)) for _ in range(100_000)]

this takes my computer 152 ms to do, I can then save this in a "Python source file" with:

with open('data.py', 'w') as fd:

    fd.write(f'somename = {somename}')

which takes 84.1 ms, reloading this using:

from data import somename

which takes 1.40 seconds — I tried with some other sizes and the scaling seems linear in array length which I find impressive. I then started to play with different file formats, JSON:

import json



with open('data.json', 'w') as fd:

    json.dump(somename, fd)



with open('data.json') as fd:

    somename = json.load(fd)

here saving took 787 ms and loading took 131 ms. Next, CSV:

import csv



with open('data.csv', 'w') as fd:

    out = csv.writer(fd)

    out.writerows(somename)



with open('data.csv') as fd:

    inp = csv.reader(fd)

    somename = [[int(v) for v in row] for row in inp]

saving took 114 ms while loading took 329 ms (down to 129 ms if strings aren't converted to ints). next I tried musbur's suggestion of pickle:

import pickle  # no need for `cPickle` in Python 3



with open('data.pck', 'wb') as fd:

    pickle.dump(somename, fd)



with open('data.pck', 'rb') as fd:

    somename = pickle.load(fd)

the saving took 49.1 ms and loading took 128 ms

The take-home message sees to be that saving data in source code takes 10 times as long, but I'm not sure how it's taking your computer 20 minutes!

answered Jan 2 at 13:56

Sam Mason

3,34811331

add a comment |

I'd go with one of the other suggestions, but thought it would be interesting to compare timings across different file formats

first I generate some data:

somename = [list(range(6)) for _ in range(100_000)]

this takes my computer 152 ms to do, I can then save this in a "Python source file" with:

with open('data.py', 'w') as fd:

    fd.write(f'somename = {somename}')

which takes 84.1 ms, reloading this using:

from data import somename

which takes 1.40 seconds — I tried with some other sizes and the scaling seems linear in array length which I find impressive. I then started to play with different file formats, JSON:

import json



with open('data.json', 'w') as fd:

    json.dump(somename, fd)



with open('data.json') as fd:

    somename = json.load(fd)

here saving took 787 ms and loading took 131 ms. Next, CSV:

import csv



with open('data.csv', 'w') as fd:

    out = csv.writer(fd)

    out.writerows(somename)



with open('data.csv') as fd:

    inp = csv.reader(fd)

    somename = [[int(v) for v in row] for row in inp]

saving took 114 ms while loading took 329 ms (down to 129 ms if strings aren't converted to ints). next I tried musbur's suggestion of pickle:

import pickle  # no need for `cPickle` in Python 3



with open('data.pck', 'wb') as fd:

    pickle.dump(somename, fd)



with open('data.pck', 'rb') as fd:

    somename = pickle.load(fd)

the saving took 49.1 ms and loading took 128 ms

The take-home message sees to be that saving data in source code takes 10 times as long, but I'm not sure how it's taking your computer 20 minutes!

answered Jan 2 at 13:56

Sam Mason

3,34811331

add a comment |

I'd go with one of the other suggestions, but thought it would be interesting to compare timings across different file formats

first I generate some data:

somename = [list(range(6)) for _ in range(100_000)]

this takes my computer 152 ms to do, I can then save this in a "Python source file" with:

with open('data.py', 'w') as fd:

    fd.write(f'somename = {somename}')

which takes 84.1 ms, reloading this using:

from data import somename

which takes 1.40 seconds — I tried with some other sizes and the scaling seems linear in array length which I find impressive. I then started to play with different file formats, JSON:

import json



with open('data.json', 'w') as fd:

    json.dump(somename, fd)



with open('data.json') as fd:

    somename = json.load(fd)

here saving took 787 ms and loading took 131 ms. Next, CSV:

import csv



with open('data.csv', 'w') as fd:

    out = csv.writer(fd)

    out.writerows(somename)



with open('data.csv') as fd:

    inp = csv.reader(fd)

    somename = [[int(v) for v in row] for row in inp]

saving took 114 ms while loading took 329 ms (down to 129 ms if strings aren't converted to ints). next I tried musbur's suggestion of pickle:

import pickle  # no need for `cPickle` in Python 3



with open('data.pck', 'wb') as fd:

    pickle.dump(somename, fd)



with open('data.pck', 'rb') as fd:

    somename = pickle.load(fd)

the saving took 49.1 ms and loading took 128 ms

The take-home message sees to be that saving data in source code takes 10 times as long, but I'm not sure how it's taking your computer 20 minutes!

answered Jan 2 at 13:56

Sam Mason

3,34811331

I'd go with one of the other suggestions, but thought it would be interesting to compare timings across different file formats

first I generate some data:

somename = [list(range(6)) for _ in range(100_000)]

this takes my computer 152 ms to do, I can then save this in a "Python source file" with:

with open('data.py', 'w') as fd:

    fd.write(f'somename = {somename}')

which takes 84.1 ms, reloading this using:

from data import somename

which takes 1.40 seconds — I tried with some other sizes and the scaling seems linear in array length which I find impressive. I then started to play with different file formats, JSON:

import json



with open('data.json', 'w') as fd:

    json.dump(somename, fd)



with open('data.json') as fd:

    somename = json.load(fd)

here saving took 787 ms and loading took 131 ms. Next, CSV:

import csv



with open('data.csv', 'w') as fd:

    out = csv.writer(fd)

    out.writerows(somename)



with open('data.csv') as fd:

    inp = csv.reader(fd)

    somename = [[int(v) for v in row] for row in inp]

saving took 114 ms while loading took 329 ms (down to 129 ms if strings aren't converted to ints). next I tried musbur's suggestion of pickle:

import pickle  # no need for `cPickle` in Python 3



with open('data.pck', 'wb') as fd:

    pickle.dump(somename, fd)



with open('data.pck', 'rb') as fd:

    somename = pickle.load(fd)

the saving took 49.1 ms and loading took 128 ms

The take-home message sees to be that saving data in source code takes 10 times as long, but I'm not sure how it's taking your computer 20 minutes!

answered Jan 2 at 13:56

Sam Mason

3,34811331

answered Jan 2 at 13:56

Sam Mason

3,34811331

answered Jan 2 at 13:56

Sam Mason

3,34811331

answered Jan 2 at 13:56

Sam Mason

3,34811331

add a comment |

import cPickle

from somefile import somename



fh = open("data.pck", "wb")

cPickle.dump(somename, fh)

fh.close()

This will serialize your data into a file "data.pck" from which it can be re-loaded very quickly.

answered Jan 2 at 13:37

musbur

947

add a comment |

import cPickle

from somefile import somename



fh = open("data.pck", "wb")

cPickle.dump(somename, fh)

fh.close()

This will serialize your data into a file "data.pck" from which it can be re-loaded very quickly.

answered Jan 2 at 13:37

musbur

947

add a comment |

import cPickle

from somefile import somename



fh = open("data.pck", "wb")

cPickle.dump(somename, fh)

fh.close()

This will serialize your data into a file "data.pck" from which it can be re-loaded very quickly.

answered Jan 2 at 13:37

musbur

947

import cPickle

from somefile import somename



fh = open("data.pck", "wb")

cPickle.dump(somename, fh)

fh.close()

This will serialize your data into a file "data.pck" from which it can be re-loaded very quickly.

answered Jan 2 at 13:37

musbur

947

answered Jan 2 at 13:37

musbur

947

answered Jan 2 at 13:37

musbur

947

answered Jan 2 at 13:37

musbur

947

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Bdtjtk