Is there a faster way to store a big dictionary, than pickle or regular Python file? [closed]
I want to store a dictionary which only contains data in the following format:
{
"key1" : True,
"key2" : True,
.....
}
In other words, just a quick way to check if a key is valid or not. I can do this by storing a dict called foo
in a file called bar.py
, and then in my other modules, I can import it as follows:
from bar import foo
Or, I can save it in a pickle file called bar.pickle
, and import it at the top of the file as follows:
import pickle
with open('bar.pickle', 'rb') as f:
foo = pickle.load(f)
Which would be the ideal, and faster way to do this?
python pickle
closed as primarily opinion-based by Engineero, Patrick Artner, eyllanesc, Jean-François Fabre, Paul Roub Jan 2 at 22:58
Many good questions generate some degree of opinion based on expert experience, but answers to this question will tend to be almost entirely based on opinions, rather than facts, references, or specific expertise. If this question can be reworded to fit the rules in the help center, please edit the question.
|
show 2 more comments
I want to store a dictionary which only contains data in the following format:
{
"key1" : True,
"key2" : True,
.....
}
In other words, just a quick way to check if a key is valid or not. I can do this by storing a dict called foo
in a file called bar.py
, and then in my other modules, I can import it as follows:
from bar import foo
Or, I can save it in a pickle file called bar.pickle
, and import it at the top of the file as follows:
import pickle
with open('bar.pickle', 'rb') as f:
foo = pickle.load(f)
Which would be the ideal, and faster way to do this?
python pickle
closed as primarily opinion-based by Engineero, Patrick Artner, eyllanesc, Jean-François Fabre, Paul Roub Jan 2 at 22:58
Many good questions generate some degree of opinion based on expert experience, but answers to this question will tend to be almost entirely based on opinions, rather than facts, references, or specific expertise. If this question can be reworded to fit the rules in the help center, please edit the question.
pickle is a just a binary format which can help you to save dict more efficiently. Python definitely needs to spend a little bit more effort to read/parse binary format than plain text format.
– Windchill
Jan 2 at 19:24
there are lots of alternatives here, what's best/ideal/fastest almost certainly depends on much more information than you've given. i.e. how often does this data change, who changes it, how do they change it, do you want to keep track of these changes, do you care about portability to different systems/versions of Python… and many more
– Sam Mason
Jan 2 at 19:25
You could also consider dumping it as ajson
since the format is essentially the same, unless your dictionary have other Python object inside that is.
– Idlehands
Jan 2 at 19:31
If you don't have any need to distinguish betweenFalse
and varieties of N/A, you can just create a list of the keys withTrue
values and write that to file. Then instead of doingif my_dict[key]
, you can doif key in my_list
.
– Acccumulation
Jan 2 at 19:35
1
@Acccumulation but that would be trading an O(1) membership test for an O(N) membership test.
– juanpa.arrivillaga
Jan 2 at 19:38
|
show 2 more comments
I want to store a dictionary which only contains data in the following format:
{
"key1" : True,
"key2" : True,
.....
}
In other words, just a quick way to check if a key is valid or not. I can do this by storing a dict called foo
in a file called bar.py
, and then in my other modules, I can import it as follows:
from bar import foo
Or, I can save it in a pickle file called bar.pickle
, and import it at the top of the file as follows:
import pickle
with open('bar.pickle', 'rb') as f:
foo = pickle.load(f)
Which would be the ideal, and faster way to do this?
python pickle
I want to store a dictionary which only contains data in the following format:
{
"key1" : True,
"key2" : True,
.....
}
In other words, just a quick way to check if a key is valid or not. I can do this by storing a dict called foo
in a file called bar.py
, and then in my other modules, I can import it as follows:
from bar import foo
Or, I can save it in a pickle file called bar.pickle
, and import it at the top of the file as follows:
import pickle
with open('bar.pickle', 'rb') as f:
foo = pickle.load(f)
Which would be the ideal, and faster way to do this?
python pickle
python pickle
edited Jan 2 at 20:16
spectras
7,64511635
7,64511635
asked Jan 2 at 19:21
darkhorsedarkhorse
1,53551847
1,53551847
closed as primarily opinion-based by Engineero, Patrick Artner, eyllanesc, Jean-François Fabre, Paul Roub Jan 2 at 22:58
Many good questions generate some degree of opinion based on expert experience, but answers to this question will tend to be almost entirely based on opinions, rather than facts, references, or specific expertise. If this question can be reworded to fit the rules in the help center, please edit the question.
closed as primarily opinion-based by Engineero, Patrick Artner, eyllanesc, Jean-François Fabre, Paul Roub Jan 2 at 22:58
Many good questions generate some degree of opinion based on expert experience, but answers to this question will tend to be almost entirely based on opinions, rather than facts, references, or specific expertise. If this question can be reworded to fit the rules in the help center, please edit the question.
pickle is a just a binary format which can help you to save dict more efficiently. Python definitely needs to spend a little bit more effort to read/parse binary format than plain text format.
– Windchill
Jan 2 at 19:24
there are lots of alternatives here, what's best/ideal/fastest almost certainly depends on much more information than you've given. i.e. how often does this data change, who changes it, how do they change it, do you want to keep track of these changes, do you care about portability to different systems/versions of Python… and many more
– Sam Mason
Jan 2 at 19:25
You could also consider dumping it as ajson
since the format is essentially the same, unless your dictionary have other Python object inside that is.
– Idlehands
Jan 2 at 19:31
If you don't have any need to distinguish betweenFalse
and varieties of N/A, you can just create a list of the keys withTrue
values and write that to file. Then instead of doingif my_dict[key]
, you can doif key in my_list
.
– Acccumulation
Jan 2 at 19:35
1
@Acccumulation but that would be trading an O(1) membership test for an O(N) membership test.
– juanpa.arrivillaga
Jan 2 at 19:38
|
show 2 more comments
pickle is a just a binary format which can help you to save dict more efficiently. Python definitely needs to spend a little bit more effort to read/parse binary format than plain text format.
– Windchill
Jan 2 at 19:24
there are lots of alternatives here, what's best/ideal/fastest almost certainly depends on much more information than you've given. i.e. how often does this data change, who changes it, how do they change it, do you want to keep track of these changes, do you care about portability to different systems/versions of Python… and many more
– Sam Mason
Jan 2 at 19:25
You could also consider dumping it as ajson
since the format is essentially the same, unless your dictionary have other Python object inside that is.
– Idlehands
Jan 2 at 19:31
If you don't have any need to distinguish betweenFalse
and varieties of N/A, you can just create a list of the keys withTrue
values and write that to file. Then instead of doingif my_dict[key]
, you can doif key in my_list
.
– Acccumulation
Jan 2 at 19:35
1
@Acccumulation but that would be trading an O(1) membership test for an O(N) membership test.
– juanpa.arrivillaga
Jan 2 at 19:38
pickle is a just a binary format which can help you to save dict more efficiently. Python definitely needs to spend a little bit more effort to read/parse binary format than plain text format.
– Windchill
Jan 2 at 19:24
pickle is a just a binary format which can help you to save dict more efficiently. Python definitely needs to spend a little bit more effort to read/parse binary format than plain text format.
– Windchill
Jan 2 at 19:24
there are lots of alternatives here, what's best/ideal/fastest almost certainly depends on much more information than you've given. i.e. how often does this data change, who changes it, how do they change it, do you want to keep track of these changes, do you care about portability to different systems/versions of Python… and many more
– Sam Mason
Jan 2 at 19:25
there are lots of alternatives here, what's best/ideal/fastest almost certainly depends on much more information than you've given. i.e. how often does this data change, who changes it, how do they change it, do you want to keep track of these changes, do you care about portability to different systems/versions of Python… and many more
– Sam Mason
Jan 2 at 19:25
You could also consider dumping it as a
json
since the format is essentially the same, unless your dictionary have other Python object inside that is.– Idlehands
Jan 2 at 19:31
You could also consider dumping it as a
json
since the format is essentially the same, unless your dictionary have other Python object inside that is.– Idlehands
Jan 2 at 19:31
If you don't have any need to distinguish between
False
and varieties of N/A, you can just create a list of the keys with True
values and write that to file. Then instead of doing if my_dict[key]
, you can do if key in my_list
.– Acccumulation
Jan 2 at 19:35
If you don't have any need to distinguish between
False
and varieties of N/A, you can just create a list of the keys with True
values and write that to file. Then instead of doing if my_dict[key]
, you can do if key in my_list
.– Acccumulation
Jan 2 at 19:35
1
1
@Acccumulation but that would be trading an O(1) membership test for an O(N) membership test.
– juanpa.arrivillaga
Jan 2 at 19:38
@Acccumulation but that would be trading an O(1) membership test for an O(N) membership test.
– juanpa.arrivillaga
Jan 2 at 19:38
|
show 2 more comments
2 Answers
2
active
oldest
votes
Python File
Using a python file will easily cache the dictionary, so that if you "import" it multiple times, it only has to be parsed once. However, python syntax is complicated, and so the parser that loads the file may not be well optimized for the limited complexity of the data you're saving (unless you're including arbitrary Python objects and code). It's easy to view and edit, and easy to use, but it's not easy to transport.
EDIT: to clarify, raw Python files are easy for a human to modify, but very hard for a computer to edit. If your code edits the data and you ever want that to be reflected in the dictionary, you're pretty much up a creek: instead, use one of the methods below.
Pickle File
If you use a pickle file, you'd either re-load the file each time you use it, or need some management code to cache the file after reading it the first time. Like arbitrary Python code, pickle files can be quite complex and the loader for them might not be optimized for your particular data types since, like raw python files, they can also store most arbitrary Python objects. However, they're hard to edit and view for a regular human, and you might encounter portability issues if you move the data around. It's also only readable by Python, and you need to consider the security considerations of using pickle, since loading pickle files can be risky and should only be done with trusted files.
JSON File
If all you're storing is simple objects (dictionaries, lists, strings, booleans, numbers), consider using the JSON file format. Python has a built-in json
module that's just as easy to use as pickle
, so there's no added complexity. These files are easy to store, view, edit, and compress (if desired), and look almost exactly like a python dictionary. It's highly portable (most common languages support reading/writing JSON files these days), and if you need to improve file loading speed, the ujson
module is a faster, drop-in replacement for the standard json
module. Since the JSON file format is fairly restricted, I'd expect its parsers and writers to be quite a bit faster than the regular Python or Pickle parsers (especially using ujson
).
Yes, yes and yes, please use a common format such as json. It's not just faster, it's also safer, portable, human-readable, re-usable…
– spectras
Jan 2 at 19:58
I wouldn't considerjson
exactly human-readable (beyond the literal definition)... after all that's why hjson exists. But I do agree on using a common format like itself.
– Idlehands
Jan 2 at 20:06
YMMV, it's reasonably human-readable while reasonably fast. Requirements will certainly shift the cursor. I wouldn't use json for any file intended to be written by a human. Nor would I use it for performance-critical serialization. Probably I'd use YAML for the former and FlatBuffers for the latter, or something along those lines. JSON is a good in-the-middle general-purpose data format :)
– spectras
Jan 2 at 20:12
Agreed, YAML is a great format for human readability, and custom formats like FlatBuffers and Protobuf are great for performance and serialization size. However, support for YAML is a bit more spotty (even Python needs a third party module to read it), and custom binary formats often involve more setup (e.g. defining and compiling the data structure in advance). JSON is a great compromise, being extremely easy to use in general, and (if formatted properly) easy to read/write. It's a good next step for people who don't want to get swamped in all the possible serialization methods.
– scnerd
Jan 2 at 21:35
add a comment |
To add to @scnerd's comment, here are the timings in IPython for different load situations.
Here we create a dictionary and write it to 3 formats:
import random
import json
import pickle
letters = 'abcdefghijklmnopqrstuvwxyz'
d = {''.join(random.choices(letters, k=6)): random.choice([True, False])
for _ in range(100000)}
# write a python file
with open('mydict.py', 'w') as fp:
fp.write('d = {n')
for k,v in d.items():
fp.write(f"'{k}':{v},n")
fp.write('None:False}')
# write a pickle file
with open('mydict.pickle', 'wb') as fp:
pickle.dump(d, fp)
# write a json file
with open('mydict.json', 'wb') as fp:
json.dump(d, fp)
Python file:
# on first import the file will be cached.
%%timeit -n1 -r1
from mydict import d
644 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
# after creating the __pycache__ folder, import is MUCH faster
%%timeit
from mydict import d
1.37 µs ± 54.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
pickle file:
%%timeit
with open('mydict.pickle', 'rb') as fp:
pickle.load(fp)
52.4 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
json file:
%%timeit
with open('mydict.json', 'rb') as fp:
json.load(fp)
81.3 ms ± 2.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# here is the same test with ujson
import ujson
%%timeit
with open('mydict.json', 'rb') as fp:
ujson.load(fp)
51.2 ms ± 304 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Since I did specifically mentionujson
to improve json speed, could you add a timing with that as well?
– scnerd
Jan 2 at 21:31
Also, for the Python file, you don't need theNone: False
at the end, at least in Python 3.6+ (probably any 3.X, not sure), it's valid syntax to have an extra comma at the end of a dictionary literal
– scnerd
Jan 2 at 21:32
Oh! Did not know that.
– James
Jan 2 at 21:35
1
Added theujson
test. It looks comparable to using pickle.
– James
Jan 2 at 21:41
add a comment |
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Python File
Using a python file will easily cache the dictionary, so that if you "import" it multiple times, it only has to be parsed once. However, python syntax is complicated, and so the parser that loads the file may not be well optimized for the limited complexity of the data you're saving (unless you're including arbitrary Python objects and code). It's easy to view and edit, and easy to use, but it's not easy to transport.
EDIT: to clarify, raw Python files are easy for a human to modify, but very hard for a computer to edit. If your code edits the data and you ever want that to be reflected in the dictionary, you're pretty much up a creek: instead, use one of the methods below.
Pickle File
If you use a pickle file, you'd either re-load the file each time you use it, or need some management code to cache the file after reading it the first time. Like arbitrary Python code, pickle files can be quite complex and the loader for them might not be optimized for your particular data types since, like raw python files, they can also store most arbitrary Python objects. However, they're hard to edit and view for a regular human, and you might encounter portability issues if you move the data around. It's also only readable by Python, and you need to consider the security considerations of using pickle, since loading pickle files can be risky and should only be done with trusted files.
JSON File
If all you're storing is simple objects (dictionaries, lists, strings, booleans, numbers), consider using the JSON file format. Python has a built-in json
module that's just as easy to use as pickle
, so there's no added complexity. These files are easy to store, view, edit, and compress (if desired), and look almost exactly like a python dictionary. It's highly portable (most common languages support reading/writing JSON files these days), and if you need to improve file loading speed, the ujson
module is a faster, drop-in replacement for the standard json
module. Since the JSON file format is fairly restricted, I'd expect its parsers and writers to be quite a bit faster than the regular Python or Pickle parsers (especially using ujson
).
Yes, yes and yes, please use a common format such as json. It's not just faster, it's also safer, portable, human-readable, re-usable…
– spectras
Jan 2 at 19:58
I wouldn't considerjson
exactly human-readable (beyond the literal definition)... after all that's why hjson exists. But I do agree on using a common format like itself.
– Idlehands
Jan 2 at 20:06
YMMV, it's reasonably human-readable while reasonably fast. Requirements will certainly shift the cursor. I wouldn't use json for any file intended to be written by a human. Nor would I use it for performance-critical serialization. Probably I'd use YAML for the former and FlatBuffers for the latter, or something along those lines. JSON is a good in-the-middle general-purpose data format :)
– spectras
Jan 2 at 20:12
Agreed, YAML is a great format for human readability, and custom formats like FlatBuffers and Protobuf are great for performance and serialization size. However, support for YAML is a bit more spotty (even Python needs a third party module to read it), and custom binary formats often involve more setup (e.g. defining and compiling the data structure in advance). JSON is a great compromise, being extremely easy to use in general, and (if formatted properly) easy to read/write. It's a good next step for people who don't want to get swamped in all the possible serialization methods.
– scnerd
Jan 2 at 21:35
add a comment |
Python File
Using a python file will easily cache the dictionary, so that if you "import" it multiple times, it only has to be parsed once. However, python syntax is complicated, and so the parser that loads the file may not be well optimized for the limited complexity of the data you're saving (unless you're including arbitrary Python objects and code). It's easy to view and edit, and easy to use, but it's not easy to transport.
EDIT: to clarify, raw Python files are easy for a human to modify, but very hard for a computer to edit. If your code edits the data and you ever want that to be reflected in the dictionary, you're pretty much up a creek: instead, use one of the methods below.
Pickle File
If you use a pickle file, you'd either re-load the file each time you use it, or need some management code to cache the file after reading it the first time. Like arbitrary Python code, pickle files can be quite complex and the loader for them might not be optimized for your particular data types since, like raw python files, they can also store most arbitrary Python objects. However, they're hard to edit and view for a regular human, and you might encounter portability issues if you move the data around. It's also only readable by Python, and you need to consider the security considerations of using pickle, since loading pickle files can be risky and should only be done with trusted files.
JSON File
If all you're storing is simple objects (dictionaries, lists, strings, booleans, numbers), consider using the JSON file format. Python has a built-in json
module that's just as easy to use as pickle
, so there's no added complexity. These files are easy to store, view, edit, and compress (if desired), and look almost exactly like a python dictionary. It's highly portable (most common languages support reading/writing JSON files these days), and if you need to improve file loading speed, the ujson
module is a faster, drop-in replacement for the standard json
module. Since the JSON file format is fairly restricted, I'd expect its parsers and writers to be quite a bit faster than the regular Python or Pickle parsers (especially using ujson
).
Yes, yes and yes, please use a common format such as json. It's not just faster, it's also safer, portable, human-readable, re-usable…
– spectras
Jan 2 at 19:58
I wouldn't considerjson
exactly human-readable (beyond the literal definition)... after all that's why hjson exists. But I do agree on using a common format like itself.
– Idlehands
Jan 2 at 20:06
YMMV, it's reasonably human-readable while reasonably fast. Requirements will certainly shift the cursor. I wouldn't use json for any file intended to be written by a human. Nor would I use it for performance-critical serialization. Probably I'd use YAML for the former and FlatBuffers for the latter, or something along those lines. JSON is a good in-the-middle general-purpose data format :)
– spectras
Jan 2 at 20:12
Agreed, YAML is a great format for human readability, and custom formats like FlatBuffers and Protobuf are great for performance and serialization size. However, support for YAML is a bit more spotty (even Python needs a third party module to read it), and custom binary formats often involve more setup (e.g. defining and compiling the data structure in advance). JSON is a great compromise, being extremely easy to use in general, and (if formatted properly) easy to read/write. It's a good next step for people who don't want to get swamped in all the possible serialization methods.
– scnerd
Jan 2 at 21:35
add a comment |
Python File
Using a python file will easily cache the dictionary, so that if you "import" it multiple times, it only has to be parsed once. However, python syntax is complicated, and so the parser that loads the file may not be well optimized for the limited complexity of the data you're saving (unless you're including arbitrary Python objects and code). It's easy to view and edit, and easy to use, but it's not easy to transport.
EDIT: to clarify, raw Python files are easy for a human to modify, but very hard for a computer to edit. If your code edits the data and you ever want that to be reflected in the dictionary, you're pretty much up a creek: instead, use one of the methods below.
Pickle File
If you use a pickle file, you'd either re-load the file each time you use it, or need some management code to cache the file after reading it the first time. Like arbitrary Python code, pickle files can be quite complex and the loader for them might not be optimized for your particular data types since, like raw python files, they can also store most arbitrary Python objects. However, they're hard to edit and view for a regular human, and you might encounter portability issues if you move the data around. It's also only readable by Python, and you need to consider the security considerations of using pickle, since loading pickle files can be risky and should only be done with trusted files.
JSON File
If all you're storing is simple objects (dictionaries, lists, strings, booleans, numbers), consider using the JSON file format. Python has a built-in json
module that's just as easy to use as pickle
, so there's no added complexity. These files are easy to store, view, edit, and compress (if desired), and look almost exactly like a python dictionary. It's highly portable (most common languages support reading/writing JSON files these days), and if you need to improve file loading speed, the ujson
module is a faster, drop-in replacement for the standard json
module. Since the JSON file format is fairly restricted, I'd expect its parsers and writers to be quite a bit faster than the regular Python or Pickle parsers (especially using ujson
).
Python File
Using a python file will easily cache the dictionary, so that if you "import" it multiple times, it only has to be parsed once. However, python syntax is complicated, and so the parser that loads the file may not be well optimized for the limited complexity of the data you're saving (unless you're including arbitrary Python objects and code). It's easy to view and edit, and easy to use, but it's not easy to transport.
EDIT: to clarify, raw Python files are easy for a human to modify, but very hard for a computer to edit. If your code edits the data and you ever want that to be reflected in the dictionary, you're pretty much up a creek: instead, use one of the methods below.
Pickle File
If you use a pickle file, you'd either re-load the file each time you use it, or need some management code to cache the file after reading it the first time. Like arbitrary Python code, pickle files can be quite complex and the loader for them might not be optimized for your particular data types since, like raw python files, they can also store most arbitrary Python objects. However, they're hard to edit and view for a regular human, and you might encounter portability issues if you move the data around. It's also only readable by Python, and you need to consider the security considerations of using pickle, since loading pickle files can be risky and should only be done with trusted files.
JSON File
If all you're storing is simple objects (dictionaries, lists, strings, booleans, numbers), consider using the JSON file format. Python has a built-in json
module that's just as easy to use as pickle
, so there's no added complexity. These files are easy to store, view, edit, and compress (if desired), and look almost exactly like a python dictionary. It's highly portable (most common languages support reading/writing JSON files these days), and if you need to improve file loading speed, the ujson
module is a faster, drop-in replacement for the standard json
module. Since the JSON file format is fairly restricted, I'd expect its parsers and writers to be quite a bit faster than the regular Python or Pickle parsers (especially using ujson
).
edited Jan 2 at 21:45
answered Jan 2 at 19:35
scnerdscnerd
3,33411026
3,33411026
Yes, yes and yes, please use a common format such as json. It's not just faster, it's also safer, portable, human-readable, re-usable…
– spectras
Jan 2 at 19:58
I wouldn't considerjson
exactly human-readable (beyond the literal definition)... after all that's why hjson exists. But I do agree on using a common format like itself.
– Idlehands
Jan 2 at 20:06
YMMV, it's reasonably human-readable while reasonably fast. Requirements will certainly shift the cursor. I wouldn't use json for any file intended to be written by a human. Nor would I use it for performance-critical serialization. Probably I'd use YAML for the former and FlatBuffers for the latter, or something along those lines. JSON is a good in-the-middle general-purpose data format :)
– spectras
Jan 2 at 20:12
Agreed, YAML is a great format for human readability, and custom formats like FlatBuffers and Protobuf are great for performance and serialization size. However, support for YAML is a bit more spotty (even Python needs a third party module to read it), and custom binary formats often involve more setup (e.g. defining and compiling the data structure in advance). JSON is a great compromise, being extremely easy to use in general, and (if formatted properly) easy to read/write. It's a good next step for people who don't want to get swamped in all the possible serialization methods.
– scnerd
Jan 2 at 21:35
add a comment |
Yes, yes and yes, please use a common format such as json. It's not just faster, it's also safer, portable, human-readable, re-usable…
– spectras
Jan 2 at 19:58
I wouldn't considerjson
exactly human-readable (beyond the literal definition)... after all that's why hjson exists. But I do agree on using a common format like itself.
– Idlehands
Jan 2 at 20:06
YMMV, it's reasonably human-readable while reasonably fast. Requirements will certainly shift the cursor. I wouldn't use json for any file intended to be written by a human. Nor would I use it for performance-critical serialization. Probably I'd use YAML for the former and FlatBuffers for the latter, or something along those lines. JSON is a good in-the-middle general-purpose data format :)
– spectras
Jan 2 at 20:12
Agreed, YAML is a great format for human readability, and custom formats like FlatBuffers and Protobuf are great for performance and serialization size. However, support for YAML is a bit more spotty (even Python needs a third party module to read it), and custom binary formats often involve more setup (e.g. defining and compiling the data structure in advance). JSON is a great compromise, being extremely easy to use in general, and (if formatted properly) easy to read/write. It's a good next step for people who don't want to get swamped in all the possible serialization methods.
– scnerd
Jan 2 at 21:35
Yes, yes and yes, please use a common format such as json. It's not just faster, it's also safer, portable, human-readable, re-usable…
– spectras
Jan 2 at 19:58
Yes, yes and yes, please use a common format such as json. It's not just faster, it's also safer, portable, human-readable, re-usable…
– spectras
Jan 2 at 19:58
I wouldn't consider
json
exactly human-readable (beyond the literal definition)... after all that's why hjson exists. But I do agree on using a common format like itself.– Idlehands
Jan 2 at 20:06
I wouldn't consider
json
exactly human-readable (beyond the literal definition)... after all that's why hjson exists. But I do agree on using a common format like itself.– Idlehands
Jan 2 at 20:06
YMMV, it's reasonably human-readable while reasonably fast. Requirements will certainly shift the cursor. I wouldn't use json for any file intended to be written by a human. Nor would I use it for performance-critical serialization. Probably I'd use YAML for the former and FlatBuffers for the latter, or something along those lines. JSON is a good in-the-middle general-purpose data format :)
– spectras
Jan 2 at 20:12
YMMV, it's reasonably human-readable while reasonably fast. Requirements will certainly shift the cursor. I wouldn't use json for any file intended to be written by a human. Nor would I use it for performance-critical serialization. Probably I'd use YAML for the former and FlatBuffers for the latter, or something along those lines. JSON is a good in-the-middle general-purpose data format :)
– spectras
Jan 2 at 20:12
Agreed, YAML is a great format for human readability, and custom formats like FlatBuffers and Protobuf are great for performance and serialization size. However, support for YAML is a bit more spotty (even Python needs a third party module to read it), and custom binary formats often involve more setup (e.g. defining and compiling the data structure in advance). JSON is a great compromise, being extremely easy to use in general, and (if formatted properly) easy to read/write. It's a good next step for people who don't want to get swamped in all the possible serialization methods.
– scnerd
Jan 2 at 21:35
Agreed, YAML is a great format for human readability, and custom formats like FlatBuffers and Protobuf are great for performance and serialization size. However, support for YAML is a bit more spotty (even Python needs a third party module to read it), and custom binary formats often involve more setup (e.g. defining and compiling the data structure in advance). JSON is a great compromise, being extremely easy to use in general, and (if formatted properly) easy to read/write. It's a good next step for people who don't want to get swamped in all the possible serialization methods.
– scnerd
Jan 2 at 21:35
add a comment |
To add to @scnerd's comment, here are the timings in IPython for different load situations.
Here we create a dictionary and write it to 3 formats:
import random
import json
import pickle
letters = 'abcdefghijklmnopqrstuvwxyz'
d = {''.join(random.choices(letters, k=6)): random.choice([True, False])
for _ in range(100000)}
# write a python file
with open('mydict.py', 'w') as fp:
fp.write('d = {n')
for k,v in d.items():
fp.write(f"'{k}':{v},n")
fp.write('None:False}')
# write a pickle file
with open('mydict.pickle', 'wb') as fp:
pickle.dump(d, fp)
# write a json file
with open('mydict.json', 'wb') as fp:
json.dump(d, fp)
Python file:
# on first import the file will be cached.
%%timeit -n1 -r1
from mydict import d
644 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
# after creating the __pycache__ folder, import is MUCH faster
%%timeit
from mydict import d
1.37 µs ± 54.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
pickle file:
%%timeit
with open('mydict.pickle', 'rb') as fp:
pickle.load(fp)
52.4 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
json file:
%%timeit
with open('mydict.json', 'rb') as fp:
json.load(fp)
81.3 ms ± 2.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# here is the same test with ujson
import ujson
%%timeit
with open('mydict.json', 'rb') as fp:
ujson.load(fp)
51.2 ms ± 304 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Since I did specifically mentionujson
to improve json speed, could you add a timing with that as well?
– scnerd
Jan 2 at 21:31
Also, for the Python file, you don't need theNone: False
at the end, at least in Python 3.6+ (probably any 3.X, not sure), it's valid syntax to have an extra comma at the end of a dictionary literal
– scnerd
Jan 2 at 21:32
Oh! Did not know that.
– James
Jan 2 at 21:35
1
Added theujson
test. It looks comparable to using pickle.
– James
Jan 2 at 21:41
add a comment |
To add to @scnerd's comment, here are the timings in IPython for different load situations.
Here we create a dictionary and write it to 3 formats:
import random
import json
import pickle
letters = 'abcdefghijklmnopqrstuvwxyz'
d = {''.join(random.choices(letters, k=6)): random.choice([True, False])
for _ in range(100000)}
# write a python file
with open('mydict.py', 'w') as fp:
fp.write('d = {n')
for k,v in d.items():
fp.write(f"'{k}':{v},n")
fp.write('None:False}')
# write a pickle file
with open('mydict.pickle', 'wb') as fp:
pickle.dump(d, fp)
# write a json file
with open('mydict.json', 'wb') as fp:
json.dump(d, fp)
Python file:
# on first import the file will be cached.
%%timeit -n1 -r1
from mydict import d
644 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
# after creating the __pycache__ folder, import is MUCH faster
%%timeit
from mydict import d
1.37 µs ± 54.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
pickle file:
%%timeit
with open('mydict.pickle', 'rb') as fp:
pickle.load(fp)
52.4 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
json file:
%%timeit
with open('mydict.json', 'rb') as fp:
json.load(fp)
81.3 ms ± 2.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# here is the same test with ujson
import ujson
%%timeit
with open('mydict.json', 'rb') as fp:
ujson.load(fp)
51.2 ms ± 304 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Since I did specifically mentionujson
to improve json speed, could you add a timing with that as well?
– scnerd
Jan 2 at 21:31
Also, for the Python file, you don't need theNone: False
at the end, at least in Python 3.6+ (probably any 3.X, not sure), it's valid syntax to have an extra comma at the end of a dictionary literal
– scnerd
Jan 2 at 21:32
Oh! Did not know that.
– James
Jan 2 at 21:35
1
Added theujson
test. It looks comparable to using pickle.
– James
Jan 2 at 21:41
add a comment |
To add to @scnerd's comment, here are the timings in IPython for different load situations.
Here we create a dictionary and write it to 3 formats:
import random
import json
import pickle
letters = 'abcdefghijklmnopqrstuvwxyz'
d = {''.join(random.choices(letters, k=6)): random.choice([True, False])
for _ in range(100000)}
# write a python file
with open('mydict.py', 'w') as fp:
fp.write('d = {n')
for k,v in d.items():
fp.write(f"'{k}':{v},n")
fp.write('None:False}')
# write a pickle file
with open('mydict.pickle', 'wb') as fp:
pickle.dump(d, fp)
# write a json file
with open('mydict.json', 'wb') as fp:
json.dump(d, fp)
Python file:
# on first import the file will be cached.
%%timeit -n1 -r1
from mydict import d
644 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
# after creating the __pycache__ folder, import is MUCH faster
%%timeit
from mydict import d
1.37 µs ± 54.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
pickle file:
%%timeit
with open('mydict.pickle', 'rb') as fp:
pickle.load(fp)
52.4 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
json file:
%%timeit
with open('mydict.json', 'rb') as fp:
json.load(fp)
81.3 ms ± 2.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# here is the same test with ujson
import ujson
%%timeit
with open('mydict.json', 'rb') as fp:
ujson.load(fp)
51.2 ms ± 304 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
To add to @scnerd's comment, here are the timings in IPython for different load situations.
Here we create a dictionary and write it to 3 formats:
import random
import json
import pickle
letters = 'abcdefghijklmnopqrstuvwxyz'
d = {''.join(random.choices(letters, k=6)): random.choice([True, False])
for _ in range(100000)}
# write a python file
with open('mydict.py', 'w') as fp:
fp.write('d = {n')
for k,v in d.items():
fp.write(f"'{k}':{v},n")
fp.write('None:False}')
# write a pickle file
with open('mydict.pickle', 'wb') as fp:
pickle.dump(d, fp)
# write a json file
with open('mydict.json', 'wb') as fp:
json.dump(d, fp)
Python file:
# on first import the file will be cached.
%%timeit -n1 -r1
from mydict import d
644 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
# after creating the __pycache__ folder, import is MUCH faster
%%timeit
from mydict import d
1.37 µs ± 54.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
pickle file:
%%timeit
with open('mydict.pickle', 'rb') as fp:
pickle.load(fp)
52.4 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
json file:
%%timeit
with open('mydict.json', 'rb') as fp:
json.load(fp)
81.3 ms ± 2.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# here is the same test with ujson
import ujson
%%timeit
with open('mydict.json', 'rb') as fp:
ujson.load(fp)
51.2 ms ± 304 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
edited Jan 2 at 21:40
answered Jan 2 at 20:02
JamesJames
13.7k11633
13.7k11633
Since I did specifically mentionujson
to improve json speed, could you add a timing with that as well?
– scnerd
Jan 2 at 21:31
Also, for the Python file, you don't need theNone: False
at the end, at least in Python 3.6+ (probably any 3.X, not sure), it's valid syntax to have an extra comma at the end of a dictionary literal
– scnerd
Jan 2 at 21:32
Oh! Did not know that.
– James
Jan 2 at 21:35
1
Added theujson
test. It looks comparable to using pickle.
– James
Jan 2 at 21:41
add a comment |
Since I did specifically mentionujson
to improve json speed, could you add a timing with that as well?
– scnerd
Jan 2 at 21:31
Also, for the Python file, you don't need theNone: False
at the end, at least in Python 3.6+ (probably any 3.X, not sure), it's valid syntax to have an extra comma at the end of a dictionary literal
– scnerd
Jan 2 at 21:32
Oh! Did not know that.
– James
Jan 2 at 21:35
1
Added theujson
test. It looks comparable to using pickle.
– James
Jan 2 at 21:41
Since I did specifically mention
ujson
to improve json speed, could you add a timing with that as well?– scnerd
Jan 2 at 21:31
Since I did specifically mention
ujson
to improve json speed, could you add a timing with that as well?– scnerd
Jan 2 at 21:31
Also, for the Python file, you don't need the
None: False
at the end, at least in Python 3.6+ (probably any 3.X, not sure), it's valid syntax to have an extra comma at the end of a dictionary literal– scnerd
Jan 2 at 21:32
Also, for the Python file, you don't need the
None: False
at the end, at least in Python 3.6+ (probably any 3.X, not sure), it's valid syntax to have an extra comma at the end of a dictionary literal– scnerd
Jan 2 at 21:32
Oh! Did not know that.
– James
Jan 2 at 21:35
Oh! Did not know that.
– James
Jan 2 at 21:35
1
1
Added the
ujson
test. It looks comparable to using pickle.– James
Jan 2 at 21:41
Added the
ujson
test. It looks comparable to using pickle.– James
Jan 2 at 21:41
add a comment |
pickle is a just a binary format which can help you to save dict more efficiently. Python definitely needs to spend a little bit more effort to read/parse binary format than plain text format.
– Windchill
Jan 2 at 19:24
there are lots of alternatives here, what's best/ideal/fastest almost certainly depends on much more information than you've given. i.e. how often does this data change, who changes it, how do they change it, do you want to keep track of these changes, do you care about portability to different systems/versions of Python… and many more
– Sam Mason
Jan 2 at 19:25
You could also consider dumping it as a
json
since the format is essentially the same, unless your dictionary have other Python object inside that is.– Idlehands
Jan 2 at 19:31
If you don't have any need to distinguish between
False
and varieties of N/A, you can just create a list of the keys withTrue
values and write that to file. Then instead of doingif my_dict[key]
, you can doif key in my_list
.– Acccumulation
Jan 2 at 19:35
1
@Acccumulation but that would be trading an O(1) membership test for an O(N) membership test.
– juanpa.arrivillaga
Jan 2 at 19:38