Python improve import time of data

Multi tool use
Multi tool use












0















I have a data file that contains the following:



somename = [ [1,2,3,4,5,6],...plus other such elements making a 60 MB file called somefile.py ]


In my python script (of the same folder as the data) I have (and only this along with appropriate shebang):



from somefile import somename


This took almost 20 minutes to complete. How can such an import be improved?



I'm using python 3.7 working on a mac osx 10.13.










share|improve this question


















  • 12





    why not use something like a json file for static data? and have it loaded when it is needed.

    – 422_unprocessable_entity
    Jan 2 at 12:37













  • @BhathiyaPerera Many thanks for the suggestion.

    – Geoff
    Jan 2 at 12:46






  • 1





    From your question it seems your entire file is raw data and not actual python implementations. Is this the case? If so treating it as a text file and reading its lines could dramatically improve performance

    – yuvgin
    Jan 2 at 13:17











  • Quite simply: do not try to use a python module as data storage for such volumes - use a proper datastore (database, whatever).

    – bruno desthuilliers
    Jan 2 at 13:25


















0















I have a data file that contains the following:



somename = [ [1,2,3,4,5,6],...plus other such elements making a 60 MB file called somefile.py ]


In my python script (of the same folder as the data) I have (and only this along with appropriate shebang):



from somefile import somename


This took almost 20 minutes to complete. How can such an import be improved?



I'm using python 3.7 working on a mac osx 10.13.










share|improve this question


















  • 12





    why not use something like a json file for static data? and have it loaded when it is needed.

    – 422_unprocessable_entity
    Jan 2 at 12:37













  • @BhathiyaPerera Many thanks for the suggestion.

    – Geoff
    Jan 2 at 12:46






  • 1





    From your question it seems your entire file is raw data and not actual python implementations. Is this the case? If so treating it as a text file and reading its lines could dramatically improve performance

    – yuvgin
    Jan 2 at 13:17











  • Quite simply: do not try to use a python module as data storage for such volumes - use a proper datastore (database, whatever).

    – bruno desthuilliers
    Jan 2 at 13:25
















0












0








0








I have a data file that contains the following:



somename = [ [1,2,3,4,5,6],...plus other such elements making a 60 MB file called somefile.py ]


In my python script (of the same folder as the data) I have (and only this along with appropriate shebang):



from somefile import somename


This took almost 20 minutes to complete. How can such an import be improved?



I'm using python 3.7 working on a mac osx 10.13.










share|improve this question














I have a data file that contains the following:



somename = [ [1,2,3,4,5,6],...plus other such elements making a 60 MB file called somefile.py ]


In my python script (of the same folder as the data) I have (and only this along with appropriate shebang):



from somefile import somename


This took almost 20 minutes to complete. How can such an import be improved?



I'm using python 3.7 working on a mac osx 10.13.







python






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Jan 2 at 12:31









GeoffGeoff

4601929




4601929








  • 12





    why not use something like a json file for static data? and have it loaded when it is needed.

    – 422_unprocessable_entity
    Jan 2 at 12:37













  • @BhathiyaPerera Many thanks for the suggestion.

    – Geoff
    Jan 2 at 12:46






  • 1





    From your question it seems your entire file is raw data and not actual python implementations. Is this the case? If so treating it as a text file and reading its lines could dramatically improve performance

    – yuvgin
    Jan 2 at 13:17











  • Quite simply: do not try to use a python module as data storage for such volumes - use a proper datastore (database, whatever).

    – bruno desthuilliers
    Jan 2 at 13:25
















  • 12





    why not use something like a json file for static data? and have it loaded when it is needed.

    – 422_unprocessable_entity
    Jan 2 at 12:37













  • @BhathiyaPerera Many thanks for the suggestion.

    – Geoff
    Jan 2 at 12:46






  • 1





    From your question it seems your entire file is raw data and not actual python implementations. Is this the case? If so treating it as a text file and reading its lines could dramatically improve performance

    – yuvgin
    Jan 2 at 13:17











  • Quite simply: do not try to use a python module as data storage for such volumes - use a proper datastore (database, whatever).

    – bruno desthuilliers
    Jan 2 at 13:25










12




12





why not use something like a json file for static data? and have it loaded when it is needed.

– 422_unprocessable_entity
Jan 2 at 12:37







why not use something like a json file for static data? and have it loaded when it is needed.

– 422_unprocessable_entity
Jan 2 at 12:37















@BhathiyaPerera Many thanks for the suggestion.

– Geoff
Jan 2 at 12:46





@BhathiyaPerera Many thanks for the suggestion.

– Geoff
Jan 2 at 12:46




1




1





From your question it seems your entire file is raw data and not actual python implementations. Is this the case? If so treating it as a text file and reading its lines could dramatically improve performance

– yuvgin
Jan 2 at 13:17





From your question it seems your entire file is raw data and not actual python implementations. Is this the case? If so treating it as a text file and reading its lines could dramatically improve performance

– yuvgin
Jan 2 at 13:17













Quite simply: do not try to use a python module as data storage for such volumes - use a proper datastore (database, whatever).

– bruno desthuilliers
Jan 2 at 13:25







Quite simply: do not try to use a python module as data storage for such volumes - use a proper datastore (database, whatever).

– bruno desthuilliers
Jan 2 at 13:25














2 Answers
2






active

oldest

votes


















3














loading files as "Python source code" will always be relatively slow, but 20 minutes to load a 60MiB file seems far too slow. Python uses a full lexer/parser, and does things like tracking source locations for accurate error reporting amongst other things. It's grammer is deliberately simple which makes parsing relatively fast, but still it's going to be much slower than other file formats.



I'd go with one of the other suggestions, but thought it would be interesting to compare timings across different file formats



first I generate some data:



somename = [list(range(6)) for _ in range(100_000)]


this takes my computer 152 ms to do, I can then save this in a "Python source file" with:



with open('data.py', 'w') as fd:
fd.write(f'somename = {somename}')


which takes 84.1 ms, reloading this using:



from data import somename


which takes 1.40 seconds — I tried with some other sizes and the scaling seems linear in array length which I find impressive. I then started to play with different file formats, JSON:



import json

with open('data.json', 'w') as fd:
json.dump(somename, fd)

with open('data.json') as fd:
somename = json.load(fd)


here saving took 787 ms and loading took 131 ms. Next, CSV:



import csv

with open('data.csv', 'w') as fd:
out = csv.writer(fd)
out.writerows(somename)

with open('data.csv') as fd:
inp = csv.reader(fd)
somename = [[int(v) for v in row] for row in inp]


saving took 114 ms while loading took 329 ms (down to 129 ms if strings aren't converted to ints). next I tried musbur's suggestion of pickle:



import pickle  # no need for `cPickle` in Python 3

with open('data.pck', 'wb') as fd:
pickle.dump(somename, fd)

with open('data.pck', 'rb') as fd:
somename = pickle.load(fd)


the saving took 49.1 ms and loading took 128 ms



The take-home message sees to be that saving data in source code takes 10 times as long, but I'm not sure how it's taking your computer 20 minutes!






share|improve this answer































    0














    The somename.py file is obviously created by some piece of software. If it is re-created regularly (i.e., changes often), that other piece of software should be rewritten in such a way to create data that is more easily imported in Python (such as tabular text data, JSON, yaml, ...). If it is static data that never changes, do this:



    import cPickle
    from somefile import somename

    fh = open("data.pck", "wb")
    cPickle.dump(somename, fh)
    fh.close()


    This will serialize your data into a file "data.pck" from which it can be re-loaded very quickly.






    share|improve this answer























      Your Answer






      StackExchange.ifUsing("editor", function () {
      StackExchange.using("externalEditor", function () {
      StackExchange.using("snippets", function () {
      StackExchange.snippets.init();
      });
      });
      }, "code-snippets");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "1"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54006460%2fpython-improve-import-time-of-data%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      3














      loading files as "Python source code" will always be relatively slow, but 20 minutes to load a 60MiB file seems far too slow. Python uses a full lexer/parser, and does things like tracking source locations for accurate error reporting amongst other things. It's grammer is deliberately simple which makes parsing relatively fast, but still it's going to be much slower than other file formats.



      I'd go with one of the other suggestions, but thought it would be interesting to compare timings across different file formats



      first I generate some data:



      somename = [list(range(6)) for _ in range(100_000)]


      this takes my computer 152 ms to do, I can then save this in a "Python source file" with:



      with open('data.py', 'w') as fd:
      fd.write(f'somename = {somename}')


      which takes 84.1 ms, reloading this using:



      from data import somename


      which takes 1.40 seconds — I tried with some other sizes and the scaling seems linear in array length which I find impressive. I then started to play with different file formats, JSON:



      import json

      with open('data.json', 'w') as fd:
      json.dump(somename, fd)

      with open('data.json') as fd:
      somename = json.load(fd)


      here saving took 787 ms and loading took 131 ms. Next, CSV:



      import csv

      with open('data.csv', 'w') as fd:
      out = csv.writer(fd)
      out.writerows(somename)

      with open('data.csv') as fd:
      inp = csv.reader(fd)
      somename = [[int(v) for v in row] for row in inp]


      saving took 114 ms while loading took 329 ms (down to 129 ms if strings aren't converted to ints). next I tried musbur's suggestion of pickle:



      import pickle  # no need for `cPickle` in Python 3

      with open('data.pck', 'wb') as fd:
      pickle.dump(somename, fd)

      with open('data.pck', 'rb') as fd:
      somename = pickle.load(fd)


      the saving took 49.1 ms and loading took 128 ms



      The take-home message sees to be that saving data in source code takes 10 times as long, but I'm not sure how it's taking your computer 20 minutes!






      share|improve this answer




























        3














        loading files as "Python source code" will always be relatively slow, but 20 minutes to load a 60MiB file seems far too slow. Python uses a full lexer/parser, and does things like tracking source locations for accurate error reporting amongst other things. It's grammer is deliberately simple which makes parsing relatively fast, but still it's going to be much slower than other file formats.



        I'd go with one of the other suggestions, but thought it would be interesting to compare timings across different file formats



        first I generate some data:



        somename = [list(range(6)) for _ in range(100_000)]


        this takes my computer 152 ms to do, I can then save this in a "Python source file" with:



        with open('data.py', 'w') as fd:
        fd.write(f'somename = {somename}')


        which takes 84.1 ms, reloading this using:



        from data import somename


        which takes 1.40 seconds — I tried with some other sizes and the scaling seems linear in array length which I find impressive. I then started to play with different file formats, JSON:



        import json

        with open('data.json', 'w') as fd:
        json.dump(somename, fd)

        with open('data.json') as fd:
        somename = json.load(fd)


        here saving took 787 ms and loading took 131 ms. Next, CSV:



        import csv

        with open('data.csv', 'w') as fd:
        out = csv.writer(fd)
        out.writerows(somename)

        with open('data.csv') as fd:
        inp = csv.reader(fd)
        somename = [[int(v) for v in row] for row in inp]


        saving took 114 ms while loading took 329 ms (down to 129 ms if strings aren't converted to ints). next I tried musbur's suggestion of pickle:



        import pickle  # no need for `cPickle` in Python 3

        with open('data.pck', 'wb') as fd:
        pickle.dump(somename, fd)

        with open('data.pck', 'rb') as fd:
        somename = pickle.load(fd)


        the saving took 49.1 ms and loading took 128 ms



        The take-home message sees to be that saving data in source code takes 10 times as long, but I'm not sure how it's taking your computer 20 minutes!






        share|improve this answer


























          3












          3








          3







          loading files as "Python source code" will always be relatively slow, but 20 minutes to load a 60MiB file seems far too slow. Python uses a full lexer/parser, and does things like tracking source locations for accurate error reporting amongst other things. It's grammer is deliberately simple which makes parsing relatively fast, but still it's going to be much slower than other file formats.



          I'd go with one of the other suggestions, but thought it would be interesting to compare timings across different file formats



          first I generate some data:



          somename = [list(range(6)) for _ in range(100_000)]


          this takes my computer 152 ms to do, I can then save this in a "Python source file" with:



          with open('data.py', 'w') as fd:
          fd.write(f'somename = {somename}')


          which takes 84.1 ms, reloading this using:



          from data import somename


          which takes 1.40 seconds — I tried with some other sizes and the scaling seems linear in array length which I find impressive. I then started to play with different file formats, JSON:



          import json

          with open('data.json', 'w') as fd:
          json.dump(somename, fd)

          with open('data.json') as fd:
          somename = json.load(fd)


          here saving took 787 ms and loading took 131 ms. Next, CSV:



          import csv

          with open('data.csv', 'w') as fd:
          out = csv.writer(fd)
          out.writerows(somename)

          with open('data.csv') as fd:
          inp = csv.reader(fd)
          somename = [[int(v) for v in row] for row in inp]


          saving took 114 ms while loading took 329 ms (down to 129 ms if strings aren't converted to ints). next I tried musbur's suggestion of pickle:



          import pickle  # no need for `cPickle` in Python 3

          with open('data.pck', 'wb') as fd:
          pickle.dump(somename, fd)

          with open('data.pck', 'rb') as fd:
          somename = pickle.load(fd)


          the saving took 49.1 ms and loading took 128 ms



          The take-home message sees to be that saving data in source code takes 10 times as long, but I'm not sure how it's taking your computer 20 minutes!






          share|improve this answer













          loading files as "Python source code" will always be relatively slow, but 20 minutes to load a 60MiB file seems far too slow. Python uses a full lexer/parser, and does things like tracking source locations for accurate error reporting amongst other things. It's grammer is deliberately simple which makes parsing relatively fast, but still it's going to be much slower than other file formats.



          I'd go with one of the other suggestions, but thought it would be interesting to compare timings across different file formats



          first I generate some data:



          somename = [list(range(6)) for _ in range(100_000)]


          this takes my computer 152 ms to do, I can then save this in a "Python source file" with:



          with open('data.py', 'w') as fd:
          fd.write(f'somename = {somename}')


          which takes 84.1 ms, reloading this using:



          from data import somename


          which takes 1.40 seconds — I tried with some other sizes and the scaling seems linear in array length which I find impressive. I then started to play with different file formats, JSON:



          import json

          with open('data.json', 'w') as fd:
          json.dump(somename, fd)

          with open('data.json') as fd:
          somename = json.load(fd)


          here saving took 787 ms and loading took 131 ms. Next, CSV:



          import csv

          with open('data.csv', 'w') as fd:
          out = csv.writer(fd)
          out.writerows(somename)

          with open('data.csv') as fd:
          inp = csv.reader(fd)
          somename = [[int(v) for v in row] for row in inp]


          saving took 114 ms while loading took 329 ms (down to 129 ms if strings aren't converted to ints). next I tried musbur's suggestion of pickle:



          import pickle  # no need for `cPickle` in Python 3

          with open('data.pck', 'wb') as fd:
          pickle.dump(somename, fd)

          with open('data.pck', 'rb') as fd:
          somename = pickle.load(fd)


          the saving took 49.1 ms and loading took 128 ms



          The take-home message sees to be that saving data in source code takes 10 times as long, but I'm not sure how it's taking your computer 20 minutes!







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Jan 2 at 13:56









          Sam MasonSam Mason

          3,34811331




          3,34811331

























              0














              The somename.py file is obviously created by some piece of software. If it is re-created regularly (i.e., changes often), that other piece of software should be rewritten in such a way to create data that is more easily imported in Python (such as tabular text data, JSON, yaml, ...). If it is static data that never changes, do this:



              import cPickle
              from somefile import somename

              fh = open("data.pck", "wb")
              cPickle.dump(somename, fh)
              fh.close()


              This will serialize your data into a file "data.pck" from which it can be re-loaded very quickly.






              share|improve this answer




























                0














                The somename.py file is obviously created by some piece of software. If it is re-created regularly (i.e., changes often), that other piece of software should be rewritten in such a way to create data that is more easily imported in Python (such as tabular text data, JSON, yaml, ...). If it is static data that never changes, do this:



                import cPickle
                from somefile import somename

                fh = open("data.pck", "wb")
                cPickle.dump(somename, fh)
                fh.close()


                This will serialize your data into a file "data.pck" from which it can be re-loaded very quickly.






                share|improve this answer


























                  0












                  0








                  0







                  The somename.py file is obviously created by some piece of software. If it is re-created regularly (i.e., changes often), that other piece of software should be rewritten in such a way to create data that is more easily imported in Python (such as tabular text data, JSON, yaml, ...). If it is static data that never changes, do this:



                  import cPickle
                  from somefile import somename

                  fh = open("data.pck", "wb")
                  cPickle.dump(somename, fh)
                  fh.close()


                  This will serialize your data into a file "data.pck" from which it can be re-loaded very quickly.






                  share|improve this answer













                  The somename.py file is obviously created by some piece of software. If it is re-created regularly (i.e., changes often), that other piece of software should be rewritten in such a way to create data that is more easily imported in Python (such as tabular text data, JSON, yaml, ...). If it is static data that never changes, do this:



                  import cPickle
                  from somefile import somename

                  fh = open("data.pck", "wb")
                  cPickle.dump(somename, fh)
                  fh.close()


                  This will serialize your data into a file "data.pck" from which it can be re-loaded very quickly.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Jan 2 at 13:37









                  musburmusbur

                  947




                  947






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Stack Overflow!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54006460%2fpython-improve-import-time-of-data%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      EwUDa6hNCtwKwd0cm,uWKys aOihMLyvwrzcoL2zB0MXvLGk NF4,G,vyI3oT,ARwov55K m5QyzmX o,QBKIsXQNGs4bmMZ
                      SlYmwWB8n209yC7oz2YTBoS7i1sNt7NB zRivyh uQvNi8WdsmYoGkzA,GphM,i

                      Popular posts from this blog

                      Monofisismo

                      Angular Downloading a file using contenturl with Basic Authentication

                      Olmecas