Guide for efficient usage of HDFStore

Multi tool use
Multi tool use












1















In my data processing application, I have around 80% of the processing time just spend in the function pandas.HDFStore.put. Although there are various SO questions around similar problems, I haven't found any definite guide on how to use HDFStore in the most efficient way.



What options do I have to reduce the writing-time?



My data consists only out of float64 columns and some spare int-columns, it may contain duplicated indices and/or column names and it is a-priori not sorted. It will be data collected over decades (sec-min resolution), so the solution should be scalable.



My Basic usage cases are the following:



# 1. Store creation
store = pd.HDFStore(pro['hdf_path'], complevel=7,
complib='blosc', fletcher32=True)

# 2. Iterative addition of new data
store.put('/table/T1', data, format='table', data_columns=True,
append=True, index=False)

# 3. Basic queries of certain columns (I only need 'index' in 'where')
store.select('/table/T1', columns=['A', 'B', ...],
where='index>="{}" & index<{sign}"{}"'.format(_t1, _t2))

# 4. Retrieving a tree with all tables and all column
# names in that table (without loading it)
for path, groups, leaves in store.walk():
...
for lv in sorted(leaves):
_item_path = '/'.join([path, lv])
columns = store.get_node('{}/table'.format(_item_path)).description._v_names


Concretely I would be interested in how to change the following parameters to optimize the writing time:




  • 'complib', 'complevel' in store creation

  • making the indexing more efficient (maybe calling create_table_index only at the end?)

  • parameters of store.put/store.append

  • I read something about index levels, like ('medium', 6), would that have an influence?

  • Could I reduce the index from datetime (which is stored as Int64 then) to something more efficient, by dropping e.g. miliseconds?


(Reading is not so much a problem as store.select with where=... is quite efficient.)



Thank you for any help, it is very much appreciated!










share|improve this question



























    1















    In my data processing application, I have around 80% of the processing time just spend in the function pandas.HDFStore.put. Although there are various SO questions around similar problems, I haven't found any definite guide on how to use HDFStore in the most efficient way.



    What options do I have to reduce the writing-time?



    My data consists only out of float64 columns and some spare int-columns, it may contain duplicated indices and/or column names and it is a-priori not sorted. It will be data collected over decades (sec-min resolution), so the solution should be scalable.



    My Basic usage cases are the following:



    # 1. Store creation
    store = pd.HDFStore(pro['hdf_path'], complevel=7,
    complib='blosc', fletcher32=True)

    # 2. Iterative addition of new data
    store.put('/table/T1', data, format='table', data_columns=True,
    append=True, index=False)

    # 3. Basic queries of certain columns (I only need 'index' in 'where')
    store.select('/table/T1', columns=['A', 'B', ...],
    where='index>="{}" & index<{sign}"{}"'.format(_t1, _t2))

    # 4. Retrieving a tree with all tables and all column
    # names in that table (without loading it)
    for path, groups, leaves in store.walk():
    ...
    for lv in sorted(leaves):
    _item_path = '/'.join([path, lv])
    columns = store.get_node('{}/table'.format(_item_path)).description._v_names


    Concretely I would be interested in how to change the following parameters to optimize the writing time:




    • 'complib', 'complevel' in store creation

    • making the indexing more efficient (maybe calling create_table_index only at the end?)

    • parameters of store.put/store.append

    • I read something about index levels, like ('medium', 6), would that have an influence?

    • Could I reduce the index from datetime (which is stored as Int64 then) to something more efficient, by dropping e.g. miliseconds?


    (Reading is not so much a problem as store.select with where=... is quite efficient.)



    Thank you for any help, it is very much appreciated!










    share|improve this question

























      1












      1








      1








      In my data processing application, I have around 80% of the processing time just spend in the function pandas.HDFStore.put. Although there are various SO questions around similar problems, I haven't found any definite guide on how to use HDFStore in the most efficient way.



      What options do I have to reduce the writing-time?



      My data consists only out of float64 columns and some spare int-columns, it may contain duplicated indices and/or column names and it is a-priori not sorted. It will be data collected over decades (sec-min resolution), so the solution should be scalable.



      My Basic usage cases are the following:



      # 1. Store creation
      store = pd.HDFStore(pro['hdf_path'], complevel=7,
      complib='blosc', fletcher32=True)

      # 2. Iterative addition of new data
      store.put('/table/T1', data, format='table', data_columns=True,
      append=True, index=False)

      # 3. Basic queries of certain columns (I only need 'index' in 'where')
      store.select('/table/T1', columns=['A', 'B', ...],
      where='index>="{}" & index<{sign}"{}"'.format(_t1, _t2))

      # 4. Retrieving a tree with all tables and all column
      # names in that table (without loading it)
      for path, groups, leaves in store.walk():
      ...
      for lv in sorted(leaves):
      _item_path = '/'.join([path, lv])
      columns = store.get_node('{}/table'.format(_item_path)).description._v_names


      Concretely I would be interested in how to change the following parameters to optimize the writing time:




      • 'complib', 'complevel' in store creation

      • making the indexing more efficient (maybe calling create_table_index only at the end?)

      • parameters of store.put/store.append

      • I read something about index levels, like ('medium', 6), would that have an influence?

      • Could I reduce the index from datetime (which is stored as Int64 then) to something more efficient, by dropping e.g. miliseconds?


      (Reading is not so much a problem as store.select with where=... is quite efficient.)



      Thank you for any help, it is very much appreciated!










      share|improve this question














      In my data processing application, I have around 80% of the processing time just spend in the function pandas.HDFStore.put. Although there are various SO questions around similar problems, I haven't found any definite guide on how to use HDFStore in the most efficient way.



      What options do I have to reduce the writing-time?



      My data consists only out of float64 columns and some spare int-columns, it may contain duplicated indices and/or column names and it is a-priori not sorted. It will be data collected over decades (sec-min resolution), so the solution should be scalable.



      My Basic usage cases are the following:



      # 1. Store creation
      store = pd.HDFStore(pro['hdf_path'], complevel=7,
      complib='blosc', fletcher32=True)

      # 2. Iterative addition of new data
      store.put('/table/T1', data, format='table', data_columns=True,
      append=True, index=False)

      # 3. Basic queries of certain columns (I only need 'index' in 'where')
      store.select('/table/T1', columns=['A', 'B', ...],
      where='index>="{}" & index<{sign}"{}"'.format(_t1, _t2))

      # 4. Retrieving a tree with all tables and all column
      # names in that table (without loading it)
      for path, groups, leaves in store.walk():
      ...
      for lv in sorted(leaves):
      _item_path = '/'.join([path, lv])
      columns = store.get_node('{}/table'.format(_item_path)).description._v_names


      Concretely I would be interested in how to change the following parameters to optimize the writing time:




      • 'complib', 'complevel' in store creation

      • making the indexing more efficient (maybe calling create_table_index only at the end?)

      • parameters of store.put/store.append

      • I read something about index levels, like ('medium', 6), would that have an influence?

      • Could I reduce the index from datetime (which is stored as Int64 then) to something more efficient, by dropping e.g. miliseconds?


      (Reading is not so much a problem as store.select with where=... is quite efficient.)



      Thank you for any help, it is very much appreciated!







      python pandas hdfstore






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Dec 31 '18 at 12:09









      Snow buntingSnow bunting

      553317




      553317
























          0






          active

          oldest

          votes











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53987355%2fguide-for-efficient-usage-of-hdfstore%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          0






          active

          oldest

          votes








          0






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes
















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53987355%2fguide-for-efficient-usage-of-hdfstore%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          RlM3xGjClUMjHXgpS qabC acvuZZ54I2 1bbD,kO7ytKWrH,FpGGxvk qZi4u6 qaxuzG3zZkjUA1,ci,oPLAfc2yM
          w8wBoEdr6jjoEK1F5VX9 kgGCIOixia wXKGGexM7 yZNcs1Swx2OVXBIHLNZlFY,VXSR8Usec

          Popular posts from this blog

          Monofisismo

          Angular Downloading a file using contenturl with Basic Authentication

          Olmecas