Python pandas groupby object apply method duplicates first group

Multi tool use
Multi tool use












29















My first SO question:
I am confused about this behavior of apply method of groupby in pandas (0.12.0-4), it appears to apply the function TWICE to the first row of a data frame. For example:



>>> from pandas import Series, DataFrame
>>> import pandas as pd
>>> df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count':[1,0,2]})
>>> print(df)
class count
0 A 1
1 B 0
2 C 2


I first check that the groupby function works ok, and it seems to be fine:



>>> for group in df.groupby('class', group_keys = True):
>>> print(group)
('A', class count
0 A 1)
('B', class count
1 B 0)
('C', class count
2 C 2)


Then I try to do something similar using apply on the groupby object and I get the first row output twice:



>>> def checkit(group):
>>> print(group)
>>> df.groupby('class', group_keys = True).apply(checkit)
class count
0 A 1
class count
0 A 1
class count
1 B 0
class count
2 C 2


Any help would be appreciated! Thanks.



Edit: @Jeff provides the answer below. I am dense and did not understand it immediately, so here is a simple example to show that despite the double printout of the first group in the example above, the apply method operates only once on the first group and does not mutate the original data frame:



>>> def addone(group):
>>> group['count'] += 1
>>> return group

>>> df.groupby('class', group_keys = True).apply(addone)
>>> print(df)

class count
0 A 1
1 B 0
2 C 2


But by assigning the return of the method to a new object, we see that it works as expected:






df2 = df.groupby('class', group_keys = True).apply(addone)
print(df2)






      class  count
0 A 2
1 B 1
2 C 3









share|improve this question




















  • 10





    This is checking whether you are mutating the data in the apply. If you are then it has to take a slower path than otherwise. It doesn't change the results.

    – Jeff
    Jan 27 '14 at 19:40











  • @Jeff: Could the result of the first call be saved so it is not called again? This might help if the function called by apply takes a long time... (along with being more intuitive, since this question comes up a lot.)

    – unutbu
    Jan 27 '14 at 19:43













  • @Jeff: Or maybe the function could be wrapped in a memoizer...

    – unutbu
    Jan 27 '14 at 19:48











  • its actually tricky; the fast-path is in cython (usually), so right now it doesn't pass it back to python space (it could I suppose). transform DOES do this however (where it choses a path and then uses that result to move on ). Its just a little bit tricky in code. Welcome to do a PR !

    – Jeff
    Jan 27 '14 at 19:51











  • Wouldn't it make more sense to just bite the bullet and make an explicit mutating/non-mutating parameter, defaulting to non-mutating? [Somewhat silly additional comment deleted, but my first question stands.]

    – DSM
    Jan 27 '14 at 19:54


















29















My first SO question:
I am confused about this behavior of apply method of groupby in pandas (0.12.0-4), it appears to apply the function TWICE to the first row of a data frame. For example:



>>> from pandas import Series, DataFrame
>>> import pandas as pd
>>> df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count':[1,0,2]})
>>> print(df)
class count
0 A 1
1 B 0
2 C 2


I first check that the groupby function works ok, and it seems to be fine:



>>> for group in df.groupby('class', group_keys = True):
>>> print(group)
('A', class count
0 A 1)
('B', class count
1 B 0)
('C', class count
2 C 2)


Then I try to do something similar using apply on the groupby object and I get the first row output twice:



>>> def checkit(group):
>>> print(group)
>>> df.groupby('class', group_keys = True).apply(checkit)
class count
0 A 1
class count
0 A 1
class count
1 B 0
class count
2 C 2


Any help would be appreciated! Thanks.



Edit: @Jeff provides the answer below. I am dense and did not understand it immediately, so here is a simple example to show that despite the double printout of the first group in the example above, the apply method operates only once on the first group and does not mutate the original data frame:



>>> def addone(group):
>>> group['count'] += 1
>>> return group

>>> df.groupby('class', group_keys = True).apply(addone)
>>> print(df)

class count
0 A 1
1 B 0
2 C 2


But by assigning the return of the method to a new object, we see that it works as expected:






df2 = df.groupby('class', group_keys = True).apply(addone)
print(df2)






      class  count
0 A 2
1 B 1
2 C 3









share|improve this question




















  • 10





    This is checking whether you are mutating the data in the apply. If you are then it has to take a slower path than otherwise. It doesn't change the results.

    – Jeff
    Jan 27 '14 at 19:40











  • @Jeff: Could the result of the first call be saved so it is not called again? This might help if the function called by apply takes a long time... (along with being more intuitive, since this question comes up a lot.)

    – unutbu
    Jan 27 '14 at 19:43













  • @Jeff: Or maybe the function could be wrapped in a memoizer...

    – unutbu
    Jan 27 '14 at 19:48











  • its actually tricky; the fast-path is in cython (usually), so right now it doesn't pass it back to python space (it could I suppose). transform DOES do this however (where it choses a path and then uses that result to move on ). Its just a little bit tricky in code. Welcome to do a PR !

    – Jeff
    Jan 27 '14 at 19:51











  • Wouldn't it make more sense to just bite the bullet and make an explicit mutating/non-mutating parameter, defaulting to non-mutating? [Somewhat silly additional comment deleted, but my first question stands.]

    – DSM
    Jan 27 '14 at 19:54
















29












29








29


9






My first SO question:
I am confused about this behavior of apply method of groupby in pandas (0.12.0-4), it appears to apply the function TWICE to the first row of a data frame. For example:



>>> from pandas import Series, DataFrame
>>> import pandas as pd
>>> df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count':[1,0,2]})
>>> print(df)
class count
0 A 1
1 B 0
2 C 2


I first check that the groupby function works ok, and it seems to be fine:



>>> for group in df.groupby('class', group_keys = True):
>>> print(group)
('A', class count
0 A 1)
('B', class count
1 B 0)
('C', class count
2 C 2)


Then I try to do something similar using apply on the groupby object and I get the first row output twice:



>>> def checkit(group):
>>> print(group)
>>> df.groupby('class', group_keys = True).apply(checkit)
class count
0 A 1
class count
0 A 1
class count
1 B 0
class count
2 C 2


Any help would be appreciated! Thanks.



Edit: @Jeff provides the answer below. I am dense and did not understand it immediately, so here is a simple example to show that despite the double printout of the first group in the example above, the apply method operates only once on the first group and does not mutate the original data frame:



>>> def addone(group):
>>> group['count'] += 1
>>> return group

>>> df.groupby('class', group_keys = True).apply(addone)
>>> print(df)

class count
0 A 1
1 B 0
2 C 2


But by assigning the return of the method to a new object, we see that it works as expected:






df2 = df.groupby('class', group_keys = True).apply(addone)
print(df2)






      class  count
0 A 2
1 B 1
2 C 3









share|improve this question
















My first SO question:
I am confused about this behavior of apply method of groupby in pandas (0.12.0-4), it appears to apply the function TWICE to the first row of a data frame. For example:



>>> from pandas import Series, DataFrame
>>> import pandas as pd
>>> df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count':[1,0,2]})
>>> print(df)
class count
0 A 1
1 B 0
2 C 2


I first check that the groupby function works ok, and it seems to be fine:



>>> for group in df.groupby('class', group_keys = True):
>>> print(group)
('A', class count
0 A 1)
('B', class count
1 B 0)
('C', class count
2 C 2)


Then I try to do something similar using apply on the groupby object and I get the first row output twice:



>>> def checkit(group):
>>> print(group)
>>> df.groupby('class', group_keys = True).apply(checkit)
class count
0 A 1
class count
0 A 1
class count
1 B 0
class count
2 C 2


Any help would be appreciated! Thanks.



Edit: @Jeff provides the answer below. I am dense and did not understand it immediately, so here is a simple example to show that despite the double printout of the first group in the example above, the apply method operates only once on the first group and does not mutate the original data frame:



>>> def addone(group):
>>> group['count'] += 1
>>> return group

>>> df.groupby('class', group_keys = True).apply(addone)
>>> print(df)

class count
0 A 1
1 B 0
2 C 2


But by assigning the return of the method to a new object, we see that it works as expected:






df2 = df.groupby('class', group_keys = True).apply(addone)
print(df2)






      class  count
0 A 2
1 B 1
2 C 3






python-2.7 pandas group-by






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jun 17 '16 at 12:48









Merlin

8,9212880158




8,9212880158










asked Jan 27 '14 at 19:37









NC maize breeding JimNC maize breeding Jim

27848




27848








  • 10





    This is checking whether you are mutating the data in the apply. If you are then it has to take a slower path than otherwise. It doesn't change the results.

    – Jeff
    Jan 27 '14 at 19:40











  • @Jeff: Could the result of the first call be saved so it is not called again? This might help if the function called by apply takes a long time... (along with being more intuitive, since this question comes up a lot.)

    – unutbu
    Jan 27 '14 at 19:43













  • @Jeff: Or maybe the function could be wrapped in a memoizer...

    – unutbu
    Jan 27 '14 at 19:48











  • its actually tricky; the fast-path is in cython (usually), so right now it doesn't pass it back to python space (it could I suppose). transform DOES do this however (where it choses a path and then uses that result to move on ). Its just a little bit tricky in code. Welcome to do a PR !

    – Jeff
    Jan 27 '14 at 19:51











  • Wouldn't it make more sense to just bite the bullet and make an explicit mutating/non-mutating parameter, defaulting to non-mutating? [Somewhat silly additional comment deleted, but my first question stands.]

    – DSM
    Jan 27 '14 at 19:54
















  • 10





    This is checking whether you are mutating the data in the apply. If you are then it has to take a slower path than otherwise. It doesn't change the results.

    – Jeff
    Jan 27 '14 at 19:40











  • @Jeff: Could the result of the first call be saved so it is not called again? This might help if the function called by apply takes a long time... (along with being more intuitive, since this question comes up a lot.)

    – unutbu
    Jan 27 '14 at 19:43













  • @Jeff: Or maybe the function could be wrapped in a memoizer...

    – unutbu
    Jan 27 '14 at 19:48











  • its actually tricky; the fast-path is in cython (usually), so right now it doesn't pass it back to python space (it could I suppose). transform DOES do this however (where it choses a path and then uses that result to move on ). Its just a little bit tricky in code. Welcome to do a PR !

    – Jeff
    Jan 27 '14 at 19:51











  • Wouldn't it make more sense to just bite the bullet and make an explicit mutating/non-mutating parameter, defaulting to non-mutating? [Somewhat silly additional comment deleted, but my first question stands.]

    – DSM
    Jan 27 '14 at 19:54










10




10





This is checking whether you are mutating the data in the apply. If you are then it has to take a slower path than otherwise. It doesn't change the results.

– Jeff
Jan 27 '14 at 19:40





This is checking whether you are mutating the data in the apply. If you are then it has to take a slower path than otherwise. It doesn't change the results.

– Jeff
Jan 27 '14 at 19:40













@Jeff: Could the result of the first call be saved so it is not called again? This might help if the function called by apply takes a long time... (along with being more intuitive, since this question comes up a lot.)

– unutbu
Jan 27 '14 at 19:43







@Jeff: Could the result of the first call be saved so it is not called again? This might help if the function called by apply takes a long time... (along with being more intuitive, since this question comes up a lot.)

– unutbu
Jan 27 '14 at 19:43















@Jeff: Or maybe the function could be wrapped in a memoizer...

– unutbu
Jan 27 '14 at 19:48





@Jeff: Or maybe the function could be wrapped in a memoizer...

– unutbu
Jan 27 '14 at 19:48













its actually tricky; the fast-path is in cython (usually), so right now it doesn't pass it back to python space (it could I suppose). transform DOES do this however (where it choses a path and then uses that result to move on ). Its just a little bit tricky in code. Welcome to do a PR !

– Jeff
Jan 27 '14 at 19:51





its actually tricky; the fast-path is in cython (usually), so right now it doesn't pass it back to python space (it could I suppose). transform DOES do this however (where it choses a path and then uses that result to move on ). Its just a little bit tricky in code. Welcome to do a PR !

– Jeff
Jan 27 '14 at 19:51













Wouldn't it make more sense to just bite the bullet and make an explicit mutating/non-mutating parameter, defaulting to non-mutating? [Somewhat silly additional comment deleted, but my first question stands.]

– DSM
Jan 27 '14 at 19:54







Wouldn't it make more sense to just bite the bullet and make an explicit mutating/non-mutating parameter, defaulting to non-mutating? [Somewhat silly additional comment deleted, but my first question stands.]

– DSM
Jan 27 '14 at 19:54














2 Answers
2






active

oldest

votes


















25














This is by design, as described here and here



The apply function needs to know the shape of the returned data to intelligently figure out how it will be combined. To do this it calls the function (checkit in your case) twice to achieve this.



Depending on your actual use case, you can replace the call to apply with aggregate, transform or filter, as described in detail here. These functions require the return value to be a particular shape, and so don't call the function twice.



However - if the function you are calling does not have side-effects, it most likely does not matter that the function is being called twice on the first value.






share|improve this answer































    2














    you can use for loop to avoid the groupby.apply duplicate first row,



    log_sample.csv



    guestid,keyword
    1,null
    2,null
    2,null
    3,null
    3,null
    3,null
    4,null
    4,null
    4,null
    4,null


    my code snippit



    df=pd.read_csv("log_sample.csv") 
    grouped = df.groupby("guestid")

    for guestid, df_group in grouped:
    print(list(df_group['guestid']))

    df.head(100)


    output



    [1]
    [2, 2]
    [3, 3, 3]
    [4, 4, 4, 4]





    share|improve this answer























      Your Answer






      StackExchange.ifUsing("editor", function () {
      StackExchange.using("externalEditor", function () {
      StackExchange.using("snippets", function () {
      StackExchange.snippets.init();
      });
      });
      }, "code-snippets");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "1"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f21390035%2fpython-pandas-groupby-object-apply-method-duplicates-first-group%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      25














      This is by design, as described here and here



      The apply function needs to know the shape of the returned data to intelligently figure out how it will be combined. To do this it calls the function (checkit in your case) twice to achieve this.



      Depending on your actual use case, you can replace the call to apply with aggregate, transform or filter, as described in detail here. These functions require the return value to be a particular shape, and so don't call the function twice.



      However - if the function you are calling does not have side-effects, it most likely does not matter that the function is being called twice on the first value.






      share|improve this answer




























        25














        This is by design, as described here and here



        The apply function needs to know the shape of the returned data to intelligently figure out how it will be combined. To do this it calls the function (checkit in your case) twice to achieve this.



        Depending on your actual use case, you can replace the call to apply with aggregate, transform or filter, as described in detail here. These functions require the return value to be a particular shape, and so don't call the function twice.



        However - if the function you are calling does not have side-effects, it most likely does not matter that the function is being called twice on the first value.






        share|improve this answer


























          25












          25








          25







          This is by design, as described here and here



          The apply function needs to know the shape of the returned data to intelligently figure out how it will be combined. To do this it calls the function (checkit in your case) twice to achieve this.



          Depending on your actual use case, you can replace the call to apply with aggregate, transform or filter, as described in detail here. These functions require the return value to be a particular shape, and so don't call the function twice.



          However - if the function you are calling does not have side-effects, it most likely does not matter that the function is being called twice on the first value.






          share|improve this answer













          This is by design, as described here and here



          The apply function needs to know the shape of the returned data to intelligently figure out how it will be combined. To do this it calls the function (checkit in your case) twice to achieve this.



          Depending on your actual use case, you can replace the call to apply with aggregate, transform or filter, as described in detail here. These functions require the return value to be a particular shape, and so don't call the function twice.



          However - if the function you are calling does not have side-effects, it most likely does not matter that the function is being called twice on the first value.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Sep 8 '14 at 1:39









          ZeroZero

          4,57543660




          4,57543660

























              2














              you can use for loop to avoid the groupby.apply duplicate first row,



              log_sample.csv



              guestid,keyword
              1,null
              2,null
              2,null
              3,null
              3,null
              3,null
              4,null
              4,null
              4,null
              4,null


              my code snippit



              df=pd.read_csv("log_sample.csv") 
              grouped = df.groupby("guestid")

              for guestid, df_group in grouped:
              print(list(df_group['guestid']))

              df.head(100)


              output



              [1]
              [2, 2]
              [3, 3, 3]
              [4, 4, 4, 4]





              share|improve this answer




























                2














                you can use for loop to avoid the groupby.apply duplicate first row,



                log_sample.csv



                guestid,keyword
                1,null
                2,null
                2,null
                3,null
                3,null
                3,null
                4,null
                4,null
                4,null
                4,null


                my code snippit



                df=pd.read_csv("log_sample.csv") 
                grouped = df.groupby("guestid")

                for guestid, df_group in grouped:
                print(list(df_group['guestid']))

                df.head(100)


                output



                [1]
                [2, 2]
                [3, 3, 3]
                [4, 4, 4, 4]





                share|improve this answer


























                  2












                  2








                  2







                  you can use for loop to avoid the groupby.apply duplicate first row,



                  log_sample.csv



                  guestid,keyword
                  1,null
                  2,null
                  2,null
                  3,null
                  3,null
                  3,null
                  4,null
                  4,null
                  4,null
                  4,null


                  my code snippit



                  df=pd.read_csv("log_sample.csv") 
                  grouped = df.groupby("guestid")

                  for guestid, df_group in grouped:
                  print(list(df_group['guestid']))

                  df.head(100)


                  output



                  [1]
                  [2, 2]
                  [3, 3, 3]
                  [4, 4, 4, 4]





                  share|improve this answer













                  you can use for loop to avoid the groupby.apply duplicate first row,



                  log_sample.csv



                  guestid,keyword
                  1,null
                  2,null
                  2,null
                  3,null
                  3,null
                  3,null
                  4,null
                  4,null
                  4,null
                  4,null


                  my code snippit



                  df=pd.read_csv("log_sample.csv") 
                  grouped = df.groupby("guestid")

                  for guestid, df_group in grouped:
                  print(list(df_group['guestid']))

                  df.head(100)


                  output



                  [1]
                  [2, 2]
                  [3, 3, 3]
                  [4, 4, 4, 4]






                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Apr 4 '18 at 3:17









                  geosmartgeosmart

                  587




                  587






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Stack Overflow!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f21390035%2fpython-pandas-groupby-object-apply-method-duplicates-first-group%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      IN WO29
                      aH4MxlCUBhqh,x1EE,kUY,CmNcf,V2Mkqo8EZLF9IgKfEzkSv7m 9P73 8fjlYjiq,GPKbJNDQ P869Scx dv

                      Popular posts from this blog

                      Monofisismo

                      Angular Downloading a file using contenturl with Basic Authentication

                      Olmecas