Python pandas groupby object apply method duplicates first group

My first SO question:
I am confused about this behavior of apply method of groupby in pandas (0.12.0-4), it appears to apply the function TWICE to the first row of a data frame. For example:

>>> from pandas import Series, DataFrame

>>> import pandas as pd

>>> df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count':[1,0,2]})

>>> print(df)

   class  count  

0     A      1  

1     B      0    

2     C      2

I first check that the groupby function works ok, and it seems to be fine:

>>> for group in df.groupby('class', group_keys = True):

>>>     print(group)

('A',   class  count

0     A      1)

('B',   class  count

1     B      0)

('C',   class  count

2     C      2)

Then I try to do something similar using apply on the groupby object and I get the first row output twice:

>>> def checkit(group):

>>>     print(group)

>>> df.groupby('class', group_keys = True).apply(checkit)

  class  count

0     A      1

  class  count

0     A      1

  class  count

1     B      0

  class  count

2     C      2

Any help would be appreciated! Thanks.

Edit: @Jeff provides the answer below. I am dense and did not understand it immediately, so here is a simple example to show that despite the double printout of the first group in the example above, the apply method operates only once on the first group and does not mutate the original data frame:

>>> def addone(group):

>>>     group['count'] += 1

>>>     return group



>>> df.groupby('class', group_keys = True).apply(addone)

>>> print(df)



      class  count

0     A      1

1     B      0

2     C      2

But by assigning the return of the method to a new object, we see that it works as expected:

df2 = df.groupby('class', group_keys = True).apply(addone)
print(df2)

      class  count

0     A      2

1     B      1

2     C      3

edited Jun 17 '16 at 12:48

Merlin

8,9212880158

asked Jan 27 '14 at 19:37

NC maize breeding Jim

27848

10

This is checking whether you are mutating the data in the apply. If you are then it has to take a slower path than otherwise. It doesn't change the results.

– Jeff
Jan 27 '14 at 19:40

@Jeff: Could the result of the first call be saved so it is not called again? This might help if the function called by apply takes a long time... (along with being more intuitive, since this question comes up a lot.)

– unutbu
Jan 27 '14 at 19:43

@Jeff: Or maybe the function could be wrapped in a memoizer...

– unutbu
Jan 27 '14 at 19:48

its actually tricky; the fast-path is in cython (usually), so right now it doesn't pass it back to python space (it could I suppose). transform DOES do this however (where it choses a path and then uses that result to move on ). Its just a little bit tricky in code. Welcome to do a PR !

– Jeff
Jan 27 '14 at 19:51

Wouldn't it make more sense to just bite the bullet and make an explicit mutating/non-mutating parameter, defaulting to non-mutating? [Somewhat silly additional comment deleted, but my first question stands.]

– DSM
Jan 27 '14 at 19:54

|
show 4 more comments

My first SO question:
I am confused about this behavior of apply method of groupby in pandas (0.12.0-4), it appears to apply the function TWICE to the first row of a data frame. For example:

>>> from pandas import Series, DataFrame

>>> import pandas as pd

>>> df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count':[1,0,2]})

>>> print(df)

   class  count  

0     A      1  

1     B      0    

2     C      2

I first check that the groupby function works ok, and it seems to be fine:

>>> for group in df.groupby('class', group_keys = True):

>>>     print(group)

('A',   class  count

0     A      1)

('B',   class  count

1     B      0)

('C',   class  count

2     C      2)

Then I try to do something similar using apply on the groupby object and I get the first row output twice:

>>> def checkit(group):

>>>     print(group)

>>> df.groupby('class', group_keys = True).apply(checkit)

  class  count

0     A      1

  class  count

0     A      1

  class  count

1     B      0

  class  count

2     C      2

Any help would be appreciated! Thanks.

>>> def addone(group):

>>>     group['count'] += 1

>>>     return group



>>> df.groupby('class', group_keys = True).apply(addone)

>>> print(df)



      class  count

0     A      1

1     B      0

2     C      2

But by assigning the return of the method to a new object, we see that it works as expected:

df2 = df.groupby('class', group_keys = True).apply(addone)
print(df2)

      class  count

0     A      2

1     B      1

2     C      3

edited Jun 17 '16 at 12:48

Merlin

8,9212880158

asked Jan 27 '14 at 19:37

NC maize breeding Jim

27848

10

This is checking whether you are mutating the data in the apply. If you are then it has to take a slower path than otherwise. It doesn't change the results.

– Jeff
Jan 27 '14 at 19:40

@Jeff: Could the result of the first call be saved so it is not called again? This might help if the function called by apply takes a long time... (along with being more intuitive, since this question comes up a lot.)

– unutbu
Jan 27 '14 at 19:43

@Jeff: Or maybe the function could be wrapped in a memoizer...

– unutbu
Jan 27 '14 at 19:48

its actually tricky; the fast-path is in cython (usually), so right now it doesn't pass it back to python space (it could I suppose). transform DOES do this however (where it choses a path and then uses that result to move on ). Its just a little bit tricky in code. Welcome to do a PR !

– Jeff
Jan 27 '14 at 19:51

Wouldn't it make more sense to just bite the bullet and make an explicit mutating/non-mutating parameter, defaulting to non-mutating? [Somewhat silly additional comment deleted, but my first question stands.]

– DSM
Jan 27 '14 at 19:54

|
show 4 more comments

My first SO question:
I am confused about this behavior of apply method of groupby in pandas (0.12.0-4), it appears to apply the function TWICE to the first row of a data frame. For example:

>>> from pandas import Series, DataFrame

>>> import pandas as pd

>>> df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count':[1,0,2]})

>>> print(df)

   class  count  

0     A      1  

1     B      0    

2     C      2

I first check that the groupby function works ok, and it seems to be fine:

>>> for group in df.groupby('class', group_keys = True):

>>>     print(group)

('A',   class  count

0     A      1)

('B',   class  count

1     B      0)

('C',   class  count

2     C      2)

Then I try to do something similar using apply on the groupby object and I get the first row output twice:

>>> def checkit(group):

>>>     print(group)

>>> df.groupby('class', group_keys = True).apply(checkit)

  class  count

0     A      1

  class  count

0     A      1

  class  count

1     B      0

  class  count

2     C      2

Any help would be appreciated! Thanks.

>>> def addone(group):

>>>     group['count'] += 1

>>>     return group



>>> df.groupby('class', group_keys = True).apply(addone)

>>> print(df)



      class  count

0     A      1

1     B      0

2     C      2

But by assigning the return of the method to a new object, we see that it works as expected:

df2 = df.groupby('class', group_keys = True).apply(addone)
print(df2)

      class  count

0     A      2

1     B      1

2     C      3

edited Jun 17 '16 at 12:48

Merlin

8,9212880158

asked Jan 27 '14 at 19:37

NC maize breeding Jim

27848

My first SO question:
I am confused about this behavior of apply method of groupby in pandas (0.12.0-4), it appears to apply the function TWICE to the first row of a data frame. For example:

>>> from pandas import Series, DataFrame

>>> import pandas as pd

>>> df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count':[1,0,2]})

>>> print(df)

   class  count  

0     A      1  

1     B      0    

2     C      2

I first check that the groupby function works ok, and it seems to be fine:

>>> for group in df.groupby('class', group_keys = True):

>>>     print(group)

('A',   class  count

0     A      1)

('B',   class  count

1     B      0)

('C',   class  count

2     C      2)

Then I try to do something similar using apply on the groupby object and I get the first row output twice:

>>> def checkit(group):

>>>     print(group)

>>> df.groupby('class', group_keys = True).apply(checkit)

  class  count

0     A      1

  class  count

0     A      1

  class  count

1     B      0

  class  count

2     C      2

Any help would be appreciated! Thanks.

>>> def addone(group):

>>>     group['count'] += 1

>>>     return group



>>> df.groupby('class', group_keys = True).apply(addone)

>>> print(df)



      class  count

0     A      1

1     B      0

2     C      2

But by assigning the return of the method to a new object, we see that it works as expected:

df2 = df.groupby('class', group_keys = True).apply(addone)
print(df2)

      class  count

0     A      2

1     B      1

2     C      3

python-2.7 pandas group-by

edited Jun 17 '16 at 12:48

Merlin

8,9212880158

asked Jan 27 '14 at 19:37

NC maize breeding Jim

27848

edited Jun 17 '16 at 12:48

Merlin

8,9212880158

asked Jan 27 '14 at 19:37

NC maize breeding Jim

27848

edited Jun 17 '16 at 12:48

Merlin

8,9212880158

edited Jun 17 '16 at 12:48

Merlin

8,9212880158

edited Jun 17 '16 at 12:48

Merlin

8,9212880158

asked Jan 27 '14 at 19:37

NC maize breeding Jim

27848

asked Jan 27 '14 at 19:37

NC maize breeding Jim

27848

asked Jan 27 '14 at 19:37

NC maize breeding Jim

27848

10

This is checking whether you are mutating the data in the apply. If you are then it has to take a slower path than otherwise. It doesn't change the results.

– Jeff
Jan 27 '14 at 19:40

@Jeff: Could the result of the first call be saved so it is not called again? This might help if the function called by apply takes a long time... (along with being more intuitive, since this question comes up a lot.)

– unutbu
Jan 27 '14 at 19:43

@Jeff: Or maybe the function could be wrapped in a memoizer...

– unutbu
Jan 27 '14 at 19:48

its actually tricky; the fast-path is in cython (usually), so right now it doesn't pass it back to python space (it could I suppose). transform DOES do this however (where it choses a path and then uses that result to move on ). Its just a little bit tricky in code. Welcome to do a PR !

– Jeff
Jan 27 '14 at 19:51

Wouldn't it make more sense to just bite the bullet and make an explicit mutating/non-mutating parameter, defaulting to non-mutating? [Somewhat silly additional comment deleted, but my first question stands.]

– DSM
Jan 27 '14 at 19:54

|
show 4 more comments

10

This is checking whether you are mutating the data in the apply. If you are then it has to take a slower path than otherwise. It doesn't change the results.

– Jeff
Jan 27 '14 at 19:40

@Jeff: Could the result of the first call be saved so it is not called again? This might help if the function called by apply takes a long time... (along with being more intuitive, since this question comes up a lot.)

– unutbu
Jan 27 '14 at 19:43

@Jeff: Or maybe the function could be wrapped in a memoizer...

– unutbu
Jan 27 '14 at 19:48

its actually tricky; the fast-path is in cython (usually), so right now it doesn't pass it back to python space (it could I suppose). transform DOES do this however (where it choses a path and then uses that result to move on ). Its just a little bit tricky in code. Welcome to do a PR !

– Jeff
Jan 27 '14 at 19:51

Wouldn't it make more sense to just bite the bullet and make an explicit mutating/non-mutating parameter, defaulting to non-mutating? [Somewhat silly additional comment deleted, but my first question stands.]

– DSM
Jan 27 '14 at 19:54

This is checking whether you are mutating the data in the apply. If you are then it has to take a slower path than otherwise. It doesn't change the results.

– Jeff
Jan 27 '14 at 19:40

@Jeff: Could the result of the first call be saved so it is not called again? This might help if the function called by apply takes a long time... (along with being more intuitive, since this question comes up a lot.)

– unutbu
Jan 27 '14 at 19:43

@Jeff: Or maybe the function could be wrapped in a memoizer...

– unutbu
Jan 27 '14 at 19:48

its actually tricky; the fast-path is in cython (usually), so right now it doesn't pass it back to python space (it could I suppose). transform DOES do this however (where it choses a path and then uses that result to move on ). Its just a little bit tricky in code. Welcome to do a PR !

– Jeff
Jan 27 '14 at 19:51

Wouldn't it make more sense to just bite the bullet and make an explicit mutating/non-mutating parameter, defaulting to non-mutating? [Somewhat silly additional comment deleted, but my first question stands.]

– DSM
Jan 27 '14 at 19:54

|
show 4 more comments

2 Answers
2

active

oldest

votes

This is by design, as described here and here

The apply function needs to know the shape of the returned data to intelligently figure out how it will be combined. To do this it calls the function (checkit in your case) twice to achieve this.

Depending on your actual use case, you can replace the call to apply with aggregate, transform or filter, as described in detail here. These functions require the return value to be a particular shape, and so don't call the function twice.

However - if the function you are calling does not have side-effects, it most likely does not matter that the function is being called twice on the first value.

answered Sep 8 '14 at 1:39

Zero

4,57543660

add a comment |

you can use for loop to avoid the groupby.apply duplicate first row,

log_sample.csv

guestid,keyword

1,null

2,null

2,null

3,null

3,null

3,null

4,null

4,null

4,null

4,null

my code snippit

df=pd.read_csv("log_sample.csv") 

grouped = df.groupby("guestid")



for guestid, df_group in grouped:

    print(list(df_group['guestid'])) 



df.head(100)

output

[1]

[2, 2]

[3, 3, 3]

[4, 4, 4, 4]

answered Apr 4 '18 at 3:17

geosmart

587

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f21390035%2fpython-pandas-groupby-object-apply-method-duplicates-first-group%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

This is by design, as described here and here

The apply function needs to know the shape of the returned data to intelligently figure out how it will be combined. To do this it calls the function (checkit in your case) twice to achieve this.

However - if the function you are calling does not have side-effects, it most likely does not matter that the function is being called twice on the first value.

answered Sep 8 '14 at 1:39

Zero

4,57543660

add a comment |

This is by design, as described here and here

The apply function needs to know the shape of the returned data to intelligently figure out how it will be combined. To do this it calls the function (checkit in your case) twice to achieve this.

However - if the function you are calling does not have side-effects, it most likely does not matter that the function is being called twice on the first value.

answered Sep 8 '14 at 1:39

Zero

4,57543660

add a comment |

This is by design, as described here and here

The apply function needs to know the shape of the returned data to intelligently figure out how it will be combined. To do this it calls the function (checkit in your case) twice to achieve this.

However - if the function you are calling does not have side-effects, it most likely does not matter that the function is being called twice on the first value.

answered Sep 8 '14 at 1:39

Zero

4,57543660

This is by design, as described here and here

The apply function needs to know the shape of the returned data to intelligently figure out how it will be combined. To do this it calls the function (checkit in your case) twice to achieve this.

However - if the function you are calling does not have side-effects, it most likely does not matter that the function is being called twice on the first value.

answered Sep 8 '14 at 1:39

Zero

4,57543660

answered Sep 8 '14 at 1:39

Zero

4,57543660

answered Sep 8 '14 at 1:39

Zero

4,57543660

answered Sep 8 '14 at 1:39

Zero

4,57543660

add a comment |

you can use for loop to avoid the groupby.apply duplicate first row,

log_sample.csv

guestid,keyword

1,null

2,null

2,null

3,null

3,null

3,null

4,null

4,null

4,null

4,null

my code snippit

df=pd.read_csv("log_sample.csv") 

grouped = df.groupby("guestid")



for guestid, df_group in grouped:

    print(list(df_group['guestid'])) 



df.head(100)

output

[1]

[2, 2]

[3, 3, 3]

[4, 4, 4, 4]

answered Apr 4 '18 at 3:17

geosmart

587

add a comment |

you can use for loop to avoid the groupby.apply duplicate first row,

log_sample.csv

guestid,keyword

1,null

2,null

2,null

3,null

3,null

3,null

4,null

4,null

4,null

4,null

my code snippit

df=pd.read_csv("log_sample.csv") 

grouped = df.groupby("guestid")



for guestid, df_group in grouped:

    print(list(df_group['guestid'])) 



df.head(100)

output

[1]

[2, 2]

[3, 3, 3]

[4, 4, 4, 4]

answered Apr 4 '18 at 3:17

geosmart

587

add a comment |

you can use for loop to avoid the groupby.apply duplicate first row,

log_sample.csv

guestid,keyword

1,null

2,null

2,null

3,null

3,null

3,null

4,null

4,null

4,null

4,null

my code snippit

df=pd.read_csv("log_sample.csv") 

grouped = df.groupby("guestid")



for guestid, df_group in grouped:

    print(list(df_group['guestid'])) 



df.head(100)

output

[1]

[2, 2]

[3, 3, 3]

[4, 4, 4, 4]

answered Apr 4 '18 at 3:17

geosmart

587

you can use for loop to avoid the groupby.apply duplicate first row,

log_sample.csv

guestid,keyword

1,null

2,null

2,null

3,null

3,null

3,null

4,null

4,null

4,null

4,null

my code snippit

df=pd.read_csv("log_sample.csv") 

grouped = df.groupby("guestid")



for guestid, df_group in grouped:

    print(list(df_group['guestid'])) 



df.head(100)

output

[1]

[2, 2]

[3, 3, 3]

[4, 4, 4, 4]

answered Apr 4 '18 at 3:17

geosmart

587

answered Apr 4 '18 at 3:17

geosmart

587

answered Apr 4 '18 at 3:17

geosmart

587

answered Apr 4 '18 at 3:17

geosmart

587

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Bdtjtk