Python pandas groupby object apply method duplicates first group
![Multi tool use Multi tool use](http://sgv.ssvwv.com/sg/ssvwvcomimagb.png)
Multi tool use
My first SO question:
I am confused about this behavior of apply method of groupby in pandas (0.12.0-4), it appears to apply the function TWICE to the first row of a data frame. For example:
>>> from pandas import Series, DataFrame
>>> import pandas as pd
>>> df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count':[1,0,2]})
>>> print(df)
class count
0 A 1
1 B 0
2 C 2
I first check that the groupby function works ok, and it seems to be fine:
>>> for group in df.groupby('class', group_keys = True):
>>> print(group)
('A', class count
0 A 1)
('B', class count
1 B 0)
('C', class count
2 C 2)
Then I try to do something similar using apply on the groupby object and I get the first row output twice:
>>> def checkit(group):
>>> print(group)
>>> df.groupby('class', group_keys = True).apply(checkit)
class count
0 A 1
class count
0 A 1
class count
1 B 0
class count
2 C 2
Any help would be appreciated! Thanks.
Edit: @Jeff provides the answer below. I am dense and did not understand it immediately, so here is a simple example to show that despite the double printout of the first group in the example above, the apply method operates only once on the first group and does not mutate the original data frame:
>>> def addone(group):
>>> group['count'] += 1
>>> return group
>>> df.groupby('class', group_keys = True).apply(addone)
>>> print(df)
class count
0 A 1
1 B 0
2 C 2
But by assigning the return of the method to a new object, we see that it works as expected:
df2 = df.groupby('class', group_keys = True).apply(addone)
print(df2)
class count
0 A 2
1 B 1
2 C 3
python-2.7 pandas group-by
|
show 4 more comments
My first SO question:
I am confused about this behavior of apply method of groupby in pandas (0.12.0-4), it appears to apply the function TWICE to the first row of a data frame. For example:
>>> from pandas import Series, DataFrame
>>> import pandas as pd
>>> df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count':[1,0,2]})
>>> print(df)
class count
0 A 1
1 B 0
2 C 2
I first check that the groupby function works ok, and it seems to be fine:
>>> for group in df.groupby('class', group_keys = True):
>>> print(group)
('A', class count
0 A 1)
('B', class count
1 B 0)
('C', class count
2 C 2)
Then I try to do something similar using apply on the groupby object and I get the first row output twice:
>>> def checkit(group):
>>> print(group)
>>> df.groupby('class', group_keys = True).apply(checkit)
class count
0 A 1
class count
0 A 1
class count
1 B 0
class count
2 C 2
Any help would be appreciated! Thanks.
Edit: @Jeff provides the answer below. I am dense and did not understand it immediately, so here is a simple example to show that despite the double printout of the first group in the example above, the apply method operates only once on the first group and does not mutate the original data frame:
>>> def addone(group):
>>> group['count'] += 1
>>> return group
>>> df.groupby('class', group_keys = True).apply(addone)
>>> print(df)
class count
0 A 1
1 B 0
2 C 2
But by assigning the return of the method to a new object, we see that it works as expected:
df2 = df.groupby('class', group_keys = True).apply(addone)
print(df2)
class count
0 A 2
1 B 1
2 C 3
python-2.7 pandas group-by
10
This is checking whether you are mutating the data in the apply. If you are then it has to take a slower path than otherwise. It doesn't change the results.
– Jeff
Jan 27 '14 at 19:40
@Jeff: Could the result of the first call be saved so it is not called again? This might help if the function called by apply takes a long time... (along with being more intuitive, since this question comes up a lot.)
– unutbu
Jan 27 '14 at 19:43
@Jeff: Or maybe the function could be wrapped in a memoizer...
– unutbu
Jan 27 '14 at 19:48
its actually tricky; the fast-path is in cython (usually), so right now it doesn't pass it back to python space (it could I suppose). transform DOES do this however (where it choses a path and then uses that result to move on ). Its just a little bit tricky in code. Welcome to do a PR !
– Jeff
Jan 27 '14 at 19:51
Wouldn't it make more sense to just bite the bullet and make an explicit mutating/non-mutating parameter, defaulting to non-mutating? [Somewhat silly additional comment deleted, but my first question stands.]
– DSM
Jan 27 '14 at 19:54
|
show 4 more comments
My first SO question:
I am confused about this behavior of apply method of groupby in pandas (0.12.0-4), it appears to apply the function TWICE to the first row of a data frame. For example:
>>> from pandas import Series, DataFrame
>>> import pandas as pd
>>> df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count':[1,0,2]})
>>> print(df)
class count
0 A 1
1 B 0
2 C 2
I first check that the groupby function works ok, and it seems to be fine:
>>> for group in df.groupby('class', group_keys = True):
>>> print(group)
('A', class count
0 A 1)
('B', class count
1 B 0)
('C', class count
2 C 2)
Then I try to do something similar using apply on the groupby object and I get the first row output twice:
>>> def checkit(group):
>>> print(group)
>>> df.groupby('class', group_keys = True).apply(checkit)
class count
0 A 1
class count
0 A 1
class count
1 B 0
class count
2 C 2
Any help would be appreciated! Thanks.
Edit: @Jeff provides the answer below. I am dense and did not understand it immediately, so here is a simple example to show that despite the double printout of the first group in the example above, the apply method operates only once on the first group and does not mutate the original data frame:
>>> def addone(group):
>>> group['count'] += 1
>>> return group
>>> df.groupby('class', group_keys = True).apply(addone)
>>> print(df)
class count
0 A 1
1 B 0
2 C 2
But by assigning the return of the method to a new object, we see that it works as expected:
df2 = df.groupby('class', group_keys = True).apply(addone)
print(df2)
class count
0 A 2
1 B 1
2 C 3
python-2.7 pandas group-by
My first SO question:
I am confused about this behavior of apply method of groupby in pandas (0.12.0-4), it appears to apply the function TWICE to the first row of a data frame. For example:
>>> from pandas import Series, DataFrame
>>> import pandas as pd
>>> df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count':[1,0,2]})
>>> print(df)
class count
0 A 1
1 B 0
2 C 2
I first check that the groupby function works ok, and it seems to be fine:
>>> for group in df.groupby('class', group_keys = True):
>>> print(group)
('A', class count
0 A 1)
('B', class count
1 B 0)
('C', class count
2 C 2)
Then I try to do something similar using apply on the groupby object and I get the first row output twice:
>>> def checkit(group):
>>> print(group)
>>> df.groupby('class', group_keys = True).apply(checkit)
class count
0 A 1
class count
0 A 1
class count
1 B 0
class count
2 C 2
Any help would be appreciated! Thanks.
Edit: @Jeff provides the answer below. I am dense and did not understand it immediately, so here is a simple example to show that despite the double printout of the first group in the example above, the apply method operates only once on the first group and does not mutate the original data frame:
>>> def addone(group):
>>> group['count'] += 1
>>> return group
>>> df.groupby('class', group_keys = True).apply(addone)
>>> print(df)
class count
0 A 1
1 B 0
2 C 2
But by assigning the return of the method to a new object, we see that it works as expected:
df2 = df.groupby('class', group_keys = True).apply(addone)
print(df2)
class count
0 A 2
1 B 1
2 C 3
python-2.7 pandas group-by
python-2.7 pandas group-by
edited Jun 17 '16 at 12:48
Merlin
8,9212880158
8,9212880158
asked Jan 27 '14 at 19:37
NC maize breeding JimNC maize breeding Jim
27848
27848
10
This is checking whether you are mutating the data in the apply. If you are then it has to take a slower path than otherwise. It doesn't change the results.
– Jeff
Jan 27 '14 at 19:40
@Jeff: Could the result of the first call be saved so it is not called again? This might help if the function called by apply takes a long time... (along with being more intuitive, since this question comes up a lot.)
– unutbu
Jan 27 '14 at 19:43
@Jeff: Or maybe the function could be wrapped in a memoizer...
– unutbu
Jan 27 '14 at 19:48
its actually tricky; the fast-path is in cython (usually), so right now it doesn't pass it back to python space (it could I suppose). transform DOES do this however (where it choses a path and then uses that result to move on ). Its just a little bit tricky in code. Welcome to do a PR !
– Jeff
Jan 27 '14 at 19:51
Wouldn't it make more sense to just bite the bullet and make an explicit mutating/non-mutating parameter, defaulting to non-mutating? [Somewhat silly additional comment deleted, but my first question stands.]
– DSM
Jan 27 '14 at 19:54
|
show 4 more comments
10
This is checking whether you are mutating the data in the apply. If you are then it has to take a slower path than otherwise. It doesn't change the results.
– Jeff
Jan 27 '14 at 19:40
@Jeff: Could the result of the first call be saved so it is not called again? This might help if the function called by apply takes a long time... (along with being more intuitive, since this question comes up a lot.)
– unutbu
Jan 27 '14 at 19:43
@Jeff: Or maybe the function could be wrapped in a memoizer...
– unutbu
Jan 27 '14 at 19:48
its actually tricky; the fast-path is in cython (usually), so right now it doesn't pass it back to python space (it could I suppose). transform DOES do this however (where it choses a path and then uses that result to move on ). Its just a little bit tricky in code. Welcome to do a PR !
– Jeff
Jan 27 '14 at 19:51
Wouldn't it make more sense to just bite the bullet and make an explicit mutating/non-mutating parameter, defaulting to non-mutating? [Somewhat silly additional comment deleted, but my first question stands.]
– DSM
Jan 27 '14 at 19:54
10
10
This is checking whether you are mutating the data in the apply. If you are then it has to take a slower path than otherwise. It doesn't change the results.
– Jeff
Jan 27 '14 at 19:40
This is checking whether you are mutating the data in the apply. If you are then it has to take a slower path than otherwise. It doesn't change the results.
– Jeff
Jan 27 '14 at 19:40
@Jeff: Could the result of the first call be saved so it is not called again? This might help if the function called by apply takes a long time... (along with being more intuitive, since this question comes up a lot.)
– unutbu
Jan 27 '14 at 19:43
@Jeff: Could the result of the first call be saved so it is not called again? This might help if the function called by apply takes a long time... (along with being more intuitive, since this question comes up a lot.)
– unutbu
Jan 27 '14 at 19:43
@Jeff: Or maybe the function could be wrapped in a memoizer...
– unutbu
Jan 27 '14 at 19:48
@Jeff: Or maybe the function could be wrapped in a memoizer...
– unutbu
Jan 27 '14 at 19:48
its actually tricky; the fast-path is in cython (usually), so right now it doesn't pass it back to python space (it could I suppose). transform DOES do this however (where it choses a path and then uses that result to move on ). Its just a little bit tricky in code. Welcome to do a PR !
– Jeff
Jan 27 '14 at 19:51
its actually tricky; the fast-path is in cython (usually), so right now it doesn't pass it back to python space (it could I suppose). transform DOES do this however (where it choses a path and then uses that result to move on ). Its just a little bit tricky in code. Welcome to do a PR !
– Jeff
Jan 27 '14 at 19:51
Wouldn't it make more sense to just bite the bullet and make an explicit mutating/non-mutating parameter, defaulting to non-mutating? [Somewhat silly additional comment deleted, but my first question stands.]
– DSM
Jan 27 '14 at 19:54
Wouldn't it make more sense to just bite the bullet and make an explicit mutating/non-mutating parameter, defaulting to non-mutating? [Somewhat silly additional comment deleted, but my first question stands.]
– DSM
Jan 27 '14 at 19:54
|
show 4 more comments
2 Answers
2
active
oldest
votes
This is by design, as described here and here
The apply
function needs to know the shape of the returned data to intelligently figure out how it will be combined. To do this it calls the function (checkit
in your case) twice to achieve this.
Depending on your actual use case, you can replace the call to apply
with aggregate
, transform
or filter
, as described in detail here. These functions require the return value to be a particular shape, and so don't call the function twice.
However - if the function you are calling does not have side-effects, it most likely does not matter that the function is being called twice on the first value.
add a comment |
you can use for loop to avoid the groupby.apply duplicate first row,
log_sample.csv
guestid,keyword
1,null
2,null
2,null
3,null
3,null
3,null
4,null
4,null
4,null
4,null
my code snippit
df=pd.read_csv("log_sample.csv")
grouped = df.groupby("guestid")
for guestid, df_group in grouped:
print(list(df_group['guestid']))
df.head(100)
output
[1]
[2, 2]
[3, 3, 3]
[4, 4, 4, 4]
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f21390035%2fpython-pandas-groupby-object-apply-method-duplicates-first-group%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
This is by design, as described here and here
The apply
function needs to know the shape of the returned data to intelligently figure out how it will be combined. To do this it calls the function (checkit
in your case) twice to achieve this.
Depending on your actual use case, you can replace the call to apply
with aggregate
, transform
or filter
, as described in detail here. These functions require the return value to be a particular shape, and so don't call the function twice.
However - if the function you are calling does not have side-effects, it most likely does not matter that the function is being called twice on the first value.
add a comment |
This is by design, as described here and here
The apply
function needs to know the shape of the returned data to intelligently figure out how it will be combined. To do this it calls the function (checkit
in your case) twice to achieve this.
Depending on your actual use case, you can replace the call to apply
with aggregate
, transform
or filter
, as described in detail here. These functions require the return value to be a particular shape, and so don't call the function twice.
However - if the function you are calling does not have side-effects, it most likely does not matter that the function is being called twice on the first value.
add a comment |
This is by design, as described here and here
The apply
function needs to know the shape of the returned data to intelligently figure out how it will be combined. To do this it calls the function (checkit
in your case) twice to achieve this.
Depending on your actual use case, you can replace the call to apply
with aggregate
, transform
or filter
, as described in detail here. These functions require the return value to be a particular shape, and so don't call the function twice.
However - if the function you are calling does not have side-effects, it most likely does not matter that the function is being called twice on the first value.
This is by design, as described here and here
The apply
function needs to know the shape of the returned data to intelligently figure out how it will be combined. To do this it calls the function (checkit
in your case) twice to achieve this.
Depending on your actual use case, you can replace the call to apply
with aggregate
, transform
or filter
, as described in detail here. These functions require the return value to be a particular shape, and so don't call the function twice.
However - if the function you are calling does not have side-effects, it most likely does not matter that the function is being called twice on the first value.
answered Sep 8 '14 at 1:39
ZeroZero
4,57543660
4,57543660
add a comment |
add a comment |
you can use for loop to avoid the groupby.apply duplicate first row,
log_sample.csv
guestid,keyword
1,null
2,null
2,null
3,null
3,null
3,null
4,null
4,null
4,null
4,null
my code snippit
df=pd.read_csv("log_sample.csv")
grouped = df.groupby("guestid")
for guestid, df_group in grouped:
print(list(df_group['guestid']))
df.head(100)
output
[1]
[2, 2]
[3, 3, 3]
[4, 4, 4, 4]
add a comment |
you can use for loop to avoid the groupby.apply duplicate first row,
log_sample.csv
guestid,keyword
1,null
2,null
2,null
3,null
3,null
3,null
4,null
4,null
4,null
4,null
my code snippit
df=pd.read_csv("log_sample.csv")
grouped = df.groupby("guestid")
for guestid, df_group in grouped:
print(list(df_group['guestid']))
df.head(100)
output
[1]
[2, 2]
[3, 3, 3]
[4, 4, 4, 4]
add a comment |
you can use for loop to avoid the groupby.apply duplicate first row,
log_sample.csv
guestid,keyword
1,null
2,null
2,null
3,null
3,null
3,null
4,null
4,null
4,null
4,null
my code snippit
df=pd.read_csv("log_sample.csv")
grouped = df.groupby("guestid")
for guestid, df_group in grouped:
print(list(df_group['guestid']))
df.head(100)
output
[1]
[2, 2]
[3, 3, 3]
[4, 4, 4, 4]
you can use for loop to avoid the groupby.apply duplicate first row,
log_sample.csv
guestid,keyword
1,null
2,null
2,null
3,null
3,null
3,null
4,null
4,null
4,null
4,null
my code snippit
df=pd.read_csv("log_sample.csv")
grouped = df.groupby("guestid")
for guestid, df_group in grouped:
print(list(df_group['guestid']))
df.head(100)
output
[1]
[2, 2]
[3, 3, 3]
[4, 4, 4, 4]
answered Apr 4 '18 at 3:17
![](https://i.stack.imgur.com/V6dbQ.png?s=32&g=1)
![](https://i.stack.imgur.com/V6dbQ.png?s=32&g=1)
geosmartgeosmart
587
587
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f21390035%2fpython-pandas-groupby-object-apply-method-duplicates-first-group%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
IN WO29
10
This is checking whether you are mutating the data in the apply. If you are then it has to take a slower path than otherwise. It doesn't change the results.
– Jeff
Jan 27 '14 at 19:40
@Jeff: Could the result of the first call be saved so it is not called again? This might help if the function called by apply takes a long time... (along with being more intuitive, since this question comes up a lot.)
– unutbu
Jan 27 '14 at 19:43
@Jeff: Or maybe the function could be wrapped in a memoizer...
– unutbu
Jan 27 '14 at 19:48
its actually tricky; the fast-path is in cython (usually), so right now it doesn't pass it back to python space (it could I suppose). transform DOES do this however (where it choses a path and then uses that result to move on ). Its just a little bit tricky in code. Welcome to do a PR !
– Jeff
Jan 27 '14 at 19:51
Wouldn't it make more sense to just bite the bullet and make an explicit mutating/non-mutating parameter, defaulting to non-mutating? [Somewhat silly additional comment deleted, but my first question stands.]
– DSM
Jan 27 '14 at 19:54