Drop duplicates, but keep rows with highest value including ties
I want to drop duplicate values for col1
, saving only rows with the highest value in col2
. Example df:
df1 = pd.DataFrame({'col1': ['a', 'a', 'b', 'b', 'c'],
'col2': [5, 5, 10, 15, 20]})
I know df1.drop_duplicates
will get rid of duplicate values, but how do I ensure that it is the highest value that is saved (or multiple values if there are values tied for highest)?
Desired output:
dfoutput = pd.DataFrame({'col1': ['a', 'a', 'b', 'c'],
'col2': [5, 5, 15, 20]})
python pandas dataframe duplicates
add a comment |
I want to drop duplicate values for col1
, saving only rows with the highest value in col2
. Example df:
df1 = pd.DataFrame({'col1': ['a', 'a', 'b', 'b', 'c'],
'col2': [5, 5, 10, 15, 20]})
I know df1.drop_duplicates
will get rid of duplicate values, but how do I ensure that it is the highest value that is saved (or multiple values if there are values tied for highest)?
Desired output:
dfoutput = pd.DataFrame({'col1': ['a', 'a', 'b', 'c'],
'col2': [5, 5, 15, 20]})
python pandas dataframe duplicates
add a comment |
I want to drop duplicate values for col1
, saving only rows with the highest value in col2
. Example df:
df1 = pd.DataFrame({'col1': ['a', 'a', 'b', 'b', 'c'],
'col2': [5, 5, 10, 15, 20]})
I know df1.drop_duplicates
will get rid of duplicate values, but how do I ensure that it is the highest value that is saved (or multiple values if there are values tied for highest)?
Desired output:
dfoutput = pd.DataFrame({'col1': ['a', 'a', 'b', 'c'],
'col2': [5, 5, 15, 20]})
python pandas dataframe duplicates
I want to drop duplicate values for col1
, saving only rows with the highest value in col2
. Example df:
df1 = pd.DataFrame({'col1': ['a', 'a', 'b', 'b', 'c'],
'col2': [5, 5, 10, 15, 20]})
I know df1.drop_duplicates
will get rid of duplicate values, but how do I ensure that it is the highest value that is saved (or multiple values if there are values tied for highest)?
Desired output:
dfoutput = pd.DataFrame({'col1': ['a', 'a', 'b', 'c'],
'col2': [5, 5, 15, 20]})
python pandas dataframe duplicates
python pandas dataframe duplicates
edited Jan 2 at 6:10
coldspeed
134k23145230
134k23145230
asked Jan 2 at 4:49
LiquidityLiquidity
266212
266212
add a comment |
add a comment |
3 Answers
3
active
oldest
votes
Start by sorting your DataFrame in descending order. Next, compute two masks, one for figuring out what rows are the maximum in their group, and the other to figure out what rows are duplicated.
We can then combine these masks to determine what rows are duplicated and not the maximum in their respective groups, and perform one final filtering step.
v = df1.sort_values('col2', ascending=False)
m1 = v['col2'] == v.groupby('col1', sort=False)['col2'].transform('max')
m2 = v.duplicated('col1')
v[~(m2 & ~m1)].sort_index() # v[~m2 | m1] - DeMorgan's Law
col1 col2
0 a 5
1 a 5
3 b 15
4 c 20
I like the first one better, it gives the output that the OP wished for. :)
– anky_91
Jan 2 at 5:56
1
@anky_91 Thanks! Didn't realise there was a discrepancy in the result! +1'd yours as well.
– coldspeed
Jan 2 at 6:08
add a comment |
You can use rank function of Pandas:
Assign the rank to each row of the group. If the values are same, rows will have same rank. Something like below:
In [126]: df1['rnk'] = df1.groupby('col1')['col2'].rank()
In [127]: df1
Out[127]:
col1 col2 rnk
0 a 5 1.5
1 a 5 1.5
2 b 10 1.0
3 b 15 2.0
4 c 20 1.0
Then use query method to filter only ranks less than 2.0
:
In [129]: df1.query('rnk < 2.0').drop('rnk',1)
Out[129]:
col1 col2
0 a 5
1 a 5
2 b 10
4 c 20
Can combine both the above commands to get a 1-line solution:
In [130]: df1[df1.groupby('col1')['col2'].rank() < 2]
Out[130]:
col1 col2
0 a 5
1 a 5
2 b 10
4 c 20
Interesting solution!
– coldspeed
Jan 2 at 5:07
Hope you liked it. An upvote would be good.
– Mayank Porwal
Jan 2 at 5:08
Actually, there is no need for the extra column, just doingdf1[df1.groupby('col1')['col2'].rank() < 2]
will work.
– coldspeed
Jan 2 at 5:08
Yes, I know there's no need of the extra column. Just put it there for the understanding of OP.
– Mayank Porwal
Jan 2 at 5:10
add a comment |
another way which I found :
get duplicates and append it with deduplicated values after sorting in decending
order , then get rid of the duplicated index.
dfoutput = df1[df1.duplicated(keep=False)].append(df1.sort_values(['col1','col2'],ascending=False).drop_duplicates(['col1']))
dfoutput[~dfoutput.index.duplicated()].sort_index()
col1 col2
0 a 5
1 a 5
3 b 15
4 c 20
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54001339%2fdrop-duplicates-but-keep-rows-with-highest-value-including-ties%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
Start by sorting your DataFrame in descending order. Next, compute two masks, one for figuring out what rows are the maximum in their group, and the other to figure out what rows are duplicated.
We can then combine these masks to determine what rows are duplicated and not the maximum in their respective groups, and perform one final filtering step.
v = df1.sort_values('col2', ascending=False)
m1 = v['col2'] == v.groupby('col1', sort=False)['col2'].transform('max')
m2 = v.duplicated('col1')
v[~(m2 & ~m1)].sort_index() # v[~m2 | m1] - DeMorgan's Law
col1 col2
0 a 5
1 a 5
3 b 15
4 c 20
I like the first one better, it gives the output that the OP wished for. :)
– anky_91
Jan 2 at 5:56
1
@anky_91 Thanks! Didn't realise there was a discrepancy in the result! +1'd yours as well.
– coldspeed
Jan 2 at 6:08
add a comment |
Start by sorting your DataFrame in descending order. Next, compute two masks, one for figuring out what rows are the maximum in their group, and the other to figure out what rows are duplicated.
We can then combine these masks to determine what rows are duplicated and not the maximum in their respective groups, and perform one final filtering step.
v = df1.sort_values('col2', ascending=False)
m1 = v['col2'] == v.groupby('col1', sort=False)['col2'].transform('max')
m2 = v.duplicated('col1')
v[~(m2 & ~m1)].sort_index() # v[~m2 | m1] - DeMorgan's Law
col1 col2
0 a 5
1 a 5
3 b 15
4 c 20
I like the first one better, it gives the output that the OP wished for. :)
– anky_91
Jan 2 at 5:56
1
@anky_91 Thanks! Didn't realise there was a discrepancy in the result! +1'd yours as well.
– coldspeed
Jan 2 at 6:08
add a comment |
Start by sorting your DataFrame in descending order. Next, compute two masks, one for figuring out what rows are the maximum in their group, and the other to figure out what rows are duplicated.
We can then combine these masks to determine what rows are duplicated and not the maximum in their respective groups, and perform one final filtering step.
v = df1.sort_values('col2', ascending=False)
m1 = v['col2'] == v.groupby('col1', sort=False)['col2'].transform('max')
m2 = v.duplicated('col1')
v[~(m2 & ~m1)].sort_index() # v[~m2 | m1] - DeMorgan's Law
col1 col2
0 a 5
1 a 5
3 b 15
4 c 20
Start by sorting your DataFrame in descending order. Next, compute two masks, one for figuring out what rows are the maximum in their group, and the other to figure out what rows are duplicated.
We can then combine these masks to determine what rows are duplicated and not the maximum in their respective groups, and perform one final filtering step.
v = df1.sort_values('col2', ascending=False)
m1 = v['col2'] == v.groupby('col1', sort=False)['col2'].transform('max')
m2 = v.duplicated('col1')
v[~(m2 & ~m1)].sort_index() # v[~m2 | m1] - DeMorgan's Law
col1 col2
0 a 5
1 a 5
3 b 15
4 c 20
edited Jan 2 at 6:08
answered Jan 2 at 4:57
coldspeedcoldspeed
134k23145230
134k23145230
I like the first one better, it gives the output that the OP wished for. :)
– anky_91
Jan 2 at 5:56
1
@anky_91 Thanks! Didn't realise there was a discrepancy in the result! +1'd yours as well.
– coldspeed
Jan 2 at 6:08
add a comment |
I like the first one better, it gives the output that the OP wished for. :)
– anky_91
Jan 2 at 5:56
1
@anky_91 Thanks! Didn't realise there was a discrepancy in the result! +1'd yours as well.
– coldspeed
Jan 2 at 6:08
I like the first one better, it gives the output that the OP wished for. :)
– anky_91
Jan 2 at 5:56
I like the first one better, it gives the output that the OP wished for. :)
– anky_91
Jan 2 at 5:56
1
1
@anky_91 Thanks! Didn't realise there was a discrepancy in the result! +1'd yours as well.
– coldspeed
Jan 2 at 6:08
@anky_91 Thanks! Didn't realise there was a discrepancy in the result! +1'd yours as well.
– coldspeed
Jan 2 at 6:08
add a comment |
You can use rank function of Pandas:
Assign the rank to each row of the group. If the values are same, rows will have same rank. Something like below:
In [126]: df1['rnk'] = df1.groupby('col1')['col2'].rank()
In [127]: df1
Out[127]:
col1 col2 rnk
0 a 5 1.5
1 a 5 1.5
2 b 10 1.0
3 b 15 2.0
4 c 20 1.0
Then use query method to filter only ranks less than 2.0
:
In [129]: df1.query('rnk < 2.0').drop('rnk',1)
Out[129]:
col1 col2
0 a 5
1 a 5
2 b 10
4 c 20
Can combine both the above commands to get a 1-line solution:
In [130]: df1[df1.groupby('col1')['col2'].rank() < 2]
Out[130]:
col1 col2
0 a 5
1 a 5
2 b 10
4 c 20
Interesting solution!
– coldspeed
Jan 2 at 5:07
Hope you liked it. An upvote would be good.
– Mayank Porwal
Jan 2 at 5:08
Actually, there is no need for the extra column, just doingdf1[df1.groupby('col1')['col2'].rank() < 2]
will work.
– coldspeed
Jan 2 at 5:08
Yes, I know there's no need of the extra column. Just put it there for the understanding of OP.
– Mayank Porwal
Jan 2 at 5:10
add a comment |
You can use rank function of Pandas:
Assign the rank to each row of the group. If the values are same, rows will have same rank. Something like below:
In [126]: df1['rnk'] = df1.groupby('col1')['col2'].rank()
In [127]: df1
Out[127]:
col1 col2 rnk
0 a 5 1.5
1 a 5 1.5
2 b 10 1.0
3 b 15 2.0
4 c 20 1.0
Then use query method to filter only ranks less than 2.0
:
In [129]: df1.query('rnk < 2.0').drop('rnk',1)
Out[129]:
col1 col2
0 a 5
1 a 5
2 b 10
4 c 20
Can combine both the above commands to get a 1-line solution:
In [130]: df1[df1.groupby('col1')['col2'].rank() < 2]
Out[130]:
col1 col2
0 a 5
1 a 5
2 b 10
4 c 20
Interesting solution!
– coldspeed
Jan 2 at 5:07
Hope you liked it. An upvote would be good.
– Mayank Porwal
Jan 2 at 5:08
Actually, there is no need for the extra column, just doingdf1[df1.groupby('col1')['col2'].rank() < 2]
will work.
– coldspeed
Jan 2 at 5:08
Yes, I know there's no need of the extra column. Just put it there for the understanding of OP.
– Mayank Porwal
Jan 2 at 5:10
add a comment |
You can use rank function of Pandas:
Assign the rank to each row of the group. If the values are same, rows will have same rank. Something like below:
In [126]: df1['rnk'] = df1.groupby('col1')['col2'].rank()
In [127]: df1
Out[127]:
col1 col2 rnk
0 a 5 1.5
1 a 5 1.5
2 b 10 1.0
3 b 15 2.0
4 c 20 1.0
Then use query method to filter only ranks less than 2.0
:
In [129]: df1.query('rnk < 2.0').drop('rnk',1)
Out[129]:
col1 col2
0 a 5
1 a 5
2 b 10
4 c 20
Can combine both the above commands to get a 1-line solution:
In [130]: df1[df1.groupby('col1')['col2'].rank() < 2]
Out[130]:
col1 col2
0 a 5
1 a 5
2 b 10
4 c 20
You can use rank function of Pandas:
Assign the rank to each row of the group. If the values are same, rows will have same rank. Something like below:
In [126]: df1['rnk'] = df1.groupby('col1')['col2'].rank()
In [127]: df1
Out[127]:
col1 col2 rnk
0 a 5 1.5
1 a 5 1.5
2 b 10 1.0
3 b 15 2.0
4 c 20 1.0
Then use query method to filter only ranks less than 2.0
:
In [129]: df1.query('rnk < 2.0').drop('rnk',1)
Out[129]:
col1 col2
0 a 5
1 a 5
2 b 10
4 c 20
Can combine both the above commands to get a 1-line solution:
In [130]: df1[df1.groupby('col1')['col2'].rank() < 2]
Out[130]:
col1 col2
0 a 5
1 a 5
2 b 10
4 c 20
edited Jan 2 at 5:12
answered Jan 2 at 5:06
Mayank PorwalMayank Porwal
4,9702724
4,9702724
Interesting solution!
– coldspeed
Jan 2 at 5:07
Hope you liked it. An upvote would be good.
– Mayank Porwal
Jan 2 at 5:08
Actually, there is no need for the extra column, just doingdf1[df1.groupby('col1')['col2'].rank() < 2]
will work.
– coldspeed
Jan 2 at 5:08
Yes, I know there's no need of the extra column. Just put it there for the understanding of OP.
– Mayank Porwal
Jan 2 at 5:10
add a comment |
Interesting solution!
– coldspeed
Jan 2 at 5:07
Hope you liked it. An upvote would be good.
– Mayank Porwal
Jan 2 at 5:08
Actually, there is no need for the extra column, just doingdf1[df1.groupby('col1')['col2'].rank() < 2]
will work.
– coldspeed
Jan 2 at 5:08
Yes, I know there's no need of the extra column. Just put it there for the understanding of OP.
– Mayank Porwal
Jan 2 at 5:10
Interesting solution!
– coldspeed
Jan 2 at 5:07
Interesting solution!
– coldspeed
Jan 2 at 5:07
Hope you liked it. An upvote would be good.
– Mayank Porwal
Jan 2 at 5:08
Hope you liked it. An upvote would be good.
– Mayank Porwal
Jan 2 at 5:08
Actually, there is no need for the extra column, just doing
df1[df1.groupby('col1')['col2'].rank() < 2]
will work.– coldspeed
Jan 2 at 5:08
Actually, there is no need for the extra column, just doing
df1[df1.groupby('col1')['col2'].rank() < 2]
will work.– coldspeed
Jan 2 at 5:08
Yes, I know there's no need of the extra column. Just put it there for the understanding of OP.
– Mayank Porwal
Jan 2 at 5:10
Yes, I know there's no need of the extra column. Just put it there for the understanding of OP.
– Mayank Porwal
Jan 2 at 5:10
add a comment |
another way which I found :
get duplicates and append it with deduplicated values after sorting in decending
order , then get rid of the duplicated index.
dfoutput = df1[df1.duplicated(keep=False)].append(df1.sort_values(['col1','col2'],ascending=False).drop_duplicates(['col1']))
dfoutput[~dfoutput.index.duplicated()].sort_index()
col1 col2
0 a 5
1 a 5
3 b 15
4 c 20
add a comment |
another way which I found :
get duplicates and append it with deduplicated values after sorting in decending
order , then get rid of the duplicated index.
dfoutput = df1[df1.duplicated(keep=False)].append(df1.sort_values(['col1','col2'],ascending=False).drop_duplicates(['col1']))
dfoutput[~dfoutput.index.duplicated()].sort_index()
col1 col2
0 a 5
1 a 5
3 b 15
4 c 20
add a comment |
another way which I found :
get duplicates and append it with deduplicated values after sorting in decending
order , then get rid of the duplicated index.
dfoutput = df1[df1.duplicated(keep=False)].append(df1.sort_values(['col1','col2'],ascending=False).drop_duplicates(['col1']))
dfoutput[~dfoutput.index.duplicated()].sort_index()
col1 col2
0 a 5
1 a 5
3 b 15
4 c 20
another way which I found :
get duplicates and append it with deduplicated values after sorting in decending
order , then get rid of the duplicated index.
dfoutput = df1[df1.duplicated(keep=False)].append(df1.sort_values(['col1','col2'],ascending=False).drop_duplicates(['col1']))
dfoutput[~dfoutput.index.duplicated()].sort_index()
col1 col2
0 a 5
1 a 5
3 b 15
4 c 20
edited Jan 2 at 5:52
answered Jan 2 at 5:45
anky_91anky_91
7,1372621
7,1372621
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54001339%2fdrop-duplicates-but-keep-rows-with-highest-value-including-ties%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown