Drop duplicates, but keep rows with highest value including ties

I want to drop duplicate values for col1, saving only rows with the highest value in col2. Example df:

df1 = pd.DataFrame({'col1': ['a', 'a', 'b', 'b', 'c'],

                   'col2': [5, 5, 10, 15, 20]})

I know df1.drop_duplicates will get rid of duplicate values, but how do I ensure that it is the highest value that is saved (or multiple values if there are values tied for highest)?

Desired output:

dfoutput = pd.DataFrame({'col1': ['a', 'a', 'b', 'c'],

                       'col2': [5, 5, 15, 20]})

edited Jan 2 at 6:10

coldspeed

134k23145230

asked Jan 2 at 4:49

Liquidity

266212

add a comment |

I want to drop duplicate values for col1, saving only rows with the highest value in col2. Example df:

df1 = pd.DataFrame({'col1': ['a', 'a', 'b', 'b', 'c'],

                   'col2': [5, 5, 10, 15, 20]})

I know df1.drop_duplicates will get rid of duplicate values, but how do I ensure that it is the highest value that is saved (or multiple values if there are values tied for highest)?

Desired output:

dfoutput = pd.DataFrame({'col1': ['a', 'a', 'b', 'c'],

                       'col2': [5, 5, 15, 20]})

edited Jan 2 at 6:10

coldspeed

134k23145230

asked Jan 2 at 4:49

Liquidity

266212

add a comment |

I want to drop duplicate values for col1, saving only rows with the highest value in col2. Example df:

df1 = pd.DataFrame({'col1': ['a', 'a', 'b', 'b', 'c'],

                   'col2': [5, 5, 10, 15, 20]})

I know df1.drop_duplicates will get rid of duplicate values, but how do I ensure that it is the highest value that is saved (or multiple values if there are values tied for highest)?

Desired output:

dfoutput = pd.DataFrame({'col1': ['a', 'a', 'b', 'c'],

                       'col2': [5, 5, 15, 20]})

edited Jan 2 at 6:10

coldspeed

134k23145230

asked Jan 2 at 4:49

Liquidity

266212

I want to drop duplicate values for col1, saving only rows with the highest value in col2. Example df:

df1 = pd.DataFrame({'col1': ['a', 'a', 'b', 'b', 'c'],

                   'col2': [5, 5, 10, 15, 20]})

I know df1.drop_duplicates will get rid of duplicate values, but how do I ensure that it is the highest value that is saved (or multiple values if there are values tied for highest)?

Desired output:

dfoutput = pd.DataFrame({'col1': ['a', 'a', 'b', 'c'],

                       'col2': [5, 5, 15, 20]})

python pandas dataframe duplicates

edited Jan 2 at 6:10

coldspeed

134k23145230

asked Jan 2 at 4:49

Liquidity

266212

edited Jan 2 at 6:10

coldspeed

134k23145230

asked Jan 2 at 4:49

Liquidity

266212

edited Jan 2 at 6:10

coldspeed

134k23145230

edited Jan 2 at 6:10

coldspeed

134k23145230

edited Jan 2 at 6:10

coldspeed

134k23145230

asked Jan 2 at 4:49

Liquidity

266212

asked Jan 2 at 4:49

Liquidity

266212

asked Jan 2 at 4:49

Liquidity

266212

add a comment |

3 Answers
3

active

oldest

votes

Start by sorting your DataFrame in descending order. Next, compute two masks, one for figuring out what rows are the maximum in their group, and the other to figure out what rows are duplicated.

We can then combine these masks to determine what rows are duplicated and not the maximum in their respective groups, and perform one final filtering step.

v = df1.sort_values('col2', ascending=False)

m1 = v['col2'] == v.groupby('col1', sort=False)['col2'].transform('max')

m2 = v.duplicated('col1')



v[~(m2 & ~m1)].sort_index()   # v[~m2 | m1] - DeMorgan's Law



  col1  col2

0    a     5

1    a     5

3    b    15

4    c    20

edited Jan 2 at 6:08

answered Jan 2 at 4:57

coldspeed

134k23145230

I like the first one better, it gives the output that the OP wished for. :)

– anky_91
Jan 2 at 5:56

1

@anky_91 Thanks! Didn't realise there was a discrepancy in the result! +1'd yours as well.

– coldspeed
Jan 2 at 6:08

add a comment |

You can use rank function of Pandas:

Assign the rank to each row of the group. If the values are same, rows will have same rank. Something like below:

In [126]: df1['rnk'] = df1.groupby('col1')['col2'].rank()

In [127]: df1

Out[127]: 

  col1  col2  rnk

0    a     5  1.5

1    a     5  1.5

2    b    10  1.0

3    b    15  2.0

4    c    20  1.0

Then use query method to filter only ranks less than 2.0:

In [129]: df1.query('rnk < 2.0').drop('rnk',1)

Out[129]: 

  col1  col2

0    a     5

1    a     5

2    b    10

4    c    20

Can combine both the above commands to get a 1-line solution:

In [130]: df1[df1.groupby('col1')['col2'].rank() < 2]

Out[130]: 

  col1  col2

0    a     5

1    a     5

2    b    10

4    c    20

edited Jan 2 at 5:12

answered Jan 2 at 5:06

Mayank Porwal

4,9702724

Interesting solution!

– coldspeed
Jan 2 at 5:07

Hope you liked it. An upvote would be good.

– Mayank Porwal
Jan 2 at 5:08

Actually, there is no need for the extra column, just doing df1[df1.groupby('col1')['col2'].rank() < 2] will work.

– coldspeed
Jan 2 at 5:08

Yes, I know there's no need of the extra column. Just put it there for the understanding of OP.

– Mayank Porwal
Jan 2 at 5:10

add a comment |

another way which I found :

get duplicates and append it with deduplicated values after sorting in decending order , then get rid of the duplicated index.

dfoutput = df1[df1.duplicated(keep=False)].append(df1.sort_values(['col1','col2'],ascending=False).drop_duplicates(['col1']))

dfoutput[~dfoutput.index.duplicated()].sort_index()



    col1    col2

0   a       5

1   a       5

3   b       15

4   c       20

edited Jan 2 at 5:52

answered Jan 2 at 5:45

anky_91

7,1372621

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54001339%2fdrop-duplicates-but-keep-rows-with-highest-value-including-ties%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

Start by sorting your DataFrame in descending order. Next, compute two masks, one for figuring out what rows are the maximum in their group, and the other to figure out what rows are duplicated.

We can then combine these masks to determine what rows are duplicated and not the maximum in their respective groups, and perform one final filtering step.

v = df1.sort_values('col2', ascending=False)

m1 = v['col2'] == v.groupby('col1', sort=False)['col2'].transform('max')

m2 = v.duplicated('col1')



v[~(m2 & ~m1)].sort_index()   # v[~m2 | m1] - DeMorgan's Law



  col1  col2

0    a     5

1    a     5

3    b    15

4    c    20

edited Jan 2 at 6:08

answered Jan 2 at 4:57

coldspeed

134k23145230

I like the first one better, it gives the output that the OP wished for. :)

– anky_91
Jan 2 at 5:56

1

@anky_91 Thanks! Didn't realise there was a discrepancy in the result! +1'd yours as well.

– coldspeed
Jan 2 at 6:08

add a comment |

Start by sorting your DataFrame in descending order. Next, compute two masks, one for figuring out what rows are the maximum in their group, and the other to figure out what rows are duplicated.

We can then combine these masks to determine what rows are duplicated and not the maximum in their respective groups, and perform one final filtering step.

v = df1.sort_values('col2', ascending=False)

m1 = v['col2'] == v.groupby('col1', sort=False)['col2'].transform('max')

m2 = v.duplicated('col1')



v[~(m2 & ~m1)].sort_index()   # v[~m2 | m1] - DeMorgan's Law



  col1  col2

0    a     5

1    a     5

3    b    15

4    c    20

edited Jan 2 at 6:08

answered Jan 2 at 4:57

coldspeed

134k23145230

I like the first one better, it gives the output that the OP wished for. :)

– anky_91
Jan 2 at 5:56

1

@anky_91 Thanks! Didn't realise there was a discrepancy in the result! +1'd yours as well.

– coldspeed
Jan 2 at 6:08

add a comment |

Start by sorting your DataFrame in descending order. Next, compute two masks, one for figuring out what rows are the maximum in their group, and the other to figure out what rows are duplicated.

We can then combine these masks to determine what rows are duplicated and not the maximum in their respective groups, and perform one final filtering step.

v = df1.sort_values('col2', ascending=False)

m1 = v['col2'] == v.groupby('col1', sort=False)['col2'].transform('max')

m2 = v.duplicated('col1')



v[~(m2 & ~m1)].sort_index()   # v[~m2 | m1] - DeMorgan's Law



  col1  col2

0    a     5

1    a     5

3    b    15

4    c    20

edited Jan 2 at 6:08

answered Jan 2 at 4:57

coldspeed

134k23145230

Start by sorting your DataFrame in descending order. Next, compute two masks, one for figuring out what rows are the maximum in their group, and the other to figure out what rows are duplicated.

We can then combine these masks to determine what rows are duplicated and not the maximum in their respective groups, and perform one final filtering step.

v = df1.sort_values('col2', ascending=False)

m1 = v['col2'] == v.groupby('col1', sort=False)['col2'].transform('max')

m2 = v.duplicated('col1')



v[~(m2 & ~m1)].sort_index()   # v[~m2 | m1] - DeMorgan's Law



  col1  col2

0    a     5

1    a     5

3    b    15

4    c    20

edited Jan 2 at 6:08

answered Jan 2 at 4:57

coldspeed

134k23145230

edited Jan 2 at 6:08

answered Jan 2 at 4:57

coldspeed

134k23145230

answered Jan 2 at 4:57

coldspeed

134k23145230

answered Jan 2 at 4:57

coldspeed

134k23145230

I like the first one better, it gives the output that the OP wished for. :)

– anky_91
Jan 2 at 5:56

1

@anky_91 Thanks! Didn't realise there was a discrepancy in the result! +1'd yours as well.

– coldspeed
Jan 2 at 6:08

add a comment |

I like the first one better, it gives the output that the OP wished for. :)

– anky_91
Jan 2 at 5:56

1

@anky_91 Thanks! Didn't realise there was a discrepancy in the result! +1'd yours as well.

– coldspeed
Jan 2 at 6:08

I like the first one better, it gives the output that the OP wished for. :)

– anky_91
Jan 2 at 5:56

@anky_91 Thanks! Didn't realise there was a discrepancy in the result! +1'd yours as well.

– coldspeed
Jan 2 at 6:08

add a comment |

You can use rank function of Pandas:

Assign the rank to each row of the group. If the values are same, rows will have same rank. Something like below:

In [126]: df1['rnk'] = df1.groupby('col1')['col2'].rank()

In [127]: df1

Out[127]: 

  col1  col2  rnk

0    a     5  1.5

1    a     5  1.5

2    b    10  1.0

3    b    15  2.0

4    c    20  1.0

Then use query method to filter only ranks less than 2.0:

In [129]: df1.query('rnk < 2.0').drop('rnk',1)

Out[129]: 

  col1  col2

0    a     5

1    a     5

2    b    10

4    c    20

Can combine both the above commands to get a 1-line solution:

In [130]: df1[df1.groupby('col1')['col2'].rank() < 2]

Out[130]: 

  col1  col2

0    a     5

1    a     5

2    b    10

4    c    20

edited Jan 2 at 5:12

answered Jan 2 at 5:06

Mayank Porwal

4,9702724

Interesting solution!

– coldspeed
Jan 2 at 5:07

Hope you liked it. An upvote would be good.

– Mayank Porwal
Jan 2 at 5:08

Actually, there is no need for the extra column, just doing df1[df1.groupby('col1')['col2'].rank() < 2] will work.

– coldspeed
Jan 2 at 5:08

Yes, I know there's no need of the extra column. Just put it there for the understanding of OP.

– Mayank Porwal
Jan 2 at 5:10

add a comment |

You can use rank function of Pandas:

Assign the rank to each row of the group. If the values are same, rows will have same rank. Something like below:

In [126]: df1['rnk'] = df1.groupby('col1')['col2'].rank()

In [127]: df1

Out[127]: 

  col1  col2  rnk

0    a     5  1.5

1    a     5  1.5

2    b    10  1.0

3    b    15  2.0

4    c    20  1.0

Then use query method to filter only ranks less than 2.0:

In [129]: df1.query('rnk < 2.0').drop('rnk',1)

Out[129]: 

  col1  col2

0    a     5

1    a     5

2    b    10

4    c    20

Can combine both the above commands to get a 1-line solution:

In [130]: df1[df1.groupby('col1')['col2'].rank() < 2]

Out[130]: 

  col1  col2

0    a     5

1    a     5

2    b    10

4    c    20

edited Jan 2 at 5:12

answered Jan 2 at 5:06

Mayank Porwal

4,9702724

Interesting solution!

– coldspeed
Jan 2 at 5:07

Hope you liked it. An upvote would be good.

– Mayank Porwal
Jan 2 at 5:08

Actually, there is no need for the extra column, just doing df1[df1.groupby('col1')['col2'].rank() < 2] will work.

– coldspeed
Jan 2 at 5:08

Yes, I know there's no need of the extra column. Just put it there for the understanding of OP.

– Mayank Porwal
Jan 2 at 5:10

add a comment |

You can use rank function of Pandas:

Assign the rank to each row of the group. If the values are same, rows will have same rank. Something like below:

In [126]: df1['rnk'] = df1.groupby('col1')['col2'].rank()

In [127]: df1

Out[127]: 

  col1  col2  rnk

0    a     5  1.5

1    a     5  1.5

2    b    10  1.0

3    b    15  2.0

4    c    20  1.0

Then use query method to filter only ranks less than 2.0:

In [129]: df1.query('rnk < 2.0').drop('rnk',1)

Out[129]: 

  col1  col2

0    a     5

1    a     5

2    b    10

4    c    20

Can combine both the above commands to get a 1-line solution:

In [130]: df1[df1.groupby('col1')['col2'].rank() < 2]

Out[130]: 

  col1  col2

0    a     5

1    a     5

2    b    10

4    c    20

edited Jan 2 at 5:12

answered Jan 2 at 5:06

Mayank Porwal

4,9702724

You can use rank function of Pandas:

Assign the rank to each row of the group. If the values are same, rows will have same rank. Something like below:

In [126]: df1['rnk'] = df1.groupby('col1')['col2'].rank()

In [127]: df1

Out[127]: 

  col1  col2  rnk

0    a     5  1.5

1    a     5  1.5

2    b    10  1.0

3    b    15  2.0

4    c    20  1.0

Then use query method to filter only ranks less than 2.0:

In [129]: df1.query('rnk < 2.0').drop('rnk',1)

Out[129]: 

  col1  col2

0    a     5

1    a     5

2    b    10

4    c    20

Can combine both the above commands to get a 1-line solution:

In [130]: df1[df1.groupby('col1')['col2'].rank() < 2]

Out[130]: 

  col1  col2

0    a     5

1    a     5

2    b    10

4    c    20

edited Jan 2 at 5:12

answered Jan 2 at 5:06

Mayank Porwal

4,9702724

edited Jan 2 at 5:12

answered Jan 2 at 5:06

Mayank Porwal

4,9702724

answered Jan 2 at 5:06

Mayank Porwal

4,9702724

answered Jan 2 at 5:06

Mayank Porwal

4,9702724

Interesting solution!

– coldspeed
Jan 2 at 5:07

Hope you liked it. An upvote would be good.

– Mayank Porwal
Jan 2 at 5:08

Actually, there is no need for the extra column, just doing df1[df1.groupby('col1')['col2'].rank() < 2] will work.

– coldspeed
Jan 2 at 5:08

Yes, I know there's no need of the extra column. Just put it there for the understanding of OP.

– Mayank Porwal
Jan 2 at 5:10

add a comment |

Interesting solution!

– coldspeed
Jan 2 at 5:07

Hope you liked it. An upvote would be good.

– Mayank Porwal
Jan 2 at 5:08

Actually, there is no need for the extra column, just doing df1[df1.groupby('col1')['col2'].rank() < 2] will work.

– coldspeed
Jan 2 at 5:08

Yes, I know there's no need of the extra column. Just put it there for the understanding of OP.

– Mayank Porwal
Jan 2 at 5:10

Interesting solution!

– coldspeed
Jan 2 at 5:07

Hope you liked it. An upvote would be good.

– Mayank Porwal
Jan 2 at 5:08

Actually, there is no need for the extra column, just doing df1[df1.groupby('col1')['col2'].rank() < 2] will work.

– coldspeed
Jan 2 at 5:08

Yes, I know there's no need of the extra column. Just put it there for the understanding of OP.

– Mayank Porwal
Jan 2 at 5:10

add a comment |

another way which I found :

get duplicates and append it with deduplicated values after sorting in decending order , then get rid of the duplicated index.

dfoutput = df1[df1.duplicated(keep=False)].append(df1.sort_values(['col1','col2'],ascending=False).drop_duplicates(['col1']))

dfoutput[~dfoutput.index.duplicated()].sort_index()



    col1    col2

0   a       5

1   a       5

3   b       15

4   c       20

edited Jan 2 at 5:52

answered Jan 2 at 5:45

anky_91

7,1372621

add a comment |

another way which I found :

get duplicates and append it with deduplicated values after sorting in decending order , then get rid of the duplicated index.

dfoutput = df1[df1.duplicated(keep=False)].append(df1.sort_values(['col1','col2'],ascending=False).drop_duplicates(['col1']))

dfoutput[~dfoutput.index.duplicated()].sort_index()



    col1    col2

0   a       5

1   a       5

3   b       15

4   c       20

edited Jan 2 at 5:52

answered Jan 2 at 5:45

anky_91

7,1372621

add a comment |

another way which I found :

get duplicates and append it with deduplicated values after sorting in decending order , then get rid of the duplicated index.

dfoutput = df1[df1.duplicated(keep=False)].append(df1.sort_values(['col1','col2'],ascending=False).drop_duplicates(['col1']))

dfoutput[~dfoutput.index.duplicated()].sort_index()



    col1    col2

0   a       5

1   a       5

3   b       15

4   c       20

edited Jan 2 at 5:52

answered Jan 2 at 5:45

anky_91

7,1372621

another way which I found :

get duplicates and append it with deduplicated values after sorting in decending order , then get rid of the duplicated index.

dfoutput = df1[df1.duplicated(keep=False)].append(df1.sort_values(['col1','col2'],ascending=False).drop_duplicates(['col1']))

dfoutput[~dfoutput.index.duplicated()].sort_index()



    col1    col2

0   a       5

1   a       5

3   b       15

4   c       20

edited Jan 2 at 5:52

answered Jan 2 at 5:45

anky_91

7,1372621

edited Jan 2 at 5:52

answered Jan 2 at 5:45

anky_91

7,1372621

answered Jan 2 at 5:45

anky_91

7,1372621

answered Jan 2 at 5:45

anky_91

7,1372621

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

YP4Hyqq60,Q dJsxOOh CRMgTsiag,HBErA,471wuyfuIn,6gGpht9rDCUY7hFYV,0Ea,2DPcsF Jornm 0nhAXLp,C M9

搜尋此網誌

Bdtjtk