Drop duplicates, but keep rows with highest value including ties












2















I want to drop duplicate values for col1, saving only rows with the highest value in col2. Example df:



df1 = pd.DataFrame({'col1': ['a', 'a', 'b', 'b', 'c'],
'col2': [5, 5, 10, 15, 20]})


I know df1.drop_duplicates will get rid of duplicate values, but how do I ensure that it is the highest value that is saved (or multiple values if there are values tied for highest)?



Desired output:



dfoutput = pd.DataFrame({'col1': ['a', 'a', 'b', 'c'],
'col2': [5, 5, 15, 20]})









share|improve this question





























    2















    I want to drop duplicate values for col1, saving only rows with the highest value in col2. Example df:



    df1 = pd.DataFrame({'col1': ['a', 'a', 'b', 'b', 'c'],
    'col2': [5, 5, 10, 15, 20]})


    I know df1.drop_duplicates will get rid of duplicate values, but how do I ensure that it is the highest value that is saved (or multiple values if there are values tied for highest)?



    Desired output:



    dfoutput = pd.DataFrame({'col1': ['a', 'a', 'b', 'c'],
    'col2': [5, 5, 15, 20]})









    share|improve this question



























      2












      2








      2








      I want to drop duplicate values for col1, saving only rows with the highest value in col2. Example df:



      df1 = pd.DataFrame({'col1': ['a', 'a', 'b', 'b', 'c'],
      'col2': [5, 5, 10, 15, 20]})


      I know df1.drop_duplicates will get rid of duplicate values, but how do I ensure that it is the highest value that is saved (or multiple values if there are values tied for highest)?



      Desired output:



      dfoutput = pd.DataFrame({'col1': ['a', 'a', 'b', 'c'],
      'col2': [5, 5, 15, 20]})









      share|improve this question
















      I want to drop duplicate values for col1, saving only rows with the highest value in col2. Example df:



      df1 = pd.DataFrame({'col1': ['a', 'a', 'b', 'b', 'c'],
      'col2': [5, 5, 10, 15, 20]})


      I know df1.drop_duplicates will get rid of duplicate values, but how do I ensure that it is the highest value that is saved (or multiple values if there are values tied for highest)?



      Desired output:



      dfoutput = pd.DataFrame({'col1': ['a', 'a', 'b', 'c'],
      'col2': [5, 5, 15, 20]})






      python pandas dataframe duplicates






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Jan 2 at 6:10









      coldspeed

      134k23145230




      134k23145230










      asked Jan 2 at 4:49









      LiquidityLiquidity

      266212




      266212
























          3 Answers
          3






          active

          oldest

          votes


















          2














          Start by sorting your DataFrame in descending order. Next, compute two masks, one for figuring out what rows are the maximum in their group, and the other to figure out what rows are duplicated.



          We can then combine these masks to determine what rows are duplicated and not the maximum in their respective groups, and perform one final filtering step.



          v = df1.sort_values('col2', ascending=False)
          m1 = v['col2'] == v.groupby('col1', sort=False)['col2'].transform('max')
          m2 = v.duplicated('col1')

          v[~(m2 & ~m1)].sort_index() # v[~m2 | m1] - DeMorgan's Law

          col1 col2
          0 a 5
          1 a 5
          3 b 15
          4 c 20





          share|improve this answer


























          • I like the first one better, it gives the output that the OP wished for. :)

            – anky_91
            Jan 2 at 5:56






          • 1





            @anky_91 Thanks! Didn't realise there was a discrepancy in the result! +1'd yours as well.

            – coldspeed
            Jan 2 at 6:08



















          3














          You can use rank function of Pandas:



          Assign the rank to each row of the group. If the values are same, rows will have same rank. Something like below:



          In [126]: df1['rnk'] = df1.groupby('col1')['col2'].rank()
          In [127]: df1
          Out[127]:
          col1 col2 rnk
          0 a 5 1.5
          1 a 5 1.5
          2 b 10 1.0
          3 b 15 2.0
          4 c 20 1.0


          Then use query method to filter only ranks less than 2.0:



          In [129]: df1.query('rnk < 2.0').drop('rnk',1)
          Out[129]:
          col1 col2
          0 a 5
          1 a 5
          2 b 10
          4 c 20


          Can combine both the above commands to get a 1-line solution:



          In [130]: df1[df1.groupby('col1')['col2'].rank() < 2]
          Out[130]:
          col1 col2
          0 a 5
          1 a 5
          2 b 10
          4 c 20





          share|improve this answer


























          • Interesting solution!

            – coldspeed
            Jan 2 at 5:07











          • Hope you liked it. An upvote would be good.

            – Mayank Porwal
            Jan 2 at 5:08











          • Actually, there is no need for the extra column, just doing df1[df1.groupby('col1')['col2'].rank() < 2] will work.

            – coldspeed
            Jan 2 at 5:08











          • Yes, I know there's no need of the extra column. Just put it there for the understanding of OP.

            – Mayank Porwal
            Jan 2 at 5:10



















          1














          another way which I found :



          get duplicates and append it with deduplicated values after sorting in decending order , then get rid of the duplicated index.



          dfoutput = df1[df1.duplicated(keep=False)].append(df1.sort_values(['col1','col2'],ascending=False).drop_duplicates(['col1']))
          dfoutput[~dfoutput.index.duplicated()].sort_index()

          col1 col2
          0 a 5
          1 a 5
          3 b 15
          4 c 20





          share|improve this answer

























            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54001339%2fdrop-duplicates-but-keep-rows-with-highest-value-including-ties%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            3 Answers
            3






            active

            oldest

            votes








            3 Answers
            3






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            2














            Start by sorting your DataFrame in descending order. Next, compute two masks, one for figuring out what rows are the maximum in their group, and the other to figure out what rows are duplicated.



            We can then combine these masks to determine what rows are duplicated and not the maximum in their respective groups, and perform one final filtering step.



            v = df1.sort_values('col2', ascending=False)
            m1 = v['col2'] == v.groupby('col1', sort=False)['col2'].transform('max')
            m2 = v.duplicated('col1')

            v[~(m2 & ~m1)].sort_index() # v[~m2 | m1] - DeMorgan's Law

            col1 col2
            0 a 5
            1 a 5
            3 b 15
            4 c 20





            share|improve this answer


























            • I like the first one better, it gives the output that the OP wished for. :)

              – anky_91
              Jan 2 at 5:56






            • 1





              @anky_91 Thanks! Didn't realise there was a discrepancy in the result! +1'd yours as well.

              – coldspeed
              Jan 2 at 6:08
















            2














            Start by sorting your DataFrame in descending order. Next, compute two masks, one for figuring out what rows are the maximum in their group, and the other to figure out what rows are duplicated.



            We can then combine these masks to determine what rows are duplicated and not the maximum in their respective groups, and perform one final filtering step.



            v = df1.sort_values('col2', ascending=False)
            m1 = v['col2'] == v.groupby('col1', sort=False)['col2'].transform('max')
            m2 = v.duplicated('col1')

            v[~(m2 & ~m1)].sort_index() # v[~m2 | m1] - DeMorgan's Law

            col1 col2
            0 a 5
            1 a 5
            3 b 15
            4 c 20





            share|improve this answer


























            • I like the first one better, it gives the output that the OP wished for. :)

              – anky_91
              Jan 2 at 5:56






            • 1





              @anky_91 Thanks! Didn't realise there was a discrepancy in the result! +1'd yours as well.

              – coldspeed
              Jan 2 at 6:08














            2












            2








            2







            Start by sorting your DataFrame in descending order. Next, compute two masks, one for figuring out what rows are the maximum in their group, and the other to figure out what rows are duplicated.



            We can then combine these masks to determine what rows are duplicated and not the maximum in their respective groups, and perform one final filtering step.



            v = df1.sort_values('col2', ascending=False)
            m1 = v['col2'] == v.groupby('col1', sort=False)['col2'].transform('max')
            m2 = v.duplicated('col1')

            v[~(m2 & ~m1)].sort_index() # v[~m2 | m1] - DeMorgan's Law

            col1 col2
            0 a 5
            1 a 5
            3 b 15
            4 c 20





            share|improve this answer















            Start by sorting your DataFrame in descending order. Next, compute two masks, one for figuring out what rows are the maximum in their group, and the other to figure out what rows are duplicated.



            We can then combine these masks to determine what rows are duplicated and not the maximum in their respective groups, and perform one final filtering step.



            v = df1.sort_values('col2', ascending=False)
            m1 = v['col2'] == v.groupby('col1', sort=False)['col2'].transform('max')
            m2 = v.duplicated('col1')

            v[~(m2 & ~m1)].sort_index() # v[~m2 | m1] - DeMorgan's Law

            col1 col2
            0 a 5
            1 a 5
            3 b 15
            4 c 20






            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Jan 2 at 6:08

























            answered Jan 2 at 4:57









            coldspeedcoldspeed

            134k23145230




            134k23145230













            • I like the first one better, it gives the output that the OP wished for. :)

              – anky_91
              Jan 2 at 5:56






            • 1





              @anky_91 Thanks! Didn't realise there was a discrepancy in the result! +1'd yours as well.

              – coldspeed
              Jan 2 at 6:08



















            • I like the first one better, it gives the output that the OP wished for. :)

              – anky_91
              Jan 2 at 5:56






            • 1





              @anky_91 Thanks! Didn't realise there was a discrepancy in the result! +1'd yours as well.

              – coldspeed
              Jan 2 at 6:08

















            I like the first one better, it gives the output that the OP wished for. :)

            – anky_91
            Jan 2 at 5:56





            I like the first one better, it gives the output that the OP wished for. :)

            – anky_91
            Jan 2 at 5:56




            1




            1





            @anky_91 Thanks! Didn't realise there was a discrepancy in the result! +1'd yours as well.

            – coldspeed
            Jan 2 at 6:08





            @anky_91 Thanks! Didn't realise there was a discrepancy in the result! +1'd yours as well.

            – coldspeed
            Jan 2 at 6:08













            3














            You can use rank function of Pandas:



            Assign the rank to each row of the group. If the values are same, rows will have same rank. Something like below:



            In [126]: df1['rnk'] = df1.groupby('col1')['col2'].rank()
            In [127]: df1
            Out[127]:
            col1 col2 rnk
            0 a 5 1.5
            1 a 5 1.5
            2 b 10 1.0
            3 b 15 2.0
            4 c 20 1.0


            Then use query method to filter only ranks less than 2.0:



            In [129]: df1.query('rnk < 2.0').drop('rnk',1)
            Out[129]:
            col1 col2
            0 a 5
            1 a 5
            2 b 10
            4 c 20


            Can combine both the above commands to get a 1-line solution:



            In [130]: df1[df1.groupby('col1')['col2'].rank() < 2]
            Out[130]:
            col1 col2
            0 a 5
            1 a 5
            2 b 10
            4 c 20





            share|improve this answer


























            • Interesting solution!

              – coldspeed
              Jan 2 at 5:07











            • Hope you liked it. An upvote would be good.

              – Mayank Porwal
              Jan 2 at 5:08











            • Actually, there is no need for the extra column, just doing df1[df1.groupby('col1')['col2'].rank() < 2] will work.

              – coldspeed
              Jan 2 at 5:08











            • Yes, I know there's no need of the extra column. Just put it there for the understanding of OP.

              – Mayank Porwal
              Jan 2 at 5:10
















            3














            You can use rank function of Pandas:



            Assign the rank to each row of the group. If the values are same, rows will have same rank. Something like below:



            In [126]: df1['rnk'] = df1.groupby('col1')['col2'].rank()
            In [127]: df1
            Out[127]:
            col1 col2 rnk
            0 a 5 1.5
            1 a 5 1.5
            2 b 10 1.0
            3 b 15 2.0
            4 c 20 1.0


            Then use query method to filter only ranks less than 2.0:



            In [129]: df1.query('rnk < 2.0').drop('rnk',1)
            Out[129]:
            col1 col2
            0 a 5
            1 a 5
            2 b 10
            4 c 20


            Can combine both the above commands to get a 1-line solution:



            In [130]: df1[df1.groupby('col1')['col2'].rank() < 2]
            Out[130]:
            col1 col2
            0 a 5
            1 a 5
            2 b 10
            4 c 20





            share|improve this answer


























            • Interesting solution!

              – coldspeed
              Jan 2 at 5:07











            • Hope you liked it. An upvote would be good.

              – Mayank Porwal
              Jan 2 at 5:08











            • Actually, there is no need for the extra column, just doing df1[df1.groupby('col1')['col2'].rank() < 2] will work.

              – coldspeed
              Jan 2 at 5:08











            • Yes, I know there's no need of the extra column. Just put it there for the understanding of OP.

              – Mayank Porwal
              Jan 2 at 5:10














            3












            3








            3







            You can use rank function of Pandas:



            Assign the rank to each row of the group. If the values are same, rows will have same rank. Something like below:



            In [126]: df1['rnk'] = df1.groupby('col1')['col2'].rank()
            In [127]: df1
            Out[127]:
            col1 col2 rnk
            0 a 5 1.5
            1 a 5 1.5
            2 b 10 1.0
            3 b 15 2.0
            4 c 20 1.0


            Then use query method to filter only ranks less than 2.0:



            In [129]: df1.query('rnk < 2.0').drop('rnk',1)
            Out[129]:
            col1 col2
            0 a 5
            1 a 5
            2 b 10
            4 c 20


            Can combine both the above commands to get a 1-line solution:



            In [130]: df1[df1.groupby('col1')['col2'].rank() < 2]
            Out[130]:
            col1 col2
            0 a 5
            1 a 5
            2 b 10
            4 c 20





            share|improve this answer















            You can use rank function of Pandas:



            Assign the rank to each row of the group. If the values are same, rows will have same rank. Something like below:



            In [126]: df1['rnk'] = df1.groupby('col1')['col2'].rank()
            In [127]: df1
            Out[127]:
            col1 col2 rnk
            0 a 5 1.5
            1 a 5 1.5
            2 b 10 1.0
            3 b 15 2.0
            4 c 20 1.0


            Then use query method to filter only ranks less than 2.0:



            In [129]: df1.query('rnk < 2.0').drop('rnk',1)
            Out[129]:
            col1 col2
            0 a 5
            1 a 5
            2 b 10
            4 c 20


            Can combine both the above commands to get a 1-line solution:



            In [130]: df1[df1.groupby('col1')['col2'].rank() < 2]
            Out[130]:
            col1 col2
            0 a 5
            1 a 5
            2 b 10
            4 c 20






            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Jan 2 at 5:12

























            answered Jan 2 at 5:06









            Mayank PorwalMayank Porwal

            4,9702724




            4,9702724













            • Interesting solution!

              – coldspeed
              Jan 2 at 5:07











            • Hope you liked it. An upvote would be good.

              – Mayank Porwal
              Jan 2 at 5:08











            • Actually, there is no need for the extra column, just doing df1[df1.groupby('col1')['col2'].rank() < 2] will work.

              – coldspeed
              Jan 2 at 5:08











            • Yes, I know there's no need of the extra column. Just put it there for the understanding of OP.

              – Mayank Porwal
              Jan 2 at 5:10



















            • Interesting solution!

              – coldspeed
              Jan 2 at 5:07











            • Hope you liked it. An upvote would be good.

              – Mayank Porwal
              Jan 2 at 5:08











            • Actually, there is no need for the extra column, just doing df1[df1.groupby('col1')['col2'].rank() < 2] will work.

              – coldspeed
              Jan 2 at 5:08











            • Yes, I know there's no need of the extra column. Just put it there for the understanding of OP.

              – Mayank Porwal
              Jan 2 at 5:10

















            Interesting solution!

            – coldspeed
            Jan 2 at 5:07





            Interesting solution!

            – coldspeed
            Jan 2 at 5:07













            Hope you liked it. An upvote would be good.

            – Mayank Porwal
            Jan 2 at 5:08





            Hope you liked it. An upvote would be good.

            – Mayank Porwal
            Jan 2 at 5:08













            Actually, there is no need for the extra column, just doing df1[df1.groupby('col1')['col2'].rank() < 2] will work.

            – coldspeed
            Jan 2 at 5:08





            Actually, there is no need for the extra column, just doing df1[df1.groupby('col1')['col2'].rank() < 2] will work.

            – coldspeed
            Jan 2 at 5:08













            Yes, I know there's no need of the extra column. Just put it there for the understanding of OP.

            – Mayank Porwal
            Jan 2 at 5:10





            Yes, I know there's no need of the extra column. Just put it there for the understanding of OP.

            – Mayank Porwal
            Jan 2 at 5:10











            1














            another way which I found :



            get duplicates and append it with deduplicated values after sorting in decending order , then get rid of the duplicated index.



            dfoutput = df1[df1.duplicated(keep=False)].append(df1.sort_values(['col1','col2'],ascending=False).drop_duplicates(['col1']))
            dfoutput[~dfoutput.index.duplicated()].sort_index()

            col1 col2
            0 a 5
            1 a 5
            3 b 15
            4 c 20





            share|improve this answer






























              1














              another way which I found :



              get duplicates and append it with deduplicated values after sorting in decending order , then get rid of the duplicated index.



              dfoutput = df1[df1.duplicated(keep=False)].append(df1.sort_values(['col1','col2'],ascending=False).drop_duplicates(['col1']))
              dfoutput[~dfoutput.index.duplicated()].sort_index()

              col1 col2
              0 a 5
              1 a 5
              3 b 15
              4 c 20





              share|improve this answer




























                1












                1








                1







                another way which I found :



                get duplicates and append it with deduplicated values after sorting in decending order , then get rid of the duplicated index.



                dfoutput = df1[df1.duplicated(keep=False)].append(df1.sort_values(['col1','col2'],ascending=False).drop_duplicates(['col1']))
                dfoutput[~dfoutput.index.duplicated()].sort_index()

                col1 col2
                0 a 5
                1 a 5
                3 b 15
                4 c 20





                share|improve this answer















                another way which I found :



                get duplicates and append it with deduplicated values after sorting in decending order , then get rid of the duplicated index.



                dfoutput = df1[df1.duplicated(keep=False)].append(df1.sort_values(['col1','col2'],ascending=False).drop_duplicates(['col1']))
                dfoutput[~dfoutput.index.duplicated()].sort_index()

                col1 col2
                0 a 5
                1 a 5
                3 b 15
                4 c 20






                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Jan 2 at 5:52

























                answered Jan 2 at 5:45









                anky_91anky_91

                7,1372621




                7,1372621






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54001339%2fdrop-duplicates-but-keep-rows-with-highest-value-including-ties%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Monofisismo

                    Angular Downloading a file using contenturl with Basic Authentication

                    Olmecas