Combine pyspark dataframes [duplicate]












-2















This question already has an answer here:




  • What are the various join types in Spark?

    3 answers




Suppose I have two spark dataframe -



df1 = spark.createDataFrame([(1,2),(2,3)],['col1','col2']) 
df2 = spark.createDataFrame([(1,2),(4,3)],['col1','col3'])


I want to combine dataframes in order to get the below result:



+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 2| 2|
| 2| 3| NA|
| 4| NA| 3|
+----+----+----+


How to get the above result?










share|improve this question















marked as duplicate by pault, eliasah apache-spark
Users with the  apache-spark badge can single-handedly close apache-spark questions as duplicates and reopen them as needed.

StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;

$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');

$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
Jan 3 at 7:01


This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.




















    -2















    This question already has an answer here:




    • What are the various join types in Spark?

      3 answers




    Suppose I have two spark dataframe -



    df1 = spark.createDataFrame([(1,2),(2,3)],['col1','col2']) 
    df2 = spark.createDataFrame([(1,2),(4,3)],['col1','col3'])


    I want to combine dataframes in order to get the below result:



    +----+----+----+
    |col1|col2|col3|
    +----+----+----+
    | 1| 2| 2|
    | 2| 3| NA|
    | 4| NA| 3|
    +----+----+----+


    How to get the above result?










    share|improve this question















    marked as duplicate by pault, eliasah apache-spark
    Users with the  apache-spark badge can single-handedly close apache-spark questions as duplicates and reopen them as needed.

    StackExchange.ready(function() {
    if (StackExchange.options.isMobile) return;

    $('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
    var $hover = $(this).addClass('hover-bound'),
    $msg = $hover.siblings('.dupe-hammer-message');

    $hover.hover(
    function() {
    $hover.showInfoMessage('', {
    messageElement: $msg.clone().show(),
    transient: false,
    position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
    dismissable: false,
    relativeToBody: true
    });
    },
    function() {
    StackExchange.helpers.removeMessages();
    }
    );
    });
    });
    Jan 3 at 7:01


    This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.


















      -2












      -2








      -2








      This question already has an answer here:




      • What are the various join types in Spark?

        3 answers




      Suppose I have two spark dataframe -



      df1 = spark.createDataFrame([(1,2),(2,3)],['col1','col2']) 
      df2 = spark.createDataFrame([(1,2),(4,3)],['col1','col3'])


      I want to combine dataframes in order to get the below result:



      +----+----+----+
      |col1|col2|col3|
      +----+----+----+
      | 1| 2| 2|
      | 2| 3| NA|
      | 4| NA| 3|
      +----+----+----+


      How to get the above result?










      share|improve this question
















      This question already has an answer here:




      • What are the various join types in Spark?

        3 answers




      Suppose I have two spark dataframe -



      df1 = spark.createDataFrame([(1,2),(2,3)],['col1','col2']) 
      df2 = spark.createDataFrame([(1,2),(4,3)],['col1','col3'])


      I want to combine dataframes in order to get the below result:



      +----+----+----+
      |col1|col2|col3|
      +----+----+----+
      | 1| 2| 2|
      | 2| 3| NA|
      | 4| NA| 3|
      +----+----+----+


      How to get the above result?





      This question already has an answer here:




      • What are the various join types in Spark?

        3 answers








      apache-spark pyspark






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Dec 28 '18 at 8:02









      Sotos

      28.4k51640




      28.4k51640










      asked Dec 28 '18 at 7:56









      kirti dahiyakirti dahiya

      32




      32




      marked as duplicate by pault, eliasah apache-spark
      Users with the  apache-spark badge can single-handedly close apache-spark questions as duplicates and reopen them as needed.

      StackExchange.ready(function() {
      if (StackExchange.options.isMobile) return;

      $('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
      var $hover = $(this).addClass('hover-bound'),
      $msg = $hover.siblings('.dupe-hammer-message');

      $hover.hover(
      function() {
      $hover.showInfoMessage('', {
      messageElement: $msg.clone().show(),
      transient: false,
      position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
      dismissable: false,
      relativeToBody: true
      });
      },
      function() {
      StackExchange.helpers.removeMessages();
      }
      );
      });
      });
      Jan 3 at 7:01


      This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.






      marked as duplicate by pault, eliasah apache-spark
      Users with the  apache-spark badge can single-handedly close apache-spark questions as duplicates and reopen them as needed.

      StackExchange.ready(function() {
      if (StackExchange.options.isMobile) return;

      $('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
      var $hover = $(this).addClass('hover-bound'),
      $msg = $hover.siblings('.dupe-hammer-message');

      $hover.hover(
      function() {
      $hover.showInfoMessage('', {
      messageElement: $msg.clone().show(),
      transient: false,
      position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
      dismissable: false,
      relativeToBody: true
      });
      },
      function() {
      StackExchange.helpers.removeMessages();
      }
      );
      });
      });
      Jan 3 at 7:01


      This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.


























          3 Answers
          3






          active

          oldest

          votes


















          0














          You need an outer merge, i.e.



          df1.join(df2, 'col1', 'outer').show()


          which gives,




          +----+----+----+
          |col1|col2|col3|
          +----+----+----+
          | 1| 2| 2|
          | 2| 3|null|
          | 4|null| 3|
          +----+----+----+






          share|improve this answer





















          • Can we get the desired dataframe without using join operation ?
            – kirti dahiya
            Dec 28 '18 at 10:33










          • No, you will have to join. The alternative would be to match but you will have to use a loop since you need to match more than 1 columns, which beats the purpose of spark
            – Sotos
            Dec 28 '18 at 13:12



















          0














          #Given DataFrames
          df1 = spark.createDataFrame([(1,2),(2,3)],['col1','col2'])
          df2 = spark.createDataFrame([(1,2),(4,3)],['col1','col3'])


          We can use 2 methods to get this Full Outer join.



          #Method 1
          df = df1.join(df2, ['col1'], how='full')
          df.show()
          +----+----+----+
          |col1|col2|col3|
          +----+----+----+
          | 1| 2| 2|
          | 2| 3|null|
          | 4|null| 3|
          +----+----+----+

          #Method 2
          import pyspark.sql.functions as F
          df = df1.join(df2, df1.col1 == df2.col1, how='full').select(
          F.coalesce(df1.col1, df2.col1).alias('col1'), df1.col2, df2.col3
          )
          df.show()
          +----+----+----+
          |col1|col2|col3|
          +----+----+----+
          | 1| 2| 2|
          | 2| 3|null|
          | 4|null| 3|
          +----+----+----+





          share|improve this answer





























            0














            The operation which you are referring here is called as outer join or FULL Outer Join based on col1.



            In Full outer Join, we will get all the records where the values for the Join column matches in both the Tables which leads to Row(1,2,2). Apart from that, we will also get all the remaining records from left side Table and values for the right side Table will be null which leads to Row(2,3,null). Similarly we will be getting all the records from right side of the Table and corresponding left column value will be null, which leads to Row(4, null, 3).



            Also, I can share the results of Other Joins :




            1. Inner Join: This gives all the records where the value of Join column matches.



            Result : Row(1,2,2)





            1. Left Outer Join: All the records from the result of Inner join and also all the remaining records from left side table. Column value for right side table will be null.



            Result : [Row(1,2,2), Row(2,3,null)]





            1. Right Outer Join: All the records from the result of Inner join and also all the remaining records from right side table. Column value for left side table will be null.



            Result : [Row(1,2,2), Row(4, null, 3)]





            1. Full Outer Join: Explained above.



            Result : [Row(1,2,2), Row(2,3,null), Row(4, null, 3)]




            You can refer JOIN syntax here for pyspark.



            Hope this Explanation helps.



            Regards,



            Neeraj






            share|improve this answer




























              3 Answers
              3






              active

              oldest

              votes








              3 Answers
              3






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              0














              You need an outer merge, i.e.



              df1.join(df2, 'col1', 'outer').show()


              which gives,




              +----+----+----+
              |col1|col2|col3|
              +----+----+----+
              | 1| 2| 2|
              | 2| 3|null|
              | 4|null| 3|
              +----+----+----+






              share|improve this answer





















              • Can we get the desired dataframe without using join operation ?
                – kirti dahiya
                Dec 28 '18 at 10:33










              • No, you will have to join. The alternative would be to match but you will have to use a loop since you need to match more than 1 columns, which beats the purpose of spark
                – Sotos
                Dec 28 '18 at 13:12
















              0














              You need an outer merge, i.e.



              df1.join(df2, 'col1', 'outer').show()


              which gives,




              +----+----+----+
              |col1|col2|col3|
              +----+----+----+
              | 1| 2| 2|
              | 2| 3|null|
              | 4|null| 3|
              +----+----+----+






              share|improve this answer





















              • Can we get the desired dataframe without using join operation ?
                – kirti dahiya
                Dec 28 '18 at 10:33










              • No, you will have to join. The alternative would be to match but you will have to use a loop since you need to match more than 1 columns, which beats the purpose of spark
                – Sotos
                Dec 28 '18 at 13:12














              0












              0








              0






              You need an outer merge, i.e.



              df1.join(df2, 'col1', 'outer').show()


              which gives,




              +----+----+----+
              |col1|col2|col3|
              +----+----+----+
              | 1| 2| 2|
              | 2| 3|null|
              | 4|null| 3|
              +----+----+----+






              share|improve this answer












              You need an outer merge, i.e.



              df1.join(df2, 'col1', 'outer').show()


              which gives,




              +----+----+----+
              |col1|col2|col3|
              +----+----+----+
              | 1| 2| 2|
              | 2| 3|null|
              | 4|null| 3|
              +----+----+----+







              share|improve this answer












              share|improve this answer



              share|improve this answer










              answered Dec 28 '18 at 8:01









              SotosSotos

              28.4k51640




              28.4k51640












              • Can we get the desired dataframe without using join operation ?
                – kirti dahiya
                Dec 28 '18 at 10:33










              • No, you will have to join. The alternative would be to match but you will have to use a loop since you need to match more than 1 columns, which beats the purpose of spark
                – Sotos
                Dec 28 '18 at 13:12


















              • Can we get the desired dataframe without using join operation ?
                – kirti dahiya
                Dec 28 '18 at 10:33










              • No, you will have to join. The alternative would be to match but you will have to use a loop since you need to match more than 1 columns, which beats the purpose of spark
                – Sotos
                Dec 28 '18 at 13:12
















              Can we get the desired dataframe without using join operation ?
              – kirti dahiya
              Dec 28 '18 at 10:33




              Can we get the desired dataframe without using join operation ?
              – kirti dahiya
              Dec 28 '18 at 10:33












              No, you will have to join. The alternative would be to match but you will have to use a loop since you need to match more than 1 columns, which beats the purpose of spark
              – Sotos
              Dec 28 '18 at 13:12




              No, you will have to join. The alternative would be to match but you will have to use a loop since you need to match more than 1 columns, which beats the purpose of spark
              – Sotos
              Dec 28 '18 at 13:12













              0














              #Given DataFrames
              df1 = spark.createDataFrame([(1,2),(2,3)],['col1','col2'])
              df2 = spark.createDataFrame([(1,2),(4,3)],['col1','col3'])


              We can use 2 methods to get this Full Outer join.



              #Method 1
              df = df1.join(df2, ['col1'], how='full')
              df.show()
              +----+----+----+
              |col1|col2|col3|
              +----+----+----+
              | 1| 2| 2|
              | 2| 3|null|
              | 4|null| 3|
              +----+----+----+

              #Method 2
              import pyspark.sql.functions as F
              df = df1.join(df2, df1.col1 == df2.col1, how='full').select(
              F.coalesce(df1.col1, df2.col1).alias('col1'), df1.col2, df2.col3
              )
              df.show()
              +----+----+----+
              |col1|col2|col3|
              +----+----+----+
              | 1| 2| 2|
              | 2| 3|null|
              | 4|null| 3|
              +----+----+----+





              share|improve this answer


























                0














                #Given DataFrames
                df1 = spark.createDataFrame([(1,2),(2,3)],['col1','col2'])
                df2 = spark.createDataFrame([(1,2),(4,3)],['col1','col3'])


                We can use 2 methods to get this Full Outer join.



                #Method 1
                df = df1.join(df2, ['col1'], how='full')
                df.show()
                +----+----+----+
                |col1|col2|col3|
                +----+----+----+
                | 1| 2| 2|
                | 2| 3|null|
                | 4|null| 3|
                +----+----+----+

                #Method 2
                import pyspark.sql.functions as F
                df = df1.join(df2, df1.col1 == df2.col1, how='full').select(
                F.coalesce(df1.col1, df2.col1).alias('col1'), df1.col2, df2.col3
                )
                df.show()
                +----+----+----+
                |col1|col2|col3|
                +----+----+----+
                | 1| 2| 2|
                | 2| 3|null|
                | 4|null| 3|
                +----+----+----+





                share|improve this answer
























                  0












                  0








                  0






                  #Given DataFrames
                  df1 = spark.createDataFrame([(1,2),(2,3)],['col1','col2'])
                  df2 = spark.createDataFrame([(1,2),(4,3)],['col1','col3'])


                  We can use 2 methods to get this Full Outer join.



                  #Method 1
                  df = df1.join(df2, ['col1'], how='full')
                  df.show()
                  +----+----+----+
                  |col1|col2|col3|
                  +----+----+----+
                  | 1| 2| 2|
                  | 2| 3|null|
                  | 4|null| 3|
                  +----+----+----+

                  #Method 2
                  import pyspark.sql.functions as F
                  df = df1.join(df2, df1.col1 == df2.col1, how='full').select(
                  F.coalesce(df1.col1, df2.col1).alias('col1'), df1.col2, df2.col3
                  )
                  df.show()
                  +----+----+----+
                  |col1|col2|col3|
                  +----+----+----+
                  | 1| 2| 2|
                  | 2| 3|null|
                  | 4|null| 3|
                  +----+----+----+





                  share|improve this answer












                  #Given DataFrames
                  df1 = spark.createDataFrame([(1,2),(2,3)],['col1','col2'])
                  df2 = spark.createDataFrame([(1,2),(4,3)],['col1','col3'])


                  We can use 2 methods to get this Full Outer join.



                  #Method 1
                  df = df1.join(df2, ['col1'], how='full')
                  df.show()
                  +----+----+----+
                  |col1|col2|col3|
                  +----+----+----+
                  | 1| 2| 2|
                  | 2| 3|null|
                  | 4|null| 3|
                  +----+----+----+

                  #Method 2
                  import pyspark.sql.functions as F
                  df = df1.join(df2, df1.col1 == df2.col1, how='full').select(
                  F.coalesce(df1.col1, df2.col1).alias('col1'), df1.col2, df2.col3
                  )
                  df.show()
                  +----+----+----+
                  |col1|col2|col3|
                  +----+----+----+
                  | 1| 2| 2|
                  | 2| 3|null|
                  | 4|null| 3|
                  +----+----+----+






                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Dec 28 '18 at 10:09









                  cph_stocph_sto

                  1,315219




                  1,315219























                      0














                      The operation which you are referring here is called as outer join or FULL Outer Join based on col1.



                      In Full outer Join, we will get all the records where the values for the Join column matches in both the Tables which leads to Row(1,2,2). Apart from that, we will also get all the remaining records from left side Table and values for the right side Table will be null which leads to Row(2,3,null). Similarly we will be getting all the records from right side of the Table and corresponding left column value will be null, which leads to Row(4, null, 3).



                      Also, I can share the results of Other Joins :




                      1. Inner Join: This gives all the records where the value of Join column matches.



                      Result : Row(1,2,2)





                      1. Left Outer Join: All the records from the result of Inner join and also all the remaining records from left side table. Column value for right side table will be null.



                      Result : [Row(1,2,2), Row(2,3,null)]





                      1. Right Outer Join: All the records from the result of Inner join and also all the remaining records from right side table. Column value for left side table will be null.



                      Result : [Row(1,2,2), Row(4, null, 3)]





                      1. Full Outer Join: Explained above.



                      Result : [Row(1,2,2), Row(2,3,null), Row(4, null, 3)]




                      You can refer JOIN syntax here for pyspark.



                      Hope this Explanation helps.



                      Regards,



                      Neeraj






                      share|improve this answer


























                        0














                        The operation which you are referring here is called as outer join or FULL Outer Join based on col1.



                        In Full outer Join, we will get all the records where the values for the Join column matches in both the Tables which leads to Row(1,2,2). Apart from that, we will also get all the remaining records from left side Table and values for the right side Table will be null which leads to Row(2,3,null). Similarly we will be getting all the records from right side of the Table and corresponding left column value will be null, which leads to Row(4, null, 3).



                        Also, I can share the results of Other Joins :




                        1. Inner Join: This gives all the records where the value of Join column matches.



                        Result : Row(1,2,2)





                        1. Left Outer Join: All the records from the result of Inner join and also all the remaining records from left side table. Column value for right side table will be null.



                        Result : [Row(1,2,2), Row(2,3,null)]





                        1. Right Outer Join: All the records from the result of Inner join and also all the remaining records from right side table. Column value for left side table will be null.



                        Result : [Row(1,2,2), Row(4, null, 3)]





                        1. Full Outer Join: Explained above.



                        Result : [Row(1,2,2), Row(2,3,null), Row(4, null, 3)]




                        You can refer JOIN syntax here for pyspark.



                        Hope this Explanation helps.



                        Regards,



                        Neeraj






                        share|improve this answer
























                          0












                          0








                          0






                          The operation which you are referring here is called as outer join or FULL Outer Join based on col1.



                          In Full outer Join, we will get all the records where the values for the Join column matches in both the Tables which leads to Row(1,2,2). Apart from that, we will also get all the remaining records from left side Table and values for the right side Table will be null which leads to Row(2,3,null). Similarly we will be getting all the records from right side of the Table and corresponding left column value will be null, which leads to Row(4, null, 3).



                          Also, I can share the results of Other Joins :




                          1. Inner Join: This gives all the records where the value of Join column matches.



                          Result : Row(1,2,2)





                          1. Left Outer Join: All the records from the result of Inner join and also all the remaining records from left side table. Column value for right side table will be null.



                          Result : [Row(1,2,2), Row(2,3,null)]





                          1. Right Outer Join: All the records from the result of Inner join and also all the remaining records from right side table. Column value for left side table will be null.



                          Result : [Row(1,2,2), Row(4, null, 3)]





                          1. Full Outer Join: Explained above.



                          Result : [Row(1,2,2), Row(2,3,null), Row(4, null, 3)]




                          You can refer JOIN syntax here for pyspark.



                          Hope this Explanation helps.



                          Regards,



                          Neeraj






                          share|improve this answer












                          The operation which you are referring here is called as outer join or FULL Outer Join based on col1.



                          In Full outer Join, we will get all the records where the values for the Join column matches in both the Tables which leads to Row(1,2,2). Apart from that, we will also get all the remaining records from left side Table and values for the right side Table will be null which leads to Row(2,3,null). Similarly we will be getting all the records from right side of the Table and corresponding left column value will be null, which leads to Row(4, null, 3).



                          Also, I can share the results of Other Joins :




                          1. Inner Join: This gives all the records where the value of Join column matches.



                          Result : Row(1,2,2)





                          1. Left Outer Join: All the records from the result of Inner join and also all the remaining records from left side table. Column value for right side table will be null.



                          Result : [Row(1,2,2), Row(2,3,null)]





                          1. Right Outer Join: All the records from the result of Inner join and also all the remaining records from right side table. Column value for left side table will be null.



                          Result : [Row(1,2,2), Row(4, null, 3)]





                          1. Full Outer Join: Explained above.



                          Result : [Row(1,2,2), Row(2,3,null), Row(4, null, 3)]




                          You can refer JOIN syntax here for pyspark.



                          Hope this Explanation helps.



                          Regards,



                          Neeraj







                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered Dec 28 '18 at 10:26









                          neeraj bhadanineeraj bhadani

                          819212




                          819212















                              Popular posts from this blog

                              Monofisismo

                              Angular Downloading a file using contenturl with Basic Authentication

                              Olmecas