Combine pyspark dataframes [duplicate]

-2

This question already has an answer here:

What are the various join types in Spark?

3 answers

Suppose I have two spark dataframe -

df1 = spark.createDataFrame([(1,2),(2,3)],['col1','col2']) 

df2 = spark.createDataFrame([(1,2),(4,3)],['col1','col3'])

I want to combine dataframes in order to get the below result:

+----+----+----+

|col1|col2|col3|

+----+----+----+

|   1|   2|   2|

|   2|   3|  NA|

|   4|  NA|   3|

+----+----+----+

How to get the above result?

edited Dec 28 '18 at 8:02

Sotos

28.4k51640

asked Dec 28 '18 at 7:56

kirti dahiya

marked as duplicate by pault, eliasah apache-spark
Users with the apache-spark badge can single-handedly close apache-spark questions as duplicates and reopen them as needed.

StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;

$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');

$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
Jan 3 at 7:01

This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

add a comment |

-2

This question already has an answer here:

What are the various join types in Spark?

3 answers

Suppose I have two spark dataframe -

df1 = spark.createDataFrame([(1,2),(2,3)],['col1','col2']) 

df2 = spark.createDataFrame([(1,2),(4,3)],['col1','col3'])

I want to combine dataframes in order to get the below result:

+----+----+----+

|col1|col2|col3|

+----+----+----+

|   1|   2|   2|

|   2|   3|  NA|

|   4|  NA|   3|

+----+----+----+

How to get the above result?

edited Dec 28 '18 at 8:02

Sotos

28.4k51640

asked Dec 28 '18 at 7:56

kirti dahiya

marked as duplicate by pault, eliasah apache-spark
Users with the apache-spark badge can single-handedly close apache-spark questions as duplicates and reopen them as needed.

StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;

$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');

$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
Jan 3 at 7:01

This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

add a comment |

-2

This question already has an answer here:

What are the various join types in Spark?

3 answers

Suppose I have two spark dataframe -

df1 = spark.createDataFrame([(1,2),(2,3)],['col1','col2']) 

df2 = spark.createDataFrame([(1,2),(4,3)],['col1','col3'])

I want to combine dataframes in order to get the below result:

+----+----+----+

|col1|col2|col3|

+----+----+----+

|   1|   2|   2|

|   2|   3|  NA|

|   4|  NA|   3|

+----+----+----+

How to get the above result?

edited Dec 28 '18 at 8:02

Sotos

28.4k51640

asked Dec 28 '18 at 7:56

kirti dahiya

This question already has an answer here:

What are the various join types in Spark?

3 answers

Suppose I have two spark dataframe -

df1 = spark.createDataFrame([(1,2),(2,3)],['col1','col2']) 

df2 = spark.createDataFrame([(1,2),(4,3)],['col1','col3'])

I want to combine dataframes in order to get the below result:

+----+----+----+

|col1|col2|col3|

+----+----+----+

|   1|   2|   2|

|   2|   3|  NA|

|   4|  NA|   3|

+----+----+----+

How to get the above result?

This question already has an answer here:

What are the various join types in Spark?

3 answers

apache-spark pyspark

edited Dec 28 '18 at 8:02

Sotos

28.4k51640

asked Dec 28 '18 at 7:56

kirti dahiya

edited Dec 28 '18 at 8:02

Sotos

28.4k51640

asked Dec 28 '18 at 7:56

kirti dahiya

edited Dec 28 '18 at 8:02

Sotos

28.4k51640

edited Dec 28 '18 at 8:02

Sotos

28.4k51640

edited Dec 28 '18 at 8:02

Sotos

28.4k51640

asked Dec 28 '18 at 7:56

kirti dahiya

asked Dec 28 '18 at 7:56

kirti dahiya

asked Dec 28 '18 at 7:56

kirti dahiya

marked as duplicate by pault, eliasah apache-spark
Users with the apache-spark badge can single-handedly close apache-spark questions as duplicates and reopen them as needed.

StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;

$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');

$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
Jan 3 at 7:01

This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

marked as duplicate by pault, eliasah apache-spark
Users with the apache-spark badge can single-handedly close apache-spark questions as duplicates and reopen them as needed.

StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;

$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');

$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
Jan 3 at 7:01

This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.

add a comment |

3 Answers
3

active

oldest

votes

You need an outer merge, i.e.

df1.join(df2, 'col1', 'outer').show()

which gives,

+----+----+----+

|col1|col2|col3|

+----+----+----+

|   1|   2|   2|

|   2|   3|null|

|   4|null|   3|

+----+----+----+

answered Dec 28 '18 at 8:01

Sotos

28.4k51640

Can we get the desired dataframe without using join operation ?
– kirti dahiya
Dec 28 '18 at 10:33

No, you will have to join. The alternative would be to match but you will have to use a loop since you need to match more than 1 columns, which beats the purpose of spark
– Sotos
Dec 28 '18 at 13:12

add a comment |

#Given DataFrames

df1 = spark.createDataFrame([(1,2),(2,3)],['col1','col2'])

df2 = spark.createDataFrame([(1,2),(4,3)],['col1','col3'])

We can use 2 methods to get this Full Outer join.

#Method 1

df = df1.join(df2, ['col1'], how='full')

df.show()

+----+----+----+

|col1|col2|col3|

+----+----+----+

|   1|   2|   2|

|   2|   3|null|

|   4|null|   3|

+----+----+----+



#Method 2

import pyspark.sql.functions as F

df = df1.join(df2, df1.col1 == df2.col1, how='full').select(

    F.coalesce(df1.col1, df2.col1).alias('col1'), df1.col2, df2.col3

)

df.show()

+----+----+----+

|col1|col2|col3|

+----+----+----+

|   1|   2|   2|

|   2|   3|null|

|   4|null|   3|

+----+----+----+

answered Dec 28 '18 at 10:09

cph_sto

1,315219

add a comment |

The operation which you are referring here is called as outer join or FULL Outer Join based on col1.

In Full outer Join, we will get all the records where the values for the Join column matches in both the Tables which leads to Row(1,2,2). Apart from that, we will also get all the remaining records from left side Table and values for the right side Table will be null which leads to Row(2,3,null). Similarly we will be getting all the records from right side of the Table and corresponding left column value will be null, which leads to Row(4, null, 3).

Also, I can share the results of Other Joins :

Inner Join: This gives all the records where the value of Join column matches.

Result : Row(1,2,2)

Left Outer Join: All the records from the result of Inner join and also all the remaining records from left side table. Column value for right side table will be null.

Result : [Row(1,2,2), Row(2,3,null)]

Right Outer Join: All the records from the result of Inner join and also all the remaining records from right side table. Column value for left side table will be null.

Result : [Row(1,2,2), Row(4, null, 3)]

Full Outer Join: Explained above.

Result : [Row(1,2,2), Row(2,3,null), Row(4, null, 3)]

You can refer JOIN syntax here for pyspark.

Hope this Explanation helps.

Regards,

Neeraj

answered Dec 28 '18 at 10:26

neeraj bhadani

819212

add a comment |

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

You need an outer merge, i.e.

df1.join(df2, 'col1', 'outer').show()

which gives,

+----+----+----+

|col1|col2|col3|

+----+----+----+

|   1|   2|   2|

|   2|   3|null|

|   4|null|   3|

+----+----+----+

answered Dec 28 '18 at 8:01

Sotos

28.4k51640

Can we get the desired dataframe without using join operation ?
– kirti dahiya
Dec 28 '18 at 10:33

No, you will have to join. The alternative would be to match but you will have to use a loop since you need to match more than 1 columns, which beats the purpose of spark
– Sotos
Dec 28 '18 at 13:12

add a comment |

You need an outer merge, i.e.

df1.join(df2, 'col1', 'outer').show()

which gives,

+----+----+----+

|col1|col2|col3|

+----+----+----+

|   1|   2|   2|

|   2|   3|null|

|   4|null|   3|

+----+----+----+

answered Dec 28 '18 at 8:01

Sotos

28.4k51640

Can we get the desired dataframe without using join operation ?
– kirti dahiya
Dec 28 '18 at 10:33

No, you will have to join. The alternative would be to match but you will have to use a loop since you need to match more than 1 columns, which beats the purpose of spark
– Sotos
Dec 28 '18 at 13:12

add a comment |

You need an outer merge, i.e.

df1.join(df2, 'col1', 'outer').show()

which gives,

+----+----+----+

|col1|col2|col3|

+----+----+----+

|   1|   2|   2|

|   2|   3|null|

|   4|null|   3|

+----+----+----+

answered Dec 28 '18 at 8:01

Sotos

28.4k51640

You need an outer merge, i.e.

df1.join(df2, 'col1', 'outer').show()

which gives,

+----+----+----+

|col1|col2|col3|

+----+----+----+

|   1|   2|   2|

|   2|   3|null|

|   4|null|   3|

+----+----+----+

answered Dec 28 '18 at 8:01

Sotos

28.4k51640

answered Dec 28 '18 at 8:01

Sotos

28.4k51640

answered Dec 28 '18 at 8:01

Sotos

28.4k51640

answered Dec 28 '18 at 8:01

Sotos

28.4k51640

Can we get the desired dataframe without using join operation ?
– kirti dahiya
Dec 28 '18 at 10:33

No, you will have to join. The alternative would be to match but you will have to use a loop since you need to match more than 1 columns, which beats the purpose of spark
– Sotos
Dec 28 '18 at 13:12

add a comment |

Can we get the desired dataframe without using join operation ?
– kirti dahiya
Dec 28 '18 at 10:33

No, you will have to join. The alternative would be to match but you will have to use a loop since you need to match more than 1 columns, which beats the purpose of spark
– Sotos
Dec 28 '18 at 13:12

Can we get the desired dataframe without using join operation ?
– kirti dahiya
Dec 28 '18 at 10:33

No, you will have to join. The alternative would be to match but you will have to use a loop since you need to match more than 1 columns, which beats the purpose of spark
– Sotos
Dec 28 '18 at 13:12

add a comment |

#Given DataFrames

df1 = spark.createDataFrame([(1,2),(2,3)],['col1','col2'])

df2 = spark.createDataFrame([(1,2),(4,3)],['col1','col3'])

We can use 2 methods to get this Full Outer join.

#Method 1

df = df1.join(df2, ['col1'], how='full')

df.show()

+----+----+----+

|col1|col2|col3|

+----+----+----+

|   1|   2|   2|

|   2|   3|null|

|   4|null|   3|

+----+----+----+



#Method 2

import pyspark.sql.functions as F

df = df1.join(df2, df1.col1 == df2.col1, how='full').select(

    F.coalesce(df1.col1, df2.col1).alias('col1'), df1.col2, df2.col3

)

df.show()

+----+----+----+

|col1|col2|col3|

+----+----+----+

|   1|   2|   2|

|   2|   3|null|

|   4|null|   3|

+----+----+----+

answered Dec 28 '18 at 10:09

cph_sto

1,315219

add a comment |

#Given DataFrames

df1 = spark.createDataFrame([(1,2),(2,3)],['col1','col2'])

df2 = spark.createDataFrame([(1,2),(4,3)],['col1','col3'])

We can use 2 methods to get this Full Outer join.

#Method 1

df = df1.join(df2, ['col1'], how='full')

df.show()

+----+----+----+

|col1|col2|col3|

+----+----+----+

|   1|   2|   2|

|   2|   3|null|

|   4|null|   3|

+----+----+----+



#Method 2

import pyspark.sql.functions as F

df = df1.join(df2, df1.col1 == df2.col1, how='full').select(

    F.coalesce(df1.col1, df2.col1).alias('col1'), df1.col2, df2.col3

)

df.show()

+----+----+----+

|col1|col2|col3|

+----+----+----+

|   1|   2|   2|

|   2|   3|null|

|   4|null|   3|

+----+----+----+

answered Dec 28 '18 at 10:09

cph_sto

1,315219

add a comment |

#Given DataFrames

df1 = spark.createDataFrame([(1,2),(2,3)],['col1','col2'])

df2 = spark.createDataFrame([(1,2),(4,3)],['col1','col3'])

We can use 2 methods to get this Full Outer join.

#Method 1

df = df1.join(df2, ['col1'], how='full')

df.show()

+----+----+----+

|col1|col2|col3|

+----+----+----+

|   1|   2|   2|

|   2|   3|null|

|   4|null|   3|

+----+----+----+



#Method 2

import pyspark.sql.functions as F

df = df1.join(df2, df1.col1 == df2.col1, how='full').select(

    F.coalesce(df1.col1, df2.col1).alias('col1'), df1.col2, df2.col3

)

df.show()

+----+----+----+

|col1|col2|col3|

+----+----+----+

|   1|   2|   2|

|   2|   3|null|

|   4|null|   3|

+----+----+----+

answered Dec 28 '18 at 10:09

cph_sto

1,315219

#Given DataFrames

df1 = spark.createDataFrame([(1,2),(2,3)],['col1','col2'])

df2 = spark.createDataFrame([(1,2),(4,3)],['col1','col3'])

We can use 2 methods to get this Full Outer join.

#Method 1

df = df1.join(df2, ['col1'], how='full')

df.show()

+----+----+----+

|col1|col2|col3|

+----+----+----+

|   1|   2|   2|

|   2|   3|null|

|   4|null|   3|

+----+----+----+



#Method 2

import pyspark.sql.functions as F

df = df1.join(df2, df1.col1 == df2.col1, how='full').select(

    F.coalesce(df1.col1, df2.col1).alias('col1'), df1.col2, df2.col3

)

df.show()

+----+----+----+

|col1|col2|col3|

+----+----+----+

|   1|   2|   2|

|   2|   3|null|

|   4|null|   3|

+----+----+----+

answered Dec 28 '18 at 10:09

cph_sto

1,315219

answered Dec 28 '18 at 10:09

cph_sto

1,315219

answered Dec 28 '18 at 10:09

cph_sto

1,315219

answered Dec 28 '18 at 10:09

cph_sto

1,315219

add a comment |

The operation which you are referring here is called as outer join or FULL Outer Join based on col1.

Also, I can share the results of Other Joins :

Inner Join: This gives all the records where the value of Join column matches.

Result : Row(1,2,2)

Left Outer Join: All the records from the result of Inner join and also all the remaining records from left side table. Column value for right side table will be null.

Result : [Row(1,2,2), Row(2,3,null)]

Right Outer Join: All the records from the result of Inner join and also all the remaining records from right side table. Column value for left side table will be null.

Result : [Row(1,2,2), Row(4, null, 3)]

Full Outer Join: Explained above.

Result : [Row(1,2,2), Row(2,3,null), Row(4, null, 3)]

You can refer JOIN syntax here for pyspark.

Hope this Explanation helps.

Regards,

Neeraj

answered Dec 28 '18 at 10:26

neeraj bhadani

819212

add a comment |

The operation which you are referring here is called as outer join or FULL Outer Join based on col1.

Also, I can share the results of Other Joins :

Inner Join: This gives all the records where the value of Join column matches.

Result : Row(1,2,2)

Left Outer Join: All the records from the result of Inner join and also all the remaining records from left side table. Column value for right side table will be null.

Result : [Row(1,2,2), Row(2,3,null)]

Right Outer Join: All the records from the result of Inner join and also all the remaining records from right side table. Column value for left side table will be null.

Result : [Row(1,2,2), Row(4, null, 3)]

Full Outer Join: Explained above.

Result : [Row(1,2,2), Row(2,3,null), Row(4, null, 3)]

You can refer JOIN syntax here for pyspark.

Hope this Explanation helps.

Regards,

Neeraj

answered Dec 28 '18 at 10:26

neeraj bhadani

819212

add a comment |

The operation which you are referring here is called as outer join or FULL Outer Join based on col1.

Also, I can share the results of Other Joins :

Inner Join: This gives all the records where the value of Join column matches.

Result : Row(1,2,2)

Left Outer Join: All the records from the result of Inner join and also all the remaining records from left side table. Column value for right side table will be null.

Result : [Row(1,2,2), Row(2,3,null)]

Right Outer Join: All the records from the result of Inner join and also all the remaining records from right side table. Column value for left side table will be null.

Result : [Row(1,2,2), Row(4, null, 3)]

Full Outer Join: Explained above.

Result : [Row(1,2,2), Row(2,3,null), Row(4, null, 3)]

You can refer JOIN syntax here for pyspark.

Hope this Explanation helps.

Regards,

Neeraj

answered Dec 28 '18 at 10:26

neeraj bhadani

819212

The operation which you are referring here is called as outer join or FULL Outer Join based on col1.

Also, I can share the results of Other Joins :

Inner Join: This gives all the records where the value of Join column matches.

Result : Row(1,2,2)

Left Outer Join: All the records from the result of Inner join and also all the remaining records from left side table. Column value for right side table will be null.

Result : [Row(1,2,2), Row(2,3,null)]

Right Outer Join: All the records from the result of Inner join and also all the remaining records from right side table. Column value for left side table will be null.

Result : [Row(1,2,2), Row(4, null, 3)]

Full Outer Join: Explained above.

Result : [Row(1,2,2), Row(2,3,null), Row(4, null, 3)]

You can refer JOIN syntax here for pyspark.

Hope this Explanation helps.

Regards,

Neeraj

answered Dec 28 '18 at 10:26

neeraj bhadani

819212

answered Dec 28 '18 at 10:26

neeraj bhadani

819212

answered Dec 28 '18 at 10:26

neeraj bhadani

819212

answered Dec 28 '18 at 10:26

neeraj bhadani

819212

add a comment |

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Bdtjtk