Combine pyspark dataframes [duplicate]
This question already has an answer here:
What are the various join types in Spark?
3 answers
Suppose I have two spark dataframe -
df1 = spark.createDataFrame([(1,2),(2,3)],['col1','col2'])
df2 = spark.createDataFrame([(1,2),(4,3)],['col1','col3'])
I want to combine dataframes in order to get the below result:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 2| 2|
| 2| 3| NA|
| 4| NA| 3|
+----+----+----+
How to get the above result?
apache-spark pyspark
marked as duplicate by pault, eliasah
StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;
$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');
$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
Jan 3 at 7:01
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
add a comment |
This question already has an answer here:
What are the various join types in Spark?
3 answers
Suppose I have two spark dataframe -
df1 = spark.createDataFrame([(1,2),(2,3)],['col1','col2'])
df2 = spark.createDataFrame([(1,2),(4,3)],['col1','col3'])
I want to combine dataframes in order to get the below result:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 2| 2|
| 2| 3| NA|
| 4| NA| 3|
+----+----+----+
How to get the above result?
apache-spark pyspark
marked as duplicate by pault, eliasah
StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;
$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');
$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
Jan 3 at 7:01
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
add a comment |
This question already has an answer here:
What are the various join types in Spark?
3 answers
Suppose I have two spark dataframe -
df1 = spark.createDataFrame([(1,2),(2,3)],['col1','col2'])
df2 = spark.createDataFrame([(1,2),(4,3)],['col1','col3'])
I want to combine dataframes in order to get the below result:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 2| 2|
| 2| 3| NA|
| 4| NA| 3|
+----+----+----+
How to get the above result?
apache-spark pyspark
This question already has an answer here:
What are the various join types in Spark?
3 answers
Suppose I have two spark dataframe -
df1 = spark.createDataFrame([(1,2),(2,3)],['col1','col2'])
df2 = spark.createDataFrame([(1,2),(4,3)],['col1','col3'])
I want to combine dataframes in order to get the below result:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 2| 2|
| 2| 3| NA|
| 4| NA| 3|
+----+----+----+
How to get the above result?
This question already has an answer here:
What are the various join types in Spark?
3 answers
apache-spark pyspark
apache-spark pyspark
edited Dec 28 '18 at 8:02
Sotos
28.4k51640
28.4k51640
asked Dec 28 '18 at 7:56
kirti dahiyakirti dahiya
32
32
marked as duplicate by pault, eliasah
StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;
$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');
$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
Jan 3 at 7:01
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
marked as duplicate by pault, eliasah
StackExchange.ready(function() {
if (StackExchange.options.isMobile) return;
$('.dupe-hammer-message-hover:not(.hover-bound)').each(function() {
var $hover = $(this).addClass('hover-bound'),
$msg = $hover.siblings('.dupe-hammer-message');
$hover.hover(
function() {
$hover.showInfoMessage('', {
messageElement: $msg.clone().show(),
transient: false,
position: { my: 'bottom left', at: 'top center', offsetTop: -7 },
dismissable: false,
relativeToBody: true
});
},
function() {
StackExchange.helpers.removeMessages();
}
);
});
});
Jan 3 at 7:01
This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.
add a comment |
add a comment |
3 Answers
3
active
oldest
votes
You need an outer
merge, i.e.
df1.join(df2, 'col1', 'outer').show()
which gives,
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 2| 2|
| 2| 3|null|
| 4|null| 3|
+----+----+----+
Can we get the desired dataframe without using join operation ?
– kirti dahiya
Dec 28 '18 at 10:33
No, you will have to join. The alternative would be to match but you will have to use a loop since you need to match more than 1 columns, which beats the purpose of spark
– Sotos
Dec 28 '18 at 13:12
add a comment |
#Given DataFrames
df1 = spark.createDataFrame([(1,2),(2,3)],['col1','col2'])
df2 = spark.createDataFrame([(1,2),(4,3)],['col1','col3'])
We can use 2 methods to get this Full Outer join.
#Method 1
df = df1.join(df2, ['col1'], how='full')
df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 2| 2|
| 2| 3|null|
| 4|null| 3|
+----+----+----+
#Method 2
import pyspark.sql.functions as F
df = df1.join(df2, df1.col1 == df2.col1, how='full').select(
F.coalesce(df1.col1, df2.col1).alias('col1'), df1.col2, df2.col3
)
df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 2| 2|
| 2| 3|null|
| 4|null| 3|
+----+----+----+
add a comment |
The operation which you are referring here is called as outer join or FULL Outer Join based on col1.
In Full outer Join, we will get all the records where the values for the Join column matches in both the Tables which leads to Row(1,2,2). Apart from that, we will also get all the remaining records from left side Table and values for the right side Table will be null which leads to Row(2,3,null). Similarly we will be getting all the records from right side of the Table and corresponding left column value will be null, which leads to Row(4, null, 3).
Also, I can share the results of Other Joins :
- Inner Join: This gives all the records where the value of Join column matches.
Result : Row(1,2,2)
- Left Outer Join: All the records from the result of Inner join and also all the remaining records from left side table. Column value for right side table will be null.
Result : [Row(1,2,2), Row(2,3,null)]
- Right Outer Join: All the records from the result of Inner join and also all the remaining records from right side table. Column value for left side table will be null.
Result : [Row(1,2,2), Row(4, null, 3)]
- Full Outer Join: Explained above.
Result : [Row(1,2,2), Row(2,3,null), Row(4, null, 3)]
You can refer JOIN syntax here for pyspark.
Hope this Explanation helps.
Regards,
Neeraj
add a comment |
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
You need an outer
merge, i.e.
df1.join(df2, 'col1', 'outer').show()
which gives,
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 2| 2|
| 2| 3|null|
| 4|null| 3|
+----+----+----+
Can we get the desired dataframe without using join operation ?
– kirti dahiya
Dec 28 '18 at 10:33
No, you will have to join. The alternative would be to match but you will have to use a loop since you need to match more than 1 columns, which beats the purpose of spark
– Sotos
Dec 28 '18 at 13:12
add a comment |
You need an outer
merge, i.e.
df1.join(df2, 'col1', 'outer').show()
which gives,
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 2| 2|
| 2| 3|null|
| 4|null| 3|
+----+----+----+
Can we get the desired dataframe without using join operation ?
– kirti dahiya
Dec 28 '18 at 10:33
No, you will have to join. The alternative would be to match but you will have to use a loop since you need to match more than 1 columns, which beats the purpose of spark
– Sotos
Dec 28 '18 at 13:12
add a comment |
You need an outer
merge, i.e.
df1.join(df2, 'col1', 'outer').show()
which gives,
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 2| 2|
| 2| 3|null|
| 4|null| 3|
+----+----+----+
You need an outer
merge, i.e.
df1.join(df2, 'col1', 'outer').show()
which gives,
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 2| 2|
| 2| 3|null|
| 4|null| 3|
+----+----+----+
answered Dec 28 '18 at 8:01
SotosSotos
28.4k51640
28.4k51640
Can we get the desired dataframe without using join operation ?
– kirti dahiya
Dec 28 '18 at 10:33
No, you will have to join. The alternative would be to match but you will have to use a loop since you need to match more than 1 columns, which beats the purpose of spark
– Sotos
Dec 28 '18 at 13:12
add a comment |
Can we get the desired dataframe without using join operation ?
– kirti dahiya
Dec 28 '18 at 10:33
No, you will have to join. The alternative would be to match but you will have to use a loop since you need to match more than 1 columns, which beats the purpose of spark
– Sotos
Dec 28 '18 at 13:12
Can we get the desired dataframe without using join operation ?
– kirti dahiya
Dec 28 '18 at 10:33
Can we get the desired dataframe without using join operation ?
– kirti dahiya
Dec 28 '18 at 10:33
No, you will have to join. The alternative would be to match but you will have to use a loop since you need to match more than 1 columns, which beats the purpose of spark
– Sotos
Dec 28 '18 at 13:12
No, you will have to join. The alternative would be to match but you will have to use a loop since you need to match more than 1 columns, which beats the purpose of spark
– Sotos
Dec 28 '18 at 13:12
add a comment |
#Given DataFrames
df1 = spark.createDataFrame([(1,2),(2,3)],['col1','col2'])
df2 = spark.createDataFrame([(1,2),(4,3)],['col1','col3'])
We can use 2 methods to get this Full Outer join.
#Method 1
df = df1.join(df2, ['col1'], how='full')
df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 2| 2|
| 2| 3|null|
| 4|null| 3|
+----+----+----+
#Method 2
import pyspark.sql.functions as F
df = df1.join(df2, df1.col1 == df2.col1, how='full').select(
F.coalesce(df1.col1, df2.col1).alias('col1'), df1.col2, df2.col3
)
df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 2| 2|
| 2| 3|null|
| 4|null| 3|
+----+----+----+
add a comment |
#Given DataFrames
df1 = spark.createDataFrame([(1,2),(2,3)],['col1','col2'])
df2 = spark.createDataFrame([(1,2),(4,3)],['col1','col3'])
We can use 2 methods to get this Full Outer join.
#Method 1
df = df1.join(df2, ['col1'], how='full')
df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 2| 2|
| 2| 3|null|
| 4|null| 3|
+----+----+----+
#Method 2
import pyspark.sql.functions as F
df = df1.join(df2, df1.col1 == df2.col1, how='full').select(
F.coalesce(df1.col1, df2.col1).alias('col1'), df1.col2, df2.col3
)
df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 2| 2|
| 2| 3|null|
| 4|null| 3|
+----+----+----+
add a comment |
#Given DataFrames
df1 = spark.createDataFrame([(1,2),(2,3)],['col1','col2'])
df2 = spark.createDataFrame([(1,2),(4,3)],['col1','col3'])
We can use 2 methods to get this Full Outer join.
#Method 1
df = df1.join(df2, ['col1'], how='full')
df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 2| 2|
| 2| 3|null|
| 4|null| 3|
+----+----+----+
#Method 2
import pyspark.sql.functions as F
df = df1.join(df2, df1.col1 == df2.col1, how='full').select(
F.coalesce(df1.col1, df2.col1).alias('col1'), df1.col2, df2.col3
)
df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 2| 2|
| 2| 3|null|
| 4|null| 3|
+----+----+----+
#Given DataFrames
df1 = spark.createDataFrame([(1,2),(2,3)],['col1','col2'])
df2 = spark.createDataFrame([(1,2),(4,3)],['col1','col3'])
We can use 2 methods to get this Full Outer join.
#Method 1
df = df1.join(df2, ['col1'], how='full')
df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 2| 2|
| 2| 3|null|
| 4|null| 3|
+----+----+----+
#Method 2
import pyspark.sql.functions as F
df = df1.join(df2, df1.col1 == df2.col1, how='full').select(
F.coalesce(df1.col1, df2.col1).alias('col1'), df1.col2, df2.col3
)
df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 2| 2|
| 2| 3|null|
| 4|null| 3|
+----+----+----+
answered Dec 28 '18 at 10:09
cph_stocph_sto
1,315219
1,315219
add a comment |
add a comment |
The operation which you are referring here is called as outer join or FULL Outer Join based on col1.
In Full outer Join, we will get all the records where the values for the Join column matches in both the Tables which leads to Row(1,2,2). Apart from that, we will also get all the remaining records from left side Table and values for the right side Table will be null which leads to Row(2,3,null). Similarly we will be getting all the records from right side of the Table and corresponding left column value will be null, which leads to Row(4, null, 3).
Also, I can share the results of Other Joins :
- Inner Join: This gives all the records where the value of Join column matches.
Result : Row(1,2,2)
- Left Outer Join: All the records from the result of Inner join and also all the remaining records from left side table. Column value for right side table will be null.
Result : [Row(1,2,2), Row(2,3,null)]
- Right Outer Join: All the records from the result of Inner join and also all the remaining records from right side table. Column value for left side table will be null.
Result : [Row(1,2,2), Row(4, null, 3)]
- Full Outer Join: Explained above.
Result : [Row(1,2,2), Row(2,3,null), Row(4, null, 3)]
You can refer JOIN syntax here for pyspark.
Hope this Explanation helps.
Regards,
Neeraj
add a comment |
The operation which you are referring here is called as outer join or FULL Outer Join based on col1.
In Full outer Join, we will get all the records where the values for the Join column matches in both the Tables which leads to Row(1,2,2). Apart from that, we will also get all the remaining records from left side Table and values for the right side Table will be null which leads to Row(2,3,null). Similarly we will be getting all the records from right side of the Table and corresponding left column value will be null, which leads to Row(4, null, 3).
Also, I can share the results of Other Joins :
- Inner Join: This gives all the records where the value of Join column matches.
Result : Row(1,2,2)
- Left Outer Join: All the records from the result of Inner join and also all the remaining records from left side table. Column value for right side table will be null.
Result : [Row(1,2,2), Row(2,3,null)]
- Right Outer Join: All the records from the result of Inner join and also all the remaining records from right side table. Column value for left side table will be null.
Result : [Row(1,2,2), Row(4, null, 3)]
- Full Outer Join: Explained above.
Result : [Row(1,2,2), Row(2,3,null), Row(4, null, 3)]
You can refer JOIN syntax here for pyspark.
Hope this Explanation helps.
Regards,
Neeraj
add a comment |
The operation which you are referring here is called as outer join or FULL Outer Join based on col1.
In Full outer Join, we will get all the records where the values for the Join column matches in both the Tables which leads to Row(1,2,2). Apart from that, we will also get all the remaining records from left side Table and values for the right side Table will be null which leads to Row(2,3,null). Similarly we will be getting all the records from right side of the Table and corresponding left column value will be null, which leads to Row(4, null, 3).
Also, I can share the results of Other Joins :
- Inner Join: This gives all the records where the value of Join column matches.
Result : Row(1,2,2)
- Left Outer Join: All the records from the result of Inner join and also all the remaining records from left side table. Column value for right side table will be null.
Result : [Row(1,2,2), Row(2,3,null)]
- Right Outer Join: All the records from the result of Inner join and also all the remaining records from right side table. Column value for left side table will be null.
Result : [Row(1,2,2), Row(4, null, 3)]
- Full Outer Join: Explained above.
Result : [Row(1,2,2), Row(2,3,null), Row(4, null, 3)]
You can refer JOIN syntax here for pyspark.
Hope this Explanation helps.
Regards,
Neeraj
The operation which you are referring here is called as outer join or FULL Outer Join based on col1.
In Full outer Join, we will get all the records where the values for the Join column matches in both the Tables which leads to Row(1,2,2). Apart from that, we will also get all the remaining records from left side Table and values for the right side Table will be null which leads to Row(2,3,null). Similarly we will be getting all the records from right side of the Table and corresponding left column value will be null, which leads to Row(4, null, 3).
Also, I can share the results of Other Joins :
- Inner Join: This gives all the records where the value of Join column matches.
Result : Row(1,2,2)
- Left Outer Join: All the records from the result of Inner join and also all the remaining records from left side table. Column value for right side table will be null.
Result : [Row(1,2,2), Row(2,3,null)]
- Right Outer Join: All the records from the result of Inner join and also all the remaining records from right side table. Column value for left side table will be null.
Result : [Row(1,2,2), Row(4, null, 3)]
- Full Outer Join: Explained above.
Result : [Row(1,2,2), Row(2,3,null), Row(4, null, 3)]
You can refer JOIN syntax here for pyspark.
Hope this Explanation helps.
Regards,
Neeraj
answered Dec 28 '18 at 10:26
neeraj bhadanineeraj bhadani
819212
819212
add a comment |
add a comment |