Calculate values from two dataframes in PySpark
I'm trying to group and sum for a PySpark (2.4) Dataframe but can't only get values one by one.
I've the following dataframe :
data.groupBy("card_scheme", "failed").count().show()
+----------------+------+------+
| card_Scheme|failed| count|
+----------------+------+------+
| jcb| false| 4|
|american express| false| 22084|
| AMEX| false| 4|
| mastercard| true| 1122|
| visa| true| 1975|
| visa| false|126372|
| CB| false| 6|
| discover| false| 2219|
| maestro| false| 2|
| VISA| false| 13|
| mastercard| false| 40856|
| MASTERCARD| false| 9|
+----------------+------+------+
I'm trying to calculate the formula X = false / (false + true)
for each card_scheme and still get one dataframe in the end.
I'm expecting something like:
| card_scheme | X |
|-------------|---|
| jcb | 1 |
| .... | . |
| visa | 0.9846| (which is 126372 / (126372 + 1975)
| ... | . |
python apache-spark pyspark
add a comment |
I'm trying to group and sum for a PySpark (2.4) Dataframe but can't only get values one by one.
I've the following dataframe :
data.groupBy("card_scheme", "failed").count().show()
+----------------+------+------+
| card_Scheme|failed| count|
+----------------+------+------+
| jcb| false| 4|
|american express| false| 22084|
| AMEX| false| 4|
| mastercard| true| 1122|
| visa| true| 1975|
| visa| false|126372|
| CB| false| 6|
| discover| false| 2219|
| maestro| false| 2|
| VISA| false| 13|
| mastercard| false| 40856|
| MASTERCARD| false| 9|
+----------------+------+------+
I'm trying to calculate the formula X = false / (false + true)
for each card_scheme and still get one dataframe in the end.
I'm expecting something like:
| card_scheme | X |
|-------------|---|
| jcb | 1 |
| .... | . |
| visa | 0.9846| (which is 126372 / (126372 + 1975)
| ... | . |
python apache-spark pyspark
add a comment |
I'm trying to group and sum for a PySpark (2.4) Dataframe but can't only get values one by one.
I've the following dataframe :
data.groupBy("card_scheme", "failed").count().show()
+----------------+------+------+
| card_Scheme|failed| count|
+----------------+------+------+
| jcb| false| 4|
|american express| false| 22084|
| AMEX| false| 4|
| mastercard| true| 1122|
| visa| true| 1975|
| visa| false|126372|
| CB| false| 6|
| discover| false| 2219|
| maestro| false| 2|
| VISA| false| 13|
| mastercard| false| 40856|
| MASTERCARD| false| 9|
+----------------+------+------+
I'm trying to calculate the formula X = false / (false + true)
for each card_scheme and still get one dataframe in the end.
I'm expecting something like:
| card_scheme | X |
|-------------|---|
| jcb | 1 |
| .... | . |
| visa | 0.9846| (which is 126372 / (126372 + 1975)
| ... | . |
python apache-spark pyspark
I'm trying to group and sum for a PySpark (2.4) Dataframe but can't only get values one by one.
I've the following dataframe :
data.groupBy("card_scheme", "failed").count().show()
+----------------+------+------+
| card_Scheme|failed| count|
+----------------+------+------+
| jcb| false| 4|
|american express| false| 22084|
| AMEX| false| 4|
| mastercard| true| 1122|
| visa| true| 1975|
| visa| false|126372|
| CB| false| 6|
| discover| false| 2219|
| maestro| false| 2|
| VISA| false| 13|
| mastercard| false| 40856|
| MASTERCARD| false| 9|
+----------------+------+------+
I'm trying to calculate the formula X = false / (false + true)
for each card_scheme and still get one dataframe in the end.
I'm expecting something like:
| card_scheme | X |
|-------------|---|
| jcb | 1 |
| .... | . |
| visa | 0.9846| (which is 126372 / (126372 + 1975)
| ... | . |
python apache-spark pyspark
python apache-spark pyspark
edited Jan 4 at 6:40
Shaido
13k123044
13k123044
asked Jan 3 at 10:22
LaSulLaSul
596321
596321
add a comment |
add a comment |
5 Answers
5
active
oldest
votes
Creating the dataset
myValues = [('jcb',False,4),('american express', False, 22084),('AMEX',False,4),('mastercard',True,1122),('visa',True,1975),('visa',False,126372),('CB',False,6),('discover',False,2219),('maestro',False,2),('VISA',False,13),('mastercard',False,40856),('MASTERCARD',False,9)]
df = sqlContext.createDataFrame(myValues,['card_Scheme','failed','count'])
df.show()
+----------------+------+------+
| card_Scheme|failed| count|
+----------------+------+------+
| jcb| false| 4|
|american express| false| 22084|
| AMEX| false| 4|
| mastercard| true| 1122|
| visa| true| 1975|
| visa| false|126372|
| CB| false| 6|
| discover| false| 2219|
| maestro| false| 2|
| VISA| false| 13|
| mastercard| false| 40856|
| MASTERCARD| false| 9|
+----------------+------+------+
Method 1: This method will be slower, as it involves a traspose via pivot
.
df=df.groupBy("card_Scheme").pivot("failed").sum("count")
df=df.withColumn('X',when((col('True').isNotNull()),(col('false')/(col('false')+col('true')))).otherwise(1))
df=df.select('card_Scheme','X')
df.show()
+----------------+------------------+
| card_Scheme| X|
+----------------+------------------+
| VISA| 1.0|
| jcb| 1.0|
| MASTERCARD| 1.0|
| maestro| 1.0|
| AMEX| 1.0|
| mastercard|0.9732717137548239|
|american express| 1.0|
| CB| 1.0|
| discover| 1.0|
| visa|0.9846120283294506|
+----------------+------------------+
Method 2: Use SQL - you can do so the via windows
function. This will be a lot faster.
from pyspark.sql.window import Window
df = df.groupBy("card_scheme", "failed").agg(sum("count"))
.withColumn("X", col("sum(count)")/sum("sum(count)").over(Window.partitionBy(col('card_scheme'))))
.where(col('failed')== False).drop('failed','sum(count)')
df.show()
+----------------+------------------+
| card_scheme| X|
+----------------+------------------+
| VISA| 1.0|
| jcb| 1.0|
| MASTERCARD| 1.0|
| maestro| 1.0|
| AMEX| 1.0|
| mastercard|0.9732717137548239|
|american express| 1.0|
| CB| 1.0|
| discover| 1.0|
| visa|0.9846120283294506|
+----------------+------------------+
Really helpful ! Thanks a lot :)
– LaSul
Jan 3 at 13:18
add a comment |
First split root dataframe into two dataframes:
df_true = data.filter(data.failed == True).alias("df1")
df_false =data.filter(data.failed == False).alias("df2")
Then doing full outer join we can get final result:
from pyspark.sql.functions import col,when
df_result = df_true.join(df_false,df_true.card_scheme == df_false.card_scheme, "outer")
.select(when(col("df1.card_scheme").isNotNull(), col("df1.card_scheme")).otherwise(col("df2.card_scheme")).alias("card_scheme")
, when(col("df1.failed").isNotNull(), (col("df2.count")/(col("df1.count") + col("df2.count")))).otherwise(1).alias("X"))
No need to do groupby
, just extra two dataframes and joining.
1
Your code will work perfectly and will solve the problem at hand. I just have a small remark to make -joins
are quite costly operation if the data is stored on multiple partitions, as it will involve a shuffle. Where as agroupBy
may do the grouping on local partitions first and then do the shuffle. So, the number of rows shuffled will be lot less, making it more efficient to work with. Regards,
– cph_sto
Jan 4 at 8:35
1
Yes. I agree :)
– Md Shihab Uddin
Jan 4 at 8:39
add a comment |
data.groupBy("card_scheme").pivot("failed").agg(count("card_scheme"))
should work. I am not sure about the agg(count(any_column))
, but the clue is pivot
function. In result you'll get two new columns: false
and true
. Then you can easily calculate the x = false / (false + true)
.
Well, this is really close from what I want but this sometimes gives me "null" values
– LaSul
Jan 3 at 13:01
add a comment |
A simple solution would be to do a second groupby:
val grouped_df = data.groupBy("card_scheme", "failed").count() // your dataframe
val with_countFalse = grouped_df.withColumn("countfalse", when($"failed" === "false", $"count").otherwise(lit(0)))
with_countFalse.groupBy("card_scheme").agg(when($"failed" === "false", $"count").otherwise(lit(0)))) / sum($"count")).show()
The idea is that you can create a second column which has the failed in the failed=false and 0 otherwise. This means that the sum of the count column gives you false + true while the sum of the countfalse gives just the false. Then simply do a second groupby
Note: Some of the other answers use pivot. I believe the pivot solution would be slower (it does more), however, if you do choose to use it, add the specific values to the pivot call, i.e. pivot("failed", ["true", "false"]) to improve performance, otherwise spark would have to do a two path (the first to find the values)
add a comment |
from pyspark.sql import functions as func
from pyspark.sql.functions import col
data = data.groupby("card_scheme", "failed").count()
Create 2 new dataframes:
a = data.filter(col("failed") == "false").groupby("card_scheme").agg(func.sum("count").alias("num"))
b = data.groupby("card_scheme").agg(func.sum("count").alias("den"))
Join both the dataframes:
c = a.join(b, a.card_scheme == b.card_scheme).drop(b.card_scheme)
Divide one column with another:
c.withColumn('X', c.num/c.den)
1
Works perfectly ! Didn't need to change anything :)
– LaSul
Jan 3 at 13:02
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54020343%2fcalculate-values-from-two-dataframes-in-pyspark%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
5 Answers
5
active
oldest
votes
5 Answers
5
active
oldest
votes
active
oldest
votes
active
oldest
votes
Creating the dataset
myValues = [('jcb',False,4),('american express', False, 22084),('AMEX',False,4),('mastercard',True,1122),('visa',True,1975),('visa',False,126372),('CB',False,6),('discover',False,2219),('maestro',False,2),('VISA',False,13),('mastercard',False,40856),('MASTERCARD',False,9)]
df = sqlContext.createDataFrame(myValues,['card_Scheme','failed','count'])
df.show()
+----------------+------+------+
| card_Scheme|failed| count|
+----------------+------+------+
| jcb| false| 4|
|american express| false| 22084|
| AMEX| false| 4|
| mastercard| true| 1122|
| visa| true| 1975|
| visa| false|126372|
| CB| false| 6|
| discover| false| 2219|
| maestro| false| 2|
| VISA| false| 13|
| mastercard| false| 40856|
| MASTERCARD| false| 9|
+----------------+------+------+
Method 1: This method will be slower, as it involves a traspose via pivot
.
df=df.groupBy("card_Scheme").pivot("failed").sum("count")
df=df.withColumn('X',when((col('True').isNotNull()),(col('false')/(col('false')+col('true')))).otherwise(1))
df=df.select('card_Scheme','X')
df.show()
+----------------+------------------+
| card_Scheme| X|
+----------------+------------------+
| VISA| 1.0|
| jcb| 1.0|
| MASTERCARD| 1.0|
| maestro| 1.0|
| AMEX| 1.0|
| mastercard|0.9732717137548239|
|american express| 1.0|
| CB| 1.0|
| discover| 1.0|
| visa|0.9846120283294506|
+----------------+------------------+
Method 2: Use SQL - you can do so the via windows
function. This will be a lot faster.
from pyspark.sql.window import Window
df = df.groupBy("card_scheme", "failed").agg(sum("count"))
.withColumn("X", col("sum(count)")/sum("sum(count)").over(Window.partitionBy(col('card_scheme'))))
.where(col('failed')== False).drop('failed','sum(count)')
df.show()
+----------------+------------------+
| card_scheme| X|
+----------------+------------------+
| VISA| 1.0|
| jcb| 1.0|
| MASTERCARD| 1.0|
| maestro| 1.0|
| AMEX| 1.0|
| mastercard|0.9732717137548239|
|american express| 1.0|
| CB| 1.0|
| discover| 1.0|
| visa|0.9846120283294506|
+----------------+------------------+
Really helpful ! Thanks a lot :)
– LaSul
Jan 3 at 13:18
add a comment |
Creating the dataset
myValues = [('jcb',False,4),('american express', False, 22084),('AMEX',False,4),('mastercard',True,1122),('visa',True,1975),('visa',False,126372),('CB',False,6),('discover',False,2219),('maestro',False,2),('VISA',False,13),('mastercard',False,40856),('MASTERCARD',False,9)]
df = sqlContext.createDataFrame(myValues,['card_Scheme','failed','count'])
df.show()
+----------------+------+------+
| card_Scheme|failed| count|
+----------------+------+------+
| jcb| false| 4|
|american express| false| 22084|
| AMEX| false| 4|
| mastercard| true| 1122|
| visa| true| 1975|
| visa| false|126372|
| CB| false| 6|
| discover| false| 2219|
| maestro| false| 2|
| VISA| false| 13|
| mastercard| false| 40856|
| MASTERCARD| false| 9|
+----------------+------+------+
Method 1: This method will be slower, as it involves a traspose via pivot
.
df=df.groupBy("card_Scheme").pivot("failed").sum("count")
df=df.withColumn('X',when((col('True').isNotNull()),(col('false')/(col('false')+col('true')))).otherwise(1))
df=df.select('card_Scheme','X')
df.show()
+----------------+------------------+
| card_Scheme| X|
+----------------+------------------+
| VISA| 1.0|
| jcb| 1.0|
| MASTERCARD| 1.0|
| maestro| 1.0|
| AMEX| 1.0|
| mastercard|0.9732717137548239|
|american express| 1.0|
| CB| 1.0|
| discover| 1.0|
| visa|0.9846120283294506|
+----------------+------------------+
Method 2: Use SQL - you can do so the via windows
function. This will be a lot faster.
from pyspark.sql.window import Window
df = df.groupBy("card_scheme", "failed").agg(sum("count"))
.withColumn("X", col("sum(count)")/sum("sum(count)").over(Window.partitionBy(col('card_scheme'))))
.where(col('failed')== False).drop('failed','sum(count)')
df.show()
+----------------+------------------+
| card_scheme| X|
+----------------+------------------+
| VISA| 1.0|
| jcb| 1.0|
| MASTERCARD| 1.0|
| maestro| 1.0|
| AMEX| 1.0|
| mastercard|0.9732717137548239|
|american express| 1.0|
| CB| 1.0|
| discover| 1.0|
| visa|0.9846120283294506|
+----------------+------------------+
Really helpful ! Thanks a lot :)
– LaSul
Jan 3 at 13:18
add a comment |
Creating the dataset
myValues = [('jcb',False,4),('american express', False, 22084),('AMEX',False,4),('mastercard',True,1122),('visa',True,1975),('visa',False,126372),('CB',False,6),('discover',False,2219),('maestro',False,2),('VISA',False,13),('mastercard',False,40856),('MASTERCARD',False,9)]
df = sqlContext.createDataFrame(myValues,['card_Scheme','failed','count'])
df.show()
+----------------+------+------+
| card_Scheme|failed| count|
+----------------+------+------+
| jcb| false| 4|
|american express| false| 22084|
| AMEX| false| 4|
| mastercard| true| 1122|
| visa| true| 1975|
| visa| false|126372|
| CB| false| 6|
| discover| false| 2219|
| maestro| false| 2|
| VISA| false| 13|
| mastercard| false| 40856|
| MASTERCARD| false| 9|
+----------------+------+------+
Method 1: This method will be slower, as it involves a traspose via pivot
.
df=df.groupBy("card_Scheme").pivot("failed").sum("count")
df=df.withColumn('X',when((col('True').isNotNull()),(col('false')/(col('false')+col('true')))).otherwise(1))
df=df.select('card_Scheme','X')
df.show()
+----------------+------------------+
| card_Scheme| X|
+----------------+------------------+
| VISA| 1.0|
| jcb| 1.0|
| MASTERCARD| 1.0|
| maestro| 1.0|
| AMEX| 1.0|
| mastercard|0.9732717137548239|
|american express| 1.0|
| CB| 1.0|
| discover| 1.0|
| visa|0.9846120283294506|
+----------------+------------------+
Method 2: Use SQL - you can do so the via windows
function. This will be a lot faster.
from pyspark.sql.window import Window
df = df.groupBy("card_scheme", "failed").agg(sum("count"))
.withColumn("X", col("sum(count)")/sum("sum(count)").over(Window.partitionBy(col('card_scheme'))))
.where(col('failed')== False).drop('failed','sum(count)')
df.show()
+----------------+------------------+
| card_scheme| X|
+----------------+------------------+
| VISA| 1.0|
| jcb| 1.0|
| MASTERCARD| 1.0|
| maestro| 1.0|
| AMEX| 1.0|
| mastercard|0.9732717137548239|
|american express| 1.0|
| CB| 1.0|
| discover| 1.0|
| visa|0.9846120283294506|
+----------------+------------------+
Creating the dataset
myValues = [('jcb',False,4),('american express', False, 22084),('AMEX',False,4),('mastercard',True,1122),('visa',True,1975),('visa',False,126372),('CB',False,6),('discover',False,2219),('maestro',False,2),('VISA',False,13),('mastercard',False,40856),('MASTERCARD',False,9)]
df = sqlContext.createDataFrame(myValues,['card_Scheme','failed','count'])
df.show()
+----------------+------+------+
| card_Scheme|failed| count|
+----------------+------+------+
| jcb| false| 4|
|american express| false| 22084|
| AMEX| false| 4|
| mastercard| true| 1122|
| visa| true| 1975|
| visa| false|126372|
| CB| false| 6|
| discover| false| 2219|
| maestro| false| 2|
| VISA| false| 13|
| mastercard| false| 40856|
| MASTERCARD| false| 9|
+----------------+------+------+
Method 1: This method will be slower, as it involves a traspose via pivot
.
df=df.groupBy("card_Scheme").pivot("failed").sum("count")
df=df.withColumn('X',when((col('True').isNotNull()),(col('false')/(col('false')+col('true')))).otherwise(1))
df=df.select('card_Scheme','X')
df.show()
+----------------+------------------+
| card_Scheme| X|
+----------------+------------------+
| VISA| 1.0|
| jcb| 1.0|
| MASTERCARD| 1.0|
| maestro| 1.0|
| AMEX| 1.0|
| mastercard|0.9732717137548239|
|american express| 1.0|
| CB| 1.0|
| discover| 1.0|
| visa|0.9846120283294506|
+----------------+------------------+
Method 2: Use SQL - you can do so the via windows
function. This will be a lot faster.
from pyspark.sql.window import Window
df = df.groupBy("card_scheme", "failed").agg(sum("count"))
.withColumn("X", col("sum(count)")/sum("sum(count)").over(Window.partitionBy(col('card_scheme'))))
.where(col('failed')== False).drop('failed','sum(count)')
df.show()
+----------------+------------------+
| card_scheme| X|
+----------------+------------------+
| VISA| 1.0|
| jcb| 1.0|
| MASTERCARD| 1.0|
| maestro| 1.0|
| AMEX| 1.0|
| mastercard|0.9732717137548239|
|american express| 1.0|
| CB| 1.0|
| discover| 1.0|
| visa|0.9846120283294506|
+----------------+------------------+
edited Jan 3 at 11:33
answered Jan 3 at 10:50
cph_stocph_sto
2,3732522
2,3732522
Really helpful ! Thanks a lot :)
– LaSul
Jan 3 at 13:18
add a comment |
Really helpful ! Thanks a lot :)
– LaSul
Jan 3 at 13:18
Really helpful ! Thanks a lot :)
– LaSul
Jan 3 at 13:18
Really helpful ! Thanks a lot :)
– LaSul
Jan 3 at 13:18
add a comment |
First split root dataframe into two dataframes:
df_true = data.filter(data.failed == True).alias("df1")
df_false =data.filter(data.failed == False).alias("df2")
Then doing full outer join we can get final result:
from pyspark.sql.functions import col,when
df_result = df_true.join(df_false,df_true.card_scheme == df_false.card_scheme, "outer")
.select(when(col("df1.card_scheme").isNotNull(), col("df1.card_scheme")).otherwise(col("df2.card_scheme")).alias("card_scheme")
, when(col("df1.failed").isNotNull(), (col("df2.count")/(col("df1.count") + col("df2.count")))).otherwise(1).alias("X"))
No need to do groupby
, just extra two dataframes and joining.
1
Your code will work perfectly and will solve the problem at hand. I just have a small remark to make -joins
are quite costly operation if the data is stored on multiple partitions, as it will involve a shuffle. Where as agroupBy
may do the grouping on local partitions first and then do the shuffle. So, the number of rows shuffled will be lot less, making it more efficient to work with. Regards,
– cph_sto
Jan 4 at 8:35
1
Yes. I agree :)
– Md Shihab Uddin
Jan 4 at 8:39
add a comment |
First split root dataframe into two dataframes:
df_true = data.filter(data.failed == True).alias("df1")
df_false =data.filter(data.failed == False).alias("df2")
Then doing full outer join we can get final result:
from pyspark.sql.functions import col,when
df_result = df_true.join(df_false,df_true.card_scheme == df_false.card_scheme, "outer")
.select(when(col("df1.card_scheme").isNotNull(), col("df1.card_scheme")).otherwise(col("df2.card_scheme")).alias("card_scheme")
, when(col("df1.failed").isNotNull(), (col("df2.count")/(col("df1.count") + col("df2.count")))).otherwise(1).alias("X"))
No need to do groupby
, just extra two dataframes and joining.
1
Your code will work perfectly and will solve the problem at hand. I just have a small remark to make -joins
are quite costly operation if the data is stored on multiple partitions, as it will involve a shuffle. Where as agroupBy
may do the grouping on local partitions first and then do the shuffle. So, the number of rows shuffled will be lot less, making it more efficient to work with. Regards,
– cph_sto
Jan 4 at 8:35
1
Yes. I agree :)
– Md Shihab Uddin
Jan 4 at 8:39
add a comment |
First split root dataframe into two dataframes:
df_true = data.filter(data.failed == True).alias("df1")
df_false =data.filter(data.failed == False).alias("df2")
Then doing full outer join we can get final result:
from pyspark.sql.functions import col,when
df_result = df_true.join(df_false,df_true.card_scheme == df_false.card_scheme, "outer")
.select(when(col("df1.card_scheme").isNotNull(), col("df1.card_scheme")).otherwise(col("df2.card_scheme")).alias("card_scheme")
, when(col("df1.failed").isNotNull(), (col("df2.count")/(col("df1.count") + col("df2.count")))).otherwise(1).alias("X"))
No need to do groupby
, just extra two dataframes and joining.
First split root dataframe into two dataframes:
df_true = data.filter(data.failed == True).alias("df1")
df_false =data.filter(data.failed == False).alias("df2")
Then doing full outer join we can get final result:
from pyspark.sql.functions import col,when
df_result = df_true.join(df_false,df_true.card_scheme == df_false.card_scheme, "outer")
.select(when(col("df1.card_scheme").isNotNull(), col("df1.card_scheme")).otherwise(col("df2.card_scheme")).alias("card_scheme")
, when(col("df1.failed").isNotNull(), (col("df2.count")/(col("df1.count") + col("df2.count")))).otherwise(1).alias("X"))
No need to do groupby
, just extra two dataframes and joining.
answered Jan 4 at 6:31
Md Shihab UddinMd Shihab Uddin
283212
283212
1
Your code will work perfectly and will solve the problem at hand. I just have a small remark to make -joins
are quite costly operation if the data is stored on multiple partitions, as it will involve a shuffle. Where as agroupBy
may do the grouping on local partitions first and then do the shuffle. So, the number of rows shuffled will be lot less, making it more efficient to work with. Regards,
– cph_sto
Jan 4 at 8:35
1
Yes. I agree :)
– Md Shihab Uddin
Jan 4 at 8:39
add a comment |
1
Your code will work perfectly and will solve the problem at hand. I just have a small remark to make -joins
are quite costly operation if the data is stored on multiple partitions, as it will involve a shuffle. Where as agroupBy
may do the grouping on local partitions first and then do the shuffle. So, the number of rows shuffled will be lot less, making it more efficient to work with. Regards,
– cph_sto
Jan 4 at 8:35
1
Yes. I agree :)
– Md Shihab Uddin
Jan 4 at 8:39
1
1
Your code will work perfectly and will solve the problem at hand. I just have a small remark to make -
joins
are quite costly operation if the data is stored on multiple partitions, as it will involve a shuffle. Where as a groupBy
may do the grouping on local partitions first and then do the shuffle. So, the number of rows shuffled will be lot less, making it more efficient to work with. Regards,– cph_sto
Jan 4 at 8:35
Your code will work perfectly and will solve the problem at hand. I just have a small remark to make -
joins
are quite costly operation if the data is stored on multiple partitions, as it will involve a shuffle. Where as a groupBy
may do the grouping on local partitions first and then do the shuffle. So, the number of rows shuffled will be lot less, making it more efficient to work with. Regards,– cph_sto
Jan 4 at 8:35
1
1
Yes. I agree :)
– Md Shihab Uddin
Jan 4 at 8:39
Yes. I agree :)
– Md Shihab Uddin
Jan 4 at 8:39
add a comment |
data.groupBy("card_scheme").pivot("failed").agg(count("card_scheme"))
should work. I am not sure about the agg(count(any_column))
, but the clue is pivot
function. In result you'll get two new columns: false
and true
. Then you can easily calculate the x = false / (false + true)
.
Well, this is really close from what I want but this sometimes gives me "null" values
– LaSul
Jan 3 at 13:01
add a comment |
data.groupBy("card_scheme").pivot("failed").agg(count("card_scheme"))
should work. I am not sure about the agg(count(any_column))
, but the clue is pivot
function. In result you'll get two new columns: false
and true
. Then you can easily calculate the x = false / (false + true)
.
Well, this is really close from what I want but this sometimes gives me "null" values
– LaSul
Jan 3 at 13:01
add a comment |
data.groupBy("card_scheme").pivot("failed").agg(count("card_scheme"))
should work. I am not sure about the agg(count(any_column))
, but the clue is pivot
function. In result you'll get two new columns: false
and true
. Then you can easily calculate the x = false / (false + true)
.
data.groupBy("card_scheme").pivot("failed").agg(count("card_scheme"))
should work. I am not sure about the agg(count(any_column))
, but the clue is pivot
function. In result you'll get two new columns: false
and true
. Then you can easily calculate the x = false / (false + true)
.
answered Jan 3 at 11:07
windwind
360215
360215
Well, this is really close from what I want but this sometimes gives me "null" values
– LaSul
Jan 3 at 13:01
add a comment |
Well, this is really close from what I want but this sometimes gives me "null" values
– LaSul
Jan 3 at 13:01
Well, this is really close from what I want but this sometimes gives me "null" values
– LaSul
Jan 3 at 13:01
Well, this is really close from what I want but this sometimes gives me "null" values
– LaSul
Jan 3 at 13:01
add a comment |
A simple solution would be to do a second groupby:
val grouped_df = data.groupBy("card_scheme", "failed").count() // your dataframe
val with_countFalse = grouped_df.withColumn("countfalse", when($"failed" === "false", $"count").otherwise(lit(0)))
with_countFalse.groupBy("card_scheme").agg(when($"failed" === "false", $"count").otherwise(lit(0)))) / sum($"count")).show()
The idea is that you can create a second column which has the failed in the failed=false and 0 otherwise. This means that the sum of the count column gives you false + true while the sum of the countfalse gives just the false. Then simply do a second groupby
Note: Some of the other answers use pivot. I believe the pivot solution would be slower (it does more), however, if you do choose to use it, add the specific values to the pivot call, i.e. pivot("failed", ["true", "false"]) to improve performance, otherwise spark would have to do a two path (the first to find the values)
add a comment |
A simple solution would be to do a second groupby:
val grouped_df = data.groupBy("card_scheme", "failed").count() // your dataframe
val with_countFalse = grouped_df.withColumn("countfalse", when($"failed" === "false", $"count").otherwise(lit(0)))
with_countFalse.groupBy("card_scheme").agg(when($"failed" === "false", $"count").otherwise(lit(0)))) / sum($"count")).show()
The idea is that you can create a second column which has the failed in the failed=false and 0 otherwise. This means that the sum of the count column gives you false + true while the sum of the countfalse gives just the false. Then simply do a second groupby
Note: Some of the other answers use pivot. I believe the pivot solution would be slower (it does more), however, if you do choose to use it, add the specific values to the pivot call, i.e. pivot("failed", ["true", "false"]) to improve performance, otherwise spark would have to do a two path (the first to find the values)
add a comment |
A simple solution would be to do a second groupby:
val grouped_df = data.groupBy("card_scheme", "failed").count() // your dataframe
val with_countFalse = grouped_df.withColumn("countfalse", when($"failed" === "false", $"count").otherwise(lit(0)))
with_countFalse.groupBy("card_scheme").agg(when($"failed" === "false", $"count").otherwise(lit(0)))) / sum($"count")).show()
The idea is that you can create a second column which has the failed in the failed=false and 0 otherwise. This means that the sum of the count column gives you false + true while the sum of the countfalse gives just the false. Then simply do a second groupby
Note: Some of the other answers use pivot. I believe the pivot solution would be slower (it does more), however, if you do choose to use it, add the specific values to the pivot call, i.e. pivot("failed", ["true", "false"]) to improve performance, otherwise spark would have to do a two path (the first to find the values)
A simple solution would be to do a second groupby:
val grouped_df = data.groupBy("card_scheme", "failed").count() // your dataframe
val with_countFalse = grouped_df.withColumn("countfalse", when($"failed" === "false", $"count").otherwise(lit(0)))
with_countFalse.groupBy("card_scheme").agg(when($"failed" === "false", $"count").otherwise(lit(0)))) / sum($"count")).show()
The idea is that you can create a second column which has the failed in the failed=false and 0 otherwise. This means that the sum of the count column gives you false + true while the sum of the countfalse gives just the false. Then simply do a second groupby
Note: Some of the other answers use pivot. I believe the pivot solution would be slower (it does more), however, if you do choose to use it, add the specific values to the pivot call, i.e. pivot("failed", ["true", "false"]) to improve performance, otherwise spark would have to do a two path (the first to find the values)
answered Jan 3 at 11:12
Assaf MendelsonAssaf Mendelson
7,46311932
7,46311932
add a comment |
add a comment |
from pyspark.sql import functions as func
from pyspark.sql.functions import col
data = data.groupby("card_scheme", "failed").count()
Create 2 new dataframes:
a = data.filter(col("failed") == "false").groupby("card_scheme").agg(func.sum("count").alias("num"))
b = data.groupby("card_scheme").agg(func.sum("count").alias("den"))
Join both the dataframes:
c = a.join(b, a.card_scheme == b.card_scheme).drop(b.card_scheme)
Divide one column with another:
c.withColumn('X', c.num/c.den)
1
Works perfectly ! Didn't need to change anything :)
– LaSul
Jan 3 at 13:02
add a comment |
from pyspark.sql import functions as func
from pyspark.sql.functions import col
data = data.groupby("card_scheme", "failed").count()
Create 2 new dataframes:
a = data.filter(col("failed") == "false").groupby("card_scheme").agg(func.sum("count").alias("num"))
b = data.groupby("card_scheme").agg(func.sum("count").alias("den"))
Join both the dataframes:
c = a.join(b, a.card_scheme == b.card_scheme).drop(b.card_scheme)
Divide one column with another:
c.withColumn('X', c.num/c.den)
1
Works perfectly ! Didn't need to change anything :)
– LaSul
Jan 3 at 13:02
add a comment |
from pyspark.sql import functions as func
from pyspark.sql.functions import col
data = data.groupby("card_scheme", "failed").count()
Create 2 new dataframes:
a = data.filter(col("failed") == "false").groupby("card_scheme").agg(func.sum("count").alias("num"))
b = data.groupby("card_scheme").agg(func.sum("count").alias("den"))
Join both the dataframes:
c = a.join(b, a.card_scheme == b.card_scheme).drop(b.card_scheme)
Divide one column with another:
c.withColumn('X', c.num/c.den)
from pyspark.sql import functions as func
from pyspark.sql.functions import col
data = data.groupby("card_scheme", "failed").count()
Create 2 new dataframes:
a = data.filter(col("failed") == "false").groupby("card_scheme").agg(func.sum("count").alias("num"))
b = data.groupby("card_scheme").agg(func.sum("count").alias("den"))
Join both the dataframes:
c = a.join(b, a.card_scheme == b.card_scheme).drop(b.card_scheme)
Divide one column with another:
c.withColumn('X', c.num/c.den)
answered Jan 3 at 11:35
Ernest S KirubakaranErnest S Kirubakaran
911510
911510
1
Works perfectly ! Didn't need to change anything :)
– LaSul
Jan 3 at 13:02
add a comment |
1
Works perfectly ! Didn't need to change anything :)
– LaSul
Jan 3 at 13:02
1
1
Works perfectly ! Didn't need to change anything :)
– LaSul
Jan 3 at 13:02
Works perfectly ! Didn't need to change anything :)
– LaSul
Jan 3 at 13:02
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54020343%2fcalculate-values-from-two-dataframes-in-pyspark%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown