Calculate values from two dataframes in PySpark

I'm trying to group and sum for a PySpark (2.4) Dataframe but can't only get values one by one.

I've the following dataframe :

data.groupBy("card_scheme", "failed").count().show()



+----------------+------+------+

|     card_Scheme|failed| count|

+----------------+------+------+

|             jcb| false|     4|

|american express| false| 22084|

|            AMEX| false|     4|

|      mastercard|  true|  1122|

|            visa|  true|  1975|

|            visa| false|126372|

|              CB| false|     6|

|        discover| false|  2219|

|         maestro| false|     2|

|            VISA| false|    13|

|      mastercard| false| 40856|

|      MASTERCARD| false|     9|

+----------------+------+------+

I'm trying to calculate the formula X = false / (false + true) for each card_scheme and still get one dataframe in the end.

I'm expecting something like:

| card_scheme | X |

|-------------|---|

| jcb         | 1 |

| ....        | . |

| visa        | 0.9846| (which is 126372 / (126372 + 1975)        

| ...         | . |

edited Jan 4 at 6:40

Shaido

13k123044

asked Jan 3 at 10:22

LaSul

596321

add a comment |

I'm trying to group and sum for a PySpark (2.4) Dataframe but can't only get values one by one.

I've the following dataframe :

data.groupBy("card_scheme", "failed").count().show()



+----------------+------+------+

|     card_Scheme|failed| count|

+----------------+------+------+

|             jcb| false|     4|

|american express| false| 22084|

|            AMEX| false|     4|

|      mastercard|  true|  1122|

|            visa|  true|  1975|

|            visa| false|126372|

|              CB| false|     6|

|        discover| false|  2219|

|         maestro| false|     2|

|            VISA| false|    13|

|      mastercard| false| 40856|

|      MASTERCARD| false|     9|

+----------------+------+------+

I'm trying to calculate the formula X = false / (false + true) for each card_scheme and still get one dataframe in the end.

I'm expecting something like:

| card_scheme | X |

|-------------|---|

| jcb         | 1 |

| ....        | . |

| visa        | 0.9846| (which is 126372 / (126372 + 1975)        

| ...         | . |

edited Jan 4 at 6:40

Shaido

13k123044

asked Jan 3 at 10:22

LaSul

596321

add a comment |

I'm trying to group and sum for a PySpark (2.4) Dataframe but can't only get values one by one.

I've the following dataframe :

data.groupBy("card_scheme", "failed").count().show()



+----------------+------+------+

|     card_Scheme|failed| count|

+----------------+------+------+

|             jcb| false|     4|

|american express| false| 22084|

|            AMEX| false|     4|

|      mastercard|  true|  1122|

|            visa|  true|  1975|

|            visa| false|126372|

|              CB| false|     6|

|        discover| false|  2219|

|         maestro| false|     2|

|            VISA| false|    13|

|      mastercard| false| 40856|

|      MASTERCARD| false|     9|

+----------------+------+------+

I'm trying to calculate the formula X = false / (false + true) for each card_scheme and still get one dataframe in the end.

I'm expecting something like:

| card_scheme | X |

|-------------|---|

| jcb         | 1 |

| ....        | . |

| visa        | 0.9846| (which is 126372 / (126372 + 1975)        

| ...         | . |

edited Jan 4 at 6:40

Shaido

13k123044

asked Jan 3 at 10:22

LaSul

596321

I'm trying to group and sum for a PySpark (2.4) Dataframe but can't only get values one by one.

I've the following dataframe :

data.groupBy("card_scheme", "failed").count().show()



+----------------+------+------+

|     card_Scheme|failed| count|

+----------------+------+------+

|             jcb| false|     4|

|american express| false| 22084|

|            AMEX| false|     4|

|      mastercard|  true|  1122|

|            visa|  true|  1975|

|            visa| false|126372|

|              CB| false|     6|

|        discover| false|  2219|

|         maestro| false|     2|

|            VISA| false|    13|

|      mastercard| false| 40856|

|      MASTERCARD| false|     9|

+----------------+------+------+

I'm trying to calculate the formula X = false / (false + true) for each card_scheme and still get one dataframe in the end.

I'm expecting something like:

| card_scheme | X |

|-------------|---|

| jcb         | 1 |

| ....        | . |

| visa        | 0.9846| (which is 126372 / (126372 + 1975)        

| ...         | . |

python apache-spark pyspark

edited Jan 4 at 6:40

Shaido

13k123044

asked Jan 3 at 10:22

LaSul

596321

edited Jan 4 at 6:40

Shaido

13k123044

asked Jan 3 at 10:22

LaSul

596321

edited Jan 4 at 6:40

Shaido

13k123044

edited Jan 4 at 6:40

Shaido

13k123044

edited Jan 4 at 6:40

Shaido

13k123044

asked Jan 3 at 10:22

LaSul

596321

asked Jan 3 at 10:22

LaSul

596321

asked Jan 3 at 10:22

LaSul

596321

add a comment |

5 Answers
5

active

oldest

votes

Creating the dataset

myValues = [('jcb',False,4),('american express', False, 22084),('AMEX',False,4),('mastercard',True,1122),('visa',True,1975),('visa',False,126372),('CB',False,6),('discover',False,2219),('maestro',False,2),('VISA',False,13),('mastercard',False,40856),('MASTERCARD',False,9)]

df = sqlContext.createDataFrame(myValues,['card_Scheme','failed','count'])

df.show()

+----------------+------+------+

|     card_Scheme|failed| count|

+----------------+------+------+

|             jcb| false|     4|

|american express| false| 22084|

|            AMEX| false|     4|

|      mastercard|  true|  1122|

|            visa|  true|  1975|

|            visa| false|126372|

|              CB| false|     6|

|        discover| false|  2219|

|         maestro| false|     2|

|            VISA| false|    13|

|      mastercard| false| 40856|

|      MASTERCARD| false|     9|

+----------------+------+------+

Method 1: This method will be slower, as it involves a traspose via pivot.

df=df.groupBy("card_Scheme").pivot("failed").sum("count")

df=df.withColumn('X',when((col('True').isNotNull()),(col('false')/(col('false')+col('true')))).otherwise(1))

df=df.select('card_Scheme','X')

df.show()

+----------------+------------------+

|     card_Scheme|                 X|

+----------------+------------------+

|            VISA|               1.0|

|             jcb|               1.0|

|      MASTERCARD|               1.0|

|         maestro|               1.0|

|            AMEX|               1.0|

|      mastercard|0.9732717137548239|

|american express|               1.0|

|              CB|               1.0|

|        discover|               1.0|

|            visa|0.9846120283294506|

+----------------+------------------+

Method 2: Use SQL - you can do so the via windows function. This will be a lot faster.

from pyspark.sql.window import Window

df = df.groupBy("card_scheme", "failed").agg(sum("count"))

  .withColumn("X", col("sum(count)")/sum("sum(count)").over(Window.partitionBy(col('card_scheme'))))

  .where(col('failed')== False).drop('failed','sum(count)')

df.show()



+----------------+------------------+

|     card_scheme|                 X|

+----------------+------------------+

|            VISA|               1.0|

|             jcb|               1.0|

|      MASTERCARD|               1.0|

|         maestro|               1.0|

|            AMEX|               1.0|

|      mastercard|0.9732717137548239|

|american express|               1.0|

|              CB|               1.0|

|        discover|               1.0|

|            visa|0.9846120283294506|

+----------------+------------------+

edited Jan 3 at 11:33

answered Jan 3 at 10:50

cph_sto

2,3732522

Really helpful ! Thanks a lot :)

– LaSul
Jan 3 at 13:18

add a comment |

First split root dataframe into two dataframes:

df_true = data.filter(data.failed == True).alias("df1")

df_false =data.filter(data.failed == False).alias("df2")

Then doing full outer join we can get final result:

from pyspark.sql.functions import col,when

df_result = df_true.join(df_false,df_true.card_scheme == df_false.card_scheme, "outer") 

    .select(when(col("df1.card_scheme").isNotNull(), col("df1.card_scheme")).otherwise(col("df2.card_scheme")).alias("card_scheme") 

            , when(col("df1.failed").isNotNull(), (col("df2.count")/(col("df1.count") + col("df2.count")))).otherwise(1).alias("X"))

No need to do groupby, just extra two dataframes and joining.

answered Jan 4 at 6:31

Md Shihab Uddin

283212

1

Your code will work perfectly and will solve the problem at hand. I just have a small remark to make - joins are quite costly operation if the data is stored on multiple partitions, as it will involve a shuffle. Where as a groupBy may do the grouping on local partitions first and then do the shuffle. So, the number of rows shuffled will be lot less, making it more efficient to work with. Regards,

– cph_sto
Jan 4 at 8:35

1

Yes. I agree :)

– Md Shihab Uddin
Jan 4 at 8:39

add a comment |

data.groupBy("card_scheme").pivot("failed").agg(count("card_scheme")) should work. I am not sure about the agg(count(any_column)), but the clue is pivot function. In result you'll get two new columns: false and true. Then you can easily calculate the x = false / (false + true).

answered Jan 3 at 11:07

wind

360215

Well, this is really close from what I want but this sometimes gives me "null" values

– LaSul
Jan 3 at 13:01

add a comment |

A simple solution would be to do a second groupby:

val grouped_df = data.groupBy("card_scheme", "failed").count() // your dataframe

val with_countFalse = grouped_df.withColumn("countfalse", when($"failed" === "false", $"count").otherwise(lit(0)))

with_countFalse.groupBy("card_scheme").agg(when($"failed" === "false", $"count").otherwise(lit(0)))) / sum($"count")).show()

The idea is that you can create a second column which has the failed in the failed=false and 0 otherwise. This means that the sum of the count column gives you false + true while the sum of the countfalse gives just the false. Then simply do a second groupby

Note: Some of the other answers use pivot. I believe the pivot solution would be slower (it does more), however, if you do choose to use it, add the specific values to the pivot call, i.e. pivot("failed", ["true", "false"]) to improve performance, otherwise spark would have to do a two path (the first to find the values)

answered Jan 3 at 11:12

Assaf Mendelson

7,46311932

add a comment |

from pyspark.sql import functions as func

from pyspark.sql.functions import col    

data = data.groupby("card_scheme", "failed").count()

Create 2 new dataframes:

a = data.filter(col("failed") == "false").groupby("card_scheme").agg(func.sum("count").alias("num"))

b = data.groupby("card_scheme").agg(func.sum("count").alias("den"))

Join both the dataframes:

c = a.join(b, a.card_scheme == b.card_scheme).drop(b.card_scheme)

Divide one column with another:

c.withColumn('X', c.num/c.den)

answered Jan 3 at 11:35

Ernest S Kirubakaran

911510

1

Works perfectly ! Didn't need to change anything :)

– LaSul
Jan 3 at 13:02

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54020343%2fcalculate-values-from-two-dataframes-in-pyspark%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

5 Answers
5

active

oldest

votes

5 Answers
5

active

oldest

votes

Creating the dataset

myValues = [('jcb',False,4),('american express', False, 22084),('AMEX',False,4),('mastercard',True,1122),('visa',True,1975),('visa',False,126372),('CB',False,6),('discover',False,2219),('maestro',False,2),('VISA',False,13),('mastercard',False,40856),('MASTERCARD',False,9)]

df = sqlContext.createDataFrame(myValues,['card_Scheme','failed','count'])

df.show()

+----------------+------+------+

|     card_Scheme|failed| count|

+----------------+------+------+

|             jcb| false|     4|

|american express| false| 22084|

|            AMEX| false|     4|

|      mastercard|  true|  1122|

|            visa|  true|  1975|

|            visa| false|126372|

|              CB| false|     6|

|        discover| false|  2219|

|         maestro| false|     2|

|            VISA| false|    13|

|      mastercard| false| 40856|

|      MASTERCARD| false|     9|

+----------------+------+------+

Method 1: This method will be slower, as it involves a traspose via pivot.

df=df.groupBy("card_Scheme").pivot("failed").sum("count")

df=df.withColumn('X',when((col('True').isNotNull()),(col('false')/(col('false')+col('true')))).otherwise(1))

df=df.select('card_Scheme','X')

df.show()

+----------------+------------------+

|     card_Scheme|                 X|

+----------------+------------------+

|            VISA|               1.0|

|             jcb|               1.0|

|      MASTERCARD|               1.0|

|         maestro|               1.0|

|            AMEX|               1.0|

|      mastercard|0.9732717137548239|

|american express|               1.0|

|              CB|               1.0|

|        discover|               1.0|

|            visa|0.9846120283294506|

+----------------+------------------+

Method 2: Use SQL - you can do so the via windows function. This will be a lot faster.

from pyspark.sql.window import Window

df = df.groupBy("card_scheme", "failed").agg(sum("count"))

  .withColumn("X", col("sum(count)")/sum("sum(count)").over(Window.partitionBy(col('card_scheme'))))

  .where(col('failed')== False).drop('failed','sum(count)')

df.show()



+----------------+------------------+

|     card_scheme|                 X|

+----------------+------------------+

|            VISA|               1.0|

|             jcb|               1.0|

|      MASTERCARD|               1.0|

|         maestro|               1.0|

|            AMEX|               1.0|

|      mastercard|0.9732717137548239|

|american express|               1.0|

|              CB|               1.0|

|        discover|               1.0|

|            visa|0.9846120283294506|

+----------------+------------------+

edited Jan 3 at 11:33

answered Jan 3 at 10:50

cph_sto

2,3732522

Really helpful ! Thanks a lot :)

– LaSul
Jan 3 at 13:18

add a comment |

Creating the dataset

myValues = [('jcb',False,4),('american express', False, 22084),('AMEX',False,4),('mastercard',True,1122),('visa',True,1975),('visa',False,126372),('CB',False,6),('discover',False,2219),('maestro',False,2),('VISA',False,13),('mastercard',False,40856),('MASTERCARD',False,9)]

df = sqlContext.createDataFrame(myValues,['card_Scheme','failed','count'])

df.show()

+----------------+------+------+

|     card_Scheme|failed| count|

+----------------+------+------+

|             jcb| false|     4|

|american express| false| 22084|

|            AMEX| false|     4|

|      mastercard|  true|  1122|

|            visa|  true|  1975|

|            visa| false|126372|

|              CB| false|     6|

|        discover| false|  2219|

|         maestro| false|     2|

|            VISA| false|    13|

|      mastercard| false| 40856|

|      MASTERCARD| false|     9|

+----------------+------+------+

Method 1: This method will be slower, as it involves a traspose via pivot.

df=df.groupBy("card_Scheme").pivot("failed").sum("count")

df=df.withColumn('X',when((col('True').isNotNull()),(col('false')/(col('false')+col('true')))).otherwise(1))

df=df.select('card_Scheme','X')

df.show()

+----------------+------------------+

|     card_Scheme|                 X|

+----------------+------------------+

|            VISA|               1.0|

|             jcb|               1.0|

|      MASTERCARD|               1.0|

|         maestro|               1.0|

|            AMEX|               1.0|

|      mastercard|0.9732717137548239|

|american express|               1.0|

|              CB|               1.0|

|        discover|               1.0|

|            visa|0.9846120283294506|

+----------------+------------------+

Method 2: Use SQL - you can do so the via windows function. This will be a lot faster.

from pyspark.sql.window import Window

df = df.groupBy("card_scheme", "failed").agg(sum("count"))

  .withColumn("X", col("sum(count)")/sum("sum(count)").over(Window.partitionBy(col('card_scheme'))))

  .where(col('failed')== False).drop('failed','sum(count)')

df.show()



+----------------+------------------+

|     card_scheme|                 X|

+----------------+------------------+

|            VISA|               1.0|

|             jcb|               1.0|

|      MASTERCARD|               1.0|

|         maestro|               1.0|

|            AMEX|               1.0|

|      mastercard|0.9732717137548239|

|american express|               1.0|

|              CB|               1.0|

|        discover|               1.0|

|            visa|0.9846120283294506|

+----------------+------------------+

edited Jan 3 at 11:33

answered Jan 3 at 10:50

cph_sto

2,3732522

Really helpful ! Thanks a lot :)

– LaSul
Jan 3 at 13:18

add a comment |

Creating the dataset

myValues = [('jcb',False,4),('american express', False, 22084),('AMEX',False,4),('mastercard',True,1122),('visa',True,1975),('visa',False,126372),('CB',False,6),('discover',False,2219),('maestro',False,2),('VISA',False,13),('mastercard',False,40856),('MASTERCARD',False,9)]

df = sqlContext.createDataFrame(myValues,['card_Scheme','failed','count'])

df.show()

+----------------+------+------+

|     card_Scheme|failed| count|

+----------------+------+------+

|             jcb| false|     4|

|american express| false| 22084|

|            AMEX| false|     4|

|      mastercard|  true|  1122|

|            visa|  true|  1975|

|            visa| false|126372|

|              CB| false|     6|

|        discover| false|  2219|

|         maestro| false|     2|

|            VISA| false|    13|

|      mastercard| false| 40856|

|      MASTERCARD| false|     9|

+----------------+------+------+

Method 1: This method will be slower, as it involves a traspose via pivot.

df=df.groupBy("card_Scheme").pivot("failed").sum("count")

df=df.withColumn('X',when((col('True').isNotNull()),(col('false')/(col('false')+col('true')))).otherwise(1))

df=df.select('card_Scheme','X')

df.show()

+----------------+------------------+

|     card_Scheme|                 X|

+----------------+------------------+

|            VISA|               1.0|

|             jcb|               1.0|

|      MASTERCARD|               1.0|

|         maestro|               1.0|

|            AMEX|               1.0|

|      mastercard|0.9732717137548239|

|american express|               1.0|

|              CB|               1.0|

|        discover|               1.0|

|            visa|0.9846120283294506|

+----------------+------------------+

Method 2: Use SQL - you can do so the via windows function. This will be a lot faster.

from pyspark.sql.window import Window

df = df.groupBy("card_scheme", "failed").agg(sum("count"))

  .withColumn("X", col("sum(count)")/sum("sum(count)").over(Window.partitionBy(col('card_scheme'))))

  .where(col('failed')== False).drop('failed','sum(count)')

df.show()



+----------------+------------------+

|     card_scheme|                 X|

+----------------+------------------+

|            VISA|               1.0|

|             jcb|               1.0|

|      MASTERCARD|               1.0|

|         maestro|               1.0|

|            AMEX|               1.0|

|      mastercard|0.9732717137548239|

|american express|               1.0|

|              CB|               1.0|

|        discover|               1.0|

|            visa|0.9846120283294506|

+----------------+------------------+

edited Jan 3 at 11:33

answered Jan 3 at 10:50

cph_sto

2,3732522

Creating the dataset

myValues = [('jcb',False,4),('american express', False, 22084),('AMEX',False,4),('mastercard',True,1122),('visa',True,1975),('visa',False,126372),('CB',False,6),('discover',False,2219),('maestro',False,2),('VISA',False,13),('mastercard',False,40856),('MASTERCARD',False,9)]

df = sqlContext.createDataFrame(myValues,['card_Scheme','failed','count'])

df.show()

+----------------+------+------+

|     card_Scheme|failed| count|

+----------------+------+------+

|             jcb| false|     4|

|american express| false| 22084|

|            AMEX| false|     4|

|      mastercard|  true|  1122|

|            visa|  true|  1975|

|            visa| false|126372|

|              CB| false|     6|

|        discover| false|  2219|

|         maestro| false|     2|

|            VISA| false|    13|

|      mastercard| false| 40856|

|      MASTERCARD| false|     9|

+----------------+------+------+

Method 1: This method will be slower, as it involves a traspose via pivot.

df=df.groupBy("card_Scheme").pivot("failed").sum("count")

df=df.withColumn('X',when((col('True').isNotNull()),(col('false')/(col('false')+col('true')))).otherwise(1))

df=df.select('card_Scheme','X')

df.show()

+----------------+------------------+

|     card_Scheme|                 X|

+----------------+------------------+

|            VISA|               1.0|

|             jcb|               1.0|

|      MASTERCARD|               1.0|

|         maestro|               1.0|

|            AMEX|               1.0|

|      mastercard|0.9732717137548239|

|american express|               1.0|

|              CB|               1.0|

|        discover|               1.0|

|            visa|0.9846120283294506|

+----------------+------------------+

Method 2: Use SQL - you can do so the via windows function. This will be a lot faster.

from pyspark.sql.window import Window

df = df.groupBy("card_scheme", "failed").agg(sum("count"))

  .withColumn("X", col("sum(count)")/sum("sum(count)").over(Window.partitionBy(col('card_scheme'))))

  .where(col('failed')== False).drop('failed','sum(count)')

df.show()



+----------------+------------------+

|     card_scheme|                 X|

+----------------+------------------+

|            VISA|               1.0|

|             jcb|               1.0|

|      MASTERCARD|               1.0|

|         maestro|               1.0|

|            AMEX|               1.0|

|      mastercard|0.9732717137548239|

|american express|               1.0|

|              CB|               1.0|

|        discover|               1.0|

|            visa|0.9846120283294506|

+----------------+------------------+

edited Jan 3 at 11:33

answered Jan 3 at 10:50

cph_sto

2,3732522

edited Jan 3 at 11:33

answered Jan 3 at 10:50

cph_sto

2,3732522

answered Jan 3 at 10:50

cph_sto

2,3732522

answered Jan 3 at 10:50

cph_sto

2,3732522

Really helpful ! Thanks a lot :)

– LaSul
Jan 3 at 13:18

add a comment |

Really helpful ! Thanks a lot :)

– LaSul
Jan 3 at 13:18

Really helpful ! Thanks a lot :)

– LaSul
Jan 3 at 13:18

add a comment |

First split root dataframe into two dataframes:

df_true = data.filter(data.failed == True).alias("df1")

df_false =data.filter(data.failed == False).alias("df2")

Then doing full outer join we can get final result:

from pyspark.sql.functions import col,when

df_result = df_true.join(df_false,df_true.card_scheme == df_false.card_scheme, "outer") 

    .select(when(col("df1.card_scheme").isNotNull(), col("df1.card_scheme")).otherwise(col("df2.card_scheme")).alias("card_scheme") 

            , when(col("df1.failed").isNotNull(), (col("df2.count")/(col("df1.count") + col("df2.count")))).otherwise(1).alias("X"))

No need to do groupby, just extra two dataframes and joining.

answered Jan 4 at 6:31

Md Shihab Uddin

283212

1

Your code will work perfectly and will solve the problem at hand. I just have a small remark to make - joins are quite costly operation if the data is stored on multiple partitions, as it will involve a shuffle. Where as a groupBy may do the grouping on local partitions first and then do the shuffle. So, the number of rows shuffled will be lot less, making it more efficient to work with. Regards,

– cph_sto
Jan 4 at 8:35

1

Yes. I agree :)

– Md Shihab Uddin
Jan 4 at 8:39

add a comment |

First split root dataframe into two dataframes:

df_true = data.filter(data.failed == True).alias("df1")

df_false =data.filter(data.failed == False).alias("df2")

Then doing full outer join we can get final result:

from pyspark.sql.functions import col,when

df_result = df_true.join(df_false,df_true.card_scheme == df_false.card_scheme, "outer") 

    .select(when(col("df1.card_scheme").isNotNull(), col("df1.card_scheme")).otherwise(col("df2.card_scheme")).alias("card_scheme") 

            , when(col("df1.failed").isNotNull(), (col("df2.count")/(col("df1.count") + col("df2.count")))).otherwise(1).alias("X"))

No need to do groupby, just extra two dataframes and joining.

answered Jan 4 at 6:31

Md Shihab Uddin

283212

1

Your code will work perfectly and will solve the problem at hand. I just have a small remark to make - joins are quite costly operation if the data is stored on multiple partitions, as it will involve a shuffle. Where as a groupBy may do the grouping on local partitions first and then do the shuffle. So, the number of rows shuffled will be lot less, making it more efficient to work with. Regards,

– cph_sto
Jan 4 at 8:35

1

Yes. I agree :)

– Md Shihab Uddin
Jan 4 at 8:39

add a comment |

First split root dataframe into two dataframes:

df_true = data.filter(data.failed == True).alias("df1")

df_false =data.filter(data.failed == False).alias("df2")

Then doing full outer join we can get final result:

from pyspark.sql.functions import col,when

df_result = df_true.join(df_false,df_true.card_scheme == df_false.card_scheme, "outer") 

    .select(when(col("df1.card_scheme").isNotNull(), col("df1.card_scheme")).otherwise(col("df2.card_scheme")).alias("card_scheme") 

            , when(col("df1.failed").isNotNull(), (col("df2.count")/(col("df1.count") + col("df2.count")))).otherwise(1).alias("X"))

No need to do groupby, just extra two dataframes and joining.

answered Jan 4 at 6:31

Md Shihab Uddin

283212

First split root dataframe into two dataframes:

df_true = data.filter(data.failed == True).alias("df1")

df_false =data.filter(data.failed == False).alias("df2")

Then doing full outer join we can get final result:

from pyspark.sql.functions import col,when

df_result = df_true.join(df_false,df_true.card_scheme == df_false.card_scheme, "outer") 

    .select(when(col("df1.card_scheme").isNotNull(), col("df1.card_scheme")).otherwise(col("df2.card_scheme")).alias("card_scheme") 

            , when(col("df1.failed").isNotNull(), (col("df2.count")/(col("df1.count") + col("df2.count")))).otherwise(1).alias("X"))

No need to do groupby, just extra two dataframes and joining.

answered Jan 4 at 6:31

Md Shihab Uddin

283212

answered Jan 4 at 6:31

Md Shihab Uddin

283212

answered Jan 4 at 6:31

Md Shihab Uddin

283212

answered Jan 4 at 6:31

Md Shihab Uddin

283212

1

Your code will work perfectly and will solve the problem at hand. I just have a small remark to make - joins are quite costly operation if the data is stored on multiple partitions, as it will involve a shuffle. Where as a groupBy may do the grouping on local partitions first and then do the shuffle. So, the number of rows shuffled will be lot less, making it more efficient to work with. Regards,

– cph_sto
Jan 4 at 8:35

1

Yes. I agree :)

– Md Shihab Uddin
Jan 4 at 8:39

add a comment |

1

Your code will work perfectly and will solve the problem at hand. I just have a small remark to make - joins are quite costly operation if the data is stored on multiple partitions, as it will involve a shuffle. Where as a groupBy may do the grouping on local partitions first and then do the shuffle. So, the number of rows shuffled will be lot less, making it more efficient to work with. Regards,

– cph_sto
Jan 4 at 8:35

1

Yes. I agree :)

– Md Shihab Uddin
Jan 4 at 8:39

Your code will work perfectly and will solve the problem at hand. I just have a small remark to make - joins are quite costly operation if the data is stored on multiple partitions, as it will involve a shuffle. Where as a groupBy may do the grouping on local partitions first and then do the shuffle. So, the number of rows shuffled will be lot less, making it more efficient to work with. Regards,

– cph_sto
Jan 4 at 8:35

Yes. I agree :)

– Md Shihab Uddin
Jan 4 at 8:39

add a comment |

answered Jan 3 at 11:07

wind

360215

Well, this is really close from what I want but this sometimes gives me "null" values

– LaSul
Jan 3 at 13:01

add a comment |

answered Jan 3 at 11:07

wind

360215

Well, this is really close from what I want but this sometimes gives me "null" values

– LaSul
Jan 3 at 13:01

add a comment |

answered Jan 3 at 11:07

wind

360215

answered Jan 3 at 11:07

wind

360215

answered Jan 3 at 11:07

wind

360215

answered Jan 3 at 11:07

wind

360215

answered Jan 3 at 11:07

wind

360215

Well, this is really close from what I want but this sometimes gives me "null" values

– LaSul
Jan 3 at 13:01

add a comment |

Well, this is really close from what I want but this sometimes gives me "null" values

– LaSul
Jan 3 at 13:01

Well, this is really close from what I want but this sometimes gives me "null" values

– LaSul
Jan 3 at 13:01

add a comment |

A simple solution would be to do a second groupby:

val grouped_df = data.groupBy("card_scheme", "failed").count() // your dataframe

val with_countFalse = grouped_df.withColumn("countfalse", when($"failed" === "false", $"count").otherwise(lit(0)))

with_countFalse.groupBy("card_scheme").agg(when($"failed" === "false", $"count").otherwise(lit(0)))) / sum($"count")).show()

answered Jan 3 at 11:12

Assaf Mendelson

7,46311932

add a comment |

A simple solution would be to do a second groupby:

val grouped_df = data.groupBy("card_scheme", "failed").count() // your dataframe

val with_countFalse = grouped_df.withColumn("countfalse", when($"failed" === "false", $"count").otherwise(lit(0)))

with_countFalse.groupBy("card_scheme").agg(when($"failed" === "false", $"count").otherwise(lit(0)))) / sum($"count")).show()

answered Jan 3 at 11:12

Assaf Mendelson

7,46311932

add a comment |

A simple solution would be to do a second groupby:

val grouped_df = data.groupBy("card_scheme", "failed").count() // your dataframe

val with_countFalse = grouped_df.withColumn("countfalse", when($"failed" === "false", $"count").otherwise(lit(0)))

with_countFalse.groupBy("card_scheme").agg(when($"failed" === "false", $"count").otherwise(lit(0)))) / sum($"count")).show()

answered Jan 3 at 11:12

Assaf Mendelson

7,46311932

A simple solution would be to do a second groupby:

val grouped_df = data.groupBy("card_scheme", "failed").count() // your dataframe

val with_countFalse = grouped_df.withColumn("countfalse", when($"failed" === "false", $"count").otherwise(lit(0)))

with_countFalse.groupBy("card_scheme").agg(when($"failed" === "false", $"count").otherwise(lit(0)))) / sum($"count")).show()

answered Jan 3 at 11:12

Assaf Mendelson

7,46311932

answered Jan 3 at 11:12

Assaf Mendelson

7,46311932

answered Jan 3 at 11:12

Assaf Mendelson

7,46311932

answered Jan 3 at 11:12

Assaf Mendelson

7,46311932

add a comment |

from pyspark.sql import functions as func

from pyspark.sql.functions import col    

data = data.groupby("card_scheme", "failed").count()

Create 2 new dataframes:

a = data.filter(col("failed") == "false").groupby("card_scheme").agg(func.sum("count").alias("num"))

b = data.groupby("card_scheme").agg(func.sum("count").alias("den"))

Join both the dataframes:

c = a.join(b, a.card_scheme == b.card_scheme).drop(b.card_scheme)

Divide one column with another:

c.withColumn('X', c.num/c.den)

answered Jan 3 at 11:35

Ernest S Kirubakaran

911510

1

Works perfectly ! Didn't need to change anything :)

– LaSul
Jan 3 at 13:02

add a comment |

from pyspark.sql import functions as func

from pyspark.sql.functions import col    

data = data.groupby("card_scheme", "failed").count()

Create 2 new dataframes:

a = data.filter(col("failed") == "false").groupby("card_scheme").agg(func.sum("count").alias("num"))

b = data.groupby("card_scheme").agg(func.sum("count").alias("den"))

Join both the dataframes:

c = a.join(b, a.card_scheme == b.card_scheme).drop(b.card_scheme)

Divide one column with another:

c.withColumn('X', c.num/c.den)

answered Jan 3 at 11:35

Ernest S Kirubakaran

911510

1

Works perfectly ! Didn't need to change anything :)

– LaSul
Jan 3 at 13:02

add a comment |

from pyspark.sql import functions as func

from pyspark.sql.functions import col    

data = data.groupby("card_scheme", "failed").count()

Create 2 new dataframes:

a = data.filter(col("failed") == "false").groupby("card_scheme").agg(func.sum("count").alias("num"))

b = data.groupby("card_scheme").agg(func.sum("count").alias("den"))

Join both the dataframes:

c = a.join(b, a.card_scheme == b.card_scheme).drop(b.card_scheme)

Divide one column with another:

c.withColumn('X', c.num/c.den)

answered Jan 3 at 11:35

Ernest S Kirubakaran

911510

from pyspark.sql import functions as func

from pyspark.sql.functions import col    

data = data.groupby("card_scheme", "failed").count()

Create 2 new dataframes:

a = data.filter(col("failed") == "false").groupby("card_scheme").agg(func.sum("count").alias("num"))

b = data.groupby("card_scheme").agg(func.sum("count").alias("den"))

Join both the dataframes:

c = a.join(b, a.card_scheme == b.card_scheme).drop(b.card_scheme)

Divide one column with another:

c.withColumn('X', c.num/c.den)

answered Jan 3 at 11:35

Ernest S Kirubakaran

911510

answered Jan 3 at 11:35

Ernest S Kirubakaran

911510

answered Jan 3 at 11:35

Ernest S Kirubakaran

911510

answered Jan 3 at 11:35

Ernest S Kirubakaran

911510

1

Works perfectly ! Didn't need to change anything :)

– LaSul
Jan 3 at 13:02

add a comment |

1

Works perfectly ! Didn't need to change anything :)

– LaSul
Jan 3 at 13:02

Works perfectly ! Didn't need to change anything :)

– LaSul
Jan 3 at 13:02

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

1fA4mL6b,g mFnZIrP nz8Xx7IYLfZxTv5wRfbN UIL oUQNiCRd,nyF8DLfl5Vz

搜尋此網誌

Bdtjtk