Calculate values from two dataframes in PySpark












3















I'm trying to group and sum for a PySpark (2.4) Dataframe but can't only get values one by one.



I've the following dataframe :



data.groupBy("card_scheme", "failed").count().show()

+----------------+------+------+
| card_Scheme|failed| count|
+----------------+------+------+
| jcb| false| 4|
|american express| false| 22084|
| AMEX| false| 4|
| mastercard| true| 1122|
| visa| true| 1975|
| visa| false|126372|
| CB| false| 6|
| discover| false| 2219|
| maestro| false| 2|
| VISA| false| 13|
| mastercard| false| 40856|
| MASTERCARD| false| 9|
+----------------+------+------+


I'm trying to calculate the formula X = false / (false + true) for each card_scheme and still get one dataframe in the end.



I'm expecting something like:



| card_scheme | X |
|-------------|---|
| jcb | 1 |
| .... | . |
| visa | 0.9846| (which is 126372 / (126372 + 1975)
| ... | . |









share|improve this question





























    3















    I'm trying to group and sum for a PySpark (2.4) Dataframe but can't only get values one by one.



    I've the following dataframe :



    data.groupBy("card_scheme", "failed").count().show()

    +----------------+------+------+
    | card_Scheme|failed| count|
    +----------------+------+------+
    | jcb| false| 4|
    |american express| false| 22084|
    | AMEX| false| 4|
    | mastercard| true| 1122|
    | visa| true| 1975|
    | visa| false|126372|
    | CB| false| 6|
    | discover| false| 2219|
    | maestro| false| 2|
    | VISA| false| 13|
    | mastercard| false| 40856|
    | MASTERCARD| false| 9|
    +----------------+------+------+


    I'm trying to calculate the formula X = false / (false + true) for each card_scheme and still get one dataframe in the end.



    I'm expecting something like:



    | card_scheme | X |
    |-------------|---|
    | jcb | 1 |
    | .... | . |
    | visa | 0.9846| (which is 126372 / (126372 + 1975)
    | ... | . |









    share|improve this question



























      3












      3








      3








      I'm trying to group and sum for a PySpark (2.4) Dataframe but can't only get values one by one.



      I've the following dataframe :



      data.groupBy("card_scheme", "failed").count().show()

      +----------------+------+------+
      | card_Scheme|failed| count|
      +----------------+------+------+
      | jcb| false| 4|
      |american express| false| 22084|
      | AMEX| false| 4|
      | mastercard| true| 1122|
      | visa| true| 1975|
      | visa| false|126372|
      | CB| false| 6|
      | discover| false| 2219|
      | maestro| false| 2|
      | VISA| false| 13|
      | mastercard| false| 40856|
      | MASTERCARD| false| 9|
      +----------------+------+------+


      I'm trying to calculate the formula X = false / (false + true) for each card_scheme and still get one dataframe in the end.



      I'm expecting something like:



      | card_scheme | X |
      |-------------|---|
      | jcb | 1 |
      | .... | . |
      | visa | 0.9846| (which is 126372 / (126372 + 1975)
      | ... | . |









      share|improve this question
















      I'm trying to group and sum for a PySpark (2.4) Dataframe but can't only get values one by one.



      I've the following dataframe :



      data.groupBy("card_scheme", "failed").count().show()

      +----------------+------+------+
      | card_Scheme|failed| count|
      +----------------+------+------+
      | jcb| false| 4|
      |american express| false| 22084|
      | AMEX| false| 4|
      | mastercard| true| 1122|
      | visa| true| 1975|
      | visa| false|126372|
      | CB| false| 6|
      | discover| false| 2219|
      | maestro| false| 2|
      | VISA| false| 13|
      | mastercard| false| 40856|
      | MASTERCARD| false| 9|
      +----------------+------+------+


      I'm trying to calculate the formula X = false / (false + true) for each card_scheme and still get one dataframe in the end.



      I'm expecting something like:



      | card_scheme | X |
      |-------------|---|
      | jcb | 1 |
      | .... | . |
      | visa | 0.9846| (which is 126372 / (126372 + 1975)
      | ... | . |






      python apache-spark pyspark






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Jan 4 at 6:40









      Shaido

      13k123044




      13k123044










      asked Jan 3 at 10:22









      LaSulLaSul

      596321




      596321
























          5 Answers
          5






          active

          oldest

          votes


















          3














          Creating the dataset



          myValues = [('jcb',False,4),('american express', False, 22084),('AMEX',False,4),('mastercard',True,1122),('visa',True,1975),('visa',False,126372),('CB',False,6),('discover',False,2219),('maestro',False,2),('VISA',False,13),('mastercard',False,40856),('MASTERCARD',False,9)]
          df = sqlContext.createDataFrame(myValues,['card_Scheme','failed','count'])
          df.show()
          +----------------+------+------+
          | card_Scheme|failed| count|
          +----------------+------+------+
          | jcb| false| 4|
          |american express| false| 22084|
          | AMEX| false| 4|
          | mastercard| true| 1122|
          | visa| true| 1975|
          | visa| false|126372|
          | CB| false| 6|
          | discover| false| 2219|
          | maestro| false| 2|
          | VISA| false| 13|
          | mastercard| false| 40856|
          | MASTERCARD| false| 9|
          +----------------+------+------+


          Method 1: This method will be slower, as it involves a traspose via pivot.



          df=df.groupBy("card_Scheme").pivot("failed").sum("count")
          df=df.withColumn('X',when((col('True').isNotNull()),(col('false')/(col('false')+col('true')))).otherwise(1))
          df=df.select('card_Scheme','X')
          df.show()
          +----------------+------------------+
          | card_Scheme| X|
          +----------------+------------------+
          | VISA| 1.0|
          | jcb| 1.0|
          | MASTERCARD| 1.0|
          | maestro| 1.0|
          | AMEX| 1.0|
          | mastercard|0.9732717137548239|
          |american express| 1.0|
          | CB| 1.0|
          | discover| 1.0|
          | visa|0.9846120283294506|
          +----------------+------------------+


          Method 2: Use SQL - you can do so the via windows function. This will be a lot faster.



          from pyspark.sql.window import Window
          df = df.groupBy("card_scheme", "failed").agg(sum("count"))
          .withColumn("X", col("sum(count)")/sum("sum(count)").over(Window.partitionBy(col('card_scheme'))))
          .where(col('failed')== False).drop('failed','sum(count)')
          df.show()

          +----------------+------------------+
          | card_scheme| X|
          +----------------+------------------+
          | VISA| 1.0|
          | jcb| 1.0|
          | MASTERCARD| 1.0|
          | maestro| 1.0|
          | AMEX| 1.0|
          | mastercard|0.9732717137548239|
          |american express| 1.0|
          | CB| 1.0|
          | discover| 1.0|
          | visa|0.9846120283294506|
          +----------------+------------------+





          share|improve this answer


























          • Really helpful ! Thanks a lot :)

            – LaSul
            Jan 3 at 13:18



















          2














          First split root dataframe into two dataframes:



          df_true = data.filter(data.failed == True).alias("df1")
          df_false =data.filter(data.failed == False).alias("df2")


          Then doing full outer join we can get final result:



          from pyspark.sql.functions import col,when
          df_result = df_true.join(df_false,df_true.card_scheme == df_false.card_scheme, "outer")
          .select(when(col("df1.card_scheme").isNotNull(), col("df1.card_scheme")).otherwise(col("df2.card_scheme")).alias("card_scheme")
          , when(col("df1.failed").isNotNull(), (col("df2.count")/(col("df1.count") + col("df2.count")))).otherwise(1).alias("X"))


          No need to do groupby, just extra two dataframes and joining.






          share|improve this answer



















          • 1





            Your code will work perfectly and will solve the problem at hand. I just have a small remark to make - joins are quite costly operation if the data is stored on multiple partitions, as it will involve a shuffle. Where as a groupBy may do the grouping on local partitions first and then do the shuffle. So, the number of rows shuffled will be lot less, making it more efficient to work with. Regards,

            – cph_sto
            Jan 4 at 8:35








          • 1





            Yes. I agree :)

            – Md Shihab Uddin
            Jan 4 at 8:39



















          1














          data.groupBy("card_scheme").pivot("failed").agg(count("card_scheme")) should work. I am not sure about the agg(count(any_column)), but the clue is pivot function. In result you'll get two new columns: false and true. Then you can easily calculate the x = false / (false + true).






          share|improve this answer
























          • Well, this is really close from what I want but this sometimes gives me "null" values

            – LaSul
            Jan 3 at 13:01



















          1














          A simple solution would be to do a second groupby:



          val grouped_df = data.groupBy("card_scheme", "failed").count() // your dataframe
          val with_countFalse = grouped_df.withColumn("countfalse", when($"failed" === "false", $"count").otherwise(lit(0)))
          with_countFalse.groupBy("card_scheme").agg(when($"failed" === "false", $"count").otherwise(lit(0)))) / sum($"count")).show()


          The idea is that you can create a second column which has the failed in the failed=false and 0 otherwise. This means that the sum of the count column gives you false + true while the sum of the countfalse gives just the false. Then simply do a second groupby



          Note: Some of the other answers use pivot. I believe the pivot solution would be slower (it does more), however, if you do choose to use it, add the specific values to the pivot call, i.e. pivot("failed", ["true", "false"]) to improve performance, otherwise spark would have to do a two path (the first to find the values)






          share|improve this answer































            1














            from pyspark.sql import functions as func
            from pyspark.sql.functions import col
            data = data.groupby("card_scheme", "failed").count()


            Create 2 new dataframes:



            a = data.filter(col("failed") == "false").groupby("card_scheme").agg(func.sum("count").alias("num"))
            b = data.groupby("card_scheme").agg(func.sum("count").alias("den"))


            Join both the dataframes:



            c = a.join(b, a.card_scheme == b.card_scheme).drop(b.card_scheme)


            Divide one column with another:



            c.withColumn('X', c.num/c.den)





            share|improve this answer



















            • 1





              Works perfectly ! Didn't need to change anything :)

              – LaSul
              Jan 3 at 13:02











            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54020343%2fcalculate-values-from-two-dataframes-in-pyspark%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            5 Answers
            5






            active

            oldest

            votes








            5 Answers
            5






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            3














            Creating the dataset



            myValues = [('jcb',False,4),('american express', False, 22084),('AMEX',False,4),('mastercard',True,1122),('visa',True,1975),('visa',False,126372),('CB',False,6),('discover',False,2219),('maestro',False,2),('VISA',False,13),('mastercard',False,40856),('MASTERCARD',False,9)]
            df = sqlContext.createDataFrame(myValues,['card_Scheme','failed','count'])
            df.show()
            +----------------+------+------+
            | card_Scheme|failed| count|
            +----------------+------+------+
            | jcb| false| 4|
            |american express| false| 22084|
            | AMEX| false| 4|
            | mastercard| true| 1122|
            | visa| true| 1975|
            | visa| false|126372|
            | CB| false| 6|
            | discover| false| 2219|
            | maestro| false| 2|
            | VISA| false| 13|
            | mastercard| false| 40856|
            | MASTERCARD| false| 9|
            +----------------+------+------+


            Method 1: This method will be slower, as it involves a traspose via pivot.



            df=df.groupBy("card_Scheme").pivot("failed").sum("count")
            df=df.withColumn('X',when((col('True').isNotNull()),(col('false')/(col('false')+col('true')))).otherwise(1))
            df=df.select('card_Scheme','X')
            df.show()
            +----------------+------------------+
            | card_Scheme| X|
            +----------------+------------------+
            | VISA| 1.0|
            | jcb| 1.0|
            | MASTERCARD| 1.0|
            | maestro| 1.0|
            | AMEX| 1.0|
            | mastercard|0.9732717137548239|
            |american express| 1.0|
            | CB| 1.0|
            | discover| 1.0|
            | visa|0.9846120283294506|
            +----------------+------------------+


            Method 2: Use SQL - you can do so the via windows function. This will be a lot faster.



            from pyspark.sql.window import Window
            df = df.groupBy("card_scheme", "failed").agg(sum("count"))
            .withColumn("X", col("sum(count)")/sum("sum(count)").over(Window.partitionBy(col('card_scheme'))))
            .where(col('failed')== False).drop('failed','sum(count)')
            df.show()

            +----------------+------------------+
            | card_scheme| X|
            +----------------+------------------+
            | VISA| 1.0|
            | jcb| 1.0|
            | MASTERCARD| 1.0|
            | maestro| 1.0|
            | AMEX| 1.0|
            | mastercard|0.9732717137548239|
            |american express| 1.0|
            | CB| 1.0|
            | discover| 1.0|
            | visa|0.9846120283294506|
            +----------------+------------------+





            share|improve this answer


























            • Really helpful ! Thanks a lot :)

              – LaSul
              Jan 3 at 13:18
















            3














            Creating the dataset



            myValues = [('jcb',False,4),('american express', False, 22084),('AMEX',False,4),('mastercard',True,1122),('visa',True,1975),('visa',False,126372),('CB',False,6),('discover',False,2219),('maestro',False,2),('VISA',False,13),('mastercard',False,40856),('MASTERCARD',False,9)]
            df = sqlContext.createDataFrame(myValues,['card_Scheme','failed','count'])
            df.show()
            +----------------+------+------+
            | card_Scheme|failed| count|
            +----------------+------+------+
            | jcb| false| 4|
            |american express| false| 22084|
            | AMEX| false| 4|
            | mastercard| true| 1122|
            | visa| true| 1975|
            | visa| false|126372|
            | CB| false| 6|
            | discover| false| 2219|
            | maestro| false| 2|
            | VISA| false| 13|
            | mastercard| false| 40856|
            | MASTERCARD| false| 9|
            +----------------+------+------+


            Method 1: This method will be slower, as it involves a traspose via pivot.



            df=df.groupBy("card_Scheme").pivot("failed").sum("count")
            df=df.withColumn('X',when((col('True').isNotNull()),(col('false')/(col('false')+col('true')))).otherwise(1))
            df=df.select('card_Scheme','X')
            df.show()
            +----------------+------------------+
            | card_Scheme| X|
            +----------------+------------------+
            | VISA| 1.0|
            | jcb| 1.0|
            | MASTERCARD| 1.0|
            | maestro| 1.0|
            | AMEX| 1.0|
            | mastercard|0.9732717137548239|
            |american express| 1.0|
            | CB| 1.0|
            | discover| 1.0|
            | visa|0.9846120283294506|
            +----------------+------------------+


            Method 2: Use SQL - you can do so the via windows function. This will be a lot faster.



            from pyspark.sql.window import Window
            df = df.groupBy("card_scheme", "failed").agg(sum("count"))
            .withColumn("X", col("sum(count)")/sum("sum(count)").over(Window.partitionBy(col('card_scheme'))))
            .where(col('failed')== False).drop('failed','sum(count)')
            df.show()

            +----------------+------------------+
            | card_scheme| X|
            +----------------+------------------+
            | VISA| 1.0|
            | jcb| 1.0|
            | MASTERCARD| 1.0|
            | maestro| 1.0|
            | AMEX| 1.0|
            | mastercard|0.9732717137548239|
            |american express| 1.0|
            | CB| 1.0|
            | discover| 1.0|
            | visa|0.9846120283294506|
            +----------------+------------------+





            share|improve this answer


























            • Really helpful ! Thanks a lot :)

              – LaSul
              Jan 3 at 13:18














            3












            3








            3







            Creating the dataset



            myValues = [('jcb',False,4),('american express', False, 22084),('AMEX',False,4),('mastercard',True,1122),('visa',True,1975),('visa',False,126372),('CB',False,6),('discover',False,2219),('maestro',False,2),('VISA',False,13),('mastercard',False,40856),('MASTERCARD',False,9)]
            df = sqlContext.createDataFrame(myValues,['card_Scheme','failed','count'])
            df.show()
            +----------------+------+------+
            | card_Scheme|failed| count|
            +----------------+------+------+
            | jcb| false| 4|
            |american express| false| 22084|
            | AMEX| false| 4|
            | mastercard| true| 1122|
            | visa| true| 1975|
            | visa| false|126372|
            | CB| false| 6|
            | discover| false| 2219|
            | maestro| false| 2|
            | VISA| false| 13|
            | mastercard| false| 40856|
            | MASTERCARD| false| 9|
            +----------------+------+------+


            Method 1: This method will be slower, as it involves a traspose via pivot.



            df=df.groupBy("card_Scheme").pivot("failed").sum("count")
            df=df.withColumn('X',when((col('True').isNotNull()),(col('false')/(col('false')+col('true')))).otherwise(1))
            df=df.select('card_Scheme','X')
            df.show()
            +----------------+------------------+
            | card_Scheme| X|
            +----------------+------------------+
            | VISA| 1.0|
            | jcb| 1.0|
            | MASTERCARD| 1.0|
            | maestro| 1.0|
            | AMEX| 1.0|
            | mastercard|0.9732717137548239|
            |american express| 1.0|
            | CB| 1.0|
            | discover| 1.0|
            | visa|0.9846120283294506|
            +----------------+------------------+


            Method 2: Use SQL - you can do so the via windows function. This will be a lot faster.



            from pyspark.sql.window import Window
            df = df.groupBy("card_scheme", "failed").agg(sum("count"))
            .withColumn("X", col("sum(count)")/sum("sum(count)").over(Window.partitionBy(col('card_scheme'))))
            .where(col('failed')== False).drop('failed','sum(count)')
            df.show()

            +----------------+------------------+
            | card_scheme| X|
            +----------------+------------------+
            | VISA| 1.0|
            | jcb| 1.0|
            | MASTERCARD| 1.0|
            | maestro| 1.0|
            | AMEX| 1.0|
            | mastercard|0.9732717137548239|
            |american express| 1.0|
            | CB| 1.0|
            | discover| 1.0|
            | visa|0.9846120283294506|
            +----------------+------------------+





            share|improve this answer















            Creating the dataset



            myValues = [('jcb',False,4),('american express', False, 22084),('AMEX',False,4),('mastercard',True,1122),('visa',True,1975),('visa',False,126372),('CB',False,6),('discover',False,2219),('maestro',False,2),('VISA',False,13),('mastercard',False,40856),('MASTERCARD',False,9)]
            df = sqlContext.createDataFrame(myValues,['card_Scheme','failed','count'])
            df.show()
            +----------------+------+------+
            | card_Scheme|failed| count|
            +----------------+------+------+
            | jcb| false| 4|
            |american express| false| 22084|
            | AMEX| false| 4|
            | mastercard| true| 1122|
            | visa| true| 1975|
            | visa| false|126372|
            | CB| false| 6|
            | discover| false| 2219|
            | maestro| false| 2|
            | VISA| false| 13|
            | mastercard| false| 40856|
            | MASTERCARD| false| 9|
            +----------------+------+------+


            Method 1: This method will be slower, as it involves a traspose via pivot.



            df=df.groupBy("card_Scheme").pivot("failed").sum("count")
            df=df.withColumn('X',when((col('True').isNotNull()),(col('false')/(col('false')+col('true')))).otherwise(1))
            df=df.select('card_Scheme','X')
            df.show()
            +----------------+------------------+
            | card_Scheme| X|
            +----------------+------------------+
            | VISA| 1.0|
            | jcb| 1.0|
            | MASTERCARD| 1.0|
            | maestro| 1.0|
            | AMEX| 1.0|
            | mastercard|0.9732717137548239|
            |american express| 1.0|
            | CB| 1.0|
            | discover| 1.0|
            | visa|0.9846120283294506|
            +----------------+------------------+


            Method 2: Use SQL - you can do so the via windows function. This will be a lot faster.



            from pyspark.sql.window import Window
            df = df.groupBy("card_scheme", "failed").agg(sum("count"))
            .withColumn("X", col("sum(count)")/sum("sum(count)").over(Window.partitionBy(col('card_scheme'))))
            .where(col('failed')== False).drop('failed','sum(count)')
            df.show()

            +----------------+------------------+
            | card_scheme| X|
            +----------------+------------------+
            | VISA| 1.0|
            | jcb| 1.0|
            | MASTERCARD| 1.0|
            | maestro| 1.0|
            | AMEX| 1.0|
            | mastercard|0.9732717137548239|
            |american express| 1.0|
            | CB| 1.0|
            | discover| 1.0|
            | visa|0.9846120283294506|
            +----------------+------------------+






            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Jan 3 at 11:33

























            answered Jan 3 at 10:50









            cph_stocph_sto

            2,3732522




            2,3732522













            • Really helpful ! Thanks a lot :)

              – LaSul
              Jan 3 at 13:18



















            • Really helpful ! Thanks a lot :)

              – LaSul
              Jan 3 at 13:18

















            Really helpful ! Thanks a lot :)

            – LaSul
            Jan 3 at 13:18





            Really helpful ! Thanks a lot :)

            – LaSul
            Jan 3 at 13:18













            2














            First split root dataframe into two dataframes:



            df_true = data.filter(data.failed == True).alias("df1")
            df_false =data.filter(data.failed == False).alias("df2")


            Then doing full outer join we can get final result:



            from pyspark.sql.functions import col,when
            df_result = df_true.join(df_false,df_true.card_scheme == df_false.card_scheme, "outer")
            .select(when(col("df1.card_scheme").isNotNull(), col("df1.card_scheme")).otherwise(col("df2.card_scheme")).alias("card_scheme")
            , when(col("df1.failed").isNotNull(), (col("df2.count")/(col("df1.count") + col("df2.count")))).otherwise(1).alias("X"))


            No need to do groupby, just extra two dataframes and joining.






            share|improve this answer



















            • 1





              Your code will work perfectly and will solve the problem at hand. I just have a small remark to make - joins are quite costly operation if the data is stored on multiple partitions, as it will involve a shuffle. Where as a groupBy may do the grouping on local partitions first and then do the shuffle. So, the number of rows shuffled will be lot less, making it more efficient to work with. Regards,

              – cph_sto
              Jan 4 at 8:35








            • 1





              Yes. I agree :)

              – Md Shihab Uddin
              Jan 4 at 8:39
















            2














            First split root dataframe into two dataframes:



            df_true = data.filter(data.failed == True).alias("df1")
            df_false =data.filter(data.failed == False).alias("df2")


            Then doing full outer join we can get final result:



            from pyspark.sql.functions import col,when
            df_result = df_true.join(df_false,df_true.card_scheme == df_false.card_scheme, "outer")
            .select(when(col("df1.card_scheme").isNotNull(), col("df1.card_scheme")).otherwise(col("df2.card_scheme")).alias("card_scheme")
            , when(col("df1.failed").isNotNull(), (col("df2.count")/(col("df1.count") + col("df2.count")))).otherwise(1).alias("X"))


            No need to do groupby, just extra two dataframes and joining.






            share|improve this answer



















            • 1





              Your code will work perfectly and will solve the problem at hand. I just have a small remark to make - joins are quite costly operation if the data is stored on multiple partitions, as it will involve a shuffle. Where as a groupBy may do the grouping on local partitions first and then do the shuffle. So, the number of rows shuffled will be lot less, making it more efficient to work with. Regards,

              – cph_sto
              Jan 4 at 8:35








            • 1





              Yes. I agree :)

              – Md Shihab Uddin
              Jan 4 at 8:39














            2












            2








            2







            First split root dataframe into two dataframes:



            df_true = data.filter(data.failed == True).alias("df1")
            df_false =data.filter(data.failed == False).alias("df2")


            Then doing full outer join we can get final result:



            from pyspark.sql.functions import col,when
            df_result = df_true.join(df_false,df_true.card_scheme == df_false.card_scheme, "outer")
            .select(when(col("df1.card_scheme").isNotNull(), col("df1.card_scheme")).otherwise(col("df2.card_scheme")).alias("card_scheme")
            , when(col("df1.failed").isNotNull(), (col("df2.count")/(col("df1.count") + col("df2.count")))).otherwise(1).alias("X"))


            No need to do groupby, just extra two dataframes and joining.






            share|improve this answer













            First split root dataframe into two dataframes:



            df_true = data.filter(data.failed == True).alias("df1")
            df_false =data.filter(data.failed == False).alias("df2")


            Then doing full outer join we can get final result:



            from pyspark.sql.functions import col,when
            df_result = df_true.join(df_false,df_true.card_scheme == df_false.card_scheme, "outer")
            .select(when(col("df1.card_scheme").isNotNull(), col("df1.card_scheme")).otherwise(col("df2.card_scheme")).alias("card_scheme")
            , when(col("df1.failed").isNotNull(), (col("df2.count")/(col("df1.count") + col("df2.count")))).otherwise(1).alias("X"))


            No need to do groupby, just extra two dataframes and joining.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Jan 4 at 6:31









            Md Shihab UddinMd Shihab Uddin

            283212




            283212








            • 1





              Your code will work perfectly and will solve the problem at hand. I just have a small remark to make - joins are quite costly operation if the data is stored on multiple partitions, as it will involve a shuffle. Where as a groupBy may do the grouping on local partitions first and then do the shuffle. So, the number of rows shuffled will be lot less, making it more efficient to work with. Regards,

              – cph_sto
              Jan 4 at 8:35








            • 1





              Yes. I agree :)

              – Md Shihab Uddin
              Jan 4 at 8:39














            • 1





              Your code will work perfectly and will solve the problem at hand. I just have a small remark to make - joins are quite costly operation if the data is stored on multiple partitions, as it will involve a shuffle. Where as a groupBy may do the grouping on local partitions first and then do the shuffle. So, the number of rows shuffled will be lot less, making it more efficient to work with. Regards,

              – cph_sto
              Jan 4 at 8:35








            • 1





              Yes. I agree :)

              – Md Shihab Uddin
              Jan 4 at 8:39








            1




            1





            Your code will work perfectly and will solve the problem at hand. I just have a small remark to make - joins are quite costly operation if the data is stored on multiple partitions, as it will involve a shuffle. Where as a groupBy may do the grouping on local partitions first and then do the shuffle. So, the number of rows shuffled will be lot less, making it more efficient to work with. Regards,

            – cph_sto
            Jan 4 at 8:35







            Your code will work perfectly and will solve the problem at hand. I just have a small remark to make - joins are quite costly operation if the data is stored on multiple partitions, as it will involve a shuffle. Where as a groupBy may do the grouping on local partitions first and then do the shuffle. So, the number of rows shuffled will be lot less, making it more efficient to work with. Regards,

            – cph_sto
            Jan 4 at 8:35






            1




            1





            Yes. I agree :)

            – Md Shihab Uddin
            Jan 4 at 8:39





            Yes. I agree :)

            – Md Shihab Uddin
            Jan 4 at 8:39











            1














            data.groupBy("card_scheme").pivot("failed").agg(count("card_scheme")) should work. I am not sure about the agg(count(any_column)), but the clue is pivot function. In result you'll get two new columns: false and true. Then you can easily calculate the x = false / (false + true).






            share|improve this answer
























            • Well, this is really close from what I want but this sometimes gives me "null" values

              – LaSul
              Jan 3 at 13:01
















            1














            data.groupBy("card_scheme").pivot("failed").agg(count("card_scheme")) should work. I am not sure about the agg(count(any_column)), but the clue is pivot function. In result you'll get two new columns: false and true. Then you can easily calculate the x = false / (false + true).






            share|improve this answer
























            • Well, this is really close from what I want but this sometimes gives me "null" values

              – LaSul
              Jan 3 at 13:01














            1












            1








            1







            data.groupBy("card_scheme").pivot("failed").agg(count("card_scheme")) should work. I am not sure about the agg(count(any_column)), but the clue is pivot function. In result you'll get two new columns: false and true. Then you can easily calculate the x = false / (false + true).






            share|improve this answer













            data.groupBy("card_scheme").pivot("failed").agg(count("card_scheme")) should work. I am not sure about the agg(count(any_column)), but the clue is pivot function. In result you'll get two new columns: false and true. Then you can easily calculate the x = false / (false + true).







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Jan 3 at 11:07









            windwind

            360215




            360215













            • Well, this is really close from what I want but this sometimes gives me "null" values

              – LaSul
              Jan 3 at 13:01



















            • Well, this is really close from what I want but this sometimes gives me "null" values

              – LaSul
              Jan 3 at 13:01

















            Well, this is really close from what I want but this sometimes gives me "null" values

            – LaSul
            Jan 3 at 13:01





            Well, this is really close from what I want but this sometimes gives me "null" values

            – LaSul
            Jan 3 at 13:01











            1














            A simple solution would be to do a second groupby:



            val grouped_df = data.groupBy("card_scheme", "failed").count() // your dataframe
            val with_countFalse = grouped_df.withColumn("countfalse", when($"failed" === "false", $"count").otherwise(lit(0)))
            with_countFalse.groupBy("card_scheme").agg(when($"failed" === "false", $"count").otherwise(lit(0)))) / sum($"count")).show()


            The idea is that you can create a second column which has the failed in the failed=false and 0 otherwise. This means that the sum of the count column gives you false + true while the sum of the countfalse gives just the false. Then simply do a second groupby



            Note: Some of the other answers use pivot. I believe the pivot solution would be slower (it does more), however, if you do choose to use it, add the specific values to the pivot call, i.e. pivot("failed", ["true", "false"]) to improve performance, otherwise spark would have to do a two path (the first to find the values)






            share|improve this answer




























              1














              A simple solution would be to do a second groupby:



              val grouped_df = data.groupBy("card_scheme", "failed").count() // your dataframe
              val with_countFalse = grouped_df.withColumn("countfalse", when($"failed" === "false", $"count").otherwise(lit(0)))
              with_countFalse.groupBy("card_scheme").agg(when($"failed" === "false", $"count").otherwise(lit(0)))) / sum($"count")).show()


              The idea is that you can create a second column which has the failed in the failed=false and 0 otherwise. This means that the sum of the count column gives you false + true while the sum of the countfalse gives just the false. Then simply do a second groupby



              Note: Some of the other answers use pivot. I believe the pivot solution would be slower (it does more), however, if you do choose to use it, add the specific values to the pivot call, i.e. pivot("failed", ["true", "false"]) to improve performance, otherwise spark would have to do a two path (the first to find the values)






              share|improve this answer


























                1












                1








                1







                A simple solution would be to do a second groupby:



                val grouped_df = data.groupBy("card_scheme", "failed").count() // your dataframe
                val with_countFalse = grouped_df.withColumn("countfalse", when($"failed" === "false", $"count").otherwise(lit(0)))
                with_countFalse.groupBy("card_scheme").agg(when($"failed" === "false", $"count").otherwise(lit(0)))) / sum($"count")).show()


                The idea is that you can create a second column which has the failed in the failed=false and 0 otherwise. This means that the sum of the count column gives you false + true while the sum of the countfalse gives just the false. Then simply do a second groupby



                Note: Some of the other answers use pivot. I believe the pivot solution would be slower (it does more), however, if you do choose to use it, add the specific values to the pivot call, i.e. pivot("failed", ["true", "false"]) to improve performance, otherwise spark would have to do a two path (the first to find the values)






                share|improve this answer













                A simple solution would be to do a second groupby:



                val grouped_df = data.groupBy("card_scheme", "failed").count() // your dataframe
                val with_countFalse = grouped_df.withColumn("countfalse", when($"failed" === "false", $"count").otherwise(lit(0)))
                with_countFalse.groupBy("card_scheme").agg(when($"failed" === "false", $"count").otherwise(lit(0)))) / sum($"count")).show()


                The idea is that you can create a second column which has the failed in the failed=false and 0 otherwise. This means that the sum of the count column gives you false + true while the sum of the countfalse gives just the false. Then simply do a second groupby



                Note: Some of the other answers use pivot. I believe the pivot solution would be slower (it does more), however, if you do choose to use it, add the specific values to the pivot call, i.e. pivot("failed", ["true", "false"]) to improve performance, otherwise spark would have to do a two path (the first to find the values)







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Jan 3 at 11:12









                Assaf MendelsonAssaf Mendelson

                7,46311932




                7,46311932























                    1














                    from pyspark.sql import functions as func
                    from pyspark.sql.functions import col
                    data = data.groupby("card_scheme", "failed").count()


                    Create 2 new dataframes:



                    a = data.filter(col("failed") == "false").groupby("card_scheme").agg(func.sum("count").alias("num"))
                    b = data.groupby("card_scheme").agg(func.sum("count").alias("den"))


                    Join both the dataframes:



                    c = a.join(b, a.card_scheme == b.card_scheme).drop(b.card_scheme)


                    Divide one column with another:



                    c.withColumn('X', c.num/c.den)





                    share|improve this answer



















                    • 1





                      Works perfectly ! Didn't need to change anything :)

                      – LaSul
                      Jan 3 at 13:02
















                    1














                    from pyspark.sql import functions as func
                    from pyspark.sql.functions import col
                    data = data.groupby("card_scheme", "failed").count()


                    Create 2 new dataframes:



                    a = data.filter(col("failed") == "false").groupby("card_scheme").agg(func.sum("count").alias("num"))
                    b = data.groupby("card_scheme").agg(func.sum("count").alias("den"))


                    Join both the dataframes:



                    c = a.join(b, a.card_scheme == b.card_scheme).drop(b.card_scheme)


                    Divide one column with another:



                    c.withColumn('X', c.num/c.den)





                    share|improve this answer



















                    • 1





                      Works perfectly ! Didn't need to change anything :)

                      – LaSul
                      Jan 3 at 13:02














                    1












                    1








                    1







                    from pyspark.sql import functions as func
                    from pyspark.sql.functions import col
                    data = data.groupby("card_scheme", "failed").count()


                    Create 2 new dataframes:



                    a = data.filter(col("failed") == "false").groupby("card_scheme").agg(func.sum("count").alias("num"))
                    b = data.groupby("card_scheme").agg(func.sum("count").alias("den"))


                    Join both the dataframes:



                    c = a.join(b, a.card_scheme == b.card_scheme).drop(b.card_scheme)


                    Divide one column with another:



                    c.withColumn('X', c.num/c.den)





                    share|improve this answer













                    from pyspark.sql import functions as func
                    from pyspark.sql.functions import col
                    data = data.groupby("card_scheme", "failed").count()


                    Create 2 new dataframes:



                    a = data.filter(col("failed") == "false").groupby("card_scheme").agg(func.sum("count").alias("num"))
                    b = data.groupby("card_scheme").agg(func.sum("count").alias("den"))


                    Join both the dataframes:



                    c = a.join(b, a.card_scheme == b.card_scheme).drop(b.card_scheme)


                    Divide one column with another:



                    c.withColumn('X', c.num/c.den)






                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Jan 3 at 11:35









                    Ernest S KirubakaranErnest S Kirubakaran

                    911510




                    911510








                    • 1





                      Works perfectly ! Didn't need to change anything :)

                      – LaSul
                      Jan 3 at 13:02














                    • 1





                      Works perfectly ! Didn't need to change anything :)

                      – LaSul
                      Jan 3 at 13:02








                    1




                    1





                    Works perfectly ! Didn't need to change anything :)

                    – LaSul
                    Jan 3 at 13:02





                    Works perfectly ! Didn't need to change anything :)

                    – LaSul
                    Jan 3 at 13:02


















                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54020343%2fcalculate-values-from-two-dataframes-in-pyspark%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Monofisismo

                    Angular Downloading a file using contenturl with Basic Authentication

                    Olmecas