Scala - Spark Boost GroupBy Computing for multiple Dimensions












0














My goal is to create a Cube of 4 Dimensions and 1 Measure.



This means I have in total 16 GroupBy`s to compute.



In my code you can see the 4 Dimensions (Gender,Age,TotalChildren,ProductCategoryName) and the Measure TotalCost.



I have filter all my columns to drop any row that it is null.



After that I compute every GroupBy one by one and then I use coalesce() to bind the csv`s into one file.



All the process takes about 10 minutes which I think is too much.



Is there any way to enhance the process? Maybe by computing some groupby`s from others?



Also my data is about 5GB so if I read it 16 times as the number of groupby`s this means in total 80GB.





Here is my Code



import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.udf


object ComputeCube {

def main(args:Array[String]):Unit= {

val spark: SparkSession = SparkSession.builder()
.master("local[*]")
.appName("SparkProject2018")
.getOrCreate()

import spark.implicits._

val filePath="src/main/resources/dataspark.txt"

val df = spark.read.options(Map("inferSchema"->"true","delimiter"->"|","header"->"true"))
.csv(filePath).select("Gender", "BirthDate", "TotalCost", "TotalChildren", "ProductCategoryName")

val df2 = df
.filter("Gender is not null")
.filter("BirthDate is not null")
.filter("TotalChildren is not null")
.filter("ProductCategoryName is not null")

val currentDate = udf{ (dob: java.sql.Date) =>
import java.time.{LocalDate, Period}
Period.between(dob.toLocalDate, LocalDate.now).getYears
}

val df3 = df2.withColumn("Age", currentDate($"BirthDate"))


val groupByAll = df3.groupBy("Gender","Age", "TotalChildren", "ProductCategoryName" ).avg("TotalCost")

val groupByGenderAndAgeAndTotalChildren = df3.groupBy("Gender","Age", "TotalChildren").avg("TotalCost")

val groupByGenderAndAgeAndProductCategoryName = df3.groupBy("Gender","Age", "ProductCategoryName" ).avg("TotalCost")

val groupByGenderAndTotalChildrenAndProductCategoryName = df3.groupBy("Gender", "TotalChildren", "ProductCategoryName" ).avg("TotalCost")

val groupByAgeAndTotalChildrenAndProductCategoryName = df3.groupBy("Age", "TotalChildren", "ProductCategoryName" ).avg("TotalCost")

val groupByGenderAndAge = df3.groupBy("Gender","Age").avg("TotalCost")

val groupByGenderAndTotalChildren = df3.groupBy("Gender","TotalChildren").avg("TotalCost")

val groupByGenderAndProductCategoryName = df3.groupBy("Gender","ProductCategoryName" ).avg("TotalCost")

val groupByAgeAndTotalChildren = df3.groupBy("Age","TotalChildren").avg("TotalCost")

val groupByAgeAndProductCategoryName = df3.groupBy("Age","ProductCategoryName" ).avg("TotalCost")

val groupByTotalChildrenAndProductCategoryName = df3.groupBy("TotalChildren","ProductCategoryName" ).avg("TotalCost")

val groupByGender = df3.groupBy("Gender").avg("TotalCost")

val groupByAge = df3.groupBy("Age").avg("TotalCost")

val groupByTotalChildren = df3.groupBy("TotalChildren" ).avg("TotalCost")

val groupByProductCategoryName = df3.groupBy("ProductCategoryName" ).avg("TotalCost")

val groupByNone = df3.groupBy().avg("TotalCost")


groupByAll.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
.mode("overwrite").save("src/main/resources/All.csv")

groupByGenderAndAgeAndTotalChildren.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
.mode("overwrite").save("src/main/resources/Gender_Age_TotalChildren.csv")

groupByGenderAndAgeAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
.mode("overwrite").save("src/main/resources/Gender_Age_ProductCategoryName.csv")

groupByGenderAndTotalChildrenAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
.mode("overwrite").save("src/main/resources/Gender_TotalChildren_ProductCategoryName.csv")

groupByAgeAndTotalChildrenAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
.mode("overwrite").save("src/main/resources/Age_TotalChildren_ProductCategoryName.csv")

groupByGenderAndAge.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
.mode("overwrite").save("src/main/resources/Gender_Age.csv")

groupByGenderAndTotalChildren.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
.mode("overwrite").save("src/main/resources/Gender_TotalChildren.csv")

groupByGenderAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
.mode("overwrite").save("src/main/resources/Gender_ProductCategoryName.csv")

groupByAgeAndTotalChildren.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
.mode("overwrite").save("src/main/resources/Age_TotalChildren.csv")

groupByAgeAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
.mode("overwrite").save("src/main/resources/Age_ProductCategoryName.csv")

groupByTotalChildrenAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
.mode("overwrite").save("src/main/resources/TotalChildren_ProductCategoryName.csv")

groupByGender.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
.mode("overwrite").save("src/main/resources/Gender.csv")

groupByAge.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
.mode("overwrite").save("src/main/resources/Age.csv")

groupByTotalChildren.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
.mode("overwrite").save("src/main/resources/TotalChildren.csv")

groupByProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
.mode("overwrite").save("src/main/resources/ProductCategoryName.csv")

groupByNone.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
.mode("overwrite").save("src/main/resources/None.csv")

}
}









share|improve this question





























    0














    My goal is to create a Cube of 4 Dimensions and 1 Measure.



    This means I have in total 16 GroupBy`s to compute.



    In my code you can see the 4 Dimensions (Gender,Age,TotalChildren,ProductCategoryName) and the Measure TotalCost.



    I have filter all my columns to drop any row that it is null.



    After that I compute every GroupBy one by one and then I use coalesce() to bind the csv`s into one file.



    All the process takes about 10 minutes which I think is too much.



    Is there any way to enhance the process? Maybe by computing some groupby`s from others?



    Also my data is about 5GB so if I read it 16 times as the number of groupby`s this means in total 80GB.





    Here is my Code



    import org.apache.spark.sql.SparkSession
    import org.apache.spark.sql.functions.udf


    object ComputeCube {

    def main(args:Array[String]):Unit= {

    val spark: SparkSession = SparkSession.builder()
    .master("local[*]")
    .appName("SparkProject2018")
    .getOrCreate()

    import spark.implicits._

    val filePath="src/main/resources/dataspark.txt"

    val df = spark.read.options(Map("inferSchema"->"true","delimiter"->"|","header"->"true"))
    .csv(filePath).select("Gender", "BirthDate", "TotalCost", "TotalChildren", "ProductCategoryName")

    val df2 = df
    .filter("Gender is not null")
    .filter("BirthDate is not null")
    .filter("TotalChildren is not null")
    .filter("ProductCategoryName is not null")

    val currentDate = udf{ (dob: java.sql.Date) =>
    import java.time.{LocalDate, Period}
    Period.between(dob.toLocalDate, LocalDate.now).getYears
    }

    val df3 = df2.withColumn("Age", currentDate($"BirthDate"))


    val groupByAll = df3.groupBy("Gender","Age", "TotalChildren", "ProductCategoryName" ).avg("TotalCost")

    val groupByGenderAndAgeAndTotalChildren = df3.groupBy("Gender","Age", "TotalChildren").avg("TotalCost")

    val groupByGenderAndAgeAndProductCategoryName = df3.groupBy("Gender","Age", "ProductCategoryName" ).avg("TotalCost")

    val groupByGenderAndTotalChildrenAndProductCategoryName = df3.groupBy("Gender", "TotalChildren", "ProductCategoryName" ).avg("TotalCost")

    val groupByAgeAndTotalChildrenAndProductCategoryName = df3.groupBy("Age", "TotalChildren", "ProductCategoryName" ).avg("TotalCost")

    val groupByGenderAndAge = df3.groupBy("Gender","Age").avg("TotalCost")

    val groupByGenderAndTotalChildren = df3.groupBy("Gender","TotalChildren").avg("TotalCost")

    val groupByGenderAndProductCategoryName = df3.groupBy("Gender","ProductCategoryName" ).avg("TotalCost")

    val groupByAgeAndTotalChildren = df3.groupBy("Age","TotalChildren").avg("TotalCost")

    val groupByAgeAndProductCategoryName = df3.groupBy("Age","ProductCategoryName" ).avg("TotalCost")

    val groupByTotalChildrenAndProductCategoryName = df3.groupBy("TotalChildren","ProductCategoryName" ).avg("TotalCost")

    val groupByGender = df3.groupBy("Gender").avg("TotalCost")

    val groupByAge = df3.groupBy("Age").avg("TotalCost")

    val groupByTotalChildren = df3.groupBy("TotalChildren" ).avg("TotalCost")

    val groupByProductCategoryName = df3.groupBy("ProductCategoryName" ).avg("TotalCost")

    val groupByNone = df3.groupBy().avg("TotalCost")


    groupByAll.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
    .mode("overwrite").save("src/main/resources/All.csv")

    groupByGenderAndAgeAndTotalChildren.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
    .mode("overwrite").save("src/main/resources/Gender_Age_TotalChildren.csv")

    groupByGenderAndAgeAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
    .mode("overwrite").save("src/main/resources/Gender_Age_ProductCategoryName.csv")

    groupByGenderAndTotalChildrenAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
    .mode("overwrite").save("src/main/resources/Gender_TotalChildren_ProductCategoryName.csv")

    groupByAgeAndTotalChildrenAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
    .mode("overwrite").save("src/main/resources/Age_TotalChildren_ProductCategoryName.csv")

    groupByGenderAndAge.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
    .mode("overwrite").save("src/main/resources/Gender_Age.csv")

    groupByGenderAndTotalChildren.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
    .mode("overwrite").save("src/main/resources/Gender_TotalChildren.csv")

    groupByGenderAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
    .mode("overwrite").save("src/main/resources/Gender_ProductCategoryName.csv")

    groupByAgeAndTotalChildren.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
    .mode("overwrite").save("src/main/resources/Age_TotalChildren.csv")

    groupByAgeAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
    .mode("overwrite").save("src/main/resources/Age_ProductCategoryName.csv")

    groupByTotalChildrenAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
    .mode("overwrite").save("src/main/resources/TotalChildren_ProductCategoryName.csv")

    groupByGender.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
    .mode("overwrite").save("src/main/resources/Gender.csv")

    groupByAge.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
    .mode("overwrite").save("src/main/resources/Age.csv")

    groupByTotalChildren.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
    .mode("overwrite").save("src/main/resources/TotalChildren.csv")

    groupByProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
    .mode("overwrite").save("src/main/resources/ProductCategoryName.csv")

    groupByNone.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
    .mode("overwrite").save("src/main/resources/None.csv")

    }
    }









    share|improve this question



























      0












      0








      0







      My goal is to create a Cube of 4 Dimensions and 1 Measure.



      This means I have in total 16 GroupBy`s to compute.



      In my code you can see the 4 Dimensions (Gender,Age,TotalChildren,ProductCategoryName) and the Measure TotalCost.



      I have filter all my columns to drop any row that it is null.



      After that I compute every GroupBy one by one and then I use coalesce() to bind the csv`s into one file.



      All the process takes about 10 minutes which I think is too much.



      Is there any way to enhance the process? Maybe by computing some groupby`s from others?



      Also my data is about 5GB so if I read it 16 times as the number of groupby`s this means in total 80GB.





      Here is my Code



      import org.apache.spark.sql.SparkSession
      import org.apache.spark.sql.functions.udf


      object ComputeCube {

      def main(args:Array[String]):Unit= {

      val spark: SparkSession = SparkSession.builder()
      .master("local[*]")
      .appName("SparkProject2018")
      .getOrCreate()

      import spark.implicits._

      val filePath="src/main/resources/dataspark.txt"

      val df = spark.read.options(Map("inferSchema"->"true","delimiter"->"|","header"->"true"))
      .csv(filePath).select("Gender", "BirthDate", "TotalCost", "TotalChildren", "ProductCategoryName")

      val df2 = df
      .filter("Gender is not null")
      .filter("BirthDate is not null")
      .filter("TotalChildren is not null")
      .filter("ProductCategoryName is not null")

      val currentDate = udf{ (dob: java.sql.Date) =>
      import java.time.{LocalDate, Period}
      Period.between(dob.toLocalDate, LocalDate.now).getYears
      }

      val df3 = df2.withColumn("Age", currentDate($"BirthDate"))


      val groupByAll = df3.groupBy("Gender","Age", "TotalChildren", "ProductCategoryName" ).avg("TotalCost")

      val groupByGenderAndAgeAndTotalChildren = df3.groupBy("Gender","Age", "TotalChildren").avg("TotalCost")

      val groupByGenderAndAgeAndProductCategoryName = df3.groupBy("Gender","Age", "ProductCategoryName" ).avg("TotalCost")

      val groupByGenderAndTotalChildrenAndProductCategoryName = df3.groupBy("Gender", "TotalChildren", "ProductCategoryName" ).avg("TotalCost")

      val groupByAgeAndTotalChildrenAndProductCategoryName = df3.groupBy("Age", "TotalChildren", "ProductCategoryName" ).avg("TotalCost")

      val groupByGenderAndAge = df3.groupBy("Gender","Age").avg("TotalCost")

      val groupByGenderAndTotalChildren = df3.groupBy("Gender","TotalChildren").avg("TotalCost")

      val groupByGenderAndProductCategoryName = df3.groupBy("Gender","ProductCategoryName" ).avg("TotalCost")

      val groupByAgeAndTotalChildren = df3.groupBy("Age","TotalChildren").avg("TotalCost")

      val groupByAgeAndProductCategoryName = df3.groupBy("Age","ProductCategoryName" ).avg("TotalCost")

      val groupByTotalChildrenAndProductCategoryName = df3.groupBy("TotalChildren","ProductCategoryName" ).avg("TotalCost")

      val groupByGender = df3.groupBy("Gender").avg("TotalCost")

      val groupByAge = df3.groupBy("Age").avg("TotalCost")

      val groupByTotalChildren = df3.groupBy("TotalChildren" ).avg("TotalCost")

      val groupByProductCategoryName = df3.groupBy("ProductCategoryName" ).avg("TotalCost")

      val groupByNone = df3.groupBy().avg("TotalCost")


      groupByAll.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
      .mode("overwrite").save("src/main/resources/All.csv")

      groupByGenderAndAgeAndTotalChildren.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
      .mode("overwrite").save("src/main/resources/Gender_Age_TotalChildren.csv")

      groupByGenderAndAgeAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
      .mode("overwrite").save("src/main/resources/Gender_Age_ProductCategoryName.csv")

      groupByGenderAndTotalChildrenAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
      .mode("overwrite").save("src/main/resources/Gender_TotalChildren_ProductCategoryName.csv")

      groupByAgeAndTotalChildrenAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
      .mode("overwrite").save("src/main/resources/Age_TotalChildren_ProductCategoryName.csv")

      groupByGenderAndAge.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
      .mode("overwrite").save("src/main/resources/Gender_Age.csv")

      groupByGenderAndTotalChildren.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
      .mode("overwrite").save("src/main/resources/Gender_TotalChildren.csv")

      groupByGenderAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
      .mode("overwrite").save("src/main/resources/Gender_ProductCategoryName.csv")

      groupByAgeAndTotalChildren.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
      .mode("overwrite").save("src/main/resources/Age_TotalChildren.csv")

      groupByAgeAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
      .mode("overwrite").save("src/main/resources/Age_ProductCategoryName.csv")

      groupByTotalChildrenAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
      .mode("overwrite").save("src/main/resources/TotalChildren_ProductCategoryName.csv")

      groupByGender.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
      .mode("overwrite").save("src/main/resources/Gender.csv")

      groupByAge.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
      .mode("overwrite").save("src/main/resources/Age.csv")

      groupByTotalChildren.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
      .mode("overwrite").save("src/main/resources/TotalChildren.csv")

      groupByProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
      .mode("overwrite").save("src/main/resources/ProductCategoryName.csv")

      groupByNone.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
      .mode("overwrite").save("src/main/resources/None.csv")

      }
      }









      share|improve this question















      My goal is to create a Cube of 4 Dimensions and 1 Measure.



      This means I have in total 16 GroupBy`s to compute.



      In my code you can see the 4 Dimensions (Gender,Age,TotalChildren,ProductCategoryName) and the Measure TotalCost.



      I have filter all my columns to drop any row that it is null.



      After that I compute every GroupBy one by one and then I use coalesce() to bind the csv`s into one file.



      All the process takes about 10 minutes which I think is too much.



      Is there any way to enhance the process? Maybe by computing some groupby`s from others?



      Also my data is about 5GB so if I read it 16 times as the number of groupby`s this means in total 80GB.





      Here is my Code



      import org.apache.spark.sql.SparkSession
      import org.apache.spark.sql.functions.udf


      object ComputeCube {

      def main(args:Array[String]):Unit= {

      val spark: SparkSession = SparkSession.builder()
      .master("local[*]")
      .appName("SparkProject2018")
      .getOrCreate()

      import spark.implicits._

      val filePath="src/main/resources/dataspark.txt"

      val df = spark.read.options(Map("inferSchema"->"true","delimiter"->"|","header"->"true"))
      .csv(filePath).select("Gender", "BirthDate", "TotalCost", "TotalChildren", "ProductCategoryName")

      val df2 = df
      .filter("Gender is not null")
      .filter("BirthDate is not null")
      .filter("TotalChildren is not null")
      .filter("ProductCategoryName is not null")

      val currentDate = udf{ (dob: java.sql.Date) =>
      import java.time.{LocalDate, Period}
      Period.between(dob.toLocalDate, LocalDate.now).getYears
      }

      val df3 = df2.withColumn("Age", currentDate($"BirthDate"))


      val groupByAll = df3.groupBy("Gender","Age", "TotalChildren", "ProductCategoryName" ).avg("TotalCost")

      val groupByGenderAndAgeAndTotalChildren = df3.groupBy("Gender","Age", "TotalChildren").avg("TotalCost")

      val groupByGenderAndAgeAndProductCategoryName = df3.groupBy("Gender","Age", "ProductCategoryName" ).avg("TotalCost")

      val groupByGenderAndTotalChildrenAndProductCategoryName = df3.groupBy("Gender", "TotalChildren", "ProductCategoryName" ).avg("TotalCost")

      val groupByAgeAndTotalChildrenAndProductCategoryName = df3.groupBy("Age", "TotalChildren", "ProductCategoryName" ).avg("TotalCost")

      val groupByGenderAndAge = df3.groupBy("Gender","Age").avg("TotalCost")

      val groupByGenderAndTotalChildren = df3.groupBy("Gender","TotalChildren").avg("TotalCost")

      val groupByGenderAndProductCategoryName = df3.groupBy("Gender","ProductCategoryName" ).avg("TotalCost")

      val groupByAgeAndTotalChildren = df3.groupBy("Age","TotalChildren").avg("TotalCost")

      val groupByAgeAndProductCategoryName = df3.groupBy("Age","ProductCategoryName" ).avg("TotalCost")

      val groupByTotalChildrenAndProductCategoryName = df3.groupBy("TotalChildren","ProductCategoryName" ).avg("TotalCost")

      val groupByGender = df3.groupBy("Gender").avg("TotalCost")

      val groupByAge = df3.groupBy("Age").avg("TotalCost")

      val groupByTotalChildren = df3.groupBy("TotalChildren" ).avg("TotalCost")

      val groupByProductCategoryName = df3.groupBy("ProductCategoryName" ).avg("TotalCost")

      val groupByNone = df3.groupBy().avg("TotalCost")


      groupByAll.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
      .mode("overwrite").save("src/main/resources/All.csv")

      groupByGenderAndAgeAndTotalChildren.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
      .mode("overwrite").save("src/main/resources/Gender_Age_TotalChildren.csv")

      groupByGenderAndAgeAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
      .mode("overwrite").save("src/main/resources/Gender_Age_ProductCategoryName.csv")

      groupByGenderAndTotalChildrenAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
      .mode("overwrite").save("src/main/resources/Gender_TotalChildren_ProductCategoryName.csv")

      groupByAgeAndTotalChildrenAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
      .mode("overwrite").save("src/main/resources/Age_TotalChildren_ProductCategoryName.csv")

      groupByGenderAndAge.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
      .mode("overwrite").save("src/main/resources/Gender_Age.csv")

      groupByGenderAndTotalChildren.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
      .mode("overwrite").save("src/main/resources/Gender_TotalChildren.csv")

      groupByGenderAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
      .mode("overwrite").save("src/main/resources/Gender_ProductCategoryName.csv")

      groupByAgeAndTotalChildren.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
      .mode("overwrite").save("src/main/resources/Age_TotalChildren.csv")

      groupByAgeAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
      .mode("overwrite").save("src/main/resources/Age_ProductCategoryName.csv")

      groupByTotalChildrenAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
      .mode("overwrite").save("src/main/resources/TotalChildren_ProductCategoryName.csv")

      groupByGender.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
      .mode("overwrite").save("src/main/resources/Gender.csv")

      groupByAge.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
      .mode("overwrite").save("src/main/resources/Age.csv")

      groupByTotalChildren.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
      .mode("overwrite").save("src/main/resources/TotalChildren.csv")

      groupByProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
      .mode("overwrite").save("src/main/resources/ProductCategoryName.csv")

      groupByNone.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
      .mode("overwrite").save("src/main/resources/None.csv")

      }
      }






      scala apache-spark intellij-idea apache-spark-sql






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Dec 28 '18 at 0:34

























      asked Dec 27 '18 at 16:13









      giorgionasis

      92210




      92210
























          1 Answer
          1






          active

          oldest

          votes


















          0














          import org.apache.spark.sql.SparkSession
          import org.apache.spark.sql.functions._

          object Test1 {
          case class SalesValue(Gender:String, BirthDate:String, TotalChildren:Int, ProductCategoryName:String, TotalCost: Int)
          def main(args: Array[String]): Unit = {
          val spark = SparkSession.builder().appName("test").master("local[*]").getOrCreate()
          import spark.implicits._
          val seq = Seq(("boy", "1990-01-01", 2, "milk", 200), ("girl", "1990-01-01", 2, "milk", 250), ("boy", "1990-01-01", 2, "milk", 270))

          val ds = spark.sparkContext.parallelize(seq).map(r => SalesValue(r._1, r._2, r._3, r._4, r._5)).toDS()
          val agg = ds.groupBy("Gender", "BirthDate", "TotalChildren", "ProductCategoryName").agg(sum($"TotalCost").alias("sum"), count($"Gender").alias("count"))
          agg.cache()
          val result1 = agg.groupBy("Gender").agg(sum($"sum")/sum($"count")).show()
          }
          }





          share|improve this answer










          New contributor




          user7460598 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.


















            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53947868%2fscala-spark-boost-groupby-computing-for-multiple-dimensions%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0














            import org.apache.spark.sql.SparkSession
            import org.apache.spark.sql.functions._

            object Test1 {
            case class SalesValue(Gender:String, BirthDate:String, TotalChildren:Int, ProductCategoryName:String, TotalCost: Int)
            def main(args: Array[String]): Unit = {
            val spark = SparkSession.builder().appName("test").master("local[*]").getOrCreate()
            import spark.implicits._
            val seq = Seq(("boy", "1990-01-01", 2, "milk", 200), ("girl", "1990-01-01", 2, "milk", 250), ("boy", "1990-01-01", 2, "milk", 270))

            val ds = spark.sparkContext.parallelize(seq).map(r => SalesValue(r._1, r._2, r._3, r._4, r._5)).toDS()
            val agg = ds.groupBy("Gender", "BirthDate", "TotalChildren", "ProductCategoryName").agg(sum($"TotalCost").alias("sum"), count($"Gender").alias("count"))
            agg.cache()
            val result1 = agg.groupBy("Gender").agg(sum($"sum")/sum($"count")).show()
            }
            }





            share|improve this answer










            New contributor




            user7460598 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.























              0














              import org.apache.spark.sql.SparkSession
              import org.apache.spark.sql.functions._

              object Test1 {
              case class SalesValue(Gender:String, BirthDate:String, TotalChildren:Int, ProductCategoryName:String, TotalCost: Int)
              def main(args: Array[String]): Unit = {
              val spark = SparkSession.builder().appName("test").master("local[*]").getOrCreate()
              import spark.implicits._
              val seq = Seq(("boy", "1990-01-01", 2, "milk", 200), ("girl", "1990-01-01", 2, "milk", 250), ("boy", "1990-01-01", 2, "milk", 270))

              val ds = spark.sparkContext.parallelize(seq).map(r => SalesValue(r._1, r._2, r._3, r._4, r._5)).toDS()
              val agg = ds.groupBy("Gender", "BirthDate", "TotalChildren", "ProductCategoryName").agg(sum($"TotalCost").alias("sum"), count($"Gender").alias("count"))
              agg.cache()
              val result1 = agg.groupBy("Gender").agg(sum($"sum")/sum($"count")).show()
              }
              }





              share|improve this answer










              New contributor




              user7460598 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
              Check out our Code of Conduct.





















                0












                0








                0






                import org.apache.spark.sql.SparkSession
                import org.apache.spark.sql.functions._

                object Test1 {
                case class SalesValue(Gender:String, BirthDate:String, TotalChildren:Int, ProductCategoryName:String, TotalCost: Int)
                def main(args: Array[String]): Unit = {
                val spark = SparkSession.builder().appName("test").master("local[*]").getOrCreate()
                import spark.implicits._
                val seq = Seq(("boy", "1990-01-01", 2, "milk", 200), ("girl", "1990-01-01", 2, "milk", 250), ("boy", "1990-01-01", 2, "milk", 270))

                val ds = spark.sparkContext.parallelize(seq).map(r => SalesValue(r._1, r._2, r._3, r._4, r._5)).toDS()
                val agg = ds.groupBy("Gender", "BirthDate", "TotalChildren", "ProductCategoryName").agg(sum($"TotalCost").alias("sum"), count($"Gender").alias("count"))
                agg.cache()
                val result1 = agg.groupBy("Gender").agg(sum($"sum")/sum($"count")).show()
                }
                }





                share|improve this answer










                New contributor




                user7460598 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.









                import org.apache.spark.sql.SparkSession
                import org.apache.spark.sql.functions._

                object Test1 {
                case class SalesValue(Gender:String, BirthDate:String, TotalChildren:Int, ProductCategoryName:String, TotalCost: Int)
                def main(args: Array[String]): Unit = {
                val spark = SparkSession.builder().appName("test").master("local[*]").getOrCreate()
                import spark.implicits._
                val seq = Seq(("boy", "1990-01-01", 2, "milk", 200), ("girl", "1990-01-01", 2, "milk", 250), ("boy", "1990-01-01", 2, "milk", 270))

                val ds = spark.sparkContext.parallelize(seq).map(r => SalesValue(r._1, r._2, r._3, r._4, r._5)).toDS()
                val agg = ds.groupBy("Gender", "BirthDate", "TotalChildren", "ProductCategoryName").agg(sum($"TotalCost").alias("sum"), count($"Gender").alias("count"))
                agg.cache()
                val result1 = agg.groupBy("Gender").agg(sum($"sum")/sum($"count")).show()
                }
                }






                share|improve this answer










                New contributor




                user7460598 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.









                share|improve this answer



                share|improve this answer








                edited Dec 28 '18 at 3:34





















                New contributor




                user7460598 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.









                answered Dec 28 '18 at 3:17









                user7460598

                12




                12




                New contributor




                user7460598 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.





                New contributor





                user7460598 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.






                user7460598 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.





                    Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                    Please pay close attention to the following guidance:


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53947868%2fscala-spark-boost-groupby-computing-for-multiple-dimensions%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Monofisismo

                    Angular Downloading a file using contenturl with Basic Authentication

                    Olmecas