Scala - Spark Boost GroupBy Computing for multiple Dimensions

My goal is to create a Cube of 4 Dimensions and 1 Measure.

This means I have in total 16 GroupBy`s to compute.

In my code you can see the 4 Dimensions (Gender,Age,TotalChildren,ProductCategoryName) and the Measure TotalCost.

I have filter all my columns to drop any row that it is null.

After that I compute every GroupBy one by one and then I use coalesce() to bind the csv`s into one file.

All the process takes about 10 minutes which I think is too much.

Is there any way to enhance the process? Maybe by computing some groupby`s from others?

Also my data is about 5GB so if I read it 16 times as the number of groupby`s this means in total 80GB.

Here is my Code

import org.apache.spark.sql.SparkSession

import org.apache.spark.sql.functions.udf





object ComputeCube {



def main(args:Array[String]):Unit= {



val spark: SparkSession = SparkSession.builder()

  .master("local[*]")

  .appName("SparkProject2018")

  .getOrCreate()



import spark.implicits._



val filePath="src/main/resources/dataspark.txt"



val df = spark.read.options(Map("inferSchema"->"true","delimiter"->"|","header"->"true"))

  .csv(filePath).select("Gender", "BirthDate", "TotalCost", "TotalChildren", "ProductCategoryName")



val df2 = df

  .filter("Gender is not null")

  .filter("BirthDate is not null")

  .filter("TotalChildren is not null")

  .filter("ProductCategoryName is not null")



val currentDate = udf{ (dob: java.sql.Date) =>

  import java.time.{LocalDate, Period}

  Period.between(dob.toLocalDate, LocalDate.now).getYears

}



val df3 = df2.withColumn("Age", currentDate($"BirthDate"))





val groupByAll = df3.groupBy("Gender","Age", "TotalChildren", "ProductCategoryName" ).avg("TotalCost")



val groupByGenderAndAgeAndTotalChildren = df3.groupBy("Gender","Age", "TotalChildren").avg("TotalCost")



val groupByGenderAndAgeAndProductCategoryName = df3.groupBy("Gender","Age", "ProductCategoryName" ).avg("TotalCost")



val groupByGenderAndTotalChildrenAndProductCategoryName = df3.groupBy("Gender", "TotalChildren", "ProductCategoryName" ).avg("TotalCost")



val groupByAgeAndTotalChildrenAndProductCategoryName = df3.groupBy("Age", "TotalChildren", "ProductCategoryName" ).avg("TotalCost")



val groupByGenderAndAge = df3.groupBy("Gender","Age").avg("TotalCost")



val groupByGenderAndTotalChildren = df3.groupBy("Gender","TotalChildren").avg("TotalCost")



val groupByGenderAndProductCategoryName = df3.groupBy("Gender","ProductCategoryName" ).avg("TotalCost")



val groupByAgeAndTotalChildren = df3.groupBy("Age","TotalChildren").avg("TotalCost")



val groupByAgeAndProductCategoryName = df3.groupBy("Age","ProductCategoryName" ).avg("TotalCost")



val groupByTotalChildrenAndProductCategoryName = df3.groupBy("TotalChildren","ProductCategoryName" ).avg("TotalCost")



val groupByGender = df3.groupBy("Gender").avg("TotalCost")



val groupByAge = df3.groupBy("Age").avg("TotalCost")



val groupByTotalChildren = df3.groupBy("TotalChildren" ).avg("TotalCost")



val groupByProductCategoryName = df3.groupBy("ProductCategoryName" ).avg("TotalCost")



val groupByNone = df3.groupBy().avg("TotalCost")





groupByAll.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/All.csv")



groupByGenderAndAgeAndTotalChildren.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Gender_Age_TotalChildren.csv")



groupByGenderAndAgeAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Gender_Age_ProductCategoryName.csv")



groupByGenderAndTotalChildrenAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Gender_TotalChildren_ProductCategoryName.csv")



groupByAgeAndTotalChildrenAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Age_TotalChildren_ProductCategoryName.csv")



groupByGenderAndAge.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Gender_Age.csv")



groupByGenderAndTotalChildren.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Gender_TotalChildren.csv")



groupByGenderAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Gender_ProductCategoryName.csv")



groupByAgeAndTotalChildren.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Age_TotalChildren.csv")



groupByAgeAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Age_ProductCategoryName.csv")



groupByTotalChildrenAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/TotalChildren_ProductCategoryName.csv")



groupByGender.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Gender.csv")



groupByAge.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Age.csv")



groupByTotalChildren.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/TotalChildren.csv")



groupByProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/ProductCategoryName.csv")



groupByNone.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/None.csv")



  }

 }

edited Dec 28 '18 at 0:34

asked Dec 27 '18 at 16:13

giorgionasis

92210

add a comment |

My goal is to create a Cube of 4 Dimensions and 1 Measure.

This means I have in total 16 GroupBy`s to compute.

In my code you can see the 4 Dimensions (Gender,Age,TotalChildren,ProductCategoryName) and the Measure TotalCost.

I have filter all my columns to drop any row that it is null.

After that I compute every GroupBy one by one and then I use coalesce() to bind the csv`s into one file.

All the process takes about 10 minutes which I think is too much.

Is there any way to enhance the process? Maybe by computing some groupby`s from others?

Also my data is about 5GB so if I read it 16 times as the number of groupby`s this means in total 80GB.

Here is my Code

import org.apache.spark.sql.SparkSession

import org.apache.spark.sql.functions.udf





object ComputeCube {



def main(args:Array[String]):Unit= {



val spark: SparkSession = SparkSession.builder()

  .master("local[*]")

  .appName("SparkProject2018")

  .getOrCreate()



import spark.implicits._



val filePath="src/main/resources/dataspark.txt"



val df = spark.read.options(Map("inferSchema"->"true","delimiter"->"|","header"->"true"))

  .csv(filePath).select("Gender", "BirthDate", "TotalCost", "TotalChildren", "ProductCategoryName")



val df2 = df

  .filter("Gender is not null")

  .filter("BirthDate is not null")

  .filter("TotalChildren is not null")

  .filter("ProductCategoryName is not null")



val currentDate = udf{ (dob: java.sql.Date) =>

  import java.time.{LocalDate, Period}

  Period.between(dob.toLocalDate, LocalDate.now).getYears

}



val df3 = df2.withColumn("Age", currentDate($"BirthDate"))





val groupByAll = df3.groupBy("Gender","Age", "TotalChildren", "ProductCategoryName" ).avg("TotalCost")



val groupByGenderAndAgeAndTotalChildren = df3.groupBy("Gender","Age", "TotalChildren").avg("TotalCost")



val groupByGenderAndAgeAndProductCategoryName = df3.groupBy("Gender","Age", "ProductCategoryName" ).avg("TotalCost")



val groupByGenderAndTotalChildrenAndProductCategoryName = df3.groupBy("Gender", "TotalChildren", "ProductCategoryName" ).avg("TotalCost")



val groupByAgeAndTotalChildrenAndProductCategoryName = df3.groupBy("Age", "TotalChildren", "ProductCategoryName" ).avg("TotalCost")



val groupByGenderAndAge = df3.groupBy("Gender","Age").avg("TotalCost")



val groupByGenderAndTotalChildren = df3.groupBy("Gender","TotalChildren").avg("TotalCost")



val groupByGenderAndProductCategoryName = df3.groupBy("Gender","ProductCategoryName" ).avg("TotalCost")



val groupByAgeAndTotalChildren = df3.groupBy("Age","TotalChildren").avg("TotalCost")



val groupByAgeAndProductCategoryName = df3.groupBy("Age","ProductCategoryName" ).avg("TotalCost")



val groupByTotalChildrenAndProductCategoryName = df3.groupBy("TotalChildren","ProductCategoryName" ).avg("TotalCost")



val groupByGender = df3.groupBy("Gender").avg("TotalCost")



val groupByAge = df3.groupBy("Age").avg("TotalCost")



val groupByTotalChildren = df3.groupBy("TotalChildren" ).avg("TotalCost")



val groupByProductCategoryName = df3.groupBy("ProductCategoryName" ).avg("TotalCost")



val groupByNone = df3.groupBy().avg("TotalCost")





groupByAll.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/All.csv")



groupByGenderAndAgeAndTotalChildren.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Gender_Age_TotalChildren.csv")



groupByGenderAndAgeAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Gender_Age_ProductCategoryName.csv")



groupByGenderAndTotalChildrenAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Gender_TotalChildren_ProductCategoryName.csv")



groupByAgeAndTotalChildrenAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Age_TotalChildren_ProductCategoryName.csv")



groupByGenderAndAge.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Gender_Age.csv")



groupByGenderAndTotalChildren.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Gender_TotalChildren.csv")



groupByGenderAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Gender_ProductCategoryName.csv")



groupByAgeAndTotalChildren.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Age_TotalChildren.csv")



groupByAgeAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Age_ProductCategoryName.csv")



groupByTotalChildrenAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/TotalChildren_ProductCategoryName.csv")



groupByGender.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Gender.csv")



groupByAge.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Age.csv")



groupByTotalChildren.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/TotalChildren.csv")



groupByProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/ProductCategoryName.csv")



groupByNone.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/None.csv")



  }

 }

edited Dec 28 '18 at 0:34

asked Dec 27 '18 at 16:13

giorgionasis

92210

add a comment |

My goal is to create a Cube of 4 Dimensions and 1 Measure.

This means I have in total 16 GroupBy`s to compute.

In my code you can see the 4 Dimensions (Gender,Age,TotalChildren,ProductCategoryName) and the Measure TotalCost.

I have filter all my columns to drop any row that it is null.

After that I compute every GroupBy one by one and then I use coalesce() to bind the csv`s into one file.

All the process takes about 10 minutes which I think is too much.

Is there any way to enhance the process? Maybe by computing some groupby`s from others?

Also my data is about 5GB so if I read it 16 times as the number of groupby`s this means in total 80GB.

Here is my Code

import org.apache.spark.sql.SparkSession

import org.apache.spark.sql.functions.udf





object ComputeCube {



def main(args:Array[String]):Unit= {



val spark: SparkSession = SparkSession.builder()

  .master("local[*]")

  .appName("SparkProject2018")

  .getOrCreate()



import spark.implicits._



val filePath="src/main/resources/dataspark.txt"



val df = spark.read.options(Map("inferSchema"->"true","delimiter"->"|","header"->"true"))

  .csv(filePath).select("Gender", "BirthDate", "TotalCost", "TotalChildren", "ProductCategoryName")



val df2 = df

  .filter("Gender is not null")

  .filter("BirthDate is not null")

  .filter("TotalChildren is not null")

  .filter("ProductCategoryName is not null")



val currentDate = udf{ (dob: java.sql.Date) =>

  import java.time.{LocalDate, Period}

  Period.between(dob.toLocalDate, LocalDate.now).getYears

}



val df3 = df2.withColumn("Age", currentDate($"BirthDate"))





val groupByAll = df3.groupBy("Gender","Age", "TotalChildren", "ProductCategoryName" ).avg("TotalCost")



val groupByGenderAndAgeAndTotalChildren = df3.groupBy("Gender","Age", "TotalChildren").avg("TotalCost")



val groupByGenderAndAgeAndProductCategoryName = df3.groupBy("Gender","Age", "ProductCategoryName" ).avg("TotalCost")



val groupByGenderAndTotalChildrenAndProductCategoryName = df3.groupBy("Gender", "TotalChildren", "ProductCategoryName" ).avg("TotalCost")



val groupByAgeAndTotalChildrenAndProductCategoryName = df3.groupBy("Age", "TotalChildren", "ProductCategoryName" ).avg("TotalCost")



val groupByGenderAndAge = df3.groupBy("Gender","Age").avg("TotalCost")



val groupByGenderAndTotalChildren = df3.groupBy("Gender","TotalChildren").avg("TotalCost")



val groupByGenderAndProductCategoryName = df3.groupBy("Gender","ProductCategoryName" ).avg("TotalCost")



val groupByAgeAndTotalChildren = df3.groupBy("Age","TotalChildren").avg("TotalCost")



val groupByAgeAndProductCategoryName = df3.groupBy("Age","ProductCategoryName" ).avg("TotalCost")



val groupByTotalChildrenAndProductCategoryName = df3.groupBy("TotalChildren","ProductCategoryName" ).avg("TotalCost")



val groupByGender = df3.groupBy("Gender").avg("TotalCost")



val groupByAge = df3.groupBy("Age").avg("TotalCost")



val groupByTotalChildren = df3.groupBy("TotalChildren" ).avg("TotalCost")



val groupByProductCategoryName = df3.groupBy("ProductCategoryName" ).avg("TotalCost")



val groupByNone = df3.groupBy().avg("TotalCost")





groupByAll.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/All.csv")



groupByGenderAndAgeAndTotalChildren.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Gender_Age_TotalChildren.csv")



groupByGenderAndAgeAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Gender_Age_ProductCategoryName.csv")



groupByGenderAndTotalChildrenAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Gender_TotalChildren_ProductCategoryName.csv")



groupByAgeAndTotalChildrenAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Age_TotalChildren_ProductCategoryName.csv")



groupByGenderAndAge.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Gender_Age.csv")



groupByGenderAndTotalChildren.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Gender_TotalChildren.csv")



groupByGenderAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Gender_ProductCategoryName.csv")



groupByAgeAndTotalChildren.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Age_TotalChildren.csv")



groupByAgeAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Age_ProductCategoryName.csv")



groupByTotalChildrenAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/TotalChildren_ProductCategoryName.csv")



groupByGender.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Gender.csv")



groupByAge.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Age.csv")



groupByTotalChildren.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/TotalChildren.csv")



groupByProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/ProductCategoryName.csv")



groupByNone.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/None.csv")



  }

 }

edited Dec 28 '18 at 0:34

asked Dec 27 '18 at 16:13

giorgionasis

92210

My goal is to create a Cube of 4 Dimensions and 1 Measure.

This means I have in total 16 GroupBy`s to compute.

In my code you can see the 4 Dimensions (Gender,Age,TotalChildren,ProductCategoryName) and the Measure TotalCost.

I have filter all my columns to drop any row that it is null.

After that I compute every GroupBy one by one and then I use coalesce() to bind the csv`s into one file.

All the process takes about 10 minutes which I think is too much.

Is there any way to enhance the process? Maybe by computing some groupby`s from others?

Also my data is about 5GB so if I read it 16 times as the number of groupby`s this means in total 80GB.

Here is my Code

import org.apache.spark.sql.SparkSession

import org.apache.spark.sql.functions.udf





object ComputeCube {



def main(args:Array[String]):Unit= {



val spark: SparkSession = SparkSession.builder()

  .master("local[*]")

  .appName("SparkProject2018")

  .getOrCreate()



import spark.implicits._



val filePath="src/main/resources/dataspark.txt"



val df = spark.read.options(Map("inferSchema"->"true","delimiter"->"|","header"->"true"))

  .csv(filePath).select("Gender", "BirthDate", "TotalCost", "TotalChildren", "ProductCategoryName")



val df2 = df

  .filter("Gender is not null")

  .filter("BirthDate is not null")

  .filter("TotalChildren is not null")

  .filter("ProductCategoryName is not null")



val currentDate = udf{ (dob: java.sql.Date) =>

  import java.time.{LocalDate, Period}

  Period.between(dob.toLocalDate, LocalDate.now).getYears

}



val df3 = df2.withColumn("Age", currentDate($"BirthDate"))





val groupByAll = df3.groupBy("Gender","Age", "TotalChildren", "ProductCategoryName" ).avg("TotalCost")



val groupByGenderAndAgeAndTotalChildren = df3.groupBy("Gender","Age", "TotalChildren").avg("TotalCost")



val groupByGenderAndAgeAndProductCategoryName = df3.groupBy("Gender","Age", "ProductCategoryName" ).avg("TotalCost")



val groupByGenderAndTotalChildrenAndProductCategoryName = df3.groupBy("Gender", "TotalChildren", "ProductCategoryName" ).avg("TotalCost")



val groupByAgeAndTotalChildrenAndProductCategoryName = df3.groupBy("Age", "TotalChildren", "ProductCategoryName" ).avg("TotalCost")



val groupByGenderAndAge = df3.groupBy("Gender","Age").avg("TotalCost")



val groupByGenderAndTotalChildren = df3.groupBy("Gender","TotalChildren").avg("TotalCost")



val groupByGenderAndProductCategoryName = df3.groupBy("Gender","ProductCategoryName" ).avg("TotalCost")



val groupByAgeAndTotalChildren = df3.groupBy("Age","TotalChildren").avg("TotalCost")



val groupByAgeAndProductCategoryName = df3.groupBy("Age","ProductCategoryName" ).avg("TotalCost")



val groupByTotalChildrenAndProductCategoryName = df3.groupBy("TotalChildren","ProductCategoryName" ).avg("TotalCost")



val groupByGender = df3.groupBy("Gender").avg("TotalCost")



val groupByAge = df3.groupBy("Age").avg("TotalCost")



val groupByTotalChildren = df3.groupBy("TotalChildren" ).avg("TotalCost")



val groupByProductCategoryName = df3.groupBy("ProductCategoryName" ).avg("TotalCost")



val groupByNone = df3.groupBy().avg("TotalCost")





groupByAll.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/All.csv")



groupByGenderAndAgeAndTotalChildren.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Gender_Age_TotalChildren.csv")



groupByGenderAndAgeAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Gender_Age_ProductCategoryName.csv")



groupByGenderAndTotalChildrenAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Gender_TotalChildren_ProductCategoryName.csv")



groupByAgeAndTotalChildrenAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Age_TotalChildren_ProductCategoryName.csv")



groupByGenderAndAge.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Gender_Age.csv")



groupByGenderAndTotalChildren.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Gender_TotalChildren.csv")



groupByGenderAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Gender_ProductCategoryName.csv")



groupByAgeAndTotalChildren.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Age_TotalChildren.csv")



groupByAgeAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Age_ProductCategoryName.csv")



groupByTotalChildrenAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/TotalChildren_ProductCategoryName.csv")



groupByGender.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Gender.csv")



groupByAge.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/Age.csv")



groupByTotalChildren.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/TotalChildren.csv")



groupByProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/ProductCategoryName.csv")



groupByNone.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")

  .mode("overwrite").save("src/main/resources/None.csv")



  }

 }

scala apache-spark intellij-idea apache-spark-sql

edited Dec 28 '18 at 0:34

asked Dec 27 '18 at 16:13

giorgionasis

92210

edited Dec 28 '18 at 0:34

asked Dec 27 '18 at 16:13

giorgionasis

92210

edited Dec 28 '18 at 0:34

asked Dec 27 '18 at 16:13

giorgionasis

92210

asked Dec 27 '18 at 16:13

giorgionasis

92210

asked Dec 27 '18 at 16:13

giorgionasis

92210

add a comment |

1 Answer
1

active

oldest

votes

import org.apache.spark.sql.SparkSession

import org.apache.spark.sql.functions._



object Test1 {

  case class SalesValue(Gender:String, BirthDate:String, TotalChildren:Int, ProductCategoryName:String, TotalCost: Int)

  def main(args: Array[String]): Unit = {

    val spark = SparkSession.builder().appName("test").master("local[*]").getOrCreate()

    import spark.implicits._

    val seq = Seq(("boy", "1990-01-01", 2, "milk", 200), ("girl", "1990-01-01", 2, "milk", 250), ("boy", "1990-01-01", 2, "milk", 270))



    val ds = spark.sparkContext.parallelize(seq).map(r => SalesValue(r._1, r._2, r._3, r._4, r._5)).toDS()

    val agg = ds.groupBy("Gender", "BirthDate", "TotalChildren", "ProductCategoryName").agg(sum($"TotalCost").alias("sum"), count($"Gender").alias("count"))

    agg.cache()

    val result1 = agg.groupBy("Gender").agg(sum($"sum")/sum($"count")).show()

  }

}

edited Dec 28 '18 at 3:34

answered Dec 28 '18 at 3:17

user7460598

New contributor

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53947868%2fscala-spark-boost-groupby-computing-for-multiple-dimensions%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

import org.apache.spark.sql.SparkSession

import org.apache.spark.sql.functions._



object Test1 {

  case class SalesValue(Gender:String, BirthDate:String, TotalChildren:Int, ProductCategoryName:String, TotalCost: Int)

  def main(args: Array[String]): Unit = {

    val spark = SparkSession.builder().appName("test").master("local[*]").getOrCreate()

    import spark.implicits._

    val seq = Seq(("boy", "1990-01-01", 2, "milk", 200), ("girl", "1990-01-01", 2, "milk", 250), ("boy", "1990-01-01", 2, "milk", 270))



    val ds = spark.sparkContext.parallelize(seq).map(r => SalesValue(r._1, r._2, r._3, r._4, r._5)).toDS()

    val agg = ds.groupBy("Gender", "BirthDate", "TotalChildren", "ProductCategoryName").agg(sum($"TotalCost").alias("sum"), count($"Gender").alias("count"))

    agg.cache()

    val result1 = agg.groupBy("Gender").agg(sum($"sum")/sum($"count")).show()

  }

}

edited Dec 28 '18 at 3:34

answered Dec 28 '18 at 3:17

user7460598

New contributor

add a comment |

import org.apache.spark.sql.SparkSession

import org.apache.spark.sql.functions._



object Test1 {

  case class SalesValue(Gender:String, BirthDate:String, TotalChildren:Int, ProductCategoryName:String, TotalCost: Int)

  def main(args: Array[String]): Unit = {

    val spark = SparkSession.builder().appName("test").master("local[*]").getOrCreate()

    import spark.implicits._

    val seq = Seq(("boy", "1990-01-01", 2, "milk", 200), ("girl", "1990-01-01", 2, "milk", 250), ("boy", "1990-01-01", 2, "milk", 270))



    val ds = spark.sparkContext.parallelize(seq).map(r => SalesValue(r._1, r._2, r._3, r._4, r._5)).toDS()

    val agg = ds.groupBy("Gender", "BirthDate", "TotalChildren", "ProductCategoryName").agg(sum($"TotalCost").alias("sum"), count($"Gender").alias("count"))

    agg.cache()

    val result1 = agg.groupBy("Gender").agg(sum($"sum")/sum($"count")).show()

  }

}

edited Dec 28 '18 at 3:34

answered Dec 28 '18 at 3:17

user7460598

New contributor

add a comment |

import org.apache.spark.sql.SparkSession

import org.apache.spark.sql.functions._



object Test1 {

  case class SalesValue(Gender:String, BirthDate:String, TotalChildren:Int, ProductCategoryName:String, TotalCost: Int)

  def main(args: Array[String]): Unit = {

    val spark = SparkSession.builder().appName("test").master("local[*]").getOrCreate()

    import spark.implicits._

    val seq = Seq(("boy", "1990-01-01", 2, "milk", 200), ("girl", "1990-01-01", 2, "milk", 250), ("boy", "1990-01-01", 2, "milk", 270))



    val ds = spark.sparkContext.parallelize(seq).map(r => SalesValue(r._1, r._2, r._3, r._4, r._5)).toDS()

    val agg = ds.groupBy("Gender", "BirthDate", "TotalChildren", "ProductCategoryName").agg(sum($"TotalCost").alias("sum"), count($"Gender").alias("count"))

    agg.cache()

    val result1 = agg.groupBy("Gender").agg(sum($"sum")/sum($"count")).show()

  }

}

edited Dec 28 '18 at 3:34

answered Dec 28 '18 at 3:17

user7460598

New contributor

import org.apache.spark.sql.SparkSession

import org.apache.spark.sql.functions._



object Test1 {

  case class SalesValue(Gender:String, BirthDate:String, TotalChildren:Int, ProductCategoryName:String, TotalCost: Int)

  def main(args: Array[String]): Unit = {

    val spark = SparkSession.builder().appName("test").master("local[*]").getOrCreate()

    import spark.implicits._

    val seq = Seq(("boy", "1990-01-01", 2, "milk", 200), ("girl", "1990-01-01", 2, "milk", 250), ("boy", "1990-01-01", 2, "milk", 270))



    val ds = spark.sparkContext.parallelize(seq).map(r => SalesValue(r._1, r._2, r._3, r._4, r._5)).toDS()

    val agg = ds.groupBy("Gender", "BirthDate", "TotalChildren", "ProductCategoryName").agg(sum($"TotalCost").alias("sum"), count($"Gender").alias("count"))

    agg.cache()

    val result1 = agg.groupBy("Gender").agg(sum($"sum")/sum($"count")).show()

  }

}

edited Dec 28 '18 at 3:34

answered Dec 28 '18 at 3:17

user7460598

New contributor

edited Dec 28 '18 at 3:34

answered Dec 28 '18 at 3:17

user7460598

New contributor

answered Dec 28 '18 at 3:17

user7460598

answered Dec 28 '18 at 3:17

user7460598

New contributor

user7460598 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Bdtjtk