Unable to convert text format to parquet format through spark












1















I am trying to insert a partition's data from one table (text format) to another table (parquet format) using spark framework. The data is around 20gb and the configuration I am using to do so is:



master = yarn



deploy-mode client



driver memory = 3g



executor memory = 15gb



num executors = 50



executor cores = 4



I am using below piece of code to do it:



val df = spark.sql("select * from table1")
df.repartition(70).write().mode("append").format("parquet").insertInto("table2")


Everytime I try running this, after completing certain tasks, the job fails with java-heap space issue.



Based upon the size of data, and spark configuration I have specified, I am not sure if there is anything that I am missing here because of which the job is failing. Any help towards this would be greatly appreciated.










share|improve this question























  • I'd suggest to try to set -Xms -Xmx JVMs parameters as well

    – slesh
    Dec 29 '18 at 13:05













  • What is the number of files in the source?

    – bdcloud
    Dec 29 '18 at 13:18











  • @slesh ..won't executor memory serve the same purpose?

    – Mohit Raja
    Dec 29 '18 at 13:26











  • @bdcloud .. It's 113

    – Mohit Raja
    Dec 29 '18 at 13:27






  • 1





    One option is to increase driver memory and also please post stack trace of the error that you are encountering.When you say few tasks are completed you mean some of the files converted to parquet in destination ? Also the source files are Gzipped?

    – bdcloud
    Dec 29 '18 at 14:01
















1















I am trying to insert a partition's data from one table (text format) to another table (parquet format) using spark framework. The data is around 20gb and the configuration I am using to do so is:



master = yarn



deploy-mode client



driver memory = 3g



executor memory = 15gb



num executors = 50



executor cores = 4



I am using below piece of code to do it:



val df = spark.sql("select * from table1")
df.repartition(70).write().mode("append").format("parquet").insertInto("table2")


Everytime I try running this, after completing certain tasks, the job fails with java-heap space issue.



Based upon the size of data, and spark configuration I have specified, I am not sure if there is anything that I am missing here because of which the job is failing. Any help towards this would be greatly appreciated.










share|improve this question























  • I'd suggest to try to set -Xms -Xmx JVMs parameters as well

    – slesh
    Dec 29 '18 at 13:05













  • What is the number of files in the source?

    – bdcloud
    Dec 29 '18 at 13:18











  • @slesh ..won't executor memory serve the same purpose?

    – Mohit Raja
    Dec 29 '18 at 13:26











  • @bdcloud .. It's 113

    – Mohit Raja
    Dec 29 '18 at 13:27






  • 1





    One option is to increase driver memory and also please post stack trace of the error that you are encountering.When you say few tasks are completed you mean some of the files converted to parquet in destination ? Also the source files are Gzipped?

    – bdcloud
    Dec 29 '18 at 14:01














1












1








1








I am trying to insert a partition's data from one table (text format) to another table (parquet format) using spark framework. The data is around 20gb and the configuration I am using to do so is:



master = yarn



deploy-mode client



driver memory = 3g



executor memory = 15gb



num executors = 50



executor cores = 4



I am using below piece of code to do it:



val df = spark.sql("select * from table1")
df.repartition(70).write().mode("append").format("parquet").insertInto("table2")


Everytime I try running this, after completing certain tasks, the job fails with java-heap space issue.



Based upon the size of data, and spark configuration I have specified, I am not sure if there is anything that I am missing here because of which the job is failing. Any help towards this would be greatly appreciated.










share|improve this question














I am trying to insert a partition's data from one table (text format) to another table (parquet format) using spark framework. The data is around 20gb and the configuration I am using to do so is:



master = yarn



deploy-mode client



driver memory = 3g



executor memory = 15gb



num executors = 50



executor cores = 4



I am using below piece of code to do it:



val df = spark.sql("select * from table1")
df.repartition(70).write().mode("append").format("parquet").insertInto("table2")


Everytime I try running this, after completing certain tasks, the job fails with java-heap space issue.



Based upon the size of data, and spark configuration I have specified, I am not sure if there is anything that I am missing here because of which the job is failing. Any help towards this would be greatly appreciated.







scala apache-spark hive






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Dec 29 '18 at 12:56









Mohit RajaMohit Raja

62112




62112













  • I'd suggest to try to set -Xms -Xmx JVMs parameters as well

    – slesh
    Dec 29 '18 at 13:05













  • What is the number of files in the source?

    – bdcloud
    Dec 29 '18 at 13:18











  • @slesh ..won't executor memory serve the same purpose?

    – Mohit Raja
    Dec 29 '18 at 13:26











  • @bdcloud .. It's 113

    – Mohit Raja
    Dec 29 '18 at 13:27






  • 1





    One option is to increase driver memory and also please post stack trace of the error that you are encountering.When you say few tasks are completed you mean some of the files converted to parquet in destination ? Also the source files are Gzipped?

    – bdcloud
    Dec 29 '18 at 14:01



















  • I'd suggest to try to set -Xms -Xmx JVMs parameters as well

    – slesh
    Dec 29 '18 at 13:05













  • What is the number of files in the source?

    – bdcloud
    Dec 29 '18 at 13:18











  • @slesh ..won't executor memory serve the same purpose?

    – Mohit Raja
    Dec 29 '18 at 13:26











  • @bdcloud .. It's 113

    – Mohit Raja
    Dec 29 '18 at 13:27






  • 1





    One option is to increase driver memory and also please post stack trace of the error that you are encountering.When you say few tasks are completed you mean some of the files converted to parquet in destination ? Also the source files are Gzipped?

    – bdcloud
    Dec 29 '18 at 14:01

















I'd suggest to try to set -Xms -Xmx JVMs parameters as well

– slesh
Dec 29 '18 at 13:05







I'd suggest to try to set -Xms -Xmx JVMs parameters as well

– slesh
Dec 29 '18 at 13:05















What is the number of files in the source?

– bdcloud
Dec 29 '18 at 13:18





What is the number of files in the source?

– bdcloud
Dec 29 '18 at 13:18













@slesh ..won't executor memory serve the same purpose?

– Mohit Raja
Dec 29 '18 at 13:26





@slesh ..won't executor memory serve the same purpose?

– Mohit Raja
Dec 29 '18 at 13:26













@bdcloud .. It's 113

– Mohit Raja
Dec 29 '18 at 13:27





@bdcloud .. It's 113

– Mohit Raja
Dec 29 '18 at 13:27




1




1





One option is to increase driver memory and also please post stack trace of the error that you are encountering.When you say few tasks are completed you mean some of the files converted to parquet in destination ? Also the source files are Gzipped?

– bdcloud
Dec 29 '18 at 14:01





One option is to increase driver memory and also please post stack trace of the error that you are encountering.When you say few tasks are completed you mean some of the files converted to parquet in destination ? Also the source files are Gzipped?

– bdcloud
Dec 29 '18 at 14:01












1 Answer
1






active

oldest

votes


















0














You have to set JVM parameters:



How to set Spark MemoryStore size when running in IntelliJ Scala Console?



Official info:




Spark properties mainly can be divided into two kinds: one is related
to deploy, like “spark.driver.memory”, “spark.executor.instances”,
this kind of properties may not be affected when setting
programmatically through SparkConf in runtime, or the behavior is
depending on which cluster manager and deploy mode you choose, so it
would be suggested to set through configuration file or spark-submit
command line options; another is mainly related to Spark runtime
control, like “spark.task.maxFailures”, this kind of properties can be
set in either way.




https://spark.apache.org/docs/latest/configuration.html






share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53969729%2funable-to-convert-text-format-to-parquet-format-through-spark%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    You have to set JVM parameters:



    How to set Spark MemoryStore size when running in IntelliJ Scala Console?



    Official info:




    Spark properties mainly can be divided into two kinds: one is related
    to deploy, like “spark.driver.memory”, “spark.executor.instances”,
    this kind of properties may not be affected when setting
    programmatically through SparkConf in runtime, or the behavior is
    depending on which cluster manager and deploy mode you choose, so it
    would be suggested to set through configuration file or spark-submit
    command line options; another is mainly related to Spark runtime
    control, like “spark.task.maxFailures”, this kind of properties can be
    set in either way.




    https://spark.apache.org/docs/latest/configuration.html






    share|improve this answer




























      0














      You have to set JVM parameters:



      How to set Spark MemoryStore size when running in IntelliJ Scala Console?



      Official info:




      Spark properties mainly can be divided into two kinds: one is related
      to deploy, like “spark.driver.memory”, “spark.executor.instances”,
      this kind of properties may not be affected when setting
      programmatically through SparkConf in runtime, or the behavior is
      depending on which cluster manager and deploy mode you choose, so it
      would be suggested to set through configuration file or spark-submit
      command line options; another is mainly related to Spark runtime
      control, like “spark.task.maxFailures”, this kind of properties can be
      set in either way.




      https://spark.apache.org/docs/latest/configuration.html






      share|improve this answer


























        0












        0








        0







        You have to set JVM parameters:



        How to set Spark MemoryStore size when running in IntelliJ Scala Console?



        Official info:




        Spark properties mainly can be divided into two kinds: one is related
        to deploy, like “spark.driver.memory”, “spark.executor.instances”,
        this kind of properties may not be affected when setting
        programmatically through SparkConf in runtime, or the behavior is
        depending on which cluster manager and deploy mode you choose, so it
        would be suggested to set through configuration file or spark-submit
        command line options; another is mainly related to Spark runtime
        control, like “spark.task.maxFailures”, this kind of properties can be
        set in either way.




        https://spark.apache.org/docs/latest/configuration.html






        share|improve this answer













        You have to set JVM parameters:



        How to set Spark MemoryStore size when running in IntelliJ Scala Console?



        Official info:




        Spark properties mainly can be divided into two kinds: one is related
        to deploy, like “spark.driver.memory”, “spark.executor.instances”,
        this kind of properties may not be affected when setting
        programmatically through SparkConf in runtime, or the behavior is
        depending on which cluster manager and deploy mode you choose, so it
        would be suggested to set through configuration file or spark-submit
        command line options; another is mainly related to Spark runtime
        control, like “spark.task.maxFailures”, this kind of properties can be
        set in either way.




        https://spark.apache.org/docs/latest/configuration.html







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Dec 31 '18 at 11:58









        Matko SoricMatko Soric

        309




        309






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53969729%2funable-to-convert-text-format-to-parquet-format-through-spark%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Monofisismo

            compose and upload a new article using a custom form

            How to correct the classpath of spring boot application so that it contains a single, compatible version of...