Why Java HashMap put is not working in Spark Scala?












-1















I have a sample Spark dataframe as follows:



val mydf1 = Seq((1, "a"), (2, "b"),(3, "c"),(4, "d"),(5, "e")).toDF("id", "col2")

scala> mydf1.show
+---+----+
| id|col2|
+---+----+
| 1| a|
| 2| b|
| 3| c|
| 4| d|
| 5| e|
+---+----+


I am trying to add the above dataframe to a Java util HashMap as follows:



import java.util._
val jmap = new java.util.HashMap[Integer, String]()

mydf1.rdd.foreach{case Row(id: Integer, col2: String) => jmap.put(id, col2)}


But after the above code I still don't see the ids and col2s getting added to the jmap HashMap as below:



scala> jmap.size
res13: Int = 0


Am I missing something in my implementation?



I know I could use Scala converters, but for some reason I don't want to use it.










share|improve this question




















  • 2





    You realize that the executors are filling the copy of the jmap that was sent to each of them in the closure, not the jmap that you defined in your driver? There is no way to update driver variables from an executor.

    – RealSkeptic
    Jan 1 at 15:41











  • My end goal is to create a Java HashMap and write it to disk. Can't I achieve it in anyway?

    – user3243499
    Jan 1 at 15:43






  • 1





    I think you asked a similar question earlier, and I believe it is an XY problem. The question is why you think the correct solution for writing on disk is to serialize a Java HashMap. What is the problem you are actually trying to solve?

    – RealSkeptic
    Jan 1 at 15:46











  • Spark already provides a lot of method for writing an RDD DF DS to a Distributed File System (like HDFS or S3) in multiple formats (like CSV, JSON, Parquet, ORC). Since Spark is intended for "BigData", it does not make sense (from a conceptual point of view) to write to a local disk, because your data is supposed to be big enough to do not fill in one machine - now, if you are sure your data will fill in one machine, you may collect the DF first and then save the local scala collection. But once again, you may consider it you are really using Spark for what it is intended.

    – Luis Miguel Mejía Suárez
    Jan 1 at 15:52








  • 1





    If the other environment also uses the same spark cluster and the same filesystem (e.g. HDFS) you may want to just save it as parquet and read it as a dataset in the Java program. Or you may serialize each object individually (a stream of pairs). Also check why CSV is slowing you down. Is it parsing time? Do you read the entire CSV to memory?

    – RealSkeptic
    Jan 1 at 16:12
















-1















I have a sample Spark dataframe as follows:



val mydf1 = Seq((1, "a"), (2, "b"),(3, "c"),(4, "d"),(5, "e")).toDF("id", "col2")

scala> mydf1.show
+---+----+
| id|col2|
+---+----+
| 1| a|
| 2| b|
| 3| c|
| 4| d|
| 5| e|
+---+----+


I am trying to add the above dataframe to a Java util HashMap as follows:



import java.util._
val jmap = new java.util.HashMap[Integer, String]()

mydf1.rdd.foreach{case Row(id: Integer, col2: String) => jmap.put(id, col2)}


But after the above code I still don't see the ids and col2s getting added to the jmap HashMap as below:



scala> jmap.size
res13: Int = 0


Am I missing something in my implementation?



I know I could use Scala converters, but for some reason I don't want to use it.










share|improve this question




















  • 2





    You realize that the executors are filling the copy of the jmap that was sent to each of them in the closure, not the jmap that you defined in your driver? There is no way to update driver variables from an executor.

    – RealSkeptic
    Jan 1 at 15:41











  • My end goal is to create a Java HashMap and write it to disk. Can't I achieve it in anyway?

    – user3243499
    Jan 1 at 15:43






  • 1





    I think you asked a similar question earlier, and I believe it is an XY problem. The question is why you think the correct solution for writing on disk is to serialize a Java HashMap. What is the problem you are actually trying to solve?

    – RealSkeptic
    Jan 1 at 15:46











  • Spark already provides a lot of method for writing an RDD DF DS to a Distributed File System (like HDFS or S3) in multiple formats (like CSV, JSON, Parquet, ORC). Since Spark is intended for "BigData", it does not make sense (from a conceptual point of view) to write to a local disk, because your data is supposed to be big enough to do not fill in one machine - now, if you are sure your data will fill in one machine, you may collect the DF first and then save the local scala collection. But once again, you may consider it you are really using Spark for what it is intended.

    – Luis Miguel Mejía Suárez
    Jan 1 at 15:52








  • 1





    If the other environment also uses the same spark cluster and the same filesystem (e.g. HDFS) you may want to just save it as parquet and read it as a dataset in the Java program. Or you may serialize each object individually (a stream of pairs). Also check why CSV is slowing you down. Is it parsing time? Do you read the entire CSV to memory?

    – RealSkeptic
    Jan 1 at 16:12














-1












-1








-1








I have a sample Spark dataframe as follows:



val mydf1 = Seq((1, "a"), (2, "b"),(3, "c"),(4, "d"),(5, "e")).toDF("id", "col2")

scala> mydf1.show
+---+----+
| id|col2|
+---+----+
| 1| a|
| 2| b|
| 3| c|
| 4| d|
| 5| e|
+---+----+


I am trying to add the above dataframe to a Java util HashMap as follows:



import java.util._
val jmap = new java.util.HashMap[Integer, String]()

mydf1.rdd.foreach{case Row(id: Integer, col2: String) => jmap.put(id, col2)}


But after the above code I still don't see the ids and col2s getting added to the jmap HashMap as below:



scala> jmap.size
res13: Int = 0


Am I missing something in my implementation?



I know I could use Scala converters, but for some reason I don't want to use it.










share|improve this question
















I have a sample Spark dataframe as follows:



val mydf1 = Seq((1, "a"), (2, "b"),(3, "c"),(4, "d"),(5, "e")).toDF("id", "col2")

scala> mydf1.show
+---+----+
| id|col2|
+---+----+
| 1| a|
| 2| b|
| 3| c|
| 4| d|
| 5| e|
+---+----+


I am trying to add the above dataframe to a Java util HashMap as follows:



import java.util._
val jmap = new java.util.HashMap[Integer, String]()

mydf1.rdd.foreach{case Row(id: Integer, col2: String) => jmap.put(id, col2)}


But after the above code I still don't see the ids and col2s getting added to the jmap HashMap as below:



scala> jmap.size
res13: Int = 0


Am I missing something in my implementation?



I know I could use Scala converters, but for some reason I don't want to use it.







scala apache-spark






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 25 at 23:38









halfer

14.6k758113




14.6k758113










asked Jan 1 at 15:38









user3243499user3243499

81411226




81411226








  • 2





    You realize that the executors are filling the copy of the jmap that was sent to each of them in the closure, not the jmap that you defined in your driver? There is no way to update driver variables from an executor.

    – RealSkeptic
    Jan 1 at 15:41











  • My end goal is to create a Java HashMap and write it to disk. Can't I achieve it in anyway?

    – user3243499
    Jan 1 at 15:43






  • 1





    I think you asked a similar question earlier, and I believe it is an XY problem. The question is why you think the correct solution for writing on disk is to serialize a Java HashMap. What is the problem you are actually trying to solve?

    – RealSkeptic
    Jan 1 at 15:46











  • Spark already provides a lot of method for writing an RDD DF DS to a Distributed File System (like HDFS or S3) in multiple formats (like CSV, JSON, Parquet, ORC). Since Spark is intended for "BigData", it does not make sense (from a conceptual point of view) to write to a local disk, because your data is supposed to be big enough to do not fill in one machine - now, if you are sure your data will fill in one machine, you may collect the DF first and then save the local scala collection. But once again, you may consider it you are really using Spark for what it is intended.

    – Luis Miguel Mejía Suárez
    Jan 1 at 15:52








  • 1





    If the other environment also uses the same spark cluster and the same filesystem (e.g. HDFS) you may want to just save it as parquet and read it as a dataset in the Java program. Or you may serialize each object individually (a stream of pairs). Also check why CSV is slowing you down. Is it parsing time? Do you read the entire CSV to memory?

    – RealSkeptic
    Jan 1 at 16:12














  • 2





    You realize that the executors are filling the copy of the jmap that was sent to each of them in the closure, not the jmap that you defined in your driver? There is no way to update driver variables from an executor.

    – RealSkeptic
    Jan 1 at 15:41











  • My end goal is to create a Java HashMap and write it to disk. Can't I achieve it in anyway?

    – user3243499
    Jan 1 at 15:43






  • 1





    I think you asked a similar question earlier, and I believe it is an XY problem. The question is why you think the correct solution for writing on disk is to serialize a Java HashMap. What is the problem you are actually trying to solve?

    – RealSkeptic
    Jan 1 at 15:46











  • Spark already provides a lot of method for writing an RDD DF DS to a Distributed File System (like HDFS or S3) in multiple formats (like CSV, JSON, Parquet, ORC). Since Spark is intended for "BigData", it does not make sense (from a conceptual point of view) to write to a local disk, because your data is supposed to be big enough to do not fill in one machine - now, if you are sure your data will fill in one machine, you may collect the DF first and then save the local scala collection. But once again, you may consider it you are really using Spark for what it is intended.

    – Luis Miguel Mejía Suárez
    Jan 1 at 15:52








  • 1





    If the other environment also uses the same spark cluster and the same filesystem (e.g. HDFS) you may want to just save it as parquet and read it as a dataset in the Java program. Or you may serialize each object individually (a stream of pairs). Also check why CSV is slowing you down. Is it parsing time? Do you read the entire CSV to memory?

    – RealSkeptic
    Jan 1 at 16:12








2




2





You realize that the executors are filling the copy of the jmap that was sent to each of them in the closure, not the jmap that you defined in your driver? There is no way to update driver variables from an executor.

– RealSkeptic
Jan 1 at 15:41





You realize that the executors are filling the copy of the jmap that was sent to each of them in the closure, not the jmap that you defined in your driver? There is no way to update driver variables from an executor.

– RealSkeptic
Jan 1 at 15:41













My end goal is to create a Java HashMap and write it to disk. Can't I achieve it in anyway?

– user3243499
Jan 1 at 15:43





My end goal is to create a Java HashMap and write it to disk. Can't I achieve it in anyway?

– user3243499
Jan 1 at 15:43




1




1





I think you asked a similar question earlier, and I believe it is an XY problem. The question is why you think the correct solution for writing on disk is to serialize a Java HashMap. What is the problem you are actually trying to solve?

– RealSkeptic
Jan 1 at 15:46





I think you asked a similar question earlier, and I believe it is an XY problem. The question is why you think the correct solution for writing on disk is to serialize a Java HashMap. What is the problem you are actually trying to solve?

– RealSkeptic
Jan 1 at 15:46













Spark already provides a lot of method for writing an RDD DF DS to a Distributed File System (like HDFS or S3) in multiple formats (like CSV, JSON, Parquet, ORC). Since Spark is intended for "BigData", it does not make sense (from a conceptual point of view) to write to a local disk, because your data is supposed to be big enough to do not fill in one machine - now, if you are sure your data will fill in one machine, you may collect the DF first and then save the local scala collection. But once again, you may consider it you are really using Spark for what it is intended.

– Luis Miguel Mejía Suárez
Jan 1 at 15:52







Spark already provides a lot of method for writing an RDD DF DS to a Distributed File System (like HDFS or S3) in multiple formats (like CSV, JSON, Parquet, ORC). Since Spark is intended for "BigData", it does not make sense (from a conceptual point of view) to write to a local disk, because your data is supposed to be big enough to do not fill in one machine - now, if you are sure your data will fill in one machine, you may collect the DF first and then save the local scala collection. But once again, you may consider it you are really using Spark for what it is intended.

– Luis Miguel Mejía Suárez
Jan 1 at 15:52






1




1





If the other environment also uses the same spark cluster and the same filesystem (e.g. HDFS) you may want to just save it as parquet and read it as a dataset in the Java program. Or you may serialize each object individually (a stream of pairs). Also check why CSV is slowing you down. Is it parsing time? Do you read the entire CSV to memory?

– RealSkeptic
Jan 1 at 16:12





If the other environment also uses the same spark cluster and the same filesystem (e.g. HDFS) you may want to just save it as parquet and read it as a dataset in the Java program. Or you may serialize each object individually (a stream of pairs). Also check why CSV is slowing you down. Is it parsing time? Do you read the entire CSV to memory?

– RealSkeptic
Jan 1 at 16:12












1 Answer
1






active

oldest

votes


















0














RDD is a distributed collection spread over in different executors in the cluster and foreach being executed in executor nodes. Whereas jmap is a local collection object, though its being sent over to individual executors (since its being invoked within foreach), but it will not come back to the driver with added values.



One way is to do is that, collect all rdd values in the driver and add them into jmap (But this is not advisable for large collection though)



mydf1.rdd.collect().foreach{case Row(id: Integer, col2: String) => jmap.put(id, col2)}





share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53996735%2fwhy-java-hashmap-put-is-not-working-in-spark-scala%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    RDD is a distributed collection spread over in different executors in the cluster and foreach being executed in executor nodes. Whereas jmap is a local collection object, though its being sent over to individual executors (since its being invoked within foreach), but it will not come back to the driver with added values.



    One way is to do is that, collect all rdd values in the driver and add them into jmap (But this is not advisable for large collection though)



    mydf1.rdd.collect().foreach{case Row(id: Integer, col2: String) => jmap.put(id, col2)}





    share|improve this answer




























      0














      RDD is a distributed collection spread over in different executors in the cluster and foreach being executed in executor nodes. Whereas jmap is a local collection object, though its being sent over to individual executors (since its being invoked within foreach), but it will not come back to the driver with added values.



      One way is to do is that, collect all rdd values in the driver and add them into jmap (But this is not advisable for large collection though)



      mydf1.rdd.collect().foreach{case Row(id: Integer, col2: String) => jmap.put(id, col2)}





      share|improve this answer


























        0












        0








        0







        RDD is a distributed collection spread over in different executors in the cluster and foreach being executed in executor nodes. Whereas jmap is a local collection object, though its being sent over to individual executors (since its being invoked within foreach), but it will not come back to the driver with added values.



        One way is to do is that, collect all rdd values in the driver and add them into jmap (But this is not advisable for large collection though)



        mydf1.rdd.collect().foreach{case Row(id: Integer, col2: String) => jmap.put(id, col2)}





        share|improve this answer













        RDD is a distributed collection spread over in different executors in the cluster and foreach being executed in executor nodes. Whereas jmap is a local collection object, though its being sent over to individual executors (since its being invoked within foreach), but it will not come back to the driver with added values.



        One way is to do is that, collect all rdd values in the driver and add them into jmap (But this is not advisable for large collection though)



        mydf1.rdd.collect().foreach{case Row(id: Integer, col2: String) => jmap.put(id, col2)}






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Jan 1 at 16:45









        SivasonaiSivasonai

        364




        364
































            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53996735%2fwhy-java-hashmap-put-is-not-working-in-spark-scala%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Monofisismo

            Angular Downloading a file using contenturl with Basic Authentication

            Olmecas