Why Java HashMap put is not working in Spark Scala?
I have a sample Spark dataframe as follows:
val mydf1 = Seq((1, "a"), (2, "b"),(3, "c"),(4, "d"),(5, "e")).toDF("id", "col2")
scala> mydf1.show
+---+----+
| id|col2|
+---+----+
| 1| a|
| 2| b|
| 3| c|
| 4| d|
| 5| e|
+---+----+
I am trying to add the above dataframe to a Java util HashMap as follows:
import java.util._
val jmap = new java.util.HashMap[Integer, String]()
mydf1.rdd.foreach{case Row(id: Integer, col2: String) => jmap.put(id, col2)}
But after the above code I still don't see the ids and col2s getting added to the jmap HashMap as below:
scala> jmap.size
res13: Int = 0
Am I missing something in my implementation?
I know I could use Scala converters, but for some reason I don't want to use it.
scala apache-spark
|
show 4 more comments
I have a sample Spark dataframe as follows:
val mydf1 = Seq((1, "a"), (2, "b"),(3, "c"),(4, "d"),(5, "e")).toDF("id", "col2")
scala> mydf1.show
+---+----+
| id|col2|
+---+----+
| 1| a|
| 2| b|
| 3| c|
| 4| d|
| 5| e|
+---+----+
I am trying to add the above dataframe to a Java util HashMap as follows:
import java.util._
val jmap = new java.util.HashMap[Integer, String]()
mydf1.rdd.foreach{case Row(id: Integer, col2: String) => jmap.put(id, col2)}
But after the above code I still don't see the ids and col2s getting added to the jmap HashMap as below:
scala> jmap.size
res13: Int = 0
Am I missing something in my implementation?
I know I could use Scala converters, but for some reason I don't want to use it.
scala apache-spark
2
You realize that the executors are filling the copy of the jmap that was sent to each of them in the closure, not the jmap that you defined in your driver? There is no way to update driver variables from an executor.
– RealSkeptic
Jan 1 at 15:41
My end goal is to create a Java HashMap and write it to disk. Can't I achieve it in anyway?
– user3243499
Jan 1 at 15:43
1
I think you asked a similar question earlier, and I believe it is an XY problem. The question is why you think the correct solution for writing on disk is to serialize a Java HashMap. What is the problem you are actually trying to solve?
– RealSkeptic
Jan 1 at 15:46
Spark already provides a lot of method for writing anRDD DF DS
to a Distributed File System (like HDFS or S3) in multiple formats (like CSV, JSON, Parquet, ORC). Since Spark is intended for "BigData", it does not make sense (from a conceptual point of view) to write to a local disk, because your data is supposed to be big enough to do not fill in one machine - now, if you are sure your data will fill in one machine, you may collect the DF first and then save the local scala collection. But once again, you may consider it you are really using Spark for what it is intended.
– Luis Miguel Mejía Suárez
Jan 1 at 15:52
1
If the other environment also uses the same spark cluster and the same filesystem (e.g. HDFS) you may want to just save it as parquet and read it as a dataset in the Java program. Or you may serialize each object individually (a stream of pairs). Also check why CSV is slowing you down. Is it parsing time? Do you read the entire CSV to memory?
– RealSkeptic
Jan 1 at 16:12
|
show 4 more comments
I have a sample Spark dataframe as follows:
val mydf1 = Seq((1, "a"), (2, "b"),(3, "c"),(4, "d"),(5, "e")).toDF("id", "col2")
scala> mydf1.show
+---+----+
| id|col2|
+---+----+
| 1| a|
| 2| b|
| 3| c|
| 4| d|
| 5| e|
+---+----+
I am trying to add the above dataframe to a Java util HashMap as follows:
import java.util._
val jmap = new java.util.HashMap[Integer, String]()
mydf1.rdd.foreach{case Row(id: Integer, col2: String) => jmap.put(id, col2)}
But after the above code I still don't see the ids and col2s getting added to the jmap HashMap as below:
scala> jmap.size
res13: Int = 0
Am I missing something in my implementation?
I know I could use Scala converters, but for some reason I don't want to use it.
scala apache-spark
I have a sample Spark dataframe as follows:
val mydf1 = Seq((1, "a"), (2, "b"),(3, "c"),(4, "d"),(5, "e")).toDF("id", "col2")
scala> mydf1.show
+---+----+
| id|col2|
+---+----+
| 1| a|
| 2| b|
| 3| c|
| 4| d|
| 5| e|
+---+----+
I am trying to add the above dataframe to a Java util HashMap as follows:
import java.util._
val jmap = new java.util.HashMap[Integer, String]()
mydf1.rdd.foreach{case Row(id: Integer, col2: String) => jmap.put(id, col2)}
But after the above code I still don't see the ids and col2s getting added to the jmap HashMap as below:
scala> jmap.size
res13: Int = 0
Am I missing something in my implementation?
I know I could use Scala converters, but for some reason I don't want to use it.
scala apache-spark
scala apache-spark
edited Jan 25 at 23:38
halfer
14.6k758113
14.6k758113
asked Jan 1 at 15:38
user3243499user3243499
81411226
81411226
2
You realize that the executors are filling the copy of the jmap that was sent to each of them in the closure, not the jmap that you defined in your driver? There is no way to update driver variables from an executor.
– RealSkeptic
Jan 1 at 15:41
My end goal is to create a Java HashMap and write it to disk. Can't I achieve it in anyway?
– user3243499
Jan 1 at 15:43
1
I think you asked a similar question earlier, and I believe it is an XY problem. The question is why you think the correct solution for writing on disk is to serialize a Java HashMap. What is the problem you are actually trying to solve?
– RealSkeptic
Jan 1 at 15:46
Spark already provides a lot of method for writing anRDD DF DS
to a Distributed File System (like HDFS or S3) in multiple formats (like CSV, JSON, Parquet, ORC). Since Spark is intended for "BigData", it does not make sense (from a conceptual point of view) to write to a local disk, because your data is supposed to be big enough to do not fill in one machine - now, if you are sure your data will fill in one machine, you may collect the DF first and then save the local scala collection. But once again, you may consider it you are really using Spark for what it is intended.
– Luis Miguel Mejía Suárez
Jan 1 at 15:52
1
If the other environment also uses the same spark cluster and the same filesystem (e.g. HDFS) you may want to just save it as parquet and read it as a dataset in the Java program. Or you may serialize each object individually (a stream of pairs). Also check why CSV is slowing you down. Is it parsing time? Do you read the entire CSV to memory?
– RealSkeptic
Jan 1 at 16:12
|
show 4 more comments
2
You realize that the executors are filling the copy of the jmap that was sent to each of them in the closure, not the jmap that you defined in your driver? There is no way to update driver variables from an executor.
– RealSkeptic
Jan 1 at 15:41
My end goal is to create a Java HashMap and write it to disk. Can't I achieve it in anyway?
– user3243499
Jan 1 at 15:43
1
I think you asked a similar question earlier, and I believe it is an XY problem. The question is why you think the correct solution for writing on disk is to serialize a Java HashMap. What is the problem you are actually trying to solve?
– RealSkeptic
Jan 1 at 15:46
Spark already provides a lot of method for writing anRDD DF DS
to a Distributed File System (like HDFS or S3) in multiple formats (like CSV, JSON, Parquet, ORC). Since Spark is intended for "BigData", it does not make sense (from a conceptual point of view) to write to a local disk, because your data is supposed to be big enough to do not fill in one machine - now, if you are sure your data will fill in one machine, you may collect the DF first and then save the local scala collection. But once again, you may consider it you are really using Spark for what it is intended.
– Luis Miguel Mejía Suárez
Jan 1 at 15:52
1
If the other environment also uses the same spark cluster and the same filesystem (e.g. HDFS) you may want to just save it as parquet and read it as a dataset in the Java program. Or you may serialize each object individually (a stream of pairs). Also check why CSV is slowing you down. Is it parsing time? Do you read the entire CSV to memory?
– RealSkeptic
Jan 1 at 16:12
2
2
You realize that the executors are filling the copy of the jmap that was sent to each of them in the closure, not the jmap that you defined in your driver? There is no way to update driver variables from an executor.
– RealSkeptic
Jan 1 at 15:41
You realize that the executors are filling the copy of the jmap that was sent to each of them in the closure, not the jmap that you defined in your driver? There is no way to update driver variables from an executor.
– RealSkeptic
Jan 1 at 15:41
My end goal is to create a Java HashMap and write it to disk. Can't I achieve it in anyway?
– user3243499
Jan 1 at 15:43
My end goal is to create a Java HashMap and write it to disk. Can't I achieve it in anyway?
– user3243499
Jan 1 at 15:43
1
1
I think you asked a similar question earlier, and I believe it is an XY problem. The question is why you think the correct solution for writing on disk is to serialize a Java HashMap. What is the problem you are actually trying to solve?
– RealSkeptic
Jan 1 at 15:46
I think you asked a similar question earlier, and I believe it is an XY problem. The question is why you think the correct solution for writing on disk is to serialize a Java HashMap. What is the problem you are actually trying to solve?
– RealSkeptic
Jan 1 at 15:46
Spark already provides a lot of method for writing an
RDD DF DS
to a Distributed File System (like HDFS or S3) in multiple formats (like CSV, JSON, Parquet, ORC). Since Spark is intended for "BigData", it does not make sense (from a conceptual point of view) to write to a local disk, because your data is supposed to be big enough to do not fill in one machine - now, if you are sure your data will fill in one machine, you may collect the DF first and then save the local scala collection. But once again, you may consider it you are really using Spark for what it is intended.– Luis Miguel Mejía Suárez
Jan 1 at 15:52
Spark already provides a lot of method for writing an
RDD DF DS
to a Distributed File System (like HDFS or S3) in multiple formats (like CSV, JSON, Parquet, ORC). Since Spark is intended for "BigData", it does not make sense (from a conceptual point of view) to write to a local disk, because your data is supposed to be big enough to do not fill in one machine - now, if you are sure your data will fill in one machine, you may collect the DF first and then save the local scala collection. But once again, you may consider it you are really using Spark for what it is intended.– Luis Miguel Mejía Suárez
Jan 1 at 15:52
1
1
If the other environment also uses the same spark cluster and the same filesystem (e.g. HDFS) you may want to just save it as parquet and read it as a dataset in the Java program. Or you may serialize each object individually (a stream of pairs). Also check why CSV is slowing you down. Is it parsing time? Do you read the entire CSV to memory?
– RealSkeptic
Jan 1 at 16:12
If the other environment also uses the same spark cluster and the same filesystem (e.g. HDFS) you may want to just save it as parquet and read it as a dataset in the Java program. Or you may serialize each object individually (a stream of pairs). Also check why CSV is slowing you down. Is it parsing time? Do you read the entire CSV to memory?
– RealSkeptic
Jan 1 at 16:12
|
show 4 more comments
1 Answer
1
active
oldest
votes
RDD is a distributed collection spread over in different executors in the cluster and foreach being executed in executor nodes. Whereas jmap is a local collection object, though its being sent over to individual executors (since its being invoked within foreach), but it will not come back to the driver with added values.
One way is to do is that, collect all rdd values in the driver and add them into jmap (But this is not advisable for large collection though)
mydf1.rdd.collect().foreach{case Row(id: Integer, col2: String) => jmap.put(id, col2)}
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53996735%2fwhy-java-hashmap-put-is-not-working-in-spark-scala%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
RDD is a distributed collection spread over in different executors in the cluster and foreach being executed in executor nodes. Whereas jmap is a local collection object, though its being sent over to individual executors (since its being invoked within foreach), but it will not come back to the driver with added values.
One way is to do is that, collect all rdd values in the driver and add them into jmap (But this is not advisable for large collection though)
mydf1.rdd.collect().foreach{case Row(id: Integer, col2: String) => jmap.put(id, col2)}
add a comment |
RDD is a distributed collection spread over in different executors in the cluster and foreach being executed in executor nodes. Whereas jmap is a local collection object, though its being sent over to individual executors (since its being invoked within foreach), but it will not come back to the driver with added values.
One way is to do is that, collect all rdd values in the driver and add them into jmap (But this is not advisable for large collection though)
mydf1.rdd.collect().foreach{case Row(id: Integer, col2: String) => jmap.put(id, col2)}
add a comment |
RDD is a distributed collection spread over in different executors in the cluster and foreach being executed in executor nodes. Whereas jmap is a local collection object, though its being sent over to individual executors (since its being invoked within foreach), but it will not come back to the driver with added values.
One way is to do is that, collect all rdd values in the driver and add them into jmap (But this is not advisable for large collection though)
mydf1.rdd.collect().foreach{case Row(id: Integer, col2: String) => jmap.put(id, col2)}
RDD is a distributed collection spread over in different executors in the cluster and foreach being executed in executor nodes. Whereas jmap is a local collection object, though its being sent over to individual executors (since its being invoked within foreach), but it will not come back to the driver with added values.
One way is to do is that, collect all rdd values in the driver and add them into jmap (But this is not advisable for large collection though)
mydf1.rdd.collect().foreach{case Row(id: Integer, col2: String) => jmap.put(id, col2)}
answered Jan 1 at 16:45
SivasonaiSivasonai
364
364
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53996735%2fwhy-java-hashmap-put-is-not-working-in-spark-scala%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
You realize that the executors are filling the copy of the jmap that was sent to each of them in the closure, not the jmap that you defined in your driver? There is no way to update driver variables from an executor.
– RealSkeptic
Jan 1 at 15:41
My end goal is to create a Java HashMap and write it to disk. Can't I achieve it in anyway?
– user3243499
Jan 1 at 15:43
1
I think you asked a similar question earlier, and I believe it is an XY problem. The question is why you think the correct solution for writing on disk is to serialize a Java HashMap. What is the problem you are actually trying to solve?
– RealSkeptic
Jan 1 at 15:46
Spark already provides a lot of method for writing an
RDD DF DS
to a Distributed File System (like HDFS or S3) in multiple formats (like CSV, JSON, Parquet, ORC). Since Spark is intended for "BigData", it does not make sense (from a conceptual point of view) to write to a local disk, because your data is supposed to be big enough to do not fill in one machine - now, if you are sure your data will fill in one machine, you may collect the DF first and then save the local scala collection. But once again, you may consider it you are really using Spark for what it is intended.– Luis Miguel Mejía Suárez
Jan 1 at 15:52
1
If the other environment also uses the same spark cluster and the same filesystem (e.g. HDFS) you may want to just save it as parquet and read it as a dataset in the Java program. Or you may serialize each object individually (a stream of pairs). Also check why CSV is slowing you down. Is it parsing time? Do you read the entire CSV to memory?
– RealSkeptic
Jan 1 at 16:12