Why Java HashMap put is not working in Spark Scala?

-1

I have a sample Spark dataframe as follows:

val mydf1 = Seq((1, "a"), (2, "b"),(3, "c"),(4, "d"),(5, "e")).toDF("id", "col2")



scala> mydf1.show

+---+----+

| id|col2|

+---+----+

|  1|   a|

|  2|   b|

|  3|   c|

|  4|   d|

|  5|   e|

+---+----+

I am trying to add the above dataframe to a Java util HashMap as follows:

import java.util._

val jmap = new java.util.HashMap[Integer, String]()



mydf1.rdd.foreach{case Row(id: Integer, col2: String) => jmap.put(id, col2)}

But after the above code I still don't see the ids and col2s getting added to the jmap HashMap as below:

scala> jmap.size

res13: Int = 0

Am I missing something in my implementation?

I know I could use Scala converters, but for some reason I don't want to use it.

edited Jan 25 at 23:38

halfer

14.6k758113

asked Jan 1 at 15:38

user3243499

81411226

2

You realize that the executors are filling the copy of the jmap that was sent to each of them in the closure, not the jmap that you defined in your driver? There is no way to update driver variables from an executor.

– RealSkeptic
Jan 1 at 15:41

My end goal is to create a Java HashMap and write it to disk. Can't I achieve it in anyway?

– user3243499
Jan 1 at 15:43

1

I think you asked a similar question earlier, and I believe it is an XY problem. The question is why you think the correct solution for writing on disk is to serialize a Java HashMap. What is the problem you are actually trying to solve?

– RealSkeptic
Jan 1 at 15:46

Spark already provides a lot of method for writing an RDD DF DS to a Distributed File System (like HDFS or S3) in multiple formats (like CSV, JSON, Parquet, ORC). Since Spark is intended for "BigData", it does not make sense (from a conceptual point of view) to write to a local disk, because your data is supposed to be big enough to do not fill in one machine - now, if you are sure your data will fill in one machine, you may collect the DF first and then save the local scala collection. But once again, you may consider it you are really using Spark for what it is intended.

– Luis Miguel Mejía Suárez
Jan 1 at 15:52

1

If the other environment also uses the same spark cluster and the same filesystem (e.g. HDFS) you may want to just save it as parquet and read it as a dataset in the Java program. Or you may serialize each object individually (a stream of pairs). Also check why CSV is slowing you down. Is it parsing time? Do you read the entire CSV to memory?

– RealSkeptic
Jan 1 at 16:12

|
show 4 more comments

-1

I have a sample Spark dataframe as follows:

val mydf1 = Seq((1, "a"), (2, "b"),(3, "c"),(4, "d"),(5, "e")).toDF("id", "col2")



scala> mydf1.show

+---+----+

| id|col2|

+---+----+

|  1|   a|

|  2|   b|

|  3|   c|

|  4|   d|

|  5|   e|

+---+----+

I am trying to add the above dataframe to a Java util HashMap as follows:

import java.util._

val jmap = new java.util.HashMap[Integer, String]()



mydf1.rdd.foreach{case Row(id: Integer, col2: String) => jmap.put(id, col2)}

But after the above code I still don't see the ids and col2s getting added to the jmap HashMap as below:

scala> jmap.size

res13: Int = 0

Am I missing something in my implementation?

I know I could use Scala converters, but for some reason I don't want to use it.

edited Jan 25 at 23:38

halfer

14.6k758113

asked Jan 1 at 15:38

user3243499

81411226

2

You realize that the executors are filling the copy of the jmap that was sent to each of them in the closure, not the jmap that you defined in your driver? There is no way to update driver variables from an executor.

– RealSkeptic
Jan 1 at 15:41

My end goal is to create a Java HashMap and write it to disk. Can't I achieve it in anyway?

– user3243499
Jan 1 at 15:43

1

I think you asked a similar question earlier, and I believe it is an XY problem. The question is why you think the correct solution for writing on disk is to serialize a Java HashMap. What is the problem you are actually trying to solve?

– RealSkeptic
Jan 1 at 15:46

Spark already provides a lot of method for writing an RDD DF DS to a Distributed File System (like HDFS or S3) in multiple formats (like CSV, JSON, Parquet, ORC). Since Spark is intended for "BigData", it does not make sense (from a conceptual point of view) to write to a local disk, because your data is supposed to be big enough to do not fill in one machine - now, if you are sure your data will fill in one machine, you may collect the DF first and then save the local scala collection. But once again, you may consider it you are really using Spark for what it is intended.

– Luis Miguel Mejía Suárez
Jan 1 at 15:52

1

If the other environment also uses the same spark cluster and the same filesystem (e.g. HDFS) you may want to just save it as parquet and read it as a dataset in the Java program. Or you may serialize each object individually (a stream of pairs). Also check why CSV is slowing you down. Is it parsing time? Do you read the entire CSV to memory?

– RealSkeptic
Jan 1 at 16:12

|
show 4 more comments

-1

I have a sample Spark dataframe as follows:

val mydf1 = Seq((1, "a"), (2, "b"),(3, "c"),(4, "d"),(5, "e")).toDF("id", "col2")



scala> mydf1.show

+---+----+

| id|col2|

+---+----+

|  1|   a|

|  2|   b|

|  3|   c|

|  4|   d|

|  5|   e|

+---+----+

I am trying to add the above dataframe to a Java util HashMap as follows:

import java.util._

val jmap = new java.util.HashMap[Integer, String]()



mydf1.rdd.foreach{case Row(id: Integer, col2: String) => jmap.put(id, col2)}

But after the above code I still don't see the ids and col2s getting added to the jmap HashMap as below:

scala> jmap.size

res13: Int = 0

Am I missing something in my implementation?

I know I could use Scala converters, but for some reason I don't want to use it.

edited Jan 25 at 23:38

halfer

14.6k758113

asked Jan 1 at 15:38

user3243499

81411226

I have a sample Spark dataframe as follows:

val mydf1 = Seq((1, "a"), (2, "b"),(3, "c"),(4, "d"),(5, "e")).toDF("id", "col2")



scala> mydf1.show

+---+----+

| id|col2|

+---+----+

|  1|   a|

|  2|   b|

|  3|   c|

|  4|   d|

|  5|   e|

+---+----+

I am trying to add the above dataframe to a Java util HashMap as follows:

import java.util._

val jmap = new java.util.HashMap[Integer, String]()



mydf1.rdd.foreach{case Row(id: Integer, col2: String) => jmap.put(id, col2)}

But after the above code I still don't see the ids and col2s getting added to the jmap HashMap as below:

scala> jmap.size

res13: Int = 0

Am I missing something in my implementation?

I know I could use Scala converters, but for some reason I don't want to use it.

scala apache-spark

edited Jan 25 at 23:38

halfer

14.6k758113

asked Jan 1 at 15:38

user3243499

81411226

edited Jan 25 at 23:38

halfer

14.6k758113

asked Jan 1 at 15:38

user3243499

81411226

edited Jan 25 at 23:38

halfer

14.6k758113

edited Jan 25 at 23:38

halfer

14.6k758113

edited Jan 25 at 23:38

halfer

14.6k758113

asked Jan 1 at 15:38

user3243499

81411226

asked Jan 1 at 15:38

user3243499

81411226

asked Jan 1 at 15:38

user3243499

81411226

2

You realize that the executors are filling the copy of the jmap that was sent to each of them in the closure, not the jmap that you defined in your driver? There is no way to update driver variables from an executor.

– RealSkeptic
Jan 1 at 15:41

My end goal is to create a Java HashMap and write it to disk. Can't I achieve it in anyway?

– user3243499
Jan 1 at 15:43

1

I think you asked a similar question earlier, and I believe it is an XY problem. The question is why you think the correct solution for writing on disk is to serialize a Java HashMap. What is the problem you are actually trying to solve?

– RealSkeptic
Jan 1 at 15:46

Spark already provides a lot of method for writing an RDD DF DS to a Distributed File System (like HDFS or S3) in multiple formats (like CSV, JSON, Parquet, ORC). Since Spark is intended for "BigData", it does not make sense (from a conceptual point of view) to write to a local disk, because your data is supposed to be big enough to do not fill in one machine - now, if you are sure your data will fill in one machine, you may collect the DF first and then save the local scala collection. But once again, you may consider it you are really using Spark for what it is intended.

– Luis Miguel Mejía Suárez
Jan 1 at 15:52

1

If the other environment also uses the same spark cluster and the same filesystem (e.g. HDFS) you may want to just save it as parquet and read it as a dataset in the Java program. Or you may serialize each object individually (a stream of pairs). Also check why CSV is slowing you down. Is it parsing time? Do you read the entire CSV to memory?

– RealSkeptic
Jan 1 at 16:12

|
show 4 more comments

2

You realize that the executors are filling the copy of the jmap that was sent to each of them in the closure, not the jmap that you defined in your driver? There is no way to update driver variables from an executor.

– RealSkeptic
Jan 1 at 15:41

My end goal is to create a Java HashMap and write it to disk. Can't I achieve it in anyway?

– user3243499
Jan 1 at 15:43

1

I think you asked a similar question earlier, and I believe it is an XY problem. The question is why you think the correct solution for writing on disk is to serialize a Java HashMap. What is the problem you are actually trying to solve?

– RealSkeptic
Jan 1 at 15:46

Spark already provides a lot of method for writing an RDD DF DS to a Distributed File System (like HDFS or S3) in multiple formats (like CSV, JSON, Parquet, ORC). Since Spark is intended for "BigData", it does not make sense (from a conceptual point of view) to write to a local disk, because your data is supposed to be big enough to do not fill in one machine - now, if you are sure your data will fill in one machine, you may collect the DF first and then save the local scala collection. But once again, you may consider it you are really using Spark for what it is intended.

– Luis Miguel Mejía Suárez
Jan 1 at 15:52

1

If the other environment also uses the same spark cluster and the same filesystem (e.g. HDFS) you may want to just save it as parquet and read it as a dataset in the Java program. Or you may serialize each object individually (a stream of pairs). Also check why CSV is slowing you down. Is it parsing time? Do you read the entire CSV to memory?

– RealSkeptic
Jan 1 at 16:12

You realize that the executors are filling the copy of the jmap that was sent to each of them in the closure, not the jmap that you defined in your driver? There is no way to update driver variables from an executor.

– RealSkeptic
Jan 1 at 15:41

My end goal is to create a Java HashMap and write it to disk. Can't I achieve it in anyway?

– user3243499
Jan 1 at 15:43

I think you asked a similar question earlier, and I believe it is an XY problem. The question is why you think the correct solution for writing on disk is to serialize a Java HashMap. What is the problem you are actually trying to solve?

– RealSkeptic
Jan 1 at 15:46

Spark already provides a lot of method for writing an RDD DF DS to a Distributed File System (like HDFS or S3) in multiple formats (like CSV, JSON, Parquet, ORC). Since Spark is intended for "BigData", it does not make sense (from a conceptual point of view) to write to a local disk, because your data is supposed to be big enough to do not fill in one machine - now, if you are sure your data will fill in one machine, you may collect the DF first and then save the local scala collection. But once again, you may consider it you are really using Spark for what it is intended.

– Luis Miguel Mejía Suárez
Jan 1 at 15:52

If the other environment also uses the same spark cluster and the same filesystem (e.g. HDFS) you may want to just save it as parquet and read it as a dataset in the Java program. Or you may serialize each object individually (a stream of pairs). Also check why CSV is slowing you down. Is it parsing time? Do you read the entire CSV to memory?

– RealSkeptic
Jan 1 at 16:12

|
show 4 more comments

1 Answer
1

active

oldest

votes

RDD is a distributed collection spread over in different executors in the cluster and foreach being executed in executor nodes. Whereas jmap is a local collection object, though its being sent over to individual executors (since its being invoked within foreach), but it will not come back to the driver with added values.

One way is to do is that, collect all rdd values in the driver and add them into jmap (But this is not advisable for large collection though)

mydf1.rdd.collect().foreach{case Row(id: Integer, col2: String) => jmap.put(id, col2)}

answered Jan 1 at 16:45

Sivasonai

364

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53996735%2fwhy-java-hashmap-put-is-not-working-in-spark-scala%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

One way is to do is that, collect all rdd values in the driver and add them into jmap (But this is not advisable for large collection though)

mydf1.rdd.collect().foreach{case Row(id: Integer, col2: String) => jmap.put(id, col2)}

answered Jan 1 at 16:45

Sivasonai

364

add a comment |

One way is to do is that, collect all rdd values in the driver and add them into jmap (But this is not advisable for large collection though)

mydf1.rdd.collect().foreach{case Row(id: Integer, col2: String) => jmap.put(id, col2)}

answered Jan 1 at 16:45

Sivasonai

364

add a comment |

One way is to do is that, collect all rdd values in the driver and add them into jmap (But this is not advisable for large collection though)

mydf1.rdd.collect().foreach{case Row(id: Integer, col2: String) => jmap.put(id, col2)}

answered Jan 1 at 16:45

Sivasonai

364

One way is to do is that, collect all rdd values in the driver and add them into jmap (But this is not advisable for large collection though)

mydf1.rdd.collect().foreach{case Row(id: Integer, col2: String) => jmap.put(id, col2)}

answered Jan 1 at 16:45

Sivasonai

364

answered Jan 1 at 16:45

Sivasonai

364

answered Jan 1 at 16:45

Sivasonai

364

answered Jan 1 at 16:45

Sivasonai

364

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Bdtjtk