Unable to convert text format to parquet format through spark

I am trying to insert a partition's data from one table (text format) to another table (parquet format) using spark framework. The data is around 20gb and the configuration I am using to do so is:

master = yarn

deploy-mode client

driver memory = 3g

executor memory = 15gb

num executors = 50

executor cores = 4

I am using below piece of code to do it:

val df = spark.sql("select * from table1")

df.repartition(70).write().mode("append").format("parquet").insertInto("table2")

Everytime I try running this, after completing certain tasks, the job fails with java-heap space issue.

Based upon the size of data, and spark configuration I have specified, I am not sure if there is anything that I am missing here because of which the job is failing. Any help towards this would be greatly appreciated.

asked Dec 29 '18 at 12:56

Mohit Raja

62112

I'd suggest to try to set -Xms -Xmx JVMs parameters as well

– slesh
Dec 29 '18 at 13:05

What is the number of files in the source?

– bdcloud
Dec 29 '18 at 13:18

@slesh ..won't executor memory serve the same purpose?

– Mohit Raja
Dec 29 '18 at 13:26

@bdcloud .. It's 113

– Mohit Raja
Dec 29 '18 at 13:27

1

One option is to increase driver memory and also please post stack trace of the error that you are encountering.When you say few tasks are completed you mean some of the files converted to parquet in destination ? Also the source files are Gzipped?

– bdcloud
Dec 29 '18 at 14:01

|
show 2 more comments

I am trying to insert a partition's data from one table (text format) to another table (parquet format) using spark framework. The data is around 20gb and the configuration I am using to do so is:

master = yarn

deploy-mode client

driver memory = 3g

executor memory = 15gb

num executors = 50

executor cores = 4

I am using below piece of code to do it:

val df = spark.sql("select * from table1")

df.repartition(70).write().mode("append").format("parquet").insertInto("table2")

Everytime I try running this, after completing certain tasks, the job fails with java-heap space issue.

asked Dec 29 '18 at 12:56

Mohit Raja

62112

I'd suggest to try to set -Xms -Xmx JVMs parameters as well

– slesh
Dec 29 '18 at 13:05

What is the number of files in the source?

– bdcloud
Dec 29 '18 at 13:18

@slesh ..won't executor memory serve the same purpose?

– Mohit Raja
Dec 29 '18 at 13:26

@bdcloud .. It's 113

– Mohit Raja
Dec 29 '18 at 13:27

1

One option is to increase driver memory and also please post stack trace of the error that you are encountering.When you say few tasks are completed you mean some of the files converted to parquet in destination ? Also the source files are Gzipped?

– bdcloud
Dec 29 '18 at 14:01

|
show 2 more comments

I am trying to insert a partition's data from one table (text format) to another table (parquet format) using spark framework. The data is around 20gb and the configuration I am using to do so is:

master = yarn

deploy-mode client

driver memory = 3g

executor memory = 15gb

num executors = 50

executor cores = 4

I am using below piece of code to do it:

val df = spark.sql("select * from table1")

df.repartition(70).write().mode("append").format("parquet").insertInto("table2")

Everytime I try running this, after completing certain tasks, the job fails with java-heap space issue.

asked Dec 29 '18 at 12:56

Mohit Raja

62112

I am trying to insert a partition's data from one table (text format) to another table (parquet format) using spark framework. The data is around 20gb and the configuration I am using to do so is:

master = yarn

deploy-mode client

driver memory = 3g

executor memory = 15gb

num executors = 50

executor cores = 4

I am using below piece of code to do it:

val df = spark.sql("select * from table1")

df.repartition(70).write().mode("append").format("parquet").insertInto("table2")

Everytime I try running this, after completing certain tasks, the job fails with java-heap space issue.

scala apache-spark hive

asked Dec 29 '18 at 12:56

Mohit Raja

62112

asked Dec 29 '18 at 12:56

Mohit Raja

62112

asked Dec 29 '18 at 12:56

Mohit Raja

62112

asked Dec 29 '18 at 12:56

Mohit Raja

62112

asked Dec 29 '18 at 12:56

Mohit Raja

62112

I'd suggest to try to set -Xms -Xmx JVMs parameters as well

– slesh
Dec 29 '18 at 13:05

What is the number of files in the source?

– bdcloud
Dec 29 '18 at 13:18

@slesh ..won't executor memory serve the same purpose?

– Mohit Raja
Dec 29 '18 at 13:26

@bdcloud .. It's 113

– Mohit Raja
Dec 29 '18 at 13:27

1

One option is to increase driver memory and also please post stack trace of the error that you are encountering.When you say few tasks are completed you mean some of the files converted to parquet in destination ? Also the source files are Gzipped?

– bdcloud
Dec 29 '18 at 14:01

|
show 2 more comments

I'd suggest to try to set -Xms -Xmx JVMs parameters as well

– slesh
Dec 29 '18 at 13:05

What is the number of files in the source?

– bdcloud
Dec 29 '18 at 13:18

@slesh ..won't executor memory serve the same purpose?

– Mohit Raja
Dec 29 '18 at 13:26

@bdcloud .. It's 113

– Mohit Raja
Dec 29 '18 at 13:27

1

One option is to increase driver memory and also please post stack trace of the error that you are encountering.When you say few tasks are completed you mean some of the files converted to parquet in destination ? Also the source files are Gzipped?

– bdcloud
Dec 29 '18 at 14:01

I'd suggest to try to set -Xms -Xmx JVMs parameters as well

– slesh
Dec 29 '18 at 13:05

What is the number of files in the source?

– bdcloud
Dec 29 '18 at 13:18

@slesh ..won't executor memory serve the same purpose?

– Mohit Raja
Dec 29 '18 at 13:26

@bdcloud .. It's 113

– Mohit Raja
Dec 29 '18 at 13:27

One option is to increase driver memory and also please post stack trace of the error that you are encountering.When you say few tasks are completed you mean some of the files converted to parquet in destination ? Also the source files are Gzipped?

– bdcloud
Dec 29 '18 at 14:01

|
show 2 more comments

1 Answer
1

active

oldest

votes

You have to set JVM parameters:

How to set Spark MemoryStore size when running in IntelliJ Scala Console?

Official info:

Spark properties mainly can be divided into two kinds: one is related
to deploy, like “spark.driver.memory”, “spark.executor.instances”,
this kind of properties may not be affected when setting
programmatically through SparkConf in runtime, or the behavior is
depending on which cluster manager and deploy mode you choose, so it
would be suggested to set through configuration file or spark-submit
command line options; another is mainly related to Spark runtime
control, like “spark.task.maxFailures”, this kind of properties can be
set in either way.

https://spark.apache.org/docs/latest/configuration.html

answered Dec 31 '18 at 11:58

Matko Soric

309

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53969729%2funable-to-convert-text-format-to-parquet-format-through-spark%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

You have to set JVM parameters:

How to set Spark MemoryStore size when running in IntelliJ Scala Console?

Official info:

Spark properties mainly can be divided into two kinds: one is related
to deploy, like “spark.driver.memory”, “spark.executor.instances”,
this kind of properties may not be affected when setting
programmatically through SparkConf in runtime, or the behavior is
depending on which cluster manager and deploy mode you choose, so it
would be suggested to set through configuration file or spark-submit
command line options; another is mainly related to Spark runtime
control, like “spark.task.maxFailures”, this kind of properties can be
set in either way.

https://spark.apache.org/docs/latest/configuration.html

answered Dec 31 '18 at 11:58

Matko Soric

309

add a comment |

You have to set JVM parameters:

How to set Spark MemoryStore size when running in IntelliJ Scala Console?

Official info:

Spark properties mainly can be divided into two kinds: one is related
to deploy, like “spark.driver.memory”, “spark.executor.instances”,
this kind of properties may not be affected when setting
programmatically through SparkConf in runtime, or the behavior is
depending on which cluster manager and deploy mode you choose, so it
would be suggested to set through configuration file or spark-submit
command line options; another is mainly related to Spark runtime
control, like “spark.task.maxFailures”, this kind of properties can be
set in either way.

https://spark.apache.org/docs/latest/configuration.html

answered Dec 31 '18 at 11:58

Matko Soric

309

add a comment |

You have to set JVM parameters:

How to set Spark MemoryStore size when running in IntelliJ Scala Console?

Official info:

Spark properties mainly can be divided into two kinds: one is related
to deploy, like “spark.driver.memory”, “spark.executor.instances”,
this kind of properties may not be affected when setting
programmatically through SparkConf in runtime, or the behavior is
depending on which cluster manager and deploy mode you choose, so it
would be suggested to set through configuration file or spark-submit
command line options; another is mainly related to Spark runtime
control, like “spark.task.maxFailures”, this kind of properties can be
set in either way.

https://spark.apache.org/docs/latest/configuration.html

answered Dec 31 '18 at 11:58

Matko Soric

309

You have to set JVM parameters:

How to set Spark MemoryStore size when running in IntelliJ Scala Console?

Official info:

Spark properties mainly can be divided into two kinds: one is related
to deploy, like “spark.driver.memory”, “spark.executor.instances”,
this kind of properties may not be affected when setting
programmatically through SparkConf in runtime, or the behavior is
depending on which cluster manager and deploy mode you choose, so it
would be suggested to set through configuration file or spark-submit
command line options; another is mainly related to Spark runtime
control, like “spark.task.maxFailures”, this kind of properties can be
set in either way.

https://spark.apache.org/docs/latest/configuration.html

answered Dec 31 '18 at 11:58

Matko Soric

309

answered Dec 31 '18 at 11:58

Matko Soric

309

answered Dec 31 '18 at 11:58

Matko Soric

309

answered Dec 31 '18 at 11:58

Matko Soric

309

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Bdtjtk