Unable to convert text format to parquet format through spark
I am trying to insert a partition's data from one table (text format) to another table (parquet format) using spark framework. The data is around 20gb and the configuration I am using to do so is:
master = yarn
deploy-mode client
driver memory = 3g
executor memory = 15gb
num executors = 50
executor cores = 4
I am using below piece of code to do it:
val df = spark.sql("select * from table1")
df.repartition(70).write().mode("append").format("parquet").insertInto("table2")
Everytime I try running this, after completing certain tasks, the job fails with java-heap space issue.
Based upon the size of data, and spark configuration I have specified, I am not sure if there is anything that I am missing here because of which the job is failing. Any help towards this would be greatly appreciated.
scala apache-spark hive
|
show 2 more comments
I am trying to insert a partition's data from one table (text format) to another table (parquet format) using spark framework. The data is around 20gb and the configuration I am using to do so is:
master = yarn
deploy-mode client
driver memory = 3g
executor memory = 15gb
num executors = 50
executor cores = 4
I am using below piece of code to do it:
val df = spark.sql("select * from table1")
df.repartition(70).write().mode("append").format("parquet").insertInto("table2")
Everytime I try running this, after completing certain tasks, the job fails with java-heap space issue.
Based upon the size of data, and spark configuration I have specified, I am not sure if there is anything that I am missing here because of which the job is failing. Any help towards this would be greatly appreciated.
scala apache-spark hive
I'd suggest to try to set -Xms -Xmx JVMs parameters as well
– slesh
Dec 29 '18 at 13:05
What is the number of files in the source?
– bdcloud
Dec 29 '18 at 13:18
@slesh ..won't executor memory serve the same purpose?
– Mohit Raja
Dec 29 '18 at 13:26
@bdcloud .. It's 113
– Mohit Raja
Dec 29 '18 at 13:27
1
One option is to increase driver memory and also please post stack trace of the error that you are encountering.When you say few tasks are completed you mean some of the files converted to parquet in destination ? Also the source files are Gzipped?
– bdcloud
Dec 29 '18 at 14:01
|
show 2 more comments
I am trying to insert a partition's data from one table (text format) to another table (parquet format) using spark framework. The data is around 20gb and the configuration I am using to do so is:
master = yarn
deploy-mode client
driver memory = 3g
executor memory = 15gb
num executors = 50
executor cores = 4
I am using below piece of code to do it:
val df = spark.sql("select * from table1")
df.repartition(70).write().mode("append").format("parquet").insertInto("table2")
Everytime I try running this, after completing certain tasks, the job fails with java-heap space issue.
Based upon the size of data, and spark configuration I have specified, I am not sure if there is anything that I am missing here because of which the job is failing. Any help towards this would be greatly appreciated.
scala apache-spark hive
I am trying to insert a partition's data from one table (text format) to another table (parquet format) using spark framework. The data is around 20gb and the configuration I am using to do so is:
master = yarn
deploy-mode client
driver memory = 3g
executor memory = 15gb
num executors = 50
executor cores = 4
I am using below piece of code to do it:
val df = spark.sql("select * from table1")
df.repartition(70).write().mode("append").format("parquet").insertInto("table2")
Everytime I try running this, after completing certain tasks, the job fails with java-heap space issue.
Based upon the size of data, and spark configuration I have specified, I am not sure if there is anything that I am missing here because of which the job is failing. Any help towards this would be greatly appreciated.
scala apache-spark hive
scala apache-spark hive
asked Dec 29 '18 at 12:56


Mohit RajaMohit Raja
62112
62112
I'd suggest to try to set -Xms -Xmx JVMs parameters as well
– slesh
Dec 29 '18 at 13:05
What is the number of files in the source?
– bdcloud
Dec 29 '18 at 13:18
@slesh ..won't executor memory serve the same purpose?
– Mohit Raja
Dec 29 '18 at 13:26
@bdcloud .. It's 113
– Mohit Raja
Dec 29 '18 at 13:27
1
One option is to increase driver memory and also please post stack trace of the error that you are encountering.When you say few tasks are completed you mean some of the files converted to parquet in destination ? Also the source files are Gzipped?
– bdcloud
Dec 29 '18 at 14:01
|
show 2 more comments
I'd suggest to try to set -Xms -Xmx JVMs parameters as well
– slesh
Dec 29 '18 at 13:05
What is the number of files in the source?
– bdcloud
Dec 29 '18 at 13:18
@slesh ..won't executor memory serve the same purpose?
– Mohit Raja
Dec 29 '18 at 13:26
@bdcloud .. It's 113
– Mohit Raja
Dec 29 '18 at 13:27
1
One option is to increase driver memory and also please post stack trace of the error that you are encountering.When you say few tasks are completed you mean some of the files converted to parquet in destination ? Also the source files are Gzipped?
– bdcloud
Dec 29 '18 at 14:01
I'd suggest to try to set -Xms -Xmx JVMs parameters as well
– slesh
Dec 29 '18 at 13:05
I'd suggest to try to set -Xms -Xmx JVMs parameters as well
– slesh
Dec 29 '18 at 13:05
What is the number of files in the source?
– bdcloud
Dec 29 '18 at 13:18
What is the number of files in the source?
– bdcloud
Dec 29 '18 at 13:18
@slesh ..won't executor memory serve the same purpose?
– Mohit Raja
Dec 29 '18 at 13:26
@slesh ..won't executor memory serve the same purpose?
– Mohit Raja
Dec 29 '18 at 13:26
@bdcloud .. It's 113
– Mohit Raja
Dec 29 '18 at 13:27
@bdcloud .. It's 113
– Mohit Raja
Dec 29 '18 at 13:27
1
1
One option is to increase driver memory and also please post stack trace of the error that you are encountering.When you say few tasks are completed you mean some of the files converted to parquet in destination ? Also the source files are Gzipped?
– bdcloud
Dec 29 '18 at 14:01
One option is to increase driver memory and also please post stack trace of the error that you are encountering.When you say few tasks are completed you mean some of the files converted to parquet in destination ? Also the source files are Gzipped?
– bdcloud
Dec 29 '18 at 14:01
|
show 2 more comments
1 Answer
1
active
oldest
votes
You have to set JVM parameters:
How to set Spark MemoryStore size when running in IntelliJ Scala Console?
Official info:
Spark properties mainly can be divided into two kinds: one is related
to deploy, like “spark.driver.memory”, “spark.executor.instances”,
this kind of properties may not be affected when setting
programmatically through SparkConf in runtime, or the behavior is
depending on which cluster manager and deploy mode you choose, so it
would be suggested to set through configuration file or spark-submit
command line options; another is mainly related to Spark runtime
control, like “spark.task.maxFailures”, this kind of properties can be
set in either way.
https://spark.apache.org/docs/latest/configuration.html
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53969729%2funable-to-convert-text-format-to-parquet-format-through-spark%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
You have to set JVM parameters:
How to set Spark MemoryStore size when running in IntelliJ Scala Console?
Official info:
Spark properties mainly can be divided into two kinds: one is related
to deploy, like “spark.driver.memory”, “spark.executor.instances”,
this kind of properties may not be affected when setting
programmatically through SparkConf in runtime, or the behavior is
depending on which cluster manager and deploy mode you choose, so it
would be suggested to set through configuration file or spark-submit
command line options; another is mainly related to Spark runtime
control, like “spark.task.maxFailures”, this kind of properties can be
set in either way.
https://spark.apache.org/docs/latest/configuration.html
add a comment |
You have to set JVM parameters:
How to set Spark MemoryStore size when running in IntelliJ Scala Console?
Official info:
Spark properties mainly can be divided into two kinds: one is related
to deploy, like “spark.driver.memory”, “spark.executor.instances”,
this kind of properties may not be affected when setting
programmatically through SparkConf in runtime, or the behavior is
depending on which cluster manager and deploy mode you choose, so it
would be suggested to set through configuration file or spark-submit
command line options; another is mainly related to Spark runtime
control, like “spark.task.maxFailures”, this kind of properties can be
set in either way.
https://spark.apache.org/docs/latest/configuration.html
add a comment |
You have to set JVM parameters:
How to set Spark MemoryStore size when running in IntelliJ Scala Console?
Official info:
Spark properties mainly can be divided into two kinds: one is related
to deploy, like “spark.driver.memory”, “spark.executor.instances”,
this kind of properties may not be affected when setting
programmatically through SparkConf in runtime, or the behavior is
depending on which cluster manager and deploy mode you choose, so it
would be suggested to set through configuration file or spark-submit
command line options; another is mainly related to Spark runtime
control, like “spark.task.maxFailures”, this kind of properties can be
set in either way.
https://spark.apache.org/docs/latest/configuration.html
You have to set JVM parameters:
How to set Spark MemoryStore size when running in IntelliJ Scala Console?
Official info:
Spark properties mainly can be divided into two kinds: one is related
to deploy, like “spark.driver.memory”, “spark.executor.instances”,
this kind of properties may not be affected when setting
programmatically through SparkConf in runtime, or the behavior is
depending on which cluster manager and deploy mode you choose, so it
would be suggested to set through configuration file or spark-submit
command line options; another is mainly related to Spark runtime
control, like “spark.task.maxFailures”, this kind of properties can be
set in either way.
https://spark.apache.org/docs/latest/configuration.html
answered Dec 31 '18 at 11:58
Matko SoricMatko Soric
309
309
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53969729%2funable-to-convert-text-format-to-parquet-format-through-spark%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
I'd suggest to try to set -Xms -Xmx JVMs parameters as well
– slesh
Dec 29 '18 at 13:05
What is the number of files in the source?
– bdcloud
Dec 29 '18 at 13:18
@slesh ..won't executor memory serve the same purpose?
– Mohit Raja
Dec 29 '18 at 13:26
@bdcloud .. It's 113
– Mohit Raja
Dec 29 '18 at 13:27
1
One option is to increase driver memory and also please post stack trace of the error that you are encountering.When you say few tasks are completed you mean some of the files converted to parquet in destination ? Also the source files are Gzipped?
– bdcloud
Dec 29 '18 at 14:01