How can one refresh the data in a Parquet table created through Amazon Athena?

I've created a Parquet-format table through Amazon Athena after allowing AWS Glue to create a CSV-format table on my behalf. That CSV table is what I base my CREATE TABLE AS SELECT statements on.

In the scenario that the data I have refreshes (e.g. I add/delete/update rows in the CSV table by virtue of me uploading new CSVs and letting Glue crawl them), how can I leverage that to update the Parquet tables?

From what I can tell, it seems to be a kill-and-fill scenario in which I have to drop the table and its associated partitions, then re-run my CTAS statement. The reason it's kill-and-fill is that Athena is refusing to delete data from my S3 bucket.

asked Dec 20 '18 at 17:26

Makoto

80.6k15125173

add a comment |

I've created a Parquet-format table through Amazon Athena after allowing AWS Glue to create a CSV-format table on my behalf. That CSV table is what I base my CREATE TABLE AS SELECT statements on.

asked Dec 20 '18 at 17:26

Makoto

80.6k15125173

add a comment |

I've created a Parquet-format table through Amazon Athena after allowing AWS Glue to create a CSV-format table on my behalf. That CSV table is what I base my CREATE TABLE AS SELECT statements on.

asked Dec 20 '18 at 17:26

Makoto

80.6k15125173

I've created a Parquet-format table through Amazon Athena after allowing AWS Glue to create a CSV-format table on my behalf. That CSV table is what I base my CREATE TABLE AS SELECT statements on.

amazon-s3 parquet amazon-athena aws-glue

asked Dec 20 '18 at 17:26

Makoto

80.6k15125173

asked Dec 20 '18 at 17:26

Makoto

80.6k15125173

asked Dec 20 '18 at 17:26

Makoto

80.6k15125173

asked Dec 20 '18 at 17:26

Makoto

80.6k15125173

asked Dec 20 '18 at 17:26

Makoto

80.6k15125173

add a comment |

2 Answers
2

active

oldest

votes

Athena doesn't perform update and delete. It only appends data.

You can use Hive over EMR and then use upsert command OR you can store data in redshift and use redshift's merge command.

Other approach could be to add one column -iudflag which may take I, U and D for insert, update, delete. Then you can insert all records in athena and then use sql window function to get last updated records

answered Dec 28 '18 at 5:54

Sandeep Fatangare

463

add a comment |

-1

This is not an amazing solution but it would work for you:
You can keep your parquet table static and put the new updates into a different folder. Something like:

Parquet table: s3://my-bucket/my-tables/table1/parquet/

CSV table: s3://my-bucket/my-tables/table1/csv/

Once you have that, you can create a second table that points to the CSV folder so you will have two tables: the parquet and the CSV one. Then you can create a view that join both tables. Query the view will access to both tables.

Of course this is something that, depending your volume, it can cause a performance drop really fast. You will need a process that will take from time to time, content from the CSV table, convert it into parquet and move it to the parquet original table.

This is something quite similar to a lambda architecture where you have keep your batch and fast layer separated.

As I said, this is not the best solution ever. There is better approaches but I think this is the best one for you.

I hope it helps :)

answered Dec 27 '18 at 14:45

Roberto

1231516

This isn't going to work. My original thought was that this could automatically refresh itself. Supposing I add a new column, the view isn't going to automatically capture that.
– Makoto
Dec 27 '18 at 17:17

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53873440%2fhow-can-one-refresh-the-data-in-a-parquet-table-created-through-amazon-athena%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

Athena doesn't perform update and delete. It only appends data.

You can use Hive over EMR and then use upsert command OR you can store data in redshift and use redshift's merge command.

answered Dec 28 '18 at 5:54

Sandeep Fatangare

463

add a comment |

Athena doesn't perform update and delete. It only appends data.

You can use Hive over EMR and then use upsert command OR you can store data in redshift and use redshift's merge command.

answered Dec 28 '18 at 5:54

Sandeep Fatangare

463

add a comment |

Athena doesn't perform update and delete. It only appends data.

You can use Hive over EMR and then use upsert command OR you can store data in redshift and use redshift's merge command.

answered Dec 28 '18 at 5:54

Sandeep Fatangare

463

Athena doesn't perform update and delete. It only appends data.

You can use Hive over EMR and then use upsert command OR you can store data in redshift and use redshift's merge command.

answered Dec 28 '18 at 5:54

Sandeep Fatangare

463

answered Dec 28 '18 at 5:54

Sandeep Fatangare

463

answered Dec 28 '18 at 5:54

Sandeep Fatangare

463

answered Dec 28 '18 at 5:54

Sandeep Fatangare

463

add a comment |

-1

This is not an amazing solution but it would work for you:
You can keep your parquet table static and put the new updates into a different folder. Something like:

Parquet table: s3://my-bucket/my-tables/table1/parquet/

CSV table: s3://my-bucket/my-tables/table1/csv/

This is something quite similar to a lambda architecture where you have keep your batch and fast layer separated.

As I said, this is not the best solution ever. There is better approaches but I think this is the best one for you.

I hope it helps :)

answered Dec 27 '18 at 14:45

Roberto

1231516

This isn't going to work. My original thought was that this could automatically refresh itself. Supposing I add a new column, the view isn't going to automatically capture that.
– Makoto
Dec 27 '18 at 17:17

add a comment |

-1

This is not an amazing solution but it would work for you:
You can keep your parquet table static and put the new updates into a different folder. Something like:

Parquet table: s3://my-bucket/my-tables/table1/parquet/

CSV table: s3://my-bucket/my-tables/table1/csv/

This is something quite similar to a lambda architecture where you have keep your batch and fast layer separated.

As I said, this is not the best solution ever. There is better approaches but I think this is the best one for you.

I hope it helps :)

answered Dec 27 '18 at 14:45

Roberto

1231516

This isn't going to work. My original thought was that this could automatically refresh itself. Supposing I add a new column, the view isn't going to automatically capture that.
– Makoto
Dec 27 '18 at 17:17

add a comment |

-1

This is not an amazing solution but it would work for you:
You can keep your parquet table static and put the new updates into a different folder. Something like:

Parquet table: s3://my-bucket/my-tables/table1/parquet/

CSV table: s3://my-bucket/my-tables/table1/csv/

This is something quite similar to a lambda architecture where you have keep your batch and fast layer separated.

As I said, this is not the best solution ever. There is better approaches but I think this is the best one for you.

I hope it helps :)

answered Dec 27 '18 at 14:45

Roberto

1231516

This is not an amazing solution but it would work for you:
You can keep your parquet table static and put the new updates into a different folder. Something like:

Parquet table: s3://my-bucket/my-tables/table1/parquet/

CSV table: s3://my-bucket/my-tables/table1/csv/

This is something quite similar to a lambda architecture where you have keep your batch and fast layer separated.

As I said, this is not the best solution ever. There is better approaches but I think this is the best one for you.

I hope it helps :)

answered Dec 27 '18 at 14:45

Roberto

1231516

answered Dec 27 '18 at 14:45

Roberto

1231516

answered Dec 27 '18 at 14:45

Roberto

1231516

answered Dec 27 '18 at 14:45

Roberto

1231516

This isn't going to work. My original thought was that this could automatically refresh itself. Supposing I add a new column, the view isn't going to automatically capture that.
– Makoto
Dec 27 '18 at 17:17

add a comment |

This isn't going to work. My original thought was that this could automatically refresh itself. Supposing I add a new column, the view isn't going to automatically capture that.
– Makoto
Dec 27 '18 at 17:17

This isn't going to work. My original thought was that this could automatically refresh itself. Supposing I add a new column, the view isn't going to automatically capture that.
– Makoto
Dec 27 '18 at 17:17

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Bdtjtk