How can one refresh the data in a Parquet table created through Amazon Athena?

Multi tool use
I've created a Parquet-format table through Amazon Athena after allowing AWS Glue to create a CSV-format table on my behalf. That CSV table is what I base my CREATE TABLE AS SELECT
statements on.
In the scenario that the data I have refreshes (e.g. I add/delete/update rows in the CSV table by virtue of me uploading new CSVs and letting Glue crawl them), how can I leverage that to update the Parquet tables?
From what I can tell, it seems to be a kill-and-fill scenario in which I have to drop the table and its associated partitions, then re-run my CTAS statement. The reason it's kill-and-fill is that Athena is refusing to delete data from my S3 bucket.
amazon-s3 parquet amazon-athena aws-glue
add a comment |
I've created a Parquet-format table through Amazon Athena after allowing AWS Glue to create a CSV-format table on my behalf. That CSV table is what I base my CREATE TABLE AS SELECT
statements on.
In the scenario that the data I have refreshes (e.g. I add/delete/update rows in the CSV table by virtue of me uploading new CSVs and letting Glue crawl them), how can I leverage that to update the Parquet tables?
From what I can tell, it seems to be a kill-and-fill scenario in which I have to drop the table and its associated partitions, then re-run my CTAS statement. The reason it's kill-and-fill is that Athena is refusing to delete data from my S3 bucket.
amazon-s3 parquet amazon-athena aws-glue
add a comment |
I've created a Parquet-format table through Amazon Athena after allowing AWS Glue to create a CSV-format table on my behalf. That CSV table is what I base my CREATE TABLE AS SELECT
statements on.
In the scenario that the data I have refreshes (e.g. I add/delete/update rows in the CSV table by virtue of me uploading new CSVs and letting Glue crawl them), how can I leverage that to update the Parquet tables?
From what I can tell, it seems to be a kill-and-fill scenario in which I have to drop the table and its associated partitions, then re-run my CTAS statement. The reason it's kill-and-fill is that Athena is refusing to delete data from my S3 bucket.
amazon-s3 parquet amazon-athena aws-glue
I've created a Parquet-format table through Amazon Athena after allowing AWS Glue to create a CSV-format table on my behalf. That CSV table is what I base my CREATE TABLE AS SELECT
statements on.
In the scenario that the data I have refreshes (e.g. I add/delete/update rows in the CSV table by virtue of me uploading new CSVs and letting Glue crawl them), how can I leverage that to update the Parquet tables?
From what I can tell, it seems to be a kill-and-fill scenario in which I have to drop the table and its associated partitions, then re-run my CTAS statement. The reason it's kill-and-fill is that Athena is refusing to delete data from my S3 bucket.
amazon-s3 parquet amazon-athena aws-glue
amazon-s3 parquet amazon-athena aws-glue
asked Dec 20 '18 at 17:26
Makoto
80.6k15125173
80.6k15125173
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
Athena doesn't perform update and delete. It only appends data.
You can use Hive over EMR and then use upsert command OR you can store data in redshift and use redshift's merge command.
Other approach could be to add one column -iudflag which may take I, U and D for insert, update, delete. Then you can insert all records in athena and then use sql window function to get last updated records
add a comment |
This is not an amazing solution but it would work for you:
You can keep your parquet table static and put the new updates into a different folder. Something like:
- Parquet table: s3://my-bucket/my-tables/table1/parquet/
- CSV table: s3://my-bucket/my-tables/table1/csv/
Once you have that, you can create a second table that points to the CSV folder so you will have two tables: the parquet and the CSV one. Then you can create a view that join both tables. Query the view will access to both tables.
Of course this is something that, depending your volume, it can cause a performance drop really fast. You will need a process that will take from time to time, content from the CSV table, convert it into parquet and move it to the parquet original table.
This is something quite similar to a lambda architecture where you have keep your batch and fast layer separated.
As I said, this is not the best solution ever. There is better approaches but I think this is the best one for you.
I hope it helps :)
This isn't going to work. My original thought was that this could automatically refresh itself. Supposing I add a new column, the view isn't going to automatically capture that.
– Makoto
Dec 27 '18 at 17:17
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53873440%2fhow-can-one-refresh-the-data-in-a-parquet-table-created-through-amazon-athena%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Athena doesn't perform update and delete. It only appends data.
You can use Hive over EMR and then use upsert command OR you can store data in redshift and use redshift's merge command.
Other approach could be to add one column -iudflag which may take I, U and D for insert, update, delete. Then you can insert all records in athena and then use sql window function to get last updated records
add a comment |
Athena doesn't perform update and delete. It only appends data.
You can use Hive over EMR and then use upsert command OR you can store data in redshift and use redshift's merge command.
Other approach could be to add one column -iudflag which may take I, U and D for insert, update, delete. Then you can insert all records in athena and then use sql window function to get last updated records
add a comment |
Athena doesn't perform update and delete. It only appends data.
You can use Hive over EMR and then use upsert command OR you can store data in redshift and use redshift's merge command.
Other approach could be to add one column -iudflag which may take I, U and D for insert, update, delete. Then you can insert all records in athena and then use sql window function to get last updated records
Athena doesn't perform update and delete. It only appends data.
You can use Hive over EMR and then use upsert command OR you can store data in redshift and use redshift's merge command.
Other approach could be to add one column -iudflag which may take I, U and D for insert, update, delete. Then you can insert all records in athena and then use sql window function to get last updated records
answered Dec 28 '18 at 5:54


Sandeep Fatangare
463
463
add a comment |
add a comment |
This is not an amazing solution but it would work for you:
You can keep your parquet table static and put the new updates into a different folder. Something like:
- Parquet table: s3://my-bucket/my-tables/table1/parquet/
- CSV table: s3://my-bucket/my-tables/table1/csv/
Once you have that, you can create a second table that points to the CSV folder so you will have two tables: the parquet and the CSV one. Then you can create a view that join both tables. Query the view will access to both tables.
Of course this is something that, depending your volume, it can cause a performance drop really fast. You will need a process that will take from time to time, content from the CSV table, convert it into parquet and move it to the parquet original table.
This is something quite similar to a lambda architecture where you have keep your batch and fast layer separated.
As I said, this is not the best solution ever. There is better approaches but I think this is the best one for you.
I hope it helps :)
This isn't going to work. My original thought was that this could automatically refresh itself. Supposing I add a new column, the view isn't going to automatically capture that.
– Makoto
Dec 27 '18 at 17:17
add a comment |
This is not an amazing solution but it would work for you:
You can keep your parquet table static and put the new updates into a different folder. Something like:
- Parquet table: s3://my-bucket/my-tables/table1/parquet/
- CSV table: s3://my-bucket/my-tables/table1/csv/
Once you have that, you can create a second table that points to the CSV folder so you will have two tables: the parquet and the CSV one. Then you can create a view that join both tables. Query the view will access to both tables.
Of course this is something that, depending your volume, it can cause a performance drop really fast. You will need a process that will take from time to time, content from the CSV table, convert it into parquet and move it to the parquet original table.
This is something quite similar to a lambda architecture where you have keep your batch and fast layer separated.
As I said, this is not the best solution ever. There is better approaches but I think this is the best one for you.
I hope it helps :)
This isn't going to work. My original thought was that this could automatically refresh itself. Supposing I add a new column, the view isn't going to automatically capture that.
– Makoto
Dec 27 '18 at 17:17
add a comment |
This is not an amazing solution but it would work for you:
You can keep your parquet table static and put the new updates into a different folder. Something like:
- Parquet table: s3://my-bucket/my-tables/table1/parquet/
- CSV table: s3://my-bucket/my-tables/table1/csv/
Once you have that, you can create a second table that points to the CSV folder so you will have two tables: the parquet and the CSV one. Then you can create a view that join both tables. Query the view will access to both tables.
Of course this is something that, depending your volume, it can cause a performance drop really fast. You will need a process that will take from time to time, content from the CSV table, convert it into parquet and move it to the parquet original table.
This is something quite similar to a lambda architecture where you have keep your batch and fast layer separated.
As I said, this is not the best solution ever. There is better approaches but I think this is the best one for you.
I hope it helps :)
This is not an amazing solution but it would work for you:
You can keep your parquet table static and put the new updates into a different folder. Something like:
- Parquet table: s3://my-bucket/my-tables/table1/parquet/
- CSV table: s3://my-bucket/my-tables/table1/csv/
Once you have that, you can create a second table that points to the CSV folder so you will have two tables: the parquet and the CSV one. Then you can create a view that join both tables. Query the view will access to both tables.
Of course this is something that, depending your volume, it can cause a performance drop really fast. You will need a process that will take from time to time, content from the CSV table, convert it into parquet and move it to the parquet original table.
This is something quite similar to a lambda architecture where you have keep your batch and fast layer separated.
As I said, this is not the best solution ever. There is better approaches but I think this is the best one for you.
I hope it helps :)
answered Dec 27 '18 at 14:45
Roberto
1231516
1231516
This isn't going to work. My original thought was that this could automatically refresh itself. Supposing I add a new column, the view isn't going to automatically capture that.
– Makoto
Dec 27 '18 at 17:17
add a comment |
This isn't going to work. My original thought was that this could automatically refresh itself. Supposing I add a new column, the view isn't going to automatically capture that.
– Makoto
Dec 27 '18 at 17:17
This isn't going to work. My original thought was that this could automatically refresh itself. Supposing I add a new column, the view isn't going to automatically capture that.
– Makoto
Dec 27 '18 at 17:17
This isn't going to work. My original thought was that this could automatically refresh itself. Supposing I add a new column, the view isn't going to automatically capture that.
– Makoto
Dec 27 '18 at 17:17
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53873440%2fhow-can-one-refresh-the-data-in-a-parquet-table-created-through-amazon-athena%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
tMjSji6qdb1 P8,XJz