How can one refresh the data in a Parquet table created through Amazon Athena?

Multi tool use
Multi tool use












0














I've created a Parquet-format table through Amazon Athena after allowing AWS Glue to create a CSV-format table on my behalf. That CSV table is what I base my CREATE TABLE AS SELECT statements on.



In the scenario that the data I have refreshes (e.g. I add/delete/update rows in the CSV table by virtue of me uploading new CSVs and letting Glue crawl them), how can I leverage that to update the Parquet tables?



From what I can tell, it seems to be a kill-and-fill scenario in which I have to drop the table and its associated partitions, then re-run my CTAS statement. The reason it's kill-and-fill is that Athena is refusing to delete data from my S3 bucket.










share|improve this question



























    0














    I've created a Parquet-format table through Amazon Athena after allowing AWS Glue to create a CSV-format table on my behalf. That CSV table is what I base my CREATE TABLE AS SELECT statements on.



    In the scenario that the data I have refreshes (e.g. I add/delete/update rows in the CSV table by virtue of me uploading new CSVs and letting Glue crawl them), how can I leverage that to update the Parquet tables?



    From what I can tell, it seems to be a kill-and-fill scenario in which I have to drop the table and its associated partitions, then re-run my CTAS statement. The reason it's kill-and-fill is that Athena is refusing to delete data from my S3 bucket.










    share|improve this question

























      0












      0








      0







      I've created a Parquet-format table through Amazon Athena after allowing AWS Glue to create a CSV-format table on my behalf. That CSV table is what I base my CREATE TABLE AS SELECT statements on.



      In the scenario that the data I have refreshes (e.g. I add/delete/update rows in the CSV table by virtue of me uploading new CSVs and letting Glue crawl them), how can I leverage that to update the Parquet tables?



      From what I can tell, it seems to be a kill-and-fill scenario in which I have to drop the table and its associated partitions, then re-run my CTAS statement. The reason it's kill-and-fill is that Athena is refusing to delete data from my S3 bucket.










      share|improve this question













      I've created a Parquet-format table through Amazon Athena after allowing AWS Glue to create a CSV-format table on my behalf. That CSV table is what I base my CREATE TABLE AS SELECT statements on.



      In the scenario that the data I have refreshes (e.g. I add/delete/update rows in the CSV table by virtue of me uploading new CSVs and letting Glue crawl them), how can I leverage that to update the Parquet tables?



      From what I can tell, it seems to be a kill-and-fill scenario in which I have to drop the table and its associated partitions, then re-run my CTAS statement. The reason it's kill-and-fill is that Athena is refusing to delete data from my S3 bucket.







      amazon-s3 parquet amazon-athena aws-glue






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Dec 20 '18 at 17:26









      Makoto

      80.6k15125173




      80.6k15125173
























          2 Answers
          2






          active

          oldest

          votes


















          0














          Athena doesn't perform update and delete. It only appends data.



          You can use Hive over EMR and then use upsert command OR you can store data in redshift and use redshift's merge command.



          Other approach could be to add one column -iudflag which may take I, U and D for insert, update, delete. Then you can insert all records in athena and then use sql window function to get last updated records






          share|improve this answer





























            -1














            This is not an amazing solution but it would work for you:
            You can keep your parquet table static and put the new updates into a different folder. Something like:




            • Parquet table: s3://my-bucket/my-tables/table1/parquet/

            • CSV table: s3://my-bucket/my-tables/table1/csv/


            Once you have that, you can create a second table that points to the CSV folder so you will have two tables: the parquet and the CSV one. Then you can create a view that join both tables. Query the view will access to both tables.



            Of course this is something that, depending your volume, it can cause a performance drop really fast. You will need a process that will take from time to time, content from the CSV table, convert it into parquet and move it to the parquet original table.



            This is something quite similar to a lambda architecture where you have keep your batch and fast layer separated.



            As I said, this is not the best solution ever. There is better approaches but I think this is the best one for you.



            I hope it helps :)






            share|improve this answer





















            • This isn't going to work. My original thought was that this could automatically refresh itself. Supposing I add a new column, the view isn't going to automatically capture that.
              – Makoto
              Dec 27 '18 at 17:17











            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53873440%2fhow-can-one-refresh-the-data-in-a-parquet-table-created-through-amazon-athena%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0














            Athena doesn't perform update and delete. It only appends data.



            You can use Hive over EMR and then use upsert command OR you can store data in redshift and use redshift's merge command.



            Other approach could be to add one column -iudflag which may take I, U and D for insert, update, delete. Then you can insert all records in athena and then use sql window function to get last updated records






            share|improve this answer


























              0














              Athena doesn't perform update and delete. It only appends data.



              You can use Hive over EMR and then use upsert command OR you can store data in redshift and use redshift's merge command.



              Other approach could be to add one column -iudflag which may take I, U and D for insert, update, delete. Then you can insert all records in athena and then use sql window function to get last updated records






              share|improve this answer
























                0












                0








                0






                Athena doesn't perform update and delete. It only appends data.



                You can use Hive over EMR and then use upsert command OR you can store data in redshift and use redshift's merge command.



                Other approach could be to add one column -iudflag which may take I, U and D for insert, update, delete. Then you can insert all records in athena and then use sql window function to get last updated records






                share|improve this answer












                Athena doesn't perform update and delete. It only appends data.



                You can use Hive over EMR and then use upsert command OR you can store data in redshift and use redshift's merge command.



                Other approach could be to add one column -iudflag which may take I, U and D for insert, update, delete. Then you can insert all records in athena and then use sql window function to get last updated records







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Dec 28 '18 at 5:54









                Sandeep Fatangare

                463




                463

























                    -1














                    This is not an amazing solution but it would work for you:
                    You can keep your parquet table static and put the new updates into a different folder. Something like:




                    • Parquet table: s3://my-bucket/my-tables/table1/parquet/

                    • CSV table: s3://my-bucket/my-tables/table1/csv/


                    Once you have that, you can create a second table that points to the CSV folder so you will have two tables: the parquet and the CSV one. Then you can create a view that join both tables. Query the view will access to both tables.



                    Of course this is something that, depending your volume, it can cause a performance drop really fast. You will need a process that will take from time to time, content from the CSV table, convert it into parquet and move it to the parquet original table.



                    This is something quite similar to a lambda architecture where you have keep your batch and fast layer separated.



                    As I said, this is not the best solution ever. There is better approaches but I think this is the best one for you.



                    I hope it helps :)






                    share|improve this answer





















                    • This isn't going to work. My original thought was that this could automatically refresh itself. Supposing I add a new column, the view isn't going to automatically capture that.
                      – Makoto
                      Dec 27 '18 at 17:17
















                    -1














                    This is not an amazing solution but it would work for you:
                    You can keep your parquet table static and put the new updates into a different folder. Something like:




                    • Parquet table: s3://my-bucket/my-tables/table1/parquet/

                    • CSV table: s3://my-bucket/my-tables/table1/csv/


                    Once you have that, you can create a second table that points to the CSV folder so you will have two tables: the parquet and the CSV one. Then you can create a view that join both tables. Query the view will access to both tables.



                    Of course this is something that, depending your volume, it can cause a performance drop really fast. You will need a process that will take from time to time, content from the CSV table, convert it into parquet and move it to the parquet original table.



                    This is something quite similar to a lambda architecture where you have keep your batch and fast layer separated.



                    As I said, this is not the best solution ever. There is better approaches but I think this is the best one for you.



                    I hope it helps :)






                    share|improve this answer





















                    • This isn't going to work. My original thought was that this could automatically refresh itself. Supposing I add a new column, the view isn't going to automatically capture that.
                      – Makoto
                      Dec 27 '18 at 17:17














                    -1












                    -1








                    -1






                    This is not an amazing solution but it would work for you:
                    You can keep your parquet table static and put the new updates into a different folder. Something like:




                    • Parquet table: s3://my-bucket/my-tables/table1/parquet/

                    • CSV table: s3://my-bucket/my-tables/table1/csv/


                    Once you have that, you can create a second table that points to the CSV folder so you will have two tables: the parquet and the CSV one. Then you can create a view that join both tables. Query the view will access to both tables.



                    Of course this is something that, depending your volume, it can cause a performance drop really fast. You will need a process that will take from time to time, content from the CSV table, convert it into parquet and move it to the parquet original table.



                    This is something quite similar to a lambda architecture where you have keep your batch and fast layer separated.



                    As I said, this is not the best solution ever. There is better approaches but I think this is the best one for you.



                    I hope it helps :)






                    share|improve this answer












                    This is not an amazing solution but it would work for you:
                    You can keep your parquet table static and put the new updates into a different folder. Something like:




                    • Parquet table: s3://my-bucket/my-tables/table1/parquet/

                    • CSV table: s3://my-bucket/my-tables/table1/csv/


                    Once you have that, you can create a second table that points to the CSV folder so you will have two tables: the parquet and the CSV one. Then you can create a view that join both tables. Query the view will access to both tables.



                    Of course this is something that, depending your volume, it can cause a performance drop really fast. You will need a process that will take from time to time, content from the CSV table, convert it into parquet and move it to the parquet original table.



                    This is something quite similar to a lambda architecture where you have keep your batch and fast layer separated.



                    As I said, this is not the best solution ever. There is better approaches but I think this is the best one for you.



                    I hope it helps :)







                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Dec 27 '18 at 14:45









                    Roberto

                    1231516




                    1231516












                    • This isn't going to work. My original thought was that this could automatically refresh itself. Supposing I add a new column, the view isn't going to automatically capture that.
                      – Makoto
                      Dec 27 '18 at 17:17


















                    • This isn't going to work. My original thought was that this could automatically refresh itself. Supposing I add a new column, the view isn't going to automatically capture that.
                      – Makoto
                      Dec 27 '18 at 17:17
















                    This isn't going to work. My original thought was that this could automatically refresh itself. Supposing I add a new column, the view isn't going to automatically capture that.
                    – Makoto
                    Dec 27 '18 at 17:17




                    This isn't going to work. My original thought was that this could automatically refresh itself. Supposing I add a new column, the view isn't going to automatically capture that.
                    – Makoto
                    Dec 27 '18 at 17:17


















                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.





                    Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                    Please pay close attention to the following guidance:


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53873440%2fhow-can-one-refresh-the-data-in-a-parquet-table-created-through-amazon-athena%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    tMjSji6qdb1 P8,XJz
                    rZR,9rwEkX 0rdtZdL,h9JN pVVSoZ8 8EcVTbI9v

                    Popular posts from this blog

                    Monofisismo

                    Angular Downloading a file using contenturl with Basic Authentication

                    Olmecas