R database table pull encoding issue












0















I have a MySQL database containing a table of names. Once I pull them into R I run into some encoding problems which I don't understand but seem to be all related to the way R interpreted the file.



con <- dbConnect(MySQL(),
host = 'localhost',
user = 'root',
dbname='test',
rstudioapi::askForPassword('Database password')
)


and pull the names table from my MySQL db with



df <- dbReadTable(con, 'names')


My table looks something like



df <- structure(list(id = c(373863, 17731249), name = c("AARESTRUP, JAN CARO̸E", 
"BÃœTTNER, CHRISTIAN")), .Names = c("id", "name"), class = "data.frame", row.names = c(NA,
-2L))


df
id name
1 373863 AARESTRUP, JAN CARO̸E
2 17731249 BÃœTTNER, CHRISTIAN


Checking the encoding brings



Encoding(df$name)
[1] "unknown" "unknown"


Using Encoding(df$name) <- "UTF-8" fixes the second name but turns parts of name one into a non UTF-8 character.



df
id name
1 373863 AARESTRUP, JAN CARO<U+0338>E
2 17731249 BÜTTNER, CHRISTIAN


Now it gets strange as df$name[1] returns
[1] "AARESTRUP, JAN CARO̸E" which is exactly how it is stored in the MySQL db while it seems like its differently interpreted when I call the whole df.



First, is there a way to ensure the dbReadTable() reads in automatically UTF-8. I found link which does not seem to change anything in terms of my table encoding.



Second, what is going on with the difference in the interpretation of the non UTF-8 encoding depending on how you access the name (whole df vs. single element)










share|improve this question



























    0















    I have a MySQL database containing a table of names. Once I pull them into R I run into some encoding problems which I don't understand but seem to be all related to the way R interpreted the file.



    con <- dbConnect(MySQL(),
    host = 'localhost',
    user = 'root',
    dbname='test',
    rstudioapi::askForPassword('Database password')
    )


    and pull the names table from my MySQL db with



    df <- dbReadTable(con, 'names')


    My table looks something like



    df <- structure(list(id = c(373863, 17731249), name = c("AARESTRUP, JAN CARO̸E", 
    "BÃœTTNER, CHRISTIAN")), .Names = c("id", "name"), class = "data.frame", row.names = c(NA,
    -2L))


    df
    id name
    1 373863 AARESTRUP, JAN CARO̸E
    2 17731249 BÃœTTNER, CHRISTIAN


    Checking the encoding brings



    Encoding(df$name)
    [1] "unknown" "unknown"


    Using Encoding(df$name) <- "UTF-8" fixes the second name but turns parts of name one into a non UTF-8 character.



    df
    id name
    1 373863 AARESTRUP, JAN CARO<U+0338>E
    2 17731249 BÜTTNER, CHRISTIAN


    Now it gets strange as df$name[1] returns
    [1] "AARESTRUP, JAN CARO̸E" which is exactly how it is stored in the MySQL db while it seems like its differently interpreted when I call the whole df.



    First, is there a way to ensure the dbReadTable() reads in automatically UTF-8. I found link which does not seem to change anything in terms of my table encoding.



    Second, what is going on with the difference in the interpretation of the non UTF-8 encoding depending on how you access the name (whole df vs. single element)










    share|improve this question

























      0












      0








      0








      I have a MySQL database containing a table of names. Once I pull them into R I run into some encoding problems which I don't understand but seem to be all related to the way R interpreted the file.



      con <- dbConnect(MySQL(),
      host = 'localhost',
      user = 'root',
      dbname='test',
      rstudioapi::askForPassword('Database password')
      )


      and pull the names table from my MySQL db with



      df <- dbReadTable(con, 'names')


      My table looks something like



      df <- structure(list(id = c(373863, 17731249), name = c("AARESTRUP, JAN CARO̸E", 
      "BÃœTTNER, CHRISTIAN")), .Names = c("id", "name"), class = "data.frame", row.names = c(NA,
      -2L))


      df
      id name
      1 373863 AARESTRUP, JAN CARO̸E
      2 17731249 BÃœTTNER, CHRISTIAN


      Checking the encoding brings



      Encoding(df$name)
      [1] "unknown" "unknown"


      Using Encoding(df$name) <- "UTF-8" fixes the second name but turns parts of name one into a non UTF-8 character.



      df
      id name
      1 373863 AARESTRUP, JAN CARO<U+0338>E
      2 17731249 BÜTTNER, CHRISTIAN


      Now it gets strange as df$name[1] returns
      [1] "AARESTRUP, JAN CARO̸E" which is exactly how it is stored in the MySQL db while it seems like its differently interpreted when I call the whole df.



      First, is there a way to ensure the dbReadTable() reads in automatically UTF-8. I found link which does not seem to change anything in terms of my table encoding.



      Second, what is going on with the difference in the interpretation of the non UTF-8 encoding depending on how you access the name (whole df vs. single element)










      share|improve this question














      I have a MySQL database containing a table of names. Once I pull them into R I run into some encoding problems which I don't understand but seem to be all related to the way R interpreted the file.



      con <- dbConnect(MySQL(),
      host = 'localhost',
      user = 'root',
      dbname='test',
      rstudioapi::askForPassword('Database password')
      )


      and pull the names table from my MySQL db with



      df <- dbReadTable(con, 'names')


      My table looks something like



      df <- structure(list(id = c(373863, 17731249), name = c("AARESTRUP, JAN CARO̸E", 
      "BÃœTTNER, CHRISTIAN")), .Names = c("id", "name"), class = "data.frame", row.names = c(NA,
      -2L))


      df
      id name
      1 373863 AARESTRUP, JAN CARO̸E
      2 17731249 BÃœTTNER, CHRISTIAN


      Checking the encoding brings



      Encoding(df$name)
      [1] "unknown" "unknown"


      Using Encoding(df$name) <- "UTF-8" fixes the second name but turns parts of name one into a non UTF-8 character.



      df
      id name
      1 373863 AARESTRUP, JAN CARO<U+0338>E
      2 17731249 BÜTTNER, CHRISTIAN


      Now it gets strange as df$name[1] returns
      [1] "AARESTRUP, JAN CARO̸E" which is exactly how it is stored in the MySQL db while it seems like its differently interpreted when I call the whole df.



      First, is there a way to ensure the dbReadTable() reads in automatically UTF-8. I found link which does not seem to change anything in terms of my table encoding.



      Second, what is going on with the difference in the interpretation of the non UTF-8 encoding depending on how you access the name (whole df vs. single element)







      r encoding dbplyr






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Jan 3 at 18:28









      CERCER

      441516




      441516
























          1 Answer
          1






          active

          oldest

          votes


















          1














          <U+0338> is a UTF-8 character. But it's an overlay character that when rendered occupies the same space as the O that preceeds it. For some reason, it's displaying it incorrectly (although I checked and in R it displays correctly, so it seems to be something to do with StackOverflow, or perhaps Firefox). You can use



          apply(df, MARGIN = 2, FUN = function(x) if( typeof(x) == 'character' ) Encoding(x)<-'UTF-8')


          on all of your database result tables, which will convert encoding to UTF-8 for all of the character columns.



          I think that the difference in behavior between when you call df$name[1] and when you call df has to do with the different print/show methods for character type objects vs data.frames.



          df wants to show you the individual characters, meaning it doesn't want to render the characters in a way that disguises the fact that the E with a slash is in fact two characters. It also doesn't want to display the character as a spacing character, since it would then look like a backslash. Showing the unicode for the character is the best way to let the user know it is there.



          The show method for character type objects wants to render the characters for you (most of the time).






          share|improve this answer


























            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54027855%2fr-database-table-pull-encoding-issue%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            1














            <U+0338> is a UTF-8 character. But it's an overlay character that when rendered occupies the same space as the O that preceeds it. For some reason, it's displaying it incorrectly (although I checked and in R it displays correctly, so it seems to be something to do with StackOverflow, or perhaps Firefox). You can use



            apply(df, MARGIN = 2, FUN = function(x) if( typeof(x) == 'character' ) Encoding(x)<-'UTF-8')


            on all of your database result tables, which will convert encoding to UTF-8 for all of the character columns.



            I think that the difference in behavior between when you call df$name[1] and when you call df has to do with the different print/show methods for character type objects vs data.frames.



            df wants to show you the individual characters, meaning it doesn't want to render the characters in a way that disguises the fact that the E with a slash is in fact two characters. It also doesn't want to display the character as a spacing character, since it would then look like a backslash. Showing the unicode for the character is the best way to let the user know it is there.



            The show method for character type objects wants to render the characters for you (most of the time).






            share|improve this answer






























              1














              <U+0338> is a UTF-8 character. But it's an overlay character that when rendered occupies the same space as the O that preceeds it. For some reason, it's displaying it incorrectly (although I checked and in R it displays correctly, so it seems to be something to do with StackOverflow, or perhaps Firefox). You can use



              apply(df, MARGIN = 2, FUN = function(x) if( typeof(x) == 'character' ) Encoding(x)<-'UTF-8')


              on all of your database result tables, which will convert encoding to UTF-8 for all of the character columns.



              I think that the difference in behavior between when you call df$name[1] and when you call df has to do with the different print/show methods for character type objects vs data.frames.



              df wants to show you the individual characters, meaning it doesn't want to render the characters in a way that disguises the fact that the E with a slash is in fact two characters. It also doesn't want to display the character as a spacing character, since it would then look like a backslash. Showing the unicode for the character is the best way to let the user know it is there.



              The show method for character type objects wants to render the characters for you (most of the time).






              share|improve this answer




























                1












                1








                1







                <U+0338> is a UTF-8 character. But it's an overlay character that when rendered occupies the same space as the O that preceeds it. For some reason, it's displaying it incorrectly (although I checked and in R it displays correctly, so it seems to be something to do with StackOverflow, or perhaps Firefox). You can use



                apply(df, MARGIN = 2, FUN = function(x) if( typeof(x) == 'character' ) Encoding(x)<-'UTF-8')


                on all of your database result tables, which will convert encoding to UTF-8 for all of the character columns.



                I think that the difference in behavior between when you call df$name[1] and when you call df has to do with the different print/show methods for character type objects vs data.frames.



                df wants to show you the individual characters, meaning it doesn't want to render the characters in a way that disguises the fact that the E with a slash is in fact two characters. It also doesn't want to display the character as a spacing character, since it would then look like a backslash. Showing the unicode for the character is the best way to let the user know it is there.



                The show method for character type objects wants to render the characters for you (most of the time).






                share|improve this answer















                <U+0338> is a UTF-8 character. But it's an overlay character that when rendered occupies the same space as the O that preceeds it. For some reason, it's displaying it incorrectly (although I checked and in R it displays correctly, so it seems to be something to do with StackOverflow, or perhaps Firefox). You can use



                apply(df, MARGIN = 2, FUN = function(x) if( typeof(x) == 'character' ) Encoding(x)<-'UTF-8')


                on all of your database result tables, which will convert encoding to UTF-8 for all of the character columns.



                I think that the difference in behavior between when you call df$name[1] and when you call df has to do with the different print/show methods for character type objects vs data.frames.



                df wants to show you the individual characters, meaning it doesn't want to render the characters in a way that disguises the fact that the E with a slash is in fact two characters. It also doesn't want to display the character as a spacing character, since it would then look like a backslash. Showing the unicode for the character is the best way to let the user know it is there.



                The show method for character type objects wants to render the characters for you (most of the time).







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Jan 4 at 16:11

























                answered Jan 3 at 19:39









                AColemanAColeman

                112




                112
































                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54027855%2fr-database-table-pull-encoding-issue%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Monofisismo

                    Angular Downloading a file using contenturl with Basic Authentication

                    Olmecas