R database table pull encoding issue

I have a MySQL database containing a table of names. Once I pull them into R I run into some encoding problems which I don't understand but seem to be all related to the way R interpreted the file.

con <- dbConnect(MySQL(),

                 host = 'localhost',

                 user = 'root',

                 dbname='test',

                 rstudioapi::askForPassword('Database password')

)

and pull the names table from my MySQL db with

df <- dbReadTable(con, 'names')

My table looks something like

df <- structure(list(id = c(373863, 17731249), name = c("AARESTRUP, JAN CAROÌ¸E", 

"BÃœTTNER, CHRISTIAN")), .Names = c("id", "name"), class = "data.frame", row.names = c(NA, 

-2L))





df

        id                   name

1   373863 AARESTRUP, JAN CAROÌ¸E

2 17731249    BÃœTTNER, CHRISTIAN

Checking the encoding brings

Encoding(df$name)

[1] "unknown" "unknown"

Using Encoding(df$name) <- "UTF-8" fixes the second name but turns parts of name one into a non UTF-8 character.

df

        id                         name

1   373863 AARESTRUP, JAN CARO<U+0338>E

2 17731249           BÜTTNER, CHRISTIAN

Now it gets strange as df$name[1] returns
[1] "AARESTRUP, JAN CARO̸E" which is exactly how it is stored in the MySQL db while it seems like its differently interpreted when I call the whole df.

First, is there a way to ensure the dbReadTable() reads in automatically UTF-8. I found link which does not seem to change anything in terms of my table encoding.

Second, what is going on with the difference in the interpretation of the non UTF-8 encoding depending on how you access the name (whole df vs. single element)

asked Jan 3 at 18:28

CER

441516

add a comment |

I have a MySQL database containing a table of names. Once I pull them into R I run into some encoding problems which I don't understand but seem to be all related to the way R interpreted the file.

con <- dbConnect(MySQL(),

                 host = 'localhost',

                 user = 'root',

                 dbname='test',

                 rstudioapi::askForPassword('Database password')

)

and pull the names table from my MySQL db with

df <- dbReadTable(con, 'names')

My table looks something like

df <- structure(list(id = c(373863, 17731249), name = c("AARESTRUP, JAN CAROÌ¸E", 

"BÃœTTNER, CHRISTIAN")), .Names = c("id", "name"), class = "data.frame", row.names = c(NA, 

-2L))





df

        id                   name

1   373863 AARESTRUP, JAN CAROÌ¸E

2 17731249    BÃœTTNER, CHRISTIAN

Checking the encoding brings

Encoding(df$name)

[1] "unknown" "unknown"

Using Encoding(df$name) <- "UTF-8" fixes the second name but turns parts of name one into a non UTF-8 character.

df

        id                         name

1   373863 AARESTRUP, JAN CARO<U+0338>E

2 17731249           BÜTTNER, CHRISTIAN

First, is there a way to ensure the dbReadTable() reads in automatically UTF-8. I found link which does not seem to change anything in terms of my table encoding.

Second, what is going on with the difference in the interpretation of the non UTF-8 encoding depending on how you access the name (whole df vs. single element)

asked Jan 3 at 18:28

CER

441516

add a comment |

I have a MySQL database containing a table of names. Once I pull them into R I run into some encoding problems which I don't understand but seem to be all related to the way R interpreted the file.

con <- dbConnect(MySQL(),

                 host = 'localhost',

                 user = 'root',

                 dbname='test',

                 rstudioapi::askForPassword('Database password')

)

and pull the names table from my MySQL db with

df <- dbReadTable(con, 'names')

My table looks something like

df <- structure(list(id = c(373863, 17731249), name = c("AARESTRUP, JAN CAROÌ¸E", 

"BÃœTTNER, CHRISTIAN")), .Names = c("id", "name"), class = "data.frame", row.names = c(NA, 

-2L))





df

        id                   name

1   373863 AARESTRUP, JAN CAROÌ¸E

2 17731249    BÃœTTNER, CHRISTIAN

Checking the encoding brings

Encoding(df$name)

[1] "unknown" "unknown"

Using Encoding(df$name) <- "UTF-8" fixes the second name but turns parts of name one into a non UTF-8 character.

df

        id                         name

1   373863 AARESTRUP, JAN CARO<U+0338>E

2 17731249           BÜTTNER, CHRISTIAN

First, is there a way to ensure the dbReadTable() reads in automatically UTF-8. I found link which does not seem to change anything in terms of my table encoding.

Second, what is going on with the difference in the interpretation of the non UTF-8 encoding depending on how you access the name (whole df vs. single element)

asked Jan 3 at 18:28

CER

441516

I have a MySQL database containing a table of names. Once I pull them into R I run into some encoding problems which I don't understand but seem to be all related to the way R interpreted the file.

con <- dbConnect(MySQL(),

                 host = 'localhost',

                 user = 'root',

                 dbname='test',

                 rstudioapi::askForPassword('Database password')

)

and pull the names table from my MySQL db with

df <- dbReadTable(con, 'names')

My table looks something like

df <- structure(list(id = c(373863, 17731249), name = c("AARESTRUP, JAN CAROÌ¸E", 

"BÃœTTNER, CHRISTIAN")), .Names = c("id", "name"), class = "data.frame", row.names = c(NA, 

-2L))





df

        id                   name

1   373863 AARESTRUP, JAN CAROÌ¸E

2 17731249    BÃœTTNER, CHRISTIAN

Checking the encoding brings

Encoding(df$name)

[1] "unknown" "unknown"

Using Encoding(df$name) <- "UTF-8" fixes the second name but turns parts of name one into a non UTF-8 character.

df

        id                         name

1   373863 AARESTRUP, JAN CARO<U+0338>E

2 17731249           BÜTTNER, CHRISTIAN

First, is there a way to ensure the dbReadTable() reads in automatically UTF-8. I found link which does not seem to change anything in terms of my table encoding.

Second, what is going on with the difference in the interpretation of the non UTF-8 encoding depending on how you access the name (whole df vs. single element)

r encoding dbplyr

asked Jan 3 at 18:28

CER

441516

asked Jan 3 at 18:28

CER

441516

asked Jan 3 at 18:28

CER

441516

asked Jan 3 at 18:28

CER

441516

asked Jan 3 at 18:28

CER

441516

add a comment |

1 Answer
1

active

oldest

votes

<U+0338> is a UTF-8 character. But it's an overlay character that when rendered occupies the same space as the O that preceeds it. For some reason, it's displaying it incorrectly (although I checked and in R it displays correctly, so it seems to be something to do with StackOverflow, or perhaps Firefox). You can use

apply(df, MARGIN = 2, FUN = function(x) if( typeof(x) == 'character' ) Encoding(x)<-'UTF-8')

on all of your database result tables, which will convert encoding to UTF-8 for all of the character columns.

I think that the difference in behavior between when you call df$name[1] and when you call df has to do with the different print/show methods for character type objects vs data.frames.

df wants to show you the individual characters, meaning it doesn't want to render the characters in a way that disguises the fact that the E with a slash is in fact two characters. It also doesn't want to display the character as a spacing character, since it would then look like a backslash. Showing the unicode for the character is the best way to let the user know it is there.

The show method for character type objects wants to render the characters for you (most of the time).

edited Jan 4 at 16:11

answered Jan 3 at 19:39

AColeman

112

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54027855%2fr-database-table-pull-encoding-issue%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

apply(df, MARGIN = 2, FUN = function(x) if( typeof(x) == 'character' ) Encoding(x)<-'UTF-8')

on all of your database result tables, which will convert encoding to UTF-8 for all of the character columns.

I think that the difference in behavior between when you call df$name[1] and when you call df has to do with the different print/show methods for character type objects vs data.frames.

The show method for character type objects wants to render the characters for you (most of the time).

edited Jan 4 at 16:11

answered Jan 3 at 19:39

AColeman

112

add a comment |

apply(df, MARGIN = 2, FUN = function(x) if( typeof(x) == 'character' ) Encoding(x)<-'UTF-8')

on all of your database result tables, which will convert encoding to UTF-8 for all of the character columns.

I think that the difference in behavior between when you call df$name[1] and when you call df has to do with the different print/show methods for character type objects vs data.frames.

The show method for character type objects wants to render the characters for you (most of the time).

edited Jan 4 at 16:11

answered Jan 3 at 19:39

AColeman

112

add a comment |

apply(df, MARGIN = 2, FUN = function(x) if( typeof(x) == 'character' ) Encoding(x)<-'UTF-8')

on all of your database result tables, which will convert encoding to UTF-8 for all of the character columns.

I think that the difference in behavior between when you call df$name[1] and when you call df has to do with the different print/show methods for character type objects vs data.frames.

The show method for character type objects wants to render the characters for you (most of the time).

edited Jan 4 at 16:11

answered Jan 3 at 19:39

AColeman

112

apply(df, MARGIN = 2, FUN = function(x) if( typeof(x) == 'character' ) Encoding(x)<-'UTF-8')

on all of your database result tables, which will convert encoding to UTF-8 for all of the character columns.

I think that the difference in behavior between when you call df$name[1] and when you call df has to do with the different print/show methods for character type objects vs data.frames.

The show method for character type objects wants to render the characters for you (most of the time).

edited Jan 4 at 16:11

answered Jan 3 at 19:39

AColeman

112

edited Jan 4 at 16:11

answered Jan 3 at 19:39

AColeman

112

answered Jan 3 at 19:39

AColeman

112

answered Jan 3 at 19:39

AColeman

112

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Bdtjtk