R database table pull encoding issue
I have a MySQL database containing a table of names. Once I pull them into R I run into some encoding problems which I don't understand but seem to be all related to the way R interpreted the file.
con <- dbConnect(MySQL(),
host = 'localhost',
user = 'root',
dbname='test',
rstudioapi::askForPassword('Database password')
)
and pull the names
table from my MySQL db with
df <- dbReadTable(con, 'names')
My table looks something like
df <- structure(list(id = c(373863, 17731249), name = c("AARESTRUP, JAN CARO̸E",
"BÃœTTNER, CHRISTIAN")), .Names = c("id", "name"), class = "data.frame", row.names = c(NA,
-2L))
df
id name
1 373863 AARESTRUP, JAN CARO̸E
2 17731249 BÃœTTNER, CHRISTIAN
Checking the encoding brings
Encoding(df$name)
[1] "unknown" "unknown"
Using Encoding(df$name) <- "UTF-8"
fixes the second name but turns parts of name one into a non UTF-8 character.
df
id name
1 373863 AARESTRUP, JAN CARO<U+0338>E
2 17731249 BÜTTNER, CHRISTIAN
Now it gets strange as df$name[1]
returns
[1] "AARESTRUP, JAN CARO̸E"
which is exactly how it is stored in the MySQL db while it seems like its differently interpreted when I call the whole df
.
First, is there a way to ensure the dbReadTable()
reads in automatically UTF-8
. I found link which does not seem to change anything in terms of my table encoding.
Second, what is going on with the difference in the interpretation of the non UTF-8 encoding depending on how you access the name (whole df vs. single element)
r encoding dbplyr
add a comment |
I have a MySQL database containing a table of names. Once I pull them into R I run into some encoding problems which I don't understand but seem to be all related to the way R interpreted the file.
con <- dbConnect(MySQL(),
host = 'localhost',
user = 'root',
dbname='test',
rstudioapi::askForPassword('Database password')
)
and pull the names
table from my MySQL db with
df <- dbReadTable(con, 'names')
My table looks something like
df <- structure(list(id = c(373863, 17731249), name = c("AARESTRUP, JAN CARO̸E",
"BÃœTTNER, CHRISTIAN")), .Names = c("id", "name"), class = "data.frame", row.names = c(NA,
-2L))
df
id name
1 373863 AARESTRUP, JAN CARO̸E
2 17731249 BÃœTTNER, CHRISTIAN
Checking the encoding brings
Encoding(df$name)
[1] "unknown" "unknown"
Using Encoding(df$name) <- "UTF-8"
fixes the second name but turns parts of name one into a non UTF-8 character.
df
id name
1 373863 AARESTRUP, JAN CARO<U+0338>E
2 17731249 BÜTTNER, CHRISTIAN
Now it gets strange as df$name[1]
returns
[1] "AARESTRUP, JAN CARO̸E"
which is exactly how it is stored in the MySQL db while it seems like its differently interpreted when I call the whole df
.
First, is there a way to ensure the dbReadTable()
reads in automatically UTF-8
. I found link which does not seem to change anything in terms of my table encoding.
Second, what is going on with the difference in the interpretation of the non UTF-8 encoding depending on how you access the name (whole df vs. single element)
r encoding dbplyr
add a comment |
I have a MySQL database containing a table of names. Once I pull them into R I run into some encoding problems which I don't understand but seem to be all related to the way R interpreted the file.
con <- dbConnect(MySQL(),
host = 'localhost',
user = 'root',
dbname='test',
rstudioapi::askForPassword('Database password')
)
and pull the names
table from my MySQL db with
df <- dbReadTable(con, 'names')
My table looks something like
df <- structure(list(id = c(373863, 17731249), name = c("AARESTRUP, JAN CARO̸E",
"BÃœTTNER, CHRISTIAN")), .Names = c("id", "name"), class = "data.frame", row.names = c(NA,
-2L))
df
id name
1 373863 AARESTRUP, JAN CARO̸E
2 17731249 BÃœTTNER, CHRISTIAN
Checking the encoding brings
Encoding(df$name)
[1] "unknown" "unknown"
Using Encoding(df$name) <- "UTF-8"
fixes the second name but turns parts of name one into a non UTF-8 character.
df
id name
1 373863 AARESTRUP, JAN CARO<U+0338>E
2 17731249 BÜTTNER, CHRISTIAN
Now it gets strange as df$name[1]
returns
[1] "AARESTRUP, JAN CARO̸E"
which is exactly how it is stored in the MySQL db while it seems like its differently interpreted when I call the whole df
.
First, is there a way to ensure the dbReadTable()
reads in automatically UTF-8
. I found link which does not seem to change anything in terms of my table encoding.
Second, what is going on with the difference in the interpretation of the non UTF-8 encoding depending on how you access the name (whole df vs. single element)
r encoding dbplyr
I have a MySQL database containing a table of names. Once I pull them into R I run into some encoding problems which I don't understand but seem to be all related to the way R interpreted the file.
con <- dbConnect(MySQL(),
host = 'localhost',
user = 'root',
dbname='test',
rstudioapi::askForPassword('Database password')
)
and pull the names
table from my MySQL db with
df <- dbReadTable(con, 'names')
My table looks something like
df <- structure(list(id = c(373863, 17731249), name = c("AARESTRUP, JAN CARO̸E",
"BÃœTTNER, CHRISTIAN")), .Names = c("id", "name"), class = "data.frame", row.names = c(NA,
-2L))
df
id name
1 373863 AARESTRUP, JAN CARO̸E
2 17731249 BÃœTTNER, CHRISTIAN
Checking the encoding brings
Encoding(df$name)
[1] "unknown" "unknown"
Using Encoding(df$name) <- "UTF-8"
fixes the second name but turns parts of name one into a non UTF-8 character.
df
id name
1 373863 AARESTRUP, JAN CARO<U+0338>E
2 17731249 BÜTTNER, CHRISTIAN
Now it gets strange as df$name[1]
returns
[1] "AARESTRUP, JAN CARO̸E"
which is exactly how it is stored in the MySQL db while it seems like its differently interpreted when I call the whole df
.
First, is there a way to ensure the dbReadTable()
reads in automatically UTF-8
. I found link which does not seem to change anything in terms of my table encoding.
Second, what is going on with the difference in the interpretation of the non UTF-8 encoding depending on how you access the name (whole df vs. single element)
r encoding dbplyr
r encoding dbplyr
asked Jan 3 at 18:28
CERCER
441516
441516
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
<U+0338>
is a UTF-8 character. But it's an overlay character that when rendered occupies the same space as the O that preceeds it. For some reason, it's displaying it incorrectly (although I checked and in R it displays correctly, so it seems to be something to do with StackOverflow, or perhaps Firefox). You can use
apply(df, MARGIN = 2, FUN = function(x) if( typeof(x) == 'character' ) Encoding(x)<-'UTF-8')
on all of your database result tables, which will convert encoding to UTF-8 for all of the character columns.
I think that the difference in behavior between when you call df$name[1]
and when you call df
has to do with the different print/show methods for character type objects vs data.frames.
df
wants to show you the individual characters, meaning it doesn't want to render the characters in a way that disguises the fact that the E with a slash is in fact two characters. It also doesn't want to display the character as a spacing character, since it would then look like a backslash. Showing the unicode for the character is the best way to let the user know it is there.
The show method for character type objects wants to render the characters for you (most of the time).
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54027855%2fr-database-table-pull-encoding-issue%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
<U+0338>
is a UTF-8 character. But it's an overlay character that when rendered occupies the same space as the O that preceeds it. For some reason, it's displaying it incorrectly (although I checked and in R it displays correctly, so it seems to be something to do with StackOverflow, or perhaps Firefox). You can use
apply(df, MARGIN = 2, FUN = function(x) if( typeof(x) == 'character' ) Encoding(x)<-'UTF-8')
on all of your database result tables, which will convert encoding to UTF-8 for all of the character columns.
I think that the difference in behavior between when you call df$name[1]
and when you call df
has to do with the different print/show methods for character type objects vs data.frames.
df
wants to show you the individual characters, meaning it doesn't want to render the characters in a way that disguises the fact that the E with a slash is in fact two characters. It also doesn't want to display the character as a spacing character, since it would then look like a backslash. Showing the unicode for the character is the best way to let the user know it is there.
The show method for character type objects wants to render the characters for you (most of the time).
add a comment |
<U+0338>
is a UTF-8 character. But it's an overlay character that when rendered occupies the same space as the O that preceeds it. For some reason, it's displaying it incorrectly (although I checked and in R it displays correctly, so it seems to be something to do with StackOverflow, or perhaps Firefox). You can use
apply(df, MARGIN = 2, FUN = function(x) if( typeof(x) == 'character' ) Encoding(x)<-'UTF-8')
on all of your database result tables, which will convert encoding to UTF-8 for all of the character columns.
I think that the difference in behavior between when you call df$name[1]
and when you call df
has to do with the different print/show methods for character type objects vs data.frames.
df
wants to show you the individual characters, meaning it doesn't want to render the characters in a way that disguises the fact that the E with a slash is in fact two characters. It also doesn't want to display the character as a spacing character, since it would then look like a backslash. Showing the unicode for the character is the best way to let the user know it is there.
The show method for character type objects wants to render the characters for you (most of the time).
add a comment |
<U+0338>
is a UTF-8 character. But it's an overlay character that when rendered occupies the same space as the O that preceeds it. For some reason, it's displaying it incorrectly (although I checked and in R it displays correctly, so it seems to be something to do with StackOverflow, or perhaps Firefox). You can use
apply(df, MARGIN = 2, FUN = function(x) if( typeof(x) == 'character' ) Encoding(x)<-'UTF-8')
on all of your database result tables, which will convert encoding to UTF-8 for all of the character columns.
I think that the difference in behavior between when you call df$name[1]
and when you call df
has to do with the different print/show methods for character type objects vs data.frames.
df
wants to show you the individual characters, meaning it doesn't want to render the characters in a way that disguises the fact that the E with a slash is in fact two characters. It also doesn't want to display the character as a spacing character, since it would then look like a backslash. Showing the unicode for the character is the best way to let the user know it is there.
The show method for character type objects wants to render the characters for you (most of the time).
<U+0338>
is a UTF-8 character. But it's an overlay character that when rendered occupies the same space as the O that preceeds it. For some reason, it's displaying it incorrectly (although I checked and in R it displays correctly, so it seems to be something to do with StackOverflow, or perhaps Firefox). You can use
apply(df, MARGIN = 2, FUN = function(x) if( typeof(x) == 'character' ) Encoding(x)<-'UTF-8')
on all of your database result tables, which will convert encoding to UTF-8 for all of the character columns.
I think that the difference in behavior between when you call df$name[1]
and when you call df
has to do with the different print/show methods for character type objects vs data.frames.
df
wants to show you the individual characters, meaning it doesn't want to render the characters in a way that disguises the fact that the E with a slash is in fact two characters. It also doesn't want to display the character as a spacing character, since it would then look like a backslash. Showing the unicode for the character is the best way to let the user know it is there.
The show method for character type objects wants to render the characters for you (most of the time).
edited Jan 4 at 16:11
answered Jan 3 at 19:39
AColemanAColeman
112
112
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54027855%2fr-database-table-pull-encoding-issue%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown