How To Correct UTF-8 Characters Stored As ASCII
I have some old data that is stored in ASCII format. Clearly there is UTF-8 data that was not properly converted to ASCII before being written. For example, José
will appear in the file as José
. I can easily fix this with a Java snippet code below:
byte utf8Bytes = c_TOBETRANSLATED.getBytes("ISO-8859-1");
String s2 = new String(utf8Bytes,"UTF-8");
But I need to do this Python with the rest of my code. I'm only just starting in Python and my internet searches and trial and error is not helping me find a Python solution to do the same thing.
python utf-8
add a comment |
I have some old data that is stored in ASCII format. Clearly there is UTF-8 data that was not properly converted to ASCII before being written. For example, José
will appear in the file as José
. I can easily fix this with a Java snippet code below:
byte utf8Bytes = c_TOBETRANSLATED.getBytes("ISO-8859-1");
String s2 = new String(utf8Bytes,"UTF-8");
But I need to do this Python with the rest of my code. I'm only just starting in Python and my internet searches and trial and error is not helping me find a Python solution to do the same thing.
python utf-8
Are you using Python 2 or 3?
– Jonah Bishop
Jan 1 at 2:09
6
That's not ASCII.
– user2357112
Jan 1 at 2:32
To my horror, I've discovered this can be intentional; a sort of Base256 with byte values converted to ISO 8859-1 characters so a byte sequence can be stored in a string datatype.
– Tom Blodget
Jan 1 at 4:03
Your problem description sounds like you are using Latin-1 to view an UTF-8 file. What are the actual bytes in the file?
– tripleee
Jan 1 at 23:28
add a comment |
I have some old data that is stored in ASCII format. Clearly there is UTF-8 data that was not properly converted to ASCII before being written. For example, José
will appear in the file as José
. I can easily fix this with a Java snippet code below:
byte utf8Bytes = c_TOBETRANSLATED.getBytes("ISO-8859-1");
String s2 = new String(utf8Bytes,"UTF-8");
But I need to do this Python with the rest of my code. I'm only just starting in Python and my internet searches and trial and error is not helping me find a Python solution to do the same thing.
python utf-8
I have some old data that is stored in ASCII format. Clearly there is UTF-8 data that was not properly converted to ASCII before being written. For example, José
will appear in the file as José
. I can easily fix this with a Java snippet code below:
byte utf8Bytes = c_TOBETRANSLATED.getBytes("ISO-8859-1");
String s2 = new String(utf8Bytes,"UTF-8");
But I need to do this Python with the rest of my code. I'm only just starting in Python and my internet searches and trial and error is not helping me find a Python solution to do the same thing.
python utf-8
python utf-8
edited Jan 1 at 4:16
noob
4761516
4761516
asked Jan 1 at 2:05
Scott MScott M
1
1
Are you using Python 2 or 3?
– Jonah Bishop
Jan 1 at 2:09
6
That's not ASCII.
– user2357112
Jan 1 at 2:32
To my horror, I've discovered this can be intentional; a sort of Base256 with byte values converted to ISO 8859-1 characters so a byte sequence can be stored in a string datatype.
– Tom Blodget
Jan 1 at 4:03
Your problem description sounds like you are using Latin-1 to view an UTF-8 file. What are the actual bytes in the file?
– tripleee
Jan 1 at 23:28
add a comment |
Are you using Python 2 or 3?
– Jonah Bishop
Jan 1 at 2:09
6
That's not ASCII.
– user2357112
Jan 1 at 2:32
To my horror, I've discovered this can be intentional; a sort of Base256 with byte values converted to ISO 8859-1 characters so a byte sequence can be stored in a string datatype.
– Tom Blodget
Jan 1 at 4:03
Your problem description sounds like you are using Latin-1 to view an UTF-8 file. What are the actual bytes in the file?
– tripleee
Jan 1 at 23:28
Are you using Python 2 or 3?
– Jonah Bishop
Jan 1 at 2:09
Are you using Python 2 or 3?
– Jonah Bishop
Jan 1 at 2:09
6
6
That's not ASCII.
– user2357112
Jan 1 at 2:32
That's not ASCII.
– user2357112
Jan 1 at 2:32
To my horror, I've discovered this can be intentional; a sort of Base256 with byte values converted to ISO 8859-1 characters so a byte sequence can be stored in a string datatype.
– Tom Blodget
Jan 1 at 4:03
To my horror, I've discovered this can be intentional; a sort of Base256 with byte values converted to ISO 8859-1 characters so a byte sequence can be stored in a string datatype.
– Tom Blodget
Jan 1 at 4:03
Your problem description sounds like you are using Latin-1 to view an UTF-8 file. What are the actual bytes in the file?
– tripleee
Jan 1 at 23:28
Your problem description sounds like you are using Latin-1 to view an UTF-8 file. What are the actual bytes in the file?
– tripleee
Jan 1 at 23:28
add a comment |
2 Answers
2
active
oldest
votes
If you are using Python 3, you can do the following using the bytes function:
test = "José"
fixed = bytes(test, 'iso-8859-1').decode('utf-8')
# fixed will now contain the string José
add a comment |
If you have "José"
"in the file" the data was read/displayed incorrectly by the file viewer. That is UTF-8 but decoded with the wrong encoding. Example:
import locale
# Correctly written
with open('file.txt','w',encoding='utf8') as f:
f.write('José')
# The default encoding for open()
print(locale.getpreferredencoding(False))
# Incorrectly opened
with open('file.txt') as f:
data = f.read()
print(data)
# What I think you are requesting as a fix.
# Re-encode with the incorrect encoding, then decode correctly.
print(data.encode('cp1252').decode('utf8'))
# Correctly opened
with open('file.txt',encoding='utf8') as f:
print(f.read())
Output:
cp1252
José
José
José
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53992620%2fhow-to-correct-utf-8-characters-stored-as-ascii%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
If you are using Python 3, you can do the following using the bytes function:
test = "José"
fixed = bytes(test, 'iso-8859-1').decode('utf-8')
# fixed will now contain the string José
add a comment |
If you are using Python 3, you can do the following using the bytes function:
test = "José"
fixed = bytes(test, 'iso-8859-1').decode('utf-8')
# fixed will now contain the string José
add a comment |
If you are using Python 3, you can do the following using the bytes function:
test = "José"
fixed = bytes(test, 'iso-8859-1').decode('utf-8')
# fixed will now contain the string José
If you are using Python 3, you can do the following using the bytes function:
test = "José"
fixed = bytes(test, 'iso-8859-1').decode('utf-8')
# fixed will now contain the string José
answered Jan 1 at 2:24
Jonah BishopJonah Bishop
8,92433257
8,92433257
add a comment |
add a comment |
If you have "José"
"in the file" the data was read/displayed incorrectly by the file viewer. That is UTF-8 but decoded with the wrong encoding. Example:
import locale
# Correctly written
with open('file.txt','w',encoding='utf8') as f:
f.write('José')
# The default encoding for open()
print(locale.getpreferredencoding(False))
# Incorrectly opened
with open('file.txt') as f:
data = f.read()
print(data)
# What I think you are requesting as a fix.
# Re-encode with the incorrect encoding, then decode correctly.
print(data.encode('cp1252').decode('utf8'))
# Correctly opened
with open('file.txt',encoding='utf8') as f:
print(f.read())
Output:
cp1252
José
José
José
add a comment |
If you have "José"
"in the file" the data was read/displayed incorrectly by the file viewer. That is UTF-8 but decoded with the wrong encoding. Example:
import locale
# Correctly written
with open('file.txt','w',encoding='utf8') as f:
f.write('José')
# The default encoding for open()
print(locale.getpreferredencoding(False))
# Incorrectly opened
with open('file.txt') as f:
data = f.read()
print(data)
# What I think you are requesting as a fix.
# Re-encode with the incorrect encoding, then decode correctly.
print(data.encode('cp1252').decode('utf8'))
# Correctly opened
with open('file.txt',encoding='utf8') as f:
print(f.read())
Output:
cp1252
José
José
José
add a comment |
If you have "José"
"in the file" the data was read/displayed incorrectly by the file viewer. That is UTF-8 but decoded with the wrong encoding. Example:
import locale
# Correctly written
with open('file.txt','w',encoding='utf8') as f:
f.write('José')
# The default encoding for open()
print(locale.getpreferredencoding(False))
# Incorrectly opened
with open('file.txt') as f:
data = f.read()
print(data)
# What I think you are requesting as a fix.
# Re-encode with the incorrect encoding, then decode correctly.
print(data.encode('cp1252').decode('utf8'))
# Correctly opened
with open('file.txt',encoding='utf8') as f:
print(f.read())
Output:
cp1252
José
José
José
If you have "José"
"in the file" the data was read/displayed incorrectly by the file viewer. That is UTF-8 but decoded with the wrong encoding. Example:
import locale
# Correctly written
with open('file.txt','w',encoding='utf8') as f:
f.write('José')
# The default encoding for open()
print(locale.getpreferredencoding(False))
# Incorrectly opened
with open('file.txt') as f:
data = f.read()
print(data)
# What I think you are requesting as a fix.
# Re-encode with the incorrect encoding, then decode correctly.
print(data.encode('cp1252').decode('utf8'))
# Correctly opened
with open('file.txt',encoding='utf8') as f:
print(f.read())
Output:
cp1252
José
José
José
answered Jan 1 at 23:22
Mark TolonenMark Tolonen
93.8k12113176
93.8k12113176
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53992620%2fhow-to-correct-utf-8-characters-stored-as-ascii%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Are you using Python 2 or 3?
– Jonah Bishop
Jan 1 at 2:09
6
That's not ASCII.
– user2357112
Jan 1 at 2:32
To my horror, I've discovered this can be intentional; a sort of Base256 with byte values converted to ISO 8859-1 characters so a byte sequence can be stored in a string datatype.
– Tom Blodget
Jan 1 at 4:03
Your problem description sounds like you are using Latin-1 to view an UTF-8 file. What are the actual bytes in the file?
– tripleee
Jan 1 at 23:28