How To Correct UTF-8 Characters Stored As ASCII

-3

I have some old data that is stored in ASCII format. Clearly there is UTF-8 data that was not properly converted to ASCII before being written. For example, José will appear in the file as JosÃ©. I can easily fix this with a Java snippet code below:

byte utf8Bytes = c_TOBETRANSLATED.getBytes("ISO-8859-1");

String s2 = new String(utf8Bytes,"UTF-8");

But I need to do this Python with the rest of my code. I'm only just starting in Python and my internet searches and trial and error is not helping me find a Python solution to do the same thing.

edited Jan 1 at 4:16

noob

4761516

asked Jan 1 at 2:05

Scott M

Are you using Python 2 or 3?

– Jonah Bishop
Jan 1 at 2:09

6

That's not ASCII.

– user2357112
Jan 1 at 2:32

To my horror, I've discovered this can be intentional; a sort of Base256 with byte values converted to ISO 8859-1 characters so a byte sequence can be stored in a string datatype.

– Tom Blodget
Jan 1 at 4:03

Your problem description sounds like you are using Latin-1 to view an UTF-8 file. What are the actual bytes in the file?

– tripleee
Jan 1 at 23:28

add a comment |

-3

byte utf8Bytes = c_TOBETRANSLATED.getBytes("ISO-8859-1");

String s2 = new String(utf8Bytes,"UTF-8");

But I need to do this Python with the rest of my code. I'm only just starting in Python and my internet searches and trial and error is not helping me find a Python solution to do the same thing.

edited Jan 1 at 4:16

noob

4761516

asked Jan 1 at 2:05

Scott M

Are you using Python 2 or 3?

– Jonah Bishop
Jan 1 at 2:09

6

That's not ASCII.

– user2357112
Jan 1 at 2:32

To my horror, I've discovered this can be intentional; a sort of Base256 with byte values converted to ISO 8859-1 characters so a byte sequence can be stored in a string datatype.

– Tom Blodget
Jan 1 at 4:03

Your problem description sounds like you are using Latin-1 to view an UTF-8 file. What are the actual bytes in the file?

– tripleee
Jan 1 at 23:28

add a comment |

-3

byte utf8Bytes = c_TOBETRANSLATED.getBytes("ISO-8859-1");

String s2 = new String(utf8Bytes,"UTF-8");

But I need to do this Python with the rest of my code. I'm only just starting in Python and my internet searches and trial and error is not helping me find a Python solution to do the same thing.

edited Jan 1 at 4:16

noob

4761516

asked Jan 1 at 2:05

Scott M

byte utf8Bytes = c_TOBETRANSLATED.getBytes("ISO-8859-1");

String s2 = new String(utf8Bytes,"UTF-8");

But I need to do this Python with the rest of my code. I'm only just starting in Python and my internet searches and trial and error is not helping me find a Python solution to do the same thing.

python utf-8

edited Jan 1 at 4:16

noob

4761516

asked Jan 1 at 2:05

Scott M

edited Jan 1 at 4:16

noob

4761516

asked Jan 1 at 2:05

Scott M

edited Jan 1 at 4:16

noob

4761516

edited Jan 1 at 4:16

noob

4761516

edited Jan 1 at 4:16

noob

4761516

asked Jan 1 at 2:05

Scott M

asked Jan 1 at 2:05

Scott M

asked Jan 1 at 2:05

Scott M

Are you using Python 2 or 3?

– Jonah Bishop
Jan 1 at 2:09

6

That's not ASCII.

– user2357112
Jan 1 at 2:32

To my horror, I've discovered this can be intentional; a sort of Base256 with byte values converted to ISO 8859-1 characters so a byte sequence can be stored in a string datatype.

– Tom Blodget
Jan 1 at 4:03

Your problem description sounds like you are using Latin-1 to view an UTF-8 file. What are the actual bytes in the file?

– tripleee
Jan 1 at 23:28

add a comment |

Are you using Python 2 or 3?

– Jonah Bishop
Jan 1 at 2:09

6

That's not ASCII.

– user2357112
Jan 1 at 2:32

To my horror, I've discovered this can be intentional; a sort of Base256 with byte values converted to ISO 8859-1 characters so a byte sequence can be stored in a string datatype.

– Tom Blodget
Jan 1 at 4:03

Your problem description sounds like you are using Latin-1 to view an UTF-8 file. What are the actual bytes in the file?

– tripleee
Jan 1 at 23:28

Are you using Python 2 or 3?

– Jonah Bishop
Jan 1 at 2:09

That's not ASCII.

– user2357112
Jan 1 at 2:32

To my horror, I've discovered this can be intentional; a sort of Base256 with byte values converted to ISO 8859-1 characters so a byte sequence can be stored in a string datatype.

– Tom Blodget
Jan 1 at 4:03

Your problem description sounds like you are using Latin-1 to view an UTF-8 file. What are the actual bytes in the file?

– tripleee
Jan 1 at 23:28

add a comment |

2 Answers
2

active

oldest

votes

If you are using Python 3, you can do the following using the bytes function:

test = "JosÃ©"

fixed = bytes(test, 'iso-8859-1').decode('utf-8')

# fixed will now contain the string José

answered Jan 1 at 2:24

Jonah Bishop

8,92433257

add a comment |

If you have "JosÃ©" "in the file" the data was read/displayed incorrectly by the file viewer. That is UTF-8 but decoded with the wrong encoding. Example:

import locale



# Correctly written

with open('file.txt','w',encoding='utf8') as f:

    f.write('José')



# The default encoding for open()

print(locale.getpreferredencoding(False))



# Incorrectly opened

with open('file.txt') as f:

    data = f.read()

    print(data)

    # What I think you are requesting as a fix.

    # Re-encode with the incorrect encoding, then decode correctly.

    print(data.encode('cp1252').decode('utf8'))



# Correctly opened

with open('file.txt',encoding='utf8') as f:

    print(f.read())

Output:

cp1252

JosÃ©

José

José

answered Jan 1 at 23:22

Mark Tolonen

93.8k12113176

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53992620%2fhow-to-correct-utf-8-characters-stored-as-ascii%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

If you are using Python 3, you can do the following using the bytes function:

test = "JosÃ©"

fixed = bytes(test, 'iso-8859-1').decode('utf-8')

# fixed will now contain the string José

answered Jan 1 at 2:24

Jonah Bishop

8,92433257

add a comment |

If you are using Python 3, you can do the following using the bytes function:

test = "JosÃ©"

fixed = bytes(test, 'iso-8859-1').decode('utf-8')

# fixed will now contain the string José

answered Jan 1 at 2:24

Jonah Bishop

8,92433257

add a comment |

If you are using Python 3, you can do the following using the bytes function:

test = "JosÃ©"

fixed = bytes(test, 'iso-8859-1').decode('utf-8')

# fixed will now contain the string José

answered Jan 1 at 2:24

Jonah Bishop

8,92433257

If you are using Python 3, you can do the following using the bytes function:

test = "JosÃ©"

fixed = bytes(test, 'iso-8859-1').decode('utf-8')

# fixed will now contain the string José

answered Jan 1 at 2:24

Jonah Bishop

8,92433257

answered Jan 1 at 2:24

Jonah Bishop

8,92433257

answered Jan 1 at 2:24

Jonah Bishop

8,92433257

answered Jan 1 at 2:24

Jonah Bishop

8,92433257

add a comment |

If you have "JosÃ©" "in the file" the data was read/displayed incorrectly by the file viewer. That is UTF-8 but decoded with the wrong encoding. Example:

import locale



# Correctly written

with open('file.txt','w',encoding='utf8') as f:

    f.write('José')



# The default encoding for open()

print(locale.getpreferredencoding(False))



# Incorrectly opened

with open('file.txt') as f:

    data = f.read()

    print(data)

    # What I think you are requesting as a fix.

    # Re-encode with the incorrect encoding, then decode correctly.

    print(data.encode('cp1252').decode('utf8'))



# Correctly opened

with open('file.txt',encoding='utf8') as f:

    print(f.read())

Output:

cp1252

JosÃ©

José

José

answered Jan 1 at 23:22

Mark Tolonen

93.8k12113176

add a comment |

If you have "JosÃ©" "in the file" the data was read/displayed incorrectly by the file viewer. That is UTF-8 but decoded with the wrong encoding. Example:

import locale



# Correctly written

with open('file.txt','w',encoding='utf8') as f:

    f.write('José')



# The default encoding for open()

print(locale.getpreferredencoding(False))



# Incorrectly opened

with open('file.txt') as f:

    data = f.read()

    print(data)

    # What I think you are requesting as a fix.

    # Re-encode with the incorrect encoding, then decode correctly.

    print(data.encode('cp1252').decode('utf8'))



# Correctly opened

with open('file.txt',encoding='utf8') as f:

    print(f.read())

Output:

cp1252

JosÃ©

José

José

answered Jan 1 at 23:22

Mark Tolonen

93.8k12113176

add a comment |

If you have "JosÃ©" "in the file" the data was read/displayed incorrectly by the file viewer. That is UTF-8 but decoded with the wrong encoding. Example:

import locale



# Correctly written

with open('file.txt','w',encoding='utf8') as f:

    f.write('José')



# The default encoding for open()

print(locale.getpreferredencoding(False))



# Incorrectly opened

with open('file.txt') as f:

    data = f.read()

    print(data)

    # What I think you are requesting as a fix.

    # Re-encode with the incorrect encoding, then decode correctly.

    print(data.encode('cp1252').decode('utf8'))



# Correctly opened

with open('file.txt',encoding='utf8') as f:

    print(f.read())

Output:

cp1252

JosÃ©

José

José

answered Jan 1 at 23:22

Mark Tolonen

93.8k12113176

If you have "JosÃ©" "in the file" the data was read/displayed incorrectly by the file viewer. That is UTF-8 but decoded with the wrong encoding. Example:

import locale



# Correctly written

with open('file.txt','w',encoding='utf8') as f:

    f.write('José')



# The default encoding for open()

print(locale.getpreferredencoding(False))



# Incorrectly opened

with open('file.txt') as f:

    data = f.read()

    print(data)

    # What I think you are requesting as a fix.

    # Re-encode with the incorrect encoding, then decode correctly.

    print(data.encode('cp1252').decode('utf8'))



# Correctly opened

with open('file.txt',encoding='utf8') as f:

    print(f.read())

Output:

cp1252

JosÃ©

José

José

answered Jan 1 at 23:22

Mark Tolonen

93.8k12113176

answered Jan 1 at 23:22

Mark Tolonen

93.8k12113176

answered Jan 1 at 23:22

Mark Tolonen

93.8k12113176

answered Jan 1 at 23:22

Mark Tolonen

93.8k12113176

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Bdtjtk