Decode bad escape characters in python
So I have a database with a lot of names. The names have bad characters. For example, a name in a record is José Florés
I wanted to clean this to get José Florés
I tried the following
name = " José Florés "
print(name.encode('iso-8859-1',errors='ignore').decode('utf8',errors='backslashreplace')
The output messes the last name to ' José Flor\xe9s '
What is the best way to solve this? The names can have any kind of unicode or hex escape sequences.
python python-3.x string character-encoding
|
show 1 more comment
So I have a database with a lot of names. The names have bad characters. For example, a name in a record is José Florés
I wanted to clean this to get José Florés
I tried the following
name = " José Florés "
print(name.encode('iso-8859-1',errors='ignore').decode('utf8',errors='backslashreplace')
The output messes the last name to ' José Flor\xe9s '
What is the best way to solve this? The names can have any kind of unicode or hex escape sequences.
python python-3.x string character-encoding
It looks like your are having two different encodings in on value. I guess the best way would to split them into single words, try to decode them with both encodings (handle a possible exception and try the other) and then put the words back together.
– Klaus D.
Jan 3 at 18:44
I thought about that. but how should i go about doing the same in a database of names like this which can have any kind of encoding
– Arpit Acharya
Jan 3 at 18:48
1
This looks to me more like a search-and-replace problem than an encoding/decoding problem. You have a valid string for the encoding, just not the string that's intended. Identify problem character sequences and the good replacements, then use regex to search and replace. But beware little bobby tables - not an exact analogy, but the "that'll never happen" attitude can be dangerous when those problem sequences turn out to occasionally be perfectly valid and correct so the substitution is an error. It might be better to automate the search, but fix manually.
– Steve314
Jan 3 at 18:59
1
Is this really what your data looks like? It's extremely unusual to see two different encodings in the same string.
– Mark Ransom
Jan 3 at 19:04
Well this is actually one of the many records I have
– Arpit Acharya
Jan 4 at 20:18
|
show 1 more comment
So I have a database with a lot of names. The names have bad characters. For example, a name in a record is José Florés
I wanted to clean this to get José Florés
I tried the following
name = " José Florés "
print(name.encode('iso-8859-1',errors='ignore').decode('utf8',errors='backslashreplace')
The output messes the last name to ' José Flor\xe9s '
What is the best way to solve this? The names can have any kind of unicode or hex escape sequences.
python python-3.x string character-encoding
So I have a database with a lot of names. The names have bad characters. For example, a name in a record is José Florés
I wanted to clean this to get José Florés
I tried the following
name = " José Florés "
print(name.encode('iso-8859-1',errors='ignore').decode('utf8',errors='backslashreplace')
The output messes the last name to ' José Flor\xe9s '
What is the best way to solve this? The names can have any kind of unicode or hex escape sequences.
python python-3.x string character-encoding
python python-3.x string character-encoding
edited Jan 3 at 18:53
snakecharmerb
12.1k42552
12.1k42552
asked Jan 3 at 18:35
Arpit AcharyaArpit Acharya
165
165
It looks like your are having two different encodings in on value. I guess the best way would to split them into single words, try to decode them with both encodings (handle a possible exception and try the other) and then put the words back together.
– Klaus D.
Jan 3 at 18:44
I thought about that. but how should i go about doing the same in a database of names like this which can have any kind of encoding
– Arpit Acharya
Jan 3 at 18:48
1
This looks to me more like a search-and-replace problem than an encoding/decoding problem. You have a valid string for the encoding, just not the string that's intended. Identify problem character sequences and the good replacements, then use regex to search and replace. But beware little bobby tables - not an exact analogy, but the "that'll never happen" attitude can be dangerous when those problem sequences turn out to occasionally be perfectly valid and correct so the substitution is an error. It might be better to automate the search, but fix manually.
– Steve314
Jan 3 at 18:59
1
Is this really what your data looks like? It's extremely unusual to see two different encodings in the same string.
– Mark Ransom
Jan 3 at 19:04
Well this is actually one of the many records I have
– Arpit Acharya
Jan 4 at 20:18
|
show 1 more comment
It looks like your are having two different encodings in on value. I guess the best way would to split them into single words, try to decode them with both encodings (handle a possible exception and try the other) and then put the words back together.
– Klaus D.
Jan 3 at 18:44
I thought about that. but how should i go about doing the same in a database of names like this which can have any kind of encoding
– Arpit Acharya
Jan 3 at 18:48
1
This looks to me more like a search-and-replace problem than an encoding/decoding problem. You have a valid string for the encoding, just not the string that's intended. Identify problem character sequences and the good replacements, then use regex to search and replace. But beware little bobby tables - not an exact analogy, but the "that'll never happen" attitude can be dangerous when those problem sequences turn out to occasionally be perfectly valid and correct so the substitution is an error. It might be better to automate the search, but fix manually.
– Steve314
Jan 3 at 18:59
1
Is this really what your data looks like? It's extremely unusual to see two different encodings in the same string.
– Mark Ransom
Jan 3 at 19:04
Well this is actually one of the many records I have
– Arpit Acharya
Jan 4 at 20:18
It looks like your are having two different encodings in on value. I guess the best way would to split them into single words, try to decode them with both encodings (handle a possible exception and try the other) and then put the words back together.
– Klaus D.
Jan 3 at 18:44
It looks like your are having two different encodings in on value. I guess the best way would to split them into single words, try to decode them with both encodings (handle a possible exception and try the other) and then put the words back together.
– Klaus D.
Jan 3 at 18:44
I thought about that. but how should i go about doing the same in a database of names like this which can have any kind of encoding
– Arpit Acharya
Jan 3 at 18:48
I thought about that. but how should i go about doing the same in a database of names like this which can have any kind of encoding
– Arpit Acharya
Jan 3 at 18:48
1
1
This looks to me more like a search-and-replace problem than an encoding/decoding problem. You have a valid string for the encoding, just not the string that's intended. Identify problem character sequences and the good replacements, then use regex to search and replace. But beware little bobby tables - not an exact analogy, but the "that'll never happen" attitude can be dangerous when those problem sequences turn out to occasionally be perfectly valid and correct so the substitution is an error. It might be better to automate the search, but fix manually.
– Steve314
Jan 3 at 18:59
This looks to me more like a search-and-replace problem than an encoding/decoding problem. You have a valid string for the encoding, just not the string that's intended. Identify problem character sequences and the good replacements, then use regex to search and replace. But beware little bobby tables - not an exact analogy, but the "that'll never happen" attitude can be dangerous when those problem sequences turn out to occasionally be perfectly valid and correct so the substitution is an error. It might be better to automate the search, but fix manually.
– Steve314
Jan 3 at 18:59
1
1
Is this really what your data looks like? It's extremely unusual to see two different encodings in the same string.
– Mark Ransom
Jan 3 at 19:04
Is this really what your data looks like? It's extremely unusual to see two different encodings in the same string.
– Mark Ransom
Jan 3 at 19:04
Well this is actually one of the many records I have
– Arpit Acharya
Jan 4 at 20:18
Well this is actually one of the many records I have
– Arpit Acharya
Jan 4 at 20:18
|
show 1 more comment
2 Answers
2
active
oldest
votes
ftfy is a python library which fixes unicode text broken in different ways with a function named fix_text
.
from ftfy import fix_text
def convert_iso_name_to_string(name):
result =
for word in name.split():
result.append(fix_text(word))
return ' '.join(result)
name = "José Florés"
assert convert_iso_name_to_string(name) == "José Florés"
Using the fix_text
method the names can be standardized, which is an alternate way to solve the problem.
Thanks Adity!. Worked for a lot of test cases
– Arpit Acharya
Jan 4 at 20:18
add a comment |
We'll start with an example string containing a non-ASCII character (i.e., “ü” or “umlaut-u”):
s = 'Florés'
Now if we reference and print the string, it gives us essentially the same result:
>>> s
'Florés'
>>> print(s)
Florés
In contrast to the same string s in Python 2.x, in this case s is already a Unicode string, and all strings in Python 3.x are automatically Unicode. The visible difference is that s wasn't changed after we instantiated it
You can find the same here Encoding and Decoding Strings
3
How does this answer the question?
– snakecharmerb
Jan 3 at 18:47
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54027932%2fdecode-bad-escape-characters-in-python%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
ftfy is a python library which fixes unicode text broken in different ways with a function named fix_text
.
from ftfy import fix_text
def convert_iso_name_to_string(name):
result =
for word in name.split():
result.append(fix_text(word))
return ' '.join(result)
name = "José Florés"
assert convert_iso_name_to_string(name) == "José Florés"
Using the fix_text
method the names can be standardized, which is an alternate way to solve the problem.
Thanks Adity!. Worked for a lot of test cases
– Arpit Acharya
Jan 4 at 20:18
add a comment |
ftfy is a python library which fixes unicode text broken in different ways with a function named fix_text
.
from ftfy import fix_text
def convert_iso_name_to_string(name):
result =
for word in name.split():
result.append(fix_text(word))
return ' '.join(result)
name = "José Florés"
assert convert_iso_name_to_string(name) == "José Florés"
Using the fix_text
method the names can be standardized, which is an alternate way to solve the problem.
Thanks Adity!. Worked for a lot of test cases
– Arpit Acharya
Jan 4 at 20:18
add a comment |
ftfy is a python library which fixes unicode text broken in different ways with a function named fix_text
.
from ftfy import fix_text
def convert_iso_name_to_string(name):
result =
for word in name.split():
result.append(fix_text(word))
return ' '.join(result)
name = "José Florés"
assert convert_iso_name_to_string(name) == "José Florés"
Using the fix_text
method the names can be standardized, which is an alternate way to solve the problem.
ftfy is a python library which fixes unicode text broken in different ways with a function named fix_text
.
from ftfy import fix_text
def convert_iso_name_to_string(name):
result =
for word in name.split():
result.append(fix_text(word))
return ' '.join(result)
name = "José Florés"
assert convert_iso_name_to_string(name) == "José Florés"
Using the fix_text
method the names can be standardized, which is an alternate way to solve the problem.
answered Jan 3 at 19:57
Aditya PurandareAditya Purandare
506
506
Thanks Adity!. Worked for a lot of test cases
– Arpit Acharya
Jan 4 at 20:18
add a comment |
Thanks Adity!. Worked for a lot of test cases
– Arpit Acharya
Jan 4 at 20:18
Thanks Adity!. Worked for a lot of test cases
– Arpit Acharya
Jan 4 at 20:18
Thanks Adity!. Worked for a lot of test cases
– Arpit Acharya
Jan 4 at 20:18
add a comment |
We'll start with an example string containing a non-ASCII character (i.e., “ü” or “umlaut-u”):
s = 'Florés'
Now if we reference and print the string, it gives us essentially the same result:
>>> s
'Florés'
>>> print(s)
Florés
In contrast to the same string s in Python 2.x, in this case s is already a Unicode string, and all strings in Python 3.x are automatically Unicode. The visible difference is that s wasn't changed after we instantiated it
You can find the same here Encoding and Decoding Strings
3
How does this answer the question?
– snakecharmerb
Jan 3 at 18:47
add a comment |
We'll start with an example string containing a non-ASCII character (i.e., “ü” or “umlaut-u”):
s = 'Florés'
Now if we reference and print the string, it gives us essentially the same result:
>>> s
'Florés'
>>> print(s)
Florés
In contrast to the same string s in Python 2.x, in this case s is already a Unicode string, and all strings in Python 3.x are automatically Unicode. The visible difference is that s wasn't changed after we instantiated it
You can find the same here Encoding and Decoding Strings
3
How does this answer the question?
– snakecharmerb
Jan 3 at 18:47
add a comment |
We'll start with an example string containing a non-ASCII character (i.e., “ü” or “umlaut-u”):
s = 'Florés'
Now if we reference and print the string, it gives us essentially the same result:
>>> s
'Florés'
>>> print(s)
Florés
In contrast to the same string s in Python 2.x, in this case s is already a Unicode string, and all strings in Python 3.x are automatically Unicode. The visible difference is that s wasn't changed after we instantiated it
You can find the same here Encoding and Decoding Strings
We'll start with an example string containing a non-ASCII character (i.e., “ü” or “umlaut-u”):
s = 'Florés'
Now if we reference and print the string, it gives us essentially the same result:
>>> s
'Florés'
>>> print(s)
Florés
In contrast to the same string s in Python 2.x, in this case s is already a Unicode string, and all strings in Python 3.x are automatically Unicode. The visible difference is that s wasn't changed after we instantiated it
You can find the same here Encoding and Decoding Strings
answered Jan 3 at 18:44
Manoj PatelManoj Patel
17519
17519
3
How does this answer the question?
– snakecharmerb
Jan 3 at 18:47
add a comment |
3
How does this answer the question?
– snakecharmerb
Jan 3 at 18:47
3
3
How does this answer the question?
– snakecharmerb
Jan 3 at 18:47
How does this answer the question?
– snakecharmerb
Jan 3 at 18:47
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54027932%2fdecode-bad-escape-characters-in-python%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
It looks like your are having two different encodings in on value. I guess the best way would to split them into single words, try to decode them with both encodings (handle a possible exception and try the other) and then put the words back together.
– Klaus D.
Jan 3 at 18:44
I thought about that. but how should i go about doing the same in a database of names like this which can have any kind of encoding
– Arpit Acharya
Jan 3 at 18:48
1
This looks to me more like a search-and-replace problem than an encoding/decoding problem. You have a valid string for the encoding, just not the string that's intended. Identify problem character sequences and the good replacements, then use regex to search and replace. But beware little bobby tables - not an exact analogy, but the "that'll never happen" attitude can be dangerous when those problem sequences turn out to occasionally be perfectly valid and correct so the substitution is an error. It might be better to automate the search, but fix manually.
– Steve314
Jan 3 at 18:59
1
Is this really what your data looks like? It's extremely unusual to see two different encodings in the same string.
– Mark Ransom
Jan 3 at 19:04
Well this is actually one of the many records I have
– Arpit Acharya
Jan 4 at 20:18