Decode bad escape characters in python












1















So I have a database with a lot of names. The names have bad characters. For example, a name in a record is José Florés
I wanted to clean this to get José Florés



I tried the following



name = "    José     Florés "
print(name.encode('iso-8859-1',errors='ignore').decode('utf8',errors='backslashreplace')


The output messes the last name to ' José Flor\xe9s '



What is the best way to solve this? The names can have any kind of unicode or hex escape sequences.










share|improve this question

























  • It looks like your are having two different encodings in on value. I guess the best way would to split them into single words, try to decode them with both encodings (handle a possible exception and try the other) and then put the words back together.

    – Klaus D.
    Jan 3 at 18:44











  • I thought about that. but how should i go about doing the same in a database of names like this which can have any kind of encoding

    – Arpit Acharya
    Jan 3 at 18:48






  • 1





    This looks to me more like a search-and-replace problem than an encoding/decoding problem. You have a valid string for the encoding, just not the string that's intended. Identify problem character sequences and the good replacements, then use regex to search and replace. But beware little bobby tables - not an exact analogy, but the "that'll never happen" attitude can be dangerous when those problem sequences turn out to occasionally be perfectly valid and correct so the substitution is an error. It might be better to automate the search, but fix manually.

    – Steve314
    Jan 3 at 18:59






  • 1





    Is this really what your data looks like? It's extremely unusual to see two different encodings in the same string.

    – Mark Ransom
    Jan 3 at 19:04











  • Well this is actually one of the many records I have

    – Arpit Acharya
    Jan 4 at 20:18
















1















So I have a database with a lot of names. The names have bad characters. For example, a name in a record is José Florés
I wanted to clean this to get José Florés



I tried the following



name = "    José     Florés "
print(name.encode('iso-8859-1',errors='ignore').decode('utf8',errors='backslashreplace')


The output messes the last name to ' José Flor\xe9s '



What is the best way to solve this? The names can have any kind of unicode or hex escape sequences.










share|improve this question

























  • It looks like your are having two different encodings in on value. I guess the best way would to split them into single words, try to decode them with both encodings (handle a possible exception and try the other) and then put the words back together.

    – Klaus D.
    Jan 3 at 18:44











  • I thought about that. but how should i go about doing the same in a database of names like this which can have any kind of encoding

    – Arpit Acharya
    Jan 3 at 18:48






  • 1





    This looks to me more like a search-and-replace problem than an encoding/decoding problem. You have a valid string for the encoding, just not the string that's intended. Identify problem character sequences and the good replacements, then use regex to search and replace. But beware little bobby tables - not an exact analogy, but the "that'll never happen" attitude can be dangerous when those problem sequences turn out to occasionally be perfectly valid and correct so the substitution is an error. It might be better to automate the search, but fix manually.

    – Steve314
    Jan 3 at 18:59






  • 1





    Is this really what your data looks like? It's extremely unusual to see two different encodings in the same string.

    – Mark Ransom
    Jan 3 at 19:04











  • Well this is actually one of the many records I have

    – Arpit Acharya
    Jan 4 at 20:18














1












1








1


0






So I have a database with a lot of names. The names have bad characters. For example, a name in a record is José Florés
I wanted to clean this to get José Florés



I tried the following



name = "    José     Florés "
print(name.encode('iso-8859-1',errors='ignore').decode('utf8',errors='backslashreplace')


The output messes the last name to ' José Flor\xe9s '



What is the best way to solve this? The names can have any kind of unicode or hex escape sequences.










share|improve this question
















So I have a database with a lot of names. The names have bad characters. For example, a name in a record is José Florés
I wanted to clean this to get José Florés



I tried the following



name = "    José     Florés "
print(name.encode('iso-8859-1',errors='ignore').decode('utf8',errors='backslashreplace')


The output messes the last name to ' José Flor\xe9s '



What is the best way to solve this? The names can have any kind of unicode or hex escape sequences.







python python-3.x string character-encoding






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 3 at 18:53









snakecharmerb

12.1k42552




12.1k42552










asked Jan 3 at 18:35









Arpit AcharyaArpit Acharya

165




165













  • It looks like your are having two different encodings in on value. I guess the best way would to split them into single words, try to decode them with both encodings (handle a possible exception and try the other) and then put the words back together.

    – Klaus D.
    Jan 3 at 18:44











  • I thought about that. but how should i go about doing the same in a database of names like this which can have any kind of encoding

    – Arpit Acharya
    Jan 3 at 18:48






  • 1





    This looks to me more like a search-and-replace problem than an encoding/decoding problem. You have a valid string for the encoding, just not the string that's intended. Identify problem character sequences and the good replacements, then use regex to search and replace. But beware little bobby tables - not an exact analogy, but the "that'll never happen" attitude can be dangerous when those problem sequences turn out to occasionally be perfectly valid and correct so the substitution is an error. It might be better to automate the search, but fix manually.

    – Steve314
    Jan 3 at 18:59






  • 1





    Is this really what your data looks like? It's extremely unusual to see two different encodings in the same string.

    – Mark Ransom
    Jan 3 at 19:04











  • Well this is actually one of the many records I have

    – Arpit Acharya
    Jan 4 at 20:18



















  • It looks like your are having two different encodings in on value. I guess the best way would to split them into single words, try to decode them with both encodings (handle a possible exception and try the other) and then put the words back together.

    – Klaus D.
    Jan 3 at 18:44











  • I thought about that. but how should i go about doing the same in a database of names like this which can have any kind of encoding

    – Arpit Acharya
    Jan 3 at 18:48






  • 1





    This looks to me more like a search-and-replace problem than an encoding/decoding problem. You have a valid string for the encoding, just not the string that's intended. Identify problem character sequences and the good replacements, then use regex to search and replace. But beware little bobby tables - not an exact analogy, but the "that'll never happen" attitude can be dangerous when those problem sequences turn out to occasionally be perfectly valid and correct so the substitution is an error. It might be better to automate the search, but fix manually.

    – Steve314
    Jan 3 at 18:59






  • 1





    Is this really what your data looks like? It's extremely unusual to see two different encodings in the same string.

    – Mark Ransom
    Jan 3 at 19:04











  • Well this is actually one of the many records I have

    – Arpit Acharya
    Jan 4 at 20:18

















It looks like your are having two different encodings in on value. I guess the best way would to split them into single words, try to decode them with both encodings (handle a possible exception and try the other) and then put the words back together.

– Klaus D.
Jan 3 at 18:44





It looks like your are having two different encodings in on value. I guess the best way would to split them into single words, try to decode them with both encodings (handle a possible exception and try the other) and then put the words back together.

– Klaus D.
Jan 3 at 18:44













I thought about that. but how should i go about doing the same in a database of names like this which can have any kind of encoding

– Arpit Acharya
Jan 3 at 18:48





I thought about that. but how should i go about doing the same in a database of names like this which can have any kind of encoding

– Arpit Acharya
Jan 3 at 18:48




1




1





This looks to me more like a search-and-replace problem than an encoding/decoding problem. You have a valid string for the encoding, just not the string that's intended. Identify problem character sequences and the good replacements, then use regex to search and replace. But beware little bobby tables - not an exact analogy, but the "that'll never happen" attitude can be dangerous when those problem sequences turn out to occasionally be perfectly valid and correct so the substitution is an error. It might be better to automate the search, but fix manually.

– Steve314
Jan 3 at 18:59





This looks to me more like a search-and-replace problem than an encoding/decoding problem. You have a valid string for the encoding, just not the string that's intended. Identify problem character sequences and the good replacements, then use regex to search and replace. But beware little bobby tables - not an exact analogy, but the "that'll never happen" attitude can be dangerous when those problem sequences turn out to occasionally be perfectly valid and correct so the substitution is an error. It might be better to automate the search, but fix manually.

– Steve314
Jan 3 at 18:59




1




1





Is this really what your data looks like? It's extremely unusual to see two different encodings in the same string.

– Mark Ransom
Jan 3 at 19:04





Is this really what your data looks like? It's extremely unusual to see two different encodings in the same string.

– Mark Ransom
Jan 3 at 19:04













Well this is actually one of the many records I have

– Arpit Acharya
Jan 4 at 20:18





Well this is actually one of the many records I have

– Arpit Acharya
Jan 4 at 20:18












2 Answers
2






active

oldest

votes


















2














ftfy is a python library which fixes unicode text broken in different ways with a function named fix_text.



from ftfy import fix_text

def convert_iso_name_to_string(name):
result =

for word in name.split():
result.append(fix_text(word))
return ' '.join(result)

name = "José Florés"
assert convert_iso_name_to_string(name) == "José Florés"


Using the fix_text method the names can be standardized, which is an alternate way to solve the problem.






share|improve this answer
























  • Thanks Adity!. Worked for a lot of test cases

    – Arpit Acharya
    Jan 4 at 20:18



















-1














We'll start with an example string containing a non-ASCII character (i.e., “ü” or “umlaut-u”):



s = 'Florés'


Now if we reference and print the string, it gives us essentially the same result:



>>> s
'Florés'
>>> print(s)
Florés


In contrast to the same string s in Python 2.x, in this case s is already a Unicode string, and all strings in Python 3.x are automatically Unicode. The visible difference is that s wasn't changed after we instantiated it



You can find the same here Encoding and Decoding Strings






share|improve this answer



















  • 3





    How does this answer the question?

    – snakecharmerb
    Jan 3 at 18:47












Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54027932%2fdecode-bad-escape-characters-in-python%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









2














ftfy is a python library which fixes unicode text broken in different ways with a function named fix_text.



from ftfy import fix_text

def convert_iso_name_to_string(name):
result =

for word in name.split():
result.append(fix_text(word))
return ' '.join(result)

name = "José Florés"
assert convert_iso_name_to_string(name) == "José Florés"


Using the fix_text method the names can be standardized, which is an alternate way to solve the problem.






share|improve this answer
























  • Thanks Adity!. Worked for a lot of test cases

    – Arpit Acharya
    Jan 4 at 20:18
















2














ftfy is a python library which fixes unicode text broken in different ways with a function named fix_text.



from ftfy import fix_text

def convert_iso_name_to_string(name):
result =

for word in name.split():
result.append(fix_text(word))
return ' '.join(result)

name = "José Florés"
assert convert_iso_name_to_string(name) == "José Florés"


Using the fix_text method the names can be standardized, which is an alternate way to solve the problem.






share|improve this answer
























  • Thanks Adity!. Worked for a lot of test cases

    – Arpit Acharya
    Jan 4 at 20:18














2












2








2







ftfy is a python library which fixes unicode text broken in different ways with a function named fix_text.



from ftfy import fix_text

def convert_iso_name_to_string(name):
result =

for word in name.split():
result.append(fix_text(word))
return ' '.join(result)

name = "José Florés"
assert convert_iso_name_to_string(name) == "José Florés"


Using the fix_text method the names can be standardized, which is an alternate way to solve the problem.






share|improve this answer













ftfy is a python library which fixes unicode text broken in different ways with a function named fix_text.



from ftfy import fix_text

def convert_iso_name_to_string(name):
result =

for word in name.split():
result.append(fix_text(word))
return ' '.join(result)

name = "José Florés"
assert convert_iso_name_to_string(name) == "José Florés"


Using the fix_text method the names can be standardized, which is an alternate way to solve the problem.







share|improve this answer












share|improve this answer



share|improve this answer










answered Jan 3 at 19:57









Aditya PurandareAditya Purandare

506




506













  • Thanks Adity!. Worked for a lot of test cases

    – Arpit Acharya
    Jan 4 at 20:18



















  • Thanks Adity!. Worked for a lot of test cases

    – Arpit Acharya
    Jan 4 at 20:18

















Thanks Adity!. Worked for a lot of test cases

– Arpit Acharya
Jan 4 at 20:18





Thanks Adity!. Worked for a lot of test cases

– Arpit Acharya
Jan 4 at 20:18













-1














We'll start with an example string containing a non-ASCII character (i.e., “ü” or “umlaut-u”):



s = 'Florés'


Now if we reference and print the string, it gives us essentially the same result:



>>> s
'Florés'
>>> print(s)
Florés


In contrast to the same string s in Python 2.x, in this case s is already a Unicode string, and all strings in Python 3.x are automatically Unicode. The visible difference is that s wasn't changed after we instantiated it



You can find the same here Encoding and Decoding Strings






share|improve this answer



















  • 3





    How does this answer the question?

    – snakecharmerb
    Jan 3 at 18:47
















-1














We'll start with an example string containing a non-ASCII character (i.e., “ü” or “umlaut-u”):



s = 'Florés'


Now if we reference and print the string, it gives us essentially the same result:



>>> s
'Florés'
>>> print(s)
Florés


In contrast to the same string s in Python 2.x, in this case s is already a Unicode string, and all strings in Python 3.x are automatically Unicode. The visible difference is that s wasn't changed after we instantiated it



You can find the same here Encoding and Decoding Strings






share|improve this answer



















  • 3





    How does this answer the question?

    – snakecharmerb
    Jan 3 at 18:47














-1












-1








-1







We'll start with an example string containing a non-ASCII character (i.e., “ü” or “umlaut-u”):



s = 'Florés'


Now if we reference and print the string, it gives us essentially the same result:



>>> s
'Florés'
>>> print(s)
Florés


In contrast to the same string s in Python 2.x, in this case s is already a Unicode string, and all strings in Python 3.x are automatically Unicode. The visible difference is that s wasn't changed after we instantiated it



You can find the same here Encoding and Decoding Strings






share|improve this answer













We'll start with an example string containing a non-ASCII character (i.e., “ü” or “umlaut-u”):



s = 'Florés'


Now if we reference and print the string, it gives us essentially the same result:



>>> s
'Florés'
>>> print(s)
Florés


In contrast to the same string s in Python 2.x, in this case s is already a Unicode string, and all strings in Python 3.x are automatically Unicode. The visible difference is that s wasn't changed after we instantiated it



You can find the same here Encoding and Decoding Strings







share|improve this answer












share|improve this answer



share|improve this answer










answered Jan 3 at 18:44









Manoj PatelManoj Patel

17519




17519








  • 3





    How does this answer the question?

    – snakecharmerb
    Jan 3 at 18:47














  • 3





    How does this answer the question?

    – snakecharmerb
    Jan 3 at 18:47








3




3





How does this answer the question?

– snakecharmerb
Jan 3 at 18:47





How does this answer the question?

– snakecharmerb
Jan 3 at 18:47


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54027932%2fdecode-bad-escape-characters-in-python%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Monofisismo

Angular Downloading a file using contenturl with Basic Authentication

Olmecas