Decode bad escape characters in python

So I have a database with a lot of names. The names have bad characters. For example, a name in a record is JosÃ© Florés
I wanted to clean this to get José Florés

I tried the following

name = "    JosÃ©     Florés "

print(name.encode('iso-8859-1',errors='ignore').decode('utf8',errors='backslashreplace')

The output messes the last name to ' José Flor\xe9s '

What is the best way to solve this? The names can have any kind of unicode or hex escape sequences.

edited Jan 3 at 18:53

snakecharmerb

12.1k42552

asked Jan 3 at 18:35

Arpit Acharya

165

It looks like your are having two different encodings in on value. I guess the best way would to split them into single words, try to decode them with both encodings (handle a possible exception and try the other) and then put the words back together.

– Klaus D.
Jan 3 at 18:44

I thought about that. but how should i go about doing the same in a database of names like this which can have any kind of encoding

– Arpit Acharya
Jan 3 at 18:48

1

This looks to me more like a search-and-replace problem than an encoding/decoding problem. You have a valid string for the encoding, just not the string that's intended. Identify problem character sequences and the good replacements, then use regex to search and replace. But beware little bobby tables - not an exact analogy, but the "that'll never happen" attitude can be dangerous when those problem sequences turn out to occasionally be perfectly valid and correct so the substitution is an error. It might be better to automate the search, but fix manually.

– Steve314
Jan 3 at 18:59

1

Is this really what your data looks like? It's extremely unusual to see two different encodings in the same string.

– Mark Ransom
Jan 3 at 19:04

Well this is actually one of the many records I have

– Arpit Acharya
Jan 4 at 20:18

|
show 1 more comment

So I have a database with a lot of names. The names have bad characters. For example, a name in a record is JosÃ© Florés
I wanted to clean this to get José Florés

I tried the following

name = "    JosÃ©     Florés "

print(name.encode('iso-8859-1',errors='ignore').decode('utf8',errors='backslashreplace')

The output messes the last name to ' José Flor\xe9s '

What is the best way to solve this? The names can have any kind of unicode or hex escape sequences.

edited Jan 3 at 18:53

snakecharmerb

12.1k42552

asked Jan 3 at 18:35

Arpit Acharya

165

It looks like your are having two different encodings in on value. I guess the best way would to split them into single words, try to decode them with both encodings (handle a possible exception and try the other) and then put the words back together.

– Klaus D.
Jan 3 at 18:44

I thought about that. but how should i go about doing the same in a database of names like this which can have any kind of encoding

– Arpit Acharya
Jan 3 at 18:48

1

This looks to me more like a search-and-replace problem than an encoding/decoding problem. You have a valid string for the encoding, just not the string that's intended. Identify problem character sequences and the good replacements, then use regex to search and replace. But beware little bobby tables - not an exact analogy, but the "that'll never happen" attitude can be dangerous when those problem sequences turn out to occasionally be perfectly valid and correct so the substitution is an error. It might be better to automate the search, but fix manually.

– Steve314
Jan 3 at 18:59

1

Is this really what your data looks like? It's extremely unusual to see two different encodings in the same string.

– Mark Ransom
Jan 3 at 19:04

Well this is actually one of the many records I have

– Arpit Acharya
Jan 4 at 20:18

|
show 1 more comment

So I have a database with a lot of names. The names have bad characters. For example, a name in a record is JosÃ© Florés
I wanted to clean this to get José Florés

I tried the following

name = "    JosÃ©     Florés "

print(name.encode('iso-8859-1',errors='ignore').decode('utf8',errors='backslashreplace')

The output messes the last name to ' José Flor\xe9s '

What is the best way to solve this? The names can have any kind of unicode or hex escape sequences.

edited Jan 3 at 18:53

snakecharmerb

12.1k42552

asked Jan 3 at 18:35

Arpit Acharya

165

So I have a database with a lot of names. The names have bad characters. For example, a name in a record is JosÃ© Florés
I wanted to clean this to get José Florés

I tried the following

name = "    JosÃ©     Florés "

print(name.encode('iso-8859-1',errors='ignore').decode('utf8',errors='backslashreplace')

The output messes the last name to ' José Flor\xe9s '

What is the best way to solve this? The names can have any kind of unicode or hex escape sequences.

python python-3.x string character-encoding

edited Jan 3 at 18:53

snakecharmerb

12.1k42552

asked Jan 3 at 18:35

Arpit Acharya

165

edited Jan 3 at 18:53

snakecharmerb

12.1k42552

asked Jan 3 at 18:35

Arpit Acharya

165

edited Jan 3 at 18:53

snakecharmerb

12.1k42552

edited Jan 3 at 18:53

snakecharmerb

12.1k42552

edited Jan 3 at 18:53

snakecharmerb

12.1k42552

asked Jan 3 at 18:35

Arpit Acharya

165

asked Jan 3 at 18:35

Arpit Acharya

165

asked Jan 3 at 18:35

Arpit Acharya

165

It looks like your are having two different encodings in on value. I guess the best way would to split them into single words, try to decode them with both encodings (handle a possible exception and try the other) and then put the words back together.

– Klaus D.
Jan 3 at 18:44

I thought about that. but how should i go about doing the same in a database of names like this which can have any kind of encoding

– Arpit Acharya
Jan 3 at 18:48

1

This looks to me more like a search-and-replace problem than an encoding/decoding problem. You have a valid string for the encoding, just not the string that's intended. Identify problem character sequences and the good replacements, then use regex to search and replace. But beware little bobby tables - not an exact analogy, but the "that'll never happen" attitude can be dangerous when those problem sequences turn out to occasionally be perfectly valid and correct so the substitution is an error. It might be better to automate the search, but fix manually.

– Steve314
Jan 3 at 18:59

1

Is this really what your data looks like? It's extremely unusual to see two different encodings in the same string.

– Mark Ransom
Jan 3 at 19:04

Well this is actually one of the many records I have

– Arpit Acharya
Jan 4 at 20:18

|
show 1 more comment

It looks like your are having two different encodings in on value. I guess the best way would to split them into single words, try to decode them with both encodings (handle a possible exception and try the other) and then put the words back together.

– Klaus D.
Jan 3 at 18:44

I thought about that. but how should i go about doing the same in a database of names like this which can have any kind of encoding

– Arpit Acharya
Jan 3 at 18:48

1

This looks to me more like a search-and-replace problem than an encoding/decoding problem. You have a valid string for the encoding, just not the string that's intended. Identify problem character sequences and the good replacements, then use regex to search and replace. But beware little bobby tables - not an exact analogy, but the "that'll never happen" attitude can be dangerous when those problem sequences turn out to occasionally be perfectly valid and correct so the substitution is an error. It might be better to automate the search, but fix manually.

– Steve314
Jan 3 at 18:59

1

Is this really what your data looks like? It's extremely unusual to see two different encodings in the same string.

– Mark Ransom
Jan 3 at 19:04

Well this is actually one of the many records I have

– Arpit Acharya
Jan 4 at 20:18

It looks like your are having two different encodings in on value. I guess the best way would to split them into single words, try to decode them with both encodings (handle a possible exception and try the other) and then put the words back together.

– Klaus D.
Jan 3 at 18:44

I thought about that. but how should i go about doing the same in a database of names like this which can have any kind of encoding

– Arpit Acharya
Jan 3 at 18:48

This looks to me more like a search-and-replace problem than an encoding/decoding problem. You have a valid string for the encoding, just not the string that's intended. Identify problem character sequences and the good replacements, then use regex to search and replace. But beware little bobby tables - not an exact analogy, but the "that'll never happen" attitude can be dangerous when those problem sequences turn out to occasionally be perfectly valid and correct so the substitution is an error. It might be better to automate the search, but fix manually.

– Steve314
Jan 3 at 18:59

Is this really what your data looks like? It's extremely unusual to see two different encodings in the same string.

– Mark Ransom
Jan 3 at 19:04

Well this is actually one of the many records I have

– Arpit Acharya
Jan 4 at 20:18

|
show 1 more comment

2 Answers
2

active

oldest

votes

ftfy is a python library which fixes unicode text broken in different ways with a function named fix_text.

from ftfy import fix_text



def convert_iso_name_to_string(name):

    result = 



    for word in name.split():

        result.append(fix_text(word))

    return ' '.join(result)



name = "JosÃ© Florés"

assert convert_iso_name_to_string(name) == "José Florés"

Using the fix_text method the names can be standardized, which is an alternate way to solve the problem.

answered Jan 3 at 19:57

Aditya Purandare

506

Thanks Adity!. Worked for a lot of test cases

– Arpit Acharya
Jan 4 at 20:18

add a comment |

-1

We'll start with an example string containing a non-ASCII character (i.e., “ü” or “umlaut-u”):

s = 'Florés'

Now if we reference and print the string, it gives us essentially the same result:

>>> s

'Florés'

>>> print(s)

Florés

In contrast to the same string s in Python 2.x, in this case s is already a Unicode string, and all strings in Python 3.x are automatically Unicode. The visible difference is that s wasn't changed after we instantiated it

You can find the same here Encoding and Decoding Strings

answered Jan 3 at 18:44

Manoj Patel

17519

3

How does this answer the question?

– snakecharmerb
Jan 3 at 18:47

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54027932%2fdecode-bad-escape-characters-in-python%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

ftfy is a python library which fixes unicode text broken in different ways with a function named fix_text.

from ftfy import fix_text



def convert_iso_name_to_string(name):

    result = 



    for word in name.split():

        result.append(fix_text(word))

    return ' '.join(result)



name = "JosÃ© Florés"

assert convert_iso_name_to_string(name) == "José Florés"

Using the fix_text method the names can be standardized, which is an alternate way to solve the problem.

answered Jan 3 at 19:57

Aditya Purandare

506

Thanks Adity!. Worked for a lot of test cases

– Arpit Acharya
Jan 4 at 20:18

add a comment |

ftfy is a python library which fixes unicode text broken in different ways with a function named fix_text.

from ftfy import fix_text



def convert_iso_name_to_string(name):

    result = 



    for word in name.split():

        result.append(fix_text(word))

    return ' '.join(result)



name = "JosÃ© Florés"

assert convert_iso_name_to_string(name) == "José Florés"

Using the fix_text method the names can be standardized, which is an alternate way to solve the problem.

answered Jan 3 at 19:57

Aditya Purandare

506

Thanks Adity!. Worked for a lot of test cases

– Arpit Acharya
Jan 4 at 20:18

add a comment |

ftfy is a python library which fixes unicode text broken in different ways with a function named fix_text.

from ftfy import fix_text



def convert_iso_name_to_string(name):

    result = 



    for word in name.split():

        result.append(fix_text(word))

    return ' '.join(result)



name = "JosÃ© Florés"

assert convert_iso_name_to_string(name) == "José Florés"

Using the fix_text method the names can be standardized, which is an alternate way to solve the problem.

answered Jan 3 at 19:57

Aditya Purandare

506

ftfy is a python library which fixes unicode text broken in different ways with a function named fix_text.

from ftfy import fix_text



def convert_iso_name_to_string(name):

    result = 



    for word in name.split():

        result.append(fix_text(word))

    return ' '.join(result)



name = "JosÃ© Florés"

assert convert_iso_name_to_string(name) == "José Florés"

Using the fix_text method the names can be standardized, which is an alternate way to solve the problem.

answered Jan 3 at 19:57

Aditya Purandare

506

answered Jan 3 at 19:57

Aditya Purandare

506

answered Jan 3 at 19:57

Aditya Purandare

506

answered Jan 3 at 19:57

Aditya Purandare

506

Thanks Adity!. Worked for a lot of test cases

– Arpit Acharya
Jan 4 at 20:18

add a comment |

Thanks Adity!. Worked for a lot of test cases

– Arpit Acharya
Jan 4 at 20:18

Thanks Adity!. Worked for a lot of test cases

– Arpit Acharya
Jan 4 at 20:18

add a comment |

-1

We'll start with an example string containing a non-ASCII character (i.e., “ü” or “umlaut-u”):

s = 'Florés'

Now if we reference and print the string, it gives us essentially the same result:

>>> s

'Florés'

>>> print(s)

Florés

You can find the same here Encoding and Decoding Strings

answered Jan 3 at 18:44

Manoj Patel

17519

3

How does this answer the question?

– snakecharmerb
Jan 3 at 18:47

add a comment |

-1

We'll start with an example string containing a non-ASCII character (i.e., “ü” or “umlaut-u”):

s = 'Florés'

Now if we reference and print the string, it gives us essentially the same result:

>>> s

'Florés'

>>> print(s)

Florés

You can find the same here Encoding and Decoding Strings

answered Jan 3 at 18:44

Manoj Patel

17519

3

How does this answer the question?

– snakecharmerb
Jan 3 at 18:47

add a comment |

-1

We'll start with an example string containing a non-ASCII character (i.e., “ü” or “umlaut-u”):

s = 'Florés'

Now if we reference and print the string, it gives us essentially the same result:

>>> s

'Florés'

>>> print(s)

Florés

You can find the same here Encoding and Decoding Strings

answered Jan 3 at 18:44

Manoj Patel

17519

We'll start with an example string containing a non-ASCII character (i.e., “ü” or “umlaut-u”):

s = 'Florés'

Now if we reference and print the string, it gives us essentially the same result:

>>> s

'Florés'

>>> print(s)

Florés

You can find the same here Encoding and Decoding Strings

answered Jan 3 at 18:44

Manoj Patel

17519

answered Jan 3 at 18:44

Manoj Patel

17519

answered Jan 3 at 18:44

Manoj Patel

17519

answered Jan 3 at 18:44

Manoj Patel

17519

3

How does this answer the question?

– snakecharmerb
Jan 3 at 18:47

add a comment |

3

How does this answer the question?

– snakecharmerb
Jan 3 at 18:47

How does this answer the question?

– snakecharmerb
Jan 3 at 18:47

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

68AKyzBpemHSniq9

搜尋此網誌

Bdtjtk