How To Correct UTF-8 Characters Stored As ASCII












-3















I have some old data that is stored in ASCII format. Clearly there is UTF-8 data that was not properly converted to ASCII before being written. For example, José will appear in the file as José. I can easily fix this with a Java snippet code below:



byte utf8Bytes = c_TOBETRANSLATED.getBytes("ISO-8859-1");
String s2 = new String(utf8Bytes,"UTF-8");


But I need to do this Python with the rest of my code. I'm only just starting in Python and my internet searches and trial and error is not helping me find a Python solution to do the same thing.










share|improve this question

























  • Are you using Python 2 or 3?

    – Jonah Bishop
    Jan 1 at 2:09






  • 6





    That's not ASCII.

    – user2357112
    Jan 1 at 2:32











  • To my horror, I've discovered this can be intentional; a sort of Base256 with byte values converted to ISO 8859-1 characters so a byte sequence can be stored in a string datatype.

    – Tom Blodget
    Jan 1 at 4:03











  • Your problem description sounds like you are using Latin-1 to view an UTF-8 file. What are the actual bytes in the file?

    – tripleee
    Jan 1 at 23:28
















-3















I have some old data that is stored in ASCII format. Clearly there is UTF-8 data that was not properly converted to ASCII before being written. For example, José will appear in the file as José. I can easily fix this with a Java snippet code below:



byte utf8Bytes = c_TOBETRANSLATED.getBytes("ISO-8859-1");
String s2 = new String(utf8Bytes,"UTF-8");


But I need to do this Python with the rest of my code. I'm only just starting in Python and my internet searches and trial and error is not helping me find a Python solution to do the same thing.










share|improve this question

























  • Are you using Python 2 or 3?

    – Jonah Bishop
    Jan 1 at 2:09






  • 6





    That's not ASCII.

    – user2357112
    Jan 1 at 2:32











  • To my horror, I've discovered this can be intentional; a sort of Base256 with byte values converted to ISO 8859-1 characters so a byte sequence can be stored in a string datatype.

    – Tom Blodget
    Jan 1 at 4:03











  • Your problem description sounds like you are using Latin-1 to view an UTF-8 file. What are the actual bytes in the file?

    – tripleee
    Jan 1 at 23:28














-3












-3








-3








I have some old data that is stored in ASCII format. Clearly there is UTF-8 data that was not properly converted to ASCII before being written. For example, José will appear in the file as José. I can easily fix this with a Java snippet code below:



byte utf8Bytes = c_TOBETRANSLATED.getBytes("ISO-8859-1");
String s2 = new String(utf8Bytes,"UTF-8");


But I need to do this Python with the rest of my code. I'm only just starting in Python and my internet searches and trial and error is not helping me find a Python solution to do the same thing.










share|improve this question
















I have some old data that is stored in ASCII format. Clearly there is UTF-8 data that was not properly converted to ASCII before being written. For example, José will appear in the file as José. I can easily fix this with a Java snippet code below:



byte utf8Bytes = c_TOBETRANSLATED.getBytes("ISO-8859-1");
String s2 = new String(utf8Bytes,"UTF-8");


But I need to do this Python with the rest of my code. I'm only just starting in Python and my internet searches and trial and error is not helping me find a Python solution to do the same thing.







python utf-8






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 1 at 4:16









noob

4761516




4761516










asked Jan 1 at 2:05









Scott MScott M

1




1













  • Are you using Python 2 or 3?

    – Jonah Bishop
    Jan 1 at 2:09






  • 6





    That's not ASCII.

    – user2357112
    Jan 1 at 2:32











  • To my horror, I've discovered this can be intentional; a sort of Base256 with byte values converted to ISO 8859-1 characters so a byte sequence can be stored in a string datatype.

    – Tom Blodget
    Jan 1 at 4:03











  • Your problem description sounds like you are using Latin-1 to view an UTF-8 file. What are the actual bytes in the file?

    – tripleee
    Jan 1 at 23:28



















  • Are you using Python 2 or 3?

    – Jonah Bishop
    Jan 1 at 2:09






  • 6





    That's not ASCII.

    – user2357112
    Jan 1 at 2:32











  • To my horror, I've discovered this can be intentional; a sort of Base256 with byte values converted to ISO 8859-1 characters so a byte sequence can be stored in a string datatype.

    – Tom Blodget
    Jan 1 at 4:03











  • Your problem description sounds like you are using Latin-1 to view an UTF-8 file. What are the actual bytes in the file?

    – tripleee
    Jan 1 at 23:28

















Are you using Python 2 or 3?

– Jonah Bishop
Jan 1 at 2:09





Are you using Python 2 or 3?

– Jonah Bishop
Jan 1 at 2:09




6




6





That's not ASCII.

– user2357112
Jan 1 at 2:32





That's not ASCII.

– user2357112
Jan 1 at 2:32













To my horror, I've discovered this can be intentional; a sort of Base256 with byte values converted to ISO 8859-1 characters so a byte sequence can be stored in a string datatype.

– Tom Blodget
Jan 1 at 4:03





To my horror, I've discovered this can be intentional; a sort of Base256 with byte values converted to ISO 8859-1 characters so a byte sequence can be stored in a string datatype.

– Tom Blodget
Jan 1 at 4:03













Your problem description sounds like you are using Latin-1 to view an UTF-8 file. What are the actual bytes in the file?

– tripleee
Jan 1 at 23:28





Your problem description sounds like you are using Latin-1 to view an UTF-8 file. What are the actual bytes in the file?

– tripleee
Jan 1 at 23:28












2 Answers
2






active

oldest

votes


















1














If you are using Python 3, you can do the following using the bytes function:



test = "José"
fixed = bytes(test, 'iso-8859-1').decode('utf-8')
# fixed will now contain the string José





share|improve this answer































    1














    If you have "José" "in the file" the data was read/displayed incorrectly by the file viewer. That is UTF-8 but decoded with the wrong encoding. Example:



    import locale

    # Correctly written
    with open('file.txt','w',encoding='utf8') as f:
    f.write('José')

    # The default encoding for open()
    print(locale.getpreferredencoding(False))

    # Incorrectly opened
    with open('file.txt') as f:
    data = f.read()
    print(data)
    # What I think you are requesting as a fix.
    # Re-encode with the incorrect encoding, then decode correctly.
    print(data.encode('cp1252').decode('utf8'))

    # Correctly opened
    with open('file.txt',encoding='utf8') as f:
    print(f.read())


    Output:



    cp1252
    José
    José
    José





    share|improve this answer























      Your Answer






      StackExchange.ifUsing("editor", function () {
      StackExchange.using("externalEditor", function () {
      StackExchange.using("snippets", function () {
      StackExchange.snippets.init();
      });
      });
      }, "code-snippets");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "1"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53992620%2fhow-to-correct-utf-8-characters-stored-as-ascii%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      1














      If you are using Python 3, you can do the following using the bytes function:



      test = "José"
      fixed = bytes(test, 'iso-8859-1').decode('utf-8')
      # fixed will now contain the string José





      share|improve this answer




























        1














        If you are using Python 3, you can do the following using the bytes function:



        test = "José"
        fixed = bytes(test, 'iso-8859-1').decode('utf-8')
        # fixed will now contain the string José





        share|improve this answer


























          1












          1








          1







          If you are using Python 3, you can do the following using the bytes function:



          test = "José"
          fixed = bytes(test, 'iso-8859-1').decode('utf-8')
          # fixed will now contain the string José





          share|improve this answer













          If you are using Python 3, you can do the following using the bytes function:



          test = "José"
          fixed = bytes(test, 'iso-8859-1').decode('utf-8')
          # fixed will now contain the string José






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Jan 1 at 2:24









          Jonah BishopJonah Bishop

          8,92433257




          8,92433257

























              1














              If you have "José" "in the file" the data was read/displayed incorrectly by the file viewer. That is UTF-8 but decoded with the wrong encoding. Example:



              import locale

              # Correctly written
              with open('file.txt','w',encoding='utf8') as f:
              f.write('José')

              # The default encoding for open()
              print(locale.getpreferredencoding(False))

              # Incorrectly opened
              with open('file.txt') as f:
              data = f.read()
              print(data)
              # What I think you are requesting as a fix.
              # Re-encode with the incorrect encoding, then decode correctly.
              print(data.encode('cp1252').decode('utf8'))

              # Correctly opened
              with open('file.txt',encoding='utf8') as f:
              print(f.read())


              Output:



              cp1252
              José
              José
              José





              share|improve this answer




























                1














                If you have "José" "in the file" the data was read/displayed incorrectly by the file viewer. That is UTF-8 but decoded with the wrong encoding. Example:



                import locale

                # Correctly written
                with open('file.txt','w',encoding='utf8') as f:
                f.write('José')

                # The default encoding for open()
                print(locale.getpreferredencoding(False))

                # Incorrectly opened
                with open('file.txt') as f:
                data = f.read()
                print(data)
                # What I think you are requesting as a fix.
                # Re-encode with the incorrect encoding, then decode correctly.
                print(data.encode('cp1252').decode('utf8'))

                # Correctly opened
                with open('file.txt',encoding='utf8') as f:
                print(f.read())


                Output:



                cp1252
                José
                José
                José





                share|improve this answer


























                  1












                  1








                  1







                  If you have "José" "in the file" the data was read/displayed incorrectly by the file viewer. That is UTF-8 but decoded with the wrong encoding. Example:



                  import locale

                  # Correctly written
                  with open('file.txt','w',encoding='utf8') as f:
                  f.write('José')

                  # The default encoding for open()
                  print(locale.getpreferredencoding(False))

                  # Incorrectly opened
                  with open('file.txt') as f:
                  data = f.read()
                  print(data)
                  # What I think you are requesting as a fix.
                  # Re-encode with the incorrect encoding, then decode correctly.
                  print(data.encode('cp1252').decode('utf8'))

                  # Correctly opened
                  with open('file.txt',encoding='utf8') as f:
                  print(f.read())


                  Output:



                  cp1252
                  José
                  José
                  José





                  share|improve this answer













                  If you have "José" "in the file" the data was read/displayed incorrectly by the file viewer. That is UTF-8 but decoded with the wrong encoding. Example:



                  import locale

                  # Correctly written
                  with open('file.txt','w',encoding='utf8') as f:
                  f.write('José')

                  # The default encoding for open()
                  print(locale.getpreferredencoding(False))

                  # Incorrectly opened
                  with open('file.txt') as f:
                  data = f.read()
                  print(data)
                  # What I think you are requesting as a fix.
                  # Re-encode with the incorrect encoding, then decode correctly.
                  print(data.encode('cp1252').decode('utf8'))

                  # Correctly opened
                  with open('file.txt',encoding='utf8') as f:
                  print(f.read())


                  Output:



                  cp1252
                  José
                  José
                  José






                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Jan 1 at 23:22









                  Mark TolonenMark Tolonen

                  93.8k12113176




                  93.8k12113176






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Stack Overflow!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53992620%2fhow-to-correct-utf-8-characters-stored-as-ascii%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Monofisismo

                      Angular Downloading a file using contenturl with Basic Authentication

                      Olmecas