What is character encoding and why should I bother with it












30















I am quite confused about the concept of character encoding.



What is Unicode, GBK, etc? How does a programming language use them?



Do I need to bother knowing about them? Is there a simpler or faster way of programming without having to trouble myself with them?










share|improve this question




















  • 7





    The classic off-site resource for this is Joel Spolsky's essay The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

    – Raedwald
    Apr 10 '15 at 12:32






  • 1





    It's a late reply, but I posted some explanations about the mentioned encodings and charsets + also some shortcuts (e.g. for java)

    – bvdb
    Aug 1 '15 at 10:49
















30















I am quite confused about the concept of character encoding.



What is Unicode, GBK, etc? How does a programming language use them?



Do I need to bother knowing about them? Is there a simpler or faster way of programming without having to trouble myself with them?










share|improve this question




















  • 7





    The classic off-site resource for this is Joel Spolsky's essay The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

    – Raedwald
    Apr 10 '15 at 12:32






  • 1





    It's a late reply, but I posted some explanations about the mentioned encodings and charsets + also some shortcuts (e.g. for java)

    – bvdb
    Aug 1 '15 at 10:49














30












30








30


14






I am quite confused about the concept of character encoding.



What is Unicode, GBK, etc? How does a programming language use them?



Do I need to bother knowing about them? Is there a simpler or faster way of programming without having to trouble myself with them?










share|improve this question
















I am quite confused about the concept of character encoding.



What is Unicode, GBK, etc? How does a programming language use them?



Do I need to bother knowing about them? Is there a simpler or faster way of programming without having to trouble myself with them?







encoding character-encoding






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Apr 10 '15 at 11:58









Raedwald

26.5k2396157




26.5k2396157










asked May 16 '12 at 3:00









hguserhguser

12.4k38126235




12.4k38126235








  • 7





    The classic off-site resource for this is Joel Spolsky's essay The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

    – Raedwald
    Apr 10 '15 at 12:32






  • 1





    It's a late reply, but I posted some explanations about the mentioned encodings and charsets + also some shortcuts (e.g. for java)

    – bvdb
    Aug 1 '15 at 10:49














  • 7





    The classic off-site resource for this is Joel Spolsky's essay The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

    – Raedwald
    Apr 10 '15 at 12:32






  • 1





    It's a late reply, but I posted some explanations about the mentioned encodings and charsets + also some shortcuts (e.g. for java)

    – bvdb
    Aug 1 '15 at 10:49








7




7





The classic off-site resource for this is Joel Spolsky's essay The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

– Raedwald
Apr 10 '15 at 12:32





The classic off-site resource for this is Joel Spolsky's essay The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

– Raedwald
Apr 10 '15 at 12:32




1




1





It's a late reply, but I posted some explanations about the mentioned encodings and charsets + also some shortcuts (e.g. for java)

– bvdb
Aug 1 '15 at 10:49





It's a late reply, but I posted some explanations about the mentioned encodings and charsets + also some shortcuts (e.g. for java)

– bvdb
Aug 1 '15 at 10:49












3 Answers
3






active

oldest

votes


















35














(Note that I'm using some of these terms loosely/colloquially for a simpler explanation that still hits the key points.)



A byte can only have 256 distinct values, being 8 bits.



Since there are character sets with more than 256 characters in the character set one cannot in general simply say that each character is a byte.



Therefore, there must be mappings that describe how to turn each character in a character set into a sequence of bytes. Some characters might be mapped to a single byte but others will have to be mapped to multiple bytes.



Those mappings are encodings, because they are telling you how to encode characters into sequences of bytes.



As for Unicode, at a very high level, Unicode is an attempt to assign a single, unique number to every character. Obviously that number has to be something wider than a byte since there are more than 256 characters :) Java uses a version of Unicode where every character is assigned a 16-bit value (and this is why Java characters are 16 bits wide and have integer values from 0 to 65535). When you get the byte representation of a Java character, you have to tell the JVM the encoding you want to use so it will know how to choose the byte sequence for the character.






share|improve this answer

































    34














    ASCII is fundamental



    Originally 1 character was always stored as 1 byte. A byte (8 bits) has the potential to distinct 256 possible values. But in fact only the first 7 bits were used. So only 128 characters were defined. This set is known as the ASCII character set.





    • 0x00 - 0x1F contain steering codes (e.g. CR, LF, STX, ETX, EOT, BEL, ...)


    • 0x20 - 0x40 contain numbers and punctuation


    • 0x41 - 0x7F contain mostly alphabetic characters


    • 0x80 - 0xFF the 8th bit = undefined.


    French, German and many other languages needed additional characters. (e.g. à, é, ç, ô, ...) which were not available in the ASCII character set. So they used the 8th bit to define their characters. This is what is known as "extended ASCII".



    The problem is that the additional 1 bit has not enough capacity to cover all languages in the world. So each region has its own ASCII variant. There are many extended ASCII encodings (latin-1 being a very popular one).



    Popular question: "Is ASCII a character set or is it an encoding" ? ASCII is a character set. However, in programming charset and encoding are wildly used as synonyms. If I want to refer to an encoding that only contains the ASCII characters and nothing more (the 8th bit is always 0): that's US-ASCII.



    Unicode goes one step further



    Unicode is a great example of a character set - not an encoding. It uses the same characters like the ASCII standard, but it extends the list with additional characters, which gives each character a codepoint in format u+xxxx. It has the ambition to contain all characters (and popular icons) used in the entire world.



    UTF-8, UTF-16 and UTF-32 are encodings that apply the Unicode character table. But they each have a slightly different way on how to encode them. UTF-8 will only use 1 byte when encoding an ASCII character, giving the same output as any other ASCII encoding. But for other characters, it will use the first bit to indicate that a 2nd byte will follow.



    GBK is an encoding, which just like UTF-8 uses multiple bytes. The principle is pretty much the same. The first byte follows the ASCII standard, so only 7 bits are used. But just like with UTF-8, The 8th bit can be used to indicate the presence of a 2nd byte, which it then uses to encode one of 22,000 Chinese characters. The main difference, is that this does not follow the Unicode character set, by contrast it uses some Chinese character set.



    Decoding data



    When you encode your data, you use an encoding, but when you decode data, you will need to know what encoding was used, and use that same encoding to decode it.



    Unfortunately, encodings aren't always declared or specified. It would have been ideal if all files contained a prefix to indicate what encoding their data was stored in. But still in many cases applications just have to assume or guess what encoding they should use. (e.g. they use the standard encoding of the operating system).



    There still is a lack of awareness about this, as still many developers don't even know what an encoding is.



    Mime types



    Mime types are sometimes confused with encodings. They are a useful way for the receiver to identify what kind of data is arriving. Here is an example, of how the HTTP protocol defines it's content type using a mime type declaration.



    Content-Type: text/html; charset=utf-8


    And that's another great source of confusion. A mime type describes what kind of data a message contains (e.g. text/xml, image/png, ...). And in some cases it will additionally also describe how the data is encoded (i.e. charset=utf-8). 2 points of confusion:




    1. Not all mime types declare an encoding. In some cases it is only optional or sometimes completely pointless.

    2. The syntax charset=utf-8 adds up to the semantic confusion, because as explained earlier, UTF-8 is an encoding and not a character set. But as explained earlier, some people just use the 2 words interchangeably.


    For example, in the case of text/xml it would be pointless to declare an encoding (and a charset parameter would simply be ignored). Instead, XML parsers in general will read the first line of the file, looking for the <?xml encoding=... tag. If it's there, then they will reopen the file using that encoding.



    The same problem exists when sending e-mails. An e-mail can contain a html message or just plain text. Also in that case mime types are used to define the type of the content.



    But in summary, a mime type isn't always sufficient to solve the problem.



    Data types in programming languages



    In case of Java (and many other programming languages) in addition to the dangers of encodings, there's also the complexity of casting bytes and integers to characters because their content is stored in different ranges.




    • a byte is stored as a signed byte (range: -128 to 127).

    • the char type in java is stored in 2 unsigned bytes (range: 0 - 65535)

    • a stream returns an integer in range -1 to 255.


    If you know that your data only contains ASCII values. Then with the proper skill you can parse your data from bytes to characters or wrap them immediately in Strings.



    // the -1 indicates that there is no data
    int input = stream.read();
    if (input == -1) throw new EOFException();

    // bytes must be made positive first.
    byte myByte = (byte) input;
    int unsignedInteger = myByte & 0xFF;
    char ascii = (char)(unsignedInteger);


    Shortcuts



    The shortcut in java is to use readers and writers and to specify the encoding when you instantiate them.



    // wrap your stream in a reader. 
    // specify the encoding
    // The reader will decode the data for you
    Reader reader = new InputStreamReader(inputStream, StandardCharsets.UTF_8);


    As explained earlier for XML files it doesn't matter that much, because any decent DOM or JAXB marshaller will check for an encoding attribute.






    share|improve this answer





















    • 1





      Just a small note: Since almost all encodings encode the 128 basic ASCII characters in the same way, as long as all used characters are defined in this basic set, you can actually encode/decode your message using almost any random encoding. (e.g. UTF-8, US-ASCII, latin-1, GBK, ...).

      – bvdb
      Nov 27 '15 at 11:18






    • 1





      Also interesting is the BOM (byte-order-mark) which is used for encodings that use multiple bytes (e.g. UTF-16). It indicates which of the bytes is the first one (most significant). This marker-byte is put in front of the message. Another good reason to use decent Readers.

      – bvdb
      Nov 27 '15 at 11:19













    • The character table of Unicode is an encoding by definition, nevertheless it is double-encoded in i. e. UTF-8. Therefore it is simply wrong, that Unicode has no encoding.

      – Amin Negm-Awad
      Oct 2 '17 at 16:56











    • @AminNegm-Awad I did not write "unicode has no encoding". Unicode has encodings: (e.g. UTF-8, UTF-16, ... etc .) Bot those are implementations. Unicode itself is pretty much just like "the alphabet", it's just a list of characters. That's why Unicode is not an encoding. An encoding on the other hand should describe how the information will be stored in bits and bytes. - I have no idea what you mean by "double-encoded" though. Are you referring to the fact that there are multiple unicode implementations ? (because I do agree on that).

      – bvdb
      Oct 11 '17 at 12:41






    • 1





      No "A coded character set is a set of characters for which a unique number has been assigned to each character. " This is the same definition I used from wikipedia. ;-)

      – Amin Negm-Awad
      May 31 '18 at 5:53



















    2














    Character encoding is what you use to solve the problem of writing software for somebody who uses a different language than you do.



    You don't know how what the characters are and how they are ordered. Therefore, you don't know what the strings in this new language will look like in binary and frankly, you don't care.



    What you do have is a way of translating strings from the language you speak to the language they speak (say a translator). You now need a system that is capable of representing both languages in binary without conflicts. The encoding is that system.



    It is what allows you to write software that works regardless of the way languages are represented in binary.






    share|improve this answer






















      protected by Raedwald Dec 14 '18 at 13:19



      Thank you for your interest in this question.
      Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).



      Would you like to answer one of these unanswered questions instead?














      3 Answers
      3






      active

      oldest

      votes








      3 Answers
      3






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      35














      (Note that I'm using some of these terms loosely/colloquially for a simpler explanation that still hits the key points.)



      A byte can only have 256 distinct values, being 8 bits.



      Since there are character sets with more than 256 characters in the character set one cannot in general simply say that each character is a byte.



      Therefore, there must be mappings that describe how to turn each character in a character set into a sequence of bytes. Some characters might be mapped to a single byte but others will have to be mapped to multiple bytes.



      Those mappings are encodings, because they are telling you how to encode characters into sequences of bytes.



      As for Unicode, at a very high level, Unicode is an attempt to assign a single, unique number to every character. Obviously that number has to be something wider than a byte since there are more than 256 characters :) Java uses a version of Unicode where every character is assigned a 16-bit value (and this is why Java characters are 16 bits wide and have integer values from 0 to 65535). When you get the byte representation of a Java character, you have to tell the JVM the encoding you want to use so it will know how to choose the byte sequence for the character.






      share|improve this answer






























        35














        (Note that I'm using some of these terms loosely/colloquially for a simpler explanation that still hits the key points.)



        A byte can only have 256 distinct values, being 8 bits.



        Since there are character sets with more than 256 characters in the character set one cannot in general simply say that each character is a byte.



        Therefore, there must be mappings that describe how to turn each character in a character set into a sequence of bytes. Some characters might be mapped to a single byte but others will have to be mapped to multiple bytes.



        Those mappings are encodings, because they are telling you how to encode characters into sequences of bytes.



        As for Unicode, at a very high level, Unicode is an attempt to assign a single, unique number to every character. Obviously that number has to be something wider than a byte since there are more than 256 characters :) Java uses a version of Unicode where every character is assigned a 16-bit value (and this is why Java characters are 16 bits wide and have integer values from 0 to 65535). When you get the byte representation of a Java character, you have to tell the JVM the encoding you want to use so it will know how to choose the byte sequence for the character.






        share|improve this answer




























          35












          35








          35







          (Note that I'm using some of these terms loosely/colloquially for a simpler explanation that still hits the key points.)



          A byte can only have 256 distinct values, being 8 bits.



          Since there are character sets with more than 256 characters in the character set one cannot in general simply say that each character is a byte.



          Therefore, there must be mappings that describe how to turn each character in a character set into a sequence of bytes. Some characters might be mapped to a single byte but others will have to be mapped to multiple bytes.



          Those mappings are encodings, because they are telling you how to encode characters into sequences of bytes.



          As for Unicode, at a very high level, Unicode is an attempt to assign a single, unique number to every character. Obviously that number has to be something wider than a byte since there are more than 256 characters :) Java uses a version of Unicode where every character is assigned a 16-bit value (and this is why Java characters are 16 bits wide and have integer values from 0 to 65535). When you get the byte representation of a Java character, you have to tell the JVM the encoding you want to use so it will know how to choose the byte sequence for the character.






          share|improve this answer















          (Note that I'm using some of these terms loosely/colloquially for a simpler explanation that still hits the key points.)



          A byte can only have 256 distinct values, being 8 bits.



          Since there are character sets with more than 256 characters in the character set one cannot in general simply say that each character is a byte.



          Therefore, there must be mappings that describe how to turn each character in a character set into a sequence of bytes. Some characters might be mapped to a single byte but others will have to be mapped to multiple bytes.



          Those mappings are encodings, because they are telling you how to encode characters into sequences of bytes.



          As for Unicode, at a very high level, Unicode is an attempt to assign a single, unique number to every character. Obviously that number has to be something wider than a byte since there are more than 256 characters :) Java uses a version of Unicode where every character is assigned a 16-bit value (and this is why Java characters are 16 bits wide and have integer values from 0 to 65535). When you get the byte representation of a Java character, you have to tell the JVM the encoding you want to use so it will know how to choose the byte sequence for the character.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited May 16 '12 at 3:27

























          answered May 16 '12 at 3:21









          QuantumMechanicQuantumMechanic

          11.5k33358




          11.5k33358

























              34














              ASCII is fundamental



              Originally 1 character was always stored as 1 byte. A byte (8 bits) has the potential to distinct 256 possible values. But in fact only the first 7 bits were used. So only 128 characters were defined. This set is known as the ASCII character set.





              • 0x00 - 0x1F contain steering codes (e.g. CR, LF, STX, ETX, EOT, BEL, ...)


              • 0x20 - 0x40 contain numbers and punctuation


              • 0x41 - 0x7F contain mostly alphabetic characters


              • 0x80 - 0xFF the 8th bit = undefined.


              French, German and many other languages needed additional characters. (e.g. à, é, ç, ô, ...) which were not available in the ASCII character set. So they used the 8th bit to define their characters. This is what is known as "extended ASCII".



              The problem is that the additional 1 bit has not enough capacity to cover all languages in the world. So each region has its own ASCII variant. There are many extended ASCII encodings (latin-1 being a very popular one).



              Popular question: "Is ASCII a character set or is it an encoding" ? ASCII is a character set. However, in programming charset and encoding are wildly used as synonyms. If I want to refer to an encoding that only contains the ASCII characters and nothing more (the 8th bit is always 0): that's US-ASCII.



              Unicode goes one step further



              Unicode is a great example of a character set - not an encoding. It uses the same characters like the ASCII standard, but it extends the list with additional characters, which gives each character a codepoint in format u+xxxx. It has the ambition to contain all characters (and popular icons) used in the entire world.



              UTF-8, UTF-16 and UTF-32 are encodings that apply the Unicode character table. But they each have a slightly different way on how to encode them. UTF-8 will only use 1 byte when encoding an ASCII character, giving the same output as any other ASCII encoding. But for other characters, it will use the first bit to indicate that a 2nd byte will follow.



              GBK is an encoding, which just like UTF-8 uses multiple bytes. The principle is pretty much the same. The first byte follows the ASCII standard, so only 7 bits are used. But just like with UTF-8, The 8th bit can be used to indicate the presence of a 2nd byte, which it then uses to encode one of 22,000 Chinese characters. The main difference, is that this does not follow the Unicode character set, by contrast it uses some Chinese character set.



              Decoding data



              When you encode your data, you use an encoding, but when you decode data, you will need to know what encoding was used, and use that same encoding to decode it.



              Unfortunately, encodings aren't always declared or specified. It would have been ideal if all files contained a prefix to indicate what encoding their data was stored in. But still in many cases applications just have to assume or guess what encoding they should use. (e.g. they use the standard encoding of the operating system).



              There still is a lack of awareness about this, as still many developers don't even know what an encoding is.



              Mime types



              Mime types are sometimes confused with encodings. They are a useful way for the receiver to identify what kind of data is arriving. Here is an example, of how the HTTP protocol defines it's content type using a mime type declaration.



              Content-Type: text/html; charset=utf-8


              And that's another great source of confusion. A mime type describes what kind of data a message contains (e.g. text/xml, image/png, ...). And in some cases it will additionally also describe how the data is encoded (i.e. charset=utf-8). 2 points of confusion:




              1. Not all mime types declare an encoding. In some cases it is only optional or sometimes completely pointless.

              2. The syntax charset=utf-8 adds up to the semantic confusion, because as explained earlier, UTF-8 is an encoding and not a character set. But as explained earlier, some people just use the 2 words interchangeably.


              For example, in the case of text/xml it would be pointless to declare an encoding (and a charset parameter would simply be ignored). Instead, XML parsers in general will read the first line of the file, looking for the <?xml encoding=... tag. If it's there, then they will reopen the file using that encoding.



              The same problem exists when sending e-mails. An e-mail can contain a html message or just plain text. Also in that case mime types are used to define the type of the content.



              But in summary, a mime type isn't always sufficient to solve the problem.



              Data types in programming languages



              In case of Java (and many other programming languages) in addition to the dangers of encodings, there's also the complexity of casting bytes and integers to characters because their content is stored in different ranges.




              • a byte is stored as a signed byte (range: -128 to 127).

              • the char type in java is stored in 2 unsigned bytes (range: 0 - 65535)

              • a stream returns an integer in range -1 to 255.


              If you know that your data only contains ASCII values. Then with the proper skill you can parse your data from bytes to characters or wrap them immediately in Strings.



              // the -1 indicates that there is no data
              int input = stream.read();
              if (input == -1) throw new EOFException();

              // bytes must be made positive first.
              byte myByte = (byte) input;
              int unsignedInteger = myByte & 0xFF;
              char ascii = (char)(unsignedInteger);


              Shortcuts



              The shortcut in java is to use readers and writers and to specify the encoding when you instantiate them.



              // wrap your stream in a reader. 
              // specify the encoding
              // The reader will decode the data for you
              Reader reader = new InputStreamReader(inputStream, StandardCharsets.UTF_8);


              As explained earlier for XML files it doesn't matter that much, because any decent DOM or JAXB marshaller will check for an encoding attribute.






              share|improve this answer





















              • 1





                Just a small note: Since almost all encodings encode the 128 basic ASCII characters in the same way, as long as all used characters are defined in this basic set, you can actually encode/decode your message using almost any random encoding. (e.g. UTF-8, US-ASCII, latin-1, GBK, ...).

                – bvdb
                Nov 27 '15 at 11:18






              • 1





                Also interesting is the BOM (byte-order-mark) which is used for encodings that use multiple bytes (e.g. UTF-16). It indicates which of the bytes is the first one (most significant). This marker-byte is put in front of the message. Another good reason to use decent Readers.

                – bvdb
                Nov 27 '15 at 11:19













              • The character table of Unicode is an encoding by definition, nevertheless it is double-encoded in i. e. UTF-8. Therefore it is simply wrong, that Unicode has no encoding.

                – Amin Negm-Awad
                Oct 2 '17 at 16:56











              • @AminNegm-Awad I did not write "unicode has no encoding". Unicode has encodings: (e.g. UTF-8, UTF-16, ... etc .) Bot those are implementations. Unicode itself is pretty much just like "the alphabet", it's just a list of characters. That's why Unicode is not an encoding. An encoding on the other hand should describe how the information will be stored in bits and bytes. - I have no idea what you mean by "double-encoded" though. Are you referring to the fact that there are multiple unicode implementations ? (because I do agree on that).

                – bvdb
                Oct 11 '17 at 12:41






              • 1





                No "A coded character set is a set of characters for which a unique number has been assigned to each character. " This is the same definition I used from wikipedia. ;-)

                – Amin Negm-Awad
                May 31 '18 at 5:53
















              34














              ASCII is fundamental



              Originally 1 character was always stored as 1 byte. A byte (8 bits) has the potential to distinct 256 possible values. But in fact only the first 7 bits were used. So only 128 characters were defined. This set is known as the ASCII character set.





              • 0x00 - 0x1F contain steering codes (e.g. CR, LF, STX, ETX, EOT, BEL, ...)


              • 0x20 - 0x40 contain numbers and punctuation


              • 0x41 - 0x7F contain mostly alphabetic characters


              • 0x80 - 0xFF the 8th bit = undefined.


              French, German and many other languages needed additional characters. (e.g. à, é, ç, ô, ...) which were not available in the ASCII character set. So they used the 8th bit to define their characters. This is what is known as "extended ASCII".



              The problem is that the additional 1 bit has not enough capacity to cover all languages in the world. So each region has its own ASCII variant. There are many extended ASCII encodings (latin-1 being a very popular one).



              Popular question: "Is ASCII a character set or is it an encoding" ? ASCII is a character set. However, in programming charset and encoding are wildly used as synonyms. If I want to refer to an encoding that only contains the ASCII characters and nothing more (the 8th bit is always 0): that's US-ASCII.



              Unicode goes one step further



              Unicode is a great example of a character set - not an encoding. It uses the same characters like the ASCII standard, but it extends the list with additional characters, which gives each character a codepoint in format u+xxxx. It has the ambition to contain all characters (and popular icons) used in the entire world.



              UTF-8, UTF-16 and UTF-32 are encodings that apply the Unicode character table. But they each have a slightly different way on how to encode them. UTF-8 will only use 1 byte when encoding an ASCII character, giving the same output as any other ASCII encoding. But for other characters, it will use the first bit to indicate that a 2nd byte will follow.



              GBK is an encoding, which just like UTF-8 uses multiple bytes. The principle is pretty much the same. The first byte follows the ASCII standard, so only 7 bits are used. But just like with UTF-8, The 8th bit can be used to indicate the presence of a 2nd byte, which it then uses to encode one of 22,000 Chinese characters. The main difference, is that this does not follow the Unicode character set, by contrast it uses some Chinese character set.



              Decoding data



              When you encode your data, you use an encoding, but when you decode data, you will need to know what encoding was used, and use that same encoding to decode it.



              Unfortunately, encodings aren't always declared or specified. It would have been ideal if all files contained a prefix to indicate what encoding their data was stored in. But still in many cases applications just have to assume or guess what encoding they should use. (e.g. they use the standard encoding of the operating system).



              There still is a lack of awareness about this, as still many developers don't even know what an encoding is.



              Mime types



              Mime types are sometimes confused with encodings. They are a useful way for the receiver to identify what kind of data is arriving. Here is an example, of how the HTTP protocol defines it's content type using a mime type declaration.



              Content-Type: text/html; charset=utf-8


              And that's another great source of confusion. A mime type describes what kind of data a message contains (e.g. text/xml, image/png, ...). And in some cases it will additionally also describe how the data is encoded (i.e. charset=utf-8). 2 points of confusion:




              1. Not all mime types declare an encoding. In some cases it is only optional or sometimes completely pointless.

              2. The syntax charset=utf-8 adds up to the semantic confusion, because as explained earlier, UTF-8 is an encoding and not a character set. But as explained earlier, some people just use the 2 words interchangeably.


              For example, in the case of text/xml it would be pointless to declare an encoding (and a charset parameter would simply be ignored). Instead, XML parsers in general will read the first line of the file, looking for the <?xml encoding=... tag. If it's there, then they will reopen the file using that encoding.



              The same problem exists when sending e-mails. An e-mail can contain a html message or just plain text. Also in that case mime types are used to define the type of the content.



              But in summary, a mime type isn't always sufficient to solve the problem.



              Data types in programming languages



              In case of Java (and many other programming languages) in addition to the dangers of encodings, there's also the complexity of casting bytes and integers to characters because their content is stored in different ranges.




              • a byte is stored as a signed byte (range: -128 to 127).

              • the char type in java is stored in 2 unsigned bytes (range: 0 - 65535)

              • a stream returns an integer in range -1 to 255.


              If you know that your data only contains ASCII values. Then with the proper skill you can parse your data from bytes to characters or wrap them immediately in Strings.



              // the -1 indicates that there is no data
              int input = stream.read();
              if (input == -1) throw new EOFException();

              // bytes must be made positive first.
              byte myByte = (byte) input;
              int unsignedInteger = myByte & 0xFF;
              char ascii = (char)(unsignedInteger);


              Shortcuts



              The shortcut in java is to use readers and writers and to specify the encoding when you instantiate them.



              // wrap your stream in a reader. 
              // specify the encoding
              // The reader will decode the data for you
              Reader reader = new InputStreamReader(inputStream, StandardCharsets.UTF_8);


              As explained earlier for XML files it doesn't matter that much, because any decent DOM or JAXB marshaller will check for an encoding attribute.






              share|improve this answer





















              • 1





                Just a small note: Since almost all encodings encode the 128 basic ASCII characters in the same way, as long as all used characters are defined in this basic set, you can actually encode/decode your message using almost any random encoding. (e.g. UTF-8, US-ASCII, latin-1, GBK, ...).

                – bvdb
                Nov 27 '15 at 11:18






              • 1





                Also interesting is the BOM (byte-order-mark) which is used for encodings that use multiple bytes (e.g. UTF-16). It indicates which of the bytes is the first one (most significant). This marker-byte is put in front of the message. Another good reason to use decent Readers.

                – bvdb
                Nov 27 '15 at 11:19













              • The character table of Unicode is an encoding by definition, nevertheless it is double-encoded in i. e. UTF-8. Therefore it is simply wrong, that Unicode has no encoding.

                – Amin Negm-Awad
                Oct 2 '17 at 16:56











              • @AminNegm-Awad I did not write "unicode has no encoding". Unicode has encodings: (e.g. UTF-8, UTF-16, ... etc .) Bot those are implementations. Unicode itself is pretty much just like "the alphabet", it's just a list of characters. That's why Unicode is not an encoding. An encoding on the other hand should describe how the information will be stored in bits and bytes. - I have no idea what you mean by "double-encoded" though. Are you referring to the fact that there are multiple unicode implementations ? (because I do agree on that).

                – bvdb
                Oct 11 '17 at 12:41






              • 1





                No "A coded character set is a set of characters for which a unique number has been assigned to each character. " This is the same definition I used from wikipedia. ;-)

                – Amin Negm-Awad
                May 31 '18 at 5:53














              34












              34








              34







              ASCII is fundamental



              Originally 1 character was always stored as 1 byte. A byte (8 bits) has the potential to distinct 256 possible values. But in fact only the first 7 bits were used. So only 128 characters were defined. This set is known as the ASCII character set.





              • 0x00 - 0x1F contain steering codes (e.g. CR, LF, STX, ETX, EOT, BEL, ...)


              • 0x20 - 0x40 contain numbers and punctuation


              • 0x41 - 0x7F contain mostly alphabetic characters


              • 0x80 - 0xFF the 8th bit = undefined.


              French, German and many other languages needed additional characters. (e.g. à, é, ç, ô, ...) which were not available in the ASCII character set. So they used the 8th bit to define their characters. This is what is known as "extended ASCII".



              The problem is that the additional 1 bit has not enough capacity to cover all languages in the world. So each region has its own ASCII variant. There are many extended ASCII encodings (latin-1 being a very popular one).



              Popular question: "Is ASCII a character set or is it an encoding" ? ASCII is a character set. However, in programming charset and encoding are wildly used as synonyms. If I want to refer to an encoding that only contains the ASCII characters and nothing more (the 8th bit is always 0): that's US-ASCII.



              Unicode goes one step further



              Unicode is a great example of a character set - not an encoding. It uses the same characters like the ASCII standard, but it extends the list with additional characters, which gives each character a codepoint in format u+xxxx. It has the ambition to contain all characters (and popular icons) used in the entire world.



              UTF-8, UTF-16 and UTF-32 are encodings that apply the Unicode character table. But they each have a slightly different way on how to encode them. UTF-8 will only use 1 byte when encoding an ASCII character, giving the same output as any other ASCII encoding. But for other characters, it will use the first bit to indicate that a 2nd byte will follow.



              GBK is an encoding, which just like UTF-8 uses multiple bytes. The principle is pretty much the same. The first byte follows the ASCII standard, so only 7 bits are used. But just like with UTF-8, The 8th bit can be used to indicate the presence of a 2nd byte, which it then uses to encode one of 22,000 Chinese characters. The main difference, is that this does not follow the Unicode character set, by contrast it uses some Chinese character set.



              Decoding data



              When you encode your data, you use an encoding, but when you decode data, you will need to know what encoding was used, and use that same encoding to decode it.



              Unfortunately, encodings aren't always declared or specified. It would have been ideal if all files contained a prefix to indicate what encoding their data was stored in. But still in many cases applications just have to assume or guess what encoding they should use. (e.g. they use the standard encoding of the operating system).



              There still is a lack of awareness about this, as still many developers don't even know what an encoding is.



              Mime types



              Mime types are sometimes confused with encodings. They are a useful way for the receiver to identify what kind of data is arriving. Here is an example, of how the HTTP protocol defines it's content type using a mime type declaration.



              Content-Type: text/html; charset=utf-8


              And that's another great source of confusion. A mime type describes what kind of data a message contains (e.g. text/xml, image/png, ...). And in some cases it will additionally also describe how the data is encoded (i.e. charset=utf-8). 2 points of confusion:




              1. Not all mime types declare an encoding. In some cases it is only optional or sometimes completely pointless.

              2. The syntax charset=utf-8 adds up to the semantic confusion, because as explained earlier, UTF-8 is an encoding and not a character set. But as explained earlier, some people just use the 2 words interchangeably.


              For example, in the case of text/xml it would be pointless to declare an encoding (and a charset parameter would simply be ignored). Instead, XML parsers in general will read the first line of the file, looking for the <?xml encoding=... tag. If it's there, then they will reopen the file using that encoding.



              The same problem exists when sending e-mails. An e-mail can contain a html message or just plain text. Also in that case mime types are used to define the type of the content.



              But in summary, a mime type isn't always sufficient to solve the problem.



              Data types in programming languages



              In case of Java (and many other programming languages) in addition to the dangers of encodings, there's also the complexity of casting bytes and integers to characters because their content is stored in different ranges.




              • a byte is stored as a signed byte (range: -128 to 127).

              • the char type in java is stored in 2 unsigned bytes (range: 0 - 65535)

              • a stream returns an integer in range -1 to 255.


              If you know that your data only contains ASCII values. Then with the proper skill you can parse your data from bytes to characters or wrap them immediately in Strings.



              // the -1 indicates that there is no data
              int input = stream.read();
              if (input == -1) throw new EOFException();

              // bytes must be made positive first.
              byte myByte = (byte) input;
              int unsignedInteger = myByte & 0xFF;
              char ascii = (char)(unsignedInteger);


              Shortcuts



              The shortcut in java is to use readers and writers and to specify the encoding when you instantiate them.



              // wrap your stream in a reader. 
              // specify the encoding
              // The reader will decode the data for you
              Reader reader = new InputStreamReader(inputStream, StandardCharsets.UTF_8);


              As explained earlier for XML files it doesn't matter that much, because any decent DOM or JAXB marshaller will check for an encoding attribute.






              share|improve this answer















              ASCII is fundamental



              Originally 1 character was always stored as 1 byte. A byte (8 bits) has the potential to distinct 256 possible values. But in fact only the first 7 bits were used. So only 128 characters were defined. This set is known as the ASCII character set.





              • 0x00 - 0x1F contain steering codes (e.g. CR, LF, STX, ETX, EOT, BEL, ...)


              • 0x20 - 0x40 contain numbers and punctuation


              • 0x41 - 0x7F contain mostly alphabetic characters


              • 0x80 - 0xFF the 8th bit = undefined.


              French, German and many other languages needed additional characters. (e.g. à, é, ç, ô, ...) which were not available in the ASCII character set. So they used the 8th bit to define their characters. This is what is known as "extended ASCII".



              The problem is that the additional 1 bit has not enough capacity to cover all languages in the world. So each region has its own ASCII variant. There are many extended ASCII encodings (latin-1 being a very popular one).



              Popular question: "Is ASCII a character set or is it an encoding" ? ASCII is a character set. However, in programming charset and encoding are wildly used as synonyms. If I want to refer to an encoding that only contains the ASCII characters and nothing more (the 8th bit is always 0): that's US-ASCII.



              Unicode goes one step further



              Unicode is a great example of a character set - not an encoding. It uses the same characters like the ASCII standard, but it extends the list with additional characters, which gives each character a codepoint in format u+xxxx. It has the ambition to contain all characters (and popular icons) used in the entire world.



              UTF-8, UTF-16 and UTF-32 are encodings that apply the Unicode character table. But they each have a slightly different way on how to encode them. UTF-8 will only use 1 byte when encoding an ASCII character, giving the same output as any other ASCII encoding. But for other characters, it will use the first bit to indicate that a 2nd byte will follow.



              GBK is an encoding, which just like UTF-8 uses multiple bytes. The principle is pretty much the same. The first byte follows the ASCII standard, so only 7 bits are used. But just like with UTF-8, The 8th bit can be used to indicate the presence of a 2nd byte, which it then uses to encode one of 22,000 Chinese characters. The main difference, is that this does not follow the Unicode character set, by contrast it uses some Chinese character set.



              Decoding data



              When you encode your data, you use an encoding, but when you decode data, you will need to know what encoding was used, and use that same encoding to decode it.



              Unfortunately, encodings aren't always declared or specified. It would have been ideal if all files contained a prefix to indicate what encoding their data was stored in. But still in many cases applications just have to assume or guess what encoding they should use. (e.g. they use the standard encoding of the operating system).



              There still is a lack of awareness about this, as still many developers don't even know what an encoding is.



              Mime types



              Mime types are sometimes confused with encodings. They are a useful way for the receiver to identify what kind of data is arriving. Here is an example, of how the HTTP protocol defines it's content type using a mime type declaration.



              Content-Type: text/html; charset=utf-8


              And that's another great source of confusion. A mime type describes what kind of data a message contains (e.g. text/xml, image/png, ...). And in some cases it will additionally also describe how the data is encoded (i.e. charset=utf-8). 2 points of confusion:




              1. Not all mime types declare an encoding. In some cases it is only optional or sometimes completely pointless.

              2. The syntax charset=utf-8 adds up to the semantic confusion, because as explained earlier, UTF-8 is an encoding and not a character set. But as explained earlier, some people just use the 2 words interchangeably.


              For example, in the case of text/xml it would be pointless to declare an encoding (and a charset parameter would simply be ignored). Instead, XML parsers in general will read the first line of the file, looking for the <?xml encoding=... tag. If it's there, then they will reopen the file using that encoding.



              The same problem exists when sending e-mails. An e-mail can contain a html message or just plain text. Also in that case mime types are used to define the type of the content.



              But in summary, a mime type isn't always sufficient to solve the problem.



              Data types in programming languages



              In case of Java (and many other programming languages) in addition to the dangers of encodings, there's also the complexity of casting bytes and integers to characters because their content is stored in different ranges.




              • a byte is stored as a signed byte (range: -128 to 127).

              • the char type in java is stored in 2 unsigned bytes (range: 0 - 65535)

              • a stream returns an integer in range -1 to 255.


              If you know that your data only contains ASCII values. Then with the proper skill you can parse your data from bytes to characters or wrap them immediately in Strings.



              // the -1 indicates that there is no data
              int input = stream.read();
              if (input == -1) throw new EOFException();

              // bytes must be made positive first.
              byte myByte = (byte) input;
              int unsignedInteger = myByte & 0xFF;
              char ascii = (char)(unsignedInteger);


              Shortcuts



              The shortcut in java is to use readers and writers and to specify the encoding when you instantiate them.



              // wrap your stream in a reader. 
              // specify the encoding
              // The reader will decode the data for you
              Reader reader = new InputStreamReader(inputStream, StandardCharsets.UTF_8);


              As explained earlier for XML files it doesn't matter that much, because any decent DOM or JAXB marshaller will check for an encoding attribute.







              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited Jan 1 at 21:38

























              answered Aug 1 '15 at 10:47









              bvdbbvdb

              7,87024664




              7,87024664








              • 1





                Just a small note: Since almost all encodings encode the 128 basic ASCII characters in the same way, as long as all used characters are defined in this basic set, you can actually encode/decode your message using almost any random encoding. (e.g. UTF-8, US-ASCII, latin-1, GBK, ...).

                – bvdb
                Nov 27 '15 at 11:18






              • 1





                Also interesting is the BOM (byte-order-mark) which is used for encodings that use multiple bytes (e.g. UTF-16). It indicates which of the bytes is the first one (most significant). This marker-byte is put in front of the message. Another good reason to use decent Readers.

                – bvdb
                Nov 27 '15 at 11:19













              • The character table of Unicode is an encoding by definition, nevertheless it is double-encoded in i. e. UTF-8. Therefore it is simply wrong, that Unicode has no encoding.

                – Amin Negm-Awad
                Oct 2 '17 at 16:56











              • @AminNegm-Awad I did not write "unicode has no encoding". Unicode has encodings: (e.g. UTF-8, UTF-16, ... etc .) Bot those are implementations. Unicode itself is pretty much just like "the alphabet", it's just a list of characters. That's why Unicode is not an encoding. An encoding on the other hand should describe how the information will be stored in bits and bytes. - I have no idea what you mean by "double-encoded" though. Are you referring to the fact that there are multiple unicode implementations ? (because I do agree on that).

                – bvdb
                Oct 11 '17 at 12:41






              • 1





                No "A coded character set is a set of characters for which a unique number has been assigned to each character. " This is the same definition I used from wikipedia. ;-)

                – Amin Negm-Awad
                May 31 '18 at 5:53














              • 1





                Just a small note: Since almost all encodings encode the 128 basic ASCII characters in the same way, as long as all used characters are defined in this basic set, you can actually encode/decode your message using almost any random encoding. (e.g. UTF-8, US-ASCII, latin-1, GBK, ...).

                – bvdb
                Nov 27 '15 at 11:18






              • 1





                Also interesting is the BOM (byte-order-mark) which is used for encodings that use multiple bytes (e.g. UTF-16). It indicates which of the bytes is the first one (most significant). This marker-byte is put in front of the message. Another good reason to use decent Readers.

                – bvdb
                Nov 27 '15 at 11:19













              • The character table of Unicode is an encoding by definition, nevertheless it is double-encoded in i. e. UTF-8. Therefore it is simply wrong, that Unicode has no encoding.

                – Amin Negm-Awad
                Oct 2 '17 at 16:56











              • @AminNegm-Awad I did not write "unicode has no encoding". Unicode has encodings: (e.g. UTF-8, UTF-16, ... etc .) Bot those are implementations. Unicode itself is pretty much just like "the alphabet", it's just a list of characters. That's why Unicode is not an encoding. An encoding on the other hand should describe how the information will be stored in bits and bytes. - I have no idea what you mean by "double-encoded" though. Are you referring to the fact that there are multiple unicode implementations ? (because I do agree on that).

                – bvdb
                Oct 11 '17 at 12:41






              • 1





                No "A coded character set is a set of characters for which a unique number has been assigned to each character. " This is the same definition I used from wikipedia. ;-)

                – Amin Negm-Awad
                May 31 '18 at 5:53








              1




              1





              Just a small note: Since almost all encodings encode the 128 basic ASCII characters in the same way, as long as all used characters are defined in this basic set, you can actually encode/decode your message using almost any random encoding. (e.g. UTF-8, US-ASCII, latin-1, GBK, ...).

              – bvdb
              Nov 27 '15 at 11:18





              Just a small note: Since almost all encodings encode the 128 basic ASCII characters in the same way, as long as all used characters are defined in this basic set, you can actually encode/decode your message using almost any random encoding. (e.g. UTF-8, US-ASCII, latin-1, GBK, ...).

              – bvdb
              Nov 27 '15 at 11:18




              1




              1





              Also interesting is the BOM (byte-order-mark) which is used for encodings that use multiple bytes (e.g. UTF-16). It indicates which of the bytes is the first one (most significant). This marker-byte is put in front of the message. Another good reason to use decent Readers.

              – bvdb
              Nov 27 '15 at 11:19







              Also interesting is the BOM (byte-order-mark) which is used for encodings that use multiple bytes (e.g. UTF-16). It indicates which of the bytes is the first one (most significant). This marker-byte is put in front of the message. Another good reason to use decent Readers.

              – bvdb
              Nov 27 '15 at 11:19















              The character table of Unicode is an encoding by definition, nevertheless it is double-encoded in i. e. UTF-8. Therefore it is simply wrong, that Unicode has no encoding.

              – Amin Negm-Awad
              Oct 2 '17 at 16:56





              The character table of Unicode is an encoding by definition, nevertheless it is double-encoded in i. e. UTF-8. Therefore it is simply wrong, that Unicode has no encoding.

              – Amin Negm-Awad
              Oct 2 '17 at 16:56













              @AminNegm-Awad I did not write "unicode has no encoding". Unicode has encodings: (e.g. UTF-8, UTF-16, ... etc .) Bot those are implementations. Unicode itself is pretty much just like "the alphabet", it's just a list of characters. That's why Unicode is not an encoding. An encoding on the other hand should describe how the information will be stored in bits and bytes. - I have no idea what you mean by "double-encoded" though. Are you referring to the fact that there are multiple unicode implementations ? (because I do agree on that).

              – bvdb
              Oct 11 '17 at 12:41





              @AminNegm-Awad I did not write "unicode has no encoding". Unicode has encodings: (e.g. UTF-8, UTF-16, ... etc .) Bot those are implementations. Unicode itself is pretty much just like "the alphabet", it's just a list of characters. That's why Unicode is not an encoding. An encoding on the other hand should describe how the information will be stored in bits and bytes. - I have no idea what you mean by "double-encoded" though. Are you referring to the fact that there are multiple unicode implementations ? (because I do agree on that).

              – bvdb
              Oct 11 '17 at 12:41




              1




              1





              No "A coded character set is a set of characters for which a unique number has been assigned to each character. " This is the same definition I used from wikipedia. ;-)

              – Amin Negm-Awad
              May 31 '18 at 5:53





              No "A coded character set is a set of characters for which a unique number has been assigned to each character. " This is the same definition I used from wikipedia. ;-)

              – Amin Negm-Awad
              May 31 '18 at 5:53











              2














              Character encoding is what you use to solve the problem of writing software for somebody who uses a different language than you do.



              You don't know how what the characters are and how they are ordered. Therefore, you don't know what the strings in this new language will look like in binary and frankly, you don't care.



              What you do have is a way of translating strings from the language you speak to the language they speak (say a translator). You now need a system that is capable of representing both languages in binary without conflicts. The encoding is that system.



              It is what allows you to write software that works regardless of the way languages are represented in binary.






              share|improve this answer




























                2














                Character encoding is what you use to solve the problem of writing software for somebody who uses a different language than you do.



                You don't know how what the characters are and how they are ordered. Therefore, you don't know what the strings in this new language will look like in binary and frankly, you don't care.



                What you do have is a way of translating strings from the language you speak to the language they speak (say a translator). You now need a system that is capable of representing both languages in binary without conflicts. The encoding is that system.



                It is what allows you to write software that works regardless of the way languages are represented in binary.






                share|improve this answer


























                  2












                  2








                  2







                  Character encoding is what you use to solve the problem of writing software for somebody who uses a different language than you do.



                  You don't know how what the characters are and how they are ordered. Therefore, you don't know what the strings in this new language will look like in binary and frankly, you don't care.



                  What you do have is a way of translating strings from the language you speak to the language they speak (say a translator). You now need a system that is capable of representing both languages in binary without conflicts. The encoding is that system.



                  It is what allows you to write software that works regardless of the way languages are represented in binary.






                  share|improve this answer













                  Character encoding is what you use to solve the problem of writing software for somebody who uses a different language than you do.



                  You don't know how what the characters are and how they are ordered. Therefore, you don't know what the strings in this new language will look like in binary and frankly, you don't care.



                  What you do have is a way of translating strings from the language you speak to the language they speak (say a translator). You now need a system that is capable of representing both languages in binary without conflicts. The encoding is that system.



                  It is what allows you to write software that works regardless of the way languages are represented in binary.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered May 16 '12 at 3:31









                  CarlCarl

                  31k76491




                  31k76491

















                      protected by Raedwald Dec 14 '18 at 13:19



                      Thank you for your interest in this question.
                      Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).



                      Would you like to answer one of these unanswered questions instead?



                      Popular posts from this blog

                      Monofisismo

                      Angular Downloading a file using contenturl with Basic Authentication

                      Olmecas