Why does this unicode character end up as 6 bytes with UTF-16 encoding?





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







7















I was playing with a code snippet from the accepted answer to this question. I simply added a byte array to use UTF-16 as follows:



final char chars = Character.toChars(0x1F701);
final String s = new String(chars);
final byte asBytes = s.getBytes(StandardCharsets.UTF_8);
final byte asBytes16 = s.getBytes(StandardCharsets.UTF_16);


chars has 2 elements, which means two 16-bit integers in Java (since the code point is outside of the BMP).



asBytes has 4 elements, which corresponds to 32 bits, which is what we'd need to represent two 16-bit integers from chars, so it makes sense.



asBytes16 has 6 elements, which is what confuses me. Why do we end up with 2 extra bytes when 32 bits is sufficient to represent this unicode character?










share|improve this question




















  • 5





    What are the actual bytes? I bet there's a byte order mark (BOM) in the UTF-16 one.

    – Daniel Pryden
    Jan 4 at 12:09


















7















I was playing with a code snippet from the accepted answer to this question. I simply added a byte array to use UTF-16 as follows:



final char chars = Character.toChars(0x1F701);
final String s = new String(chars);
final byte asBytes = s.getBytes(StandardCharsets.UTF_8);
final byte asBytes16 = s.getBytes(StandardCharsets.UTF_16);


chars has 2 elements, which means two 16-bit integers in Java (since the code point is outside of the BMP).



asBytes has 4 elements, which corresponds to 32 bits, which is what we'd need to represent two 16-bit integers from chars, so it makes sense.



asBytes16 has 6 elements, which is what confuses me. Why do we end up with 2 extra bytes when 32 bits is sufficient to represent this unicode character?










share|improve this question




















  • 5





    What are the actual bytes? I bet there's a byte order mark (BOM) in the UTF-16 one.

    – Daniel Pryden
    Jan 4 at 12:09














7












7








7








I was playing with a code snippet from the accepted answer to this question. I simply added a byte array to use UTF-16 as follows:



final char chars = Character.toChars(0x1F701);
final String s = new String(chars);
final byte asBytes = s.getBytes(StandardCharsets.UTF_8);
final byte asBytes16 = s.getBytes(StandardCharsets.UTF_16);


chars has 2 elements, which means two 16-bit integers in Java (since the code point is outside of the BMP).



asBytes has 4 elements, which corresponds to 32 bits, which is what we'd need to represent two 16-bit integers from chars, so it makes sense.



asBytes16 has 6 elements, which is what confuses me. Why do we end up with 2 extra bytes when 32 bits is sufficient to represent this unicode character?










share|improve this question
















I was playing with a code snippet from the accepted answer to this question. I simply added a byte array to use UTF-16 as follows:



final char chars = Character.toChars(0x1F701);
final String s = new String(chars);
final byte asBytes = s.getBytes(StandardCharsets.UTF_8);
final byte asBytes16 = s.getBytes(StandardCharsets.UTF_16);


chars has 2 elements, which means two 16-bit integers in Java (since the code point is outside of the BMP).



asBytes has 4 elements, which corresponds to 32 bits, which is what we'd need to represent two 16-bit integers from chars, so it makes sense.



asBytes16 has 6 elements, which is what confuses me. Why do we end up with 2 extra bytes when 32 bits is sufficient to represent this unicode character?







java unicode






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 4 at 14:20









Boann

37.5k1290122




37.5k1290122










asked Jan 4 at 11:59









mahonyamahonya

2,71453254




2,71453254








  • 5





    What are the actual bytes? I bet there's a byte order mark (BOM) in the UTF-16 one.

    – Daniel Pryden
    Jan 4 at 12:09














  • 5





    What are the actual bytes? I bet there's a byte order mark (BOM) in the UTF-16 one.

    – Daniel Pryden
    Jan 4 at 12:09








5




5





What are the actual bytes? I bet there's a byte order mark (BOM) in the UTF-16 one.

– Daniel Pryden
Jan 4 at 12:09





What are the actual bytes? I bet there's a byte order mark (BOM) in the UTF-16 one.

– Daniel Pryden
Jan 4 at 12:09












2 Answers
2






active

oldest

votes


















5














UTF-16 bytes start with Byte order mark FEFF to indicate that value is encoded in big-endian. As per wiki BOM is also used to distinguish UTF-16 from UTF-8:




Neither of these sequences is valid UTF-8, so their presence indicates that the file is not encoded in UTF-8.




You can convert byte to hex-encoded String as per this answer:



asBytes   = F09F9C81
asBytes16 = FEFFD83DDF01





share|improve this answer
























  • Thanks, it is the BOM indeed

    – mahonya
    Jan 4 at 12:33



















3















asBytes has 4 elements, which corresponds to 32 bits, which is what we'd need to represent two 16-bit integers from chars, so it makes sense.




Actually no, the number of chars needed to represent a codepoint in Java has nothing to do with it. The number of bytes is directly related to the numeric value of the codepoint itself.



Codepoint U+1F701 (0x1F701) uses 17 bits (11111011100000001)



0x1F701 requires 4 bytes in UTF-8 (F0 9F 9C 81) to encode its 17 bits. See the bit distribution chart on Wikipedia. The algorithm is defined in RFC 3629.




asBytes16 has 6 elements, which is what confuses me. Why do we end up with 2 extra bytes when 32 bits is sufficient to represent this unicode character?




Per the Java documentation for StandardCharsets




UTF_16



public static final Charset UTF_16


Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark




0x1F701 requires 4 bytes in UTF-16 (D8 3D DF 01) to encode its 17 bits. See the bit distribution chart on Wikipedia. The algorithm is defined in RFC 2781.



UTF-16 is subject to endian, unlike UTF-8, so StandardCharsets.UTF_16 includes a BOM to specify the actual endian used in the byte array.



To avoid the BOM, use StandardCharsets.UTF_16BE or StandardCharsets.UTF_16LE as needed:




UTF_16BE



public static final Charset UTF_16BE


Sixteen-bit UCS Transformation Format, big-endian byte order



UTF_16LE



public static final Charset UTF_16LE


Sixteen-bit UCS Transformation Format, little-endian byte order




Since their endian is implied in their names, they don't need to include a BOM in the byte array.






share|improve this answer


























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54038562%2fwhy-does-this-unicode-character-end-up-as-6-bytes-with-utf-16-encoding%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    5














    UTF-16 bytes start with Byte order mark FEFF to indicate that value is encoded in big-endian. As per wiki BOM is also used to distinguish UTF-16 from UTF-8:




    Neither of these sequences is valid UTF-8, so their presence indicates that the file is not encoded in UTF-8.




    You can convert byte to hex-encoded String as per this answer:



    asBytes   = F09F9C81
    asBytes16 = FEFFD83DDF01





    share|improve this answer
























    • Thanks, it is the BOM indeed

      – mahonya
      Jan 4 at 12:33
















    5














    UTF-16 bytes start with Byte order mark FEFF to indicate that value is encoded in big-endian. As per wiki BOM is also used to distinguish UTF-16 from UTF-8:




    Neither of these sequences is valid UTF-8, so their presence indicates that the file is not encoded in UTF-8.




    You can convert byte to hex-encoded String as per this answer:



    asBytes   = F09F9C81
    asBytes16 = FEFFD83DDF01





    share|improve this answer
























    • Thanks, it is the BOM indeed

      – mahonya
      Jan 4 at 12:33














    5












    5








    5







    UTF-16 bytes start with Byte order mark FEFF to indicate that value is encoded in big-endian. As per wiki BOM is also used to distinguish UTF-16 from UTF-8:




    Neither of these sequences is valid UTF-8, so their presence indicates that the file is not encoded in UTF-8.




    You can convert byte to hex-encoded String as per this answer:



    asBytes   = F09F9C81
    asBytes16 = FEFFD83DDF01





    share|improve this answer













    UTF-16 bytes start with Byte order mark FEFF to indicate that value is encoded in big-endian. As per wiki BOM is also used to distinguish UTF-16 from UTF-8:




    Neither of these sequences is valid UTF-8, so their presence indicates that the file is not encoded in UTF-8.




    You can convert byte to hex-encoded String as per this answer:



    asBytes   = F09F9C81
    asBytes16 = FEFFD83DDF01






    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Jan 4 at 12:16









    Karol DowbeckiKarol Dowbecki

    26.8k93860




    26.8k93860













    • Thanks, it is the BOM indeed

      – mahonya
      Jan 4 at 12:33



















    • Thanks, it is the BOM indeed

      – mahonya
      Jan 4 at 12:33

















    Thanks, it is the BOM indeed

    – mahonya
    Jan 4 at 12:33





    Thanks, it is the BOM indeed

    – mahonya
    Jan 4 at 12:33













    3















    asBytes has 4 elements, which corresponds to 32 bits, which is what we'd need to represent two 16-bit integers from chars, so it makes sense.




    Actually no, the number of chars needed to represent a codepoint in Java has nothing to do with it. The number of bytes is directly related to the numeric value of the codepoint itself.



    Codepoint U+1F701 (0x1F701) uses 17 bits (11111011100000001)



    0x1F701 requires 4 bytes in UTF-8 (F0 9F 9C 81) to encode its 17 bits. See the bit distribution chart on Wikipedia. The algorithm is defined in RFC 3629.




    asBytes16 has 6 elements, which is what confuses me. Why do we end up with 2 extra bytes when 32 bits is sufficient to represent this unicode character?




    Per the Java documentation for StandardCharsets




    UTF_16



    public static final Charset UTF_16


    Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark




    0x1F701 requires 4 bytes in UTF-16 (D8 3D DF 01) to encode its 17 bits. See the bit distribution chart on Wikipedia. The algorithm is defined in RFC 2781.



    UTF-16 is subject to endian, unlike UTF-8, so StandardCharsets.UTF_16 includes a BOM to specify the actual endian used in the byte array.



    To avoid the BOM, use StandardCharsets.UTF_16BE or StandardCharsets.UTF_16LE as needed:




    UTF_16BE



    public static final Charset UTF_16BE


    Sixteen-bit UCS Transformation Format, big-endian byte order



    UTF_16LE



    public static final Charset UTF_16LE


    Sixteen-bit UCS Transformation Format, little-endian byte order




    Since their endian is implied in their names, they don't need to include a BOM in the byte array.






    share|improve this answer






























      3















      asBytes has 4 elements, which corresponds to 32 bits, which is what we'd need to represent two 16-bit integers from chars, so it makes sense.




      Actually no, the number of chars needed to represent a codepoint in Java has nothing to do with it. The number of bytes is directly related to the numeric value of the codepoint itself.



      Codepoint U+1F701 (0x1F701) uses 17 bits (11111011100000001)



      0x1F701 requires 4 bytes in UTF-8 (F0 9F 9C 81) to encode its 17 bits. See the bit distribution chart on Wikipedia. The algorithm is defined in RFC 3629.




      asBytes16 has 6 elements, which is what confuses me. Why do we end up with 2 extra bytes when 32 bits is sufficient to represent this unicode character?




      Per the Java documentation for StandardCharsets




      UTF_16



      public static final Charset UTF_16


      Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark




      0x1F701 requires 4 bytes in UTF-16 (D8 3D DF 01) to encode its 17 bits. See the bit distribution chart on Wikipedia. The algorithm is defined in RFC 2781.



      UTF-16 is subject to endian, unlike UTF-8, so StandardCharsets.UTF_16 includes a BOM to specify the actual endian used in the byte array.



      To avoid the BOM, use StandardCharsets.UTF_16BE or StandardCharsets.UTF_16LE as needed:




      UTF_16BE



      public static final Charset UTF_16BE


      Sixteen-bit UCS Transformation Format, big-endian byte order



      UTF_16LE



      public static final Charset UTF_16LE


      Sixteen-bit UCS Transformation Format, little-endian byte order




      Since their endian is implied in their names, they don't need to include a BOM in the byte array.






      share|improve this answer




























        3












        3








        3








        asBytes has 4 elements, which corresponds to 32 bits, which is what we'd need to represent two 16-bit integers from chars, so it makes sense.




        Actually no, the number of chars needed to represent a codepoint in Java has nothing to do with it. The number of bytes is directly related to the numeric value of the codepoint itself.



        Codepoint U+1F701 (0x1F701) uses 17 bits (11111011100000001)



        0x1F701 requires 4 bytes in UTF-8 (F0 9F 9C 81) to encode its 17 bits. See the bit distribution chart on Wikipedia. The algorithm is defined in RFC 3629.




        asBytes16 has 6 elements, which is what confuses me. Why do we end up with 2 extra bytes when 32 bits is sufficient to represent this unicode character?




        Per the Java documentation for StandardCharsets




        UTF_16



        public static final Charset UTF_16


        Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark




        0x1F701 requires 4 bytes in UTF-16 (D8 3D DF 01) to encode its 17 bits. See the bit distribution chart on Wikipedia. The algorithm is defined in RFC 2781.



        UTF-16 is subject to endian, unlike UTF-8, so StandardCharsets.UTF_16 includes a BOM to specify the actual endian used in the byte array.



        To avoid the BOM, use StandardCharsets.UTF_16BE or StandardCharsets.UTF_16LE as needed:




        UTF_16BE



        public static final Charset UTF_16BE


        Sixteen-bit UCS Transformation Format, big-endian byte order



        UTF_16LE



        public static final Charset UTF_16LE


        Sixteen-bit UCS Transformation Format, little-endian byte order




        Since their endian is implied in their names, they don't need to include a BOM in the byte array.






        share|improve this answer
















        asBytes has 4 elements, which corresponds to 32 bits, which is what we'd need to represent two 16-bit integers from chars, so it makes sense.




        Actually no, the number of chars needed to represent a codepoint in Java has nothing to do with it. The number of bytes is directly related to the numeric value of the codepoint itself.



        Codepoint U+1F701 (0x1F701) uses 17 bits (11111011100000001)



        0x1F701 requires 4 bytes in UTF-8 (F0 9F 9C 81) to encode its 17 bits. See the bit distribution chart on Wikipedia. The algorithm is defined in RFC 3629.




        asBytes16 has 6 elements, which is what confuses me. Why do we end up with 2 extra bytes when 32 bits is sufficient to represent this unicode character?




        Per the Java documentation for StandardCharsets




        UTF_16



        public static final Charset UTF_16


        Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark




        0x1F701 requires 4 bytes in UTF-16 (D8 3D DF 01) to encode its 17 bits. See the bit distribution chart on Wikipedia. The algorithm is defined in RFC 2781.



        UTF-16 is subject to endian, unlike UTF-8, so StandardCharsets.UTF_16 includes a BOM to specify the actual endian used in the byte array.



        To avoid the BOM, use StandardCharsets.UTF_16BE or StandardCharsets.UTF_16LE as needed:




        UTF_16BE



        public static final Charset UTF_16BE


        Sixteen-bit UCS Transformation Format, big-endian byte order



        UTF_16LE



        public static final Charset UTF_16LE


        Sixteen-bit UCS Transformation Format, little-endian byte order




        Since their endian is implied in their names, they don't need to include a BOM in the byte array.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Jan 7 at 21:32

























        answered Jan 7 at 17:14









        Remy LebeauRemy Lebeau

        344k19270464




        344k19270464






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54038562%2fwhy-does-this-unicode-character-end-up-as-6-bytes-with-utf-16-encoding%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Mossoró

            Error while reading .h5 file using the rhdf5 package in R

            Pushsharp Apns notification error: 'InvalidToken'