Why does this unicode character end up as 6 bytes with UTF-16 encoding?

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

I was playing with a code snippet from the accepted answer to this question. I simply added a byte array to use UTF-16 as follows:

final char chars = Character.toChars(0x1F701);

final String s = new String(chars);

final byte asBytes = s.getBytes(StandardCharsets.UTF_8);

final byte asBytes16 = s.getBytes(StandardCharsets.UTF_16);

chars has 2 elements, which means two 16-bit integers in Java (since the code point is outside of the BMP).

asBytes has 4 elements, which corresponds to 32 bits, which is what we'd need to represent two 16-bit integers from chars, so it makes sense.

asBytes16 has 6 elements, which is what confuses me. Why do we end up with 2 extra bytes when 32 bits is sufficient to represent this unicode character?

edited Jan 4 at 14:20

Boann

37.5k1290122

asked Jan 4 at 11:59

mahonya

2,71453254

5

What are the actual bytes? I bet there's a byte order mark (BOM) in the UTF-16 one.

– Daniel Pryden
Jan 4 at 12:09

add a comment |

I was playing with a code snippet from the accepted answer to this question. I simply added a byte array to use UTF-16 as follows:

final char chars = Character.toChars(0x1F701);

final String s = new String(chars);

final byte asBytes = s.getBytes(StandardCharsets.UTF_8);

final byte asBytes16 = s.getBytes(StandardCharsets.UTF_16);

chars has 2 elements, which means two 16-bit integers in Java (since the code point is outside of the BMP).

asBytes has 4 elements, which corresponds to 32 bits, which is what we'd need to represent two 16-bit integers from chars, so it makes sense.

asBytes16 has 6 elements, which is what confuses me. Why do we end up with 2 extra bytes when 32 bits is sufficient to represent this unicode character?

edited Jan 4 at 14:20

Boann

37.5k1290122

asked Jan 4 at 11:59

mahonya

2,71453254

5

What are the actual bytes? I bet there's a byte order mark (BOM) in the UTF-16 one.

– Daniel Pryden
Jan 4 at 12:09

add a comment |

I was playing with a code snippet from the accepted answer to this question. I simply added a byte array to use UTF-16 as follows:

final char chars = Character.toChars(0x1F701);

final String s = new String(chars);

final byte asBytes = s.getBytes(StandardCharsets.UTF_8);

final byte asBytes16 = s.getBytes(StandardCharsets.UTF_16);

chars has 2 elements, which means two 16-bit integers in Java (since the code point is outside of the BMP).

asBytes has 4 elements, which corresponds to 32 bits, which is what we'd need to represent two 16-bit integers from chars, so it makes sense.

asBytes16 has 6 elements, which is what confuses me. Why do we end up with 2 extra bytes when 32 bits is sufficient to represent this unicode character?

edited Jan 4 at 14:20

Boann

37.5k1290122

asked Jan 4 at 11:59

mahonya

2,71453254

I was playing with a code snippet from the accepted answer to this question. I simply added a byte array to use UTF-16 as follows:

final char chars = Character.toChars(0x1F701);

final String s = new String(chars);

final byte asBytes = s.getBytes(StandardCharsets.UTF_8);

final byte asBytes16 = s.getBytes(StandardCharsets.UTF_16);

chars has 2 elements, which means two 16-bit integers in Java (since the code point is outside of the BMP).

asBytes has 4 elements, which corresponds to 32 bits, which is what we'd need to represent two 16-bit integers from chars, so it makes sense.

asBytes16 has 6 elements, which is what confuses me. Why do we end up with 2 extra bytes when 32 bits is sufficient to represent this unicode character?

java unicode

edited Jan 4 at 14:20

Boann

37.5k1290122

asked Jan 4 at 11:59

mahonya

2,71453254

edited Jan 4 at 14:20

Boann

37.5k1290122

asked Jan 4 at 11:59

mahonya

2,71453254

edited Jan 4 at 14:20

Boann

37.5k1290122

edited Jan 4 at 14:20

Boann

37.5k1290122

edited Jan 4 at 14:20

Boann

37.5k1290122

asked Jan 4 at 11:59

mahonya

2,71453254

asked Jan 4 at 11:59

mahonya

2,71453254

asked Jan 4 at 11:59

mahonya

2,71453254

5

What are the actual bytes? I bet there's a byte order mark (BOM) in the UTF-16 one.

– Daniel Pryden
Jan 4 at 12:09

add a comment |

5

What are the actual bytes? I bet there's a byte order mark (BOM) in the UTF-16 one.

– Daniel Pryden
Jan 4 at 12:09

What are the actual bytes? I bet there's a byte order mark (BOM) in the UTF-16 one.

– Daniel Pryden
Jan 4 at 12:09

add a comment |

2 Answers
2

active

oldest

votes

UTF-16 bytes start with Byte order mark FEFF to indicate that value is encoded in big-endian. As per wiki BOM is also used to distinguish UTF-16 from UTF-8:

Neither of these sequences is valid UTF-8, so their presence indicates that the file is not encoded in UTF-8.

You can convert byte to hex-encoded String as per this answer:

asBytes   = F09F9C81

asBytes16 = FEFFD83DDF01

answered Jan 4 at 12:16

Karol Dowbecki

26.8k93860

Thanks, it is the BOM indeed

– mahonya
Jan 4 at 12:33

add a comment |

asBytes has 4 elements, which corresponds to 32 bits, which is what we'd need to represent two 16-bit integers from chars, so it makes sense.

Actually no, the number of chars needed to represent a codepoint in Java has nothing to do with it. The number of bytes is directly related to the numeric value of the codepoint itself.

Codepoint U+1F701 (0x1F701) uses 17 bits (11111011100000001)

0x1F701 requires 4 bytes in UTF-8 (F0 9F 9C 81) to encode its 17 bits. See the bit distribution chart on Wikipedia. The algorithm is defined in RFC 3629.

asBytes16 has 6 elements, which is what confuses me. Why do we end up with 2 extra bytes when 32 bits is sufficient to represent this unicode character?

Per the Java documentation for StandardCharsets

UTF_16
public static final Charset UTF_16
Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark

0x1F701 requires 4 bytes in UTF-16 (D8 3D DF 01) to encode its 17 bits. See the bit distribution chart on Wikipedia. The algorithm is defined in RFC 2781.

UTF-16 is subject to endian, unlike UTF-8, so StandardCharsets.UTF_16 includes a BOM to specify the actual endian used in the byte array.

To avoid the BOM, use StandardCharsets.UTF_16BE or StandardCharsets.UTF_16LE as needed:

UTF_16BE
public static final Charset UTF_16BE
Sixteen-bit UCS Transformation Format, big-endian byte order

UTF_16LE
public static final Charset UTF_16LE
Sixteen-bit UCS Transformation Format, little-endian byte order

Since their endian is implied in their names, they don't need to include a BOM in the byte array.

edited Jan 7 at 21:32

answered Jan 7 at 17:14

Remy Lebeau

344k19270464

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54038562%2fwhy-does-this-unicode-character-end-up-as-6-bytes-with-utf-16-encoding%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

UTF-16 bytes start with Byte order mark FEFF to indicate that value is encoded in big-endian. As per wiki BOM is also used to distinguish UTF-16 from UTF-8:

Neither of these sequences is valid UTF-8, so their presence indicates that the file is not encoded in UTF-8.

You can convert byte to hex-encoded String as per this answer:

asBytes   = F09F9C81

asBytes16 = FEFFD83DDF01

answered Jan 4 at 12:16

Karol Dowbecki

26.8k93860

Thanks, it is the BOM indeed

– mahonya
Jan 4 at 12:33

add a comment |

UTF-16 bytes start with Byte order mark FEFF to indicate that value is encoded in big-endian. As per wiki BOM is also used to distinguish UTF-16 from UTF-8:

Neither of these sequences is valid UTF-8, so their presence indicates that the file is not encoded in UTF-8.

You can convert byte to hex-encoded String as per this answer:

asBytes   = F09F9C81

asBytes16 = FEFFD83DDF01

answered Jan 4 at 12:16

Karol Dowbecki

26.8k93860

Thanks, it is the BOM indeed

– mahonya
Jan 4 at 12:33

add a comment |

UTF-16 bytes start with Byte order mark FEFF to indicate that value is encoded in big-endian. As per wiki BOM is also used to distinguish UTF-16 from UTF-8:

Neither of these sequences is valid UTF-8, so their presence indicates that the file is not encoded in UTF-8.

You can convert byte to hex-encoded String as per this answer:

asBytes   = F09F9C81

asBytes16 = FEFFD83DDF01

answered Jan 4 at 12:16

Karol Dowbecki

26.8k93860

UTF-16 bytes start with Byte order mark FEFF to indicate that value is encoded in big-endian. As per wiki BOM is also used to distinguish UTF-16 from UTF-8:

Neither of these sequences is valid UTF-8, so their presence indicates that the file is not encoded in UTF-8.

You can convert byte to hex-encoded String as per this answer:

asBytes   = F09F9C81

asBytes16 = FEFFD83DDF01

answered Jan 4 at 12:16

Karol Dowbecki

26.8k93860

answered Jan 4 at 12:16

Karol Dowbecki

26.8k93860

answered Jan 4 at 12:16

Karol Dowbecki

26.8k93860

answered Jan 4 at 12:16

Karol Dowbecki

26.8k93860

Thanks, it is the BOM indeed

– mahonya
Jan 4 at 12:33

add a comment |

Thanks, it is the BOM indeed

– mahonya
Jan 4 at 12:33

Thanks, it is the BOM indeed

– mahonya
Jan 4 at 12:33

add a comment |

asBytes has 4 elements, which corresponds to 32 bits, which is what we'd need to represent two 16-bit integers from chars, so it makes sense.

Actually no, the number of chars needed to represent a codepoint in Java has nothing to do with it. The number of bytes is directly related to the numeric value of the codepoint itself.

Codepoint U+1F701 (0x1F701) uses 17 bits (11111011100000001)

0x1F701 requires 4 bytes in UTF-8 (F0 9F 9C 81) to encode its 17 bits. See the bit distribution chart on Wikipedia. The algorithm is defined in RFC 3629.

asBytes16 has 6 elements, which is what confuses me. Why do we end up with 2 extra bytes when 32 bits is sufficient to represent this unicode character?

Per the Java documentation for StandardCharsets

UTF_16
public static final Charset UTF_16
Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark

0x1F701 requires 4 bytes in UTF-16 (D8 3D DF 01) to encode its 17 bits. See the bit distribution chart on Wikipedia. The algorithm is defined in RFC 2781.

UTF-16 is subject to endian, unlike UTF-8, so StandardCharsets.UTF_16 includes a BOM to specify the actual endian used in the byte array.

To avoid the BOM, use StandardCharsets.UTF_16BE or StandardCharsets.UTF_16LE as needed:

UTF_16BE
public static final Charset UTF_16BE
Sixteen-bit UCS Transformation Format, big-endian byte order

UTF_16LE
public static final Charset UTF_16LE
Sixteen-bit UCS Transformation Format, little-endian byte order

Since their endian is implied in their names, they don't need to include a BOM in the byte array.

edited Jan 7 at 21:32

answered Jan 7 at 17:14

Remy Lebeau

344k19270464

add a comment |

asBytes has 4 elements, which corresponds to 32 bits, which is what we'd need to represent two 16-bit integers from chars, so it makes sense.

Actually no, the number of chars needed to represent a codepoint in Java has nothing to do with it. The number of bytes is directly related to the numeric value of the codepoint itself.

Codepoint U+1F701 (0x1F701) uses 17 bits (11111011100000001)

0x1F701 requires 4 bytes in UTF-8 (F0 9F 9C 81) to encode its 17 bits. See the bit distribution chart on Wikipedia. The algorithm is defined in RFC 3629.

asBytes16 has 6 elements, which is what confuses me. Why do we end up with 2 extra bytes when 32 bits is sufficient to represent this unicode character?

Per the Java documentation for StandardCharsets

UTF_16
public static final Charset UTF_16
Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark

0x1F701 requires 4 bytes in UTF-16 (D8 3D DF 01) to encode its 17 bits. See the bit distribution chart on Wikipedia. The algorithm is defined in RFC 2781.

UTF-16 is subject to endian, unlike UTF-8, so StandardCharsets.UTF_16 includes a BOM to specify the actual endian used in the byte array.

To avoid the BOM, use StandardCharsets.UTF_16BE or StandardCharsets.UTF_16LE as needed:

UTF_16BE
public static final Charset UTF_16BE
Sixteen-bit UCS Transformation Format, big-endian byte order

UTF_16LE
public static final Charset UTF_16LE
Sixteen-bit UCS Transformation Format, little-endian byte order

Since their endian is implied in their names, they don't need to include a BOM in the byte array.

edited Jan 7 at 21:32

answered Jan 7 at 17:14

Remy Lebeau

344k19270464

add a comment |

asBytes has 4 elements, which corresponds to 32 bits, which is what we'd need to represent two 16-bit integers from chars, so it makes sense.

Actually no, the number of chars needed to represent a codepoint in Java has nothing to do with it. The number of bytes is directly related to the numeric value of the codepoint itself.

Codepoint U+1F701 (0x1F701) uses 17 bits (11111011100000001)

0x1F701 requires 4 bytes in UTF-8 (F0 9F 9C 81) to encode its 17 bits. See the bit distribution chart on Wikipedia. The algorithm is defined in RFC 3629.

asBytes16 has 6 elements, which is what confuses me. Why do we end up with 2 extra bytes when 32 bits is sufficient to represent this unicode character?

Per the Java documentation for StandardCharsets

UTF_16
public static final Charset UTF_16
Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark

0x1F701 requires 4 bytes in UTF-16 (D8 3D DF 01) to encode its 17 bits. See the bit distribution chart on Wikipedia. The algorithm is defined in RFC 2781.

UTF-16 is subject to endian, unlike UTF-8, so StandardCharsets.UTF_16 includes a BOM to specify the actual endian used in the byte array.

To avoid the BOM, use StandardCharsets.UTF_16BE or StandardCharsets.UTF_16LE as needed:

UTF_16BE
public static final Charset UTF_16BE
Sixteen-bit UCS Transformation Format, big-endian byte order

UTF_16LE
public static final Charset UTF_16LE
Sixteen-bit UCS Transformation Format, little-endian byte order

Since their endian is implied in their names, they don't need to include a BOM in the byte array.

edited Jan 7 at 21:32

answered Jan 7 at 17:14

Remy Lebeau

344k19270464

asBytes has 4 elements, which corresponds to 32 bits, which is what we'd need to represent two 16-bit integers from chars, so it makes sense.

Actually no, the number of chars needed to represent a codepoint in Java has nothing to do with it. The number of bytes is directly related to the numeric value of the codepoint itself.

Codepoint U+1F701 (0x1F701) uses 17 bits (11111011100000001)

0x1F701 requires 4 bytes in UTF-8 (F0 9F 9C 81) to encode its 17 bits. See the bit distribution chart on Wikipedia. The algorithm is defined in RFC 3629.

asBytes16 has 6 elements, which is what confuses me. Why do we end up with 2 extra bytes when 32 bits is sufficient to represent this unicode character?

Per the Java documentation for StandardCharsets

UTF_16
public static final Charset UTF_16
Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark

0x1F701 requires 4 bytes in UTF-16 (D8 3D DF 01) to encode its 17 bits. See the bit distribution chart on Wikipedia. The algorithm is defined in RFC 2781.

UTF-16 is subject to endian, unlike UTF-8, so StandardCharsets.UTF_16 includes a BOM to specify the actual endian used in the byte array.

To avoid the BOM, use StandardCharsets.UTF_16BE or StandardCharsets.UTF_16LE as needed:

UTF_16BE
public static final Charset UTF_16BE
Sixteen-bit UCS Transformation Format, big-endian byte order

UTF_16LE
public static final Charset UTF_16LE
Sixteen-bit UCS Transformation Format, little-endian byte order

Since their endian is implied in their names, they don't need to include a BOM in the byte array.

edited Jan 7 at 21:32

answered Jan 7 at 17:14

Remy Lebeau

344k19270464

edited Jan 7 at 21:32

answered Jan 7 at 17:14

Remy Lebeau

344k19270464

answered Jan 7 at 17:14

Remy Lebeau

344k19270464

answered Jan 7 at 17:14

Remy Lebeau

344k19270464

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Bdtjtk