Why does this unicode character end up as 6 bytes with UTF-16 encoding?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
I was playing with a code snippet from the accepted answer to this question. I simply added a byte array to use UTF-16 as follows:
final char chars = Character.toChars(0x1F701);
final String s = new String(chars);
final byte asBytes = s.getBytes(StandardCharsets.UTF_8);
final byte asBytes16 = s.getBytes(StandardCharsets.UTF_16);
chars has 2 elements, which means two 16-bit integers in Java (since the code point is outside of the BMP).
asBytes has 4 elements, which corresponds to 32 bits, which is what we'd need to represent two 16-bit integers from chars, so it makes sense.
asBytes16 has 6 elements, which is what confuses me. Why do we end up with 2 extra bytes when 32 bits is sufficient to represent this unicode character?
java unicode
add a comment |
I was playing with a code snippet from the accepted answer to this question. I simply added a byte array to use UTF-16 as follows:
final char chars = Character.toChars(0x1F701);
final String s = new String(chars);
final byte asBytes = s.getBytes(StandardCharsets.UTF_8);
final byte asBytes16 = s.getBytes(StandardCharsets.UTF_16);
chars has 2 elements, which means two 16-bit integers in Java (since the code point is outside of the BMP).
asBytes has 4 elements, which corresponds to 32 bits, which is what we'd need to represent two 16-bit integers from chars, so it makes sense.
asBytes16 has 6 elements, which is what confuses me. Why do we end up with 2 extra bytes when 32 bits is sufficient to represent this unicode character?
java unicode
5
What are the actual bytes? I bet there's a byte order mark (BOM) in the UTF-16 one.
– Daniel Pryden
Jan 4 at 12:09
add a comment |
I was playing with a code snippet from the accepted answer to this question. I simply added a byte array to use UTF-16 as follows:
final char chars = Character.toChars(0x1F701);
final String s = new String(chars);
final byte asBytes = s.getBytes(StandardCharsets.UTF_8);
final byte asBytes16 = s.getBytes(StandardCharsets.UTF_16);
chars has 2 elements, which means two 16-bit integers in Java (since the code point is outside of the BMP).
asBytes has 4 elements, which corresponds to 32 bits, which is what we'd need to represent two 16-bit integers from chars, so it makes sense.
asBytes16 has 6 elements, which is what confuses me. Why do we end up with 2 extra bytes when 32 bits is sufficient to represent this unicode character?
java unicode
I was playing with a code snippet from the accepted answer to this question. I simply added a byte array to use UTF-16 as follows:
final char chars = Character.toChars(0x1F701);
final String s = new String(chars);
final byte asBytes = s.getBytes(StandardCharsets.UTF_8);
final byte asBytes16 = s.getBytes(StandardCharsets.UTF_16);
chars has 2 elements, which means two 16-bit integers in Java (since the code point is outside of the BMP).
asBytes has 4 elements, which corresponds to 32 bits, which is what we'd need to represent two 16-bit integers from chars, so it makes sense.
asBytes16 has 6 elements, which is what confuses me. Why do we end up with 2 extra bytes when 32 bits is sufficient to represent this unicode character?
java unicode
java unicode
edited Jan 4 at 14:20
Boann
37.5k1290122
37.5k1290122
asked Jan 4 at 11:59
mahonyamahonya
2,71453254
2,71453254
5
What are the actual bytes? I bet there's a byte order mark (BOM) in the UTF-16 one.
– Daniel Pryden
Jan 4 at 12:09
add a comment |
5
What are the actual bytes? I bet there's a byte order mark (BOM) in the UTF-16 one.
– Daniel Pryden
Jan 4 at 12:09
5
5
What are the actual bytes? I bet there's a byte order mark (BOM) in the UTF-16 one.
– Daniel Pryden
Jan 4 at 12:09
What are the actual bytes? I bet there's a byte order mark (BOM) in the UTF-16 one.
– Daniel Pryden
Jan 4 at 12:09
add a comment |
2 Answers
2
active
oldest
votes
UTF-16 bytes start with Byte order mark FEFF to indicate that value is encoded in big-endian. As per wiki BOM is also used to distinguish UTF-16 from UTF-8:
Neither of these sequences is valid UTF-8, so their presence indicates that the file is not encoded in UTF-8.
You can convert byte to hex-encoded String as per this answer:
asBytes = F09F9C81
asBytes16 = FEFFD83DDF01
Thanks, it is the BOM indeed
– mahonya
Jan 4 at 12:33
add a comment |
asByteshas 4 elements, which corresponds to 32 bits, which is what we'd need to represent two 16-bit integers from chars, so it makes sense.
Actually no, the number of chars needed to represent a codepoint in Java has nothing to do with it. The number of bytes is directly related to the numeric value of the codepoint itself.
Codepoint U+1F701 (0x1F701) uses 17 bits (11111011100000001)
0x1F701 requires 4 bytes in UTF-8 (F0 9F 9C 81) to encode its 17 bits. See the bit distribution chart on Wikipedia. The algorithm is defined in RFC 3629.
asBytes16has 6 elements, which is what confuses me. Why do we end up with 2 extra bytes when 32 bits is sufficient to represent this unicode character?
Per the Java documentation for StandardCharsets
UTF_16
public static final Charset UTF_16
Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark
0x1F701 requires 4 bytes in UTF-16 (D8 3D DF 01) to encode its 17 bits. See the bit distribution chart on Wikipedia. The algorithm is defined in RFC 2781.
UTF-16 is subject to endian, unlike UTF-8, so StandardCharsets.UTF_16 includes a BOM to specify the actual endian used in the byte array.
To avoid the BOM, use StandardCharsets.UTF_16BE or StandardCharsets.UTF_16LE as needed:
UTF_16BE
public static final Charset UTF_16BE
Sixteen-bit UCS Transformation Format, big-endian byte order
UTF_16LE
public static final Charset UTF_16LE
Sixteen-bit UCS Transformation Format, little-endian byte order
Since their endian is implied in their names, they don't need to include a BOM in the byte array.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54038562%2fwhy-does-this-unicode-character-end-up-as-6-bytes-with-utf-16-encoding%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
UTF-16 bytes start with Byte order mark FEFF to indicate that value is encoded in big-endian. As per wiki BOM is also used to distinguish UTF-16 from UTF-8:
Neither of these sequences is valid UTF-8, so their presence indicates that the file is not encoded in UTF-8.
You can convert byte to hex-encoded String as per this answer:
asBytes = F09F9C81
asBytes16 = FEFFD83DDF01
Thanks, it is the BOM indeed
– mahonya
Jan 4 at 12:33
add a comment |
UTF-16 bytes start with Byte order mark FEFF to indicate that value is encoded in big-endian. As per wiki BOM is also used to distinguish UTF-16 from UTF-8:
Neither of these sequences is valid UTF-8, so their presence indicates that the file is not encoded in UTF-8.
You can convert byte to hex-encoded String as per this answer:
asBytes = F09F9C81
asBytes16 = FEFFD83DDF01
Thanks, it is the BOM indeed
– mahonya
Jan 4 at 12:33
add a comment |
UTF-16 bytes start with Byte order mark FEFF to indicate that value is encoded in big-endian. As per wiki BOM is also used to distinguish UTF-16 from UTF-8:
Neither of these sequences is valid UTF-8, so their presence indicates that the file is not encoded in UTF-8.
You can convert byte to hex-encoded String as per this answer:
asBytes = F09F9C81
asBytes16 = FEFFD83DDF01
UTF-16 bytes start with Byte order mark FEFF to indicate that value is encoded in big-endian. As per wiki BOM is also used to distinguish UTF-16 from UTF-8:
Neither of these sequences is valid UTF-8, so their presence indicates that the file is not encoded in UTF-8.
You can convert byte to hex-encoded String as per this answer:
asBytes = F09F9C81
asBytes16 = FEFFD83DDF01
answered Jan 4 at 12:16
Karol DowbeckiKarol Dowbecki
26.8k93860
26.8k93860
Thanks, it is the BOM indeed
– mahonya
Jan 4 at 12:33
add a comment |
Thanks, it is the BOM indeed
– mahonya
Jan 4 at 12:33
Thanks, it is the BOM indeed
– mahonya
Jan 4 at 12:33
Thanks, it is the BOM indeed
– mahonya
Jan 4 at 12:33
add a comment |
asByteshas 4 elements, which corresponds to 32 bits, which is what we'd need to represent two 16-bit integers from chars, so it makes sense.
Actually no, the number of chars needed to represent a codepoint in Java has nothing to do with it. The number of bytes is directly related to the numeric value of the codepoint itself.
Codepoint U+1F701 (0x1F701) uses 17 bits (11111011100000001)
0x1F701 requires 4 bytes in UTF-8 (F0 9F 9C 81) to encode its 17 bits. See the bit distribution chart on Wikipedia. The algorithm is defined in RFC 3629.
asBytes16has 6 elements, which is what confuses me. Why do we end up with 2 extra bytes when 32 bits is sufficient to represent this unicode character?
Per the Java documentation for StandardCharsets
UTF_16
public static final Charset UTF_16
Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark
0x1F701 requires 4 bytes in UTF-16 (D8 3D DF 01) to encode its 17 bits. See the bit distribution chart on Wikipedia. The algorithm is defined in RFC 2781.
UTF-16 is subject to endian, unlike UTF-8, so StandardCharsets.UTF_16 includes a BOM to specify the actual endian used in the byte array.
To avoid the BOM, use StandardCharsets.UTF_16BE or StandardCharsets.UTF_16LE as needed:
UTF_16BE
public static final Charset UTF_16BE
Sixteen-bit UCS Transformation Format, big-endian byte order
UTF_16LE
public static final Charset UTF_16LE
Sixteen-bit UCS Transformation Format, little-endian byte order
Since their endian is implied in their names, they don't need to include a BOM in the byte array.
add a comment |
asByteshas 4 elements, which corresponds to 32 bits, which is what we'd need to represent two 16-bit integers from chars, so it makes sense.
Actually no, the number of chars needed to represent a codepoint in Java has nothing to do with it. The number of bytes is directly related to the numeric value of the codepoint itself.
Codepoint U+1F701 (0x1F701) uses 17 bits (11111011100000001)
0x1F701 requires 4 bytes in UTF-8 (F0 9F 9C 81) to encode its 17 bits. See the bit distribution chart on Wikipedia. The algorithm is defined in RFC 3629.
asBytes16has 6 elements, which is what confuses me. Why do we end up with 2 extra bytes when 32 bits is sufficient to represent this unicode character?
Per the Java documentation for StandardCharsets
UTF_16
public static final Charset UTF_16
Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark
0x1F701 requires 4 bytes in UTF-16 (D8 3D DF 01) to encode its 17 bits. See the bit distribution chart on Wikipedia. The algorithm is defined in RFC 2781.
UTF-16 is subject to endian, unlike UTF-8, so StandardCharsets.UTF_16 includes a BOM to specify the actual endian used in the byte array.
To avoid the BOM, use StandardCharsets.UTF_16BE or StandardCharsets.UTF_16LE as needed:
UTF_16BE
public static final Charset UTF_16BE
Sixteen-bit UCS Transformation Format, big-endian byte order
UTF_16LE
public static final Charset UTF_16LE
Sixteen-bit UCS Transformation Format, little-endian byte order
Since their endian is implied in their names, they don't need to include a BOM in the byte array.
add a comment |
asByteshas 4 elements, which corresponds to 32 bits, which is what we'd need to represent two 16-bit integers from chars, so it makes sense.
Actually no, the number of chars needed to represent a codepoint in Java has nothing to do with it. The number of bytes is directly related to the numeric value of the codepoint itself.
Codepoint U+1F701 (0x1F701) uses 17 bits (11111011100000001)
0x1F701 requires 4 bytes in UTF-8 (F0 9F 9C 81) to encode its 17 bits. See the bit distribution chart on Wikipedia. The algorithm is defined in RFC 3629.
asBytes16has 6 elements, which is what confuses me. Why do we end up with 2 extra bytes when 32 bits is sufficient to represent this unicode character?
Per the Java documentation for StandardCharsets
UTF_16
public static final Charset UTF_16
Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark
0x1F701 requires 4 bytes in UTF-16 (D8 3D DF 01) to encode its 17 bits. See the bit distribution chart on Wikipedia. The algorithm is defined in RFC 2781.
UTF-16 is subject to endian, unlike UTF-8, so StandardCharsets.UTF_16 includes a BOM to specify the actual endian used in the byte array.
To avoid the BOM, use StandardCharsets.UTF_16BE or StandardCharsets.UTF_16LE as needed:
UTF_16BE
public static final Charset UTF_16BE
Sixteen-bit UCS Transformation Format, big-endian byte order
UTF_16LE
public static final Charset UTF_16LE
Sixteen-bit UCS Transformation Format, little-endian byte order
Since their endian is implied in their names, they don't need to include a BOM in the byte array.
asByteshas 4 elements, which corresponds to 32 bits, which is what we'd need to represent two 16-bit integers from chars, so it makes sense.
Actually no, the number of chars needed to represent a codepoint in Java has nothing to do with it. The number of bytes is directly related to the numeric value of the codepoint itself.
Codepoint U+1F701 (0x1F701) uses 17 bits (11111011100000001)
0x1F701 requires 4 bytes in UTF-8 (F0 9F 9C 81) to encode its 17 bits. See the bit distribution chart on Wikipedia. The algorithm is defined in RFC 3629.
asBytes16has 6 elements, which is what confuses me. Why do we end up with 2 extra bytes when 32 bits is sufficient to represent this unicode character?
Per the Java documentation for StandardCharsets
UTF_16
public static final Charset UTF_16
Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark
0x1F701 requires 4 bytes in UTF-16 (D8 3D DF 01) to encode its 17 bits. See the bit distribution chart on Wikipedia. The algorithm is defined in RFC 2781.
UTF-16 is subject to endian, unlike UTF-8, so StandardCharsets.UTF_16 includes a BOM to specify the actual endian used in the byte array.
To avoid the BOM, use StandardCharsets.UTF_16BE or StandardCharsets.UTF_16LE as needed:
UTF_16BE
public static final Charset UTF_16BE
Sixteen-bit UCS Transformation Format, big-endian byte order
UTF_16LE
public static final Charset UTF_16LE
Sixteen-bit UCS Transformation Format, little-endian byte order
Since their endian is implied in their names, they don't need to include a BOM in the byte array.
edited Jan 7 at 21:32
answered Jan 7 at 17:14
Remy LebeauRemy Lebeau
344k19270464
344k19270464
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54038562%2fwhy-does-this-unicode-character-end-up-as-6-bytes-with-utf-16-encoding%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
5
What are the actual bytes? I bet there's a byte order mark (BOM) in the UTF-16 one.
– Daniel Pryden
Jan 4 at 12:09