UTF-16




 

UTF-16 represents non-BMP characters (those from U+10000 through U+10FFFF) using a pair of 16-bit words, known as a surrogate pair. First 1000016 is subtracted from the code point to give a 20-bit value. This is then split into two separate 10-bit values each of which is represented as a surrogate with the most significant half placed in the first surrogate. To allow safe use of simple word-oriented string processing, separate ranges of values are used for the two surrogates: 0xD800–0xDBFF for the first, most significant surrogate (marked brown) and 0xDC00-0xDFFF for the second, least significant surrogate (marked azure).

For example, the character at code point U+10000 becomes the code unit sequence 0xD800 0xDC00, and the character at U+10FFFD, the upper limit of Unicode, becomes the sequence 0xDBFF 0xDFFD.

  DC00 DC01 … DFFF D800 010000 010001 … 0103FF D801 010400 010401 … 0107FF   ⋮ ⋱     ⋮ DBFF 10FC00 10FC01 … 10FFFF
Example UTF-16 encoding procedure

The character at code point U+64321 (hexadecimal) is to be encoded in UTF-16. Since it is above U+FFFF, it must be encoded with a surrogate pair, as follows:

v = 0x64321
v′ = v - 0x10000
= 0x54321
= 0101 0100 0011 0010 0001

vh = 01 0101 0000 // higher 10 bits of v′
vl = 11 0010 0001 // lower 10 bits of v′
w1 = 0xD800 // the resulting 1st word is initialized with the high bits
w2 = 0xDC00 // the resulting 2nd word is initialized with the low bits

w1 = w1 | vh
= 1101 1000 0000 0000
| 01 0101 0000
= 1101 1001 0101 0000
= 0xD950

w2 = w2 | vl
= 1101 1100 0000 0000
| 11 0010 0001
= 1101 1111 0010 0001
= 0xDF21

The correct UTF-16 encoding for this character is thus the following word sequence:

0xD950 0xDF21

Since the character is above U+FFFF, the character cannot be encoded in UCS-2.