Return WordToUInt32(Leading)
Else If (Leading >= $DC00) Then
Error(" .")
Else
Var Code: UInt32
Code = WordToUInt32(Leading And $3FF) Shl 10
Trailing = ReadWord()
If ((Trailing < $DC00) Or (Trailing > $DFFF)) Then
Error(" .")
Else
Code = Code Or WordToUInt32(Trailing And $3FF)
Return (Code + $10000)
End If
End If
End Function
UTF-8
UTF-8 ( Unicode Transformation Format, 8-bit) , 8- . -.
UTF-16, UTF-8 : .
, 128, UTF-8 ASCII. , UTF-8 128 ASCII . 2 6 ( 4 , 221 ), 11xxxxxx, 10xxxxxx.
, UTF-8 , ASCII US-ASCII, a 1. .
, , .
, ( ) , UTF-8 UTF-16.
, UTF-16 , .
, UTF-16, UCS-2.
UTF-8 2 1992 Plan 9. UTF-8 RFC 3629 ISO/IEC 10646 Annex D.
: , UTF-8, , Unicode 0x10ffff, Unicode 4 UTF-8.
UTF-8 0 0x7FFFFFFF ( 32- ).
1. 8- (). 1 6 .
|
|
2. ASCII- (000 0x7F ) .
3. . , .
4. ASCII- 1. ( , , ). , .
5. 6 ( ).
, , 1 . 0. . 6 , . 11111110 (0xFE) 11111111 (0xFF) UTF-8.
, .
, 6 . 32- . , . 6 0..5 . , 6..11, 12..17, 18..23, 24..29. .
, .
. .
Unicode (HEX) | UTF-8 | |
00000000 0000007F | 1 | ASCII, , |
00000080 000007FF | 2 | , , , , , ; , , ; ; |
00000800 0000FFFF | 3 | , , , , ; ; |
00010000 001FFFFF | 4 | , , |
00200000 03FFFFFF | 5 | Unicode |
04000000 7FFFFFFF | 6 | Unicode |
, .
UTF-8 , . , . , ASCII- 1 (0x31), : 11000000 10110001 (0xC0 0xB1) 11100000 10000000 10110001 (0xE0 0x80 0xB1). : 110 00000 (0xC0), 1110 0000 (0xE0), 11110 000 (0xF0), 111110 00 (0xF8), 1111110 0 (0xFC), 10 000000 (0x80).
|
|
UTF-8 32- . , Unicode 0x001FFFFF . 32- , UTF-8 .
. . , . , . 6 . 36 42 .
UTF-8 254 (0xFE) 255 (0xFF). 221, UTF-8 248 253 (0xF8 0xFD). ( ) UTF-8, 192 193 (0xC0 0xC1).
BOM ()
Windows ( ) 0xEF, 0xBB, 0xBF , UTF-8.
(. Byte Order Mark, BOM), (, UTF-8 UTF-8 with Signature). , UTF-8, , xml-. , Notepad++, Notepad2 Kate, , UTF-.
: a.
UTF-8 with Signature, : 0xEF 0xBB 0xBF 0x61
UTF-8 ( ), : 0x61
BOM, Unicode- 0xFEFF. . BOM UTF-16 UTF-32.
(. Byte Order Mark (BOM)) -, . U+FEFF. , , , , . , , Unicode .
Unicode 16- 32- , . .
[
Unicode, U+FEFF ( , ). Unicode 3.2 U+2060 Word Joiner[1], U+FEFF .
hex | dec | ISO-8859-1 | KOI8-R | CP1251 | CP866 | ||
UTF-8[t 1] | EF BB BF | 239 187 191 | ï¿ | ╩ | | ╗┐ | |
UTF-16 (BE) | FE FF | 254 255 | þÿ | ■ | |||
UTF-16 (LE) | FF FE | 255 254 | ÿþ | ■ | |||
UTF-32 (BE) | 00 00 FE FF | 0 0 254 255 | ␀␀þÿ | ␀␀ | ␀␀ | ␀␀■ | ␀ NUL, |
UTF-32 (LE) | FF FE 00 00 | 255 254 0 0 | ÿþ␀␀ | ␀␀ | ␀␀ | ■␀␀ | |
UTF-7[t 1] | 2B 2F 76 38 2B 2F 76 39 2B 2F 76 2B 2B 2F 76 2F[t 2] | 43 47 118 56 43 47 118 57 43 47 118 43 43 47 118 47 | +/v8 +/v9 +/v+ +/v/ | ||||
UTF-1[t 1] | F7 64 4C | 247 100 76 | ÷dL | ||||
UTF-EBCDIC[t 1] | DD 73 66 73 | 221 115 102 115 | Ýsfs | ||||
SCSU[t 1] | 0E FE FF[t 3] | 14 254 255 | ␎þÿ | ␎■ | ␎ . Shift Out (.)., | ||
BOCU-1[t 1] | FB EE 28 | 251 238 40 | ûî | √( | |||
GB-18030[t 1] | 84 31 95 33 | 132 49 149 51 | �1�3 | 13 | � |
1. ↑ : 1234567 , , .[2][3]
|
|
2. ↑ UTF-7 base-64, BOM 001111xx , xx ( BOM). BOM, ( BOM) . xx=00, 01, 10, 11, , , 38, 39, 2B, 2F base64. base64, 38 , 2D.
3. ↑ SCSU U+FEFF, UTR #6.[4]