; :
. , : , , [34];
, . , .
, , . , . Å (A ) Å, μ - .
, . , , , .
. , .
ɻ (U+0419) Ȼ (U+0418) ̆ (U+0306)
(). . , , . , . (. base characters), (. combining characters); . , á a (U+0061) ́ (U+0301) á (U+00E1).
(. variation selectors). , . 5.0 , .
, . (. normalization forms) , . . () , , , () .
4 : NFD, NFC, NFKD NFKC.
NFD
NFD, . n ormalization f orm D (D . d ecomposition), D , (. precomposed characters) (. composite characters) .
|
|
:
| → |
|
|
| → |
|
|
|
| → |
|
|
|
| → |
|
|
|
NFC
NFC, . n ormalization f orm C (C . c omposition), C , . ( NFD) D. , NFD, :
S , ;
, S, C S, S C - B, , , C. , ;
, ( );
X Y, Z, <X, Y>;
C L , L L-C, C .
:
|
| → |
|
NFKD
NFKD, . n ormalization f orm KD, KD , . [35]:
(ℍ ℌ);
(①);
(カ カ);
(︷ {);
(⁹ ₉);
(¼);
().
:
| → |
|
| → |
|
| → |
|
| → |
|
| → |
|
| → |
|
| → |
|
NFKC
NFKC, . n ormalization f orm KC, KC , ( NFKD) ( NFC).
|
|
[ | -]
NFD | NFC | NFKD | NFKC | |||||||||||||||||||||||
|
|
|
|
| ||||||||||||||||||||||
|
|
|
|
| ||||||||||||||||||||||
|
|
|
|
| ||||||||||||||||||||||
|
|
|
|
| ||||||||||||||||||||||
|
|
|
|
| ||||||||||||||||||||||
|
|
|
|
| ||||||||||||||||||||||
|
|
|
|
| ||||||||||||||||||||||
|
|
|
|
| ||||||||||||||||||||||
|
|
|
|
|
(. left-to-right, LTR), (. right-to-left, RTL) , . ; .
, , . (. bidirectional text, BiDi). (, ) , . : , , . ( ) .
: ,
, :
,
,
,
,
,
,
,
,
,
,
( , ),
,
,
,
,
(),
,
,
( , , )
.
, : , , , , , , .
, .
, , (, Apple MacRoman (0xF0) Windows Wingdings (0xFF)). .
|
|
ISO/IEC 10646
ISO/IEC/JTC1/SC2/WG2, 10646 (ISO/IEC 10646). ISO/IEC 10646 , .
(. International Organization for Standardization, ISO) 1991 . 1993 ISO DIS 10646.1. 1.1, DIS 10646.1. Unicode 1.1 DIS 10646.1 .
. 2000 Unicode 3.0 ISO/IEC 10646-1:2000. ISO/IEC 10646 Unicode 4.0. , .
UTF-16 UTF-32 , ISO/IEC 10646 : UCS-2 (2 , UTF-16) UCS-4 (4 , UTF-32). UCS () (. universal multiple-octet coded character set). UCS-2 UTF-16 (UTF-16 ), UCS-4 UTF-32.
ISO/IEC 10646:
;
ISO/IEC 10646 , :
;
(. collation) (. rendering) ;
(, , (. bi-directional) ).
(. Unicode transformation format, UTF): UTF-8, UTF-16 (UTF-16BE, UTF-16LE) UTF-32 (UTF-32BE, UTF-32LE). UTF-7 , - ASCII . 1 2005 : UTF-9 UTF-18 (RFC 4042).
Microsoft Windows NT Windows 2000 Windows XP UTF-16LE. UNIX- GNU/Linux, BSD Mac OS X UTF-8 UTF-32 UTF-8 .
Punycode Unicode- ACE-, - , .
UTF-8
: UTF-8
UTF-8 , , 8- . , 128, UTF-8 ASCII. , UTF-8 128 ASCII . 2 6 ( , 4 , 10FFFF, ), 11xxxxxx, 10xxxxxx. UTF-8 , 4 .
|
|
UTF-8 2 1992 Plan 9[36]. UTF-8 RFC 3629 ISO/IEC 10646 Annex D.
UTF-8 Unicode :
Unicode UTF-8:
0x00000000 0x0000007F: 0xxxxxxx
0x00000080 0x000007FF: 110xxxxx 10xxxxxx
0x00000800 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
0x00010000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
, :
0x00200000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
0x04000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
, UTF-8 , . .
UTF-16 (. UTF-16 little-endian), (. UTF-16 big-endian). UTF-32LE UTF-32BE.
U+FEFF ( ), (. byte order mark (BOM)). UTF-16LE UTF-16BE, U+FFFE . UTF-8, . , , :
UTF-8
EF BB BF
UTF-16BE
FE FF
UTF-16LE
FF FE
UTF-32BE
00 00 FE FF
UTF-32LE
FF FE 00 00
, UTF-16LE UTF-32LE, U+0000 ( ).
UTF-16 UTF-32, BOM, big-endian (unicode.org).