Code points and code units

When working with Unicode, there is more than just bytes and characters to deal with. You must also understand something about code points and code units. In Unicode, a code point is a numeric value corresponding to an entry in an encoding table. For example, 0x0061 is the code point for the letter "a". Code points are sometimes combined in order to form one character. The code point for the letter "a" can be combined with 0x0303, the code point corresponding to the tilde to form the character ã. The 0x0303 code point happens to represent a combining diacritical mark, meaning that it is always used to modify a character, never used standalone.

In Unicode, it's not only possible for multiple code points to represent one character, but one code point can represent many characters. If you find that confusing, don't feel bad—we do too.

While a code point represents a specific mapping of a character to a numeric value in an encoding table, a code unit refers to the actual representation of the code point. Take, for example, the Unicode code point 0x0061 representing the letter "a". The UTF-8 representation of that code point uses just one byte: 0x61. That one byte is a code unit. The UTF-16 representation, on the other hand, uses two bytes to represent the code point: 0x0061. Those two bytes together form one code unit. As you can see, the code unit size (i.e., the number of bytes in the code unit) varies depending on which form of Unicode is being used. Sometimes, a code point value is too large for the underlying code unit. In such cases, the code point is represented using two or more code units. For example, the code point 0x1D11E, which represents the musical symbol G clef (), is represented in UTF-16 using two code units: 0xD834 and 0xDD1E. Neither value represents a code point by itself—only together do the two code units represent a code point, and in this case the code point in turn represents a single character.

The distinction between bytes, characters, code points, and code units becomes important when you work with string functions such as LENGTH and INSTR. When you want the length of a string, do you want it in terms of bytes, characters, code points, or code units? Oracle9i supports variations of these functions that allow you to choose the semantics that you wish to use for a given invocation; for example, you can useLENGTHC to look at the length of a string in terms of the number of Unicode characters it holds. Certain Unicode characters can be represented in multiple ways. In UTF-16, the character ã can be represented as the single code point 0x00E3 or as the two code points 0x0061 and 0x0303. The standard LENGTH function will see the two code points as two characters. The LENGTHC function will recognize that 0x0061 followed by 0x0303 represents only a single character.

Unicode has many subtle facets. We've been told there may be unusual cases where a single Unicode character could be interpreted as multiple characters by a person who speaks the language in question. Because of subtle issues like this one, we strongly recommend that you become familiar with the resources at http://unicode.org if you work with Unicode.

In addition to LENGTH and LENGTHC, you should also considerLENGTH2 and LENGTH4. LENGTH2 counts the number of code units in a string, and LENGTH4 counts the number of code points in a string. This is why it's important to understand the distinction between code units and code points. Other character string functions, such as SUBSTR and INSTR, also have the same variations as LENGTH.

In Oracle's documentation you'll see references to UCS-2 and UCS-4. These acronyms are where the 2 and 4 in LENGTH2 and LENGTH4 come from. Originally defined by an ISO standard that mirrored the Unicode specification, the UCS-2 and UCS-4 acronyms are now obsolete. Whenever you see UCS-2, think code unit. Whenever you see UCS-4, think code point.