C++ Programming/Software Internationalization/Text Encoding

Text encoding

Text, in particular the characters are used to generate readable text consists on the use of a character encoding scheme that pairs a sequence of characters from a given character set (sometimes referred to as code page) with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the use of its digital representation.

A easy to understand example would be Morse code, which encodes letters of the Latin alphabet as series of long and short depressions of a telegraph key; this is similar to how ASCII, encodes letters, numerals, and other symbols, as integers.

Text and data

Probably the most important use for a byte is holding a character code. Characters typed at the keyboard, displayed on the screen, and printed on the printer all have numeric values. To allow it to communicate with the rest of the world, the IBM PC uses a variant of the ASCII character set. There are 128 defined codes in the ASCII character set. IBM uses the remaining 128 possible values for extended character codes including European characters, graphic symbols, Greek letters, and math symbols.

In earlier days of computing, the introduction of coded character sets such as ASCII (1963) and EBCDIC (1964) began the process of standardization. The limitations of such sets soon became apparent, and a number of ad-hoc methods developed to extend them. The need to support multiple writing systems (Languages), including the CJK family of East Asian scripts, required support for a far larger number of characters and demanded a systematic approach to character encoding rather than the previous ad hoc approaches.

What's this about UNICODE?

Unicode is an industry standard whose goal is to provide the means by which text of all forms and languages can be encoded for use by computers. Unicode 6.1 was released in January 2012 and is the current version. It currently comprises over 109,000 characters from 93 scripts. Since Unicode is just a standard that assigns numbers to characters, there also needs to be methods for encoding these numbers as bytes. The three most common character encodings are UTF-8, UTF-16, and UTF-32, of which UTF-8 is by far the most frequently used.

In the Unicode standard, planes are groups of numerical values (code points) that point to specific characters. Unicode code points are logically divided into 17 planes, each with 65,536 (= 2¹⁶) code points. Planes are identified by the numbers 0 to 16_decimal, which corresponds with the possible values 00-10_hexadecimal of the first two positions in six position format (hhhhhh). As of version 6.1, six of these planes have assigned code points (characters), and are named.

Plane 0 - Basic Multilingual Plane (BMP)
Plane 1 - Supplementary Multilingual Plane (SMP)
Plane 2 - Supplementary Ideographic Plane (SIP)
Planes 3–13 - Unassigned
Plane 14 - Supplementary Special-purpose Plane (SSP)
Planes 15–16 - Supplementary Private Use Area (S PUA A/B)

BMP and SMP

BMP		SMP
0000–0FFF	8000–8FFF	10000–10FFF	18000-18FFF
1000–1FFF	9000–9FFF	11000–11FFF	19000-19FFF
2000–2FFF	A000–AFFF	12000–12FFF	1A000-1AFFF
3000–3FFF	B000–BFFF	13000–13FFF	1B000-1BFFF
4000–4FFF	C000–CFFF	14000-14FFF	1C000-1CFFF
5000–5FFF	D000–DFFF	15000-15FFF	1D000–1DFFF
6000–6FFF	E000–EFFF	16000–16FFF	1E000–1EFFF
7000–7FFF	F000–FFFF	17000-17FFF	1F000–1FFFF

ISP and SSP

SIP		SSP
20000–20FFF	28000–28FFF	E0000–E0FFF
21000–21FFF	29000–29FFF
22000–22FFF	2A000–2AFFF
23000–23FFF	2B000–2BFFF
24000–24FFF
25000–25FFF
26000–26FFF
27000–27FFF	2F000–2FFFF

PUA

PUA
F0000–F0FFF	F8000–F8FFF	100000–100FFF	108000–108FFF
F1000–F1FFF	F9000–F9FFF	101000–101FFF	109000–109FFF
F2000–F2FFF	FA000–FAFFF	102000–102FFF	10A000–10AFFF
F3000–F3FFF	FB000–FBFFF	103000–103FFF	10B000–10BFFF
F4000–F4FFF	FC000–FCFFF	104000–104FFF	10C000–10CFFF
F5000–F5FFF	FD000–FDFFF	105000–105FFF	10D000–10DFFF
F6000–F6FFF	FE000–FEFFF	106000–106FFF	10E000–10EFFF
F7000–F7FFF	FF000–FFFFF	107000–107FFF	10F000–10FFFF

Currently, about ten percent of the potential space is used. Furthermore, ranges of characters have been tentatively mapped out for every current and ancient writing system (script) the Unicode consortium has been able to identify. While Unicode may eventually need to use another of the spare 11 planes for ideographic characters, other planes remain. Even if previously unknown scripts with tens of thousands of characters are discovered, the limit of 1,114,112 code points is unlikely to be reached in the near future. The Unicode consortium has stated that limit will never be changed.

The odd-looking limit (it is not a power of 2), is not due to UTF-8, which was designed with a limit of 2³¹ code points (32768 planes), and can encode 2²¹ code points (32 planes) even if limited to 4 bytes but is due to the design of UTF-16. In UTF-16 a "surrogate pair" of two 16-bit words is used to encode 2²⁰ code points 1 to 16, in addition to the use of single words to encode plane 0.

UTF-8

UTF-8 is a variable-length encoding of Unicode, using from 1 to 4 bytes for each character. It was designed for compatibility with ASCII, and as such, single-byte values represent the same character in UTF-8 as they do in ASCII. Because a UTF-8 stream doesn't contain '\0's, you may use it directly in your existing C++ code without any porting (except when counting the 'actual' number of character in it).

UTF-16

UTF-16 is also variable-length, but works in 16 bit units instead of 8, so each character is represented by either 2 or 4 bytes. This means that it is not compatible with ASCII.

UTF-32

Unlike the previous two encodings, UTF-32 is not variable-length: every character is represented by exactly 32-bits. This makes encoding and decoding easier, because the 4-byte value maps directly to the Unicode code space. The disadvantage is in space efficiency, as each character takes 4 bytes, no matter what it is.