User:Newhoggy~enwikibooks/UTF8 for Haskell
This exercise describes how to convert a UTF8 string into a Unicode string in Haskell.
UTF8 is a superset of ASCII, so the following ASCII string is also a valid UTF8 string. To convert this to a unicode code point, simply take the ASCII ordinal value as the code point.
String | H | e | l | l | o |
ASCII code | x48 | x65 | x49 | x49 | x6F |
Unicode code point | U+0048 | U+0065 | U+0049 | U+0049 | U+006F |
Non-ascii characters span multiple byte and take more work to convert to a unicode code point.
String | 你 | 好 |
ASCII code | xE4:xBD:xA0 | xE5:xA5:xBD |
Unicode code point | U+4f60 | U+597d |
Every non-ASCII unicode character when encoding in UTF8 must consist of a leading byte and one or more trailing bytes. Trailing bytes are of the form 0x10nnnnnn so they are easy to recognise. Leading bytes are of the form 0x110nnnnn, 0x1110nnnn, or 0x1111nnnn, each signifying that there are 1, 2, or 3 trailing bytes respectively.
The UTF8 to unicode translation can be describe thus:
Bytes | UTF8 | Unicode |
2 | 0x110aaaaa 0x10bbbbbb | 0x0000000000aaaaabbbbbb |
3 | 0x1110aaaa 0x10bbbbbb 0x10cccccc | 0x00000aaaabbbbbbcccccc |
4 | 0x11110aaa 0x10bbbbbb 0x10cccccc 0x10dddddd | 0xaaabbbbbbccccccdddddd |
In the case of 2 bytes, if the resulting code point is less than or equal to 127, then the UTF8 character sequence is illegal because codepoints 0 to 127 are already represented in the ASCII subset. Notice that each trailing byte contributes 6 bits to the code point while the leading byte contributes 5, 4, or 3 bits to the code point depending on the type of leading byte. Keep this in mind because we will use bit shifting and recursion on this property for the translation.
The code
[edit | edit source]First we need to import the ord function from Data.Char and the bitwise operators .&. (bitwise AND), .|. (bitwise OR), and shiftL (bitwise shift left) from Data.Bits.
> import Data.Bits > import Data.Char
To help document the code we define UTF8 type synonyms for character and string datatypes. The type synonyms will make it clear which function argument, or return type contains what sort of data.
> type Utf8Char = Char > type Utf8String = String
Here we define a helper function that will be used later on in the code. The function determines if a value is inclusively between two values.
> within :: Ord a => a -> (a, a) -> Bool > within value (min, max) = (min <= value) && (value <= max) > consumeUtf8 :: Int -> Int -> Utf8String -> String > consumeUtf8 ordinal 0 remainder = (chr ordinal):(fromUtf8 remainder) > consumeUtf8 ordinal n (next:remainder) = > consumeUtf8 newOrdinal (n-1) remainder > where > nextOrdinal = (ord next) .&. (ord '\x3f') > newOrdinal = (ordinal `shiftL` 6) .|. nextOrdinal > consumeUtf8 _ n [] = > error "Incomplete character. Expecting " ++ (show n) ++ " more bytes." > fromUtf8 :: Utf8String -> String > fromUtf8 [] = [] > fromUtf8 (lead:remainder) > | lead `within` asciiRange = fromUtf8 remainder > | lead `within` plus1Range = consumeUtf8 (ordinal plus1Mask) 1 remainder > | lead `within` plus2Range = consumeUtf8 (ordinal plus2Mask) 2 remainder > | lead `within` plus3Range = consumeUtf8 (ordinal plus3Mask) 3 remainder > | otherwise = error $ "Invalid lead byte: " ++ (show lead) > where > asciiRange = ('\x00', '\x7f') > badRange00 = ('\x80', '\xbf') -- Only tail bytes in this range > badRange01 = ('\xc0', '\xc1') -- Prevent duplicating code points <= 127 > plus1Range = ('\xc2', '\xdf') > plus2Range = ('\xe0', '\xef') > plus3Range = ('\xf0', '\xf4') > badRange02 = ('\xf5', '\xff') -- Illegal in Unicode, or RFC 3629 > plus1Mask = '\x1f' > plus2Mask = '\x0f' > plus3Mask = '\x07' > ordinal mask = (ord lead) .&. (ord mask)