Jump to content

FOSS Open Standards/Standards and Internationalization/Localization of Software

From Wikibooks, open books for an open world

Internationalization and Localization of Software

[edit | edit source]

The internationalization of a product, such as software, is not the same as its localization although they may address many similar issues. Internationalization refers to the process whereby a product is made or adapted so that it can be used internationally (i.e., in different countries or regions all over the world with different cultures and conventions) without redesign. On the other hand, localization addresses how a product may be tailored for a specific country, region or culture by making it linguistically and culturally appropriate. Internationalization is often referred to using the abbreviation "I18N" or "i18n", where the number 18 refers to the number of letters omitted. Similarly, the abbreviation "L10N" or "l10n" is used for localization.

It is important that application software that is meant for deployment in many different countries with different cultures and languages be designed with internationalization in mind, to be able to accommodate possibly different ways of expressing an item of information or peculiarities of a different language. Some of the issues that internationalization needs to grapple with include:[1]

  1. Date and time formats
  2. Currency format
  3. Language peculiarities (e.g., alphabets, numerals and left-to-right script vs. right-to-left)
  4. Language character coding sets for textual display
  5. Names and titles
  6. Sorting of names and text
  7. Identification numbers, e.g. social security and passport numbers
  8. Telephone numbers, addresses and international postal codes
  9. Weights and measures

While the cultural and linguistic demands may change from country to country, the core program dealing with the functionalities of a software product do not change and so it is common practice to separate text and other environment-dependent data from the program code itself. This makes it easier to support internationalization as changes only need to be made to the environment-dependent resources. Minimal code changes are required.

The better internationalized an application is, the easier it is to localize. This is because a wellinternationalized application will have built-in support to cater to items that are needed for localization. These may include:[2]

  1. Language translation
  2. Hardware support for certain languages, e.g. input devices and methods
  3. Local customs
  4. Local content
  5. Aesthetics
  6. Cultural values and social context.

The major work of localization is in translating the user interface and documentation but it involves more than just translating the language used. It also needs to cater to other relevant changes such as the usage of appropriate cultural and social values, symbols peculiar to the language, display of numbers, 32 dates, currency, appropriate input methods, etc.

In software internationalization and localization, a set of parameters, termed a locale, is used to define the user's language, country and any special variant preferences that the user wants to see in the user interface.[3] A locale identifier usually contains at least a language and a region/country identifier. Depending on the operating platform/system used, locale identifiers can be defined in several ways. Most systems utilize the two- and three-letter language codes defined by ISO 639-1 and 639-2, respectively, for the language identifier and the two-letter country codes from ISO 3166-1 for the country identifier. However, MS Windows uses a numeric Locale Identifier (LCID) that specifies the language and sort identifier.[4]

Standards Important to I18N and L10N

[edit | edit source]

In this section we shall look at some important standards which are used in i18n and l10n.

Unicode and ISO/IEC 10646

[edit | edit source]

Proper rendering and display as well as practical input methods for multilingual text on a computer system are essential if efforts to make software available in multiple languages are to be successful. Standards are needed for character code tables and character encoding methods. Character code tables assign integer numbers to characters while character encoding is a method by which characters or their respective integer values can be represented as a sequence of bytes for use by the software.

The international standards ISO/IEC 10646[5] and the Unicode Standard (Unicode)[6] describe and define the Universal Character Set (UCS), which is a superset of all other character set standards. It guarantees round-trip compatibility to other character sets. This means simply that no information is lost in the conversion of any text string to UCS and then back to its original encoding.[7]

The Unicode Standard Version 4.0 and ISO/IEC 10646:2003 make use of the same character set tables and character encoding methods, but the Unicode Standard additionally provides details of character properties, processing algorithms, and definitions that are useful to implementers.[8]

ISO/IEC 10646 and Unicode define several encoding forms, UCS Transformation format 8 (UTF-8), UCS-2, UTF-16, UCS-4 and UTF-32. In an encoding form, each character is represented as one or more encoding units and apart from UTF-8, all other encoding forms have an encoding unit larger than one octet (an 8- bit byte), making them hard to use in many current applications and protocols that assume 8- or even 7-bit characters.[9] UTF-8 uses all bits of an octet for its encoding and it preserves the full US-ASCII range, the latter being encoded in one octet having the normal US-ASCII value. This is important and very useful since it is backwardly compatible with the large existing volume of software that predominantly uses US-ASCII encoding. UTF-8 encodes UCS characters as a varying number of octets, where the number of octets, and the value of each, depend on the integer value assigned to the character in the Unicode character code table.

Unicode has become the dominant encoding scheme in software internationalization and usage in multilingual environments. Many other standards such as XML have adopted Unicode as the underlying scheme to represent text. Modern operating environments like those under GNU/Linux, Mac OS X and MS Windows XP have support for Unicode.[10]

ISO 639

[edit | edit source]

The international standard, ISO 639-1, provides a two-letter code identifier (alpha-2) for the representation of names of languages while ISO 639-2 provides a three-letter identifier (alpha-3) for the languages.[11] Locale language identifiers make use of the ISO 3166 country codes to identify the language to use.

ISO 639-1 was devised mainly for use in terminology. It provides identifiers for those languages that are responsible for a major proportion of the world's literature and which also possess specialized vocabulary and terminology.

ISO 639-2 tries to provide a representation to the world's languages, for use in bibliography as well as terminology, but it is not as restrictive in scope as ISO 639-1. It was devised to include languages that are most frequently represented in the total body of the world's literature, regardless of whether specialized terminologies exist in those languages or not. The three-letter code for ISO 639-2 means that it can accommodate more languages. So, while it limits coverage of individual languages to those for which at least modest bodies of literature have been developed, other languages are still accommodated by means of identifiers for collections of languages, such as language families.[12]

Under ISO 639-2, some languages have different codes for bibliography and terminology (see Table 8).

Sample ISO 639-1 and 639-2 Language Codess
639-2* 639-1 Language Name
apa Apache languages
ara ar Arabic
bur/mya my Burmese
chi/zho zh Chinese
dut/ndl nl Dutch;Flemish
eng/ndl en English
hin hi Hindi
kar Karen
kin rw Kiyarwanda
tlh Klingon;tlhlngan-Hol
may/msa ms Malay
nep ne Nepali
swa sw Swahili
tam ta Tamil
tha th Thai
ton to Tonga(Tonga Islands)

For the 639-2 codes, where two codes are provided, the bibliographic code is given first and the terminology code is given second.

ISO 3166-1

[edit | edit source]

ISO 3166-1 provides two (alpha-2) and three-character (alpha-3) codes for representing names of countries. It thus provides a table of country codes just as ISO 639 provides a table of language codes. However, these two standards were developed independently, and there was no attempt to use the same code for a language as that for the country in which it is spoken, and codes from each list should be used independently. Locale country identifiers make use of the ISO 3166 codes to identify the country or region location.

The ISO 3166-1 alpha-2 code is probably best known in its usage for the country code top-level domain (ccTLD) of the Internet Domain Name Service (DNS) system. However, there are several ccTLDs in use which are not part of the ISO 3166-1 two-letter codes, e.g., "uk" for the United Kingdom (the corresponding ISO 3166-1 alpha-2 code is "gb").

Sample ISO 3166-1 Alpha-2 Country Codes
ISO 3166-1 (Alpha-2) Country/Region
CA Canada
DE Germany
GB United Kingdom
KE Kenya
NG Nigeria
TH Thailand
TN Tunisia
VE Venezuela

The IETF's RFC 3066[13] describes a language tag for use in cases where it is desired to indicate the language used in an information object, how to register values for use in this language tag, and a construct for matching such language tags. RFC 3066 specifies use of a two-character language code from ISO 639-1 when it exists and when a language does not have a two-character code assigned, the three-character code is used.

The RFC also specifies the use of optional subtags (e.g., a country code from ISO 3166) and how to register a dialect or variant information with the Internet Assigned Numbers Authority (IANA) when there is no available ISO 639 code.

As of September 2006, RFC 3066 has been obsoleted by the new / extended RFC 4646.[14]

Internationalization and Localization Software Initiatives

[edit | edit source]

In the past, the language supported in software was very much dependent on where the authors were from. So many common off the shelf (COTS) software were written mainly for the English language due to the dominance of countries like the USA in this area. In recent times, with the emergence of the Internet and globalization, this predominantly single language-centric support for popular software is changing. There is growing awareness among software developers and authors that many software can be and will be deployed worldwide and it is important to be able to adapt the software to the local environment. As a result, there is much better support for internationalization and localization on modern software platforms.

For commercial proprietary software, experience has shown that any localization effort has to be considered in the light of economical viability and/or other benefits that the effort may bring to the vendor. This means that, in many cases, versions of popular commercial proprietary software are not available for languages or cultures where commercial returns are not justified. Since FOSS can be freely modified and redistributed, at times all that is needed is some interested party to take the initiative to localize software that is released as FOSS. This has resulted in many popular FOSS being localized (e.g., the Mozilla.org family of products, GNOME, KDE, OpenOffice.org) and made available in many languages, including some rather obscure ones.

The Open Internationalization Initiative

[edit | edit source]

The Open Internationalization (OpenI18N) Initiative [15] is a key initiative under the Free Standards Group.[16] This initiative has several active projects under it. One of them is the OpenI18N Specification which is concerned with the specification for interfaces and functionalities that must be supported by GNU/Linux-like operating systems to run internationalized application software, as well as recommendations for such operating systems to facilitate the development of internationalized application software.[17] Other active projects include:

  1. Linux Internationalization Locale Name Guideline
  2. Common XML Locale Repository (now known as Common Locale Data Repository)
  3. Internet Intranet Input Method Framework
  4. OpenI18N Certification Test Suite
  5. Multilingualization library (m17n-lib)

All the standards, publications and documentation from the OpenI18N Initiative are freely available.

Some FOSS I18n and L10n Initiatives

[edit | edit source]

Most of the FOSS I18N and/or L10N projects are community-driven. Almost all major FOSS have good support and tools for I18N and L10N. Local users of the software are encouraged to contribute to the L10N projects.

Mozilla Family

[edit | edit source]

The Mozilla Localization Project (MLP)[18] relies mainly on the FOSS community to make the products from the Mozilla Foundation available for different world cultures and languages. The project is focused towards software localization making use of the underlying internationalization support available in the products.

The software localization projects under MLP include:

  1. Mozilla (aka project Seamonkey) with over 100 languages registered
  2. Mozilla Firefox with over 30 languages registered
  3. Mozilla Thunderbird with over 50 languages registered

GNOME

[edit | edit source]

The aim of the GNOME Translation Project[19] is to translate GNOME applications and documentation to every language in existence. This community-based effort currently boasts of translation projects covering well over 100 languages.

K Desktop Environment

[edit | edit source]

The popular K Desktop Environment (KDE) software also has wide support for its internationalization and localization initiatives.[20] There are good guides and documentation available, and again community driven projects for localization are well supported and received. As a result KDE is currently available in over 100 languages.

OpenOffice.org

[edit | edit source]

OOo has a framework and tools for both I18N and L10N.[21] OOo is now available in over 70 languages covering all major languages and cultures of the world and also some minor ones.

Microsoft Software

[edit | edit source]

The newer versions of software from Microsoft, e.g., Windows XP, MS Office 2003 have good internationalization support and are also available in many localized native versions.

MS Windows XP

[edit | edit source]

Localized versions of MS Windows XP are available in 24 languages and the Multilingual User Interface (MUI) Pack offers more localized user interface languages. The MUI Pack is a set of language-specific resource files that can be added to the English version of MS Windows. Microsoft claims that the total number of languages supported in MS Windows XP is in excess of 140.[22]

MS Office

[edit | edit source]

Localized versions of MS Office 2003 are available in over 35 languages.[23] In addition, the MS Office MUI offers support for other languages for which a localized version is not available.

Footnotes

[edit | edit source]
  1. Wikipedia (the free-content encyclopedia) entry on "Internationalization and localization" http://en.wikipedia.org/wiki/Internationalization_and_localization
  2. Wikipedia (the free-content encyclopedia) entry on "Internationalization and localization" http://en.wikipedia.org/wiki/Internationalization_and_localization
  3. Wikipedia (the free-content encyclopedia) entry on "Locale" http://en.wikipedia.org/wiki/Locale
  4. The Microsoft Developer Network (MSDN), "Locale Identifiers" http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/nls_8sj7.asp
  5. ISO/IEC 10646:2003, "Information technology - Universal Multiple-Octet Coded Character Set (UCS)" http://www.iso.ch/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=39921&ICS1=35&ICS2=40&ICS3=
  6. The Unicode Standard http://www.unicode.org/standard/standard.html
  7. Kuhn, M., "UTF-8 and Unicode FAQ for Unix/Linux" http://www.cl.cam.ac.uk/~mgk25/unicode.html
  8. ISO/IEC 10646:2003, "Information technology - Universal Multiple-Octet Coded Character Set (UCS)" http://www.iso.ch/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=39921&ICS1=35&ICS2=40&ICS3=
  9. RFC 3629, "UTF-8, a transformation format of ISO 10646" http://www.ietf.org/rfc/rfc3629.txt
  10. Wikipedia (the free-content encyclopedia) entry on "Unicode" http://en.wikipedia.org/wiki/Unicode
  11. ISO 639 Frequently Asked Questions (FAQ) http://www.loc.gov/standards/iso639-2/faq.html
  12. ISO 639 Frequently Asked Questions (FAQ) http://www.loc.gov/standards/iso639-2/faq.html
  13. RFC 3066, "Tags for the Identification of Languages" http://www.ietf.org/rfc/rfc3066.txt
  14. RFC 4646, "Tags for Identifying Languages" http://www.ietf.org/rfc/rfc4646.txt
  15. The Open Internationalization Initiative http://www.openi18n.org
  16. The Free Standards Group http://www.freestandards.org
  17. OpenI18N 1.3 Globalization Specification http://www.openi18n.org/docs/pdf/OpenI18N1.3.pdf
  18. The Mozilla Localization Project http://www.mozilla.org/projects/l10n
  19. The GNOME Translation Project http://developer.gnome.org/projects/gtp
  20. KDE Internationalization http://i18n.kde.org
  21. OpenOffice.org L10N and I18N Projects http://l10n.openoffice.org
  22. Windows XP LIP FAQ http://www.microsoft.com/globaldev/DrIntl/faqs/winxp.mspx
  23. Office 2003 Editions Localized Versions http://www.microsoft.com/office/editions/prodinfo/language/localized.mspx