Anirudh Zala's Blog: culture

#1 L10N overview

Internationalization and Localization are means of adapting products such as publications, hardware and software for non-native environments, especially other nations and cultures.

When you implement i18n, l10n automatically knocks your door to get implemented. Reason is that if you allow users from different countries and cultures to use your software, they will expect that apart from language transformation, real data should also get transformed into localized ways. This expectation is reasonable because it might be possible that different countries are using different standards for displaying dates, currencies, units etc. For example US people won't understand Kilometers because they use Mile as unit, however for Indian people Kilometer is quite familiar unit. The more the cultures, the more the varieties one can see in communication, displaying information etc.

#1.1 Locale and formats

Before implementing l10n we first need to have proper understanding of terms locale and format. Locale represents a whole culture that can contain information about how to display dates, how to show currencies, which measurements units are to be used for conversions etc. For example for Indian locale, it is like below:

Date display:DD-MM-YYYY
Currency: 1,11,111
Unit to measure distance: Km

While for US, it can be like below:

Date display: MM-DD-YYYY
Currency: 111,111
Unit to measure distance: Mile

Hence locale should be seen in broader way as it represents set of various localized items. But sometimes apart from locale, users prefer more customization in locals, hence there comes the term format which means allowing 1 more level of customization. For example a US user might like to format date from 02-01-2007 to February 1st 2007. In short locale includes language, default format, glyph and other instruction set for particular locale while format is nothing but the different representation of same values. Hence overall a locale can have more than 1 formats.

#2 How to implement it?

While implementing l10n in softwares, software administrator or team members first need to determine that how many locales and formats should be used. Chosen locale and formats can be stored in file system or in database. Once it is decided, it can be implemented at 2 levels.

#2.1 Backend

Each software normally has administrative area from where whole software is managed. This area should be used to select number of formats for respective locales for particular software.

From selected formats, there should be chosen 1 default format for each l10n entity which will be used at client area. This default format is applicable to whole client area of software until it is overridden at user-level.

For certain entities like number format and currency format only 1 format should be set and user-level option may not be allowed. It should also be kept in mind that formats of one locale should not be used in other locale.

For softwares likes FS and Flog where client is registered from administrative area, user-level locales and formats could be selected directly.

#2.2 Frontend

At client area, if user is not given option to set his/her own locales & formats or if it is provided but user is not logged in then locale and formats set as default at administrative area should be used.

#2.2.1 Setting locale

While implementing l10n in PHP based softwares, developers need to set locales first. This locale can be decided upon selection of language. For example if English language is selected by user then locate should be set as en_US, for Finnish language locale should be set as fi_FI. To set locale in PHP, you can use function setlocale(). You can set locale for various categories like to display monetary items, or dates or messages etc. Please refer PHP manual for more details about how to use this function.

There are various PHP functions which behaves depending upon locales. Some of them are strcoll(), strftime(), date() etc.

#2.2.2 Displaying data in various formats based upon locales

Once locale is set for particular language, locale related functions behaves in different ways. For example below code will display day in different language for different locales. You can see that code remains same but information displays in different way.

// Displays “Wednesday” for English language.
setlocale(LC_TIME,'C');
echo strftime('%A');

// Displays “keskiviikko” for Finnish language.
setlocale(LC_TIME,'fi_FI');
echo strftime('%A');

// Displays “mercredi” for French language.
setlocale(LC_TIME,'fr_FR');
echo strftime('%A');

// Displays “Mittwoch” for German language.
setlocale(LC_TIME,'de_DE');
echo strftime('%A');

// Displays “बधवार” for Hindi language.
setlocale(LC_TIME,'hi_IN');
echo strftime('%A');

Similarly locales can be set for entities like currency, number format etc. To set locale for all entities, constant LC_ALL should be used.

At code level there might be problems during implementing different formats because for different locales default formats can be different. Hence above code doesn't actually serve our purpose. See example below.

// Displays 'Friday December 22 1978' in English.
setlocale(LC_ALL, 'en_US');
echo strftime('%A %B %d %Y', mktime(0, 0, 0, 12, 22, 1978))."\n";

// Displays 'perjantai 22 joulukuu 1978' in Finnish.
setlocale(LC_ALL, 'fi_FI');
echo strftime('%A %d %B %Y', mktime(0, 0, 0, 12, 22, 1978))."\n";

// Displays 'vendredi 22 décembre 1978' in French.
setlocale(LC_ALL, 'fr_FR');
echo strftime('%A %d %B %Y', mktime(0, 0, 0, 12, 22, 1978))."\n";

// Displays 'Freitag 22 Dezember 1978' in German.
setlocale(LC_ALL, 'de_DE');
echo strftime('%A %d %B %Y', mktime(0, 0, 0, 12, 22, 1978))."\n";

// Displays '22 दिसमबर शकरवार 1978' in Hindi.
setlocale(LC_ALL, 'hi_IN');
echo strftime('%d %B %A %Y', mktime(0, 0, 0, 12, 22, 1978))."\n";

In this example there are different formats for different locales, hence to make implementation easy at code level, we should store conversion specifier into database/file and using it directly into function. For example for Finnish languages locale, conversion specifier %A %d %B %Y would stored as string and should be used directly into function like above. Similarly this type of conversion specifiers can be used for all formats of all entities.

#3 Limitations of l10n

Native support of l10n in script or database is limited to display information into different format and glyphs only. It doesn't actually convert values according to localization. For example if price of any item is stored in $ currency, then that price, when displays to users who has selected Finnish language (or locale), wont get displayed automatically into his/her own chosen currency (i.e. €). This is because conversion rates between 2 units gets constantly changed.

For such issues, l10n should be implemented in customized way in your software where unit conversion functions can be built and used according to chosen format. However information should be stored in database in only one format and should be formatted only while displaying it to users.

However there is one one exception in displaying date and time, which can be displayed with different values if time zone related functions are used. Normally software logs date and time into it's own locale, but it could be possible that the user who is using that software located in different country where date/time is different than server time. Hence in such cases software should provide option to select timezone so that date/time can be displayed with localized values. Such option is essential for softwares that provides email services.

#4 Links

http://en.wikipedia.org/wiki/l10n
http://www.useit.com/alertbox/9608.html

#1 I18N overview

Internationalization and localization are means of adapting products such as publications, hardware and software for non-native environments, especially other nations and cultures.

I18n includes many non-english and non-european languages like Hindi, Gujarati etc. that require multi bytes to store characters. To support such languages, software should use utf-8 encoding scheme to input, process, store, search and output data in same language.

#2 How to implement it?

Just few years ago i18n was headache for developers to implement because of limited support from database, scripting language, browsers, OS and other middle layers. But now a days with transparent support of utf-8 at each layer it has been easy to implement i18n.

In this document I have mentioned steps about how to implement i18n for LAMP based softwares with MySQL 4.1 and higher and PHP 5.0 and higher. For lower versions than these certain steps may not work.

#2.1 Server side

This sections includes changes are to be made at server side.

#2.1.1 OS level

At OS level, the only requirement is that OS should support utf-8 encoding which modern OSes like FC1...6, CentOS etc. support very well.

#2.1.2 Database level

At database level, you should use utf-8_* encoding as default for database communication and collation. For that you need to add following entries into my.cnf (MySQL software's configuration file) file at section [mysqld].

# To support Asiatic languages use utf-8. 

init-connect='SET NAMES utf8' 

default-character-set=utf8 

Sometimes customized configuration my.cnf has more sections like [client] etc. In that case add below entry in [client] section also.

# To support Asiatic languages use utf-8. 

default-character-set=utf8 

After making above entries, restart MySQL service. Whenever you create new database, use connection collation as utf8_* and use utf-8 as character set of the file. However if you have set above 2 values then these changes are not required, but still it is advised to check them as sometimes you are dumping databases from different versions of MySQL.

To test what is set; run below SQL queries.

SHOW VARIABLES LIKE 'character_set%';

SHOW VARIABLES LIKE 'collation%';

It will output all values containing utf-8 word in it. Sometimes it is not possible to add above type of entries in my.cnf specially on shared hosting server. In such case, execute below SQL query before execution of any query (in your PHP script).

SET NAMES 'utf8';

It does same thing for which we added entries in my.cnf except that this is runtime and applicable in local scope only. However database and tables must be created in utf-8 format and using same as collation.

#2.1.3 PHP script level

This is the main area where important changes are to be made. PHP natively doesn't support handling of i18n hence we have to use certain extensions to fulfill our requirements. These extensions are iconv and mbstring. But from these 2, mbstring is popular and works very well. As mbstring extension is not part of standard PHP installation, we need to enable it manually.

If you have configured your web server using utilities like YUM then it is very easy. Just run below command as root user and restart httpd service.

[root@mypc ~]# yum install php-mbstring 

For manual installation, you need to set following option to enable all the supported languages.

–enable-mbstring=all 

Once this extension is enabled in PHP, we need to set certain directives to make it working. These directives can be set in php.ini for global usage, httpd.conf for host wise usage and in PHP script itself for page or project wise usage. But I recommend to use it in PHP script itself so that it's usage remains limited to specific application or project. See below section for implementation.

Apart from these changes, all your PHP and other required scripts should be saved in utf-8 character set encoding because we store static i18n data in flat files (like fl_fi.inc.php). Hence editors needs to be configured in that way. Most popular editors provide setting of character set encoding; hence select utf-8 as standard encoding type.

#2.1.4 Application level

At application level, you need set certain directives using function ini_get() to enable mbstring extension. These directives are:

// Settings for i18n support. 

ini_set('mbstring.internal_encoding','utf-8'); 

ini_set('mbstring.func_overload',7); 

ini_set('mbstring.encoding_translation',1); 

If your application is using output buffering (i.e. output_buffering=On) then output handler should be set as mb_output_handler like below:

ini_set('output_handler','mb_output_handler'); 

Once these directives are set, all string related, regular expression related and mail system related functions will work transparently for all kind of languages. For more information about above directives, refer PHP manual.

While sending emails, utf-8 should be set as character set encoding in headers. Similarly while outputting HTML/XHTML to browser you need to explicitly set character set encoding into meta tags like below:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

Moreover sometimes if your application outputs contents to browser directly from running script even then you need to pass certain header like below:

<?php header('Content-Type: text/html; charset=utf-8'); ?>

In short wherever and whenever character set encoding is required, pass utf-8 as encoding scheme.

If you are using Ajax mechanism in your website then you also require to encode your Ajax query string before submitting to server for execution. This is specially required while running application in MSIE browsers. Firefox and other Gecko based browsers handles Ajax queries that contains i18n data correctly. For more info read thread http://news.php.net/php.i18n/1059.

#2.2 Client side

At client side minimal support is required.

#2.2.1 OS level

Client's OS should be able to understand utf-8 encoding, which is quite possible in latest versions of OSes. Apart from that, OS should have support to type characters in native language which is called as Keyboard layout and Regional settings.

#2.2.2 Browser level

The only requirement at browser level is that browser should be able to interpret utf-8 character set encoding which is supported by almost all modern browsers. Contents sent from server (refer section 2.1.4) tell browser that which encoding scheme is to be used.

#3 How to implement in existing projects?

To implement i18n in existing projects requires above implementation plus conversion of existing data.

#3.1 Converting database

#3.1.1 Using commands

To convert existing database encoded into utf-8, take dump of data by running below command.

mysqldump -hHOST -uUSER -p --opt --default-character-set=CHARSET --skip-set-charset DB_NAME|sed -e 's/SOURCE_CHARSET/DEST_CHARSET/g' > DB_NAME.sql 

It will create SQL file containing all data of selected database which you can use to re-create new database. Now convert existing file into utf-8 format by running below command.

iconv -f SOURCE_CHARSET -t UTF-8 DB_NAME.sql > UTF_DB_NAME.sql 

To dump above data into newly created database, run below command:

mysql -hHOST -uUSER -p --default-character-set=DEST_CHARSET DB_NAME < 

UTF_DB_NAME.sql 

#3.1.2 Converting database directly

Sometimes it is not possible do this using command line interface (due to shared hosting environment), then it will require to execute certain SQL queries to change character set of existing database without copying or moving it anywhere. To do so; run below SQL queries wherever it is applicable. Please note the order of execution of queries.

# First change all i18n fields of all tables into BLOB.

ALTER TABLE TABLE MODIFY FIELD BLOB; 

# Now change character set of database as UTF-8. 

ALTER DATABASE DATABASE charset=utf8; 

# Then change character set of each table.

ALTER TABLE TABLE charset=utf8; 

# Now change all i18n fields of all tables into UTF-8 character set.

ALTER TABLE TABLE MODIFY FIELD ORIG_FIELD_TYPE CHARACTER SET utf8; 

Ideally above queries should be run by making shell or PHP script so that it can be used later or for other projects.

#3.2 Converting file system

You will also require to change encoding of existing PHP scripts and other files if they contain data that requires utf-8 encoding. Normally you would require to change format of files containing non-english text only. But it is recommended to use same encoding for all type of files throughout your software. To change encoding of file iconv utility can be used. One example is provided below.

iconv -f SOURCE_CHARSET -t UTF-8 FILENAME > FILENAME 

You can also use mb_string_* functions of PHP to change character set encoding of your files. However this is PHP based function so you need to design a script to convert all existing files.

#3.3 Translating your project

When you design multilingual projects, it becomes mandatory to dynamically handle static text of your application. Your application can have several options for that depending upon tools used to build it. Symfony based projects can have XLIFF as standard mechanism to translate static data. But overall if we want such mechanism in every PHP application then there are 2 ways to do so.

Declaring PHP variables where static text gets displayed, and then putting those variables in language files, according to each language used in application, and then including it in scripts where those translations are required. This is standard way but there are some disadvantages of it. Since file is PHP based, third party translators will find difficulties in adding/updating translations as they might be from non-technical area. Hence there exists professional mechanism to overcome this problem.
For large applications, where translators are from various cultures and backgrounds, preferred approach is it to use PHP extension called Gettext (in PHP it is bundled as php-gettext).

To check whether it is installed or not, run command php -m from command line. If output contains text like gettext then it is installed. Otherwise using various options it can be installed. The most popular way is using Yum utility. Just type below command as root user and that's it.

[root@mypc ~]# yum install php-gettext 

php-gettext works in this way. 1 language file is prepared using specials editors like KBabel, poEdit where all translations are kept in pairs of 2 special variable msgid and msgstr. msgid variable denotes variable name to be used in PHP script while msgstr denotes language translation associated with that msgid variable. Using pair of msgid and msgstr, we can build as much translations as required for any language. There will be separate files for each language. This version of file has .po extension but it is only for editing translations. Actual file which is used by PHP to apply these translations is not this. For that .po file needs to get compiled into .mo binary file. For that following command can be used.

[user@mypc ~]# msgfmt -cv -o FILE.mo FILE.po 

Normally both files are kept at same location for easy maintenance. Since we have generated files for language translations, I will describe how to use them in your application. In your global includable script following code needs to get added to bind your created language files and your application.

// Set environment variable. 

putenv('LC_ALL=en_US'); 

// Set selected locale (language). 

setlocale(LC_ALL, en_US); 

// Specify location of translation tables. 

bindtextdomain('frontend', '/web/projects/myproject/translations/'); 

bindtextdomain('backend', '/web/projects/myproject/translations/'); 

// Choose domain for application. 

textdomain('frontend'); 

// Bind specific character set to be used with selected domain. 

bind_textdomain_codeset('frontend','UTF-8'); 

bind_textdomain_codeset('backend','UTF-8'); 

In above code snippet, items marked in bold are to be replaced by your application specific needs. Here en_US is a locale which can be changed as required.

Since gettext accesses translation files in special way, they needs to get stored according to rules defined by gettext. With reference to above code-snippet, if root folder of language translation files is like /web/projects/myproject/translations/ and application areas are admin and client then language related files should be stored in following way.

/web/projects/myproject/translations/en_US/LC_MESSAGES/backend.mo, [admin.po]
/web/projects/myproject/translations/en_US/LC_MESSAGES/frontend.mo, [client.po]

/web/projects/myproject/translations/en_GB/LC_MESSAGES/backend.mo, [admin.po]
/web/projects/myproject/translations/en_GB/LC_MESSAGES/frontend.mo, [client.po]

/web/projects/myproject/translations/fi_FI/LC_MESSAGES/backend.mo, [admin.po]
/web/projects/myproject/translations/fi_FI/LC_MESSAGES/frontend.mo, [client.po]

Hence by switching values in function textdomain(), desired translation files can be included in each application. Now how to use these translation in PHP script?. For that each msgid variable should be written like _('myVar') in PHP script to include corresponding translation from selected file.

That's it, whenever translation files are modified, they need to get recompiled using command msgfmt as described earlier to avail latest translations. Since compiled binary .mo files are cached by PHP, modifications might not get reflected immediately. In such case web service should be gracefully restarted.

#4 Summary

Advantages of implementing i18n using utf-8 character set encoding is that users can now input data in their localized language and those contents would saved and displayed back in browser in same language. Not only these, but database can make searching records in native language also. For example if you are required to retrieve records of all users whose last name is “ઝાલા”, then writing SQL like below will work successfully.

SELECT * FROM user WHERE last_name='ઝાલા'; 

At PHP script level you can compare, sort, split Unicode strings in same way like you are doing for normal strings. Next version of PHP (i.e PHP 6) is going to support Unicode by default hence there will not require extensions or setting to enable Unicode string.

#5 Links

http://en.wikipedia.org/wiki/I18n
http://www.useit.com/alertbox/9608.html
http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html
http://www.nyphp.org/php-presentations/90_Timezones-Internationalization-Localization-Character-Sets-PHP-4-5

Anirudh Zala's Blog

Pages

27 Feb 2007

Localization

23 Feb 2007

Internationalization

Followers