#1 I18N overview
Internationalization and localization are means of adapting products such as publications, hardware and software for non-native environments, especially other nations and cultures.
I18n includes many non-english and non-european languages like Hindi, Gujarati etc. that require multi bytes to store characters. To support such languages, software should use utf-8 encoding scheme to input, process, store, search and output data in same language.
#2 How to implement it?
Just few years ago i18n was headache for developers to implement because of limited support from database, scripting language, browsers, OS and other middle layers. But now a days with transparent support of utf-8 at each layer it has been easy to implement i18n.
In this document I have mentioned steps about how to implement i18n for LAMP based softwares with MySQL 4.1 and higher and PHP 5.0 and higher. For lower versions than these certain steps may not work.
#2.1 Server side
This sections includes changes are to be made at server side.
#2.1.1 OS level
At OS level, the only requirement is that OS should support utf-8 encoding which modern OSes like FC1...6, CentOS etc. support very well.
#2.1.2 Database level
At database level, you should use utf-8_* encoding as default for database communication and collation. For that you need to add following entries into my.cnf (MySQL software's configuration file) file at section [mysqld].
Sometimes customized configuration my.cnf has more sections like [client] etc. In that case add below entry in [client] section also.
After making above entries, restart MySQL service. Whenever you create new database, use connection collation as utf8_* and use utf-8 as character set of the file. However if you have set above 2 values then these changes are not required, but still it is advised to check them as sometimes you are dumping databases from different versions of MySQL.
To test what is set; run below SQL queries.
It will output all values containing utf-8 word in it. Sometimes it is not possible to add above type of entries in my.cnf specially on shared hosting server. In such case, execute below SQL query before execution of any query (in your PHP script).
It does same thing for which we added entries in my.cnf except that this is runtime and applicable in local scope only. However database and tables must be created in utf-8 format and using same as collation.
#2.1.3 PHP script level
This is the main area where important changes are to be made. PHP natively doesn't support handling of i18n hence we have to use certain extensions to fulfill our requirements. These extensions are iconv and mbstring. But from these 2, mbstring is popular and works very well. As mbstring extension is not part of standard PHP installation, we need to enable it manually.
Once this extension is enabled in PHP, we need to set certain directives to make it working. These directives can be set in php.ini for global usage, httpd.conf for host wise usage and in PHP script itself for page or project wise usage. But I recommend to use it in PHP script itself so that it's usage remains limited to specific application or project. See below section for implementation.
Apart from these changes, all your PHP and other required scripts should be saved in utf-8 character set encoding because we store static i18n data in flat files (like fl_fi.inc.php). Hence editors needs to be configured in that way. Most popular editors provide setting of character set encoding; hence select utf-8 as standard encoding type.
#2.1.4 Application level
At application level, you need set certain directives using function ini_get() to enable mbstring extension. These directives are:
If your application is using output buffering (i.e. output_buffering=On) then output handler should be set as mb_output_handler like below:
Once these directives are set, all string related, regular expression related and mail system related functions will work transparently for all kind of languages. For more information about above directives, refer PHP manual.
While sending emails, utf-8 should be set as character set encoding in headers. Similarly while outputting HTML/XHTML to browser you need to explicitly set character set encoding into meta tags like below:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Moreover sometimes if your application outputs contents to browser directly from running script even then you need to pass certain header like below:
<?php header('Content-Type: text/html; charset=utf-8'); ?>
In short wherever and whenever character set encoding is required, pass utf-8 as encoding scheme.
If you are using Ajax mechanism in your website then you also require to encode your Ajax query string before submitting to server for execution. This is specially required while running application in MSIE browsers. Firefox and other Gecko based browsers handles Ajax queries that contains i18n data correctly. For more info read thread http://news.php.net/php.i18n/1059.
#2.2 Client side
At client side minimal support is required.
#2.2.1 OS level
Client's OS should be able to understand utf-8 encoding, which is quite possible in latest versions of OSes. Apart from that, OS should have support to type characters in native language which is called as Keyboard layout and Regional settings.
#2.2.2 Browser level
The only requirement at browser level is that browser should be able to interpret utf-8 character set encoding which is supported by almost all modern browsers. Contents sent from server (refer section 2.1.4) tell browser that which encoding scheme is to be used.
#3 How to implement in existing projects?
To implement i18n in existing projects requires above implementation plus conversion of existing data.
#3.1 Converting database
#3.1.1 Using commands
To convert existing database encoded into utf-8, take dump of data by running below command.
It will create SQL file containing all data of selected database which you can use to re-create new database. Now convert existing file into utf-8 format by running below command.
To dump above data into newly created database, run below command:
#3.1.2 Converting database directly
Sometimes it is not possible do this using command line interface (due to shared hosting environment), then it will require to execute certain SQL queries to change character set of existing database without copying or moving it anywhere. To do so; run below SQL queries wherever it is applicable. Please note the order of execution of queries.
Ideally above queries should be run by making shell or PHP script so that it can be used later or for other projects.
#3.2 Converting file system
You will also require to change encoding of existing PHP scripts and other files if they contain data that requires utf-8 encoding. Normally you would require to change format of files containing non-english text only. But it is recommended to use same encoding for all type of files throughout your software. To change encoding of file iconv utility can be used. One example is provided below.
You can also use mb_string_* functions of PHP to change character set encoding of your files. However this is PHP based function so you need to design a script to convert all existing files.
#3.3 Translating your project
When you design multilingual projects, it becomes mandatory to dynamically handle static text of your application. Your application can have several options for that depending upon tools used to build it. Symfony based projects can have XLIFF as standard mechanism to translate static data. But overall if we want such mechanism in every PHP application then there are 2 ways to do so.
php-gettext works in this way. 1 language file is prepared using specials editors like KBabel, poEdit where all translations are kept in pairs of 2 special variable msgid and msgstr. msgid variable denotes variable name to be used in PHP script while msgstr denotes language translation associated with that msgid variable. Using pair of msgid and msgstr, we can build as much translations as required for any language. There will be separate files for each language. This version of file has .po extension but it is only for editing translations. Actual file which is used by PHP to apply these translations is not this. For that .po file needs to get compiled into .mo binary file. For that following command can be used.
Normally both files are kept at same location for easy maintenance. Since we have generated files for language translations, I will describe how to use them in your application. In your global includable script following code needs to get added to bind your created language files and your application.
In above code snippet, items marked in bold are to be replaced by your application specific needs. Here en_US is a locale which can be changed as required.
Since gettext accesses translation files in special way, they needs to get stored according to rules defined by gettext. With reference to above code-snippet, if root folder of language translation files is like /web/projects/myproject/translations/ and application areas are admin and client then language related files should be stored in following way.
/web/projects/myproject/translations/en_US/LC_MESSAGES/backend.mo, [admin.po]
/web/projects/myproject/translations/en_US/LC_MESSAGES/frontend.mo, [client.po]
/web/projects/myproject/translations/en_GB/LC_MESSAGES/backend.mo, [admin.po]
/web/projects/myproject/translations/en_GB/LC_MESSAGES/frontend.mo, [client.po]
/web/projects/myproject/translations/fi_FI/LC_MESSAGES/backend.mo, [admin.po]
/web/projects/myproject/translations/fi_FI/LC_MESSAGES/frontend.mo, [client.po]
Hence by switching values in function textdomain(), desired translation files can be included in each application. Now how to use these translation in PHP script?. For that each msgid variable should be written like _('myVar') in PHP script to include corresponding translation from selected file.
That's it, whenever translation files are modified, they need to get recompiled using command msgfmt as described earlier to avail latest translations. Since compiled binary .mo files are cached by PHP, modifications might not get reflected immediately. In such case web service should be gracefully restarted.
#4 Summary
Advantages of implementing i18n using utf-8 character set encoding is that users can now input data in their localized language and those contents would saved and displayed back in browser in same language. Not only these, but database can make searching records in native language also. For example if you are required to retrieve records of all users whose last name is “ઝાલા”, then writing SQL like below will work successfully.
At PHP script level you can compare, sort, split Unicode strings in same way like you are doing for normal strings. Next version of PHP (i.e PHP 6) is going to support Unicode by default hence there will not require extensions or setting to enable Unicode string.
#5 Links
http://en.wikipedia.org/wiki/I18n
http://www.useit.com/alertbox/9608.html
http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html
http://www.nyphp.org/php-presentations/90_Timezones-Internationalization-Localization-Character-Sets-PHP-4-5
Internationalization and localization are means of adapting products such as publications, hardware and software for non-native environments, especially other nations and cultures.
I18n includes many non-english and non-european languages like Hindi, Gujarati etc. that require multi bytes to store characters. To support such languages, software should use utf-8 encoding scheme to input, process, store, search and output data in same language.
#2 How to implement it?
Just few years ago i18n was headache for developers to implement because of limited support from database, scripting language, browsers, OS and other middle layers. But now a days with transparent support of utf-8 at each layer it has been easy to implement i18n.
In this document I have mentioned steps about how to implement i18n for LAMP based softwares with MySQL 4.1 and higher and PHP 5.0 and higher. For lower versions than these certain steps may not work.
#2.1 Server side
This sections includes changes are to be made at server side.
#2.1.1 OS level
At OS level, the only requirement is that OS should support utf-8 encoding which modern OSes like FC1...6, CentOS etc. support very well.
#2.1.2 Database level
At database level, you should use utf-8_* encoding as default for database communication and collation. For that you need to add following entries into my.cnf (MySQL software's configuration file) file at section [mysqld].
# To support Asiatic languages use utf-8.
init-connect='SET NAMES utf8'
default-character-set=utf8
Sometimes customized configuration my.cnf has more sections like [client] etc. In that case add below entry in [client] section also.
# To support Asiatic languages use utf-8.
default-character-set=utf8
After making above entries, restart MySQL service. Whenever you create new database, use connection collation as utf8_* and use utf-8 as character set of the file. However if you have set above 2 values then these changes are not required, but still it is advised to check them as sometimes you are dumping databases from different versions of MySQL.
To test what is set; run below SQL queries.
SHOW VARIABLES LIKE 'character_set%';
SHOW VARIABLES LIKE 'collation%';
It will output all values containing utf-8 word in it. Sometimes it is not possible to add above type of entries in my.cnf specially on shared hosting server. In such case, execute below SQL query before execution of any query (in your PHP script).
SET NAMES 'utf8';
It does same thing for which we added entries in my.cnf except that this is runtime and applicable in local scope only. However database and tables must be created in utf-8 format and using same as collation.
#2.1.3 PHP script level
This is the main area where important changes are to be made. PHP natively doesn't support handling of i18n hence we have to use certain extensions to fulfill our requirements. These extensions are iconv and mbstring. But from these 2, mbstring is popular and works very well. As mbstring extension is not part of standard PHP installation, we need to enable it manually.
- If you have configured your web server using utilities like YUM then it is very easy. Just run below command as root user and restart httpd service.
[root@mypc ~]# yum install php-mbstring
- For manual installation, you need to set following option to enable all the supported languages.
–enable-mbstring=all
Once this extension is enabled in PHP, we need to set certain directives to make it working. These directives can be set in php.ini for global usage, httpd.conf for host wise usage and in PHP script itself for page or project wise usage. But I recommend to use it in PHP script itself so that it's usage remains limited to specific application or project. See below section for implementation.
Apart from these changes, all your PHP and other required scripts should be saved in utf-8 character set encoding because we store static i18n data in flat files (like fl_fi.inc.php). Hence editors needs to be configured in that way. Most popular editors provide setting of character set encoding; hence select utf-8 as standard encoding type.
#2.1.4 Application level
At application level, you need set certain directives using function ini_get() to enable mbstring extension. These directives are:
// Settings for i18n support.
ini_set('mbstring.internal_encoding','utf-8');
ini_set('mbstring.func_overload',7);
ini_set('mbstring.encoding_translation',1);
If your application is using output buffering (i.e. output_buffering=On) then output handler should be set as mb_output_handler like below:
ini_set('output_handler','mb_output_handler');
Once these directives are set, all string related, regular expression related and mail system related functions will work transparently for all kind of languages. For more information about above directives, refer PHP manual.
While sending emails, utf-8 should be set as character set encoding in headers. Similarly while outputting HTML/XHTML to browser you need to explicitly set character set encoding into meta tags like below:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Moreover sometimes if your application outputs contents to browser directly from running script even then you need to pass certain header like below:
<?php header('Content-Type: text/html; charset=utf-8'); ?>
In short wherever and whenever character set encoding is required, pass utf-8 as encoding scheme.
If you are using Ajax mechanism in your website then you also require to encode your Ajax query string before submitting to server for execution. This is specially required while running application in MSIE browsers. Firefox and other Gecko based browsers handles Ajax queries that contains i18n data correctly. For more info read thread http://news.php.net/php.i18n/1059.
#2.2 Client side
At client side minimal support is required.
#2.2.1 OS level
Client's OS should be able to understand utf-8 encoding, which is quite possible in latest versions of OSes. Apart from that, OS should have support to type characters in native language which is called as Keyboard layout and Regional settings.
#2.2.2 Browser level
The only requirement at browser level is that browser should be able to interpret utf-8 character set encoding which is supported by almost all modern browsers. Contents sent from server (refer section 2.1.4) tell browser that which encoding scheme is to be used.
#3 How to implement in existing projects?
To implement i18n in existing projects requires above implementation plus conversion of existing data.
#3.1 Converting database
#3.1.1 Using commands
To convert existing database encoded into utf-8, take dump of data by running below command.
mysqldump -hHOST -uUSER -p --opt --default-character-set=CHARSET --skip-set-charset DB_NAME|sed -e 's/SOURCE_CHARSET/DEST_CHARSET/g' > DB_NAME.sql
It will create SQL file containing all data of selected database which you can use to re-create new database. Now convert existing file into utf-8 format by running below command.
iconv -f SOURCE_CHARSET -t UTF-8 DB_NAME.sql > UTF_DB_NAME.sql
To dump above data into newly created database, run below command:
mysql -hHOST -uUSER -p --default-character-set=DEST_CHARSET DB_NAME <
UTF_DB_NAME.sql
#3.1.2 Converting database directly
Sometimes it is not possible do this using command line interface (due to shared hosting environment), then it will require to execute certain SQL queries to change character set of existing database without copying or moving it anywhere. To do so; run below SQL queries wherever it is applicable. Please note the order of execution of queries.
# First change all i18n fields of all tables into BLOB.
ALTER TABLE TABLE MODIFY FIELD BLOB;
# Now change character set of database as UTF-8.
ALTER DATABASE DATABASE charset=utf8;
# Then change character set of each table.
ALTER TABLE TABLE charset=utf8;
# Now change all i18n fields of all tables into UTF-8 character set.
ALTER TABLE TABLE MODIFY FIELD ORIG_FIELD_TYPE CHARACTER SET utf8;
Ideally above queries should be run by making shell or PHP script so that it can be used later or for other projects.
#3.2 Converting file system
You will also require to change encoding of existing PHP scripts and other files if they contain data that requires utf-8 encoding. Normally you would require to change format of files containing non-english text only. But it is recommended to use same encoding for all type of files throughout your software. To change encoding of file iconv utility can be used. One example is provided below.
iconv -f SOURCE_CHARSET -t UTF-8 FILENAME > FILENAME
You can also use mb_string_* functions of PHP to change character set encoding of your files. However this is PHP based function so you need to design a script to convert all existing files.
#3.3 Translating your project
When you design multilingual projects, it becomes mandatory to dynamically handle static text of your application. Your application can have several options for that depending upon tools used to build it. Symfony based projects can have XLIFF as standard mechanism to translate static data. But overall if we want such mechanism in every PHP application then there are 2 ways to do so.
- Declaring PHP variables where static text gets displayed, and then putting those variables in language files, according to each language used in application, and then including it in scripts where those translations are required. This is standard way but there are some disadvantages of it. Since file is PHP based, third party translators will find difficulties in adding/updating translations as they might be from non-technical area. Hence there exists professional mechanism to overcome this problem.
- For large applications, where translators are from various cultures and backgrounds, preferred approach is it to use PHP extension called Gettext (in PHP it is bundled as php-gettext).
[root@mypc ~]# yum install php-gettext
php-gettext works in this way. 1 language file is prepared using specials editors like KBabel, poEdit where all translations are kept in pairs of 2 special variable msgid and msgstr. msgid variable denotes variable name to be used in PHP script while msgstr denotes language translation associated with that msgid variable. Using pair of msgid and msgstr, we can build as much translations as required for any language. There will be separate files for each language. This version of file has .po extension but it is only for editing translations. Actual file which is used by PHP to apply these translations is not this. For that .po file needs to get compiled into .mo binary file. For that following command can be used.
[user@mypc ~]# msgfmt -cv -o FILE.mo FILE.po
Normally both files are kept at same location for easy maintenance. Since we have generated files for language translations, I will describe how to use them in your application. In your global includable script following code needs to get added to bind your created language files and your application.
// Set environment variable.
putenv('LC_ALL=en_US');
// Set selected locale (language).
setlocale(LC_ALL, en_US);
// Specify location of translation tables.
bindtextdomain('frontend', '/web/projects/myproject/translations/');
bindtextdomain('backend', '/web/projects/myproject/translations/');
// Choose domain for application.
textdomain('frontend');
// Bind specific character set to be used with selected domain.
bind_textdomain_codeset('frontend','UTF-8');
bind_textdomain_codeset('backend','UTF-8');
In above code snippet, items marked in bold are to be replaced by your application specific needs. Here en_US is a locale which can be changed as required.
Since gettext accesses translation files in special way, they needs to get stored according to rules defined by gettext. With reference to above code-snippet, if root folder of language translation files is like /web/projects/myproject/translations/ and application areas are admin and client then language related files should be stored in following way.
/web/projects/myproject/translations/en_US/LC_MESSAGES/backend.mo, [admin.po]
/web/projects/myproject/translations/en_US/LC_MESSAGES/frontend.mo, [client.po]
/web/projects/myproject/translations/en_GB/LC_MESSAGES/backend.mo, [admin.po]
/web/projects/myproject/translations/en_GB/LC_MESSAGES/frontend.mo, [client.po]
/web/projects/myproject/translations/fi_FI/LC_MESSAGES/backend.mo, [admin.po]
/web/projects/myproject/translations/fi_FI/LC_MESSAGES/frontend.mo, [client.po]
Hence by switching values in function textdomain(), desired translation files can be included in each application. Now how to use these translation in PHP script?. For that each msgid variable should be written like _('myVar') in PHP script to include corresponding translation from selected file.
That's it, whenever translation files are modified, they need to get recompiled using command msgfmt as described earlier to avail latest translations. Since compiled binary .mo files are cached by PHP, modifications might not get reflected immediately. In such case web service should be gracefully restarted.
#4 Summary
Advantages of implementing i18n using utf-8 character set encoding is that users can now input data in their localized language and those contents would saved and displayed back in browser in same language. Not only these, but database can make searching records in native language also. For example if you are required to retrieve records of all users whose last name is “ઝાલા”, then writing SQL like below will work successfully.
SELECT * FROM user WHERE last_name='ઝાલા';
At PHP script level you can compare, sort, split Unicode strings in same way like you are doing for normal strings. Next version of PHP (i.e PHP 6) is going to support Unicode by default hence there will not require extensions or setting to enable Unicode string.
#5 Links
http://en.wikipedia.org/wiki/I18n
http://www.useit.com/alertbox/9608.html
http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html
http://www.nyphp.org/php-presentations/90_Timezones-Internationalization-Localization-Character-Sets-PHP-4-5