Anirudh Zala's Blog: 2007

9 Aug 2007

Designing web service

#1 New way of Programming experience

Designing API requires different way of thinking in programming because API is mainly related to handling raw data and header management. Hence apart from traditional web development, it requires to have good understanding and experience of HTTP Headers, HTTP request methods, REST architectural style, OOP and Responsiveness.

Another difference between API and normal web application is that once API is being started to use it grows vertically hence it's URL scheme, response headers and types can not be changed that easily so proper care should be taken before designing it. Best practice is to start with providing simple interface of only GETting data. And from that experience more functionality for adding, updating and deleting resources can be added.

#2 Handling input data

While designing APIs, developers should not rely on browsers because except POST and GET methods; data coming by other methods will directly come to server skipping the browser. Hence population of incoming data (which is done by PHP by storing them in $_GET and $_POST super globals) would not be possible particularly for PUT and DELETE methods. Which means all data should be handled by standards input which can be accessed from php://input. After that they can be parsed to extract information in chunks to finally execute requests. Please study example API for more details about how to handle and implement raw data. This link explains all possible HTTP methods and how they work.

#3 Structure of codebase

Normally API would require 1 configuration file to store server specific data, 1 global file to store other global settings, 1 front controller and then bunch of classes according to required modules. Since API doesn't have GUI, multiple controllers are not required. And anyway modern practices recommend having only 1 single controller to receive and dispatch data.

URL of API could be like http://api.mysite.fi or http://www.mysite.com/api/. As far as directory structure is concerned there can be created folders to store API related includable files, classes and controller/s. Since API is part of main application, file and folder structure should conform to existing standards of main application.

#4 Authentication and session management

Access control and session management are important aspects of API design. When it is required to change state of resource on server then it becomes important to identify client who has made such request. This can be done by asking requester to provide details of account on behalf of which request is made.

For example, to build generic authentication mechanism a URL like http://www.mysite.fi/api/api_login/ can be provided where clients can submit his/her ID and Password to log on. The API authentication module will verify ID and Password and if found ok then can issue an api_token which can then be used to execute subsequent requests. Depending upon sensitiveness of data various authentication mechanisms can be implemented which includes session based authentication and basic/digest authentications.

For more information about authentication following resources can be studied:

http://code.google.com/apis/gdata/basics.html
http://code.google.com/apis/gdata/auth.html
http://code.google.com/apis/accounts/AuthForInstalledApps.html

#5 Access control

When API get considerable traffic from various clients it becomes important to restrict clients to not to make so many requests at particular time. Depending upon available resources and amount traffic to various resources, various timing related mechanism can be applied so that clients can not submit particular request more than once or twice or thrice for each minute or hour or so and so. This is important to keep hackers/crackers away who can damage API service at various levels.

For advanced usage, access control can also be applied to allow usage of other users' data. Hence if API client want to use other users' resources, he/she first get confirmation from respective user/s to use his/her data for various purposes.

#6 Caching of common resources

Once API starts receiving high amount of traffic to particular resource/s, it may become necessary to cache certain resources into memory or file to serve them faster and to save bandwidth. There can be some resources like top rated image of users, which may not be changing frequently, could be cached.

#7 Designing URLs

As we have adopted REST architectural style for implementing API, we must adhere to it's standards. For that it would require to properly understand REST principles. In REST architectural style, URL of any resource is tightly coupled to request method of that resource. Which means URL can be same but depending upon request method, response from API could change. For example resource http://www.mysite.fi/api/users/ requested by POST method would mean to add new user to existing users; with GET method will provide list of users. Similarly resource http://www.mysite.fi/api/user/1/ requested by PUT method would update information of particular user; with GET method will provide details of user and; with DELETE method will remove that user from database.

Please note that if resource http://www.mysite.fi/api/user/1/ is requested by POST method then there should be appropriate error response telling API client that POST method is not allowed for such resource. Similarly no resources can be deleted by GET, POST or PUT methods. For that only DELETE method must be used. Such problem exists with Del.ico.us API where everything is done via GET including creating, changing and removing resources which is unRESTful way of implementation.

#8 Programming approach

A typical API can contain following objects/modules. Each object/module could be a single class to handle request/s at various stages.

#8.1 Request handler

This is the 1st module which will receive request. It will then determine request method, will log client's information and will store raw data received in request. Then it will give control to authentication or session manager.

#8.2 Authenticator/Session manager

This module, if required, will check whether authentication is required for particular request or not. If yes then will inform client via error handler module to send credentials. This module will generate authentication tokens to client for subsequent requests.

#8.3 Parser

Once client is authenticated next step would be to parse raw data to extract information from it to for execution. This extracted data will then be handed over to data validator.

#8.4 Data Validator

This module will validate (will check accept type, range and type of data etc.) extracted data for particular requests. If everything is ok then Executer module will be called to execute request for which request is made.

Here Parser and Data validator modules can be combined together also depending upon convenience of developers.

#8.5 Executer

This module will be collection of various classes which will actually execute request and generate output in Arrays. Such modules could be same as primary objects of requests like image, user, tag etc.

These classes should preferably generate data in associate arrays so that it can be used in any format. Once output is generated it will be handed over to renderer module.

#8.6 Renderer (XML/XHTML/JSON etc.)

This module will redder generated data and will send back to client. Default format of data could be XML but depending upon query string different format can be provided. There can be separate templates for each format so that in final output dynamic part will be replaced with data and static part will remain as it is.

#8.7 Error handler

This module can be used with any module at any level to inform client about occurrence of any error while executing any request. Format of errors could be like below:

<?xml version="1.0" encoding="utf-8" ?>
<res status="0">
<error code="123">Invalid data. Please read documentation for more details.</error>
</res>

Note that whenever error occurs, "status" attribute will be set to "0" rather than 1. Apart from custom error messages and codes, standard HTTP headers can also be used to minimize parsing of output at client side to determine whether request was executed successfully or not.

All these modules classes could be designed by using latest concepts of PHP5. One such concept is to have Getter and Setter methods to set and get data whenever required. It is not easy to test and debug API by conventional browsers because developers can't send raw headers and data by browsers. Hence either API client will require to be designed first or to use other options. One such option is to use LiveHTTPHeaders extension in FF browser. It is an excellent way to send and receive raw data over HTTP to API. By this way developers can easily see what response is received when particular request is made. For more details about this extension, visit link http://livehttpheaders.mozdev.org/

Excellent resource about how to design RESTful API can be found here http://www.peej.co.uk/articles/restfully-delicious.html.

#9 Sample of output of data

Below is given samples of XML responses. Root level node would be having attribute 1 or 0 to indicate API client whether request was executed successfully or not. Rest of nodes can be designed as per requirements. It would be noted that most common way of response is XML. Later when API gets more demand, more response format can be provided like Serialized PHP, JSON etc.

=> Basic XML structure of response:

<?xml version="1.0" encoding="utf-8" ?>
<res status="1"></res>

=> XML structure of response with data:

<?xml version="1.0" encoding="utf-8" ?>
<res status="1">
  <image>
    <resource>/image/recent</resource>
    <url>http://www.mysite.fi/images/123456.jpg</url>
    <width>120</width>
    <height>90</height>
    <metadata>
      <camera>Canon EOS 60</camera>
      <date-taken>10-09-2007</date-taken>
    </metadata>
  </image>
</res>

REST architectural style emphasizes on following 2 rules while responding requests, they are:

#9.1 Resources should be interconnected

which means each response should contain information about previous and next resource. For example while displaying list of users if limit of response is 30 records per page then each response should contain link of next and previous list of users if applicable. This tells that by inspecting any response, API client could find what was previous response and what will be the next one. This can be accomplished either by setting attribute/s like next-url and previous-url or by providing attributes like tot-page, cur-page, next-page etc. in response XM so that those can be used as query string of request to access next or previous resources.

Similarly whenever new resource is created, response sent back to client should contain URL of that newly created resource. This is how inter-connectivity works.

#9.2 Revelation of information should be step by step

which means all information should not be responded with just one call. For example request like http://www.mysite.fi/api/images/ should not contain details of each image, instead it can contain link to obtain information of each image (http://www.mysite.fi/api/image/12345_6789.jpg) and after calling that resource, actual information of image would be responded.

All response should be using UTF-8 characters and encoded in same way before sending to client.

#10 Documentation

Documentation is must while designing any API. Without documentation none would use your API. Hence it should be prepared in parallel with developing of code. It also should be easy to understand and use.

#11 Example of API

It is important to study existing API while building API first time. Below link contains source code of such API. It is REST implementation of managing users and companies. However it is very old and developed in PHP4 but still worth to have look at it specially it's code to understand how to handle, execute and respond various requests.

http://nchc.dl.sourceforge.net/sourceforge/phprestsql/phprestsql.tar.gz
http://phprestsql.sourceforge.net/tutorial.html

11 Jun 2007

CVS best practices

Best practices often requires thorough knowledge of technology that is used in software development. To fully use CVS, we now need to know about tag, trunk, branch, merging etc. to minimize or eliminate certain problems arising of out insufficient use of CVS. In this document it has been shown how we can use CVS in most effective way to minimize such problems. Here are some policies that have been designed to follow whenever it is possible.

#1 Sandbox

The developer sandbox is where each developer keeps his or her working copy of the code base. In CVS this is referred to as the working directory. This is where they build, test and debug the modules that they are working on. A sandbox can also be the area where the staging build or the production build is done. Changes made in the work area are checked into the CVS repository. In addition, changes made in the repository by others have to be updated in the sandbox on a regular basis.

The best practices related to developers sandbox are:

#1.1 Keep System clocks in Sync

CVS tracks change to source files by using the timestamps on the file. If each client system date and time is not in sync, there is a definite possibility of CVS getting confused. Thus system clocks must be kept in sync by use of a central time server or similar mechanism.

CVS is designed from ground up to handle multiple timezones. As long as the host operating system has been setup and configured correctly, CVS will be able to track changes correctly.

#1.2 Stay in sync with the repository

To gain the benefits of working within a sandbox as mentioned above, the developer must keep his or her sandbox in sync with the main repository. A regular cvs update with the appropriate tag or branch name will ensure that the sandboxes are kept up to date.

#1.3 Do not share the sandbox

Sandboxes have to be unique for each developer or purpose. They should not be used for multiple things at the same time. A sandbox can be a working area for a developer or the build area for the final release. If such sandboxes are shared, then the owner of the sandbox will not be aware of the changes made to the files resulting in confusion.

In CVS, the sandbox is created automatically when a working copy is checked out for a CVS project using the cvs checkout [options] MODULES command. In very large projects, it does not make sense for the developers to check−out the entire source into the local sandbox. In such cases, they can take only certain modules in which they are working.

#1.4 Do not work outside the sandbox

The sandbox can be thought of as a controlled area within which CVS can track for changes made to the various source files. Files belonging to other developers will be automatically updated by CVS in the developer's sandbox. Thus the developer who lives within the sandbox will stand to gain a lot of benefits of concurrent development.

#1.5 Cleanup after completion

Make sure that the sandbox is cleaned up after completion of work on the files. Clean up can be done in CVS by using the cvs release [-d] [DIRECTORIES] command. This ensures that no old version of the files exists in the development sandbox.

#1.6 Check−in often

To help other developers keep their code in sync with your code, you must check−in (commit) your code often into the CVS repository. The best practice would be to check−in soon as a piece of code is completed, reviewed and tested, check−in the changes with command cvs commit [options] [-m LOG_MESSAGE | -F FILE] [-r revision] [FILES] to ensure that your changes are committed to the CVS repository.

CVS promotes concurrent development. Concurrent development is possible only if all the other developers are aware of the ongoing changes on a regular basis. This awareness can be termed as "situation awareness". One of the "bad" practices that commonly occur is the sharing of files between developers by email. This works against most of the best practices mentioned above. To share updates between two developers, CVS must be used as the communication medium. This will ensure that CVS is aware of the changes and can track them. Thus, audit trail can be established if necessary.

When you commit a change to the repository, make sure your change reflects a single purpose: the fixing of a specific bug, the addition of a new feature, or some particular task. Your commit will create a new revision number which can forever be used as a name for the change. You can mention this revision number in bug databases, or use it as an argument to CVS merge should you want to undo the change or port it to another branch.

#1.7 Add/Commit data in proper way

CVS is not good in handling directories. Hence once any directory is added, can't be removed from repository in normal way. Hence be careful when dealing with directories.

Moreover CVS tend to exclude empty directories while checking out any module. Which means any directory that is supposed to be empty at check-out time wont be included in checked-out copy of module. This problem often occurs when any directory is used to store temporary files which are not required to keep in CVS. Hence if directory is not present in checked-out module, your local sandbox might not work as expected. To solve this problem an empty file called .keepme can be added to empty directory.

#1.8 Use the issue-tracker wisely

Try to create as many two-way links between CVS changesets and your issue-tracking (gForge, Bugzilla, Mantis etc.) database as possible:

If possible, refer to a specific issue ID in every commit log message. When appending information to an issue (to describe progress, or to close the issue) name the revision number(s) responsible for the change.

#2 Branching and Merging

Branching in CVS splits a project's development into separate, parallel histories. Changes made on one branch do not affect the other branches. Branching can be used extensively to maintain multiple versions of a product for providing support and new features.

Merging converges the branches back to the main trunk. In a merge, CVS calculates the changes made on the branch between the point where it diverged from the trunk and the branch's tip (its most recent state), then applies those differences to the project at the tip of the trunk.

#2.1 Know when to create branches

This is a hotly debated question, and it really depends on the culture of your software project. Rather than prescribe a universal policy, we'll describe three common ones here.

#2.1.1 The Never-Branch system

(Often used by nascent projects that don't yet have runnable code.) Users commit their day-to-day work on /trunk. Occasionally /trunk "breaks" (doesn't compile, or fails functional tests) when a user begins to commit a series of complicated changes.

Pros: Very easy policy to follow. New developers have low barrier to entry. Nobody needs to learn how to branch or merge.

Cons: Chaotic development, code could be unstable at any time.

Note: this sort of development is a bit less risky in Subversion than in CVS. Because Subversion commits are atomic, it's not possible for a checkout or update to receive a "partial" commit while somebody else is in the process of committing.

#2.1.2 The Always-Branch system

(Often used by projects that favor heavy management and supervision.) Each user creates/works on a private branch for every coding task. When coding is complete, someone (original coder, peer, or manager) reviews all private branch changes and merges them to /trunk.

Pros: /trunk is guaranteed to be extremely stable at all times.

Cons: Coders are artificially isolated from each other, possibly creating more merge conflicts than necessary. Requires users to do lots of extra merging.

#2.1.3 The Branch-When-Needed system

Users commit their day-to-day work on /trunk.

Rule #1: /trunk must compile and pass regression tests at all times. Committers who violate this rule are publicly humiliated.

Rule #2: a single commit (changeset) must not be so large so as to discourage peer-review.

Rule #3: if rules #1 and #2 come into conflict (i.e. it's impossible to make a series of small commits without disrupting the trunk), then the user should create a branch and commit a series of smaller changesets there. This allows peer-review without disrupting the stability of /trunk.

Pros: /trunk is guaranteed to be stable at all times. The hassle of branching/merging is somewhat rare.

Cons: Adds a bit of burden to users' daily work: they must compile and test before every commit.

#2.3 Assign ownership to Trunk and Branches

The main trunk of the source tree and the various branches should have a owner assigned who will be responsible for.

#2.3.1 Keeping the list of configurable items for the branch or trunk

The owner will be the maintainer of the contents list for the branch or trunk. This list should contain the item name and a brief description about the item. This list is essential since new artifacts are always added to or removed from the repository on an ongoing basis. This list will be able to track the new additions/deletions to the repository for the respective branch.

#2.3.2 Establishing a working policy for the branch or trunk

The owner will establish policies for check−in and check−out. The policy will define when the code can be checked in (after coding or after review etc.,). Who is responsible to merge changes on the same file and resolve conflicts (the author or the person who recently changed the file).

#2.3.3 Identifying and document policy deviations

Policies once established tend to have exceptions. The owner will be responsible for identifying the workaround and tracking/documenting the same for future use.

#2.3.4 Merging with the trunk

The branch owner will be responsible for ensuring that the changes in the branch can be successfully merged with the main trunk at a reasonable point in time.

#2.3 Tag each release

As part of the release process, the entire code base must be tagged (by cvs tag [options] SYMBOLIC_TAG [FILES] command) with an identifier that can help in uniquely identifying the release. A tag gives a label to the collection of revisions represented by one developer's working copy (usually, that working copy is completely up to date so the tag name is attached to the "latest and greatest" revisions in the repository).

The identifier for the tag should provide enough information to identify the release at any point in time in the future. One suggested tag identifier is of the form.

release_{major version #}_{minor version #}

Checkout the entire codebase using the tag, and then proceed to go through a build / deploy / test process before making the actual release. This will absolutely ensure that what "leaves the door " is a verified and tested codebase.

#2.4 Create a branch after each release

After each software release, once the CVS repository is tagged, a branch has to be immediately created. This branch will serve as the bug fix baseline for that release. This branch is created only if the release is not a bug fix or patch release in the first place. Patches that have to be made for this release at any point in time in the future will be developed on this branch. The main trunk will be used for ongoing product development.

With this arrangement, the changes in the code for the ongoing development will be on the main trunk and the branch will provide a separate partition for hot fixes and bug fix releases.The identifier for the branch name can be of the form.

A branch can be created using cvs tag -b BRANCH_NAME command.

#2.5 Make bug fixes to branches only

This practice extends from the previous practice of creating a separate branch after a major release. The branch will serve as the code base for all bug fixes and patch release that have to be made. Thus, there is a separate repository "sandbox" where the hot fixes and patches can be developed apart from the mainstream development.

This practice also ensures that bug fixes done to previous releases do not mysteriously affect the mainstream version. In addition, new features added to the mainstream version do not creep into the patch release accidentally.

#2.6 Make patch releases from branches only

Since all the bug fixes for a given release are done on its corresponding branch, the patch releases are made from the branch. This ensures that there is no confusion on the feature set that is released as part of the patch release.. After the patch release is made, the branch has to be tagged using the release tagging practice (see Tag each release).

#2.7 Merge branch with the trunk after release

After each release from a branch, the changes made to the branch should be merged (by cvs update -j BRANCH command) with the trunk. This ensures that all the bug fixes made to the patch release are properly incorporated into future releases of the application.

This merge could potentially be time consuming depending on the amount of changes made to the trunk and the branch being merged. In fact, it will probably result in a lot of conflicts in CVS resulting in manual merges. After the merge, the trunk code base must be tested to verify that the application is in proper working order. This must be kept in mind while preparing the project schedule.

In the case of changes occurring on branches for a long period, these changes can be merged to the main branch on a regular basis even before the release is made. The frequency of merge is done based on certain logical points in the branch's evolution. To ensure that duplicate merging does not occur, the following practice can be adopted.

In addition to the branch tag, a tag called {branch_name}_MERGED should be created. This is initially at the same level as the last release tag for the branch. This tag is then "moved" after each intermediate merge by using the −F option. This eliminates duplicate merging issues during intermediate merges.

#3 Summary

it is not said that all of the above mentioned policies must be followed as they are there. There are always exceptions depending upon type of project, amount of work that is to be done. Hence policies need to be followed under most suitable way.

27 Feb 2007

Localization

#1 L10N overview

Internationalization and Localization are means of adapting products such as publications, hardware and software for non-native environments, especially other nations and cultures.

When you implement i18n, l10n automatically knocks your door to get implemented. Reason is that if you allow users from different countries and cultures to use your software, they will expect that apart from language transformation, real data should also get transformed into localized ways. This expectation is reasonable because it might be possible that different countries are using different standards for displaying dates, currencies, units etc. For example US people won't understand Kilometers because they use Mile as unit, however for Indian people Kilometer is quite familiar unit. The more the cultures, the more the varieties one can see in communication, displaying information etc.

#1.1 Locale and formats

Before implementing l10n we first need to have proper understanding of terms locale and format. Locale represents a whole culture that can contain information about how to display dates, how to show currencies, which measurements units are to be used for conversions etc. For example for Indian locale, it is like below:

Date display:DD-MM-YYYY
Currency: 1,11,111
Unit to measure distance: Km

While for US, it can be like below:

Date display: MM-DD-YYYY
Currency: 111,111
Unit to measure distance: Mile

Hence locale should be seen in broader way as it represents set of various localized items. But sometimes apart from locale, users prefer more customization in locals, hence there comes the term format which means allowing 1 more level of customization. For example a US user might like to format date from 02-01-2007 to February 1st 2007. In short locale includes language, default format, glyph and other instruction set for particular locale while format is nothing but the different representation of same values. Hence overall a locale can have more than 1 formats.

#2 How to implement it?

While implementing l10n in softwares, software administrator or team members first need to determine that how many locales and formats should be used. Chosen locale and formats can be stored in file system or in database. Once it is decided, it can be implemented at 2 levels.

#2.1 Backend

Each software normally has administrative area from where whole software is managed. This area should be used to select number of formats for respective locales for particular software.

From selected formats, there should be chosen 1 default format for each l10n entity which will be used at client area. This default format is applicable to whole client area of software until it is overridden at user-level.

For certain entities like number format and currency format only 1 format should be set and user-level option may not be allowed. It should also be kept in mind that formats of one locale should not be used in other locale.

For softwares likes FS and Flog where client is registered from administrative area, user-level locales and formats could be selected directly.

#2.2 Frontend

At client area, if user is not given option to set his/her own locales & formats or if it is provided but user is not logged in then locale and formats set as default at administrative area should be used.

#2.2.1 Setting locale

While implementing l10n in PHP based softwares, developers need to set locales first. This locale can be decided upon selection of language. For example if English language is selected by user then locate should be set as en_US, for Finnish language locale should be set as fi_FI. To set locale in PHP, you can use function setlocale(). You can set locale for various categories like to display monetary items, or dates or messages etc. Please refer PHP manual for more details about how to use this function.

There are various PHP functions which behaves depending upon locales. Some of them are strcoll(), strftime(), date() etc.

#2.2.2 Displaying data in various formats based upon locales

Once locale is set for particular language, locale related functions behaves in different ways. For example below code will display day in different language for different locales. You can see that code remains same but information displays in different way.

// Displays “Wednesday” for English language.
setlocale(LC_TIME,'C');
echo strftime('%A');

// Displays “keskiviikko” for Finnish language.
setlocale(LC_TIME,'fi_FI');
echo strftime('%A');

// Displays “mercredi” for French language.
setlocale(LC_TIME,'fr_FR');
echo strftime('%A');

// Displays “Mittwoch” for German language.
setlocale(LC_TIME,'de_DE');
echo strftime('%A');

// Displays “बधवार” for Hindi language.
setlocale(LC_TIME,'hi_IN');
echo strftime('%A');

Similarly locales can be set for entities like currency, number format etc. To set locale for all entities, constant LC_ALL should be used.

At code level there might be problems during implementing different formats because for different locales default formats can be different. Hence above code doesn't actually serve our purpose. See example below.

// Displays 'Friday December 22 1978' in English.
setlocale(LC_ALL, 'en_US');
echo strftime('%A %B %d %Y', mktime(0, 0, 0, 12, 22, 1978))."\n";

// Displays 'perjantai 22 joulukuu 1978' in Finnish.
setlocale(LC_ALL, 'fi_FI');
echo strftime('%A %d %B %Y', mktime(0, 0, 0, 12, 22, 1978))."\n";

// Displays 'vendredi 22 décembre 1978' in French.
setlocale(LC_ALL, 'fr_FR');
echo strftime('%A %d %B %Y', mktime(0, 0, 0, 12, 22, 1978))."\n";

// Displays 'Freitag 22 Dezember 1978' in German.
setlocale(LC_ALL, 'de_DE');
echo strftime('%A %d %B %Y', mktime(0, 0, 0, 12, 22, 1978))."\n";

// Displays '22 दिसमबर शकरवार 1978' in Hindi.
setlocale(LC_ALL, 'hi_IN');
echo strftime('%d %B %A %Y', mktime(0, 0, 0, 12, 22, 1978))."\n";

In this example there are different formats for different locales, hence to make implementation easy at code level, we should store conversion specifier into database/file and using it directly into function. For example for Finnish languages locale, conversion specifier %A %d %B %Y would stored as string and should be used directly into function like above. Similarly this type of conversion specifiers can be used for all formats of all entities.

#3 Limitations of l10n

Native support of l10n in script or database is limited to display information into different format and glyphs only. It doesn't actually convert values according to localization. For example if price of any item is stored in $ currency, then that price, when displays to users who has selected Finnish language (or locale), wont get displayed automatically into his/her own chosen currency (i.e. €). This is because conversion rates between 2 units gets constantly changed.

For such issues, l10n should be implemented in customized way in your software where unit conversion functions can be built and used according to chosen format. However information should be stored in database in only one format and should be formatted only while displaying it to users.

However there is one one exception in displaying date and time, which can be displayed with different values if time zone related functions are used. Normally software logs date and time into it's own locale, but it could be possible that the user who is using that software located in different country where date/time is different than server time. Hence in such cases software should provide option to select timezone so that date/time can be displayed with localized values. Such option is essential for softwares that provides email services.

#4 Links

http://en.wikipedia.org/wiki/l10n
http://www.useit.com/alertbox/9608.html

23 Feb 2007

Internationalization

#1 I18N overview

Internationalization and localization are means of adapting products such as publications, hardware and software for non-native environments, especially other nations and cultures.

I18n includes many non-english and non-european languages like Hindi, Gujarati etc. that require multi bytes to store characters. To support such languages, software should use utf-8 encoding scheme to input, process, store, search and output data in same language.

#2 How to implement it?

Just few years ago i18n was headache for developers to implement because of limited support from database, scripting language, browsers, OS and other middle layers. But now a days with transparent support of utf-8 at each layer it has been easy to implement i18n.

In this document I have mentioned steps about how to implement i18n for LAMP based softwares with MySQL 4.1 and higher and PHP 5.0 and higher. For lower versions than these certain steps may not work.

#2.1 Server side

This sections includes changes are to be made at server side.

#2.1.1 OS level

At OS level, the only requirement is that OS should support utf-8 encoding which modern OSes like FC1...6, CentOS etc. support very well.

#2.1.2 Database level

At database level, you should use utf-8_* encoding as default for database communication and collation. For that you need to add following entries into my.cnf (MySQL software's configuration file) file at section [mysqld].

# To support Asiatic languages use utf-8. 

init-connect='SET NAMES utf8' 

default-character-set=utf8 

Sometimes customized configuration my.cnf has more sections like [client] etc. In that case add below entry in [client] section also.

# To support Asiatic languages use utf-8. 

default-character-set=utf8 

After making above entries, restart MySQL service. Whenever you create new database, use connection collation as utf8_* and use utf-8 as character set of the file. However if you have set above 2 values then these changes are not required, but still it is advised to check them as sometimes you are dumping databases from different versions of MySQL.

To test what is set; run below SQL queries.

SHOW VARIABLES LIKE 'character_set%';

SHOW VARIABLES LIKE 'collation%';

It will output all values containing utf-8 word in it. Sometimes it is not possible to add above type of entries in my.cnf specially on shared hosting server. In such case, execute below SQL query before execution of any query (in your PHP script).

SET NAMES 'utf8';

It does same thing for which we added entries in my.cnf except that this is runtime and applicable in local scope only. However database and tables must be created in utf-8 format and using same as collation.

#2.1.3 PHP script level

This is the main area where important changes are to be made. PHP natively doesn't support handling of i18n hence we have to use certain extensions to fulfill our requirements. These extensions are iconv and mbstring. But from these 2, mbstring is popular and works very well. As mbstring extension is not part of standard PHP installation, we need to enable it manually.

If you have configured your web server using utilities like YUM then it is very easy. Just run below command as root user and restart httpd service.

[root@mypc ~]# yum install php-mbstring 

For manual installation, you need to set following option to enable all the supported languages.

–enable-mbstring=all 

Once this extension is enabled in PHP, we need to set certain directives to make it working. These directives can be set in php.ini for global usage, httpd.conf for host wise usage and in PHP script itself for page or project wise usage. But I recommend to use it in PHP script itself so that it's usage remains limited to specific application or project. See below section for implementation.

Apart from these changes, all your PHP and other required scripts should be saved in utf-8 character set encoding because we store static i18n data in flat files (like fl_fi.inc.php). Hence editors needs to be configured in that way. Most popular editors provide setting of character set encoding; hence select utf-8 as standard encoding type.

#2.1.4 Application level

At application level, you need set certain directives using function ini_get() to enable mbstring extension. These directives are:

// Settings for i18n support. 

ini_set('mbstring.internal_encoding','utf-8'); 

ini_set('mbstring.func_overload',7); 

ini_set('mbstring.encoding_translation',1); 

If your application is using output buffering (i.e. output_buffering=On) then output handler should be set as mb_output_handler like below:

ini_set('output_handler','mb_output_handler'); 

Once these directives are set, all string related, regular expression related and mail system related functions will work transparently for all kind of languages. For more information about above directives, refer PHP manual.

While sending emails, utf-8 should be set as character set encoding in headers. Similarly while outputting HTML/XHTML to browser you need to explicitly set character set encoding into meta tags like below:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

Moreover sometimes if your application outputs contents to browser directly from running script even then you need to pass certain header like below:

<?php header('Content-Type: text/html; charset=utf-8'); ?>

In short wherever and whenever character set encoding is required, pass utf-8 as encoding scheme.

If you are using Ajax mechanism in your website then you also require to encode your Ajax query string before submitting to server for execution. This is specially required while running application in MSIE browsers. Firefox and other Gecko based browsers handles Ajax queries that contains i18n data correctly. For more info read thread http://news.php.net/php.i18n/1059.

#2.2 Client side

At client side minimal support is required.

#2.2.1 OS level

Client's OS should be able to understand utf-8 encoding, which is quite possible in latest versions of OSes. Apart from that, OS should have support to type characters in native language which is called as Keyboard layout and Regional settings.

#2.2.2 Browser level

The only requirement at browser level is that browser should be able to interpret utf-8 character set encoding which is supported by almost all modern browsers. Contents sent from server (refer section 2.1.4) tell browser that which encoding scheme is to be used.

#3 How to implement in existing projects?

To implement i18n in existing projects requires above implementation plus conversion of existing data.

#3.1 Converting database

#3.1.1 Using commands

To convert existing database encoded into utf-8, take dump of data by running below command.

mysqldump -hHOST -uUSER -p --opt --default-character-set=CHARSET --skip-set-charset DB_NAME|sed -e 's/SOURCE_CHARSET/DEST_CHARSET/g' > DB_NAME.sql 

It will create SQL file containing all data of selected database which you can use to re-create new database. Now convert existing file into utf-8 format by running below command.

iconv -f SOURCE_CHARSET -t UTF-8 DB_NAME.sql > UTF_DB_NAME.sql 

To dump above data into newly created database, run below command:

mysql -hHOST -uUSER -p --default-character-set=DEST_CHARSET DB_NAME < 

UTF_DB_NAME.sql 

#3.1.2 Converting database directly

Sometimes it is not possible do this using command line interface (due to shared hosting environment), then it will require to execute certain SQL queries to change character set of existing database without copying or moving it anywhere. To do so; run below SQL queries wherever it is applicable. Please note the order of execution of queries.

# First change all i18n fields of all tables into BLOB.

ALTER TABLE TABLE MODIFY FIELD BLOB; 

# Now change character set of database as UTF-8. 

ALTER DATABASE DATABASE charset=utf8; 

# Then change character set of each table.

ALTER TABLE TABLE charset=utf8; 

# Now change all i18n fields of all tables into UTF-8 character set.

ALTER TABLE TABLE MODIFY FIELD ORIG_FIELD_TYPE CHARACTER SET utf8; 

Ideally above queries should be run by making shell or PHP script so that it can be used later or for other projects.

#3.2 Converting file system

You will also require to change encoding of existing PHP scripts and other files if they contain data that requires utf-8 encoding. Normally you would require to change format of files containing non-english text only. But it is recommended to use same encoding for all type of files throughout your software. To change encoding of file iconv utility can be used. One example is provided below.

iconv -f SOURCE_CHARSET -t UTF-8 FILENAME > FILENAME 

You can also use mb_string_* functions of PHP to change character set encoding of your files. However this is PHP based function so you need to design a script to convert all existing files.

#3.3 Translating your project

When you design multilingual projects, it becomes mandatory to dynamically handle static text of your application. Your application can have several options for that depending upon tools used to build it. Symfony based projects can have XLIFF as standard mechanism to translate static data. But overall if we want such mechanism in every PHP application then there are 2 ways to do so.

Declaring PHP variables where static text gets displayed, and then putting those variables in language files, according to each language used in application, and then including it in scripts where those translations are required. This is standard way but there are some disadvantages of it. Since file is PHP based, third party translators will find difficulties in adding/updating translations as they might be from non-technical area. Hence there exists professional mechanism to overcome this problem.
For large applications, where translators are from various cultures and backgrounds, preferred approach is it to use PHP extension called Gettext (in PHP it is bundled as php-gettext).

To check whether it is installed or not, run command php -m from command line. If output contains text like gettext then it is installed. Otherwise using various options it can be installed. The most popular way is using Yum utility. Just type below command as root user and that's it.

[root@mypc ~]# yum install php-gettext 

php-gettext works in this way. 1 language file is prepared using specials editors like KBabel, poEdit where all translations are kept in pairs of 2 special variable msgid and msgstr. msgid variable denotes variable name to be used in PHP script while msgstr denotes language translation associated with that msgid variable. Using pair of msgid and msgstr, we can build as much translations as required for any language. There will be separate files for each language. This version of file has .po extension but it is only for editing translations. Actual file which is used by PHP to apply these translations is not this. For that .po file needs to get compiled into .mo binary file. For that following command can be used.

[user@mypc ~]# msgfmt -cv -o FILE.mo FILE.po 

Normally both files are kept at same location for easy maintenance. Since we have generated files for language translations, I will describe how to use them in your application. In your global includable script following code needs to get added to bind your created language files and your application.

// Set environment variable. 

putenv('LC_ALL=en_US'); 

// Set selected locale (language). 

setlocale(LC_ALL, en_US); 

// Specify location of translation tables. 

bindtextdomain('frontend', '/web/projects/myproject/translations/'); 

bindtextdomain('backend', '/web/projects/myproject/translations/'); 

// Choose domain for application. 

textdomain('frontend'); 

// Bind specific character set to be used with selected domain. 

bind_textdomain_codeset('frontend','UTF-8'); 

bind_textdomain_codeset('backend','UTF-8'); 

In above code snippet, items marked in bold are to be replaced by your application specific needs. Here en_US is a locale which can be changed as required.

Since gettext accesses translation files in special way, they needs to get stored according to rules defined by gettext. With reference to above code-snippet, if root folder of language translation files is like /web/projects/myproject/translations/ and application areas are admin and client then language related files should be stored in following way.

/web/projects/myproject/translations/en_US/LC_MESSAGES/backend.mo, [admin.po]
/web/projects/myproject/translations/en_US/LC_MESSAGES/frontend.mo, [client.po]

/web/projects/myproject/translations/en_GB/LC_MESSAGES/backend.mo, [admin.po]
/web/projects/myproject/translations/en_GB/LC_MESSAGES/frontend.mo, [client.po]

/web/projects/myproject/translations/fi_FI/LC_MESSAGES/backend.mo, [admin.po]
/web/projects/myproject/translations/fi_FI/LC_MESSAGES/frontend.mo, [client.po]

Hence by switching values in function textdomain(), desired translation files can be included in each application. Now how to use these translation in PHP script?. For that each msgid variable should be written like _('myVar') in PHP script to include corresponding translation from selected file.

That's it, whenever translation files are modified, they need to get recompiled using command msgfmt as described earlier to avail latest translations. Since compiled binary .mo files are cached by PHP, modifications might not get reflected immediately. In such case web service should be gracefully restarted.

#4 Summary

Advantages of implementing i18n using utf-8 character set encoding is that users can now input data in their localized language and those contents would saved and displayed back in browser in same language. Not only these, but database can make searching records in native language also. For example if you are required to retrieve records of all users whose last name is “ઝાલા”, then writing SQL like below will work successfully.

SELECT * FROM user WHERE last_name='ઝાલા'; 

At PHP script level you can compare, sort, split Unicode strings in same way like you are doing for normal strings. Next version of PHP (i.e PHP 6) is going to support Unicode by default hence there will not require extensions or setting to enable Unicode string.

#5 Links

http://en.wikipedia.org/wiki/I18n
http://www.useit.com/alertbox/9608.html
http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html
http://www.nyphp.org/php-presentations/90_Timezones-Internationalization-Localization-Character-Sets-PHP-4-5

Anirudh Zala's Blog

Pages