Anirudh Zala's Blog

9 Aug 2007

Designing web service

#1 New way of Programming experience

Designing API requires different way of thinking in programming because API is mainly related to handling raw data and header management. Hence apart from traditional web development, it requires to have good understanding and experience of HTTP Headers, HTTP request methods, REST architectural style, OOP and Responsiveness.

Another difference between API and normal web application is that once API is being started to use it grows vertically hence it's URL scheme, response headers and types can not be changed that easily so proper care should be taken before designing it. Best practice is to start with providing simple interface of only GETting data. And from that experience more functionality for adding, updating and deleting resources can be added.

#2 Handling input data

While designing APIs, developers should not rely on browsers because except POST and GET methods; data coming by other methods will directly come to server skipping the browser. Hence population of incoming data (which is done by PHP by storing them in $_GET and $_POST super globals) would not be possible particularly for PUT and DELETE methods. Which means all data should be handled by standards input which can be accessed from php://input. After that they can be parsed to extract information in chunks to finally execute requests. Please study example API for more details about how to handle and implement raw data. This link explains all possible HTTP methods and how they work.

#3 Structure of codebase

Normally API would require 1 configuration file to store server specific data, 1 global file to store other global settings, 1 front controller and then bunch of classes according to required modules. Since API doesn't have GUI, multiple controllers are not required. And anyway modern practices recommend having only 1 single controller to receive and dispatch data.

URL of API could be like http://api.mysite.fi or http://www.mysite.com/api/. As far as directory structure is concerned there can be created folders to store API related includable files, classes and controller/s. Since API is part of main application, file and folder structure should conform to existing standards of main application.

#4 Authentication and session management

Access control and session management are important aspects of API design. When it is required to change state of resource on server then it becomes important to identify client who has made such request. This can be done by asking requester to provide details of account on behalf of which request is made.

For example, to build generic authentication mechanism a URL like http://www.mysite.fi/api/api_login/ can be provided where clients can submit his/her ID and Password to log on. The API authentication module will verify ID and Password and if found ok then can issue an api_token which can then be used to execute subsequent requests. Depending upon sensitiveness of data various authentication mechanisms can be implemented which includes session based authentication and basic/digest authentications.

For more information about authentication following resources can be studied:

http://code.google.com/apis/gdata/basics.html
http://code.google.com/apis/gdata/auth.html
http://code.google.com/apis/accounts/AuthForInstalledApps.html

#5 Access control

When API get considerable traffic from various clients it becomes important to restrict clients to not to make so many requests at particular time. Depending upon available resources and amount traffic to various resources, various timing related mechanism can be applied so that clients can not submit particular request more than once or twice or thrice for each minute or hour or so and so. This is important to keep hackers/crackers away who can damage API service at various levels.

For advanced usage, access control can also be applied to allow usage of other users' data. Hence if API client want to use other users' resources, he/she first get confirmation from respective user/s to use his/her data for various purposes.

#6 Caching of common resources

Once API starts receiving high amount of traffic to particular resource/s, it may become necessary to cache certain resources into memory or file to serve them faster and to save bandwidth. There can be some resources like top rated image of users, which may not be changing frequently, could be cached.

#7 Designing URLs

As we have adopted REST architectural style for implementing API, we must adhere to it's standards. For that it would require to properly understand REST principles. In REST architectural style, URL of any resource is tightly coupled to request method of that resource. Which means URL can be same but depending upon request method, response from API could change. For example resource http://www.mysite.fi/api/users/ requested by POST method would mean to add new user to existing users; with GET method will provide list of users. Similarly resource http://www.mysite.fi/api/user/1/ requested by PUT method would update information of particular user; with GET method will provide details of user and; with DELETE method will remove that user from database.

Please note that if resource http://www.mysite.fi/api/user/1/ is requested by POST method then there should be appropriate error response telling API client that POST method is not allowed for such resource. Similarly no resources can be deleted by GET, POST or PUT methods. For that only DELETE method must be used. Such problem exists with Del.ico.us API where everything is done via GET including creating, changing and removing resources which is unRESTful way of implementation.

#8 Programming approach

A typical API can contain following objects/modules. Each object/module could be a single class to handle request/s at various stages.

#8.1 Request handler

This is the 1st module which will receive request. It will then determine request method, will log client's information and will store raw data received in request. Then it will give control to authentication or session manager.

#8.2 Authenticator/Session manager

This module, if required, will check whether authentication is required for particular request or not. If yes then will inform client via error handler module to send credentials. This module will generate authentication tokens to client for subsequent requests.

#8.3 Parser

Once client is authenticated next step would be to parse raw data to extract information from it to for execution. This extracted data will then be handed over to data validator.

#8.4 Data Validator

This module will validate (will check accept type, range and type of data etc.) extracted data for particular requests. If everything is ok then Executer module will be called to execute request for which request is made.

Here Parser and Data validator modules can be combined together also depending upon convenience of developers.

#8.5 Executer

This module will be collection of various classes which will actually execute request and generate output in Arrays. Such modules could be same as primary objects of requests like image, user, tag etc.

These classes should preferably generate data in associate arrays so that it can be used in any format. Once output is generated it will be handed over to renderer module.

#8.6 Renderer (XML/XHTML/JSON etc.)

This module will redder generated data and will send back to client. Default format of data could be XML but depending upon query string different format can be provided. There can be separate templates for each format so that in final output dynamic part will be replaced with data and static part will remain as it is.

#8.7 Error handler

This module can be used with any module at any level to inform client about occurrence of any error while executing any request. Format of errors could be like below:

<?xml version="1.0" encoding="utf-8" ?>
<res status="0">
<error code="123">Invalid data. Please read documentation for more details.</error>
</res>

Note that whenever error occurs, "status" attribute will be set to "0" rather than 1. Apart from custom error messages and codes, standard HTTP headers can also be used to minimize parsing of output at client side to determine whether request was executed successfully or not.

All these modules classes could be designed by using latest concepts of PHP5. One such concept is to have Getter and Setter methods to set and get data whenever required. It is not easy to test and debug API by conventional browsers because developers can't send raw headers and data by browsers. Hence either API client will require to be designed first or to use other options. One such option is to use LiveHTTPHeaders extension in FF browser. It is an excellent way to send and receive raw data over HTTP to API. By this way developers can easily see what response is received when particular request is made. For more details about this extension, visit link http://livehttpheaders.mozdev.org/

Excellent resource about how to design RESTful API can be found here http://www.peej.co.uk/articles/restfully-delicious.html.

#9 Sample of output of data

Below is given samples of XML responses. Root level node would be having attribute 1 or 0 to indicate API client whether request was executed successfully or not. Rest of nodes can be designed as per requirements. It would be noted that most common way of response is XML. Later when API gets more demand, more response format can be provided like Serialized PHP, JSON etc.

=> Basic XML structure of response:

<?xml version="1.0" encoding="utf-8" ?>
<res status="1"></res>

=> XML structure of response with data:

<?xml version="1.0" encoding="utf-8" ?>
<res status="1">
  <image>
    <resource>/image/recent</resource>
    <url>http://www.mysite.fi/images/123456.jpg</url>
    <width>120</width>
    <height>90</height>
    <metadata>
      <camera>Canon EOS 60</camera>
      <date-taken>10-09-2007</date-taken>
    </metadata>
  </image>
</res>

REST architectural style emphasizes on following 2 rules while responding requests, they are:

#9.1 Resources should be interconnected

which means each response should contain information about previous and next resource. For example while displaying list of users if limit of response is 30 records per page then each response should contain link of next and previous list of users if applicable. This tells that by inspecting any response, API client could find what was previous response and what will be the next one. This can be accomplished either by setting attribute/s like next-url and previous-url or by providing attributes like tot-page, cur-page, next-page etc. in response XM so that those can be used as query string of request to access next or previous resources.

Similarly whenever new resource is created, response sent back to client should contain URL of that newly created resource. This is how inter-connectivity works.

#9.2 Revelation of information should be step by step

which means all information should not be responded with just one call. For example request like http://www.mysite.fi/api/images/ should not contain details of each image, instead it can contain link to obtain information of each image (http://www.mysite.fi/api/image/12345_6789.jpg) and after calling that resource, actual information of image would be responded.

All response should be using UTF-8 characters and encoded in same way before sending to client.

#10 Documentation

Documentation is must while designing any API. Without documentation none would use your API. Hence it should be prepared in parallel with developing of code. It also should be easy to understand and use.

#11 Example of API

It is important to study existing API while building API first time. Below link contains source code of such API. It is REST implementation of managing users and companies. However it is very old and developed in PHP4 but still worth to have look at it specially it's code to understand how to handle, execute and respond various requests.

http://nchc.dl.sourceforge.net/sourceforge/phprestsql/phprestsql.tar.gz
http://phprestsql.sourceforge.net/tutorial.html

11 Jun 2007

CVS best practices

Best practices often requires thorough knowledge of technology that is used in software development. To fully use CVS, we now need to know about tag, trunk, branch, merging etc. to minimize or eliminate certain problems arising of out insufficient use of CVS. In this document it has been shown how we can use CVS in most effective way to minimize such problems. Here are some policies that have been designed to follow whenever it is possible.

#1 Sandbox

The developer sandbox is where each developer keeps his or her working copy of the code base. In CVS this is referred to as the working directory. This is where they build, test and debug the modules that they are working on. A sandbox can also be the area where the staging build or the production build is done. Changes made in the work area are checked into the CVS repository. In addition, changes made in the repository by others have to be updated in the sandbox on a regular basis.

The best practices related to developers sandbox are:

#1.1 Keep System clocks in Sync

CVS tracks change to source files by using the timestamps on the file. If each client system date and time is not in sync, there is a definite possibility of CVS getting confused. Thus system clocks must be kept in sync by use of a central time server or similar mechanism.

CVS is designed from ground up to handle multiple timezones. As long as the host operating system has been setup and configured correctly, CVS will be able to track changes correctly.

#1.2 Stay in sync with the repository

To gain the benefits of working within a sandbox as mentioned above, the developer must keep his or her sandbox in sync with the main repository. A regular cvs update with the appropriate tag or branch name will ensure that the sandboxes are kept up to date.

#1.3 Do not share the sandbox

Sandboxes have to be unique for each developer or purpose. They should not be used for multiple things at the same time. A sandbox can be a working area for a developer or the build area for the final release. If such sandboxes are shared, then the owner of the sandbox will not be aware of the changes made to the files resulting in confusion.

In CVS, the sandbox is created automatically when a working copy is checked out for a CVS project using the cvs checkout [options] MODULES command. In very large projects, it does not make sense for the developers to check−out the entire source into the local sandbox. In such cases, they can take only certain modules in which they are working.

#1.4 Do not work outside the sandbox

The sandbox can be thought of as a controlled area within which CVS can track for changes made to the various source files. Files belonging to other developers will be automatically updated by CVS in the developer's sandbox. Thus the developer who lives within the sandbox will stand to gain a lot of benefits of concurrent development.

#1.5 Cleanup after completion

Make sure that the sandbox is cleaned up after completion of work on the files. Clean up can be done in CVS by using the cvs release [-d] [DIRECTORIES] command. This ensures that no old version of the files exists in the development sandbox.

#1.6 Check−in often

To help other developers keep their code in sync with your code, you must check−in (commit) your code often into the CVS repository. The best practice would be to check−in soon as a piece of code is completed, reviewed and tested, check−in the changes with command cvs commit [options] [-m LOG_MESSAGE | -F FILE] [-r revision] [FILES] to ensure that your changes are committed to the CVS repository.

CVS promotes concurrent development. Concurrent development is possible only if all the other developers are aware of the ongoing changes on a regular basis. This awareness can be termed as "situation awareness". One of the "bad" practices that commonly occur is the sharing of files between developers by email. This works against most of the best practices mentioned above. To share updates between two developers, CVS must be used as the communication medium. This will ensure that CVS is aware of the changes and can track them. Thus, audit trail can be established if necessary.

When you commit a change to the repository, make sure your change reflects a single purpose: the fixing of a specific bug, the addition of a new feature, or some particular task. Your commit will create a new revision number which can forever be used as a name for the change. You can mention this revision number in bug databases, or use it as an argument to CVS merge should you want to undo the change or port it to another branch.

#1.7 Add/Commit data in proper way

CVS is not good in handling directories. Hence once any directory is added, can't be removed from repository in normal way. Hence be careful when dealing with directories.

Moreover CVS tend to exclude empty directories while checking out any module. Which means any directory that is supposed to be empty at check-out time wont be included in checked-out copy of module. This problem often occurs when any directory is used to store temporary files which are not required to keep in CVS. Hence if directory is not present in checked-out module, your local sandbox might not work as expected. To solve this problem an empty file called .keepme can be added to empty directory.

#1.8 Use the issue-tracker wisely

Try to create as many two-way links between CVS changesets and your issue-tracking (gForge, Bugzilla, Mantis etc.) database as possible:

If possible, refer to a specific issue ID in every commit log message. When appending information to an issue (to describe progress, or to close the issue) name the revision number(s) responsible for the change.

#2 Branching and Merging

Branching in CVS splits a project's development into separate, parallel histories. Changes made on one branch do not affect the other branches. Branching can be used extensively to maintain multiple versions of a product for providing support and new features.

Merging converges the branches back to the main trunk. In a merge, CVS calculates the changes made on the branch between the point where it diverged from the trunk and the branch's tip (its most recent state), then applies those differences to the project at the tip of the trunk.

#2.1 Know when to create branches

This is a hotly debated question, and it really depends on the culture of your software project. Rather than prescribe a universal policy, we'll describe three common ones here.

#2.1.1 The Never-Branch system

(Often used by nascent projects that don't yet have runnable code.) Users commit their day-to-day work on /trunk. Occasionally /trunk "breaks" (doesn't compile, or fails functional tests) when a user begins to commit a series of complicated changes.

Pros: Very easy policy to follow. New developers have low barrier to entry. Nobody needs to learn how to branch or merge.

Cons: Chaotic development, code could be unstable at any time.

Note: this sort of development is a bit less risky in Subversion than in CVS. Because Subversion commits are atomic, it's not possible for a checkout or update to receive a "partial" commit while somebody else is in the process of committing.

#2.1.2 The Always-Branch system

(Often used by projects that favor heavy management and supervision.) Each user creates/works on a private branch for every coding task. When coding is complete, someone (original coder, peer, or manager) reviews all private branch changes and merges them to /trunk.

Pros: /trunk is guaranteed to be extremely stable at all times.

Cons: Coders are artificially isolated from each other, possibly creating more merge conflicts than necessary. Requires users to do lots of extra merging.

#2.1.3 The Branch-When-Needed system

Users commit their day-to-day work on /trunk.

Rule #1: /trunk must compile and pass regression tests at all times. Committers who violate this rule are publicly humiliated.

Rule #2: a single commit (changeset) must not be so large so as to discourage peer-review.

Rule #3: if rules #1 and #2 come into conflict (i.e. it's impossible to make a series of small commits without disrupting the trunk), then the user should create a branch and commit a series of smaller changesets there. This allows peer-review without disrupting the stability of /trunk.

Pros: /trunk is guaranteed to be stable at all times. The hassle of branching/merging is somewhat rare.

Cons: Adds a bit of burden to users' daily work: they must compile and test before every commit.

#2.3 Assign ownership to Trunk and Branches

The main trunk of the source tree and the various branches should have a owner assigned who will be responsible for.

#2.3.1 Keeping the list of configurable items for the branch or trunk

The owner will be the maintainer of the contents list for the branch or trunk. This list should contain the item name and a brief description about the item. This list is essential since new artifacts are always added to or removed from the repository on an ongoing basis. This list will be able to track the new additions/deletions to the repository for the respective branch.

#2.3.2 Establishing a working policy for the branch or trunk

The owner will establish policies for check−in and check−out. The policy will define when the code can be checked in (after coding or after review etc.,). Who is responsible to merge changes on the same file and resolve conflicts (the author or the person who recently changed the file).

#2.3.3 Identifying and document policy deviations

Policies once established tend to have exceptions. The owner will be responsible for identifying the workaround and tracking/documenting the same for future use.

#2.3.4 Merging with the trunk

The branch owner will be responsible for ensuring that the changes in the branch can be successfully merged with the main trunk at a reasonable point in time.

#2.3 Tag each release

As part of the release process, the entire code base must be tagged (by cvs tag [options] SYMBOLIC_TAG [FILES] command) with an identifier that can help in uniquely identifying the release. A tag gives a label to the collection of revisions represented by one developer's working copy (usually, that working copy is completely up to date so the tag name is attached to the "latest and greatest" revisions in the repository).

The identifier for the tag should provide enough information to identify the release at any point in time in the future. One suggested tag identifier is of the form.

release_{major version #}_{minor version #}

Checkout the entire codebase using the tag, and then proceed to go through a build / deploy / test process before making the actual release. This will absolutely ensure that what "leaves the door " is a verified and tested codebase.

#2.4 Create a branch after each release

After each software release, once the CVS repository is tagged, a branch has to be immediately created. This branch will serve as the bug fix baseline for that release. This branch is created only if the release is not a bug fix or patch release in the first place. Patches that have to be made for this release at any point in time in the future will be developed on this branch. The main trunk will be used for ongoing product development.

With this arrangement, the changes in the code for the ongoing development will be on the main trunk and the branch will provide a separate partition for hot fixes and bug fix releases.The identifier for the branch name can be of the form.

A branch can be created using cvs tag -b BRANCH_NAME command.

#2.5 Make bug fixes to branches only

This practice extends from the previous practice of creating a separate branch after a major release. The branch will serve as the code base for all bug fixes and patch release that have to be made. Thus, there is a separate repository "sandbox" where the hot fixes and patches can be developed apart from the mainstream development.

This practice also ensures that bug fixes done to previous releases do not mysteriously affect the mainstream version. In addition, new features added to the mainstream version do not creep into the patch release accidentally.

#2.6 Make patch releases from branches only

Since all the bug fixes for a given release are done on its corresponding branch, the patch releases are made from the branch. This ensures that there is no confusion on the feature set that is released as part of the patch release.. After the patch release is made, the branch has to be tagged using the release tagging practice (see Tag each release).

#2.7 Merge branch with the trunk after release

After each release from a branch, the changes made to the branch should be merged (by cvs update -j BRANCH command) with the trunk. This ensures that all the bug fixes made to the patch release are properly incorporated into future releases of the application.

This merge could potentially be time consuming depending on the amount of changes made to the trunk and the branch being merged. In fact, it will probably result in a lot of conflicts in CVS resulting in manual merges. After the merge, the trunk code base must be tested to verify that the application is in proper working order. This must be kept in mind while preparing the project schedule.

In the case of changes occurring on branches for a long period, these changes can be merged to the main branch on a regular basis even before the release is made. The frequency of merge is done based on certain logical points in the branch's evolution. To ensure that duplicate merging does not occur, the following practice can be adopted.

In addition to the branch tag, a tag called {branch_name}_MERGED should be created. This is initially at the same level as the last release tag for the branch. This tag is then "moved" after each intermediate merge by using the −F option. This eliminates duplicate merging issues during intermediate merges.

#3 Summary

it is not said that all of the above mentioned policies must be followed as they are there. There are always exceptions depending upon type of project, amount of work that is to be done. Hence policies need to be followed under most suitable way.

Anirudh Zala's Blog

Pages

9 Aug 2007

Designing web service

11 Jun 2007

CVS best practices

Followers