Pages

9 Sept 2008

SEO guidelines

#1 About this document

This document aims to defines generic search engine optimization requirements for various projects.

At this moment this document contains general guidelines of SEO. In future, at the time of taking training session, this document will be expanded further in order to be used as perfect resource for almost all SEO requirements.

#2 General requirements

#2.1 Server location

The server should be located in same country from where it will be mostly accessed. Moreover If the service will have it’s own domain then it should reside on a dedicated server. Wildcard DNS should not be allowed as well as all sub domains, if any, should be activated separately.

#2.2 Robots.txt

The robots.txt file is to be placed in the root (value of DocumentRoot directive in case web server is Apache) directory of the software. It should allow the search engines to crawl all directories where information related to various entities will be shown.

Personal pages such as listing owner's entities, posting/editing entities that require login should be blocked. Most search engines now a days are able to find out this behavior hence you may omit such entries into robots.txt file.

#2.3 Encoding

If there are will be used special characters in language of website they will need to get encoded in URLs (maybe using PHP function like urlencode()) and Filenames using UTF-8 encoding. For full documentation of encoding of such characters, please visit http://www1.tip.nl/~t876506/utf8tbl.html). As practically all browsers supports the Unicode UTF-8 standard, it should not be important to encode the characters in the actual content. The suitable HTML entities can be taken from this address: http://leftlogic.com/lounge/articles/entity-lookup/ anyhow.

There should be a 301 redirect from any page with special characters in the URL where someone writes the URL using the special characters and not the encoded ones if that user has a browser that over writes the UTF-8 character set with some other character set. See how Wikipedia functions for an example. This prevents links with the wrong character set to be used on external pages.

#2.4 Header responses

#2.4.1 Page not found (404 error)

Entities that are removed from database/software should not be shown. When someone accesses the removed listings page the server should respond with a 404 header response (and not a 200 response) and show an error message (or optionally a separate page) saying that the entity is already deleted/expired/sold etc. Furthermore the relevant listing page should be shown.

#2.4.2 Redirects (301 error)

As a general rule of thumb all redirects should be done using the 301 permanently moved response. All sub domains should be redirected this way (example.com -> 301 -> www.example.com) and also all other domains that contain the same information, as shown below;

www.example.net -> 301 -> www.example.com
www.example.in -> 301 -> www.example.com

Assure also that only the specified URLs work and make a 301 redirect rule for all non-specified URL’s when called missing.

#3 General page requirements

#3.1 Using standards

The site should comply with the World Wide Web consortium’s (http://www.w3.org/) recommendations for creating web pages (XHTML 1.0 Transitional should be enough) and also comply with the Americans with disabilities act (http://www.ada.gov/) if required.

#3.2 Page design

The pages should be designed with CSS positioning and the content part of the page should appear in the source code as early as possible preferably before other body content such as navigational blocks.

The navigation should be implemented with anchor tags and text and the links should not redirect.

Breadcrumb navigation would increase SEO with internal back links and usability in a sense that the visitor would see their location on site. Example of the breadcrumb navigation: Home => List furniture items => View table => ...

Scripts and other elements (CSS) should be put in external files. The source code should be kept clean with little or no unused code. The preferred maximum file size for HTML code is 100 KB.

#3.3 Elements of a page

The following elements should always be included (and be editable somehow) on a page which is to be indexed by search engine:
  • Page title ([title]-element in the header)
  • Meta description, robots and keywords (in the header)
  • Page heading (one [h1] per page)
#3.3.1 Page title ([title]-element in the header)

A page title should be as specific and concise as possible with respect to the document. This will insure its uniqueness and click-through in Search Engine Result Pages. A structure similar to "Page name | Section name | Site title - Tagline" is encouraged for clarity, uniqueness and better usability for the visitor. Focus on delivering a title that spans from specific (closer to the beginning) to general keywords. The length of the title needs no more then 80 characters.

#3.3.2 Meta description, robots and keywords (in the header)

HTML meta description around 150 characters should be sufficient. Although it doesn't hurt to be a little more, this data should contain the most concise information about the document. The uniqueness of this information also plays a fair role as far as Search Engine Result Pages are concerned.

Meta keywords on the other hand are not quite necessary since it is the responsibility of the search engine indexers to determine the nature and the relevancy of the document. For the purposes of accuracy, they can't rely on what the document claims it to be. There comes a transition on the Web which provides this sort of meta information about the document. Today, the results gained from meta keywords are negligible. See below some examples of well written meta tags;

<meta name="description" content="Suppliers of quality office furniture and accessories at discount prices.">
<meta name="keywords" content="furniture, office, store, shop, retail, discount">

#3.3.3 Page heading (one <h1> per page)

A proper structured document will consist of headings, paragraphs, lists, tables, and forms, and use an external stylesheet to style them. Many search engines place more emphasis on text within heading tags (and not just on keywords provided in meta elements), so make sure they use keywords. Use one <h1> tag per page with the most important keywords. You can also use other head tags ( <h2>, <h3> etc.) to provide variations and support the main heading.

Some example of tags are;

<h1>Tables</h1>
<h2>Round tables</h2>
<p>... information about round tables ...</p>
<h2>Square Desks</h2>
<p>... information about square desks, etc.</p>

#3.3.4 Body text

Make sure the text of your web pages contain keywords and common phrases which people might search for. Be careful with the frequency of your keywords - you want to have them occur at least a few times if possible, but don't repeat yourself so much that the copy becomes unnatural. The idea is to discretely spread keywords around without making it obvious.

A well written document will naturally use keywords that are appropriate and in proportion. Search engine algorithms essentially compare similar documents to get a better understanding of the nature of the document. If a document is not well written and gives off-balanced scores then it will raise flags and possibly mark it as not relevant as it indicates a document that is written for the machine and not for the human reader. Keep in mind that indexing is in place to assist human searches. An example of good body text could be like;

[p]Buy office furniture at affordable prices from any of our retail stores.[/p]

#3.3.5 Images and Pictures

When pictures, that are not part of the page template, are used they should always include an ALT description. This description should either be automated or editable (This is partly already a requirement of the Americans with disabilities act).

#3.4 Automation

The title element and the meta description and keywords need to be automatically generated according to different templates. These templates will include page- and directory specific elements as well as generic elements. An example of a template for the title element for a page called Search results page could be:

[Results] - [category] - Search results – My furniture example.com
  • Different elements that could be included are
  • Results = Search results pages (New, Old, All)
  • Category name = Such as Wood tables, Wood chairs, Metal chairs
  • Area = Can represents location of entity.
  • Page number in a Search results, if applicable
  • The category name or area might not be in basic form – different grammatical forms might be needed.
In the title, meta information and headings the keywords or key phrases are added as is or in another grammatical form but when automating (URL rewriting) the URL, it may need some encoding if other language has been used:
  • Non-ASCII keywords (and phrases) included in URLs need to be encoded in hex values (maybe using PHP function like urlencode()) like:
  • www.example.com/product/table/એપલ => www.example.com/product/table/%E0%AA%8F%E0%AA%AA%E0%AA%B2
#4 Index page

Index page of your website is the most likely to get the highest number of inbound links since it is entry point of your website. Hence linking other pages of website from this page becomes very important. By theory this page should host almost links of all pages that starts from here.

However number of links in such page should be around 100, in many projects it may not be possible to display all links. In such cases most important links should be made visible from here. And remaining pages could be linked from there because our purpose is to chain all important pages to be get indexed.

To make this working it becomes important to identify those important links. For example if you are selling something then this home page can have link of those pages that display list of items per category of products. Similarly if they are bound to certain geographical location and if you website displays list of selling items per province/city/area then links to those pages could be placed on this page.

#5 Search pages

Search pages whether simple or extended, may not be indexed as they are not containing, be default, any information to be searched for.

However for usability point of view, their URLs, page design and on-page information should be properly designed and implemented.

#6 Listing pages

Listing pages are the 2nd most important pages for any website as they display information about entities for which website is created. Listings entities can include various types of stuffs ranging from selling items, ads, jobs etc.

Such listing may contain pagination and sorting links depending upon results and interest of users. It is recommended to keep pagination links in text mode so that search engine can crawl through all available pages and can index those pages. However sorting links may be implemented using JS (Ajax) etc. so that additional query to server can be minimized. From search engine point of view, it doesn't matter in what order information displays.

If possible, URL scheme of such pages can be made self-informative. For example for furniture selling website URLs can be designed like below; 

www.myfurniture.com/tables/round
www.myfurniture.com/tables/square
www.myfurniture.com/tables/plastic
www.myfurniture.com/chairs/rocking
www.myfurniture.com/chairs/revolving

Text appearing on such pages should be as informative as possible and number of entity per list should be kept around 10 to 30 entities. Listing pages may also contain links to other important pages which are to be indexed.

#7 View pages

View entity pages shows detailed information about entities listed in listing pages. Title, Meta description, Meta keywords, H1 tag should contain information about entity that is expected to be viewed.

#8 Other pages

Other pages may include pages like Login pages, Posting/Editing entity pages etc.

#8.1 Login pages

Such pages should not get indexed as they don't contain any public searchable information.

#8.2 Posting/Editing entity pages

Any page that contains forms to be submitted are not normally indexed as they don't display any searchable information to general public.

General rule of thumb is that those pages which changes stat of the server (like data is inserted/updated, file is created/delete etc.) or those pages which are personal to users are not indexed as they are tightly integrated with data of the website.

#9 Resources

http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=35769
http://help.yahoo.com/l/us/yahoo/search/basics/basics-18.html

13 Mar 2008

Lighttpd vs. Apache

#1 Lighttpd overview

Lighttpd is an open source web server (similar like Apache) to server web pages. It has been developed by a MySQL developer named Jan Kneschke who developed this web server as a part of the C10K problem. Hence immediate reason of birth of Lighttpd is to overcome weakness, like reducing high memory footprint, of Apache web server.

The prefork model that Apache uses consumes a lot of memory (> 20 MB normally) per process. Which means if we multiply number of process to run simultaneously then RAM of server gets exhausted quickly. Lighttpd here beats Apache by using very low memory footprint (just 6MB) which means faster output from web server. The response appears even more faster when static contents are to be delivered. In Netcraft's latest web server survey, we can see Lighttpd among top 5 web servers currently used on Internet.

#2 How to set it up

Normal and most preferred installation instruction can be found from this installation page. For Yum users, a single command yum install zlib pcre lighttpd lighttpd-fastcgi will do almost all things.

If you want to start and stop Lighttpd manually, you're done. To install Lighttpd as a service like Apache, edit and install the init script (only if you have installed Lighttpd from source):

# sed -e 's/FOO/lighttpd/g' doc/rc.lighttpd > lighttpd.init
# chmod a+rx lighttpd.init
# cp lighttpd.init /etc/init.d/lighttpd
# cp -p doc/sysconfig.lighttpd /etc/sysconfig/lighttpd
# install -Dp ./doc/lighttpd.conf /etc/lighttpd/lighttpd.conf
# chkconfig lighttpd on

If you have installed Lighttpd using Yum then just follow last step. You may also use various other commands to start and stop Lighttpd web service like /etc/init.d/lighttpd start|stop|restart|condrestart|reload|status or service lighttpd start|stop|restart|condrestart|reload|status.

To just test lighttpd.conf, run command lighttpd -t -f /PATH/TO/CONF/lighttpd.conf

#3 Differences between Apache and Lighttpd

#3.1 General

The main difference between Apache and Lighttpd is the serving model, Lighttpd is event-driven and Apache is threaded or pre-forked.

Apache provides different multiprocessing models (MPMs) for different runtime environments. The prefork model that Apache uses creates number of processes at startup of service and manages them in a pool. However each process requires lot of memory to handle requests which means the more the processes the more memory will require. That is simultaneous apache processes quickly eat available RAM.

On the other hand Lighttpd uses single process, single thread and non-blocking I/O. For that it  uses fastest even handler in the target system like: poll, epoll, kqueue or /dev/poll. This difference makes Lighttpd faster than Apache in serving static files.

However the biggest difference between both is how they support scripting languages (specially like PHP). Apache has upper hand here because it supports easy to use Shared module version, CGI and FastCGI all together while Lighttpd supports only FastCGI at this  moment.

#3.2 Configuration level

There is visible difference between styles of configuration files of Lighttpd (lighttpd.conf) and Apache (httpd.conf). Syntax of lighttpd.conf will look more like syntax of php.ini while httpd.conf has XML type syntax. Here is an example of some basic configuration:

#3.2.1 Basic Configuration

Apache:

DocumentRoot /var/www/html
CustomLog /var/www/logs/access
ErrorLog /var/www/logs/error
User apache
Group apache

Lighttpd:

server.document-root="/var/www/html"
accesslog.filename="/var/www/logs/access"
server.errorlog="/var/www/logs/error"
server.username="apache"
server.groupname="apache"
server.modules=("mod_cml")

#3.2.2 Virtual Hosts

Below is an example of difference between VirtualHosts of Apache and Lighttpd. Example is shown for myproject project.

Apache:

NameVirtualHost *

<VirtualHost *:80>
 ServerName 'www.myproject.com'
 DocumentRoot '/web/
myproject/web'
 ErrorLog '/web/logs/
myproject_error'
</VirtualHost>

Include conf.d/virtualhosts/*.conf

Lighttpd:

$HTTP[“host”] == “www.myproject.com” {
 server.document-root=”/web/myproject/web”
 server.errorlog="/web/myproject_error"
}

#3.2.3 Authentication and Authorization

Lighttpd, at this moment, does not support .htaccess files, so all settings must be specified in the lighttpd.conf file, or the configuration files that it includes. However it understands Apache user files for basic and digest authentication, but group file support is not yet implemented but will be implemented soon. Here is an example of authentication and authorization:

Apache:

<Directory ~>
  AuthName "Authentication required to access this area."
  AuthType Basic
  AuthUserFile /web/myproject/docs/valid.users
  Order deny,allow
  Require valid-user
</Directory>
 
Lighttpd:

auth.backend="htpasswd"
auth.backend.htpasswd.userfile="/web/myproject/docs/valid.users"
auth.require=
("~" =>
  (
    "method" =>"basic",
    "realm"  =>"Authentication required to access this area.",
    "require"=>"valid-user"
  )
)

Summarily, configuration file of Lighttpd server behaves like an active script in which you can declare variables, write logic, do computation based upon criteria etc. similar like programming script. This feature makes configuration file alive and agile.

#4 How to run PHP under Lighttpd

#4.1 Configuring PHP under Lighttpd

Apache processes PHP internally i.e using it as Shared module mod_php while Lighttpd runs PHP under FastCGI. Although Apache also supports FastCGI, using PHP under FastCGI with Apache is neglected and is not used. However with Lighttpd, only option is to run under FastCGI, PHP must be compiled with FastCGI option (thought it is not used with Apache). For more information, please read http://trac.lighttpd.net/trac/wiki/TutorialLighttpdAndPHP. Below is the example of difference between running PHP under Apache and lighttpd.

Apache:

LoadModule php5_module modules/libphp5.so
AddType application/x-httpd-php .php

Lighttpd:

server.modules=( ..., "mod_fastcgi", ... )
fastcgi.server=( ".php" =>
                 (
                   (
                     "socket" => "/tmp/php-fastcgi.socket",
                     "bin-path" => "/usr/bin/php-cgi",
                     "broken-scriptfilename" => "enable",
                     "bin-environment" =>
                     (
                       "PHP_FCGI_CHILDREN" => "2",
                       "PHP_FCGI_MAX_REQUESTS" => "5000"
                     ),
                     "min-procs" => 1,
                     "max-procs" => 2,
                     "idle-timeout" => 60
                   )
                 )
               )

You may require to set path of php-cgi according to your setup. Please note that directive server.modules actually exists along with other modules on top of configuration file hence above line indicates that mod_fastcgi should be enabled in lighttpd.cnf.

Then it will require to set 1 directive in following way in php.ini configuration file if it exists, if  doesn't then nothing to do.

cgi.fix_pathinfo=1

Last 4 directives of above mentioned configuration are for running PHP scripts in better ways.

#4.2 Application wise changes

As FastCGI is a separate process, we can't handle directives of PHP into configuration file of web server (i.e lighttpd.conf). This is one of the biggest drawback of FastCGI that is why PHP is not used under FastCGI. Moreover under FastCGI mode, your PHP script would get limited support from web server which may force you to change or rewrite your scripts. Hence it is not recommended to use PHP under Lighttpd (at this moment because Lighttpd currently supports only FastCGI mode) because of it's lack of features that PHP will require like enabling configuration options of php.ini in configuration file for all hosts or per host base.

Moreover many benchmarks shows that PHP runs slower under FastCGI than under shared version on Apache.

However if it is required to run PHP on Lighttpd, then 2 major changes will require. They are:
  1. To move all PHP related setting either in php.ini directly or in configuration or global file of the application.
  2. Removing all Apache web server related variables and settings from application.
Once these changes are done, it will require to test application heavily to find whether functionalities of application get broken somewhere or not.

#5 mod_uploadprogress and Prototype

This feature will be available in Lighttpd version 1.5.0 which is not released yet, hence can not test or write more about it. More information about this module can be found at here. However when it will get released, it would surely be one of the finest module of Lighttpd server because it can be easily integrated with front-end applications using JSON.

#6 Lighttpd and output compression

Lighttpd provides output compression for static data through mod_compress module. Which means before sending static contents to client, mod_compress compresses it and saves at specified path. This compressed and cached copy will be served directly from cached location when similar request is made from same or different client. Thus saving valuable bandwidth and increasing response time.

Lighttpd supports 3 types of compressions viz. deflate, gzip and bzip2. The limitation of compressing and caching is that Lighttpd can not compress files with size more than 128 MByte and less than 128 Bytes.

To enable compression we need to set 3 directives in lighttpd.conf file. They are:

compress.cache-dir="/var/www/cache/myproject/"
compress.filetype=("text/plain", "text/html")
compress.max-filesize=1 MB

However since there is upper limit of file size of 128 Mbytes, the last directive is not necessary to declare. While compressing various types of static data, it should be kept in mind that if no file type or wrong file type is mentioned then no file will get compressed.

You may require to manually create cache folder and assign necessary write permissions to it. These cached contents do not automatically get cleared hence it is left to developer to clean it at periodic level when required. Following type of command can be used to remove contents that are older than a week.

$ find /var/www/cache/myproject/ -type f -mtime +7 | xargs -r rm

To compress dynamic contents, we need to reply on PHP itself as PHP natively supports good compression of dynamic contents. For that following 2 directives are to be set in php.ini or in equivalent configuration file.

zlib.output_compression=1
zlib.output_handler=On

Please note that to use zlib.output_compression, value of output_handler should be zlib.output_handler instead of standard output_handler. To do so, output_handler directive is to be set in following way:

output_handler=zlib.output_handler

or

zlib.output_handler=On

#7 Lighttpd and caching

#7.1 Caching overview

Caching is also another method to gain better performance in serving contents and increasing response time of your PHP scripts. There are several types of caching softwares available for PHP. Some important from them are Zend Platform, APC (APC GUI), XCache, eAccelerator, ionCube Encoder and PHP Accelerator. Certain web servers like Lighttpd provides built in modules for caching static contents at web server level. They are mod_expire, mod_mem_cache and mod_cml. Hence using combination of caching static and dynamic contents effectively, we can gain lot of speed in serving contents. However all of these mechanisms are not similar.

Aforementioned independent softwares are for Opcode/Bytecode caching i.e caching your PHP script into compiled state so that when new request arrives for same script, cache software will server compiled version of code directly from cache rather than reading file again from the disk and then compiling. From these 6, eAccelerator, XCache and APC are widely used caching softwares. This benchmarks also show that how XCache and APC are better than others. We will learn more about XCache in a short while.

As said earlier that only good combination of static and dynamic contents can give considerable boost in performance, we should try to cache as much contents as possible. To cache static contents, integrated modules of web server are the best candidates. In case of Lighttpd they are mod_expire and mod_mem_cache (however this is not provided as default).

#7.2 mod_expire

Mod_expire controls the Expire header in the Response Header of HTTP/1.0 messages. It is useful to set it for static files which should be cached like images, style-sheets etc. To use this module, first it needs to get enabled in server.modules directive array. Then module specific directives are to be set in server's configuration file as shown below.

<access|modification> <number> <years|months|days|hours|minutes|seconds>

Some examples could be like:
  • Cache contents of folder images for 2 hours.
expire.url = ( "/images/" => "access 2 hours" )
  • Cache contents of all sub-folders of images folder for 2 hours.
$HTTP["url"] =~ "^/images/" {
     expire.url = ( "" => "access 2 hours" )
}

Values can be hours, months, days etc. depending upon requirement.

#7.3 mod_mem_cache

Mod_mem_cache is a plugin which stores content of files in memory for faster serving. That is it stores specified file types into memory to serve directly from there without going to read it from disk from specified location thus saving disk read access time. This module is a 3rd party module, hence is not included in the official distribution of Lighttpd.

This module doesn't seem that much promising to use effectively for caching as memory should be used for processing data rather than storing data. Moreover memory should not be occupied for serving files that can reside and easily managed on disks. For example when we have thousands of images to be served then it is not advisable to store them into memory just to serve it faster. More information about this plugin can be found at here.

#7.4 mod_cml (Cache Meta Language)

Mod_cml is an another caching module similar like mod_expire which is provided by Lighttpd to cache static contents of dynamic pages. The difference between mod_expire and mod_cml is that mod_cml can cache fragmented static contents which are part of dynamic contents. For example a dynamic page called index.php might have static contents like menu.html, banner.html inside it which are not integral part of index.php. In such case using mod_cml, these 2 static contents can be cached and can be delivered directly from there.

But such type of caching can not be handled directly by Lighttpd web server and mod_cml hence we need to write some code in PHP or in special CML scripts for mod_cml which is written in lua programming language.

To use mod_cml, it requires to install lua programming language and libmemcache-1.3.x. Additionally Lighttpd must be compiled with 2 options --with-lua and –with-memcache.

#7.5 XCache

XCache is a newly emerging candidate in the market of caching PHP scripts. This is an independent software and not a module of Lighttpd. However it has been written by developers of Lighttpd.

XCache is an open-source opcode cacher, which means that it accelerates the performance of PHP on servers. It optimizes performance by removing the compilation time of PHP scripts by caching the compiled state of PHP scripts into the shm (RAM) and uses the compiled version straight from the RAM. This will increase the rate of page generation time by up to 5 times as it also optimizes many other aspects of php scripts and reduce server load. Some of the good features of XCache are:
  1. Optimized opcode cache.
  2. Using a generator to produce C code, reduces human mistake greatly.
  3. Running stable on PHP_4_3/PHP_4_4
  4. Supported and tested on all latest php cvs branches, such as PHP_4_3 PHP_4_4 PHP_5_0 PHP_5_1 PHP_5_2 HEAD (6.x)
  5. Alpha supported for in-alpha-php6, with Unicode enabled.
  6. Read-only Cacher Protection that prevents the cache from being corrupted by php-core/extension or any code other than XCache itself.
  7. Atomic get/set/inc/dec API operation on var cache for php programmers.
  8. Optimizer
  9. Encoder/Decoder(Loader)
  10. Administrator Script
  1. view statistics
  2. to see if it's AutoDisableOnCorrupted?
  3. view cached php/variable list
  4. clear cache
The last feature allows administrator to view statistic and cached PHP variables and manage caching behavior of XCache.

#7.5.1 Installing XCache

The standard way to install XCache is from source. Get your desired version of XCache from here. Then follow below steps to install it.

# tar -zxf xcache-*.tar.gz
# cd xcache
# phpize
# ./configure --enable-xcache
# make
# su
# make install
# cat xcache.ini >> /etc/php.ini

To make sure XCache is properly installed, run below command.

$ php-fcgi -v

It will show string like with XCache vX.X, Copyright (c) XXXX-XXXX, by XXX. Same can be checked from output of phpinfo() function also. Once XCache is installed, it will require to edit xcache.ini which contain various caching related directives to be used. However it is not mandatory to edit or change. A complete explanation of all the directives can be found from http://trac.lighttpd.net/xcache/wiki/PhpIni.

#7.5.2 Configuring Administrator panel

XCache Administrator panel is an important web interface that you can monitor and operate your opcode cache, seeing how well(or bad) it goes. Since this page is protected by http-auth, it will require to provide certain values in xcache.ini. For that set below 2 directives.

xcache.admin.user='USER'
xcache.admin.pass='MD5(PASSWORD)'

where USER is name of user you wish to use and MD5(PASSWORD) is MD5 encrypted string of password that you wish to use for given USER.

To set up web interface, copy xcache/admin/ (the whole directory) to your web document-root or sub-directory of it then request it from your browser, a http-auth prompt will popup where you will require to provide above USER and PASSWORD (as a normal string, not MD5 encrypted string). However sometimes installing XCache from rpm based utilities it may require to alias in web server instead of copying the script. To do so, add below directive in your server configuration file.

Apache:

Alias /xcache-admin/ /usr/share/xcache/admin/

Lighttpd:

alias.url += ("/xcache-admin/" => "/usr/share/xcache/admin/")

Gaining performance boost by using caching mechanism is tricky. Unless used carefully, it cannot give required boost. As we know that we can't cache everything (specially dynamic contents), we should try to cache whatever is left. This can be achieved by various types of caching as discussed above. Static contents are well cached by clients, if not then can be cached by web servers. PHP scripts can be cached using Opcode caching softwares like APC, XCache etc. While static part of dynamic data can be cached by modules like mod_cml (for lighttpd web server only).

#8 Summary

Summarily, Lighttpd web server is surely worth to have look at it and to be used for serving static data. For dynamic contents like PHP scripts, it is not optimized (because of support of only FastCGI) hence we have to wait until Shared module version PHP get started to support by it. At this moment it is widely used to server static contents only. So it will take time for it to really start competing with Apache.

However certain modules like mod_secdownload, mod_compress, mod_geoip, mod_trigger_b4_dl, mod_uploadprogress, mod_useronline etc. are peculiar modules of Lighttpd which can make it stand firmly with currently popular web servers.

#9 Links

http://www.onlamp.com/pub/a/onlamp/2007/04/05/the-lighttpd-web-server.html
http://survey.netcraft.com/Reports/0703/
http://schlitt.info/applications/blog/index.php?/archives/504-Apache-vs.-Lighttpd-echo-performance.html
http://trac.lighttpd.net/trac/wiki/TutorialInstallation
http://trac.lighttpd.net/trac/wiki/TutorialConfiguration
http://trac.lighttpd.net/trac/wiki/Docs:ConfigurationOptions
http://trac.lighttpd.net/trac/wiki/TutorialLighttpdAndPHP
http://trac.lighttpd.net/trac/wiki/Docs:ModUploadProgress
http://trac.lighttpd.net/trac/wiki/Docs:ModCompress
http://blog.lighttpd.net/articles/2006/08/01/mod_uploadprogress-is-back
http://trac.lighttpd.net/trac/wiki/Docs:ModCML
http://www-128.ibm.com/developerworks/library/os-php-fastapps1/index.html
http://trac.lighttpd.net/xcache/wiki/Faq

2 Jan 2008

Varnish accelerator

#1 Introduction

Varnish is a high performance HTTP accelerator (more precisely a Reverse proxy server) designed for content-heavy dynamic web sites. In contrast to other HTTP accelerators, many of which began life as client-side proxies or origin servers, Varnish was designed from the ground up as an HTTP accelerator. The Varnish web site claims that Varnish is ten to twenty times faster than the popular Squid cache on the same hardware.

Varnish is installed within the neighbourhood of one or more webservers. All connections coming from the Internet addressed to one of the webservers are routed through the proxy server, which may either deal with the request itself or pass the request wholly or partially to the main webserver.

There are various reasons to install reverse proxies. They are:
  • Security: the proxy server is an additional layer of defence and therefore protects the webservers further up the chain.
  • Encryption / SSL acceleration: when secure websites are created, the SSL encryption is sometimes not done by the webserver itself, but by a reverse proxy that is equipped with SSL acceleration hardware.
  • Load distribution: the reverse proxy can distribute the load to several servers, each server serving its own application area. In the case of reverse proxying in the neighbourhood of webservers, the reverse proxy may have to rewrite the URLs in each webpage (translation from externally known URLs to the internal locations).
  • Caching static content: A reverse proxy can offload the webservers by caching static content, such as images. Proxy caches of this sort can often satisfy a considerable amount of website requests, greatly reducing the load on the central web server.
  • Compression: the proxy server can optimize and compress the content to speed up the load time.
  • Spoon feeding: if a program is producing the webpage on the webservers, the webservers can produce it, serve it to the reverse-proxy, which can spoon-feed it however slowly the clients need and then close the program rather than having to keep it open while the clients insist on being spoon fed.
#2 Architecture

Varnish is heavily threaded, with each client connection being handled by a separate worker thread. When the configured limit on the number of active worker threads is reached, incoming connections are placed in an overflow queue; only when this queue reaches its configured limit will incoming connections be rejected.

The principal configuration mechanism is VCL (Varnish Configuration Language), a DSL used to write hooks which are called at critical points in the handling of each request. Most policy decisions are left to VCL code, making Varnish far more configurable and adaptable than most other HTTP accelerators. When a VCL script is loaded, it is translated to C, compiled to a shared object by the system compiler, and linked directly into the accelerator.

A number of run-time parameters control things such as the maximum and minimum number of worker threads, various timeouts etc. A command-line management interface allows these parameters to be modified, and new VCL scripts to be compiled, loaded and activated, without restarting the accelerator.

In order to reduce the number of system calls in the fast path to a minimum, log data is stored in shared memory, and the task of filtering, formatting and writing log data to disk is delegated to a separate application.

#3 Installation

Here we will go through quick installation process. Please get latest version of Varnish from here or check it out from repository.

#3.1 Prerequisites

The following tools are required to build Varnish:
  1. A recent version of GCC.
  2. A POSIX compatible make.
  3. Recent versions of GNU autotools like automake, autoconf, libtool.
Latest versions of OSes are most likely to contain above mentioned items.

#3.2 Configuring and Building

$ ./autogen.sh

You may see some error messages. Check if configure and Makefile.in were generated. If they weren't, you probably need newer versions of the GNU autotools. If they were; run autogen.sh again: any error messages it still shows the second time around are most likely caused by bugs in autoconf macros installed by other software you have on your machine, and can safely be ignored.

Next, run configure. In most cases, the defaults are correct and you do not need to specify any command-line options, except perhaps --prefix. If you plan on hacking the Varnish sources, however, you will most likely want to turn on stricter error checks and dependency tracking:

$ ./configure

OR

$ ./configure --enable-debugging-symbols --enable-developer-warnings –enable-dependency-tracking

If configure completes without any errors, simply run below two commands to compile and install Varnish.

$ make
$ make install

For more information please visit this link.

#3.3 Enabling Varnish caching

Varnish API comes with Management console (telnet HOST/IP PORT), Caching process as a child process of management process (varnishd), and some utilities for logging (varnishlog and varnishncsa), statistics of caching (varnishstat), histogram (varnishhist) and log entry ranking (varnishtop).

Following commands can be used to enable varnish caching on your servers.

$ varnishd -a www.example.com:80 -b www.example.com:8080
$ varnishd -a www.example.com:80 -f /usr/local/etc/varnish/myconf.vcl
$ varnishd -a www.example.com:80 -b www.example.com:8080 -T www.example.com:6082
$ varnishd -a www.example.com:80 -f /usr/local/etc/varnish/myconf.vcl -T www.example.com:6082

1st command denotes that website www.example.com is originally running on port 8080 on Apache web server but it's running through Varnish under port 80 which is default port for http. This is must for production server but for development and/or test server, ports could be exactly in reverse because during development and testing you may want to run your websites without caching.

Sometimes we might want to use different caching policies (like caching documents having cookies) which is written in special configuration syntax called VCL; in that case 2nd command is useful to tell Varnish to use modified configuration language file than the default one. When -f switch is used, -b switch cannot be used together because values of -b switch is now mentioned in configuration file.

Once caching is started it can be controlled by management console from which caching can be started, stopped and various configuration values can be set and unset. For that 2 steps are needed.
  1. enabling Varnish as shown in command 3 or 4 and
  2. using Telnet utility to open management console on given port for given host (like telnet www.example.com 6082).
Please note that to start and stop caching do not just kill process, instead use management console to control caching for particular host.

Varnish stores log into memory hence to dump it in regular file on disk, use varnishlog or varnishncsa utilities. For more information and how to use these and other utilities, please check their man pages.

#4 VCL

#4.1 Description

VCL is an acronym for Varnish Configuration Language. In a VCL file, you configure how Varnish should behave. It is like Apache web server's httpd.conf and PHP's php.ini configuration files.

#4.2 Syntax

The VCL syntax is very simple, and deliberately similar to C and Perl. Blocks are delimited by curly braces, statements end with semicolons, and comments may be written as in C, C++ or Perl according to your own preferences.

In addition to the C-like assignment (=), comparison (==) and boolean (!, && and ||) operators, VCL sup-ports regular expression and ACL matching using the ~ operator.

Unlike C and Perl, the backslash (\) character has no special meaning in strings in VCL, so it can be freely used in regular expressions without doubling.

Assignments are introduced with the set keyword. There are no user-defined variables; values can only be assigned to variables attached to backend, request or document objects. Most of these are typed, and the values assigned to them must have a compatible unit suffix.

VCL has if tests, but no loops.

The contents of another VCL file may be inserted at any point in the code by using the include keyword followed by the name of the other file as a quoted string.

#4.3 How to

#4.3.1 refresh (purge) document when it gets changed on server?

Refreshing is often called purging a document. There are 2 different ways in Varnish to refresh (purge) any document/s:
  • From management console you can type below commands to control purging of desired documents. Regular expressions are allowed in syntax so many documents can be purged by giving few commands.
url.purge ^/$
url.purge .*html$
  • In VCL we can write logic to purge any document when request is method is PURGE. Which means any document that needs to get purged, will require to call same document by PURGE method to remove itself from cache. This is the most convenient and practical way to keep fresh copies of documents in cache. It is also automatic way so server administrator need not to manually purge large amount of documents.
Define all possible hosts only from which purging request will be accepted. This is good precaution so that not everyone can purge what is in cache.

acl purge
{
  "myhost"; "123.456.789.1";
}

When request is received.

sub vcl_recv
{
  if (req.request == "PURGE")
  {
    if (!client.ip ~ purge)
    {
      error 405 "Not allowed.";
    }
    lookup;
  }
}

When cache is hit (i.e document is to be served from cache).

sub vcl_hit
{
  if(req.request == "PURGE")
  {
    set obj.ttl = 0s;
    error 200 "Purged.";
 }
}

When cache is missed (i.e document is to be served directly from backend server).

sub vcl_miss
{
  if(req.request == "PURGE")
  {
    error 404 "Not in cache.";
  }
}

#4.3.2 cache documents even when cookies are present?

When request is received.

sub vcl_recv
{
  if (req.request == "GET" && req.http.cookie)
  {
    lookup;
  }
}

Fetch document from backend server.

sub vcl_fetch
{
  if (resp.http.Set-Cookie)
  {
    insert;
  }
}

#4.3.3 support multiple sites running on separate backends in the same Varnish instance?

Define all backend WWW servers which are to be used for caching.

backend www
{
  set backend.host = "www.example.com";
  set backend.port = "8080";
}

Define all backend Image servers which are to be used for caching.

backend images
{
  set backend.host = "images.example.com";
  set backend.port = "8080";
}

When request is received.

sub vcl_recv
{
  if (req.http.host ~ "^(www.)?example.com.com$")
  {
    set req.backend = www;
  }
  elsif (req.http.host ~ "^images.example.com")
  {
    set req.backend = images;
  }
  else
  {
    error 404 "Unknown virtual host";
  }
}

#4.3.4 force a minimum TTL for all documents?

Fetch document from backend server.

sub vcl_fetch
{
  if (obj.ttl < 120s)
  {
    set obj.ttl = 120s;
  }
}

#5 Performance

While Varnish is designed to reduce contention between threads to a minimum, its performance will only be as good as that of the system's pthreads implementation. Additionally, a poor malloc implementation may add unnecessary contention and thereby limit performance. On FreeBSD (using libthr) and Linux (using native threads), it is believed that performance is limited only by hardware.

When the requested document is in cache, response time is typically measured in microseconds. This is significantly better than most HTTP servers, so even sites consisting mostly of static content will mostly benefit from Varnish.

#6 Limitations
  1. Current versions of Varnish do not understand the HTTP Vary: header, which can lead to problems with sites which support content negotiation.
  2. the HTTP Host: header is always included in the object hash, so sites which can be accessed under multiple different names will have multiple copies of the same content cached.
  3. Default policy of Varnish doesn't allow caching documents having cookies/sessions, which means websites heavily dependent upon cookies and session can not use Varnish out of the box for dynamic documents. To solve this problem VCL is to be tweaked as shown in section 4.3.2.
  4. Varnish’s internal caching mechanism doesn’t obey even the minimum requisite client-side HTTP caching pragmas. It fails to obey other established caching headers, and support for them cannot even be implemented by end users through configuration, because there’s no mechanism to control cache behavior based on Web server HTTP headers — only on client headers. Which means preventing caching of files without an ETag response header is very hard to implement.
  5. Varnish refuses to start if your /tmp is mounted noexec. Because Varnish attempts to compile a “shared lib” and load it from /tmp. Such problems are very hard to detect because the startup script doesn’t give any indication, and the log files don’t either.
  6. There is lack of proper documentation for Varnish and VCL. There is some documentation in man pages but it is accessible only when you have Varnish installed on your PC.
Most of these limitations have been or are being addressed in the development version.

#7 Conclusion

Web accelerators (here caching software) are not install and forget type of software. They require constant monitoring and inspection on them for their behaviour and effectiveness. Software like Varnish have their limitations as shown in section 6 which must be kept in mind before using them. Then there are other things to be taken care of in your project to use caching most effectively.
  1. Caching of documents is implemented on GET and HEAD methods only. Hence your project must have maximum documents using above 2 methods.
  2. URL structure should be caching friendly.
  3. For dynamic document session IDs should not get appended into URL because they are dynamic and different every time they are generated hence same document having such different session IDs makes caching of documents less effective because same document will have different versions in cache as session IDs are different.
#8 Links