Apache module

DC-Apache Module

Contents:

Why Apache ?
The DC-Apache Module
DC-Apache Module configuration directives
Document Migration and Replication
Load Balancing and Consistency
Experiments and Results

Why Apache ?

Apache has been the most popular web server on the Internet. The recent WWW server site survey by Netcraft found that over 60%

of the web sites on the Internet are using Apache.

Apache is a high performance web server with a fully featured functionality.
It is not only a web server but also a platform upon which we can quickly build systems. The Apache provides web server API (application programming interface) that we can use to extend the functionality of the web server itself by linking new module directly to the server executable.
Apache server's source code is freely available.

The DC-Apache Module:

The DC-Apache module is implemented using the API and the module interface provided by Apache. The Apache's module interface can let us insert several processing handlers into the Apache server's request processing cycle, which is divided in several phases. The DC-Apache module then uses those handlers to process incoming connection requests.

Figure 1: Functional diagram of DCWS system.

Figure 1 shows the DC-Apache system's functional structure. The Apache server's main process dispatches request to several child processes. Each child process servers the request in a request loop divided in several phases as described before. Using the handler mechanism, we implemented the DC-Apache system as an Apache module that processes request in the request loop. The "pinger" process's function is to compute and collect load information about participating servers. The share memory contains the document graph and statistics information.

DC-Apache Module configuration directives:

The DC-Apache module gets its configuration information from the Apache's configuration file. The DC-Apache module introduced three new directives: ExportPath, ImportPath, Backend and DiskQuota.

ExportPath is used to tell the DC-Apache module which directory will be exported among the co-op servers. The files in this directory will be replicated or migrated. The DC-Apache module will build the local document graph from all documents in this directory and its sub-directories.
ImportPath tells the server where to store the replicated documents. The co-op server need to store in this directory the replicated documents it gets from the home
server.
Backend indicates the co-op server's address. This information is used for home server to contact co-op servers and to replicate document to them.
In order to limit the usage of co-op server's disk space, there is a directive "DiskQuota" whose value is the disk space limit in bytes that can be used to hold the imported files. The default value is 10M Bytes.
SetHandler is also needed to specify that the requests for the documents under the "ExportPath" directory should be processed by DC-Apache module. This directive is a standard Apache run-time directive. The name of DC-Apache module's handler is "DCA-handler".

Document Migration and Replication:

Document migration and replication mechamism is shown by the following figures:

wpe4.jpg (18889 bytes)

Server #1 is overloaded, we want to move some documents to server #2.

wpe5.jpg (19424 bytes)

Document D has been chosen to migrate to server #2, because it's most effective: quick balancing with minimum data transfer. Two hyperlinks should be modified. Clients see the new URL and will access the new server.

wpe6.jpg (20695 bytes)

Document D can also be chosen to be replicated to server #2. There are two choices for hyperlinks in B and E pointing to D. We can dynamically choose one hyperlink, according to current load state.

Load Balancing and Consistency:

In DC-Apache module, we store the replication information inside the shared memory and let the home server dynamically generate the hyperlinks when serving the request. To enhance the processing speed, we do not parse a document on every request, instead we parse the document at the time of building the document graph and before the server starts to serve requests. We can store the hyperlink information in the document graph: just associate this hyperlink with its length and start offset in the document. With this position information, when we send a request's response, if one hyperlink needs to be dynamically generated ( the document this hyperlink pointed to has copies on coop servers ) we substitute the hyperlink part in the file with generated URI according to the replication information in the shared memory, and skip the hyperlink part in the file.

This also gives us a better chance to do dynamic load balancing. We can generate hyperlinks according to the current load information and try not to generate hyperlinks pointing to the heavily loaded server.

Experiments and Results

Experimental Environments

We tested the Apache server with DC-Apache module on a cluster of 64 Intel Pentium workstations. Every workstation has a CPU of 200 MHz clock rate and 128MB memory. The operating system is Red Hat Linux release 5.1 (Manhattan) Kernel 2.0.34. They are all connected by a Catalyst 5500 switch, which provides a 100Mbps switched Ethernet network. One workstation is used as control machine. The Apache servers can be running on 16 workstations, so the server number is from 1 to 16. The rest 32 workstations are for test clients.

Results

We did the experiments using client number from 16 to 256 with increment of 16. The data is gathered from the client side.
The result of Sequoia data set ( without hot spot ) is shown below:

wpe8.jpg (44772 bytes)

The result of data set Mapug ( with host spots ) is shown below:

wpe9.jpg (43109 bytes)