|
Octavian Andrei
Dragoi, |
Assignment 1 |
Abstract:
This report presents the conceptual (abstract) architecture of the Apache web server. It tries to emphasize the overall structure of the system, without going into implementation details, or requiring such details to be previously known by the reader. The main purpose is to make the architecture "intellectually tractable" ([Monroe97]).Keywords:
The conceptual architecture has been inferred from a number of Apache related documents and from the way source files are grouped and named.
At a high level the Apache server architecture is composed of a core that implements the most basic functionality of a web server and a set of standard modules that actually service the phases of handling an HTTP request.
The server core accepts a HTTP request and implicitly invokes the appropriate handlers, sequentially, in the appropriate order, to service the request.
The report shows that the most similar architectural style (in the sense of ([Garlan94])) that can characterize the Apache architecture is "implicit invocation" , although the notion of event does not exist in the architecture.
The architecture offers great opportunities for extending or changing the Apache functionality, by the means of adding or replacing modules.
Apache, conceptual architecture, abstract architecture, web serverAvailable online at:
http://www.grad.math.uwaterloo.ca/~oadragoi/CS746G/a1/apache_conceptual_arch.html
The report assumes no previous familiarity with the architecture of the Apache web server. So it can serve as an introductory reading on the architecture of the server.
It should be noted that the architecture described here might not be entirely accurate. It has been inferred based on several sources, including the overall structure of files and files name. It does not start from a previously existing complete design document.
May be here is the place to mention that Apache is written to be drop-in compatible with the NCSA server. This has design consequences due related to some configuration commands promoted by NCSA server, which cannot be naturally implemented in Apache. These commands are supported in a way that, somehow, is not in the general "philosophy" of the system.([Thau96]). (more details in the configuration section).
Additional concerns related to controlling access authorization and clients authorizations are also in the responsibility of the web server. As has been said the web server might execute programs as response to clients requests. It must ensure that this is not a threat for the host system (were the web server runs). In addition, the web server must be capable, not only to respond to a high rate of requests, but also to satisfy a request as quickly as possible.
![]() |
|
|
The following are the components of the core:
http_protocol.c: contains routines that directly communicates
with the client (through the socket connection), following the HTTP protocol.
All data transfers to the client are done using this component.
http_main.c: the component that startup the server and
contains the main server loop that waits for and accepts connections. It is
also in charge of managing timeouts.
http_request.c the component that handles the flow of the
request processing, dispatching control to the modules in the appropriate
order. It is also in charge with error handling.
http_core.c: the component implementing the most basic
functionality, which is described in a comment from a source file as being
"just 'barely' functional enough to serve documents, though not terribly
well". Another interesting quote from a source file comment illustrates
very well the function of this component:"this file could almost be
mod_core.c". Meaning that the component behaves like a module but has to
access some globals directly (which is not characteristic for a module).
alloc.c)
http_config.c), as well as
support for virtual hosts. An important function of http_config
is that form the list of modules that will be called to service different
phases of the requests. ![]() |
|
|
It is interesting to observed that although the components of the core have rather distinct functionality, there is not a simple way to depict the interactions between them. Most of the architectural information being in the names of the modules rather than in the connectors between them.
This is due to the considerably effort done by the designers to move everything that can be expressed as a separate entity into the modules part of the Apache server. What is left in the core are components too interconnected to be written as separate modules.
The following are the phases of handling a request for the Apache server:
![]() |
|
|
Handlers are defined by modules, and a module might specify handlers for one, many or none of the phases of a request. Handlers are the part of the module that is called when the processing of the request enters the phase for which the handler is defined.
The rationale behind having modules defining handlers for more than one phase is that a module might save internally data on the request being processed, and when its handlers for a subsequent phase of the request are called they might make use of those the data. In theory the module might even save data between different request (e.g. it might cash some file content for future use).
It should be noted that there are additional functions exported by modules, related with configuration, and initialization, They are called in the startup phase of the server.
mod_userdir: translate the user home directories into
actual paths mod_rewrite Apache 1.2 and up
mod_rewrite: rewrites URLs based on regular expressions, it
has additional handlers for fix-ups and for determining the mime type
mod_auth, mod_auth_anon,mod_auth_db, mod_auth_dbm : User
authentication using text files, anonymous in FTP-style, using Berkeley DB
files, using DBM files.
mod_access: host based access control. mod_mime: determines document types using file extensions.
mod_mime_magic: determines document types using "magic
numbers" (e.g. all gif files start with a certain code) mod_alias: replace aliases by the actual path
mod_env: fix-up the environment (based on information in
configuration files)
mod_speling: automatically correct minor typos in URLs
mod_actions: file type/method-based script execution
mod_asis: send the file as it is
mod_autoindex: send an automatic generated representation
of a directory listing
mod_cgi: invokes CGI scripts and returns the result
mod_include: handles server side includes (documents parse
by server which includes certain additional data before handing the document
to the client)
mod_dir: basic directory handling.
mod_imap: handles image-map file mod_log_*: various types of logging modules
For some phases only one module (handler in a module) can be called. Such
phases are the authorization, the authentication, the return of the actual
object to the client, and sometimes the URI to filename translation.
Other
phases of servicing a request can have more that one handler called. For example
there can be more than one module called to implement the logging part of the
request.
In some phases of processing a request all the handlers (in the registered
modules) might be called until one returns a special code meaning that
subsequent registered handlers for the current phase should not be called. An
example is the URI to filename, translation phase.
Further more there might
be the case that a handler returns an error code. In that case the processing of
the request should stop and an error should be returned to the client (i.e. no
other handlers are called, from this phase or subsequent phases).
![]() |
|
|
As a consequence, Apache uses a different technique, namely persistent
server processes. It forks a fixed number of children, right from the
beginning. The children service incoming requests independently (different
address spaces). Concurrency in Apache server is pictured in Figure
5.
Alternatively, when Apache compiles on MS Windows (as opposed to
UNIX), a fixed number of threads is started from the beginning to service the
incoming request (due probably to specific characteristic of this operating
system).
![]() |
|
|
From another point of view one might raise the question if a module is a
separated process or can be implemented as a separated process. In Apache module
is not a separated process. However some modules might fork new children in
order to do their job. A readily example is the mod_cgi module,
which handles the cgi script. It must fork a new child to execute the actual CGI
script (after proper redirection of the standard input and output for the child
process), and wait for it to finish. But this is a characteristic of the
mod_cgi, many other modules need not to fork children.
A different kind of module is the one that although it is not a separate process and does not for children it communicate through IPC mechanisms or sockets in with a different process (which might, for instance, be located on a different machine). An example of such module would be an authorization module which communicate with a server that manages users and passwords information. Even the CGI module might be implemented in this way (i.e. the actual script running as a completely different process not a child) which will result in improved security, but will have the communication overhead as a penalty.
An interesting concept implemented by Apache is the one of Virtual Hosts. The server can respond to more than one name (i.e. www.example and www2.example), each assigned to one of the multiple IP addresses of the machine. The multiple IP addresses can be addresses associated with physical network interfaces or can be addresses associated with virtual network interfaces (simulated via logical devices by the operating system). Apache is able to "tell" under which name the host has been referenced and use different configuration options (e.g. allowing more access rights to users accessing the host through an interface networked in the local network, as opposed to users accessing the web server via an interface networked in the outside-the-company network). Modules also have accessed to this information.
To summarize, the Apache "philosophy" related to configuration is: each component takes care of its own configuration, and configuration commands. The server core parse the configuration files and dispatches configuration commands to the appropriate modules to be interpreted (executed), or interprets (executes) the command itself if in particular was meant for it (i.e. is a configuration command for the core not for a module).
To "fix" this the problem commands of NCSA server (e.g. Options) are interpreted by the Apache core, even when they affect modules. The core make the configuration available to modules in the same way it make available the general configuration information.
Another key structure is the one the Apache core uses keep track of various modules. It is a linked list of module records, each holding all the information related to that module (e.g. handlers, configuration data per module). The module record is the mean by which the core calls the module.
What is characteristic for the resource pool, is that all resources are freed at once, when the resource pool is freed, preventing resource leakage. This is particularly important due to use of persistent processes.
There is, however, something that might be compared with announcing an event, namely is the issuing of a sub-request by a module in order to "force" the core to perform some of the steps for a request on the sub-request (i.e. calling sequentially handlers for each servicing phase). However this is not (conceptually) a proper event, because the issuing module does not announce something to other (unknown to it) modules. It just a mean of "forcing" an implicit invocation.
There are other characteristics of event systems (as summarized in [Shaw96]) that does not "fit" the description of the
core-modules architecture of Apache. For example there is no control asynchrony,
in the sense that the module issuing a the sub-request waits for the sub-request
to be completed.
Also two phases of the request cannot be handled in parallel
(one uses the outcome of the precedent one). More over the module is not a
separate process, although it can fork children for some phases - like running a
CGI script.
So although the connectors between modules are implicit invocations and data flow is a tree - with some restrictions (e.g some phases cannot have more than one module to handle them, one phase is after the other) the architecture does not have other characteristics of the event systems.
It can be argued however that as different instances of Apache (sub-processes) can handle in the same time request from different HTTP clients there is asynchrony. However the different instances are independent and do not shared information related to the requests processed.
The way a request is serviced, with phases handled one after the other and the outcome of a request is used (most of the time) by the next phase, has some similarities with the general style of "pipe line" (as in [Shaw96])). There is no upstream control (i.e. when the core invokes the handlers for one phase there is no data or control upstream). However, again, there is no asynchrony and more important the core regain control after each phase (i.e. after the handler has been invoked, and its job is done).
Further more, some phases does not provide any change in the conceptual data-flow. And more significant, some handlers might be implemented by the same module and those handler might exchange information via private data of the module, bypassing the main data-flow. For example authorization and authentication does not change the request, they can only deny the execution of it. To conclude the pipeline is rather poorly reflected by the module structures, although conceptually the idea exists, therefore the implicit invocation seems more appropriate to characterize the general conceptual architectural style.
Further more the ability of dynamically loading modules present in Apache 1.3
release (no static linking with the server code), make the task of customizing
the server even easier as there is no need to recompile the entire server. It is
necessarily only to change some configuration files.
Another feature worth
re-mentioning here is the capability of modules to define their own
configuration commands, for which they are implicitly called to execute.
An important part of the Apache web server that cannot be changed only by
changing / adding a module is the one that implements the HTTP protocol. On the
good, side the protocol is implemented as a separate piece of code
(http_protocol.c), and all communication with the client is done
through it, so only that part must be changed in order to implement a future
version of HTTP. However there is no well defined API, as is the case for
modules.
The core is the one that accepts and manages HTTP connections and calls the handlers in modules in the appropriate order to service the current request.
The architectural style can be characterized implicit invocation made by the server core on handlers implemented by the modules. Concurrency exists only between a number of persistent identical processes that service incoming HTTP requests on the same port. Modules are not implemented as separate process although it is possible to fork children or to cooperate with other independent process to handle a phase of processing a request.
The functionality of Apache can be easily changed by writing new modules which complements or replace the existing one. The server is also highly configurable, at different levels (virtual host, directory, module) and modules can define their own configuration commands.