Published on: Sunday 11th July 1999 By: Pankaj Kamthan
Since its inception as an implementation in the NCSA Web server, CGI has become a powerful and widely used standard for interfacing external applications with Web servers.
There are various advantages of CGI:
The CGI approach, however, also has both implicit and explicit limitations. In this article, we discuss these drawbacks, and suggest solutions to minimize, and even eliminate them in certain cases.
Although any programming language can be used to write CGI scripts, Perl has established itself as the lingua franca for CGI programming. When referring to a programming language, we will refer to Perl exclusively. Again, even though the CGI is independent of any server, we will incline our discussion towards Apache, which as benchmark comparisons provided by Netcraft surveys show, has established itself as the most popular among the Web servers in use today.
The leading CGI problem is performance, and is reflected in various forms.
In the case of a compiled language such as C/C++ the overhead is negligible but sacrifices portability in many instances. Writing a CGI script in Perl gives it a semblance of platform independence, but it also requires that each separate Perl interpreter which takes more time and requires extra resources.
The CGI overhead is primarily as a consequence of HTTP being a stateless protocol. CGI scripts exit as soon as they are finished processing the current request. If the CGI script does some time-consuming task at start-up (such as connecting to a database), the overhead of reestablishing the state each time it is needed is considerable.
When a server responds to a request that accesses a CGI program, it must create a new process to run the CGI program, and then pass to it, via environment variables and standard output, the information that might be necessary to generate a response. Creating a process for every such request requires time and significant server resources, which limits the number of requests a server can handle concurrently.
Thus, CGI scripts do not scale well. With the invocation of each CGI process, the load on the server increases and performance degrades. CGI is often used as frontend to other applications, such as a database, which may take considerable time to initialize. This can also represent a major overhead.
CGI has a limited functionality: it only plays a simple role of a "responder", where the application generates the response that is returned to the client. Once it begins execution, a CGI program runs in a separate process. Therefore, it cannot interact with the Web server (to link into other stages of Web server request processing, such as authorization and logging) or take advantage of the server's capabilities. (For example, a CGI script cannot write to the server log files.) This can have an indirect impact on performance as well.
Inspite of the shortcomings mentioned in the above section, all is not lost. In the next sections, we will discuss the remedies and point out various techniques that can improve CGI performance in various cases.
The first step towards CGI optimization is obviously to follow the guidelines outlined for efficiency in the language of choice itself (which in our case is Perl). For definitive guidelines on time and space efficiency (among others) in Perl, see Chapter 8 of the book Programming Perl, By Larry Wall, Tom Christiansen & Randal L. Schwartz, O'Reilly & Associates, 1996.
For frequently accessed CGI's, using a Perl compiler (distributed with Perl 5.005) that generates C code from Perl scripts, can significantly improve efficiency as there is no overhead of starting up the Perl interpreter. Perl performance can also be significantly improved (and in certain cases, even better than the compiled C code) when using mod_perl module in Apache.
CGI scripts, such as the ones using system() (which also reduces portability) or `backtick` notation are inefficient by their very nature, and very resource-intensive. There are ways to reduce or eliminate all these overheads, but these tend to be operating system- or server-specific (for which the best support seems to be in Apache).
CGI Tester lets you test and measure the performance of scripts that run under Cold Fusion 3.1. The tool provides both regression and load testing and can keep track of an unlimited number of tests and testing sessions, including statistics on test failures and server speed.
I/O buffering has its advantages but for time-intensive computations, (for example, searching a large database or creating images on-the-fly), it can be a bottleneck. You could follow these steps to adjust I/O buffering:
The server is given only the IP address of the browser making the request. The reverse DNS lookups let the server use the full qualified name in CGIs. The problem with DNS is that it uses blocking systems calls which hang the (parent) server process till a call is completed. These calls can take a significant amount of time for a single user, resulting in a sacrifice in performance, if many users are being served.
Explicit reverse DNS lookups are not needed as, if needed, CGIs can do a lookup themselves using the environment variable. If possible, avoid runtime reverse DNS lookups and use static IP addresses.
In some servers, such as recent versions of Netscape Enterprise, DNS lookups are set off by default. To turn off reverse DNS lookups in Apache, you can do the following in httpd.conf:
and the following in the AddLog directive:
Most Web servers buffer the output from the CGI script before sending it onward to the browser. If the buffer size is large and the size of a page is small, then the script may have to send several pages before the first one is sent to the browser, resulting in choppy updating.
When a Content-type header is included in a CGI script, the server parses the output and completes the header information (by adding the header information of its own) that it considers may be useful to the browser. However, CGI scripts can override the header information included by the server by generating a complete HTTP header on its own. CGI scripts which bypass the server and generate the HTTP header information on their own are known as non-parsed header (NPH) scripts.
The advantages of NPH CGI scripts are:
The limitations of NPH CGI scripts are:
NPH scripts can be useful in instances that require "server-push." Examples are animation programs that need to induce "continuity" when presenting image frames to the user, and stock pricing programs which depend on constantly changing data.
For multiple CGI's, an improvement in scalability can be achieved by running them on different Web servers. If the CGI's do share data, then just the CGI's can be placed on different systems. If the CGI is being used as frontend to other applications, such as a database, then the backend program should be run on a separate server doing most of the work, while the actual CGI simply carries messages.
This solution also has certain limitations:
One way to avoid latency of CGI scripts, is to keep them running all the time as a co-process. Intead of having CGI start in response to a query and die, it can be useful to start-up a persistent CGI-like process along with the Web server. When the Web server gets a request pointing at that process, it connects to the process, hands over the request, and waits for the response while still being able to handle other requests.
One obvious limitation of co-processing is the risk of memory leak since the process has to run all the time. Chances of this can be reduced with utilities that can detect and locate the problematic areas in the script.
An alternative to the CGI protocol is the FastCGI, a standard protocol proposed by Open Market, Inc. The idea behind FastCGI is co-processing. FastCGI is a simple communications protocol that works as follows: it uses a single TCP socket to connect the Web server and the FastCGI script (in contrast to the ordinary CGI method of using pipes and environment variables). This connection provides a CGI-like environment and other (I/O streams, error-specific) information, which is set-up at the beginning of each request. (The environment variables and stdin data is directed to the application, and stdout and stderr data is directed to the Web server.)
The Web server runs FastCGI scripts as separate processes like ordinary CGI scripts. However, once launched, these scripts don't immediately exit when they finish processing the initial request. Instead, they go into an infinite loop that waits for new incoming requests, processes them, and goes back to waiting.
The advantages of FastCGI are:
Implementations of FastCGI in Apache was included (though not compiled in by default) in distributions prior to versions 1.2 as the mod_fastcgi module. It is not included now due to the problem of synchronizing versions. Commercial implementations of FastCGI are available for Netscape servers and Microsoft IIS from Fast Engines, Inc.). Fast.Serv is another commerical implementation of FastCGI and is currently available for all Netscape and Microsoft Web servers on Windows NT and all major UNIX platforms.
More information on FastCGI, including FastCGI server modules and application libraries, is available at FastCGI Web site.
The limitations of FastCGI are:
If the number of possible inputs and state combinations is small, one can run the CGI for all possible input offline and cache each result in a static HTML document.The limitation to this approach is that it may not work if the browser does not cache documents. Also, there are cases such as outputs of CGI scripts, which should not be cached. In such cases, the scripts need to specify the appropriate header (Pragma:Nocache in HTTP 1.0 or Cache-Control in HTTP 1.1), and as a result put load on the server.
When CGI scripts retrieve and return an existing document (on any server), it is known as server-redirection. It can be done using the HTTP Location: response header pointing to the static HTML document. In Apache, you can also redirect an entire server or directory to a single URL using the Apache module mod_rewrite. (The CGI approach for redirection is preferred if any information is being POSTed to the redirected URL.)
Server-redirection can have various applications, such as, returning a standard response page when a user submits a feedback form. When there are large number of inputs but a small number of frequently requested documents, caching is possible via server-redirection.
A solution to the CGI performance problem is using embedded high-level interpretive languages in their servers. Embedded interpreters often come with CGI emulation layers, allowing scripts to be executed directly by the server without the overhead of invoking a separate process. An embedded interpreter also eliminates the need to make dramatic changes to the server software itself. In many cases (and in contrast to server proprietary APIs), an embedded interpreter provides a smooth path for speeding-up CGI scripts because little or no source code modification is necessary.
One of the most important developments (and natural choice both from the language and the server standpoint) in the embedded interpreter arena has been the provison of including a Perl interpreter within the Apache Web server.
mod_perl is an Apache server module that embeds a copy of the Perl interpreter into the server executable. With mod_perl, Perl becomes the extension language for the Web server, providing a complete access to the Perl functionality within Apache. One can then write Perl snippets or CGI scripts, which do not require a new Perl interpreter process to be invoked (since Perl is not built-in the server). Instead, a new thread executes a precompiled Perl program. Since the CGI scripts (in Perl) are precompiled by the server and executed without forking, they running more quickly and efficiently. (Usually, it is not the size of the script itself but the fork/exec overhead that slows a CGI down.)
Last but not least, it almost goes without saying that the program scripts should be kept as small as possible (but not smaller). It has various advantages such as ease of testing, debugging and maintainance. These factors are directly/indirectly related to performance.
Optimization in code size also means using a context dependent approach and avoiding "overkill." For example, the overhead of using Perl 5 CGI libraries for small scripts can be considerable, and should be avoided.
The moral (with apologies to the reader) is: "Keep It Small, Silly" (or say goodbye to performance).
CGI is inflicted with various limitations. However, some of these can be circumvented just by careful scripting, with using the strengths and knowing the weaknesses of the language of choice.
I would like to thank Patrick Killelea and Martin Webb for useful discussions and suggesting the example in the section Client-Side Processing. Once a CGI script has been "tailored" to use mod_perl, it ceases to be a "gateway" in its true sense of the meaning; thanks to Jason Turner for pointing this out.