Published on: Sunday 16th November 1997 By: Jason Nugent
As most of us are aware, using a form on a website is an effective way to gather information from a visitor. Information can be requested, mailing lists can be subscribed to, and comments and feedback can be submitted. For those of you who have not yet implemented a form on your website and are wondering about the HTML syntax in doing so, I will first delve into such details. If you are already familiar with forms, you may skip this section and move onto the next.
A form on a website is embodied inside the <FORM> tag, so we will look at this one first and in some detail. Like most HTML tags, the <FORM> tag takes a number of attributes related to the form itself. It possesses the following syntax:
<FORM attribute1=".." attribute2=".."> ... </FORM>
where the attributes can be one (or more)of the following:
Inside the <form> tag, you place all the elements that you want to use in the form itself. The first element is an <input> tag which takes attributes the define how it appears on the page. It is an open-ended tag, which means that it does not have a corresponding </input> tag to close it. These <input> tags can be of type text, checkbox, image, password, radio, submit, and reset. There also exists a <textarea> tag which creates a large area for entering multi-lined information. The <select> tag is used to create a drop-down list of items.
The simple form used in this example contains two text fields, one for a name, and the other for an email address. One button has been added, by which the form is submitted to the server.
<form name="form_name" onSubmit="return isReady(this)" action=""> <table cellpadding=0 cellspacing=5 border=0><tr> <td align="left">Your Name:</td><td align="left"><input type="text" name="Name"></td> </tr><tr> <td align="left">Your Email Address:</td><td align="left"><input type="text" name="address"></td> </tr></table> </form>
As we have seen, coupling interactivity via forms and programs or scripts on a server through the Common Gateway Interface (CGI) is an effective way to obtain information from individuals visiting your website. However, there are risks associated with running a CGI script from the web. Poorly written scripts that accept malformed information from an unknowing or malicious user could be made to do things that could bring your server to its knees.
For example, imagine operating a website that contains a field that allows a user to enter the name of a directory on the server. Certainly not the smartest idea, but they are out there. If someone were to put the following in as the directory they wanted listed, bad things could happen:
web_directory ; /bin/rm *
Quite possibly, the command to list the directory would be carried out normally, and then the second command (/bin/rm *) could be carried out and erase a directory.
There are several ways to prevent this sort of thing from happening, and some are better than others, depending on the situation. First and foremost, the script itself could be written to verify that the form submitted to it does not contain any malicious code. Upon detecting such an attempt, the script could refuse to process the entry and store the submitter's IP address in a file for future reference. Or, more simply, the script could simply display an alternate page telling the visitor that their input was not accepted.
Beware, however, that a poorly written script can still accept requests that do not come from the form. It is possible that a malicious user from a completely different domain could run your script directly and feed it bad information. Fortunately, there are several ways around this. One of the easiest is to make your CGI script examine the HTTP_REFERER and REMOTE_HOST environmental variables that are submitted with every request. These variables contain the URL of the requesting document and the domain name of the foreign server respectively, and could be checked to ensure that the request was submitted from a user on an allowed domain (in particular, your own). If the request is not allowed, the foreign domain name could be logged in a file and refused access to the script.
var a = "This is my field's contents"; var b = a.indexOf("my"); // b now contains 9.
As you can see, b now contains the position (starting from 0) that the pattern "my" was located at. If the pattern searched for was not found, the indexOf() method returns -1.
Note that these functions can be used independently of a form. These methods can be used anywhere, as long as an appropriate string value is passed as an argument.
First off, what is a regular expression? Put simply, a regular expression is a string of special values that programmers can use to explicitly match a specific string of text.
Before we get into using regular expressions to parse text, it is important that you understand a bit about how regular expressions work and what special characters do what. There is just too much to get into here, but here are a few that come up often:
. matches any singular character. ? matches one or none of the preceding character. + matches at least one of the preceding character. * matches none or all of the preceding character. ^ matches the absolute beginning of the string. $ matches the absolute end of the string. \w+ matches a whole word. \w matches a "word" character (alphanumerics and the "_" character). \W+ matches whitespace. x|y matches one or the other of x or y. [0..9] matches ONE number, ranging from 0 to 9. [A-Za-z] matches any letter, uppercase or lowercase.
Parentheses can be used to group characters together.
(this)+ matches at least one occurrence of "this".
If you wish to search for one of the special characters, you must first delimit it with a backslash(\).
\. matches a period. \? matches a question mark. \[ matches a left square bracket. \| matches a "pipe" character.
In addition to these, modifiers can be added after the regular expression to control how it searches through the string. Some of more useful ones include these:
/somematch/g - global (matches all instances). /somematch/i - ignore case. /somematch/gi - you can combine them, too.
pattern = /:+/; // matches one or more colons pattern = new RegExp(":+"); // same thing.
There is one very important thing to notice here. With the first method, it is important to remember to delimit your expression using slashes. A slash specifies the beginning or the end of a regular expression. You may also place the regular expression directly into the function without first defining it using the RegExp() method, which is what I do in the examples below.
The replace() method allows a programmer to replace a found match with another string. It takes two arguments, one being the regular expression you want searched for, and the other being the replacement text you want substituted. For example:
var s = "Let's use Regular Expressions"; var found = s.search(/use/); // found now contains 6.
var s = "Jason:Nugent:this:is:great:don't:you:think"; var a = s.split(/:/);
In this case, a becomes the array containing ["Jason", "Nugent", "this", "is", "great", "don't", "you", "think"]. In common CGI applications, this same technique is used to separate a comma delimited text file that perhaps serves as a database containing user information.
The match() method searches a string in a different way. It returns an array consisting of all the matches found in the string that match the regular expression. If no matches are found, it returns null.
var s = "Thank you, there, for thinking about me."; var a = s.match(/th\w+/gi); // matches a word beginning with th, globally, and ignore case.
a is an array that now contains ["Thank", "there", "thinking"].
Now, finally, we get to do some useful things with regular expressions. The following function will parse a form consisting of a username and an email address, and alert the user if the username is not entirely made up of characters, numbers or spaces. The function will also alert the user if the email address contains more than just alphanumerics, an "at" sign, periods, or hyphens.
Ok. Let's stop and examine the regular expressions used in the functions above. First, let's look at the isProper() function since it is simpler. The Regular Expression used is /^\w+( \w+)?$/.
Ok. Shall we move on to the isMail() function? The Regular Expression is /^\w+((-\w+)|(\.\w+))*\@[A-Za-z0-9]+((\.|-)[A-Za-z0-9]+)*\.[A-Za-z0-9]+$/.
This pattern allows for email addresses like the following. With this particular regular expression, the bare minimum that a person could enter as an email address is email@example.com, where x is any alphanumeric character:
firstname.lastname@example.org email@example.com firstname.lastname@example.org email@example.com firstname.lastname@example.org email@example.com
Why not try the example out, which works in Netscape Navigator 2, 3 and 4, as well as Internet Explorer 3 and 4.
You can view the source code of the working example.
For a good introduction to regular expressions, please check out: ftp://ftp.ou.edu/mirrors/CPAN/doc/manual/html/pod/perlre.html
In addition, you might want to check out Tom Christiansen's page on Regular Expressions in Perl 5, which can be found at: http://www.perl.com/CPAN-local/doc/FMTEYEWTK/regexps.html The FMTEYEWTK stands for "Far More Than Everything You Ever Wanted To Know".