Home Articles FAQs XREF Games Software Instant Books BBS About FOLDOC RFCs Feedback Sitemap
irt.Org

Related items

Internet Explorer As A Development Platform?

META tags: What they are and how they work

Hypertext on PDAs

HTML #5 - Using feedback forms

HTML #4 - Advanced Page Layout

HTML #3 - Making your Web pages more exciting

Formatting Text In HTML

An Introduction to HTML

Stop! Is Your HTML Document Valid?

You are here: irt.org | Articles | HTML | Stop! Is Your HTML Document Valid? [ previous next ]

Published on: Friday 30th April 1999 By: Pankaj Kamthan

HTML Validation

Be conservative in what you produce; be liberal in what you accept.

HTML authoring can lead to the possibility of errors. These errors are similar to that which can occur in using typesetting languages (e.g., LaTeX, which inspired HTML development). Such errors are not reflected in many browsers as they follow the second half of the above maxim in computer programming - by accepting HTML documents and trying to display them even if they are not valid HTML. Usually, this means that the browser will try to make educated guesses about what the author probably meant. It works ... sometimes.

Why Validate?

The trend in HTML authoring is, unfortunately, an HTML-version of this second maxim:

If it ain't broken (i.e., the document is rendered "correctly"), why fix it?

This, however, hardly qualifies as "good" HTML authoring. Also, the problem is that different browsers (or even different versions of the same browser) will make different guesses about the same illegal construct in an invalid HTML document. The result is a document that will display correctly up to a certain point and then display incorrectly, or even stops abruptly. Even if the document does seem to display correctly in all browsers in existence at that time (testing which could be quite time consuming), there is no guarantee that it will do so in their future versions. It is also possible that some other author may use your document in his/her work, only to find it is incorrect.

These are some of the reasons why you want to follow the first half of the above maxim by making sure your documents are in valid HTML. The best way of doing that is by processing your documents through one or more HTML validators.

Standard Generalized Markup Language (SGML) is a meta-language, of which HTML is a "child". For our purposes, a DTD is simply a document that defines the syntax of an SGML-based language, such as HTML. An HTML document that conforms to a DTD is said to be valid corresponding to that DTD. Validation can have, besides DTD-conformance, other diverse viewpoints: we will restrict ourselves to syntactical, semantical (spelling in case of an HTML document), and stylistic.

In this article, we take a tour of most commonly used HTML validators, the choice being based on the different features they offer. Demos of these validators (which are referring CGI gateways to respective validators) are also included. Besides validation, we also consider how to make a document optimal. We assume here that the reader has some knowledge of HTML and basic experience with Emacs.

The HTML validators which we will be discussing, are given in the following table:

SERVICE URL SUPPORTED DTDs TYPE OF VALIDATION
W3C HTML Validation Service http://validator.w3.org/ HTML 3.2, 4.0, Other DTDs Syntax, Style
HTMLChek http://uts.cc.utexas.edu/~churchh/htmlchek.html HTML 3.2 Syntax
Weblint http://www.cre.canon.co.uk/~neilb/weblint/ HTML 3.2 Syntax, Style
HTML Tidy http://www.w3.org/People/Raggett/tidy/ HTML 4.0 Syntax (limited), Style, Structure
Doctor HTML http://www2.imagiware.com/RxHTML/ NA Syntax (limited), Style, Structure

W3C and WebTechs Validation Services operate directly from the HTML DTD, and both strictly obey the rules of SGML. HTMLChek and Weblint are heuristic validators - they do not completely parse your HTML markup, but simply scan it looking for errors. The advantage of this is that they are fast and can detect constructs that are valid HTML but considered "bad style", such as an <IMG> tag without an ALT attribute; the disadvantage is that they can fail to detect certain errors.

Using a combination of validators is probably the best solution. Each has features that the others don't, and they complement each other well.

We will discuss the use of W3C and WebTechs Validation Services, Weblint and Doctor HTML. HTMLChek is mentioned here for its historical significance. We will not discuss it as it is slightly dated.

Part I. Online HTML Validation

There are various online HTML validation services. Among these, the most notable and authoritative is the W3C HTML Validation Service.

Syntax And Style With W3C HTML Validation Service

The W3C HTML Validation Service is based on James Clark's nsgmls SGML parser and Weblint. It checks HTML documents for compliance with W3C HTML Recommendations and other HTML standards. It does not have a File Upload option (like WebTechs HTML Validation Service).

According to the W3C HTML 4.0 Recommendation, the document should begin with one of the following DOCTYPE declarations:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN"
  "http://www.w3.org/TR/REC-html40/strict.dtd">
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
  "http://www.w3.org/TR/REC-html40/loose.dtd">
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Frameset//EN"
  "http://www.w3.org/TR/REC-html40/frameset.dtd">

There is a list of other DTDs which are also supported. Source code for this service is also available for some UNIX platforms, which can be used to set-up an offline validation service. (This, however, is not straightforward; besides installing a myriad of prerequisite software, one has to download necessary DTDs which are retrieved by the SGML parser off the WWW.)

W3C HTML Validation Service - Demo

Enter the URL of the document to be validated:

URL:  

OPTIONS:
Include Weblint results 
Show source input 
Show parse tree 
Run Weblint in "pedantic" mode 
Show an outline of this document 
Don't show attributes in the parse tree 

Once your document has passed the W3C HTML Validation Service, you can place the following icon on it: Valid HTML 4.0

HTML Tidy

HTML Tidy is a utility to tidy up sloppy editing into nicely layed out markup. It also works well on the difficult to read markup generated by special-purpose HTML editors and conversion tools. It can help you identify how you can make your pages more accessible to people with disabilities. It also has support for all HTML 4.0 entities, internationalization (through various character encodings) and a limited XML 1.0 support.

HTML Tidy is able to fix up a wide range of problems and to bring to your attention things that you need to work on yourself. It provides the location of each problem found with the line number and column, and generates a list of such problems. It has limited validation capabilites. When there are problems that it can not handle, they logged as "errors" rather than "warnings".

For use of an offline version of HTML Tidy, see HTML Tidy Revisited.

CYAN

CYAN is an online interface to HTML Tidy. It can check WWW pages for HTML 4.0 compliance while formatting tags according to your preferences. CYAN will then let you download a copy of the specified page with common errors fixed and point out rest of the errors.

Enter page address:
OR paste the HTML tags to validate / format and click "GO:"

Display warnings
Save preferences using cookies
Replace FONT, NOBR and CENTER tags with CSS
Read as XML
Convert to XML
Indent element content
Omit optional end tags
Force tags to upper case
Do not output entities for characters 128 to 255
Output numeric rather than named entities
Wrap text at column:    Edit box width: height:

Semantics With Doctor HTML

Doctor HTML performs tests, which can be selected from a menu, and displays a report containing the syntax errors and stylistic suggestions. We quickly outline the most important features:

A detailed explanation of all the options in Doctor HTML and an FAQ are available. It does not have a File Upload option (like WebTechs HTML Validation Service).

Doctor HTML - Demo

Enter the URL that you wish Doctor HTML to examine and select the tests you wish it to perform in the form below.

URL: 
REPORT FORMAT: Short Do All Tests
Long Select from list below

Spelling 
Image Syntax 
Form Structure 

Image Analysis
Table Structure
Show Commands
Document Structure
Verify Hyperlinks 
Show Page (JavaScript Only)

Once your document has passed the Doctor HTML Service, you can place the following icon on it: Checked by Doctor HTML

Optimization With HTML Squisher

Even after an HTML document is validated, it may contain unnecessary characters. HTML Squisher is a script which optimizes your HTML document by removing those characters. By doing so, it shrinks the number of bytes in your HTML document and makes it download faster. Some of its useful features are:

There is, however, one disadvantage: a page that has been squished will not be very human-readable, though it should still be parsed by an HTML editor and will display in your WWW browser without difficulty. You can reformat the HTML into something relatively easy to edit using the HTML Formatter.

HTML Squisher - Demo

To squish an HTML page, enter the URL below and press the Squish button.

URL:

Remove Comments:Yes No

Many HTML editors insert large amounts of text in the form of comments that have no impact on how an HTML page is presented and slow the document download. If you want these comments removed, leave the "Yes" box checked.

Some parsed-HTML syntax is embedded in HTML comments and it is occasionally useful, for example, to keep an author's name associated with a page. If you do want to keep these comments in the page, uncheck the "No" box above.

Squish <hr>:Yes No

The horizontal-rule defaults to a width=100% and align=center. You can remove explicit specification of these defaults.

Convert <strong> to <b> Yes No

The stylistic tag <strong> usually means "bold-face". You can save some bytes by using the physical tag <b>.

Convert <em> to <i> Yes No

The stylistic tag <em> usually means "italics". You can save some bytes by using the physical tag <i>.

Insert Base-HREF:Yes No

For HTML Squish to display the document properly, a Base-HREF must be inserted so that the images and relative links are accurate. This shouldn't break the page, but it does waste a few bytes. If you select "No", the displayed page will have broken images, but when you put the new page into place on your site, it should work fine.

Part II. Offline HTML Validation

There are times when you are working offline, by choice or by force, and/or do not have access to the Internet. In such a case, it might be useful to have a facility of HTML validation offline on your own computer.

One solution is to use a syntax-checking HTML editor. Another is to an HTML syntax checker such as weblint. Many of the WYSIWYGU (What You See Is What You Get Unfortunately) graphical-editors (e.g., FrontPage 98, Netscape Composer) also overlook basic syntactical errors. Furthermore, some HTML authoring tools, generate HTML code which is completely contrary to the design goals of the language - they look at a document from the point of view of layout, and then mimic that layout in HTML, by often overusing certain tags (e.g., <BR>) or using proprietary tags (such as <FONT>). This renders them impractical to be recasted into other markup languages, such as eXtensible Markup Language (XML).

Weblint

Weblint is a syntax and minimal style checker for HTML. It catches syntax errors, warns about "bad" HTML style practices and potential compatibility problems. It is implemented as a Perl script which picks fluff off HTML pages, much in the same way as traditional lint picks fluff off C programs. It is available for UNIX, Windows NT, Macintosh and OS/2. We describe here version 1.020.

Features

The following checks are currently performed:

  • Default checks for HTML 3.2.
  • 46 different checks and warnings.
  • Warnings can be enabled/disabled individually, as per your preference.
  • Basic structure and syntax checks.
  • Warnings for use of unknown elements and element attributes.
  • Context checks (where a tag must appear within a certain element).
  • Overlapped or illegally nested elements.
  • Checks if IMG elements have ALT text.
  • Flags obsolete elements.
  • Support for user and site configuration files.
  • Stylistic checks.
  • Checks for html which is not portable across all browsers.
  • Flags markup embedded in comments.
  • Support for Netscape and Microsoft HTML extensions.

Obtaining And Installing Weblint

We will not discuss the details of installing weblint; for that you can refer to the readme file from the weblint distribution. The requirement is that you need to have Perl (4.036 or 5.004) on your system.

Using Weblint

Files to be checked are passed on the command-line:

% weblint *.html

Warnings are generated similar to that of lint:

<filename>(line #): <warning>

For example:

index.html(5): malformed heading - open tag is <H1>, but closing is </H2>

For details of usage see the Weblint Man Page.

Weblint Gateways

A Weblint gateway is an HTML form which lets you type in a URL and have it checked by weblint without having to install weblint locally, turning it into an online HTML validation service. One such (referer) gateway is available at Concordia University, Canada. For a comprehensive list, see the Weblint Gateways Page .

The Weblint Mode : HTML Validation With Emacs

One advantage of using weblint for offline syntax checking is that you can use it in conjunction with a powerful editor such as Emacs.

Emacs is a widely and freely available editor distributed by the Free Software Foundation (FSF) and runs under a wide variety of operating systems. Emacs can be configured to HTML editing, for example, with the help of the html-helper-mode. For those who wish to learn more about Emacs, a list of references has been provided.

If you have installed Weblint on your system, you can use the weblint mode to check the validity of your HTML files within Emacs.

Obtaining And Installing Weblint Mode

A copy of weblint mode is available at Concordia University, Canada. To use the mode, you need to add the following lines to your .emacs file after you have changed the path to the weblint directory accordingly:

(setq load-path (cons "path_to_weblint_directory/" load-path))
(autoload 'weblint "weblint" "Weblint syntax checker" t)

Using Weblint Mode

After creating and saving an HTML file, type:

M-weblint RET

to invoke the weblint mode and do the checking. The results of the checks are displayed in a separate window. You can then toggle back and forth between the windows to browse through the errors and warnings, and do the debugging. The above process can be repeated till you are satisfied with the state of the document.

HTML Tidy Revisted

Obtaining And Installing HTML Tidy

HTML Tidy is available for Windows 95/NT (as a binary), Linux, MacOS, BeOS, and a variety of UNIX platforms. We will not discuss the details of installation. (It is straightforward on Windows 95, Linux and IRIX, on which it was tested.)

Using HTML Tidy

HTML Tidy runs on command-line with various options as

tidy [[options] filename]*

For a list of options and some representative examples, see the HTML Tidy page.

HTML-KIT

A GUI version of HTML Tidy, HTML-Kit, is also now available for Windows 95/98/NT. HTML-Kit can be used by experts as well as newcomers to HTML, who can benefit from HTML-Kit pointing out errors and improvements to the markup. It will also allow users to see which tags produce which effects. Figure 1 shows a snapshot of HTML-Kit in use with the URL http://www.irt.org. HTML-Kit can perform checks both with Tidy and CSE HTML Validator.

HTML-Kit

Figure 1. HTML-Kit Interface.

Optimization with HTML (UN)Compress

As with the case of online validation, your HTML document may again contain unnecessary characters. HTML (Un)Compress is a shareware program which optimizes your HTML document by removing such characters. It is available for Windows 95/98/NT-based platforms. The most important feature of HTML (Un)Compress is that, besides reducing the size of your document, it preserves the formatting of the document - its Compress tool first removes all information used for editing in the HTML file and then the UnCompress function of this tool adds this formatting information once again.

Conclusion

To err is an HTML author; to forgive is the browser.

HTML authoring is error prone and consequences of this (HTML-version of a biblical) maxim are damaging - although you can check your documents using a WWW browser, this may not reflect all the errors in the document because some browsers are quite forgiving and can recover from errors.

It is important that any document, whether in a formal or informal language, be syntactically, semantically and stylistically correct. HTML is no exception. Online or offline HTML validators can be quite useful in this endeavour, before there is a need for damage control.

References

HTML

HTML Validation

General

Online HTML Validation Services

HTML Optimization

Offline HTML Validation Services

Usenet

Related items

Internet Explorer As A Development Platform?

META tags: What they are and how they work

Hypertext on PDAs

HTML #5 - Using feedback forms

HTML #4 - Advanced Page Layout

HTML #3 - Making your Web pages more exciting

Formatting Text In HTML

An Introduction to HTML

Feedback on 'Stop! Is Your HTML Document Valid?'

©2018 Martin Webb