Related items

Looking for Something? : Searching the Web

You are here: irt.org | Articles | World Wide Web (WWW) | Looking for Something? : Searching the Web [ previous next ]

Published on: Sunday 20th June 1999 By: Pankaj Kamthan

Introduction
Obstacles to Searching
Views of the Web
Search Engines
Query
Search Techniques
The Nature of the Information on the Web
- Metadata and RDF
Conclusion
Acknowledgements
References

Introduction

The Web is one of the world's largest sources of publicly available information. It provides a myriad of information, though still lacks in navigational aids. It is important for users to search relevant information efficiently and quickly on the Web. As GVU Center's 10th WWW User Survey (October 1998) shows, finding both existing and new information is a significant problem that is currently facing the Web in general.

Figure 1. Problems Using the Web.

To help us deal with incredible amount of data, a new skill is needed: Web searching. Given a topic, anyone with a Web browser and access to the Internet can search the Web for information on that topic. Searching, however, is not the same thing as finding. There is little organization or consistency on the Web.

Obstacles to Searching

A variety of problems can occur while searching through the Web:

Lack of Success. The search yields no results. This can happen due to various reasons if a global search engine is being used. Apart from the obvious fact that the information corresponding to the query does not exist, robots (a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced) can be excluded by using exclusion standard code on a site.)
Irrelevancy. The search yields too many (most of them irrelevant) results. This is usually because of the inability of a search engine to "understand" what the searcher meant. Thus, search for "Window" may yield: "Microsoft Windows 98" or "Tinted Windows" or "Alfred Hitchcock's Rear Window."
Inefficiency. The network connection is slow at the time when results are needed quickly.
Inaccessibility. The Web site on which the relevant information lies is inaccessible.
Lack of Reliability. The search did yield a result, and the site location did have the relevant information but on a subsequent visit, the document was not found.

A successful search requires a search strategy, which depends on the search object, view of the Web, search tool, query (if the search tool is a search engine) and search technique.

Views of the Web

There are a number of possible views of the Web, as illustrated in Figure 2 and outlined in Table 1. A key to effective searching (as pointed out in NetResearch: Finding Information Online, By Daniel J. Barrett, O'Reilly & Associates, 1997) is to be able to change one's view of the Web as needed.

Web Site Viewpoints

Figure 2. Views of the Web.

WEB VIEWPOINT	APPROACH	EXAMPLES	ADVANTAGES/DISADVANTAGES
Web is a group of Computers	Computer Engineering	Every computer on the Internet has a unique "name", such as www.irt.org.	The advantage of knowing some computer names during searching is speed. The disadvantage of this view is that one has to remember a lot of computer names (though this can be circumvented to certain extent as the naming is standardized and follows an intuitive pattern).
Web is a set of Programs	Computer Science	A search engine, such as AltaVista.	The advantage of knowing a specific program can speed-up the search. The disadvantage of this view is that it ignores intuition; just because one knows every single feature of a search engine does not mean that the search will always be effective.
Web is a collection of Resources	Information Technology, Publishing	Web sites can be viewed as "Yellow Pages", that is, a collection of resources.	The advantage of this view is that resources on a topic one is trying to locate may already exist; all one has to do is to look them up. The disadvantage of this view is that such resources may go out-of-date, they list only a few (at time biased) resources on each topic, and are limited to searching on topic at a time. They are not useful for searching a combination of topics.
Web is a Library	Library Science	Web can be viewed as a large collection of libraries worldwide. In this view, a Web document is a "book", a URL is the "call number" and a search engine is a "librarian."	The advantage of this view is that users are often familiar with the way libraries are organized. The disadvantage of this view is that each of these libraries has its own method of organizing and accessing information, there is no master index of all libraries and their contents, and there is no "roadmap" to get from one libary to another.
Web is a Community	Sociology	People "communicate" with each other via e-mail, chat, or a bulletin-board.	The advantage of this view is that there are many knowledgeable people with access to the Internet who can help locate a resource. The disadvantage of this view is that such a facility can easily be misused. For example, students can post their assignment problems and obtain solutions to them without the knowledge of the teacher, or by users asking overwhelming number of general questions.

Table 1. Web from the Viewpoint of Searching.

SEARCH ENGINES

Search engines are powerful tools for seaching the Web. The way a search engine works is simple. The words given to search to a search engine are keywords and a set of one or more keywords is a query. A search engine has access to a catalog of Web pages, and uses it to locate a query. If a search engine finds a Web page in its catalog that matches a query, the matching page is a hit.

Local Search Engines

We call a search engine local if it searches only the information local to a Web site. Such a search facility is available, for example, at irt.org. These engines are often CGI scripts which act as a gateway. Some popular scripts/programs in use today are given in the list of references.

Search Gateways

The HTTP protocol limits the possibilities and usefulness of the Web as an integrated Internet access platform since many things we expect to use are not available through the Web. One solution to this problem is to use gateways. A gateway is a program that functions as an interface between two systems. In this case, on one side there is the Web and its HTTP protocol, and on the other side there are other Internet protocols or applications. Gateways are specially designed scripts/programs in a computer language (Perl, C, shell scripts, and so on), complying with the CGI specification or some proprietary API specification.

Table 2 reflects the usefulness of search gateways in various contexts. Such gateways, though (by their very nature) limited in their reach, can be very useful. As more and more of such specialized search engines appear on the Web, it will become relatively easier for users to find information relevant to their topic(s) of interest.

TOPIC	EXAMPLE
Books	Amazon.com Bookstore
Documentation	UNIX man Pages
Graphics	GIF Optimizer
Networking	DNS Lookup (InterNIC and CDNNet)
Reference	PC Webopaedia
Software	CPAN
Language Validation	XML Validation

Table 2. Examples of Gateways in Use.

There are more examples on C|net.

Search Engine Criteria

Local search engines, such as those that are CGI-based, vary in degree of features and complexity they offer. From a design and implementation point of view, some desirable features are described below:

Speed and Reliability. It is always important that the search engines are fast and reliable.
Help. The engine should provide a help document to the user describing search features (such as, links to documents or directories if a match to a query is found, a description of the match, links to topics related to the query, etc.) and how the results are organized (such as, ranking by number of occurences of the query, whether the matching document links open in a new window so that the user can return to the page with the search results, etc.). When a search for a query fails, the engine provides hints to improve search and other leads for navigation to the user. (The irt.org search engine at irt.org provides this feature when search for a term (such as, "video") fails.)
Security. As the engine provides a search field (that could be used for frivolous purposes) to the user, it should take the security precautions normally associated with CGIs.
User-Interface. The engine should have an intuitive user-interface that may be familiar to most users.
Query Lengths. It is difficult to anticipate all possible search queries. Still, the form field should, as much as reasonably feasible, incorporate possible query lengths. In the next (fictitious) example, the length of the field is 10 characters.

This is sufficient for searching the term "JavaScript" but not necessarily for "JavaScript Image Rollover" as the user can not see what is being entered without scrolling. "Hardwiring" (by using maxlength attribute in the above case) a query length is also not a good practice.

A guide with examples of some possible queries can be useful here to the user.

Global Search Engines

We call the search engines on the Web which are not local as global. There are three types of global search engines: active, passive and meta-search. See Figure 3. We now describe each of them in detail, including there advantages and limitations.

Active Search Engine. It is a search engine that collects Web page information by itself. An example is AltaVista. It uses a robot that travels around the Internet, locates Web pages and adds entries to the catalog. The advantage of active search engines is that they usually have large catalogs and are updated frequently (without human intervention). The disadvantage is that there are often too many hits, which are not very well organized.
Passive Search Engine. It is a search engine that allows people to register their Web pages. An example is Yahoo!. Once a page is registered with the search engine, the page can be found by queries. The advantage of a passive search engine is that they tend to be very organized. The disadvantages are that they usually have smaller catalog and items may be organized in unexpected (from the user's viewpoint) places.
Meta-Search Engine. It is a search engine which uses several search engines simultaneously. An example is MetaCrawler. The advantage is that they save effort of searching multiple search engines. The disadvantages are that the search can be slow and that they may summarize the data in their own way or present only partial results from each search engine, possibly hiding relevant information.

Figure 3. The Three Types of Web Search Engines.

A collection of some specialized global search is available at http://www.search-engines.net/.

Limitations of Search Engines

Inspite of the strength of search engines as a search tool, there are certain limitations (particularly, of global search engines) that one should be aware of in their usage. These are:

The way they sort and rate information; often irrelevant information is presented earlier in the list of hits.
Over-information.
Their inability to distinguish word meanings.
Their inability to recognize some HTML elements, such as frames. (In cases where frames are being used, proper use of the <META> tag and/or including a detailed description of the site in the <NOFRAMES> tag can help.)

Query

The success of Web search depends heavily on the keywords chosen. It is important to know how are they treated by specific search engines. A simple query consists of just the keywords one wishes to search. Usually, simple queries can be too broad, resulting in a response with many unrelated hits. An advanced query consists of Boolean operators to include or exclude keywords from the search. The type of operators supported by a search engine define its query language. Different search engines usually have different query languages which can be found on their respective Web sites, often included in options for a "power search."

Search Techniques

Using certain search techniques, searching time and effort can be greatly reduced. The techniques can be of the following type:

General Search. A general search uses the query operator OR to concatenate keywords, instructing the search engine to locate pages that contain any of the keywords. The more keywords one adds, the broader the search, the more hits one is likely to get. But one is also likely to get many irrelevant hits as well.
Specific Search. A specific search uses the query operator AND to concatenate keywords, instructing the search engine to locate pages that contain all of the keywords. The more keywords one adds, the narrower the search, the fewer hits one is likely to get. The chances of getting a hit are low but hits are likely to be relevant.
Incremental Search. Incremental search allows one to narrow and broaden the search progressively until one 'zeroes-in' on the desired information. This process can, however, be time consuming.
Substring Search. Substring search allows one to search for the occurence of a keyword (substring) in any word (string) in which it occurs. This process can simplify query but is likely to produce irrelevant hits.
Search-and-Jump. Search-and-jump search allows one to search using the Find facility (such as the Find button in Netscape Communicator or the "/" command in Lynx) of browsers. One can choose a search engine, set number of hits to the maximum number possible, type the query and start the search, use the Find facility to find specific information in the Web page returned. This process is faster than performing multiple queries but it can, however, be time consuming.
Category Search. This allows one to search using the categorized list of topics of subject guides (such as Yahoo!, Netscape Open Directory Project). No query is required. The limitation here is that the list may not be organized in a manner one might expect.
Search and Rank. Some search engines (such as AltaVista) search and rank the results. Relevant hits are listed first, however, effective ranking functions are still not known.

Table 3 summarizes the search strategies for some typical (search) objects. See also Getting the Most Out of Your Search Engine for the specific features that you could consider when carrying out a search using a global search engine.

OBJECT	WEB VIEW	SEARCH TOOL	QUERY	TECHNIQUE
Topic	Library	Search Engines	Advanced	Category
Document	Library	Search Engines	Advanced	Specific Search
Software	Resource	Usenet FAQ	Advanced	Search-and-Jump
People	Community	Search Engines, Usenet	Advanced	Specific Search

Table 3. Web Search Strategies.

The Nature of the Information on the Web

Effective searching strongly depends on how and in what form the information itself is organized on the Web. The Web was originally built for human consumption, and although everything on it is machine-readable, this data is not necessarily machine-understandable. HTML till now has been a universal data format for publishing information on the Web. It was designed to focus on structure, and therefore has very limited support for attaching any "meaning" to the content of a document. The development of eXtensible Markup Language (XML) addresses this problem to a large extent.

Metadata and RDF

Metadata is "structured data about data." For example, a library catalog is metadata, since it describes resources in the library. In our context, it is "data describing Web resources." For example, the HTML <META> tag describes the content of a Web document. Metadata facilitates searching, helping authors to describe their documents in ways that browsers, search engines and robots can understand. As a result, users can have better document discovery services available to them.

Resource Description Framework (RDF), which is an XML application, is a foundation for processing metadata. It provides interoperability between applications that exchange machine-understandable information on the Web. Till now, automated processing of information on the Web has been very difficult because of the sheer volume. RDF emphasizes facilities to enable automated processing of Web resources.

RDF's syntax specification became a W3C recommendation on February 22, 1999. As pointed out in the specification, RDF could enhance our searching capabilities in a variety of ways: in resource discovery to provide better search engine capabilities, in cataloging for describing the content and content relationships available at a particular Web site, by intelligent software agents to facilitate knowledge sharing and exchange, in content rating, in describing collections of pages that represent a single logical "document."

However, a widespread acceptance and use, and requisite software development for processing, both XML (as a data format that incorporates "meaning") and RDF (as a metadata framework), will be necessary for this to become a reality.

Conclusion

Towards ... Eureka!

Searching the Web is an art, as well as, a science. There are no perfect solutions; each have their own advantages and limitations. There is also no substitute for human experience. But a blend of that with intuition and knowledge of some basic techniques can make the journey both enjoyable and fruitful.

In "searching for something", the "something" (in our case information) is as important as the methodology of searching itself. It is expected, at least from the perspective of a computer-based search, that the organization of information on the Web (and thus its searching) will improve as the use of XML and RDF becomes ubiquitous.

Acknowledgements

Statistical graph presented in Figure 1 is from GVU Center's WWW User Surveys of GTRC and the GVU Center. It is "Copyright 1994-1998 Georgia Tech Research Corporation. All rights Reserved" and its use is hereby acknowledged.

References

Yahoo! - List of search engines.
Deja News - Usenet search.
List of Lists - Mailing list lists.
People Search: Who Where? (E-Mail address search), Personal Pages Worldwide (Personal Home Page lists), and BigYellow (Telephone directories).
Reference Search: Research-It! (General reference) and Webster's Dictionary (Specific reference).
SWISH, WWWWAIS, freeWAIS, Glimpse, Harvest (Unix; Freeware/Shareware); Excite (Windows NT/95, Unix; Commercial) - Document indexing and text searching software.
NetResearch: Finding Information Online, By Daniel J. Barrett, O'Reilly & Associates, 1997. A good introduction to the art and science of searching.
Search Engine Watch - This site contains information about how search engines work, comparisons of search engines, recent news about changes in search engine, and more.
Metadata and Resource Description - W3C.
RDF - What's in it for us? - By Janus Boye. A lucid introduction to RDF.

Related items

Light Relief

Links Want To Be Links

What are RFCs

Are all Portals the same?

Intranets in Education

Representation of Japanese Language Characters on the WWW

Rendering Chinese Language Characters on the World-Wide Web

Where is the Web heading?

WWW - How It All Begun