|
Looking for Something? : Searching the Web
You are here: irt.org | Articles | World Wide Web (WWW) | Looking for Something? : Searching the Web
Published on: Sunday 20th June 1999 By: Pankaj Kamthan
Introduction
The Web is one of the world's largest sources of publicly available
information. It provides a myriad of information, though still lacks
in navigational aids. It is important for users to search relevant
information efficiently and quickly on the Web. As
GVU Center's 10th WWW User Survey
(October 1998) shows, finding both existing and new information is a
significant problem that is currently facing the Web in general.

Figure 1. Problems Using the Web.
To help us deal with incredible amount of data, a new skill is needed:
Web searching. Given a topic, anyone with a Web browser and access to
the Internet can search the Web for information on that
topic. Searching, however, is not the same thing as
finding. There is little organization or consistency on the Web.
Obstacles to Searching
A variety of problems can occur while searching through the
Web:
- Lack of Success. The search yields no
results. This can happen due to various reasons if a
global search engine
is being used. Apart from the obvious fact that the information
corresponding to the query does not exist, robots (a program that
automatically traverses the Web's hypertext structure by
retrieving a document, and recursively retrieving all documents
that are referenced) can be excluded by
using
exclusion standard
code on a site.)
- Irrelevancy. The search yields too many (most of
them irrelevant) results. This is usually because of the inability of
a search engine to "understand" what the searcher
meant. Thus, search for "Window" may yield: "Microsoft
Windows 98" or "Tinted
Windows" or "Alfred
Hitchcock's Rear Window."
- Inefficiency. The network connection is
slow at the time when results are needed quickly.
- Inaccessibility. The Web site on which the
relevant information lies is inaccessible.
- Lack of Reliability. The search did yield a
result, and the site location did have the relevant information but on
a subsequent visit, the document was not found.
A successful search requires a search strategy, which depends on the
search object, view of the Web, search tool, query (if the search tool
is a search engine) and search technique.
Views of the Web
There are a number of possible views of the Web, as illustrated in
Figure 2 and outlined in Table 1. A key to effective searching (as
pointed out in
NetResearch: Finding Information Online,
By Daniel J. Barrett, O'Reilly & Associates, 1997) is
to be able to change one's view of the Web as needed.

Figure 2. Views of the Web.
| WEB VIEWPOINT |
APPROACH |
EXAMPLES |
ADVANTAGES/DISADVANTAGES |
| Web is a group of Computers |
Computer Engineering |
Every computer on the Internet
has a unique "name", such as www.irt.org. |
The advantage of knowing some computer names during searching is speed. The disadvantage of this view is that one has to remember a lot of computer names (though this can be circumvented to certain extent as the
naming is standardized and follows an intuitive pattern). |
| Web is a set of Programs |
Computer Science |
A search engine, such as AltaVista. |
The advantage of knowing a specific program can speed-up the search. The disadvantage of this view is that it ignores intuition; just because one knows every single feature of a search engine does not mean that the search will always be effective. |
| Web is a collection of Resources |
Information Technology, Publishing |
Web sites can be viewed as "Yellow Pages", that is, a collection of resources. |
The advantage of this view is that resources on a topic one is trying to locate may already exist; all one has to do is to look them up. The disadvantage of this view is that such resources may go out-of-date, they list only a few (at time biased) resources on each topic, and are limited to searching on topic at a time. They are not useful for searching a combination of topics. |
| Web is a Library |
Library Science |
Web can be viewed as a large collection of libraries worldwide. In this view, a Web document is a "book", a URL is the "call number" and a search engine is a "librarian." |
The advantage of this view is that users are often familiar with the way libraries are organized. The disadvantage of this view is that each of these libraries has its own method of organizing and accessing information, there is no master index of all libraries and their contents, and there is no "roadmap" to get from one libary to another. |
| Web is a Community |
Sociology |
People "communicate" with each other via e-mail, chat, or a bulletin-board. |
The advantage of this view is that there are many knowledgeable people with access to the Internet who can help locate a resource. The disadvantage of this view is that such a facility can easily be misused. For example, students can post their assignment problems and obtain solutions to them without the knowledge of the teacher, or by users asking overwhelming number of general questions. |
Table 1. Web from the Viewpoint of Searching.
SEARCH ENGINES
Search engines are powerful tools for seaching the Web. The way a
search engine works is simple. The words given to search to a search
engine are keywords and a set of one or more keywords is a
query. A search engine has access to a catalog of Web pages,
and uses it to locate a query. If a search engine finds a Web page in
its catalog that matches a query, the matching page is a hit.
Local Search Engines
We call a search engine local if it searches only
the information local to a Web site. Such a search facility is
available,
for example, at
irt.org.
These engines are often CGI scripts which act as a gateway. Some
popular scripts/programs in use today are given in the list of
references.
Search Gateways
The HTTP protocol limits the possibilities and usefulness of the Web
as an integrated Internet access platform since many things we expect
to use are not available through the Web. One solution to this problem
is to use gateways. A gateway is a program that functions as
an interface between two systems. In this case, on one side there is
the Web and its HTTP protocol, and on the other side there are other
Internet protocols or applications. Gateways are specially designed
scripts/programs in a computer language (Perl, C, shell scripts, and
so on), complying with the CGI specification or some proprietary API
specification.
Table 2 reflects the usefulness of search gateways in various
contexts. Such gateways, though (by their very nature) limited in
their reach, can be very useful. As more and more of such specialized
search engines appear on the Web, it will become relatively easier for
users to find information relevant to their topic(s) of interest.
Table 2. Examples of Gateways in Use.
There are more examples on
C|net.
Search Engine Criteria
Local search engines, such as those that are CGI-based, vary in degree
of features and complexity they offer. From a design and
implementation point of view, some desirable features are
described below:
- Speed and Reliability. It is always important
that the search engines are fast and reliable.
- Help. The engine should provide a help document to the
user describing search features (such as, links to documents or
directories if a match to a query is found, a description of the
match, links to topics related to the query, etc.) and how the results
are organized (such as, ranking by number of occurences of the query,
whether the matching document links open in a new window so that the
user can return to the page with the search results, etc.). When a
search for a query fails, the engine provides hints to improve search
and other leads for navigation to the user.
(The irt.org search engine
at irt.org provides this feature when
search for a term (such as, "video") fails.)
- Security. As the engine provides a search
field (that could be used for frivolous purposes) to the user, it
should take the security precautions normally associated with
CGIs.
- User-Interface. The engine should have an
intuitive user-interface that may be familiar to most users.
- Query Lengths. It is difficult to
anticipate all possible search queries. Still, the form field should,
as much as reasonably feasible, incorporate possible query lengths. In
the next (fictitious) example, the length of the field is 10
characters.
This is sufficient for searching the term "JavaScript"
but not necessarily for "JavaScript Image Rollover" as the
user can not see what is being entered without
scrolling. "Hardwiring" (by using maxlength attribute in the
above case) a query length is also not a good practice.
A guide with examples of some possible queries can be useful here to
the user.
Global Search Engines
We call the search engines on the Web which are not local as
global. There are three types of global search engines:
active, passive and meta-search. See Figure
3. We now describe each of them in detail, including there advantages
and limitations.
- Active Search Engine. It is a search engine that
collects Web page information by itself. An example is
AltaVista.
It uses a robot that travels around the Internet, locates Web pages
and adds entries to the catalog. The advantage of active search
engines is that they usually have large catalogs and are updated
frequently (without human intervention). The disadvantage is that
there are often too many hits, which are not very well
organized.
- Passive Search Engine. It is a search engine that
allows people to register their Web pages. An example is
Yahoo!.
Once a page is registered with the search engine, the page can be
found by queries. The advantage of a passive search engine is that
they tend to be very organized. The disadvantages are that they
usually have smaller catalog and items may be organized in unexpected
(from the user's viewpoint) places.
- Meta-Search Engine. It is a search engine which uses
several search engines simultaneously. An example is
MetaCrawler.
The advantage is that they save effort of searching multiple search
engines. The disadvantages are that the search can be slow and that
they may summarize the data in their own way or present only partial
results from each search engine, possibly hiding relevant
information.

Figure 3. The Three Types of Web Search Engines.
A collection of some specialized global search is available at
http://www.search-engines.net/.
Limitations of Search Engines
Inspite of the strength of search engines as a search tool, there are
certain limitations (particularly, of global search engines) that one
should be aware of in their usage. These are:
- The way they sort and rate information; often irrelevant
information is presented earlier in the list of hits.
- Over-information.
- Their inability to distinguish word meanings.
- Their inability to recognize some HTML elements, such as
frames. (In cases where frames are being used, proper use of the
<META> tag and/or including a detailed description
of the site in the <NOFRAMES> tag can
help.)
Query
The success of Web search depends heavily on the keywords chosen. It
is important to know how are they treated by specific search
engines. A simple query consists of just the keywords one
wishes to search. Usually, simple queries can be too broad, resulting
in a response with many unrelated hits. An advanced query
consists of Boolean operators to include or exclude keywords from the
search. The type of operators supported by a search engine define its
query language. Different search engines usually have
different query languages which can be found on their respective Web
sites, often included in options for a "power search."
Search Techniques
Using certain search techniques, searching time and effort
can be greatly reduced. The techniques can be of the following type:
- General Search. A general search uses the query
operator OR to concatenate keywords, instructing the search engine to
locate pages that contain any of the keywords. The more
keywords one adds, the broader the search, the more
hits one is likely to get. But one is also likely to get many
irrelevant hits as well.
- Specific Search. A specific search uses the query
operator AND to concatenate keywords, instructing the search engine to
locate pages that contain all of the keywords. The more
keywords one adds, the narrower the search, the fewer
hits one is likely to get. The chances of getting a hit are low but
hits are likely to be relevant.
- Incremental Search. Incremental search allows one
to narrow and broaden the search progressively until one 'zeroes-in'
on the desired information. This process can, however, be time
consuming.
- Substring Search. Substring search allows one to
search for the occurence of a keyword (substring) in any word (string)
in which it occurs. This process can simplify query but is likely to
produce irrelevant hits.
- Search-and-Jump. Search-and-jump search allows
one to search using the Find facility (such as the
Find button in Netscape Communicator or the
"/" command in Lynx) of browsers. One can choose a search
engine, set number of hits to the maximum number possible, type the
query and start the search, use the Find facility to find specific
information in the Web page returned. This process is faster than
performing multiple queries but it can, however, be time
consuming.
- Category Search. This allows one to search using
the categorized list of topics of subject guides (such as
Yahoo!,
Netscape Open Directory Project).
No query is required. The limitation here is that the list may not be
organized in a manner one might expect.
- Search and Rank. Some search engines (such as
AltaVista)
search and rank the results. Relevant hits are listed first, however,
effective ranking functions are still not known.
Table 3 summarizes the search strategies for some typical
(search) objects. See also
Getting the Most Out of Your Search Engine
for the specific features that you could consider when carrying out a
search using a global search engine.
| OBJECT |
WEB VIEW |
SEARCH TOOL |
QUERY |
TECHNIQUE |
| Topic |
Library |
Search Engines |
Advanced |
Category |
| Document |
Library |
Search Engines |
Advanced |
Specific Search |
| Software |
Resource |
Usenet FAQ |
Advanced |
Search-and-Jump |
| People |
Community |
Search Engines, Usenet |
Advanced |
Specific Search |
Table 3. Web Search Strategies.
The Nature of the Information on the Web
Effective searching strongly depends on how and in what form the
information itself is organized on the Web. The Web was originally
built for human consumption, and although everything on it is
machine-readable, this data is not necessarily
machine-understandable. HTML till now has been a universal
data format for publishing information on the Web. It was designed to
focus on structure, and therefore has very limited support for
attaching any "meaning" to the content of a document. The
development of
eXtensible Markup Language
(XML) addresses this problem to a large extent.
Metadata and RDF
Metadata
is "structured data about data." For example, a library
catalog is metadata, since it describes resources in the library. In
our context, it is "data describing Web resources." For
example, the HTML <META> tag describes the content
of a Web document. Metadata facilitates searching, helping authors to
describe their documents in ways that browsers, search engines and
robots can understand. As a result, users can have better document
discovery services available to them.
Resource Description Framework
(RDF), which is an XML application, is a foundation for processing
metadata. It provides interoperability between applications that
exchange machine-understandable information on the Web. Till now,
automated processing of information on the Web has been very difficult
because of the sheer volume. RDF emphasizes facilities to enable
automated processing of Web resources.
RDF's syntax specification became a W3C recommendation
on February 22, 1999. As pointed out in the specification, RDF could
enhance our searching capabilities in a variety of ways: in
resource discovery to provide better search engine
capabilities, in cataloging for describing the content and
content relationships available at a particular Web site, by
intelligent software agents to facilitate knowledge sharing
and exchange, in content rating, in describing
collections of pages that represent a single logical
"document."
However, a widespread acceptance and use, and requisite software
development for processing, both XML (as a data format that
incorporates "meaning") and RDF (as a metadata framework),
will be necessary for this to become a reality.
Conclusion
Towards ... Eureka!
Searching the Web is an art, as well as, a science. There are no
perfect solutions; each have their own advantages and
limitations. There is also no substitute for human experience. But a
blend of that with intuition and knowledge of some basic techniques
can make the journey both enjoyable and fruitful.
In "searching for something", the "something" (in
our case information) is as important as the methodology of searching
itself. It is expected, at least from the perspective of a
computer-based search, that the organization of information on the Web
(and thus its searching) will improve as the use of XML and RDF
becomes ubiquitous.
Acknowledgements
Statistical graph presented in Figure 1 is from
GVU Center's WWW User Surveys
of GTRC and the GVU Center. It is "Copyright 1994-1998 Georgia
Tech Research Corporation. All rights Reserved" and its use is
hereby acknowledged.
References
- Yahoo! - List of search engines.
- Deja News - Usenet search.
- List of Lists - Mailing list lists.
- People Search: Who Where? (E-Mail address search), Personal Pages Worldwide (Personal Home Page lists), and BigYellow (Telephone directories).
- Reference Search: Research-It! (General reference) and Webster's Dictionary (Specific reference).
- SWISH, WWWWAIS, freeWAIS, Glimpse, Harvest (Unix; Freeware/Shareware); Excite (Windows NT/95, Unix; Commercial) - Document indexing and text searching software.
- NetResearch: Finding Information Online, By Daniel J. Barrett, O'Reilly & Associates, 1997. A good introduction to the art and science of searching.
- Search Engine Watch - This site contains information about how search engines work, comparisons of search engines, recent news about changes in search engine, and more.
- Metadata and Resource Description - W3C.
- RDF - What's in it for us? - By Janus Boye. A lucid introduction to RDF.
View the profile on Pankaj Kamthan and the list of other Articles by Pankaj Kamthan.
|
|