How Search
Engines Work
Internet search engines are special sites on
the Internet that are designed to help people
find information stored on other sites. There
are
differences in the ways various search engines work, but they
all perform three basic tasks:
They search the Internet
-- or select pieces of the Internet -
- based
on important words.
They keep an index of the
words they find, and where they find
them.
They allow users to look
for words or combinations of words
found
in that index.
Early search engines held an index of a few
hundred thousand pages and
documents, and received maybe one or two
thousand inquiries each day. Today, a top search
engine will index hundreds of millions of
pages, and respond to tens of millions of queries per
day.

Spidering
Before a search engine can tell you where a
file or document is, it must be found. To find
information on the hundreds of millions of web
pages that
exist, a search engine employs special software
robots, called spiders, to build lists of the words
found on websites.
When a spider is building its lists, the
process is called crawling.
In order to build and maintain a useful list
of words, a search engine's spiders have
to look at a lot of pages. How does any spider
start its
travels over the web? The usual starting points are lists
of heavily used servers and very popular pages.
The spider will begin with a popular site,
indexing the words on its pages and following
every link
found within the site. In this way, the spidering
system quickly begins to travel, spreading out
across the most widely used portions of the
web.


Indexing
Once the spiders have completed the task of
finding information on web pages, the
search engine must store the information in a way
that makes
it useful. There are two key components involved
in making the gathered data accessible to
users:
The information stored
with the data
The method by which the
information is indexed
In the simplest case, a search engine could
just store the word and the URL where it
was found. In reality, this would make for an
engine of
limited use, since there would be no way of telling
whether the word was used in an important or a
trivial way on the page, whether the word was
used once or many times or whether the page
contained
links to other pages containing the word. In other
words, there would be no way of building the ranking
list that tries to present the most useful pages
at the top of the list of search results.
To make for more useful results, most search
engines store more than just the word and
URL. An engine might store the number of
times that
the word appears on a page. The engine might assign
a weight to each entry, with increasing values
assigned to words as they appear near the top of
the document, in sub-headings, in links, in the
meta tags or
in the title of the page. Each commercial search
engine has a different formula for assigning weight
to the words in its index. This is one of the
reasons that a search for the same word on
different search engines will produce different lists,
with the pages presented in different
orders.
An index has a single purpose: It allows
information to be found as quickly as possible.
There are quite a few ways for an index to be
built, but
one of the most effective ways is to build a hash table.
In hashing, a formula is applied to attach a
numerical value to each word. The formula is designed
to evenly distribute the entries across a
predetermined number of divisions. This
numerical distribution is different from the
distribution of words across the alphabet, and that
is the key to a hash table's
effectiveness.
The
Search Engine Program
The search engine software or program is the
final part. When a person requests a
search on a keyword or phrase, the search engine
software
searches the index for relevant information. The
software then provides a report back to the searcher
with the most relevant web pages listed
first.

Next
Page
|