Experts in 
Software Development

 

 



Content Management 
When should you build, and when should you buy?
 

The Search Engine Unveiled
 By Bill Spencer

This white paper describes how a  search engine works. It's meant for the technically curious, and explains in general terms the basic concepts, and some of the theory.

It's also meant to entice you into using a search engine on your website, or other application if you are not already doing so. A search function on a website is mandatory nowadays. And since good indexing engines are available, and require less than a days effort to set up, you shouldn't embarrass yourself with a mediocre linear search. Get the real thing. It'll pay big dividends in making your site look and feel professional.

If you find this paper interesting, and want to look at the actual guts of a search engine, Phase2 does provide the source code license to its search engine for a modest fee.  If you hire us to set up the engine for you, then the source code license is free.

Table of Contents

 

Introduction

It is impressive to use a search engine like Google,  Lycos, or Northern Light, that searches through hundreds of  millions of documents in a fraction of a second, and finds exactly what you are looking for.

But, even the fastest computer made could not search through that much text in so short a time. So how do these search engines do it? How do they come up with related text, that may not even contain the keyword I was searching for?

 These are some of the questions tackled by this paper. It will not tell you how to get your web page to be the top listing on a particular search engine. However, it will make it clear as to why it is difficult to get it listed at the top, and what the underlying issues and complexities are in determining these rankings.

This paper begins with a basic description of important concepts in searching. It starts with the easy to implement linear search, followed by a category search, followed by indexed keyword searching, then we get to a real search engine, using relevance and other more advanced features.

Linear Searches grep, and the SQL Like clause

When we think of searching text, we imagine scanning through document after document looking for certain keywords or phrases, and checking off the documents in which the keywords appear. Computers can perform this process quicker than humans, and for searches up to a certain size this is an effective way of searching. One of the most useful computer programs I know the UNIX utility grep is simply this. It scans text files seeking out regular expression patterns, and reports back to the user files that contain the text. The Windows equivalent is the Find in File option under recent versions of the Windows Explorer. I often forget what file I was recently working on, or am pouring through someone else's work, and just need to find a certain section of their text. grep works great for that.

There are two problems with grep. First, it scans through every character of every document. This is time consuming. Once you exceed 3 Mb of text the delays become uncomfortable. Second, it is indiscriminate in its search.  The sequence of letters 'tea' finds lines containing the word tea, but also steak, and many others.

These same problems plague searches into unstructured data in databases. For example, when companies leave space for you to fill in comments about their products, these wind up in an unstructured text field. Later, a marketing person may search for catchy phrases in the text fields like "doesn't work", or "stinks", to get a feel for how many people felt  about their product. However, not only is this slow, but there are dozens of phrases which indicate a negative or positive reaction to a product. A simple SQL SELECT statement using a LIKE clause will do a linear scan through this field and locate specific words or phrases, and return database records which have fields containing these phrases.

Since the linear database search is readily built by an average web developer, the search on many websites is nothing more than a SQL based search. A quick survey of on-line stores shows the flaws and frustrations associated with these searches. They are slow, and highly inaccurate. They return the right record only if you are looking for a sequence of letters that is almost unique.

In the case of the marketing person prowling through the comments field, the information the marketing person is looking for is probably along the lines of "How many people feel negatively towards our product, and why?", not how may people say it "stinks". The use of modern search technology, through use of linguistics and relevance, can answer the right question here.

In summary, the linear search is slow, inaccurate, and doesn't allow for rating the quality of the response that is found. However, it does allow complicated regular expressions to be located within text files. While it may seem that this type of search should be rejected altogether, there are reasons to use it. It is simple to set up, and there are cases where combining the linear search with an indexed search will accurately pinpoint your result.

Category Searches SQL databases

This chapter shouldn't be here, but so many people seek to improve their linear database searches by using categories and structured information that it must be mentioned. I can't tell you how many times I have sat with marketing and engineering people who argue details of these improvements which only marginally improve their search. The energy put here would be far better spent in implementing an indexed text search with relevance.

The first attempted improvement is to take information that is better described in free text, and try to add a lot of structure to it.

For example, with our marketing example above, we could break the ratings down into several different areas, such as installation, ease of use, and support, and ask for rankings such as "wonderful", "okay", "ambivalent", "not so good", and "stinks". Now our database, which is good at processing structured data, gives a more accurate picture of what the user thought of the product. We can quickly find those who think the installation process "stinks". However, we have lost the information about why the user thought the installation "stinks". Invariably this means there will be another field added for comments on each one of these rating categories, and that is still free text. The ultimate result will be a multi-page questionnaire with lots of free text fields, which is even more difficult to understand and undesirable to fill out..

The second attempted improvement is to categorize the information. This puts all the info into appropriate bins, so that you can search by descending a tree hierarchy. Unfortunately the casual searcher does not always understand how the categories work. This is similar to the yellow pages. To find the number for a local pizza shop, you need to know whether the category is 'pizza', 'restaurant', 'restaurants Italian', or 'food'. After checking several categories, you may find what you want, but it leaves you wondering if you perhaps missed something.

Both of these fixes to a linear search require a lot of work, and you don't end up with what you want. It may be better, but you still have confused users, and overlooked information. The better solution is to let the computer do the work, allowing the user the freedom to specify what is being searched for, and to create related categories on his/her own. This is what is done is real free text search engine, which is described below.

Indexed Keyword Searches 

The first step of a real text based search engine is to create an index of words. It is very similar in concept to the Index you find at the back of the book. You look up a word, and find what page numbers it appears on. In a similar way, a search engine indexes every word it can find in a document.

Imagine that you have a pile of documents, thousands of them. You go through each one, indexing each word, and recording which documents that word appears in. You could also record its position in the document, which gives you an additional feature or two, but we will leave that to later.

Now to find all documents which have the name "Clinton", we look up 'Clinton' in the index and there is a list of the documents containing 'Clinton'. That was fast. Now you should begin to understand how those search engines can be so fast. The indexing, tedious as it was, does most of the work before the search even occurs.

The Advanced Search and Query Expressions

Let's make our search more complicated. We're going to search for the documents that contain 'Bill' and 'Clinton'. We use our index, and get two lists of documents; one that lists all documents containing 'Bill', and one that lists documents containing 'Clinton'. If we take the intersection of those two lists, we will get a smaller list of documents that contain both words. The resulting list is our search result.

Now let's be tricky with our search query. We're going to find those documents that contain 'Bill' or 'Hillary' and the word 'Clinton', but do not contain 'Monica'. Here's how the search works. For each word, we find a list of documents. We combine the lists containing 'Bill' and 'Hillary', and take the intersection of that list with the list containing 'Clinton'. Using the resulting list, we then subtract out any documents that contain 'Monica'.

The last search query can be written as an expression:

( Bill OR Hillary) AND Clinton AND NOT Monica

You have probably seen search engines that understand this type of expression. Advanced search expressions such as these reduce to a simple algebra using sets of documents. This is a powerful way to isolate just those documents we want.

Stop words and Stemming

There are two easy modifications made to the index that improve the search. The two modifications are removal of stopwords, and stemming of words to their root.

Stopwords are words like 'and', 'the', 'it', 'our', 'for', and many others that are not useful for searching. They will appear in every document, and don't help isolate the unique documents a user is looking for. When we are indexing, they are pulled out altogether, and ignored. Typically, an indexing engine will have 300-500 stop words (for the English language) that it ignores.

Stemming is the process of eliminating the ending of words so that they can be identified as having the same root. The words "search", "searching", "searches", and "searched" all have the same stem word 'search', and all those words should be found when searching for the word "search".

An algorithm developed by Porter is often used to stem words. While effective for most common stems, it does make a number of errors. For example, the stem for the word engine is 'engin'. This also is the stem for engineering, and engineer. These words should not be equivalent in a search.  Typical Porter stemming functions allow for exceptions so we can explicitly mark engineer as an excpetion.

Scoring

Our search now works for finding a list of documents that conforms to an advanced search expression. However, when our search results produce 1000 documents, we need some method for determining that one document is a better match to our search than another. This introduces the concept of scoring.

Assigning a score to every word in a document is the beginning of determining relevance. For the purpose of a search, relevance is a measure of the quality of the match between the search string and the document found.

The simplest score to assign is based on how many times a word appears in a document divided by the number of words in the document. This is given the name "Term Frequency" (TF). But you also might want to assign a slightly higher score to a word that appears in the first paragraph or the title. Also, when searching, if a search word appears 3 times in a document with 200 words, versus three distinct search words which each appear once in another document of 200 words, you would pick the latter as being the most likely successful search.

To account for this every search engine has a different way of weighting term frequencies. Words found in a title, a first paragraph, or an abstract, all might have different weights. The variety of weights is one reason why it is impossible to predict where your document will end up in a search engine's list. The term frequency scoring is a complex, closely guarded secret for all these search engines.

The score you assign to a search word also depends on the number of documents that you search. For example, a word that appears in every document is a useless search word. In fact it should probably be added to the list of stop words and ignored, since it won't help you narrow down the search. On the other hand, a word that appears in only a handful of documents is the best kind of search word. It immediately narrows the search to that handful of documents. It seems that yet another adjustment to the score is in order, which reduces the importance of a word based on how many documents that words appears in.

Finally, there may be a reason to weight an entire document more than others. Perhaps it contains lots of links, or there are lots of links that point at it. Or in an on-line store, it may describe a product you are trying to push, and want it to score higher in searches.  One of the greatest features of Google is the weighting it gives to pages that have lots of links pointing at it.  This assures that the most "popular" pages have the greatest weight, and are found at the top of the list.

TF-IDF Scoring

One of the most common methods of determining scores in a search is the called TF-IDF for Term Frequency Inverse Document Frequency. TF-IDF is a heuristic approach to scoring that seems to work effectively, though the reasons why it is effective are not completely understood.

The inverse document frequency is

IDF = 1 log(DF)/log(TD)

Where TD is the total number of documents in a collection, and DF is the number of documents in which a certain word appears. This is nice function that gives a strong weighting to words that appear relatively infrequently, and a low weighting to those that appear frequently. For example, words that appear in all the documents will be given a IDF score of zero, and words which appear in only a few documents will have a IDF score which approaches 1.

The term frequency, TF, is the number of times a word appears in a particular document. As mentioned above, the TF may be modified in any number of ways based on the individual weighting of different parts of the document.

The TF-IDF score of a particular keyword matching a particular document is the product:

TF * IDF

A Scoring Example

My favorite scoring algorithm starts with a count how many unique sentences the word appears in within the document. Then I take the square root of that. Then I take all the values I get for all the words in the document, and "normalize" them, such that the sum of the squares of all the term frequency scores for a document equals 1. This is my term frequency score for a document. It avoids the problem of documents that repeatedly use one word over and over, and gives relatively less strength to each additional time a word appears in the document. The normalization automatically adjusts for the total size of the document. Before the normalization, I also take into account different weightings for different parts of the document. Further, if I want to give the entire document special weighting, instead of normalizing to 1, I may decide to normalize to 2, giving that document twice the normal weight.

When doing an actual search, I treat the search string as a document as well, and perform the same scoring calculation on it as well. Then during the search, I take the product of the individual scores of each search word and the score of that word in the document, and adjust by the IDF of the word. Then I sum that over all the search words.

Relevance

If you take the preceding example, and work through it carefully, you will find that a search for a document using that same document as the search string will yield a search score very close to 1. In fact, were it not for the IDF adjustment, the score would be 1, which is the highest possible score.

We call this score the measure of search relevance.

When we talk about relevance of one document to another, we can take a measure of the relevance just by multiplying the normalized term frequency scores of the two documents together, and summing them. The relevance score will then be some number between 0 and 1, with 1 being the score for identical documents, and 0 for documents that have nothing in common.

To present the results, we can now combine the relevance scoring with our query expression to produce a sequence of documents that satisfy the query, and are ordered by the relevance score.

Once you have relevance and the ability to process a search expression, you have a pretty good search engine.  There are many other possible features to be added which can improve the accuracy of the search, or give more functionality.

Synonyms and Related Words

Content searches are about finding documents with relevant content, not just words that match certain keywords. A search through a web store for kitchen knives may miss a whole set of cutlery products, just because the word 'knives' is not mentioned in the description of the cutlery.

Augmenting a search by giving relevance to related search words is one way of solving this problem. Some search engines have a list of synonyms that is scanned prior to the actual search, and synonyms are automatically added to the search string but given a much lower search weight.

Then the search will find documents with the synonyms, although they might have a lower score.

A improvement to this method is to frequently examine what people are searching for, and create a list of related words that can augment those same searches in the future to more accurately find the best set of documents to return. Related words may be synonyms, but could be much more than that. Some related words for Boston may be Massachusetts, and New England.

Search Engines that Learn

The human mind is one of the best examples of a search engine; with limited input we can recall events of the past, piece together related events, and pull together ideas that direct our current actions. As we grow and learn, the associations of the past become more refined; the feedback we get, both positive and negative guide those associations.

One of the big challenges of search engines is to mimic this learning behavior. Use positive feedback from successful searches, and negative feedback from unsuccessful searches to improve the searches of the future.

What we want to do is build a network of associations between words that is guided by the success and failure of searches. We have a scoring mechanism in the relevance search described above. We have a way of modifying the search in a positive and negative way by adding related words and synonyms with positive and negative weights. But, we need to be able provide feedback in a useful way.

Here's a specific model. For each search, there is a association made between a search keyword, and the documents found. Preserve that relationship in an index, and the search query itself should be preserved as a document in another index. When a success is found (perhaps measured as a document clicked on, or the user never returned for another search, etc.), then increment by a small value the score in the relationship index for the successful document, and increment by a small value the score for this as a search document in the query index. You could also decrement the score if there is measurable negative feedback.

Now when another user performs a search, first run the search against the query index. Take the most relevant search that is not identical, and add it as related words for the new search. Second, run the search against the relationship index for high relevant responses. Now when the actual search is run, adjust the score of the documents by mixing in the scores of the relevant responses in the relationship index. This will cause documents that were successful in the past to be promoted with a higher score.

Over time the engine will learn what the successful searches are. Some of you may recognize this as a neural net approach to computer learning. I think this is a fine example of how that technology can be put to practical use.

.Gathering Documents WWW Spiders and Robots

Up until now we haven't said anything about what the documents we are indexing are. They could be descriptions in a database, a collection of Microsoft Word documents, customer support question and answer sessions, or web pages from thousands of unrelated web sites.

Most people encounter search engines in searching the web. The documents indexed in this kind of engine are collected from the web by a program commonly nicknamed a 'spider' or a 'robot'. A spider does not actually go out onto the web. It executes from one or more machines attached to a site like Google or Yahoo. It behaves just like your browser in collecting a document, except instead of displaying it on a screen, it sends the document to the search engine to be indexed. A spider can also 'crawl'. In this operation it collects links found on a page, and then indexes those links as well.

Network Services Interface

A search engine, because it tends to monopolize the resources of a computer, is often run as a stand-alone system. Even on smaller websites, when a local search engine is run on the same machine as the application server, it is a good idea for scalability reasons to run it as if it is a standalone system.

This means that the search engine should be implemented as a network service. In this way it is independent of programming languages, proprietary database interfaces, and other problems that often get in the way of quick implementations. When run on a stand-alone computer it can be viewed as an appliance, rather than a complex computer program. This gets rid of the operating system dependencies often encountered in implementation. You just turn it on, assign the computer a name, and it is ready to be used as a network service.

The service interface utilizes HTTP as its base protocol, and XML (with options for HTML and plain text) for transferring data. XML provides the most flexibility for manipulating the data that results from a search, however, the HTML interface can quickly provide a screen display requiring minimal programming on the part of the web developer. Although SOAP is a relatively new standard for a protocol, expect that service interfaces for search engines to use SOAP in the future. SOAP is a specific protocol format enabled by HTTP and XML.

Example Installation An On-Line Store

As an example which is not a web search engine, suppose you have a web store with an inventory of 10,000 products. The information about these products is stored in a database, and a general product information page for each product is generated dynamically from the database. You have been running a search process using a standard SQL Linear search, and your customers are complaining that the search takes too long, and finds a lot of stuff they weren't looking for. You've generated a hierarchy of categories to help lead them to the right pages, but that seems cumbersome, and users are often getting lost finding the right directions.

It is decided to bring in a real indexed search.

Step one of the process is to install a search engine computer on the network, preferably behind the firewall, beside the application server computer.

Step two of the process is to provide a service interface that allows the data in the database to be indexed. (Note: do not be bashful about the amount of data here. Even if you have 1,000,000 pages to be indexed, this will all get handled electronically and the computer doesn't care about the page size.) You can estimate the time for indexing by assuming that the indexer will consume 20 Mb to 200 Mb per hour. Assuming each page is about 2Kb, 10,000 pages will take less than 1 hour, and 1,000,000 pages will take 4 days. For search engines using a spider, the retrieval time of documents across the web is the rate-limiting step. Updating documents, and deleting documents from the index are actually the most time consuming operations, often being 10 to 100 times slower than adding a new document.

Step three of the process is to configure the search engine server. First we create an index or multiple indices. There may be a number of other implementation specific parameters that need to be set.

Step four is to integrate this new search seamlessly into your website, and integrate into the administrative applications for adding and deleting products. The time required for this step will vary considerably depending upon your needs. A simple integration of the default search presentation into an existing search page may take less than 1 day. Handling multiple search types, including an advanced search page, or specifying categories requires more customization.

Summary

Adding a search engine into a website, intranet, support network, can add enormously to the usefulness of a system. There are a lot of complex processes that underlie the search operation, but a number of search engines are readily available for use that hide all the complexity, but yield excellent results for user searches. 

Using a search engine already set up as a web service could mean that you have a search engine for your application running in less than a day, so long as it is a friend who is hosting the web service. Otherwise you have to set up your own. Don't be intimidated by it. You'll be amazed by the results, and if you do it yourself, you'll find broader use for it than you ever first imagined.

Phase2 developed its own search engine which is highly customizable, fast, and contains all the features of the high priced engines.  We invite you to try it by e-mailing us at bill@p2software.com .

 

Copyright 2002  [Phase2 Software Corporation]. All rights reserved.
Revised: March 06, 2002 .