Search engine parser flow diagram

You need a better understanding about search engines first. There are normally

1) a web crawler, something that get the documents you want to add to your search data space. THis is usually totally outside the scope of what you call "search engine".

2) a parser which is taking the document and splitting it into indexable text fragments. If usually works with different file formats, human languages and is preprocessing the text in maybe some fixed records and flow text. Linguistic algorithms (like stemmers - search for Porter Stemmer to get simple one) are also applied here.

3) A indexer which might be as simple as an inverted list of words per document or as complex as you want if you try to be as clever as google. Building an index is the really magic part of a successfull search engine. Usually there are multiple ranking algorithms that are put together.

4) The frontend with an optional query language. THis is where google is really bad but as you can see on googles success it might not be so important for 98% of the people. But i really miss this.

I think you are asking for (3) the indexer. Basically there are 2 different kind of algorithms you find in classic information retrieval literature. Vector Space model and Boolean Search. The later is easy, just check if the search words are inside the document and return a boolean value. Each search term can be given a relevanz probability. And for different search terms you can use Bayesian probability to sum up the relevanz and add return the highest ranked documents. The vector model treats a document as a vector of all its words you can build a scalar vector product between documents to judge if they are close together - this is a much more complex theroy. The father of IR (information retrieval) was Gerald Salton, you will find a lot of literature under his name.

This was the state of IR art until 1999 (i wrote my diploma thesis about a usenet news search engine in 1998). Then google came and all the theory went into the trashcan of academic stupidity and pratical irrelevanz.

Google was not build on mainstream IR theory. Read in the link that Srirangan gave you about it. Its just an ad hock relevanz function build on many many different sources. You will not find anything in this area beside white paper marketing blablabla. This algorithms are the business secret and capital of the search engine companies.

For simple search engines look at the lucence library or at dtsearch which was always my choice for an embeddable search engine library.

There is not really a lot of example code nor available information in the open source world about IR technology. Most of them like lucense are just implementing the most primitive operations. You have to buy books and go to a university library to get access to research literature.

As literature i would recommend starting with this book link text

@Lothar thanks for the very detailed answer. do you know of any good articles or books about parsers? how is parsing text different from a compiler parsing a programming language?

forme 2010-01-10 08:17:18

Okay if you really want linguistic natural language processing you should read "http://www.amazon.com/Natural-Language-Processing-Python-Steve/dp/0596516495/ref=cm_cr_pr_sims_t" and/or the book "Text Processing in Python". This will give you enough basic understand to find the correct search terms to google or bing more literature. Unfortunately the most interesting stuff is not available online.

Lothar 2010-01-10 10:41:45

Im curious now:) (going to local university library...)

forme 2010-01-10 21:33:30

ansaurus

tags:

views:

answers:

Search engine parser flow diagram

related questions