views:

1002

answers:

7

I am trying to write a search engine that will give meaningful results when a user enters words that occur within a bunch of documents. For that I want to know how exactly a search engine works, what data structures and algorithms it uses to build indexes, store and query indexes etc. Plus some pointers on how to make a search engine give good results even in case of spelling mistakes. One natural extension that I would want is to allow phonetic search to also work, so that the search also works for documents in transliterated form.

+1  A: 

A really simple way to get started is to use something like agrep.

Greg Hewgill
+6  A: 

You could look at Lucene's documentation - there is a lot on internals in wiki. http://lucene.apache.org/java/docs/

...or Lucene.NET if you're .NET guy: http://incubator.apache.org/lucene.net/

Michał Chaniewski
+3  A: 

While using grep/agrep is a good starting point, to make an efficient search engine you'll need indexing.

For that, you'll need to learn about some specialized data structures, such as suffix trees, which are very useful for efficient indexing and retrieval of search results.

Here's Brin's and Page's article about the anatomy of a 'new' search engine (Google) written some years ago. Google now offers even more articles about their algorithms.

Eli Bendersky
+3  A: 

Related posts for your reference:

Building a Web Search Engine:

http://stackoverflow.com/questions/112248/building-a-web-search-engine

Open Source Projects for Web Search Engine Components:

http://stackoverflow.com/questions/672476/open-source-projects-for-web-search-engine-components

Timothy Chung
A: 

Writing a search engine is not a trivial task and that is depending how many documents you want to search. If the number and the size of the documents is not large, then a simple sequential search a la Grep/Glimpse oder a database is enough. Otherwise you must use a library that use an inverted-index for indexing and searching.

For documentation you can look at Introduction to Information Retrieval (available online).

You can also look at:

bill
+1  A: 

Research Information Retrieval. The concepts are decades old and very well written about.

kervin
+4  A: 
James McMahon