I am trying to write a search engine that will give meaningful results when a user enters words that occur within a bunch of documents. For that I want to know how exactly a search engine works, what data structures and algorithms it uses to build indexes, store and query indexes etc. Plus some pointers on how to make a search engine give good results even in case of spelling mistakes. One natural extension that I would want is to allow phonetic search to also work, so that the search also works for documents in transliterated form.
A really simple way to get started is to use something like agrep
.
You could look at Lucene's documentation - there is a lot on internals in wiki. http://lucene.apache.org/java/docs/
...or Lucene.NET if you're .NET guy: http://incubator.apache.org/lucene.net/
While using grep/agrep is a good starting point, to make an efficient search engine you'll need indexing.
For that, you'll need to learn about some specialized data structures, such as suffix trees, which are very useful for efficient indexing and retrieval of search results.
Here's Brin's and Page's article about the anatomy of a 'new' search engine (Google) written some years ago. Google now offers even more articles about their algorithms.
Related posts for your reference:
Building a Web Search Engine:
http://stackoverflow.com/questions/112248/building-a-web-search-engine
Open Source Projects for Web Search Engine Components:
http://stackoverflow.com/questions/672476/open-source-projects-for-web-search-engine-components
Writing a search engine is not a trivial task and that is depending how many documents you want to search. If the number and the size of the documents is not large, then a simple sequential search a la Grep/Glimpse oder a database is enough. Otherwise you must use a library that use an inverted-index for indexing and searching.
For documentation you can look at Introduction to Information Retrieval (available online).
You can also look at:
- Lucene full text search engine in java.
- Zettair simple text search engine in c.
- Seach engines for websites like A Perl search engine library.
Research Information Retrieval. The concepts are decades old and very well written about.