Indexing algorithms to develop an app like google desktop search ? | ansaurus

tags:

views:

295

answers:

2

Q:

Indexing algorithms to develop an app like google desktop search ?

Hi Friends, I want to develop google desktop search like application, I want to know that which Indexing Techniques/ Algorithms I should use so I can get very fast data retrival.

Thanks, Sunny.

+3 A:

The Burrows-Wheeler transform, used to compress data in bzip2, can be used to make substring searching of text a constant time function.

http://en.wikipedia.org/wiki/Burrows-Wheeler%5Ftransform

I haven't seen a simple introduction online, but here is a lot of detail:

http://www.ddj.com/architect/184405504

David Crawshaw 2009-09-13 11:51:40

Interesting article, thanks!

Nick Johnson 2009-09-13 14:24:20

+3 A:

In general, what you want is an Inverted Index. You can do the indexing yourself, but it's a lot of work to get right - you need to handle stemming, stop words, extending the posting list to include positions in the document so you can handle multi-word queries, and so forth. Then, you need to store the index, probably in a B-Tree on disk - or you can make life easier for yourself by using an existing database for the disk storage, such as BDB. You also need to write a query planner that interprets user queries, performs query expansion and converts them to a series of index scans. Wikipedia's article on Search Engine Indexing provides a good overview of all the challenges, too.

Or, you can leverage existing work and use ready-made full text indexing solutions like Apache Lucene and Compass (which is built on Lucene). These tools handle practically everything detailed above (and more), which just leaves you writing the tool to build and update the index by feeding all your documents into Lucene, and the UI to allow users to search it.

Nick Johnson 2009-09-13 14:23:14

related questions

How do I find the Excel column name that corresponds to a given integer?

Calculating a cutting list with the least amount of off cut waste.

Red-Black Trees

How to maintain a recursive invariant in a MySQL database?

RFC calculation in Java need help with algorithm

Best word wrap algorithm?

How do you separate game logic from display?

Most effective way for float and double comparison

Choosing a multiplier for a (string) hash function

Optimizing a search algorithm in C

Find the best combination from a given set of multiple sets

What "already invented" algorithm did you invent?

Designing a Calendar system like Google Calendar

How to overload std::swap()

Looking for algorithm that reverses the sprintf() function output

Merge Sort a Linked List

Puzzle: Find largest rectangle (maximal rectangle problem)

graph serialization

Peak detection of measured signal

Big O, how do you calculate/approximate it?

What problems can be solved, or tackled more easily, using graphs and trees?

Followup: "Sorting" colors by distinctiveness

Efficiently get sorted sums of a sorted list

Function for creating color wheels

Fastest way to get value of pi