views:

885

answers:

9

I need to index a lot of text. The search results must give me the name of the files containing the query and all of the positions where the query matched in each file - so, I don't have to load the whole file to find the matching portion. What libraries can you recommend for doing this?

update: Lucene has been suggested. Can you give me some info on how should I use Lucene to achieve this? (I have seen examples where the search query returned only the matching files)

A: 

Try using grep.

lilburne
Man, delete this as soon as you can, you will start to get downvotes.
Richard J. Terrell
But that is exactly what I'd do. C/C++/java are the wrong tools. Use shell programming or perl if you must. Convert to an app language if the need is real. RP it first with a scripting language after which time you'll know all the bits that are needed.
lilburne
You misunderstood the problem. If I need to find something once maybe I would use grep. But I need to search and find stuff fast (several queries a minute) in a few gigabytes of textual data. (btw the downvote wasn't me)
Richard J. Terrell
The downvote was me. Grep is system-dependent regexp implementation which doesn't exist by default outside *nix and can't be used intelligently enough in conjuction with C/C++/Java to allow for what the topic poster wants.
Esko
+8  A: 

For java try Lucene

Jared
+2  A: 

It all depends on how you are going to access it. And of course, how many are going to access it. Read up on MapReduce.

If you are going to roll your own, you will need to create an index file which is sort of a map between unique words and a tuple like (file, line, offset). Of course, you can think of other in-memory data structures like a trie(prefix-tree) a Judy array and the like...

Some 3rd party solutions are listed here.

dirkgently
+2  A: 

Lucene - Java

It's open source as well so you are free to use and deploy in your application.

As far as I know, Eclipse IDE help file is powered by Lucene - It is tested by millions

Sung Meister
A: 

Why don't you try and construct a state machine by reading all files ? Transitions between states will be letters, and states will be either final (some files contain the considered word, in which case the list is available there) or intermediate.

As far as multiple-word lookups, you'll have to deal with them independently before intersecting the results.

I believe the Boost::Statechart library may be of some help for that matter.

Benoît
I don't see how a state machine would be efficient.
Richard J. Terrell
Why not ? In case it wouldn't be as efficient as you'd want, you could add more complicated transitions (strings). It's simply a binary tree. You can decide its size and balance it as much as you like !
Benoît
+2  A: 

Have a look at http://www.compass-project.org/ it can be looked on as a wrapper on top of Lucene, Compass simplifies common usage patterns of Lucene such as google-style search, index updates as well as more advanced concepts such as caching and index sharding (sub indexes). Compass also uses built in optimizations for concurrent commits and merges.

The Overview can give you more info http://www.compass-project.org/overview.html

I have integrated this into a spring project in no time. It is really easy to use and gives what your users will see as google like results.

Paul Whelan
+2  A: 

I believe the lucene term for what you are looking for is highlighting. Here is a very recent report on Lucene highlighting. You will probably need to store word position information in order to get the snippets you are looking for. The Token API may help.

Yuval F
+2  A: 

Also take a look at Lemur Toolkit.

Nemanja Trifunovic
A: 

I'm aware you asked for a library, just wanted to point you to the underlying concept of building an inverted index (from Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze).

Fabian Steeg