views:

1257

answers:

8

I don't know what they use in normal windows searching.But there is a technique in which you use indexing of files at once and then use the index later for faster searching.(e.g. Windows search 4.0)

Is there any other way for faster searching than this? Can you elaborate from implementation point of view? (Assuming that I may need to implement it)

To make it simple to understand, let me put it in this way:

Assume that I want to build an search application which performs the search operation similar to the one we use in windows.

My question is, what are the possible options/ways/approaches available to build such an application? (and which are faster than the existing ones.)

(Can binary search tree kind of technique be used?)

+1  A: 

Are you searching for filenames only or do you want to look at the content as well? In what language do you want to implement this?

If you're looking for filenames only, then an index is a big performance increase, whereas if you need to open each file you're trying to look for, the index only helps in the way that you open only those files where the content you're looking for might be located in.

It still does require you to open each file until you found what you're watching out for.

Atmocreations
Yes.I want to see the content also. Talking about implementation part, .net framework is preferred.
Ravi
well then you have to walk your files, open each one and grepping through the content... one way would be to add tags to your index. tags telling the most important categories the current file belongs to... that's how you can effectively find files in an appropriate speed. regards
Atmocreations
+12  A: 

Take a look at Lucene. It is a super fast search library for text(files). There is also Lucene.NET available. If you like to implement it yourself, it is a good starting point and benchmark for your implementation.

Thomas Maierhofer
Thanks Thomas for that.But Lucene seems to suffice only for text files and not the rests.
Ravi
@Ravi: the solution is to extract text from PDF, DOC, etc files and feed the extracted text to lucene.
Stephen C
Exactly, the Lucene in Action book has an entire section dedicated to that.
JG
It's easy to hook Lucene.NET up with Microsoft IFilters - you use the IFilter interface to get the text and then add that to the Lucene Index - http://en.wikipedia.org/wiki/IFilter
Dan Diplo
A: 

Full-text Search: Imagine you had a dictionary of words, and for each word, you wrote down which document contained the word and the exact location of the word in that document. This is known as a full-text index, and it lets you do things like boolean search and matching an exact phrase. Full-text indexing can easily scale to millions of documents and is what Windows Search 4.0 generally uses. See also Lucene or Sphinx.

Concept Search: Conceptual search allows you to input a bunch of relevant words (or even an entire document) and return documents that most closely resemble your input. Based on your collection of documents, it produces concept spaces that allow it to infer semantic links between words. This allows it to return more relevant search results because the computer "understands" the concepts that you are searching for and will match conceptually similar words and phrases. This is what enterprise search and eDiscovery solutions commonly use. Products that offer conceptual search include Engenium and Autonomy.

Meta Search: Instead of searching on the content directly, you search on information about the content, known as metadata. Metadata can include things such as tags, keywords, author name, timestamp, etc. So for example, if you know the approximate date when a document was written, you can include that metadata in your search criteria to more quickly narrow-down your search results.

As you can tell, there are many ways to approach search, and each involve many different types of data structures. If there is a particular area that you want me to elaborate on, I can do that for you.

geofflee
A: 

There are lots of research papers on full-text search available on the web, and there is lots of source code. If you take a look at them, you'll see that using a binary search tree is not going to provide good results on modern hardware. A binary search tree is a very specific data-structure that is not as fast as possible on a modern cpu with multi-level cache. Fast datastructures have higher fan out than 2.

In addition, the problem is more suited for a (radix) trie. See wikipedia.

Stephan Eggermont
+8  A: 

There are basically two techniques used for full-text search over big corpuses: posting lists and suffix arrays.

A posting list is a list of (term, document_id) pairs, optionally with a position in the document. If you sort it or hash it by term, you have an efficiently-searchable full-text index.

There are various techniques for making posting lists smaller, faster to access, faster to update, and more flexible, some at the cost of accuracy. Lucene is probably the best off-the-shelf posting-list-based text indexer available today, and (contrary to your earlier comment) it can index text found in PDF, Microsoft Word, etc., files. The Lucene.net project linked by Thomas Maierhofer looks like a pretty reasonable port, although of course you'll always be a bit behind the leading edge of what's going on in the Java version.

For a corpus that's much bigger than memory, you pretty much have to store the posting list on disk. That militates against using a simple binary search tree to access it: if you have a hundred thousand documents of ten thousand words each, you have a billion postings, which means your binary search tree has a minimum depth of 30. The trouble with that is that the 30 nodes on the path from the root of the tree to the leaf will, in general, be located in different parts of your disk — so the disk has to seek 30 times to find the postings for one term! That's about 2½ seconds, which is prohibitively slow.

However, there is a modified version of the binary tree data structure called a “B-tree” which can work. Lucene uses a simple data structure that's a lot like a B-tree, but supports massive updates much more easily. I wrote a very simple version of this same data structure in my own dumbfts project, which implements a full-text search engine for my email in a few pages of Python. I use it every day, it's free software, and it works pretty well for what I use it for, but it's not exactly a world-class search system like Lucene.

As an example of making posting lists smaller at the cost of accuracy, the Managing Gigabytes book (and the mg4j project) has a data structure called a "signed minimal perfect hash table", which doesn't actually store the terms that were indexed — just hashes of them. So there's a small probability of a false positive — you have to retrieve the documents that supposedly contain the term in order to confirm that they really do.

Suffix arrays, which are a much more compact and slightly slower version of radix trees (aka tries), are implemented by GLIMPSE and a few other programs, but they've basically fallen out of use these days. They have some flexibility not present in the posting-list data structure — they allow regular expression searches and searches with misspellings, for example, but they're not quite as fast. There has been some recent work with the Burrows-Wheeler Transform, which is based on suffix arrays, providing a compression algorithm in which the compressed file is the full-text index! The best-documented version of this is called the FM-index, although I've heard that there are older versions of the technique, maybe unpublished. Unlike the other techniques described above, though, I think this doesn't actually work when the documents are PDF files or something like that — you could still use the same approach of extracting a text version of each page and indexing that, but you don't get the benefit of compressing the original document.

My acquaintance Tim wrote a really good introductory series of blog postings on search back in 2003, which are still pretty great. They cover this stuff (except for the recent developments) in a lot more depth.

Ravi: Is this the kind of information you're looking for?

Edit: thanks for fixing my formatting, Martin!

Kragen Javier Sitaker
+1. It seems this (or any other answer) is not enough for the OP to start with. I doubt anything will be. Very well written. If it did not help the OP, at least it taught me.
Lieven
Thanks, Lieven! I think the OP hasn't read it since I posted it — he hasn't left a comment either here or on his question.
Kragen Javier Sitaker
Thank you for this,Kragen.And sorry for delay in my comment.
Ravi
I didn't mean to sound like I was complaining — I just meant that we didn't have any information about whether it was "enough for you" or not, because your earlier comments weren't in response to this answer.
Kragen Javier Sitaker
Ya.I got your point.
Ravi
+1 even if the answer wasn't as thorough as it is, I would have voted up just for using the word "militate"
SquareCog
A: 

Perhaps you can adapt PigeonRank™ to your needs :)

pmg
A: 

You can use knuth-morris-pratt or boyer-more search which are very fast and you do not need an index.

codymanix
They're linear-time in the size of your corpus. If you have a 15GiB corpus and your disk can read 40MB per second, you won't get your search results for at least six minutes, assuming your CPU is fast enough to keep up with the disk. By contrast, I query a full-text index on a 15GB corpus daily and get results in seconds. Google has a full-text index on several billion web pages, and can give you search results in less than a second.
Kragen Javier Sitaker
A: 

There's no single technique or 'silver bullet'. But if you do start from scratch better grok this and this standard text on the topic.

typeseven