views:

172

answers:

4

I have a big load of documents, text-files, that I want to search for relevant content. I've seen a searching tool, can't remeber where, that implemented a nice method as I describe in my requirement below.

My requirement is as follows:

  • I need an optimised search function: I supply this search function with a list (one or more) partially-complete (or complete) words separated with spaces.
  • The function then finds all the documents containing words starting or equal to the first word, then search these found documents in the same way using the second word, and so on, at the end of which it returns a list containing the actual words found linked with the documents (name & location) containing them, for the complete the list of words.
  • The documents must contain all the words in the list.
  • I want to use this function to do an as-you-type search so that I can display and update the results in a tree-like structure in real-time.

A possible approach to a solution I came up with is as follows: I create a database (most likely using mysql) with three tables: 'Documents', 'Words' and 'Word_Docs'.

  • 'Documents' will have (idDoc, Name, Location) of all documents.
  • 'Words' will have (idWord, Word) , and be a list of unique words from all the documents (a specific word appears only once).
  • 'Word_Docs' will have (idWord, idDoc) , and be a list of unique id-combinations for each word and document it appears in.

The function is then called with the content of an editbox on each keystroke (except space):

  • the string is tokenized
  • (here my wheels spin a bit): I am sure a single SQL statement can be constructed to return the required dataset: (actual_words, doc_name, doc_location); (I'm not a hot-number with SQL), alternatively a sequence of calls for each token and parse-out the non-repeating idDocs?
  • this dataset (/list/array) is then returned

The returned list-content is then displayed:

e.g.: called with: "seq sta cod" displays:

sequence - start - code - Counting Sequences [file://docs/sample/con_seq.txt]
         - stop - code - Counting Sequences [file://docs/sample/con_seq.txt]
sequential - statement - code - SQL intro [file://somewhere/sql_intro.doc]

(and-so-on)

Is this an optimal way of doing it? The function needs to be fast, or should it be called only when a space is hit? Should it offer word-completion? (Got the words in the database) At least this would prevent useless calls to the function for words that does not exist. If word-completion: how would that be implemented?

(Maybe SO could also use this type of search-solution for browsing the tags? (In top-right of main page))

A: 

Not sure about the syntax (this is sql server syntax), but:

-- N is the number of elements in the list

SELECT idDoc, COUNT(1)
FROM Word_Docs wd INNER JOIN Words w on w.idWord = wd.idWord
WHERE w.Word IN ('word1', ..., 'wordN')
GROUP BY wd.idDoc
HAVING COUNT(1) = N

That is, without using like. With like things are MUCH more complex.

Sklivvz
+1  A: 

The fastest way is certainly not using a database at all, since if you do the search manually with optimized data, you can easily beat select search performance. The fastest way, assuming the documents don't change very often, is to build index files and use these for finding the keywords. The index file is created like this:

  1. Find all unique words in the text file. That is split the text file by spaces into words and add every word to a list unless already found on that list.

  2. Take all words you have found and sort them alphabetically; the fastest way to do this is using Three Way Radix QuickSort. This algorithm is hard to beat in performance when sorting strings.

  3. Write the sorted list to disk, one word a line.

  4. When you now want to search the document file, ignore it completely, instead load the index file to memory and use binary search to find out if a word is in the index file or not. Binary search is hard to beat when searching large, sorted lists.

Alternatively you can merge step (1) and step (2) within a single step. If you use InsertionSort (which uses binary search to find the right insert position to insert a new element into an already sorted list), you not only have a fast algorithm to find out if the word is already on the list or not, in case it is not, you immediately get the correct position to insert it and if you always insert new ones like that, you will automatically have a sorted list when you get to step (3).

The problem is you need to update the index whenever the document changes... however, wouldn't this be true for the database solution as well? On the other hand, the database solution buys you some advantages: You can use it, even if the documents contain so many words, that the index files wouldn't fit into memory anymore (unlikely, as even a list of all English words will fit into memory of any average user PC); however, if you need to load index files of a huge number of documents, then memory may become a problem. Okay, you can work around that using clever tricks (e.g. searching directly within the files that you mapped to memory using mmap and so on), but these are the same tricks databases use already to perform speedy look-ups, thus why re-inventing the wheel? Further you also can prevent locking problems between searching words and updating indexes when a document has changed (that is, if the database can perform the locking for you or can perform the update or updates as an atomic operation). For a web solution with AJAX calls for list updates, using a database is probably the better solution (my first solution is rather suitable if this is a locally running application written in a low level language like C).

If you feel like doing it all in a single select call (which might not be optimal, but when you dynamacilly update web content with AJAX, it usually proves as the solution causing least headaches), you need to JOIN all three tables together. May SQL is a bit rusty, but I'll give it a try:

SELECT COUNT(Document.idDoc) AS NumOfHits, Documents.Name AS Name, Documents.Location AS Location 
FROM Documents INNER JOIN Word_Docs ON Word_Docs.idDoc=Documents.idDoc 
INNER JOIN Words ON Words.idWord=Words_Docs.idWord
WHERE Words.Word IN ('Word1', 'Word2', 'Word3', ..., 'WordX')
GROUP BY Document.idDoc HAVING NumOfHits=X

Okay, maybe this is not the fastest select... I guess it can be done faster. Anyway, it will find all matching documents that contain at least one word, then groups all equal documents together by ID, count how many have been grouped togetehr, and finally only shows results where NumOfHits (the number of words found of the IN statement) is equal to the number of words within the IN statement (if you search for 10 words, X is 10).

Mecki
slashmais
A: 

Google Desktop Search or a similar tool might meet your requirements.

J.F. Sebastian
+2  A: 

What you're talking about is known as an inverted index or posting list, and operates similary to what you propose and what Mecki proposes. There's a lot of literature about inverted indexes out there; the Wikipedia article is a good place to start.

Better, rather than trying to build it yourself, use an existing inverted index implementation. Both MySQL and recent versions of PostgreSQL have full text indexing by default. You may also want to check out Lucene for an independent solution. There are a lot of things to consider in writing a good inverted index, including tokenisation, stemming, multi-word queries, etc, etc, and a prebuilt solution will do all this for you.

Nick Johnson
Now at least I know what to search for. Thanks.
slashmais