ansaurus

Question

Answer 1

A:

Not sure about the syntax (this is sql server syntax), but:

-- N is the number of elements in the list

SELECT idDoc, COUNT(1)
FROM Word_Docs wd INNER JOIN Words w on w.idWord = wd.idWord
WHERE w.Word IN ('word1', ..., 'wordN')
GROUP BY wd.idDoc
HAVING COUNT(1) = N

That is, without using like. With like things are MUCH more complex.

Sklivvz 2008-09-29 08:47:51

Answer 2

+1 A:

The fastest way is certainly not using a database at all, since if you do the search manually with optimized data, you can easily beat select search performance. The fastest way, assuming the documents don't change very often, is to build index files and use these for finding the keywords. The index file is created like this:

Find all unique words in the text file. That is split the text file by spaces into words and add every word to a list unless already found on that list.
Take all words you have found and sort them alphabetically; the fastest way to do this is using Three Way Radix QuickSort. This algorithm is hard to beat in performance when sorting strings.
Write the sorted list to disk, one word a line.
When you now want to search the document file, ignore it completely, instead load the index file to memory and use binary search to find out if a word is in the index file or not. Binary search is hard to beat when searching large, sorted lists.

Alternatively you can merge step (1) and step (2) within a single step. If you use InsertionSort (which uses binary search to find the right insert position to insert a new element into an already sorted list), you not only have a fast algorithm to find out if the word is already on the list or not, in case it is not, you immediately get the correct position to insert it and if you always insert new ones like that, you will automatically have a sorted list when you get to step (3).

The problem is you need to update the index whenever the document changes... however, wouldn't this be true for the database solution as well? On the other hand, the database solution buys you some advantages: You can use it, even if the documents contain so many words, that the index files wouldn't fit into memory anymore (unlikely, as even a list of all English words will fit into memory of any average user PC); however, if you need to load index files of a huge number of documents, then memory may become a problem. Okay, you can work around that using clever tricks (e.g. searching directly within the files that you mapped to memory using mmap and so on), but these are the same tricks databases use already to perform speedy look-ups, thus why re-inventing the wheel? Further you also can prevent locking problems between searching words and updating indexes when a document has changed (that is, if the database can perform the locking for you or can perform the update or updates as an atomic operation). For a web solution with AJAX calls for list updates, using a database is probably the better solution (my first solution is rather suitable if this is a locally running application written in a low level language like C).

If you feel like doing it all in a single select call (which might not be optimal, but when you dynamacilly update web content with AJAX, it usually proves as the solution causing least headaches), you need to JOIN all three tables together. May SQL is a bit rusty, but I'll give it a try:

SELECT COUNT(Document.idDoc) AS NumOfHits, Documents.Name AS Name, Documents.Location AS Location 
FROM Documents INNER JOIN Word_Docs ON Word_Docs.idDoc=Documents.idDoc 
INNER JOIN Words ON Words.idWord=Words_Docs.idWord
WHERE Words.Word IN ('Word1', 'Word2', 'Word3', ..., 'WordX')
GROUP BY Document.idDoc HAVING NumOfHits=X

Okay, maybe this is not the fastest select... I guess it can be done faster. Anyway, it will find all matching documents that contain at least one word, then groups all equal documents together by ID, count how many have been grouped togetehr, and finally only shows results where NumOfHits (the number of words found of the IN statement) is equal to the number of words within the IN statement (if you search for 10 words, X is 10).

Mecki 2008-09-29 09:31:58

slashmais 2008-09-29 09:57:19

Answer 3

A:

Google Desktop Search or a similar tool might meet your requirements.

J.F. Sebastian 2008-09-29 09:45:05

Answer 4

+2 A:

What you're talking about is known as an inverted index or posting list, and operates similary to what you propose and what Mecki proposes. There's a lot of literature about inverted indexes out there; the Wikipedia article is a good place to start.

Better, rather than trying to build it yourself, use an existing inverted index implementation. Both MySQL and recent versions of PostgreSQL have full text indexing by default. You may also want to check out Lucene for an independent solution. There are a lot of things to consider in writing a good inverted index, including tokenisation, stemming, multi-word queries, etc, etc, and a prebuilt solution will do all this for you.

Nick Johnson 2008-09-29 10:11:22

Now at least I know what to search for. Thanks.

slashmais 2008-09-29 10:31:50

ansaurus

tags:

views:

answers:

Dynamic search and display

related questions