I have a big load of documents, text-files, that I want to search for relevant content. I've seen a searching tool, can't remeber where, that implemented a nice method as I describe in my requirement below.
My requirement is as follows:
- I need an optimised search function: I supply this search function with a list (one or more) partially-complete (or complete) words separated with spaces.
- The function then finds all the documents containing words starting or equal to the first word, then search these found documents in the same way using the second word, and so on, at the end of which it returns a list containing the actual words found linked with the documents (name & location) containing them, for the complete the list of words.
- The documents must contain all the words in the list.
- I want to use this function to do an as-you-type search so that I can display and update the results in a tree-like structure in real-time.
A possible approach to a solution I came up with is as follows: I create a database (most likely using mysql) with three tables: 'Documents', 'Words' and 'Word_Docs'.
- 'Documents' will have (idDoc, Name, Location) of all documents.
- 'Words' will have (idWord, Word) , and be a list of unique words from all the documents (a specific word appears only once).
- 'Word_Docs' will have (idWord, idDoc) , and be a list of unique id-combinations for each word and document it appears in.
The function is then called with the content of an editbox on each keystroke (except space):
- the string is tokenized
- (here my wheels spin a bit): I am sure a single SQL statement can be constructed to return the required dataset: (actual_words, doc_name, doc_location); (I'm not a hot-number with SQL), alternatively a sequence of calls for each token and parse-out the non-repeating idDocs?
- this dataset (/list/array) is then returned
The returned list-content is then displayed:
e.g.: called with: "seq sta cod" displays:
sequence - start - code - Counting Sequences [file://docs/sample/con_seq.txt]
- stop - code - Counting Sequences [file://docs/sample/con_seq.txt]
sequential - statement - code - SQL intro [file://somewhere/sql_intro.doc]
(and-so-on)
Is this an optimal way of doing it? The function needs to be fast, or should it be called only when a space is hit? Should it offer word-completion? (Got the words in the database) At least this would prevent useless calls to the function for words that does not exist. If word-completion: how would that be implemented?
(Maybe SO could also use this type of search-solution for browsing the tags? (In top-right of main page))