views:

765

answers:

6

I have a constantly growing database of keywords. I need to parse incoming text inputs (articles, feeds etc) and find which keywords from the database are present in the text. The database of keywords is much larger than the text.

Since the database is constantly growing (users add more and more keywords to watch for), I figure the best option will be to break the text input into words and compare those against the database. My main dilemma is implementing this comparison scheme (PHP and MySQL will be used for this project).

The most naive implementation would be to create a simple SELECT query against the keywords table, with a giant IN clause listing all the found keywords.

SELECT user_id,keyword FROM keywords WHERE keyword IN ('keyword1','keyword2',...,'keywordN');

Another approach would be to create a hash-table in memory (using something like memcache) and to check against it in the same manner.

Does anyone has any experience with this kind of searching and has any suggestions on how to better implement this? I haven't tried yet any of those approaches, I'm just gathering ideas at this point.

A: 

I'm not 100% clear on what you're asking, but maybe what you're looking for is an inverted index?

Update:

You can use an inverted index to match multiple keywords at once.

Split up the new document into tokens, and insert the tokens paired with an identifier for the document into the inverted index table. A (rather denormalized) inverted index table:

inverted_index
-----
document_id keyword

If you're searching for 3 keywords manually:

select document_id, count(*) from inverted_index
  where keyword in (keyword1, keyword2, keyword3)
  group by document_id 
  having count(*) = 3

If you have a table of the keywords you care about, just use an inner join rather than an in() operation:

keyword_table
----
keyword othercols

select keyword_table.keyword, keyword_table.othercols from inverted_index 
   inner join keyword_table on keyword_table.keyword=inverted_index.keyword
   where inverted_index.document_id=id_of_some_new_document

is any of this closer to what you want?

ʞɔıu
He's trying to do the opposite. In this case for a incoming huge chunk of test, find which keywords the database already knows about.
Nick Gerakines
As Nick said, I'm looking for the best way to find records matching multiple keywords at once.
Eran Galperin
I suggested this exact solution in the question. I'm looking for alternatives, as this is probably the most inefficient solution. Also this is not against a documents table - but a users table. There is only one document at a time which is tokenized and compared against user keywords
Eran Galperin
I would suggest that if you index this correctly it would actually be pretty efficient.
ʞɔıu
It's possible, I'm not ruling out this option. Just looking for more ways to go about it.
Eran Galperin
A: 

I would do 2 things here.

First (and this isn't directly related to the question) I'd break up and partition user keywords by users. Having more tables with fewer data, ideally on different servers for distributed lookups where slices or ranges of users exist on different slices. Aka, all of usera's data exists on slice one, userb on slice two, etc.

Second, I'd have some sort of in-memory hash table to determine existence of keywords. This would likely be federated as well to distribute the lookups. For n keyword-existence servers, hash the keyword and mod it by n then distribute ranges of those keys across all of the memcached servers. This quick way lets you say is keyword x being watched, hash it and determine what server it would live on. Then make the lookup and collect/aggregate keywords being tracked.

At that point you'll at least know which keywords are being tracked and you can take your user slices and perform subsequent lookups to determine which users are tracking which keywords.

In short: SQL is not an ideal solution here.

Nick Gerakines
Partioning the keywords table by users might not be a good idea - each users only has 4-5 keywords, and there are a lot of users - will result in a lot of tables. I am leaning for a non-SQL solution as you said.
Eran Galperin
Even if there is high keyword distribution between users, then you'll still have much smaller data sets to iterate through. User based data partitioning is pretty efficient in most cases and it doesn't require a lot of work.
Nick Gerakines
Others have suggested that in-memory table or a temporary table created from an article keywords would be as optimized as I'm likely to get. Do you have anything to support that SQL is not an ideal solution here?
Eran Galperin
A: 

In general,

  1. break the text into words

    b. convert words back to canonical root form

    c. drop common conjunction words

    d. strip duplicates

  2. insert the words into a temporary table then do an inner join against the keywords table, or (as you suggested) build the keywords into a complex query criteria

It may be worthwhile to cache a 3- or 4-letter hash array with which to pre-filter potential keywords; you will have to experiment to find the best tradeoff between memory size and effectiveness.

Hugh Bothwell
Any experience with the performance of in-memory table versus a memcache array? I would think an array access would be faster?
Eran Galperin
It's easier to do a JOIN against a table using MEMORY storage engine than it would be to join against memcached.
Bill Karwin
A: 

Have you considered graduating to a fulltext solution such as Sphinx?

I'm talking out of my hat here, because I haven't used it myself. But it's getting a lot of attention as a high-speed fulltext search solution. It will probably scale better than any relational solution you use.

Here's a blog about using Sphinx as a fulltext search solution in MySQL.

Bill Karwin
Actually I have a lot of experience with sphinx and it's great. But this is not text-search - but exact matches against a list of keywords.
Eran Galperin
+2  A: 

The classic way of searching a text stream for multiple keywords is the Aho-Corasick finite automaton, which uses time linear in the text to be searched. You'll want minor adaptations to recognize strings only on word boundaries, or perhaps it would be simpler just to check the keywords found and make sure they are not embedded in larger words.

You can find an implementation in fgrep. Even better, Preston Briggs wrote a pretty nice implementation in C that does exactly the kind of keyword search you are talking about. (It searches programs for occurrences of 'interesting' identifiers'.) Preston's implementation is distributed as part of the Noweb literate-programming tool. You could find a way to call this code from PHP or you could rewrite it in PHP---the recognize itself is about 220 lines of C, and the main program is another 135 lines.

All the proposed solutions, including Aho-Corasick, have these properties in common:

  • A preprocessing step that takes time and space proportional to the number of keywords in the database.

  • A search step that takes time and space proportional to the length of the text plus the number of keywords found.

Aho-Corasick offers considerably better constants of proportionality on the search step, but if your texts are small, this won't matter. In fact, if your texts are small and your database is large, you probably want to minimize the amount of memory used in the preprocessing step. Andrew Appel's DAWG data structure from the world's fastest scrabble program will probably do the trick.

Norman Ramsey
I'm not searching for words in a text - I'm tokenizing text into words then comparing against a database of words.
Eran Galperin
Preston's code does exactly what you want. It combines the tokenization and the search in a single step. You provide a list of keywords, build the automaton, then hit the string with the automaton.
Norman Ramsey
You still don't understand. I don't want to search every keyword in the database against the text - but the other way around. Break down the text into words, and find which of those exist in the database. The database is much larger than the text.
Eran Galperin
OK, with that crucial piece of information added, any simple preprocessing step on the database will do---add all the keywords to a hash table. Then tokenize your small text, look up each word in the hash table, and you're done. Aho-Corasick will work too but you don't need the complexity.
Norman Ramsey
Yes, I'm not really interested in the preprocessing step, I already chose the method to attack this. Searching a large of number keywords against a very large table is what concerns me.
Eran Galperin
Sounds like you already have an approach that works, so you don't need a new idea.
Norman Ramsey
My approach works, but it's the only one I know. Just checking to see if people know of anything better - since performance is an issue here.
Eran Galperin
People might be able to help you better if you tell us *where* the performance is an issue: is it in preparing the list of keywoards, or in examining the text?
Norman Ramsey
As I said, I haven't tried anything in practice yet, so I can't tell where the performance problem will be. I'm just looking for alternative approaches...
Eran Galperin
A: 

I hacked up some code for scanning for multiple keywords using a dawg (as suggested above referencing the Scrabble paper) although I wrote it from first principles and I don't know whether it is anything like the AHO algorithm or not.

http://www.gtoal.com/wordgames/spell/multiscan.c.html

A friend made some hacks to my code after I first posted it on the wordgame programmers mailing list, and his version is probably more efficient:

http://www.gtoal.com/wordgames/spell/multidawg.c.html

Scales fairly well...

G

Graham Toal