views:

47

answers:

2

Hello,

I need to filter stream of text articles by checking every entry for fuzzy matches of predefined string(I am searching for misspelled product names, sometime they have different order of words and extra non letter characters like ":" or ",").

I get excellent results by putting this articles in sphinx index and performing search on it, but unfortunately I get hundreds of articles every second and updating index after getting every article is too slow(and I understand that it's not designed for such task). I need some library which can build in memory index of small ~100kb text and perform fuzzy search on it, does anything like this exist?

+1  A: 

This problem is almost identical to Bayesian spam filtering and tools already written for that can just be trained to recognize according to your criteria.

added in response to comment:

So how are you partitioning the stream into bins now? If you already have a corpus of separated articles, just feed that into the classifier. Bayesian classifiers are the way to do fuzzy content matching in context and can classify everything from spam to nucleotides to astronomical spectral categories.

You could use less stochastic methods (e.g. Levenshtein), but at some point you have to describe the difference between hits and misses. The beauty of Bayesian methods, especially if you already have a segregated corpus in hand is that you don't actually need to expressly know how you are classifying.

msw
Thx, this is a very bright idea, but unfortunately right now I can't train filters and AFAIK Bayesian filtering will not work good for long(6-7 words) search strings.
Riz
FAYK is incorrect. Apparently not only do you not have time to train filters but you don't have time to RTFWA.
msw
LOL, don't get me wrong, I didn't mean I am too lazy to train filters(or read wikipedia), but amount of this filters can be quite big(so I can't prepare set of trained filter for everyone) and creating "add filter - check - train - repeat" loop is not the best solution for my task(end users would prefer to get some wrong results rather than spend more time on training filters). As for long search strings I may be wrong, it's just personal experience from using bayesian spam filtering in my email client :)
Riz
I am using simple string search and flagging of articles by hand right now :)Anyway - thanks for suggestion, I will definitely try playing with current data as already I have sets of correct and wrong data(no idea if I have enough data to create good rules). Any python libs recommendations?
Riz
msw
+1  A: 

How about using sqlite fts3 extension?

CREATE VIRTUAL TABLE enrondata1 USING fts3(content TEXT);

(You may create any number of columns -- all of them will be indexed)

After that you insert whatever you like, and can search it without index rebuild -- matching either specific column, or the whole row.

(http://www.sqlite.org/fts3.html)

t7ko
Thx for suggestion, played with sqlite for a bit, using fts3 with porter tokenizer gives really nice results, but it's not working for cases where search string is like "Toy Story 3 something" and text contains "Toy Story 3 some_other_word" :(
Riz