views:

168

answers:

6

Hi

I have a database table with around 1000 keywords/phrases (one to four words long) - This table changes rarely, so I could extract the data into something more useful (like a regular expression?) - So this is not finding / guessing at keywords based on natural language processing..

I then have a user inputting some text into a form that I'd like to match against my keywords and phrases.

The program would then store a link to each phrase matched next to the text.

So if we ran the algorithm on this question text against a few phrases that are in here, we'd get a result like so:

{"inputting some text" : 1,
 "extract the data" : 1,
 "a phrase not here" : 0}

What are my options?

  1. Compile a regular expression
  2. Some sort of SQL query
  3. A third way?

Bearing in mind that there's a 1000 possible phrases..

I'm running Django / Python with MySQL.

edit: I'm currently doing this:

>>> text_input = "This is something with first phrase in and third phrase" 
>>> regex = "first phrase|second phrase|third phrase" 
>>> p = re.compile(regex, re.I) 
>>> p.findall(text_input)
['first phrase','third phrase']
A: 

If I understand you correctly, you have a unique set of strings, that you want to compare an input strings against. In this case you could use set to store both processing results and db values. Comparison then could be done as follows:

>>> db = {'abc', 'def', 'jhi', 'asdf'}
>>> inpt = {'abc', 'tmp'}
>>> db & inpt
{'abc'}

The further conversion to the dictionary is trivial.

SilentGhost
well - kind of. I have a block of text and phrases that I'm looking for within that block. I'm currently doing it via a regular expression like so: >>> text_input = "This is something with first phrase in and third phrase">>> regex = "first phrase|second phrase|third phrase">>> p = re.compile(regex, re.I)>>> p.findall(text_input)['first phrase','second phrase']
Guy Bowden
FWIW, set comprehension syntax is python 3.0 and higher.
hughdbrown
@hughdbrown: I wasn't using set comprehension, I was using new-style set literals http://docs.python.org/3.1/whatsnew/3.0.html#new-syntax Everything here could be accomplished in py 2.x by using `set(lst)`
SilentGhost
@SilentGhost: It's not that simple. He doesn't parse his input string into phrases.
John Machin
A: 

Just a heads up... you may be interested in django's support for regex in queries

Example from the linked django docs:

Entry.objects.get(title__regex=r'^(An?|The) +')
Jiaaro
A: 

Here is a slight variation on SilentGhost's answer. You load in the keywords line by line. store them in a set. for each keyword that you find in the user input increase the corresponding entry in the results.

keyword_file = StringIO("""inputting some text
    extract the data
    a phrase not here""")

keywords = set(line.strip() for line in keyword_file)

results = defaultdict(int)
for phrase in keywords:
    if userinput.find(phrase) != -1:
        results[phrase] += 1

print results

Hope this points you in the right direction. Not entirely sure this is what you were asking but it's my best guess.

Do you care about speed? Why don't you like the method you use now? Does your method work?

James Brooks
A: 

Once you've formed your pattern such as (first phrase)|(the second)|(and another) (with the parentheses I indicate) and compiled it into a regular expression object r, a good way to loop on matches and identify which match it was is:

class GroupCounter(object):
  def __init__(self, phrases):
    self.phrases = phrases
    self.counts = [0] * len(phrases)
  def __call__(self, mo):
    self.counts[mo.lastindex - 1] += 1
    return ''
  def asdict(self):
    return dict(zip(self.phrases, self.counts))

g = GroupCounter(['first phrase', 'the second', 'and another'])
r.sub(g, thetext)
print g.asdict()

It would also be reasonable to have the GroupCounter instance also build the regex object, since it does need the list of phrases in the same order as it appears in the regex itself.

Alex Martelli
A: 

If you have 1000 phrases, and you're searching an input string to find which of those phrases are substrings, you're probably not going to be happy with the performance you get from using a big regular expression. A trie is a bit more work to implement, but it's a lot more efficient: the regular expression a|b|c|d|e does five tests on each character in a given input string, while a trie only does one. You could conceivably also use a lexer, like Plex, that produces a DFA.

Edit:

I appear to be procrastinating this morning. Try this:

 class Trie(object):
  def __init__(self):
   self.children = {}
   self.item = None
  def add(self, item, remainder=None):
   """Add an item to the trie."""
   if remainder == None:
    remainder = item
   if remainder == "":
    self.item = item
   else:
    ch = remainder[0]
    if not self.children.has_key(ch):
     self.children[ch] = Trie()
    self.children[ch].add(item, remainder[1:])
  def find(self, word):
   """Return True if word is an item in the trie."""
   if not word:
    return True
   ch = word[0]
   if not self.children.has_key(ch):
    return False
   return self.children[ch].find(word[1:])
  def find_words(self, word, results=None):
   """Find all items in the trie that word begins with."""
   if results == None:
    results = []
   if self.item:
    results.append(self.item)
   if not word:
    return results
   ch = word[0]
   if not self.children.has_key(ch):
    return results
   return self.children[ch].find_words(word[1:], results)

A quick test (words.txt is the BSD words file, a very handy thing to have around - it contains about 240,000 words):

>>> t = Trie()
>>> with open(r'c:\temp\words.txt', 'r') as f:
        for word in f:
            t.add(word.strip())

That takes about 15 seconds on my machine. This, however, is almost instantaneous:

>>> s = "I played video games in a drunken haze."
>>> r = []
>>> for i in range(len(s)):
        r.extend(t.find_words(s[i:]))
>>> r
['I', 'p', 'play', 'l', 'la', 'lay', 'a', 'ay', 'aye', 'y', 'ye', 'yed', 'e', 'd', 'v', 'video', 'i', 'id', 'ide', 'd', 'de', 'e', 'o', 'g', 'ga', 'gam', 'game', 'a', 'am', 'ame', 'm', 'me', 'e', 'es', 's', 'i', 'in', 'n', 'a', 'd', 'drunk', 'drunken', 'r', 'run', 'u', 'un', 'unken', 'n', 'k', 'ken', 'e', 'en', 'n', 'h', 'ha', 'haze', 'a', 'z', 'e']

Yes, unken is in words.txt. I have no idea why.

Oh, and I did try to compare with regular expressions:

 >>> import re
 >>> with open(r'c:\temp\words.txt', 'r') as f:
         p = "|".join([l.strip() for l in f])

 >>> p = re.compile(p)

 Traceback (most recent call last):
  File "<pyshell#250>", line 1, in <module>
    p = re.compile(p)
  File "C:\Python26\lib\re.py", line 188, in compile
    return _compile(pattern, flags)
  File "C:\Python26\lib\re.py", line 241, in _compile
    p = sre_compile.compile(pattern, flags)
  File "C:\Python26\lib\sre_compile.py", line 529, in compile
    groupindex, indexgroup
OverflowError: regular expression code size limit exceeded
Robert Rossney
seems to be similar to the Aho-Corasick thing suggested by @johnmachin - would be interested in the speed differences between the two...
Guy Bowden
Aho-Corasick is faster. But the speed difference between the two is a function of the length of the string being searched, not the size of the dictionary. If you're searching relatively long strings, it's probably worth the extra complexity.
Robert Rossney
A: 

The algorithm for this job is Aho-Corasick ... see the link at the bottom whch points to a C-extension for Python.

John Machin