ansaurus

Question

matching stored keywords/phrases in text

Answer 1

A:

If I understand you correctly, you have a unique set of strings, that you want to compare an input strings against. In this case you could use set to store both processing results and db values. Comparison then could be done as follows:

>>> db = {'abc', 'def', 'jhi', 'asdf'}
>>> inpt = {'abc', 'tmp'}
>>> db & inpt
{'abc'}

The further conversion to the dictionary is trivial.

SilentGhost 2009-12-04 13:44:07

well - kind of. I have a block of text and phrases that I'm looking for within that block. I'm currently doing it via a regular expression like so: >>> text_input = "This is something with first phrase in and third phrase">>> regex = "first phrase|second phrase|third phrase">>> p = re.compile(regex, re.I)>>> p.findall(text_input)['first phrase','second phrase']

Guy Bowden 2009-12-04 13:59:25

FWIW, set comprehension syntax is python 3.0 and higher.

hughdbrown 2009-12-04 14:46:33

@hughdbrown: I wasn't using set comprehension, I was using new-style set literals http://docs.python.org/3.1/whatsnew/3.0.html#new-syntax Everything here could be accomplished in py 2.x by using `set(lst)`

SilentGhost 2009-12-04 14:59:35

@SilentGhost: It's not that simple. He doesn't parse his input string into phrases.

John Machin 2009-12-04 20:01:30

Answer 2

A:

Just a heads up... you may be interested in django's support for regex in queries

Example from the linked django docs:

Entry.objects.get(title__regex=r'^(An?|The) +')

Jiaaro 2009-12-04 13:54:56

Answer 3

A:

Here is a slight variation on SilentGhost's answer. You load in the keywords line by line. store them in a set. for each keyword that you find in the user input increase the corresponding entry in the results.

keyword_file = StringIO("""inputting some text
    extract the data
    a phrase not here""")

keywords = set(line.strip() for line in keyword_file)

results = defaultdict(int)
for phrase in keywords:
    if userinput.find(phrase) != -1:
        results[phrase] += 1

print results

Hope this points you in the right direction. Not entirely sure this is what you were asking but it's my best guess.

Do you care about speed? Why don't you like the method you use now? Does your method work?

James Brooks 2009-12-04 14:12:14

Answer 4

A:

Once you've formed your pattern such as (first phrase)|(the second)|(and another) (with the parentheses I indicate) and compiled it into a regular expression object r, a good way to loop on matches and identify which match it was is:

class GroupCounter(object):
  def __init__(self, phrases):
    self.phrases = phrases
    self.counts = [0] * len(phrases)
  def __call__(self, mo):
    self.counts[mo.lastindex - 1] += 1
    return ''
  def asdict(self):
    return dict(zip(self.phrases, self.counts))

g = GroupCounter(['first phrase', 'the second', 'and another'])
r.sub(g, thetext)
print g.asdict()

It would also be reasonable to have the GroupCounter instance also build the regex object, since it does need the list of phrases in the same order as it appears in the regex itself.

Alex Martelli 2009-12-04 15:50:47

Answer 5

A:

If you have 1000 phrases, and you're searching an input string to find which of those phrases are substrings, you're probably not going to be happy with the performance you get from using a big regular expression. A trie is a bit more work to implement, but it's a lot more efficient: the regular expression a|b|c|d|e does five tests on each character in a given input string, while a trie only does one. You could conceivably also use a lexer, like Plex, that produces a DFA.

Edit:

I appear to be procrastinating this morning. Try this:

 class Trie(object):
  def __init__(self):
   self.children = {}
   self.item = None
  def add(self, item, remainder=None):
   """Add an item to the trie."""
   if remainder == None:
    remainder = item
   if remainder == "":
    self.item = item
   else:
    ch = remainder[0]
    if not self.children.has_key(ch):
     self.children[ch] = Trie()
    self.children[ch].add(item, remainder[1:])
  def find(self, word):
   """Return True if word is an item in the trie."""
   if not word:
    return True
   ch = word[0]
   if not self.children.has_key(ch):
    return False
   return self.children[ch].find(word[1:])
  def find_words(self, word, results=None):
   """Find all items in the trie that word begins with."""
   if results == None:
    results = []
   if self.item:
    results.append(self.item)
   if not word:
    return results
   ch = word[0]
   if not self.children.has_key(ch):
    return results
   return self.children[ch].find_words(word[1:], results)

A quick test (words.txt is the BSD words file, a very handy thing to have around - it contains about 240,000 words):

>>> t = Trie()
>>> with open(r'c:\temp\words.txt', 'r') as f:
        for word in f:
            t.add(word.strip())

That takes about 15 seconds on my machine. This, however, is almost instantaneous:

>>> s = "I played video games in a drunken haze."
>>> r = []
>>> for i in range(len(s)):
        r.extend(t.find_words(s[i:]))
>>> r
['I', 'p', 'play', 'l', 'la', 'lay', 'a', 'ay', 'aye', 'y', 'ye', 'yed', 'e', 'd', 'v', 'video', 'i', 'id', 'ide', 'd', 'de', 'e', 'o', 'g', 'ga', 'gam', 'game', 'a', 'am', 'ame', 'm', 'me', 'e', 'es', 's', 'i', 'in', 'n', 'a', 'd', 'drunk', 'drunken', 'r', 'run', 'u', 'un', 'unken', 'n', 'k', 'ken', 'e', 'en', 'n', 'h', 'ha', 'haze', 'a', 'z', 'e']

Yes, unken is in words.txt. I have no idea why.

Oh, and I did try to compare with regular expressions:

 >>> import re
 >>> with open(r'c:\temp\words.txt', 'r') as f:
         p = "|".join([l.strip() for l in f])

 >>> p = re.compile(p)

 Traceback (most recent call last):
  File "<pyshell#250>", line 1, in <module>
    p = re.compile(p)
  File "C:\Python26\lib\re.py", line 188, in compile
    return _compile(pattern, flags)
  File "C:\Python26\lib\re.py", line 241, in _compile
    p = sre_compile.compile(pattern, flags)
  File "C:\Python26\lib\sre_compile.py", line 529, in compile
    groupindex, indexgroup
OverflowError: regular expression code size limit exceeded

Robert Rossney 2009-12-04 19:06:24

seems to be similar to the Aho-Corasick thing suggested by @johnmachin - would be interested in the speed differences between the two...

Guy Bowden 2009-12-05 14:10:20

Aho-Corasick is faster. But the speed difference between the two is a function of the length of the string being searched, not the size of the dictionary. If you're searching relatively long strings, it's probably worth the extra complexity.

Robert Rossney 2009-12-05 19:14:42

Answer 6

A:

The algorithm for this job is Aho-Corasick ... see the link at the bottom whch points to a C-extension for Python.

John Machin 2009-12-04 19:48:12

ansaurus

tags:

views:

answers:

matching stored keywords/phrases in text

related questions