views:

199

answers:

3

I want to find the first index of substrings in a larger string. I only want it to match whole words and I'd like it to be case-insensitive, except that I want it to treat CamelCase as separate words.

The code below does the trick, but it's slow. I'd like to speed it up. Any suggestions? I was trying some regex stuff, but couldn't find one that handled all the edge cases.

def word_start_index(text, seek_word):
    start_index = 0
    curr_word = ""
    def case_change():
        return curr_word and ch.isupper() and curr_word[-1].islower()
    def is_match():
        return curr_word.lower() == seek_word.lower()
    for i, ch in enumerate(text):
        if case_change() or not ch.isalnum():
            if is_match():
                return start_index
            curr_word = ""
            start_index = None
        if ch.isalnum():
            if start_index is None:
                start_index = i
            curr_word += ch
    if is_match():
        return start_index

if __name__ == "__main__":
    #            01234567890123456789012345
    test_text = "a_foobar_FooBar baz golf_CART"
    test_words = ["a", "foo", "bar", "baz", "golf", "cart", "fred"]

    for word in test_words:
        match_start = word_start_index(test_text, word)
        print match_start, word

Output:

0 a
9 foo
12 bar
16 baz
20 golf
25 cart
None fred
+2  A: 

If I were doing this with regular expressions I'd probably do it like this:

def word_start_index2(text, seek_word):
    camel_case = seek_word[0].upper() + seek_word[1:].lower()
    seek_word_i = ''.join('[' + c.lower() + c.upper() + ']'
                           for c in seek_word)
    regex1 = r'(?:(?<=[^a-zA-Z])|^)' + seek_word_i + r'(?=$|[^a-zA-Z])'
    regex2 = r'(?:(?<=[a-z]|[^A-Z])|^)' + camel_case + r'(?=$|[A-Z]|[^a-z])'
    regex = '%s|%s' % (regex1,  regex2)
    import re
    m = re.search(regex, text)
    if not m:
        return None
    else:
        return m.start()

I haven't performance tested this against your version though, but you could try it to see if it is better or worse and let us know.

My answer might give different output from yours on some edge cases but in your comments you said that you don't care about these cases.

Also, I tried to use the notation (?i) to mark part of the regex as case-insensitive but for some reason this fails to work correctly. I cannot explain why.

Final self-nitpick: the function needs to validate its arguments but this code is omitted for clarity. You should add checks at least for the following:

  • text should be a string
  • seek_word should be a string matching '[a-zA-Z]+'
Mark Byers
This is a good answer too.
Jesse Aldridge
How was the performance of this solution? How did you performance test? Can you post the performance testing code you used? When performance testing, I hope you remembered to load the re module only once and not on every call!
Mark Byers
Also, for anyone else reading that may choose to use this solution, if you remove the letters-only contraint on seek_word, you need to use `re.escape`.
Mark Byers
Also note to others reading who want to use this solution: if you are using the same search on multiple files, there would be a performance advantage in constructing the regex only once and reusing it for each file.
Mark Byers
+3  A: 

word_emitter (below) takes a text string and yields lowercase "words" as they are found, one at a time (along with their positions).

It replaces all underscores with spaces. It then splits the text into a list. For example,

"a_foobar_FooBar baz golf_CART Foo"

becomes

['a', 'foobar', 'FooBar', 'baz', 'golf', 'CART', 'Foo']

Of course, you also want camelCase words to be treated as separate words. So for each piece in the above list, we use the regex pattern '(.*[a-z])(?=[A-Z])' to split camelCase words. This regex uses the re module's look-forward operator (?=...). Perhaps that is the trickiest part to the whole thing.

word_emitter then yields the words one at a time, along with their associated positions.

Once you have a function which splits the text into "words", the rest is easy.

I've also switch the order of your loops, so you only loop through the test_text once. This will speed things up if test_text is very long compared to test_words.

import re
import string
import itertools

nonspace=re.compile('(\S+)')
table = string.maketrans(
    '_.,!?;:"(){}@#$%^&*-+='+"'",
    '                       ',
    )

def piece_emitter(text):
    # This generator splits text into 2-tuples of (positions,pieces).
    # Given "a_foobar_FooBar" it returns
    # ((0,'a'),
    #  (2,'foobar'),
    #  (9,'FooBar'),
    #  )
    pos=0
    it=itertools.groupby(text,lambda w: w.isspace())
    for k,g in it:
        w=''.join(g)
        w=w.translate(table)
        it2=itertools.groupby(w,lambda w: w.isspace())
        for isspace,g2 in it2:
            word=''.join(g2)
            if not isspace:
                yield pos,word
            pos+=len(word)

def camel_splitter(word):
    # Given a word like 'FooBar', this generator yields
    # 'Foo', then 'Bar'.
    it=itertools.groupby(word,lambda w: w.isupper())
    for k,g in it:
        w=''.join(g)
        if len(w)==1:
            try:
                k1,g1=next(it)
                w+=''.join(g1)
            except StopIteration:
                pass
        yield w

def word_emitter(piece):
    # Given 'getFooBar', this generator yields in turn the elements of the sequence
    # ((0,'get'),
    #  (0,'getFoo'),
    #  (0,'getFooBar'),
    #  (3,'Foo'),
    #  (3,'FooBar'),
    #  (6,'Bar'), 
    #  )
    # In each 2-tuple, the number is the starting position of the string,
    # followed by the fragment of camelCase word generated by camel_splitter.
    words=list(camel_splitter(piece))
    num_words=len(words)
    for i in range(0,num_words+1):
        prefix=''.join(words[:i])
        for step in range(1,num_words-i+1):
            word=''.join(words[i:i+step])
            yield len(prefix),word

def camel_search(text,words):
    words=dict.fromkeys(words,False)
    for pos,piece in piece_emitter(text):        
        if not all(words[test_word] for test_word in words):
            for subpos,word in word_emitter(piece):
                for test_word in words:
                    if not words[test_word] and word.lower() == test_word.lower(): 
                        yield pos+subpos,word
                        words[test_word]=True
                        break
        else:
            break
    for word in words:
        if not words[word]:
            yield None,word

if __name__ == "__main__":    
    #            01234567890123456789012345
    test_text = "a_foobar_FooBar baz golf_CART"
    test_words = ["a", "foo", "bar", "baz", "golf", "cart", "fred"]
    for pos,word in camel_search(test_text,test_words):
        print pos,word.lower()

Here are the unit tests I used to check the program:

import unittest
import sys
import camel
import itertools

class Test(unittest.TestCase):
    def check(self,result,answer):
        for r,a in itertools.izip_longest(result,answer):
            if r!=a:
                print('%s != %s'%(r,a))
            self.assertTrue(r==a)

    def test_piece_emitter(self):
        tests=(("a_foobar_FooBar baz? golf_CART Foo 'food' getFooBaz",
                ((0,'a'),
                 (2,'foobar'),
                 (9,'FooBar'),
                 (16,'baz'),
                 (21,'golf'),
                 (26,'CART'),
                 (31,'Foo'),
                 (36,'food'),
                 (42,'getFooBaz'),
                )
                ),
            )
        for text,answer in tests:
            result=list(camel.piece_emitter(text))
            print(result)
            self.check(result,answer)
    def test_camel_splitter(self):
        tests=(('getFooBar',('get','Foo','Bar')),
               ('getFOObar',('get','FOO','bar')),
               ('Foo',('Foo',)),
               ('getFoo',('get','Foo')),
               ('foobar',('foobar',)),
               ('fooBar',('foo','Bar')),
               ('FooBar',('Foo','Bar')),
               ('a',('a',)),
               ('fooB',('foo','B')),
               ('FooB',('Foo','B')),               
               ('FOOb',('FOO','b')),                              
               )
        for word,answer in tests:
            result=camel.camel_splitter(word)
            self.check(result,answer)            
    def test_word_emitter(self):
        tests=(("a",
                ((0,'a'),) ),
               ('getFooBar',
                ((0,'get'),
                 (0,'getFoo'),
                 (0,'getFooBar'),
                 (3,'Foo'),
                 (3,'FooBar'),
                 (6,'Bar'), 
                 )                
                )
            )
        for text,answer in tests:
            result=list(camel.word_emitter(text))
            print(result)
            self.check(result,answer)

    def test_camel_search(self):
        tests=(("a_foobar_FooBar baz? golf_CART Foo 'food' getFooBaz",
                ("a", "foo", "bar", "baz", "golf", "cart", "fred", "food",
                  'FooBaz'),
                ((0,'a'),
                 (9,'Foo'),
                 (12,'Bar'),
                 (16,'baz'),
                 (21,'golf'),
                 (26,'CART'),
                 (36,'food'),
                 (45,'FooBaz'),
                 (None,'fred')
                )
                ),
               ("\"Foo\"",('Foo',),((1,'Foo'),)),
               ("getFooBar",('FooBar',),((3,'FooBar'),)),                              
            )
        for text,search_words,answer in tests:
            result=list(camel.camel_search(text,search_words))
            print(result)
            self.check(result,answer)

if __name__ == '__main__':
    unittest.main(argv = unittest.sys.argv + ['--verbose'])
unutbu
How does this handle searching "Foo!" for the string "foo"?
Mark Byers
@Mark: Indeed, good point. The OP has not yet specified how he wants this handled, but I suppose a natural thing to do would be to squash punctuation by replacing all punctuation marks with spaces. I'm editing my script to handle punctuation this way.
unutbu
I think this is my favorite answer so far. It's the fastest (almost 10x faster than mine!) And easy to understand. Great work.
Jesse Aldridge
How does it handle searching for 'FooBar' in 'FooBar'?
Mark Byers
I was curious so I decided to test it, and it seems to me that this fails to give the correct answer when searching 'FooBar' in 'FooBar' unfortunately :( . Is there any way to adapt your solution to fix this bug without making it slower? I don't see a simple fix, but maybe I'm missing something.
Mark Byers
@Mark: My code now handles searching for "FooBar", by yielding `pos,piece` as well as `pos,word`.
unutbu
OK, that's a nice simple fix +1. I think it's going to make it about twice as slow though. For typical text I think it will return each word twice, once as a piece, and once as a word... is that right? Since the OP is looking for the fastest solution, not the most correct one, your previous version might have been better for his needs. But personally I prefer this one as I think "correct" *nearly* always is better than "fast but wrong". (There are exceptions to this rule, but I'd say this isn't one of them IMHO, OP's opinion may differ).
Mark Byers
I guess this question is closed now, but in case other people are reading: Another possible bug with this is that it appears to fail for ("\"Foo\"", "Foo"). One thing I don't like about this solution is the high memory usage and it requires parsing the whole file even if the match is at the start which could be slower for very large files. A good point though is that it wouldn't take a lot of effort to fix it to work streaming. My regex solution is not easy to stream (but a positive point is that it stops early if it finds a match, and needs little memory).
Mark Byers
I have tested a little more and there's a few more things to note: one is that it doesn't handle upper case in `test_text` (easy to fix). Another thing is what should it do with ("getFooBar" , "FooBar")? Both our solutions miss this one. My solution could easily be modified to pick this up as a match, but yours I think cannot. I'll ask OP what he wants.
Mark Byers
@Mark: I've reworked my solution to address as many of the points you've raised as I could. `("\"Foo\"","Foo")` and `("getFooBar","FooBar")` should now work. I've included unit tests to demonstrate. I'm not sure if I've addressed your point about "high memory usage". The code no longer uses re.split, so I think I am being memory-friendly, but please tell me if I've missing something.
unutbu
It seems to be correct now for all the test cases I can think of (apart from coming up with more punctuation symbols: '(){}' etc. but never mind that). It's still doing a translate on the entire file before it tries to search. If you are searching a 100MB file and the first match is near the beginning, that's a lot of unnecessary work. The rest seems nice and streaming though. Note: this uses Python 3's next(it) - so people using Python 2 will need to change it to it.next().
Mark Byers
@Mark: I'm not sure what I can do to remove the `text.translate` call... :/ Thanks for pointing that out, though. `next(it)` also works in Python 2.6+, which is what I'm using. On an unrelated note, but equally fascinating (at least to me) -- how did you see my comment so fast? Is there a way to get automatically notified?
unutbu
@Mark: I managed to get replace the big `text.translate` call with many little ones. I split on spaces first, squash punctuation, then split on spaces again. Since I use `itertools.groupby` to do the splitting, I think it truly is memory-friendly now.
unutbu
@Mark: I added a check to break out of the main loop early if all test words have been found. Thanks for all your help in making this answer better.
unutbu
The translate call fails when text is unicode. I fixed by decoding it to an ascii string like so: str(w.decode("ascii", "replace")). Not sure if there's a better way.
Jesse Aldridge
Make that *en*code, not decode.
Jesse Aldridge
fails to find "text" in "QTextBrowser"
Jesse Aldridge
Also, with all the modifications this now runs significantly slower than Mark's answer. I'm going to switch my accepted answer. (sorry)
Jesse Aldridge
@Jesse: That's fine, :) Handling some problem-cases like `("getFooBar","FooBar")` definitely makes my script slower. I like my method mainly because I think it is a fairly organized way to break the problem apart, more easy to modify and maintain than regex. If this is not something you need, then by all means, use the method that is most useful to you.
unutbu
~unutbu: Erm... sorry :( I think your solution is better now than it was before - matches more correctly, and handles large inputs better. But as I said Jesse seems to be looking for fastest, not most correct, and only on short strings, so the improvements are not what he wants. But... I think he maybe will change his mind after he realises how hard it will be to maintain that regex. ;) I didn't particularly like this answer when it was accepted, but now with the changes I think this answer is more suitable than mine. There are still a few issues with it, but I think they're fixable. :)
Mark Byers
~ubuntu: And also my solution does not handle 'getFooBar' yet... first I thought it would be an easy fix, but thinking about it more carefully, even if I fix that, there are still some other cases my regex won't easily handle correctly... fixing all of them will make it horribly unreadable, and probably quite a bit slower too. When I made my solution I was thinking in terms of a quick fix for a specific task, not as a reusable component... I think your solution is more suitable for this role.
Mark Byers
No worries, gents. Programming is just a hell of lot of fun, even when I manage to turn an accepted answer into a bowl of spaghetti ;)
unutbu
+1  A: 

With a index to speed up searching :-)

from collections import defaultdict

class IndexedText(object):
    """ a indexed text """
    def __init__(self, text):
        self.text = text
        self._index()


    def word_start_index(self, word):
        l = len(word)
        w = word.lower()
        return self.index[word]

    def _index(self):
        self.index = defaultdict( list )

        def index( word, pos):
            self.index[word.lower()].append( pos )

        start = 0
        it = enumerate(self.text)
        lpos, lchar = it.next()
        WS = (' ','_')

        for pos, char in it:
            if lchar in WS and char not in WS:
                index( self.text[start:lpos], start )
                start = pos
            elif lchar.islower() and char.isupper(): # camelcase
                index( self.text[start:pos], start )
                start = pos
            lpos, lchar = pos, char

        # last word is missing
        index( self.text[start:], start ) 

if __name__ == "__main__":
    #            01234567890123456789012345
    test_text = "a_foobar_FooBar baz golf_CART"
    test_words = ["a", "foo", "bar", "baz", "golf", "cart", "fred"]

    index = IndexedText( test_text )

    for word in test_words:
        match_start = index.word_start_index( word )
        print match_start, word
THC4k
Nice idea with the caching, but note that it could actually end up being slower this way if each text is only searched once. I've asked the OP a clarifying question about this. Also how does this handle searching "Foo!" for the string "foo"?
Mark Byers
Yes the index is a nice idea. I don't think I want it right now, but I'll keep it in mind for the future.
Jesse Aldridge
Also, how does this handle searching for 'FooBar' in 'FooBar'?
Mark Byers