views:

198

answers:

9

I'm writing a python program that deals with a fair amount of strings/files. My problem is that I'm going to be presented with a fairly short piece of text, and I'm going to need to search it for instances of a fairly broad range of words/phrases.

I'm thinking I'll need to compile regular expressions as a way of matching these words/phrases in the text. My concern, however, is that this will take a lot of time.

My question is how fast is the process of repeatedly compiling regular expressions, and then searching through a small body of text to find matches? Would I be better off using some string method?

Edit: So, I guess an example of my question would be: How expensive would it be to compile and search with one regular expression versus say, iterating 'if "word" in string' say, 5 times?

A: 

If you're just searching for a particular substring, use str.find() instead.

Amber
A: 

Depending on what you're doing it might be better to use a tokenizer and loop through the tokens to find matches.

However, when it comes to short pieces of text regexes have incredibly good performance. Personally I remember only coming into problems when text sizes became ridiculous like 100k words or something like that.

Furthermore, if you are worried about the speed of actual regex compilation rather than matching, you might benefit from creating a daemon that compiles all the regexes then goes through all the pieces of text in a big loop or runs as a service. This way you will only have to compile the regexes once.

Swizec Teller
(1) Most of the regex engines used in Python/Perl/etc are much slower than they could be (if an enormous amount of work was put into optimisation) i.e. NOT "incredibly good" (2) in many places, a text of 100K words would be considered ridiculously small (3) a regex like `"foo|bar|baz|zot|tox"` is rather likely to cause the text to be searched 5 times instead of running an automaton over it once.
John Machin
A: 

in general case, you can use "in" keyword

for line in open("file"):
    if "word" in line:
        print line.rstrip()

regex is usually not needed when you use Python :)

ghostdog74
+3  A: 

If speed is of the essence, you are better off running some tests before you decide how to code your production application.

First of all, you said that you are searching for words which suggests that you may be able to do this using split() to break up the string on whitespace. And then use simple string comparisons to do your search.

Definitely do compile your regular expressions and do a timing test comparing that with the plain string functions. Check the documentation for the string class for a full list.

Michael Dillon
+1  A: 

When you compile the regexp, it is converted into a state machine representation. Provided the regexp is efficiently expressed, it should still be very fast to match. Compiling the regexp can be expensive though, so you will want to do that up front, and as infrequently as possible. Ultimately though, only you can answer if it is fast enough for your requirements.

There are other string searching approaches, such as the Boyer-Moore algorithm. But I'd wager the complexity of searching for multiple separate strings is much higher than a regexp that can switch off each successive character.

gavinb
You almost answered my question here, the compilation of regular expressions is expensive... How expensive would it be to compile and search with one regular expression versus say, iterating 'if "word" in string' say, 5 times?
Wilduck
@gavinb: (1) "state machine representation" and "switch off each successive character" sounds like you think that Python's RE engine uses DFAs. It doesn't. (2) It seems that you are unaware that compilation of regexes is cached. If the caller doesn't pre-compile regexes, there is little penalty.
John Machin
@John Machin, I did assume the RE engine used a DFA. I'll have to have a closer look at the implementation.
gavinb
@Wilduck: I doubt 5 iterations would be enough to base a performance comparison on. As Robert's answer shows, regular expressions can easily be significantly faster than using the 'in' operator. It is good practice to compile the regexp first, but since they are cached, the cost is only borne on the first usage.
gavinb
+3  A: 

You should try to compile all your regexps into a single one using the | operator. That way, the regexp engine will do most of the optimizations for you. Use the grouping operator () to determine which regexp matched.

Aaron Digulla
In what way does Python optimize your regular expression?
Gumbo
For example, `r'axxx|byyy'` will be much faster than checking for `r'axxx'` and then for `r'byyy'` individually since the regexp engine will do the "switch" for you (it will check the first character and then ignore one of the patterns). So more complex regexps don't have to mean "slower".
Aaron Digulla
@Aaron Digulla: "it will check [that] the first character" of which string satisfies what condition? Can you point to where in the source code it does this. For a text of size T bytes and N query strings, are you saying that the best case time is O(T) or O(NT)?
John Machin
The regexp engine will first try to find one of the leading characters of the various subexpressions. If it can't find them, it'll be O(TN) (the regexp engine will check a few characters of each regexp per text position). If you use individual regexps, it'll be sum(O(T*Rx)) where Rx is the size of each regexp.
Aaron Digulla
+1  A: 

If you like to know how does it fast during compiling regex patterns, you need to benchmark it.

Here is how I do that. Its compile 1 Million time each patterns.

import time,re

def taken(f):
 def wrap(*arg):
  t1,r,t2=time.time(),f(*arg),time.time()
  print t2-t1,"s taken"
  return r
 return wrap

@taken
def regex_compile_test(x):
 for i in range(1000000):
  re.compile(x)
 print "for",x,

#sample tests
regex_compile_test("a")
regex_compile_test("[a-z]")
regex_compile_test("[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}")

Its took around 5 min for each patterns in my computer.

for a 4.88999986649 s taken
for [a-z] 4.70300006866 s taken
for [A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4} 4.78200006485 s taken

The real Bottleneck is not in compiling patterns, its in extracting text like re.findall, replacing re.sub. If you use that against Serveral MB texts, Its quite slow.

If your text is fixed, use normal str.find, its faster than regex.

Actually, If you give your text samples, and your regex patterns samples, we could give you better idea, there is many many great regex, and python guys out there.

Hope this help, sorry If my answer couldn't help you.

S.Mark
You might want to take a look at the `timeit` module.
Robert Rossney
Thanks for info about `timeit` module, I didnt know that.
S.Mark
+2  A: 

Your requirement appears to be searching a text for the first occurrence of any one of a collection of strings. Presumably you then wish to restart the search to find the next occurrence, and so on until the searched string is exhausted. Only plain old string comparison is involved.

The classic algorithm for this task is Aho-Corasick for which there is a Python extension (written in C). This should beat the socks off any alternative that's using the re module.

John Machin
+1  A: 

This is a question that can readily be answered by just trying it.

>>> import re
>>> import timeit
>>> find = ['foo', 'bar', 'baz']
>>> pattern = re.compile("|".join(find))
>>> with open('c:\\temp\\words.txt', 'r') as f:
        words = f.readlines()

>>> len(words)
235882
>>> timeit.timeit('r = filter(lambda w: any(s for s in find if w.find(s) >= 0), words)', 'from __main__ import find, words', number=30)
18.404569854548527
>>> timeit.timeit('r = filter(lambda w: any(s for s in find if s in w), words)', 'from __main__ import find, words', number=30)
10.953313759150944
>>> timeit.timeit('r = filter(lambda w: pattern.search(w), words)', 'from __main__ import pattern, words', number=30)
6.8793022576891758

It looks like you can reasonably expect regular expressions to be faster than using find or in. Though if I were you I'd repeat this test with a case that was more like your real data.

Robert Rossney