ansaurus

Question

Speeding up regular expressions in Python

Answer 1

+7 A:

First, use an HTML parser built for this, like BeautifulSoup:

http://www.crummy.com/software/BeautifulSoup/

Then, you can identify remaining particular slow spots with the profiler:

http://docs.python.org/library/profile.html

And for learning about regular expressions, I've found Mastering Regular Expressions very valuable, no matter what the programming language:

http://oreilly.com/catalog/9781565922570

Also:

http://stackoverflow.com/questions/606350/how-can-i-debug-a-regular-expression-in-python

Due to the reclarification of the use-case, then for this request, I would say the above is not what you want. My alternate recommendation would be: http://stackoverflow.com/questions/3277239/speeding-up-regular-expressions-in-python/3277705#3277705

eruciform 2010-07-18 21:44:07

lxml might be faster than BeautifulSoup, you should try both.

Juri Robl 2010-07-18 22:08:45

Obviously, doing a complete parse can never be faster than substituting a few regular expressions, unless there is a some super-duper HTML parsing library. (I tried BS, it's an order of magnitude slower than the regex I am using).

Abhi 2010-07-18 22:14:04

The one downside to BeautifulSoup is its speed: it's very slow.

Ned Batchelder 2010-07-18 22:31:24

@Abhi: well, it depends on what you mean by "parsing" I suppose. Many parsers (and BS may be among them) use regular expressions to identify the tags and attributes, and those would be slow for the same reason all regular expressions are slow. But one could conceivably write a parser as a finite state machine, which goes through one character at a time and alters its state based on the character and current state, and that would be a _lot_ faster. (Quite complicated to code for HTML though)

David Zaslavsky 2010-07-18 22:52:44

I'd say David Zaslavsky is spot on: if you just do simple removing of portions of html, write a character-by-character parse. A lot of the complexity a parser would deal with (tags / nesting) are non existent as you don't care to remember them, just to remove them.

Wrikken 2010-07-18 23:07:30

updated with link another so question on turning on debugging for regexes, which also might help profile them.

eruciform 2010-07-18 23:08:36

@all: it's true that bs is slower, but is also depends on his use case. i he's parsing it once and then doing tons of operations on it, it still might be better. though, it might be far worse if he needs to constantly parse new html...

eruciform 2010-07-18 23:27:58

@David: I understand. I thought re.compile's job *is* to analyze the regex and effectively create a FSM, but I am not sure whether and how well it can do that. But yes, as @Wrikken pointed out, I am interested in a very special FSM that drops characters, which is my ultimate goal. Anyways, as you said, it will be complicated, and not probably won't justify the added coding/debugging time. Thanks!

Abhi 2010-07-18 23:39:03

BTW, do you know whether regex implementations in today's programming languages are strictly equivalent to the regular languages/FSM formalism, or do they represent a superset? Thanks!

Abhi 2010-07-18 23:39:45

@eruciform: I have to extract text from 50 million HTML docs of Web-average length. So I end calling those set of regex on each of the docs one by one.

Abhi 2010-07-18 23:42:12

@abhi - that depends on the language, they all overlap quite a bit, but each one has a few things others don't. personally, i've found the one in perl to be the most powerful (blow your foot off if you're not careful powerful, like perl usually is :-)

eruciform 2010-07-18 23:42:34

posted separate answer now that the use-case is more clear: http://stackoverflow.com/questions/3277239/speeding-up-regular-expressions-in-python/3277705#3277705

eruciform 2010-07-18 23:49:04

Answer 2

+1 A:

one thing you can do is combine the script/style regexes using backreferences. here's some sample data:

$ cat sample 
<script>some stuff</script>
<html>whatever </html>
<style>some other stuff</style>

using perl:

perl -ne "if (/<(script|style)>.*?<\/\1>/) { print $1; } " sample

it will match either script or style. I second the recommendation for "mastering regular expressions", it's an excellent book.

Paul Sanwald 2010-07-18 21:49:50

Answer 3

+1 A:

The suggestion to use an HTML parser is a good one, since it'll quite possibly be faster than regular expressions. But I'm not sure BeautifulSoup is the right tool for the job, since it constructs a parse tree from the entire file and stores the whole thing in memory. For a terabyte of HTML, you'd need an obscene amount of RAM to do that ;-) I'd suggest you look at HTMLParser, which is written at a lower level than BeautifulSoup, but I believe it's a stream parser, so it will only load a bit of the text at a time.

David Zaslavsky 2010-07-18 21:50:47

Sorry, I meant 50 million HTML files. It's not one big piece of text. Anyway, HTMLParser didn't work: This is HTML-in-the-wild. Is there a clean way to extract all the text from BeautifulSoup's parse tree?

Abhi 2010-07-18 21:59:01

Ah, gotcha. I don't know all that much about BeautifulSoup, but I'd imagine it might be possible to extract the text. Did you try just converting it to a string? i.e. `soup = BeautifulSoup(...)` then `str(soup)` (I don't know if that's it, but that'd be my first guess if I were working on your project)

David Zaslavsky 2010-07-18 22:12:37

Thanks for getting back :) But I typed that comment in haste. BS is an order of magnitude slower than regex substitutions, so extracting text from its parse is moot :P

Abhi 2010-07-18 22:31:23

Answer 4

+1 A:

If your use-case is indeed to parse a few things for each of millions of documents, then my above answer won't help. I recommend some heuristics, like doing a couple "straight text" regexes on them to begin with - like just plain /script/ and /style/ to throw things out quickly if you can. In fact, do you really need to do the end-tag check at all? Isn't <style good enough? Leave validation for someone else. If the quick ones succeed, then put the rest into a single regex, like /<script|<style|\s{2,}|etc.../ so that it doesn't have to go through so much text once for each regex.

eruciform 2010-07-18 23:48:31

Answer 5

+2 A:

You're processing each file five times, so the first thing you should do (as Paul Sanwald said) is try to reduce that number by combining your regexes together. I would also avoid using reluctant quantifiers, which are designed for convenience at the expense of efficiency. Consider this regex:

<script.*?</script>

Each time the . goes to consume another character, it first has to make sure </script> won't match at that spot. It's almost like doing a negative lookahead at every position:

<script(?:(?!</script>).)*</script>

But we know there's no point doing the lookahead if the next character is anything but <, and we can tailor the regex accordingly:

<script[^<]*(?:<(?!/script>)[^<]*)*</script>

When I test them in RegexBuddy with this target string:

<script type="text/javascript">var imagePath='http://sstatic.net/stackoverflow/img/';&lt;/script&gt;

...the reluctant regex takes 173 steps to make the match, while the tailored regex takes only 28.

Combining your first three regexes into one yields this beast:

<(?:(script|style)[^<]*(?:<(?!/\1)[^<]*)*</\1>|[!/]?[a-zA-Z-]+[^<>]*>)

You might want to zap the <HEAD> element while you're at it (i.e., (script|style|head)).

I don't know what you're doing with the fourth regex, for character entities--are you just deleting those, too? I'm guessing the fifth regex has to be run separately, since some of the whitespace it's cleaning up is generated by the earlier steps. But try it with the first three regexes combined and see how much difference it makes. That should tell you if it's worth going forward with this approach.

Alan Moore 2010-07-19 01:38:08

Thanks for the insight about reluctant quantifiers. Profiling did show that these are expensive! I will modify my regex.

Abhi 2010-07-19 15:45:09

Answer 6

A:

I would use simple program with regular Python partition something like, this, but it is tested only with one style example file:

## simple filtering when not hierarchical tags inside other discarded tags

start_tags=('<style','<script')
end_tags=('</style>','</script>')

##print("input:\n %s" % open('giant.html').read())
out=open('cleaned.html','w')
end_tag=''

for line in open('giant.html'):
    line=' '.join(line.split())
    if end_tag:
        if end_tag in line:
            _,tag,end = line.partition(end_tags[index])
            if end.strip():
                out.write(end)
            end_tag=''
        continue ## discard rest of line if no end tag found in line

    found=( index for index in (start_tags.index(start_tag)
                                if start_tag in line else ''
                                for start_tag in start_tags)
            if index is not '')
    for index in  found:
        start,tag,end = line.partition(start_tags[index])
        # drop until closing angle bracket of start tag
        tag,_ ,end = end.partition('>')
        # check if closing tag already in same line
        if end_tags[index] in end:
            _,tag,end = end.partition(end_tags[index])
            if end.strip():
                out.write(end)
            end_tag = '' # end tag reset after found
        else:
            end_tag=end_tags[index]
            out.write(end) # no end tag at same line
    if not end_tag: out.write(line+'\n')

out.close()
##    print 'result:\n%s' % open('cleaned.html').read()

Tony Veijalainen 2010-07-19 16:35:19

ansaurus

tags:

views:

answers:

Speeding up regular expressions in Python

related questions