views:

240

answers:

6

I need to quickly extract text from HTML files. I am using the following regular expressions instead of a full-fledged parser since I need to be fast rather than accurate (I have more than a terabyte of text). The profiler shows that most of the time in my script is spent in the re.sub procedure. What are good ways of speeding up my process? I can implement some portions in C, but I wonder whether that will help given that the time is spent inside re.sub, which I think would be efficiently implemented.

# Remove scripts, styles, tags, entities, and extraneous spaces:
scriptRx    = re.compile("<script.*?/script>", re.I)
styleRx     = re.compile("<style.*?/style>", re.I)
tagsRx      = re.compile("<[!/]?[a-zA-Z-]+[^<>]*>")
entitiesRx  = re.compile("&[0-9a-zA-Z]+;")
spacesRx    = re.compile("\s{2,}")
....
text = scriptRx.sub(" ", text)
text = styleRx.sub(" ", text)
....

Thanks!

+7  A: 

First, use an HTML parser built for this, like BeautifulSoup:

http://www.crummy.com/software/BeautifulSoup/

Then, you can identify remaining particular slow spots with the profiler:

http://docs.python.org/library/profile.html

And for learning about regular expressions, I've found Mastering Regular Expressions very valuable, no matter what the programming language:

http://oreilly.com/catalog/9781565922570

Also:

http://stackoverflow.com/questions/606350/how-can-i-debug-a-regular-expression-in-python

Due to the reclarification of the use-case, then for this request, I would say the above is not what you want. My alternate recommendation would be: http://stackoverflow.com/questions/3277239/speeding-up-regular-expressions-in-python/3277705#3277705

eruciform
lxml might be faster than BeautifulSoup, you should try both.
Juri Robl
Obviously, doing a complete parse can never be faster than substituting a few regular expressions, unless there is a some super-duper HTML parsing library. (I tried BS, it's an order of magnitude slower than the regex I am using).
Abhi
The one downside to BeautifulSoup is its speed: it's very slow.
Ned Batchelder
@Abhi: well, it depends on what you mean by "parsing" I suppose. Many parsers (and BS may be among them) use regular expressions to identify the tags and attributes, and those would be slow for the same reason all regular expressions are slow. But one could conceivably write a parser as a finite state machine, which goes through one character at a time and alters its state based on the character and current state, and that would be a _lot_ faster. (Quite complicated to code for HTML though)
David Zaslavsky
I'd say David Zaslavsky is spot on: if you just do simple removing of portions of html, write a character-by-character parse. A lot of the complexity a parser would deal with (tags / nesting) are non existent as you don't care to remember them, just to remove them.
Wrikken
updated with link another so question on turning on debugging for regexes, which also might help profile them.
eruciform
@all: it's true that bs is slower, but is also depends on his use case. i he's parsing it once and then doing tons of operations on it, it still might be better. though, it might be far worse if he needs to constantly parse new html...
eruciform
@David: I understand. I thought re.compile's job *is* to analyze the regex and effectively create a FSM, but I am not sure whether and how well it can do that. But yes, as @Wrikken pointed out, I am interested in a very special FSM that drops characters, which is my ultimate goal. Anyways, as you said, it will be complicated, and not probably won't justify the added coding/debugging time. Thanks!
Abhi
BTW, do you know whether regex implementations in today's programming languages are strictly equivalent to the regular languages/FSM formalism, or do they represent a superset? Thanks!
Abhi
@eruciform: I have to extract text from 50 million HTML docs of Web-average length. So I end calling those set of regex on each of the docs one by one.
Abhi
@abhi - that depends on the language, they all overlap quite a bit, but each one has a few things others don't. personally, i've found the one in perl to be the most powerful (blow your foot off if you're not careful powerful, like perl usually is :-)
eruciform
posted separate answer now that the use-case is more clear: http://stackoverflow.com/questions/3277239/speeding-up-regular-expressions-in-python/3277705#3277705
eruciform
+1  A: 

one thing you can do is combine the script/style regexes using backreferences. here's some sample data:

$ cat sample 
<script>some stuff</script>
<html>whatever </html>
<style>some other stuff</style>

using perl:

perl -ne "if (/<(script|style)>.*?<\/\1>/) { print $1; } " sample

it will match either script or style. I second the recommendation for "mastering regular expressions", it's an excellent book.

Paul Sanwald
+1  A: 

The suggestion to use an HTML parser is a good one, since it'll quite possibly be faster than regular expressions. But I'm not sure BeautifulSoup is the right tool for the job, since it constructs a parse tree from the entire file and stores the whole thing in memory. For a terabyte of HTML, you'd need an obscene amount of RAM to do that ;-) I'd suggest you look at HTMLParser, which is written at a lower level than BeautifulSoup, but I believe it's a stream parser, so it will only load a bit of the text at a time.

David Zaslavsky
Sorry, I meant 50 million HTML files. It's not one big piece of text. Anyway, HTMLParser didn't work: This is HTML-in-the-wild. Is there a clean way to extract all the text from BeautifulSoup's parse tree?
Abhi
Ah, gotcha. I don't know all that much about BeautifulSoup, but I'd imagine it might be possible to extract the text. Did you try just converting it to a string? i.e. `soup = BeautifulSoup(...)` then `str(soup)` (I don't know if that's it, but that'd be my first guess if I were working on your project)
David Zaslavsky
Thanks for getting back :) But I typed that comment in haste. BS is an order of magnitude slower than regex substitutions, so extracting text from its parse is moot :P
Abhi
+1  A: 

If your use-case is indeed to parse a few things for each of millions of documents, then my above answer won't help. I recommend some heuristics, like doing a couple "straight text" regexes on them to begin with - like just plain /script/ and /style/ to throw things out quickly if you can. In fact, do you really need to do the end-tag check at all? Isn't <style good enough? Leave validation for someone else. If the quick ones succeed, then put the rest into a single regex, like /<script|<style|\s{2,}|etc.../ so that it doesn't have to go through so much text once for each regex.

eruciform
+2  A: 

You're processing each file five times, so the first thing you should do (as Paul Sanwald said) is try to reduce that number by combining your regexes together. I would also avoid using reluctant quantifiers, which are designed for convenience at the expense of efficiency. Consider this regex:

<script.*?</script>

Each time the . goes to consume another character, it first has to make sure </script> won't match at that spot. It's almost like doing a negative lookahead at every position:

<script(?:(?!</script>).)*</script>

But we know there's no point doing the lookahead if the next character is anything but <, and we can tailor the regex accordingly:

<script[^<]*(?:<(?!/script>)[^<]*)*</script>

When I test them in RegexBuddy with this target string:

<script type="text/javascript">var imagePath='http://sstatic.net/stackoverflow/img/';&lt;/script&gt;

...the reluctant regex takes 173 steps to make the match, while the tailored regex takes only 28.

Combining your first three regexes into one yields this beast:

<(?:(script|style)[^<]*(?:<(?!/\1)[^<]*)*</\1>|[!/]?[a-zA-Z-]+[^<>]*>)

You might want to zap the <HEAD> element while you're at it (i.e., (script|style|head)).

I don't know what you're doing with the fourth regex, for character entities--are you just deleting those, too? I'm guessing the fifth regex has to be run separately, since some of the whitespace it's cleaning up is generated by the earlier steps. But try it with the first three regexes combined and see how much difference it makes. That should tell you if it's worth going forward with this approach.

Alan Moore
Thanks for the insight about reluctant quantifiers. Profiling did show that these are expensive! I will modify my regex.
Abhi
A: 

I would use simple program with regular Python partition something like, this, but it is tested only with one style example file:

## simple filtering when not hierarchical tags inside other discarded tags

start_tags=('<style','<script')
end_tags=('</style>','</script>')

##print("input:\n %s" % open('giant.html').read())
out=open('cleaned.html','w')
end_tag=''

for line in open('giant.html'):
    line=' '.join(line.split())
    if end_tag:
        if end_tag in line:
            _,tag,end = line.partition(end_tags[index])
            if end.strip():
                out.write(end)
            end_tag=''
        continue ## discard rest of line if no end tag found in line

    found=( index for index in (start_tags.index(start_tag)
                                if start_tag in line else ''
                                for start_tag in start_tags)
            if index is not '')
    for index in  found:
        start,tag,end = line.partition(start_tags[index])
        # drop until closing angle bracket of start tag
        tag,_ ,end = end.partition('>')
        # check if closing tag already in same line
        if end_tags[index] in end:
            _,tag,end = end.partition(end_tags[index])
            if end.strip():
                out.write(end)
            end_tag = '' # end tag reset after found
        else:
            end_tag=end_tags[index]
            out.write(end) # no end tag at same line
    if not end_tag: out.write(line+'\n')

out.close()
##    print 'result:\n%s' % open('cleaned.html').read()
Tony Veijalainen