I need to quickly extract text from HTML files. I am using the following regular expressions instead of a full-fledged parser since I need to be fast rather than accurate (I have more than a terabyte of text). The profiler shows that most of the time in my script is spent in the re.sub procedure. What are good ways of speeding up my process? I can implement some portions in C, but I wonder whether that will help given that the time is spent inside re.sub, which I think would be efficiently implemented.
# Remove scripts, styles, tags, entities, and extraneous spaces:
scriptRx = re.compile("<script.*?/script>", re.I)
styleRx = re.compile("<style.*?/style>", re.I)
tagsRx = re.compile("<[!/]?[a-zA-Z-]+[^<>]*>")
entitiesRx = re.compile("&[0-9a-zA-Z]+;")
spacesRx = re.compile("\s{2,}")
....
text = scriptRx.sub(" ", text)
text = styleRx.sub(" ", text)
....
Thanks!