ansaurus

Question

Answer 1

+4 A:

Hint: Don't use regexes for link extraction or other HTML "parsing" tasks!

Your regex has 6 (SIX) repeating groups in it. Executing it will entail a lot of backtracking. In the worst case, it could even approach O(N^6) where N is the number of input characters. You could ease this a bit by replacing eager matching with lazy matching, but it is almost impossible to avoid pathological cases; e.g. when the input data is sufficiently malformed that the regex does not match.

A far, far better solution is to use some existing strict or permissive HTML parser. Even writing an ad-hoc parser by hand is going to be better than using gnarly regexes.

This page that lists various HTML parsers for Java. I've heard good things about TagSoup and HtmlCleaner.

Stephen C 2010-10-11 22:57:16

Why? Any resource to read (books,articles, etc)

santiagobasulto 2010-10-11 23:03:10

Thanks brother, i'll try with javax.swing.text.html.HTMLEditorKit

santiagobasulto 2010-10-11 23:14:14

Answer 2

+1 A:

Try jscrape instead. Please don't use regex for this.

Regex use vs. Regex abuse

Regular expressions are not Parsers. Although you can do some amazing things with regular expressions, they are weak at balanced tag matching. Some regex variants have balanced matching, but it is clearly a hack -- and a nasty one. You can often make it kinda-sorta work, as I have in the sanitize routine. But no matter how clever your regex, don't delude yourself: it is in no way, shape or form a substitute for a real live parser.

Source

zengr 2010-10-11 22:59:19

Why? Any resource to read (books,articles, etc)

santiagobasulto 2010-10-11 23:03:28

Regex is a very expensive operation. And while you are scraping a website, you will need to parse alot of text. http://www.acorns.com.au/blog/?p=136

zengr 2010-10-11 23:11:08

The complexity of evaluating regex is something terrible according to the linear complexity of parsing the HTML page

jutky 2010-10-11 23:12:16

@zengr Thanks for that!

santiagobasulto 2010-10-11 23:20:18

Answer 3

+1 A:

All your time, all of it, is being spent here:

 html+=line;

Use a StringBuffer. Better still, if you can, run the match on every line and don't accumulate them at all.

EJP 2010-10-11 23:36:37

Good hint, anyway, i need to construct all the HTML becouse lines can be broken.

santiagobasulto 2010-10-12 00:05:37

I tried it, it doesn't change anything.

santiagobasulto 2010-10-12 00:07:24

You tried what exactly?

EJP 2010-10-12 04:00:06

Sorry EJP. It doesn't change anything based on the great time it takes to parse with regex. It does change something, but is nothing comparing it with the overall time. It is a good tip though, i've coded it.

santiagobasulto 2010-10-12 13:49:36

Answer 4

A:

Java tends to be ugly for this kind of stuff. Here is a (poorly written) example of that in Python

from BeautifulSoup import BeautifulSoup
import urllib2, timeit

def parseIt():
    content = ''.join(urllib2.urlopen('http://news.google.com.ar/nwshp?hl=es&amp;tab=wn'))

    soup = BeautifulSoup(content)

    links = soup.findAll("a")
    for link in links:
        print link


if __name__ == '__main__':
    t = timeit.Timer(stmt=parseIt, setup='from BeautifulSoup import BeautifulSoup')
    print t.timeit(number=1)

Time to run for me is: 0.78207570931

bwawok 2010-10-11 23:59:08

Thanks, good example. But i'm trying to do it with java.

santiagobasulto 2010-10-12 00:10:00

ansaurus

tags:

views:

answers:

Java Regex performance

related questions