ansaurus

Question

How to use regular expressions to parse HTML in Java?

Answer 1

+27 A:

Using regular expressions to pull values from HTML is always a mistake. HTML syntax is a lot more complex that it may first appear and it's very easy for a page to catch out even a very complex regular expression.

Use an HTML Parser instead.

Dave Webb 2009-03-24 11:41:56

+1 for not parsing HTML with regular expressions.

Welbog 2009-03-24 11:43:13

+1 for answer and morph

Paul Whelan 2009-03-24 11:46:29

It depends on what you are doing. If you are processing a lot of HTML from random sources an HTML Parser may well fail on some of them and will likely require more memory and processing than a regex. For example the Heritrix web crawler uses regex for link extraction on HTML pages.

Kris 2009-03-24 12:19:08

I am amazed how often this very question has been answered with this very answer on this site (and the rest of the Internet) already. I wonder if this this topic will ever run dry. +1 nevertheless.

Tomalak 2009-03-24 12:55:33

The solution depends on the question...

ReneS 2009-03-24 13:06:05

Answer 2

+1 A:

I searched the Regular Expression Library (http://regexlib.com/Search.aspx?k=href and http://regexlib.com/Search.aspx?k=src)

The best I found was

((?<html>(href|src)\s*=\s*")|(?<css>url\())(?<url>.*?)(?(html)"|\))

Check out these links for more expressions:

http://regexlib.com/REDetails.aspx?regexp_id=2261

http://regexlib.com/REDetails.aspx?regexp_id=758

http://regexlib.com/REDetails.aspx?regexp_id=774

http://regexlib.com/REDetails.aspx?regexp_id=1437

Mark Justin 2009-03-24 11:50:55

I hate that site. I see they still don't bother to mention which flavor a given regex is targeted at. This regex (id=2261) uses named captures and conditionals, neither of which is supported by Java.

Alan Moore 2009-03-24 17:03:48

Answer 3

+3 A:

If you want to go down the html parsing route, which Dave and I recommend here's the code to parse a String Data for anchor tags and print their href.

since your just using anchor tags you should be ok with just regex but if you want to do more go with a parser. The Mozilla HTML Parser is the best out there.

File parserLibraryFile = new File("lib/MozillaHtmlParser/native/bin/MozillaParser" + EnviromentController.getSharedLibraryExtension());
                String parserLibrary = parserLibraryFile.getAbsolutePath();
                //  mozilla.dist.bin directory :
                final File mozillaDistBinDirectory = new File("lib/MozillaHtmlParser/mozilla.dist.bin."+ EnviromentController.getOperatingSystemName());

        MozillaParser.init(parserLibrary,mozillaDistBinDirectory.getAbsolutePath());
MozillaParser parser = new MozillaParser();
Document domDocument = parser.parse(data);
NodeList list = domDocument.getElementsByTagName("a");

for (int i = 0; i < list.getLength(); i++) {
    Node n = list.item(i);
    NamedNodeMap m = n.getAttributes();
    if (m != null) {
        Node attrNode = m.getNamedItem("href");
        if (attrNode != null)
           System.out.println(attrNode.getNodeValue());

Scott Cowan 2009-03-24 11:56:12

Answer 4

+6 A:

Dont use regular expressions use NekoHTML or TagSoup which are a bridge providing a SAX or DOM as in XML approach to visiting a HTML document.

mP 2009-03-24 12:40:22

+1 on Neko. Very easy to use.

Damo 2009-03-24 15:20:43

Answer 5

+2 A:

The other answers are true. Java Regex API is not a proper tool to achieve your goal. Use efficient, secure and well tested high-level tools mentioned in the other answers.

If your question concerns rather Regex API than a real-life problem (learning purposes for example) - you can do it with the following code:

String html = "foo <a href='link1'>bar</a> baz <a href='link2'>qux</a> foo";
Pattern p = Pattern.compile("<a href='(.*?)'>");
Matcher m = p.matcher(html);
while(m.find()) {
   System.out.println(m.group(0));
   System.out.println(m.group(1));
}

And the output is:

<a href='link1'>
link1
<a href='link2'>
link2

Please note that lazy/reluctant qualifier *? must be used in order to reduce the grouping to the single tag. Group 0 is the entire match, group 1 is the next group match (next pair of parenthesis).

Henryk Konsek 2009-03-24 13:17:37

Thanks. While not a real "works-everywhere" regex this works for data returned from google hot trends and I have been pulling my hair to parse it for a long time...

rjha94 2010-10-17 16:01:00

Answer 6

A:

Regular expressions can only parse regular languages, that's why they are called regular expressions. HTML is not a regular language, ergo it cannot be parsed by regular expressions.

HTML parsers, on the other hand, can parse HTML, that's why they are called HTML parsers.

You should use you favorite HTML parser instead.

Jörg W Mittag 2009-03-24 21:30:18

Answer 7

A:

Contrary to popular opinion, regular expressions are useful tools to extract data from unstructured text (which HTML is).

If you are doing complex HTML data extraction (say, find all paragraphs in a page) then HTML parsing is probably the way to go. But if you just need to get some URLs from HREFs, then a regular expression would work fine and it will be very hard to break it.

Try something like this:

/<a[^>]+href=["']?([^'"> ]+)["']?[^>]*>/i

Guss 2009-03-25 08:49:23

Answer 8

A:

The problem is, that java just does not work like expected with regexp. vor example try parsing

input = "<td valign='top' someRandom='tag'><a href='#'>google.de</a><td>";
pattern = "<td .*?>(.*)?</td>";

this should match

<a href='#'>google.de</a>

in group(1), but it doenst.. java implementation of regexp just suxx

WasserbettenGuerilla 2010-01-28 23:27:48

It doesn't match because the regex is simply wrong for that particular input string. Your regex requires </td> at the end, your string has <td>.

ferdystschenko 2010-01-28 23:46:51

Welcome at Stackoverflow. Please don't pollute topics with rants which are incorrectly been posted as answers. Whenever you have a *question*, please press the `Ask Question` button at the right top :)

BalusC 2010-01-29 00:21:32

Answer 9

A:

Woah, guys, you seem to have forgotten to link to the canonical parsing-HTML-with-regular-expressions Stack Overflow answer.

Paul D. Waite 2010-01-28 23:32:39

Not sure about yours, but in my world, 14 Nov 2009 is **after** 24 Mar 2009. Regardless, you should have posted this smartass-like reply as a **comment** rather than an answer ;)

BalusC 2010-01-29 00:19:30

“Doh”, and “doh”.

Paul D. Waite 2010-01-29 11:00:49

ansaurus

tags:

views:

answers:

How to use regular expressions to parse HTML in Java?

related questions