ansaurus

Question

Java: I have a big string of html and need to extract the href="..." text...

Answer 1

+4 A:

.*

This is an greedy operation that will take any character including the quotes.

Try something like:

"href=\"([^\"]*)\""

Kugel 2009-11-03 22:42:17

It still prints the entire string and not the capture group :(

Legend 2009-11-03 22:49:48

Probably because he's missed the quantifier after the negated quote. But anyway, stop trying to use RegEx for this, it's the wrong tool for the job!

Peter Boughton 2009-11-03 22:56:12

But it's the fastest tool for the job (development wise). Html parsers can be a pain.

Kugel 2009-11-03 22:59:19

Regex *cannot* match HTML nodes correctly. Even with the non-Regular extensions of many modern regex extensions, HTML is too complex.

Peter Boughton 2009-11-03 23:03:17

Sorry! This works... There was something wrong with my string... Thanks a ton!

Legend 2009-11-03 23:25:59

It is, in fact, fastest for given task (performance wise). But XPath would be faster and more scalable development wise.

tulskiy 2009-11-04 00:17:50

Xpath works on html too?@Peter I understand that, but the job here was not to match html nodes, but simply find the links.

Kugel 2009-11-04 10:28:57

Answer 2

+3 A:

Regex is great but not the right tool for this particular purpose. Normally you want to use a stackbased parser for this. Have a look at Java HTML parser API's like jTidy.

BalusC 2009-11-03 22:45:56

Answer 3

+1 A:

"href=\"(.*?)\"" should also work, but I think Kugel's answer will work faster.

tulskiy 2009-11-03 22:46:35

Answer 4

+4 A:

There are two problems with the code you've posted:

Firstly the .* in your regular expression is greedy. This will cause it to match all characters until the last " character that can be found. You can make this match be non-greedy by changing this to .*?.

Secondly, to pick up all the matches, you need to keep iterating with Matcher.find rather than looking for groups. Groups give you access to each parenthesized section of the regex. You however, are looking for each time the whole regular expression matches.

Putting these together gives you the following code which should do what you need:

Pattern p = Pattern.compile("href=\"(.*?)\"", Pattern.DOTALL);
Matcher m = p.matcher(innerHTML);

while (m.find()) 
{
    System.out.println(m.group(1));
}

Phil Ross 2009-11-03 22:48:33

This works as well! Thank You!

Legend 2009-11-03 23:26:47

Answer 5

+1 A:

you may use a html parser library. jtidy for example gives you a DOM model of the html, from wich you can extract all "a" elements and read their "href" attribute

Lorenzo Boccaccia 2009-11-03 22:51:29

Answer 6

+2 A:

Use a built in parser. Something like:

    EditorKit kit = new HTMLEditorKit();
    HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
    doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);
    kit.read(reader, doc, 0);

    HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.A);

    while (it.isValid())
    {
        SimpleAttributeSet s = (SimpleAttributeSet)it.getAttributes();
        String href = (String)s.getAttribute(HTML.Attribute.HREF);
        System.out.println( href );
        it.next();
    }

Or use the ParserCallback:

import java.io.*;
import java.net.*;
import javax.swing.text.*;
import javax.swing.text.html.parser.*;
import javax.swing.text.html.*;

public class ParserCallbackText extends HTMLEditorKit.ParserCallback
{
    public void handleStartTag(HTML.Tag tag, MutableAttributeSet a, int pos)
    {
     if (tag.equals(HTML.Tag.A))
     {
      String href = (String)a.getAttribute(HTML.Attribute.HREF);
      System.out.println(href);
     }
    }

    public static void main(String[] args)
     throws Exception
    {
     Reader reader = getReader(args[0]);
     ParserCallbackText parser = new ParserCallbackText();
     new ParserDelegator().parse(reader, parser, true);
    }

    static Reader getReader(String uri)
     throws IOException
    {
     // Retrieve from Internet.
     if (uri.startsWith("http:"))
     {
      URLConnection conn = new URL(uri).openConnection();
      return new InputStreamReader(conn.getInputStream());
     }
     // Retrieve from file.
     else
     {
      return new FileReader(uri);
     }
    }
}

The Reader could be a StringReader.

camickr 2009-11-03 23:26:58

Thank you for this. Was not aware of this approach...

Legend 2009-11-04 02:46:55

ansaurus

tags:

views:

answers:

Java: I have a big string of html and need to extract the href="..." text...

related questions