ansaurus

Question

Answer 1

+2 A:

Have you looked at using a capturing group ?

Pattern.compile(">([^<>]*)</a>")

Note however that it's generally not recommended to use regular expressions for HTML, since HTML isn't regular. You will get more reliable results by using an HTML parser such as JTidy.

Brian Agnew 2009-11-15 09:53:17

I tried this. Provides the same list. Cheers.

Littlejon 2009-11-15 09:59:04

This answer is also correct. Changing the html.substring(m.start(), m.end()) to m.group(1) makes this work.

Littlejon 2009-11-15 10:12:00

Answer 2

+2 A:

Keep in mind that due to its limited nature, your regex (and regex in general) may run into problems if the HTML you're trying to parse is slightly more complex. For example, the following would fail to parse correctly, but is completely valid (and common) HTML:

<a href="blah.html">this is only a <em>single</em> link</a>

You might be better off using a DOM parser (I'm pretty sure Java has plenty of options in this regard) that you can then request the inner-text of each <a> tag.

Amber 2009-11-15 09:56:54

nah, it won't fail, it just won't give you what you expect.. ;) "> link</a>"

roe 2009-11-15 09:59:30

Answer 3

+2 A:

You can do that by wrapping a group around that part of your regex and then using group(X) where X is the number of the group:

Matcher m = Pattern.compile(">([^<>]*)</a>").matcher(html);
while (m.find()) {
 resp.getWriter().println(m.group(1));
}

But, a better way would be to use a simple parser for this:

import java.io.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;

public class HtmlParseDemo {
   public static void main(String [] args) throws Exception {
       Reader reader = new StringReader("foo <a href=\"#\">Link 1</a> bar <a href=\"#\">Link <b>2</b> more</a> baz");
       HTMLEditorKit.Parser parser = new ParserDelegator();
       parser.parse(reader, new LinkParser(), true);
       reader.close();
   }
}

class LinkParser extends HTMLEditorKit.ParserCallback {

    private boolean linkStarted = false;
    private StringBuilder b = new StringBuilder();

    public void handleText(char[] data, int pos) {
        if(linkStarted) b.append(new String(data));
    }

    public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
        if(t == HTML.Tag.A) linkStarted = true;
    }

    public void handleEndTag(HTML.Tag t, int pos) {
        if(t == HTML.Tag.A) {
            linkStarted = false;
            System.out.println(b);
            b = new StringBuilder();
        }
    }
}

Output:

Link 1
Link 2 more

Bart Kiers 2009-11-15 09:58:51

That worked great. Thanks.

Littlejon 2009-11-15 10:09:40

You're welcome Littlejon.

Bart Kiers 2009-11-15 10:30:30

Can I find the link i.e '#' instead of Link 1 or Link 2 more ?

Ritz 2010-01-13 09:30:42

Answer 4

+1 A:

I'm late to the party but I'd like to point out another alternative:

(?<=X)      X, via zero-width positive lookbehind

If you put your initial > into that mess, i.e.

(?<=>)[^<>]*</a>

then it should not be returned as part of your result.

Untested, though. Good luck!

Carl Smotricz 2009-11-15 10:37:23

Answer 5

A:

A nice quick way to test your regular expressions, is to use a regex editor such as the following eclipse plugin: http://brosinski.com/regex/

crowne 2009-11-15 15:04:23

ansaurus

tags:

views:

answers:

Regex to extract link content

related questions