tags:

views:

187

answers:

5

Hi All,

I'll be the first to admit that my Regex knowledge is hopeless. I am using java with the following

Matcher m = Pattern.compile(">[^<>]*</a>").matcher(html);
while (m.find()) {
 resp.getWriter().println(html.substring(m.start(), m.end()));
}

I get the following list:

>Link Text a</a>
>Link Text b</a>

What am I missing to remove the > and the </a>.

Cheers.

+2  A: 

Have you looked at using a capturing group ?

Pattern.compile(">([^<>]*)</a>")

Note however that it's generally not recommended to use regular expressions for HTML, since HTML isn't regular. You will get more reliable results by using an HTML parser such as JTidy.

Brian Agnew
I tried this. Provides the same list. Cheers.
Littlejon
This answer is also correct. Changing the html.substring(m.start(), m.end()) to m.group(1) makes this work.
Littlejon
+2  A: 

Keep in mind that due to its limited nature, your regex (and regex in general) may run into problems if the HTML you're trying to parse is slightly more complex. For example, the following would fail to parse correctly, but is completely valid (and common) HTML:

<a href="blah.html">this is only a <em>single</em> link</a>

You might be better off using a DOM parser (I'm pretty sure Java has plenty of options in this regard) that you can then request the inner-text of each <a> tag.

Amber
nah, it won't fail, it just won't give you what you expect.. ;) "> link</a>"
roe
+2  A: 

You can do that by wrapping a group around that part of your regex and then using group(X) where X is the number of the group:

Matcher m = Pattern.compile(">([^<>]*)</a>").matcher(html);
while (m.find()) {
 resp.getWriter().println(m.group(1));
}

But, a better way would be to use a simple parser for this:

import java.io.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;

public class HtmlParseDemo {
   public static void main(String [] args) throws Exception {
       Reader reader = new StringReader("foo <a href=\"#\">Link 1</a> bar <a href=\"#\">Link <b>2</b> more</a> baz");
       HTMLEditorKit.Parser parser = new ParserDelegator();
       parser.parse(reader, new LinkParser(), true);
       reader.close();
   }
}

class LinkParser extends HTMLEditorKit.ParserCallback {

    private boolean linkStarted = false;
    private StringBuilder b = new StringBuilder();

    public void handleText(char[] data, int pos) {
        if(linkStarted) b.append(new String(data));
    }

    public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
        if(t == HTML.Tag.A) linkStarted = true;
    }

    public void handleEndTag(HTML.Tag t, int pos) {
        if(t == HTML.Tag.A) {
            linkStarted = false;
            System.out.println(b);
            b = new StringBuilder();
        }
    }
}

Output:

Link 1
Link 2 more
Bart Kiers
That worked great. Thanks.
Littlejon
You're welcome Littlejon.
Bart Kiers
Can I find the link i.e '#' instead of Link 1 or Link 2 more ?
Ritz
+1  A: 

I'm late to the party but I'd like to point out another alternative:

(?<=X)      X, via zero-width positive lookbehind

If you put your initial > into that mess, i.e.

(?<=>)[^<>]*</a>

then it should not be returned as part of your result.

Untested, though. Good luck!

Carl Smotricz
A: 

A nice quick way to test your regular expressions, is to use a regex editor such as the following eclipse plugin: http://brosinski.com/regex/

crowne