tags:

views:

108

answers:

2

I have a page source and i want to get the anchor text of all its anchor tags

Could someone please help me out with the pattern for it.

Thanks in Advance

+2  A: 

karim79 is right, regex might be the wrong way, but anyway here is one simple way it could be done in Java. Note that this would not work, if the anchors have aditional attributes before the href. However, this might be a good start or help you understanding how you could do it.

    String html = "<body>" +
            "<a href=\"#first\">got to first</a>" +
            "<span>something else</span>" +
            "<a href=\"#second\">got to second</a>" +
            "</body>";

    Pattern pattern = Pattern.compile("<a href=\"#(\\w+)\">([\\w\\s]+)</a>");
    Matcher matcher = pattern.matcher(html);
    while(matcher.find()){
        System.out.println(matcher.group(2));
    }
Tim Büthe
thanks for the reply..
Jack
but wht i want is the Anchor text and not the url
Jack
Jack
@Jack: I added a second group that gets the a-tag's content.
Tim Büthe
A: 

Try this regex pattern, should give you what you are looking for:

(?<=<\s*a[^>]*>)(?<anchorContent>[\s\S]*?)(?=<\s*/a>)

This will give you a group called "anchorContent"

Hope that helps.

jimplode