ansaurus

Question

Answer 1

A:

As you wrote in a comment of your question that you want what is within href, using Regex here it is:

<a[^>]*? href=\"(?<url>[^\"]+)\"[^>]*?>

This regex will work with Microsoft .NET Framework. It'll capture the content within href putting it in a group called url.

Just noted that this question is tagged with Java. In Java there's no named group as of JDK 6, so here's the solution for Java:

<a[^>]*? href="([^"]+)"[^>]*?>

The above regex will capture the content within href putting it in group 1.

Test it here: http://www.regexplanet.com/simple/index.html

Run this program:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexMatches
{
    public static void main( String args[] ){

      // String to be scanned to find the pattern.
      String line = "<a href='page.php?page=1&filename=SomeFile&chapter=EDITORS PREFACE'>EDITORS PREFACE</a>";
      String pattern = "<a[^>]*? href=\'([^\']+)\'[^>]*?>";

      // Create a Pattern object
      Pattern r = Pattern.compile(pattern);

      // Now create matcher object.
      Matcher m = r.matcher(line);

      if (m.find( ))
      {
         // Found value: <a href='page.php?page=1&filename=SomeFile&chapter=EDITORS PREFACE'>
         System.out.println("Found value: " + m.group(0) );

         // Found value: page.php?page=1&filename=SomeFile&chapter=EDITORS PREFACE
         System.out.println("Found value: " + m.group(1) );
      }
      else
      {
         System.out.println("NO MATCH");
      }
   }
}

Leniel Macaferi 2010-07-26 18:40:53

In general, using regex to parse XML/HTML is a *bad idea*, because the regex will depend critically on the exact structure of the input. This is NOT guaranteed, so a slight change to the input, such as re-ordering attributes, will break the regex. The only real way to accomplish this with robustness is to use an HTML or XML parser library.

Jim Garrison 2010-07-26 21:21:26

Hi Leniel thanks for the help but I can't seem to implement it in java. I used the first example on this page - http://www.tutorialspoint.com/java/java_regular_expressions.htm - i replaced the pattern variable with your example and line with my sample. It comes up no match.

usertest 2010-07-26 22:39:47

I just found out why: your line example uses single quote ('). The regex I passed expects double quotes ("). I updated the answer with a program that I run using Eclipse. It's working fine now.

Leniel Macaferi 2010-07-27 12:48:03

Hi Leniel thanks for all the help but I think you mistoke my question, I wanted just the text "EDITORS PREFACE" not the whole link.

usertest 2010-07-29 15:31:24

In the 2nd comment on your question you wrote you wanted the text between the href tag... I'll try to update the answer.

Leniel Macaferi 2010-07-29 15:44:30

Sorry, I was replying to the comment before "Do you want it extracted from the href attribute or from between the anchor tags?". I didn't word my reply very clearly.

usertest 2010-08-01 17:12:00

ansaurus

tags:

views:

answers:

Extract text with java

related questions