views:

104

answers:

1

If I have the string below, how can I extract the EDITORS PREFACE text with java? Thanks.

<div class='chapter'><a href='page.php?page=1&filename=SomeFile&chapter=EDITORS PREFACE'>EDITORS PREFACE</a></div> 
A: 

As you wrote in a comment of your question that you want what is within href, using Regex here it is:

<a[^>]*? href=\"(?<url>[^\"]+)\"[^>]*?>

This regex will work with Microsoft .NET Framework. It'll capture the content within href putting it in a group called url.

Just noted that this question is tagged with Java. In Java there's no named group as of JDK 6, so here's the solution for Java:

<a[^>]*? href="([^"]+)"[^>]*?>

The above regex will capture the content within href putting it in group 1.

Test it here: http://www.regexplanet.com/simple/index.html

Run this program:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexMatches
{
    public static void main( String args[] ){

      // String to be scanned to find the pattern.
      String line = "<a href='page.php?page=1&filename=SomeFile&chapter=EDITORS PREFACE'>EDITORS PREFACE</a>";
      String pattern = "<a[^>]*? href=\'([^\']+)\'[^>]*?>";

      // Create a Pattern object
      Pattern r = Pattern.compile(pattern);

      // Now create matcher object.
      Matcher m = r.matcher(line);

      if (m.find( ))
      {
         // Found value: <a href='page.php?page=1&filename=SomeFile&chapter=EDITORS PREFACE'>
         System.out.println("Found value: " + m.group(0) );

         // Found value: page.php?page=1&filename=SomeFile&chapter=EDITORS PREFACE
         System.out.println("Found value: " + m.group(1) );
      }
      else
      {
         System.out.println("NO MATCH");
      }
   }
}
Leniel Macaferi
In general, using regex to parse XML/HTML is a *bad idea*, because the regex will depend critically on the exact structure of the input. This is NOT guaranteed, so a slight change to the input, such as re-ordering attributes, will break the regex. The only real way to accomplish this with robustness is to use an HTML or XML parser library.
Jim Garrison
Hi Leniel thanks for the help but I can't seem to implement it in java. I used the first example on this page - http://www.tutorialspoint.com/java/java_regular_expressions.htm - i replaced the pattern variable with your example and line with my sample. It comes up no match.
usertest
I just found out why: your line example uses single quote ('). The regex I passed expects double quotes ("). I updated the answer with a program that I run using Eclipse. It's working fine now.
Leniel Macaferi
Hi Leniel thanks for all the help but I think you mistoke my question, I wanted just the text "EDITORS PREFACE" not the whole link.
usertest
In the 2nd comment on your question you wrote you wanted the text between the href tag... I'll try to update the answer.
Leniel Macaferi
Sorry, I was replying to the comment before "Do you want it extracted from the href attribute or from between the anchor tags?". I didn't word my reply very clearly.
usertest