views:

203

answers:

5

I am having problems trying to use the regular expression that I used in JavaScript. On a web page, you may have:

<b>Renewal Date:</b> 03 May 2010</td>

I just want to be able to pull out the 03 May 2010, remembering that a webpage has more than just the above content. The way I currently perform this using JavaScript is:

DateStr = /<b>Renewal Date:<\/b>(.+?)<\/td>/.exec(returnedHTMLPage);

I tried to follow some tutorials on java.util.regex.Pattern and java.util.regex.Matcher with no luck. I can't seem to be able to translate (.+?) into something they can understand??

thanks,

Noeneel

+3  A: 

This is how regular expressions are used in java:

Pattern p = Pattern.compile("<b>Renewal Date:</b>(.+?)</td>");
Matcher m = p.matcher(returnedHTMLPage);

if (m.find()) // find the next match (and "generate the groups")
    System.out.println(m.group(1)); // prints whatever the .+? expression matched.

There are other useful methods in the Matcher class, such as m.matches(). Have a look at http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Matcher.html

aioobe
noy joy, i posted some more information in this thread.
bebeTech
To get a partial match, you need to use `find()` rather than `matches()`. I've edited aioobe's answer to fix this.
Jan Goyvaerts
@Jan, Thank you!
aioobe
@aiobe, thank you once again. Find() did the trick. Sorry for the newbie questions, have done lots of self taught JavaScript and am now trying to transition to JAVA.
bebeTech
+1  A: 

Still no joy. Sample html code:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"&gt;
<html xmlns="http://www.w3.org/1999/xhtml"&gt;
<head>
<body>

<table cellpadding="2" cellspacing="0" border="0" width="100%" class="light_box">
<tr>
<td valign=top><b>Expiry Date:</b> 11 May 2010</td>

<td align=right><b>Current Period:</b> Ends: 11 May 2010
</td></tr>
<tr>
<td colspan=2>&nbsp;</td>
</tr></table>
</body>
</html>

Tried a couple of ways:

            String regexDate = "<b>Expiry Date:<\\/b>(.+?)<\\/td>";
//          String regexDate = "<b>Expiry Date:<\\/b>";
            Pattern p = Pattern.compile(regexDate);
            String[] items = p.split(returnedHTML);
            System.out.println("*******REGEX 1 RESULT*******"); // prints whatever the .+? expression matched.
            for(String s : items) 
            { 
                System.out.println(s); 
            }
            System.out.println("*******REGEX 1 RESULT*******"); // prints whatever the .+? expression matched.

            Pattern p2 = Pattern.compile("<b>Expiry Date:<\\/b>");
            Matcher m = p2.matcher(returnedHTML);

            if (m.matches()) // check if it matches (and "generate the groups")
            {
                System.out.println("*******REGEX 2 RESULT*******"); // prints whatever the .+? expression matched.
                System.out.println(m.group(1)); // prints whatever the .+? expression matched.
                System.out.println("*******REGEX 2 RESULT*******"); // prints whatever the .+? expression matched.
            }

The frst returns the whole page minus the pattern and the second returns nothing?

bebeTech
Why are you doing the \\ thing? Just leave them out!
polygenelubricants
why are you mudding with split() here? you don't need that.you should use the same pattern for p2 and p1.
J-16 SDiZ
@bebeTech: read the documentation. `split()` splits the input string around the matches of the expression you pass in. In this case, the expression you were looking for. In the second case, you used `matches()` when you wanted `find()`.
wds
Thank you guys for your help.
bebeTech
+4  A: 

On matches vs find

The problem is that you used matches when you should've used find. From the API:

  • The matches method attempts to match the entire input sequence against the pattern.
  • The find method scans the input sequence looking for the next subsequence that matches the pattern.

Note that String.matches(String regex) also looks for a full match of the entire string. Unfortunately String does not provide a partial regex match, but you can always s.matches(".*pattern.*") instead.


On reluctant quantifier

Java understands (.+?) perfectly.

Here's a demonstration: you're given a string s that consists of a string t repeating at least twice. Find t.

System.out.println("hahahaha".replaceAll("^(.+)\\1+$", "($1)"));
// prints "(haha)" -- greedy takes longest possible

System.out.println("hahahaha".replaceAll("^(.+?)\\1+$", "($1)"));
// prints "(ha)" -- reluctant takes shortest possible

On escaping metacharacters

It should also be said that you have injected \ into your regex ("\\" as Java string literal) unnecessarily.

        String regexDate = "<b>Expiry Date:<\\/b>(.+?)<\\/td>";
                                            ^^         ^^
        Pattern p2 = Pattern.compile("<b>Expiry Date:<\\/b>");
                                                      ^^

\ is used to escape regex metacharacters. A / is NOT a regex metacharacter.

See also

polygenelubricants
I only used it as it was suggested previously that I needed it. If I have just: String regexDate = "<b>Expiry Date:</b>(.+?)</td>"; Pattern p = Pattern.compile(regexDate); Matcher m = p.matcher(returnedHTML); if (m.matches()) // check if it matches (and "generate the groups") { System.out.println("*******REGEX RESULT*******"); System.out.println(m.group(1)); // prints whatever the .+? expression matched. System.out.println("*******REGEX RESULT*******"); } it still fails
bebeTech
@bebeTech: use `if (m.find())` instead of `if (m.matches())` in this case. Look at the documentation to see difference.
polygenelubricants
@polygenelubricants: the problem is not the backslashes he introduced. They end up just quoting the / following them, so shouldn't mess with the results (although they are of course superfluous).
wds
@wds: Ah, you're right. `"/".matches("\\/")` is `true`. Answer restructured.
polygenelubricants
Thank you, all good. 8-)
bebeTech
+1 then, seems you covered most of it
wds
A: 

(.+?) is an odd choice. Try ( *[0-9]+ *[A-Za-z]+ *[0-9]+ *) or just ([^<]+) instead.

drawnonward
It works and validates as an ok syntax to use. Just can't get it to work with JAVA. I have it working fine with JS.
bebeTech
+1  A: 

Ok, so using aioobe's original suggestion (which i also tried earlier), I have:

String regexDate = "<b>Expiry Date:</b>(.+?)</td>";
Pattern p = Pattern.compile(regexDate);
Matcher m = p.matcher(returnedHTML);

if (m.matches()) // check if it matches (and "generate the groups")
{
  System.out.println("*******REGEX RESULT*******"); 
  System.out.println(m.group(1)); // prints whatever the .+? expression matched.
  System.out.println("*******REGEX RESULT*******"); 
}

The IF statement must keep coming up FALSE as the **REGEX RESULT** is never outputted.

If anyone missed what I am trying to achieve, I am just wanting to get the date out. Amongst a html page is a date like <b>Expiry Date:</b> 03 May 2010</td> and I want the 03 May 2010.

bebeTech
Then change `if (m.maches())` to `if (m.find())`. As @polygenelubricants mentioned above! @Jan even kindly updated my post to use `find()` instead of `matches()`.
aioobe
Yes, 2 people with the correct answer but I can only tick one?
bebeTech