I have this string containing a large chunk of html and am trying to extract the link from href="..." portion of the string. The href could be in one of the following forms:
<a href="..." />
<a class="..." href="..." />
I don't really have a problem with regex but for some reason when I use the following code:
String innerHTML = getHTML();
Pattern p = Pattern.compile("href=\"(.*)\"", Pattern.DOTALL);
Matcher m = p.matcher(innerHTML);
if (m.find()) {
// Get all groups for this match
for (int i=0; i<=m.groupCount(); i++) {
String groupStr = m.group(i);
System.out.println(groupStr);
}
}
Can someone tell me what is wrong with my code? I did this stuff in php but in Java I am somehow doing something wrong... What is happening is that it prints the whole html string whenever I try to print it...
EDIT: Just so that everyone knows what kind of a string I am dealing with:
<a class="Wrap" href="item.php?id=43241"><input type="button">
<span class="chevron"></span>
</a>
<div class="menu"></div>
Everytime I run the code, it prints the whole string... That's the problem...
And about using jTidy... I'm on it but it would be interesting to know what went wrong in this case as well...