tags:

views:

787

answers:

4

I want to parse an HTML document and extract a certain div block that can be repeated.

I've managed to extract THE FIRST occurrence of the block, but I cant figure out how to get the next.

This is my code so far:

         String inputStr = HTTPGetter.get("http://someurl");
     String patternStr ="<div class=\"MY-CLASS\">(.*?)</div>";
     // Compile and use regular expression

     Pattern pattern = Pattern.compile(patternStr);
     Matcher matcher = pattern.matcher(inputStr);
     boolean matchFound = matcher.find();

     if (matchFound) {
         // Get all groups for this match
         for (int i=0; i<=matcher.groupCount(); i++) {
             String groupStr = matcher.group(i);
      System.out.println("Group found:\n"+groupStr);
         }
     } else {
      System.out.println("Not found");
     }

The document I'm parsing has more than one div block of class MY-CLASS. I want to get all of them.

How can I do that?

+4  A: 

Just use find() in a while loop:

while (matcher.find()) {
    System.out.println("Group found:\n"+matcher.group(1));
}

It's the matches you need to iterate through, not the capture groups.

Alan Moore
+4  A: 

Are you sure that you do not want to use an xml parser? Regular expressions are really not suitable for non-regular languages like xml.

soulmerge
That would only work if the document was XHTML.
JG
There are also plenty of HTML parsers: http://stackoverflow.com/search?q=java+html+parser
Adam Paynter
+1  A: 

I would strongly recommend against using regexps for all but the simplest cases, since HTML is not regular and there are numerous edge cases to trip up your expressions (see numerous answers passim).

Take a look at JTidy, which will parse the HTML and present a DOM interface for you to interrogate.

Brian Agnew
A: 

how can I praser nested div tags using thing code i can parser html page and extract single div tag content but when page content nested div tag. please con any one tell me how to sol this problem.

shyam