tags:

views:

362

answers:

4

hey everyone, i am trying to find a certain tag in a html-page with java. all i know is what kind of tag (div, span ...) and the id ... i dunno how it looks, how many whitespaces are where or what else is in the tag ... so i thought about using pattern matching and i have the following code:

 // <tag[any character may be there or not]id="myid"[any character may be there or not]>
 String str1 = "<" + Tag + "[.*]" + "id=\"" + search + "\"[.*]>";
 // <tag[any character may be there or not]id="myid"[any character may be there or not]/>
 String str2 = "<" + Tag + "[.*]" + "id=\"" + search + "\"[.*]/>";
 Pattern p1 = Pattern.compile( str1 );
 Pattern p2 = Pattern.compile( str2 );
 Matcher m1 = p1.matcher( content );
 Matcher m2 = p2.matcher( content );
 int start = -1;
 int stop = -1;
 String Anfangsmarkierung = null;
 int whichMatch = -1;

 while( m1.find() == true || m2.find() == true ){

  if( m1.find() ){
   //System.out.println( " ... " + m1.group() );
   start = m1.start();
   //ende = m1.end();
   stop = content.indexOf( "<", start );
   whichMatch = 1;
  }
  else{
   //System.out.println( " ... " + m2.group() );
   start = m2.start();
   stop = m2.end();
   whichMatch = 2;
  }
 }

but i get an exception with m1(m2).start(), when i enter the actual tag without the [.*] and i dun get anything when i enter the regular expression :( ... i really havent found an explanation for this ... i havent worked with pattern or match at all yet, so i am a little lost and havent found anything so far. would be awesome if anyone could explain me what i am doing wrong or how i can do it better ...

thnx in advance :)

... dg

+1  A: 

I think each call to find is advancing through your match. Calling m1.find() inside your condition is moving your matcher to a place where there is no longer a valid match, which causes m1.start() to throw (I'm guessing) an IllegalStateException Ensuring you call find once per iteration and referencing that result from some flag avoids this problem.

boolean m1Matched = m1.find()
boolean m2Matched = m2.find()
while( m1Matched || m2Matched ) {

            if( m1Matched ){
                ...
            }

m1Matched = m1.find();
m2Matched = m2.find();
}
butterchicken
thnx, i will look into that :)
doro
+3  A: 

I know that I am broadening your question, but I think that using a dedicated library for parsing HTML documents (such as: http://htmlparser.sourceforge.net/) will be much more easier and accurate than regexps.

Itay
i bet there are some really cool solutions that would take away some away as well, but i am supposed to do that from the scratch ... thnx, i will look into it anyway ;)
doro
+1  A: 

Here is an example for what you're trying to do adapted from one of my notes:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {

    public static void main(String[] args) {

        String tag = "thetag";
        String id = "foo";

        String content = "<tag1>\n"+
                "<thetag name=\"Tag Name\" id=\"foo\">Some text</thetag>\n" +
                "<thetag name=\"AnotherTag\" id=\"foo\">Some more text</thetag>\n" +
                "</tag1>";

        String patternString = "<" + tag + ".*?name=\"(.*?)\".*?id=\"" + id + "\".*?>";

        System.out.println("Content:\n" + content);
        System.out.println("Pattern: " + patternString);

        Pattern pattern = Pattern.compile(patternString);

        Matcher matcher = pattern.matcher(content);

        boolean found = false;
        while (matcher.find()) {
            System.out.format("I found the text \"%s\" starting at " +
                    "index %d and ending at index %d.%n",
                    matcher.group(), matcher.start(), matcher.end());
            System.out.println("Name: " + matcher.group(1));
            found = true;
        }
        if (!found) {
            System.out.println("No match found.");
        }
    }
}

You'll notice that the pattern string becomes something like <thetag.*?name="(.*?)".*?id="foo".*?> which will search for tags named thetag where the id attribute is set to "foo".

Note the following:

  • It uses .*? to weakly match zero or more of anything (if you don't understand, try removing the ? to see what I mean).
  • It uses a submatch expression between parenthesis (the name="(.*?)" part) to extract the contents of the name attribute (as an example).
iWerner
thnx for the code :) awesome
doro
A: 

Awesome example. Really helpful :) :)