views:

240

answers:

2

I need to transform some HTML text that has nested tags to decorate 'matches' with a css attribute to highlight it (like firefox search). I can't just do a simple replace (think if user searched for "img" for example), so I'm trying to just do the replace within the body text (not on tag attributes).

I have a pretty straightforward HTML parser that I think should do this:

final Pattern pat = Pattern.compile(srch, Pattern.CASE_INSENSITIVE);
Matcher m = pat.matcher(output);
if (m.find()) {
    final StringBuffer ret = new StringBuffer(output.length()+100);
    lastPos=0;
    try {
        new ParserDelegator().parse(new StringReader(output.toString()),
        new HTMLEditorKit.ParserCallback () {
            public void handleText(char[] data, int pos) {
                ret.append(output.subSequence(lastPos, pos));
                Matcher m = pat.matcher(new String(data));
                ret.append(m.replaceAll("<span class=\"search\">$0</span>"));
                lastPos=pos+data.length;
            }
        }, false);
        ret.append(output.subSequence(lastPos, output.length()));
        return ret;
    } catch (Exception e) {
 return output;
    }
}
return output;

My problem is, when I debug this, the handleText is getting called with text that includes tags! It's like it's only going one level deep. Anyone know why? Is there some simple thing I need to do to HTMLParser (haven't used it much) to enable 'proper' behavior of nested tags?

PS - I figured it out myself - see answer below. Short answer is, it works fine if you pass it HTML, not pre-escaped HTML. Doh! Hope this helps someone else.

<span>example with <a href="#">nested</a> <p>more nesting</p>
</span> <!-- all this gets thrown together -->
A: 

Seems to work fine for me using JDK6 on XP. I wrapped your sample HTML with head and body tags. I got three lines of output:

a) example with b) nested c) more nesting

Here's the code I used:

import java.io.*;
import java.net.*;
import javax.swing.text.html.parser.*;
import javax.swing.text.html.*;

public class ParserCallbackText extends HTMLEditorKit.ParserCallback
{
    public void handleText(char[] data, int pos)
    {
     System.out.println( data );
    }

    public static void main(String[] args)
     throws Exception
    {
     Reader reader = getReader(args[0]);
     ParserCallbackText parser = new ParserCallbackText();
     new ParserDelegator().parse(reader, parser, true);
    }

    static Reader getReader(String uri)
     throws IOException
    {
     // Retrieve from Internet.
     if (uri.startsWith("http:"))
     {
      URLConnection conn = new URL(uri).openConnection();
      return new InputStreamReader(conn.getInputStream());
     }
     // Retrieve from file.
     else
     {
      return new FileReader(uri);
     }
    }
}
camickr
Thanks camickr for taking the trouble to help verify that. Sorry I didn't give a better test case - This helped because I found the problem as I was trying to run my sample thru your test code. – Jim P
Jim P
A: 

Sorry for the misleading question - I found my problem, and it wasn't included in my description - my input string had been pre-processed so I was looking at text such as

<span>example with &lt;a href="#"&gt; nested &gt;/a&lt; &gt;p&lt;more nesting&gt;/p&lt;
</span> <!-- well of course it all gets thrown together -->
Jim P