tags:

views:

133

answers:

2

My HTML contains tags of the following form:

<div class="author"><a href="/user/1" title="View user profile.">Apple</a> - October 22, 2009 - 01:07</div>

I'd like to extract the date, "October 22, 2009 - 01:07" in this example, from each tag

I've implemented javax.swing.text.html.HTMLEditorKit.ParserCallback as follows:

class HTMLParseListerInner extends HTMLEditorKit.ParserCallback {   
    private ArrayList<String> foundDates = new ArrayList<String>();
    private boolean isDivLink = false;

    public void handleText(char[] data, int pos) {
        if(isDivLink)
         foundDates.add(new String(data)); // Extracts "Apple" instead of the date.
    }

    public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {       
        String divValue = (String)a.getAttribute(HTML.Attribute.CLASS);
        if (t.toString() == "div" && divValue != null && divValue.equals("author"))
         isDivLink = true;
    }
}

However, the above parser returns "Apple" which is inside a hyperlink within the tag. How can I fix the parser to extract the date?

A: 

Override handleEndTag and check for "a"?

However, this HTML parser is from the early 90's and these methods are not well specified.

Tom Hawtin - tackline
And whilst I think about it, it generally isn't a good idea to use `==` on `String`s.
Tom Hawtin - tackline
I searched for html parsers in Java and this seems like a popular one. If you know any other easy to use parsers I'll appreciate if you could introduce them.
reprogrammer
A: 
import java.io.*;
import java.util.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;

public class ParserCallbackDiv extends HTMLEditorKit.ParserCallback
{
    private boolean isDivLink = false;
    private String divText;

    public void handleEndTag(HTML.Tag tag, int pos)
    {
     if (tag.equals(HTML.Tag.DIV))
     {
      System.out.println( divText );
      isDivLink = false;
     }
    }

    public void handleStartTag(HTML.Tag tag, MutableAttributeSet a, int pos)
    {
     if (tag.equals(HTML.Tag.DIV))
     {
      String divValue = (String)a.getAttribute(HTML.Attribute.CLASS);

         if ("author".equals(divValue))
       isDivLink = true;
     }
    }

    public void handleText(char[] data, int pos)
    {
     divText = new String(data);
    }

    public static void main(String[] args)
    throws IOException
    {
     String file = "<div class=\"author\"><a href=\"/user/1\"" +
      "title=\"View user profile.\">Apple</a> - October 22, 2009 - 01:07</div>";
     StringReader reader = new StringReader(file);

     ParserCallbackDiv parser = new ParserCallbackDiv();

     try
     {
      new ParserDelegator().parse(reader, parser, true);
     }
     catch (IOException e)
     {
      System.out.println(e);
     }
    }
}
camickr