tags:

views:

144

answers:

2

How do I handle closing tags (ex: </h1>) with the Java HTML Parser Library?

For example, if I have the following:

public class MyFilter implements NodeFilter {

 public boolean accept(Node node) {
  if (node instanceof TagNode) {
   TagNode theNode = (TagNode) node;
   if (theNode.getRawTagName().equals("h1")) {
    return true;
   } else {
    return false;
   }
  }
  return false;
 }
}

public class MyParser {
 public final String parseString(String input) {
  Parser parser = new Parser();
  MyFilter theFilter = new MyFilter();
  parser.setInputHTML("<h1>Welcome, User</h1>");
  NodeList theList = parser.parse(theFilter);
  return theList.toHtml();
 }
}

When I run my parser, I get the following output back:

<h1>Welcome, User</h1>Welcome, User</h1>

The NodeList contains a list of size 3 with the following entities:

(tagNode) <h1>

(textNode) Welcome, User

(tagNode) </h1>

I would like the output to be "<h1>Welcome, User</h1>". Does anyone see what is wrong in my sample parser?

A: 

HINT:

I think you must rely on isEndTag() API in that case.

ring bearer
A: 

Your filter is accepting too many nodes. For your sample input, you want to create a NodeList that has only a single node--for the <h1> tag. The other two nodes are children of that first node so should not be added to the NodeList.


If you add the following code, you may see better what the problem is.

for (Node node : theList.toNodeArray())
{
    System.out.println(node.toHtml());
}

It should print

<h1>Welcome, User</h1>
Welcome, User
</h1>
Matthew T. Staebler