ansaurus

Question

Java HTML Parsing

Answer 1

+1 A:

If your HTML is well-formed, you can easily employ an XML parser to do the job for you... If you're only reading, SAX would be ideal.

Yuval =8-)

Yuval 2008-10-26 14:01:36

Answer 2

+6 A:

You might be interested by TagSoup, a Java HTML parser able to handle malformed HTML. XML parsers would work only on well formed XHTML.

PhiLho 2008-10-26 14:16:42

Answer 3

+2 A:

The HTMLParser project (http://htmlparser.sourceforge.net/) might be a possibility. It seems to be pretty decent at handling malformed HTML. The following snippet should do what you need:

Parser parser = new Parser(htmlInput);
CssSelectorNodeFilter cssFilter = 
    new CssSelectorNodeFilter("DIV.targetClassName");
NodeList nodes = parser.parse(cssFilter);

dave 2008-10-26 14:23:12

Answer 4

+7 A:

The main problem as stated by preceding coments is malformed HTML, so an html cleaner or HTML-XML converter is a must. Once you get the XML code (XHTML) there are plenty of tools to handle it. You could get it with a simple SAX handler that extracts only the data you need or any tree-based method (DOM, JDOM, etc.) that let you even modify original code.

Here is a sample code that uses HTML cleaner to get all DIVs that use a certain class and print out all Text content inside it.

import java.io.IOException;
import java.net.URL;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;

import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.TagNode;

/**
 * @author Fernando Miguélez Palomo <fernandoDOTmiguelezATgmailDOTcom>
 */
public class TestHtmlParse
{
    static final String className = "tags";
    static final String url = "http://www.stackoverflow.com";

    TagNode rootNode;

    public TestHtmlParse(URL htmlPage) throws IOException
    {
        HtmlCleaner cleaner = new HtmlCleaner();
        rootNode = cleaner.clean(htmlPage);
    }

    List getDivsByClass(String CSSClassname)
    {
        List divList = new ArrayList();

        TagNode divElements[] = rootNode.getElementsByName("div", true);
        for (int i = 0; divElements != null && i < divElements.length; i++)
        {
            String classType = divElements[i].getAttributeByName("class");
            if (classType != null && classType.equals(CSSClassname))
            {
                divList.add(divElements[i]);
            }
        }

        return divList;
    }

    public static void main(String[] args)
    {
        try
        {
            TestHtmlParse thp = new TestHtmlParse(new URL(url));

            List divs = thp.getDivsByClass(className);
            System.out.println("*** Text of DIVs with class '"+className+"' at '"+url+"' ***");
            for (Iterator iterator = divs.iterator(); iterator.hasNext();)
            {
                TagNode divElement = (TagNode) iterator.next();
                System.out.println("Text child nodes of DIV: " + divElement.getText().toString());
            }
        }
        catch(Exception e)
        {
            e.printStackTrace();
        }
    }
}

Fernando Miguélez 2008-10-26 14:55:57

Answer 5

+3 A:

Several years ago I used JTidy for the same purpose:

http://jtidy.sourceforge.net/

"JTidy is a Java port of HTML Tidy, a HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM interface to the document that is being processed, which effectively makes you able to use JTidy as a DOM parser for real-world HTML.

JTidy was written by Andy Quick, who later stepped down from the maintainer position. Now JTidy is maintained by a group of volunteers.

More information on JTidy can be found on the JTidy SourceForge project page ."

2008-10-26 16:06:03

Got it - works a charm!

Richie_W 2008-10-26 20:53:05

Answer 6

+1 A:

HTMLUnit might be of help. It does a lot more stuff too.

http://htmlunit.sourceforge.net/[1]

alex 2008-10-26 19:16:21

Answer 7

A:

Is there any tutorial or sample programs for HTML Parsing using HTMLParser ?.. Please help me to... I googled.. but didn't find any .. Thanks in advance..

TAM 2009-08-20 08:48:18

Answer 8

A:

Check this url http://regxjava.blogspot.com/2010/04/partial-html-parser.html

Raj 2010-04-24 18:20:07

ansaurus

tags:

views:

answers:

Java HTML Parsing

related questions