ansaurus

Question

How to use HTML Parser to get complete information about all tags in the HTML page

Answer 1

A:

You seemed to use the Swing HtmlDocument. It may not be the smartest idea ever. I believe you would have better results using, as an example, NekoHtml.

Riduidel 2010-02-18 16:12:49

Answer 2

A:

Or another simple library you can use is jtidy that can clean up your html before parsing it. Hope this helps.

http://sourceforge.net/projects/jtidy/

Ciao!

gicappa 2010-02-18 16:33:59

Answer 3

A:

As per the comments:

actually i want to extract information such as product name,price etc of all products listed in an online shopping site such as amazon.com How should i go about it???

Step 1: read their robots file. It's usually found on the root of the site, for example http://amazon.com/robots.txt. If the URL you're trying to access is covered by a Disallow on an User-Agent of *, then stop here. Contact them, explain them in detail what you're trying to do and ask them for ways/alternatives/webservices which can provide you the information you need. Else you're violating the laws and you may risk to get blacklisted by the site and/or by your ISP or worse. If not, then proceed to step 2.

Step 2: check if the site in question hasn't already a public webservice available which is much more easy to use than parsing a whole HTML page. Using a webservice, you'll get exactly the information you're looking for in a concise format (JSON or XML) based on a simple set of parameters. Look around or contact them for details about any webservices. If there's no way, proceed to step 3.

Step 3: learn how HTML/CSS/JS work, learn how to work with webdeveloper tools like Firebug, learn how to interpret the HTML/CSS/JS source you see by rightclick > View Page Source. My bet that the site in question uses JS/Ajax to load/populate the information you'd like to gather. In that case, you'll need to use a HTML parser which is capable of parsing and executing JS as well (the one you're using namely doesn't do that). This isn't going to be an easy job, so I won't explain it in detail until it's entirely clear what you're trying to achieve and if that is allowed and if there aren't more-easy-to-use webservices available.

BalusC 2010-02-18 16:43:57

Step 1:Robots.txt allows.Its not a Prob.Step 2:I tried using AWS for that matter but it does not gives a comprehensive list of all information i need. But the information can be seen on the web page.So i need to actually go to step 3Step 3:Now the problem is that i need to extract product name,price,features. This can be done if i manually identify the pattern how these information is stored on the web page.But now i want a way which should automate this pattern finding or should be able to extract it without any pattern been supplied to the program.How should i go about it?Thanks

Heman 2010-02-19 07:34:51

Answer 4

+1 A:

I am doing this fairly reliably with HTML Parser, (provided that the HTML document does not change its structure). A web service with a stable API is much better, but sometimes we just do not have one.

General idea:

You first have to know in what tags (div, meta, span, etc) the information you want are in, and know the attributes to identify those tags. Example :

 <span class="price"> $7.95</span>

if you are looking for this "price", then you are interested in span tags with class "price".

HTML Parser has a filter-by-attribute functionality.

filter = new HasAttributeFilter("class", "price");

When you parse using a filter, you will get a list of Nodes that you can do a instanceof operation on them to determine if they are of the type you are interested in, for span you'd do something like

if (node instanceof Span) // or any other supported element.

See list of supported tags here.

An example with HTML Parser to grab the meta tag that has description about a site:

Tag Sample :

<meta name="description" content="Amazon.com: frankenstein: Books"/>

Code:

import org.htmlparser.Node;
import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
import org.htmlparser.filters.HasAttributeFilter;
import org.htmlparser.tags.MetaTag;

public class HTMLParserTest {
    public static void main(String... args) {
        Parser parser = new Parser();
        //<meta name="description" content="Some texte about the site." />
        HasAttributeFilter filter = new HasAttributeFilter("name", "description");
        try {
            parser.setResource("http://www.youtube.com");
            NodeList list = parser.parse(filter);
            Node node = list.elementAt(0);

            if (node instanceof MetaTag) {
                MetaTag meta = (MetaTag) node;
                String description = meta.getAttribute("content");

                System.out.println(description);
                // Prints: "YouTube is a place to discover, watch, upload and share videos."
            }

        } catch (ParserException e) {
            e.printStackTrace();
        }
    }

}

Bakkal 2010-07-07 21:56:00

ansaurus

tags:

views:

answers:

How to use HTML Parser to get complete information about all tags in the HTML page

related questions