HTML/XML Parser for Java

views:

985

answers:

+5 Q:

HTML/XML Parser for Java

Hello,

What HTML parsers have the following features:

Fast
Thread-safe
Reliable and bug-free
Parses HTML and XML
Handles erroneous HTML
Has a DOM implementation
Supports HTML4, JavaScript, and CSS tags
Relatively simple, object-oriented API

What parser you think is better?

Thank you.

+6 A:

The best known are NekoHTML and JTidy.

NekoHTML is based on Xerces, and provides a simple adaptable SAXParser which implements XMLReader JavaSE interface.

JTidy is more intented into formatting your html code into something XML-valid, but is still very useful as an XML parser, producing a DOM tree if needed.

You could have a look at this list for other alternatives.

Another choice could be to use hpricot through jRuby.

Valentin Rocher 2010-01-24 23:32:19

Why? which features do they have?

Shayan 2010-01-24 23:33:42

added some details.

Valentin Rocher 2010-01-24 23:36:10

SAX is not what I want and main purpose of JTidy is cleaning an XML. Are you sure that it does what i want better than others?

Shayan 2010-01-24 23:40:06

what do you want exactly, then ?

Valentin Rocher 2010-01-24 23:42:36

It should be DOM based. and I want extraction as it's main job not transformation.

Shayan 2010-01-24 23:45:09

It doesn't matter what its "main" job is, as long as it does what you want it to do.

Anon. 2010-01-25 00:08:57

+1 A:

Well:

there aren't so many good HTML parsers in java as you need, but here are some alternatives: http://java-source.net/open-source/html-parsers

Very few of them support Javascript. Actually, I think you'll have to do this part on your own using Rhino (http://www.mozilla.org/rhino/).

Kico Lobo 2010-01-24 23:35:10

I think that HTML Cleaner is what you're looking for. See its announcement on TheServerSide to see how it compare to JTidy, TagSoup, NekoHtml.

Pascal Thivent 2010-01-24 23:47:59

But this is also for transforming into well XMLs. My main goal is extracting data from it.

Shayan 2010-01-24 23:53:02

@Shayan So what? Doesn't it allow you to extract data from it? Doesn't it offer DOM manipulation? Doesn't it allow to parse nasty HTML? I don't get you.

Pascal Thivent 2010-01-25 01:00:06

+1 A:

you probably want to look at doing something like running Mozilla in headless mode. Here is a link to get you started, I am sure you can use Google to find out more information.

fuzzy lollipop 2010-01-24 23:57:41

+7 A:

Check out Web Harvest. It's both a library you can use and a data extraction tool, which sounds to me that's exactly what you want to do. You create XML script files to instruct the scraper how to extract the information you need and from where. The provided GUI is very useful to quickly test the scripts.

Check out the project's samples page to see if it's a good fit for what you are trying to do.

Cesar 2010-01-25 00:16:10

+1 for Web Harvest -- if you are trying to do page scraping it is the way to go.

jckdnk111 2010-01-25 02:34:29

+2 A:

Validator.nu's HTML parser, definitely. It's an implementation of the HTML5 parsing algorithm, and Gecko is in the process of replacing its own HTML parser with a C++ translation of this one.

Ms2ger 2010-01-25 09:50:39

ansaurus

tags:

views:

answers:

HTML/XML Parser for Java

related questions