views:

75

answers:

4

I am looking for a simple lightweight java library that parses HTML. I have looked a lot and there are many options out there. But I cannot find something simple. I really would like to have something like pyquery in python except in java. My requirements are: fast, easy to use and lightweight.

What do I need it for? Not sure if this matters, but I need to index parts of an html documents. So I am hoping to be able to select part of that document quickly and then parse it.

+1  A: 

try groovy. It has a number of "slurpers," which are DSLs for reading in markup like XML and HTML, as well as JSON. here for example.

Anatoly G
why does this get voted down?
Ben
and it's pretty lame to vote down without commenting why
Ben
Groovy is a language that runs on the JVM and essentially VERY easy to include in your project. I do not see a reason why this was voted down.
Anatoly G
I voted up for your comment, Ben, but I guess that the guy voted down because groovy is not really java?
djondal
What's "really" java, anyway? It runs on the JVM, written in Java, easy to run.
Anatoly G
@djondal.. thanks. yeah, maybe but it still applies. Either way, really lame to vote down without providing feedback
Ben
maybe they should change SO so we can't vote down without comments
Ben
I have a highly gui sensitive data needs to be parsed then passed to solr. I was hoping for direct java solution. I'll try it though
Amir Raminfar
A: 

If you want a jQuery like library, as your the title of your question suggests, you should had a look over GWT.

GWT will let you explore the DOM of your page. Try the tutorial and in 30 minutes you will be sure if you want to know more or not...

djondal
This seems really heavy for just parsing text :(
Amir Raminfar
You will be sure if you want to know more or not... It's "or not" then ha ha. Thanks for down voting me, always a pleasure!
djondal
+1  A: 

Use tagsoup to normalize the HTML into xhtml, and XOM to parse the resulting document. It's not that hard.

XPath will give you easy selection similiar to CSS selectors.

Stefan Kendall
i found jsoup at http://jsoup.org/ is this similar to what tagsoup is?
Amir Raminfar
Looks similar. Tagsoup has a java library you can download, too, but it doesn't give any parsing or finding abilities; it just produces good xml that you can parse.
Stefan Kendall
+1  A: 

I have used HTMLParser in the past. I wasn't very happy with it. I found tagsoup and jsoup. I am really liking jsoup. Haven't used it extensively yet but you can do something like:

Elements resultLinks = doc.select("h3.r > a"); // direct a after h3
Amir Raminfar
After using jsoup I think it's exactly what I was looking for. I don't understand why after so much googling, it didn't get found. But it has almost every feature that I need.
Amir Raminfar