views:

286

answers:

2

I'm looking for the Clojure/Java equivalent to Python's lxml library.

I've used it a ton in the past for parsing all sorts of html (as a replacement for BeautifulSoup) and it's great to be able to use the same elementtree api for xml as well -- really a trusted friend! Can anyone recommend a similar Java/Clojure library?

About lxml

lxml is an xml and html processing library based off of libxml2. It handles broken html pages very well so it is excellent for screen scraping tasks. It also implements the ElementTree api, so the xml/html structure is represented as a tree object with full support for xpath and css selectors among other things.

It also has some really handy utility functions such as the "cleaner" module which will strip out unwanted tags from the "soup" (ie script tags, style tags, etc...).

So it is simple to use, robust, and VERY fast...!

+3  A: 

For Java (and thus usable from Clojure) is the tagsoup-library, which, like lxml, is a tolerant parser for faulty SGML-variants.

Clojure has a bundled namespace clojure.xml, but this will only work with valid XML.

pmf
+3  A: 

Enlive: http://github.com/cgrand/enlive

I've used it for screen-scraping and it works quite well for that. It uses a CSS selector like syntax for getting at elements in the document.

dnolen