views:

138

answers:

2

I need to process quite a bit of [fairly] arbitrary html data. The data thankfully can be broken into about twelve different templates. My current plan is to build a filter for each of the templates that allows me to extract the required data sans irrelevant content. Problem is I'm not sure what the ideal tool for the job is.

I was hoping someone could recommend a good library for working with/extracting elements from arbitrary html data. Good in this case would be a robust parser that is ideally FOSS. In the past I've done everything from write my own parser, use regular expressions*, and used various parsing libraries like python's ElementTree and BeautifulSoup. Ideally you will suggest something having used a number of technologies, not just 'the one library I use'.

I'm going to be doing this on a Linux host and I don't have any real concern with what language I use.

(*) Yeah, everyone knows the saying "using regular expressions to parse html is bad". It's pointless to bring it up again.

A: 

i've had plenty of success with hpricot.

http://hpricot.com/

emh
+1  A: 

QueryPath - www.querypath.org

You access elements via css selectors, just like in JQuery.

You also might use it as a template engine etc..

toninoj