Recommended library for scraping html data.

I need to process quite a bit of [fairly] arbitrary html data. The data thankfully can be broken into about twelve different templates. My current plan is to build a filter for each of the templates that allows me to extract the required data sans irrelevant content. Problem is I'm not sure what the ideal tool for the job is.

I was hoping someone could recommend a good library for working with/extracting elements from arbitrary html data. Good in this case would be a robust parser that is ideally FOSS. In the past I've done everything from write my own parser, use regular expressions*, and used various parsing libraries like python's ElementTree and BeautifulSoup. Ideally you will suggest something having used a number of technologies, not just 'the one library I use'.

I'm going to be doing this on a Linux host and I don't have any real concern with what language I use.

(*) Yeah, everyone knows the saying "using regular expressions to parse html is bad". It's pointless to bring it up again.

ansaurus

tags:

views:

answers:

Recommended library for scraping html data.

related questions