I am preparing some custom performance tests against a legacy application that outputs nonstandard HTML (missing tags, duplicate quotes, missing quotes, the works) that can't be changed right now for all the usual reasons.
I am looking for a library similar to BeautifulSoup or "HTML Agility Pack" that can be called from C or Java on a UNIX host.
We'll build some test scaffolding and then start redesigning and reimplementing, but I need some baseline measurements first.