Hi,
I have a collection of HTML documents for which I need to parse the contents of the <meta> tags in the <head> section. These are the only HTML tags whose values I'm interested in, i.e. I don't need to parse anything in the <body> section.
I've attempted to parse these values using the XPath support provided by JDom. However, this isn't working out too well because a lot of the HTML in the <body> section is not valid XML.
Does anyone have any suggestions for how I might go about parsing these tag values in manner that can deal with malformed HTML, perhaps a regex?
Cheers, Don