parse meta tags in Java

views:

672

answers:

+1 Q:

parse meta tags in Java

Hi,

I have a collection of HTML documents for which I need to parse the contents of the <meta> tags in the <head> section. These are the only HTML tags whose values I'm interested in, i.e. I don't need to parse anything in the <body> section.

I've attempted to parse these values using the XPath support provided by JDom. However, this isn't working out too well because a lot of the HTML in the <body> section is not valid XML.

Does anyone have any suggestions for how I might go about parsing these tag values in manner that can deal with malformed HTML, perhaps a regex?

Cheers, Don

+2 A:

If it suits your application you can use Tidy to convert HTML to valid XML, and then use as much XPath as you like!

divideandconquer.se 2008-11-18 16:52:53

JTidy should provide a good starting point for this.

James Van Huis 2008-11-18 16:54:51

+5 A:

You can likely use the Jericho HTML Parser. In particular, have a look at this to see how you can go about finding specific tags.

bdumitriu 2008-11-18 16:56:05

ansaurus

tags:

views:

answers:

parse meta tags in Java

related questions