tags:

views:

74

answers:

2

I am building an app in Java using Jena for semantic information scraping. I am looking for a RDFa parser that would allow me to correctly extract all the RDFa statements. Specifically, one that extracts info about namespaces used and presuming that RDFa tags are correct in the page produces correct triples, ones that distinguish between object and data properties.

I went through all RDFa parsers from the site http://rdfa.info/wiki/Consume for Java. They all struggle to extract any RDFa statements and if they do not crash, Jena RDFa parser shows plenty of errors and then dies a terrible death, the data is of little use as it is incorrectly processed and generally mixed up. I am newbie in this area so please be gentle:)

I was also thinking of using a library written in different language but then again I don't really know how to plug it into Java code. Any suggestions?

+1  A: 

Most RDFa parsers struggle with invalid HTML. The any23 library includes an RDFa parser that can deal with invalid HTML. It parses any RDFa into full RDF, including namespace mappings and so on, and is under active development.

cygri
Thanks for this, I have used any23 library in my app, seemed the best choice among the present java based offerings.
lennyks
A: 

Use java-rdfa. It supports jena, and uses the validator.nu html 5 parser, which handles parsing the html like a browser does (i.e. it will repair broken markup).