tags:

views:

465

answers:

5

I would like to be able to parse XML that isn't necessarily well-formed. I'd be looking for a fuzzy rather than a strict parser, able to recover from badly nested tags, for example. I could write my own but it's worth asking here first.

Update:

What I'm trying to do is extract links and other info from HTML. In the case of well-formed XML I can use the Scala XML API. In the case of ill-formed XML, it would be nice to somehow convert it into correct XML (somehow) and deal with it the same way, otherwise I'd have to have two completely different sets of functions for dealing with documents.

Obviously because the input is not well-formed and I'm trying to create a well-formed tree, there would have to be some heuristic involved (such as when you see <parent><child></parent> you would close the <child> first and when you then see a <child> you ignore it). But of course this isn't a proper grammar and so there's no correct way of doing it.

+1  A: 

Try Tag Soup.

JTidy does something similar but only for HTML.

Wim Coenen
+1  A: 

Try the parser on the XHtml object. It is much more lenient than the one on XML.

Daniel
+5  A: 

What you're looking for would not be an XML parser. XML is very strict about nesting, closing, etc. One of the other answers suggests Tag Soup. This is a good suggestion, though technically it is much closer to a lexer than a parser. If all you want from XML-ish content is an event stream without any validation, then it's almost trivial to roll your own solution. Just loop through the input, consuming content which matches regular expressions along the way (this is exactly what Tag Soup does).

The problem is that a lexer is not going to be able to give you many of the features you want from a parser (e.g. production of a tree-based representation of the input). You have to implement that logic yourself because there is no way that such a "lenient" parser would be able to determine how to handle cases like the following:

<parent>
    <child>
    </parent>
</child>

Think about it: what sort of tree would expect to get out of this? There's really no sane answer to that question, which is precisely why a parser isn't going to be of much help.

Now, that's not to say that you couldn't use Tag Soup (or your own hand-written lexer) to produce some sort of tree structure based on this input, but the implementation would be very fragile. With tree-oriented formats like XML, you really have no choice but to be strict, otherwise it becomes nearly impossible to get a reasonable result (this is part of why browsers have such a hard time with compatibility).

Daniel Spiewak
True, the closest match for my problem is the kind of output this gives. I do have an idea about what kind of rules I would use to produce an XML tree (I was hoping to use the XML API for queries) but of course this wouldn't be remotely 'correct'. I can just do it the more pragmatic way.
Joe
+1  A: 

I mostly agree with Daniel Spiewak's answer. This is just another way to create "your own parser".

While I don't know of any Scala specific solution, you can try using Woodstox, a Java library that implements the StAX API. (Being an even-based API, I am assuming it will be more fault tolerant than a DOM parser)

There is also a Scala wrapper around Woodstox called Frostbridge, developed by the same guy who made the Simple Build Tool for Scala.

I had mixed opinions about Frostbridge when I tried it, but perhaps it is more suitable for your purposes.

HRJ
+1  A: 

I agree with the answers that turning invalid XML into "correct" XML is impossible.

Why don't you just do a regular text search for the hrefs if that's all you're interested in? One issue would be commented out links, but if the XML is invalid, it might not be possible to tell what is intended to be commented out!

Adrian Mouat
The reason I wanted to this would be to use the Scala XML API on those well-formed documents that I find, and attempt to fix broken ones first. I'll suppose just treat it as a string.
Joe
One reason you might not want to do a text search is if you only want to extract links from `a` tags and not, for example, `link` tags or `DOCTYPE` declarations.
Ben James