ansaurus

Question

Error-tolerant XML parsing in Scala

Answer 1

+1 A:

Try Tag Soup.

JTidy does something similar but only for HTML.

Wim Coenen 2009-10-02 22:07:47

Answer 2

+1 A:

Try the parser on the XHtml object. It is much more lenient than the one on XML.

Daniel 2009-10-02 22:32:01

Answer 3

+5 A:

What you're looking for would not be an XML parser. XML is very strict about nesting, closing, etc. One of the other answers suggests Tag Soup. This is a good suggestion, though technically it is much closer to a lexer than a parser. If all you want from XML-ish content is an event stream without any validation, then it's almost trivial to roll your own solution. Just loop through the input, consuming content which matches regular expressions along the way (this is exactly what Tag Soup does).

The problem is that a lexer is not going to be able to give you many of the features you want from a parser (e.g. production of a tree-based representation of the input). You have to implement that logic yourself because there is no way that such a "lenient" parser would be able to determine how to handle cases like the following:

<parent>
    <child>
    </parent>
</child>

Think about it: what sort of tree would expect to get out of this? There's really no sane answer to that question, which is precisely why a parser isn't going to be of much help.

Now, that's not to say that you couldn't use Tag Soup (or your own hand-written lexer) to produce some sort of tree structure based on this input, but the implementation would be very fragile. With tree-oriented formats like XML, you really have no choice but to be strict, otherwise it becomes nearly impossible to get a reasonable result (this is part of why browsers have such a hard time with compatibility).

Daniel Spiewak 2009-10-03 02:03:51

True, the closest match for my problem is the kind of output this gives. I do have an idea about what kind of rules I would use to produce an XML tree (I was hoping to use the XML API for queries) but of course this wouldn't be remotely 'correct'. I can just do it the more pragmatic way.

Joe 2009-10-23 14:38:30

Answer 4

+1 A:

I mostly agree with Daniel Spiewak's answer. This is just another way to create "your own parser".

While I don't know of any Scala specific solution, you can try using Woodstox, a Java library that implements the StAX API. (Being an even-based API, I am assuming it will be more fault tolerant than a DOM parser)

There is also a Scala wrapper around Woodstox called Frostbridge, developed by the same guy who made the Simple Build Tool for Scala.

I had mixed opinions about Frostbridge when I tried it, but perhaps it is more suitable for your purposes.

HRJ 2009-10-03 04:22:42

Answer 5

+1 A:

I agree with the answers that turning invalid XML into "correct" XML is impossible.

Why don't you just do a regular text search for the hrefs if that's all you're interested in? One issue would be commented out links, but if the XML is invalid, it might not be possible to tell what is intended to be commented out!

Adrian Mouat 2009-10-03 10:36:51

The reason I wanted to this would be to use the Scala XML API on those well-formed documents that I find, and attempt to fix broken ones first. I'll suppose just treat it as a string.

Joe 2009-10-23 14:51:47

One reason you might not want to do a text search is if you only want to extract links from `a` tags and not, for example, `link` tags or `DOCTYPE` declarations.

Ben James 2009-12-19 18:27:54

ansaurus

tags:

views:

answers:

Error-tolerant XML parsing in Scala

related questions