views:

39

answers:

1

I would like to be able to parse RSS and Atom feeds that contain non-valid XML. The errors I have encountered and would like to fix include "simple" things such as a &gt where the closing ; is missing, missing closing tags and closing tags that appear in the wrong order.

I would like to ignore the question whether in theory it makes any sense to attempt parsing malformed XML documents at all. One "technical" term that seems to come rather close to what I want to do is "tag soup". What existing CPAN modules should I use to build such a parser that is able to tolerate or correct simple errors like those described above?

A: 

The recover flag to LibXML, if you really must, or XML-Liberal if you really want to go overboard in parsing any old rubbish.

I'm sure you would like to ignore the question of whether parsing non-well-formed documents makes any sense, but ignoring it won't make it go away. Most RSS tools will correctly reject any non-well-formed XML input completely; you should generally follow suit, unless your tool is something unusual like an RSS debugger.

“Tag soup” is a term specifically related to HTML parsing. One of the central ideas of XML (and hence RSS and Atom) is that there is no such thing.

bobince