tagsoup

With Haskell, how do I process large volumes of XML?

I've been exploring the Stack Overflow data dumps and thus far taking advantage of the friendly XML and “parsing” with regular expressions. My attempts with various Haskell XML libraries to find the first post in document-order by a particular user all ran into nasty thrashing. TagSoup import Control.Monad import Text.HTML.TagSoup use...

Tagsoup fails to parse html document from a StringReader ( java )

Hi I have this function: private Node getDOM(String str) throws SearchEngineException { DOMResult result = new DOMResult(); try { XMLReader reader = new Parser(); reader.setFeature(Parser.namespacesFeature, false); reader.setFeatur...

XPath Expression returns nothing for //element, but //* returns a count.

I'm using XOM with the following sample data: Element root = cleanDoc.getRootElement(); //find all the bold elements, as those mark institution and clinic. Nodes nodes = root.query("//*"); <html xmlns="http://www.w3.org/1999/xhtml" xmlns:html="http://www.w3.org/1999/xhtml"&gt; <head> <title>Patient Information</title> <...

JDOM 1.1: hyphen is not a valid comment character

I'm using tagsoup to clean some HTML I'm scraping from the internet, and I'm getting the following error when parsing through pages with comments: The data "- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - " is not legal for a JDOM com...

Parsing XML with TagSoup : bug with long attributes ?

Hi, I'm trying to parse ugly HTML with TagSoup to extract value of a given tag. Here is the tag : <input type="hidden" name="hash_check" value="ffc39410ed8da309408a9382450ddc85" /> I want to retrieve value of attribute "value" ("ffc39410ed8da309408a9382450ddc85") And here is my code, in my SAX handler : public void startElement(Str...