I've been exploring the Stack Overflow data dumps and thus far taking advantage of the friendly XML and “parsing” with regular expressions. My attempts with various Haskell XML libraries to find the first post in document-order by a particular user all ran into nasty thrashing.
TagSoup
import Control.Monad
import Text.HTML.TagSoup
use...
Hi
I have this function:
private Node getDOM(String str) throws SearchEngineException {
DOMResult result = new DOMResult();
try {
XMLReader reader = new Parser();
reader.setFeature(Parser.namespacesFeature, false);
reader.setFeatur...
I'm using XOM with the following sample data:
Element root = cleanDoc.getRootElement();
//find all the bold elements, as those mark institution and clinic.
Nodes nodes = root.query("//*");
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:html="http://www.w3.org/1999/xhtml">
<head>
<title>Patient Information</title>
<...
I'm using tagsoup to clean some HTML I'm scraping from the internet, and I'm getting the following error when parsing through pages with comments:
The data "- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - " is not legal for a JDOM com...
Hi,
I'm trying to parse ugly HTML with TagSoup to extract value of a given tag.
Here is the tag :
<input type="hidden" name="hash_check" value="ffc39410ed8da309408a9382450ddc85" />
I want to retrieve value of attribute "value" ("ffc39410ed8da309408a9382450ddc85")
And here is my code, in my SAX handler :
public void startElement(Str...