views:

80

answers:

5

I'm trying to scrape data from a website that has invalid HTML. Simple HTML DOM Parser parses it but loses some info because of how its handling the invalid HTML. The built-in DOM parser with DOMXPath isn't working, it returns a blank result set. I was able to get it (DOMDocument and DOMXPath) working locally after running the fetched HTML through PHP Tidy but PHP Tidy isn't installed on the server and its a shared hosting server, so I have no control over that. I tried HTMLPurifier but that just seems to be for securing user input, since it completely removes the doctype, head, and body tags.

Is there any kind of standalone alternative to PHP Tidy? I would really prefer to use DOMXPath to navigate around and grab what I need, it just seems to need some help cleaning the HTML up before it can parse it.

Edit: Im scraping this website: http://courseschedules.njit.edu/index.aspx?semester=2010f. For now I'm just trying to get all the course links.

+1  A: 

if you know the errors you might apply some regular expressions to fix them specifically. While this ad-hoc solution might seem dirty, it may actually be better as if the HTML is indeed malformed it might be complex to infer a correct interpretation automatically.

EDIT: Actually it might be better to simply extract the needed information through regular expressions as the page has many errors which would be hard or at least tedious to fix.

Michele Balistreri
A: 

Is there a web service that will run your content through Tidy? Could you write one? Tidy is the only sane way I know of fixing broken markup.

Robin
+1  A: 

DOM handles broken HTML fine if you use loadHTML or loadHTMLFile:

$dom = new DOMDocument;
libxml_use_internal_errors(TRUE);
$dom->loadHTMLFile('http://courseschedules.njit.edu/index.aspx?semester=2010f');
libxml_clear_errors();

$xPath = new DOMXPath($dom);
$links = $xPath->query('//div[@class="courseList_section"]//a');
foreach($links as $link) {
    printf("%s (%s)\n", $link->nodeValue, $link->getAttribute('href'));
}

will output

ACCT - Accounting (index.aspx?semester=2010f&subjectID=ACCT)
AD   - Art and Design (index.aspx?semester=2010f&subjectID=AD  )
ARCH - Architecture (index.aspx?semester=2010f&subjectID=ARCH)
... many more ...
TRAN - Transportation Engr (index.aspx?semester=2010f&subjectID=TRAN)
TUTR - Tutoring (index.aspx?semester=2010f&subjectID=TUTR)
URB  - Urban Systems (index.aspx?semester=2010f&subjectID=URB )

Using

echo $dom->saveXML($link), PHP_EOL;

in the foreach loop will output the full outerHTML of the links.

Gordon
This does a little better than Simple HTML DOM Parser but if you count the results, it only gives 107 of the 123 links.
Telanor
@Telanor updated. The XPath now searches for *all links inside divs with the class courseList_section* instead of for *all links inside spans inside divs*. I am pretty sure you could have fixed that easily yourself though. Also possible `'//a[ancestor::div[@class="courseList_section"]]'`
Gordon
You're right, it does work now. I'm still not sure how I didn't already try this. That's actually the same XPath query I was using locally after running Tidy
Telanor
A: 

Consider using a real browser or the webbrowser control. I tested with iMacros and the web scraping works well. Test macro for the first two links:

VERSION BUILD=7050962
URL GOTO=http://courseschedules.njit.edu/index.aspx?semester=2010f
'Get text
'TAG POS=2 TYPE=A FORM=ID:form1 ATTR=TXT:*-* EXTRACT=TXT
'Get link first entry
TAG POS=2 TYPE=A FORM=ID:form1 ATTR=TXT:*-* EXTRACT=HREF
'Get link second entry
TAG POS=3 TYPE=A FORM=ID:form1 ATTR=TXT:*-* EXTRACT=HREF

You can move between the entries by incrementing the POS= value.

SamMeiers
A: 

Another simple way to solve the problem could be passing the site you are trying to scrape through a mobile browser adapter package such as google's mobilizer for complicated websites. This will correct the invalid html and enable you to use the simple html dom parser package, but it might not work if you need some of the information that is stripped out of the site. The links to this adapter are below. I use this for sites on which the information is poorly formatted or if I need a way to simplify the formatting so that it is easy to parse. The html returned by the google mobilizer is simpler and much easier to process.

http://www.google.com/gwt/n

jerryvig