ansaurus

Question

How can I scrape a website with invalid HTML

Answer 1

+1 A:

if you know the errors you might apply some regular expressions to fix them specifically. While this ad-hoc solution might seem dirty, it may actually be better as if the HTML is indeed malformed it might be complex to infer a correct interpretation automatically.

EDIT: Actually it might be better to simply extract the needed information through regular expressions as the page has many errors which would be hard or at least tedious to fix.

Michele Balistreri 2010-10-08 19:42:53

Answer 2

A:

Is there a web service that will run your content through Tidy? Could you write one? Tidy is the only sane way I know of fixing broken markup.

Robin 2010-10-08 19:48:55

Answer 3

+1 A:

DOM handles broken HTML fine if you use loadHTML or loadHTMLFile:

$dom = new DOMDocument;
libxml_use_internal_errors(TRUE);
$dom->loadHTMLFile('http://courseschedules.njit.edu/index.aspx?semester=2010f');
libxml_clear_errors();

$xPath = new DOMXPath($dom);
$links = $xPath->query('//div[@class="courseList_section"]//a');
foreach($links as $link) {
    printf("%s (%s)\n", $link->nodeValue, $link->getAttribute('href'));
}

will output

ACCT - Accounting (index.aspx?semester=2010f&subjectID=ACCT)
AD   - Art and Design (index.aspx?semester=2010f&subjectID=AD  )
ARCH - Architecture (index.aspx?semester=2010f&subjectID=ARCH)
... many more ...
TRAN - Transportation Engr (index.aspx?semester=2010f&subjectID=TRAN)
TUTR - Tutoring (index.aspx?semester=2010f&subjectID=TUTR)
URB  - Urban Systems (index.aspx?semester=2010f&subjectID=URB )

Using

echo $dom->saveXML($link), PHP_EOL;

in the foreach loop will output the full outerHTML of the links.

Gordon 2010-10-08 21:49:28

This does a little better than Simple HTML DOM Parser but if you count the results, it only gives 107 of the 123 links.

Telanor 2010-10-08 23:33:43

@Telanor updated. The XPath now searches for *all links inside divs with the class courseList_section* instead of for *all links inside spans inside divs*. I am pretty sure you could have fixed that easily yourself though. Also possible `'//a[ancestor::div[@class="courseList_section"]]'`

Gordon 2010-10-09 08:36:11

You're right, it does work now. I'm still not sure how I didn't already try this. That's actually the same XPath query I was using locally after running Tidy

Telanor 2010-10-09 18:45:55

Answer 4

A:

Consider using a real browser or the webbrowser control. I tested with iMacros and the web scraping works well. Test macro for the first two links:

VERSION BUILD=7050962
URL GOTO=http://courseschedules.njit.edu/index.aspx?semester=2010f
'Get text
'TAG POS=2 TYPE=A FORM=ID:form1 ATTR=TXT:*-* EXTRACT=TXT
'Get link first entry
TAG POS=2 TYPE=A FORM=ID:form1 ATTR=TXT:*-* EXTRACT=HREF
'Get link second entry
TAG POS=3 TYPE=A FORM=ID:form1 ATTR=TXT:*-* EXTRACT=HREF

You can move between the entries by incrementing the POS= value.

SamMeiers 2010-10-09 09:20:56

Answer 5

A:

Another simple way to solve the problem could be passing the site you are trying to scrape through a mobile browser adapter package such as google's mobilizer for complicated websites. This will correct the invalid html and enable you to use the simple html dom parser package, but it might not work if you need some of the information that is stripped out of the site. The links to this adapter are below. I use this for sites on which the information is poorly formatted or if I need a way to simplify the formatting so that it is easy to parse. The html returned by the google mobilizer is simpler and much easier to process.

http://www.google.com/gwt/n

jerryvig 2010-10-09 09:27:58

ansaurus

tags:

views:

answers:

How can I scrape a website with invalid HTML

related questions