I'm trying to scrape information from http://www.nfl.com/scores (in particular, find out when a game is over so my computer can stop recording it). I can download HTML easily enough, and it makes this claim about compliance with standards:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
But
An attempt to parse it with Expat produces the error
not well-formed (invalid token)
.The W3C's online validation service reports 399 Errors and 121 warnings.
I tried to run HTML tidy (just called
tidy
) on my linux system with the-xml
option, but tidy reports 56 warnings and 117 errors and is unable to recover a good XML file. The errors look like this:line 409 column 122 - Warning: unescaped & or unknown entity "&role" ... line 409 column 172 - Warning: unescaped & or unknown entity "&tabSeq" ... line 1208 column 65 - Error: unexpected </td> in <br> line 1209 column 57 - Error: unexpected </tr> in <br> line 1210 column 49 - Error: unexpected </table> in <br>
But when I check the input, the "unknown entities" appear to be part of a properly quoted URL, so I don't know if a double quote is missing somewhere or what.
I know that there is something out there that can parse this stuff because both Firefox and w3m display something reasonable. What tool will fix the noncompliant HTML so that I can parse it with Expat?