I'm trying to scrape data from a website that has invalid HTML. Simple HTML DOM Parser parses it but loses some info because of how its handling the invalid HTML. The built-in DOM parser with DOMXPath isn't working, it returns a blank result set. I was able to get it (DOMDocument and DOMXPath) working locally after running the fetched HTML through PHP Tidy but PHP Tidy isn't installed on the server and its a shared hosting server, so I have no control over that. I tried HTMLPurifier but that just seems to be for securing user input, since it completely removes the doctype, head, and body tags.
Is there any kind of standalone alternative to PHP Tidy? I would really prefer to use DOMXPath to navigate around and grab what I need, it just seems to need some help cleaning the HTML up before it can parse it.
Edit: Im scraping this website: http://courseschedules.njit.edu/index.aspx?semester=2010f. For now I'm just trying to get all the course links.