I'm working on a web scraper which will aggregate data from various websites. I've started using PHP's built in DOM functions but after running into a couple of issues (especially regarding malformed markup and character encoding), I've chosen to ditch PHP. I was thinking of server side Javascript but am open to other suggestions. If I go with Javascript, which interpreter should I use?
Thanks, I'll give it a shot.
Olivier Lalonde
2010-01-31 07:59:11
+1
A:
There's an excellent BeautifulSoup module for Python which can handle broken markup in most cases. It also allows to use hooks for preprocessing HTML if the page is so malformed that its built-in heuristics doesn't work. I've used BeautifulSoup to write dozens of parsers.
There's also html5lib module that is faster and also can parse invalid HTML.
Both modules has Ruby ports.
Eugene Morozov
2010-01-31 08:42:21