views:

1929

answers:

10

I've been doing some html scraping in PHP using regular expressions. This works, but the result is finicky and fragile. Has anyone used any packages that provide a more robust solution? A config driven solution would be ideal, but I'm not picky.

+7  A: 

I would recomend PHP Simple HTML DOM Parser after you have scraped the HTML from the page. It supports invalid HTML, and provides a very easy way to handle HTML elements.

Espo
+3  A: 

If the page you're scraping is valid X(HT)ML, then any of PHP's built-in XML parsers will do.

I haven't had much success with PHP libraries for scraping. If you're adventurous though, you can try simplehtmldom. I'd recommend Hpricot for Ruby or Beautiful Soup for Python, which are both excellent parsers for HTML.

John Douthat
If you're going to be parsing particularly sloppy HTML, make sure you don't use BeautifulSoup 3.1.x (use 3.0.x). 3.1.x uses htmllib as its parser, which is much less forgiving than 3.0.x's use of sgmllib.
Tom
+2  A: 

I've had very good with results with the Simple Html DOM Parser mentioned above as well. And then there's the  tidy Extension for PHP as well which works really well too.

Polygraf
+1  A: 

Have a look at this thread - the question goes into a similar direction

crono
+2  A: 

I had some fun working with htmlSQL, which is not so much a high end solution, but really simple to work with.

BlaM
late comment but I just found your answer via google.. i like it! :)
Ben
+1  A: 

Using PHP for HTML scraping, I'd recommend cURL + regexp or cURL + some DOM parsers though I personally use cURL + regexp. If you have a profound taste of regexp, it's actually more accurate sometimes.

kavoir.com
+3  A: 

I would also recommend 'Simple HTML DOM Parser.' It is a good option particularly if your familiar with jQuery or JavaScript selectors then you will find yourself at home.

I have even blogged about it in the past.

Orange Box
A: 

How about getting the HTML using PHP, and then traversing & scraping it using JQuery? Is this possible?

Dmitry
+1  A: 

Wat kind of package? a library or codes or software? Explain a bit

Bob
+1  A: 

finicky and fragile? there's really no way to answer this vague of a question without seeing some code and understanding of the problem