views:

545

answers:

2

which one is better for screen scraping? simple html dom or snoopy ?? i use simple html dom and find it comfortable.. does snoopy has any advantage over simple html dom?

my requirements : if i wanna scrape contents from a page(after login).. simple html dom is easy but it takes a lotta time to print the results..

+1  A: 

Is Snoopy that well known / mature of a package?

If it's not, then all other things being equal, I'd probably go with generic HTML DOM code - especially if the scraping is somewhat simple.

But only you know when your code is starting to get too big, unmanageable, etc., at which point it might be better to look at another tool out there like Snoopy.

(Which, admittedly, I don't have experience with; it's apparently at http://sourceforge.net/projects/snoopy/ for those not familiar with it - "Snoopy is a PHP class that simulates a web browser. It automates the task of retrieving web page content and posting forms, for example.")

The real reason I'm posting, even though I don't know Snoopy per se and thus can't definitively answer your question, is to ask if you've considered using Selenium (http://www.seleniumhq.org/) instead of Snoopy.

Selenium is a fairly well-known testing tool, and it occurred to me that one of the nice things about using that for what you're doing (if you can) is that it has built in tests.

The reason that's good is that screen scraping is kind of an inherently brittle task - if the target site changes something, blam, your scraping fails. So it's kind of a nice design to have an automated scrape/test-that-scraping-worked system.

Something to think about, anyway.

Chirael
thanks for the link..i'm lookin at it..
Sam
A: 

I've stumbled into BeautifulSoup, which is Python-based. I suppose there are a bunch of others too.

Looks like Snoopy is PHP-based, and hence can be run server-side only. Is this what you are really looking for? What are your requirements? Please elaborate on that.

AndreaG
There's also Mechanize (http://wwwsearch.sourceforge.net/mechanize/) which is Python-based and based on BeautifulSoup.Andrea and Jeremy are right, we need more details of what you're trying to do (and how often you want to do it, for how many pages, etc.) to be able to recommend server- vs. client-side, etc.
Chirael