views:

25

answers:

1

I am using SimpleHtmlDOM PHP quite successfully to scrape some of my favorite webpages. Some of these pages, however, require me to log in before I can get at the information that I really care about. Does anyone know how (or if it's possible) to get this library to access a page that requires a username and password be enterred before you gain access to the page? Everything I've done to date starts with something like...

$html = file_get_html('http://www.google.com/');
+2  A: 

Very few sites use authentication mechanisms that are identical, so there's no one way to always authenticate with a site.

Your best bet will be to use cURL and make your scraper look like a real browser. This means using cookies (search for "cookie" on the page, you might want to use a cookie file/jar) and storing them somewhere, navigating to the login form, submitting it successfully, then continuing to use that "browser" session to perform your scraping.

Please make sure that the sites don't mind being scraped in this way. If discovered, you may be banned from the site depending on how much the site owners dislike scraping.

Charles
interesting, why would anyone care about being scraped like this?
vicatcu
@vicatcu, it depends on what the sites are, and what you're doing with the data. For example, if you were logging in to a site that hosted forums for members only and pulling out the posts, the site owners might not be happy about it.
Charles
Oh I see what you're saying, I have no intention of pulling private data and reposting it in a public space. Thanks for the advice!
vicatcu