views:

1650

answers:

7

Hi friends,

How to screen scrape a particular website. I need to log in to a website and then scrape the inner information. How could this be done?

Please guide me.

Duplicate: How to implement a web scraper in PHP?

A: 

You want to look at the curl functions - they will let you get a page from another website. You can use cookies or HTTP authentication to log in first then get the page you want, depending on the site you're logging in to.

Once you have the page, you're probably best off using regular expressions to scrape the data you want.

Greg
-1 Sorry but this issue has come up time and time again: regex is a terrible way to do scraping. Use an HTML/XML parser. Regexes are so error prone for this sort of thing it's not funny.
cletus
cletus I completely disagree. If you're looking to get a small piece of information from a blob of HTML, a regex is the way to go.
Greg
A: 

You should look look at curl.

benlumley
A: 

You might also want to take a look at BeautifulSoup which is a Python library which is supposed to be very good at making bad HTML parseable. It is aimed at things like screen scraping.

How easy it would be to call from PHP I don't know though.

andynormancx
-1 Beautiful Soup is fine if it's Python but this isn't. There are PHP libraries (like Zend and Simple XML) for this. Calling Python is not a sensible solution.
cletus
Seems a little harsh. I don't know a that much about Simple XML and Zend, but Googling suggests SimpleXML is just an XML parser and Zend is an app server. I fail to see how either of those help in any specific way in the hard problem of scraping HTML in the way that something like BS would.
andynormancx
Zend is also a framework of many different packages. And that's kinda my point: your knowledge of PHP is sketchy (it seems) so suggesting Python (something I presume you know more about based on your answer) doesn't really help.
cletus
So Zend has a package designed for parsing badly formatted HTML as found on most websites then ? If it has nobody seems to have recommended it here. Is there such a package ?
andynormancx
I know enough about PHP to know that it can shell out to another app . So running a quick Python script to make use of BS to make the HTML parseable should work. If I was looking at scraping potential lousy HTML it is definitely what I would try first, before attempting to roll my own.
andynormancx
+1  A: 
Zend_Http_Client and Zend_Dom_Query
Adrian Grigore
A: 

You could also check out http://php.net/dom

middus
A: 

Curl, and once ure in, use QueryPath php library. (querypath.org) You can access dom elements just like in JQuery, via CSS selectors, there's method chaining...

Way better than just using php's native xml functions.

It also works as drupal extension, but I suppose you could implement it in any php project.

toninoj
A: 

I use the simple_html_dom class mixed with regex for validation purposes. very easy to use and well-documented: http://www.quickscrape.com

Steve
You need to disclose your affiliation rather than making it sound like you're just a user. See [Astroturfing](http://en.wikipedia.org/wiki/Astroturfing) and read the SO [FAQ](http://stackoverflow.com/faq) for the policy on promoting products you're affiliated with. (Also, your header section is rendering funny in Firefox.)
Bill the Lizard