views:

561

answers:

5

How can I screen scrape a website using cURL and show the data within a specific div?

A: 

Fetch the website content using a cURL GET request. There's a code sample on the curl_exec manual page.

Use a regular expression to search for the data you need. There's a code sample on the preg_match manual page, but you'll need to do some reading up on regular expressions to be able to build the pattern you need. As Yacoby mentioned which I hadn't thought of, a better idea may be to examine the DOM of the HTML page using PHP's Simple XML or DOM parser.

Output the information you've found from the regex/parser in the HTML of your page (within the required div.)

Andy Shellam
The <center> cannot hold ... http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
Tim Post
+6  A: 

Download the page using cURL (There are a lot of examples in the documentation). Then use a DOM Parser, for example Simple HTML DOM or PHPs DOM to extract the value from the div element.

Yacoby
Please comment when downvoting to give me the chance to correct or otherwise improve my answer.
Yacoby
A: 

You have to ask permission from the site owner first.

Col. Shrapnel
+1: valid point
Jacco
That doesn't come remotely close to answering the question and it's based on unfounded assumptions. The content could be public domain, for example.
Scott Reynen
I love that "could be".
Col. Shrapnel
This should be a comment rather than an answer, since it doesn't actually answer the question. However, it does bring up a valid point...generally speaking, unless it's really obvious that you're in the clear, it's probably a good idea to check with the site owner(s) first.
Beska
A: 

After downloading with cURL use XPath to select the div and extract the content.

Plumo
A: 

A possible alternative.

# We will store the web page in a string variable.
var string page

# Read the page into the string variable.
cat "http://www.abczyx.com/path/to/page.ext" > $page

# Output the portion in the third (3rd) instance of "<div...</div>"
stex -r -c "^<div&</div\>^3" $page

This code is in biterscripting. I am using the 3 as sample to extract 3rd div. If you want to extract the div that has say string "ABC", then use this command syntax.

stex -r -c "^<div&ABC&</div\>^" $page

Take a look at this script http://www.biterscripting.com/helppages/SS_ExtractTable.html . It shows how to extract an element (div, table, frame, etc.) when the elements are nested.

P M