ansaurus

Question

How do I screen scrape a website and get data within div?

Answer 1

A:

Fetch the website content using a cURL GET request. There's a code sample on the curl_exec manual page.

Use a regular expression to search for the data you need. There's a code sample on the preg_match manual page, but you'll need to do some reading up on regular expressions to be able to build the pattern you need. As Yacoby mentioned which I hadn't thought of, a better idea may be to examine the DOM of the HTML page using PHP's Simple XML or DOM parser.

Output the information you've found from the regex/parser in the HTML of your page (within the required div.)

Andy Shellam 2010-03-26 12:11:23

The <center> cannot hold ... http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

Tim Post 2010-04-06 17:43:31

Answer 2

+6 A:

Download the page using cURL (There are a lot of examples in the documentation). Then use a DOM Parser, for example Simple HTML DOM or PHPs DOM to extract the value from the div element.

Yacoby 2010-03-26 12:12:25

Please comment when downvoting to give me the chance to correct or otherwise improve my answer.

Yacoby 2010-03-26 16:38:30

Answer 3

A:

You have to ask permission from the site owner first.

Col. Shrapnel 2010-03-26 12:28:36

+1: valid point

Jacco 2010-03-26 12:32:40

That doesn't come remotely close to answering the question and it's based on unfounded assumptions. The content could be public domain, for example.

Scott Reynen 2010-03-26 15:09:21

I love that "could be".

Col. Shrapnel 2010-03-26 16:26:03

This should be a comment rather than an answer, since it doesn't actually answer the question. However, it does bring up a valid point...generally speaking, unless it's really obvious that you're in the clear, it's probably a good idea to check with the site owner(s) first.

Beska 2010-04-08 18:16:09

Answer 4

A:

After downloading with cURL use XPath to select the div and extract the content.

Plumo 2010-03-28 10:47:04

Answer 5

A:

A possible alternative.

# We will store the web page in a string variable.
var string page

# Read the page into the string variable.
cat "http://www.abczyx.com/path/to/page.ext" > $page

# Output the portion in the third (3rd) instance of "<div...</div>"
stex -r -c "^<div&</div\>^3" $page

This code is in biterscripting. I am using the 3 as sample to extract 3rd div. If you want to extract the div that has say string "ABC", then use this command syntax.

stex -r -c "^<div&ABC&</div\>^" $page

Take a look at this script http://www.biterscripting.com/helppages/SS_ExtractTable.html . It shows how to extract an element (div, table, frame, etc.) when the elements are nested.

P M 2010-05-10 17:05:48

ansaurus

tags:

views:

answers:

How do I screen scrape a website and get data within div?

related questions