views:

1186

answers:

7

   I am trying to create a simple alert app for some friends.
   Basically i want to be able to extract data "price" and "stock availability" from a webpage like the folowing two:

    I have made the alert via e-mail and sms part but now i want to be able to get the quantity and price out of the webpages (those 2 or any other ones) so that i can compare the price and quantity available and alert us to make an order if a product is between some thresholds.
   I have tried some regex (found on some tutorials, but i an way too n00b for this) but haven't managed to get this working, any good tips or examples ?

Thanks,
Mike

+2  A: 

What ever you do: Don't use regular expressions to parse HTML or bad things will happen. Use a parser instead.

I think regular expressions are ok for very specific use cases (i.e. the markup/text is always the same). But of course not for validating HTML etc. Parsers are always a good solution but sometimes they are overkill.
Felix Kling
i thought a regex would do the trick here since i only try to extract 2 info's from the page, and the format is quite standard...
Mike
@Felix Did your read the graphic description of what happens if you try to parse HTML with regular expressions. If are very daring, click on the first link in my answer.
@Mike A "standard" format sounds like an ideal opportunity to use a standard tool: a parser.
@lutz: I only say that if the scope is clear, regex can be a fast/easy solution. I don't say regex should be used to analyze HTML in general.
Felix Kling
-1 for linking YET AGAIN that answer. Really, give us a break.
kemp
@kemp You don't have to click on any link if you don't like to.
+2  A: 

You are probably best off loading the HTML code into a DOM parser like this one and searching for the "pricing" table. However, any kind of scraping you do can break whenever they change their page layout, and is probably illegal without their consent.

The best way, though, would be to talk to the people who run the site, and see whether they have alternative, more reliable forms of data delivery (Web services, RSS, or database exports come to mind).

Pekka
i only want to do this for me and my friend so that we can have an script look through the website every hour. they do not suport any web services at this time. database exports... haha, i really don't think so.
Mike
"Illegal?" Seriously..?
David Thomas
Yes. Many sites prohibit any kind of automated browsing/downloading/parsing of their sites' contents in their terms of service. In many jurisdictions, this works and can be enforced. It's unlikely there is going to be any trouble in this case but it's still always worth noting.
Pekka
Pekka do you have some sources on that? I'm interested in this subject
kemp
Scraping data and re-publishing it is a copyright offense in most parts of the world. When it comes to scraping it for private use, the situation looks less unequivocal than I thought. I came across this Google Answers question http://answers.google.com/answers/threadview?id=746810 it is related to India but makes a few international points, too.
Pekka
Well republishing copyright protected contents is an offense even if you do it by hand, I was interested about the illegal part of making an automated script to extract them -- not what you do with that data.
kemp
As I said, it's not as straightforward as re-publishing, and not as easy to attack. Check out the link I posted, there are some pointers there.
Pekka
Ok, thanks (15 chars)
kemp
+2  A: 

Hi Mike,

1st, asking this question goes too into details. 2nd, extracting data from a website might not be legitimate. However, I have hints:

  1. Use Firebug or Chrome/Safari Inspector to explore the HTML content and pattern of interesting information

  2. Test your RegEx to see if the match. You may need do it many times (multi-pass parsing/extraction)

  3. Write a client via cURL or even much simpler, use file_get_contents (NOTE that some hosting disable loading URLs with file_get_contents)

For me, I'd better use Tidy to convert to valid XHTML and then use XPath to extract data, instead of RegEx. Why? Because XHTML is not regular and XPath is very flexible. You can learn XSLT to transform.

Good luck!

Viet
A: 

As others have pointed out, it probably isn't a good idea to do this for many reasons, but here's the code anyways:

$c = curl_init();

curl_setopt($c, CURLOPT_URL, 'http://www.sparkfun.com/commerce/product_info.php?products_id=9279');
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($c, CURLOPT_HEADER, 0);

$d = curl_exec($c);

// Get the value from the first th of table.pricing
preg_match('#<table class="pricing">.*?<th>\$(.*?)</th>#s', $d, $match);

$price = $match[1];

The price of the product can now be found in the $price variable.

Note that you should ask permission first, I'm sure the owner of the site isn't pleased that you using their data without permission and fetching 11kb of data to get just one floating point number.

As I'm not adept in using XML parsers, I'd be interested to see what a 'proper' solution would look like.

Tatu Ulmanen
Although you were generous and enthusiastic. I don't think source code was necessary. The main issue here is legitimacy and he needs to work his own way.
Viet
A: 
$content = file_get_contents('http://www.sparkfun.com/commerce/product_info.php?products_id=9279');

preg_match('#<tr><th>(.*)</th> <td><b>price</b></td></tr>#', $content, $match);
$price = $match[1];

preg_match('#<input type="hidden" name="quantity_on_hand" value="(.*?)">#', $content, $match);
$in_stock = $match[1];

echo "Price: $price - Availability: $in_stock\n";
kemp
thsi works like a charm at a first look, and is just the simple solution i was lookfin for !!! thanks a lot
Mike
very easily modified to get the product name and other info out of the text.... WOW 10x a lot, i mean... it's just the simplest way to get some meaningful data out of many simple websites.
Mike
You're welcome :) If you have specific needs, regular expressions can be perfectly fine to mine data from an HTML page. They break if the structure of the page changes, but so do solutions based on parsers.
kemp
the ony thing that can change is different links on the page or some stuff like that, but i do check the website a lot and i can tell if it has changed the design and make the appropriate change in the regex.
Mike
Downvoter cares to say why?
kemp
no matter what this is the answer i was looking for. anyone looking to do this .... this is worth 2min. for looking into.
Mike
+3  A: 

It's called screen scraping, in case you need to google for it.

I would suggest that you use a dom parser and xpath expressions instead. Feed the HTML through HtmlTidy first, to ensure that it's valid markup.

For example:

$html = file_get_contents("http://www.example.com");
$html = tidy_repair_string($html);
$doc = new DomDocument();
$doc->loadHtml($html);
$xpath = new DomXPath($doc);
// Now query the document:
foreach ($xpath->query('//table[@class="pricing"]/th') as $node) {
  echo $node, "\n";
}
troelskn
+1 for recommending the only sensible thing - a parser.
Tomalak
A car is the best choice for general travelling, but if you need to visit your neighbour a simple walk might suffice.
kemp
A: 

i tried and it works fine. but i have another problem. this example extract only the first data that he find. but if in the webpage there is something like that: name 1 ... ...

how can i display all the name?

chris