ansaurus

Question

Answer 1

+2 A:

What ever you do: Don't use regular expressions to parse HTML or bad things will happen. Use a parser instead.

2010-01-07 11:33:10

I think regular expressions are ok for very specific use cases (i.e. the markup/text is always the same). But of course not for validating HTML etc. Parsers are always a good solution but sometimes they are overkill.

Felix Kling 2010-01-07 11:37:12

i thought a regex would do the trick here since i only try to extract 2 info's from the page, and the format is quite standard...

Mike 2010-01-07 11:40:03

@Felix Did your read the graphic description of what happens if you try to parse HTML with regular expressions. If are very daring, click on the first link in my answer.

2010-01-07 11:40:16

@Mike A "standard" format sounds like an ideal opportunity to use a standard tool: a parser.

2010-01-07 11:40:56

@lutz: I only say that if the scope is clear, regex can be a fast/easy solution. I don't say regex should be used to analyze HTML in general.

Felix Kling 2010-01-07 11:48:39

-1 for linking YET AGAIN that answer. Really, give us a break.

kemp 2010-01-07 12:01:35

@kemp You don't have to click on any link if you don't like to.

2010-01-07 12:13:27

Answer 2

+2 A:

You are probably best off loading the HTML code into a DOM parser like this one and searching for the "pricing" table. However, any kind of scraping you do can break whenever they change their page layout, and is probably illegal without their consent.

The best way, though, would be to talk to the people who run the site, and see whether they have alternative, more reliable forms of data delivery (Web services, RSS, or database exports come to mind).

Pekka 2010-01-07 11:37:57

i only want to do this for me and my friend so that we can have an script look through the website every hour. they do not suport any web services at this time. database exports... haha, i really don't think so.

Mike 2010-01-07 11:53:01

"Illegal?" Seriously..?

David Thomas 2010-01-07 12:52:54

Yes. Many sites prohibit any kind of automated browsing/downloading/parsing of their sites' contents in their terms of service. In many jurisdictions, this works and can be enforced. It's unlikely there is going to be any trouble in this case but it's still always worth noting.

Pekka 2010-01-07 13:09:53

Pekka do you have some sources on that? I'm interested in this subject

kemp 2010-01-07 14:19:51

Scraping data and re-publishing it is a copyright offense in most parts of the world. When it comes to scraping it for private use, the situation looks less unequivocal than I thought. I came across this Google Answers question http://answers.google.com/answers/threadview?id=746810 it is related to India but makes a few international points, too.

Pekka 2010-01-07 15:17:14

Well republishing copyright protected contents is an offense even if you do it by hand, I was interested about the illegal part of making an automated script to extract them -- not what you do with that data.

kemp 2010-01-07 17:12:56

As I said, it's not as straightforward as re-publishing, and not as easy to attack. Check out the link I posted, there are some pointers there.

Pekka 2010-01-07 17:15:04

Ok, thanks (15 chars)

kemp 2010-01-07 17:19:22

Answer 3

+2 A:

Hi Mike,

1st, asking this question goes too into details. 2nd, extracting data from a website might not be legitimate. However, I have hints:

Use Firebug or Chrome/Safari Inspector to explore the HTML content and pattern of interesting information
Test your RegEx to see if the match. You may need do it many times (multi-pass parsing/extraction)
Write a client via cURL or even much simpler, use file_get_contents (NOTE that some hosting disable loading URLs with file_get_contents)

For me, I'd better use Tidy to convert to valid XHTML and then use XPath to extract data, instead of RegEx. Why? Because XHTML is not regular and XPath is very flexible. You can learn XSLT to transform.

Good luck!

Viet 2010-01-07 11:43:24

Answer 4

A:

As others have pointed out, it probably isn't a good idea to do this for many reasons, but here's the code anyways:

$c = curl_init();

curl_setopt($c, CURLOPT_URL, 'http://www.sparkfun.com/commerce/product_info.php?products_id=9279');
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($c, CURLOPT_HEADER, 0);

$d = curl_exec($c);

// Get the value from the first th of table.pricing
preg_match('#<table class="pricing">.*?<th>\$(.*?)</th>#s', $d, $match);

$price = $match[1];

The price of the product can now be found in the $price variable.

Note that you should ask permission first, I'm sure the owner of the site isn't pleased that you using their data without permission and fetching 11kb of data to get just one floating point number.

As I'm not adept in using XML parsers, I'd be interested to see what a 'proper' solution would look like.

Tatu Ulmanen 2010-01-07 11:53:45

Although you were generous and enthusiastic. I don't think source code was necessary. The main issue here is legitimacy and he needs to work his own way.

Viet 2010-01-07 11:57:52

Answer 5

A:

$content = file_get_contents('http://www.sparkfun.com/commerce/product_info.php?products_id=9279');

preg_match('#<tr><th>(.*)</th> <td><b>price</b></td></tr>#', $content, $match);
$price = $match[1];

preg_match('#<input type="hidden" name="quantity_on_hand" value="(.*?)">#', $content, $match);
$in_stock = $match[1];

echo "Price: $price - Availability: $in_stock\n";

kemp 2010-01-07 11:58:59

thsi works like a charm at a first look, and is just the simple solution i was lookfin for !!! thanks a lot

Mike 2010-01-07 12:08:17

very easily modified to get the product name and other info out of the text.... WOW 10x a lot, i mean... it's just the simplest way to get some meaningful data out of many simple websites.

Mike 2010-01-07 12:12:51

You're welcome :) If you have specific needs, regular expressions can be perfectly fine to mine data from an HTML page. They break if the structure of the page changes, but so do solutions based on parsers.

kemp 2010-01-07 12:30:44

the ony thing that can change is different links on the page or some stuff like that, but i do check the website a lot and i can tell if it has changed the design and make the appropriate change in the regex.

Mike 2010-01-07 12:43:58

Downvoter cares to say why?

kemp 2010-01-07 17:06:50

no matter what this is the answer i was looking for. anyone looking to do this .... this is worth 2min. for looking into.

Mike 2010-01-08 10:29:54

Answer 6

+3 A:

It's called screen scraping, in case you need to google for it.

I would suggest that you use a dom parser and xpath expressions instead. Feed the HTML through HtmlTidy first, to ensure that it's valid markup.

For example:

$html = file_get_contents("http://www.example.com");
$html = tidy_repair_string($html);
$doc = new DomDocument();
$doc->loadHtml($html);
$xpath = new DomXPath($doc);
// Now query the document:
foreach ($xpath->query('//table[@class="pricing"]/th') as $node) {
  echo $node, "\n";
}

troelskn 2010-01-07 12:01:31

+1 for recommending the only sensible thing - a parser.

Tomalak 2010-01-07 14:35:48

A car is the best choice for general travelling, but if you need to visit your neighbour a simple walk might suffice.

kemp 2010-01-07 17:24:56

Answer 7

A:

i tried and it works fine. but i have another problem. this example extract only the first data that he find. but if in the webpage there is something like that: name 1 ... ...

how can i display all the name?

chris 2010-01-26 12:33:38

ansaurus

tags:

views:

answers:

Extract data from website via PHP

related questions