views:

63

answers:

2

I'm trying to extract the mileage value from different ebay pages but I'm stuck as there seem to be too many patterns because the pages are a bit different . Therefore I would like to know if you can help me with a better pattern . Some examples of items are the following : http://cgi.ebay.com/ebaymotors/1971-Chevy-C10-Shortbed-Truck-/250647101696?cmd=ViewItem&pt=US_Cars_Trucks&hash=item3a5bbb4100 http://cgi.ebay.com/ebaymotors/1987-HANDICAP-LEISURE-VAN-W-WHEEL-CHAIR-LIFT-/250647101712?cmd=ViewItem&pt=US_Cars_Trucks&hash=item3a5bbb4110 http://cgi.ebay.com/ebaymotors/ws/eBayISAPI.dll?ViewItemNext&item=250647101696
Please see the patterns at the following link (I still cannot figure it out how to escape the html here

http://pastebin.com/zk4HAY3T

However they are not enough many as it seems there are still new patters....

A: 

This should be a bit more generic - it doesn't care what's inside the html tags. It works on all three of the links you provided.

/Mileage[^<]*<[^>]*><[^>]*>(.*?)\s*miles/i

Of course, there could be better ways depending on what other constraints you have, but this is a good starting point.

Recognizing the duplication there, you could simplify (logically, at least) a bit more:

/Mileage[^<]*(?:<[^>]*>){2}(.*?)\s*miles/i

You're looking for two html tags in a row between the words 'Mileage' and 'miles'. That's the (?:<[^>]*>){2} part. The ?: tells it not to remember that sequence, so that $matches[1] still contains the number you're looking for, and the {2} indicates that you want to match the previous sequence exactly twice.

potatoe
+2  A: 

Don't use regular expressions to parse HTML. Even for a relatively simple thing such as this, regular expressions make you highly dependent on the exact markup.

You can use DOMDocument and XPath to grab the value nicely, and it's somewhat more resilient to changes in the page:

  $doc = new DOMDocument();

  @$doc->loadHtmlFile($url);

  $xpath = new DOMXpath($doc);
  foreach ($xpath->query('//th[contains(., "Mileage")]/following-sibling::td') as $td) {
    var_dump($td->textContent);
  }

The XPath query searches for a <th> which contains the word "Mileage", then selects the <td>s following it.

You can then lop off the miles suffix and get rid of commas using str_replace or substr.

Chris Smith