ansaurus

Question

Answer 1

A:

This should be a bit more generic - it doesn't care what's inside the html tags. It works on all three of the links you provided.

/Mileage[^<]*<[^>]*><[^>]*>(.*?)\s*miles/i

Of course, there could be better ways depending on what other constraints you have, but this is a good starting point.

Recognizing the duplication there, you could simplify (logically, at least) a bit more:

/Mileage[^<]*(?:<[^>]*>){2}(.*?)\s*miles/i

You're looking for two html tags in a row between the words 'Mileage' and 'miles'. That's the (?:<[^>]*>){2} part. The ?: tells it not to remember that sequence, so that $matches[1] still contains the number you're looking for, and the {2} indicates that you want to match the previous sequence exactly twice.

potatoe 2010-06-15 01:29:40

Answer 2

+2 A:

Don't use regular expressions to parse HTML. Even for a relatively simple thing such as this, regular expressions make you highly dependent on the exact markup.

You can use DOMDocument and XPath to grab the value nicely, and it's somewhat more resilient to changes in the page:

  $doc = new DOMDocument();

  @$doc->loadHtmlFile($url);

  $xpath = new DOMXpath($doc);
  foreach ($xpath->query('//th[contains(., "Mileage")]/following-sibling::td') as $td) {
    var_dump($td->textContent);
  }

The XPath query searches for a <th> which contains the word "Mileage", then selects the <td>s following it.

You can then lop off the miles suffix and get rid of commas using str_replace or substr.

Chris Smith 2010-06-15 01:46:48

ansaurus

tags:

views:

answers:

regex , php, preg_match

related questions