tags:

views:

378

answers:

3

Hi,

I'm trying to scrape a price from a web page using PHP and Regexes. The price will be in the format £123.12 or $123.12 (i.e., pounds or dollars).

I'm loading up the contents using libcurl. The output of which is then going into preg_match_all. So it looks a bit like this:

$contents = curl_exec($curl);

preg_match_all('/(?:\$|£)[0-9]+(?:\.[0-9]{2})?/', $contents, $matches);

So far so simple. The problem is, PHP isn't matching anything at all - even when there are prices on the page. I've narrowed it down to there being a problem with the '£' character - PHP doesn't seem to like it.

I think this might be a charset issue. But whatever I do, I can't seem to get PHP to match it! Anyone have any ideas?

(Edit: I should note if I try using the Regex Test Tool using the same regex and page content, it works fine)

A: 
Daok
Doesn't work, unfortunately :(
Phill Sacre
I have edited the regex and removed few other things. Check the screenshot. Are you sure it's not the way you use the match after the regex?
Daok
I just noticed your edit. If the regex work fine it might be the encoding of the page from the curl that give you some encoding problem with $ and £. You might want to output the curl data to check it.
Daok
Yep, turns out curl was giving encoding ISO-8859-1, which apparently PHP doesn't like. Converting to UTF-8 seems to work.
Phill Sacre
+1  A: 

maybe pound has it's html entity replacement? i think you should try your regexp with some sort of couching program (i.e. match it against fixed text locally).

i'd change my regexp like this: '/(?:\$|£)\d+(?:\.\d{2})?/'

Eimantas
Thanks - I tried saving it locally and it came up with an error when opening the file. If I convert the string to utf8, it works! So I guess I just need to detect the charset.
Phill Sacre
A: 

This should work for simple values.

'#(?:\$|\£|\€)(\d+(?:\.\d+)?)#'

This will not work with thousand separator like 234,343 and 34,454.45.

OIS