tags:

views:

60

answers:

3

I am currently trying to process a csv file in PHP using preg_match(). An example of the data that I am trying to process is below;

"SN120187","Aldersr Rd Nr Shops","","STHPTN","50 56.4241N","1 25.7587W","1001077307","2010-05-30 15:29:49","10","","SURRSHLT3x32","BSU243L1","iiipiiipiiipiiipiii",

"HA035028","Hursley Road - Leigh House Hospital","","HURSLEY","50 59.6772N","1 23.4412W","","","24","","","","The quick brown fox jumped over the lazy dog The quick brown fox jumped over the lazy dog",

I have a regular expression that I am trying to use on this data (below);

if(preg_match('/^"(?P<code>.+)","(?P<description>.+)","(?P<bay>.*)","(?P<area>.+)","(?P<lat>.+)","(?P<lon>.+)","(?P<build>.*)","(?P<msgTime>.*)","(?P<routes>.*)","(?P<simNo>.*)","(?P<displayType>.*)","(?P<version>.*)","(?P<comments>.*)",$/', $line, $matches)){}

The regular expression works on 95% of the data, however, the data that is not working has the last field in the csv line as non-empty.

I began playing around with the data, (mainly the last field) and found that the following data will not pass through the regex;

"SN120187","Aldersr Rd Nr Shops","","STHPTN","50 54.5512N","1 22.9273W","1001077307","2010-05-30 15:29:49","10","","SURRSHLT3x32","BSU243L1","iiiipiiiipiiiipiiii",

"HA035028","Hursley Road - Leigh House Hospital","","HURSLEY","52 58.3498N","1 26.5421W","","","24","","","","iiiipiiiipiiiipiiiipiiiipiiiipiiiipiiiipiiiipiiiipiiiipii",

However, if you remove one character from the last field from the above data, it will pass. From playing around with it, I have found out that there is no consistant pattern for getting this error; the overal length of the string does not seem to matter (this is shown by adding extra characters to other fields), and also the length of the final field does not matter either.

I have no idea what is going on. Does anyone have any ideas?

I am currently running PHP version 5.3.2, and no error messages are appearing.

+2  A: 

If this is CSV data, use a CSV processing function like str_getcsv for strings or fgetcsv for reading from a file.

Gumbo
I've been having trouble with str_getcsv(), but I am more bemused as to why the above is not working. I know i can do CSV stuff in other ways, but I'm really puzzled by this regex problem
Mabbage
A: 

I tried it locally and it was the same as you described, I have PHP 5.2.10-2ubuntu6.

First try, I removed "(?P<comments>.*)", of your pattern:

$line='"HA035028","Hursley Road - Leigh House Hospital","","HURSLEY","52 58.3498N","1 26.5421W","","","24","","","","iiiipiiiipiiiipiiiipiiiipiiiipiiiipiiiipiiiipiiiipiiiipii",';

$r=preg_match('/^"(?P<code>.+)","(?P<description>.+)","(?P<bay>.*)","(?P<area>.+)","(?P<lat>.+)","(?P<lon>.+)","(?P<build>.*)","(?P<msgTime>.*)","(?P<routes>.*)","(?P<simNo>.*)","(?P<displayType>.*)","(?P<version>.*)",$/', $line, $matches);

var_dump($r, $matches);

Output:

int(1)
array(25) {
  [0]=>
  string(169) ""HA035028","Hursley Road - Leigh House Hospital","","HURSLEY","52 58.3498N","1 26.5421W","","","24","","","","iiiipiiiipiiiipiiiipiiiipiiiipiiiipiiiipiiiipiiiipiiiipii","
  ["code"]=>
  string(8) "HA035028"
  [1]=>
  string(8) "HA035028"
  ["description"]=>
  string(35) "Hursley Road - Leigh House Hospital"
  [2]=>
  string(35) "Hursley Road - Leigh House Hospital"
  ["bay"]=>
  string(0) ""
  [3]=>
  string(0) ""
  ["area"]=>
  string(7) "HURSLEY"
  [4]=>
  string(7) "HURSLEY"
  ["lat"]=>
  string(11) "52 58.3498N"
  [5]=>
  string(11) "52 58.3498N"
  ["lon"]=>
  string(13) "1 26.5421W",""
  [6]=>
  string(13) "1 26.5421W",""
  ["build"]=>
  string(0) ""
  [7]=>
  string(0) ""
  ["msgTime"]=>
  string(2) "24"
  [8]=>
  string(2) "24"
  ["routes"]=>
  string(0) ""
  [9]=>
  string(0) ""
  ["simNo"]=>
  string(0) ""
  [10]=>
  string(0) ""
  ["displayType"]=>
  string(0) ""
  [11]=>
  string(0) ""
  ["version"]=>
  string(57) "iiiipiiiipiiiipiiiipiiiipiiiipiiiipiiiipiiiipiiiipiiiipii"
  [12]=>
  string(57) "iiiipiiiipiiiipiiiipiiiipiiiipiiiipiiiipiiiipiiiipiiiipii"
}

Note that <version> now matches the last field, while <lon> matches two field


Second Try; I replaced every . occurrence with [^"]:

$line='"HA035028","Hursley Road - Leigh House Hospital","","HURSLEY","52 58.3498N","1 26.5421W","","","24","","","","iiiipiiiipiiiipiiiipiiiipiiiipiiiipiiiipiiiipiiiipiiiipii",';

$r=preg_match('/^"(?P<code>[^"]+)","(?P<description>[^"]+)","(?P<bay>[^"]*)","(?P<area>[^"]+)","(?P<lat>[^"]+)","(?P<lon>[^"]+)","(?P<build>[^"]*)","(?P<msgTime>[^"]*)","(?P<routes>[^"]*)","(?P<simNo>[^"]*)","(?P<displayType>[^"]*)","(?P<version>[^"]*)","(?P<comments>[^"]*)",$/', $line, $matches);

Output:

int(1)
array(27) {
  [0]=>
  string(169) ""HA035028","Hursley Road - Leigh House Hospital","","HURSLEY","52 58.3498N","1 26.5421W","","","24","","","","iiiipiiiipiiiipiiiipiiiipiiiipiiiipiiiipiiiipiiiipiiiipii","
  ["code"]=>
  string(8) "HA035028"
  [1]=>
  string(8) "HA035028"
  ["description"]=>
  string(35) "Hursley Road - Leigh House Hospital"
  [2]=>
  string(35) "Hursley Road - Leigh House Hospital"
  ["bay"]=>
  string(0) ""
  [3]=>
  string(0) ""
  ["area"]=>
  string(7) "HURSLEY"
  [4]=>
  string(7) "HURSLEY"
  ["lat"]=>
  string(11) "52 58.3498N"
  [5]=>
  string(11) "52 58.3498N"
  ["lon"]=>
  string(10) "1 26.5421W"
  [6]=>
  string(10) "1 26.5421W"
  ["build"]=>
  string(0) ""
  [7]=>
  string(0) ""
  ["msgTime"]=>
  string(0) ""
  [8]=>
  string(0) ""
  ["routes"]=>
  string(2) "24"
  [9]=>
  string(2) "24"
  ["simNo"]=>
  string(0) ""
  [10]=>
  string(0) ""
  ["displayType"]=>
  string(0) ""
  [11]=>
  string(0) ""
  ["version"]=>
  string(0) ""
  [12]=>
  string(0) ""
  ["comments"]=>
  string(57) "iiiipiiiipiiiipiiiipiiiipiiiipiiiipiiiipiiiipiiiipiiiipii"
  [13]=>
  string(57) "iiiipiiiipiiiipiiiipiiiipiiiipiiiipiiiipiiiipiiiipiiiipii"
}
aularon
Brilliant - The second try works well thanks :) Though I am still a little confused as to why it would accept some versions of the data, and not others with the original regex
Mabbage
A: 

The [^"] answer is fine, but I think you could also turn all your + and * operators into lazy operators by making them +? and *? respectively.

preg_match('/^"(?P<code>.+?)","(?P<description>.+?)","(?P<bay>.*?)","(?P<area>.+?)","(?P<lat>.+?)","(?P<lon>.+?)","(?P<build>.*?)","(?P<msgTime>.*?)","(?P<routes>.*?)","(?P<simNo>.*?)","(?P<displayType>.*?)","(?P<version>.*?)","(?P<comments>.*?)",$/', $line, $matches);

It seems as though one of the expressions was grabbing too much of the line. I'm not entirely sure why (but it would lead to a lot of backtracking).

Aether
`[^"]` is a much better solution. `.*?` will *try* to take the shortest match, but it can still match too much if either the regex or the data is malformed. That won't happen with `[^"]*` because it can't get past the closing quote. In fact, you can even use a *possessive* quantifier (`[^"]*+`) and get a performance boost as a bonus.
Alan Moore