views:

84

answers:

3

I'm attempting to parse a set of CSV data using PHP, but having a major issue. One of the fields is a long description field, which itself contains linebreaks within the enclosures.

My primary issue is writing a piece of code that can split the data line by line, but also recognize when linebreaks within the data should not be used. The linebreaks within this field are not properly escaped, making them hard to distinguish from legitimate linebreaks.

I've tried to come up with a regular expression that can properly handle it, but had no luck so far. Any ideas?

CSV format:

"####","text data here", "text data \n with linebreaks \n here"\n

+1  A: 

According to aleske, a commenter in the documentation for PHP's fgetcsv function:

The PHP's CSV handling stuff is non-standard and contradicts with RFC4180, thus fgetcsv() cannot properly deal with files [that contain line breaks] ...

And he offered up the following function to get around this limitation:

function csvstring_to_array(&$string, $CSV_SEPARATOR = ';', $CSV_ENCLOSURE = '"', $CSV_LINEBREAK = "\n") { 
  $o = array(); 

  $cnt = strlen($string); 
  $esc = false; 
  $escesc = false; 
  $num = 0; 
  $i = 0; 
  while ($i < $cnt) { 
$s = $string[$i]; 

if ($s == $CSV_LINEBREAK) { 
  if ($esc) { 
    $o[$num] .= $s; 
  } else { 
    $i++; 
    break; 
  } 
} elseif ($s == $CSV_SEPARATOR) { 
  if ($esc) { 
    $o[$num] .= $s; 
  } else { 
    $num++; 
    $esc = false; 
    $escesc = false; 
  } 
} elseif ($s == $CSV_ENCLOSURE) { 
  if ($escesc) { 
    $o[$num] .= $CSV_ENCLOSURE; 
    $escesc = false; 
  } 

  if ($esc) { 
    $esc = false; 
    $escesc = true; 
  } else { 
    $esc = true; 
    $escesc = false; 
  } 
} else { 
  if ($escesc) { 
    $o[$num] .= $CSV_ENCLOSURE; 
    $escesc = false; 
  } 

  $o[$num] .= $s; 
} 

$i++; 
  } 

//  $string = substr($string, $i); 

  return $o; 
} 

That looks like it will do the trick.

Stephen
A: 

You can use fgetcsv or strgetcsv to parse a csv. Look the examples inside of the php documentation.

Felipe Cardoso Martins
When I last tried to use them a few years ago, neither of the getcsv functions would accept newlines in a quoted field. They'd consider it the end of the record.
Charles
A: 

I ended up being able to modify a regular expression with certain special flags to work for my needs. I used the following function call:

preg_match_all('/"\d+",".","."\n/sU', $csv_data, $matches); NOTE: This editor seems to remove the asterisks following the dot characters that indicate 0 or many characters

This seems to work for a few reasons:

1) The 's' flag tells the editor to catch newlines under the dot, which normally isn't the case. The unfortunate side effect of this is that legitimate newline characters are also caught by the dot, which could theoretically match the entire CSV to one result, so

2) I added the U flag. This tells the dot to be ungreedy by default, and as such, it currently only matches one line a piece.

omgitsfletch