tags:

views:

748

answers:

5

I have a CSV file that holds about 200,000 - 300,000 records. Most of the records can be separated and inserted into a MySQL database with a simple

$line = explode("\n", $fileData);

and then the values separated with

$lineValues = explode(',', $line);

and then inserted into the database using the proper data type i.e int, float, string, text, etc.

However, some of the records have a text column that includes a \n in the string. Which breaks when using the $line = explode("\n", $fileData); method. Each line of data that needs to be inserted into the database has approximately 216 columns. not every line has a record with a \n in the string. However, each time a \n is found in the line it is enclosed between a pair of single quotes (')

each line is set up in the following format:

id,data,data,data,text,more data

example:

1,0,0,0,'Hello World,0
2,0,0,0,'Hello
    World',0
3,0,0,0,'Hi',0
4,0,0,0,,0

As you can see from the example, most records can be easily split with the methods shown above. Its the second record in the example that causes the problem.

New lines are only \n and the file does not include \r in the file at all.

A: 

If you could be guaranteed that each new line beginning with a number is a valid new-line (i.e. not in the middle of a text description) then you could try something like the below:

// Replace all new-line then id patterns with new-line 0+id
$line = preg_replace('/\n(\d)/',"\n0$1",$line);

// Split on new-line then id
$linevalues = preg_split("/\n\d/",$data);

The first step identifies all lines which have a new line followed by a numeric value. It then prepends "0" to this numeric value. The second line splits where it find a new-line then integer.

The "0" is added to the front of the id as preg_split removes the chars it matches from the subsequent matches.

As I say, this will only work if you're sure that the text which breaks a line won't start a new line with a number.

ConroyP
A: 

If the csv data is in a file, you can just use fgetcsv() as others have pointed out. fgetcsv handles embedded newlines correctly.

However if your csv data is in a string (like $fileData in your example) the following method may be useful as str_getcsv() only works on a row at a time and cannot split a whole file into records.

You can detect the embedded newlines by counting the quotes in each line. If there are an odd number of quotes, you have an incomplete line, so concatenate this line with the following line. Once you have an even number of quotes, you have a complete record.

Once you have a complete record, split it at the quotes (again using explode()). Odd-numbered fields are quoted (thus embedded commas are not special), even-numbered fields are not.

Example:

# Split file into physical lines (records may span lines)
$lines = explode("\n", $fileData);

# Re-assemble records
$records = array ();
$record = '';
$lineSep = '';
foreach ($lines as $line) {
  # Escape @ symbol so we can use it as a marker (as it does not conflict with
  # any special CSV character.)
  $line = str_replace('@', '@a', $line);

  # Escape commas as we don't yet know which ones are separators
  $line = str_replace(',', '@c', $line);

  # Escape quotes in a form that uses no special characters
  $line = str_replace("\\'", '@q', $line);
  $line = str_replace('\\', '@b', $line);

  $record .= $lineSep . $line;
  $lineSep = "\n";

  # Must have an even number of quotes in a complete record!
  if (substr_count($record, "'") % 2 == 0) {
    $records[] = $record;
    $record = '';
    $lineSep = '';
  }
}
if (strlen($record) > 0) {
  $records[] = $record;
}

$rows = array ();

foreach ($records as $record) {
  $chunks_in = explode("'", $record);
  $chunks_out = array ();

  # Decode escaped quotes/backslashes.
  # Decode field-separating commas (unless quoted)
  foreach ($chunks_in as $i => $chunk) {
    # Unescape quotes & backslashes
    $chunk = str_replace('@q', "'", $chunk);
    $chunk = str_replace('@b', '\\', $chunk);
    if ($i % 2 == 0) {
      # Unescape commas
      $chunk = str_replace('@c', ',', $chunk);
    }
    $chunks_out[] = $chunk;
  }

  # Join back together, discarding unescaped quotes
  $record = join('', $chunks_out);

  $chunks_in = explode(',', $record);
  $row = array ();
  foreach ($chunks_in as $chunk) {
    $chunk = str_replace('@c', ',', $chunk);
    $chunk = str_replace('@a', '@', $chunk);
    $row[] = $chunk;
  }
  $rows[] = $row;
}
finnw
+1  A: 

how about manually iterating through the data, from start to finish, with a for-loop or two? It's slower than explode(), but it's easier to get consistent and reliable results regarding quotes.

If you choose this method, remeber to take escaped quotes into account.

Henrik Paul
+3  A: 

The other advice here is, of course, valid, especially if you aim to write your own CSV parser, however, if you just want to get the data out, use fgetcsv() function and don't worry about implementation details.

Nouveau
A: 

Use fgetcsv and it'll take care of all of that for you. Unless there's some overriding reason you need to have your own CSV parser.

KernelM
I'm not familiar with the fgetcsv() function. This is the first time I've been tasked with taking about 300MB worth of csv files, and inserting them into a MySQL database. The first few files were easy as they didnt have the embedded new lines.
Jayrox