tags:

views:

2985

answers:

11

I am looking for suggestions on how to handle a csv file that is being created, then uploaded by our customers, and that may have a comma in a value, like a company name.

Some of the ideas we are looking at are: quoted Identifiers (value "," values ","etc) or using a | instead of a comma. The biggest problem is that we have to make it easy, or the customer won't do it.

+7  A: 

Put double quotes around strings. That is generally what Excel does.

Ala Eli,

you escape a double quote as two double quotes. E.g. "test1","foo""bar","test2"

Joe Philllips
basically the same concept as a quoted Identifiers
Bob The Janitor
you escape a double quote as two double quotes. E.g. "test1","foo""bar","test2"
Eli
That's what I thought but wasn't sure.
Joe Philllips
+1  A: 

There is a popular library for handling this type of data, FileHelpers

Jhonny D. Cano -Leftware-
hehe i like this argument for why not to write your own. http://secretgeek.net/csv_trouble.asp rings true to me
qntmfred
A: 

Use a different delimiter than commas, so that the problem goes away. Hopefully they don't use tabs in their data. :)

James Black
Picking a magic character that you "hope" the client doesn't use in their data strikes me as a really bad idea.
Eli
Just tell the client that some character is not allowed. If it is something they won't be using then no problem. :)
James Black
Yeah I know a guy who swears by using ^ as a delimiter. Just because he doesn't use ^, doesn't mean that no one else uses it.
DrG
A: 

Do your customers have Excel? They should just be able to create a spreadsheet of the values, and save it as a CSV. Values with commas will have quotes around them (as @d03boy points out).

bdukes
A: 

I'm assuming you're refering to reading andd writing the file. I would put the value in quotes.

If you're refering to what to do with the comma once you've read the value into a data type. That will depend on if it's important to you or not. It can be stripped or or kept.

codeelegance
+12  A: 

There's actually a spec for CSV format and how to handle commas:

http://tools.ietf.org/html/rfc4180

Corey Trager
"Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes."
Eli
A: 

What language are you using? Every popular language is going to have a CSV library that will take care of this for you.

Eli
+4  A: 

As others have said, you need to escape values that include quotes. Here's a little CSV reader that supports quoted values, including embedded quotes and carriage returns.

By the way this is unit-tested code. I'm posting it now because this question seems to come up a lot and others may not want an entire library when simple CSV support will do.

You can use it as follows:

using System;
public class test
{
    public static void Main()
    {
        using ( CsvReader reader = new CsvReader( "data.csv" ) )
        {
            foreach( string[] values in reader.RowEnumerator )
            {
                Console.WriteLine( "Row {0} has {1} values.", reader.RowIndex, values.Length );
            }
        }
        Console.ReadLine();
    }
}

Here are the classes. Note that you can use the Csv.Escape function to write valid CSV as well.

using System.IO;
using System.Text.RegularExpressions;

public sealed class CsvReader : System.IDisposable
{
 public CsvReader( string fileName ) : this( new FileStream( fileName, FileMode.Open, FileAccess.Read ) )
 {
 }

 public CsvReader( Stream stream )
 {
  __reader = new StreamReader( stream );
 }

 public System.Collections.IEnumerable RowEnumerator
 {
  get {
   if ( null == __reader )
    throw new System.ApplicationException( "I can't start reading without CSV input." );

   __rowno = 0;
   string sLine;
   string sNextLine;

   while ( null != ( sLine = __reader.ReadLine() ) )
   {
    while ( rexRunOnLine.IsMatch( sLine ) && null != ( sNextLine = __reader.ReadLine() ) )
     sLine += "\n" + sNextLine;

    __rowno++;
    string[] values = rexCsvSplitter.Split( sLine );

    for ( int i = 0; i < values.Length; i++ )
     values[i] = Csv.Unescape( values[i] );

    yield return values;
   }

   __reader.Close();
  }
 }

 public long RowIndex { get { return __rowno; } }

 public void Dispose()
 {
  if ( null != __reader ) __reader.Dispose();
 }

 //============================================


 private long __rowno = 0;
 private TextReader __reader;
 private static Regex rexCsvSplitter = new Regex( @",(?=(?:[^""]*""[^""]*"")*(?![^""]*""))" );
 private static Regex rexRunOnLine = new Regex( @"^[^""]*(?:""[^""]*""[^""]*)*""[^""]*$" );
}

public static class Csv
{
 public static string Escape( string s )
 {
  if ( s.Contains( QUOTE ) )
   s = s.Replace( QUOTE, ESCAPED_QUOTE );

  if ( s.IndexOfAny( CHARACTERS_THAT_MUST_BE_QUOTED ) > -1 )
   s = QUOTE + s + QUOTE;

  return s;
 }

 public static string Unescape( string s )
 {
  if ( s.StartsWith( QUOTE ) && s.EndsWith( QUOTE ) )
  {
   s = s.Substring( 1, s.Length - 2 );

   if ( s.Contains( ESCAPED_QUOTE ) )
    s = s.Replace( ESCAPED_QUOTE, QUOTE );
  }

  return s;
 }


 private const string QUOTE = "\"";
 private const string ESCAPED_QUOTE = "\"\"";
 private static char[] CHARACTERS_THAT_MUST_BE_QUOTED = { ',', '"', '\n' };
}
harpo
Is this compatible with excel?
Joe Philllips
works like a champ, I can't believe how smooth this works
Bob The Janitor
works with all of the edges cases we have tried so far
Bob The Janitor
A: 

You can put double quotes around the fields. I don't like this approach, as it adds another special character (the double quote). Just define an escape character (usually backslash) and use it wherever you need to escape something:

data,more data,more data\, even,yet more

You don't have to try to match quotes, and you have fewer exceptions to parse. This simplifies your code, too.

Adam Jaskiewicz
A: 

As this is about general practices let's start from rules of the thumb:

  1. Don't use CSV, use XML with a library to read & write the xml file instead.

  2. If you must use CSV. Do it properly and use a free library to parse and store the CSV files.

To justify 1), most CSV parsers aren't encoding aware so if you aren't dealing with US-ASCII you are asking for troubles. For example excel 2002 is storing the CSV in local encoding without any note about the encoding. The CSV standard isn't widely adopted :(. On the other hand xml standard is well adopted and it handles encodings pretty well.

To justify 2), There is tons of csv parsers around for almost all language so there is no need to reinvent the wheel even if the solutions looks pretty simple.

To name few:

  • for python use build in csv module

  • for perl check CPAN and Text::CSV

  • for php use build in fgetcsv/fputcsv functions

  • for java check SuperCVS library

Really there is no need to implement this by hand if you aren't going to parse it on embedded device.

Piotr Czapla
XML isn't always the answer. CSV is the right format for the job when you have a lot of dense, tabular data (i.e. a spreadsheet). Those tags introduce a lot of overhead, and if each and every line has an identical format, there is no need to be explicit about what each and every value represents. XML is great when you have complicated hierarchical data, or records with optional fields. That isn't always the case.
Adam Jaskiewicz
In theory the "tags" introduce a bit of overhead but I can't think of any real life application where it start to be a problem. Do you have any practical examples? To work on data one should use an database instead of csv. if we speak about data serialization (backups, data interchange), will it matter if parsing takes a week instead of 5 days?
Piotr Czapla
Basically, any situation in which you have data that is best represented by a table. Say you have data from a dozen different sensors that you sample every so often, and you record the timestamp and the value of each of the sensors at that time. Each record is identical: timestamp, sensor0, sensor1, ... sensor11. XML is great for representing complex, irregular data, but it is a rather heavyweight format that does not fit every single situation. KISS
Adam Jaskiewicz
Some people see a problem, and they say "I know, I'll use XML!" Now they have two problems.
Adam Jaskiewicz
I totally agree that xml is not an answer for everything. Especially It is not well suited as a database replacement nor for configuration files.But here the question was about data interchange for which XML was designed for.
Piotr Czapla
It depends what the data is, more than what it is used for. I don't know what data is being transferred, so I'm not sure if CSV is the most appropriate, but if it really is regular, tabular data, a tabular format simply makes more sense than a hierarchical format. Quite often people shoehorn data that *should* be XML into another format, but I think the opposite is just as bad, and it's a pet peeve of mine.
Adam Jaskiewicz
A: 

The CSV format uses commas to separate values, values which contain carriage returns, linefeeds, commas, or double quotes are surrounded by double-quotes. Values that contain double quotes are quoted and each literal quote is escaped by an immediately preceding quote: For example, the 3 values:

test
list, of, items
"go" he said

would be encoded as:

test,"list, of, items","""go"" he said"

Any field can be quoted but only fields that contain commas, CR/NL, or quotes must be quoted.

There is no real standard for the CSV format but almost all applications follow the conventions documented here. The RFC that was mentioned elsewhere is not a standard for CSV, it is an RFC for using CSV within MIME and contains some unconventional and unnecessary limitations that make it useless outside of MIME.

A gotcha that many CSV modules I have seen don't accomodate is the fact that multiple lines can be encoded in a single field which means you can't assume that each line is a separate record, you either need to not allow newlines in your data or be prepared to handle this.

Robert Gamble