views:

5753

answers:

12
+11  Q: 

CSV parser in C++

All I need is a good CSV file parser for C++. At this point it can really just be a comma-delimited parser (ie don't worry about escaping new lines and commas). The main need is a line-by-line parser that will return a vector for the next line each time the method is called.

I found this article which looks quite promising: http://www.boost.org/doc/libs/1_35_0/libs/spirit/example/fundamental/list_parser.cpp

I've never used Boost's Spirit, but am willing to try it. Is it overkill/bloated or is it fast and efficient? Does anyone have faster algorithms using STL or anything else?

Thanks!

+3  A: 

You might want to look at my FOSS project CSVfix, which is a CSV stream editor written in C++. The CSV parser is no prize, but does the job and the whole package may do what you need without you writing any code.

anon
Seems great ... What about the status beta / production ?
neuro
The status is "in development", as suggested by the version numbers. I really need more feed back from users before going to version 1.0. Plus I have a couple more features I want to add, to do with XML production from CSV.
anon
Bookmarking it, and will give it a try next time I have to deal with those wonderful standard CSV files ...
neuro
+1 I found a project I can learn from :)
AraK
+14  A: 

If you don't care about escaping comma and newline,
AND you can't embed comma and newline in quotes (If you can't escape then...)
then its only about three lines of code (OK 14 ->But its only 15 to read the whole file).

std::vector<std::string> getNextLineAndSplitIntoTokens(std::istream& str)
{
    std::vector<std::string>   result;
    std::string                line;
    std::getline(str,line);

    std::stringstream          lineStream(line);
    std::string                cell;

    while(std::getline(lineStream,cell,','))
    {
        result.push_back(cell);
    }
    return result;
}

I would just create a class representing a row.
Then stream into that object:

#include <iterator>
#include <iostream>
#include <fstream>
#include <sstream>
#include <vector>
#include <string>

class CVSRow
{
    public:
        std::string const& operator[](std::size_t index) const
        {
            return m_data[index];
        }
        std::size_t size() const
        {
            return m_data.size();
        }
        void readNextRow(std::istream& str)
        {
            std::string         line;
            std::getline(str,line);

            std::stringstream   lineStream(line);
            std::string         cell;

            m_data.clear();
            while(std::getline(lineStream,cell,','))
            {
                m_data.push_back(cell);
            }
        }
    private:
        std::vector<std::string>    m_data;
};

std::istream& operator>>(std::istream& str,CVSRow& data)
{
    data.readNextRow(str);
    return str;
}   
int main()
{
    std::ifstream       file("plop.csv");

    CVSRow              row;
    while(file >> row)
    {
        std::cout << "4th Element(" << row[3] << ")\n";
    }
}

But with a little work we could technically create an iterator:

class CVSIterator
{   
    public:
        typedef std::input_iterator_tag     iterator_category;
        typedef CVSRow                      value_type;
        typedef std::size_t                 difference_type;
        typedef CVSRow*                     pointer;
        typedef CVSRow&                     reference;

        CVSIterator(std::istream& str)  :m_str(str.good()?&str:NULL) { ++(*this); }
        CVSIterator()                   :m_str(NULL) {}

        // Pre Increment
        CVSIterator& operator++()               {if (m_str) { (*m_str) >> m_row;m_str = m_str->good()?m_str:NULL;}return *this;}
        // Post increment
        CVSIterator operator++(int)             {CVSIterator    tmp(*this);++(*this);return tmp;}
        CVSRow const& operator*()   const       {return m_row;}
        CVSRow const* operator->()  const       {return &m_row;}

        bool operator==(CVSIterator const& rhs) {return ((this == &rhs) || ((this->m_str == NULL) && (rhs.m_str == NULL)));}
        bool operator!=(CVSIterator const& rhs) {return !((*this) == rhs);}
    private:
        std::istream*       m_str;
        CVSRow              m_row;
};


int main()
{
    std::ifstream       file("plop.csv");

    for(CVSIterator loop(file);loop != CVSIterator();++loop)
    {
        std::cout << "4th Element(" << (*loop)[3] << ")\n";
    }
}
Martin York
This is exactly what I wanted! Now, some extra credit..how would I make this into a class with a constructor and two methods: firstLine() and nextLine(). std::istream doesn't have a default constructor..so what do I use instead? Thanks for the help!!
User1
Can somebody do two fixes above: lineSteam instead of linestream. Missing ")" on while.
User1
first() next(). What is this Java! Only Joking.
Martin York
or you could use some boost libraries to parse csv ... see below
stefanB
A: 

well if you need only simple CSV parsing, Neil Butterworth libs might be overkill in your case, you can just use the istream& getline (char* s, streamsize n, char delim );. It will only handle simple cases, but it can be enough as a starting point ...

neuro
@Martin: arghhh not fast enough :-)
neuro
/me really hate downvotes without comment ...
neuro
A: 

The Boost Tokenizer documentation specifically mentions parsing CSV files as one of the examples. It still might be overkill for what you need, but less so than writing a full-blown LL parser.

Kristo
+9  A: 

Solution using Boost Tokenizer:

std::vector<std::string> vec;
using namespace boost;
tokenizer<escaped_list_separator<char> > tk(
   line, escaped_list_separator<char>('\\', ',', '\"'));
for (tokenizer<escaped_list_separator<char> >::iterator i(tk.begin());
   i!=tk.end();++i) 
{
   vec.push_back(*i);
}
dtw
The boost tokenizer doesn't fully support the complete CSV standard, but there are some quick workarounds. See http://stackoverflow.com/questions/1120140/csv-parser-in-c/1595366#1595366
Rolf Kristensen
+2  A: 

Excuse me, but this all seems like a great deal of elaborate syntax to hide a few lines of code.

Why not this:

/**

  Read line from a CSV file

  @param[in] fp file pointer to open file
  @param[in] vls reference to vector of strings to hold next line

  */
void readCSV( FILE *fp, std::vector<std::string>& vls )
{
    vls.clear();
    if( ! fp )
     return;
    char buf[10000];
    if( ! fgets( buf,999,fp) )
     return;
    std::string s = buf;
    int p,q;
    q = -1;
    // loop over columns
    while( 1 ) {
     p = q;
     q = s.find_first_of(",\n",p+1);
     if( q == -1 ) 
      break;
     vls.push_back( s.substr(p+1,q-p-1) );
    }
}

int _tmain(int argc, _TCHAR* argv[])
{
    std::vector<std::string> vls;
    FILE * fp = fopen( argv[1], "r" );
    if( ! fp )
     return 1;
    readCSV( fp, vls );
    readCSV( fp, vls );
    readCSV( fp, vls );
    std::cout << "row 3, col 4 is " << vls[3].c_str() << "\n";

    return 0;
}
ravenspoint
A: 

You could also take a look at capabilities of Qt library.

It has regular expressions support and QString class has nice methods, e.g. split() returning QStringList, list of strings obtained by splitting the original string with a provided delimiter. Should suffice for csv file..

To get a column with a given header name I use following: http://stackoverflow.com/questions/970330/c-inheritance-qt-problem-qstring/1011601#1011601

MadH
+4  A: 

The String Toolkit Library has a token grid class that allows you to load data either from text files, strings or char buffers, and to parse/process them in a row-column fashion.

You can specify the row delimiters and column delimiters or just use the defaults.

void foo()
{
   std::string data;
   data += "1,2,3,4,5\n";
   data += "0,2,4,6,8\n";
   data += "1,3,5,7,9\n";

   strtk::token_grid grid(data,data.size(),",");

   for(std::size_t i = 0; i < grid.row_count(); ++i)
   {
      strtk::token_grid::row_type r = grid.row(i);
      for(std::size_t j = 0; j < r.size(); ++j)
      {
         std::cout << r.get<int>(j) << "\t";
      }
      std::cout << std::endl;
   }
   std::cout << std::endl;
}
Beh Tou Cheh
+2  A: 

When using the Boost Tokenizer escaped_list_separator for CSV files, then one should be aware of the following:

  1. It requires an escape-character (default back-slash - \)
  2. It requires a splitter/seperator-character (default comma - ,)
  3. It requires an quote-character (default quote - ")

The CSV format specified by wiki states that data fields can contain separators in quotes (supported):

1997,Ford,E350,"Super, luxurious truck"

The CSV format specified by wiki states that single quotes should be handled with double-quotes (escaped_list_separator will strip away all quote characters):

1997,Ford,E350,"Super ""luxurious"" truck"

The CSV format doesn't specify that any back-slash characters should be stripped away (escaped_list_separator will strip away all escape characters).

A possible work-around to fix the default behavior of the boost escaped_list_separator:

  1. First replace all back-slash characters (\) with two back-slash characters (\\) so they are not stripped away.
  2. Secondly replace all double-quotes ("") with a single back-slash character and a quote (\")

This work-around has the side-effect that empty data-fields that are represented by a double-quote, will be transformed into a single-quote-token. When iterating through the tokens, then one must check if the token is a single-quote, and treat it like an empty string.

Not pretty but it works.

Rolf Kristensen
+3  A: 

It is not overkill to use Spirit for parsing CSVs. Spirit is well suited for micro-parsing tasks. For instance, with Spirit 2.1, it is as easy as:

bool r = phrase_parse(first, last,

    //  Begin grammar
    (
        double_ % ','
    )
    ,
    //  End grammar

    space, v);

The vector, v, gets stuffed with the values. There is a series of tutorials touching on this in the new Spirit 2.1 docs that's just been released with Boost 1.41. I suggest you go check it out here:

http://tinyurl.com/yfucedn

The tutorial progresses from simple to complex. The CSV parsers are presented somewhere in the middle and touches on various techniques in using Spirit. The generated code is as tight as hand written code. Check out the assembler generated!

Joel de Guzman
+3  A: 

You can use Boost Tokenizer with escaped_list_separator.

escaped_list_separator parses a superset of the csv. Boost::tokenizer

This only uses Boost tokenizer header files, no linking to boost libraries required.

Here is an example, (see Parse CSV File With Boost Tokenizer In C++ for details or Boost::tokenizer ):

#include <iostream>     // cout, endl
#include <fstream>      // fstream
#include <vector>
#include <string>
#include <algorithm>    // copy
#include <iterator>     // ostream_operator
#include <boost/tokenizer.hpp>

int main()
{
    using namespace std;
    using namespace boost;
    string data("data.csv");

    ifstream in(data.c_str());
    if (!in.is_open()) return 1;

    typedef tokenizer< escaped_list_separator<char> > Tokenizer;
    vector< string > vec;
    string line;

    while (getline(in,line))
    {
        Tokenizer tok(line);
        vec.assign(tok.begin(),tok.end());

        // vector now contains strings from one row, output to cout here
        copy(vec.begin(), vec.end(), ostream_iterator<string>(cout, "|"));

        cout << "\n----------------------" << endl;
    }
}
stefanB
Downvotes? Because ....? Well anyway thanks, your annoying and childish response is very constructive because now we all know why you don't like this response ... er ... you did not sleep well?
stefanB
+3  A: 

If you DO care about parsing CSV correctly, this will do it...relatively slowly as it works one char at a time.

 int ParseCSV(const string& csvSource, vector<vector<string> >& lines)
    {
       int result(0);

       bool inQuote(false);
       bool lastCharWasAQuote(false);
       bool newLine(false);
       string field;
       lines.clear();
       vector<string> line;

       string::const_iterator aChar = csvSource.begin();
       while (aChar != csvSource.end())
       {
          switch (*aChar)
          {
          case '"':
             newLine = false;
             if (lastCharWasAQuote == true)
             {
                lastCharWasAQuote = false;
                field += *aChar;
             }
             else
             {
                inQuote = !inQuote;
             }
             break;

          case ',':
             newLine = false;
             if (inQuote == true)
             {
                field += *aChar;
             }
             else
             {
                line.push_back(field);
                field.clear();
             }
             break;

          case '\n':
          case '\r':
             if (inQuote == true)
             {
                field += *aChar;
             }
             else
             {
                if (newLine == false)
                {
                   line.push_back(field);
                   lines.push_back(line);
                   field.clear();
                   line.clear();
                   newLine = true;
                }
             }
             break;

          default:
             newLine = false;
             field.push_back(*aChar);
             break;
          }

          aChar++;
       }

       if (line.size())
       {
          if (field.size())
             line.push_back(field);

          lines.push_back(line);
       }

       return result;
    }
Michael