tags:

views:

526

answers:

5

I have a data file which contains data in row/colum form. I would like a way to read this data in to a 2D array in C or C++ (whichever is easier) but I don't know how many rows or columns the file might have before I start reading it in.

At the top of the file is a commented line giving a series of numbers relating to what each column holds. Each row is holding the data for each number at a point in time, so an example data file (a small one - the ones i'm using are much bigger!) could be like:

# 1 4 6 28
21.2 492.1 58201.5 586.2
182.4 1284.2 12059. 28195.2
.....

I am currently using Python to read in the data using numpy.loadtxt which conveniently splits the data in row/column form whatever the data array size, but this is getting quite slow. I want to be able to do this reliably in C or C++.

I can see some options:

  1. Add a header tag with the dimensions from my extraction program

    # 1 4 6 28
    # xdim, ydim
    21.2 492.1 58201.5 586.2
    182.4 1284.2 12059. 28195.2
    .....
    

    but this requires rewriting my extraction programs and programs which use the extracted data, which is quite intensive.

  2. Store the data in a database file eg. MySQL, SQLite etc. Then the data could be extracted on demand. This might be a requirement further along in the development process so it might be good to look into anyway.

  3. Use Python to read in the data and wrap C code for the analysis. This might be easiest in the short run.

  4. Use wc on linux to find the number of lines and number of words in the header to find the dimensions.

    echo $((`cat FILE | wc -l` - 1)) # get number of rows (-1 for header line)
    echo $((`cat FILE | head -n 1 | wc -w` - 1)) # get number of columns (-1 for '#' character)
    
  5. Use C/C++ code

This question is mostly related to point 5 - if there is an easy and reliable way to do this in C/C++. Otherwise any other suggestions would be welcome

Thanks

+8  A: 

How about:

  1. Load the file.
  2. Count the number of rows and columns.
  3. Close the file.
  4. Allocate the memory needed.
  5. Load the file again.
  6. Fill the array with data.

Every .obj (3D model file) loader I've seen uses this method. :)

knight666
You can use `fseek`/`fstream::seekg` to reset the cursor to 0 without re-opening the file.
KennyTM
A: 

Do you need a square or a ragged matrix? If the latter, create a structure like this:

 std:vector < std::vector <double> > data;

Now read each line at a time into a:

 vector <double> d;

and add the vector to the ragged matrix:

 data.push_back( d );

All data structures involved are dynamic, and will grow as required.

anon
+10  A: 

Create table as vector of vectors:

std::vector<std::vector<double> > table;

Inside infinite (while(true)) loop:

Read line:

std::string line;
std::getline(ifs, line);

If something went wrong (probably EOF), exit the loop:

if(!ifs) 
    break;

Skip that line if it's a comment:

if(line[0] == '#')
    continue;

Read row contents into vector:

std::vector<double> row;
std::copy(std::istream_iterator<double>(ifs),
          std::istream_iterator<double>(),
          std::back_inserter(row));

Add row to table;

table.push_back(row);

At the time you're out of the loop, "table" contains the data:

  • table.size() is the number of rows

  • table[i] is row i

  • table[i].size() is the number of cols. in row i

  • table[i][j] is the element at the j-th col. of row i

Manuel
upvoted this as it helped the most
Simon Walker
A: 

Figured out a way to do this. Thanks go mostly to Manuel as it was the most informative answer.

std::vector< std::vector<double> > readIn2dData(const char* filename)
{
    /* Function takes a char* filename argument and returns a 
     * 2d dynamic array containing the data
     */

    std::vector< std::vector<double> > table; 
    std::fstream ifs;

    /*  open file  */
    ifs.open(filename);

    while (true)
    {
        std::string line;
        double buf;
        getline(ifs, line);

        std::stringstream ss(line, std::ios_base::out|std::ios_base::in|std::ios_base::binary);

        if (!ifs)
            // mainly catch EOF
            break;

        if (line[0] == '#' || line.empty())
            // catch empty lines or comment lines
            continue;


        std::vector<double> row;

        while (ss >> buf)
            row.push_back(buf);


        table.push_back(row);


    }

    ifs.close();

    return table;
}

Basically create a vector of vectors. The only difficulty was splitting by whitespace which is taken care of with the stringstream object. This may not be the most effective way of doing it but it certainly works in the short term!

Also I'm looking for a replacement for the deprecated atof function, but nevermind. Just needs some memory leak checking (it shouldn't have any since most of the objects are std objects) and I'm done.

Thanks for all your help

Simon Walker
Why use atof? What is wrong with `ifstream is(file); float f; is >> f;`
graham.reeds
cheers just changed it, much cleaner
Simon Walker
A: 

I've seen your answer, and while it's not bad, I don't think it's ideal either. At least as I understand your original question, the first comment basically specifies how many columns you'll have in each of the remaining rows. e.g. the one you've given ("1 4 6 28") contains four numbers, which can be interpreted as saying each succeeding line will contain 4 numbers.

Assuming that's correct, I'd use that data to optimize reading the data. In particular, after that, (again, as I understand it) the file just contains row after row of numbers. That being the case, I'd put all the numbers together into a single vector, and use the number of columns from the header to index into the rest:

class matrix { 
    std::vector<double> data;
    int columns;
public:
    // a matrix is 2D, with fixed number of columns, and arbitrary number of rows.
    matrix(int cols) : columns(cols) {}

    // just read raw data from stream into vector:    
    std::istream &read(std::istream &stream) { 
        std::copy(std::istream_iterator<double>(stream), 
                  std::istream_iterator<double>(), 
                  std::back_inserter(data));
        return stream;
   }

   // Do 2D addressing by converting rows/columns to a linear address
   // If you want to check subscripts, use vector.at(x) instead of vector[x].
   double operator()(size_t row, size_t col) { 
       return data[row*columns+col];
   }
};

This is all pretty straightfoward -- the matrix knows how many columns it has, so you can do x,y indexing into the matrix, even though it stores all its data in a single vector. Reading the data from the stream just means copying that data from the stream into the vector. To deal with the header, and simplify creating a matrix from the data in a stream, we can use a simple function like this:

matrix read_data(std::string name) { 
    // read one line from the stream.
    std::ifstream in(name.c_str());
    std::string line;
    std::getline(in, line);

    // break that up into space-separated groups:
    std::istringstream temp(line);
    std::vector<std::string> counter;
    std::copy(std::istream_iterator<std::string>(temp), 
              std::istream_iterator<std::string>(),
              std::back_inserter(counter));

    // the number of columns is the number of groups, -1 for the leading '#'.
    matrix m(counter.size()-1);

    // Read the remaining data into the matrix.
    m.read(in);
    return m;
}

As it's written right now, this depends on your compiler implementing the "Named Return Value Optimization" (NRVO). Without that, the compiler will copy the entire matrix (probably a couple of times) when it's returned from the function. With the optimization, the compiler pre-allocates space for a matrix, and has read_data() generate the matrix in place.

Jerry Coffin
had to change a couple of things to get this to work:return data[row*cols+col]; -> return data[row*columns+col];std::getline(line, in); -> std::getline(in, line);It's good, but I feel I understand my answer better
Simon Walker
@Simon:Quite true -- the code wasn't tested, so a couple of bugs isn't a big surprise. Thanks for pointing them out -- I'll fix those in the code.
Jerry Coffin