views:

89

answers:

7

I need a fast and efficient method to read a space separated file with numbers into an array. The files are formatted this way:

4 6
1 2 3 4 5 6
2 5 4 3 21111 101
3 5 6234 1 2 3
4 2 33434 4 5 6

The first row is the dimension of the array [rows columns]. The lines following contain the array data.

The data may also be formatted without any newlines like this:

4 6
1 2 3 4 5 6 2 5 4 3 21111 101 3 5 6234 1 2 3 4 2 33434 4 5 6

I can read the first line and initialize an array with the row and column values. Then I need to fill the array with the data values. My first idea was to read the file line by line and use the split function. But the second format listed gives me pause, because the entire array data would be loaded into memory all at once. Some of these files are in the 100 of MBs. The second method would be to read the file in chunks and then parse them piece by piece. Maybe somebody else has a better a way of doing this?

A: 

Unless the machine you're parsing these text files on is limited, files of a few hundred MB should still fit in memory. I'd suggest going with your first approach of reading by line and using split.

If memory becomes an issue, your second approach of reading in chunks should work fine.

Basically what I'm saying is just to implement it and measure if performance is a problem.

Lester
But; 100 of MBs - lets assume that is ASCII; so double that when creating in .NET. Now split it, so *at least* double again, plus overheads and the new array. Plus the array of integers (4 bytes each). It would have to be x64 before you can confidently say it'll fit in memory...
Marc Gravell
+2  A: 

What's your usage pattern for the data once it's loaded? Do you generally need to touch every array element or will you just make sparse/random access?

If you need to touch most array elements, loading it into memory will probably be the best way to go.

If you need to just access certain elements, you might want to lazy load the elements that you need into memory. One strategy would be to determine which of the two layouts the file uses (with/without newline) and create an algorithm to load a particular element directly from disk as needed (seek the given file offset, read and parse). To efficiently re-access the same element it could make sense to keep the element, once read, in a dictionary indexed by the offset. Check the dictionary first before going to the file for a particular value.

On general principal I would take the simple route unless your testing proves that you need to go a more complicated route (avoid premature optimization).

Eric J.
+1  A: 

Read the file a character at a time. If it's whitespace, start a new number. If it's a digit, use it.

for numbers with multiple digits, keep a counter variable:

int counter = 0;
while (fileOpen) {
    char ch = readChar(); // use your imagination to define this method.
    if (isDigit(ch)) {
        counter *= 10;
        counter += asciiToDecimal(ch);
    } else if (isWhitespace(ch)) {
        appendToArray(counter);
        counter = 0;
    } else {
        // Error?
    }
}

Edited for clarification.

TreDubZedd
Needs some tweaking for cases with multiple whitespace characters in a row (or line breaks), but otherwise +1.
dtb
Yeah; this isn't meant to be a catch-all solution--merely a guide to get the OP thinking.Edit: Also, the call to 'appendToArray()' will need to be something else for the first two numbers.
TreDubZedd
+1  A: 

How about:

    static void Main()
    {
        // sample data
        File.WriteAllText("my.data", @"4 6
1 2 3 4 5 6
2 5 4 3 21111 101
3 5 6234 1 2 3
4 2 33434 4 5 6");

        using (Stream s = new BufferedStream(File.OpenRead("my.data")))
        {
            int rows = ReadInt32(s), cols = ReadInt32(s);
            int[,] arr = new int[rows, cols];
            for(int y = 0 ; y < rows ; y++)
                for (int x = 0; x < cols; x++)
                {
                    arr[y, x] = ReadInt32(s);
                }
        }
    }

    private static int ReadInt32(Stream s)
    { // edited to improve handling of multiple spaces etc
        int b;
        // skip any preceeding
        while ((b = s.ReadByte()) >= 0 && (b < '0' || b > '9')) {  }
        if (b < 0) throw new EndOfStreamException();

        int result = b - '0';
        while ((b = s.ReadByte()) >= '0' && b <= '9')
        {
            result = result * 10 + (b - '0');
        }
        return result;
    }

Actually, this isn't very specific about the delimiters - it'll pretty much assume that anything that isn't an integer is a delimiter, and it only supports ASCII (you use use a reader if you need other encodings).

Marc Gravell
A: 

Lets assume we've read the entire file into a string.
You say the first two are rows and columns, so what we definitely need is to parse the numbers.
After that, we can take the first two, create our data structure, and fill it accordingly.

var fileData = File.ReadAllText(...).Split(' ');
var convertedToNumbers = fileData.Select(entry => int.Parse(entry));
int rows = convertedToNumbers.First();
int columns = convertedToNumbers.Skip(1).First();
// Now we have the number of rows, number of columns, and the data.
int[,] resultData = new int[rows, columns];
// Skipping over rows and columns values.
var indexableData = convertedToNumbers.Skip(2).ToList();
for(int i=0; i<rows; i++)
    for(int j=0; j<columns; j++)
        resultData[i, j] = inedexableData[i*rows + j];

An alternative would be to read the first two from a stream, initialize the array, and then read n values at a time, which would be complicated. Also, it's best to keep files open for the shortest time possible.

Rubys
we can't assume that we can read the entire file into memory at once.
luke
A: 

you want to stream the file into memory and parse as you go.

private IEnumerable<String> StreamAsSpaceDelimited(this StreamReader reader)
{
    StringBuilder builder = new StringBuilder();
    int v;
    while((v = reader.Read()) != -1)
    {
        char c = (char) v;
        if(Char.IsWhiteSpace(c))
        {
            if(builder.Length >0)
            {
                yield return builder.ToString();
                builder.Clear();
            }
        }
        else
        {
            builder.Append(c);
        }
    }
    yield break;
}

this will parse the file into a collection of space delimited strings (lazily) and then you can read them as doubles just like :

using(StreamReader sr = new StreamReader("filename"))
{
    var nums = sr.StreamAsSpaceDelimited().Select(s => int.Parse(s));
    var enumerator = nums.GetEnumerator();
    enumerator.MoveNext();
    int numRows = enumerator.Current;
    enumerator.MoveNext();
    int numColumns = enumerator.current;
    int r =0, c = 0;
    int[][] destArray = new int[numRows][numColumns];
    while(enumerator.MoveNext())
    {
        destArray[r][c] = enumerator.Current;
        c++;
        if(c == numColumns)
        {
            c = 0;
            r++;
            if(r == numRows)
               break;//we are done
        }
    }

because we use iterators this should never read more than a few chars at a time. this is a common approach used to parse large files (for example this is how LINQ2CSV works).

luke
A: 

Here are two methods

IEnumerable<int[]> GetArrays(string filename, bool skipFirstLine)
{
    using (StreamReader reader = new StreamReader(filename))
    {
        if (skipFirstLine && !reader.EndOfStream)
            reader.ReadLine();

        while (!reader.EndOfStream)
        {
            string temp = reader.ReadLine();
            int[] array = temp.Trim().Split().Select(s => int.Parse(s)).ToArray();
            yield return array;
        }
    }
}

int[][] GetAllArrays(string filename, bool skipFirstLine)
{
    int skipNumber = 0;
    if (skipFirstLine )
        skipNumber = 1;
    int[][] array = File.ReadAllLines(filename).Skip(skipNumber).Select(line => line.Trim().Split().Select(s => int.Parse(s)).ToArray()).ToArray();
    return array;
}

If you're dealing with large files, the first would likely be preferrable. If files are small, then the second can load the entire thing into a jagged array.

Anthony Pegram
you can't use ReadLine because the file might contain arbitrarily long lines (as in multiple MBs) long so you might get an out of memory error
luke
Ah, I failed to notice the second file structure problem.
Anthony Pegram