Best practice for reading csv (with variable number of lines) into data structures

views:

153

answers:

+2 Q:

Best practice for reading csv (with variable number of lines) into data structures

Hey guys, I'm writing a small program to read in a csv with a variable number of lines and have a question about best practices:

Is the best way to create storage for the data on each line to make an array that holds the data structures of the csv (one per each line of csv)?

Size allocated to the array could be set to a large number (for example, more lines than there would ever reasonably be in the csv)? I have seen this in many examples on the web.

Or... is there was a smart way to tell how much space would be needed such as counting the lines before hand or dynamically adding space by using a linked list as opposed to an array with static storage allocation. Any best practices? I don't think choosing a random number seems very slick...

Any thoughts would be greatly appreciated. Thanks!

+4 A:

Two best practices:

Never expect input from the outside to correct.
Make it transactional (imports all or rolls back)
If possible, leverage a third-party API or library like this http://www.codeproject.com/KB/database/CsvReader.aspx or this http://sourceforge.net/projects/javacsv/ to greatly reduce reinventing to wheel. If you're sticking to C and can do C++ consider this approach: http://stackoverflow.com/questions/415515/how-can-i-read-and-manipulate-csv-file-data-in-c

Nissan Fan 2010-08-16 21:08:51

This was helpful... but didn't really address the data structure portion of the question. Thanks.

Tyler Brock 2010-08-16 21:11:37

Check out the very end of it with the link to a similar question in C++. As far as allocation, you may want to set a default alloc and realloc as it's exceeded. You're best however to use a collection like in the C++ example which means allocation is handled under the covers.

Nissan Fan 2010-08-16 21:12:59

Ok. You've earned your check mark :-)

Tyler Brock 2010-08-16 21:21:41

Umm, just to be a gadly here, but: #1 is certainly true, but not relevant to the question. #2 may or may not be true depending on the problem, but again, not really relevant to the question.

Jay 2010-08-16 21:46:31

+1 A:

Use a library or count the lines beforehand. You could also use some kind of list data structure to avoid worrying about the line count.

+1 to Nissan Fan for recommending a library, in my opinion, unless you're trying to learn a lot about CSV parsing and CSV parsing edge cases, this is always the way to go.

marr75 2010-08-16 21:11:37

Thanks for the response, these are basically the two approaches I outlined above as alternatives. I wonder why all the examples on the web favor this arbitrary array size a approach. Seems sloppy to me.

Tyler Brock 2010-08-16 21:13:38

+1 A:

There's really no "best practice". Keep in mind the particular structure of your data, how quickly you want to read it, store it, query it, sort it, find/eliminate/ignore duplicates, etc. A tree, a linked-list, hashing, ordered data, etc. are good options depending on the factors that I already mentioned.

I agree with the other fellows. No need to reinvent the wheel. There must be gazillions of samples about how to parse CSV.

However, when choosing your favorite library, a few words of caution:

Best practice: Never assume that the data has a specific (small or very large) amount of data. Corollary: don't store all the data in memory, just as little as reasonable, and assume that whatever the size of your array, the data may be bigger than it. With that being considered, work around that assumption.
Another best practice: Test corner cases (no input, very large input, only one line or element, etc.)
CSV files are not standard. For example, some programs that generate CSV just ignore the following cases:

3.1. Commas within strings. For example, it's not the same "Smith, John" than Smith, John. 3.2. Special characters withing the strings, such as apostrophes, tabs, or quotes. How are they handled? For example, Microsoft typically uses double-double quotes to represent quotes inside a string. 3.3. And, of course, be careful with the end of line format (Unix or Windows-style).

Be sure to take a look to very good bunch of actual data. Never believe the users (nor the programmers :-).

Good luck. Luis. Excel and Visual Basic used to generate

luiscolorado 2010-08-16 21:35:50

Wow, thanks for this helpful input. You guys are great!

Tyler Brock 2010-08-16 21:47:38

+2 A:

If you can process the data as you read it rather than saving it all and processing after, this would eliminate the problem.

I avoid counting the lines first, as this requires reading the entire file twice. I suppose if the file is small the efficiency hit is not a big deal, but if you know that the file is small, then you could just allocate a big enough space.

So in general, my approach -- if I can't process the file a line at a time -- is to use a data structure that can grow, like a linked list. Then for each line I just allocate a new block. Depending on what you're up to, you might use a dynamic array: allocate an amount of space that ought to be enough for the normal case. If you fill it, allocate a bigger space, copy the first to the second, delete the first, and then continue working with the second. If you fill that, repeat the process. This can be a lot of data movement but the amount of space used in the end will be less than a linked list because you don't have the pointers, and it will be faster to traverse because you're not chasing pointers and possibly running all over virtual memory.

Jay 2010-08-16 21:44:11

Great answer, thanks!

Tyler Brock 2010-08-16 21:47:20

Note that for *"allocate a bigger space, copy the first to the second, delete the first"* you can use the `realloc()` function - this is precisely its intended use (and it may even be able to avoid the copies sometimes).

caf 2010-08-16 22:57:03

Did not know, thanks!

Tyler Brock 2010-08-17 00:12:29

@caf: Frankly I'd forgotten about realloc -- I don't do a lot of C lately. If you're making the block significantly larger, I'd guess it almost always must allocate new space and free the old. It could only avoid this if there just happens to be available space "bordering" on the original allocation. But point taken, it is probably the "right" way to do it.

Jay 2010-08-17 03:12:17

ansaurus

tags:

views:

answers:

Best practice for reading csv (with variable number of lines) into data structures

related questions