tags:

views:

905

answers:

5

Please feel free to correct me if I am wrong at any point...

I am trying to read a CSV (comma separated values) file using .NET file I/O classes. Now the problem is, this CSV file may contain some fields with soft carriage returns (i.e. solitary \r or \n markers rather than the standard \r\n used in text files to end a line) within some fields and the standard text mode I/O class StreamReader does not respect the standard convention and treats the soft carriage returns as hard carriage returns thus compromising the integrity of the CSV file.

Now using the BinaryReader class seems to be the only option left but the BinaryReader does not have a ReadLine() function hence the need to implement a ReadLine() on my own.

My current approach reads one character from the stream at a time and fills a StringBuilder until a \r\n is obtained (ignoring all other characters including solitary \r or \n) and then returns a string representation of the StringBuilder (using ToString()).

But I wonder: is this is the most efficient way of implementing the ReadLine() function? Please enlighten me.

A: 

How about simply preprocessing the file?

Replace the soft carriage returns with something unique.

For the record, CSV files with linefeeds in the data, that's bad design.

Lasse V. Karlsen
I think singular line feeds in CSV data may not be a bad idea as long as you are on windows/dos. This design has been around for quite sometime. This is how its done in excel for example if you have a line break in a cell. (Press Alt+Enter to introduce a line break within a cell)
SDX2000
A: 

You could read a bigger chunk at a time, unencode it to a string using Encoder.GetString and then split into lines using string.Split("\r\n") or even picking out the head of the string using string.Substring(0,string.IndexOf("\r\n")) and leaving the rest for processing of the next line. Remember to add the next read operation to your last line from the previous read.

Guge
The underlying stream already buffers the reads to bigger chunks, doesn't it?
configurator
@config: yes, it does.
MusiGenesis
I was more worried about adding strings of length=1 to a "StringBuffer" (could he mean StringBuilder), and frequent heap allocs. Much better to do fewer operations on bigger strings.
Guge
@Guge Yes I meant StringBuilder, my apologies :)
SDX2000
+4  A: 

It probably is. In terms of order, it goes through each char once only, so it would be O(n) (where n is the length of the stream) so that's not a problem. To read a single character a BinaryReader is your best bet.

What I would do is make a class

public class LineReader : IDisposable
{
 private Stream stream;
 private BinaryReader reader;

 public LineReader(Stream stream) { reader = new BinaryReader(stream); }

 public string ReadLine()
 {
  StringBuilder result = new StringBuilder();
  char lastChar = reader.ReadChar();
  // an EndOfStreamException here would propogate to the caller

  try
  {
   char newChar = reader.ReadChar();
   if (lastChar == '\r' && newChar == '\n')
    return result.ToString();

   result.Append(lastChar);
   lastChar = newChar;
  }
  catch (EndOfStreamException)
  {
   result.Append(lastChar);
   return result.ToString();
  }
 }

 public void Dispose()
 {
  reader.Close();
 }
}

Or something like that.

(WARNING: the code has not been tested and is provided AS IS without warranty of any kind, expressed or implied. Should this program prove defective or destroy the planet, you assume the cost of all necessary servicing, repair or correction.)

configurator
Wow! That was pretty quick. Thanks for your answer I would like to vote it up but I haven't earned enough reputation yet :)
SDX2000
A: 

Your approach sounds fine. One way to improve the efficiency of your method might be to store each line as you're building it in a regular string (i.e. not a StringBuilder), and then append the entire-line-string to your StringBuilder. See this article for a further explanation - StringBuilder is not automatically the best choice here.

It probably will matter little, though.

MusiGenesis
That's not completely true: using String.Join on a string[] would be faster. But what about building the string[]? You would need a List<> or a LinkedList<>, which would in turn take more time to build than using a StringBuilder.
configurator
I just read the article and I agree with "configurator". The first order of business here is to comb the incoming stream for \r\n one char at a time and build a string from the rejected chars. StringBuilder outperforms String in this case.
SDX2000
@MusiGenesis thanks for the link to the article though. You have indeed enlightened me :)
SDX2000
It's not the case here, but programmers tend to overuse StringBuilder in simple scenarios where it isn't really needed. "You're fine" is a pretty boring answer, so I thought I would add something.
MusiGenesis
@config: you don't need to use a List or a LinkedList. The easiest way is "string s = System.Text.ASCIIEncoding.ASCII.GetString(b);" where b is an array of bytes. If I get bored today I'll benchmark these different options.
MusiGenesis
+1  A: 

You might want to look at using an ODBC/OleDB connection to do this. If you point the data source of an oledb connection to a directory containing csv files, you can then query it as if each CSV was a table.
check http://www.connectionstrings.com/?carrier=textfile>connectionstrings.com for the correct connection string

Kevin
Hmm, an interesting solution!
SDX2000