views:

1301

answers:

6

I've been trying to deal with some delimited text files that have non standard delimiters (not comma/quote or tab delimited). The delimiters are random ASCII characters that don't show up often between the delimiters. After searching around, I've seem to have only found no solutions in .NET will suit my needs and the custom libraries that people have written for this seem to have some flaws when it comes to gigantic input (4GB file with some field values having very easily several million characters).

While this seems to be a bit extreme, it is actually a standard in the Electronic Document Discovery (EDD) industry for some review software to have field values that contain the full contents of a document. For reference, I've previously done this in python using the csv module with no problems.

Here's an example input:

Field delimiter = 
quote character = þ

þFieldName1þþFieldName2þþFieldName3þþFieldName4þ
þValue1þþValue2þþValue3þþSomeVery,Very,Very,Large value(5MB or so)þ
...etc...

Edit: So I went ahead and created a delimited file parser from scratch. I'm kind of weary using this solution as it may be prone to bugs. It also doesn't feel "elegant" or correct to have to write my own parser for a task like this. I also have a feeling that I probably didn't have to write a parser from scratch for this anyway.

A: 

While this doesn't help address the large input issue, a possible solution to the parsing issue might include a custom parser that users the strategy pattern to supply a delimiter.

Ian P
"Strategy pattern" is useful when you need to parameterize by *code*, not *data*, and is primarily only exists because of Java and other languages without closures. A plain-jane parameter will do just fine.
Barry Kelly
If you consider that he may need to parse, at some future time, data that is not delimited (say EDI or something of the sorts), then strategy would be ideal.
Ian P
all of you object oriented nazis need to stop over-designing things.
TheSoftwareJedi
+5  A: 

Use the File Helpers API. It's .NET and open source. It's extremely high performance using compiled IL code to set fields on strongly typed objects, and supports streaming.

It supports all sorts of file types and custom delimiters; I've used it to read files larger than 4GB.

If for some reason that doesn't do it for you, try just reading line by line with a string.split:

public IEnumerable<string[]> CreateEnumerable(StreamReader input)
{
    string line;
    while ((line = input.ReadLine()) != null)
    {
        yield return line.Split('þ');
    }
}

That'll give you simple string arrays representing the lines in a streamy fashion that you can even Linq into ;) Remember however that the IEnumerable is lazy loaded, so don't close or alter the StreamReader until you've iterated (or caused a full load operation like ToList/ToArray or such - given your filesize however, I assume you won't do that!).

Here's a good sample use of it:

using (StreamReader sr = new StreamReader("c:\\test.file"))
{
    var qry = from l in CreateEnumerable(sr).Skip(1)
              where l[3].Contains("something")
              select new { Field1 = l[0], Field2 = l[1] };
    foreach (var item in qry)
    {
        Console.WriteLine(item.Field1 + " , " + item.Field2);
    }
}
Console.ReadLine();

This will skip the header line, then print out the first two field from the file where the 4th field contains the string "something". It will do this without loading the entire file into memory.

TheSoftwareJedi
In the FileHelpersAPI I don't seem to see a way to specify a quote char. It also seems to rely on predefined objects to read each record (which is a problem because the content and number of fields is unknown). Simply splitting the delimiter doesn't cut it because it ignores quote characters.
llamaoo7
FileHelpers is open source, and definitely supports quote chars. As for splitting the quote char, just add it to the string.split!
TheSoftwareJedi
A: 

I would be inclined to use a combination of Memory Mapped Files (msdn point to a .NET wrapper here) and a simple incremental parse, yielding back to an IEnumerable list of your record / text line (or whatever)

Tim Jarvis
A: 

You mention that some fields are very very big, if you try to read them in their entirety to memory you may be getting yourself into trouble. I would read through the file in 8K (or small chunks), parse the current buffer, keep track of state.

What are you trying to do with this data that you are parsing? Are you searching for something? Are you transforming it?

Sam Saffron
Mainly transforming it for now (remove data, reorder colomns, validate values, etc). In the future I plan to use SQLite for more complicated operations. For now, it is very safe to assume one record will fit in memory.
llamaoo7
+1  A: 

Windows and high performance I/O means, use IO Completion ports. You may have todo some extra plumbing to get it working in your case.

This is with the understanding that you want to use C#/.NET, and according to Joe Duffy

18) Don’t use Windows Asynchronous Procedure Calls (APCs) in managed code.

I had to learn that one the hard way ;), but ruling out APC use, IOCP is the only sane option. It also supports many other types of I/O, frequently used in socket servers.

As far as parsing the actual text, check out Eric White's blog for some streamlined stream use.

RandomNickName42
A: 

I don't see a problem with you writing a custom parser. The requirements seem sufficiently different to anything already provided by the BCL, so go right ahead.

"Elegance" is obviously a subjective thing. In my opinion, if your parser's API looks and works like a standard BCL "reader"-type API, then that is quite "elegant".

As for the large data sizes, make your parser work by reading one byte at a time and use a simple state machine to work out what to do. Leave the streaming and buffering to the underlying FileStream class. You should be OK with performance and memory consumption.

Example of how you might use such a parser class:

using(var reader = new EddReader(new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read, 8192)) {
    // Read a small field
    string smallField = reader.ReadFieldAsText();
    // Read a large field
    Stream largeField = reader.ReadFieldAsStream();
}
Christian Hayter