views:

59

answers:

4

To start I would like to clarify that I'm not extremely well versed in C#. In that, a project I'm doing working in C# using .Net 3.5 has me building a class to read from and export files that contain multiple fixed width formats based on the record type.

There are currently 5 types of records indicated by the first character position in each line of the file that indicate a specific line format. The problem I have is that the types are distinct from each other.

Record type 1 has 5 columns, signifies beginning of the file

Record type 3 has 10 columns, signifies beginning of a batch
Record type 5 has 69 columns, signifies a transaction
Record type 7 has 12 columns, signifies end of the batch, summarizes
(these 3 repeat throughout the file to contain each batch)

Record type 9 has 8 columns, signifies end of the file, summarizes

Is there a good library out there for these kinds of fixed width files? I've seen a few good ones that want to load the entire file in as one spec but that won't do.

Roughly 250 of these files are read at the end of every month and combined filesize on average is about 300 megs. Efficiency is very important to me in this project.

Based on my knowledge of the data I've build a class hierarchy of what I "think" an object should look like...

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace Extract_Processing
{
    class Extract
    {
        private string mFilePath;
        private string mFileName;
        private FileHeader mFileHeader;
        private FileTrailer mFileTrailer;
        private List<Batch> mBatches;       // A file can have many batches

        public Extract(string filePath)
        { /* Using file path some static method from another class would be called to parse in the file somehow */ }

        public string ToString()
        { /* Iterates all objects down the heiarchy to return the file in string format */ }

        public void ToFile()
        { /* Calls some method in the file parse static class to export the file back to storage somewhere */ }
    }

    class FileHeader
    { /* ... contains data types for all fields in this format, ToString etc */ }

    class Batch
    {
        private string mBatchNumber;                // Should this be pulled out of the batch header to make LINQ querying simpler for this data set?
        private BatchHeader mBatchHeader;
        private BatchTrailer mBatchTrailer;
        private List<Transaction> mTransactions;    // A batch can have multiple transactions

        public string ToString()
        { /* Iterates through batches to return what the entire batch would look like in string format */ }
    }

    class BatchHeader
    { /* ... contains data types for all fields in this format, ToString etc */ }

    class Transaction
    { /* ... contains data types for all fields in this format, ToString etc */ }

    class BatchTrailer
    { /* ... contains data types for all fields in this format, ToString etc */ }

    class FileTrailer
    { /* ... contains data types for all fields in this format, ToString etc */ }

}

Ive left out many constructors and other methods but I think the idea should be pretty solid. I'm looking for ideas and critique to the methods I'm considering as again, not knowledgable about C# and the execution time is the highest priority.

Biggest question besides some critique is, how should I bring in this file? I've brought in many files in other languages such as VBA using FSO methods, Microsoft Access ImportSpec to read in the file (5 times, one for each spec... wow that was inefficient!), created a 'Cursor' object in visual foxpro (which was FAAAAAAAST but again, had to do five times) but am looking for hidden gems in C# if said things exist.

Thanks for reading my novel, let me know if your having issues understanding it. I'm taking the weekend to go over this design to see if I buy it and want to take the effort to implement it this way.

A: 

Best library for these sorts of things is filehelpers

lomaxx
Going to download and fool around with this. My fear is that I will have to open the entire file 5 times, once for each 'specification' class that will be implemented by this assembly.
Mohgeroth
+1  A: 

One critique I have is that you are not correctly implementing ToString.

    public string ToString()

Should be:

    public override string ToString()
Mark Byers
Used to java doing that for me, thanks for the critique!
Mohgeroth
+1  A: 

FileHelpers is nice. It has a couple of drawbacks in that it doesn't seem to be under active development anymore, and it makes you use public variables for your fields instead of letting you use properties. But otherwise good.

What are you doing with these files? Are you loading them into SQL Server? If so, and you're looking for FAST and SIMPLE, I'd recommend a design like this:

  1. Make staging tables in your database that correspond to each of the 5 record types. Consider adding a LineNumber column and a FileName column too just so you can trace problems back to the file itself.
  2. Read the file line by line and parse it out into your business objects, or directly into ADO.NET DataTable objects that correspond to your tables.
  3. If you used business objects, apply your data transformations or business rules and then put the data into DataTable objects that correspond to your tables.
  4. Once each DataTable reaches an appropriate BatchSize (say 1000 records), use the SqlBulkCopy object to pump the data into your staging tables. After each SqlBulkCopy operation, clear out the DataTable and continue processing.
  5. If you didn't want to use business objects, do any final data manipulation in SQL Server.

You could probably accomplish the whole thing in under 500 lines of C#.

mattmc3
I definitely don't want to put this in SQL server since the size of the raw extract files alone for one year is over 3 gigs! These files stand as our backup and we want certain things for both billing and client record keeping but the reality is that if someone wants to know something about client X at a point in time we can just unzip the files (Compression rate is 98%) and just use a process to read through and pull out what the client wants to know. Reading through this data fast helps so we can make a nice interface later to drill down into the data. Great information though, thanks!
Mohgeroth
+1  A: 

Biggest question besides some critique is, how should I bring in this file?

I do not know of any good library for file IO, but the reading is pretty straightforward.

Instantiate a StreamReader class using a 64kB buffer to limit disk IO operations (my estimations is 1500 transactions average per file per the end of the month).

Now you can stream over the file:
1) Using the Read at the beggining of each line to determine the type of the record.
2) Using the ReadLine method with the String.Split method to get column values.
3) Create the object using the column values.

or

You could just buffer the data from a Stream manually and IndexOf+SubString for more performance (if done right).

Also if the lines weren't columns but primitive datatypes in binary format, you could use the BinaryReader class for a very easy and performant way to read the objects.

Jaroslav Jandek
Better performance and less headache using the MultiRecordEngine of file helpers for what I'm trying to do. Not the type of approach I would have hoped for but its efficient enough
Mohgeroth