views:

703

answers:

4

I'm using VB.NET to process a long fixed-length record. The simplest option seems to be loading the whole record into a string and using Substring to access the fields by position and length. But it seems like there will be some redundant processing within the Substring method that happens on every single invocation. That led me to wonder whether I might get better results using a stream- or array-based approach.

The content starts out as a byte array containing UTF8 character data. A couple of other approaches I've thought of are listed below.

  1. Loading the string into a StringReader and reading blocks of it at a time
  2. Converting the byte array into a char array and accessing the characters positionally within the array
  3. (This one seems dumb but I'll throw it out there) Copying the byte array to a memory stream and using a StreamReader

This is definitely premature optimization; the substring approach may be perfectly acceptable even if it's a few milliseconds slower. But I thought I'd ask before coding it, just to see if anyone could think of a reason to use one of the other approaches.

+3  A: 

The primary cost with substring is the excising of the sub string into a new string. Using Reflector you can see this:

private unsafe string InternalSubString(int startIndex, int length, bool fAlwaysCopy)
{
    if (((startIndex == 0) && (length == this.Length)) && !fAlwaysCopy)
    {
        return this;
    }
    string str = FastAllocateString(length);
    fixed (char* chRef = &str.m_firstChar)
    {
        fixed (char* chRef2 = &this.m_firstChar)
        {
            wstrcpy(chRef, chRef2 + startIndex, length);
        }
    }
    return str;
}

Now to get there (notice that that is not Substring()) it has to go through 5 checks on length and such.

If you are referencing the same substring multiple times then it may well be worth pulling everything out once and dumping the giant string. You will incur overhead in the arrays to store all these substrings.

If it's generally a "one off" access then Substring it, otherwise consider partitioning up. Perhaps System.Data.DataTable would be of use? If you're doing multiple accesses and parsing to other data types then DataTable looks more attractive to me. If you only need one record in memory at a time then a Dictionary<string,object> should be sufficient to hold one record (field names have to be unique).

Alternatively, you could write a custom, generic class that handles fixed-length record reading for you. Indicate the start index of each field and the type of the field. The length of the field is inferred by the start of the next field (exception is the last field which can be inferred from the total record length). The types can be auto-converted using the likes of int.Parse(), double.Parse(), bool.Parse(), etc.

RecordParser r = new RecordParser();
r.AddField("Name", 0, typeof(string));
r.AddField("Age", 48, typeof(int));
r.AddField("SystemId", 58, typeof(Guid));
r.RecordLength(80);

Dictionary<string, object> data = r.Parse(recordString);

If reflection suits your fancy:

[RecordLength(80)]
public class MyRecord
{
    [RecordFieldOffset(0)]
    string Name;

    [RecordFieldOffset(48)]
    int Age;

    [RecordFieldOffset(58)]
    Guid Systemid;
}

Simply run through the properties where you can get the PropertyInfo.PropertyType to know how to deal with the sub string from the record; you can pull out the offsets and total length from the attributes; and return an instance of your class with the data populated. Essentially, you could use reflection to pull out information to call RecordParser.AddField() and RecordLength() from my previous suggestion.

Then wrap it all up into a neat little, no-fuss class:

RecordParser<MyRecord> r = new RecordParser<MyRecord>();
MyRecord data = r.Parse(recordString);

Could even go so far to call r.EnumerateFile("path\to\file") and use the yield return enumeration syntax to parse out records

RecordParser<MyRecord> r = new RecordParser<MyRecord>();
foreach (MyRecord data in r.EnumerateFile("foo.dat"))
{
    // Do stuff with record
}
Colin Burnett
Thanks for the internals. That's the kind of thing I wanted to know. Are you suggesting I create a DataTable with columns to match my record format, and then read the record sequentially and populate the DataTable as I go? That's an interesting suggestion I hadn't thought of.
John M Gant
My assumption is is that you have multiple records to read and need multiple in memory at the same time. I guess if you have only a single record then a Dictionary<string,object> should suffice, yes? You could probably go so far as writing a class to generically handle this. I'll integrate this comment to my answer.
Colin Burnett
Sorry, I guess all my code is in C#. I don't know VB so I'm not sure how much can translate (namely the `yield return` syntax) but I assume it does.
Colin Burnett
+3  A: 

The fastest method will likely be using the stream technique, because assuming you can read each field sequentially it will only keep what you need in memory and it remembers where you are in the process.

Joel Coehoorn
That's kinda what I was thinking. And thanks for the link.
John M Gant
A: 

How are you reading the record in the first place?

are you reading character by character or line by line?

you may be able to do things on the fly while you are reading hence no substring would be involved.

in case you must read once and then process then reading into a string and use StringReader it will allow you to read character by character or by a number of characters.

AppDeveloper
It starts out as a UTF-8 byte array. It's a single record with no lines. Reading the whole thing into a string and processing it with a StringReader was one of my options. Is that what you're recommending?
John M Gant
+1  A: 

What your trying to do sounds like a parsing task. If I understand correctly, your loading up a huge string that contains multiple fields and their values. For this particular kind of scenario, Substring is not going to be particularly performant. For each field and its value, you are going to need to call Substring with a specific position and length in the larger string. That is quite a lot of overhead.

As an alternative, you could implement a simple parser that would process your string once, from start to end, and retrieve each field and value in a single pass. Such a parser wouldn't need to be very complicated...just a simple 1 character lookahead parser would probably do. You probably don't even need to tokenize your input...you could just process it in streaming fashion to extract one field, then its value, stick it in some receptacle, and move on.

If your input string is more complex than just a series of fields and values (i.e. its structured) a more complex parser would likely be needed. There are many tools, like antler, that provide frameworks that can generate a grammar for you, generate a parser, and provide a nice API to consume your parsed content.

jrista