tags:

views:

85

answers:

3

I have a text file that is automatically generated by an older computer system daily.

Unfortunately, the columns in this file are not delimited and they are not exactly fixed width (each day the width of each column could change depending on the amount of chars of the data in each column). The file does have column headings, so I want to find the width of each column using the column headings. Here is an example of the column heading row:

JOB_NO[variable amount of white space chars]FILE_NAME[variable amount of ws chars]PROJECT_CODE[variable amount of ws chars][carriage return]

What I want to do is get the index of of the first char in a column and the index of the last white space of a column (from the column heading). I would want to get the index of the "J" in JOB_NUM and the last white space before the "F" in FILE_NAME for the first column.

I guess I should also mention that the columns may not always be in the same order from day to day but they will have the same header names.

Any thoughts about how do do this in VB.net or c#? I know I can use the string.indexOf("JOB_NO") to get the index of the start of the column, but how do I get the index of the last white space in each column? (or last whitespace before the next first non-whitespace that denotes the start of the next column)

+2  A: 

Get the indexes of all columns. e.g.

var jPos = str.IndexOf("JOB_NO");
var filePos = str.IndexOf("FILE_NAME");
var projPos = str.IndexOf("PROJECT_CODE");  

Then sort them in an array. from min to max. now you know your columns order. the last space of first column is [the_next_column's_index]-1.

int firstColLastSpace = ar[1] -1;
int secColLastSpace = ar[2] -1;
Kamyar
A: 

Borrowing heavily from a previous answer I've given... To get column positions, how about this? I'm making the assumption that column names do not contain spaces.

IEnumerable<int> positions=Regex
    .Matches("JOB_NUM   FILE_NAME         SOME_OTHER_THING",@"(?<=^| )\w")
    .Cast<Match>()
    .Select(m=>m.Index);

or (verbose version of the above)

//first get a MatchCollection
//this regular expression matches a word character that immediately follows
//either the start of the line or a space, i.e. the first char of each of
//your column headers
MatchCollection matches=Regex
    .Matches("JOB_NUM   FILE_NAME         SOME_OTHER_THING",@"(?<=^| )\w");
//convert to IEnumerable<Match>, so we can use Linq on our matches
IEnumerable<Match> matchEnumerable=matches.Cast<Match>();
//For each match, select its Index
IEnumerable<int> positions=matchEnumerable.Select(m=>m.Index);
//convert to array (if you want)
int[] pos_arr=positions.ToArray();
spender
I'm sorry, I'm having trouble figuring out how to use the output of that expression. Does "new Regex..." return a value?
avword
I've rewritten my answer to make it clear what's going on. It wasn't necessary to instantiate a new Regex to get a MatchCollection, but yes... "new Regex" returns a new Regex instance, on which I was calling the Matches method. As Regex has a static Matches method, it's probably better to use that instead. The output of the expression is an IEnumerable<int>. (I've indicated this in my edit). You can call ToList or ToArray on it if you're happier with those types of collection.
spender
A: 

Here is an alternative answer using a small class which you can later use to parse your lines. You can use the fields collection as a template to pull the fields for each of your lines, this solution does not ignore the whitespaces as I presume that they are variable because the fields vary in length each day and you would need that data:

Imports System.Text.RegularExpressions

Module Module1

Sub Main()

    Dim line As String = "JOB_NUM      FILE_NAME         SOME_OTHER_THING  "
    Dim Fields As List(Of Field) = New List(Of Field)
    Dim oField As Field = Nothing

    Dim mc As MatchCollection = Regex.Matches(
        line, "(?<=^| )\w")

    For Each m As Match In mc
        oField = New Field
        oField.Start = m.Index
        'Loop through the matches
        If m.NextMatch.Index = 0 Then
            'This is the last field
            oField.Length = line.Length - oField.Start
        Else
            oField.Length = m.NextMatch.Index - oField.Start
        End If
        oField.Name = line.Substring(oField.Start, oField.Length)
        'Trim the field name:
        oField.Name = Trim(oField.Name)
        'Add to the list
        Fields.Add(oField)
    Next

    'Check the Fields: you can use line.substring(ofield.start, ofield.length)
    'to parse each line of your file.

    For Each f As Field In Fields
        Console.WriteLine("Field Name: " & f.Name)
        Console.WriteLine("Start: " & f.Start)
        Console.WriteLine("Length " & f.Length)
    Next

    Console.Read()
End Sub

Class Field
    Public Property Name As String
    Public Property Start As Integer
    Public Property Length As Integer
End Class

End Module

jangeador