Here's an example data table (the data are fake, but in the same format as my business data coming from an external system):
R1 Pears,Apples,Bananas 10
Oranges 5
R2 Apricots 15
Bananas 222
Apples,Oranges 15
The data are in a string. Columns are tab-delimited, lines are CRLF-delimited. Output should be the same. Multiple values are comma-delimited.
Here's the "flattened" output desired:
R1 Pears 10
R1 Apples 10
R1 Bananas 10
R1 Oranges 5
R2 Apricots 15
R2 Bananas 222
R2 Apples 15
R2 Oranges 15
Each column is filled down where there are blanks, and columns with multiple values (comma-delimited) are duplicated and filled down in the other columns.
- Complicating assumption: the number of columns in the input is arbitrary.
- Simplifying assumption: only one column will be comma-delimited.
I'm working through this with a relatively naive solution (loops and some recursion), but I'd love to see if this is a situation where LINQ or some other solution would be more appropriate.
I'm working in VB.NET at the moment, but C# is fine too.
Here's my answer so far... optimization isn't terribly important for me, but clarity is always a good thing.
Public Shared Function FlattenPlainTextTable(ByVal InputTable As String) As String
Const RowDelimiter As String = vbCRLF
Const ColDelimiter As String = vbTab
Const MultDelimiter As String = ","
''// First pass: determine the number of columns and which column if any contains
''// multiple values; build a new collection of rows pre-split into columns so the
''// split work can be reused for the second pass.
Dim rows As New System.Collections.Generic.List(Of String())
Dim maxColumnIndex As Integer
Dim multiValueColumnIndex As Integer = -1
Dim thisRow() As String
Dim foundComma As Integer
For Each row As String In Split(InputTable, RowDelimiter)
thisRow = Split(row, ColDelimiter)
rows.Add(ThisRow)
maxColumnIndex = Math.Max(maxColumnIndex, thisRow.GetUpperBound(0))
If multiValueColumnIndex < 0 Then
''// We haven't found a multi-value column yet. Function only supports,
''// at maximum, one multi-value column. Look for a comma in this cell,
''// and if found, make this the multi-value column.
foundComma = row.IndexOf(MultDelimiter)
If foundComma > 0 Then
Dim beforeComma As String
beforeComma = row.Substring(0, foundComma - 1)
''// The column index is the number of column delimiters found before
''// the comma. Faster than splitting into an array and looking for
''// the comma.
multiValueColumnIndex = beforeComma.Length - beforeComma.Replace(ColDelimiter, "").Length
End If
End If
Next
''// If no multi-value column was found, pretend it's the first column--simpler
''// logic to assume there is one.
If multiValueColumnIndex < 0 Then multiValueColumnIndex = 0
''// Initialize lastRow with the maximum number of columns found in the original
''// lastRow is used to fill down values where blanks are found on subsequent rows.
Dim lastRow() As String = Split(New String(","c, maxColumnIndex + 1), ",")
Dim outputTable As New StringBuilder()
Dim thisVal As String
Dim MuliValueColumnValues() As String
Dim multiValues() As String
For Each ThisRow In Rows
''// Get the multi-value column's data first so we know how many times to repeat the row.
If ThisRow.GetUpperBound(0) < multiValueColumnIndex Then
''// If the multi-value column is after the jagged edge of this row, create an array of
''// one blank value.
MuliValueColumnValues = Split("", MultDelimiter) ''// assures GetUpperBound(0)=0
Else
MuliValueColumnValues = Split(ThisRow(multiValueColumnIndex), MultDelimiter)
End If
''// Repeat this row for as many multi-value values were found
For RowRepeat As Integer = 0 To MuliValueColumnValues.GetUpperBound(0)
For columnIndex As Integer = 0 To MaxColumnIndex
If columnIndex = multiValueColumnIndex Then
''// Value is one of the multiple-value values
thisVal = MuliValueColumnValues(RowRepeat)
ElseIf ThisRow.GetUpperBound(0) < columnIndex Then
''// This row's jagged edge already ended, default to blank
thisVal = ""
Else
thisVal = ThisRow(columnIndex)
End If
If thisVal = "" Then
''// Fill down
thisVal = lastRow(columnIndex)
Else
''// Change the fill-down value for next time. (Fill-down only
''// fills down the *last* value in the multi-value column, not
''// the whole set.)
lastRow(columnIndex) = thisVal
End If
If columnIndex > 0 Then outputTable.Append(ColDelimiter)
outputTable.Append(thisVal)
Next
outputTable.Append(RowDelimiter)
Next
Next
return outputTable.ToString()
End Function