views:

39

answers:

1

Here's an example data table (the data are fake, but in the same format as my business data coming from an external system):

 R1   Pears,Apples,Bananas    10
      Oranges         5
 R2   Apricots        15
      Bananas         222
      Apples,Oranges  15

The data are in a string. Columns are tab-delimited, lines are CRLF-delimited. Output should be the same. Multiple values are comma-delimited.

Here's the "flattened" output desired:

 R1   Pears     10
 R1   Apples    10
 R1   Bananas   10
 R1   Oranges   5
 R2   Apricots  15
 R2   Bananas   222
 R2   Apples    15
 R2   Oranges   15

Each column is filled down where there are blanks, and columns with multiple values (comma-delimited) are duplicated and filled down in the other columns.

  • Complicating assumption: the number of columns in the input is arbitrary.
  • Simplifying assumption: only one column will be comma-delimited.

I'm working through this with a relatively naive solution (loops and some recursion), but I'd love to see if this is a situation where LINQ or some other solution would be more appropriate.

I'm working in VB.NET at the moment, but C# is fine too.

Here's my answer so far... optimization isn't terribly important for me, but clarity is always a good thing.

Public Shared Function FlattenPlainTextTable(ByVal InputTable As String) As String
  Const RowDelimiter As String = vbCRLF
  Const ColDelimiter As String = vbTab
  Const MultDelimiter As String = ","
  ''// First pass: determine the number of columns and which column if any contains
  ''// multiple values; build a new collection of rows pre-split into columns so the
  ''// split work can be reused for the second pass.
  Dim rows As New System.Collections.Generic.List(Of String())
  Dim maxColumnIndex As Integer
  Dim multiValueColumnIndex As Integer = -1
  Dim thisRow() As String
  Dim foundComma As Integer
  For Each row As String In Split(InputTable, RowDelimiter)
    thisRow = Split(row, ColDelimiter)
    rows.Add(ThisRow)
    maxColumnIndex = Math.Max(maxColumnIndex, thisRow.GetUpperBound(0))
    If multiValueColumnIndex < 0 Then
      ''// We haven't found a multi-value column yet. Function only supports,
      ''// at maximum, one multi-value column. Look for a comma in this cell,
      ''// and if found, make this the multi-value column.
      foundComma = row.IndexOf(MultDelimiter)
      If foundComma > 0 Then
        Dim beforeComma As String
        beforeComma = row.Substring(0, foundComma - 1)
        ''// The column index is the number of column delimiters found before
        ''// the comma. Faster than splitting into an array and looking for
        ''// the comma.
        multiValueColumnIndex = beforeComma.Length - beforeComma.Replace(ColDelimiter, "").Length
      End If
    End If
  Next
  ''// If no multi-value column was found, pretend it's the first column--simpler
  ''// logic to assume there is one.
  If multiValueColumnIndex < 0 Then multiValueColumnIndex = 0
  ''// Initialize lastRow with the maximum number of columns found in the original
  ''// lastRow is used to fill down values where blanks are found on subsequent rows.
  Dim lastRow() As String = Split(New String(","c, maxColumnIndex + 1), ",")
  Dim outputTable As New StringBuilder()
  Dim thisVal As String
  Dim MuliValueColumnValues() As String
  Dim multiValues() As String
  For Each ThisRow In Rows
    ''// Get the multi-value column's data first so we know how many times to repeat the row.
    If ThisRow.GetUpperBound(0) < multiValueColumnIndex Then
      ''// If the multi-value column is after the jagged edge of this row, create an array of
      ''// one blank value.
      MuliValueColumnValues = Split("", MultDelimiter) ''// assures GetUpperBound(0)=0
    Else
      MuliValueColumnValues = Split(ThisRow(multiValueColumnIndex), MultDelimiter)
    End If
    ''// Repeat this row for as many multi-value values were found
    For RowRepeat As Integer = 0 To MuliValueColumnValues.GetUpperBound(0)
      For columnIndex As Integer = 0 To MaxColumnIndex
        If columnIndex = multiValueColumnIndex Then
          ''// Value is one of the multiple-value values
          thisVal = MuliValueColumnValues(RowRepeat)
        ElseIf ThisRow.GetUpperBound(0) < columnIndex Then
          ''// This row's jagged edge already ended, default to blank
          thisVal = ""
        Else
          thisVal = ThisRow(columnIndex)
        End If
        If thisVal = "" Then
          ''// Fill down
          thisVal = lastRow(columnIndex)
        Else
          ''// Change the fill-down value for next time. (Fill-down only
          ''// fills down the *last* value in the multi-value column, not
          ''// the whole set.)
          lastRow(columnIndex) = thisVal
        End If
        If columnIndex > 0 Then outputTable.Append(ColDelimiter)
        outputTable.Append(thisVal)
      Next
      outputTable.Append(RowDelimiter)
    Next
  Next
  return outputTable.ToString()
End Function
A: 

The following works for a fixed number of columns, how do you want to handle arbitrary number of columns? Can you give an example?

    Dim result As New System.Text.StringBuilder
    Dim fakeData As String = _
    "R1" & vbTab & "Pears,Apples,Bananas" & vbTab & "10" & vbCrLf & _
    vbTab & "Oranges" & vbTab & "5" & vbCrLf & _
    "R2" & vbTab & "Apricots" & vbTab & "15" & vbCrLf & _
    vbTab & "Bananas" & vbTab & "222" & vbCrLf & _
    vbTab & "Apples,Oranges" & vbTab & "15"

    Dim allLines() As String = Microsoft.VisualBasic.Split(fakeData, vbCrLf)
    Dim firstColText As String = String.Empty
    For Each line As String In allLines
        Dim allCols() As String = Microsoft.VisualBasic.Split(line, vbTab)
        Dim allFruits() As String = Microsoft.VisualBasic.Split(allCols(1), ",")
        If allCols(0).Length > 0 Then firstColText = allCols(0)
        For Each fruit As String In allFruits
            result.Append(firstColText).Append(vbTab).Append(fruit).Append(vbTab).Append(allCols(2)).Append(vbCrLf)
        Next
    Next
Tim Schmelter
Any particular reason to use Microsoft.VisualBasic.Split function instead of the built-in framework split method? performing fakeData.Split(vbCrLf) should function the same way.
Bytemaster
Out of habit. Its more flexible and mostly that what i need.If i remember correctly, the Microsoft.VisualBasic.Split function checks if all charcters of the given string are found, the .Net Alternative looks only if one of the characters were found.But there is nothing evil with using the "oldschool"-Split-function ;)http://stackoverflow.com/questions/226517/is-the-microsoft-visualbasic-namespace-true-net-code
Tim Schmelter
@Bytemaster, the String.Split() method by default only supports Char or Char() arrays. There's an overload for an array of string delimiters, but it's a pain to use compared to the more straighforward legacy framework syntax when there's only one delimiter string and no special options needed.
richardtallent