tags:

views:

101

answers:

6

I have a function that converts a .csv file to a datatable. One of the columns I am converting is is a field of names that have a comma in them i.e. "Doe, John" when converting the function treats this as 2 seperate fields because of the comma. I need the datatable to hold this as one field Doe, John in the datatable.

Function CSV2DataTable(ByVal filename As String, ByVal sepChar As String) As DataTable
    Dim reader As System.IO.StreamReader
    Dim table As New DataTable
    Dim colAdded As Boolean = False

    Try
        ''# open a reader for the input file, and read line by line
        reader = New System.IO.StreamReader(filename)
        Do While reader.Peek() >= 0
            ''# read a line and split it into tokens, divided by the specified 
            ''# separators
            Dim tokens As String() = System.Text.RegularExpressions.Regex.Split _
                (reader.ReadLine(), sepChar)
            ''# add the columns if this is the first line
            If Not colAdded Then
                For Each token As String In tokens
                    table.Columns.Add(token)
                Next
                colAdded = True
            Else
                ''# create a new empty row
                Dim row As DataRow = table.NewRow()
                ''# fill the new row with the token extracted from the current 
                ''# line
                For i As Integer = 0 To table.Columns.Count - 1
                    row(i) = tokens(i)
                Next
                ''# add the row to the DataTable
                table.Rows.Add(row)
            End If
        Loop

        Return table
    Finally
        If Not reader Is Nothing Then reader.Close()
    End Try
End Function
+1  A: 

Instead of rolling out your own solution have you considered using

http://filehelpers.sourceforge.net/

It should address your issue.

ggonsalv
I would rather make a change to mine if you could help. Thanks in advance.
Nick LaMarca
+3  A: 

Don't use a .Split() function to read your csv data. Not only does it cause the kind of error you just ran into but it's slower as well. You need a state machine -based parser. That will be faster and make it easier to correctly handle quote-enclosed text.

I have an example here:
http://stackoverflow.com/questions/1544721/reading-csv-files-in-c/1544743#1544743

and there's also a highly-respected CSV reader on codeplex you can use:
http://www.codeproject.com/KB/database/CsvReader.aspx


You'd use my code like this:

Function DataTableFromCSV(ByVal filename As String) As DataTable
    Dim table As New DataTable
    Dim colAdded As Boolean = False

    For Each record As IList(Of String) In CSV.FromFile(filename)
        ''# Add column headers on first iteration
        If Not colAdded Then
            For Each token As String In record
                table.Columns.Add(token)
            Next token
            colAdded = True
        Else
            ''# add the row to the table
            Dim row As DataRow = table.NewRow()
            For i As Integer = 0 To table.Columns.Count - 1
                row(i) = record(i)
            Next
            table.Rows.Add(row)
        End If
    Next record

    Return table
End Function  

If you're using .net 3.5 or later, I'd write it a little differently to pull the column creation out of the for each loop (using type inference and .Take(1) ), but I wanted to be sure this would work with .Net 2.0 as well.

Joel Coehoorn
Joel is there no easy way to fix my code to account for this. I dont want to start over and I dont want to use any open source dll either
Nick LaMarca
It is technically possible to re-write the regular expression to account for your quote-enclosed text. However, this will not be a simple regex. Whenever you have nested components to a regex (ie, commas inside quotes) you should start to look at alternatives. In this case, because you can guarantee you're only nesting one level deep a regex is still possible, but it's not going to be a trivial expression. You'll likely find it easier to move to state-machine based code, especially as that code will be faster and is already written for you. Don't succumb to NIH.
Joel Coehoorn
One more note: the nice thing about open source is you don't have to use a dll. You can include the code in your project directly. If you don't like the license of the code project link, my code will work and won't encumber your project. In fact, you should read the other answers to the same question. You might be able to use the Microsoft.VisualBasic.FileIO.TextFieldParser answer, for example.
Joel Coehoorn
what is NIH????
Nick LaMarca
ok Joe I see this code is reading the data correctly, but how can I merge it with mine to return a datatable?I have an example here:http://stackoverflow.com/questions/1544721/reading-csv-files-in-c/1544743#1544743This prints out a dialog of the data but, but for one I dont know how to access the first row with this object to get th ecolumns names. Can you come up with some code to have this function return a datatable assuming the column names are in row 1?
Nick LaMarca
@Nick - sorry, I had to take care of other things yesterday. I added code that explains how to use my csv parser
Joel Coehoorn
A: 

Try splitting with the " and then skipping every second element starting from the first. example:

 "Test,Name","123 Street,","NY","12345" 
Dim tokens As String() =  = System.Text.RegularExpressions.Regex.Split _
            (reader.ReadLine(), """") 'Or send in " as the sepchar

You would get

 {Length=9}
    (0): ""
    (1): "Test,Name"
    (2): ","
    (3): "123 Street,"
    (4): ","
    (5): "NY"
    (6): ","
    (7): "12345"
    (8): ""

So you would take the Odd numbered elements only to retrieve the data. The only caveat is when there is also a " in the data file.

I still think you should reconsider not using an external library.

Here is a article that addresses it. http://www.secretgeek.net/csv_trouble.asp

ggonsalv
That doesnt work for me because I think I need the code to first look at the data then know it has a comma then preserve it. In this case if only works if the comma field is column 1 I think.
Nick LaMarca
Look at 123 Street, element (3)The comma is preserved.
ggonsalv
Doing that for whatever reason doesnt let the datatable have the excated column names. I look at the column names for my datatable and it have one column with the whole first row of column headers as the name of the column
Nick LaMarca
+1  A: 

I can't help you with the VB.NET side of things, but RFC 4180 is your friend. Specifically, section 2:

5. Each field may or may not be enclosed in double quotes (however some programs, such as Microsoft Excel, do not use double quotes at all). If fields are not enclosed with double quotes, then double quotes may not appear inside the fields. For example:

   "aaa","bbb","ccc" CRLF
   zzz,yyy,xxx

6. Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes. For example:

   "aaa","b CRLF
   bb","ccc" CRLF
   zzz,yyy,xxx

7. If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote. For example:

   "aaa","b""bb","ccc"
Ken
A: 

Like others have said, don't roll your own.

Give CSVHelper a try. It's separated into a parser and reader, so you can just use the parser if you want. The parsing code is pretty straight forward and RFC 4180 compliant, if you want to look at it's source.

Josh Close
A: 

Have you looked into using the TextFieldParser class that's built into the .Net framework?

It has a property called HasFieldsEnclosedInQuotes that should handle your situation.

You can set the delimiters, and then call the ReadLine and ReadFields methods to get the field data, and it should account for those fields enclosed in quotation marks.

Chris Dunaway