views:

1697

answers:

5

I was wondering if someone could help me.

I have to parse data from a csv file and put this into a db table. An example of the data is as follows:

"first field", "second , field", "third " Field " ", "fourth field"

As you can see there are quotation marks and commas embedded in the fields. I was using ADO.NET but it had issues with the embedded quotation marks. Any field after the embedded quotation marks would be null

e.g. from the example above the following would be inserted into the DB table.

first field  |  second, field   |  third     | NULL

Here is the code i am using.

    Dim dataTable As New DataTable
    Dim dataAdapter As New OleDbDataAdapter

    Dim cmd As New OleDbCommand
    Dim path As String = "c:\"


        Dim conn As OleDbConnection = New OleDbConnection("Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" _
           & System.IO.Path.GetDirectoryName(filename) & ";Extended Properties=""Text;HDR=Yes;FMT=Delimited""")

        cmd.CommandText = "Select * FROM " & System.IO.Path.GetFileName(filename)
        dataAdapter.SelectCommand = cmd
        cmd.Connection = conn
        conn.Open()
        dataAdapter.Fill(dataTable)

Then I insert data into the db table.

Any advice or guidance on this would be greatly appreciated.

A: 

This can be a bit of a challenge - however writing code to manually parse a line of csv data is fairly straightforward.

Basically you've got two possible cases:

  1. Where you are guaranteed that every field is delimited by double quotes
  2. Where any string that contains commas will be bounded by double quotes - things could, potentially, get interesting if you've got commas inside quoted strings so you may have to set limits to how clever you want to be (what the provided data can contain).

For the first case basically all you have to do is split the string on the "," combination (this may have to be done manually) - remembering to remove the leading " from the first column/item and the trailing " from the last column/item - and you're pretty much there.

For the second case you have to do more work - either a) split the line on each comma and then process through the results stitching adjacent items back together where an item starts with a double quote but doesn't end with one (you concatenate fields - not forgetting to include the comma - 'til you find the corresponding item that ends with a " but doesn't start with one) or b) loop through the string at each iteration you pull the first item from the list (if it starts with " then you're looking for ", to end it otherwise you're just looking for a comma on its own and allowing in both instances for the boundary case of the delimeter not being found because you're looking at the last column.

Hope this helps. A bit.

Murph
A: 

You could try to tell apart quotations that delimit a field from those that don't checking that the following regex can be matched on the preceding or following characters @"\p{Z},\p{Z}" Using the same concept, you could tell commas in text apart from delimiting commas by the presence of nearby quotation marks. Running these heuristsics and escaping commas and quotes not matching the rules above should yield a pretty clean result Of course there could be cases when this doesn't work fine, but as far as I know, if the CSV is unescaped there'll be no better way to handle this.

emaster70
A: 

If your code doesn't need to be part of an application, but you just want to generate SQL statements to update a database from CSV data, you may be interested in my own CSVFix project, whih is FOSS & allows you to manipulate CSV data in many ways without programming.

anon
+1  A: 

Don't really understand the answers so far.

As far as I can tell, your example input string can't be parsed without adding extra restrictions.

It is your third field which presents a problem. How can the parser tell that after the " between third and Field we haven't finished the third element? By detecting that the following character is not a comma? What if the third field ALSO contained a comma, is that allowed?

"first field", "second , field", "third ", Field " ", "fourth field"

At that point, you're getting into multi-pass trial and error parsing (though I'm no expert). That is, the parser would have to get to the F of Field, and realise that that is unparsable, back up and include the ", in the third field and carry on to the following "... etc

[Even Excel can't cope with that string as it stands.]

Benjol
One assumes a rule that says that quotes come in pairs. After that its less complex.Observing that a given tool can't cope doesn't mean that a problem can't be solved.
Murph
If quotes come in pairs, does that exclude pairs of quotes within the quotes or not? Excel was just an example.
Benjol
A: 

cool. Thanks for all your help guys. The files I am dealing with are very large so manual parsing would not be efficient.

I have solved the problem by getting proper csv files with escaped quotation marks and adding validation and error handling for when this hasnt happened.

You might also want to try a library like CSV Helper. http://csvhelper.com
Josh Close