views:

8580

answers:

5

Is it possible to BULK INSERT (SQL Server) a CSV file in which the fields are only OCCASSIONALLY surrounded by quotes? Specifically, quotes only surround those fields that contain a ",".

In other words, I have data that looks like this (the first row contain headers):

id, company, rep, employees
729216,INGRAM MICRO INC.,"Stuart, Becky",523
729235,"GREAT PLAINS ENERGY, INC.","Nelson, Beena",114
721177,GEORGE WESTON BAKERIES INC,"Hogan, Meg",253

Because the quotes aren't consistent, I can't use '","' as a delimiter, and I don't know how to create a format file that accounts for this.

I tried using ',' as a delimter and loading it into a temporary table where every column is a varchar, then using some kludgy processing to strip out the quotes, but that doesn't work either, because the fields that contain ',' are split into multiple columns.

Unfortunately, I don't have the ability to manipulate the CSV file beforehand.

Is this hopeless?

Many thanks in advance for any advice.

By the way, i saw this post SQL bulk import from csv, but in that case, EVERY field was consistently wrapped in quotes. So, in that case, he could use ',' as a delimiter, then strip out the quotes afterwards.

A: 

You should be able to specifiy not only the field separator, which should be [,] but also the text qualifier, which in this case would be ["]. Using [] to enclose that so there's no confusion with ".

Kibbee
Kibbee - thanks for the answer. But I'm not able to use any db tools - just T-SQL, since this is something that will be automated. Essentially, this application has a feature that allows users to upload CSV files, and then the app will load them into a database table. And I don't know how to set a text qualifier with the "BULK INSERT" command. Can you expand on that?
mattstuehler
+5  A: 

It isn't possible to do a bulk insert for this file, from MSDN:

To be usable as a data file for bulk import, a CSV file must comply with the following restrictions:

  • Data fields never contain the field terminator.
  • Either none or all of the values in a data field are enclosed in quotation marks ("").

(http://msdn.microsoft.com/en-us/library/ms188609.aspx)

Some simple text processing should be all that's required to get the file ready for import. Alternatively your users could be required to either format the file according to the se guidelines or use something other than a comma as a delimiter (e.g |)

Macros
Macros - thanks for this. Looks definitive. I thought about pre-processing the file - e.g., changing all the commas to pipes, but I don't know how to distinguish the the commas that separate fields from the commas in the values. Is there an easy way to do that?
mattstuehler
Regular Expressions may help however I'm not sure how well they would deal with conditions like multiple commas inside the quotes and multiple quoted strings in one line. Algorithmically you could parse each string replacing each comma with a pipe until you reach a " at which point replacement is switched off until you reach a closing quote. May not be the most efficient though!
Macros
I had the same thought - parsing row by row, field by field, even character by character. With a little elbow grease, that'd work, but I doubt it'd be very efficient. I was hoping there'd be a quick an easy answer. There should be - seems like this must come up often, since it looks like Excel formats data like this when you try to save a spreadsheet as a CSV file. Oh well.
mattstuehler
You would think that a CSV file that Excel can create would be a valid format to bulk insert into SQL Server, OR, you would expect the bulk inserter to be able to take Excel created files and import those. No ? Well I would think. Maybe that's just me.
cometbill
+1  A: 

This might be more complicated or involved than what your willing to use, but ...

If you can implement the logic for parsing the lines into fields in VB or C#, you can do this using a CLR table valued function (TVF).

A CLR TVF can be a good performing way to read data in from external source when you want to have some C# or VB code separate the data into columns and/or adjust the values.

You have to be willing to add a CLR assembly to your database (and one that allows external or unsafe operations so it can open files). This can get a bit complicated or involved, but might be worth it for the flexibility you get.

I had some large files that needed to be regularly loaded to tables as fast as possible, but certain code translations needed to be performed on some columns and special handling was needed to load values that would have otherwise caused datatype errors with a plain bulk insert.

In short, a CLR TVF lets you run C# or VB code against each line of the file with bulk insert like performance (although you may need to worry about logging). The example in the SQL Server documentation lets you create a TVF to read from the event log that you could use as a starting point.

Note that the code in the CLR TVF can only access the database in an init stage before the first row is processed (eg. no lookups for each row - you use a normal TVF on top of this to do such things). You don't appear to need this based on your question.

Also note, each CLR TVF must have its output columns explicitly specified, so you can't write a generic one that is reusable for each different csv file you might have.

You could write one CLR TVF to read whole lines from the file, returning a one column result set, then use normal TVFs to read from that for each type of file. This requires the code to parse each line to be written in T-SQL, but avoids having to write many CLR TVFs.

Brett
Brett, Sorry - I was on vacation, and just now saw this response. CLR-TVF isn't something I'm familiar with, but I'm definitely going to look into it. Many thanks for this really interesting suggestion!
mattstuehler
+2  A: 

You are going to need to preprocess the file, period.

If you really really need to do this, here is the code. I wrote this because I absolutely had no choice. It is utility code and I'm not proud of it, but it works. The approach is not to get SQL to understand quoted fields, but instead manipulate the file to use an entirely different delimiter.

This function takes an input file and will replace all field-delimiting commas (NOT commas inside quoted-text fields, just the actual delimiting ones) with a new delimiter. You can then tell sql server to use the new field delimiter instead of a comma. In the version of the function here, the placeholder is <TMP> (I feel confident this will not appear in the original csv - if it does, brace for explosions).

Therefore after running this function you import in sql by doing something like:

BULK INSERT MyTable
FROM 'C:\FileCreatedFromThisFunction.csv'
WITH
(
FIELDTERMINATOR = '<*TMP*>',
ROWTERMINATOR = '\n'
)

And without further ado, the terrible, awful function that I apologize in advance for inflicting on you (edit - I've posted a working program that does this instead of just the function on my blog here):

Private Function CsvToOtherDelimiter(ByVal InputFile As String, ByVal OutputFile As String) As Integer

        Dim PH1 As String = "<*TMP*>"

        Dim objReader As StreamReader = Nothing
        Dim count As Integer = 0 'This will also serve as a primary key'
        Dim sb As New System.Text.StringBuilder

        Try
            objReader = New StreamReader(File.OpenRead(InputFile), System.Text.Encoding.Default)
        Catch ex As Exception
            UpdateStatus(ex.Message)
        End Try

        If objReader Is Nothing Then
            UpdateStatus("Invalid file: " & InputFile)
            count = -1
            Exit Function
        End If

        'grab the first line
    Dim line = reader.ReadLine()
    'and advance to the next line b/c the first line is column headings
    If hasHeaders Then
        line = Trim(reader.ReadLine)
    End If

    While Not String.IsNullOrEmpty(line) 'loop through each line

        count += 1

        'Replace commas with our custom-made delimiter
        line = line.Replace(",", ph1)

        'Find a quoted part of the line, which could legitimately contain commas.
        'In that case we will need to identify the quoted section and swap commas back in for our custom placeholder.
        Dim starti = line.IndexOf(ph1 & """", 0)
        If line.IndexOf("""",0) = 0 then starti=0

        While starti > -1 'loop through quoted fields

            Dim FieldTerminatorFound As Boolean = False

            'Find end quote token (originally  a ",)
            Dim endi As Integer = line.IndexOf("""" & ph1, starti)

            If endi < 0 Then
                FieldTerminatorFound = True
                If endi < 0 Then endi = line.Length - 1
            End If

            While Not FieldTerminatorFound

                'Find any more quotes that are part of that sequence, if any
                Dim backChar As String = """" 'thats one quote
                Dim quoteCount = 0
                While backChar = """"
                    quoteCount += 1
                    backChar = line.Chars(endi - quoteCount)
                End While

                If quoteCount Mod 2 = 1 Then 'odd number of quotes. real field terminator
                    FieldTerminatorFound = True
                Else 'keep looking
                    endi = line.IndexOf("""" & ph1, endi + 1)
                End If
            End While

            'Grab the quoted field from the line, now that we have the start and ending indices
            Dim source = line.Substring(starti + ph1.Length, endi - starti - ph1.Length + 1)

            'And swap the commas back in
            line = line.Replace(source, source.Replace(ph1, ","))

            'Find the next quoted field
            '                If endi >= line.Length - 1 Then endi = line.Length 'During the swap, the length of line shrinks so an endi value at the end of the line will fail
            starti = line.IndexOf(ph1 & """", starti + ph1.Length)

        End While

            line = objReader.ReadLine

        End While

        objReader.Close()

        SaveTextToFile(sb.ToString, OutputFile)

        Return count

    End Function
Chris Clark
A: 

Chris, Thanks a bunch for this!! You saved my biscuits!! I could not believe that bulk loader wouldn't handle this case when XL does such a nice job..don't these guys see eachother in the halls??? Anyway...I needed a ConsoleApplication version so here is what I hacked together. It's down and dirty but it works like a champ! I hardcoded the delimiter and commented out the header as they were not needed for my app.

I wish I could also paste a nice big beer in here for ya too.

Geeze, I have no idea why the End Module and Public Class are outside the code block...srry!

Module Module1

Sub Main()

    Dim arrArgs() As String = Command.Split(",")
    Dim i As Integer
    Dim obj As New ReDelimIt()

    Console.Write(vbNewLine & vbNewLine)

    If arrArgs(0) <> Nothing Then
        For i = LBound(arrArgs) To UBound(arrArgs)
            Console.Write("Parameter " & i & " is " & arrArgs(i) & vbNewLine)
        Next


        obj.ProcessFile(arrArgs(0), arrArgs(1))

    Else
        Console.Write("Usage Test1 <inputfile>,<outputfile>")
    End If

    Console.Write(vbNewLine & vbNewLine)
End Sub

End Module

Public Class ReDelimIt

Public Function ProcessFile(ByVal InputFile As String, ByVal OutputFile As String) As Integer

    Dim ph1 As String = "|"

    Dim objReader As System.IO.StreamReader = Nothing
    Dim count As Integer = 0 'This will also serve as a primary key
    Dim sb As New System.Text.StringBuilder

    Try
        objReader = New System.IO.StreamReader(System.IO.File.OpenRead(InputFile), System.Text.Encoding.Default)
    Catch ex As Exception
        MsgBox(ex.Message)
    End Try

    If objReader Is Nothing Then
        MsgBox("Invalid file: " & InputFile)
        count = -1
        Exit Function
    End If

    'grab the first line
    Dim line = objReader.ReadLine()
    'and advance to the next line b/c the first line is column headings
    'Removed Check Headers can put in if needed.
    'If chkHeaders.Checked Then
    'line = objReader.ReadLine
    'End If

    While Not String.IsNullOrEmpty(line) 'loop through each line

        count += 1

        'Replace commas with our custom-made delimiter
        line = line.Replace(",", ph1)

        'Find a quoted part of the line, which could legitimately contain commas.
        'In that case we will need to identify the quoted section and swap commas back in for our custom placeholder.
        Dim starti = line.IndexOf(ph1 & """", 0)

        While starti > -1 'loop through quoted fields

            'Find end quote token (originally  a ",)
            Dim endi = line.IndexOf("""" & ph1, starti)

            'The end quote token could be a false positive because there could occur a ", sequence.
            'It would be double-quoted ("",) so check for that here
            Dim check1 = line.IndexOf("""""" & ph1, starti)

            'A """, sequence can occur if a quoted field ends in a quote.
            'In this case, the above check matches, but we actually SHOULD process this as an end quote token
            Dim check2 = line.IndexOf("""""""" & ph1, starti)

            'If we are in the check1 ("",) situation, keep searching for an end quote token
            'The +1 and +2 accounts for the extra length of the checked sequences
            While (endi = check1 + 1 AndAlso endi <> check2 + 2) 'loop through "false" tokens in the quoted fields
                endi = line.IndexOf("""" & ph1, endi + 1)
                check1 = line.IndexOf("""""" & ph1, check1 + 1)
                check2 = line.IndexOf("""""""" & ph1, check2 + 1)
            End While

            'We have searched for an end token (",) but can't find one, so that means the line ends in a "
            If endi < 0 Then endi = line.Length - 1

            'Grab the quoted field from the line, now that we have the start and ending indices
            Dim source = line.Substring(starti + ph1.Length, endi - starti - ph1.Length + 1)

            'And swap the commas back in
            line = line.Replace(source, source.Replace(ph1, ","))

            'Find the next quoted field
            If endi >= line.Length - 1 Then endi = line.Length 'During the swap, the length of line shrinks so an endi value at the end of the line will fail
            starti = line.IndexOf(ph1 & """", starti + ph1.Length)

        End While

        'Add our primary key to the line
        ' Removed for now
        'If chkAddKey.Checked Then
        'line = String.Concat(count.ToString, ph1, line)
        ' End If

        sb.AppendLine(line)

        line = objReader.ReadLine

    End While

    objReader.Close()

    SaveTextToFile(sb.ToString, OutputFile)

    Return count

End Function

Public Function SaveTextToFile(ByVal strData As String, ByVal FullPath As String) As Boolean
    Dim bAns As Boolean = False
    Dim objReader As System.IO.StreamWriter
    Try
        objReader = New System.IO.StreamWriter(FullPath, False, System.Text.Encoding.Default)
        objReader.Write(strData)
        objReader.Close()
        bAns = True
    Catch Ex As Exception
        Throw Ex
    End Try
    Return bAns
End Function

End Class

Indigo42