views:

659

answers:

6

I have a CSV dump from another DB that looks like this (id, name, notes):

1001,John Smith,15 Main Street
1002,Jane Smith,"2010 Rockliffe Dr.
Pleasantville, IL
USA"
1003,Bill Karr,2820 West Ave.

The last field may contain carriage returns and commas, in which case it is surrounded by double quotes. And I need to preserve those returns and commas.

I use this code to import CSV into my table:

BULK INSERT CSVTest
FROM 'c:\csvfile.csv'
WITH
(
  FIELDTERMINATOR = ',',
  ROWTERMINATOR = '\n'
)

SQL Server 2005 bulk insert cannot figure out that carriage returns inside quotes are not row terminators.
How to overcome?


UPDATE:
Looks like the only way to keep line breaks inside a field is to use different row separator. So, I want to mark all row separating line breaks by putting a pipe in front of them. How can I change my CSV to look like this?

1001,John Smith,15 Main Street|
1002,Jane Smith,"2010 Rockliffe Dr.
Pleasantville, IL
USA"|
1003,Bill Karr,2820 West Ave.|

A: 

According to the source of all knowledge (Wikipedia), csv uses new lines to separate records. So what you have is not valid csv.

My suggestion is that you write a perl program to process your file and add each record to the db.

If you're not a perl person, then you could use a programming site or see if some kind SO person will write the parsing section of the program for you.

Added:

Possible Solution

Since the OP states that he can change the input file, I'd change all the new lines that do not follow a " to be a reserved char sequence, eg XXX

This can be an automated replacement in many editors. In Windows, UltraEdit includes regexp find/replace functionality

Then import into the dbms since you'll no longer have the embedded new lines.

Then use SQL Replace to change the XXX occurrences back into new lines.

Larry K
also from the source of all knowledge, "Fields with embedded line breaks must be enclosed within double-quote characters", so the input file is valid.
Jimmy
Not an option. It has to be SQL script that will be executed somewhere else by somebody else. However, I can tweak the CSV file anyway I want in the editor.
z-boss
+1  A: 

Bulk operations on SQL Server do not specifically support CSV even though they can import them if the files are carefully formatted. My suggestion would be to enclose all field values in quotes. BULK INSERT might then allow the carriage returns within a field value. If it does not, then your next solution might be an Integration Services package.

See Preparing Data for Bulk Export or Import for more.

Thomas
A: 

You cannot import this unless the CSV is in valid format. So, you have to either fix the dump or manually using search & replace fix the unwanted new line characters.

SoftwareGeek
The thing is I need to preserve those new line characters in the notes field. Even if I could remove them, the CSV is huge and I cannot fully automate it in the editor.
z-boss
Try the dump without adding the notes field, then do another dump with only the notes field but this time see if you can replace the unwanted characters. Also you might have to import twice since you have two dumps.
SoftwareGeek
If you mean dump from the database - I cannot do that: CSV is given to me as is.
z-boss
A: 

If you have control over the contents of the CSV file, you could replace the in-field line breaks (CRLF) with a non-linebreak character (perhaps just CR or LF), then run a script after the import to replace them with CRLF again.

This is how MS Office products (Excel, Access) deal with this problem.

Ed Harper
But how can I reliably distinguish between the in-field line breaks and row ending line breaks?
z-boss
@z-boss - possibly I misunderstood your question - I inferred that you have control over the production of the CSV file.
Ed Harper
+1  A: 

you can massage these line breaks into one line with a script, eg you can use GNU sed to remove line breaks. eg

$ more file
1001,John Smith,15 Main Street
1002,Jane Smith,"2010 Rockliffe Dr.
Pleasantville, IL
USA"
1003,Bill Karr,"2820
West Ave"

$ sed '/"/!s/$/|/;/.*\".*[^"]$/{ :a;N };/"$/ { s/$/|/ }' file
1001,John Smith,15 Main Street|
1002,Jane Smith,"2010 Rockliffe Dr.
Pleasantville, IL
USA"|
1003,Bill Karr,"2820
West Ave"|

then you can bulk insert.

Edit:

Save this :/"/!s/$/|/;/.*\".*[^"]$/{ :a;N };/"$/ { s/$/|/ } in a file , say myformat.sed. then do this on the command line

c:\test> sed.exe -f myformat.sed myfile

ghostdog74
I need to keep those in-field line breaks. Is it possible with sed to prepend all not in-field line breaks with a pipe character? In this way I would use '|\n' as a row separator for bulk insert and it would work.
z-boss
i don't understand. do you mean something like this: `2010|Rockliffe Dr|Pleasantville|IL USA` ? if not, you should probably show your desired output in your question
ghostdog74
I've updated my question.
z-boss
see my edit. I mangled the data abit for other scenario..
ghostdog74
I'm running on Windows XP and it gives me this error: sed: -e expression #1, char 1: unknown command: `''
z-boss
i see. windows cmd.exe is particular about single quote,so you have to use double quotes and some escaping. however, to save you the trouble, you can also save the sed statements in a file, and then call the sed.exe excutable. see my edit
ghostdog74
Thanks a lot. It did ran and it worked in most cases. Unfortunatelly, the notes column can contain all kind of weird stuff, like full email traces, bunch of empty lines, dates etc, which I believe the format string is not expecting. Looks like I'll have to write my own parser. Thanks for your help anyway.
z-boss
A: 

OK, here's a small Java program that I end up writing to solve the problem.
Comments, corrections and optimizations are welcome.

import java.io.*;

public class PreBulkInsert
{
    public static void main(String[] args)
    {
        if (args.length < 3)
        {
            System.out.println ("Usage:");
            System.out.println ("  java PreBulkInsert input_file output_file separator_character");
            System.exit(0);
        }

        try
        {
            boolean firstQuoteFound = false;
            int fromIndex;
            int lineCounter = 0;
            String str;

            BufferedReader in = new BufferedReader(new FileReader(args[0]));
            BufferedWriter out = new BufferedWriter(new FileWriter(args[1])); 
            String newRowSeparator = args[2];

            while ((str = in.readLine()) != null)
            {
                fromIndex = -1;
                do
                {
                    fromIndex = str.indexOf('"', fromIndex + 1);
                    if (fromIndex > -1)
                        firstQuoteFound = !firstQuoteFound;
                } while (fromIndex > -1);

                if (!firstQuoteFound)
                    out.write(str + newRowSeparator + "\r\n");
                else
                    out.write(str + "\r\n");
                lineCounter++;
            }
            out.close();
            in.close();
            System.out.println("Done! Total of " + lineCounter + " lines were processed.");
        }
        catch (IOException e)
        {
            System.out.println(e.getMessage());
            System.exit(1);
        }       
    }
}
z-boss