views:

3269

answers:

10

I have a text file that is in a comma separated format, delimited by " on most fields. I am trying to get that into something I can enumerate through (Generic Collection, for example). I don't have control over how the file is output nor the character it uses for the delimiter.

In this case, the fields are separated by a comma and text fields are enclosed in " marks. The problem I am running into is that some fields have quotation marks in them (i.e. 8" Tray) and are accidentally being picked up as the next field. In the case of numeric fields, they don't have quotes around them, but they do start with a + or a - sign (depicting a positive/negative number).

I was thinking of a RegEx, but my skills aren't that great so hopefully someone can come up with some ideas I can try. There are about 19,000 records in this file, so I am trying to do it as efficiently as possible. Here are a couple of example rows of data:

"00","000000112260 ","Pie Pumpkin ","RET","6.99 "," ","ea ",+0000000006.99000 "00","000000304078 ","Pie Apple caramel ","RET","9.99 "," ","ea ",+0000000009.99000 "00","StringValue here","8" Tray of Food ","RET","6.99 "," ","ea ",-00000000005.3200

There are a lot more fields, but you can get the picture....

I am using VB.NET and I have a generic List setup to accept the data. I have tried using CSVReader and it seems to work well until you hit a record like the 3rd one (with a quote in the text field). If I could somehow get it to handle the additional quotes, than the CSVReader option will work great.

Thanks!

A: 

There are at least ODBC drivers for CSV files. But there are different flavors of CSV.

What produced these files? It's not unlikely that there's a matching driver based on the requirements of the source application.

le dorfier
It is an old DOS based accounting package called Business Vision Delta. Unfortunately, the company has been sold to new vendors and they don't support the old DOS stuff any longer. This is the only way I can extract the data to integrate into newer software.
hacker
Can you tell what kind of data tables it uses? Maybe dbfs? Also, try just opening the CSV files with Excel, Access, whatever other apps you have that can import CSV. Try to avoid writing software as a first option.
le dorfier
+1  A: 

Save time and do yourself a favour and download this codeproject article: A Fast CSV Reader (in .NET):

One would imagine that parsing CSV files is a straightforward and boring task. I was thinking that too, until I had to parse several CSV files of a couple GB each. After trying to use the OleDB JET driver and various regular expressions, I still ran into serious performance problems. At this point, I decided I would try the custom class option. I scoured the net for existing code, but finding a correct, fast, and efficient CSV parser and reader is not so simple, whatever platform/language you fancy.

Mitch Wheat
Thanks, but I have tried that and am running into a problem when there are quotes in the text fields.
hacker
ah, sorry. I missed that you had tried it already
Mitch Wheat
+3  A: 

Give a look to the FileHelpers library.

CMS
Looks good, but I found it very frustrating to use. Lack of support for auto-properties instead of private fields is very clumsy.
Alex
And this wasn't a factor in the original question, but that page says that FileHelpers uses dynamic code generation. That means it's not useful in some constrained environments (MonoTouch, for me).
James Moore
A: 

Your problem with CSVReader is that the quote in the third record isn't escaped with another quote (aka double quoting). If you don't escape them, then how would you expect to handle ", in the middle of a text field?

http://en.wikipedia.org/wiki/Comma-separated_values

(I did end up having to work with files (with different delimiters) but the quote characters inside a text value weren't escaped and I ended up writing my own custom parser. I do not know if this was absolutely necessary or not.)

llamaoo7
That is my problem... I can't escape them. I don't have control over how the file is exported. I'm trying to get away from writing a parser that goes character by character to check if there is a comma after a quote, etc. but it may come down to that.
hacker
Well, if you go the route of making your own (I'm still convinced there's a solution somewhere that can handle this case), just be sure to validate the field count and data as best as you can. (I'd post mine but I did it at work.)
llamaoo7
+5  A: 

From here:

Encoding fileEncoding = GetFileEncoding(csvFile);
// get rid of all doublequotes except those used as field delimiters
string fileContents = File.ReadAllText(csvFile, fileEncoding);
string fixedContents = Regex.Replace(fileContents, @"([^\^,\r\n])""([^$,\r\n])", @"$1$2");
using (CsvReader csv =
       new CsvReader(new StringReader(fixedContents), true))
{
       // ... parse the CSV
Mitch Wheat
This works pretty good, but for some reason, it screws up on a name like: Product "A" Name I am sure it has to do with the RegEx, but I can't seem to get it right.
hacker
See my answer below for how I was able to implement this.
hacker
+5  A: 

I recommend looking at the TextFieldParserClass in .Net. You need to include

Imports Microsoft.VisualBasic.FileIO.TextFieldParser

Here's a quick sample:

        Dim afile As FileIO.TextFieldParser = New FileIO.TextFieldParser(FileName)
        Dim CurrentRecord As String() ' this array will hold each line of data
        afile.TextFieldType = FileIO.FieldType.Delimited
        afile.Delimiters = New String() {","}
        afile.HasFieldsEnclosedInQuotes = True

        ' parse the actual file
        Do While Not afile.EndOfData
            Try
                CurrentRecord = afile.ReadFields
            Catch ex As FileIO.MalformedLineException
                Stop
            End Try
        Loop
Avi
+1  A: 

The logic of this custom approach is: Read through file 1 line at a time, split each line on the comma, remove the first and last character (removing the outer quotes but not affecting any inside quotes), then adding the data to your generic list. It's short and very easy to read and work with.

        Dim fr As StreamReader = Nothing
        Dim FileString As String = ""
        Dim LineItemsArr() as String

        Dim FilePath As String = HttpContext.Current.Request.MapPath("YourFile.csv")

        fr = New System.IO.StreamReader(FilePath)

        While fr.Peek <> -1
            FileString = fr.ReadLine.Trim

            If String.IsNullOrEmpty(FileString) Then Continue While 'Empty Line

            LineItemsArr = FileString.Split(",")

            For Each Item as String In LineItemsArr
                'If every item will have a beginning and closing " (quote) then you can just
                'cut the first and last characters of the string here.
                'i.e.  UpdatedItems = Item. remove first and last character

                'Then stick the data into your Generic List (Of String()?)
            Next
        End While
rvarcher
Or before stripping the outer quotes, use this as a test to do string processing, or number processing (if needs be).
Dillie-O
A: 

I am posting this as an answer so I can explain how I did it and why.... The answer from Mitch Wheat was the one that gave me the best solution for this case and I just had to modify it slightly due to the format this data was exported in.

Here is the VB Code:

Dim fixedContents As String = Regex.Replace(
                            File.ReadAllText(csvFile, fileEncoding),
                            "(?<!,)("")(?!,)", 
                            AddressOf ReplaceQuotes)

The RegEx that was used is what I needed to change because certain fields had non-escaped quotes in them and the RegEx provided didn't seem to work on all examples. This one uses 'Look Ahead' and 'Look Behind' to see if the quote is just after a comma or just before. In this case, they are both negative (meaning show me where the double quote is not before or after a comma). This should mean that the quote is in the middle of a string.

In this case, instead of doing a direct replacement, I am using the function ReplaceQuotes to handle that for me. The reason I am using this is because I needed a little extra logic to detect whether it was at the beginning of a line. If I would have spent even more time on it, I am sure I could have tweaked the RegEx to take into consideration the beginning of the line (using MultiLine, etc) but when I tried it quickly, it didn't seem to work at all.

With this in place, using CSV reader on a 32MB CSV file (about 19000 rows), it takes about 2 seconds to read the file, perform the regex, load it into the CSV Reader, add all the data to my generic class and finish. Real quick!!

hacker
+2  A: 

Try this site. http://kbcsv.codeplex.com/

I've looked for a good utility and this is hands down the best that I've found and works correctly. Don't waste your time trying other stuff,this is free and it works.

Middletone
Why, thank you!
Kent Boogaart
I second this. 15 char.
Alex
+3  A: 

As this link says... Don't roll your own CSV parser!

Use TextFieldParser as Avi suggested. Microsoft has already done this for you. If you ended up writing one, and you find a bug in it, consider replacing it instead of fixing the bug. I did just that recently and it saved me a lot of time.

skypecakes