tags:

views:

28

answers:

2

Hi,

I have a huge csv file, separated by comma's and I want to do a analysis with glm in R. In one column there exists data with a comma implied, something like: bla,blabla When reading the file in R with read.csv.sql there comes a error-message: RS-DBI driver: (RS_sqlite_import: ./agp.csv line 47612 expected 37 columns of data but found 38) This is due to the 'extra' comma in some of the data, not the whole column has an extra column. How can I fix this? I want to remove this extra superfluous comma. Thanks for the reaction, André

+1  A: 

The CSV format is very simple and can easily be hand edited. In order to include a comma in a value, you must surround the value with quotes quotes. Try this: "bla,blabla". If that data happens to contain any quotes, eg. blah,"thequotedblah",blah, those quotes need to be escaped with another quote, like this: "blah,""thequotedblah"",blah".

Although there is no official standard around it, there isn't much to the CSV format. Wikipedia has a great CSV reference that I have personally used to implement CSV support in applications. Spend 5-10 minutes reading it and you'll know everything you ever need to know to manually create/read/repair CSV data.

Asaph
Thanks for the reaction, but it is a huge file. I can't read the file in Textmate, because of the size. The word bla,blabla is randomly dispersed in one column. How can I delete a column in a csv file without reading it first entirely. Thanks again for the reaction.
andre de boer
A: 

Is it just this one line that contains a non-quoted comma - or are there several such lines? Editing the .csv with an editor that can handle large files (e.g. Ultraedit) to sanitize that one record would certainly help. Asaph's suggestion of quoting is also a good 'un.

Will A