tags:

views:

1128

answers:

5

I have a large csv file with a mix of character and numeric columns. Some of the numerical values are expressed as strings with commas. e.g., "1,513" instead of 1513. What is the simplest way to read the data into R?

I can use read.csv(...,colClasses="character"), but then I have to strip out the commas from the relevant elements before converting those columns to numeric, and I can't find a neat way to do that.

It is easy enough to do it by brute force, but I was hoping there might be some simple solution out there.

+5  A: 

Not sure about how to have read.csv interpret it properly, but you can reformat the character vector with gsub:

> y <- c("1,200","20,000","100","12,111")
> as.numeric(gsub(",","", y))
[1]  1200 20000   100 12111

This was also answered previously on R-Help (and in Q2 here).

Alternatively, you can pre-process the file, for instance with sed in unix.

Shane
+2  A: 

I think preprocessing is the way to go. You could use Notepad++ which has a regular expression replace option.

For example, if your file were like this:

"1,234","123","1,234"
"234","123","1,234"
123,456,789

Then, you could use the regular expression "([0-9]+),([0-9]+)" and replace it with \1\2

1234,"123",1234
"234","123",1234
123,456,789

Then you could use x <- read.csv(file="x.csv",header=FALSE) to read the file.

Jacob
This is how I handle it as well.
kpierce8
Anything you can script, you should. Doing it by hand introduces the opportunity for error, as well as not being very reproducible.
hadley
+1  A: 

I want to use R rather than pre-processing the data as it makes it easier when the data are revised. Following Shane's suggestion of using gsub, I think this is about as neat as I can do:

x <- read.csv("file.csv",header=TRUE,colClasses="character")
col2cvt <- 15:41
x[,col2cvt] <- lapply(x[,col2cvt],function(x){as.numeric(gsub(",", "", x))})
Rob Hyndman
Doesn't colClasses="char" force all columns to be char in which case the others besides 15:41 are also char? Maybe letting read.csv() decide and then converting those that in cols 15:41 may get you 'more' numeric columns.
Dirk Eddelbuettel
Yes, but as my question noted, all the other columns are character. I could use as.is=TRUE instead which would be more general. But letting read.csv() decide by using the default arguments is not helpful because it will convert anything that looks like a character into a factor which causes hassles for the numeric columns as then they don't convert properly using as.numeric().
Rob Hyndman
A: 

"Preprocess" in R:

lines <- "www, rrr, 1,234, ttt \n rrr,zzz, 1,234,567, rrr"

Can use readLines on a textConnection. Then remove only those the commas that are between digits:

gsub("([0-9])+\,+([0-9])", "\1\2", lines)

[1] "www, rrr, 1234, ttt" "rrr,zzz, 124567, rrr"

DWin
+2  A: 

You can have read.table or read.csv do this conversion for you semi-automatically, just create a conversion function and set it as an "as" method using the setAs function like so:

setAs("character", "num.with.commas", 
        function(from) as.numeric(gsub(",", "", from) ) )

Then run read.csv like:

DF <- read.csv('your.file.here', 
   colClasses=c('num.with.commas','factor','character','numeric','num.with.commas'))
Greg Snow
This is very nice trick. It could be used for on-import conversion (for example converting Y/N values to logical vector using `setAs("character", "logical.Y.N", function(from) c(Y=TRUE,N=FALSE)[from] )`).
Marek