ansaurus

Question

How to read a csv file in R where some numbers contain commas?

Answer 1

+5 A:

Not sure about how to have read.csv interpret it properly, but you can reformat the character vector with gsub:

> y <- c("1,200","20,000","100","12,111")
> as.numeric(gsub(",","", y))
[1]  1200 20000   100 12111

This was also answered previously on R-Help (and in Q2 here).

Alternatively, you can pre-process the file, for instance with sed in unix.

Shane 2009-10-06 01:18:26

Answer 2

+2 A:

I think preprocessing is the way to go. You could use Notepad++ which has a regular expression replace option.

For example, if your file were like this:

"1,234","123","1,234"
"234","123","1,234"
123,456,789

Then, you could use the regular expression "([0-9]+),([0-9]+)" and replace it with \1\2

1234,"123",1234
"234","123",1234
123,456,789

Then you could use x <- read.csv(file="x.csv",header=FALSE) to read the file.

Jacob 2009-10-06 01:47:26

This is how I handle it as well.

kpierce8 2009-10-06 14:17:34

Anything you can script, you should. Doing it by hand introduces the opportunity for error, as well as not being very reproducible.

hadley 2009-10-07 13:38:44

Answer 3

+1 A:

I want to use R rather than pre-processing the data as it makes it easier when the data are revised. Following Shane's suggestion of using gsub, I think this is about as neat as I can do:

x <- read.csv("file.csv",header=TRUE,colClasses="character")
col2cvt <- 15:41
x[,col2cvt] <- lapply(x[,col2cvt],function(x){as.numeric(gsub(",", "", x))})

Rob Hyndman 2009-10-06 03:15:28

Doesn't colClasses="char" force all columns to be char in which case the others besides 15:41 are also char? Maybe letting read.csv() decide and then converting those that in cols 15:41 may get you 'more' numeric columns.

Dirk Eddelbuettel 2009-10-06 11:40:28

Yes, but as my question noted, all the other columns are character. I could use as.is=TRUE instead which would be more general. But letting read.csv() decide by using the default arguments is not helpful because it will convert anything that looks like a character into a factor which causes hassles for the numeric columns as then they don't convert properly using as.numeric().

Rob Hyndman 2009-10-06 22:18:04

Answer 4

A:

"Preprocess" in R:

lines <- "www, rrr, 1,234, ttt \n rrr,zzz, 1,234,567, rrr"

Can use readLines on a textConnection. Then remove only those the commas that are between digits:

gsub("([0-9])+\,+([0-9])", "\1\2", lines)

[1] "www, rrr, 1234, ttt" "rrr,zzz, 124567, rrr"

DWin 2009-10-08 01:09:15

Answer 5

+2 A:

You can have read.table or read.csv do this conversion for you semi-automatically, just create a conversion function and set it as an "as" method using the setAs function like so:

setAs("character", "num.with.commas", 
        function(from) as.numeric(gsub(",", "", from) ) )

Then run read.csv like:

DF <- read.csv('your.file.here', 
   colClasses=c('num.with.commas','factor','character','numeric','num.with.commas'))

Greg Snow 2010-08-31 17:34:19

This is very nice trick. It could be used for on-import conversion (for example converting Y/N values to logical vector using `setAs("character", "logical.Y.N", function(from) c(Y=TRUE,N=FALSE)[from] )`).

Marek 2010-09-01 08:49:12

ansaurus

tags:

views:

answers:

How to read a csv file in R where some numbers contain commas?

related questions