views:

204

answers:

4

The dataset I want to read in contains numbers with and without a comma as thousand separator:

"Sudan", "15,276,000", "14,098,000", "13,509,000"
"Chad", 209000, 196000, 190000

and I am looking for a way to read this data in.

Any hint appreciated!

+1  A: 

Looking at that set of data you could parse it using ", " (note the extra space) as the seperator intead of ","

Scobal
A: 

How about doing it as a two step process. 1. Replace the "," with a TAB character 2. Split on tab.

I'm assuming .NET here but the sample principle would apply in any language

Raj
A couple comments - 1) the "r" tag means Karsten is using the "R" language, not .NET. 2) Replacing all commas with tabs wouldn't work, you'd end up splitting your data in bad ways.
Ken Williams
A: 

You could use the following regular expression to remove the commas and any surrounding quote marks to leave plain csv content

,(?=[0-9])|"

then process it as normal

Justin Wignall
+7  A: 

since there is an "r" tag under the question, I assume this is an R question. In R, you do not need to do anything to handle the quoted commas:

> read.csv('t.csv', header=F)
     V1          V2          V3          V4
1 Sudan  15,276,000  14,098,000  13,509,000
2  Chad      209000      196000      190000

# if you want to convert them to numbers:
> df <- read.csv('t.csv', header=F, stringsAsFactor=F)
> df$V2 <- as.numeric(gsub(',', '', df$V2))
xiechao
I'd love it if read.csv (and read.table at root) took a 'thousands.sep' argument as a character to allow (and strip) in numeric data. For now I think the gsub() solution is all we have though.
Ken Williams