tags:

views:

508

answers:

2

I'm trying to read a text file with different row lengths:

1
1   2
1   2 3
1   2 3 4
1   2 3 4 5
1   2 3 4 5 6
1   2 3 4 5 6 7
1   2 3 4 5 6 7 8

To overcome this problem, I'm using the argument fill=TRUE in read.table, so:

data<-read.table("test",sep="\t",fill=TRUE)

Unfortunately, to assess the maximum row length, read.table reads only the first 5 lines of the file, and generates an object looking like this:

data
   V1 V2 V3 V4 V5
1   1 NA NA NA NA
2   1  2 NA NA NA
3   1  2  3 NA NA
4   1  2  3  4 NA
5   1  2  3  4  5
6   1  2  3  4  5
7   6 NA NA NA NA
8   1  2  3  4  5
9   6  7 NA NA NA
10  1  2  3  4  5
11  6  7  8 NA NA

Is there a way to force read.table to scroll over the whole file to assess the maximum row length? I know a possible solution would be to provide the column number, like:

data<-read.table("test",sep="\t",fill=TRUE,col.names=c(1:8))

But since I have a lot of files, I wanted to assess this automatically within R. Any suggestion? :-)


EDIT: the original file doesn't contain progressive numbers, so this is not a solution:

data1<-read.table("test",sep="\t",fill=TRUE)
data2<-read.table("test",sep="\t",fill=TRUE,col.names=c(1:max(data1))
+15  A: 

There is nice function count.fields (see help) which counts number of column per row:

count.fields("test", sep = "\t")
#[1] 1 2 3 4 5 6 7 8

So, using your second solution:

no_col <- max(count.fields("test", sep = "\t"))
data <- read.table("test",sep="\t",fill=TRUE,col.names=1:no_col)
data
#   X1 X2 X3 X4 X5 X6 X7 X8
# 1  1 NA NA NA NA NA NA NA
# 2  1  2 NA NA NA NA NA NA
# 3  1  2  3 NA NA NA NA NA
# 4  1  2  3  4 NA NA NA NA
# 5  1  2  3  4  5 NA NA NA
# 6  1  2  3  4  5  6 NA NA
# 7  1  2  3  4  5  6  7 NA
# 8  1  2  3  4  5  6  7  8
Marek
Brilliant. Elegant and fast :-)
Thrawn
Good call. I've been using R for over a year and never ran into that function, even though it's right there at the end of the read.table documentation!
Steve Lianoglou
+2  A: 

Using count.fields is definitely the right approach for this, but just for completeness:

Another option is to bring in all the raw text and parse it within R:

x <- readLines(textConnection(
"1\t
1\t2
1\t2\t3
1\t2\t3\t4
1\t2\t3\t4\t5
1\t2\t3\t4\t5\t6"))
x <- strsplit(x,"\t")

To combine a list of unequal length vectors, the easiest approach is to use the rbind.fill function from plyr:

library(plyr)
# requires data.frames with column names
x <- lapply(x,function(x) {x <- as.data.frame(t(x)); colnames(x)=1:length(x); return(x)})
do.call(rbind.fill,x)
1    2    3    4    5    6
1 1 <NA> <NA> <NA> <NA> <NA>
2 1    2 <NA> <NA> <NA> <NA>
3 1    2    3 <NA> <NA> <NA>
4 1    2    3    4 <NA> <NA>
5 1    2    3    4    5 <NA>
6 1    2    3    4    5    6
Shane