ansaurus

Question

Inputting one column of info into a R data frame.

Answer 1

A:

maybe just adding the column name to your read table line is ok, like this:

datalist = lapply(filelist, function(x)read.table(x, header=T, sep=";", comment.char="")["FOCUS APP"])

Gary Lee 2010-09-27 13:02:49

Thanks Gary. Just wondering though, has your method been tested?

Eric Brotto 2010-09-27 13:29:21

This works, but is inefficient because it inputs the entire file before subsetting. `colClasses` is more efficient because it doesn't input `"NULL"` columns.

Joshua Ulrich 2010-09-27 14:12:24

Answer 2

+3 A:

If you just want to read in a particular column from your files, then colClasses is the way to go. For example, suppose your data looked like this:

a,b
1,2
3,4

Then

## Use colClasses to select columns
## "NULL" means skip the column
## "numeric" means that the column is numeric
## Other options are Date, factor - see ?read.table for more
## Use NA to let R decide
data = read.table("/tmp/tmp.csv", sep=",", 
                  colClasses=c("NULL", "numeric"), 
                  header=TRUE)

gives just the second column.

> data
  b
1 2
2 4

csgillespie 2010-09-27 13:07:37

Thanks csgillespie. Looks good, but I don't see where you placed the argument to specify that you wanted the 2nd column. Am I missing something? :)

Eric Brotto 2010-09-27 13:24:27

I've added a comment to the code.

csgillespie 2010-09-27 13:27:57

Hmmmm... I'm not sure if I understand correctly. If I have four columns labeled Tiger, Lion, Bear, Gorilla and all I want is Bear would I write: colClasses=c("NULL", "NULL', 'numeric', "NULL")?

Eric Brotto 2010-09-27 13:33:52

That's correct.

csgillespie 2010-09-27 13:35:15

Eric, be careful with single/double quote pairs. `'` and `"` do not match each other, so `"NULL'` is an open quote.

Joshua Ulrich 2010-09-27 14:15:11

@Joshua: Good spot!

csgillespie 2010-09-27 14:16:28

@csgillespie: thanks!

Joshua Ulrich 2010-09-28 22:20:58

Answer 3

A:

If you are just doing this once, then the colClasses answer is probably the best (however that still reads in all the data, just only processes the one column). If you are doing things like this often then you may want to use a database instead. Look at the RSQLite, sqldf, and SQLiteDF packages as well as RODBC for some possibilities.

Greg Snow 2010-09-27 14:45:29

All the data are *not* read in. `read.table` uses `scan` and the Details section of `?scan` says: "If any of the types is NULL, the corresponding field is skipped (but a NULL component appears in the result)." E.g.: `Data <- data.frame(x=rnorm(20),y=rnorm(20)); write.table(Data,"Data.txt",row.names=FALSE); scan("Data.txt",what=list(0.0,NULL),skip=1)`

Joshua Ulrich 2010-09-28 22:18:36

The scan function has to read every single byte, how else will it know when the interesting columns start? It just does not store the "NULL" columns after identifying where they stop. "Skipped" does not mean those bytes are never read, just that they are not stored/processed. Actual databases have indexing information that allows the database program to jump directly to places of interest in the data and has potential to not read in all the data.

Greg Snow 2010-09-29 16:57:15

@Greg I agree that databases are more I/O efficient for repeated reads of this type, and I understand what you're saying. My point was that the data are not read _into_ (i.e. stored in) your R session. Your answer is not clear that the data are not stored. "[the colClasses answer] still reads *in* all the data, just only processes the one column" (emphasis mine) is misleading.

Joshua Ulrich 2010-09-29 20:16:19

ansaurus

tags:

views:

answers:

Inputting one column of info into a R data frame.

related questions