tags:

views:

81

answers:

3

Hello,

I am currently using this code to input data from numerous files into R:

library(foreign)

setwd("/Users/ericbrotto/Desktop/A_Intel/")

filelist <-list.files()


#assuming tab separated values with a header    
datalist = lapply(filelist, function(x)read.table(x, header=T, sep=";", comment.char="")) 

#assuming the same header/columns for all files
datafr = do.call("rbind", datalist) 

The headers look like this:

TIME ;POWER SOURCE ;qty MONITORS ;NUM PROCESSORS ;freq of CPU Mhz ;SCREEN SIZE ;CPU LOAD ;BATTERY LEVEL ; KEYBOARD MVT ; MOUSE MVT ;BATTERY MWH ;HARD DISK SPACE ;NUMBER PROCESSES ;RAM ;RUNNING APPS  ;FOCUS APP ;BYTES IN ;BYTES OUT ;ACTIVE NETWORKS ; IP ADDRESS ; NAMES OF FILES ; 

and an example of the data looks like this:

 2010-09-11-19:28:34.680 ; BA ; 1 ; 2 ; 2000 ; 1440 : 900  ; 0.224121 ; 92 ; NO ; NO ; NULL ; 92.581558  ;  57    ; 196.1484375   ; +NULL  ; loginwindow-#35  ;  5259  ;  4506  ; en1 :   ;  192.168.1.3  ;  NULL  ;    

Rather then input all of the columns into a data frame I would like to just grab one, say, FOCUS APP. Any advice on how to do this?

Thanks,

A: 

maybe just adding the column name to your read table line is ok, like this:

datalist = lapply(filelist, function(x)read.table(x, header=T, sep=";", comment.char="")["FOCUS APP"]) 
Gary Lee
Thanks Gary. Just wondering though, has your method been tested?
Eric Brotto
This works, but is inefficient because it inputs the entire file before subsetting. `colClasses` is more efficient because it doesn't input `"NULL"` columns.
Joshua Ulrich
+3  A: 

If you just want to read in a particular column from your files, then colClasses is the way to go. For example, suppose your data looked like this:

a,b
1,2
3,4

Then

## Use colClasses to select columns
## "NULL" means skip the column
## "numeric" means that the column is numeric
## Other options are Date, factor - see ?read.table for more
## Use NA to let R decide
data = read.table("/tmp/tmp.csv", sep=",", 
                  colClasses=c("NULL", "numeric"), 
                  header=TRUE)

gives just the second column.

> data
  b
1 2
2 4
csgillespie
Thanks csgillespie. Looks good, but I don't see where you placed the argument to specify that you wanted the 2nd column. Am I missing something? :)
Eric Brotto
I've added a comment to the code.
csgillespie
Hmmmm... I'm not sure if I understand correctly. If I have four columns labeled Tiger, Lion, Bear, Gorilla and all I want is Bear would I write: colClasses=c("NULL", "NULL', 'numeric', "NULL")?
Eric Brotto
That's correct.
csgillespie
Eric, be careful with single/double quote pairs. `'` and `"` do not match each other, so `"NULL'` is an open quote.
Joshua Ulrich
@Joshua: Good spot!
csgillespie
@csgillespie: thanks!
Joshua Ulrich
A: 

If you are just doing this once, then the colClasses answer is probably the best (however that still reads in all the data, just only processes the one column). If you are doing things like this often then you may want to use a database instead. Look at the RSQLite, sqldf, and SQLiteDF packages as well as RODBC for some possibilities.

Greg Snow
All the data are *not* read in. `read.table` uses `scan` and the Details section of `?scan` says: "If any of the types is NULL, the corresponding field is skipped (but a NULL component appears in the result)." E.g.: `Data <- data.frame(x=rnorm(20),y=rnorm(20)); write.table(Data,"Data.txt",row.names=FALSE); scan("Data.txt",what=list(0.0,NULL),skip=1)`
Joshua Ulrich
The scan function has to read every single byte, how else will it know when the interesting columns start? It just does not store the "NULL" columns after identifying where they stop. "Skipped" does not mean those bytes are never read, just that they are not stored/processed. Actual databases have indexing information that allows the database program to jump directly to places of interest in the data and has potential to not read in all the data.
Greg Snow
@Greg I agree that databases are more I/O efficient for repeated reads of this type, and I understand what you're saying. My point was that the data are not read _into_ (i.e. stored in) your R session. Your answer is not clear that the data are not stored. "[the colClasses answer] still reads *in* all the data, just only processes the one column" (emphasis mine) is misleading.
Joshua Ulrich