tags:

views:

98

answers:

2

The software I am using produces log files with a variable number of lines of summary information followed by lots of tab delimited data. I am trying to write a function that will read the data from these log files into a data frame ignoring the summary information. The summary information never contains a tab, so the following function works:

read.parameters <- function(file.name, ...){
  lines <- scan(file.name, what="character", sep="\n")
  first.line <- min(grep("\\t", lines))
  return(read.delim(file.name, skip=first.line-1, ...))
}

However, these logfiles are quite big, and so reading the file twice is very slow. Surely there is a better way?

Edited to add:

Marek suggested using a textConnection object. The way he suggested in the answer fails on a big file, but the following works:

read.parameters <- function(file.name, ...){
  conn = file(file.name, "r")
  on.exit(close(conn))
  repeat{
    line = readLines(conn, 1)
    if (length(grep("\\t", line))) {
      pushBack(line, conn)
      break}}
  df <- read.delim(conn, ...)
  return(df)}

Edited again: Thanks Marek for further improvement to the above function.

+1  A: 

You don't need to read twice. Use textConnection on first result.

read.parameters <- function(file.name, ...){
  lines <- scan(file.name, what="character", sep="\n") # you got "tmp.log" here, i suppose file.name should be
  first.line <- min(grep("\\t", lines))
  return(read.delim(textConnection(lines), skip=first.line-1, ...))
}
Marek
I've fixed the typo. Thanks for the suggestion of using textConnection, although the function as you've given it doesn't work. I think I need to make the textConnection first, then run scan on it, then use pushBack to rewind the file.
Michael Dunn
Strange. I test it and works for my fake data. You got an error message or an empty results?
Marek
Example: `cat(c("ds","sdds","sddfsd","a\tb\tc","1\t2\t3","1\t2\t3"),file="test.txt", sep="\n")` then `read.parameters("test.txt")` return `data.frame` with 3 cols and 2 rows.
Marek
Interesting. I can confirm it works with your fake data, but with my big data files R stops responding and I have to force quit. But inspired by your suggestion I've produced a working version (added in the question in order to preserve the formatting)
Michael Dunn
A: 

If you can be sure that the header info won't be more than N lines, e.g. N = 200, then try:

scan(..., nlines = N)

That way you won't re-read more than N lines.

G. Grothendieck
That's a decent approach, but I can't really guarantee anything about the header size. I'm quite pleased with my function using a file pointer.
Michael Dunn