tags:

views:

29

answers:

2

Is there anything I can do to get partial results from after bumping into errors in a big file? I am using the following command to import data from files. This is the fastest way I know, but it's not robust. It can easily screw up everything because of a small error. I hope at least there is way that scan(or any reader) can quickly return which row/line has the error, or partial results it read (than I will have an idea where the error is). Then, I can skip enough lines to recover over 99% good data.

rawData = scan(file = "rawData.csv", what = scanformat, sep = ",", skip = 1, quiet = TRUE, fill = TRUE, na.strings = c("-", "NA", "Na","N"))

All importing data tutorial I found seem to assume files are in good shape. Didn't find useful hint to deal with dirty files.

I will sincerely appreciate any hint or suggestion! It was really frustrating.

A: 

Idea1: Open a file connection (with file function) and then scan line by line (with nlines=1). Put each scan into try to recover after reading a bad line.

Idea2: Use readLines to read the file in raw format; then use strsplit to parse. You can analyse this output to find bad lines and remove it.

mbq
A: 

The count.fields function will preprocess a table like file and give you how many fields it found on each line (in the sense that read.table will look for fields). This is often a quick way to identify lines that have a problem because they will show a different number of fields from what is expected (or just different from the majority of other lines).

Greg Snow