ansaurus

Question

Answer 1

+2 A:

It sees a column as not numeric if it contains something other than numbers and NA. You're either getting the actual column wrong or you have some garbage in the column that needs to be cleaned out.

Perhaps it was on that line you deleted. If there was something other than a number in the column then the column gets converted to something other than numeric type, probably a factor. If it was you merely need to convert the variable in question back to a numeric.

cleandata$Avg.Position <- as.numeric(levels(cleandata$Avg.Position)[cleandata$Avg.Position])

You could work out just what type you have to convert from with

str(datadump)

John 2010-09-27 23:43:06

Looks like the data has indeed some "". It's just not clean enough.

datayoda 2010-09-27 23:51:50

I tried loading the actual data and it is giving me tons of errors: Error: cannot allocate vector of size 128.0 MbIn addition: Warning messages:1: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : Reached total allocation of 1535Mb: see help(memory.size)2: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : Reached total allocation of 1535Mb: see help(memory.size)3: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : Reached total allocation of 1535Mb: see help(memory.size)

datayoda 2010-09-28 00:13:08

Answer 2

+2 A:

This happens to me a lot when I have to pull from my colleagues messy excel files. Basically I get blank "" characters in the dataframe. I usually just fix it by recoding them to NA and then telling it to be as.numeric once more.

df[df==""] <- NA  ## Recodes all "" as NA
df$Avg.Position <- as.numeric(df$Avg.Position)
df$some.other.var <- as.numeric(df$some.other.var)

If you have other strings in Avg.Position, you'll need to search and destroy those too. Don't use as.numeric() to replace until you are CERTAIN that everything strange is gone. Weird things can happen to your data if you don't.

Alternatively you could do this right at the beginning:

datadump <- read.delim2("truncate.txt", na.strings=c("NA",""), header=TRUE, skip="6", )

na.strings=c("NA","") tells read.table that "NA" and "" are == NA, you can use this to convert other "junk" to NA as well.

You can also use nrows=SOME_NUMBER, if you know how many lines there are before the very end of the file with the junk line.

You might want to get rid of the $ signs too, as they are causing your Avg.CPC/CPM/Cost to convert to factors and that takes time/memory as well. There might be a way to do this from your source. (Looks like a download from web analytic software, but I can't tell which - it's been a long time since I've done web-analytics)

Brandon Bertelsen 2010-09-28 05:37:43

this helps! thx.

datayoda 2010-09-28 19:58:00

Answer 3

A:

You use read.delim2 where default decimal separator is ,, but in your data decimal separator is .. Try use read.delim and don't forget to provide na.strings argument as Brandon Bertelsen states.

And if it's 1.5GB file you may consider advice in ?read.table about comment.char parameter:

comment.char: character: a character vector of length one containing a single character or an empty string. Use `""’ to turn off the interpretation of comments

so use read.delim(some_others_settings, comment.char="").

Marek 2010-09-28 06:01:27

read.delim2(file, header = TRUE, sep = "\t", quote="\"", dec=",", fill = TRUE, comment.char="", ...) . Default sep = "\t". He's using the right one. read.csv() is sep=","

Brandon Bertelsen 2010-09-28 06:16:08

@Brandon I'm not taking about `sep` but decimal separator - `dec`.

Marek 2010-09-28 07:23:12

Answer 4

+2 A:

Things apparently get pretty messy for you, partly due to the large size of your data. With the size you report, you really have to do either of these options :

you rescale your problem so you don't have to load the complete dataset
you use the techniques available in R for huge datasets.
you buy a 64bit system with 12Gb ram and set your R memory large enough.

If you choose the latter one, you might benefit from watching the presentation of Rosario in the R Users group of Los Angeles this year. See also the master page here for sample code and such.

This said, for very messy data I use a little different solution, namely a combination of readLines() and textConnection(). With the first, I get in the datafile as a vector of lines. This allows me to scan all lines for awkward things, often using regular expressions. I can also very easily select any set of lines to read. textConnection() then allows me to use that vector of lines within read.table(), read.delim(), ... Eg:

Lines <- readLines(somefile.txt)
Lines <- Lines[seq(2,100,by=2)] # selects every second line

xx <- textConnection(Lines)
Data <- read.table(xx,header=T)
close(xx)

Without having your actual data, it's difficult to guide you through the process. Keep in mind what is said in the other answers, it's all valid.

Joris Meys 2010-09-28 08:32:09

ansaurus

tags:

views:

answers:

Not reading the data the right way?

related questions