tags:

views:

112

answers:

4

I think I'm not asking the right question to begin with.

New Question: I have a 1.5gig tsv file. It has 6 lines of junk at the top and one line of junk at the bottom, all of which I want to remove without having to open the file. Line 7 are the headers. I have 13 headers. Number of rows is unknown.

How do I read the file into a dataframe so that I can do basic descriptive stats, boxplots, etc....


Original Question:

Hi

I have a feeling this one is really easy. I'm just missing something.

I have a txt file, tab separated, with 6 lines of junk at the top and a junk line at the very bottom as well. In between the junk I have data of form Label1 Label2 Label3 Label4....Label13 text ID number percent....number

Here is what I enter in R:

datadump <- read.delim2("truncate.txt", header=TRUE, skip="6")

cleandata <- datadump[c(-dim(datadump)[1]),]

avgposition <- cleandata$Avg.Position

hist(avgposition)

Avg.Position is label13 and a number of form #.#

Yet I get an error: Error in hist.default(avgposition) : 'x' must be numeric

Why is it not seeing the data as numeric?

Thanks!

As requested here is some data:

> dput(cleandata)
structure(list(Account = structure(c(2L, 2L), .Label = c("Crap1", 
"XXS"), class = "factor"), Campaign = structure(c(1L, 1L), .Label = c("3098012", 
"Crap2"), class = "factor"), Customer.Id = structure(c(2L, 2L
), .Label = c("", "nontech broad (7)"), class = "factor"), Ad.Group = structure(c(2L, 
2L), .Label = c("", "RR 236 (300)"), class = "factor"), Keyword = structure(2:3, .Label = c("", 
"chagall pro", "matisse"), class = "factor"), Keyword.Matching = structure(c(2L, 
2L), .Label = c("", "Broad"), class = "factor"), Impressions = c(4L, 
16L), Clicks = c(1L, 1L), CTR = structure(2:3, .Label = c("", 
"25.00%", "6.25%"), class = "factor"), Avg.CPC = structure(2:3, .Label = c("", 
"$0.05 ", "$0.11 "), class = "factor"), Avg.CPM = structure(2:3, .Label = c("", 
"$12.50 ", "$6.88 "), class = "factor"), Cost = structure(2:3, .Label = c("", 
"$0.05 ", "$0.11 "), class = "factor"), Avg.Position = structure(2:3, .Label = c("", 
"3", "3.1"), class = "factor")), .Names = c("Account", "Campaign", 
"Customer.Id", "Ad.Group", "Keyword", "Keyword.Matching", "Impressions", 
"Clicks", "CTR", "Avg.CPC", "Avg.CPM", "Cost", "Avg.Position"
), row.names = 1:2, class = "data.frame")
+2  A: 

It sees a column as not numeric if it contains something other than numbers and NA. You're either getting the actual column wrong or you have some garbage in the column that needs to be cleaned out.

Perhaps it was on that line you deleted. If there was something other than a number in the column then the column gets converted to something other than numeric type, probably a factor. If it was you merely need to convert the variable in question back to a numeric.

cleandata$Avg.Position <- as.numeric(levels(cleandata$Avg.Position)[cleandata$Avg.Position])

You could work out just what type you have to convert from with

str(datadump)
John
Looks like the data has indeed some "". It's just not clean enough.
datayoda
I tried loading the actual data and it is giving me tons of errors: Error: cannot allocate vector of size 128.0 MbIn addition: Warning messages:1: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : Reached total allocation of 1535Mb: see help(memory.size)2: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : Reached total allocation of 1535Mb: see help(memory.size)3: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : Reached total allocation of 1535Mb: see help(memory.size)
datayoda
+2  A: 

This happens to me a lot when I have to pull from my colleagues messy excel files. Basically I get blank "" characters in the dataframe. I usually just fix it by recoding them to NA and then telling it to be as.numeric once more.

df[df==""] <- NA  ## Recodes all "" as NA
df$Avg.Position <- as.numeric(df$Avg.Position)
df$some.other.var <- as.numeric(df$some.other.var)

If you have other strings in Avg.Position, you'll need to search and destroy those too. Don't use as.numeric() to replace until you are CERTAIN that everything strange is gone. Weird things can happen to your data if you don't.

Alternatively you could do this right at the beginning:

datadump <- read.delim2("truncate.txt", na.strings=c("NA",""), header=TRUE, skip="6", )

na.strings=c("NA","") tells read.table that "NA" and "" are == NA, you can use this to convert other "junk" to NA as well.

You can also use nrows=SOME_NUMBER, if you know how many lines there are before the very end of the file with the junk line.

You might want to get rid of the $ signs too, as they are causing your Avg.CPC/CPM/Cost to convert to factors and that takes time/memory as well. There might be a way to do this from your source. (Looks like a download from web analytic software, but I can't tell which - it's been a long time since I've done web-analytics)

Brandon Bertelsen
this helps! thx.
datayoda
A: 

You use read.delim2 where default decimal separator is ,, but in your data decimal separator is .. Try use read.delim and don't forget to provide na.strings argument as Brandon Bertelsen states.

And if it's 1.5GB file you may consider advice in ?read.table about comment.char parameter:

comment.char: character: a character vector of length one containing a single character or an empty string. Use `""’ to turn off the interpretation of comments

so use read.delim(some_others_settings, comment.char="").

Marek
read.delim2(file, header = TRUE, sep = "\t", quote="\"", dec=",", fill = TRUE, comment.char="", ...) . Default sep = "\t". He's using the right one. read.csv() is sep=","
Brandon Bertelsen
@Brandon I'm not taking about `sep` but decimal separator - `dec`.
Marek
+2  A: 

Things apparently get pretty messy for you, partly due to the large size of your data. With the size you report, you really have to do either of these options :

  • you rescale your problem so you don't have to load the complete dataset
  • you use the techniques available in R for huge datasets.
  • you buy a 64bit system with 12Gb ram and set your R memory large enough.

If you choose the latter one, you might benefit from watching the presentation of Rosario in the R Users group of Los Angeles this year. See also the master page here for sample code and such.

This said, for very messy data I use a little different solution, namely a combination of readLines() and textConnection(). With the first, I get in the datafile as a vector of lines. This allows me to scan all lines for awkward things, often using regular expressions. I can also very easily select any set of lines to read. textConnection() then allows me to use that vector of lines within read.table(), read.delim(), ... Eg:

Lines <- readLines(somefile.txt)
Lines <- Lines[seq(2,100,by=2)] # selects every second line

xx <- textConnection(Lines)
Data <- read.table(xx,header=T)
close(xx)

Without having your actual data, it's difficult to guide you through the process. Keep in mind what is said in the other answers, it's all valid.

Joris Meys