views:

75

answers:

3

I have a problem while using a subset of a data.frame in R.

The subset gets created and displayed correctly, but when I try to plot it using qplot(), the rows which were not selected by the subset() also get shown along one axis.

The actual file I am reading in is a web server log, but I have created a small example to illustrate my problem:

This is the ITEMSSOLD.CSV file I read in:

CUST,DT,ITEM,PRICE
BigJoe,10/13/2010,Pickup Truck,20000
TightWad,10/13/2010,USB Drive,12
Jane,10/13/2010,Smart Car,30000
Scrooge,10/13/2010,Gumdrops,1
GeekyMan,10/13/2010,Smart Car,30000

I read this into a data frame as follows:

sales_df <- read.table("C:/R_Expt/ItemsSold.csv", header=TRUE, sep=",")

I then did a subset to get the high-ticket items as follows:

big_sales_df <- subset(sales_df, PRICE>100)

big_sales_df

big_sales_df
      CUST         DT         ITEM PRICE
1   BigJoe 10/13/2010 Pickup Truck 20000
3     Jane 10/13/2010    Smart Car 30000
5 GeekyMan 10/13/2010    Smart Car 30000

So that looks OK.

When I try to plot it via qplot as follows:

qplot(nrow, ITEM, data = ddply(big_sales_df, .(ITEM), "nrow"))

the resulting plot shows all the ITEMS on the Y axis, instead of only Pickup Truck and Smart Car.

The ddply() alone produces the following output:

ddply(big_sales_df, .(ITEM), "nrow")
          ITEM nrow
1 Pickup Truck    1
2    Smart Car    2

Since the example has a small number of ITEMs, the plot is still readable, but in real life, I am trying to plot the names of slow web pages and unfortunately, qplot() tries to put the names of all webpages along the Y axis and it becomes a black blur.

I tried sqldf() also:

qplot(NSOLD, ITEM, data = sqldf('select ITEM, count(*) as NSOLD from big_sales_df group by ITEM order by count(*) desc'))

but this produces the same plot.

What I understood is that subset() is somehow carrying the full parent information inside, instead of just the matched rows.

Is there any way of telling subset()that it should only keep the relevant information?

Or any other way of getting around subset() carrying along the empty members?

I know that a brute force method might be to write the result of the subset() into another CSV file and then read it back in a data.frame, but I am sure there is an easier way.

Many thanks to all your R gurus out there!

Atish

A: 

When you subset your data the factors that existed in the original data set persist. Take the diamonds data set for example. You have 5 different cuts.

unique(diamonds$cut) ## Ideal, Premium, Good, Very Good, Fair

If we subset diamonds, we get:

str(subset(diamonds, cut == "Ideal")) ## Look at structure

In str(), we see that cut maintains the factors that it had originally.

$ cut    : Factor w/ 5 levels "Fair","Good",..: 5 5 5 5 5 5 5 5 5 5 ...

Even though we've removed all the other categories of cut, the factoring persists.

You can remove the extra factors by refactoring the column with it's own unique subsetted factors.

x$cut <- factor(x$cut, labels=unique(x$cut))

Now looking more specifically at your example:

test <- ddply(big_sales_df, .(ITEM), "nrow")
test$ITEM <- factor(test$ITEM, labels=unique(test$ITEM))

Now, try your plot again.

Brandon Bertelsen
Many thanks Brandon, that worked just fine!
Atish
I posted another answer that does the same thing, but in situ in the plot. IN qplot() replace ITEM with factor(ITEM)
Brandon Bertelsen
+1  A: 

You need to remove the factor levels that were dropped from your subset.

big_sales_df$ITEM <- factor(big_sales_df$ITEM)
big_sales_df$CUST <- factor(big_sales_df$CUST)

OR change how you read in the data:

sales_df <- read.csv("ItemsSold.csv", header=TRUE, stringsAsFactors=FALSE)
Joshua Ulrich
Thanks Joshua, your approach worked too!
Atish
+1  A: 

Or you can cheat by factoring item:

qplot(nrow, factor(ITEM), data = ddply(big_sales_df, .(ITEM), "nrow")
Brandon Bertelsen
Thanks again, Brandon.
Atish
Oops, I meant to add that there was a slight typo in your answer - it should be:qplot(nrow, factor(ITEM), data = ddply(big_sales_df, .(ITEM), "nrow"))
Atish
And I like this solution better as it gives me more control, in case I want to plot by several dimensions.
Atish
@Atish: if you like this solution, you should make it the accepted answer.
Joshua Ulrich