I have a problem while using a subset of a data.frame
in R.
The subset gets created and displayed correctly, but when I try to plot it using qplot()
, the rows which were not selected by the subset()
also get shown along one axis.
The actual file I am reading in is a web server log, but I have created a small example to illustrate my problem:
This is the ITEMSSOLD.CSV
file I read in:
CUST,DT,ITEM,PRICE
BigJoe,10/13/2010,Pickup Truck,20000
TightWad,10/13/2010,USB Drive,12
Jane,10/13/2010,Smart Car,30000
Scrooge,10/13/2010,Gumdrops,1
GeekyMan,10/13/2010,Smart Car,30000
I read this into a data frame as follows:
sales_df <- read.table("C:/R_Expt/ItemsSold.csv", header=TRUE, sep=",")
I then did a subset to get the high-ticket items as follows:
big_sales_df <- subset(sales_df, PRICE>100)
big_sales_df
big_sales_df
CUST DT ITEM PRICE
1 BigJoe 10/13/2010 Pickup Truck 20000
3 Jane 10/13/2010 Smart Car 30000
5 GeekyMan 10/13/2010 Smart Car 30000
So that looks OK.
When I try to plot it via qplot
as follows:
qplot(nrow, ITEM, data = ddply(big_sales_df, .(ITEM), "nrow"))
the resulting plot shows all the ITEMS on the Y axis, instead of only Pickup Truck and Smart Car.
The ddply()
alone produces the following output:
ddply(big_sales_df, .(ITEM), "nrow")
ITEM nrow
1 Pickup Truck 1
2 Smart Car 2
Since the example has a small number of ITEMs, the plot is still readable, but in real life, I am trying to plot the names of slow web pages and unfortunately, qplot()
tries to put the names of all webpages along the Y axis and it becomes a black blur.
I tried sqldf()
also:
qplot(NSOLD, ITEM, data = sqldf('select ITEM, count(*) as NSOLD from big_sales_df group by ITEM order by count(*) desc'))
but this produces the same plot.
What I understood is that subset()
is somehow carrying the full parent information inside, instead of just the matched rows.
Is there any way of telling subset()
that it should only keep the relevant information?
Or any other way of getting around subset()
carrying along the empty members?
I know that a brute force method might be to write the result of the subset()
into another CSV file and then read it back in a data.frame
, but I am sure there is an easier way.
Many thanks to all your R gurus out there!
Atish