tags:

views:

67

answers:

2

I actually have a solution for this problem, but I am curious if there is a better way to do what I was trying to do.

I scraped some data from the majorleaguesoccer.com and read it into R using

mls.reg.tmp <- read.table("../data/mls_reg_season_20100812.csv",
                          header = F, sep = ";")

Note that I used sep = ";" because some of the attendance figures where in the thousands on the websites and I scraped "as is", e.g.,

> str(mls.reg.dat$a_tot)
 Factor w/ 164 levels " 166,060"," 171,282",..: 132 45 159 153 46 160 
158 148 150 98 ...

In hindsight, I should've just removed the commas in python in the pre-processing step of this project. I should also point out that there were some text fields in the data set as well.

> str(mls.reg.dat$team)
 Factor w/ 20 levels "Chicago Fire",..: 4 9 19 11 3 10 13 16 5 6 ...

Given that I want to use the attendance data as a numeric value, I converted using as.numeric and gsub. As an example in a call to ggplot:

ggplot(data = mls.reg.dat, aes(x = as.numeric(gsub(",", "", 
  mls.reg.dat$a_tot)), y = sog)) + geom_point() + 
  facet_wrap(~ team)

Question: Is this the most efficient way of working with data such as this? Or is there a specialized function for doing something along these lines?

I'm posting the question here because I spent quite a bit of time (> 30 min) just working in this simple solution and thought that others might benefit from this as well.

+1  A: 

I am not aware of any specialised function, but you could do it directly when you read the data.

  data <- read.table(...)
  data$someColumn <- as.numeric(gsub(",", "", data$someColumn))

Any subsequent call can be made using data$someColumn, without need of further conversion (and easier-to-read code)

EDIT: seems to be duplicate of "How can I declare a thousand separator in read.csv?"

nico
Thanks; I didn't see that post.
rtelmore
A: 

I'm just trying to get into R. I'm used to using Minitab with Excel as a the main input. I could easily configure the data using Excel before any detailed statistical analysis. I understood that the beauty of R was that you could write scripts that would take configured data and be capable of some serious analysis. Do others, like rtelmore, use many other languages (python) to refine data into the correct format?

stob