views:

101

answers:

3

Hi all,

I am new to R so I apologize if this is easy and straight forward. I have successfully read a web page into a character vector. I want to strip this string down to a smaller segment so I can extract some data. So far, so easy.

The problem is that I am new to regex and R, so this has been pretty hard for me. I simply want to shorten the string such that it includes everything between the

<div class="appForm"

and 

</div>

For some reason, I am having a hard time using the stringr package and ?str_match.

Any help - more efficient solutions - will be very much appreciated. A newbie at web scraping, but determined to stay within R.

+5  A: 

Some in the community heavily discourage the use of regular expressions to parse text containing an arbitrary number of nested expressions. R does have an XML parser (also applicable for HTML) which you might consider using for this purpose.

Stephen
Thanks for the help, but I know very little about HTML or XML and am not sure where to begin. How would I go about scraping the data contained inside ul tags but found in the named div tag above?
Btibert3
@Btibert3 - that's an easy problem for XPath. You can specify which nodes from the XML parse tree you want to get information from, and then specify the type of information (attributes or text). Take a look at the R functions I use below, and the XPath string. Here's where to start: http://www.w3schools.com/xpath/xpath_nodes.asp
Vince
@Vince - getting an error telling me it couldn't find function clean? Thx
Btibert3
`clean` was my function that did some text cleaning, sorry. Look at the example I just added.
Vince
+3  A: 

I suggest using the XML package and XPath. This requires some learning, but if you're serious about web scraping, it's the way to go. I did this with some county level elections data from the NY Times website ages ago, and the code looked something like this (just to give you an idea):

getCounty <- function(url) {
    doc = htmlTreeParse(url, useInternalNodes = TRUE)

    nodes <- getNodeSet(doc, "//tr/td[@class='county-name']/text()")
    tmp <- sapply(nodes, xmlValue)
    county <- sapply(tmp, function(x) clean(x, num=FALSE))

    return(county)
}

You can learn about XPath here.

Another example: grab all R package names from the Crantastic timeline. This looks for a div node with the id "timeline", then looks for the ul with the class "timeline", and extracts all of the first a nodes from the parent node, and returns their text:

url <- 'http://crantastic.org/'
doc = htmlTreeParse(url, useInternalNodes = TRUE)

nodes <- getNodeSet(doc, "//div[@id='timeline']/ul[@class='timeline']/li/a[1]/text()")
tmp <- sapply(nodes, xmlValue)
tmp

>  [1] "landis"          "vegan"           "mutossGUI"       "lordif"         
 [5] "futile.paradigm" "lme4"            "tm"              "qpcR"           
 [9] "igraph"          "aspace"          "ade4"            "MCMCglmm"       
[13] "hts"             "emdbook"         "DCGL"            "wq"             
[17] "crantastic"      "Psychometrics"   "crantastic"      "gR"             
[21] "crantastic"      "Distributions"   "rAverage"        "spikeslab"      
[25] "sem"
Vince
@Btibert3, I tweeted this item and someone responded suggesting this: http://twitter.com/cocteau/status/21673265592 This may be useful for you.
Vince
+2  A: 

I second Stephen and Vince's advice to use the htmlTreeParse in the XML package. There are quite a few SO questions related to scraping/using HTML content in R, based on this idea. Take a look at

Scraping html tables into R data frames using the XML package

How can I use R (Rcurl/XML packages ?!) to scrape this webpage ?

How to isolate a single element from a scraped web page in R

How to transform XML data into a data.frame?

Richie Cotton
Those questions are great! I saw a few of them before and already tried modifying some of the examples to fit my particular problem to no avail, hence my attempt to just use regex. The website is http://collegesearch.collegeboard.com/search/CollegeDetail.jsp?collegeId=885 and I want to grab not only the names in the box "More to explore", but the ids that are found in the links as well.
Btibert3