ansaurus

Question

Answer 1

+5 A:

Some in the community heavily discourage the use of regular expressions to parse text containing an arbitrary number of nested expressions. R does have an XML parser (also applicable for HTML) which you might consider using for this purpose.

Stephen 2010-08-20 03:58:25

Thanks for the help, but I know very little about HTML or XML and am not sure where to begin. How would I go about scraping the data contained inside ul tags but found in the named div tag above?

Btibert3 2010-08-20 12:43:36

@Btibert3 - that's an easy problem for XPath. You can specify which nodes from the XML parse tree you want to get information from, and then specify the type of information (attributes or text). Take a look at the R functions I use below, and the XPath string. Here's where to start: http://www.w3schools.com/xpath/xpath_nodes.asp

Vince 2010-08-20 14:55:53

@Vince - getting an error telling me it couldn't find function clean? Thx

Btibert3 2010-08-20 15:14:36

`clean` was my function that did some text cleaning, sorry. Look at the example I just added.

Vince 2010-08-20 15:24:46

Answer 2

+3 A:

I suggest using the XML package and XPath. This requires some learning, but if you're serious about web scraping, it's the way to go. I did this with some county level elections data from the NY Times website ages ago, and the code looked something like this (just to give you an idea):

getCounty <- function(url) {
    doc = htmlTreeParse(url, useInternalNodes = TRUE)

    nodes <- getNodeSet(doc, "//tr/td[@class='county-name']/text()")
    tmp <- sapply(nodes, xmlValue)
    county <- sapply(tmp, function(x) clean(x, num=FALSE))

    return(county)
}

You can learn about XPath here.

Another example: grab all R package names from the Crantastic timeline. This looks for a div node with the id "timeline", then looks for the ul with the class "timeline", and extracts all of the first a nodes from the parent node, and returns their text:

url <- 'http://crantastic.org/'
doc = htmlTreeParse(url, useInternalNodes = TRUE)

nodes <- getNodeSet(doc, "//div[@id='timeline']/ul[@class='timeline']/li/a[1]/text()")
tmp <- sapply(nodes, xmlValue)
tmp

>  [1] "landis"          "vegan"           "mutossGUI"       "lordif"         
 [5] "futile.paradigm" "lme4"            "tm"              "qpcR"           
 [9] "igraph"          "aspace"          "ade4"            "MCMCglmm"       
[13] "hts"             "emdbook"         "DCGL"            "wq"             
[17] "crantastic"      "Psychometrics"   "crantastic"      "gR"             
[21] "crantastic"      "Distributions"   "rAverage"        "spikeslab"      
[25] "sem"

Vince 2010-08-20 04:10:22

@Btibert3, I tweeted this item and someone responded suggesting this: http://twitter.com/cocteau/status/21673265592 This may be useful for you.

Vince 2010-08-20 15:40:16

Answer 3

+2 A:

I second Stephen and Vince's advice to use the htmlTreeParse in the XML package. There are quite a few SO questions related to scraping/using HTML content in R, based on this idea. Take a look at

Scraping html tables into R data frames using the XML package

How can I use R (Rcurl/XML packages ?!) to scrape this webpage ?

How to isolate a single element from a scraped web page in R

How to transform XML data into a data.frame?

Richie Cotton 2010-08-20 12:49:02

Those questions are great! I saw a few of them before and already tried modifying some of the examples to fit my particular problem to no avail, hence my attempt to just use regex. The website is http://collegesearch.collegeboard.com/search/CollegeDetail.jsp?collegeId=885 and I want to grab not only the names in the box "More to explore", but the ids that are found in the links as well.

Btibert3 2010-08-20 13:10:00

ansaurus

tags:

views:

answers:

Regex pattern match in a character

related questions