ansaurus

Question

Extract Links from Webpage using R

Answer 1

A:

The iMacros wiki has a nice chapter on web scraping. Mabe that gives you some ideas.

SamMeiers 2010-09-19 15:31:40

Answer 2

+1 A:

The documentation for htmlTreeParse shows one method. Here's another:

> url <- "http://stackoverflow.com/questions/3746256/extract-links-from-webpage-using-r"
> doc <- htmlParse(url)
> links <- xpathSApply(doc, "//a/@href")
> free(doc)

(You can drop the "href" attribute from the returned links by passing "links" through "as.vector".)

My previous reply:

One approach is to use Hadley Wickham's stringr package, which you can install with install.packages("stringr", dep=TRUE).

> url <- "http://stackoverflow.com/questions/3746256/extract-links-from-webpage-using-r"
> html <- paste(readLines(url), collapse="\n")
> library(stringr)
> matched <- str_match_all(html, "<a href=\"(.*?)\"")

(I guess some people might not approve of using regexp's here.)

matched is a list of matrixes, one per input string in the vector html -- since that has length one here, matched just has one element. The matches for the first capture group are in column 2 of this matrix (and in general, the ith group would appear in column (i + 1)).

> links <- matched[[1]][, 2]
> head(links)
[1] "/users/login?returnurl=%2fquestions%2f3746256%2fextract-links-from-webpage-using-r"
[2] "http://careers.stackoverflow.com"                                                  
[3] "http://meta.stackoverflow.com"                                                     
[4] "/about"                                                                            
[5] "/faq"                                                                              
[6] "/"

David F 2010-09-19 16:44:46

Thanks. When I was thinking of using regex, I definatly was going to use hadley's package. I will give the top a shot, but I definatley think this is what I needed.

Btibert3 2010-09-19 19:07:13

ansaurus

tags:

views:

answers:

Extract Links from Webpage using R

related questions