views:

100

answers:

2

Hello all,

I wish to scrape the home page of one of the new stackexchange websites: http://webapps.stackexchange.com/ (just once, and for only several pages, nothing that should bother the servers). If I had wanted it from stackoverflow, I know there is a database dump, but for the new stackexchange, they don't exist yet.

Here is what I want to do.

Step 1: choose URL

URL <- "http://webapps.stackexchange.com/"

Step 2: read the table

readHTMLTable(URL)  # oops, doesn't work - gives NULL

Step 2: this time, let's try it with XML

htmlTreeParse(URL) # o.k, this reads the data - but it is all in <div> - now what?

So I was able to read the page, but now the structure is in divs. How can it now be used to create the same thing as readHTMLTable ?

A: 

What are you writing this in? I wrote an application that parses out of a web scrape (link). I would be more then happy to share the logic.

Josh K
[R](http://www.r-project.org/) (http://www.r-project.org/)
Robert Harvey
+8  A: 

You can do this with the overflowr package (with the StackExchange API). Just use the get.questions() function and supply the site prefix. It's not on CRAN since it isn't complete, but you can download it and build it.

library(overflowr)
questions <- get.questions(50)

For the statistics site, the top 5 most recent questions:

questions <- get.questions(top.n=5, site="stats.stackexchange")

Incidentally, happy to include more people working on this project because I don't have any more time to spend on it. Three of the moderators from Stats.Exchange are currently working on it.

Shane
This looks great Shane!! Any chance that there is a download link of version for windows that was already built?
Tal Galili
Nope, sorry. You will have to check it out from svn and build it. I don't see much point in providing a download version until there's more to it. The core infrastructure is there, but you can't do basic things (like pull answers).
Shane
o.k, thank you Shane - I'll proceed in doing that...
Tal Galili
Great, and if you make an enhancements, please feel free to submit them back.
Shane