views:

144

answers:

2

I would like any advice on how to create and visualize a link map between blogs so to reflect the "social network" between them.

Here is how I am thinking of doing it:

  1. Start with one (or more) blog home page and collect all the links on that page
  2. Remove all the links that are internal links (that is, If I start from www.website.com. Then I want to remove all the links from the shape "www.website.com/*"). But store all the external links.
  3. Go to each of these links (assuming you haven't visited them already), and repeat step 1.
  4. Continue until (let's say) X jumps from the first page.
  5. Plot the data collected.

I imagine that in order to do this in R, one would use RCurl/XML (Thanks Shane for your answer here), combined with something like igraph.

But since I don't have experience with neither of them, is there someone here that might be willing to correct me if I missed any important step, or attach any useful snippet of code to allow this task ?

p.s: My motivation for this question is that in a week I am giving a talk on useR 2010 on "blogging and R", and I thought this might be a nice way to both give something fun to the audience and also motivate them to do something like this themselves.

Thanks a lot!

Tal

+4  A: 

Tal,

This type of data collection is referred to as a k-snowball search in network theory, and should be fairly straightforward in R. As you note, the easiest way to accomplish this would be using the XMl package and the htmlTreeParse command. This will parse the HTML from a blog into a tree, which will allow you to more easily perform the link extraction you are interested in.

Also, igraph would be perfectly capable of representing the graphs, but also has a useful function graph.compose for taking two graphs and returning their edge set composition. You will need this to combine data as your continue to "roll the snowball." The basic steps of the process would be:

  1. Find some seed blog
  2. Find all neighbors of the seed, and create its ego-net (star graph) with the seed at the center connected to its neighbors
  3. For all of seed's neighbors, create their ego-nets, and iteratively compose those graphs with their the original seed graph.
  4. Repeat for as many k-degree steps you are interested in collecting.

I have no code for this in R, but I have generated code that performs a very similar process in Python using Google's SocialGraph API.

Good luck!

DrewConway
Many thanks Drew, I will have a go at it (hopefully I'll get it working in time).
Tal Galili
+7  A: 

NB: This example is a very BASIC way of getting the links and therefore would need to be tweaked in order to be more robust. :)

I don't know how useful this code is, but hopefully it might give you an idea of the direction to go in (just copy and paste it into R, it's a self contained example once you've installed the packages RCurl and XML):

library(RCurl)
library(XML)

get.links.on.page <- function(u) {
  doc <- getURL(u)
  html <- htmlTreeParse(doc, useInternalNodes = TRUE)
  nodes <- getNodeSet(html, "//html//body//a[@href]")
  urls <- sapply(nodes, function(x) x <- xmlAttrs(x)[[1]])
  urls <- sort(urls)
  return(urls)
}

# a naieve way of doing it. Python has 'urlparse' which is suppose to be rather good at this
get.root.domain <- function(u) {
  root <- unlist(strsplit(u, "/"))[3]
  return(root)
}

# a naieve method to filter out duplicated, invalid and self-referecing urls. 
filter.links <- function(seed, urls) {
  urls <- unique(urls)
  urls <- urls[which(substr(urls, start = 1, stop = 1) == "h")]
  urls <- urls[grep("http", urls, fixed = TRUE)]
  seed.root <- get.root.domain(seed)
  urls <- urls[-grep(seed.root, urls, fixed = TRUE)]
  return(urls)
}

# pass each url to this function
main.fn <- function(seed) {
  raw.urls <- get.links.on.page(seed)
  filtered.urls <- filter.links(seed, raw.urls)
  return(filtered.urls)
}

### example  ###
seed <- "http://www.r-bloggers.com/blogs-list/"
urls <- main.fn(seed)

# crawl first 3 links and get urls for each, put in a list 
x <- lapply(as.list(urls[1:3]), main.fn)
names(x) <- urls[1:3]
x

If you copy and paste it into R, and then look at x, I think it'll make sense.

Either way, good luck mate! Tony Breyal

Tony Breyal