views:

500

answers:

3

Hi all,

I have a (somewhat complex) web scraping challenge that I wish to accomplish and would love for some direction (to whatever level you feel like sharing) here goes:

I would like to go through all the "species pages" present in this link:

http://gtrnadb.ucsc.edu/

So for each of them I will go to:

  1. The species page link (for example: http://gtrnadb.ucsc.edu/Aero_pern/)
  2. And then to the "Secondary Structures" page link (for example: http://gtrnadb.ucsc.edu/Aero_pern/Aero_pern-structs.html)

Inside that link I wish to scrap the data in the page so that I will have a long list containing this data (for example):

chr.trna3 (1-77)    Length: 77 bp
Type: Ala   Anticodon: CGC at 35-37 (35-37) Score: 93.45
Seq: GGGCCGGTAGCTCAGCCtGGAAGAGCGCCGCCCTCGCACGGCGGAGGcCCCGGGTTCAAATCCCGGCCGGTCCACCA
Str: >>>>>>>..>>>>.........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<....

Where each line will have it's own list (inside the list for each "trna" inside the list for each animal)

I remember coming across the packages Rcurl and XML (in R) that can allow for such a task. But I don't know how to use them. So what I would love to have is: 1. Some suggestion on how to build such a code. 2. And recommendation for how to learn the knowledge needed for performing such a task.

Thanks for any help,

Tal

+2  A: 

Interesting problem and agree that R is cool, but somehow i find R to be a bit cumbersome in this respect. I seem to prefer to get the data in intermediate plain text form first in order to be able to verify that the data is correct in every step... If the data is ready in its final form or for uploading your data somewhere RCurl is very useful.

Simplest in my opinion would be to (on linux/unix/mac/or in cygwin) just mirror the entire http://gtrnadb.ucsc.edu/ site (using wget) and take the files named /-structs.html, sed or awk the data you would like and format it for reading into R.

I'm sure there would be lots of other ways also.

aaspnas
Thanks aaspnas. I agree with having an intermediate step for data cleaning. Admit though that knowing how to do it with R will be cool :)
Tal Galili
+8  A: 

Tal,

You could use R and the XML package to do this, but (damn) that is some poorly formed HTML you are trying to parse. In fact, in most cases your would want to be using the readHTMLTable() function, which is covered in this previous thread.

Given this ugly HTML, however, we will have to use the RCurl package to pull the raw HTML and create some custom functions to parse it. This problem has two components:

  1. Get all of the genome URLS from the base webpage (http://gtrnadb.ucsc.edu/) using the getURLContent() function in the RCurlpackage and some regex magic :-)
  2. Then take that list of URLS and scrape the data you are looking for, and then stick it into a data.frame.

So, here goes...

library(RCurl)

### 1) First task is to get all of the web links we will need ##
base_url<-"http://gtrnadb.ucsc.edu/"
base_html<-getURLContent(base_url)[[1]]
links<-strsplit(base_html,"a href=")[[1]]

get_data_url<-function(s) {
    u_split1<-strsplit(s,"/")[[1]][1]
    u_split2<-strsplit(u_split1,'\\"')[[1]][2]
    ifelse(grep("[[:upper:]]",u_split2)==1 & length(strsplit(u_split2,"#")[[1]])<2,return(u_split2),return(NA))
}

# Extract only those element that are relevant
genomes<-unlist(lapply(links,get_data_url))
genomes<-genomes[which(is.na(genomes)==FALSE)]

### 2) Now, scrape the genome data from all of those URLS ###

# This requires two complementary functions that are designed specifically
# for the UCSC website. The first parses the data from a -structs.html page
# and the second collects that data in to a multi-dimensional list
parse_genomes<-function(g) {
    g_split1<-strsplit(g,"\n")[[1]]
    g_split1<-g_split1[2:5]
    # Pull all of the data and stick it in a list
    g_split2<-strsplit(g_split1[1],"\t")[[1]]
    ID<-g_split2[1]                             # Sequence ID
    LEN<-strsplit(g_split2[2],": ")[[1]][2]     # Length
    g_split3<-strsplit(g_split1[2],"\t")[[1]]
    TYPE<-strsplit(g_split3[1],": ")[[1]][2]    # Type
    AC<-strsplit(g_split3[2],": ")[[1]][2]      # Anticodon
    SEQ<-strsplit(g_split1[3],": ")[[1]][2]     # ID
    STR<-strsplit(g_split1[4],": ")[[1]][2]     # String
    return(c(ID,LEN,TYPE,AC,SEQ,STR))
}

# This will be a high dimensional list with all of the data, you can then manipulate as you like
get_structs<-function(u) {
    struct_url<-paste(base_url,u,"/",u,"-structs.html",sep="")
    raw_data<-getURLContent(struct_url)
    s_split1<-strsplit(raw_data,"<PRE>")[[1]]
    all_data<-s_split1[seq(3,length(s_split1))]
    data_list<-lapply(all_data,parse_genomes)
    for (d in 1:length(data_list)) {data_list[[d]]<-append(data_list[[d]],u)}
    return(data_list)
}

# Collect data, manipulate, and create data frame (with slight cleaning)
genomes_list<-lapply(genomes[1:2],get_structs) # Limit to the first two genomes (Bdist & Spurp), a full scrape will take a LONG time
genomes_rows<-unlist(genomes_list,recursive=FALSE) # The recursive=FALSE saves a lot of work, now we can just do a straigh forward manipulation
genome_data<-t(sapply(genomes_rows,rbind))
colnames(genome_data)<-c("ID","LEN","TYPE","AC","SEQ","STR","NAME")
genome_data<-as.data.frame(genome_data)
genome_data<-subset(genome_data,ID!="</PRE>")   # Some malformed web pages produce bad rows, but we can remove them

head(genome_data)

The resulting data frame contains seven columns related to each genome entry: ID, length, type, sequence, string, and name. The name column contains the base genome, which was my best guess for data organization. Here it what it looks like:

head(genome_data)
                                   ID   LEN TYPE                           AC                                                                       SEQ
1     Scaffold17302.trna1 (1426-1498) 73 bp  Ala     AGC at 34-36 (1459-1461) AGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTTTTCCA
2   Scaffold20851.trna5 (43038-43110) 73 bp  Ala   AGC at 34-36 (43071-43073) AGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTTCTCCA
3   Scaffold20851.trna8 (45975-46047) 73 bp  Ala   AGC at 34-36 (46008-46010) TGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTTCTCCA
4     Scaffold17302.trna2 (2514-2586) 73 bp  Ala     AGC at 34-36 (2547-2549) GGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACAGGGATCGATGCCCGGGTTCTCCA
5 Scaffold51754.trna5 (253637-253565) 73 bp  Ala AGC at 34-36 (253604-253602) CGGGGGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTCCTCCA
6     Scaffold17302.trna4 (6027-6099) 73 bp  Ala     AGC at 34-36 (6060-6062) GGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGAGTTCTCCA
                                                                        STR  NAME
1 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
2 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
3 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
4 >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>.>>>.......<<<.<<<<<<<<. Spurp
5 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
6 >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<......>>>>.......<<<<.<<<<<<<. Spurp

I hope this helps, and thanks for the fun little Sunday afternoon R challenge!

DrewConway
Drew - Wow, Just wow!This has been the most impressive answer I have ever gotten from anyone regarding R!If I could I would mark it with 2 V's instead of one. Thank you very very much. I can't say I understand your code, but I'll go to it in the next few days/weeks.p.s: Something breaks sometimes, see - genome_data[1709:1711,]And again - thank you so much!!Tal
Tal Galili
I found "why" it breaks in row 1710 - because that line (see link: http://gtrnadb.ucsc.edu/Bdist/Bdist-structs.html) has an extra line "Possible intron: 37-48 (139611-139622)". Drew, if you can see how to solve this in a snap - please let me know. Either way (again) thanks a lot ! Tal
Tal Galili
Update: I found where the code broke, I'll fix it - thanks again!
Tal Galili
Since I can't change your massage - the code should be:if( length(g_split1) > 21 ){ g_split1<-g_split1[c(2,3,5,6)] # in case we have the extra row of "Possible intron:", then we will just take it off} else { g_split1<-g_split1[2:5] # if we don't have that row - continue as usual}Instead of: g_split1<-g_split1[2:5] # if we don't have that row - for the function parse_genomes
Tal Galili
Ahh, thanks Tal. I had not done a thorough enough search of the possible entries. Glad you were able to get it working!
DrewConway
+1  A: 

Just tried it using Mozenda (http://www.mozenda.com). After roughly 10 minutes and I had an agent that could scrape the data as you describe. You may be able to get all of this data just using their free trial. Coding is fun, if you have time, but it looks like you may already have a solution coded for you. Nice job Drew.

Chris
Hi Chris. the code does work, but alternative solutions are always welcome :) . Cheers, Tal
Tal Galili