views:

382

answers:

10

I'd like to create an rRNA sequence database with a web front end for the lab I work in. It seems common in biology to want to search a large number of sequences using alignment algorithms such as BLAST and HMMER, so I wondered if there is any existing php/python/rails projects that allow easy creation of a generic sequence database with a website search form?

UPDATE: GMOD is the type of server I was looking for. I was also suggested to look at BioMart too which looks to have a similar functionality.

A: 

It's not either of the language you are talking about, but there is BioPERL, which is a collection of functions specifically made for DNA and RNA and other acid and protein base 'programming'

Look for it in CPAN.org

David Brunelle
Does it also allow the creation of web front ends? Perl is not a language I'm familiar with though.
Michael Barton
You can do Web Front End with perl. In fact, I spent 3 years developping a web site in PERL. I believe that PHP is , to some extent , deribed from PERL. I heard the syntax is pretty similar
David Brunelle
In fact, I might add that BioPERL is so advanced comparing to any other BioInformatic API that the time to learn PERL will be much shorter than the time it would take you to build your library in other language. For HTML rendering in PERL, you can look up HTML::Template in CPAN. It will help you create HTML using programming pragmas likes loops and ifsCPAN is actually a repository for Perl code anyone can download. You can find almost anything, HTML rendering, PDF creation, and many other.
David Brunelle
A: 

Having no idea about what format the information will be stored in, or how DNA sequences are displayed (is it just a long string?), you may be able to get away with simply inserting each DNA sequence into a MySQL database and then executing a simple query like:

SELECT * FROM `dna_table` WHERE `sequence` = $sequence;

Make sure you use an escape string or a parameterized query (to prevent SQL injection), but other than that, this sounds like a REALLY simple DB program that shouldn't be more than about 100 lines of code.

Crowe T. Robot
The sequences will probably be stored as FASTA and I'd also like to be able to query the database using BLAST via a web form.
Michael Barton
Working with DNA sequence is much more complicated that normal string search. You have to check for mismatches, snips, intron-extron, alternative splicing, etc...
David Brunelle
That's why I said I wasn't sure. I was just trying to give an example of how to query a DB against a search phrase, in case he didn't even know that. But anyways, yeah I figured it might be more in depth than that.
Crowe T. Robot
Heh. Who would have though that searching the DNA would be complicated ? :) It's only four amino acid A,C,G,T , so it's a string of 4 differents character. Yet, since some place can be different from different individuals (hence we would all be identicals), they also have combination character to illustrate A or C, A or G, A or C or T, ... up to A or C or G or T. IE when searching for example AAAGTCTGA, you have in fact to search for that, plus ALL the combinations of 'combination character' and usually, you want to include mismatches (1 or 2 max since you'll never see the end of you search).
David Brunelle
A,C,G,T are not amino-acids but nucleotides ;-)
Pierre
+2  A: 

This will be overkill probably but.... ncbi has a lot of software available. Link.

In particular, this.

Angelo
The genome work bench looks useful, thanks. I don't think it can be used to create a web front end though can it?
Michael Barton
A: 

I'd strongly suggest contacting the bioinformatics community. The most important thing is to design the database and decide its purpose. You mention DNA in the title but rRNA in the text - these are completely different things. If it's only a typo, fine - but if you don't understand the difference then talk with people in the community.

Since I'm involved in the community you might like to contact the MyExperiment community (http://en.wikipedia.org/wiki/MyExperiment) and mention my name if you need to. You'll find lots of friendly people and help.

UPDATE I've just noticed you are from Manchester and that's the hub of MyExperiment so it really is the obvious place to start!

peter.murray.rust
Hi Peter. What I meant to write is a generic nucleic acid database whether RNA or DNA that can be accessed via a BLAST web form. I had a look at myExperiment and this seems focused on Taverna workflows rather than web front ends.
Michael Barton
Taverna is based on accessing WebServices. I know these are not web front ends but they are probably the right way to go. In any case you will be put in touch with the best ways to proceed
peter.murray.rust
+6  A: 

Hi Mike, something a little less barebones is http://gmod.org/ - the simplest installation should give you a blast form & a "sequence browser" interface. Don't know if theres a hmmer form yet...

(scales pretty well - from a simple sqlite to a real database)

Alternatively, you may want to look into the galaxy server. http://main.g2.bx.psu.edu/
It's first aim is making complex genomic queries easy for non-computational people but I dont know if it has a blast out of the box

cheers, yannick

Yannick Wurm
Thanks that was exactly the sort of thing I was looking for. Something I can just install and get running on a server.
Michael Barton
A: 

I agree: You should post your question to [email protected] or the bioperl mailing list.

The question "easy creation of a generic sequence database with a website search form" seems too general. A sequence database is a list of (id, sequence) and by itself doesn't need any tool support. At least I don't see any reason why you would need a tool for that.

I think your question is: Is there a BLAST client as webform that one can install locally? There are some: PLAN might worth a try though I never had it running. BioPerl has objects for standalone BLAST execution (http://doc.bioperl.org/releases/bioperl-1.0/Bio/Tools/Run/StandAloneBlast.html) and can display the results graphically. Debian/Ubuntu Med have ncbi-tools-bin and ncbi-rrna-data which install the necessary tools and databases in a couple of seconds.

Instead of pondering tool support I would rather hack together a 10 line CGI script that executes blast with an input sequence onto the Fasta files that you have and then see if the users aren't already happy with that.

Concerned about the programming language: If you like, you can do this with a shell script (*). That might even take you less time than the posting on stackoverflow... ;-)

(*) Note to paranoid computer science collegues: it's going to be an internal application for biologists who don't know the difference between an operating system and operator overloading, so sql injections are very very unlikely...

I think this is an example where premature optimization is evil enough, in the sense that you can loose tons of time with designing a system too complex for a simple task. In the spirit of agile programming, if you like software engineering buzzwords, you might simply hack something together and then try it on your users before thinking about the architecture.

Maximilian Haeussler
A: 

Concerning GMOD: I am relatively sure that GMOD is complete overkill for your application. GMOD is not a server, it's a collection of tools, the database schema (CHADO) being one of them, and Chado is not really for someone who mostly will have sequences and ids. BioMart is not a server either, it's a tool that permits de-normalization of model databases, to be able to run whole-genome queries fast enough. One of the BioMart clients (MartView) comes as a web interface. You definitely don't want to use Biomart at the moment but I can explain that in detail by email. I have the impression that you rather need a web-based BLAST client to get started first.

Max
A: 

Galaxy: Galaxy is not a database, it's a website with tools to work with (mostly DNA) sequences from various genomes. Galaxy is tightly linked with the UCSC genome browser sequences, tools and fileformats. So if you want to create a database of entirely new sequences, galaxy is not for you. It doesn't include any BLAST servers either. If you want to create a database of sequences, CHADO as part of GMOD comes close, but I'd rather start use a text file to get started, see my post above.

Max
+1  A: 

There's a simple CGI front-end distributed with the NCBI BLAST package as well. You can download it from their FTP site, which is here:

ftp://ftp.ncbi.nih.gov/

James Thompson
A: 

Maybe you can look at Plone4Bio.

Plone is an extended content management engine written in python, with a lot of features and easy to use applications, so you can create your website by using a collection of modules like forums, products for news, etc... (I know you know this already but it is just to give a bit of background).

Plone4Bio is aimed at providing some plone applications for bioinformatics... I don't know how much the project is advanced and I haven't used it yet, but it seems that at least you have a sequence object and some apps for visualizing it, and probably some applications to search them. (p.s. they use it at uniprot - look at the 'Third party data' section for any membrane protein)

I don't know of any other CMS apps aimed at bioinformatics, but maybe you can also easily implement something with django without too much effort.

dalloliogm