views:

323

answers:

5

I have a script that performs BLAST queries (bl2seq)

The script works like this:

  1. Get sequence a, sequence b
  2. write sequence a to filea
  3. write sequence b to fileb
  4. run command 'bl2seq -i filea -j fileb -n blastn'
  5. get output from STDOUT, parse
  6. repeat 20 million times

The program bl2seq does not support piping. Is there any way to do this and avoid writing/reading to the harddrive?

I'm using Python BTW.

+1  A: 

Is this the bl2seq program from BioPerl? If so, it doesn't look like you can do piping to it. You can, however, code your own hack using Bio::Tools::Run::AnalysisFactory::Pise, which is the recommended way of going about it. You'd have to do it in Perl, though.

If this is a different bl2seq, then disregard the message. In any case, you should probably provide some more detail.

Pedro Silva
It's the bl2seq program that comes with NCBI's Blast+. The BioPerl bl2seq is a wrapper of some sort.
Austin
+3  A: 

Depending on what OS you're running on, you may be able to use something like bash's process substitution. I'm not sure how you'd set that up in Python, but you're basically using a named pipe (or named file descriptor). That won't work if bl2seq tries to seek within the files, but it should work if it just reads them sequentially.

cjm
Hey, that worked! You're rad bl2seq -i<(echo sequence1) -j(echo sequence2) -p blastn
Austin
+1  A: 

How do you know bl2seq does not support piping.? By the way, pipes is an OS feature, not the program. If your bl2seq program outputs something, whether to STDOUT or to a file, you should be able to parse the output. Check the help file of bl2seq for options to output to file as well, eg -o option. Then you can parse the file.

Also, since you are using Python, an alternative you can use is BioPython module.

ghostdog74
It's certainly up to the program whether it reads anything from standard input. Just because you can pipe something to STDIN doesn't mean the program will do anything with it.
cjm
who says anything about STDIN. I am asking OP about STDOUT.
ghostdog74
I can parse the output just fine as it's going to STDOUT.@cjn - How would I pipe to two arguments?bl2seq -i <pipe1> -j <pipe2>
Austin
@ghostdog75 you deserve credit for answering this. I didn't really understand what you meant at the time.
Austin
A: 

Wow. I have it figured out.

The answer is to use python's subprocess module and pipes!

EDIT: forgot to mention that I'm using blast2 which does support piping.

(this is part of a class)

def _query(self):
    from subprocess import Popen, PIPE, STDOUT
    pipe = Popen([BLAST,
    '-p', 'blastn',
    '-d', self.database,
    '-m', '8'],
    stdin=PIPE,
    stdout=PIPE)
    pipe.stdin.write('%s\n' % self.sequence)
    print pipe.communicate()[0]

where self.database is a string containing the database filename, ie 'nt.fa' self.sequence is a string containing the query sequence

This prints the output to the screen but you can easily just parse it. No slow disk I/O. No slow XML parsing. I'm going to write a module for this and put it on github.

Also, I haven't gotten this far yet but I think you can do multiple queries so that the blast database does not need to be read and loaded into RAM for each query.

Austin
+1  A: 

I call blast2 using R script:
....
system("mkfifo seq1")
system("mkfifo seq2")
system("echo sequence1 > seq1"), wait=FALSE)
system("echo sequence2 > seq2"), wait=FALSE)
system("blast2 -p blastp -i seq1 -j seq2 -m 8", intern = TRUE)
....
This is 2 times slower(!) vs. writing and reading from hard drive!

standa
Ha! I appreciate your anti-answer.
Austin