bioinformatics

Customizing output of BLAST?

I know this is a very specific question relating to BLAST and Bioinformatics but here goes: I am attempting to use standalone BLAST (I already have downloaded it and tested it running on the command line) to perform a DNA sequence alignment (blastn). I need to be able to provide both my own query file (fasta format) and my own database...

Assessing the significance of a BLASTn score?

I am running standalone command line blast to align many query sequences against a large database sequence of nucleotides. I can modify the command line parameters of the blastn program to change various parameters such as the match/mismatch scores. I am wondering - for the 'bit score' that blastn outputs, does it make sense to compare...

multiFASTA file processing

Hi all, I was curious to know if there is any bioinformatics tool out there able to process a multiFASTA file giving me infos like number of sequences, length, nucleotide/aminoacid content, etc. and maybe automatically draw descriptive plots. Also an R BIoconductor solution or a BioPerl module would do, but I didn't manage to find anyth...

Fast assessment of corrupted Affymetrix CEL files

Hi all, I'm trying to normalize a big amount of Affymetrix CEL files using R. However, some of them appear to be truncated, so when reading them i get the error Cel file xxx does not seem to have the correct dimensions And the normalization stops. Manually removing the corrupted files and restart every time will take very long. Do yo...

Code review please: a Java program that displays an XML file with 30000+ terms in a JTree.

I am hunting for a job and one of the companies that I interviewed with asked me to write a little test program so that they could test my programming abilities. I am a biologist by training, and most of my programming knowledge I gain by autodidactic means. I am also more comfortable writing Python then Java. This is the brief I was g...

What's the best way to divide large files in Python for multiprocessing?

I run across a lot of "embarrassingly parallel" projects I'd like to parallelize with the multiprocessing module. However, they often involve reading in huge files (greater than 2gb), processing them line by line, running basic calculations, and then writing results. What's the best way to split a file and process it using Python's multi...

Does a regular expression exist for enzymatic cleavage?

Does a regular expression exist for (theoretical) tryptic cleavage of protein sequences? The cleavage rule for trypsin is: after R or K, but not before P. Example: Cleavage of the sequence VGTKCCTKPESERMPCTEDYLSLILNR should result in these 3 sequences (peptides): VGTK CCTKPESER MPCTEDYLSLILNR Note that there is no cleavage after ...

Are there any existing solutions for creating a generic DNA sequence database with a website front end?

I'd like to create an rRNA sequence database with a web front end for the lab I work in. It seems common in biology to want to search a large number of sequences using alignment algorithms such as BLAST and HMMER, so I wondered if there is any existing php/python/rails projects that allow easy creation of a generic sequence database with...

How to extract the first hit elements from an XML NCBI BLAST file?

Hello all, Im trying to extract only the first hit from an NCBI xml BLAST file. next I would like to get only the first HSP. at the final stage I would like to get these based on best score. to make things clear here a sample of the xml file: <?xml version="1.0"?> <!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://www.n...

Python, Huge Iteration Performance Problem

Hi I'm doing an iteration through 3 words, each about 5 million characters long, and I want to find sequences of 20 characters that identifies each word. That is, I want to find all sequences of length 20 in one word that is unique for that word. My problem is that the code I've written takes an extremely long time to run. I've never ev...

bioinformatics resources

Hi there guys, When is about programming, we certain have some blogs to follow, but when your thinking to try a different field, how can you find the big names? I wish to try bioinfrmatics field and to add into my daily schedule some blog reads from this domain. Can you recommend me some blogs? ...

Draw a colored sphere from cartesian coordinates in pymol

Hi, I was looking in the wiki how to convert the following information about beads, cartesian coordinates + energy : 23.4 54.6 12.3 -123.5 54.5 23.1 9.45 -56.7 ....... to a draw in pymol that contains for each atom a sphere of radius R, centered on its coordinates, and with color, in a rainbow gradient. Thanks ...

how to rank gene using information gain??

how gene ranking is done for microarray data using information gain and chi-square statistics ?? Please illustrate with a simple example.. ...

Packages for multiple allignment of Mass spec Data

Dear R user, I am searching for good R package to allign multiple spectra. Thanks. ...

running BLAST (bl2seq) without creating sequence files

I have a script that performs BLAST queries (bl2seq) The script works like this: Get sequence a, sequence b write sequence a to filea write sequence b to fileb run command 'bl2seq -i filea -j fileb -n blastn' get output from STDOUT, parse repeat 20 million times The program bl2seq does not support piping. Is there ...

python script for robust multi-array average on microarray data

I have tried google with no luck. I have seen some weak references to robust multi-array averaging done with python but no code. I am not so interested in reinventing the wheel. Any suggestions on a python module, script .... If I could find a nice explanation or example of the algorithm I would write a python implementation to share. ...

Running BLAST through XGrid

Does anyone have any experience running BLAST with XGrid? Googling reveals a tool called 'Xgrid BLAST' existed but not where to get. ...

Bored with CS. What can I study that will make an impact?

So I'm going to an average university, majoring in CS. I haven't learned a damn thing and am in my third year. I've come to be really bored with studying CS. Initially, I was kind of misinformed and thought majoring in CS would make me a good "product creator". I make my money combining programming and business/marketing. But I have a...

Coding Blosum62 in the source code

Hi, I am trying to implement protein pairwise sequence alignment using "Global Alignment" algorithm by 'Needleman -Wunsch'. I am not clear about how to include 'Blosum62 Matrix' in my source code to do the scoring or to fill the two-dimensional matrix? I have googled and found that most people suggested to use flat file which contain...

Extracting code from photograph of T-shirt via OCR

I recently saw someone with a T-shirt with some Perl code on the back. I took a photograph of it and cropped out the code: Next I tried to extract the code from the image via OCR, so I installed Tesseract OCR and the Python bindings for it, pytesser. Pytesser only works on TIFF images, so I converted the image in Gimp and entered the...