bioinformatics

Bioinformatics: job opportunities?

Anyone have experience in the bioinformatics field comment on what type of programming jobs are available? So far during my coop terms (similar to paid internships), it's been database joins, queries and number crunching. Is there more to the field than that? ...

Generating Synthetic DNA Sequence with Subtitution Rate

Given these inputs: my $init_seq = "AAAAAAAAAA" #length 10 bp my $sub_rate = 0.003; my $nof_tags = 1000; my @dna = qw( A C G T ); I want to generate: One thousand length-10 tags Substitution rate for each position in a tag is 0.003 Yielding output like: AAAAAAAAAA AATAACAAAA ..... AAGGAAAAGA # 1000th tags Is there a compact ...

How can I talk to UniProt over HTTP in Python?

I'm trying to get some results from UniProt, which is a protein database (details are not important). I'm trying to use some script that translates from one kind of ID to another. I was able to do this manually on the browser, but could not do it in Python. In http://www.uniprot.org/faq/28 there are some sample scripts. I tried the Per...

How do I merge two FASTA files (one file with line break) in Perl?

I have two following Fasta file: file1.fasta >0 GAATAGATGTTTCAAATGTACCAATTTCTTTCGATT >1 GTTAAGTTATATCAAACTAAATATACATACTATAAA >2 GGGGCTGTGGATAAAGATAATTCCGGGTTCGAATAC file2.qual >0 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 15 40 40 >1 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40...

Querying the DNS service records to find the hostname and TCP/IP

In a paper about the Life Science Identifiers (see LSID Tester, a tool for testing Life Science Identifier resolution services), Dr Roderic DM Page wrote : Given the LSID urn:lsid**:ubio.org**:namebank:11815, querying the DNS for the SRV record for *_lsid.tcp.ubio.org returns animalia.ubio.org:80 as the location of the ubio.org LSID se...

Best OS for bioinformatics?

What is the best choice of operating system for bioinformatics work? Are most of the tools for 64-bit Windows, for Linux/Unix in general, or OS X? ...

How can I find multiple motifs(substring) in a protein sequence(string)?

The following script is for finding one motif in protein sequence. use strict; use warnings; my @file_data=(); my $protein_seq=''; my $h= '[VLIM]'; my $s= '[AG]'; my $x= '[ARNDCEQGHILKMFPSTWYV]'; my $regexp = "($h){4}D($x){4}D"; #motif to be searched is hhhhDxxxxD my @locations=(); @file_data= get_file_data("seq.txt"); $protein_se...

Recommended reading for bioinformatics

I'm keen on learning about bioinformatics. I am ideally looking for a short course introduction, with some practical tasks I can get my teeth into immediately to see if there is any interest in it for me. I already have a good understanding of molecular biology, so I should be able to skip most of the foundational work. Any suggestion...

Encouraging good development practices for non-professional programmers?

In my Copious Free Time, I collaborate with a number of scientists (mostly biologists) who develop software, databases, and other tools related to the work they do. Generally these projects are built on a one-off basis, used in-house, and eventually someone decides "oh, this could be useful to other people," so they release a binary or s...

Finding matching keys in two large dictionaries and doing it fast

I am trying to find corresponding keys in two different dictionaries. Each has about 600k entries. Say for example: myRDP = { 'Actinobacter': 'GATCGA...TCA', 'subtilus sp.': 'ATCGATT...ACT' } myNames = { 'Actinobacter': '8924342' } I want to print out the value for Actinobacter (8924342) since it matches a value in myRDP. T...

Performing BLAST/SmithWaterman searches directly from my application

I'm working on a small application and thinking about integrating BLAST or other local alignment searches into my application. My searching has only brought up programs, which need to be installed and called as an external program. Is there a way short of me implementing it from scratch? Any pre-made library perhaps? ...

Perl recursion techniques?

I need a bit of help with is this code. I know the sections that should be recursive, or at least I think I do but am not sure how to implement it. I am trying to implement a path finding program from an alignment matrix that will find multiple routes back to the zero value. For example if you excute my code and insert CGCA as the first ...

cluster short, homogeneous strings (DNA) according to common sub-patterns and extract consensus of classes

Task: to cluster a large pool of short DNA fragments in classes that share common sub-sequence-patterns and find the consensus sequence of each class. Pool: ca. 300 sequence fragments 8 - 20 letters per fragment 4 possible letters: a,g,t,c each fragment is structured in three regions: 5 generic letters 8 or more positions of g's...

BioPython: Skipping over bad GIDs with Entrez.esummary/Entrez.read

Sorry about the odd title. I am using eSearch & eSummary to go from Accession Number --> gID --> TaxID Assume that 'accessions' is a list of 20 accession numbers (I do 20 at a time because that's the maximum that NCBI will allow). I do: handle = Entrez.esearch(db="nucleotide", rettype="xml", term=accessions) record = Entrez.read(ha...

Which functional programming languages have bioinformatics libraries?

Which functional programming languages have bioinformatics libraries easily available? (Don't include multi-paradigm languages such as Ruby) Update: Listing which major functional programming languages don't currently have easy access to bioinformatics libraries is also welcome. ...

How can I extract start and end codon from DNA sequences in Perl?

I have a code below that try to identify the position of start and end codon of the given DNA sequences. We define start codon as a ATG sequence and end codon as TGA,TAA,TAG sequences. The problem I have is that the code below works only for first two sequences (DM208659 and AF038953) but not the rest. What's wrong with my approach be...

output in two rows for multiple columns in python

I'm working with an output list that contains the following information: [start position, stop position, chromosome, [('sample name', 'sample value'), ('sample name','sample value')...]] [[59000, 59500, chr1, [('cn_04', '1.362352462'), ('cn_01', '1.802001235')]], [100000, 110000, chr1, [('cn_03', '1.88726...

How do I change this to "idiomatic" Perl?

I am beginning to delve deeper into Perl, but am having trouble writing "Perl-ly" code instead of writing C in Perl. How can I change the following code to use more Perl idioms, and how should I go about learning the idioms? Just an explanation of what it is doing: This routine is part of a module that aligns DNA or amino acid sequences...

R statistical package: wrapping GOFrame objects

Dear all, I'm trying to generate GOFrame objects to generate a gene ontology mapping in R for unsupported organisms (see http://www.bioconductor.org/packages/release/bioc/vignettes/GOstats/inst/doc/GOstatsForUnsupportedOrganisms.pdf). However, following the instructions literally doesn't help me. Here's the code I execute (R 2.9.2 on u...

Calculation of DNA sequences

Could you tell me how I can calculate the DNA sequences by Java using Levenshtein algorithm ...