tags:

views:

178

answers:

2

I have a DNA file in the following format:

>gi|5524211|gb|AAD44166.1| cytochrome
ACCAGAGCGGCACAGCAGCGACATCAGCACTAGCACTAGCATCAGCATCAGCATCAGC
CTACATCATCACAGCAGCATCAGCATCGACATCAGCATCAGCATCAGCATCGACGACT
ACACCCCCCCCGGTGTGTGTGGGGGGTTAAAAATGATGAGTGATGAGTGAGTTGTGTG
CTACATCATCACAGCAGCATCAGCATCGACATCAGCATCAGCATCAGCATCGACGACT
TTCTATCATCATTCGGCGGGGGGATATATTATAGCGCGCGATTATTGCGCAGTCTACG
TCATCGACTACGATCAGCATCAGCATCAGCATCAGCATCGACTAGCATCAGCTACGAC

How do I read this file and extract the DNA sequence part (ACCAGAGCGG...) without any newlines, for example:

ACCAGAGCGGCACAGCAGCGACATCAGCACTAGCACTAGCATCAGCATCAGCATCAGCCTACATCATCACAGCAGCATCA

Maybe regex isn't needed?

+8  A: 

If there's always only one line of header :

dnalines = text.split('\n')[1:]
dna = ''.join(dnalines)

With text = the contents of your file (for example, text = open('yourfile').read())

Pierre Bourdon
awesome thanks that is perfect
Joshua
+3  A: 

I did some tests, and it appears that the following is more efficient than delroth's answer:

text.split('\n', 1)[1].replace('\n', '')

Edit: wait, it's not so simple. I timed both methods, twice, using Python 2.6.4 and 3.1.1, on an ~30MB file:

  • Python 2.6.4, my version:

    $ python -m timeit -c "open('x').read().split('\n', 1)[1].replace('\n', '')"
    10 loops, best of 3: 221 msec per loop
    $ python -m timeit -c "open('x').read().split('\n', 1)[1].replace('\n', '')"
    10 loops, best of 3: 219 msec per loop
    
  • Python 2.6.4, delroth's version:

    $ python -m timeit -c "''.join(open('x').read().split('\n')[1:])"
    10 loops, best of 3: 392 msec per loop
    $ python -m timeit -c "''.join(open('x').read().split('\n')[1:])"
    10 loops, best of 3: 390 msec per loop
    
  • Python 3.1.1, my version:

    $ python3 -m timeit -c "open('x').read().split('\n', 1)[1].replace('\n', '')"
    10 loops, best of 3: 803 msec per loop
    $ python3 -m timeit -c "open('x').read().split('\n', 1)[1].replace('\n', '')"
    10 loops, best of 3: 798 msec per loop
    
  • Python 3.1.1, delroth's version:

    $ python3 -m timeit -c "''.join(open('x').read().split('\n')[1:])"
    10 loops, best of 3: 610 msec per loop
    $ python3 -m timeit -c "''.join(open('x').read().split('\n')[1:])"
    10 loops, best of 3: 610 msec per loop
    

Conclusion: Python 3 is much slower, and it depends on the Python version which of the two code snippets is faster!

Stephan202
+1 for using timeit!-)
Alex Martelli