A: 

After the first line in the ids.txt file has been processed, the file seqres.txt has been exhausted. There is something wrong with the nesting of your loops. Also, you're modifying the iterator inside the for line in g loop. Not a good idea.

If you really want to append the line that follows the line whose ID matches, then perhaps something like this might work better:

with open('ids.txt', 'rU') as f:
    ids = f.readlines()
with open('seqres.txt', 'rU') as g:
    seqres = g.readlines()

for id in ids:
    id=id.lower()[0:4]+'_'+id[4]
    with open(id + '.fasta', 'a') as h:
    for line in seqres:
        if line.startswith('>'+ id):
            h.write(seqres.next())
Tim Pietzcker
You mean the with statement is exhausted?Can I access the next line a different way?
reve_etrange
Surprisingly, given the way iteration is handled in Python for loops (by explicit calls to next()) it shouldn't have adverse effect. It is questionable stylistically.
msw
No, Dave observed the same thing. In the first iteration `for` loop, the entire file `g` is processed completely, so every subsequent `for line in g` becomes a NOOP.
Tim Pietzcker
Got it myself just before your update.Thanks for the help...and for reminding me I don't need all that nesting now.
reve_etrange
+1  A: 

The problem is that you are only looping through file g once - after you have read through it the first time the file index position is left at the end of the file, so any further reads will fail with EOF. You would need to reopen g every time round the loop.

However this will be massively inefficient - you are reading the same file repeatedly, once for every line in f. It will be orders of magnitude faster to read all of g into an array at the start and use that, so long as it will fit in memory.

Dave Kirby
Better might be to load the ids from ids.txt into a set, and then walk through seqres.txt, writing out the lines that match the set.
dangph
Good advice. I hadn't realized about the iterator mechanics, thought it would start over again.
reve_etrange
+2  A: 

Here's a mostly flattened implementation. Depending on how many hits you're going to get for each ID, and how many entries there are in 'seqres' you could redesign it.

# Extract the IDs in the desired format and cache them
ids = [ x.lower()[0:4]+'_'+x[4] for x in open('ids.txt','rU')]
ids = set(ids)

# Create iterator for seqres.txt file and pull the first value
iseqres = iter(open('seqres.txt','rU'))
lineA = iseqres.next()

# iterate through the rest of seqres, staggering
for lineB in iseqres:
  lineID = lineA[1:7]
  if lineID in ids:
    with open("%s.fasta" % lineID, 'a') as h:
      h.write(lineB)
  lineA = lineB
MattH
Just one hit per ID.Would using sets be faster than the double loop I found and as suggested by Tim Pietzcker?
reve_etrange
My solution iterates ids.txt once and seqres once. Only the computed ids are held in memory. Tim's solution reads ids.txt, reads seqres. Then loops through ids.txt looping through seqres with every item. My implementation is much faster and uses less memory.
MattH
I guess the iterator.next() method doesn't increment the iterator like the file object's next method? Given that is there still a reason not to use it?I'm curious because it seems like the situation of needing the next value could be common in some contexts.
reve_etrange
Actually, it does. It's just that when you iterate over a file object, it is reading the content's of the file and file-position is progressing through it. If you wanted to iterate through the same file object again, you'd need to `.seek(0)`.
MattH
+1  A: 

For speed, you really want to avoid looping over the same file multiple times. This means you've turned into an O(N*M) algorithm, when you could be a using an O(N+M) one.

To achieve this, read your list of ids into a fast lookup structure, like a set. Since there are only 4600 this in-memory form shouldn't be any problem.

The new solution is also reading the list into memory. Probably not a huge problem with just a few hundred thousand lines, but its wasting more memory than you need, since you can do the whole thing in a single pass, only reading the smaller ids.txt file into memory. You can just set a flag when the previous line was something interesting, which will signal the next line to write it out.

Here's a reworked version:

with open('ids.txt', 'rU') as f:
    interesting_ids = set('>' + line.lower()[0:4] + "_" + line[4] for line in f)  # Get all ids in a set.

found_id = None
with open('seqres.txt', 'rU') as g:
    for line in g:
        if found_id is not None:
            with open(found_id+'.fasta','w') as h:
                h.write(line)

        id = line[:7]
        if id in interesting_ids: found_id = id
        else: found_id = None
Brian
Your `interesting_ids` is a set of strings that are 7 characters long and your `found_id` is going to look for a 6 character string in the set.
MattH
@MattH: Oops, you're right. Fixed.
Brian
+1  A: 

I think there is still progress to be made from the code you declare as final. You can make the result a little less nested and avoid a couple sort of silly things.

from contextlib import nested
from itertools import tee, izip

# Stole pairwise recipe from the itertools documentation
def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = tee(iterable)
    next(b, None)
    return izip(a, b)

with nested(open('ids.txt', 'rU'), open('seqres.txt', 'rU')) as (f, g):
    for id in f:
        id = id.lower()[0:4] + '_' + id[4]
        with open(id + '.fasta', 'w') as h:
            g.seek(0) # start at the beginning of g each time
            for line, next_line in pairwise(g):
                if line.startswith('>' + id):
                    h.write(next_line)

This is an improvement over the final code you posted in that

  • It does not unnecessarily read the whole files into memory, but simple iterates over the file objects. (This may or may not be the best option for g, really. It definitely scales better.)
  • It does not contain the crash condition using gl[line+1] if we are already on the last line of gl
    • Depending on how g actually looks, there might be something more applicable than pairwise.
  • It is not as deeply nested.
  • It conforms to PEP8 for things like spaces around operators and indentation depth.
  • This algorithm is O(n * m), where n and m are the number of lines in f and g. If the length of f is unbounded, you can use a set of its ids to reduce the algorithm to O(n) (linear in the number of lines in g).
Mike Graham