tags:

views:

374

answers:

2

Hi. I have a file like below.

Sequence A.1.1 Bacteria
ATGCGCGATATAGGCCT
ATTATGCGCGCGCGC

Sequence A.1.2 Virus
ATATATGCGCCGCGCGTA
ATATATATGCGCGCCGGC

Sequence B.1.21 Chimpanzee
ATATAGCGCGCGCGCGAT
ATATATATGCGCG

Sequence C.21.4 Human
ATATATATGCCGCGCG
ATATAATATC

I want to make separate files for sequences of category A, B and C from one single file. Kindly suggest some reading material for breaking this code. Thanks. The output should be three files, one for 'A', second file for Sequences with 'B' and third file for sequences with 'C'.

+1  A: 

It's not 100% clear what you want to do, but something like:

currout = None
seqname2file = dict()

for line in open('thefilewhosenameyoudonottellus.txt'):
  if line.startswith('Sequence '):    
    seqname = line[9]  # A or B or C
    if seqname not in seqname2file:
      filename = 'outputfileforsequence_%s.txt' % seqname
      seqname2file[seqname] = open(filename, 'w')
    currout = seqname2file[seqname]
  currout.write(line)

for f in seqname2file.values():
  f.close()

should get you pretty close -- if you want three separate files (one each for A, B and C) that among them contain all the lines from the input file, it's just about done except you'll probably need better filenames (but you don't let us in on the secret of what those might be;-), otherwise some tweaks should get it there.

BTW, it always helps immensely (to help you more effectively rather than stumbling in the dark and guessing) if you also give examples of what output results you want for the input data example you give!-)

Alex Martelli
Yeah thanks kindly check the output above
Certainly got me close. I am getting error index out of range.Can I do some thing like this in PythonIndex = line[9]and thenif (Index == 'a')I get error
You can do index=line[9] if and only if Len(line) is at least 10; apparently you're doing it on some line that's shorter than that.
Alex Martelli
A: 

I'm not sure exactly what you want the output to be, but it sounds like you need something like:

#!/usr/bin/python

# Open the input file
fhIn = open("input_file.txt", "r")

# Open the output files and store their handles in a dictionary
fhOut = {}
fhOut['A'] = open("sequence_a.txt", "w")
fhOut['B'] = open("sequence_b.txt", "w")
fhOut['C'] = open("sequence_c.txt", "w")

# Create a regexp to find the line naming the sequence
Matcher = re.compile(r'^Sequence (?P<sequence>[A-C])')

# Iterate through each line in the file
CurrentSequence = None
for line in fhIn:
    # If the line is a sequence identifier...
    m = Matcher.match(line)
    if m is not None:
        # Select the appropriate sequence from the regexp match
        CurrentSequence = m.group('sequence')
    # Uncomment the following two lines to skip blank lines
    # elif len(line.strip()) == 0:
    #     pass
    # Print out the line to the current sequence output file
    # (change to else if you don't want to print the sequence titles)
    if CurrentSequence is not None:
        fhOut[CurrentSequence].write(line)

# Close all the file handles
fhIn.close()
fhOut['A'].close()
fhOut['B'].close()
fhOut['C'].close()

Completely untested though...

Al
Thank you. Most appreciated.