tags:

views:

216

answers:

5

Hi all. I have a file.

>Sequence 1.1.1  ATGCGCGCGATAAGGCGCTA   
ATATTATAGCGCGCGCGCGGATATATATATATATATATATT  
>Sequence 1.2.2  ATATGCGCGCGCGCGCGGCG   
ACCCCGCGCGCGCGCGGCGCGATATATATATATATATATATT                 
>Sequence 2.1.1  ATTCGCGCGAGTATAGCGGCG

NOW,I would like to remove the last digit from each of the line that starts with '>'. For example, in this first line, i would like to remove '.1' (rightmost) and in second instance i would like to remove '.2' and then write the rest of the file to a new file. Thanks,

+4  A: 
if line.startswith('>Sequence'):
  line = line[:-2] # trim 2 characters from the end of the string

or if there could be more than one digit after the period:

if line.startswith('>Sequence'):
  dot_pos = line.rfind('.') # find position of rightmost period
  line = line[:dot_pos] # truncate upto but not including the dot

Edit for if the sequence occurs on the same line as >Sequence

If we know that there will always be only 1 digit to remove we can cut out the period and the digit with:

line = line[:13] + line[15:]

This is using a feature of Python called slices. The indexes are zero-based and exclusive for the end of the range so line[0:13] will give us the first 13 characters of line. Except that if we want to start at the beginning the 0 is optional so line[:13] does the same thing. Similarly line[15:] gives us the substring starting at character 15 to the end of the string.

mikej
What to do if I have this case>Sequence 1.1.1 atgcgcgcgatatatashhshshshSo now I only have to remove ".1" but not a single digit on either the right side or left side of it, just the last number ".1" or whatever it is. Thanks
Are you saying the atgcgcgcgatatat is on the same line as >Sequence 1.1.1 or it that just the way your comment has formatted? Please explain a bit more what you mean
mikej
Sure. Yes you interpreted right. atgcgcgatga sequence is on the same line. So, in this case now I have to remove last digit from the series of digits. One thing is sure that this last digit will always be present on 15th index on every line starting with '>'.
@Arshan please include all possible formattings in your question
Otto Allmendinger
Here is the detailed question.>Sequence 1.1.1 atatatccchhchcasjssjsjjsjsjsjsj>Sequence 1.2.2 atatatatatatatassdjdjdjfjfjfjjjgjg>Sequence 1.2.1 atatatatatatatatatatatatatatatatNow, I have to remove last digit from every line that starts with '>'. Like in case of first line, I have to remove '.1' (rightmost) and in second case, I have to remove '.2' (rightmost).
please consider that every line that starts with '>' is a new line.
@Arshan you can use the 'edit' link to update and clarify your question. Then you can use all the formatting which is not available in comments.
mikej
ok editing done. Please check.
Thansk it helped and done :)
+2  A: 

map "".join(line.split('.')[:-1]) to each line of the file.

Steve B.
+7  A: 
import fileinput
import re

for line in fileinput.input(inplace=True, backup='.bak'):
  line = line.rstrip()
  if line.startswith('>'):
    line = re.sub(r'\.\d$', '', line)
  print line

many details can be changed depending on details of the processing you want, which you have not clearly communicated, but this is the general idea.

Alex Martelli
Cool use of fileinput. I'd never heard of this module.
hughdbrown
Thanks all. It helped
So Arshan, accept an answer that's helped you most -- that's fundamental StackOverflow etiquette!
Alex Martelli
@hughdbrown, glad you liked it -- it's a great module especially for "pseudo-inplace" alteration of textfiles.
Alex Martelli
+4  A: 
import re
trimmedtext = re.sub(r'(\d+\.\d+)\.\d', '$1', text)

Should do it. Somewhat simpler than searching for start characters (and it won't effect your DNA chains)

Oli
looks great but sorry could not understood your code. I started python yesterday so could you be very kind to start with the the opening of file? Thanks.
@Arshan: Have you read the Python tutorial yet? It should help you understand the basic steps like reading files and iterating through lines to give you the context for using this solution.
Nathan Kitchen
Oli is using what is called a regular expression for performing substitution on the text. These patterns such as (\d+\.\d+)\.\d are a general concept and not specific to Python.
mikej
Yes I know how to open, read and write file. But not sure about iterating lines in a file.
You don't *need* to iterate lines with this. You can if you want to but you can just chuck the whole file through with `open('filename').read()`. And yes, my code before is based on the regex library built into python. Regex is something worth learning as it's very useful for doing operations like this. It's also great for input validation.
Oli
+1  A: 

Here's a short script. Run it like: script [filename to clean]. Lots of error handling omitted.

It operates using generators, so it should work fine on huge files as well.

import sys
import os

def clean_line(line):
    if line.startswith(">"):
        return line.rstrip()[:-2]
    else:
        return line.rstrip()

def clean(input):
    for line in input:
        yield clean_line(line)

if __name__ == "__main__":
    filename = sys.argv[1]

    print "Cleaning %s; output to %s.." % (filename, filename + ".clean")

    input = None
    output = None
    try:
        input = open(filename, "r")
        output = open(filename + ".clean", "w")
        for line in clean(input):
            output.write(line + os.linesep)
            print ": " + line
    except:
        input.close()
        if output != None:
            output.close()
Skurmedel