views:

139

answers:

3

Hi,

I am having some trouble with a piece of code below:

Input: li is a nested list as below:

li = [['>0123456789 mouse gene 1\n', 'ATGTTGGGTT/CTTAGTTG\n', 'ATGGGGTTCCT/A\n'],   ['>9876543210 mouse gene 2\n', 'ATTTGGTTTCCT\n', 'ATTCAATTTTAAGGGGGGGG\n']]

Using the function below, my desired output is simply the 2nd to the 9th digits following '>' under the condition that the number of '/' present in the entire sublist is > 1.

Instead, my code gives the digits to all entries. Also, it gives them multiple times. I therefore assume something is wrong with my counter and my for loop. I can't quite figure this out.

Any help, greatly appreciated.

import os

cwd = os.getcwd()


def func_one():
    outp = open('something.txt', 'w')       #output file
    li = []
    for i in os.listdir(cwd):           
        if i.endswith('.ext'):
            inp = open(i, 'r').readlines()
            li.append(inp)
    count = 0
    lis = []
    for i in li:
        for j in i:
            for k in j[1:]          #ignore first entry in sublist
                if k == '/':
                    count += 1
                if count > 1:
                    lis.append(i[0][1:10])      
                    next_func(lis, outp)

Thanks, S :-)

+8  A: 

Your indentation is possibly wrong, you should check count > 1 within the for j in i loop, not within the one that checks every single character in j[1:].

Also, here's a much easier way to do the same thing:

def count_slashes(items):
    return sum(item.count('/') for item in items)

for item in li:
    if count_slashes(item[1:]) > 1:
        print item[0][1:10]

Or, if you need the IDs in a list:

result = [item[0][1:10] for item in li if count_slashes(item[1:]) > 1]

Python list comprehensions and generator expressions are really powerful tools, try to learn how to use them as it makes your life much simpler. The count_slashes function above uses a generator expression, and my last code snippet uses a list comprehension to construct the result list in a nice and concise way.

Tamás
Python surprises me again and again, how easy some things can be. Great answer +1
Felix Kling
A: 
import itertools
import glob

lis = []
with open('output.txt', 'w') as outfile:
    for file in glob.iglob('*.ext'):
        content = open(file).read()
        if content.partition('\n')[2].count('/') > 1:
            lis.append(content[1:10])
            next_func(lis, outfile)

The reason you digits to all entries, is because you're not resetting the counter.

SilentGhost
Could you possibly tell me how I would reset the the counter? This happens all the time to me so I generally run everything through a function to remove duplications. Thanks!
Seafoid
@seafoid: you need to move `count = 0` after `for in li:` line, but you're better off using my code, it's more efficient and there's no need for all those nested loops.
SilentGhost
@SilentGhost - Thanks! Can your code be modified to exclude counting '/' if present in the first string within each sublist?
Seafoid
@seafoid: sure, see my edit
SilentGhost
why the downvote?
SilentGhost
It didn't come from me. Thanks for your help!
Seafoid
oops, it wasn't a downvote. someone just took back his upvote.
SilentGhost
+5  A: 

Tamás has suggested a good solution, although it uses a very different style of coding than you do. Still, since your question was "I am having some trouble with a piece of code below", I think something more is called for.

How to avoid these problems in the future

You've made several mistakes in your approach to getting from "I think I know how to write this code" to having actual working code.

You are using meaningless names for your variables which makes it nearly impossible to understand your code, including for yourself. The thought "but I know what each variable means" is obviously wrong, otherwise you would have managed to solve this yourself. Notice below, where I fix your code, how difficult it is to describe and discuss your code.

You are trying to solve the whole problem at once instead of breaking it down into pieces. Write small functions or pieces of code that do just one thing, one piece at a time. For each piece you work on, get it right and test it to make sure it is right. Then go on writing other pieces which perhaps use pieces you've already got. I'm saying "pieces" but usually this means functions, methods or classes.

Fixing your code

That is what you asked for and nobody else has done so.

You need to move the count = 0 line to after the for i in li: line (indented appropriately). This will reset the counter for every sub-list. Second, once you have appended to lis and run your next_func, you need to break out of the for k in j[1:] loop and the encompassing for j in i: loop.

Here's a working code example (without the next_func but you can add that next to the append):

>>> li = [['>0123456789 mouse gene 1\n', 'ATGTTGGGTT/CTTAGTTG\n', 'ATGGGGTTCCT/A\n'],   ['>9876543210 mouse gene 2\n', 'ATTTGGTTTCCT\n', 'ATTCAATTTTAAGGGGGGGG\n']]
>>> lis = []
>>> for i in li:
        count = 0
        for j in i:
            break_out = False
            for k in j[1:]:
                if k == '/':
                    count += 1
                if count > 1:
                    lis.append(i[0][1:10])
                    break_out = True
                    break
            if break_out:
                break

>>> lis
['012345678']

Re-writing you code to make it readable

This is so you see what I meant in the beginning of my answer.

>>> def count_slashes(gene):
    "count the number of '/' character in the DNA sequences of the gene."
    count = 0
    dna_sequences = gene[1:]
    for sequence in dna_sequences:
        count += sequence.count('/')
    return count
>>> def get_gene_name(gene):
    "get the name of the gene"
    gene_title_line = gene[0]
    gene_name = gene_title_line[1:10]
    return gene_name
>>> genes = [['>0123456789 mouse gene 1\n', 'ATGTTGGGTT/CTTAGTTG\n', 'ATGGGGTTCCT/A\n'],   ['>9876543210 mouse gene 2\n', 'ATTTGGTTTCCT\n', 'ATTCAATTTTAAGGGGGGGG\n']]
>>> results = []
>>> for gene in genes:
        if count_slashes(gene) > 1:
            results.append(get_gene_name(gene))

>>> results
['012345678']
>>> 
taleinat
`sum(seq.count('/') for seq in gene[1:])` would do the job just fine.
SilentGhost
Great answer - I would have voted it up more than once if I could.
Tamás