views:

93

answers:

3

Hi,

I have two files and the content is as follows:

alt text

alt text

Please only consider the bolded column and the red column. The remaining text is junk and unnecessary. As evident from the two files they are similar in many ways. I am trying to compare the bolded text in file_1 and file_2 (it is not bolded but hope you can make out it is the same column) and if they are different, I want to print out the red text from file_1. I achieved this by the following script:

import string
import itertools

chain_id=[]
for file in os.listdir("."):
    basename = os.path.basename(file)
    if basename.startswith("d.complex"):
        chain_id.append(basename)

for i in chain_id:
    print i
    g=codecs.open(i,  encoding='utf-8')

    f=codecs.open("ac_chain_dssp.dssp",  encoding='utf-8')
    for (x, y) in itertools.izip(g,  f): 
            if y[11]=="C":
                if y[35:38]!= "EN":
                    if y[35:38] != "OTE":
                        if x[11]=="C":
                            if x[12] != "C":
                                if y[35:38] !=x[35:38]:
                                    print x [7:10]


    g.close()
    f.close()

But the results I got were not what I expected. Now I want to modify the above code in such a way that when I compare the bolded column, if the difference between the values is more than 2, then it has to print out the results. For example, row-1 of bolded column in file_1 is 83 and in file_2 it is 84 since the difference between the two is less than two, I want it to be rejected.

Can someone help me in adding the remaining code? Cheers, Chavanak

PS: This is not homework :)

A: 

I haven't understood your problem fully but

File 1

100 C 20.2
300 B 33.3

File 2

110 C 20.23
320 B 33.34

and you want to compare 3rd column of the two files.

lines1 = file1.readlines()
list1 = [float(line.split()[2]) for line in lines1] # list of 3rd column values

lines2 = file2.readlines()
list2 = [float(line.split()[2]) for line in lines2]

result = map(lambda x,y: x-y < 2,list1,list2)

OR

 result = [list1[i]-list2[i] for i in range(len(list1)) if list1[i] - list2[i] > 2]

Is this what you want??

TheMachineCharmer
`2437` is Good prime number!!!
TheMachineCharmer
How can it be what he wants? His data columns are FIXED-WIDTH, and the 5th column has some entries that are all blank. Using str.split() on his data will create a mess. His bolded column is about the NINTH column -- I can't see where you get 3 contiguous columns from.
John Machin
Right I didn't notice that. Thanks. I should have used slicing. +1 to your good answer :D. Also I have mentioned that I haven't understood the question completely.
TheMachineCharmer
+2  A: 

The direct answer to your question is to alter the last condition,
if y[35:38] !=x[35:38]: so that instead the "field" at [35:38] get converted to int (or float...) and a difference can be applied to them. Giving something like

   try:
     iy = int(y[35:38])
     ix = int(x[35:38])
   except ValueError:
     # here for whatever action is appropriate, including silent ignoring.
     print("Unexpected value for record # %s" % x[7:10])

   if abs(ix - iy) > 2:
     print(x[7:10])

More indirectly, the snippet in the question prompt the following remarks,which may in turn suggest different approaches to the problem.

  • first off, if the files are strictly "fixed format", if they are very big, and/or if nothing else is done with any of the other "fields" values found in the file, the current approach is valid and probably very efficient.
  • alternatively, the logic may be made more resilient to possible variations in the file structure etc, by parsing in the "fields" of the file, rather than addressing these as slices of a long string. Loot into the standard library's csv module for possible parser support.
  • some tests seem goofy / always true etc (like comparing a 3 characters slice to a 2 character string literal. Aside from being logically wrong, this too points to a more "parsed" solution where such logical error are more readily avoided or more obvious.
mjv
+2  A: 

Nothing to do with your problem, but this:

        if y[11]=="C":
            if y[35:38]!= "EN":
# I don't see any "EN" or "OTE" anywhere in your sample input.
# In any case the above condition will always be true, because
# y[35:38] appears to be a 3-byte string but "EN" is a 2-byte string.
                if y[35:38] != "OTE":
                    if x[11]=="C":
                        if x[12] != "C":
                            if y[35:38] !=x[35:38]:
                                print x [7:10]

is ummmmm ...

You may wish to consider an alternative way of expression e.g.

if (x[11] == "C" == y[11]
and x[12] != "C"
and y[35:38] not in ("EN?", "OTE")
and y[35:38] != x[35:38]):
    print x[7:10]
John Machin
Thanks for the tip :) The code now looks clean and neat :)
forextremejunk