views:

101

answers:

2

I am trying to compare two files. I will list the two file content:

 File 1                           File 2

"d.complex.1"                     "d.complex.1"

  1                                 4
  5                                 5
  48                                47
  65                                21

d.complex.10                    d.complex.10

  46                                5
  21                                46
 109                               121
 192                               192

There are totally 2000 d.complex in each file. I am trying to compare both the files but the problem is the values listed under d.complex.1 in first file has to be checked with all the 2000 d.complex entries in the second file and if the entry do not match, they are to be printed out. For example in the above files, in file1 d.complex.1 number 48 is not present in file2 d.complex.1; so that number has to be stored in a list (to print out later). Then again the same d.complex.1 has to be compared with d.complex.10 of file2 and since 1, 48 and 65 are not there, they have to be appended to a list.

The method I chose to achieve this was to use sets and then do a intersection. Code I wrote was:

first_complex=open( "file1.txt", "r" )
first_complex_lines=first_complex.readlines()
first_complex_lines=map( string.strip, first_complex_lines )
first_complex.close()

second_complex=open( "file2.txt", "r" )
second_complex_lines=second_complex.readlines()
second_complex_lines=map( string.strip, second_complex_lines )
second_complex.close()


list_1=[]
list_2=[]

res_1=[]
for line in first_complex_lines:
    if line.startswith( "d.complex" ):
        res_1.append( [] )
    res_1[-1].append( line )

res_2=[]
for line in second_complex_lines:
    if line.startswith( "d.complex" ):
        res_2.append( [] )
    res_2[-1].append( line )
h=len( res_1 )
k=len( res_2 )
for i in res_1:
   for j in res_2:
       print i[0]
       print j[0]
       target_set=set ( i )
       target_set_1=set( j )
       for s in target_set:
           if s not in target_set_1:
               print s

The above code is giving an output like this (just an example):

1
48
65
d.complex.1.dssp
d.complex.1.dssp

46
21

109 d.complex.1.dssp d.complex.1.dssp d.complex.10.dssp

Though the above answer is correct, I want a more efficient way of doing this, can anyone help me? Also two d.complex.1.dssp are printed instead of one which is also not good.

What I would like to have is:

d.complex.1
d.complex.1 (name from file2)
   1
   48
   65

d.complex.1
d.complex.10 (name from file2)
   1
   48
   65

I am so new to python so my concept above might be flawed. Also I have never used sets before :(. Can someone give me a hand here?

+1  A: 

Pointers:

  • Use list comprehensions or generator expressions to simplify data processing. More readable
  • Just generate the sets once.
  • Use functions to not repeat yourself, especially doing the same task twice.

I've made a few assumptions about your input data, you might want to try something like this.

def parsefile(filename):
  ret = {}
  cur = None
  for line in ( x.strip() for x in open(filename,'r')):
    if line.startswith('d.complex'):
      cur = set()
      ret[line] = cur
    if not cur or not line.isdigit():
      continue
    cur.add(int(line))
  return ret

def compareStructures(first,second):
  # Iterate through key,value pairs in first
  for firstcmplx, firstmembers in first.iteritems():
    # Iterate through key,value pairs in second
    for secondcmplx, secondmembers in second.iteritems():
      notinsecond = firstmembers- secondmembers
      if notinsecond:
        # There are items in first that aren't in second
        print firstcmplx
        print secondcmplx
        print "\n".join([ str(x) for x in notinsecond])

first = parsefile("myFirstFile.txt")
second = parsefile("mySecondFile.txt")

compareStructures(first,second)

Edited for fixes.. shows how much I rely on running the code to test it :) Thanks Alex

MattH
@MattH, good concepts, but you should clean up details -- e.g. you make data1 and data2 but then pass things called first and second, test `if len(notinsecond):` where you should be testing just `if notinsecond:`, etc -- worth editing to fix these issues!
Alex Martelli
@Alex Martelli, thanks for the proof-reading! Was multitasking when I put it together and as I don't have reasonable sample input data, I didn't test it. I normally go back over my own answers, but didn't have the time.
MattH
+1  A: 

There's already a good answer, by @MattH, focused on the Python details of your problem, and while it can be improved in several details the improvements would only gain you some percentage points in efficiency -- worthwhile but not great.

The only hope for a huge boost in efficiency (as opposed to "kai-zen" incremental improvement) is a drastic change in the algorithm -- which may or may not be possible depending on characteristics of your data that you do not reveal, and some details about your precise requirements.

The crucial part is: roughly, what range of numbers will be present in the file, and roughly, how many numbers per "d.complex.N" stanza? You already told us there are going to be about 2000 stanzas per file (and that's also crucial of course) and the impression is that in each file they're going to be ordered by contiguous increasing N -- 1, 2, 3, and so on (is that so?).

Your algorithm builds two maps stanza->numbers (not with top efficiency, but that's what @MattH's answer focuses on enhancing) so then inevitably it needs N squared stanza-to-stanza checks -- as N is 2,000, it needs 4 million such checks.

Consider building reversed maps, number->stanzas -- if the range of numbers and the typical size of (amount of numbers in) a stanza are both reasonably limited, those will be more compact. For example, if the numbers are between 1 and 200, and there are about 4 numbers per stanzas, this implies a number will typically be in (2000 * 4) / 200 -> 40 stanzas, so such mappings would have 200 entries of about 40 stanzas each. It only needs 200 squared (40,000) checks, rather than 4 million, to obtain the joint information for each number (then, depending on exact need for output format, formatting that info may require very substantial effort again -- if you absolutely require as final result 4 million "stanza-pairs" section as the output, then of course there's no way to avoid 4 million "output operations, which will be inevitably very costly).

But all of this depends on those numbers that you're not telling us -- average stanza population, and range of numbers in the files, as well as details on what constraints you must absolutely respect for output format (if the numbers are reasonable, the output format constraints are going to be the key constraint on the big-O performance you can get out of any program).

Remember, to quote Fred Brooks:

Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious.

Brooks was writing in the '60s (though his collection of essays, "The Mythical Man-Month", was published later, in the '70s), whence the quaint use of "flowcharts" (where we'd say code or algorithms) and "tables" (where we'd say data or data structures) -- but the general concept is still perfectly valid: the organization and nature of your data, in all kinds of programs focused on data processing (such as yours), can be even more important than the organization of the code, especially since it constrains the latter;-).

Alex Martelli