views:

28

answers:

2

I have 2 large logfiles. I want to see if a device is in a but not b and vice versa (exclude lines where the device is common) the files look like this example.

04/09/2010,13:11:52,Authen OK,user1,Default Group,00-24-2B-A1-08-88,29,10.1.1.1,(Default),,,,,,13,EAP-TLS,,device1,
04/19/2010,15:35:24,Authen OK,user2,Default Group,00-24-2B-A1-05-EA,29,10.1.1.2,(Default),,,,,,13,EAP-TLS,,device2,
04/09/2010,13:11:52,Authen OK,user3,Default Group,00-24-2B-A1-08-88,29,10.1.1.3,(Default),,,,,,13,EAP-TLS,,device3,
04/19/2010,15:35:24,Authen OK,user4,Default Group,00-24-2B-A1-05-EA,29,10.1.1.4,(Default),,,,,,13,EAP-TLS,,device4,

to reiterate, I need device (field [-2]) and IP (field [7]) for each device that is in logfile a but not b, and is in b but not a

Here's what I've done so far, but seems a little clunky and is very slow (each file has about 400K lines). I'm cross referring twice. Can anyone suggest efficiencies please? Perhaps I am using the wrong logic??

chst={}
chbs={}
for i,line in enumerate(open('chst.txt').readlines()):
    line=line.split(',')
    chst[line[-2]+','+str(i)]=','.join(line)

for i,line in enumerate(open('chbs.txt').readlines()):
    line=line.split(',')
    chbs[line[-2]+','+str(i)]='.'.join(line)

print "these lines are in CHST but not in CHBS"
for a in chst:
    if a.split(',')[0] not in str(chbs.values()):
        line=chst[a].split(',')
        print line[-2], line[7]

print "\nthese lines are in CHBS but not in CHST"

for a in chbs:
    if a.split(',')[0] not in str(chst.values()):
        line=chbs[a].split(',')
        print line[-2], line[7]
+1  A: 

You are looking for a symmetric difference:

chst = { ( line.split( "," )[ -2 ], line.split( "," )[ 7 ] ) for line in open( ... ) }
chbs = { ( line.split( "," )[ -2 ], line.split( "," )[ 7 ] ) for line in open( ... ) }

diff = chst ^ chbs

If you need the asymmetric differences, use -:

chst - chbs # tuples in chst but not in chbs
chbs - chst # tuples in chbs but not in chst

If you need the actual line, instead of a tuple ( device, IP ) you can use dictionaries instead of sets:

chst = { ( line.split( "," )[ -2 ], line.split( "," )[ 7 ] ): line for line in open( ... ) }
chbs = { ( line.split( "," )[ -2 ], line.split( "," )[ 7 ] ): line for line in open( ... ) }

diff = chst.items( ) ^ bar.items( )

This works because dict.items( ) returns a view on the items, which has setlike properties. Note that this is called dict.viewitems( ) in Python 2.x.

katrielalex
The sets module is deprecated since Python 2.6. Starting from 2.6, set and frozensets are indeed builtins.
Jim Brissom
Oops, the backporting team *has* been busy! Fixed.
katrielalex
I also quite sure just calling items won't work (and is also unrelated to dict views) - you would have to call viewitems on that dict, supported starting with 2.7. The items method just returns a list of key/value pairs,and for lists, the ^ operator is not supported, whereas viewitems returns an actual view of type dict_ietms.
Jim Brissom
I tried it before posting and it works in Py3k.
katrielalex
Thanks all, here's what workedthe top two lines from first answer plus the two x - y. I then joined those strings and tested it. It run very quickly on the large datasets and I did some sample searches of the results in the files to test and all seemed good. Well done
Bill
A: 

There's a bug in line 9 where you are doing ='.'.join(line) instead of =','.join(line) i.e. a dot in the quotes instead of a comma. Or maybe the lines in chbs should be split on dots instead of commas later.

At the moment if there are three lines for device7 is in chbs but not chst the script will tell you three times, but your description of the problem implies that you don't need to know how many times it appears. Do you really want that or is a single report OK for multiple occurrences? In that case you could simplify it by just using the device name as the dictionary key and checking if the other dictionary has that key.

Also at the moment you're recording the line numbers, but not really using them. If you do need to know how many times a device appears why not report that instead of having to count them? In which case when adding a device key to the dictionary first check if it's already there and if so increment a counter (perhaps in another dictionary also keyed by the device name).

Simon Hibbs
Thanks Simon, indeed a typo. The way I had it took way too long anyway so grateful for answer above
Bill