views:

292

answers:

2

I have lines of data comprising of 4 fields

aaaa bbb1 cccc dddd  
aaaa bbb2 cccc dddd  
aaaa bbb3 cccc eeee  
aaaa bbb4 cccc ffff  
aaaa bbb5 cccc gggg  
aaaa bbb6 cccc dddd

Please bear with me.

The first and third field is always the same - but I don't need them, the 4th field can be the same or different. The thing is, I only want 2nd and 4th fields from lines which don't share the common field. For example like this from the above data

bbb3 eeee  
bbb4 ffff    
bbb5 gggg

Now I don't mean deduplication as that would leave one of the entries in. If the 4th field shares a value with another line, I don't want any line which ever had that value.

humblest apologies once again for asking what is probably simple.

+6  A: 

Here you go:

from collections import defaultdict

LINES = """\
aaaa bbb1 cccc dddd
aaaa bbb2 cccc dddd
aaaa bbb3 cccc eeee
aaaa bbb4 cccc ffff
aaaa bbb5 cccc gggg
aaaa bbb6 cccc dddd""".split('\n')

# Count how many lines each unique value of the fourth field appears in.
d_counts = defaultdict(int)
for line in LINES:
    a, b, c, d = line.split()
    d_counts[d] += 1

# Print only those lines with a unique value for the fourth field.
for line in LINES:
    a, b, c, d = line.split()
    if d_counts[d] == 1:
        print b, d

# Prints
# bbb3 eeee
# bbb4 ffff
# bbb5 gggg
RichieHindle
That's about perfect. Many thanks. I now need to weave it into my script, i'm iterating over a file, then making the output available in the script (via a dictionary) later on. Do you forsee any problems?
The only thing you need to be careful of is that I'm iterating over the lines twice - you can't simply replace my two "for line in LINES:" loops with two "for line in my_open_file:" loops, because the first loop will read the whole file and the second will have nothing to read. Either store the lines in a list for the second loop to use, or seek() back to the start of the file before the second loop.
RichieHindle
A: 

For your amplified requirement, you can avoid reading the file twice or saving it in a list:

LINES = """\
aaaa bbb1 cccc dddd
aaaa bbb2 cccc dddd
aaaa bbb3 cccc eeee
aaaa bbb4 cccc ffff
aaaa bbb5 cccc gggg
aaaa bbb6 cccc dddd""".split('\n')

import collections
adict = collections.defaultdict(list)
for line in LINES: # or file ...
    a, b, c, d = line.split()
    adict[d].append(b)

map_b_to_d = dict((blist[0], d) for d, blist in adict.items() if len(blist) == 1)
print(map_b_to_d)

# alternative; saves some memory

xdict = {}
duplicated = object()
for line in LINES: # or file ...
    a, b, c, d = line.split()
    xdict[d] = duplicated if d in xdict else b

map_b_to_d2 = dict((b, d) for d, b in xdict.items() if b is not duplicated)
print(map_b_to_d2)
John Machin