views:

172

answers:

3

Hi,

I am having trouble picking the best data structure for solving a problem.

The problem is as below:

  1. I have a nested list of identity codes where the sublists are of varying length.

    li = [['abc', 'ghi', 'lmn'], ['kop'], ['hgi', 'ghy']]
    
  2. I have a file with two entries on each line; an identity code and a number.

    abc      2.93  
    ghi      3.87  
    lmn      5.96  
    

Each sublist represents a cluster. I wish to select the i.d. from each sublist with the highest number associated with it, append that i.d. to a new list and ultimately write it to a new file.

What data structure should the file with numbers be read in as?

Also, how would you iterate over said data structure to return the i.d. with the highest number that matches the i.d. within a sublist?

Thanks, S :-)

+4  A: 

You can read the file into a dictionary (string=>int), then use a list comprehension to get the highest identity code from each sublist.

d = {}
with open("data", 'rb') as data:
  for line in data:
    key, val = line.split(' ')
    d[key] = float(val)

ids = [max(sublist, key=lambda k: d[k]) for sublist in li]

For Python 2.4, use:

ids = []
for sublist in li:
  subnums = map(lambda x: d[x], sublist)
  ids.append(sublist[subnums.index(max(subnums))])

As noted, this is O(n).

Matthew Flaschen
O(n) sounds good to me.
Justin Peel
You are not currently casting val to a number, so this code will be comparing *strings*, not floats, i.e. '9.1' > '82.3'. Presumably @Seafoid wants values compared as numbers.
Jeffrey Harris
Thanks, @Jeffrey.
Matthew Flaschen
Yeah - I do indeed wish to compare the numerical values. Also, this code throws an error. max() doesn't want to take a keyword argument.
Seafoid
Just figures that max() has support for keyword arguments from v2.5 onwards. Unfortunately, I have to cope with v2.4.3. Any alternatives to max()?
Seafoid
This 2.4 version works nicely, it seems! Thanks!
Seafoid
+2  A: 

My solution assumes that you only want the highest number and not the id that is associated with it.

I'd read the identity codes and the numbers in a dictionary as suggested by Matthew

NEW_LIST = []
ID2NUM = {}
with file('codes') as codes:
  for line in codes:
    id, num = line.rstrip().split()
    ID2NUM[id] = num

I added some numbers so every id has a value. My ID2NUM looks like this:

{'abc': 2.9300000000000002,
 'ghi': 3.8700000000000001,
 'ghy': 1.2,
 'hgi': 0.40000000000000002,
 'kop': 4.3499999999999996,
 'lmn': 5.96}

Then process the list li:

for l in li:
  NEW_LIST.append(max([d[x] for x in l]))

>>> NEW_LIST
[5.96, 4.3499999999999996, 1.2]

To write the new list to a file, one number per line:

with file('new_list', 'w') as new_list:
  new_list.write('\n'.join(NEW_LIST))
mmmasterluke
Sorting is O(n log n), so it's preferable to use max, which is O(n).
Matthew Flaschen
Good point, thanks.
mmmasterluke
A: 

How about storing each sublist as a binary search tree? You'd get O(log n) search performance on average.

Another option would be to use max-heaps and you'd get O(1) to get the maximum value.

Pin