views:

194

answers:

3

I need to find all sub-arrays which share any mutual element and merge them into one sub-array. (Implementing in Python but any algorithmic idea would be helpful)

Multidimensional array structure:

categories = {'car':['automobile','auto'],
             'bike':['vehicle','motorcycle','motorbike','automobile'],
             'software':['computer','macbook','apple','microsoft','mozilla'],
             'firefox':['internet','mozilla','browser']
             'bicycle':['vehicle']}

I'd like to have 'car', 'bike' and 'bicycle' merged into one list (keep first list's key new list's key could be any of the relevant keys) and 'software' and 'firefox' merged into one list as well.

Performance is crucial.

Best solution I could come with so far is to maintain a flatten one-dimension array of element => list_key (e.g 'automobile' => 'car') and then run the following recursive function for each list in the multidimensional array (pseudocode):

function merge_similar(list_key):
    For each element in categories[list_key]:
        If flatten_array.has_key(element):
            list_to_merge = flatten_array[element]
            merge_similar(list_to_merge) /* merge other lists which share an element with our newly found similar list */
            categories[list_key] = merge(categories [list_key], categories[list_to_merge])
            delete categories[list_to_merge]

Any idea how to improve it's performance?

Thanks!

+2  A: 

Note that there is no "first key" -- dicts don't keep order, so if you need some order preserved you'll need to start from some different, alternative data structure.

Apart from order-related issues, I'd start with something like:

def merged(dictoflists):
  result = dict()
  reversed = dict()
  for k, l in dictoflists.iteritems():
    intersecting = set(reversed.get(w) for w in l) - set([None])
    if intersecting:
      pickone = intersecting.pop()
      into = result[pickone]
    else:
      pickone = k
      into = result[k] = set()
    for ok in intersecting:
      into.update(result.pop(ok))
    into.update(l)
    for w in into:
      reversed[w] = pickone
  return dict((k, sorted(l)) for k, l in result.iteritems())

If order is important to you, the uses of set will be problematic and you'll need more complicated (and slower) data structures -- however, if that's the case, you should first specify in complete detail exactly what ordering constraints you need to respect in the various possible cases that can occur.

Alex Martelli
Seems great, going to do some more tests. Key is actually not crucial - I've fixed my post. Thanks!
Saggi Malachi
A: 

I can't imagine that a recursive solution would be speedy.
Is using list.extend() too slow?
You could do something like this:

categories['car'].extend(categories['bike']);
categories['car'].extend(categories['bicycle']);

Or to be more general, if you pass in a list of keys you want to merge:

first_key=None;
for key in keys_whose_lists_I_want_to_merge:
    if first_key is None:
        first_key=key;
    else:
        categories[first_key].extend(categories[key]);

If you're merging a ton of lists, you can optimize that loop to not perform the None check after the first time. See the tip entitled 'Re-map Functions at runtime' on the Python Performance Tips page.

Roman Stolper
A: 
>>> categories = {'car':['automobile','auto'],
             'bike':['vehicle','motorcycle','motorbike','automobile'],
             'software':['computer','macbook','apple','microsoft','mozilla'],
             'firefox':['internet','mozilla','browser'],
             'bicycle':['vehicle']}
>>> # Use sets for values
>>> for k,v in categories.items(): categories[k] = set(v)

>>> # Acumulate
>>> for k1, v1 in categories.items():
    if v1:
        for k2,v2 in categories.items():
            if v2 and k1 != k2 and v1 & v2:
                v1 |= v2
                categories[k2] = None
        categories[k1] = v1


>>> # Print
>>> for k1, v1 in categories.items():
    if v1: print('%s: %r' %(k1,v1))


bicycle: {'motorbike', 'vehicle', 'auto', 'automobile', 'motorcycle'}
firefox: {'apple', 'mozilla', 'macbook', 'computer', 'internet', 'microsoft', 'browser'}
>>> 
Paddy3118