views:

96

answers:

3

I have a list of pairs of objects. Objects can appear in the pair in either order. What is the most efficient algorithm (and implementation?) to find all bags (ie sets with duplicates permitted) of pairs between the same objects. For my purpose the object references can be assumed to be pointers, or names or some similar convenient, short, useful representation. The individual pairs are identifiable. There are no pairs which have the same object in both parts of the pair.

So given a list of pairs (Oid is an object reference; Pid a pair reference)

O1-P1-O2
O3-P2-O4
O5-P3-O1
O1-P4-O2
O2-P5-O1
O1-P6-O5
O7-P7-O8

should return:

P1;P4;P5 and P3;P6
+5  A: 

Fancy terminology may make this problem look difficult, but it's actually pretty simple.

  1. Order elements in each pair. (Since you said objects can be represented as numbers, let's assume pair.first <= pair.second always)
  2. Sort list, using traditional way to compare pairs. I.e. pair1 < pair2 means pair1.first < pair2.first or pair1.first == pair2.first && pair1.second < pair2.second.

Sorted list from your example will look like this

O1-P1-O2
O1-P4-O2
O1-P5-O2
O1-P3-O5
O1-P6-O5
O3-P2-O4
O7-P7-O8

Now all elements from one 'bag' will occupy consecutive spots in the list. Go ahead and grab them.

There're options to solve this with hash too.

Nikita Rybak
Thank you for this. Doesn't your approach involve at least three passes over the data. If so this is not as efficient as I hope for - see my comment on efficiency.
Chris Walton
@Chris Hashtables, then. It'll give you linear solution and, if carefully implemented, will increase memory consumption only a little.
Nikita Rybak
I've implemented your answer in Python. http://stackoverflow.com/questions/3990737/what-is-an-efficient-algorithm-for-extracting-bags-from-lists-of-pairs/3991352#3991352
J.F. Sebastian
+3  A: 

Is "less than" defined on your objects? If so, then you can do this with a single pass through your list of pairs.

1) Create an empty collection of bags, indexed by two "object" parameters. By convention, the first index parameter should be less than the second index parameter.

2) Loop through the list, and find the appropriate bag index at min(pair.left,pair.right), max(pair.left, pair.right). Add the element to that bag.

mbeckish
My objects do not currently have less than defined, but this is a detail that can be incorporated very easily. This looks to be the sort of approach I am looking for, involving only a single pass over the data. I will look at this in more detail. Is there an even more efficient way of approaching this, given that a small portion of a large list will be changing rapidly, but is of most interest in what I am trying to do?
Chris Walton
If you are copying N items from a List to a collection of Bags, then the runtime will be at least O(N).
mbeckish
Thanks. In between comments I realised that I can increase efficiency by adopting your approach, but pre-calculating the collection of bags; and only calculating the changing portion repeatedly. Still O(n) but n is much reduced. Again thanks.
Chris Walton
I've implemented your answer in Python. http://stackoverflow.com/questions/3990737/what-is-an-efficient-algorithm-for-extracting-bags-from-lists-of-pairs/3991352#3991352
J.F. Sebastian
+1  A: 

@Nikita Rybak's solution in Python using itertools.groupby():

#!/usr/bin/env python
from itertools import groupby

pairs = """
O1-P1-O2
O3-P2-O4
O5-P3-O1
O1-P4-O2
O2-P5-O1
O1-P6-O5
O7-P7-O8
""".split()

def lex_order(pair):
    """'O2-P5-O1' -> ['01', '02']"""
    return sorted(pair.split('-')[::2])

data = sorted(pairs, key=lex_order)
for key, group in groupby(data, key=lex_order):
    print "key=%(key)s, pairs=%(pairs)s" % dict(key=key, pairs=list(group))

Output:

key=['O1', 'O2'], pairs=['O1-P1-O2', 'O1-P4-O2', 'O2-P5-O1']
key=['O1', 'O5'], pairs=['O5-P3-O1', 'O1-P6-O5']
key=['O3', 'O4'], pairs=['O3-P2-O4']
key=['O7', 'O8'], pairs=['O7-P7-O8']

@mbeckish's solution in Python:

#!/usr/bin/env python
from collections import defaultdict

pairs = """
O1-P1-O2
O3-P2-O4
O5-P3-O1
O1-P4-O2
O2-P5-O1
O1-P6-O5
O7-P7-O8
""".split()

bags = defaultdict(list)
for pair in pairs:
    i, _, j = pair.split('-') # 'O2-P5-O1' -> ['02', 'P5', '01']
    bags[min(i,j), max(i,j)].append(pair)

import pprint;
pprint.pprint(dict(bags))

Output:

{('O1', 'O2'): ['O1-P1-O2', 'O1-P4-O2', 'O2-P5-O1'],
 ('O1', 'O5'): ['O5-P3-O1', 'O1-P6-O5'],
 ('O3', 'O4'): ['O3-P2-O4'],
 ('O7', 'O8'): ['O7-P7-O8']}
J.F. Sebastian
I understand what that slicing trick does ([::2]) but can you tell me what the terminology is for it? I'd like to read some more about it.
jlv
@jlv: "extended slices": `a[i:j:k]` selects all items of `a` with index `x` where `x = i + n*k`, `n >= 0` and `i <= x < j` http://docs.python.org/reference/datamodel.html Read about `__getitem__()`, `slice()`, and `numpy` arrays for advanced usage.
J.F. Sebastian
Thank ya kindly, J.F.
jlv