ansaurus

Question

Answer 1

+1 A:

Solution:

result = [(l1 + l2[1:]) 
          for l1 in list1 
          for l2 in list2 
          if (l1[0] == l2[0] and l1[3] < l2[2])
          ]

... with test code:

list1 = [(1, 'Type1', 20, 30,),
         (2, 'Type1', 20, 30,),
         (3, 'Type1', 20, 30,),
         (4, 'Type1', 20, 30,),
         (5, 'Type1', 20, 30,),
         (6, 'Type1', 20, 30,), # does not have Type2

         (8, 'Type1', 20, 30,), # multiple
         (8, 'Type1', 25, 35,), # multiple
         (8, 'Type1', 50, 55,), # multiple
         ]

list2 = [(1, 'Type2', 40, 50,), # after
         (2, 'Type2', 10, 15,), # before
         (3, 'Type2', 25, 28,), # inside
         (4, 'Type2', 25, 35,), # inside-after
         (4, 'Type2', 15, 25,), # inside-before
         (7, 'Type2', 20, 30,), # does not have Type1

         (8, 'Type2', 40, 50,), # multiple
         (8, 'Type2', 60, 70,), # multiple
         (8, 'Type2', 80, 90,), # multiple
         ]

result = [(l1 + l2[1:]) 
          for l1 in list1 
          for l2 in list2 
          if (l1[0] == l2[0] and l1[3] < l2[2])
          ]

print '\n'.join(str(r) for r in result)

It is not clear what result would you like if there is more then one occurrence of both Type1 and Type2 within the same text ID. Please specify.

van 2009-06-12 19:04:22

I would like to have all possible type1 before type2 within the same ID.

2009-06-12 19:07:02

what if you have more then 1 type2?

van 2009-06-12 19:15:47

updated with multiple occurrences tests: please check if the result is as you want it to be in regard of ID = 8. Currently you get all combinations that fit, but you may want less. Advise?

van 2009-06-12 19:25:20

Yes the results from ID8 yield the correct solution.

2009-06-12 19:31:31

Heh, I thought by "super inefficient", the questioner meant O(n^2). Apparently not :-)

Steve Jessop 2009-06-12 19:48:27

@onebyone. I know, I know. But often people over-size the issue. In such cases the simplicity is better choice.

van 2009-06-12 20:14:36

Answer 2

+1 A:

I don't know how many types you have. But If we assume you have only type 1 and type 2, then it sounds like a problem similar to a merge sort. Doing it with a merge sort, you make a single pass through the list.

Take two indexes, one for type 1 and one for type 2 (I1, I2). Sort the list by id, start1. Start I1 as the first instance of type1, and I2 as zero. If I1.id < I2.Id then increment I1. If I2.id < I1.id then increment I2. If I1.id = I2.id then check iStart.

I1 can only stop on a type one record and I2 can only stop on a type 2 record. Keep incrementing the index till it lands on an appropriate record.

You can make some assumptions to make this faster. When you find an block that succeeds, you can move I1 to the next block. Whenever I2 < I1, you can start I2 at I1 + 1 (WOOPS MAKE SURE YOU DONT DO THIS, BECAUSE YOU WOULD MISS THE FAILURE CASE!) Whenever you detect an obvious failure case, move I1 and I2 to the next block (on appropriate recs of course).

johnnycrash 2009-06-12 19:05:42

Answer 3

+1 A:

I recently did something like this. I might not be understanding your problem but here goes.

I would use a dictionary:

from collections import defaultdict:
masterdictType1=defaultDict(dict)
masterdictType2=defaultdict(dict)


for item in myList:
   if item[1]=Type1
       if item[0] not in masterdictType1:
           masterdictType1[item[0]]['begin']=item[2] # start index
           masterdictType1[item[0]]['end']=item[-1] # end index
   if item[1]=Type2
       if item[0] not in masterdictType2:
           masterdictType2[item[0]]['begin']=item[2] # start index
           masterdictType2[item[0]]['end']=item[-1] # end index

joinedDict=defaultdict(dict)

for id in masterdictType1:
    if id in masterdictType2:
        if masterdictType1[id]['begin']<masterdictType2[id]['begin']:
            joinedDict[id]['Type1Begin']=masterdictType1[id]['begin']
            joinedDict[id]['Type1End']=masterdictType1[id]['end']
            joinedDict[id]['Type2Begin']=masterdictType2[id]['begin']
            joinedDict[id]['Type2End']=masterdictType2[id]['end']

This gives you explicitness and gives you something that is durable since you can pickle the dictionary easily.

PyNEwbie 2009-06-12 19:14:35

Answer 4

A:

Assuming there are lots of entries for each ID, I would (pseudo-code)

    for each ID:
        for each type2 substring of that ID:
            store it in an ordered list, sorted by start point
        for each type1 substring of that ID:
            calculate the end point (or whatever)
            look it up in the ordered list
            if there's anything to the right, you have a hit

So, if you have control of the initial sort, then instead of (ID, start), you want them sorted by ID, then by type (2 before 1). Then within the type, sort by start point for type2, and the offset you're going to compare for type1. I'm not sure whether by "A before B" you mean "A starts before B starts" or "A ends before B starts", but do whatever's appropriate.

Then you can do the whole operation by running over the list once. You don't need to actually construct an index of type2s, because they're already in order. Since the type1s are sorted too, you can do each lookup with a linear or binary search starting from the result of the previous search. Use a linear search if there are lots of type1s compared with type2s (so results are close together), and a binary search if there are lots of type2s compared with type1s (so results are sparse). Or just stick with the linear search as it's simpler - this lookup is the inner loop, but its performance might not be critical.

If you don't have control of the sort, then I don't know whether it's faster to build the list of type2 substrings for each ID as you go; or to sort the entire list before you start into the required order; or just to work with what you've got, by writing a "lookup" that ignores the type1 entries when searching through the type2s (which are already sorted as required). Test it, or just do whatever results in clearer code. Even without re-sorting, you can still use the merge-style optimisation unless "sorted by start index" is the wrong thing for the type1s.

Steve Jessop 2009-06-12 19:38:37

Answer 5

A:

Could I check, by before, do you mean immediately before (ie. t1_a, t2_b, t2_c, t2_d should just give the pair (t1_a, t2_b), or do you want all pairs where a type1 value occurs anywhere before a type2 one within the same block. (ie (t1_a, t2_b), (t1_a, t2_c), (t1_a, t2_d) for the previous example).

In either case, you should be able to do this with a single pass over your list (assuming sorted by id, then start index).

Here's a solution assuming the second option (every pair):

import itertools, operator

def find_t1_t2(seq):
    """Find every pair of type1, type2 values where the type1 occurs 
    before the type2 within a block with the same id.

    Assumes sequence is ordered by id, then start location.
    Generates a sequence of tuples of the type1,type2 entries.
    """
    for group, items in itertools.groupby(seq, operator.itemgetter(0)):
        type1s=[]
        for item in items:
            if item[1] == TYPE1: 
                type1s.append(item)
            elif item[1] == TYPE2:
                for t1 in type1s:
                    yield t1 + item[1:]

If it's just immediately before, it's even simpler: just keep track of the previous item and yield the tuple whenever it is type1 and the current one is type2.

Here's an example of usage, and the results returned:

l=[[1, TYPE1, 10, 15],
   [1, TYPE2, 20, 25],  # match with first
   [1, TYPE2, 30, 35],  # match with first (2 total matches)

   [2, TYPE2, 10, 15],  # No match
   [2, TYPE1, 20, 25],
   [2, TYPE1, 30, 35],
   [2, TYPE2, 40, 45],  # Match with previous 2 type1s.
   [2, TYPE1, 50, 55],
   [2, TYPE2, 60, 65],  # Match with 3 previous type1 entries (5 total)
   ]

for x in find_t1_t2(l):
    print x

This returns:

[1, 'type1', 10, 15, 'type2', 20, 25]
[1, 'type1', 10, 15, 'type2', 30, 35]
[2, 'type1', 20, 25, 'type2', 40, 45]
[2, 'type1', 30, 35, 'type2', 40, 45]
[2, 'type1', 20, 25, 'type2', 60, 65]
[2, 'type1', 30, 35, 'type2', 60, 65]
[2, 'type1', 50, 55, 'type2', 60, 65]

Brian 2009-06-12 20:53:47

ansaurus

tags:

views:

answers:

Efficient Tuple List Comparisons

related questions