views:

312

answers:

5

I am kind of hitting a wall on this problem and I was wondering if some fresh brains could help me out.

I have a large list of four element tuples in the format:

(ID number, Type, Start Index, End Index)

Previously in the code, I have searched through thousands of blocks of text for two specific types of substrings. These tuples store in which large chunk of text the substring was found, which of the two types of substrings it is, and the start and end index of this substring.

The final goal is to look through this list to find all instances where a type 1 substring occurs before a type 2 substring in a block of text with the same ID. Then I would like to store these objects in the format (ID, Type 1, Start, End, Type2, Start, End).

I've tried to mess around with a bunch of stuff that was super inefficient. I have the list sorted by ID then Start Index, and if been trying varying ways of popping the items off the list for comparisons. I have to imagine there is a more elegant solution. Any brilliant people out there wish to assist my tired brain???

Thanks in advance

+1  A: 

Solution:

result = [(l1 + l2[1:]) 
          for l1 in list1 
          for l2 in list2 
          if (l1[0] == l2[0] and l1[3] < l2[2])
          ]

... with test code:

list1 = [(1, 'Type1', 20, 30,),
         (2, 'Type1', 20, 30,),
         (3, 'Type1', 20, 30,),
         (4, 'Type1', 20, 30,),
         (5, 'Type1', 20, 30,),
         (6, 'Type1', 20, 30,), # does not have Type2

         (8, 'Type1', 20, 30,), # multiple
         (8, 'Type1', 25, 35,), # multiple
         (8, 'Type1', 50, 55,), # multiple
         ]

list2 = [(1, 'Type2', 40, 50,), # after
         (2, 'Type2', 10, 15,), # before
         (3, 'Type2', 25, 28,), # inside
         (4, 'Type2', 25, 35,), # inside-after
         (4, 'Type2', 15, 25,), # inside-before
         (7, 'Type2', 20, 30,), # does not have Type1

         (8, 'Type2', 40, 50,), # multiple
         (8, 'Type2', 60, 70,), # multiple
         (8, 'Type2', 80, 90,), # multiple
         ]

result = [(l1 + l2[1:]) 
          for l1 in list1 
          for l2 in list2 
          if (l1[0] == l2[0] and l1[3] < l2[2])
          ]

print '\n'.join(str(r) for r in result)

It is not clear what result would you like if there is more then one occurrence of both Type1 and Type2 within the same text ID. Please specify.

van
I would like to have all possible type1 before type2 within the same ID.
what if you have more then 1 type2?
van
updated with multiple occurrences tests: please check if the result is as you want it to be in regard of ID = 8. Currently you get all combinations that fit, but you may want less. Advise?
van
Yes the results from ID8 yield the correct solution.
Heh, I thought by "super inefficient", the questioner meant O(n^2). Apparently not :-)
Steve Jessop
@onebyone. I know, I know. But often people over-size the issue. In such cases the simplicity is better choice.
van
+1  A: 

I don't know how many types you have. But If we assume you have only type 1 and type 2, then it sounds like a problem similar to a merge sort. Doing it with a merge sort, you make a single pass through the list.

Take two indexes, one for type 1 and one for type 2 (I1, I2). Sort the list by id, start1. Start I1 as the first instance of type1, and I2 as zero. If I1.id < I2.Id then increment I1. If I2.id < I1.id then increment I2. If I1.id = I2.id then check iStart.

I1 can only stop on a type one record and I2 can only stop on a type 2 record. Keep incrementing the index till it lands on an appropriate record.

You can make some assumptions to make this faster. When you find an block that succeeds, you can move I1 to the next block. Whenever I2 < I1, you can start I2 at I1 + 1 (WOOPS MAKE SURE YOU DONT DO THIS, BECAUSE YOU WOULD MISS THE FAILURE CASE!) Whenever you detect an obvious failure case, move I1 and I2 to the next block (on appropriate recs of course).

johnnycrash
+1  A: 

I recently did something like this. I might not be understanding your problem but here goes.

I would use a dictionary:

from collections import defaultdict:
masterdictType1=defaultDict(dict)
masterdictType2=defaultdict(dict)


for item in myList:
   if item[1]=Type1
       if item[0] not in masterdictType1:
           masterdictType1[item[0]]['begin']=item[2] # start index
           masterdictType1[item[0]]['end']=item[-1] # end index
   if item[1]=Type2
       if item[0] not in masterdictType2:
           masterdictType2[item[0]]['begin']=item[2] # start index
           masterdictType2[item[0]]['end']=item[-1] # end index

joinedDict=defaultdict(dict)

for id in masterdictType1:
    if id in masterdictType2:
        if masterdictType1[id]['begin']<masterdictType2[id]['begin']:
            joinedDict[id]['Type1Begin']=masterdictType1[id]['begin']
            joinedDict[id]['Type1End']=masterdictType1[id]['end']
            joinedDict[id]['Type2Begin']=masterdictType2[id]['begin']
            joinedDict[id]['Type2End']=masterdictType2[id]['end']

This gives you explicitness and gives you something that is durable since you can pickle the dictionary easily.

PyNEwbie
A: 

Assuming there are lots of entries for each ID, I would (pseudo-code)

    for each ID:
        for each type2 substring of that ID:
            store it in an ordered list, sorted by start point
        for each type1 substring of that ID:
            calculate the end point (or whatever)
            look it up in the ordered list
            if there's anything to the right, you have a hit

So, if you have control of the initial sort, then instead of (ID, start), you want them sorted by ID, then by type (2 before 1). Then within the type, sort by start point for type2, and the offset you're going to compare for type1. I'm not sure whether by "A before B" you mean "A starts before B starts" or "A ends before B starts", but do whatever's appropriate.

Then you can do the whole operation by running over the list once. You don't need to actually construct an index of type2s, because they're already in order. Since the type1s are sorted too, you can do each lookup with a linear or binary search starting from the result of the previous search. Use a linear search if there are lots of type1s compared with type2s (so results are close together), and a binary search if there are lots of type2s compared with type1s (so results are sparse). Or just stick with the linear search as it's simpler - this lookup is the inner loop, but its performance might not be critical.

If you don't have control of the sort, then I don't know whether it's faster to build the list of type2 substrings for each ID as you go; or to sort the entire list before you start into the required order; or just to work with what you've got, by writing a "lookup" that ignores the type1 entries when searching through the type2s (which are already sorted as required). Test it, or just do whatever results in clearer code. Even without re-sorting, you can still use the merge-style optimisation unless "sorted by start index" is the wrong thing for the type1s.

Steve Jessop
A: 

Could I check, by before, do you mean immediately before (ie. t1_a, t2_b, t2_c, t2_d should just give the pair (t1_a, t2_b), or do you want all pairs where a type1 value occurs anywhere before a type2 one within the same block. (ie (t1_a, t2_b), (t1_a, t2_c), (t1_a, t2_d) for the previous example).

In either case, you should be able to do this with a single pass over your list (assuming sorted by id, then start index).

Here's a solution assuming the second option (every pair):

import itertools, operator

def find_t1_t2(seq):
    """Find every pair of type1, type2 values where the type1 occurs 
    before the type2 within a block with the same id.

    Assumes sequence is ordered by id, then start location.
    Generates a sequence of tuples of the type1,type2 entries.
    """
    for group, items in itertools.groupby(seq, operator.itemgetter(0)):
        type1s=[]
        for item in items:
            if item[1] == TYPE1: 
                type1s.append(item)
            elif item[1] == TYPE2:
                for t1 in type1s:
                    yield t1 + item[1:]

If it's just immediately before, it's even simpler: just keep track of the previous item and yield the tuple whenever it is type1 and the current one is type2.

Here's an example of usage, and the results returned:

l=[[1, TYPE1, 10, 15],
   [1, TYPE2, 20, 25],  # match with first
   [1, TYPE2, 30, 35],  # match with first (2 total matches)

   [2, TYPE2, 10, 15],  # No match
   [2, TYPE1, 20, 25],
   [2, TYPE1, 30, 35],
   [2, TYPE2, 40, 45],  # Match with previous 2 type1s.
   [2, TYPE1, 50, 55],
   [2, TYPE2, 60, 65],  # Match with 3 previous type1 entries (5 total)
   ]

for x in find_t1_t2(l):
    print x

This returns:

[1, 'type1', 10, 15, 'type2', 20, 25]
[1, 'type1', 10, 15, 'type2', 30, 35]
[2, 'type1', 20, 25, 'type2', 40, 45]
[2, 'type1', 30, 35, 'type2', 40, 45]
[2, 'type1', 20, 25, 'type2', 60, 65]
[2, 'type1', 30, 35, 'type2', 60, 65]
[2, 'type1', 50, 55, 'type2', 60, 65]
Brian