views:

105

answers:

2

I won't go into the details of the problem I'm trying to solve, but it deals with a large string and involves finding overlapping intervals that exist in the string. I can only use one of the intervals that overlap, so I wanted to separate these intervals out and analyze them individually. I was wondering what algorithm to use to do this as efficiently as possible.

I must stress that speed is paramount here. I need to separate the intervals as quickly as possible. The algorithm that came to my mind was an Interval Tree, but I wasn't sure if that's the best that we can do.

Interval Trees can be queried in O(log n) time, n being the number of intervals and construction requires O(nlog n) time, though I wanted to know if we can cut down on either.

Thanks!

Edit: I know the question is vague. I apologize for the confusion. I suggest that people look at the answer by Aaron Huran and the comments on the same. That should help clarify things a lot more.

+1  A: 

You are looking to calculate the difference between the two strings right? What language are you trying to do this in?

Update: Without any sort of criteria on how you will select which intervals to use there are an enormous possible solutions.

One method would be to take the lowest starting number, grab its end. Grab the next starting number that is higher than the previous interval's end. Get this interval's end and repeat.

So for 0-4, 5-13, 8-19, 10-12 You get: 0-4, 5-13 and ignore the others.

Aaron Harun
@Aaron: I'm using Java but I'm not looking to calculate the difference between two strings. I have just one string, that has multiple intervals defined within it. I need to use these intervals for some computation, but I can only use one of all the overlapping intervals, which is why I'm looking to separate them out.
efficiencyIsBliss
@Dharmesh: What do you mean by "multiple intervals defined within it"? Are you trying to parse a data format? If so, could you provide some sample input?
Anon.
@Aaron: I've attached scores to the string intervals, and I need the one with the highest score. So, it might be that the interval 10-12 has the highest score, but if we use the method you described above, we would have O(n) running time.
efficiencyIsBliss
Also, there's sample input in the comments above.
efficiencyIsBliss
A: 

well, I was bored last night so I did this in Python. It's recursive unnecessarily (I just read The Little Schemer and think recursion is super neat right now) but it solves your problem and handles all input I threw at it.

intervals = [(0,4), (5,13), (8,19), (10,12)]  

def overlaps(x,y): 
   x1, x2 = x 
   y1, y2 = y 
   return ( 
      (x1 <= y1 <= x2) or 
      (x1 <= y2 <= x2) or  
      (y1 <= x1 <= y2) or 
      (y1 <= x2 <= y2) 
   ) 

def find_overlaps(intervals, checklist=None, pending=None): 
   if not intervals:  
      return [] 

   interval = intervals.pop() 

   if not checklist: 
      return g(intervals, [interval], [interval]) 

   check = checklist.pop() 

   if overlaps(interval, check): 
      pending = pending or [] 
      checklist.append(check) 
      checklist.append(interval) 
      return pending + [interval] + g(intervals, checklist) 
   else: 
      intervals.append(interval) 
      return g(intervals, checklist) 

Use like this:

>>> find_overlaps(intervals)
[(10, 12), (8, 19), (5, 13)]

Note that it returns all overlapping intervals in REVERSE order of their start point. Hopefully that's a minor issue. That's only happening because I'm using push() and pop() on the list, which operates on the end of the list, rather than insert(0) and pop(0) which operates on the beginning.

This isn't perfect, but it runs in linear time. Also remember that the size of the actual string doesn't matter at all - the running time is relative to the number of intervals, not the size of the string.

Triptych
Yes, I know that the length of the string does not matter. I've also implemented something similar in linear time, but I was hoping I could do something better. I think interval trees can get us down to O(log n), though I haven't read up on them properly yet.
efficiencyIsBliss