views:

331

answers:

1

Hey,

I need to build a full "number range" set given a series of numbers. I start with a list such as :

ID   START  
*    0  
a    4  
b    70  
c    700  
d    701  
e    85
  • where "def" is the default range & should "fill-in" the gaps
  • "overlaps" are value (70, 700, 701) in starting data

And need the following result:

ID  START  END  
*     0 - 39  
a     4 - 49  
*     5 - 69  
c   700 - 7009  
d   701 - 7019  
b   702 - 709  
*    71 - 849  
e    85 - 859  
*    86 - 9

What I am trying to figure out is if there is some sort of algorithm out there or design pattern to tackle this. I have some ideas but I thought I'd run it by the "experts" first. I am using Python.

Any ideas / direction would be greatly appreciated. Some initial ideas I have:

  • Build a "range" list w/ the start & end values padded to the full length. So default would be 0000 to 9999
  • Build a "splits" list that is built on the fly
  • Loop through "range" list comparing each value to the values in the splits list.
  • In the event that an overlap is found, remove the value in the splits list and add the new range(s).
A: 
import operator

ranges = {
    '4'  : 'a',
    '70' : 'b',
    '700': 'c',
    '701': 'd',
    '85' : 'e',
    '87' : 'a',
}

def id_for_value(value):
    possible = '*'
    for idvalue, id in sorted(ranges.iteritems()):
        if value.startswith(idvalue):
            possible = id
        elif idvalue > value:
            break
    return possible

That is enough to know the id of a certain value. Testing:

assert id_for_value('10') == '*'
assert id_for_value('499') == 'a'
assert id_for_value('703') == 'b'
assert id_for_value('7007') == 'c'
assert id_for_value('7017') == 'd'
assert id_for_value('76') == id_for_value('83') == '*'
assert id_for_value('857') == 'e'
assert id_for_value('8716') == 'a'

If you really want the range, you can use itertools.groupby to calculate it:

def firstlast(iterator):
    """ Returns the first and last value of an iterator"""
    first = last = iterator.next()
    for value in iterator:
        last = value
    return first, last

maxlen = max(len(x) for x in ranges) + 1
test_range = ('%0*d' % (maxlen, i) for i in xrange(10 ** maxlen))
result = dict((firstlast(gr), id) 
              for id, gr in itertools.groupby(test_range, key=id_for_value))

Gives:

{('0000', '3999'): '*',
 ('4000', '4999'): 'a',
 ('5000', '6999'): '*',
 ('7000', '7009'): 'c',
 ('7010', '7019'): 'd',
 ('7020', '7099'): 'b',
 ('7100', '8499'): '*',
 ('8500', '8599'): 'e',
 ('8600', '8699'): '*',
 ('8700', '8799'): 'a',
 ('8800', '9999'): '*'}
nosklo
wow! let me try this out ..... one complexity that i left out is the fact that you can have multiple ranges for the same "ID", which would affect the initial dictionary. Thanks!!
John
@John: I have updated my answer to reflect multiple ranges for the same id. I am just using the id as value now.
nosklo
@Nosklo -- for some reason when I have a large START range (7190000), the process runs out of memory. It doesn't matter if I have 3 ranges or 20, a range of that size causes a memory problem. Thoughts? THANKS AGAIN!!
John
@Nosklo -- I know WHY it's running into a memory issue (as im sure you do too). With a range that large, the maxlen becomes 10 million and we are now processing a significantly larger setof data. Could we somehow only build out subsets of the test_range? So - only around 7190000 for that size?
John
@John: You run out of the memory because the list() function is actually creating a big list in memory. I've edited the answer and added a firstlast function that won't create the list. That will prevent the memory error but it is slow. If you need large ranges maybe you could take another approach.
nosklo
@Nosklo : thanks again. Yeah - I need to process larger ranges. I really liked where this was going. Do you have any other thoughts / ideas about another approach? I suppose the full "test_range" approach is key aspect of your solution.
John