ansaurus

Question

Finding closest match in collection of strings representing numbers

Answer 1

+4 A:

You want the bisect module from the standard library. It will do a binary search and tell you the correct insertion point for a new value into an already sorted list. Here's an example that will print the place in the list where target would be inserted:

from bisect import bisect
dates = ['2009-09-10T12:00:00', '2009-09-11T12:32:00', '2009-09-11T12:43:00']
target = '2009-09-11T12:40:00'
print bisect(dates, target)

From there you can just compare to the thing before and after your insertion point, which in this case would be dates[i-1] and dates[i] to see which one is closest to your target.

Eli Courtwright 2009-09-17 13:52:23

+1 you beat me to it :)

Nadia Alramli 2009-09-17 13:54:06

I'm using the algorithm from bisect (see the source code from bisect.py). I'm doing that rather than using bisect because my list is [{'dt': dt1, 'x': x1, 'y':y1}, {}, {}, ...] rather than [dt1, dt2,...]. Other than using the bisect module rather than the bisect algorithm, how is your suggestion different than what I'm doing? Or is it that bisect is implemented in C?

foosion 2009-09-17 14:09:03

it is easier to convert your input than to code bisect and test it ,which has already been done efficiently in the standard lib

Anurag Uniyal 2009-09-17 14:32:32

How would you convert the input? Given I'm using the bisect code, I don't see how using the code in the module is better than using the code directly in my program.

foosion 2009-09-17 14:38:18

@foosion, so how you are injecting bisect code into your program from bisect module?

Anurag Uniyal 2009-09-17 14:45:45

@foosion: bisect module implementation can change and improve, you'll benefit if you use it directly. See my answer for how to decorate/undecorate your list.

nosklo 2009-09-17 15:04:03

@anurag: compare the source code for the bisect.py module to the while loop in my mid function

foosion 2009-09-17 15:21:39

Answer 2

+2 A:

import bisect

def mid(res, target):
    keys = [r['dt'] for r in res]
    return res[bisect.bisect_left(keys, target)]

nosklo 2009-09-17 15:01:40

Not bad. I'd move the keys list comprehension out of mid() so that we don't redo this for every search

foosion 2009-09-17 15:43:45

Answer 3

+1 A:

First, change to this.

import datetime
def parse_dt(dt):
    return datetime.strptime( dt, "%Y-%m-%dT%H:%M:%S" )

This removes much of the "effort".

Consider this as the search.

def mid( res, target ):
    """res is a list of entries, sorted by dt (dateTtime) 
       each entry is a dict with a dt and some other info
    """
    times = [ parse_dt(r['dt']) for r in res ]
    index= bisect( times, parse_dt(target) )
    return times[index]

This doesn't seem like very much "effort". This does not depend on your timestamps being formatted properly, either. You can change to any timestamp format and be assured that this will always work.

S.Lott 2009-09-17 15:16:39

+1 Good point about not being dependent on the time format.

Todd Owen 2009-09-17 15:19:36

The difference between your parse_dt and my dttosecs is that dttosecs returns a number so that we can determine which of two values is closest to the target. For example, is '2009-09-17T12:00:00' closer to'2009-09-17T11:00:00' or '2009-09-17T15:00:00'. We don't need parse_dt just to use bisect - see nosklo's answer.

foosion 2009-09-17 15:33:49

@foosion: you're missing the point. The `datetime` object is essential to do this reliably. I'm saying that you should not depend on a proper string format. I'm also saying that conversion is not "excess effort".

S.Lott 2009-09-17 15:45:59

@S.Lott, the string format is dependable (it's output from a GPS device converted to a GPX file by a reliable program and then parsed to produce res[{}, {}, ...]). Does this new info affect your comment?

foosion 2009-09-17 15:57:03

@foosion: No. You're working with `datetime` objects, which are only input as strings. Convert the input strings to proper `datetime` objects and use the `datetime` objects properly. I'll try to make this simple: String is for Input and Output. `datetime` is for all processing.

S.Lott 2009-09-17 17:10:57

@S.Lott, I'm working with times, which can be represented as strings, datetime objects, seconds since the epoch, etc. Strings can be compared and sorted, as can datetime objects, etc. I don't see why datetime objects are any better for testing whether a > b than string objects. What am I missing?

foosion 2009-09-17 17:36:52

@foosion: You're not working with strings. You're working with `datetime` objects. Coincidentally (and it's only coincidence) you happen to be able to create your strings in a similar order. You cannot do ANY OTHER OPERATION except simple order-by comparison. You can't change formats. Nor can you do proper linear interpolation to see which target string has a smaller difference from your target time. There are many, many things a string representation of a `datetime` cannot do. String representations are for input-output only and nothing else. Strings have limitations.

S.Lott 2009-09-17 18:10:27

Answer 4

+4 A:

"Copy and paste coding" (getting bisect's sources into your code) is not recommended as it carries all sorts of costs down the road (lot of extra source code for you to test and maintain, difficulties dealing with upgrades in the upstream code you've copied, etc, etc); the best way to reuse standard library modules is simply to import them and use them.

However, to do one pass transforming the dictionaries into meaningfully comparable entries is O(N), which (even though each step of the pass is simple) will eventually swamp the O(log N) time of the search proper. Since bisect can't support a key= key extractor like sort does, what the solution to this dilemma -- how can you reuse bisect by import and call, without a preliminary O(N) step...?

As quoted here, the solution is in David Wheeler's famous saying, "All problems in computer science can be solved by another level of indirection". Consider e.g....:

import bisect

listofdicts = [
  {'dt': '2009-%2.2d-%2.2dT12:00:00' % (m,d) }
  for m in range(4,9) for d in range(1,30)
  ]

class Indexer(object):
  def __init__(self, lod, key):
    self.lod = lod
    self.key = key
  def __len__(self):
    return len(self.lod)
  def __getitem__(self, idx):
    return self.lod[idx][self.key]


lookfor = listofdicts[len(listofdicts)//2]['dt']

def mid(res=listofdicts, target=lookfor):
    keys = [r['dt'] for r in res]
    return res[bisect.bisect_left(keys, target)]

def midi(res=listofdicts, target=lookfor):
    wrap = Indexer(res, 'dt')
    return res[bisect.bisect_left(wrap, target)]

if __name__ == '__main__':
  print '%d dicts on the list' % len(listofdicts)
  print 'Looking for', lookfor
  print mid(), midi()
assert mid() == midi()

The output (just running this indexer.py as a check, then with timeit, two ways):

$ python indexer.py 
145 dicts on the list
Looking for 2009-06-15T12:00:00
{'dt': '2009-06-15T12:00:00'} {'dt': '2009-06-15T12:00:00'}
$ python -mtimeit -s'import indexer' 'indexer.mid()'
10000 loops, best of 3: 27.2 usec per loop
$ python -mtimeit -s'import indexer' 'indexer.midi()'
100000 loops, best of 3: 9.43 usec per loop

As you see, even in a modest task with 145 entries in the list, the indirection approach can have a performance that's three times better than the "key-extraction pass" approach. Since we're comparing O(N) vs O(log N), the advantage of the indirection approach grows without bounds as N increases. (For very small N, the higher multiplicative constants due to the indirection make the key-extraction approach faster, but this is soon surpassed by the big-O difference). Admittedly, the Indexer class is extra code -- however, it's reusable over ALL tasks of binary searching a list of dicts sorted by one entry in each dict, so having it in your "container-utilities back of tricks" offers good return on that investment.

So much for the main search loop. For the secondary task of converting two entries (the one just below and the one just above the target) and the target to a number of seconds, consider, again, a higher-reuse approach, namely:

import time

adt = '2009-09-10T12:00:00'

def dttosecs(dt=adt):
    # string to seconds since the beginning
    date,tim = dt.split('T')
    y,m,d = date.split('-')
    h,mn,s = tim.split(':')
    y = int(y)
    m = int(m)
    d = int(d)
    h = int(h)
    mn = int(mn)
    s = min(59,int(float(s)+0.5)) # round to neatest second
    s = int(s)
    secs = time.mktime((y,m,d,h,mn,s,0,0,-1))
    return secs

def simpler(dt=adt):
  return time.mktime(time.strptime(dt, '%Y-%m-%dT%H:%M:%S'))

if __name__ == '__main__':
  print adt, dttosecs(), simpler()
assert dttosecs() == simpler()

Here, there is no performance advantage to the reuse approach (indeed, and on the contrary, dttosecs is faster) -- but then, you only need to perform three conversions per search, no matter how many entries are on your list of dicts, so it's not clear whether that performance issue is germane. Meanwhile, with simpler you only have to write, test and maintain one simple line of code, while dttosecs is a dozen lines; given this ratio, in most situations (i.e., excluding absolute bottlenecks), I would prefer simpler. The important thing is to be aware of both approaches and of the tradeoffs between them so as to ensure the choice is made wisely.

Alex Martelli 2009-09-17 15:51:43

+1, comprehensive

Nadia Alramli 2009-09-17 16:00:05

Very comprehensive. I finally agree regarding reusing standard code (frankly I wasn't aware of the bisect module - I'd found the algorithm elsewhere) and agree regarding the virtues of simplicity (unless there's a real performance hit). I'll think some more about the search loop timing issues. FWIW, the number of searches is a few orders of magnitude smaller than the size of the list being searched (the list I'm searching is GPS output that creates an entry every second or so and I'm searching for events that usually occur a few times an hour, give or take)

foosion 2009-09-17 16:35:35

If we move the 'keys =' list comprehension out of mid(), time per loop goes under 2 usec (I'm getting approximately the same speed as you on what you reported). This is much faster than leaving it in or using Indexer. res will not change between searches, so moving it out seems a winner.

foosion 2009-09-17 17:13:00

"res will not change between searches" is the killer bit of specs that wasn't clear earlier -- under such special circumstances a one-off expensive preprocessing step to prep an auxiliary data structure, followed by many uses of that structure, of course becomes a much more attractive proposition.

Alex Martelli 2009-09-18 03:48:37

ansaurus

tags:

views:

answers:

Finding closest match in collection of strings representing numbers

related questions