ansaurus

Question

Case-insensitive comparison of sets in Python

Answer 1

+3 A:

First, don't you mean a.intersection(b)? The intersection (if case insensitive) would be set(['today']). The difference would be set(['i', 'am', 'fine'])

Here are two ideas:

1.) Write a function to convert the elements of both sets to lowercase and then do the intersection. Here's one way you could do it:

>>> intersect_with_key = lambda s1, s2, key=lambda i: i: set(map(key, s1)).intersection(map(key, s2))
>>> fs1 = frozenset('Today I am fine'.split())
>>> fs2 = frozenset('Hello how are you TODAY'.split())
>>> intersect_with_key(fs1, fs2)
set([])
>>> intersect_with_key(fs1, fs2, key=str.lower)
set(['today'])
>>>

This is not very efficient though because the conversion and new sets would have to be created on each call.

2.) Extend the frozenset class to keep a case insensitive copy of the elements. Override the intersection method to use the case insensitive copy of the elements. This would be more efficient.

lost-theory 2009-09-25 23:44:36

`lambda` is redundant here, just bare `key=str.lower` will work in this case.

J.F. Sebastian 2009-09-25 23:50:25

Ah right :) Thanks.

lost-theory 2009-09-25 23:54:29

I've fixed typo: `s.lower` -> `str.lower`.

J.F. Sebastian 2009-09-26 02:16:43

Yes, meant intersection ... doh.

Adam Nelson 2009-09-28 14:18:16

Answer 2

+5 A:

Unfortunately, even if you COULD "change on the fly" the comparison-related special methods of the sets' items (__lt__ and friends -- actually, only __eq__ needed the way sets are currently implemented, but that's an implementatio detail) -- and you can't, because they belong to a built-in type, str -- that wouldn't suffice, because __hash__ is also crucial and by the time you want to do your intersection it's already been applied, putting the sets' items in different hash buckets from where they'd need to end up to make intersection work the way you want (i.e., no guarantee that 'Today' and 'today' are in the same bucket).

So, for your purposes, you inevitably need to build new data structures -- if you consider it "inelegant" to have to do that at all, you're plain out of luck: built-in sets just don't carry around the HUGE baggage and overhead that would be needed to allow people to change comparison and hashing functions, which would bloat things by 10 times (or more) for the sae of a need felt in (maybe) one use case in a million.

If you have frequent needs connected with case-insensitive comparison, you should consider subclassing or wrapping str (overriding comparison and hashing) to provide a "case insensitive str" type cistr -- and then, of course, make sure than only instances of cistr are (e.g.) added to your sets (&c) of interest (either by subclassing set &c, or simply by paying care). To give an oversimplified example...:

class ci(str):
  def __hash__(self):
    return hash(self.lower())
  def __eq__(self, other):
    return self.lower() == other.lower()

class cifrozenset(frozenset):
  def __new__(cls, seq=()):
    return frozenset((ci(x) for x in seq))

a = cifrozenset(('Today','I','am','fine'))
b = cifrozenset(('hello','how','are','you','today'))

print a.intersection(b)

this does emit frozenset(['Today']), as per your expressed desire. Of course, in real life you'd probably want to do MUCH more overriding (for example...: the way I have things here, any operation on a cifrozenset returns a plain frozenset, losing the precious case independence special feature -- you'd probably want to ensure that a cifrozenset is returned each time instead, and, while quite feasible, that's NOT trivial).

Alex Martelli 2009-09-26 02:03:24

I guess that caching self.lower() could be an interesting first optimization?

NicDumZ 2009-09-26 09:41:32

For the problem as given (wrapping the building and intersection into a function, using return instead of print) the code I give takes 30 usec, with caching of self.lower() 44 usec. Caching the hash as well, 49 usec. Longer (3x) strings appear to behave similarly (i.e., no caching seems to work best). Accessing the cached values is a tad slower than recomputing them on the fly.

Alex Martelli 2009-09-26 17:16:28

Answer 3

+4 A:

Here's version that works for any pair of iterables:

def intersection(iterableA, iterableB, key=lambda x: x):
    """Return the intersection of two iterables with respect to `key` function.

    """
    def unify(iterable):
        d = {}
        for item in iterable:
            d.setdefault(key(item), []).append(item)
        return d

    A, B = unify(iterableA), unify(iterableB)

    return [(A[k], B[k]) for k in A if k in B]

Example:

print intersection('Today I am fine'.split(),
                   'Hello How a re you TODAY'.split(),
                   key=str.lower)
# -> [(['Today'], ['TODAY'])]

J.F. Sebastian 2009-09-26 02:46:07

Just to note, I get this error when doing this on unicode text:*** TypeError: descriptor 'lower' requires a 'str' object but received a 'unicode'I'm looking to a solution now.

Adam Nelson 2009-09-30 16:08:40

@Adam: Use `unicode.lower` for unicode strings.

J.F. Sebastian 2009-09-30 22:08:05

Answer 4

A:

>>> a_, b_ = map(set, [map(str.lower, a), map(str.lower, b)])
>>> a_ & b_
set(['today'])

Or... with less maps,

>>> a_ = set(map(str.lower, a))
>>> b_ = set(map(str.lower, b))
>>> a_ & b_
set(['today'])

eric.frederich 2009-09-28 15:27:51

ansaurus

tags:

views:

answers:

Case-insensitive comparison of sets in Python

related questions