views:

180

answers:

8

From the question "Is partitioning easier than sorting?":

Suppose I have a list of items and an equivalence relation on them, and comparing two items takes constant time. I want to return a partition of the items, e.g. a list of linked lists, each containing all equivalent items.

One way of doing this is to extend the equivalence to an ordering on the items and order them (with a sorting algorithm); then all equivalent items will be adjacent.

(Keep in mind the distinction between equality and equivalence.)

Clearly the equivalence relation must be considered when designing the ordering algorithm. For example, if the equivalence relation is "people born in the same year are equivalent", then sorting based on the person's name is not appropriate.

  1. Can you suggest a datatype and equivalence relation such that it is not possible to create an ordering?

  2. How about a datatype and equivalence relation where it is possible to create such an ordering, but it is not possible to define a hash function on the datatype that will map equivalent items to the same hash value.

(Note: it is OK if nonequivalent items map to the same hash value (collide) -- I'm not asking to solve the collision problem -- but on the other hand, hashFunc(item) { return 1; } is cheating.)

My suspicion is that for any datatype/equivalence pair where it is possible to define an ordering, it will also be possible to define a suitable hash function, and they will have similar algorithmic complexity. A counterexample to that conjecture would be enlightening!

+1  A: 

One example that seems to fit your request is an IEEE floating point type. In particular, a NaN doesn't compare as equivalent to anything else (nor even to itself) unless you take special steps to detect that it's a NaN, and always call that equivalent.

Likewise for hashing. If memory serves, any floating point number with all bits of the significand set to 0 is treated as having the value 0.0, regardless of what the bits in the exponent are set to. I could be remembering that a bit wrong, but the idea is the same in any case -- the right bit pattern in one part of the number means that it has the value 0.0, regardless of the bits in the rest. Unless your hash function takes this into account, it will produce different hash values for numbers that really compare precisely equal.

Jerry Coffin
Interesting example. A hash function for that datatype could be "if significand is all-0s then return 0, else if exponent is all-1s (NaN) then return 1, else hash the value into the range [2..n]". Or if the desire is for all NaNs to be partitioned into distinct groups, then skip the exponent check. I don't think this example meets the criteria in #1 or #2.
Dan
NaN ignores the bits in the mantissa, but zero does not. There is an implied one bit before the mantissa unless all bits besides the sign are zero.
drawnonward
@Dan: ultimately, it doesn't prevent you from writing *some* function that provides some ordering -- but ultimately, if you can represent it in a computer at all, you can probably do the same. The ordering you give may not mean anything, but that's exactly what we have here -- there's a reason a NaN is specified as comparing unequal to everything, including itself, and if we define them as comparing as equal (or equivalent) we're simply forcing "there must be equivalence" into a situation that doesn't really support equivalence.
Jerry Coffin
+4  A: 

The answer to questions 1 and 2 is no, in the following sense: given a computable equivalence relation ≡ on strings {0, 1}*, there exists a computable function f such that x ≡ y if and only if f(x) = f(y), which leads to an order/hash function. One definition of f(x) is simple, and very slow to compute: enumerate {0, 1}* in lexicographic order (ε, 0, 1, 00, 01, 10, 11, 000, …) and return the first string equivalent to x. We are guaranteed to terminate when we reach x, so this algorithm always halts.

A good answer for those questions, so +1. But I want to say that this is not "the final answer" if you are interested in algorithmic complexity, as the OP is. Computability says nothing about how long the computation takes.
j_random_hacker
I agree but doubt that a "final answer" is forthcoming – none of the techniques in an introductory complexity theory course seem to have any promise, and complexity theorists generally don't have a lot of answers to this kind of question. I'm by no means a specialist, though, and it may be worth framing this question carefully and reposting it on MathOverflow.
If we're going to speculate, then Graph Canonization _might_ be harder than Graph Isomorphism: http://en.wikipedia.org/wiki/Graph_canonization
Not sure what you mean by your 1st comment... E.g. with a few weak assumptions we can get that partitioning is faster than sorting if we have a bound on the number of equivalence classes and that bound is less than log2(n). Also ShreevatsaR gives a O(n^2) algorithm for building a perfect hash function that is relevant I think.
j_random_hacker
o(log(n)) partitions is not exactly what I'd call a weak assumption. ShreevatsaR's algorithm is about as interesting as this one, but it's optimal (again) under the strong assumption that the equivalence relation is an arbitrary oracle rather than a program.
"Weak" was the wrong word; "satisfied often enough in practice to be worth analysing" would have been better. (A very common case would be dividing a set into 2 classes, for example.) ShreevatsaR's assumption of O(1) time to decide equivalence isn't really an assumption at all, just an abbreviation that omits the uninteresting detail of how long the subroutine takes. Spelling it out in full, his algorithm takes O(kn^2) if the subroutine takes O(k) time.
j_random_hacker
+2  A: 

Creating a hash function and an ordering may be expensive but will usually be possible. One trick is to represent an equivalence class by a pre-arranged member of that class, for instance, the member whose serialised representation is smallest, when considered as a bit string. When somebody hands you a member of an equivalence class, map it to this canonicalised member of that class, and then hash or compare the bit string representation of that member. See e.g. http://en.wikipedia.org/wiki/Canonical#Mathematics

Examples where this is not possible or convenient include when somebody gives you a pointer to an object that implements equals() but nothing else useful, and you do not get to break the type system to look inside the object, and when you get the results of a survey that only asks people to judge equality between objects. Also Kruskal's algorithm uses Union&Find internally to process equivalence relations, so presumbly for this particular application nothing more cost-effective has been found.

mcdowella
+1  A: 

As you probably know, comparison-based sorting takes at least O(n log n) time (more formally you would say it is Omega(n log n)). If you know that there are fewer than log2(n) equivalence classes, then partitioning is faster, since you only need to check equivalence with a single member of each equivalence class to determine which part in the partition you should assign a given element to.

I.e. your algorithm could be like this:

For each x in our input set X:
    For each equivalence class Y seen so far:
        Choose any member y of Y.
        If x is equivalent to y:
            Add x to Y.
            Resume the outer loop with the next x in X.

    If we get to here then x is not in any of the equiv. classes seen so far.
    Create a new equivalence class with x as its sole member.

If there are m equivalence classes, the inner loop runs at most m times, taking O(nm) time overall. As ShreetvatsaR observes in a comment, there can be at most n equivalence classes, so this is O(n^2). Note this works even if there is not a total ordering on X.

j_random_hacker
More formally you would say it's `Ω(log n!)` :)
BlueRaja - Danny Pflughoeft
@BlueRaja: I thought so too at first but in fact log(n) is Theta(n log n): http://en.wikipedia.org/wiki/Factorial#Rate_of_growth ;)
j_random_hacker
Yes I know :) An even easier way to see that (which doesn't require calc) is that for all positive n, log(stirling's approx.) ≤ log(n!) ≤ log(n^n)
BlueRaja - Danny Pflughoeft
A: 
belisarius
Not true if we assume the Axiom of Choice.
Yes, Well Ordering Theorem. I just added an answer referencing that!
Moron
@throwawayacct Here we are talking about a SPECIAL ordering induced by an equivalence relationship. Partial ordered sets do exists, depending on your choice of the ordering. http://en.wikipedia.org/wiki/Partially_ordered. If we forget the equivalence relation, the problem is different, and you can define an order, for Zermelo's sake!
belisarius
@belisarius: Even with a given equivalence relation, a well ordering exists between the equivalence classes, and that is all you need!
Moron
@throwawayacct: there's no need to *assume* Axiom of Choice when working with finite sets..
BlueRaja - Danny Pflughoeft
A: 

I believe that...

1- Can you suggest a datatype and equivalence relation such that it is not possible to create an ordering?

...it's possible only for infinite (possibly only for non-countable) sets.

2- How about a datatype and equivalence relation where it is possible to create such an ordering, but it is not possible to define a hash function on the datatype that will map equivalent items to the same hash value.

...same as above.

Mau
+1  A: 

Theoretically, it is alway possible (for questions 1 and 2), because of the Well Ordering Theorem, even when you have an uncountable number of partitions.

Even if you restrict to computable functions, throwawayaccount's answer answers that.

You need to more precisely define your question :-)

In any case,

Practically speaking,

Consider the following:

You data type is the set of unsigned integer arrays. The ordering is lexicographic comparison.

You could consider hash(x) = x, but I suppose that is cheating too :-)

I would say (but haven't thought more about getting a hash function, so might well be wrong) that partitioning by ordering is much more practical than partitioning by hashing, as hashing itself could become impractical. (A hashing function exists, no doubt).

Moron
A: 

Suppose that F(X) is a function which maps an element of some data type T to another of the same type, such that for any Y of type T, there is exactly one X of type T such that F(X)=Y. Suppose further that the function is chosen so that there is generally no practical way of finding the X in the above equation for a given Y.

Define F0=X, F{1}(X)=F(X), F{2}(X)=F(F(X)), etc. so F{n}(X) = F(F{n-1}(X)).

Now define a data type Q containing a positive integer K and an object X of type T. Define an equivalence relation thus:

Q(a,X) vs Q(b,Y):

If a > b, the items are equal iff F{a-b}(Y)==X

If a < b, the items are equal iff F{b-a}(X)==Y

If a=b, the items are equal iff X==Y

For any given object Q(a,X) there exists exactly one Z for F{a}(Z)==X. Two objects are equivalent iif they would have the same Z. One could define an ordering or hash function based upon Z. On the other hand, if F is chosen such that its inverse cannot be practically computed, the only practical way to compare elements may be to use the equivalence function above. I know of no way to define an ordering or hash function without either knowing the largest possible "a" value an item could have, or having a means to invert function F.

supercat