views:

400

answers:

8

This is a question that's been lingering in my mind for some time ...

Suppose I have a list of items and an equivalence relation on them, and comparing two items takes constant time. I want to return a partition of the items, e.g. a list of linked lists, each containing all equivalent items.

One way of doing this is to extend the equivalence to an ordering on the items and order them (with a sorting algorithm); then all equivalent items will be adjacent.

But can it be done more efficiently than with sorting? Is the time complexity of this problem lower than that of sorting? If not, why not?

+1  A: 

Comparison based sorting generally has a lower bound of O(n log n).

Assume you iterate over your set of items and put them in buckets with items with the same comparative value, for example in a set of lists (say using a hash set). This operation is clearly O(n), even after retreiving the list of lists from the set.

--- EDIT: ---

This of course requires two assumptions:

  • There exists a constant time hash-algorithm for each element to be partitioned.
  • The number of buckets does not depend on the amount of input.

Thus, the lower bound of partitioning is O(n).

Nubsis
This assumes the items have a "getHashCode" method available, or some other way of associating all items that equal each other with a unique key.
Peter Recore
This is not O(n). It is O(n*bucket count).
Anthony Williams
@Anthony Williams So you're saying that O(N) is not equivalent to O(N * some constant )!?
wheaties
@Anthony: But O(n*constant value) = O(n).
JAB
@JAB. Anthony is right. bucket count could very well be n/2 here. What makes you think it is a constant?
Moron
The bucket count is either (a) predefined, in which case it is a constant or (b) a runtime parameter like the number of entries
Anthony Williams
@Moron, Anthony: Ah, true.
JAB
@Nubsis: yes, so my question is whether partitioning can be done faster than O(n log n).
reinierpost
It's true I made some implicit assumptions. I update the question with these. Either way, I don't think the number of buckets needs to be taken into consideration. Get and set operations in the bucket data structure can both be assumed to be O(1), and you don't ever need to iterate over the entire set of buckets. This is the same as assuming you have a sufficient amount of buckets to avoid hash collisions.The partitioning problem can also not be higher than O(n log n) because the problem can be solved by sorting the input. I argue that it is O(n) with these assumptions.
Nubsis
+2  A: 

If a comparator must be used, then the lower bound is Ω(n log n) comparisons for sorting or partitioning. The reason is all elements must be inspected Ω(n), and a comparator must perform log n comparisons for each element to uniquely identify or place that element in relation to the others (each comparison divides the space in 2, and so for a space of size n, log n comparisons are needed.)

If each element can be associated with a unique key which is derived in constant time, then the lowerbound is Ω(n), for sorting ant partitioning (c.f. RadixSort)

mdma
That's not a great explanation of the lower bound. Sorting requires Ω(n log n) comparisons because it takes log(n!) = Ω(n log n) bits to describe one of n! permutations. Information-theoretically, partitioning is as hard as sorting because if an adversary chooses comparison results as though there are n distinct partitions, it can change its mind at any point prior to the input having been sorted by putting two adjacent elements whose order is not known into one partition.
I was going for the intuitive angle. For example, compare adding n items to self-balancing tree or skiplist. It then requires n log n comparisons, and the result is a sorted collection. Given that partitioning can be recast as a sorting problem, surely partitioning is no harder than sorting, and the complexity depends upon how items are compared, either by hashing or by comparison. Given that the lower bound for radix sort is linear, then partitioning in that case can be no more efficient than that.
mdma
@mdma: Please use Omega instead of bigOh when talking about lower bounds.
Moron
Sure, I was unsure how to format it. I'll paste in the nice omega sign from the comment above.
mdma
@mdma: You can always use Omega :-) I wish there was some latex like support here.
Moron
@user382751 Yours is actually the answer I was looking for! But the other answers and remarks are very helpful additions. I love this site ...
reinierpost
+4  A: 

If you can define a hash function for the items as well as an equivalence relation, then you should be able to do the partition in linear time -- assuming computing the hash is constant time. The hash function must map equivalent items to the same hash value.

Without a hash function, you would have to compare every new item to be inserted into the partitioned lists against the head of each existing list. The efficiency of that strategy depends on how many partitions there will eventually be.

Let's say you have 100 items, and they will eventually be partitioned into 3 lists. Then each item would have to be compared against at most 3 other items before inserting it into one of the lists.

However, if those 100 items would eventually be partitioned into 90 lists (i.e., very few equivalent items), it's a different story. Now your runtime is closer to quadratic than linear.

Dan
Adding to a hash table is essentially just this, where the equivalence relation is (hash maps to same bucket). You haven't simplified the problem at all.
Anthony Williams
@Anthony: you still need the equivalence relation to check for hash collision.
Dan
@Dan: Why should two 'equal' objects have the same hash? I could very well have a set of objects where it is easy to define equality but really hard to ensure equal objects get equal hashes.
Moron
@Dan: Exactly. Doing a hash doesn't help.
Anthony Williams
Still ... in some cases a good hash function may be easy to design, and if it doesn't solve the problem perfectly it will certainly help. E.g. comparing file contents by just comparing their md5 hashes works pretty well, I tried it on the entire Debian source code and out of 10 million files only got a few hundred collisions.
reinierpost
Unless the hash value cleanly maps to the equivalence classes (i.e. equal hashes implies the values belong in the same partition) then hashing doesn't help.
Anthony Williams
@reinierpost: No one is disputing that having such a hash won't help. The claim that hashing will solve this irrespective of the compare function is misguided. You will have to come up with different hashes for different compare/equality functions and solving that might well include solving the partition problem!
Moron
@moron: In the abstract, if one is given a collection of objects and a comparator function, partitioning will be hard. Indeed, if the objects can only be tested for equality rather than ranked, it may take time O(N^2). On the other hand, code which knows enough about objects to actually use them should be able to hash them (if nothing else, by running a binary version of the data through a cryptographic hash function). A high-speed partitioning library should accept a hash function as a parameter.
supercat
@supercat: No one is disputing the benefits of a good, applicable hash function. If you read the question again, it is clear that OP is looking to compare partitioning with sorting under the constraint of using a comparison function. The talk of hashing, is just noise, especially when there are no guarantees of finding a good hash function (we don't even know what objects/compare functions OP has in mind) more easily than solving partitioning.
Moron
@Moron: Given only a pair-wise equivalence relation, there is no algorithm that will require less than N*(N+1)/2 comparisons. Allowing a ranking comparison reduces that to O(NlgN). Adding more means of "extracting information" reduces it further.
supercat
@supercat: Yes, but what is your point?
Moron
@Moron: "Why should two 'equal' objects have the same hash?" -- I clarified my answer: "The hash function must map equivalent items to the same hash value." Sorry... I thought that part was obvious. It shouldn't be hard to define such a hash function, but of course it depends on your sense of equivalence.
Dan
@Dan. It is not at all obvious (especially with no info about the objects). For instance, say the Equality function was an input to your PartitionMethod. Do you really think you can use hashing then?
Moron
@Moron: of course not -- an algorithm given only the list of items and the equivalence function cannot "invent" a hash function on its own. But a *human* given a description of the equivalence function should be able to devise a hash function that will yield equal hashes for equivalent items. Which is why I wrote "If you (human) define a hash function..."
Dan
@Dan: Yes, but given an arbitrary equivalence relation, defining such a hash is the hardest part of the problem. (I don't think it's possible in o(n log n) time... and may be even harder depending on how the equivalence is given.) So you've not simplified the problem at all, just restated it (or assumed that the equivalence relation is something sufficiently trivial that defining a hash is trivial).
ShreevatsaR
@moron: I am definitely interested in practical help, in addition to the theoretical viewpoint, so suggesting a hashing function is always helpful.
reinierpost
@rei: Sure, but your question gives no information at all, which could be useful in suggesting a hashing function! We can suggest you use a hash, but in this case, it is like the suggestion "write code" (ok, bit of an exaggeration...) as the hard part could be coming up with the hashing function in the first place. And given that you clearly state (in the question) you are interested in comparing with sorting, IMO, talking about hashing is noise. It is helpful, no doubt, but IMO it is not helpful in the *context* of the question. This is not a discussion forum.
Moron
@ShreevatsaR, @Moron, @reinierpost, @supercat, @Anthony: A challenge: http://stackoverflow.com/questions/3261782 Talking about hashing is not noise. Defining a hash should be no more difficult than designing a sort ordering, which the OP suggested is possible. I look forward to being proven wrong.
Dan
I didn't say that hashing doesn't help or work, only that devising a hash is (in general) as hard as solving the original partitioning problem. So the idea of using a hash "reduces" the problem to something no easier than before. (BTW: your new question talks mainly of possibility rather than algorithmic complexity, you may want to edit it.)
ShreevatsaR
@Anthony Williams: "Unless the hash value cleanly maps to the equivalence classes (i.e. equal hashes implies the values belong in the same partition) then hashing doesn't help." Not true; it may be that several equivalence classes all map to the same hash value. In that case, hashing does gain you information about what class a given element belongs to, just not *all* the information.
j_random_hacker
+3  A: 

If you don't care about the final ordering of the equivalence sets, then partitioning into equivalence sets could be quicker. However, it depends on the algorithm and the numbers of elements in each set.

If there are very few items in each set, then you might as well just sort the elements and then find the adjacent equal elements. A good sorting algorithm is O(n log n) for n elements.

If there are a few sets with lots of elements in each then you can take each element, and compare to the existing sets. If it belongs in one of them then add it, otherwise create a new set. This will be O(n*m) where n is the number of elements, and m is the number of equivalence sets, which is less then O(n log n) for large n and small m, but worse as m tends to n.

A combined sorting/partitioning algorithm may be quicker.

Anthony Williams
Using hashing, the problem can almost certainly be solved in O(N) time, though a good hash function may take long enough to compute that an O(NlgN) algorithm could be faster. Storing observed keys in a tree and discarding duplicates would yield a time of O(NlgM), where N is the number of elements and M is the number of distinct ones.
supercat
Only if the hash function helps identify the equivalence classes (i.e. non-equal but equivalent values have the same hash)
Anthony Williams
A: 

Partitioning is faster than sorting, in general, because you don't have to compare each element to each potentially-equivalent already-sorted element, you only have to compare it to the already-established keys of your partitioning. Take a close look at radix sort. The first step of radix sort is to partition the input based on some part of the key. Radix sort is O(kN). If your data set has keys bounded by a given length k, you can radix sort it O(n). If your data are comparable and don't have a bounded key, but you choose a bounded key with which to partition the set, the complexity of sorting the set would be O(n log n) and the partitioning would be O(n).

Eric Mickelsen
Big-O aside, it should be noted that the choice of key will have a big effect on speed, whether partitioning or radix sorting, with respect to a particular data set. Also, sets with few equivalent elements will favor sorting and sets with many equivalent elements will favor partitioning.
Eric Mickelsen
Doing a radix sort on a hash (which is essentially what one would get from recursively hashing into buckets) is apt to be much faster than radix sorting on typical keys, because the distribution will be much better. A typical radix sort scenario with 256 bins might end up with some bins containing 10% or more of the input records; a hash radix sort with 256 bins would be very unlikely to end up with a bin holding more than 1-2% of the input records unless more than 1% of the input records had identical keys.
supercat
@supercat: Wouldn't a hash with only 256 bins have a ridiculously high number of hash collisions (except for tiny sets), thus counter-acting the benefit of more uniform distribution? I'm not sure there really is such a thing as a "hash radix sort." Could you provide a reference to support your claim?
Eric Mickelsen
@tehMick: The number 256 is arbitrary; my point was that dividing items into bins based upon a hash function will result in all the bins being much more uniformly filled than would dividing them into bins using a typical radix-sort function. I'm unaware of such a thing as a "hash radix sort" per se, but one could use a radix sort on the output of a hash function just as well as one could use one on anything else.
supercat
+8  A: 

You seem to be asking two different questions at one go here.

1) If allowing only equality checks, does it make partition easier than if we had some ordering? The answer is, no. You require Omega(n^2) comparisons to determine the partitioning in the worst case (all different for instance).

2) If allowing ordering, is partitioning easier than sorting? The answer again is no. This is because of the Element Distinctness Problem. Which says that in order to even determine if all objects are distinct, you require Omega(nlogn) comparisons. Since sorting can be done in O(nlogn) time (and also have Omega(nlogn) lower bounds) and solves the partition problem, asymptotically they are equally hard.

If you pick an arbitrary hash function, equal objects need not have the same hash, in which case you haven't done any useful work by putting them in a hashtable.

Even if you do come up with such a hash (equal objects guaranteed to have the same hash), the time complexity is expected O(n) for good hashes, and worst case is Omega(n^2).

Whether to use hashing or sorting completely depends on other constraints not available in the question.

The other answers also seem to be forgetting that your question is (mainly) about comparing partitioning and sorting!

Moron
Any hash value which does not yield equal hashes for equal objects is broken. Part of the definition of a hash function (indeed, practically the only requirement) is that equal values must yield equal hashes.
supercat
The latter is exactly what I was looking for. Thanks!
reinierpost
@supercat: If we had such a hash would we even be having this question posted? My guess is no. Given an arbitrary equality/compare function on objects, it might be hard to come up with such a hash and solving that might well involve solving the partition problem. So just claiming that hashing is a solution is not really helpful.
Moron
@reinierpost: It helped? No upvote then? ;-) Really though, if you think an answer (not just for this question) is correct and useful, you should consider upvoting it, especially when there are many upvoted misleading answers about (in general, not making any claims about this question). The whole premise of this site is that correct and useful answers get upvoted and that the signal/noise ratio is high.
Moron
+1 for seeing and succinctly stating the salient points.
andand
+1, but I really think you meant "unequal objects need not have unequal hashes". Supercat is right. (Though it may be hard to find such a hash.)
ShreevatsaR
@supercat: "Any hash value which does not yield equal hashes for equal objects is broken." -- that's true, but in this case the hash also needs to yield equal hashes for *equivalent* objects. There's a difference between equivalence and equality. "John Smith" is not equal to "Fred Smith" but they may be equivalent if you have defined equivalence to only consider last names.
Dan
@ShreevatsaR: I meant, you just can't pick any hash function and expect it to work. It has to be based on equal objects having equal hashes. I will edit the answer to clarify.
Moron
+1 [what andand said]
Dave
@Moron: I'm not sure what makes you think I didn't upvote your answer. But all answers have been useful, and they complement each other, so it's hard to pick a winner. If you can suggest a criterion for ordering by usefulness, be my guest :-)
reinierpost
@rei: I saw your comment, and saw zero upvotes and so just assumed :-) I was kidding, btw, you are not under any obligations, even if it is your own question. The criterion for ordering is the number of votes! So if people are diligent in voting the right stuff, the ordering is automatic and that is what the voting system is for. Anyway, we are getting way off-topic.
Moron
"Whether to use hashing or sorting completely depends on other constraints not available in the question". Well, yes, like in the ~0.000...01% of the cases where you cannot come up with a reasonable hash function for your elements. Otherwise, a hashtable beats sorting hands-down in this problem.
Dimitris Andreou
@Dimitris: Where did you pull that figure from? btw, did you read the question?
Moron
@Moron, might you show (or pull from somewhere, heh, good one!) an example where sorting would be better than hashing for this problem? Just saying.
Dimitris Andreou
@Dimit: I added an answer to the question Dan posted. I didn't think much about it, so might be wrong, though. In any case, the burden of proof is on _you_! If you make a claim, you better prove it. I never claimed sorting is _always_ (or even usually) better than hashing. I only claimed that in the _context_ of the question ("is partition easier than sorting") talking about hashing is noise. People seem to have misunderstood what I was saying.
Moron
In fact _you_ made a claim that somehow deciding between sorting and hashing for this problem depends on some other, hidden factors. I merely asked you to show an example where the best answer would not be clear (and in particular, would be sorting rather than hashing). Still waiting for that.
Dimitris Andreou
@Dim: I stand by it. Choosing whether hashing or sorting is better depends on the actual situation! There is too little information provided in the question to say hashing is better or not. For an answer: http://stackoverflow.com/questions/3261782/puzzle-need-an-example-of-a-complicated-equivalence-relation-partitioning-th/3266142#3266142. (In fact, I already referred to this answer in my previous comment, if you happened to miss the first sentence of that comment!)
Moron
What I'm getting at is that under any remotely reasonable hash function, it is exceedingly improbable to get a degenerate case. I would choose the expected O(N) algorithm all times, even with the risk that it might degenerate to Ω(N^2) once in a blue moon, rather that pay the cost of O(NlogN) cost each and every time. (This argument also applies in using hashtables rather than search trees.) Theoretically? Υes, you are right, in the worst case hashing is rubbish (it is!). But in practice, I'd suggest hashing with hardly any discussion over it.
Dimitris Andreou
@Dimi: Without even knowing if a suitable hash function exists you want to suggest hashing to a question whose title is "Is Partitioning easier than Sorting"!? No one is claiming sorting is better (I am repeating myself here) than hashing in practice. If you have a decent hash function sure, use it. If you notice, the question was a theoretical one and the better or worse was only in terms of the worst case complexity. My answer does mention that hashing is _expected_ O(n) with a decent hash. If you see the answer I linked to earlier, I hope you will agree that sorting might be better there.
Moron
...I am not sure where you got that I am suggesting always use sorting or sorting is always better! Please read the answer in the context of the question. Also, I disagree that you can blindly always go with the hash. Depends on the situation! I do agree that hashing will probably win, though... :-)
Moron
A: 

The time required to perform a possibly-imperfect partition using a hash function will be O(n+bucketcount) [not O(n*bucketcount)]. Making the bucket count large enough to avoid all collisions will be expensive, but if the hash function works at all well there should be a small number of distinct values in each bucket. If one can easily generate multiple statistically-independent hash functions, one could take each bucket whose keys don't all match the first one and use another hash function to partition the contents of that bucket.

Assuming a constant number of buckets on each step, the time is going to be O(NlgN), but if one sets the number of buckets to something like sqrt(N), the average number of passes should be O(1) and the work in each pass O(n).

supercat
+1  A: 

This is a classic problem in data structures, and yes, it is easier than sorting. If you want to also quickly be able to look up which set each element belongs to, what you want is the disjoint set data structure, together with the union-find operation. See here: http://en.wikipedia.org/wiki/Disjoint-set_data_structure

Aaron