One alternative is to count the numbers of each character in each string and compare the counts. A simple implementation should take O(max(N, A))
time where N is the length of the larger of the strings, and A is the size of the array you use to store counts. For example, in Java:
public boolean equalIgnoringOrder(String s1, String s2) {
if (s1.length() != s2.length()) {
return false;
}
// Assuming characters in the range ASCII 0 to 127
int[] c1 = new int[128];
int[] c2 = new int[128];
for (int i = 0; i < s1.length(); i++) {
c1[s1.charAt(i)]++;
c2[s2.charAt(i)]++;
}
for (int i = 0; i < c1.length; i++) {
if (c1[i] != c2[i]) {
return false;
}
}
return true;
}
There are some possible improvements to this. For example, you can cope with an arbitrary character set by doing a range reduction; i.e. do an initial pass through s1
and s2
looking for the smallest and largest characters in each one, and use this to determine the size of c1
and c2
and a base offset. This will use less space on average and reduce the time to initialize the count arrays. It also offers a short circuit for the comparison; e.g. when the smallest and largest characters for s1
and s2
are not the same.
By comparison, comparing strings sorted using heapsort or quicksort would be O(NlogN)
on average with O(N)
space, where N is the larger string's length.
However, as @pst points out, the constants of proportionality can make an O(NlogN)
or even O(N*N)
algorithm better than an O(N)
algorithm if N is not large. In this case, the average lengths of the strings being compared is probably the most important factor.
The code above is effectively performing a Radix Sort with a couple of short circuits. (Three if you include the short circuit associated with range reduction.) So ultimately it boils down to whether a Quick Sort / Heap Sort or a Radix Sort would be better. And that depends on input string lengths and character ranges.
On a different tack. @John's answer proposes that we compute a product of prime numbers. If we do the computation using an arbitrary precision representation, the resulting values will be unique for each distinct set of "equal ignoring order" strings. Unfortunately, the computation will be O(N*N)
. (Each intermediate product has O(N)
digits, and multiplying an N-digit number by a constant is O(N)
. Do that for N characters and you get O(N*N)
.)
But if we do the computation modulo (say) 64, the result is effectively a really good hash that is insensitive to character order; e.g.
long hash = 1;
for (int i = 0; i < s.length(); i++) {
hash = hash * primes[s.charAt(i)];
}
So, I would claim that the algorithm that gives best performance and space usage on average for comparing randomly generated strings is likely to be of the form:
if (s1.length() != s2.length()) {
return false;
}
if (hash(s1) != hash(s2)) { // computed as above
return false;
}
// Compare using sorting or character counting as above.
One final point. If we assume that the string pointers are not identical and that the strings have unequal lengths, any algorithm that computes this equals
predicate has to be at O(N)
or worse. It has to examine every character in both strings to make this determination, and that takes O(N)
operations.
Any algorithm that does less than 2 * N
fetches or less than 2 * N
further operations on the fetched values in this scenario is provably incorrect.