I need to sort a huge list of text strings of arbitrary length. I suppose radix sort is the best option here. List is really huge, so padding strings to the same length is completely impossible.
Is there any ready-made implementation for this task, preferably in C#?
views:
158answers:
5String.Compare() overloads are using such string comparison. See what you need is to feed this to your sort algorithm.
UPDATE
This is the implementation:
[MethodImpl(MethodImplOptions.InternalCall)]
internal static extern int nativeCompareString(int lcid, string string1, int offset1, int length1, string string2, int offset2, int length2, int flags);
Hard to beat this native unmanaged implementation with your own implementation.
How many are many, one million?
The built in List<string>.Sort()
takes O(n * log(n)) on average.
log2(10^6) ~=20, that is not very much slower than O(n) for 10^6 elements. If your strings are more than 20 characters long radix sort O(n * k) will be "slower".
I doubt a radix sort will be significantly faster than the built in sort. But it would be fun to measure and compare.
Edit: there is a point to these statements I made previously, but the point is wrong overall.
Radix sort is the wrong sort to use on large numbers of strings. For things like
I really like squirrels. Yay, yay, yay!
I really like blue jays. Yay, yay, yay!
I really like registers. Yay, yay, yay!
you will have a bunch of entries falling in the same bucket. You could avoid this by hashing, but what use is sorting a hash code?
Use quicksort or mergesort or the like. (Quicksort generally performs better and takes less memory, but many examples have worst-case performance of O(N^2) which almost never occurs in practice; Mergesort doesn't perform quite as well but is usually implemented to be stable, and it's easy to do part in memory and part on disk.) That is, use the built-in sort function.
Edit: Well, it turns out that at least on very large files with long repeats at the beginning (e.g. source code) and with many lines exactly the same (100x repeats, in fact), radix sort does start becoming competitive with quicksort. I'm surprised! But, anyway, here is the code I used to implement radix sort. It's in Scala, not C#, but I've written it in fairly iterative style so it should be reasonably obvious how things work. The only two tricky bits are that (a(i)(ch) & 0xFF)
is to extract a 0-255 byte from an array of arrays of bytes (bytes are signed), that counts.scanLeft(0)(_ + _)
forms a cumulative sum of the counts, starting from zero (and then indices.clone.take(257)
takes all but the last one), and that Scala allows multiple parameter lists (so I split up the always-provided argument from the arguments that have defaults that are used in recursion). Here it is:
def radixSort(a: Array[Array[Byte]])(i0: Int = 0, i1: Int = a.length, ch: Int = 0) {
val counts = new Array[Int](257)
var i = i0
while (i < i1) {
if (a(i).length <= ch) counts(0) += 1
else { counts((a(i)(ch)&0xFF)+1) += 1 }
i += 1
}
val indices = counts.scanLeft(0)(_ + _)
val starts = indices.clone.take(257)
i = i0
while (i < i1) {
val bucket = if (a(i).length <= ch) 0 else (a(i)(ch)&0xFF)+1
if (starts(bucket)+i0 <= i && i < starts(bucket)+i0+counts(bucket)) {
if (indices(bucket) <= i) indices(bucket) = i+1
i += 1
}
else {
val temp = a(indices(bucket)+i0)
a(indices(bucket)+i0) = a(i)
a(i) = temp
indices(bucket) += 1
}
}
i = 1
while (i < counts.length) {
if (counts(i)>1) {
radixSort(a)(i0+starts(i),i0+starts(i)+counts(i),ch+1)
}
i += 1
}
}
And the timings are that with 7M lines of source code (100x duplication of 70k lines), the radix sort ties the built-in library sort, and wins thereafter.
Depending on what you need, you might find inserting all the strings into some form of Trie to be the best solution. Even a basic Ternary Search Trie will have a smaller memory footprint than an array/list of strings and will store the strings in sorted order.
Insertion, lookup and removal are all O(k * log(a))
where a
is the size of your alphabet (the number of possible values for a character). Since a
is constant so is log(a)
so you end up with a O(n * k)
algorithm for sorting.
Edit: In case you are unfamiliar with Tries, they are basically n-ary trees where each edge represents a single character of the key. When inserting, you check if the root node contains an edge (or child, whatever) that matches the first character of your key. If so, you follow that path and repeat with the second character and so on. If not, you add a new edge. In a Ternary Search Trie, the edges/children are stored in a binary tree so the characters are in sorted order and can be searched in log(a)
time. If you want to waste memory you can store the edges/children in an array of size a
which gives you constant lookup at each step.