views:

158

answers:

5

I need to sort a huge list of text strings of arbitrary length. I suppose radix sort is the best option here. List is really huge, so padding strings to the same length is completely impossible.
Is there any ready-made implementation for this task, preferably in C#?

A: 

String.Compare() overloads are using such string comparison. See what you need is to feed this to your sort algorithm.

UPDATE

This is the implementation:

[MethodImpl(MethodImplOptions.InternalCall)]
internal static extern int nativeCompareString(int lcid, string string1, int offset1, int length1, string string2, int offset2, int length2, int flags);

Hard to beat this native unmanaged implementation with your own implementation.

Aliostad
This does not look like radix sort to me.
Albin Sunnanbo
What a very informative update.
mquander
Radix sort is not a comparison sort so a string compare function is useless.
Niki Yoshiuchi
The question is how to *avoid* using a whole-string comparison, not *which one to use*. The whole point of radix sort is that it does *no* string-to-string comparisons, only char comparisons.
Eric Lippert
But this one compares the strings char by char and will return as soon as one does not match. So I cannot understand the negatives!
Aliostad
I see! it starts from right...
Aliostad
@Eric - String comparisons are built out of char comparisons. What is your point?
Rex Kerr
@Rex - at each step in a radix sort you are doing single char comparisons. At each step in a comparison sort you are doing a string compare.
Niki Yoshiuchi
@Rex: My point is that you never need to call String.Compare in a radix sort; the whole point of radix sort is that you do *not* need to do a whole-string comparison. You don't need to do any *ordering* comparison at all, only an *equality* comparison on chars.
Eric Lippert
+2  A: 

See this thread. radix sort or this one radix sort implementation

Sandeep
A: 

How many are many, one million?

The built in List<string>.Sort() takes O(n * log(n)) on average.

log2(10^6) ~=20, that is not very much slower than O(n) for 10^6 elements. If your strings are more than 20 characters long radix sort O(n * k) will be "slower".

I doubt a radix sort will be significantly faster than the built in sort. But it would be fun to measure and compare.

Albin Sunnanbo
1. comparisons of whole strings (for quicksort) and chars (for radix sort) are very different in cost 2. quicksort takes O(n * log(n)) only in average, but it takes O(n^2) in the worst case
jifuyo
@jifuyo - Do you have any reason to suspect you might hit a worst-case scenario? And if so, why not just use mergesort? And how do you intend to not need to look at every character in a string to separate two strings different only in the last character? In radix sort, you have to look at the whole thing to put each in the right bucket; in mergesort you have to run to the end to do that comparison.
Rex Kerr
You're forgetting that each comparison in quick/merge/heap sort is going to require the use of a string compare, which is O(k), so the running time of List<string>.Sort() will take O(n*k*log(n)) on average.
Niki Yoshiuchi
@Niki Yoshiuchi string comparison takes O(k) in the worst case, but that requires lots of long strings with long common beginnings. If strings were infinite and random, the number of comparisons required on average would be some constant, obviously not dependent on the length of strings (which were infinite). What is the situation in real life I don't know, but I doubt the average complexity is O(k).
Maciej Hehl
@Maciej Hehl: I'm aware. I just don't think it's safe to ignore the possibility, especially when we don't know anything about the data set (think lots of duplicates). Ultimately this is something that would require benchmarking or knowledge of the data.
Niki Yoshiuchi
A: 

Edit: there is a point to these statements I made previously, but the point is wrong overall.

Radix sort is the wrong sort to use on large numbers of strings. For things like

I really like squirrels. Yay, yay, yay!
I really like blue jays. Yay, yay, yay!
I really like registers. Yay, yay, yay!

you will have a bunch of entries falling in the same bucket. You could avoid this by hashing, but what use is sorting a hash code?

Use quicksort or mergesort or the like. (Quicksort generally performs better and takes less memory, but many examples have worst-case performance of O(N^2) which almost never occurs in practice; Mergesort doesn't perform quite as well but is usually implemented to be stable, and it's easy to do part in memory and part on disk.) That is, use the built-in sort function.


Edit: Well, it turns out that at least on very large files with long repeats at the beginning (e.g. source code) and with many lines exactly the same (100x repeats, in fact), radix sort does start becoming competitive with quicksort. I'm surprised! But, anyway, here is the code I used to implement radix sort. It's in Scala, not C#, but I've written it in fairly iterative style so it should be reasonably obvious how things work. The only two tricky bits are that (a(i)(ch) & 0xFF) is to extract a 0-255 byte from an array of arrays of bytes (bytes are signed), that counts.scanLeft(0)(_ + _) forms a cumulative sum of the counts, starting from zero (and then indices.clone.take(257) takes all but the last one), and that Scala allows multiple parameter lists (so I split up the always-provided argument from the arguments that have defaults that are used in recursion). Here it is:

def radixSort(a: Array[Array[Byte]])(i0: Int = 0, i1: Int = a.length, ch: Int = 0) {
  val counts = new Array[Int](257)
  var i = i0
  while (i < i1) {
    if (a(i).length <= ch) counts(0) += 1
    else { counts((a(i)(ch)&0xFF)+1) += 1 }
    i += 1
  }
  val indices = counts.scanLeft(0)(_ + _)
  val starts = indices.clone.take(257)
  i = i0
  while (i < i1) {
    val bucket = if (a(i).length <= ch) 0 else (a(i)(ch)&0xFF)+1
    if (starts(bucket)+i0 <= i && i < starts(bucket)+i0+counts(bucket)) {
      if (indices(bucket) <= i) indices(bucket) = i+1
      i += 1
    }
    else {
      val temp = a(indices(bucket)+i0)
      a(indices(bucket)+i0) = a(i)
      a(i) = temp
      indices(bucket) += 1
    }
  }
  i = 1
  while (i < counts.length) {
    if (counts(i)>1) {
      radixSort(a)(i0+starts(i),i0+starts(i)+counts(i),ch+1)
    }
    i += 1
  }
}

And the timings are that with 7M lines of source code (100x duplication of 70k lines), the radix sort ties the built-in library sort, and wins thereafter.

Rex Kerr
Yes, lots of items in the same bucket. Is this a problem?
jifuyo
@jifuyo - They're not sorted, and you don't know how many there are. So you need to create a resizable array or a tree to hold your pointers, and then you have to bucket again recursively. This gives you `O(N log N)` performance and considerable memory usage even if you're really careful, since you need to generate buckets any time you have confused strings. And then, don't forget that you have to weigh a string compare and pointer manipulation against a char index lookup and creation and maintenance of an array of stored values.
Rex Kerr
@Rex Kerr, radix sort can be done in-place. No memory usage.
jifuyo
@jifuyo - Fair enough - if you're allowing two-pass methods, you need at least `O(K*B)` memory to store the bin boundaries, where `K` is the length of the string and `B` is the size of the allowable character set. Still more than quicksort's `O(log N)` in most cases, but pretty insignificant compared to the strings themselves.
Rex Kerr
+2  A: 

Depending on what you need, you might find inserting all the strings into some form of Trie to be the best solution. Even a basic Ternary Search Trie will have a smaller memory footprint than an array/list of strings and will store the strings in sorted order.

Insertion, lookup and removal are all O(k * log(a)) where a is the size of your alphabet (the number of possible values for a character). Since a is constant so is log(a) so you end up with a O(n * k) algorithm for sorting.

Edit: In case you are unfamiliar with Tries, they are basically n-ary trees where each edge represents a single character of the key. When inserting, you check if the root node contains an edge (or child, whatever) that matches the first character of your key. If so, you follow that path and repeat with the second character and so on. If not, you add a new edge. In a Ternary Search Trie, the edges/children are stored in a binary tree so the characters are in sorted order and can be searched in log(a) time. If you want to waste memory you can store the edges/children in an array of size a which gives you constant lookup at each step.

Niki Yoshiuchi
Well, I can try this way too. I'm just researching possible ways to speed up my task. Better if you know a ready-made code for it ;)
jifuyo
A quick google search revealed a few implementations, but the ones I looked at assumed Roman characters and case insensitive comparisons. They also either used either arrays or lists to store nodes which is not necessarily the most efficient way to do it (arrays waste space, lists mean `O(a)` lookup).
Niki Yoshiuchi
I have an implementation of a ternary search trie on my computer (in OCaml so useless for you) and decided to run some tests. I concatenated a whole bunch of C++ source files together and sorted that using my trie and List.sort. List.sort was more than 3 times faster. Then I catted that big file with itself 8 times (to create duplicates) after which List.sort was a little less than 3 times as fast(9s to 26s). Finally I made a file with the same line repeated 10^7 times. The trie was twice as fast (27s to 52s).
Niki Yoshiuchi