views:

523

answers:

11

I have a performance problem related to string comparison (in Java).

I'm working on a project that needs to sort a huge list (a TableViewer in Eclipse). Anyway I've pinpointed the bottleneck down to the call to compareTo() for the string to be compared.

Is there any way to optimize the performance of a string compare? I've searched and googled to no avail...

As the project is strictly limited to a Win32 environment, I was thinking that maybe it would be possible to leverage on that...

Any suggestions would be greatly appreciated.

EDIT: I forgot to mention that I would need both numerical comparison and literal comparison of the strings.

EDIT2: The goal is essentially to speed up the user interface because it is unacceptable to wait a few seconds each time you click on the header of the table to perform a sort. I'm looking into maybe caching values somehow to speed up the comparison. As the strings are pretty much static I think it would be possible.

EDIT3: I know a lot of you have been disturbed by the try()-catch() thing. Actually that is less of a concern because even if I remove that code and only execute the catch-block (a single compareTo()) it still executes at virtually the same speed as the original code. If however I comment out the compareTo() also; leaving me with only the overhead of the compare function (getting labels, etc.) it is lightning fast. So I still need a better way to compare strings. Either by caching or by doing some other magic.

Unfortunately it is not possible to change the sorting algorithm - however I doubt that it is that slow because it succeeds in sorting pure integers quite fast.

CLARIFICATION:

The compare function is implemented as part of the TableViewer framework for performing sort operations which means that I'm not implementing the specific sorting algorithm but rather it is implemented by SWT/JFace. I'm only implementing the compare function.

What is further more interesting is the fact that the code for sorting doubles is faster than the string comparison. It is faster to sort columns with only numbers than with actual literal strings.... Which leads me to the conclusion that something fishy is going on in the compareTo() method...

Here is the core of the function:

// e1Label and e2Label is Strings to be compared
//

// Be smart about the comparison and use non-lexical comparison if
// possible (i.e. if both strings are actually numbers...)
//
// Warning: This is only "semi-smart" as the sorting might get "a bit"
// messed up if some of the values in a column can be parsed as
// doubles while others can not...
//
try {
 // Try using numeric (double) comparison of label values
 //
 double e1_double = Double.parseDouble(e1Label);
 double e2_double = Double.parseDouble(e2Label);
 rc = Double.compare(e1_double, e2_double);
} catch (NumberFormatException e) {
 // Use lexical comparison if double comparison is not possible
 //
 rc = e1Label.compareToIgnoreCase(e2Label);
}
+5  A: 

Hi,

Even though the bottleneck seems to be the compareTo() function, it probably stands out in the profiler because it is the function which is called the most in your loop.

It might be beneficial to also know how exactly your sort routine functions. You might be better of changing the sort algorithm since there's much more speed to be gained there.

Toad
Although I totally agree with you on the algorithm part it is out of my control. I've updated the question a bit to reflect the scenario better.
Subtwo
A: 

If you need both "literal" and "numerical" comparison, then what data do those Strings contain? Do they always represent numbers?

If they only contain numbers, then it's probably much faster to store them as numbers (in addition to being the cleaner thing to do).

And if you need "literal" comparison (which i interpret as sorting "100" before "20"), then you can easily implement that on ints or longs with some math that's probably still much faster than String comparison.

Joachim Sauer
The data in the strings is unpredictable meaning it can be both numbers and literals, doubles, integers, names, etc.
Subtwo
+2  A: 

I really doubt that you will be able to speed up String.compareTo() all that much. The solution probably lies in aclling compareTo() less often. But it is impossible to tell you how to do that without knowing more about your algorithm.

Guillaume
Yes, you're probably right. I'll look into if it is possible to do precalculation and caching to speed things up.
Subtwo
+5  A: 

If you have knowledge about your String content you can pre-compute and store additional information to speed up the comparison. For example, suppose your Strings only contained capital letters A-Z. You could assign a rank to the String based on say the first 3 letters; e.g.

  • AAA := 1
  • AAB := 2
  • ...
  • ABA := 27

Then you could speed up your compareTo by first comparing each String's rank (fast int based comparison) and then only performing a full String compare if the ranks were equal.

Adamski
+1 this is actually a pretty good idea. The strings would be pretty much constant over at least a limited period of time.
Subtwo
+3  A: 

It's almost certainly the exceptions that are slowing down the comparison. Throwing and catching an exception is an expensive operation, and you get an exception with every non-numeric cell value.

Consider using a regular expression first to check if the value appears to be numeric, and if not then do not attempt to parse it.

private static final Pattern numberPattern = Pattern.compile("[-+0-9.e]+");

// ...

// e1Label and e2Label is Strings to be compared
//

// Be smart about the comparison and use non-lexical comparison if
// possible (i.e. if both strings are actually numbers...)
//
// Warning: This is only "semi-smart" as the sorting might get "a bit"
// messed up if some of the values in a column can be parsed as
// doubles while others can not...
//
if (numberPattern.matches(e1Label) && numberPattern.matches(e2Label)) {
    try {
        // Try using numeric (double) comparison of label values
        //
        double e1_double = Double.parseDouble(e1Label);
        double e2_double = Double.parseDouble(e2Label);
        rc = Double.compare(e1_double, e2_double);
    } catch (NumberFormatException e) {
        // Use lexical comparison if double comparison is not possible
        //
        rc = e1Label.compareToIgnoreCase(e2Label);
    }
} else {
    rc = e1Label.compareToIgnoreCase(e2Label);
}
finnw
I though so too but I've tried and removed the exception and only did string comparison (i.e. only the last catch block) but the performance was essentially the same curiously enough.
Subtwo
I tried you suggestion but I'm sad to say it doesn't really make a difference in speed. It was a nice try though.
Subtwo
regular expressions are also notoriously slow.
Toad
A: 

As renier and Guillaume already said String.compareTo() is not to blame. It should be slower than numerical compare, but not really that much to matter.

Even if your list is million items long it should not take longer than a second.

If it is an option, I would do the search in background, that is attach some sort of indexing to the strings.

You should really analyze what kind of operations are going to happen most often, single insertions, mass unsorted insertions, mass partially sorted insertions, sorting, deleting, and so on.

Depending on the most common operation, you would choose the more appropriate data structure.

Sint
A: 

Why not sort the list once in the beginning, keeping that updated with an insert sort? Then when you want to change the ordering from ascending to descending, the information is already there. If you want to sort by another column, then just keep the list around incase you switch back this column? Or is that not do-able in SWT? (It's been a while since I used it)

Pod
Maybe it is possible to attach values to each object depending on the column and then use those for caching the position in the list. The hard part is to figure out the actual position when you just get called with in a compare-method given two items at a time.
Subtwo
A: 

It seems to me that what you need to do is avoid calling String.compareTo() as often as you do. There are basically two ways to do this.

1) Implement some form of bucket sort, to avoid performing all those comparisons.

Depending on the number of strings to be sorted (thousands? millions?), using a complete bucket sort might require too much overhead, in terms of space and garbage collections.

To avoid that you could perform a constant rounds of bucket sorts, so the strings are sorted into lists containing all strings where, say, the first 10 letters match. Then you can use the built-in sort to sort each bucket.

2) Make a hash of each string and sort the hashes (making sure to handle collisions). Then you can just reorder the strings afterwards. This is probably the easiest solution.

Using either of these solutions should let you sort millions of strings in less than a second.

Jørgen Fogh
I maybe was a bit vague but implementing the algorithm was not an option for me, I have no doubt that it could prove fruitful if it was possible (I mean possible in the practical - within budget kind of way)
Subtwo
+2  A: 

Don't store the values as String objects. Create your own wrapper that only ever calls Double.parseDouble once for each String. Cache the response (either the value or the Exception). It could probably cache a case-insensitive version of the string too.

Bill Michell
+1  A: 

Based on your recent clarification here is a second answer: Create a class: Item that can be used to represent either a numeric or alphanumeric value, and can determine if this is the case upfront. That way you avoid the overhead of parsing the value and handling any exceptions during your compareTo method call.

public class Item implements Comparable<Item> {
    private final String s;
    private final double d;
    private final boolean numeric;

    public Item(String s) {
        double tmpD;
        boolean tmpNumeric;

        try {
            // Do the work of parsing / catching exceptions *upfront*.
            tmpD = Double.parseDouble(s);
            tmpNumeric = true;
        } catch(NumberFormatException ex) {
            // Parse failed so must be a String.
            tmpD = 0.0;
            tmpNumeric = false;
        }

        this.s = s;
        this.d = tmpD;
        this.numeric = tmpNumeric;
    }

    public String asString() {
        return s;
    }

    public double asDouble() {
        if (!numeric) {
            throw new IllegalStateException("Not a numeric value: " + s);
        }

        return d;
    }

    public boolean isNumeric() {
        return numeric;
    }

    @Override
    public boolean equals(Object o) {
        if (this == o) return true;
        if (!(o instanceof Item)) return false;

        Item item = (Item) o;

        return Double.compare(item.d, d) == 0 && s.equals(item.s);
    }

    @Override
    public int hashCode() {
        int result;
        long temp;
        result = s.hashCode();
        temp = d != +0.0d ? Double.doubleToLongBits(d) : 0L;
        result = 31 * result + (int) (temp ^ (temp >>> 32));
        return result;
    }

    public int compareTo(Item item) {
        int ret;

        if (numeric && item.isNumeric()) {
            // Both items are numeric so do fast comparison.
            double diff = d - item.asDouble();
            if (diff > 0.0) {
                ret = 1;
            } else if (diff < 0.0) {
                ret = -1;
            } else {
                ret = 0;
            }
        } else {
            ret = s.compareTo(item.asString());
        }

        return ret;
    }
}
Adamski
Even if I were to eliminate all double conversions and parsing and exceptions it don't make much difference (strange isn't it). If I replace all my code with a single compareTo() call it is still performing horribly slow. So I think the solution has to be to either speed up the actual string comparison or to reduce the need to compare strings. I greatly appreciate your answer though, thanks!
Subtwo
If eliminating the parsing doesn't improve matters, then either most of your Strings can't be parsed to doubles (in which case the code trying to do it is useless and should be eliminated) or the String comparison really isn't making any difference and you should really go and look at the sort algorithm.
DJClayworth
+1  A: 

Even if you can squeeze a bit more performance out of your compareTo(), I think the main problem is the size of the list. Even if, hypothetically, today you can reduce the sorting delay to something acceptable (1 second?), what if next year the application needs to display a list that's twice as big? Sort algorithms are O(n log n), so doubling the size of the list is going to make the sort significantly slower.

For a robust solution, look into virtual tables (using the SWT.VIRTUAL attribute). Then you can implement an underlying data provider which doesn't need to do a full sort up-front. Exactly how you implement it will depend on where your data is coming from. If it's coming from a database, you could consider putting indexes on all the sortable fields. If there's no way to do that, there are other strategies you can use, for example if you have some quick method for dividing the table into chunks (e.g. rows starting with "A", rows starting with "B", etc) then you can start by just extracting the rows in the first chunk, sorting them and displaying them, since the user always begins at the top of the table. Sorting of the subsequent chunks can continue in a background thread.

Todd Owen
+1 Yes. You are right. I would have to consider the possibility of having an even larger list in the future. Unfortunately the time is extremely limited for an advanced implementation with virtual tables at the moment but will probably be needed in the future. Thanks for the suggestion. I'll try to squeeze some performance out of the current solution as it is today and recommend a different approach for the future.
Subtwo