ansaurus

Question

Fastest way to compare strings (literal and numerical)

Answer 1

+5 A:

Hi,

Even though the bottleneck seems to be the compareTo() function, it probably stands out in the profiler because it is the function which is called the most in your loop.

It might be beneficial to also know how exactly your sort routine functions. You might be better of changing the sort algorithm since there's much more speed to be gained there.

Toad 2009-08-27 09:21:20

Although I totally agree with you on the algorithm part it is out of my control. I've updated the question a bit to reflect the scenario better.

Subtwo 2009-08-27 09:37:21

Answer 2

A:

If you need both "literal" and "numerical" comparison, then what data do those Strings contain? Do they always represent numbers?

If they only contain numbers, then it's probably much faster to store them as numbers (in addition to being the cleaner thing to do).

And if you need "literal" comparison (which i interpret as sorting "100" before "20"), then you can easily implement that on ints or longs with some math that's probably still much faster than String comparison.

Joachim Sauer 2009-08-27 09:28:23

The data in the strings is unpredictable meaning it can be both numbers and literals, doubles, integers, names, etc.

Subtwo 2009-08-27 09:50:30

Answer 3

+2 A:

I really doubt that you will be able to speed up String.compareTo() all that much. The solution probably lies in aclling compareTo() less often. But it is impossible to tell you how to do that without knowing more about your algorithm.

Guillaume 2009-08-27 09:28:42

Yes, you're probably right. I'll look into if it is possible to do precalculation and caching to speed things up.

Subtwo 2009-08-27 09:49:15

Answer 4

+5 A:

If you have knowledge about your String content you can pre-compute and store additional information to speed up the comparison. For example, suppose your Strings only contained capital letters A-Z. You could assign a rank to the String based on say the first 3 letters; e.g.

AAA := 1
AAB := 2
...
ABA := 27

Then you could speed up your compareTo by first comparing each String's rank (fast int based comparison) and then only performing a full String compare if the ranks were equal.

Adamski 2009-08-27 09:40:15

+1 this is actually a pretty good idea. The strings would be pretty much constant over at least a limited period of time.

Subtwo 2009-08-27 09:42:52

Answer 5

+3 A:

It's almost certainly the exceptions that are slowing down the comparison. Throwing and catching an exception is an expensive operation, and you get an exception with every non-numeric cell value.

Consider using a regular expression first to check if the value appears to be numeric, and if not then do not attempt to parse it.

private static final Pattern numberPattern = Pattern.compile("[-+0-9.e]+");

// ...

// e1Label and e2Label is Strings to be compared
//

// Be smart about the comparison and use non-lexical comparison if
// possible (i.e. if both strings are actually numbers...)
//
// Warning: This is only "semi-smart" as the sorting might get "a bit"
// messed up if some of the values in a column can be parsed as
// doubles while others can not...
//
if (numberPattern.matches(e1Label) && numberPattern.matches(e2Label)) {
    try {
        // Try using numeric (double) comparison of label values
        //
        double e1_double = Double.parseDouble(e1Label);
        double e2_double = Double.parseDouble(e2Label);
        rc = Double.compare(e1_double, e2_double);
    } catch (NumberFormatException e) {
        // Use lexical comparison if double comparison is not possible
        //
        rc = e1Label.compareToIgnoreCase(e2Label);
    }
} else {
    rc = e1Label.compareToIgnoreCase(e2Label);
}

finnw 2009-08-27 09:44:57

I though so too but I've tried and removed the exception and only did string comparison (i.e. only the last catch block) but the performance was essentially the same curiously enough.

Subtwo 2009-08-27 09:46:41

I tried you suggestion but I'm sad to say it doesn't really make a difference in speed. It was a nice try though.

Subtwo 2009-08-27 09:59:25

regular expressions are also notoriously slow.

Toad 2009-08-27 13:31:59

Answer 6

A:

As renier and Guillaume already said String.compareTo() is not to blame. It should be slower than numerical compare, but not really that much to matter.

Even if your list is million items long it should not take longer than a second.

If it is an option, I would do the search in background, that is attach some sort of indexing to the strings.

You should really analyze what kind of operations are going to happen most often, single insertions, mass unsorted insertions, mass partially sorted insertions, sorting, deleting, and so on.

Depending on the most common operation, you would choose the more appropriate data structure.

Sint 2009-08-27 09:55:24

Answer 7

A:

Why not sort the list once in the beginning, keeping that updated with an insert sort? Then when you want to change the ordering from ascending to descending, the information is already there. If you want to sort by another column, then just keep the list around incase you switch back this column? Or is that not do-able in SWT? (It's been a while since I used it)

Pod 2009-08-27 09:56:03

Maybe it is possible to attach values to each object depending on the column and then use those for caching the position in the list. The hard part is to figure out the actual position when you just get called with in a compare-method given two items at a time.

Subtwo 2009-08-27 10:09:27

Answer 8

A:

It seems to me that what you need to do is avoid calling String.compareTo() as often as you do. There are basically two ways to do this.

1) Implement some form of bucket sort, to avoid performing all those comparisons.

Depending on the number of strings to be sorted (thousands? millions?), using a complete bucket sort might require too much overhead, in terms of space and garbage collections.

To avoid that you could perform a constant rounds of bucket sorts, so the strings are sorted into lists containing all strings where, say, the first 10 letters match. Then you can use the built-in sort to sort each bucket.

2) Make a hash of each string and sort the hashes (making sure to handle collisions). Then you can just reorder the strings afterwards. This is probably the easiest solution.

Using either of these solutions should let you sort millions of strings in less than a second.

Jørgen Fogh 2009-08-27 09:58:49

I maybe was a bit vague but implementing the algorithm was not an option for me, I have no doubt that it could prove fruitful if it was possible (I mean possible in the practical - within budget kind of way)

Subtwo 2009-08-27 10:24:08

Answer 9

+2 A:

Don't store the values as String objects. Create your own wrapper that only ever calls Double.parseDouble once for each String. Cache the response (either the value or the Exception). It could probably cache a case-insensitive version of the string too.

Bill Michell 2009-08-27 09:59:30

Answer 10

+1 A:

Based on your recent clarification here is a second answer: Create a class: Item that can be used to represent either a numeric or alphanumeric value, and can determine if this is the case upfront. That way you avoid the overhead of parsing the value and handling any exceptions during your compareTo method call.

public class Item implements Comparable<Item> {
    private final String s;
    private final double d;
    private final boolean numeric;

    public Item(String s) {
        double tmpD;
        boolean tmpNumeric;

        try {
            // Do the work of parsing / catching exceptions *upfront*.
            tmpD = Double.parseDouble(s);
            tmpNumeric = true;
        } catch(NumberFormatException ex) {
            // Parse failed so must be a String.
            tmpD = 0.0;
            tmpNumeric = false;
        }

        this.s = s;
        this.d = tmpD;
        this.numeric = tmpNumeric;
    }

    public String asString() {
        return s;
    }

    public double asDouble() {
        if (!numeric) {
            throw new IllegalStateException("Not a numeric value: " + s);
        }

        return d;
    }

    public boolean isNumeric() {
        return numeric;
    }

    @Override
    public boolean equals(Object o) {
        if (this == o) return true;
        if (!(o instanceof Item)) return false;

        Item item = (Item) o;

        return Double.compare(item.d, d) == 0 && s.equals(item.s);
    }

    @Override
    public int hashCode() {
        int result;
        long temp;
        result = s.hashCode();
        temp = d != +0.0d ? Double.doubleToLongBits(d) : 0L;
        result = 31 * result + (int) (temp ^ (temp >>> 32));
        return result;
    }

    public int compareTo(Item item) {
        int ret;

        if (numeric && item.isNumeric()) {
            // Both items are numeric so do fast comparison.
            double diff = d - item.asDouble();
            if (diff > 0.0) {
                ret = 1;
            } else if (diff < 0.0) {
                ret = -1;
            } else {
                ret = 0;
            }
        } else {
            ret = s.compareTo(item.asString());
        }

        return ret;
    }
}

Adamski 2009-08-27 10:07:36

Even if I were to eliminate all double conversions and parsing and exceptions it don't make much difference (strange isn't it). If I replace all my code with a single compareTo() call it is still performing horribly slow. So I think the solution has to be to either speed up the actual string comparison or to reduce the need to compare strings. I greatly appreciate your answer though, thanks!

Subtwo 2009-08-27 10:14:27

If eliminating the parsing doesn't improve matters, then either most of your Strings can't be parsed to doubles (in which case the code trying to do it is useless and should be eliminated) or the String comparison really isn't making any difference and you should really go and look at the sort algorithm.

DJClayworth 2009-08-27 14:00:41

Answer 11

+1 A:

Even if you can squeeze a bit more performance out of your compareTo(), I think the main problem is the size of the list. Even if, hypothetically, today you can reduce the sorting delay to something acceptable (1 second?), what if next year the application needs to display a list that's twice as big? Sort algorithms are O(n log n), so doubling the size of the list is going to make the sort significantly slower.

For a robust solution, look into virtual tables (using the SWT.VIRTUAL attribute). Then you can implement an underlying data provider which doesn't need to do a full sort up-front. Exactly how you implement it will depend on where your data is coming from. If it's coming from a database, you could consider putting indexes on all the sortable fields. If there's no way to do that, there are other strategies you can use, for example if you have some quick method for dividing the table into chunks (e.g. rows starting with "A", rows starting with "B", etc) then you can start by just extracting the rows in the first chunk, sorting them and displaying them, since the user always begins at the top of the table. Sorting of the subsequent chunks can continue in a background thread.

Todd Owen 2009-08-27 10:33:56

+1 Yes. You are right. I would have to consider the possibility of having an even larger list in the future. Unfortunately the time is extremely limited for an advanced implementation with virtual tables at the moment but will probably be needed in the future. Thanks for the suggestion. I'll try to squeeze some performance out of the current solution as it is today and recommend a different approach for the future.

Subtwo 2009-08-27 10:40:57

ansaurus

tags:

views:

answers:

Fastest way to compare strings (literal and numerical)

related questions