views:

146

answers:

3

Hi,

Basic idea is to sort the strings and compare signature of strings, where signature is the alphabetically sorted string.

What would be the efficient algorithm to do so ?

Thanks.

A: 

You don't specify the programming language or the language of the strings (is it ASCII, Latin1, UTF8, UTF16, etc.), but basically your compare function would need to either sort the characters in each string and then return the result based on comparison or sum the ordinal values of the characters in each string and return the result of an integer comparison between them.

John Cavan
I am looking for Java Solution and language of string is UTF8
Rachel
+2  A: 

If you are sorting the UTF8 characters "alphabetically", you can convert them to 32-bit integers (UTF8 chars are 1 to 4 8-bit values) and then do a RADIX sort. It will work in O(N) time. If you were using just ASCII, I would suggest Counting Sort.

There are many ways to match the signatures but I would use a Hash Table ( O(1) on average ) or a O(Lg N) structure such as Red-Black Trees or Skip-Lists.

To further speed up your string matching, you can compress these signatures by Run Length Encoding these UTF8 characters (since they're sorted, the signature will be runs + gaps). Actually, you could compress them to use bit tags that represent 7-bit chars (most common), RLE runs, and longer literals (8-bit through 32-bit chars). Comparing the compressed strings would be faster.

Adisak
A: 

The question looks similar to one asked here, to which my answer was:

#define NUM_ALPHABETS 256
int alphabets[NUM_ALPHABETS];

bool isAnagram(char *src, char *dest) {
    len1 = strlen(src);
    len2 = strlen(dest);
    if (len1 != len2)
        return false;

    memset(alphabets, 0, sizeof(alphabets));
    for (i = 0; i < len1; i++)
        alphabets[src[i]]++;
    for (i = 0; i < len2; i++) {
        alphabets[dest[i]]--;
        if (alphabets[dest[i]] < 0)
            return false;
    }

   return true;
}
Ashwin
This is a clever method of using a Counting Sort twice (the second time with decrementing). It works great with ASCII, but not quite so good with UTF8 (where the character set can have 8, 16, 24, or 32 bit chars). Still, like I said an interesting example of repurposing Counting Sort to find anagrams.
Adisak