ansaurus

Question

Check my anagram code from a job interview in the past.

Answer 1

+9 A:

The best way to group anagrams is to map the strings to some sort of histogram representation.

 FUNCTION histogram
 [input] -> [output]

 "dog"   -> (1xd, 1xg, 1xo)
 "god"   -> (1xd, 1xg, 1xo)
 "foo"   -> (1xf, 2xo)

Basically, with a linear scan of a string, you can produce the histogram representation of how many of each letters it contains. A small, finite alphabet makes this even easier (e.g. with A-Z, you just have an array of 26 numbers, one for each letter).

Now, anagrams are simply words that have the same histogram.

Then you can have a multimap data structure that maps a histogram to a list of words that have that histogram.

MULTIMAP
[key]           => [set of values]

(1xd, 1xg, 1xo) => { "dog", "god" }
(1xf, 2xo)      => { "foo" }

The canonical form trick

Instead of working on the histograms, you can also work on the "canonical form" of the strings. Basically, you define for each string, what its canonical form is, and two words are anagrams if they have the same canonical form.

One convenient canonical form is to have the letters in the string in sorted order.

FUNCTION canonize
[input]  -> [output]

"dog"    -> "dgo"
"god"    -> "dgo"
"abracadabra" -> "aaaaabbcdrr"

Note that this is just one step after the histogram approach: you're essentially doing counting sort to sort the letters.

This is the most practical solution in actual programming language to your problem.

Complexity

Producing the histogram/canonical form of a word is practically O(1) (finite alphabet size, finite maximum word length).

With a good hash implementation, get and put on the multimap is O(1).

You can even have multiple multimaps, one for each word length.

If there are N words, putting all the words into the multimaps is therefore O(N); then outputting each anagram group is simply dumping the values in the multimaps. This too can be done in O(N).

This is certainly better than checking if each pair of word are anagrams (an O(N^2) algorithm).

polygenelubricants 2010-04-25 03:41:19

I originally started going down this route - counting letters and such, but was told it would be way too slow. Hmm. Isn't a multimap a O(nlogn) lookup? You'd have to create a hash to get it close to o(n) on the lookup with another n on the creation which is still linear.

Michael Dorgan 2010-04-25 03:45:44

@Michael: if you can fix your code and/or explain your algorithm in pseudocode, I can comment on complexity a lot more. If you're not using any data structure at all, then your algorithm is probably `O(N^2)`, but I can't confirm that yet. Multimap can have `O(1)` amortized access. Also, I'm not sure what `n` is referring to your in comment.

polygenelubricants 2010-04-25 03:54:33

Runs fine on my end - strange. I could post my driver if need be.The psuedo code is sort the strings (assuming O(nlogn) there), then run through the sorted list and check for identical sorted strings. If it exists, then remove it so that future scans don't have to double up work. Worst case on that run is O^2 / 2 if not matches are found and O(n) best case if every string matches.

Michael Dorgan 2010-04-25 04:05:49

Great comments. Thanks for the info. I learned a lot!

Michael Dorgan 2010-04-25 04:23:01

@Michael: Additionally, a `multimap<K,V>` can be emulated by a `map<K,set<V>>`. I'm sure C++ has at least a good `map` and `set` implementation. If not, you can just put pairs of `[word, canonize(word)]` in a list, and sort based on the second part of the pair. Anagrams would then appear next to each other in the sorted list. This would be `O(N log N)` total.

polygenelubricants 2010-04-25 04:30:03

@Michael: I added the link to an older question that has a nice C# solution from Jon Skeet that should be easily translatable to C++.

polygenelubricants 2010-04-25 04:39:23

Wow, my paste was so bad that my typedef got mangled. I am using pairs in a list for my method :)

Michael Dorgan 2010-04-25 04:45:05

Answer 2

+1 A:

I'll just treat it as a free function get_anagrams since class Anagram doesn't appear to do anything else.

The choice of list and set aren't any better than anything else, so while I'm at it I'll make them both vector. Actually, the output can be just one flat sequence. So call it anagram_sort. I'll take the specification of "list" and "set" as the basic constructs rather than literal class templates. Also "given … return" can be loosely interpreted as modifying the input in-place. The caller can create a copy if they like.

struct anagram_less { bool operator()( string const &l, string const &r ) {
    string sl( l ), sr( r );
    sort( sl.begin(), sl.end() );
    sort( sr.begin(), sr.end() );
    return sl < sr;
} };

void anagram_sort( vector< string > &v ) { // "the answer"
    sort( v.begin(), v.end(), anagram_less );
} // 10 lines

   // usage:

void print_anagrams( vector< string > &&v ) { // destructive => use rvalue ref
    anagram_sort( v );

    for ( vector< string >::iterator i = v.begin(); i != v.end(); /*++i*/ ) {

        vector< string >::iterator e // find end of run of anagrams
            = adjacent_find( i, v.end(), anagram_less() );
        if ( e != v.end() ) ++ e; // usually need this after adjacent_find :v(

        cout << "( ";
        for ( ; i != e; ++ i ) {
            cout << * i << " ";
        }
        cout << ")" << endl;
    }
}

This is suboptimal as it repetitively sorts the words. I'd rather cut the fat in an interview question than make something that's fast "on paper," though.

To squeeze that last bit of performance out, I'd probably make a copy of the input with serial-number indexes attached, sort the characters of the strings, sort the strings, and then use the indexes to reorder the input vector using this algorithm. Final runtime a low-coefficient O(n log n), as it should be.

Potatoswatter 2010-04-25 05:45:29

ansaurus

tags:

views:

answers:

Check my anagram code from a job interview in the past.

The canonical form trick

Complexity

related questions