The best way to group anagrams is to map the strings to some sort of histogram representation.
FUNCTION histogram
[input] -> [output]
"dog" -> (1xd, 1xg, 1xo)
"god" -> (1xd, 1xg, 1xo)
"foo" -> (1xf, 2xo)
Basically, with a linear scan of a string, you can produce the histogram representation of how many of each letters it contains. A small, finite alphabet makes this even easier (e.g. with A-Z
, you just have an array of 26 numbers, one for each letter).
Now, anagrams are simply words that have the same histogram.
Then you can have a multimap data structure that maps a histogram to a list of words that have that histogram.
MULTIMAP
[key] => [set of values]
(1xd, 1xg, 1xo) => { "dog", "god" }
(1xf, 2xo) => { "foo" }
The canonical form trick
Instead of working on the histograms, you can also work on the "canonical form" of the strings. Basically, you define for each string, what its canonical form is, and two words are anagrams if they have the same canonical form.
One convenient canonical form is to have the letters in the string in sorted order.
FUNCTION canonize
[input] -> [output]
"dog" -> "dgo"
"god" -> "dgo"
"abracadabra" -> "aaaaabbcdrr"
Note that this is just one step after the histogram approach: you're essentially doing counting sort to sort the letters.
This is the most practical solution in actual programming language to your problem.
Complexity
Producing the histogram/canonical form of a word is practically O(1)
(finite alphabet size, finite maximum word length).
With a good hash implementation, get
and put
on the multimap is O(1)
.
You can even have multiple multimaps, one for each word length.
If there are N
words, putting all the words into the multimaps is therefore O(N)
; then outputting each anagram group is simply dumping the values in the multimaps. This too can be done in O(N)
.
This is certainly better than checking if each pair of word are anagrams (an O(N^2)
algorithm).