views:

1414

answers:

13

I'm checking if two strings a and b are permutations of each other, and I'm wondering what the ideal way to do this is in Python. From the Zen of Python, "There should be one -- and preferably only one -- obvious way to do it," but I see there are at least two ways:

sorted(a) == sorted(b)

and

all(a.count(char) == b.count(char) for char in a)

but the first one is slower when (for example) the first char of a is nowhere in b, and the second is slower when they are actually permutations.

Is there any better (either in the sense of more Pythonic, or in the sense of faster on average) way to do it? Or should I just choose from these two depending on which situation I expect to be most common?

+7  A: 

I think the first one is the "obvious" way. It is shorter, clearer, and likely to be faster in many cases because Python's built-in sort is highly optimized.

Robert Gamble
But complexity is more important for large strings.
Um, the first solution has a smaller time complexity than the second one...
Robert Gamble
+6  A: 

Your second example won't actually work:

all(a.count(char) == b.count(char) for char in a)

will only work if b does not contain extra characters not in a. It also does duplicate work if the characters in string a repeat.

If you want to know whether two strings are permutations of the same unique characters, just do:

set(a) == set(b)

To correct your second example:

all(str1.count(char) == str2.count(char) for char in set(a) | set(b))

set() objects overload the bitwise OR operator so that it will evaluate to the union of both sets. This will make sure that you will loop over all the characters of both strings once for each character only.

That said, the sorted() method is much simpler and more intuitive, and would be what I would use.

ʞɔıu
Thanks! I didn't realize you could do the operator overloading, but now I see in the docs there are a bunch of overloaded operators too.
Kiv
+9  A: 

Here is a way which is O(n), asymptotically better than the two ways you suggest.

import collections

def same_permutation(a, b):
    d = collections.defaultdict(int)
    for x in a:
        d[x] += 1
    for x in b:
        d[x] -= 1
    return not any(d.itervalues())

## same_permutation([1,2,3],[2,3,1])
#. True

## same_permutation([1,2,3],[2,3,1,1])
#. False
namin
for Python 3.0, replace .itervalues() with just .values()
Lasse V. Karlsen
This is significantly longer and more complex than necessary and turns out to be slower in every case I tested than the simple "sorted" solution. I can't find anything "better" about it.
Robert Gamble
I was going to ask HAVE YOU MEASURED?? Thanks, Robert.
Norman Ramsey
Actually, when you run it against 20,000,000 characters, namin's version is 50% faster. So if you're looking for permutations in Finnish, it might be worth a look :)
Alabaster Codify
As namin said, it is asymptotically better - and you do not need to measure, just to demonstrate that it is O(n) against O(n log(n)).
Roberto Liffredo
Thanks for the algorithm, my strings are actually not long enough that it is faster, but I will remember it for the future. Am I right that you can express this using collections.defaultdict(int)? I modified it to use this and it was marginally faster.
Kiv
The last 4 lines can be replaced with "return not any(d.itervalues())"
Demur Rumed
+2  A: 

Go with the first one - it's much more straightforward and easier to understand. If you're actually dealing with incredibly large strings and performance is a real issue, then don't use Python, use something like C.

As far as the Zen of Python is concerned, that there should only be one obvious way to do things refers to small, simple things. Obviously for any sufficiently complicated task, there will always be zillions of small variations on ways to do it.

Adam Rosenfield
The sorted function is written in optimized C so performance shouldn't be an issue.
Robert Gamble
+2  A: 

Here is a sample code with three methods - 1. using set and len, 2. using sorted and 3. method defined by namin.

a='confused'
b='unfocused'
c='foncused'
sort_method=lambda x,y: sorted(x)==sorted(y)
def same_permutation(a, b):
    d = {}
    for x in a:
        d[x] = d.get(x, 0) + 1
    for x in b:
        d[x] = d.get(x, 0) - 1
    for v in d.itervalues():
        if v != 0:
            return False
    return True

the average run times (over 100000 loops) of the 2 methods are:

for a non match a,b

jv@Pioneer:~/jai/thekid$ python -m timeit -s 'import temp' 'temp.sort_method(temp.a,temp.b)'
100000 loops, best of 3: 9.72 usec per loop
jv@Pioneer:~/jai/thekid$ python -m timeit -s 'import temp' 'temp.same_permutation(temp.a,temp.b)'
10000 loops, best of 3: 28.1 usec per loop

for a match a,c

jv@Pioneer:~/jai/thekid$ python -m timeit -s 'import temp' 'temp.sort_method(temp.a,temp.c)'
100000 loops, best of 3: 9.47 usec per loop
jv@Pioneer:~/jai/thekid$ python -m timeit -s 'import temp' 'temp.same_permutation(temp.a,temp.c)'
100000 loops, best of 3: 24.6 usec per loop

here the strings are small. You can check out the timer for real strings you have and decide.

Edit1: before answering i thought O(n) method from namin is the one, but it seems that efficiency will matter when strings grow large. But anywayz you have 2 methods and timer - choose for urself according to your data.

Edit2: removed an incorrect soution

JV
The set method will return true for "aa" and "a" where the other methods will not. This is probably the best solution if this is the desired behavior, the sorted solution is probably the best one if it is not.
Robert Gamble
@Robert: its not set() alone its "set() and len()", so "aa" and "a" will return False in the first method.
JV
@Norman: agreed. But still I will leave the answer for comparing the rest of the two methods.
JV
@Norman: removed the set_n_len method and left the comparison of the other two in place. Marked as community wiki
JV
@JV: thanks; deleted my comment
Norman Ramsey
+12  A: 

"but the first one is slower when (for example) the first char of a is nowhere in b".

This kind of degenerate-case performance analysis is not a good idea. It's a rat-hole of lost time thinking up all kinds of obscure special cases.

Only do the O-style "overall" analysis.

Overall, the sorts are O( n log( n ) ).

The a.count(char) for char in a solution is O( n 2 ). Each count pass is a full examination of the string.

If some obscure special case happens to be faster -- or slower, that's possibly interesting. But it only matters when you know the frequency of your obscure special cases. When analyzing sort algorithms, it's important to note that a fair number of sorts involve data that's already in the proper order (either by luck or by a clever design), so sort performance on pre-sorted data matters.

In your obscure special case ("the first char of a is nowhere in b") is this frequent enough to matter? If it's just a special case you thought of, set it aside. If it's a fact about your data, then consider it.

S.Lott
this makes a great point.
Horace Loeb
+1 for average and edge case elaboration.
JV
Note that the first one is actually *incorrect* when b contains characters not in a (see Nick's answer). Thinking of all possible "degenerate" and "obscure special cases" is important for correctness. :)
ShreevatsaR
@ShreegatsaR: Correctness is unrelated to performance. I'm not addressing correctness in saying that obscure special cases don't matter. I'm specifically addressing performance and only performance.
S.Lott
I'm pretty sure you can make the counting of characters O(n) by doing some bucket counting with an array of size 26. Then you can probably do the inverse for string b and return true iff each of the 26 indices in the array is 0. Totally O(n) where n is len(a) + len(b), or len(a) + len(b) + 26 ops.
Pål GD
+5  A: 

heuristically you're probably better to split them off based on string size.

Pseudocode:

returnvalue = false
if len(a) == len(b)
   if len(a) < threshold
      returnvalue = (sorted(a) == sorted(b))
   else
       returnvalue = naminsmethod(a, b)
return returnvalue

If performance is critical, and string size can be large or small then this is what I'd do.

It's pretty common to split things like this based on input size or type. Algorithms have different strengths or weaknesses and it would be foolish to use one where another would be better... In this case Namin's method is O(n), but has a larger constant factor than the O(n log n) sorted method.

Good plan, in my tests Namin's method becomes faster for string lengths over 1 million. All my strings are shorter than that, so I will use sorted for now and keep Namin's in mind.
Kiv
Instead of setting returnvalue, just return.
recursive
A: 

This is a PHP function I wrote about a week ago which checks if two words are anagrams. How would this compare (if implemented the same in python) to the other methods suggested? Comments?

public function is_anagram($word1, $word2) {
    $letters1 = str_split($word1);
    $letters2 = str_split($word2);
    if (count($letters1) == count($letters2)) {
        foreach ($letters1 as $letter) {
            $index = array_search($letter, $letters2);
            if ($index !== false) {
                unset($letters2[$index]);
            }
            else { return false; }
        }
        return true;
    }
    return false;        
}


Here's a literal translation to Python of the PHP version (by JFS):

def is_anagram(word1, word2):
    letters2 = list(word2)
    if len(word1) == len(word2):
       for letter in word1:
           try:
               del letters2[letters2.index(letter)]
           except ValueError:
               return False               
       return True
    return False

Comments:

    1. The algorithm is O(N**2). Compare it to @namin's version (it is O(N)).
    2. The multiple returns in the function look horrible.
Josh Smeaton
I've added a Python version.
J.F. Sebastian
Yeah, I didn't like the multiple returns either. For what I had in mind I wanted a quick escape if false and didn't really know how to refactor that part.
Josh Smeaton
A: 

This version is faster than any examples presented so far except it is 20% slower than sorted(x) == sorted(y) for short strings. It depends on use cases but generally 20% performance gain is insufficient to justify a complication of the code by using different version for short and long strings (as in @patros's answer).

It doesn't use len so it accepts any iterable therefore it works even for data that do not fit in memory e.g., given two big text files with many repeated lines it answers whether the files have the same lines (lines can be in any order).

def isanagram(iterable1, iterable2):
    d = {}
    get = d.get
    for c in iterable1:
        d[c] = get(c, 0) + 1
    try:
        for c in iterable2:
            d[c] -= 1
        return not any(d.itervalues())
    except KeyError:
        return False

It is unclear why this version is faster then defaultdict (@namin's) one for large iterable1 (tested on 25MB thesaurus).

If we replace get in the loop by try: ... except KeyError then it performs 2 times slower for short strings i.e. when there are few duplicates.

J.F. Sebastian
I think it's because if we find a letter in B that is not in A, yours returns False early whereas Namin's always examines all of both iterables.
Kiv
@Kiv: I've checked it when both A and B has the same letters (but maybe different number of them; in this case KeyError never raises) and still the variant with an ordinary dictionary is faster than defauldict' one.
J.F. Sebastian
A: 

Sorry that my code is not in Python, I have never used it, but I am sure this can be easily translated into python. I believe this is faster than all the other examples already posted. It is also O(n), but stops as soon as possible:

public boolean isPermutation(String a, String b) {
    if (a.length() != b.length()) {
        return false;
    }

    int[] charCount = new int[256];
    for (int i = 0; i < a.length(); ++i) {
        ++charCount[a.charAt(i)];
    }

    for (int i = 0; i < b.length(); ++i) {
        if (--charCount[b.charAt(i)] < 0) {
            return false;
        }
    }
    return true;
}

First I don't use a dictionary but an array of size 256 for all the characters. Accessing the index should be much faster. Then when the second string is iterated, I immediately return false when the count gets below 0. When the second loop has finished, you can be sure that the strings are a permutation, because the strings have equal length and no character was used more often in b compared to a.

martinus
This is actually slower in Python than sorted(), Namin's, or JF's. (As well as restricted to ASCII.)The main problem is in Java you can say charCount['a'] and it will automatically cast the char to an integer for you; in Python you have to use a lot of calls to ord() instead.
Kiv
oh, I did not know that. In Java this code would probably be the fastest.
martinus
A: 

Here's martinus code in python. It only works for ascii strings:

def is_permutation(a, b):
    if len(a) != len(b):
        return False

    char_count = [0] * 256
    for c in a:
        char_count[ord(c)] += 1

    for c in b:
        char_count[ord(c)] -= 1
        if char_count[ord(c)] < 0:
            return False

    return True
recursive
+1  A: 

I did a pretty thorough comparison in Java with all words in a book I had. The counting method beats the sorting method in every way. The results:

Testing against 9227 words.

Permutation testing by sorting ... done.     18.582 s
Permutation testing by counting ... done.    14.949 s

If anyone wants the algorithm and test data set, comment away.

Pål GD
i.e. 85137529 comparisons of 9227 unique words
Pål GD
A: 

In Python 3.1/2.7 you can just use collections.Counter(a) == collections.Counter(b).

But sorted(a) == sorted(b) is still the most obvious IMHO. You are talking about permutations - changing order - so sorting is the obvious operation to erase that difference.

Beni Cherniavsky-Paskin
That builds two dicts; Namin's method uses only one.
John Machin
@John Machin: true, but unless it's unicode, the dicts are no bigger then 256 entries, so who cares? Since the counting loop is done in C, it's likely to be faster [not tested].
Beni Cherniavsky-Paskin
collections.Counter looks simpler but it performs a lot worse than the "sorted" method, at least from what I'm seeing on Python 3.1 on Windows, with relatively short strings
Ricardo Reyes