views:

1986

answers:

8

Okay this is not a question of "how to get all uniques" or "How to remove duplicates from my array in php". This is a question about the time complexity.

I figured that the array_unique is somewhat O(n^2 - n) and here's my implementation:

function array_unique2($array) 
{ 
    $to_return = array(); 
 $current_index = 0;

    for ( $i = 0 ; $i < count($array); $i++ ) 
    { 
        $current_is_unique = true; 

        for ( $a = $i+1; $a < count($array); $a++ ) 
        { 
            if ( $array[$i] == $array[$a] ) 
            { 
                $current_is_unique = false; 
                break; 
            } 
        } 
        if ( $current_is_unique ) 
        { 
   $to_return[$current_index] = $array[$i];
        } 

    } 

    return $to_return; 
}

However when benchmarking this against the array_unique i got the following result:

Testing (array_unique2)... Operation took 0.52146291732788 s.

Testing (array_unique)... Operation took 0.28323101997375 s.

Which makes the array_unique twice as fast, my question is, why ( Both had the same random data ) ?

And a friend of mine wrote the following:

function array_unique2($a)
{
    $n = array();
    foreach ($a as $k=>$v)
        if (!in_array($v,$n))
            $n[$k]=$v;
    return $n;
}

which is twice as fast as the built in one in php.

I'd like to know, why?

What is the time-complexity of array_unique and in_array?

Edit I removed the count($array) from both loops and just used a variable in the top of the function, that gained 2 seconds on 100 000 elements!

+9  A: 

While I can't speak for the native array_unique function, I can tell you that your friends algorithm is faster because:

  1. He uses a single foreach loop as opposed to your double for() loop.
  2. Foreach loops tend to perform faster than for loops in PHP.
  3. He used a single if(! ) comparison while you used two if() structures
  4. The only additional function call your friend made was in_array whereas you called count() twice.
  5. You made three variable declarations that your friend didn't have to ($a, $current_is_unique, $current_index)

While none of these factors alone is huge, I can see where the cumulative effect would make your algorithm take longer than your friends.

Noah Goodrich
Actually, count() is called many times -- once per loop iteration for each loop.
strager
Thank you, by removing count() from both places i gained 2 seconds on 100 000 elements.
Filip Ekberg
Your point 1 is not valid: `in_array()` seems to loop over the entries as well, so there's a hidden for-loop in Filip's friend's code as well!
Christoph
@Christoph: Could you provide a reference? I would be very interested to read up on that myself.
Noah Goodrich
In my time tests the "friend" function underperforms in large array situations.
jmucchiello
@gabriel1836: I just did some benchmarking (therefore 'seems to loop') - you'd have to check the sources of the zend engine to be sure; but afaik, PHP arrays are linked hash tables, and therefore there's no way around walking the list (looping over the map's entries) to find a value...
Christoph
@Christoph, Unless you build a reverse hash table, of course.
strager
@strager: Of course, but that's not how it's done (check my new answer); building a reverse hash table only works for hashable values...
Christoph
Good answer, but i'd like to see some Ordor-calculations too :)
Filip Ekberg
+1  A: 

PHP's arrays are implemented as hash tables, i.e. their performance characteristics are different from what you'd expect from 'real' arrays. An array's key-value-pairs are additionally stored in a linked list to allow fast iteration.

This explains why your implementation is so slow compared to your friend's: For every numeric index, your algorithm has to do a hash table lookup, whereas a foreach()-loop will just iterate over a linked list.

The following implementation uses a reverse hash table and might be the fastest of the crowd (double-flipping courtesy of *joe_mucchiello*):

function array_unique2($array) {
    return array_flip(array_flip($array));
}

This will only work if the values of $array are valid keys, ie integers or strings.

I also reimplemented your algorithm using foreach()-loops. Now, it will actually be faster than your friend's for small data sets, but still slower than the solution via array_flip():

function array_unique3($array) {
    $unique_array = array();

    foreach($array as $current_key => $current_value) {
        foreach($unique_array as $old_value) {
            if($current_value === $old_value)
                continue 2;
        }
        $unique_array[$current_key] = $current_value;
    }

    return $unique_array;
}

For large data sets, the built-in version array_unique() will outperform all other's except the double-flipping one. Also, the version using in_array() by your friend will be faster than array_unique3().

To summarize: Native code for the win!


Yet another version, which should preserve keys and their ordering:

function array_flop($array) {
    $flopped_array = array();

    foreach($array as $key => $value) {
     if(!isset($flopped_array[$value]))
      $flopped_array[$value] = $key;
    }

    return $flopped_array;
}

function array_unique4($array) {
    return array_flip(array_flop($array));
}

This is actually enobrev's array_unique3() - I didn't check his implementations as thoroughly as I should have...

Christoph
This does not preserve the keys. You mean return array_flip(array_flip($array));
jmucchiello
Nice idea, but I wouldn't expect that to be faster. A benchmark could prove me wrong, though.
strager
Actually it smokes the other suggestion in this thread. It just doesn't produce the same results.
jmucchiello
@joe: Filip's code assumed numeric keys, so I did as well; and no, I didn't mean a double flip: I only cared for the values, and these are the keys of the flipped array...
Christoph
I clarified this a bit...
Christoph
Yes, but you can't only care about the values since the builtin array_unique preserves the keys of the input array. Not preserving them is providing a different algorithm. At which point, speed comparisons are irrelevant.
jmucchiello
Then just do your double-flip and use your solution; but it still won't be äquivalent to array_unique(): these algorithms break if the values aren't strings or integers, whereas array_unique() works reasonably well as long as the string conversion of the values return reasonable results
Christoph
Actually, my second algorithm would give even better results than the native one if it preserved the keys - I'll change it in a bit...
Christoph
@Christoph, array_reverse seems to keep the keys and is just barely slower than the double-array_flip - and it beats the native. I posted an example.
enobrev
@Christoph, your new one is the same as array_unique3 in my results. That was my first attempt at speeding up the OP's friend's, but eventually found the array_flip method was best.
enobrev
@enobrev: you're right, my bad - but I still like my naming scheme better ;) `array_unique3()` is also the fastest one which will still reproduce `array_uniques()`'s results (or at least for all string/all integer values - mixed one's might produce different results(?))
Christoph
heh, fan of the flip-flop as well.
enobrev
A: 

PHP is slower to execute than raw machine code (which is most likely executed by array_unique).

Your second example function (the one your friend wrote) is interesting. I do not see how it would be faster than the native implementation, unless the native one is removing elements instead of building a new array.

strager
+3  A: 

Try this algorithm. It takes advantage of the fact that the key lookup is faster than in_array():

function array_unique_mine($A) {
    $keys = Array();
    $values = Array();
    foreach ($A as $k => $v) {
        if (!array_key_exists($v, $values)) {
            $keys[] = $k;
            $values[$v] = $v;
        }
    }
    return array_combine($keys, $values);
}
jmucchiello
I'm not sure the OP is looking for an efficient algorithm, but an explanation as to why the methods have different time complexities.
strager
this will only work if the values in $A are all strings or integers!
Christoph
true. I suppose you could serialize the $values[serialize($v)] = $v part.
jmucchiello
@joe_mucchiello added your method to my post
enobrev
+5  A: 

The time complexity of in_array() is O(n). To see this, we'll take a look at the PHP source code.

The in_array() function is implemented in ext/standard/array.c. All it does is call php_search_array(), which contains the following loop:

while (zend_hash_get_current_data_ex(target_hash, (void **)&entry, &pos) == SUCCESS) {

    // checking the value...

    zend_hash_move_forward_ex(target_hash, &pos);
}

That's where the linear characteristic comes from.

This is the overall characteristic of the algorithm, becaus zend_hash_move_forward_ex() has constant behaviour: Looking at Zend/zend_hash.c, we see that it's basically just

*current = (*current)->pListNext;


As for the time complexity of array_unique():

  • first, a copy of the array will be created, which is an operation with linear characteristic
  • then, a C array of struct bucketindex will be created and pointers into our array's copy will be put into these buckets - linear characteristic again
  • then, the bucketindex-array will be sorted usign quicksort - n log n on average
  • and lastly, the sorted array will be walked and and duplicate entries will be removed from our array's copy - this should be linear again, assuming that deletion from our array is a constant time operation

Hope this helps ;)

Christoph
the code for array.c can be found here: http://www.google.com/search?q=cache%3Ahttp%3A%2F%2Fsvn.php.net%2Fphp-src%2Ftrunk%2Fext%2Fstandard%2Farray.c (thanks to Gumbo); including the link in the answer somehow breaks stackoverflow :(
Christoph
A: 

I'll admit I don't understand the native code very well, but it seems to copy the entire array, sort it, then loop through it removing duplicates. In that case your second piece of code is actually a more efficient algorithm, since adding to the end of an array is cheaper than deleting from the middle of it.

Keep in mind the PHP developers probably had a good reason for doing it the way they do. Does anyone want to ask them?

Ant P.
A: 

The native PHP function array_unique is implemented in C. Thus it is faster than PHP, that has to be translated first. What’s more, PHP uses an different algorithm than you do. As I see it, PHP first uses Quick sort to sort the elements and then deletes the duplicates in one run.

Why his friend’s implementation is faster has his own? Because it uses more built-in functionality that trying to recreate them.

Gumbo
+2  A: 
enobrev
interesting; in my test with the following input array ``` $array = array();for($i = 0; $i < 1000; ++$i) $array[$i] = $i;for($i = 500; $i < 700; ++$i) $array[10000 + $i] = $i; ```array_unique() was second fastest...
Christoph
ALso, could you include joe's algorithm ( http://stackoverflow.com/questions/478002/php-arrays-remove-duplicates-time-complexity#478038 ) in your tests?
Christoph
or just check if a test for `array_key_exists()` would speed up `array_unique3()`?
Christoph
That is interesting. I got similar results as what I'd posted using your data.
enobrev
@Christoph I'll add it. I doubt it though as array_key_exists is a lot slower than isset.
enobrev
@enobrev: it might even be faster...
Christoph
@Christoph added
enobrev
@enobrev: thanks; also, I did some more benchmarking myself: `array_key_exists()` really seems to be slower than `isset()`; I wonder why - shouldn't the more specialized version be faster than the general one?
Christoph
@Christoph, i wish it were as it's more explicit and easier to read. The functional difference between the two is that isset returns false for NULL values. I've read the actual reason it's faster somewhere, though I can't find the link.
enobrev
@Christoph - in prior benchmarking I've found isset faster than array_key_exists and have never understood why. But by habit I never use isset. I suppose that would speed up my version.
jmucchiello
@enobrev - your test suite doesn't include the important parts of array_unique. The builtin version preserves keys. And it keeps the first (not last) instance of the key. Thus the versions where the keys differ are just plain wrong.
jmucchiello
@joe: 'plain wrong' might be a bit strong - they just do things a little differently: `_unique4()` drops the keys (that might be 'plain wrong'), `_unique6()` uses the last key instead and `_unique5()` doesn't preserve the ordering; they are similar, but just don't give exactly the same results...
Christoph
...also, `_unique3` reproduces the results and is still faster
Christoph
@joe_mucchiello, That was my point in showing the results. Some were faster and dropped keys, some were faster and kept the keys, just rearranged things - and in an associative array, the order on the items can be less important. And one of the "plain wrong" ones was your suggestion (flip, flip)
enobrev
If you are benchmarking two functions, they must give identical results or what's the point. I stand behind 'plain wrong'. If you are just replacing a function, sure change it's meaning and refactor as needed. But for benchmarking you need identical outputs for identical inputs.
jmucchiello
flip,flip was my response to Christoph's array_keys(array_flip). If you look at my 2nd response to Christoph's post below, you see me state flip,flip doesn't work correctly. I was denouncing that method because it is plain wrong. Don't attribute it to me. It's Christoph's.
jmucchiello
@joe_mucchiello, so you're saying this would no longer be wrong if I removed 3 of the methods? I think I understand your point, but that's why I explicitly showed the results and not only the timing.
enobrev