views:

971

answers:

4

After using PHP for a while now, I've noticed that not all PHP built in functions as fast as expected. Consider the below two possible implementations of a function that finds if a number is prime using a cached array of primes.

//very slow for large $prime_array
$prime_array = array( 2, 3, 5, 7, 11, 13, .... 104729, ... );
$result_array = array();
foreach( $array_of_number => $number ) {
    $result_array[$number] = in_array( $number, $large_prime_array );
}

//speed is much less dependent on size of $prime_array, and runs much faster.
$prime_array => array( 2 => NULL, 3 => NULL, 5 => NULL, 7 => NULL,
                       11 => NULL, 13 => NULL, .... 104729 => NULL, ... );
foreach( $array_of_number => $number ) {
    $result_array[$number] = array_key_exists( $number, $large_prime_array );
}

This is because in_array is implemented with a linear search O(n) which will linearly slow down as $prime_array grows. Where the array_key_exists function is implemented with a hash lookup O(1) which will not slow down unless the hash table gets extremely populated (in which case it's only O(n)).

So far I've had to discover the big-O's via trial and error, and occasionally looking at the source code. Now for the question...

Is there was a list of the theoretical (or practical) big O times for all* the PHP built in functions?

*or at least the interesting ones

For example find it very hard to predict what the big O of functions listed because the possible implementation depends on unknown core data structures of PHP: array_merge, array_merge_recursive, array_reverse, array_intersect, array_combine, str_replace (with array inputs), etc.

A: 

The explanation for the case you specifically describe is that associative arrays are implemented as hash tables - so lookup by key (and correspondingly, array_key_exists) is O(1). However, arrays aren't indexed by value, so the only way in the general case to discover whether a value exists in the array is a linear search. There's no surprise there.

I don't think there's specific comprehensive documentation of the algorithmic complexity of PHP methods. However, if it's a big enough concern to warrant the effort, you can always look through the source code.

Dathan
This isn't really an answer. As I've stated in the question, I've already tried looking into the PHP source code. Since PHP is implemented is written in C making use of complex macros, which can make it hard at times to "see" the underlying big O for functions.
Kendall Hopkins
@Kendall I overlooked your reference to diving into the source code. However, there is an answer in my reply: "I don't think there's specific comprehensive documentation of the algorithmic complexity of PHP methods." "No" is a perfectly valid answer. (c:
Dathan
+1  A: 

If you are really that interested in time complexity, you could always use some sort of profiling tool, or code (ex. Google hit #1 for "profile PHP") to test functions on a number arrays of size N ranging from N=1 to N=100, and then directly calculate their Big-O. Given that N is sufficiently large, you really shouldn't need to sample on very many intervals.

Rob
If you're going to down vote, please explain why. Since there's not a known list, directly measuring time complexity seems the obvious thing to do.
Rob
This is not an answer to the question. If you did what you suggested and submitted that, then that would be an answer.
Kendall Hopkins
@kendall If you asked how many stones were in the great wall of china, you couldn't really expect anyone to answer with an exact count. The best answers you'd get would be hints about how to estimate the number yourself. You can't expect people to do all the leg work for you, just because you asked.
Frank Farmer
@Frank Farmer I never expected someone to benchmark each of the PHP functions, but if someone wants to submit an answer it should at least attempt to answer to the question, NOT tell me what google is. Not only that, but this answer isn't even a good idea. Sampling N=1...100 is in most cases would only be measuring the overhead of a PHP function call, not the Big O of the function.
Kendall Hopkins
(1) N=100 is nowhere "large". (2) Profiling seldom able to reveal a log N or log log N factor simply because log grows too slowly. (3) Timing fluctuation in system can make the result inaccurate.
KennyTM
A: 

You almost always want to use isset instead of array_key_exists. I'm not looking at the internals, but I'm pretty sure that array_key_exists is O(N) because it iterates over each and every key of the array, while isset tries to access the element using the same hash algorithm that is used when you access an array index. That should be O(1).

One "gotcha" to watch out for is this:

$search_array = array('first' => null, 'second' => 4);

// returns false
isset($search_array['first']);

// returns true
array_key_exists('first', $search_array);

I was curious, so I benchmarked the difference:

<?php

$bigArray = range(1,100000);

$iterations = 1000000;
$start = microtime(true);
while ($iterations--)
{
    isset($bigArray[50000]);
}

echo 'is_set:', microtime(true) - $start, ' seconds', '<br>';

$iterations = 1000000;
$start = microtime(true);
while ($iterations--)
{
    array_key_exists(50000, $bigArray);
}

echo 'array_key_exists:', microtime(true) - $start, ' seconds';
?>

is_set: 0.132308959961 seconds
array_key_exists: 2.33202195168 seconds

Of course, this doesn't show time complexity, but it does show how the 2 functions compare to each other.

To test for time complexity, compare the amount of time it takes to run one of these functions on the first key and the last key.

ryeguy
This is wrong. I'm 100% sure array_key_exists doesn't have to iterate over each key. If you don't believe be take a look at the link below. The reason isset is so much faster is that it's a language construct. Which means it doesn't have the overhead of doing a function call. Also, I think it might be caching the lookup, because of this.Also, this isn't an answer to THE QUESTION! I would like a list of Big(O) for PHP functions (as the question states). Not a single benchmark of my examples.http://svn.php.net/repository/php/php-src/branches/PHP_5_3/ext/standard/array.c
Kendall Hopkins
If you still don't believe me, I've create a small benchmark to demonstrate the point. http://pastebin.com/BdKpNvkE
Kendall Hopkins
+20  A: 

Since it doesn't seem like anyone has done this before I thought it'd be good idea to have it for reference somewhere. I've gone though and either via benchmark or code-skimming to characterize the array_* functions. I've tried to put the more interesting Big-O near the top. This list is not complete.

Note: All the Big-O where calculated assuming a hash lookup is O(1) even though it's really O(n). The coefficient of the n is so low, the ram overhead of storing a large enough array would hurt you before the characteristics of lookup Big-O would start taking effect. For example the difference between a call to array_key_exists at N=1 and N=1,000,000 is ~50% time increase.

Interesting Points:

  1. isset/array_key_exists is much faster than in_array and array_search
  2. +(union) is a bit faster than array_merge (and looks nicer). But it does work differently so keep that in mind.
  3. shuffle is on the same Big-O tier as array_rand
  4. array_pop/array_push is faster than array_shift/array_unshift due to re-index penalty

Lookups:

array_key_exists O(n) but really close to O(1) - this is because of linear polling in collisions, but because the chance of collisions is very small, the coefficient is also very small. I find you treat hash lookups as O(1) to give a more realistic big-O. For example the different between N=1000 and N=100000 is only about 50% slow down.

isset( $array[$index] ) O(n) but really close to O(1) - it uses the same lookup as array_key_exists. Since it's language construct, will cache the lookup if the key is hardcoded, resulting in speed up in cases where the same key is used repeatedly.

in_array (n) - this is because it does a linear search though the array until it finds the value.

array_search O(n) - it uses the same core function as in_array but returns value.

Queue functions:

array_push O(∑ var_i, for all i)

array_pop O(1)

array_shift O(n) - it has to reindex all the keys

array_unshift O(n + ∑ var_i, for all i) - it has to reindex all the keys

Array Intersection, Union, Subtraction:

array_intersect_key if intersection 100% do O(Max(param_i_size)*∑param_i_count, for all i), if intersection 0% intersect O(∑param_i_size, for all i)

array_intersect if intersection 100% do O(n^2*∑param_i_count, for all i), if intersection 0% intersect O(n^2)

array_intersect_assoc if intersection 100% do O(Max(param_i_size)*∑param_i_count, for all i), if intersection 0% intersect O(∑param_i_size, for all i)

array_diff O(π param_i_size, for all i) - That's product of all the param_sizes

array_diff_key O(∑ param_i_size, for i != 1) - this is because we don't need to iterate over the first array.

array_merge O( ∑ array_i, i != 1 ) - doesn't need to iterate over the first array

+ (union) O(n), where n is size of the 2nd array (ie array_first + array_second) - less overhead than array_merge since it doesn't have to renumber

array_replace O( ∑ array_i, for all i )

Random:

shuffle O(n)

array_rand O(n) - Requires a linear poll.

Obvious Big-O:

array_fill O(n)

array_fill_keys O(n)

range O(n)

array_splice O(offset + length)

array_slice O(offset + length) or O(n) if length = NULL

array_keys O(n)

array_values O(n)

array_reverse O(n)

array_pad O(pad_size)

array_flip O(n)

array_sum O(n)

array_product O(n)

array_reduce O(n)

array_filter O(n)

array_map O(n)

array_chunk O(n)

array_combine O(n)

I'd like to thank Eureqa for making it easy to find the Big-O of the functions. It's an amazing free program that can find the best fitting function for arbitrary data.

Kendall Hopkins