views:

512

answers:

5

The PHP array is one of the language's core features IMHO. It is sparse, allows multi-typed keys in the same array, and supports set, dictionary, array, stack/queue and iterative functionality.

BUT after working with PHP for a while now, I've found that quite a few of the array_* functions are much slower than you'd think at first glance. Like in the case of array_rand on a very large array (10000+). array_rand is so slow in fact, that in cases where your using the php array as an indexed array, a function like rand( 0, array_length( $array ) - 1 ) runs MUCH faster than array_rand.

Now onto my question.

How is the PHP array implemented on the C level? This would be very helpful for predicting the Big O of a function that heavily uses the different functionality of the PHP array datatype.

+1  A: 

See this comment in the documentation confirming your dilemma that array_rand, while fast for small arrays, scales very poorly.

I modified fake_array_rand to always only return 1 element, and did some benchmarks against calling array_rand with the second parameter as 1. I ran 100 samples for each function for each number of elements and took the average result. While the internal array_rand is faster for a small number of elements, it scales very poorly.

1 elements: 2.0619630813599E-05 sec. for array_rand,8.4352493286133E-05 sec. 
for fake_array_rand 

10 elements: 2.1675825119019E-05 sec. for array_rand,8.427619934082E-05 sec. 
for fake_array_rand 

100 elements: 2.9319524765015E-05 sec. for array_rand,8.4599256515503E-05 sec. 
for fake_array_rand 

1000 elements: 0.0001157283782959 sec. for array_rand,8.5572004318237E-05 sec. 
for fake_array_rand 

10000 elements: 0.0016669762134552 sec. for array_rand,8.5201263427734E-05 sec. 
for fake_array_rand 

100000 elements: 0.015599734783173 sec. for array_rand,8.5580348968506E-05 sec. 
for fake_array_rand 

1000000 elements: 0.18011983394623 sec. for array_rand,8.6690187454224E-05 sec. for fake_array_rand 

<?php 
function fake_array_rand ($array) 
{ 
        $count = count ($array); 
        # Help keep the number generator random :) 
        $randval and usleep ("0.$randval"); 

        # Seed the random number generator 
        # Generate a random number 
        srand ((double) microtime() * 10000000); 
        $randval = rand(); 

        # Use the random value to 'pick' an entry from the array 
        # Count the number of times that the entry is picked 
        ++$index[$randval % $count]; 

        return $array[$randval % $count]; 
} 
?>

http://us.php.net/manual/en/function.array-rand.php#22360

Chacha102
That's exactly what I'm talking about. It seems the array datatype doesn't have the ability to seek to an offset in O(c) time, like a normal array in C would. Instead it seems to do a linear poll which is O(n), which gets pretty scary quick when it's put into a loop (O(n^2)). On a side note, that's a pretty horrible use of srand.
Kendall Hopkins
Why O(n)? It's **direct** access. It's slower, but in good enough hash function doesn't relate to the number of elements in array. See my answer.
Sagi
Sorry I wasn't clear that I was talking about using `array_rand` in the loop. The way `array_rand` is implemented it has to use linear polling though the linked-list to find the offset it randomly generated, since it doesn't know if there are gaps in the range.
Kendall Hopkins
Yes, that's true. And your solution is impressive. Worth trying profiling with several arrays and adding this kind of optimization to PHP source code.
Sagi
A: 

Take a look at zend/zend_hash.c and zend/zend_hash.h

I don't know c, but to me it looks like a chained hash table.

chris
+1  A: 

Since PHP arrays are ordered maps (even when using contiguous integer indices) array_rand() likely has to cons up a list of keys to select an element from. I'm guessing that it would be prohibitively space and time ineffective to cache the (frequently invalidated) key list so every call is going to incur at least an O(n) traversal and construction cost.

Because your rand(... length ...) implementation takes advantage of the special knowledge you have that the keys are contiguous integers, you'll get O(log n) lookup costs. This seems consistent with Chacha102's data.

msw
+4  A: 

PHP associative arrays are in fact implementation of HashTables.

Internally, it is possible to make numeric arrays or associative arrays. If you combine them, it is associative array.

In numeric arrays, it is very similer to C. You have array of pointers to ZVAL structs.

Because pointers have fixed-length (let's call it n), the offset (x) calculation is easy: x * n.

In PHP types are ZVAL structs (because that way it implements dynamic types), but it also helps in associative array, because you can assume fixed-length. So even if direct access to array is slower, it is still considered O(1).

So what happens in string keys? PHP uses hash function to convert them to intergers.

Searching in numeric and associative array has similer efficianty, because internally they are all numeric.

Only direct-access to array keys is slower, because of the additional level (hash function).

Sagi
+4  A: 

After reading over zend/zend_hash.h and ext/standard/array.c I think I have found the answer (Thankyou Chris and gumbo for the suggestions).

The PHP array is a chained hash table (lookup of O(c) and O(n) on key collisions) that allows for int and string keys. It uses 2 different hashing algorithms to fit the two types into the same hash key space. Also each value stored in the hash is linked to the value stored before it and the value stored after (linked list). It also has a temporary pointer which is used to hold the current item so the hash can be iterated.

The catch for the array_rand function is that in order to assure that key is truly random, the array_rand function must iterate over the array rand(0, count($array)) times (O(n)). This is because there is no way to move to an offset in the hash table in O(c) time because there is no guarantee that there isn't missing keys in the range.

This discovery has somewhat troubled me, because it means there is NO datatype in PHP that has normal C array characteristics. Now most of the time this is ok, because hash lookups are very faster, but it's faults show through in cases like array_rand.

Another thing that also troubled me was the difference between the implementation of array_key_exists and in_array. array_key_exists uses the hash lookup (most of the time O(c)) to see if a key is in the array, while in_array has to linearly search the hash (O(n)).

Consider the two examples below:

in_array version

$array = range(0, 100000);
if( in_array( $random_key, $array ) ) {
   //we found a value
}

array_key_exists version

$array = array_fill_keys( range(0, 100000), NULL );
if( array_key_exists( $random_key, $array ) ) {
   //we found a value, err key
}

While the in_array version looks much cleaner and simpler to understand, it's actually very slow on large arrays (O(n)), where array_key_exists (despite is annoyingly long name) is very fast (O(c) or close).

In conclusion: I wish there was a transparent flag in the zend HashTable data-structure that would be set in cases where the array was created using array_push or array[] = $value that would allow for scaling like a C array instead of a linked list.

Kendall Hopkins
You are right :) But look at SplFixedArray
Sagi