views:

353

answers:

4

I have a PHP script which reads a large CSV and performs certain actions, but only if the "username" field is unique. The CSV is used in more than one script, so changing the input from the CSV to only contain unique usernames is not an option.

The very basic program flow (which I'm wondering about) goes like this:

$allUsernames = array();
while($row = fgetcsv($fp)) {
    $username = $row[0];
    if (in_array($username, $allUsernames)) continue;
    $allUsernames[] = $username;
    // process this row
}

Since this CSV could actually be quite large, it's that in_array bit which has got me thinking. The most ideal situation when searching through an array for a member is if it is already sorted, so how would you build up an array from scratch, keeping it in order? Once it is in order, would there be a more efficient way to search it than using in_array(), considering that it probably doesn't know the array is sorted?

+1  A: 

The array type in php is an ordered map (php array type). If you pass in either ints or strings as keys, you will have an ordered map...

Please review item #6 in the above link.

Steen
Do you mean example #6? The way I read that is that arrays are ordered maps, which does not necessarily equate to being sorted: they just have an order to them.
nickf
@nickf: PHP arrays are hash maps, indexed by the array key. Internal order is irrelevant for accessing the values.
Tomalak
ok that makes sense, but that's just the keys of the array, right? that won't help you try to find a particular value in the array.
nickf
Well, I would suggest that you check if the lookup in your stored values really takes so long that you need a different datastructure and/or some 'optimization' on the array. I think you'll find the array type fast enough.
Steen
+2  A: 

The way to build up an array from scratch in sorted order is an insertion sort. In PHP-ish pseudocode:

$list = []
for ($element in $elems_to_insert) {
     $index = binary_search($element, $list);
     insert_into_list($element, $list, $index);
}

Although, it might actually turn out to be faster to just create the array in unsorted order and then use quicksort (PHP's builtin sort functions use quicksort)

And to find an element in a sorted list:

function binary_search($list, $element) {
    $start = 0;
    $end = count($list);
    while $end - $start > 1) {
        $mid = ($start + $end) / 2;
        if ($list[$mid] < $element){
            $start = $mid;
        }
        else{
            $end = $mid;
        }
    }
    return $end;
}

With this implementation you'd have to test $list[$end] to see if it is the element you want, since if the element isn't in the array, this will find the point where it should be inserted. I did it that way so it'd be consistent with the previous code sample. If you want, you could check $list[$end] === $element in the function itself.

David Zaslavsky
+6  A: 

Not keeping the array in order, but how about this kind of optimization? I'm guessing isset() for an array key should be faster than in_array() search.

$allUsernames = array();
while($row = fgetcsv($fp)) {
  $username = $row[0];

  if (isset($allUsernames[$username])) {
    continue;
  } else {
    $allUsernames[$username] = true;

    // do stuff
  }
}
Henrik Paul
or array_key_exists?
dylanfm
True, you could do that too, but I'd still benchmark the differences between those two. You never know with PHP - one might be O(1), while the other O(n)... (referring to the "do array_flip() twice" trick)
Henrik Paul
I'd say it is "array_key_exists()". PHP arrays are hashes, they are optimized for this kind of random access stuff.
Tomalak
in_array has bad performance .. Using a hash as suggested here, is a good idea.
troelskn
A: 

in_array() does not benefit from having a sorted array. PHP just walks along the whole array as if it were a linked list.

too much php
i figured that - read the last sentence of the question again.
nickf