views:

305

answers:

8

I'm writing a small algorithm in PHP that goes through n number of movies with ratings, and will store the top 5. I'm not reading from a datafile, but from a stream so I cannot simply order the movies by rating.

My question is what is the most efficent way to keep track of the top 5 rated movies as I read the stream? Currently I do the following:

  1. Read in 5 movies (into an array called movies[]), with two keys movies[][name] and movies[][rating]
  2. Order the array by movies[rating] using array_multisort() (highest rating now sits at movies[4])
  3. Read in the next movie
  4. If this new movie rating > movies[0][rating] then replace movies[0] with this new movie
  5. Re-order the list
  6. Repeat 3-5 until finished

My method works, but requires a sort on the list after every read. I believe this to be an expensive method mostly due to the fact that every time I use array_multisort() I must do a for loop on 5 movies just to build the index to sort on. Can anyone suggest a better way to approach this?

+3  A: 

Your algorithm looks fine. I am not sure how the arrays are implemented in PHP. From an algorithm point of view: use a heap instead of an array.

dirkgently
Arrays in PHP isn't really arrays but a mix of stacks, vectors and maps. I belive that they are implemented as hash tables under the surface. I agree with your assessment, though. A priority queue implemented as a heap would be the most efficient way to go.
Emil H
PHP doesn't have arrays. It just calls dictionary arrays. They have a order, but you can't really address a specific element in that order. Direct access is done via a key, which can be numeric. But $a[3] could be before $a[2] in the "array" order.
stesch
+4  A: 

Linked lists would work here.

Build a linked list that chains the first 5 movies in the correct order. For each new movie, just start at the the end of the chain and walk it until your movie is between one with a higher rating and one with a lower rating. Then insert your link into the list here. If the movie was better than the worst (and thus your list is now 6 long), just remove the last link in the chain, and you are back to 5.

No sorting, no indexing.

Phil H
That is sorting, more specific insertion sort.
Georg
True, but it isn't a whole-array sort, because you use the fact that the list is already sorted. It also scales well with large datasets and larger top N sets, unlike a full-array sort.
Phil H
interesting solution I'll let you know how it increases my performance.
A: 

How big is the list? I'm guessing it's not an option to keep the entire list in memory, and sort it at the end?

PanMan
+1  A: 

My method works, but requires a sort on the list after every read.

No it doesn't, it only requires a sort after you find a new movie whos rating is > movies[0][rating].

This method seems efficient to me. You only sort occasionally when there's a new entry for the top 5, which will happen less the more movies you process.

Actually, yes it does. As it's written, it sorts even if there's no insertion but that could be rectified by skipping step 5 if step 4 didn't insert.
paxdiablo
I thought it was reasonably intuitive that you only need to do step 5 if step 4 has happened.
+3  A: 

No point in re-sorting after every read since you really only need to insert a new entry. Use the following algorithm, it's likely to get you the best speed. It's basically an unrolled loop, not the most beautiful code.

set movies[0..4].rating to -1.
while more movies in stream:
    read in next movie.
    if movie.rating < movies[0].rating:
        next while
    if movie.rating < movies[1].rating:
        movies[0] = movie
        next while
    if movie.rating < movies[2].rating:
        movies[0] = movies[1]
        movies[1] = movie
        next while
    if movie.rating < movies[3].rating:
        movies[0] = movies[1]
        movies[1] = movies[2]
        movies[2] = movie
        next while
    if movie.rating < movies[4].rating:
        movies[0] = movies[1]
        movies[1] = movies[2]
        movies[2] = movies[3]
        movies[3] = movie
        next while
    movies[0] = movies[1]
    movies[1] = movies[2]
    movies[2] = movies[3]
    movies[3] = movies[4]
    movies[4] = movie

At the end, you have your sorted list of movies. If there's less than 5, those others will have a rating of -1 so you'll know they're invalid. This is assuming that the rating on a real movie is zero or greater but you can adjust the values if they're not.

If you need to adjust it for more than 5 movies, you can. The best bet would be to roll up the loop again. At some point, however, it's going to become more efficient to sort it than use this method. This method's only really good for a small data set.

paxdiablo
This is probably the fastest approach. Could be made (a little bit) faster by storing the pointer to the movie with the lowest value, that would decrease the amount of comparisons that have to be done.
Georg
@gs, I'm not sure that would be faster, the movie with the lowest rating is always movies[0].
paxdiablo
I don't know the relative cost in php, but there are only 120 possible combinations, so you can do an 0(1) insert if you indirect through a second array, and transition that array to another of the 120 possible mappings.
Pete Kirkham
A: 
  1. there is no need for two keys in array. array with name as key, and rating as value will do. Sort it with arsort();
  2. the algorithm is not perfect, you can do it optimally with linked list. Although I think linked list implemented in PHP will be actually slower that function call to asort() for 6 elements. For big O estimation, you can assume that sorting 6 elements has constant time.
  3. You'll only sort when you encounter movie rated higher then the actual, so in average case you'll do it less an less often, while progressing. You'll sort on every movie only in worst case scenario of having initial list sorted from lowest rated.
vartec
A: 

Here’s what I would do:

// let’s say get_next_movie () returns array with 'rating' and 'name' keys

while ($m = get_next_movie ()) {

  $ratings[$m['rating']][] = $m['movie'];

  $temp_ratings = $ratings;
  $top5 = array ();
  $rating = 5;
  while (1) {
    if (count ($temp_ratings[$rating])) {
      $top5[] = array_shift ($temp_ratings[$rating]);
    } elseif ($rating > 0) {
      --$rating;
    } else {
      break;
    }
  }

  // $top5 has current top 5 :-)

}

$ratings array looks like this, each rating has array of movies inside:

Array
    (
    [5] => Array
        (
            [0] => Five!
        )

    [3] => Array
        (
            [0] => Three
            [1] => Threeeeee
            [2] => Thr-eee-eee
        )

    [4] => Array
        (
            [0] => FOR
        )
    )
Ilya Birman
Looks like noöne notices my brilliant solution :-)
Ilya Birman
A: 

Maybe this can be of help.

class TopList {
    private $items = array();
    private $indexes = array();
    private $count = 0;
    private $total = 5;
    private $lowest;
    private $sorted = false;

    public function __construct($total = null) {
     if (is_int($total))
      $this->total = $total;

     $this->lowest = -1 * (PHP_INT_MAX - 1);
    }

    public function addItem($index, $item) {
     if ($index <= $this->lowest)
      return;

     $setLowest = $this->count === $this->total;
     if ($setLowest) {
      /* //remove first added
      $lowestIndex = array_search($this->lowest, $this->indexes);
      /*/ //remove last added
      $lowestIndex = end(array_keys($this->indexes, $this->lowest));
      //*/
      unset($this->indexes[$lowestIndex], $this->items[$lowestIndex]);
     } else {
      ++$this->count;
      $setLowest = $this->count === $this->total;
     }

     $this->indexes[] = $index;
     $this->items[] = $item;
     $this->sorted = false;

     if ($setLowest)
      $this->lowest = min($this->indexes);
    }

    public function getItems() {
     if (!$this->sorted) {
      array_multisort($this->indexes, SORT_DESC, $this->items);
      $this->sorted = true;
     }
     return $this->items;
    }
}

$top5 = new TopList(5);
foreach ($movies as $movie) {
    $top5->addItem($movie['rating'], $movie);
}
var_dump($top5->getItems());
OIS