views:

92

answers:

2

I am writing a very process-intensive function in PHP that needs to be as optimized as it can get for speed, as it can take up to 60 seconds to complete in extreme cases. This is my situation:

I am trying to match an array of people to an XML list of jobs. The array of people have keywords that I have already analyzed, delimited by spaces. The jobs are from a large XML file.

It's currently setup like this:

$matches = new array();
foreach($people as $person){
    foreach($jobs as $job){
        foreach($person['keywords'] as $keyword){
            $count = substr_count($job->title, $keyword);
            if($count > 0) $matches[$job->title] = $count;
        }
    }
}

I do the keywords loop a few times with different categories. It does what I need it to do, but it feels very sloppy and the process can take a very, very long time depending on the number of people/jobs.

Is there a more efficient, or faster, way of doing this?

+1  A: 

You could use an index of the words in the job titles to make the lookup more efficient:

$jobsByWords = array();
foreach ($jobs as &$job) {
    preg_match_all('/\w+/', strtolower($jobs->title), $words);
    foreach ($words[0] as $word) {
        if (!isset($jobsByWords[$word])) $jobsByWords[$word] = array();
        $jobsByWords[$word][] = &$job;
    }
}

Then you just iterate the people and check if the keywords are in the index:

foreach ($people as $person) {
    foreach ($person['keywords'] as $keyword) {
        $keyword = strtolower($keyword);
        if (isset($jobsByWords[$keyword])) {
            foreach ($jobsByWords[$keyword] as &$job) {
                $matches[$job->title] = true;
            }
        }
    }
}
Gumbo
+1  A: 
$matches = new array();
foreach($people as $person){
    foreach($jobs as $job){
        foreach($person['keywords'] as $keyword){
            $count = substr_count($job->title, $keyword);
            if($count > 0) $matches[$job->title] = $count;
        }
    }
}

Truthfully, your method is a bit sloppy, but I assume that's because you have some specially formatted data that you have to work around? Although other than just being sloppy, I see a bit of lost data in the way you're processing things that I don't think was intentional.

I see that you're not just checking "is the keyword in the job title", but "how many times is the keyword in the job title" and then you're storing this. This means for the job title friendly friend of the friend company, the "keyword" friend shows up 3 times, and thus $matches["friendly friend of the friend company"] = 3. Since you're declaring $matches before you being your $people foreach loop, though, this means you keep over-writing this value any time a new person has that keyword. In other words, if the first person has the keyword "friend" then $matches["friendly friend of the friend company"] is set to 3. Then if the second person has the keyword "friendly", this value is over-written and $matches["friendly friend of the friend company"] now equals 1.

I think what you wanted to do was count how many people have a keyword which is contained in the job title. In this case, rather than counting how many times $keyword appears in $job->title, you should just see if it appears, and respond accordingly.

$matches = new array();
foreach($people as $person){
    foreach($jobs as $job){
        foreach($person['keywords'] as $keyword){
            if(strpos($job->title, $keyword) !== FALSE) /* "If $keyword exists in $job->title" */
                $matches[$job->title]++; /* Increment "number of people who match" */
        }
    }
}

Another possibility is that you wanted to know how many keywords a given person had which matched a given job title. In this case you'd want a separate array per person. This is done with a slight modification.

$matches = new array();
foreach($people as $person){
    $matches[$person] = new array();
    foreach($jobs as $job){
        foreach($person['keywords'] as $keyword){
            if(strpos($job->title, $keyword) !== FALSE) /* "If $keyword exists in $job->title" */
                $matches[$person][$job->title]++; /* Increment "number of keywords which match" */
        }
    }
}

Or, alternatively, you could return to counting how many times a keyword matches now since per-person this is actually a meaningful value ("how well does the job match")

$matches = new array();
foreach($people as $person){
    $matches[$person] = new array();
    foreach($jobs as $job){
        foreach($person['keywords'] as $keyword){
            if($count = substr_count($job->title, $keyword)) /* if(0) = false */
                $matches[$person][$job->title] += $count; /* Increase "number of keywords which match" by $count */
        }
    }
}

Essentially, before tackling the problem of making your loop for efficient, you need to figure out what it is your loop is really trying to accomplish. Figure this out and then your best bet for increasing the efficiency is to just decrease the number of iterations of the loop to a minimum and use as many built-in functions as possible since these are implemented in C (a non-interpreted and therefore quicker-running language).

steven_desu
Yes, I left out quite a bit of detail in what it is I am trying to accomplish in order to keep it simple, and try to focus on the overall structure instead. You've pretty much hit the nail on the head though. At the moment, the keywords are the indices in an array, and the values are how many times they appear on that person. Then when matching a job, it counts the number of times that keyword appears in the job, and gives it a "score" in a different "matched" array (keyword value * number of matches), which all eventually let's me rank relevancy.
Nick
In that case the need to count the number of occurrences (and thus parse each string individually) does give you a unique situation of algorithmic complexity. I would definitely recommend for starters to keep the list of people and jobs as small as necessary (don't count matches for people if you aren't outputting the result)
steven_desu
@Nick Also, consider giving each job an array of all the contained vocabulary. It's much quicker/easier for a computer to compare two arrays since you know exactly where the word will start if it does exist. Essentially you only check as many positions as there are words, not letters. Thus, `$job->wordArray = array('friendly','friend','of','the','friend','company');`. After this, look into `array_count_values()` and `array_key_exists()` to find the number of matches.
steven_desu
@Nick Note that the new method will differentiate between "friendly" and "friend", but this may be good as it will differentiate between "cat" and "catering"
steven_desu
@steven_desu Thanks, I'll give that a shot. Right now I am using regex with word boundaries around the keyword to differentiate between word like "cat" and "catering", which probably does not help with my issue of speed. Unfortunately, each person has to be analyzed and there's no way to avoid that, since I need to display all of them and be able to sort them. But I was considering converting the XML to an array since it's reused over and over for each person, and this certainly helps confirm the need to do that. Thanks!
Nick