tags:

views:

40

answers:

2

All,

I'm building a site which will gather news stories from about 35 different RSS feeds, storing in an array. I'm using a foreach() loop to search the title and description to see if it contains one of about 40 keywords, using substr() for each article. If the search is successful, that article is stored in a DB, and ultimately will appear on the site.

The script runs every 30 mins. Trouble is, it takes 1-3 mins depending on how many stories are returned. Not 'terrible' but on a shard hosting env, I can see this causing plenty of issues, especially as the site grows and more feeds/keywords are added.

Are there any ways that I can optimize the 'searching' of keywords, so that I can speed up the 'indexing'?

Thanks!!

+2  A: 

35-40 RSS feeds are a lot of requests for one script to handle and parse all at once. Your bottleneck is most likely the requests, not the parsing. You should separate the concerns. Have one script that requests an RSS feed one at a time every minute or so, and store the results locally. Then another script should parse and save/remove the temporary results every 15-30 minutes.

Stephen
+1  A: 

You could use XPath to search the XML directly... Something like:

$dom = new DomDocument();
$dom->loadXml($feedXml);
$xpath = new DomXpath($dom);

$query = '//item[contains(title, "foo")] | //item[contains(description, "foo")]';
$matchingNodes = $xpath->query($query);

Then, $matchingNodes will be a DomNodeList of all the matching item nodes. Then you can save those in the database...

So to adjust this to your real world example, you could either build the query to do all the searching for you in one shot:

$query = array();
foreach($keywords as $keyword) {
    $query[] = '//item[contains(title, "'.$keyword.'")]';
    $query[] = '//item[contains(description, "'.$keyword.'")]';
}
$query = implode('|', $query);

Or just re-query for each keyword... Personally, I'd build one giant query, since then all the matching is done in complied C code (and hence should be more efficient than looping in php land and aggregating the results there)...

ircmaxell