tags:

views:

112

answers:

1

I am trying to return the top ten entities that are associated with articles after doing a search on the articles.

Lets say I have the following set up:

CREATE TABLE IF NOT EXISTS `articles` (
  `id` char(36) NOT NULL,
  `html_content` text NOT NULL,
  `featured` tinyint(1) NOT NULL DEFAULT '0',
  PRIMARY KEY (`id`),
  FULLTEXT KEY `html_content` (`html_content`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

CREATE TABLE IF NOT EXISTS `entities` (
  `id` char(36) NOT NULL,
  `title` varchar(255) NOT NULL,
  `article_count` int(11) NOT NULL,
  PRIMARY KEY (`id`),
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

// Pivot table as the articles -> entities is a many-to-many Relationship
CREATE TABLE IF NOT EXISTS `articles_entities` (
  `id` char(36) NOT NULL,
  `article_id` char(36) NOT NULL,
  `entity_id` char(36) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `article_id` (`article_id`),
  KEY `entity_id` (`entity_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

A little background, The search returns approx 15K articles and growing daily, each article can have upwards of 20 entities associated with it. Everything I try causes the script to take ages to complete.

Lets say I am searching for all articles that have the word 'google' in them, and I want to return the 10 most associated entities. How would I go about this?

I am using cakePHP, but any solution is ok.

Thanks in advance!

A: 

Here is the approach I would take in Cake and MySQL. Although Cake's ORM doesn't really do much for you in this situation.

Once you have your query passed in and sanitized, do your full-text search:

$params= array('conditions' => array(
    'MATCH(Article.html_content) AGAINST('.$query.')'),
    'fields' => array('Article.id'));

$articles = $this->Article->find('all', $params);

Then using something like array_walk() I would pull all the IDs out of $articles into a single comma separated string, and query the join table for them:

$entities = $this->Article->query('SELECT entity_id, COUNT(entity_id) as count 
    FROM articles_entities WHERE article_id IN ('.$article_ids.') 
    GROUP BY entity_id ORDER BY count DESC);

This should return the IDs of all entities associated with the group, an association count, ordered by the count. Then I would use array_slice() to get the top 10 entities, pull out the IDs and search for them.

Hope this helps.

handsofaten
I should mention, using `IN` may not be the best way if your database is large. See this post http://explainextended.com/2009/08/18/passing-parameters-in-mysql-in-list-vs-temporary-table/
handsofaten