views:

359

answers:

3

Hi, I'm currently building a little CMS for a smaller site. Now I want to extract all words from the text_content field and store them in my word table for later analysis.

page( id int, 
      title varchar(45),
      # ... a bunch of meta fields ...  
      html_content text,
      text_content text);

word( page_id int,        # Foreign key
      word varchar(100)); # I presume there are no words longer than 100 chars

Currently I'm using the following code, which runs very slowly (understandably) for larger chunks of text.

// Sidenote: $_POST is sanitized above scope of this code.
$_POST['text_content'] = str_replace("\t", "", 
         htmlspecialchars_decode(strip_tags($_POST['html_content'])));

// text is in swedish, so we add support for swedish vowels
$words = str_word_count($_POST['text_content'], 1, "åäöÅÄÖ");

// Delete all previous records of words
$this->db->delete("word", array('page_id' => $_POST['id']));

// Add current ones
foreach($words as $word)
{
    if (trim($word) == "")
     continue;

    $this->db->query("INSERT INTO word(page_id, word) VALUES(?, ?)", 
                      array($_POST['id'], strtolower(trim($word))));
}

Now, I'm not happy with this solution. I was thinking of creating a trigger in the database which would do pretty much the same thing as the php version. Is it possible to create a trigger in MySQL which would perform said actions, if so - how? Or is there a better way? Am I taking a crazy approach to this?

+1  A: 

Triggers that perform large calculations will slow down your application.

I think you are better of scheduling a task to run periodically and perform the extraction for you.

Raj More
I guess that could work, but I'm not to fond of cron jobs.
Morningcoffee
A: 

Have you tried PHP's "htmlentities" function to strip of those tags?

RPK
thing is, I want to remove the html-tags to produce a clean text-only version.
Morningcoffee
+4  A: 

You could make this PHP code significantly faster by building up a single insert query and executing it rather than a separate query for every word. Otherwise, I don't think your code looks that bad.

Scott Saunders
Thanks. I didn't know you could do that. What used to execute for 45 seconds now executes in 0.9 seconds. This solves my problem :)
Morningcoffee
Absolutely. Each query has significant overhead simply from the client to server communications. Additionally, the blurb could be passed in to a stored proceedure that would break it down and do the inserts
Kevin Peno