views:

521

answers:

3

I'm storing a PHP array where key=>value pairs are information used to build jQuery UI tabs on a website.

The string stored in the MySQL database looks like:

a:2:{i:0;a:2:{i:1;s:9:"Info";i:2;s:643:"<h2><strong>This section is about foo</strong></h2><p><strong>Lorem ipsum ...";}i:1;a:2:{i:1;s:14:"More Info";i:2;s:465:"<p>Lorem ipsum ...";}}

(not a valid serialized array or html because I truncated the lengthy content for formatting reasons)

I would like to allow this content to be fed to Sphinx (full-text indexer) for site search purposes. Basically Sphinx just grabs the contents of the database and indexes what it finds, subject to the configuration options you specify... What I'm wondering is if there's a good way to get either MySQL or Sphinx to strip out the serialization information and html tags so that only the plain text gets indexed.

A: 

Your best bet is probably to stop storing just the PHP serialized format, adding a 'plain text' version alongside that Sphinx can index. Failing that, another idea would be to have a PHP script that crawls the table on a regular basis and creates the 'plain text' version out-of-band with the original HTTP request that created the records. With unserialize() and strip_tags() at your disposal, this becomes a fairly trivial problem.

TML
yeah that's what I was planning to fall back on when I get to that point in this project. the CMS I'm working with makes storing the plain text alternative alongside the serialized string a bit of a pain but I'll get something to work.
Ty W
+1  A: 

For the Html-Tag problem put this in your sphinx-config: html_strip = 1

link to the manual section of html_strip

I have not found a way to strip serialization-info from the index. (But im having the same problem)

smoove666
nice find, hadn't seen that before. I think I've found a way to store a plaintext alternative in the DB alongside the original in the CMS I'm using, but this is a very good tip.
Ty W
A: 

how to stip html tags use only mysql functions?

akuba