views:

260

answers:

2

Hi!

I have to hack a content management system to support fulltext search for a language that contains special characters. These are stored in the database as html entities. Out of the box, the CMS doesn't support it. The bug was reported long time ago, but apparently it has no priority. I'm stick to this CMS, the customer is awaiting my solution, so I have to hack it. Damn...

Ok... the CMS stores it's content by translating special characters into html entities (this is actualy done by the bundled editor). So the german word "möchten" gets "möchten" in the DB. The CMS creates a query string like

SELECT * FROM `SiteTree` WHERE MATCH( Content ) AGAINST (<SEARCH_STRING> IN BOOLEAN MODE);

The table is of type MyISAM, the field has a FULLTEXT index.

If you use "m&ouml;chten" as search string, MySQL will match every page, as & is a operator that will do crazy things if it's present in the search string. The search will not work.

Next idea is to replace the special character by an * as placeholder. But this will also match several words, as soon as you have anything starting with an "m" and another following word ending with an "chten". I don't know why, but replacing only the ampersand with an asterisk (so searching for "m*ouml;chten") will also lead to similar results.

The same problem was described here.

Ok, folks, I need your help! Any ideas?

Edit: Converting the content to UTF-8 is no option.

Thanks!
craesh

+1  A: 

Why you're using html entities? Just switch to utf8.

Otherwise try to quote your search string once again like ('"search"'). Unfortunately won't work - http://bugs.mysql.com/bug.php?id=26265 there is a long hanging bug. I guess the only approach is:

And last approach is to store additional column just for search purpose with all accents replaced.

Kane
Hey, that tip with adding quotes works! Even if there is an ampersand present. Btw: I'm using MySQL version 5.0.51a
craesh
A: 

You can use a full-text-search engine. Apache Lucene is powerful, but a bit hard to learn. Apache Solr is much easier to learn, and can be quite useful. Sphinx is known for its easy integration with MySQL. I believe all of them handle internationalization well.

Yuval F
Sorry, but I won't reimplement the whole search engine for an CMS just to make it work with special characters.
craesh