views:

168

answers:

2

Hi, i'm indexing strings containing URL's in MySQL Fulltext... but i dont want the urls included in the results.

As an example i search for "PHP" or "HTML" and i get records like "Ibiza Angels Massage Company see funandfrolicks.php"... a hedonistic distraction at best.

I can't see examples of adding regular expressions to the stop word list.

The other thing i thought of (and failed on) is creating the fulltext SQL, and decreasing the word contribution... however, in the following SQL, the relevance value did not change.

SELECT title, content,match(title,content) against('+PHP >".php"' IN BOOLEAN MODE)
FROM tb_feed 
WHERE match(title,content) against('PHP >".php"' IN BOOLEAN MODE) 
ORDER BY published DESC LIMIT 10;

An alternative is a messy SQL statement with the additional condition ...

WHERE ... IF(content REGEXP '.php', content REGEXP '(^| )php', 1) ...

Thoughts... whats the best solution?

A: 

If you want php/html not part of the URL, one simple way is to try

like "% php %"
like "% html %"

That way, php/html must be a word in the sentence.

phsiao
Yes i could, but like "REGEXP '(^| )php'" it's an additional WHERE condition which does not take advantage of mysql's fast fulltext indexing.
Drew
+1  A: 

If the numbers of results are bearable, you could choose to not display the matches the words that you want to ignore. Such as .php or .html. This is very quick to kludge but will involve using more memory than you need to.

Another solution is to create another field with the keywords that you are wanting to search on. With this field you omit urls and any other keywords that are not desired. This solution will take a short amount of time to write but will take up extra space on the hard drive.

The better solution is to create another table called keyword (or similar). When a user submits a search query the keyword table is searched looking for the specified keywords. The keyword table is populated by splitting the input data when the content was uploaded or retrieved.

This last option has the advantage of possibly being fast, the data is compact as the keywords are stored once only with a index pointing back to the main content record. It allows clever searches to occur if you so desire.

George Patterson
Yeah its getting a bit messy, but thanks
Drew