views:

84

answers:

1

I'm trying to implement a search feature for an offline-accessible StackOverflow, and I'm noticing some problems with using MySQLs FULLTEXT indexing.

Specifically, by default FULLTEXT indexing is restricted to words between 4 and 84 characters long. Terms such as "PHP" or "SQL" would not meet the minimum length and searching for those terms would yield no results.

It is possible to modify the variable which controls the minimum length a word needs to be to be indexed (ft_min_word_len), but this is a system-wide change requiring indexes in all databases to be rebuilt. On the off chance others find this app useful, I'd rather keep these sort of variables as vanilla as possible. I found a post on this site the other day stating that changing that value is just a bad idea anyway.

Another issue is with terms like "VB.NET" where, as far as I can tell, the period in the middle of the term separates it into two indexed values - VB and NET. Again, this means searches for "VB.NET" would return nothing.

Finally, since I'm doing a direct dump of the monthly XML-based dumps, all values are converted to HTML Entities and I'm concerned that this might have an impact on my search results.

I found a blog post which tries to address these issues with the following advice:

  • keep two copies of your data - one with markup, etc. for display, and one modified for searching (remove unwanted words, markup, etc)
  • pad short terms so they will be indexed, I assume with a pre/suffix.

What I'd like to know is, are these really the best workarounds for these issues? It seems like semi-duplicating a > 1GB table is wasteful, but maybe that's just me.

Also, if anyone could recommend a good site to understand MySQL's FULLTEXT indexing, I'd appreciate it. To keep this question from being too cluttered, please leave the site recommendations in the question comments, or email me directly at the site on my user profile).

Thanks!


Additional Info:
I think I should clarify a couple of things.

I know "MySQL" tends to lead to the assumption of "web application", but that's not what I'm going for here. I could install Apache and PHP and run things that way, but I'm trying to keep this light. I can use my website for playing with PHP, so I don't feel the need to install it on my home machine too. I also hope this could be useful for others as well, and I don't want to force anyone else into installing a bunch of extra utilities. I went with MySQL since it was easy and needing to install some sort of DB was unavoidable.

The specifics of the project were going to be:

  • Desktop application written in C# (WinForms)
  • MySQL backend

I'm starting to wonder if I should just say to hell with it, and install everything I'd need to make this an (offline) webapp. As much as we'd all like to think our pet project is going to be used and loved by the community at large, I should know by now that this is likely going end up being only used by a single user.

A: 

From what was already said, I understand, that MySQL FullText is not for you ;) But why stick to MySQL? Try Sphinx:

http://www.sphinxsearch.com/

It will solve most of your problems.

FractalizeR
Just to be sure I've got this straight. Sphinx is its own indexing engine, which works with either MySQL or PostgreSQL. In my case, I'd want to use SphinxSE so that I can use it directly through the MySQL namespaces in C#. This is likely what I'll have to do, and I'll accept this answer then I suppose, but part of this project was to keep things relatively light. For instance, I was going to do this as a desktop/WinForms C# app, rather then needing to install Apache and PHP since it's strictly offline (also I don't want to install those two since I have my own site to use for that sort of dev)
AgentConundrum
No need in Apache or PHP. Just use native C# bindings and Sphinx itself:http://iterat.ive.ly/index.php/2008/01/05/sphinx-search-c-net-client-api/
FractalizeR