views:

134

answers:

1

I have a database table with chinese dictionaries (about 300 000 rows) for online dictionary. The data structure looks like this:

 ID     ch_smpl     pinyin     definition
 ----------------------------------------
 1       我           wǒ          I, me
 2       我们         wǒmen       we, us
 etc.

I'm not good with php and mysql, so the question is how to setup a search engine? I've found a bunch of tutorials on php mysql search, I also found some full-text search examples, but I'm not sure how it works with chinese characters. The speed of search is really important for me.

Any suggestions on how to organize the search engine for this purpose are greatly appreciated.

A: 

The speed of your "search engine" depends mainly on three things:

  1. Your SQL-Query
  2. Your database design
  3. Your MySQL configuration

So there will be no "flip this switch and you will get super duper performance". You will need to tackle all these areas. In addition to that there are many other things that can have an impact on performance. For example: operating system, hard drive, amount of memory etc.

Lets start with MySQL configuration. You should try out the mysql query caching feature first. If you have mainly read operations this can boost your performance since everything comes from cache and no i/o operations are needed.

Read here: MySQL Documentation on Query Cache

Another important area is your database design or which database engine you choose. Basically you have three options: InnoDB, MyIsam and Memory (There are others but I don't really know them).

As far as I know MyIsam and Memory only support table locking and not row locking. But again, if you mainly do read operations this won't affect you. In general they are both faster than InnoDB. If I were you I would start with Memory since everything is hold in memory. But be aware of the implications: you maybe need more memory and you will lose unsaved data if the server crashes.

InnoDB on the other hand gives you a lot of data safety and can also be pretty fast if you configure it correct. Unfortunetly this is a wide area. So I won't cover it all. One thing to start with is to set innodb_buffer_pool_size to around 80% of your memory. So if you have 10GB of RAM you could set it to 8GB.

If your server has more than 8 CPUs you also might want to set innodb_thread_concurrency to a higher number. You should use 2 * Number of CPUs.

If you want to know more about MySQL performance you should grab a cup of coffee and read this blog: MySQL performance blog

Another important thing could be to use indices on some of your columns. But I can't really tell if it will pay off in your case since my knowledge of the chinese dictionary is limited ;)

Generally speaking your primary key field should have an index. In addition to that you can use indices for fields you often query and that rarely change (every change on an index field invalidates the index so it has to be recompiled -> performance issue).

As far as I know it also should only be used in case where the column holds a lot of different data. If you have for example a column "gender" which only holds "male" or "female" you will most likely only break the index tree in half. If you have 100 users you will end up with 50 rows. But if you would use an index for their phone number, which is in most cases unique, you will end up with only one row which is much more effective.

So maybe you should use an index for the column ch_smpl.

Last but not least your query. My first advice is to select as little data as possible. That means avoid queries like this:

select * from ...

In your case: If you only want to have the definition for 我 you should use this query:

select definition from dictionary where ch_smpl = '我'

and not

select * from dictionary where ch_smpl = '我'

Also avoid "Like"-Statements with the percent symbol in front of the searchterm since it will deactivate the index for this column.

For example:

select * from dictionary where ch_smpl like '%我'

You should use the percent symbol only after the term:

select * from dictionary where ch_smpl like '我%'

One last piece of advice. There is no special switch you can flip as I said before. There are a lot of things you can do to achieve better performance. Try a few things out and measure the performance.

Jens