views:

374

answers:

1

I have a database full of Asian-character filled records (Chinese, Japanese, and Korean) alongside those with Latin-character filled records (English, Français, you name it), and I want to perform fulltext searches on them.

MySQL http://dev.mysql.com/doc/refman/5.1/en/fulltext-restrictions.html%29">says:

Ideographic languages such as Chinese and Japanese do not have word delimiters. Therefore, the FULLTEXT parser cannot determine where words begin and end in these and other such languages. The implications of this and some workarounds for the problem are described in Section 11.8, “Full-Text Search Functions”.

This makes sense. But naturally, Section 11.8 does not contain any workarounds or even any mention of the problem.

So, for all the master MySQL/PHP programmers out there, I'm wondering: How would you sort a search for a single Chinese character in a mixed-character database? MySQL's %LIKE% would work, but it doesn't have the nifty relevance ratings. Should I just count the times a character appears in the record, and rank by that? I appreciate any advice you have. Thanks!

+1  A: 

It depends on the size of the dataset. If we're talking hundred of thousands of rows, I'd probably take a look at one of the excellent independent full text search solutions available. I've actually never had to deal with this problem mysqlf, so I'm not sure of which solutions that includes support for asian languages.

I do however know that lucene sports a analyzer for chinese, japanese and korean, so my guess is that it has some kind of support for what you're doing. What I usually do when I need to integrate lucene with php is that I implement lucene as a socket server, and connect to it from php.

If the dataset is small enough, it might be an option to roll your own ad-hoc approach. There's two parts to this problem: Retrieval of documents to be ranked, and the actual ranking. There's several ways to do the retrieval. One might be to use LIKE, if you're dataset is sufficiently small. Another might be to roll your own disk based indexing scheme, although this would be rather complex and time consuming. You could also use MySQL as a middle path, as described below.

In order to implement an indexing scheme using MySQL, you would have to create a few tables with the following structure:

document
  document_id
  document_text
  document_tokencount

document_token
  document_id
  token_id
  token_docfrequency
  index (token_id, document_id)

token
  token_id
  token_unicode
  token_globalfrequency
  index (token_unicode)

Then I'd process each document and insert a row into the document_token table for each character (token) in a document. The token_unicode field would contain the integer unicode sequence used to referr to this character. The token_docfrequency field contains an integer corresponding to the number of times that the document contains the token, while the token_globalfrequency field contains the total number of times the term is used, across all documents.

This would allow you to do quick searches for tokens:

SELECT * FROM document_token WHERE token_id = 1
UNION
SELECT * FROM document_token WHERE token_id = 2
UNION
SELECT * FROM document_token WHERE token_id = 3

(the union approach is a hack that allows mysql to utilize indexes for all the selects, and will most likely be faster than the corresponding query using a single select and several or statements)

This leaves us with relevance ranking as the remaining problem, which is what you really asked for. :) This can be done with rather good results by utilizing the Vector Space Model (VSM).

After doing a search the first thing you would have to do is to calculate the tf-idf score for this token. This is done using the formula:

tf-idf = tf(t,d) / tf(d) * log(D / d(t))

where:
tf(t,d) = token frequency in current document
tf(d) = total number of tokens in current document
D = total number of documents
d(t) = number of document that contains the token

Calculate this score for each term in the search query first, and store the result in a hashmap or something similiar. This is your first vector, called v_1. Then proceed to the first document. Calculate the tf-idf score for each term in the document as well, and store it as v_2. Now you can calculate a score for this document using cosine similiarity:

score = arccos(v_1 * v_2 / (|v_1| * |v_2|))

The result is a value that can be used to rank the document. Continue and do this for every document. The sort them in descending order. The first document in the list will be the most relevant one.

This might all sound a bit complicated, but if you have some basic understanding of linear algebra, you could probably produce a working solution in a few hours. Still, if at all possible, use an existing solution such as lucene.

Emil H
Thank you Emil H, that was an extremely thorough solution. I really enjoyed your mini-tutorial on the Vector Space Model, and will spend some time trying to integrate that in. It's always fun to put some math into searching!Thanks,Jasie
Jasie