ansaurus

Question

MySQL Fulltext Searching ideographic (Asian) characters

Answer 1

+1 A:

It depends on the size of the dataset. If we're talking hundred of thousands of rows, I'd probably take a look at one of the excellent independent full text search solutions available. I've actually never had to deal with this problem mysqlf, so I'm not sure of which solutions that includes support for asian languages.

I do however know that lucene sports a analyzer for chinese, japanese and korean, so my guess is that it has some kind of support for what you're doing. What I usually do when I need to integrate lucene with php is that I implement lucene as a socket server, and connect to it from php.

If the dataset is small enough, it might be an option to roll your own ad-hoc approach. There's two parts to this problem: Retrieval of documents to be ranked, and the actual ranking. There's several ways to do the retrieval. One might be to use LIKE, if you're dataset is sufficiently small. Another might be to roll your own disk based indexing scheme, although this would be rather complex and time consuming. You could also use MySQL as a middle path, as described below.

In order to implement an indexing scheme using MySQL, you would have to create a few tables with the following structure:

document
  document_id
  document_text
  document_tokencount

document_token
  document_id
  token_id
  token_docfrequency
  index (token_id, document_id)

token
  token_id
  token_unicode
  token_globalfrequency
  index (token_unicode)

Then I'd process each document and insert a row into the document_token table for each character (token) in a document. The token_unicode field would contain the integer unicode sequence used to referr to this character. The token_docfrequency field contains an integer corresponding to the number of times that the document contains the token, while the token_globalfrequency field contains the total number of times the term is used, across all documents.

This would allow you to do quick searches for tokens:

SELECT * FROM document_token WHERE token_id = 1
UNION
SELECT * FROM document_token WHERE token_id = 2
UNION
SELECT * FROM document_token WHERE token_id = 3

(the union approach is a hack that allows mysql to utilize indexes for all the selects, and will most likely be faster than the corresponding query using a single select and several or statements)

This leaves us with relevance ranking as the remaining problem, which is what you really asked for. :) This can be done with rather good results by utilizing the Vector Space Model (VSM).

After doing a search the first thing you would have to do is to calculate the tf-idf score for this token. This is done using the formula:

tf-idf = tf(t,d) / tf(d) * log(D / d(t))

where:
tf(t,d) = token frequency in current document
tf(d) = total number of tokens in current document
D = total number of documents
d(t) = number of document that contains the token

Calculate this score for each term in the search query first, and store the result in a hashmap or something similiar. This is your first vector, called v_1. Then proceed to the first document. Calculate the tf-idf score for each term in the document as well, and store it as v_2. Now you can calculate a score for this document using cosine similiarity:

score = arccos(v_1 * v_2 / (|v_1| * |v_2|))

The result is a value that can be used to rank the document. Continue and do this for every document. The sort them in descending order. The first document in the list will be the most relevant one.

This might all sound a bit complicated, but if you have some basic understanding of linear algebra, you could probably produce a working solution in a few hours. Still, if at all possible, use an existing solution such as lucene.

Emil H 2009-05-03 15:53:49

Thank you Emil H, that was an extremely thorough solution. I really enjoyed your mini-tutorial on the Vector Space Model, and will spend some time trying to integrate that in. It's always fun to put some math into searching!Thanks,Jasie

Jasie 2009-05-03 23:37:59

ansaurus

tags:

views:

answers:

MySQL Fulltext Searching ideographic (Asian) characters

related questions