views:

216

answers:

2

I have nearly 150k articles in Turkish. I will use articles for natural language processing research. I want to store words and frequency of them per article after processing articles.

I'm storing them in RDBS now.

I have 3 tables:

Articles -> article_id,text
Words -> word_id, type, word
Words-Article -> id, word_id, article_id, frequency (index for word_id, index for article_id )

I will query for

  • ALL Words in an article
  • one Word's frequency per article
  • Word occurrences in all articles and in which articles

I have millions of rows in words-article table. I always worked with RDBS in this project. started with mysql and using oracle now. But I don't want to use oracle and want better performance than mysql.

Also I have to handle this job in a machine with 4gb ram.
Simply, how to store document-term matrix and make some query on it? performance is necessary. can "key-value databases" beat mysql at performance? or what can beat mysql?

if your answer programming language depended, I'm writing code in python. But C/C++ , Java is ok.

A: 

maybe check out lucene (or Zend_Search_Lucene in php). it's very nice FTS engine.

jspcal
A: 

For 150k articles, you must have a few hundred million rows in the words_articles table. This is manageable, as long as you configure MySQL properly.

A few tips:

  • Make sure your tables are MyISAM, not InnoDB.
  • Drop the id field in the words_articles table and make (word_id, article_id) the primary key. Also, create separate indexes for word_id and article_id in the words_articles table:

    ALTER TABLE words_articles
    DROP PRIMARY KEY,
    ADD PRIMARY KEY (word_id, article_id),
    ADD INDEX (word_id),
    ADD INDEX (article_id);
    

    (doing everything in a single alter statement gives much better performance).

  • Create an index for word in the words table:

    ALTER TABLE words ADD INDEX (word);
    
  • Tweak my.cnf. Specifically, increase the buffer sizes (especially key_buffer_size). my-huge.cnf might be a good starting point.

Can Berk Güder