I have nearly 150k articles in Turkish. I will use articles for natural language processing research. I want to store words and frequency of them per article after processing articles.
I'm storing them in RDBS now.
I have 3 tables:
Articles -> article_id,text
Words -> word_id, type, word
Words-Article -> id, word_id, article_id, frequency (index for word_id, index for article_id )
I will query for
- ALL Words in an article
- one Word's frequency per article
- Word occurrences in all articles and in which articles
I have millions of rows in words-article table. I always worked with RDBS in this project. started with mysql and using oracle now. But I don't want to use oracle and want better performance than mysql.
Also I have to handle this job in a machine with 4gb ram.
Simply, how to store document-term matrix and make some query on it? performance is necessary. can "key-value databases" beat mysql at performance? or what can beat mysql?
if your answer programming language depended, I'm writing code in python. But C/C++ , Java is ok.