views:

191

answers:

5

Dear Everyone I am currently using mysql. I am finding that my schema is getting incredibly complicated. I seek to find a new db that will suit my needs:

Let's assume I am building a news aggregrator (which collects news from multiple website). I then run algorithms to determine if two news from different sites are actually referring to the same topic. I run this algorithm to cluster news together. The relationship is depicted below:

cluster
\--news1
   \--word1
   \--word2
\--news2
   \--word3
\--news3
   \--word1
   \--word3

And then I will apply some magic and determine the importance of each word. Summing all the importance of each word gives me the importance of a news article. Summing the importance of each news article gives me the importance of a cluster.

Note that above cluster there are also subgroups( like split by region etc), and categories (like sports, etc) which I have to determine the importance of that in a particular day per se.

I have used views in the past to do so, but I realized that views are very slow. So i will normally do an insert into an actual table and index them for better performance. As you can see this leads to multiple tables derived like (cluster, importance), (news, importance), (words, importance) etc which can get pretty messy.

Also the "importance" metric will change. It has become increasingly difficult to alter tables, update data (which I am using TRUNCATE TABLE) and then inserting from null.

I am currently looking into something schemaless like Mongodb. I do not need distributedness. I would very much want something that is reasonably fast (which can be indexed) and something that is a lot more flexible that traditional RDMBS.

NEW

As requested by various people, I will post my usage to this database (they are not actual SQL queries since I hope everyone here could understand)

TABLE word ( word_id, news_id, word )
TABLE news ( news_id, date, site .. )
TABLE clusters ( cluster_id, cluster_leader, cluster_name, ... )
TABLE mapping_clusters_news( cluster_id, news_id)
TABLE word_importance (word_id, score)
TABLE news_importance (news_id, score)
TABLE cluster_importance( cluster_id, score)
TABLE group_importance( cluster_id, score)

You might notice that TABLE_word has an extra news_id column. This is to correspond to TABLE_word_importance column because the same word can have different importance in different articles (if you are familiar with tfidf, this is basically something like that).

All the "importance" table now calculates the importance of each entity by averaging the importance of all the sub-entities below it. This means that Each cluster's importance is determined by all the news inside it, each news's importance is determined by all the words inside it etc.

TYPICAL USAGE:
1) SELECT clusters FROM db THAT HAS word1, word2, word3, .. ORDER BY cluster_importance_score
2) SELECT words FROM db BELONGING TO THE CLUSTER cluster_id=5 ODER BY word_importance score.
3) SELECT groups ordered by importance score.

As you can see, I am deriving a lot of scores from each layer, and someone have been telling me to use a materialized view for this purpose (which postgresql supports it). However, as you can see, this simple schema already consists of 8 tables (my actual db consists of 26 tables of crap like that, which is adding so much additional layers of complexity for maintainance).

NOTE THIS IS NOT ABOUT FULL-TEXT SEARCH.

A: 

How about db4o? db4o

Raj
not really looking into something like that sorry
afvasd
A: 

ORM means "Object-relational mapper". Not using a relational database wouldn't make much sense. I'll pretend you meant "I want to be able to serialize objects".

I don't understand why distributedness is not required. Could you elaborate on that?

Personally, I would reccomend Cassandra. It still has reasonably close ties to (by which I mean easy to integrate with) Hadoop, which you will probably eventually want for your processing. As an added bonus, there's Telephus, so Cassandra supports Twisted beautifully. Cassandra's method of conflict resolution (currently timestamps, soon-ish vector clocks) might work for your changing metric as long as you don't mind getting the old value for as long as the metric hasn't been recalculated. Otherwise, you might move up a level and simply store multiple versions of the data with different versions of the metric. That way, if you decide a metric is a bad idea, you don't have to recompute.

Cassandra, unfortunately, does not have something that serializes/deserializes objects very well yet. However, for the thin wrappers you would be writing (essentially structs with a few methods), would writing a fromCassandra @classmethod really be that big a deal?

lvh
A: 

Postgresql may be "schema based" but it kind of feels like you're throwing the baby out with the bathwater. If you don't need a distributed db or a particularly schema-less design (which it doesn't sound like offhand you do, but you appear to think you do) then I'm not sure why you would want mongodb. Postgres has lots of indexing options and it sounds like its built in full text searching would be good for you. If you're used to MySQL and altering tables (you mentioned issues there) can be a nightmare, mostly its better in Postgres. I'm a fan on Postgres and MongoDB - it just don't sound like there's a good reason to move away from a relational db for data that certainly sounds relational in nature.

rfusca
A: 

In a word, YES, you should probably be looking at something else: Cassandra, Hadoop, MongoDB, something.

MongoDB is basically going to reduce your sample schema to "clusters" and "news", with everything else basically being contained in those two.

The good news:

  1. This will make it easy to modify fields.
  2. Map-reduce operations are a natural fit for the type of work that you're doing. You perform a map-reduce and then save the data back to the "news" item and all will be well.

The bad news:

  1. It's easy to lose track of the structure of data with something like Mongo. Hadoop and Hive typically force your schema little more. But in any case, you'll need to write down some form of schema or just drown.

  2. If you plan to do this for some non-trivial amount of data, then you're going to want "horizontal" scalability. MongoDB is "ok" for this, Hadoop is definitely a "leader" for this.

Gates VP
A: 

When the schema is getting complicated, a graph database can be a good alternative. As I understand your domain, you have lots of entities related to other entities in different ways. Would it make sense to you to model this as a graph/network of entities? As food for thought I whipped up an example using Neo4j:

news-analysis-example

In a graphdb you can set properties on both nodes and relationships, which could be useful in your case (for instance the number of times a word is used in a news entry could be added to the relationship to that word). BTW, I added an extra is_related relationship between two news items, as I thought that could be interesting as well.

nawroth