views:

399

answers:

5

Hi at all. I have a project with 2 applications ( books and reader ).

Books application has a table with 4 milions of rows with this fields:

 book_title = models.CharField(max_length=40)
 book_description = models.CharField(max_length=400)

To avoid to query the database with 4 milions of rows, I am thinking to divide it by subject ( 20 models with 20 tables with 200.000 rows ( book_horror, book_drammatic, ecc ).

In "reader" application, I am thinking to insert this fields:

reader_name = models.CharField(max_length=20, blank=True)
book_subject = models.IntegerField()
book_id = models.IntegerField()

So instead of ForeignKey, I am thinking to use a integer "book_subject" (which allows to access the appropriate table) and "book_id" (which allows to access the book in the table specified in "book_subject").

Is a good solution to avoid to query a table with 4 milions of rows ?

Is there an alternative solution ?

Thanks ^__^

+5  A: 

ForeignKey is implemented as IntegerField in the database, so you save little to nothing at the cost of crippling your model.

Edit: And for pete's sake, keep it in one table and use indexes as appropriate.

Ignacio Vazquez-Abrams
I am using index but the table has 4 milions of rows and it's often queried. So I don't know if the index is enough :-\
xRobot
If it isn't then the database needs more memory.
Ignacio Vazquez-Abrams
4 million rows is nothing to sneeze at, but databases are built for this kind of thing, especially if you're indexing. I would only worry about it if you are getting up to at least a hundred million rows.
LeafStorm
A: 

I'm not familiar with Django, but I have a general understanding of DB.

When you have large databases, it's pretty normal to index your database. That way, retrieving data, should be pretty quick.

When it comes to associate a book with a reader, you should create another table, that links reader to books.

It's not a bad idea to divide the books into subjects. But I'm not sure what you mean by having 20 applications.

Steven
20 applications means 20 tables :) I am using already index but the table has 4 milions of rows and it's often queried. So I don't know if the index is enough :-\
xRobot
Gods no! Do not split it up into 20 tables! Make an EAR diagram of you tables, and you will see what extra table you need to handle this. E.g. extra tables for relating person to books, category to books and so on.
Steven
A: 

Are you having performance problems? If so, you might need to add a few indexes.

One way to get an idea where an index would help is by looking at your db server's query log (instructions here if you're on MySQL).

If you're not having performance problems, then just go with it. Databases are made to handle millions of records, and django is pretty good at generating sensible queries.

Seth
Yes, its a performance problem. I am using index but the table has 4 milions of rows and it's often queried. So I don't know if the index is enough :-\
xRobot
A bigger index (more columns) is probably the way to go, maybe in addition to memcached as @jcm mentions. Often times single column indexes don't help because they don't get used by your queries.
Seth
A: 

A common approach to this type of problem is Sharding. Unfortunately it's mostly up to the ORM to implement it (Hibernate does it wonderfully) and Django does not support this. However, I'm not sure 4 million rows is really all that bad. Your queries should still be entirely manageable.

Perhaps you should look in to caching with something like memcached. Django supports this quite well.

jcm
+1  A: 

Like many have said, it's a bit premature to split your table up into smaller tables (horizontal partitioning or even sharding). Databases are made to handle tables of this size, so your performance problem is probably somewhere else.

Indexes are the first step, it sounds like you've done this though. 4 million rows should be ok for the db to handle with an index.

Second, check the number of queries you're running. You can do this with something like the django debug toolbar, and you'll often be surprised how many unnecessary queries are being made.

Caching is the next step, use memcached for pages or parts of pages that are unchanged for most users. This is where you will see your biggest performance boost for the little effort required.

If you really, really need to split up the tables, the latest version of django (1.2 alpha) can handle sharding (eg multi-db), and you should be able to hand write a horizontal partitioning solution (postgres offers an in-db way to do this). Please don't use genre to split the tables! pick something that you wont ever, ever change and that you'll always know when making a query. Like author and divide by first letter of the surname or something. This is a lot of effort and has a number of drawbacks for a database which isn't particularly big --- this is why most people here are advising against it!

[edit]

I left out denormalisation! Put common counts, sums etc in the eg author table to prevent joins on common queries. The downside is that you have to maintain it yourself (until django adds a DenormalizedField). I would look at this during development for clear, straightforward cases or after caching has failed you --- but well before sharding or horizontal partitioning.

Will Hardy
Ok if I will split the table, I will split it by first letter... indeed is more reasonable :)The table is queried from ajax using an autocomplete field with thisquery in the views.py:books.objects.filter(book_title__istartswith=request.GET['q'])[:100] So do you recommend me index + memcached ?Thanks
xRobot
Make an index on the fist three letters of title (or whatever is the first number you start querying the database) and it will run quite fast.
naivists