views:

105

answers:

4

Hi all,

I'm a lone developer for a telecoms company, and am after some database design advice from anyone with a bit of time to answer.

I am inserting into one table ~2 million rows each day, these tables then get archived and compressed on a monthly basis. Each monthly table contains ~15,000,000 rows. Although this is increasing month on month.

For every insert I do above I am combining the data from rows which belong together and creating another "correlated" table. This table is currently not being archived, as I need to make sure I never miss an update to the correlated table. (Hope that makes sense) Although in general this information should remain fairly static after a couple of days of processing.

All of the above is working perfectly. However my company now wishes to perform some stats against this data, and these tables are getting too large to provide the results in what would be deemed a reasonable time. Even with the appropriate indexes set.

So I guess after all the above my question is quite simple. Should I write a script which groups the data from my correlated table into smaller tables. Or should I store the queries result sets in something like memcache? I'm already using mysqls cache, but due to having limited control over how long the data is stored for, it's not working ideally.

The main advantages I can see of using something like memcache:

  • No blocking on my correlated table after the query has been cashed.
  • Greater flexibility of sharing the collected data between the backend collector and front end processor. (i.e custom reports could be written in the backend and the results of these stored in the cache under a key which then gets shared with anyone who would want to see the data of this report)
  • Redundancy and scalability if we start sharing this data with a large amount of customers.

The main disadvantages I can see of using something like memcache:

  • Data is not persistent if machine is rebooted / cache is flushed.

The main advantages of using MySql

  • Persistent data.
  • Less code changes (although adding something like memcache is trivial anyway)

The main disadvantages of using MySql

  • Have to define table templates every time I want to store provide a new set of grouped data.
  • Have to write a program which loops through the correlated data and fills these new tables.
  • Potentially will still grow slower as the data continues to be filled.

Apologies for quite a long question. It's helped me to write down these thoughts here anyway, and any advice/help/experience with dealing with this sort of problem would be greatly appreciated.

Many thanks.

Alan

+2  A: 

Aside from the options you discuss above, you might also want to consider adding more powerful hardware into the picture, if thats an option.

This bit of your question shows that the underlying problem here is speed of results:

However my company now wishes to perform some stats against this data, and these tables are getting too large to provide the results in what would be deemed a reasonable time.

In situations where speed of results is important, throwing better/extra hardware at the problem can often work out cheaper than developing new code/database structures/etc.

Just a thought!

codeulike
After reading through the question my first thought was 'Morez hardz' but it appears I was beaten to it.
graham.reeds
Thanks, I'm guessing however that my bottleneck is now purely on the I/O on extracting the data from the hard drive? I'm not sure what solution I would use even if I through another machine/more drives into the equation?Of course there's SSD drives I guess?
If MySql is caching blocks of data in memory then I/O isn't necessarily the bottleneck. Maybe do some monitoring to find out whether it is or not.
codeulike
I've not actually used any MySQL monitoring tools before. Shame on me really. I'll have a look and try to find where the bottle neck actually is with these. Thanks very much for your replies. Greatly appreciated.
+1  A: 

(Another answer from me, different enough that I'll post it separately)

Two questions:

What sort of stats does your company want to generate?
and
After rows are inserted into the database, are they ever changed?

If data doesn't change after insert, then you may be able to build up a separate 'stats' table, that you amend/update as new rows are inserted, or maybe soon after new rows are inserted.

e.g. things like:

  • When a new row is inserted thats relevant to stat 'B', go and increment a number in another table for stat 'B', minute 'Y'
    or
  • Every hour, run a small query on rows that have been inserted in the last hour, that generates the stats for that hour and stores them separately
    or
  • As above, but each minute, etc.

Its hard to be any more specific without knowing the details, but depending on the stats you're after, these kind of approaches may help.

codeulike
The stats are basic such as group minutes by destination. group counts by destination, group stats by customer. All the information is in the one correlated table. Meaning I have to wait till I've got complete correlated data. With the three above solutions, what would be your preferred? Obviously doing the insert and update for every insert will slow down the speed at which I can do a complete record insert.
If you can update the stats as every insert happens (which also is what user pcent suggested, below), then that is conceptually simpler, but as you say makes the updates slower. Also it increases the likelyhood of bugs in the insert code, which you don't want. A separate process that generates stats for recent inserts every n minutes is safer and (in theory) wont slow the inserts down. But is slightly harder to code.
codeulike
+1  A: 

If you want to do some analysis of static data from a few days back, you should perhaps consider using something like a OLAP system.

Basicly, this type of system stock intermediate stats in their format to do quick sum(), avg(), count()... on large table.

I think your question is a perfect example of the situation where it's used, but perhaps i think so just because it's my job. =)

Take a look.

c-verde
Thanks very much, I'll have a look at OLAP now
+1  A: 

I work in a company with similar situation, with millions of inserts monthly.

We adopted the strategy of summarize the data in smaller tables, grouped by certain fields.

In our case, when an insert is performed, it triggers a function which classifies the inserted tuple and increment the summary tables.

From time to time, we move the oldest rows to a backup table, reducing the growth of the main table.

pcent
Looks like this is the way forwards! Thanks for your reply.