views:

331

answers:

6

I've been coding php for a while now and have a pretty firm grip on it, MySQL, well, lets just say I can make it work.

I'd like to make a stats script to track the stats of other websites similar to the obvious statcounter, google analytics, mint, etc.

I, of course, would like to code this properly and I don't see MySQL liking 20,000,000 to 80,000,000 inserts ( 925 inserts per second "roughly**" ) daily.

I've been doing some research and it looks like I should store each visit, "entry", into a csv or some other form of flat file and then import the data I need from it.

Am I on the right track here? I just need a push in the right direction, the direction being a way to inhale 1,000 psuedo "MySQL" inserts per second and the proper way of doing it.

Example Insert: IP, time(), http_referer, etc.

I need to collect this data for the day, and then at the end of the day, or in certain intervals, update ONE row in the database with, for example, how many extra unique hits we got. I know how to do that of course, just trying to give a visualization since I'm horrible at explaining things.

If anyone can help me, I'm a great coder, I would be more than willing to return the favor.

+4  A: 

We tackled this at the place I've been working the last year so over summer. We didn't require much granularity in the information, so what worked very well for us was coalescing data by different time periods. For example, we'd have a single day's worth of real time stats, after that it'd be pushed into some daily sums, and then off into a monthly table.

This obviously has some huge drawbacks, namely a loss of granularity. We considered a lot of different approaches at the time. For example, as you said, CSV or some similar format could potentially serve as a way to handle a month of data at a time. The big problem is inserts however.

Start by setting out some sample schema in terms of EXACTLY what information you need to keep, and in doing so, you'll guide yourself (through revisions) to what will work for you.

Another note for the vast number of inserts: we had potentially talked through the idea of dumping realtime statistics into a little daemon which would serve to store up to an hours worth of data, then non-realtime, inject that into the database before the next hour was up. Just a thought.

Tony k
Hehe I like this answer much better than mine.
thomasrutter
Wow, it sounds like you were in the same exact boat that I'm in now, the one up shits creek with out a paddle, haha.I need to take in real time stats, but will be running a script, taking out the information I need and then inserting the aforementioned data into more simpler rows in the table.Your kind of losing me with the deamons and schema talk, but forgive me as I'm surely not as seasoned ad you are.
John
schema as in schemantics some type of blueprint, layout of what your going to do?? deamons as in exim mailer deamon, something that runs in the background? forgive my ignorance.
John
Schema in databases refers literally to how the database is structured, so a set of CREATE TABLE { etc commands to be run.
Tony k
Oh, gotchya. So what do you guys think about inserting millions of rows into a single csv... can a file handle that much information? could i then pump that data into a temp table, get the information i need from the real time stats, etc, or would i need some kind of super computer to deal with such large files? It'd be like an 8GB csv file....
John
A: 

Presuming that your MySQL server is on a different physical machine to your web server, then yes it probably would be a bit more efficient to log the hit to a file on the local filesystem and then push those to the database periodically.

That would add some complexity though. Have you tested or considered testing it with regular queries? Ie, increment a counter using an UPDATE query (because you don't need each entry in a separate row). You may find that this doesn't slow things down as much as you had thought, though obviously if you are pushing 80,000,000 page views a day you probably don't have much wiggle room at all.

thomasrutter
I wouldn't be able to use the update function simply because I only want to update information based on unique visitors there for I need to look at the data first and then update it :( Which brings me back to figuring out the fastest most practical solution to taking in the equivalent of 1000 inserts to mysql per second.Thanks for answering my question and sharing your knowledge. It's much appreciated.
John
A: 

I think that using MySQL is an overkill for the task of collecting the logs and summarizing them. I'd stick to plain log files in your case. It does not provide the full power of relational database management but it's quite enough to generate summaries. A simple lock-append-unlock file operation on a modern OS is seamless and instant. On the contrary, using MySQL for the same simple operation loads the CPU and may lead to swapping and other hell of scalability.

Mind the storage as well. With plain text file you'll be able to store years of logs of a highly loaded website taking into account current HDD price/capacity ratio and compressability of plain text logs

Max Kosyakov
A simple lock-append-unlock file operation does not solve the problem of log aggregation between multiple servers, nor is it "instant". MySQL would not "lead to swapping" unless it was incorrectly tuned. Moreover, at the load levels the OP says, you will not be able to "store years of logs" even with modern cheap server-grade discs.
MarkR
+1  A: 

For the kind of activity you're looking at, you need to look at the problem from a new point of view: decoupling. That is, you need to figure out how to decouple the data-recording steps so that delays and problems don't propogate back up the line.

You have the right idea in logging hits to a database table, insofar as that guarantees in-order, non-contended access. This is something the database provides. Unfortunately, it comes at a price, one of which is that the database completes the INSERT before getting back to you. Thus the recording of the hit is coupled with the invocation of the hit. Any delay in recording the hit will slow the invocation.

MySQL offers a way to decouple that; it's called INSERT DELAYED. In effect, you tell the database "insert this row, but I can't stick around while you do it" and the database says "okay, I got your row, I'll insert it when I have a minute". It is conceivable that this reduces locking issues because it lets one thread in MySQL do the insert, not whichever you connect to. Unfortuantely, it only works with MyISAM tables.

Another solution, which is a more general solution to the problem, is to have a logging daemon that accepts your logging information and just en-queues it to wherever it has to go. The trick to making this fast is the en-queueing step. This the sort of solution syslogd would provide.

staticsan
I like the logging daemon. It should do bulk inserts of lots of records at one time so that the database can be more efficient. Moreover, you should have somewhere to put the logs if the database is temporarily unavailable.
MarkR
Thanks for the great adivce, I've never used "insert delayed" but it sounds like it would definitely help. Thank you so much for the tip.Also, what's the deal with this logging deamon? I've never used that before nor do I know anything about deamons, any suggestions on where to learn more about using it?
John
Do man syslogd to see how a logging daemon would work - because that's what syslogd is. But there's nothing stopping you writing your own. It would need to listen on a socket for the data and have a way of queueing the database inserts up.
staticsan
A: 

You should be able to get that kind of volume quite easily, provided that you do some stuff sensibly. Here are some ideas.

  1. You will need to partition your audit table on a regular (hourly, daily?) basis, if nothing else only so you can drop old partitions to manage space sensibly. DELETEing 10M rows is not cool.
  2. Your web servers (as you will be running quite a large farm, right?) will probably want to do the inserts in large batches, asynchronously. You'll have a daemon process which reads flat-file logs on a per-web-server machine and batches them up. This is important for InnoDB performance and to avoid auditing slowing down the web servers. Moreover, if your database is unavailable, your web servers need to continue servicing web requests and still have them audited (eventually)
  3. As you're collecting large volumes of data, some summarisation is going to be required in order to report on it at a sensible speed - how you do this is very much a matter of taste. Make sensible summaries.
  4. InnoDB engine tuning - you will need to tune the InnoDB engine quite significantly - in particular, have a look at the variables controlling its use of disc flushing. Writing out the log on each commit is not going to be cool (maybe unless it's on a SSD - if you need performance AND durability, consider a SSD for the logs) :) Ensure your buffer pool is big enough. Personally I'd use the InnoDB plugin and the file per table option, but you could also use MyISAM if you fully understand its characteristics and limitations.

I'm not going to further explain any of the above as if you have the developer skills on your team to build an application of that scale anyway, you'll either know what it means or be capable of finding it out.

Provided you don't have too many indexes, 1000 rows/sec is not unrealistic with your data sizes on modern hardware; we insert that many sometimes (and probably have a lot more indexes).

Remember to performance test it all on production-spec hardware (I don't really need to tell you this, right?).

MarkR
MarkR, thanks for the advice, but as I loosely stated in my question that I'm not that intelligent when it comes to this. I only know enough to make it work. After I use whm/cpanel to create the db I use phpmyadmin to setup the tables/columns, etc. Your way out of my league by a long shot but again, thanks for your advice, I'll take the bits and pieces of it that I do understand and I guess use google to learn more about innodb, tuning a mysql engine, etc. Again, thanks a lot.
John
+1  A: 

In my opinion it's a good thing to stick to MySQL for registering the visits, because it provides tools to analyze your data. To decrease the load I would have the following suggestions.

  • Make a fast collecting table, with no indixes except primary key, myisam, one row per hit
  • Make a normalized data structure for the hits and move the records once a day to that database.

This gives you a smaller performance hit for logging and a well indexed normalized structure for querying/analyzing.

Peter Smit
Is using MyISAM for the collecting table really a good idea if you have other tables with other engines?
MarkR
In my experience there is no extra overhead of using multiple engines at the same time. Especially not because the MyISAM engine is always loaded.
Peter Smit
But you can't efficiently use RAM if you need to allocate some to innodb buffer pool and some to MyISAM key cache and OS cache.
MarkR
As far as I remember the key caches is for both innodb and MyISAM (but I can be wrong). On the other hand, why would it decrease efficiency? The logtable index will always be in cache, so it doesn't really matter which cache. Both caches will anyway exist.
Peter Smit
MyISAM and InnoDB do not use the same key cache, they use independent ones. The log table index may not fit wholly in the cache, the more the better. If you have 10M rows per day, your indexes may be a few Gb per day. And you want some cache for previous days to benefit reporting / summarisation queries.
MarkR
So a "fast collecting table" is simply a table with no indexes? Do I even need to use a primary key? Also when I create my tables it looks like everything is MyISAM already..Again, I can make MySQL work but I'm not keen on all the terms.Thanks for the great advice. Also, what other engines are there? Is it worth my time to research or to look in to using others ones?
John