views:

207

answers:

11

What is the best IO strategy for a high traffic web app that logs user behaviour on a website and where ALL of the traffic will result in an IO write? Would it be to write to a file and overnight do batch inserts to the database? Or to simply do an INSERT (or INSERT DELAYED) per request? I understand that to consider this problem properly much more detail about the architecture would be needed, but a nudge in the right direction would be much appreciated.

A: 

Write to a file and then load later. It's safer to be coupled to a filesystem than to a database. And the database is more likely to fail than the your filesystem.

le dorfier
Wha??? I don't know what operating system you're on, but I know mine doesn't have an ACID compliant filesystem.
benjismith
unless you're looking at an advanced journaling/db files system, yeah - writing to the DBis *much* better
warren
you commenters stuck on FAT? linux fs's have journalled writes and file-lock support, as does NTFS.
Kent Fredric
Then ask yourself why webloggers write to the filesystem (and without noticeable impact on anything else, you will notice.) In most cases, if you are writing to a database, you're writing to both, because the database writes to the FS - most don't do their own FS anymore.
le dorfier
Your file system is optimized for - guess what? - reads and writes, without the overhead of indexing, locking (other than file-level locks), etc.
le dorfier
I don't understand why this is voted down... For non-transactional write-only behavior, the FS is going to be much "safer" than a DB. My FS has a 99.99999% uptime, my DB only 99.99%. That said, writing to *shared* files might be a little less safe.
James Schek
@doofledorfer - what webloggers write to the filesystem? All the ones I've used/know of write into a DB
warren
@James - your uptimes are impressive. Why does your db not have a better uptime? Anything planning to take on the gobs of transactions being described should be more durable than the underlying storage which will always have a tendency to fail
warren
Web logs - well, apache for one...
le dorfier
@doofledorfer - based on your term "weblog" - I understood you to mean a tool such as WordPress - a weblogging (or blog) environment
warren
@warran - I may have exagerated slightly with an extra 9 or two on the FS. But while the DB is down for patching, upgrades, network outages, etc the FS is still online. FS goes down, DB goes down too. I understand the OP to mean lots of "writes", but full "transactions" were not needed.
James Schek
A: 

My instinct would be to only use the database, avoiding direct filesystem IO at all costs. If you need to produce some filesystem artifact, then I'd use a nightly cron job (or something like it) to read DB records and write to the filesystem.

ALSO: Only use "INSERT DELAYED" in cases where you don't mind losing a few records in the event of a server crash or restart, because some records almost certainly WILL be lost.

benjismith
+1  A: 

By writing to the DB, you allow the RDBMS to decide when disk IO should happen - if you have enough RAM, for instance, it may be effectively caching all those inserts in memory, writing them to disk when there's a lighter load, or on some other scheduling mechanism.

Writing directly to the filesystem is going to be bandwidth-limited more-so than writing to a DB which then writes, expressly because the DB can - theoretically - write in more efficient sizes, contiguously, and at "convenient" times.

warren
File systems are designed specifically to optimize disk reads and writes without the overhead of indexing, record-locking, etc. Most databases anymore are just a layer on top of the filesystem, so you're probably doing both with this advice.
le dorfier
Any insert to the Database is involve IO , think about transaction log
MichaelT
@MichaelT - I think you misread my answer... the DB will cache those writes until an opportune time to *actually* write them to disk
warren
Any insert to relational DB is first will be written to the disk (transactions log) if it wasn't so. DB can't gurante what insert will persist over failure.
MichaelT
that's distinctly not true, @MichaelT - I've used RDBMSes that keep everything in a RAM cache first
warren
The world's most heavily marketed database, Oracle, can be configured to run in this way...
James Schek
and so can postgres (i guess thats the one MichaelT refers to), just set fsync = No. Thats the only sure way to get your pg database killed (in case of a server crash) i know of, and it's probably similar with oracle.If you want to make sure your db is consistent everything in a transaction is written to disc. or nothing.
pfote
A: 

The only problem with using the filesystem to back writes is how you extend the log.

A poorly implemented logger will have to open the entire file to append a line to the end of it. I witnessed one such example case where the person logged to a file in reverse order, being the most recent entries came out first, which required loading the entire file into memory, writing 1 line out to the new file, and then writing the original file contents after it.

This log eventually exceeded phps memory limit, and as such, bottlenecked the entire project.

If you do it properly however, the filesystem reads/writes will go directly into the system cache, and will only be flushed to disk every 10 or more seconds, ( depending on FS/OS settings ) which has a negligible performance hit compared to writing to arbitrary memory addresses.

Oh yes, and whatever system you use, you'll need to think about concurrent log appending. If you use a database, a high insert load can cause you to have deadlock conditions, and on files, you need to make sure that you're not going to have 2 concurrent writes cancel each other out.

Kent Fredric
A: 

The insertions will generally impact the (read/update) performance of the table. Perhaps you can do the writes to another table (or database) and have batch job that processes this data. The advantages of the database approach is that you can query/report on the data and all the data is logically in a relational database and may be easier to work with. Depending on how the data is logged to text file, you could open up more possibilities for corruption.

BrianLy
A: 

There's an easier way to answer this. Profile the performance of the two solutions.

Create one page that performs the DB insert, another that writes to a file, and another that does neither. Otherwise, the pages should be identical. Hit each page with a load tester (JMeter for example) and see what the performance impact is.

If you don't like the performance numbers, you can easily tweak each page to try and optimize performance a bit or try new solutions... everything from using MSMQ backed by MSSQL to delayed inserts to shared logs to individual files with a DB background worker.

That will give you a solid basis to make this decision rather than depending on speculation from others. It may turn out that none of the proposed solutions are viable or that all of them are viable...

James Schek
+1  A: 

I've done this on a recent app. Inserts are generally pretty cheap (esp if you put them into an unindexed hopper table). I think that you have a couple of options.

  1. As above, write data to a hopper table, if what ever application framework supports batched inserts, then use these, it will speed it up. Then every x requests, do a merge (via an SP call) into a master table, where you can normalize off data that has low entropy. For example if you are storing if the HTTP type of the request (get/post/etc), this can only ever be a couple of types, and better to store as an Int, and get improved I/O + query performance. Your master tables can also be indexed as you would normally do.

  2. If this isn't good enough, then you can stream the requests to files on the local file system, and then have an out of band (i.e seperate process from the webserver) suck these files up and BCP them into the database. This will be at the expense of more moving parts, and potentially, a greater delay between receiving requests and them finding their way into the database

Hope this helps, Ace

phatmanace
A: 

Hello from left field, but no one asked (and you didn't specify) how important is it that you never, ever lose data?

If speed is the problem, leave it all in memory, and dump to the database in batches.

Dean J
+1  A: 

When working with an RDBMS the most important thing is optimizing write operations to disk. Something somewhere has got to flush() to persistant storage (disk drives) to complete each transaction which is VERY expensive and time consuming. Minimizing the number of transactions and maximizing the number of sequential pages written is key to performance.

If you are doing inserts sending them in bulk within a single transaction will lead to more effecient write behavior on disk reducing the number of flush operations.

My recommendation is to queue the messages and periodically .. say every 15 seconds or so start a transaction ... send all queued inserts ... commit the transaction.

If your database supports sending multiple log entries in a single request/command doing so can have a noticable effect on performance when there is some network latency between the application and RDBMS by reducing the number of round trips.

Some systems support bulk operations (BCP) providing a very effecient method for bulk loading data which can be faster than the use of "insert" queries.

Sparing use of indexes and selection of sequential primary keys help.

Making sure multiple instances either coordinate write operations or write to separate tables can improve throughput in some instances by reducing concurrency management overhead in the database.

Einstein
A: 

Do you log more than what would be available in the webserver logs? It can be quite a lot, see Apache 2.0 log information for example.

If not, then you can use the good old technique of buffering then batch writing. You can buffer at different places: in memory on your server, then batch insert them in db or batch write them in a file every X requests, and/or every X seconds.

If you use MySQL there are several different options/techniques to load efficiently a lot of data: LOAD DATA INFILE, INSERT DELAYED and so on.

Lots of details on insertion speeds.

Some other tips include:

  • splitting data into different tables per period of time (ie: per day or per week)
  • using multiple db connections
  • using multiple db servers
  • have good hardware (SSD/multicore)

Depending on the scale and resources available, it is possible to go different ways. So if you give more details, i can give more specific advices.

elhoim
A: 

If you do not need to wait for a response such as a generated ID, you may want to adopt an asynchronous strategy using either a message queue or a thread manager.

kls