tags:

views:

2383

answers:

15

I have a database driven website serving about 50,000 pages.

I want to track each webpage/record hit. I will do this by creating logs, and then batch processing the logs once a day. I am not concerned with how I will do the batch process, only with the quickest way to log.

How would you log, which do you think is quicker:

a) Use PHP to append to the end of a text log file.

b) Use MySQL to INSERT INTO a non-indexed log table.

+2  A: 

I would believe that a flat file will be faster to write to.

Robert Rouse
A: 

Into file will be quicker, but into DB will be better.

vaske
+2  A: 

Use a database - it is the only sane option. Even if it takes a little longer. Once you start with logfiles then you are on a track where it will cause you pain - e.g. moving servers, file permissions, precludes load balancing etc...

If you've got the database open then I reckon that it would be probably quicker to insert a single row.

However with all this performance related the only way to be sure is to write a simple test and measure it....

Update: I've done a quick test - and sure enough if you have to open and close the file it's about the same speed or slower using a test of 10,000 lines:

However when you start to have multiple processes doing this it slows down as can be seen below. This is with 10 concurrent processes (all timings in seconds)

DB time: 2.1695
DB time: 2.3869
DB time: 2.4305
DB time: 2.5864
DB time: 2.7465
DB time: 3.0182
DB time: 3.1451
DB time: 3.3298
DB time: 3.4483
DB time: 3.7812
File open time: 0.1538
File open time: 0.5478
File open time: 0.7252
File open time: 3.0453
File open time: 4.2661
File open time: 4.4247
File open time: 4.5484
File open time: 4.6319
File open time: 4.6501
File open time: 4.6646
Open close file time: 11.3647
Open close file time: 12.2849
Open close file time: 18.4093
Open close file time: 18.4202
Open close file time: 21.2621
Open close file time: 22.7267
Open close file time: 23.4597
Open close file time: 25.6293
Open close file time: 26.1119
Open close file time: 29.1471

function debug($d)
{
    static $start_time = NULL;
    static $start_code_line = 0;

    if( $start_time === NULL )
    {
        $start_time = time() + microtime();
        $start_code_line = $code_line;
        return 0;
    }

    printf("$d time: %.4f\n", (time() + microtime() - $start_time));
    $fp = @fopen('dbg.txt','a');
    fprintf($fp,"$d time: %.4f\n", (time() + microtime() - $start_time));
    fclose($fp);

    $start_time = time() + microtime();
    $start_code_line = $code_line;
}

function tfile()
{
    $fp = @fopen('t1.txt','a');
    for ($i=0;$i<10000;$i++)
    {
        $txt = $i."How would you log, which do you think is quicker:How would you log, which do you think is quicker:";
        fwrite($fp,$txt);
    }
    fclose($fp);
}
function tfile_openclose()
{
    for ($i=0;$i<10000;$i++)
    {
        $fp = @fopen('t1.txt','a');
        $txt = $i."How would you log, which do you think is quicker:How would you log, which do you think is quicker:";
        fwrite($fp,$txt);
        fclose($fp);
    }
}

function tdb()
{
    $db = mysql_connect('localhost','tremweb','zzxxcc');

    $select_db = mysql_select_db('scratch');

    if (!$select_db) 
        die('Error selecting database.');

    for ($i=0;$i<10000;$i++)
    {
        $txt = $i."How would you log, which do you think is quicker:How would you log, which do you think is quicker:";
        mysql_query("INSERT INTO tlog values('".$txt."')");
    }
}

debug("");

tfile();
debug("File open");

tfile_openclose();
debug("Open close file");

tdb();
debug("DB");
Richard Harrison
This is why different processes should write to different files... Aggregate later. All you have to worry with opening/closing is buffering the logs while you rotate files at logging intervals (once an hour is probably ok for this kind of load).
SquareCog
Indeed - however I wouldn't write to a log file at all. Once you start doing things like having to aggregate and rotating - it becomes the sort of process that will go wrong - and for little benefit in terms of CPU in the first case.Databases are good at storing and retrieving data....
Richard Harrison
OP here, this seems convincing to me, coupled with Delayed Insert
A: 

I'd recommend you test both with a few test cases.

I would assume a flat file would be faster, b/c that's really what the DB is doing - it's just writing it to a file. The only advantage I can think of is if the database can run concurrently, you might get better results.

+4  A: 

I would use a Delayed Insert into MySQL. This way you don't have to wait for the insert to finish.

Eric Hogue
Interesting, I'd not heard of these Delayed Inserts, they seem perfect for logging.
Except there is no guarantee they will actually insert. Most of the time, they will.. but since this is logging, you probably want to know about the exceptional cases even more than the regular ones!
SquareCog
It's true that he won't know if the insert fails. But if the logs are just for statistics on pages view and he already have a db connection delayed insert should be good. If the logs are critical for him, then it's not a good solution.
Eric Hogue
OP here, yes, in my case it is not important if an INSERT fails.
Fair enough, it's a good solution for you then, especially due to ease of implementation. You lose some flexibility and reliability, but you save a ton on development time and complexity. Upvoted! :-)
SquareCog
A: 

All depends on your infrastructure and limitations. If the disk is slow, writing will be slow. If the SQL server is lagged by the requests, the insert will be slow. Flat file is probably the best way to go, but I would write your code or use existing code (PEAR::Log) so you can change the provider and storage method at will.

Abyss Knight
+4  A: 

You could try both ways using log4php, which supports:

  • Configuration through xml and properties file (same structure as log4j).
  • File, RollingFile, DailyFile, Echo, Console, Mail, PEAR::Db, PHP error, Syslog or NT events and Socket appenders.
  • Simple, TTCC, Pattern, Html and Xml Layouts.
  • Nested (NDC) and Mapped (MDC) Diagnostic Contexts.
  • Switchable internal debug.

Regarding logging into a file, you could improve performance by buffering the write requests.

alexandrul
+10  A: 
  1. Write to file. Rotate logs.

  2. Batch load the file to the database on a scheduled basis.

There are many, many reasons to choose this architecture -- ease of scaling (write to many logs, load them to db), lack of reliance on a SPOF in the database (if something goes wrong, you just accumulate logs for a while), ability to do cleaning and non-trivial parsing at load-time without burdening your production servers, and more.

SquareCog
A: 

A few considerations:

  1. Do you think you'll want to join log data with other data in the database? If so, the overhead of a db insert is likely justified so existing relationships can be easily leveraged.
  2. Would logging the data in the database allow you to reduce the amount of data you're logging greatly (due to existing relationships in the db)? For example, a log in the database of user activity could simply be a table containing a userid, activityid, and a timestamp. A log file this lean in a file wouldn't be human readable. Depending on your needs, you'd need to capture at least some of the user's data in the log file to assure it can be useful and human readable on its own.
  3. Any chance you'll want to leverage this log data in the front end or via an admin tool down the road? If so, DB write is likely preferable.
Cory House
A: 

As others mentioned - it depends on lots of things such as traffic, disk speed, etc. You'll have to test both the scenarios.

While testing MySQL, try both MyISAM and INNODB. In theory, Innodb will perform better as it has row level locking.

Gaurav
A: 

I've done something similar. I log each record to a separate file, then I have a batch process that grabs the files, puts them into a tar file and uploads them to the central log server (in my case, S3 :)).

I generate random file names for each log entry. I do this to avoid locking files for rotation. It's really easy to archive/delete this way.

I use json as my log format instead of the typical white space delimited log files. This makes it easier to parse and add fields in the future. It also means it's easier for me to write an entry per file than appending multiple records per file.

I've also used log4php+syslog-ng to centralize logging in real time. I have log4php log to syslog, which then forwards to the logs to my central server. This is really useful on larger clusters. One caveat is that there's a length limit to syslog messages, so you risk longer messages being truncated.

Gary Richardson
A: 

If this is for a database driven site, why aren't you just using the built in logging capabilities of Apache or IIS, and a suitable reporting tool such as AWStats and beyond that, there's always Google Analytics

AWStats and webserver logging is my preference - you essentially get it for free anyway - even if you're not after traffic analysis, you could still consider parsing the Apache access log file yourself for whatever batch processing you need to do.

iAn
+2  A: 

You should try SQLite. It will give you both the speed of writing to a file as well as the power of a database.

fireweasel
I've used SQLLite a few times for similar things - instances where i want to have something in a database but not the system database - it's a really neat solution in many ways and one that deserves to be more widely known.
Cruachan
+1  A: 

I read an article in the C++ Users Journal, years ago, about loggin performance. Whether you use DB or files, the best thing to do is write unformatted data that can be "inflated" into meaningful data when (and more likely if) you need to view the logs. The vast majority of the cost of logging is informatting the strings that are written to the destination, and most of the time that cost is wasted - the logs are never read.

I can dig out the article reference if it's useful to you.

endian
A: 

If you are using either file based logging or database based logging, your biggest performance hit will be file/table locking. Basically, if client A and client B connects within a relatively small time frame, client B is stuck waiting for the lock to be released on the hits file/table before continuing.

The problem with a file based mechanism is that file locking is essential to ensure that your hits doesn't get corrupted. The only way around that is to implement a queue to do a delayed write to the file.

With database logging, you can at least do the following [MySQL using MyISAM]:

INSERT DELAYED INTO `hits` ...

See 12.2.5.2. INSERT DELAYED Syntax for more information.

Andrew Moore