views:

41

answers:

3

We have an app that creates text log files of requests to it. Pretty typical stuff in the log file and it is space delimited (date, time, url, http code, ip, user agent, etc).

Currently, we are generating around 500k entries in the text log files per day.

We're currently doing a lot of analysis via sed/awk/grep of the text files. However, that isn't really going to scale especially as we want to start reporting across multiple days:

e.g. - How many times did this IP address hit this URL in the last 5 days - What % of requests resulted in 500s for specific URL's

It's easy enough to do regular imports into a mysql db and pull this type of data with select/group-bys. However, even with a few hundred thousand rows, the queries are relatively slow.

I'm a n00b when it comes to some of the new no-sql dbs out there (Casandra, Dynamo, BigTable) but would any of them be well suited for this? I'm continuing reading up on them but maybe this crew had some recommendations.

Thanks!

A: 

Be aware that most RDBMS have bulk insertion functionality - MySQL's for instance

INSERT INTO YOUR_TABLE
VALUES
  (value, value2, value3),
  (value4, value5, value6),
  (value7, value8, value9);
OMG Ponies
+1  A: 

We've had a similar problem at work and managed to solve it by dumping the data into a column based database. These kinds of databases are much better at analytical queries of the kind you're describing. There are several options:

http://en.wikipedia.org/wiki/Column-oriented_DBMS

We've had good experience with InfiniDB:

http://infinidb.org/

Using this approach we managed to speed up the queries by approx. 10x, however is not a silver bullet and eventually you'll run into the same problems again.

You might also want to look at partitioning your data to improve performance.

srkiNZ84
A: 

There are a couple of reasons why I wouldn't necessarily look right away to a NoSQL solution:

  • Yours is a known schema which sounds like it won't be changing.

  • There doesn't seem to be a lot of denormalizing potential for you, as you've pretty much got a single flat table structure.

  • You haven't made any reference to application scalability (# of users), just the size of the query.

And those are three of the big 'wins' for NoSQL as I know it.

That being said, I'm no expert, and I don't know for sure that it wouldn't make for faster reads, so it's definitely worth a try!

LesterDove
Good analysis and breakdown. Thanks! I'll give @srkiNZ84 suggestion of infinidb a shot and see where that takes us.
whatupwilly