tags:

views:

342

answers:

11

I have a program that receives real time data on 1000 topics. It receives -- on average -- 5000 messages per second. Each message consists of a two strings, a topic, and a message value. I'd like to save these strings along with a timestamp indicating the message arrival time.

I'm using 32 bit Windows XP on 'Core 2' hardware and programming in C#.

I'd like to save this data into 1000 files -- one for each topic. I know many people will want to tell me to save the data into a database, but I don't want to go down that road.

I've considered a few approaches:

1) Open up 1000 files and write into each one as the data arrives. I have two concerns about this. I don't know if it is possible to open up 1000 files simultaneously, and I don't know what effect this will have on disk fragmentation.

2) Write into one file and -- somehow -- process it later to produce 1000 files.

3) Keep it all in RAM until the end of the day and then write one file at a time. I think this would work well if I have enough ram although I might need to move to 64 bit to get over the 2 GB limit.

How would you approach this problem?

+2  A: 

I'd like to explore a bit more why you don't wnat to use a DB - they're GREAT at things like this! But on to your options...

  1. 1000 open file handles doesn't sound good. Forget disk fragmentation - O/S resources will suck.
  2. This is close to db-ish-ness! Also sounds like more trouble than it's worth.
  3. RAM = volatile. You spend all day accumulating data and have a power outage at 5pm.

How would I approach this? DB! Because then I can query index, analyze, etc. etc.

:)

n8wrl
Can a DB handle 5000 messages per second? Also, I've heard -- but don't really know -- that DBs are not very good at handling time series data.
Joe H
Sure! Lots of strategies to keep transaction rates high. Time-series data, meaning date+time data? Absolutely.
n8wrl
Actually DB are built for that. And they scale very well.
Pavels
Exactly - and they've figured out all the stuff you're asking about. How much to keep in RAM, how to write fast, high rates, etc. etc.
n8wrl
+1  A: 

I would make 2 separate programs: one to take the incoming requests, format them, and write them out to one single file, and another to read from that file and write the requests out. Doing things this way allows you to minimize the number of file handles open while still handling the incoming requests in realtime. If you make the first program format it's output correctly then processing it to the individual files should be simple.

Stephan
How would the second program work? Would it open 1000 simultaneous files? That's why I wrote 'somehow' for item '2'.
Joe H
I would just open and close files as needed. Little heavy on the resources from the OS but not as heavy as having one process consume 1000 files simultaneously. It probably would be unable to process in realtime, but it should lag to far behind.
Stephan
+2  A: 

First calculate the bandwidth! 5000 messages/sec each 2kb = 10mb/sec. Each minute - 600mb. Well you could drop that in RAM. Then flush each hour.

Edit: corrected mistake. Sorry, my bad.

Pavels
+1  A: 

I would look into purchasing a real time data historian package. Something like a PI System or Wonderware Data Historian. I have tried to things like this in files and a MS SQL database before and it didn't turn out good (It was a customer requirement and I wouldn't suggest it). Thes products have API's and they even have packages where you can make queiries to the data just like it was SQL.

It wouldn't allow me to post Hyperlinks so just google those 2 products and you will find information on them.

EDIT

If you do use a database like most people are suggesting I would recommend a table for each topic for historical data and consider table partioning, indexes, and how long you are going to store the data.

For example if you are going to store a days worth and its one table for each topic, you are looking at 5 updates a second x 60 seconds in a minute x 60 minutes in an hour x 24 hours = 432000 records a day. After exporting the data I would imagine that you would have to clear the data for the next day which will cause a lock so you will have to have to queue you writes to the database. Then if you are going to rebuild the index so that you can do any querying on it that will cause a schema modification lock and MS SQL Enterprise Edition for online index rebuilding. If you don't clear the data everyday you will have to make sure you have plenty of disk space to throw at it.

Basically what I'm saying weigh the cost of purchasing a reliable product against building your own.

Kyle Sonaty
+1  A: 

I'd keep a buffer of the incoming messages, and periodically write the 1000 files sequentially on a separate thread.

omellet
+8  A: 

I can't imagine why you wouldn't want to use a database for this. This is what they were built for. They're pretty good at it.

If you're not willing to go that route, storing them in RAM and rotating them out to disk every hour might be an option but remember that if you trip over the power cable, you've lost a lot of data.

Seriously. Database it.

Edit: I should add that getting a robust, replicated and complete database-backed solution would take you less than a day if you had the hardware ready to go.

Doing this level of transaction protection in any other environment is going to take you weeks longer to set up and test.

Oli
+1 for pluckiness!
n8wrl
I still have concerns about whether a DB can handle an average of 5,000 transactions per second. I've got almost no experience with databases -- I've kept away from them because of unfamiliarity and concern about performance with large amounts of data.After storing data for six months, I'll have lots of tables... Every day, I'll create 1000 new tables (1 per topic), so in six months I'll have about 125,000 tables. Is this something that would perform well on one 32 bit machine?
Joe H
+2  A: 

I would agree with Kyle and go with a package product like PI. Be aware PI is quite expensive.

If your looking for a custom solution I'd go with Stephen's with some modifications. Have one server recieve the messages and drop them into a queue. You can't use a file though to hand off the message to the other process because your going to have locking issues constantly. Probably use something like MSMQ(MS message queuing) but I'm not sure on speed of that.

I would also recommend using a db to store your data. You'll want to do bulk inserts of data into the db though, as I think you would need some heafty hardware to allow SQL do do 5000 transactions a second. Your better off to do a bulk insert every say 10000 messages that accumulate in the queue.

DATA SIZES:

Average Message ~50bytes -> small datetime = 4bytes + Topic (~10 characters non unicode) = 10bytes + Message -> 31characters(non unicode) = 31 bytes.

50 * 5000 = 244kb/sec -> 14mb/min -> 858mb/hour

dilbert789
Each message is more like 20 to 30 bytes... so it's not as bad as you think.
Joe H
+4  A: 

Like n8wrl i also would recommend a DB. But if you really dislike this feature ...

Let's find another solution ;-)

In a minimum step i would take two threads. First is a worker one, recieving all the data and putting each object (timestamp, two strings) into a queue.

Another thread will check this queue (maybe by information by event or by checking the Count property). This thread will dequeue each object, open the specific file, write it down, close the file and proceed the next event.

With this first approach i would start and take a look at the performance. If it sucks, make some metering, where the problem is and try to accomplish it (put open files into a dictionary (name, streamWriter), etc).

But on the other side a DB would be so fine for this problem... One table, four columns (id, timestamp, topic, message), one additional index on topic, ready.

Oliver
This would be opening and closing 5000 files per second -- maybe a little less if two subsequent messages had the same topic. I think this wouldn't work because the second thread could not keep up.
Joe H
+2  A: 

I agree with Oliver, but I'd suggest a modification: have 1000 queues, one for each topic/file. One thread receives the messages, timestamps them, then sticks them in the appropriate queue. The other simply rotates through the queues, seeing if they have data. If so, it reads the messages, then opens the corresponding file and writes the messages to it. After it closes the file, it moves to the next queue. One advantage of this is that you can add additional file-writing threads if one can't keep up with the traffic. I'd probably first try setting a write threshold, though (defer processing a queue until it's got N messages) to batch your writes. That way you don't get bogged down opening and closing a file to only write one or two messages.

TMN
A: 

If you don't want to use a database (and I would, but assuming you don't), I'd write the records to a single file, append operations are fast as they can be, and use a separate process/service to split up the file into the 1000 files. You could even roll-over the file every X minutes, so that for example, every 15 minutes you start a new file and the other process starts splitting them up into 1000 separate files.

All this does beg the question of why not a DB, and why do you need 1000 different files - you may have a very good reason - but then again, perhaps you should re-think you strategy and make sure it is sound reasoning before you go to far down this path.

EJB
+1  A: 

Perhaps you don't want the overhead of a DB install?

In that case, you could try a filesystem-based database like sqlite:

SQLite is a software library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine. SQLite is the most widely deployed SQL database engine in the world. The source code for SQLite is in the public domain.

Jeffrey Knight