views:

97

answers:

5

I'm using a very simple architecture for one of my intranet enterprise applications.

Client:

  • 1 agent running on each computer sending system config data (one time), reports (every 2 to 5 min) => size of the data flowing from client to server is a few hundred bytes and rarely touches a kB.

Server:

  • 1 web application (front-end to manage clients, view reports)
  • a web service to receive all incoming data (which it simply dumps in a table)
  • a system service to read the dump every few seconds and execute relevant queries - inserts, updates on the actual tables used for reporting (this step could probably be compared to ETL)

With thousands of clients sending data concurrently to the server, the server simply dumps this incoming data in a temporary table (one insert for each client sending data). A system service running in the background keeps flushing this temporary table - in the sense - every 10 seconds, it reads the top 100 rows from the dump table, organizes this data into the relevant tables used for reporting and removes these 100 rows from the dump and so on.

So far I've run my app in a network of 2,000 computers and it seems to be working well. Now I need to scale this to support a network of 25,000 clients. I'm gonna run simulation tests with 25,000 requests per second and check if the architecture holds good.

The server is .NET based. ASP .NET web application - front-end, web service to dump data. .NET based system service to perform ETL. SQL Server 2005/2008 as the db server.

Hope to get some constructive criticism and guidance from the stackoverflow community so as to improve this architecture. Do you feel it's good enough the way it is to work with 25,000 clients using a single server? What do you think would be the component most likely to break down with increasing concurrent activity? Is it fundamentally flawed? All sorts of guidance welcome. Thanks.

+1  A: 

In worst case, it means that your system needs to churn 5000-13000 requests per minute. You need to compute rough through-put of your system with 60-70% system utilization(with say current 2000 clients) - if web service take approx 50 millisecond per request then it means it can support max 1200 request per minute. Similar calculation can be done for .NET service. As load increases, through-put is likely to decrease, so actual number would be less. Based on such calculations, you need to decide whether you have to scale out your system. You can run your services on multiple servers and the load will get divided. If db server become bottleneck, it can be utilized in clustered way. Only thing that you need to check is that can your implementation of .NET service allows parallelism (IMO, web service would be state less and should scale w/o issue) - for example, do you need to insert records in the order you received etc.

VinayC
+1  A: 

Run the simulation and see how it holds out. What's likely to be a bottleneck is the network and possibly the disk i/o. In which case I can suggest a couple of things.

1st off, I hope you're using UDP not TCP??

Try making the service listen on multiple NIC's. Make multiple instances of the apps run and access the table. I don't know what database you're using but sqlite would be perfect for this type of app... and it has some features that might help with performance without touching the disk too often.

Lots of memory in your server.

Assuming all that is done and if it still doesn't perform then

Next step would be to have an series of intermediary servers which collect the results for several thousand clients each and then forward them on over a higher speed link to the main server for processing. You might even be able to batch send them to the main server and have the data compressed over that link. Or just SCP them over to it and import the results in a batch.

Anyway, just my thoughts. I'm working on something similar but my volume of data is going to be maxing out almost continuously 1 - 2Gbit links over a range of different high end servers.. so the intermediary server is what we're doing,

Matt H
@Matt H - when you said n/w being bottleneck, do you mean at n/w card level or at ISP level? At worst, he is talking about 5-13 Gigabytes per minute inbound tracffic - roughly 1-2 Gbits/second. IMO, its quite possible that his ISP may not support that but local nic should be able to handle it - right?
VinayC
I assumed it's running on a LAN/WAN environment because he talks about it being an intranet application (not internet). Most NIC's only go up to 1Gbit anyway. So, it's a lot of data to be handling. You're right in a sense though. A network of 25000 PC's sounds like a big organisation or university. There could be multiple WAN links that become the bottleneck.
Matt H
+3  A: 

Evenly distribute, "worst case" you're at 12500 trans/minute, which is 209 trans per sec.

What you should probably best do is load balance the front end.

If you had 4 machines, you're down to 52 trans per sec on each machine. Each machine stores their trans data locally, and then, in batches, makes bulk inserts in to the back end, final database. This keeps the trans volume low on the main database. The difference between inserting 1 row and 50 rows (depending on row size) is pretty minor. At some point it's "the same" depending on network overhead etc.

So, if we round down to 50 (for easy math), every 5 secs the front end machines insert 250 rows in to the back end database. That's not a low of volume (again depending on the row size).

You mention polling 100 recs per process on the back end. Whatever number you use here, combined with processing time, needs to be less than your total traffic and desired finish time.

Specifically, it's all right for the backend processing to be slower than the front end insertion rate in the short run, as long as in the long run, your backend catches up. For example, perhaps most of your traffic is from 8am-5pm, but all said and done your backend processing will be caught up by 9pm.

Otherwise, the backend never catches up, you're always behind, and the backlog is just getting larger and larger. So you need to make sure you can handle that properly as well.

If your report queries are expensive, it's best to offload those as well. Have the front tier machines send raw data to the single middle tier machine, then have a 3rd machine make large (perhaps, daily) bulk exports in to local reporting database for your database queries.

Also, consider failure and availability scenarios (i.e. if you lose one of your load balanced front tier machines, can you still keep up with traffic, etc.). Lots of room for failure here.

Finally, as a rule, updates tend to be cheaper than deletes, so if you can delete on your down time rather than during the mainstream processing, you'll probably find some performance there if you need it.

Will Hartung
+1  A: 

25k requests per second you need to scale out (even at 25k per minute, 25k per second is actually a huge load and you'll need many a servers to handle it). You must have a park of WWW service servers, each dumping the request into a local storage (a queue). You can't have the WWW farm talk straight inot the back end, it will die because of contention (lock exclusion due to client requests attempting to insert/update in the same spot in the database). The WWW service just dumps the requests locally, and then returns the HTTP response and continues. From the mid tier WWW servers these requests have to be aggregated and loaded into the central servers. This loading has to be reliable, easily configurable, and quite fast. Don't fall for the trap of 'I'll just write a copy utility myself with retry logic', that road is paved with bodies. A good candidate for this local storage is a SQL Server Express instance and a good candidate for the aggregation and loading is Service Broker. I know this architecture works because I've done projects that use it, see High Volume Contiguos Real Time Audit and ETL. And I know of projects that use this architecture to scale it (really high, see March Madness on Demand or Real Time Analytics with SQL Server 2008 R2 StreamInsight about how the Silverlight media streaming runtime intelligence is collected (the emphasys on both links is on different technologies, but sinc eI happen to know that project quite well I know how they collect the data from the WWW web-services to their back end).

Remus Rusanu
A: 

By my calcs, at worst case you have 25000 inserts every 120 seconds. Every 10 seconds you read 100 rows, which means in 120 seconds you have read 1200 rows. This means your temp table will keep accumulating data.

What you would need to do for scaling a system is to think in terms of how you can add components to the system to handle load.

Design the web service to be able to fire off requests to "slaves" responsible for inserting the data into temp tables. The list of temp table names will need to be kept in some common naming service (something as simple as another table of names would also be ok).

Design the system ETL service in a similar fashion to pick of a temp table, read all its rows, do its job and mark the temp table as processed and go back to sleep.

This way you can add additional processes for the inserts and for the ETL.

Finally, your report repository is going to grow at an alarming rate. Hopefully the data there can be cleaned out every week or month?

venky