tags:

views:

131

answers:

5

I have an application, in it's simplest form, it reads a large number of phone numbers from a database (about 15 million) and sends each number off one line at a time to a url for processing. I designed the application like this:

  1. bulk export the phone numbers from sql to a text file using SSIS. This is very quick and a matter of 1 or 2 minutes.
  2. load the numbers into a message queue (I use MSMQ at the moment).
  3. Dequeue the messages from a command line application and fire up the request over http to some service, like 3 calls per phone number and finally log to a database.

Problem is: It still takes a long time to complete. MSMQ also has a limit on the size of messages it can take and now I have to create multiple message queues. I need a lot of fault tolerance but I dare not make my message queue transactional because of performance. I'm thinking of publishing the message queue (currently a private queue) to the active directory so that the processes can dequeue it from different systems so this can complete quicker. Also, my processors hit 100% during execution and I'm changing it to use a threadpool at this time. I'm willing to explore JMS right now if it will handle the queue better. So far, the most efficient part of the whole processing is the SSIS part.

I'll like to hear better design approach, especially if you've handled this kind of volume before. I'm ready to switch to unix or do lisp if it handles this kinda situation better.

Thanks.

A: 

I have no experience in MSMQ but just a general question. Why read whole number list in queue at start? Why not read a number from text file, process it and then read next? Reading all number in object at once might exceed RAM limit and pagination will cause delay in swapping objects to and fro between RAM and page file.

Ashish Patil
Using something like MSMQ will give you a way to handle what numbers you have already processed so you don't start from the beginning again in case of a failure.
kasperjj
That would mean each number becomes its own entry in MSMQ, this could get silly with 15 million numbers. I would ignore MSMQ and use the database itself to record the current state of any processed numbers.
Adam
It would still be better to read bunch of rows say 1000 rows in batches from DB and process those and update row status once each row is processed. Reading rows in chunk will save on repeated DB calls which might be expensive in terms of time, while saving status in DB is much easier to handle for failback.
Ashish Patil
Is there really a difference between 15 million messages in a queue versus 15 million rows in a database?
matt b
Trust me, there is. The database update time is significant. The queue manages order well, it's also way faster than updating a db. And yes, it tastefully handles the ones you have processed. I can have multiple queues, it doesn't have to be one.
keni
A: 

You could alternatively expose access to the phone numbers via a web service. Instead of the process reading MSMQ, it can prompt the web service for a batch of phone numbers which can be flagged as locked in SQL. It can then call the web service indicating success or failure, so they can be marked as completed or unflagged as locked in SQL. This way you don't needlessly bulk select all rows and clog the system, and you get the fault tolerance of SQL itself.

Further to this, you can then implement some maintenance to unflag rows as locked if they have had no response in a long time, and then ignore the results if the original caller finally gets back to you. You can do this by flagging the rows with some form of caller ID.

Alternatively, ditch the web service and connect directly to the database. Either way it gives fault tolerance, its simple to implement, its as fast as you are going to get, and it can be distributed over many processing computers.

Adam
A: 

Why do you read the whole lot from the database at once? Wouldn't it be more effective to read the phone numbers in batches, so that your issues with queues disappear?

Grzenio
+2  A: 

Here is a simple super pragmatic solution:

First split your text file into smaller files, perhaps with something like 10,000 entries in each file. Lets call them numbers_x.queue.

Create a threadpool based app where each thread processes the files using the following steps:

  1. Look for a file called numbers_x.done if it exists find the last full number in it.
  2. If you found a .done file scan through numbers_x.queue to position yourself at the number after the last in the .done file.
  3. Read a number from the .queue file
  4. Do your web api calls
  5. Do your logging
  6. Append the number to the .done file
  7. If the .queue file is not at the end yet, goto 3
  8. Delete the queue file, then the done file
  9. Grab another unprocessed .queue file and continue from 1

While this is a pretty crude approach, it is super easy to implement, pretty fault tolerant and you can easily split the .queue files between a set of servers and have them work in parallel.

kasperjj
Is this really all that different than an approach using a message queue? This sounds like it would just take away the message queue abstraction and re-implement a work queue-like structure. The remote web service would still be a bottleneck.
matt b
Yes, it is different in that it is a much simpler queue structure that takes advantage of his specific requirements to offer a much higher performance due to the streaming io design with no option to add entries to the queue.But yes, the web service could still be a bottle neck. However, it doesn't have to be (100ms latency, 32 threads in parallel = 13 hours) and he doesn't mention it in his question so I chose to focus on his problems using MSMQ.
kasperjj
A: 

Hi, what is the goal in using 2 separate apps + JMS - one app for loading the data and the other one to send it away while using JMS to queue the numbers? Are both apps standalone applications or is the "reading" one embedded in some server? And last question, are the numbers somehow tracked that they've been sent away?

After all, in my opinion, there is no benefit in using JMS. It brings only additional overhead.

Lets summarize: app that sends numbers away to some web server for processing. I assume that the receiving party cannot process/receive more numbers at once. I assume that the database you are loading the data from is some 'read-only' store from customer or so. I assume you want to keep track of the numbers sent away.

What I would do:

  • merge the two apps into one
  • create some (possibly embedded) DB to keep track of processed numbers (this is better than maintaining some file manually or having it in memory - in case of crash you are starting from the beggining; db has nice features to offer)
  • create pool of senders that take bunch of lets say 10 numbers and send it to somewhere
  • create data reader that reads numbers in batches (I don't see benefit in reading all at the beggining, that delays start of sending). This is also more traceable as you maintain smaller sets of data. The role of the reader is also to keep track of idle senders and feed them with data that are read during the time when all senders are working.
  • the senders are updating the internal DB with numbers they already processed

I would say, that this is quite configurable as well as you can modify the amounts of numbers processed by senders and modify number of senders.

If the desired environment is more distributed - you might have more computers - then do it this way:

  • Keep JMS with one (or more equivalent) queue(s) for feeding senders; create queue for notifying the reader part about finished work (so that the feeding queues don't get overloaded).
  • Make reader part that feeds data - sets of numbers - and reads the notification queue.
  • Create DB for keeping track of processed numbers - make it shared for senders or isolated for reader while sending 'processing reports' from senders. Reader than updates the DB.
Martin