Best architecture for a 30 + hour query

views:

447

answers:

+11 Q:

Best architecture for a 30 + hour query

I have an interesting problem to solve. One of my clients has me developing a stock analysis program with close to 50 years of stock data for almost a thousand symbols. I've developed a series of filters that are applied on any given day to see if anything falls out for a trade.

We want to run this filter for each day of data we have for each stock. Basically your begin and end date type report. However it takes 6 minutes to filter each week for each symbol. We are figuring about 40 hours or so to run the report on our entire data set.

The overriding requirement is that my client is able to do anything in the application from any computer anywhere (he travels a lot), so we are browser based.

To solve this issue, I wrote an asychronous method that runs this report, however the application pool inactivity timer will kill the job. I don't want to have to start adjusting timeouts for the entire application to support this one report (we are going to do a lot of these as every stock scenerio will need to be run against our entire dataset for analysis before it gets used for active trading).

Does anyone have any general ideas or experiences with a web architecture that will support ultra-long asychronous processes?

Thanks

+16 A:

As a general suggestion I would recommend a standalone Windows Service, Console App or similar with very careful lifetime controls and logging, which would run constantly and check (poll) for 'jobs to process' in a database, then update the database with results and progress information.

It may not be the best way but I've used it before many times and it's reliable, scalable and has good performance.

Best to keep web requests to a minute or two maximum - they were never designed for heavy processing times. This way you can 'check in' on the job status every minute or so (using a Web Service).

If you have any questions of me or about the idea please post a comment & I will be happy to help, elaborate or suggest..

Hope that helps!

(Additional: I believe Windows Services are underused! All it takes is a quick base class or collection of reusable helper methods and you've got a logged, reliable, automatic, configurable, quick-to-implement process running under your control. Quick to prototype with too!)

Kieren Johnstone 2010-07-07 21:05:05

+1 for a Windows Service. This work doesn't belong in a web application's workload.

p.campbell 2010-07-07 21:23:20

+1 yep, that's what services are for, 100% agreed, putting it into a web app is just a hack

andy 2010-07-08 02:16:06

Kieren, Thank you for your answer. Yes, I think you hit the nail right on the head and this approach is consistent with the rest of our architecture. We are already running Windows services to manage historical quote downloads when the markets close. Also, the only timeout I'll have to manage is SQL's CommandTimeout which is simple enough. I can let my users input parameters for their reports, store them in a database and pull them up and run them in the context of the service. I knew that this requirement was just not going to be solved soley by a web page. Thank you - Bravo!

MikeMalter 2010-07-08 02:48:02

+5 A:

Is there any reason not to simply run a service in the background and archive individual resultsets to a read only results table as they are requested? Do you need to run the query in realtime? The app could retrieve pages of results as they get generated by the service.

Mike Burton 2010-07-07 21:06:19

+1 - I second this. Does a query this long really need to be ad-hoc. Pre-cook your results and have the web service retrieve the cached values while an OS service updates the query as necessary.

j0rd4n 2010-07-07 21:09:08

Mike and J - I did not want to have to archive recordsets, and then pull them up. However the service part is a very good idea.

MikeMalter 2010-07-08 02:49:06

+1 A:

Generally, ultra-long asychronous processes don't go on the web.

His request should be queued and another process should run the job and store the data in the format the user will use it in.

Ed B 2010-07-07 21:06:49

Ed, yup. My thinking, thanks for your thoughts.

MikeMalter 2010-07-08 02:57:36

+1 A:

Six minutes to filter a week of data? Seems like your db needs proper index finetuning.

friol 2010-07-07 21:09:46

Bear in mind that "a week of data" can mean many things -- just each day's close (five data points), or every tick (possibly into millions of points).

Bob Kaufman 2010-07-07 21:29:20

It also depends on the software / hardware in use and the database design.In my professional life I've seen databases constructed in numerous, silly ways:.. a database with a read-only table having over a billion rows.. a small database (< 5GB) with many hundreds of tables and dozens of relations using multiple-column / non-simple primary keys -- it chokes up constantly.It's easier to engineer badly, and around those lacking experience there's plenty to go around. "Just this once" becomes the doctrine of regret.

Hardryv 2010-07-07 23:38:34

friol, Bob is right. we are looking at millions of points so there are filters, and then there are FFIILLTTEERRSS! And I do agree with your idea about proper index tuning. Every stored procedure has its query plan analyzed and I certianly love SQL 2008's feature of telling you what indices you could be missing.

MikeMalter 2010-07-08 02:57:10

+3 A:

It sounds like you are doing SQL queries directly against these data. Have you considered loading the data to e.g. SQL Server Analysis Services and setting up a cube with (for starters) time, stock and symbol dimensions? Depending on the nature of your queries, you may get into quite reasonable response times. Relational databases are good for online transaction processing (within certain load and response time parameters), but analytical work sometimes requires the methods and technologies of data warehouses instead. (Or, perhaps, associative databases... there are alternatives.)

However, considering Murphy, you'll probably have some long running queries. Do the data vary for different end users? If not, why not precompute answers? Nothing http based should take more than a minute to process, if at that -- at least not by design!

Pontus Gagge 2010-07-07 21:13:48

Pontus, thank you for your thoughts. It does seem like this requirement would be one for SASS, but due to the filters we are running on our data, I wanted an approach that could simply put them in a while loop and keep running them until the start date is >= to the end date. One day or all of them. The query time for this report is not an issue as these are proof type reports to validate strategic thinking over a 50 year timespan with regard to trading activity. So even if it took three days to run, what is important is that the approach is validated over a very large historical set.

MikeMalter 2010-07-08 02:53:01

+2 A:

Depending on the specifics of your filter, this sounds like a task that could benefit from parallelization - split the query across multiple compute nodes that run the filter on a subset (shard) of the data. If your filter is focused on analyzing one stock across a lot of time data, you could split the work on the stock symbol, and have multiple compute nodes processing different stock symbols concurrently. If you need to study relationships between stock symbols over time, it might make more sense to split the work by time intervals and coalesce the results after the operation (mapreduce). This is a case where throwing more hardware at the problem can actually improve response time considerably. Consider the Google search engine one example.

The usual caveats apply: review your current filter implementation for performance bottlenecks first. Make sure the tables you're hitting are appropriately indexed, etc. Precalculate relationships and digests of frequently needed computations in advance. Storage is cheap if it will save time.

Your web request could kick off a scatter/gather query operation distributing the query to available compute nodes in the cloud (Windows Azure, Google Apps, Amazon). Given sufficient compute nodes and appropriate distribution of work, you can probably get the response back in near real time.

dthorpe 2010-07-07 21:27:45

dthorpe, Yes we are working with PLINQ and Parallel queries. However, there are some sorting issues I have to work out. In one report I got 3 hours down to 1/2 hour. But everytime we ran the report we got different results. So, I need to look into it, but parallelization is not a priority or requirement at this point, however in the future it will be. Thank you again for your thoughts.

MikeMalter 2010-07-08 02:54:51

I don't know why everyone answer here wants all the work to be done in the database, but the job control should be outside. It's like putting the ignition key for your car on the outside. Actually, i do know why. You tagged this question with wcf.

And the suggestion to use "compute nodes" just completed my buzzword bingo, thanks dthorpe! you don't need "compute nodes" just cores. Most RDBMSs have PX built right in (Parallel Execution). Why pay for cloud computing that you use everyday, just buy a server with enough CPUs, you'll be fine... No need for "scatter gather" queries, just turn on PX...

Pontus is pointing you in the right direction. Being satisfied with 6 min performance and worrying about how to schedule that is your problem. There are lots of strategies to manage your data into formats which promote speed. Indexes, partitioning, cubes, IOTs. You maybe doing two pass sorts instead of in memory sorts. Your statistics could be out of date causing a bad plan.

I'm assuming that you've not done a whole ton of db tuning from the tenor of this question. You really should post a database tuning question(s) and tell us the RDBMS you're using and how far you've already tuned.

Stephanie Page 2010-07-07 21:54:52

Because the database is where the data lives, and 50 + years of data will be a gigabytes of data and massive datasets to pass around as objects. I am not saying it cannot be done, but every situation has its own merit for performing certain tasks in the db, others outside. In this case, I would try and perform as much as possible of the filtering on the DB, then pass the data out to a reporting type server to do the reports.

Ryk 2010-07-07 23:07:09

Stephanie, my sentiments exactly. We are just shoveling stock tic data out of our SQL Server database and are further processing it in binary code. Actually, the databases on this project are the simplest I have ever created. I think we may even have barely reached second normal form. But the size of the data is massive. I think I could have better clarified my question by saying a 30 + hour process instead of a 30 + hour query. Thank you for taking the time to share your thoughts with me.

MikeMalter 2010-07-08 03:04:17

Ryk, you misunderstood. It the part AFTER the but that I disagree with. I was arguing against scheduling and control being a service in windows or anything else that's NOT in the database. It's a comma, not a period.

Stephanie Page 2010-07-08 20:16:46

Mike, Ah, makes more sense. I worked in a trading firm IT shop for 8 years. I was the guy who did the database for the PhD's who wrote the "binary code" as you call it. Now that you've clarified the query/process line, "Compute Node" makes a lot more sense - this is the exact situation we had as well, . We have fleets of servers dedicated to what we call, replicated processes and a scheduler that manages those reps.

Stephanie Page 2010-07-13 20:34:01

Stephanie, thanks for sharing your experience. It is nice to know that my small project has some similarities to the larger ones.

MikeMalter 2010-07-14 20:19:36

+1 A:

I recommend you read this article about Asynchronous Procedure Execution. If you logic is database bound (which it most certainly is) then this give a completely reliable way to launch computation task in an asynchronous fashion that is resilient to failover. Given that your load is highly paralelizable, you can launch several tasks, eg. one for each ticker, see the follow up article Passing Parameters to a Background Procedure.

As a side note, this technique of leveraging the built-in SQL Server asynchronous activation is used by at least two major financial corporations I know of, for exactly the same scenario as your.

Remus Rusanu 2010-07-07 22:34:04

Remus, thanks for the article which I certianly will read. Our database is not logic bound if I understand your meaning. I view stored procedures like those huge steam shovels in mines, they are huge highly optimized machines good for scooping earth for further processing. In our case Linq for SQL is our shovel, and we do most of our additional filtering and analysis in binary code with Linq.

MikeMalter 2010-07-08 03:00:30

You may consider the out-of-process service that leverages the same technology, the External Activator: http://blogs.msdn.com/b/sql_service_broker/archive/2008/11/21/announcing-service-broker-external-activator.aspx. You need a way to reliable schedule computation (tasks), and you'll end up sooner or later building a queue in the database, so you better start with it.

Remus Rusanu 2010-07-08 03:20:51

I am going to go with creating a Windows service to read a report queue table. If it starts getting big and looks like maintenance will become a problem, we will look at MSMQ.

MikeMalter 2010-07-08 18:09:49

http://rusanu.com/2010/03/26/using-tables-as-queues/

Remus Rusanu 2010-07-08 18:42:42

Thank you for your article. I did not know about the OUTPUT clause. We are not deleting records from our queue at this point, but are flagging a status column and using it as the context.

MikeMalter 2010-07-14 20:24:34

Mike,

There are many ways to answer this question, but the more important question I see that you should be asking is, why does it take 6 mins to filter for a stock?

Yes I know you have 50 years of data, and lots of stocks, BUT it should not take 6 mins. So more importantly, I would be looking it that particular table structure, the indexes on there and the query and what it is doing.

I used to work for a similar company, with tables that are almost 100Gb each. Yes the size of the table, not the entire db, and after some fine tuning got queries that used to take 15 mins + down to 3 seconds.

I would love to help you especially if you are running on SQL Server. Email me ryk99[at]hotmail[dot]com and we'll see what we can do from there.

Ryk 2010-07-07 23:04:25

Ryk, thanks for the offer of assistance. It takes 6 minutes to filter a stock because there half a dozen filters full of conditions and comparisions that go from today to a year of analysis. The issue for me is not the time it takes to execute the query as much as how to support scaling out. What happens when we need to check our entire dataset for a strategic approach to making a trade? However, earlier a great suggestion was made to create a Windows service and poll a table of parameters. The only timeout I have to worry about is the SQL CommandTimeout. Thanks again.

MikeMalter 2010-07-08 03:07:54

Have you thought of using an ETL solution such as SSIS to pre-populate your data?

Mr Shoubs 2010-07-08 15:13:54

No, we have not considered it, partially because my experience with SSIS is scant. Maybe I should look into it.

MikeMalter 2010-07-08 18:02:13

It's worth looking into (its the sort of thing SSIS is supposed to be used for), though it does have quite a steep learning curve. If your SQL skills are sound, then you should have too much of a problem, there are a lot of good resources around.

Mr Shoubs 2010-07-09 15:16:46

ansaurus

tags:

views:

answers:

Best architecture for a 30 + hour query

related questions