Hi all,
First time posting to a questions site, but I sort of have a complex problem i've been looking at for a few days.
Background At work we're implementing a new billing system. However, we want to take the unprecedented move of actually auditing the new billing system against the old one which is significantly more robust on an ongoing basis. The reason is the new billing system is alot more flexible for our new rate plans, so marketting is really on us to get this new billing system in place.
We had our IT group develop a report for a rediculous amount of money that runs at 8AM each morning for yesterday's data, compares records for getting byte count discrepencies, and generates the report. This isn't very usefull for us since for one it runs the next day, and secondly if it shows bad results, we don't have any indication why we may have had a problem the day before.
So we want to build our own system, that hooks into any possible data source (at first only the new and old systems User Data Records (UDR)) and compares the results in near real-time.
Just some notes on the scale, each billing system produces roughly 6 million records / day at a total file size of about 1 gig.
My Proposed set-up Essentially, buy some servers, we have budget for several 8 core / 32GB of RAM machines, so I'd like to do all the processing and storage in in-memory data structures. We can buy bigger server's if necassary, but after a couple days, I don't see any reason to keep the data in memory any longer (written out to persistent storage) and Aggregate statistics stored in a database.
Each record essentially contains a record-id from the platform, correlation-id, username, login-time, duration, bytes-in, bytes-out, and a few other fields.
I was thinking of using a fairly complex data structure for processing. Each record would be broken into a user object, and a record object belong to either platform a or platform b. At the top level, would be a binary search tree (self balancing) on the username. The next step would be sort of like a skip list based on date, so we would have next matched_record, next day, next hour, next month, next year, etc. Finally we would have our matched record object, essentially just a holder which references the udr_record object from system a, and the udr record object from system b.
I'd run a number of internal analytics as data is added to see if the new billing system has choked, started having large discrepencies compared to the old system, and send an alarm to our operations center to be investigated. I don't have any problem with this part myself.
Problem The problem I have is aggregate statistics are great, but I want to see if I can come up with a sort of query language where the user can enter a query, for say the top contributors to this alarm, and see what records contributed to the discrepency, and dig in and investigate. Originally, I wanted to use a syntax similar to a filter in wireshark, with some added in SQL.
Example:
udr.bytesin > 1000 && (udr.analysis.discrepency > 100000 || udr.analysis.discrepency_percent > 100) && udr.started_date > '2008-11-10 22:00:44' order by udr.analysis.discrepency DESC LIMIT 10
The other option would be to use DLINQ, but i've been out of the C# game for a year and a half now, so am not 100% up to speed on the .net 3.5 stuff. Also i'm not sure if it could handle the data structure I was planning on using. The real question, is can I get any feedback on how to approach the getting a query string from the user, parsing it, and applying it to the data structure (which has quite a few more attributes then outlined above), and getting the resulting list back. I can handle the rest on my own.
I am fully prepared to hard code mutch of the possible queries, and just have them more as reports that are run with some paramaters, but if there is a nice clean way of doing this type of query syntax, I think it would be immensely cool feature to add.
Thanks for You're help.