tags:

views:

234

answers:

3

Hi all,

First time posting to a questions site, but I sort of have a complex problem i've been looking at for a few days.

Background At work we're implementing a new billing system. However, we want to take the unprecedented move of actually auditing the new billing system against the old one which is significantly more robust on an ongoing basis. The reason is the new billing system is alot more flexible for our new rate plans, so marketting is really on us to get this new billing system in place.

We had our IT group develop a report for a rediculous amount of money that runs at 8AM each morning for yesterday's data, compares records for getting byte count discrepencies, and generates the report. This isn't very usefull for us since for one it runs the next day, and secondly if it shows bad results, we don't have any indication why we may have had a problem the day before.

So we want to build our own system, that hooks into any possible data source (at first only the new and old systems User Data Records (UDR)) and compares the results in near real-time.

Just some notes on the scale, each billing system produces roughly 6 million records / day at a total file size of about 1 gig.

My Proposed set-up Essentially, buy some servers, we have budget for several 8 core / 32GB of RAM machines, so I'd like to do all the processing and storage in in-memory data structures. We can buy bigger server's if necassary, but after a couple days, I don't see any reason to keep the data in memory any longer (written out to persistent storage) and Aggregate statistics stored in a database.

Each record essentially contains a record-id from the platform, correlation-id, username, login-time, duration, bytes-in, bytes-out, and a few other fields.

I was thinking of using a fairly complex data structure for processing. Each record would be broken into a user object, and a record object belong to either platform a or platform b. At the top level, would be a binary search tree (self balancing) on the username. The next step would be sort of like a skip list based on date, so we would have next matched_record, next day, next hour, next month, next year, etc. Finally we would have our matched record object, essentially just a holder which references the udr_record object from system a, and the udr record object from system b.

I'd run a number of internal analytics as data is added to see if the new billing system has choked, started having large discrepencies compared to the old system, and send an alarm to our operations center to be investigated. I don't have any problem with this part myself.

Problem The problem I have is aggregate statistics are great, but I want to see if I can come up with a sort of query language where the user can enter a query, for say the top contributors to this alarm, and see what records contributed to the discrepency, and dig in and investigate. Originally, I wanted to use a syntax similar to a filter in wireshark, with some added in SQL.

Example:

udr.bytesin > 1000 && (udr.analysis.discrepency > 100000 || udr.analysis.discrepency_percent > 100) && udr.started_date > '2008-11-10 22:00:44' order by udr.analysis.discrepency DESC LIMIT 10

The other option would be to use DLINQ, but i've been out of the C# game for a year and a half now, so am not 100% up to speed on the .net 3.5 stuff. Also i'm not sure if it could handle the data structure I was planning on using. The real question, is can I get any feedback on how to approach the getting a query string from the user, parsing it, and applying it to the data structure (which has quite a few more attributes then outlined above), and getting the resulting list back. I can handle the rest on my own.

I am fully prepared to hard code mutch of the possible queries, and just have them more as reports that are run with some paramaters, but if there is a nice clean way of doing this type of query syntax, I think it would be immensely cool feature to add.

Thanks for You're help.

+1  A: 

Actually, for the above type of query, the dynamic LINQ stuff is quite a good fit. Otherwise you'll have to write pretty-much the same anyway - a parser, and a mechanism for mapping that to attributes. Unfortunately it isn't an exact hit, since you need to split things like OrderBy, and dates need to be parameterized - but here's a working example:

class Udr { // formatted for space
    public int BytesIn { get; set; }
    public UdrAnalysis Analysis { get; set; }
    public DateTime StartedDate { get; set; }
}
class UdrAnalysis {
    public int Discrepency { get; set; }
    public int DiscrepencyPercent { get; set; }
}    
static class Program {
    static void Main() {
        Udr[] data = new [] {
              new Udr { BytesIn = 50000, StartedDate = DateTime.Today,
                 Analysis = new UdrAnalysis { Discrepency = 50000, DiscrepencyPercent = 130}},
              new Udr { BytesIn = 500, StartedDate = DateTime.Today,
                 Analysis = new UdrAnalysis { Discrepency = 50000, DiscrepencyPercent = 130}}
        };
        DateTime when = DateTime.Parse("2008-11-10 22:00:44");
        var query = data.AsQueryable().Where(
            @"bytesin > 1000 && (analysis.discrepency > 100000
                || analysis.discrepencypercent > 100)
                && starteddate > @0",when)
            .OrderBy("analysis.discrepency DESC")
            .Take(10);
        foreach(var item in query) {
            Console.WriteLine(item.BytesIn);
        }
    }
}

Of course, you could take the dynamic LINQ sample and customize the parser to do more of what you need...

Marc Gravell
+1  A: 

Whether you use DLINQ or not, I suspect that you'll want to use LINQ somewhere in the solution, because it provides so many bits of what you want.

How much protection do you need from your users, and how technical are they? If this is only for a few very technical internal staff (e.g. who are already developers) then you could just let them write a C# expression and then use CSharpCodeProvider to compile the code - then apply it on your data.

Obviously this requires your users to be able to write C# - or at least just enough of it for a query expression - and it requires that you trust them not to trash the server. (You can load the code into a separate AppDomain, give it low privileges and tear down the AppDomain after a timeout, but that sort of thing is complicated to achieve - and you don't really want huge amounts of data crossing an AppDomain boundary.)

Jon Skeet
A: 

On the subject of LINQ in general - again, a good fit due to your size issues:

Just some notes on the scale, each billing system produces roughly 6 million records / day at a total file size of about 1 gig.

LINQ can be used fully with streaming solutions. For example, your "source" could be a file reader. The Where would then iterate over the data checking individual rows without having to buffer the entire thing in memory:

    static IEnumerable<Foo> ReadFoos(string path) {
        return from line in ReadLines(path)
               let parts = line.Split('|')
               select new Foo { Name = parts[0],
                   Size = int.Parse(parts[1]) };
    }
    static IEnumerable<string> ReadLines(string path) {
        using (var reader = File.OpenText(path)) {
            string line;
            while ((line = reader.ReadLine()) != null) {
                yield return line;
            }
        }
    }

This is now lazy loading... we only read one line at a time. You'll need to use AsQueryable() to use it with dynamic LINQ, but it stays lazy.

If you need to perform multiple aggregates over the same data, then Push LINQ is a good fit; this works particularly well if you need to group data, since it doesn't buffer everything.

Finally - if you want binary storage, serializers like protobuf-net can be used to create streaming solutions. At the moment it works best with the "push" approach of Push LINQ, but I expect I could invert it for regular IEnumerable<T> if needed.

Marc Gravell
Have you integrated Push LINQ directly with protobuf-net then? My port has a MessageStreamIterator which implements IEnumerable<TMessage> for whatever message type you're interested in - it works fine for normal LINQ. Works with Push LINQ in the normal way too, of course.
Jon Skeet
It isn't directly integrated, no; however, there is a sample on the repo of doing this. Basically, the deserialization supports sequences via IEnumerable<T> and Add(T) (as an alternative IList<T>, etc) - and it is trivial to create a Push LINQ feed where the Add pushes a value through Push LINQ.
Marc Gravell
What I have yet to do is to make the deserialization produce an IEnumerable<T> directly...
Marc Gravell