schema-less data warehouse and reporting

views:

349

answers:

+1 Q:

schema-less data warehouse and reporting

We have a system that generates many events as the result of a phone call/web request/sms/email etc, each of these events need to be able to be stored and be available for reporting (for MI/BI etc) on, each of these events have many variables and does not fit any one specific scheme.

The structure of the event document is a key-value pair list (cdr= 1&name=Paul&duration=123&postcode=l21). Currently we have a SQL Server system using dynamically generated sparse columns to store our (flat) document, of which we have reports that run against the data, for many different reasons I am looking at other solutions.

I am looking for suggestions of a system (open or closed) that allows us to push these events in (regardless of the schema) and provide reporting and anlytics on top of it.

I have seen Pentaho and Jasper, but most of the seem to connect to a system to get the data out of it to then report on it. I really just want to be able to push a document in and have it available to be reported on.

As much as I love CouchDB, I am looking for a system that allows schema-less submitting of data and reporting on top of it (much like Pentaho, Jasper, SQL Reporting/Analytics Server etc)

+1 A:

I don't think there is any DBMS that will do what you want and allow an off-the-shelf reporting tool to be used. Low-latency analytic systems are not quick and easy to build. Low-latency on unstructured data is quite ambitious.

You are going to have to persist the data in some sort of database, though.

I think you may have to take a closer look at your problem domain. Are you trying to run low-latency analytical reports, or an operational report that prompts some action within the business when certain events occur? For low-latency systems you need to be quite ruthless about what constitutes operational reporting and what constitutes analytics.

Edit: Discourage the 'potentially both' mindset unless the business are prepared to pay. Investment banks and hedge funds spend big bucks and purchase supercomputers to do 'real-time analytics'. It's not a trivial undertaking. It's even less trivial when you try to do such a system and build it for high uptimes.

Even on apps like premium-rate SMS services and .com applications the business often backs down when you do a realistic scope and cost analysis of the problem. I can't say this enough. Be really, really ruthless about 'realtime' requirements.

If the business really, really need realtime analytics then you can make hybrid OLAP architectures where you have a marching lead partition on the fact table. This is an architecture where the fact table or cube is fully indexed for historical data but has a small leading partition that is not indexed and thus relatively quick to insert data into.

Analytic queries will table scan the relatively small leading data partition and use more efficient methods on the other partitions. This gives you low latency data and the ability to run efficient analytic queries over the historical data.

Run a process nightly that rolls over to a new leading partition and consolidates/indexes the previous lead partition.

This works well where you have items such as bitmap indexes (on databases) or materialised aggregations (on cubes) that are expensive on inserts. The lead partition is relatively small and cheap to table scan but efficient to trickle insert into. The roll-over process incrementally consolidates this lead partition into the indexed historical data which allows it to be queried efficiently for reports.

Edit 2: The common fields might be candidates to set up as dimensions on a fact table (e.g. caller, time). The less common fields are (presumably) coding. For an efficient schema you could move the optional coding into one or more 'junk' dimensions..

Briefly, a junk dimension is one that represents every existing combination of two or more codes. A row on the table doesn't relate to a single system entity but to a unique combination of coding. Each row on the dimension table corresponds to a distinct combination that occurs in the raw data.

In order to have any analytic value you are still going to have to organise the data so that the columns in the junk dimension contain something consistently meaningful. This goes back to some requirements work to make sure that the mappings from the source data make sense. You can deal with items that are not always recorded by using a placeholder value such as a zero-length string (''), which is probably better than nulls.

ConcernedOfTunbridgeWells 2009-06-29 14:59:03

Potentially, both - at the moment, there is a need for low-latency (the business uses realtime, but we know it is not) analytics and reporting. However operationally, we can make decisions of very recent actions in the "other" database which is truncated very frequently. Unfortunatly, the same data that is available for operational decisions also needs to be available to our clients via analytics.

Kinlan 2009-06-29 15:06:17

At the moment, I think it will be two phase. The DB (call detail records essentially) dedicated to storing the data however that may be, and a reporting/analytics system.

Kinlan 2009-06-29 15:08:25

Ok, cool, thanks for the edits. If I can ignore the real-time aspect for now. We have a case where the facts and also dimensions can be "dynamic" for want of a better word - that is there are no fixed schemas (apart from a few common variables), so what I am looking for is a something that can handle this. At the moment, we have a single Call detail record table that each row represents one call and the columns represent the variables set in that call (the columns of which can be dynamically created and may not be populated 99% of the time)

Kinlan 2009-06-29 15:38:36

See the second edit for a discussion about junk dimensions with some fanout. This is probably the best way to maintain a large body of coding fields.

ConcernedOfTunbridgeWells 2009-06-29 15:49:58

Are these systems schema-less, or is it just that the schemas have not been determined? Do the schemas change frequently?

John Saunders 2009-06-29 15:02:51

The schema is entirely defined by the client, by the application, by the in call scripts that we use. That is, the calls we make have a very high chance that they will contain very different columns. It is very easy to see that we will 1000+ different measurements taken at the end of a phone call.

Kinlan 2009-06-29 15:17:45

I guess I haven't worked with this sort of problem before. You've actually got multiple schemas. Even if, in order to store the data quickly, you've got a simple key/value DB, to do any reporting, each client must have a schema - and different schemas for each client, I presume!

John Saunders 2009-06-29 15:38:03

depending on the campaign that clients might be running and the "applications" that they might be running at anyone time there is a schema, but this schema might only exist for 20 minutes before the client changes it based on their demand.

Kinlan 2009-06-29 15:45:30

I wouldn't put it quite so brutally (I've exceeded my quota already for the day). However, this sounds pretty accurate: insufficient analysis meets excessive requirements. Send them a spreadsheet.

John Saunders 2009-06-29 16:05:45

I would - this is an object lesson in dealing with requirements generated by sales people. This particular requirement is a perennial one that comes up on a semi-regular basis when market survey firms try to do analytics. Spreadsheets are the solution.

ConcernedOfTunbridgeWells 2009-06-29 16:16:49

We are sending them spreadsheets and we will be sending them spreadsheets, but we have to have a system that can handle 300+ variables coming from an individual call all of which will be pertinent to a point. It is not nescearilly that the client will be using these variables we will be optimising the call scripts based off this data.

Kinlan 2009-06-29 16:33:53

The system will be to help us help the clients as well as potentially provide reporting that they need, I don't believe that it is a client requirements problem.

Kinlan 2009-06-29 16:34:53

Now I think I see the underlying requirements. This is an online or phone survey application with custom surveys. The way to deal with this requirement is to fob the analytics off onto the client. No online tool will let you turn around schema changes in 20 minutes.

I've seen this type of requirement before and it boils down to the client wanting to do some stats on a particular survey. If you can give them a CSV based on the fields (i.e. with named header columns) in their particular survey they can import it into excel and pivot it from there.

This should be fairly easy to implement from a configurable online survey system as you should be able to read the survey configuration. The client will be happy that they can play with their numbers in Excel as they don't have to get their head around a third party tool. Any competent salescritter should be able to spin this to the client as a good thing. You can use a spiel along the lines of 'And you can use familiar tools like Excel to analyse your numbers'. (or SAS if they're that way inclined)

Wrap the exporter in a web page so they can download it themselves and get up-to-date data.

Note that the wheels will come off if you have larger data volumes over 65535 respondents per survey as this won't fit onto a spreadsheet tab. Excel 2007 increases this limit to 1048575. However, surveys with this volume of response will probably be in the minority. One possible workaround is to provide a means to get random samples of the data that are small enough to work with in Excel.

Edit: I don't think there are other solutions that are sufficiently flexible for this type of applicaiton. You've described a holy grail of survey statistics.

I still think that the basic strategy is to give them a data dump. You can pre-package it to some extent by using OLE automation to construct a pivot table and deliver something partially digested. The API for pivot tables in Excel is a bit hairy but this is certainly quite feasible. I have written VBA code that programatically creates pivot tables in the past so I can say from personal experience that this is feasible to do.

The problem becomes a bit more complex if you want to compute and report distributions of (say) response times as you have to construct the displays. You can programatically construct pivot charts if necessary but automating report construction through excel in this way will be a fair bit of work.

You might get some mileage from R (www.r-project.org) as you can construct a framework that lets you import data and generate bespoke reports with a bit of R Code. This is not an end-user tool but your client base sounds like they want canned reports anyway.

ConcernedOfTunbridgeWells 2009-06-29 16:04:23

It is close, it is a IVR system that can run any script not just surveys. For instance a basic script may measure up to 500 variables from the call, including time to respond to a prompt for a question, the result of the question etc, at a basic level yes, we already provide a CSV result table but with hardcoded columns (it is really unmanagable), an Entity Value schema is not performant so we discounted that - which is why we settled for a SQL Server dynamic columns (through sparse columns) and I am trying to determine if there are other solutions :)

Kinlan 2009-06-29 16:26:40

And the other point is that (without sounding rude to our clients) is that they either can't analyse the data we already present and we have to manage the reporting for them in summarised form; or they simply don't want to have to manipulate the CSV export we already provide.

Kinlan 2009-06-29 16:28:26

(The following verges on exceeding my bluntness quota for the day): If they can't analyze their data, then what's their deal? Do you provide the domain expertise they seem not to have? Do you know what their scripts are about? Really, at some point, someone has to figure out what they want and what it means. If that's constantly changing, then maybe some process changes are required as well, but GIGO.

John Saunders 2009-06-29 16:38:14

John, the main thrust I am trying to get at is: are their systems suited to storing and analysing schema-less data? that is, where the schema is not fixed. Most Analysis systems I have seen seem to have a very specific schema fixed against mostly slowly changing systems. I am not asking people to question the choice of what data take on a call and provide to our clientbase. We will use the data if our clients don't, we do provide domain expertise and the breadth of the client base means that lots of it can't be supplied across all customers, many customer have vastly different calling scripts.

Kinlan 2009-06-29 18:23:11

John: However rspec looks interesting.

Kinlan 2009-06-29 18:23:43

I think this would be a good time for me to just shut up. The fact is that I'm very skeptical that this is a schema-less database. It sounds like a requirement-less database. The customers may have an excellent grasp of their internal data requirements, but it seems that by the time it gets to IVR, the details are lost and all you have is "300 variables". This seems like an issue with the process. I can imagine it may be endemic to your industry, but it's a lot like what I've seen in other industries.

John Saunders 2009-06-29 20:10:27

John, Seriously. the question and the issue is not to do with the requirements or the process in which we generate the IVR, I specifically tried not to go in to detail about our application because I knew there would be this type of discussion about why I want a "schema-less" database, we have the variables because they are precise measurements about every facet of a call. The question was clear, concise and was essentially are their any datastores that are suited schema-less data and suited for analytics. I am trying to stay away from the "always use a traditional RDBMS" mindset.

Kinlan 2009-06-30 08:05:30

I do appreciate the time you spent on this subject

Kinlan 2009-06-30 08:06:25

Optimising - this is the key. I'll still argue that you want to look at statistics software like R. In this case you probably want to think of an event-based model and possibly a data mining tool - this does sound like a data mining application in disguise.

ConcernedOfTunbridgeWells 2009-06-30 08:28:47

For sure, everything that happens in call is the response of an event occurring, and there are events that only ever occur when another event occurs. The system essentially produces one large detail record at the end of the call. (We do have in call events that can be logged seperatly - but that is essentially a different system, and might be a different problem domain too)

Kinlan 2009-06-30 08:59:38

Presumably this record has time stamps of various events or (as you say) response times to certain questions - if so it can probably be transformed into a time series where this is relevant and whatever static variables are necessary.

ConcernedOfTunbridgeWells 2009-06-30 09:17:47

ansaurus

tags:

views:

answers:

schema-less data warehouse and reporting

related questions