views:

517

answers:

4

I have various time-series I'd like to correlate and present as either a csv-file or in-memory datatable (.NET). These time-series are arrays of time-value-pairs (actually these are objects containing more than just time and value). The time-series may span across different overlapping periods, and some might even have holes (missing values for given timestamps).

For those interested, I'm using the OPC HDA .NET library to extract historic time-series from an OPC HDA server.

The resulting datatable should have one column for each time-series all in chronological order based on a timestamp column. See example below:

|-------|-------|-------|-------|-------|
   TIME    TS1     TS2     TS3     TS4
|-------|-------|-------|-------|-------|
    1       X               X       X
|-------|-------|-------|-------|-------|
    2       X       X       X       X
|-------|-------|-------|-------|-------|
    3       X       X               X
|-------|-------|-------|-------|-------|
    4       X       X       X 
|-------|-------|-------|-------|-------|
    5       X       X       X 
|-------|-------|-------|-------|-------|

What would be the most effective way of achieving this? With "effective" I mean with the least amount of code. But considering that the timeseries could become quite large, memory usage might also be an issue.

+2  A: 

You might go with a data structure like a nested dictionary and iterate over the contents:

Dictionary <TimeSeries, Dictionary<DateTime, Value>> dict = new Dictionary<TimeSeries, Dictionary<DateTime, Value>>();

foreach (TimeSeries series in dict.Keys) {

    //table row output code goes here
    Dictionary<DateTime, Value> innerDict = dict[series];
    foreach (DateTime date in innerDict.Keys) {
        Value seriesValueAtTimeT = innerDict[date];
        //table column output code goes here
    }
}

Where your output code is writing out to something else, depending on your needs, and you replace the datatypes TimeSeries, Value, etc., with your actual data types.

Doug R
Thanks, Doug R. This pointed me in a direction that works for me.I'm using the SortedDictionary and Dictionary collections as follow:SortedDictionary<DateTime, Dictionary<string, double>> where datetime is the timestamp and the nested dictionary contains all the values from the different time-series with that timestamp. This way I can easily iterate over the chronological list of timestamps, get all the values, and shove the data line by line into a DataTable or text file.
chriscena
A: 

One thing you can do if memory usage becomes a problem is to move from tracking single events to tabulated counts of an event within a given time period. You lose some precision about exactly when things happened, but often summarizing your data like this can reduce the complexity of your picture and make trends more obvious.

In case it's not obvious, I mean taking data that looks like this:

12:00 event1
12:01 event2
12:10 event1
12:11 event1

to this:

12:00-12:15 event1 3
12:00-12:15 event2 1
James Thompson
+1  A: 

You can first scan all present series for the distinct values (for example, aggregating them in a HashSet), then simply dump them into an array of dates (storing a match between date and index position in a dictionary).

var distinctDates = allSeries
  .SelectMany(s => s.Values.Select(v => v.Date))
  .Distinct()
  .OrderBy(d => d)
  .ToArray();

var datePositions = distinctDates
  .Select((d,index) => new 
    {
      Date = d,
      Index = index
    }).
  .ToDictionary(x => x.Date, x => x.Index);

Then, create a jagged array that has width of "NumberOfSeries" and length of "NumberOfDates". After that, do a second scan of all the data and dump them to their positions.

var values = new float[allSeries.Length][];
for (var i=0;i<allSeries.Length;i++)
{
  values[i] = new float[distinctDates.Length];
  var currentSerie = allSeries[i];
  foreach(var value in currentSerie.Values)
  {
    var index = datePositions[value.Date];
    values[i][index] = value.Value;
  }      
}

I wrote this code without touching VisualStudio, so I may have a few typos. Or there may be used a few LINQ methods that are not present in the .NET (just look in Lokad.Shared.dll). But you should be able to get the idea.

Some more notes, while I'm at the topic:

  1. Go for the jagged array, if you have to keep everything in the memory at once. It is way more efficient than a dictionary and has a lot less memory problems than a rectangular array.

  2. Keep Value objects as small as possible (i.e.: float instead of double).

  3. If number of time serie values is expected to go large in the future, then never store values in database in a "one row per value". It is recommended to either go for something like HDF (which has a .NET interface) or use persist time serie fragments in binary form in DB (as in time serie databases)

Sticking to these should allow you to scale up to hundreds of millions of time values without a lot of problems (done that).

Rinat Abdullin
Thanks for your suggestion, but in this particular project I'm stuck with .NET 2.0 for now.
chriscena
You can use all that in .NET 2.0 http://abdullin.com/how-to-use-net-35-syntax-and-compiler-features-for-net-20/
Rinat Abdullin
+1  A: 

Hey Chris. I realize you've already accepted an answer, but thought I'd post a solution I use. Hopefully it's of some use to someone. If not at least it provides a place for me to find it sometime in the future. :-)

This is VBA code fired off directly from an Excel 2007 code module. It could be easily converted to .Net.

The key for the data manipulation is the pivot table object. I've found it to be massively efficient at getting the data into the layout you've specified.

Sub GetIndexData ()
Dim cn as ADODB.Connection, cmd As ADODB.Command, rs As ADODB.Recordset
Dim rPivotTopLeft As Range, rPivotBottomRight As Range

Application.ScreenUpdating = False
Application.Calculation = xlCalculationManual

'Get the data.'
Set cn = New ADODB.Connection
With cn
  .Provider = "SQLOLEDB"
  .ConnectionString = "Database=" & mDBName & ";" & _
                      "Server=" & mDBServerName & ";" & _
                      "UID=" & mDBUserID & ";" & _
                      "Password=" & mDBPassword & ";" & _
                      "Persist Security Info=True;"
  .CursorLocation = adUseClient
  .Open
End With

Set cmd = New ADODB.Command
Set rs = New ADODB.Recordset
With cmd
  .ActiveConnection = adoTools.DBConnection
  .CommandType = adCmdText
  .CommandText = "SELECT YourData From YourSource WHERE YourCritera"
  Set rs = .Execute
End With



If Not (rs.BOF And rs.EOF) Then 'Check that we have some data.'

'Put the data into a worksheet.'
With wsRawData
  .Cells.CurrentRegion.Clear

  Set rPivotTopLeft = .Range("A1")
  With ThisWorkbook.PivotCaches.Add(SourceType:=xlExternal)
    Set .Recordset = rs
    .CreatePivotTable _
        TableDestination:=rPivotTopLeft, _
        TableName:="MyPivotTable"
  End With

  'Massage the data into the desired layout.'
  With .PivotTables("MyPivotTable")
    .ManualUpdate = True

    .PivotFields("Date").Orientation = xlRowField
    .PivotFields("Index").Orientation = xlColumnField
    .AddDataField .PivotFields("Return"), "Returns", xlSum

    .DisplayFieldCaptions = False
    .ColumnGrand = False
    .RowGrand = False

    .ManualUpdate = False
  End With

  mMonthCount = Range(.Range("A3"), .Cells(Rows.Count, "A").End(xlUp)).Count
  mIndexCount = Range(.Range("B2"), .Cells(2, Columns.Count).End(xlToLeft)).Count

  'Convert pivot table to values.'
  Set rPivotBottomRight = .Cells(mMonthCount + 2, mIndexCount + 1)
  With .Range(rPivotTopLeft, rPivotBottomRight)
    .Copy
    .PasteSpecial Paste:=xlPasteValues, Operation:=xlNone, SkipBlanks:=False, Transpose:=False
  End With

  'Format the worksheet.'
  .Range("A3").Resize(mMonthCount, 1).NumberFormat = "mmm-yy"
  .Range("B3").Resize(mMonthCount, mIndexCount).NumberFormat = "0.00%"
  Union(.Rows(2), .Columns(1)).Font.Bold = True
  .Cells.ColumnWidth = 7.14
  .Rows(1).Delete

End With


rs.close
Set rs = Nothing
cmd.ActiveConnection = Nothing
Set cmd = Nothing
cn.close
Set cn = Nothing

End Sub

From there its relatively easy to leverage the built in excel regression statistics to output a correlation matrix. With this technique I'm producing a worksheet with a 600x600 matrix of correlations in about 45 seconds.

Note that the .PivotFields parameters should be changed to fit the column names of your data from your data source.

GollyJer