views:

821

answers:

2

Background: I have a sequence of contiguous, time-stamped data. The data-sequence has holes in it, some large, others just a single missing value.
Whenever the hole is just a single missing value, I want to patch the holes using a dummy-value (larger holes will be ignored).

I would like to use lazy generation of the patched sequence, and I am thus using Seq.unfold.

I have made two versions of the method to patch the holes in the data.

The first consumes the sequence of data with holes in it and produces the patched sequence. This is what i want, but the methods runs horribly slow when the number of elements in the input sequence rises above 1000, and it gets progressively worse the more elements the input sequence contains.

The second method consumes a list of the data with holes and produces the patched sequence and it runs fast. This is however not what I want, since this forces the instantiation of the entire input-list in memory.

I would like to use the (sequence -> sequence) method rather than the (list -> sequence) method, to avoid having the entire input-list in memory at the same time.

Questions:

1) Why is the first method so slow (getting progressively worse with larger input lists) (I am suspecting that it has to do with repeatedly creating new sequences with Seq.skip 1, but I am not sure)

2) How can I make the patching of holes in the data fast, while using an input sequence rather than an input list?

The code:

open System

// Method 1 (Slow)
let insertDummyValuesWhereASingleValueIsMissing1 (timeBetweenContiguousValues : TimeSpan) (values : seq<(DateTime * float)>) =
    let sizeOfHolesToPatch = timeBetweenContiguousValues.Add timeBetweenContiguousValues // Only insert dummy-values when the gap is twice the normal
    (None, values) |> Seq.unfold (fun (prevValue, restOfValues) ->  
        if restOfValues |> Seq.isEmpty then
            None // Reached the end of the input seq
        else
            let currentValue = Seq.hd restOfValues
            if prevValue.IsNone then
                Some(currentValue, (Some(currentValue), Seq.skip 1 restOfValues  )) // Only happens to the first item in the seq
            else
                let currentTime = fst currentValue
                let prevTime = fst prevValue.Value
                let timeDiffBetweenPrevAndCurrentValue = currentTime.Subtract(prevTime)
                if timeDiffBetweenPrevAndCurrentValue = sizeOfHolesToPatch then
                    let dummyValue = (prevTime.Add timeBetweenContiguousValues, 42.0) // 42 is chosen here for obvious reasons, making this comment superfluous
                    Some(dummyValue, (Some(dummyValue), restOfValues))
                else
                    Some(currentValue, (Some(currentValue), Seq.skip 1 restOfValues))) // Either the two values were contiguous, or the gap between them was too large to patch

// Method 2 (Fast)
let insertDummyValuesWhereASingleValueIsMissing2 (timeBetweenContiguousValues : TimeSpan) (values : (DateTime * float) list) =
    let sizeOfHolesToPatch = timeBetweenContiguousValues.Add timeBetweenContiguousValues // Only insert dummy-values when the gap is twice the normal
    (None, values) |> Seq.unfold (fun (prevValue, restOfValues) ->  
        match restOfValues with
        | [] -> None // Reached the end of the input list
        | currentValue::restOfValues -> 
            if prevValue.IsNone then
                Some(currentValue, (Some(currentValue), restOfValues  )) // Only happens to the first item in the list
            else
                let currentTime = fst currentValue
                let prevTime = fst prevValue.Value
                let timeDiffBetweenPrevAndCurrentValue = currentTime.Subtract(prevTime)
                if timeDiffBetweenPrevAndCurrentValue = sizeOfHolesToPatch then
                    let dummyValue = (prevTime.Add timeBetweenContiguousValues, 42.0) 
                    Some(dummyValue, (Some(dummyValue), currentValue::restOfValues))
                else
                    Some(currentValue, (Some(currentValue), restOfValues))) // Either the two values were contiguous, or the gap between them was too large to patch

// Test data
let numbers = {1.0..10000.0}
let contiguousTimeStamps = seq { for n in numbers -> DateTime.Now.AddMinutes(n)}

let dataWithOccationalHoles = Seq.zip contiguousTimeStamps numbers |> Seq.filter (fun (dateTime, num) -> num % 77.0 <> 0.0) // Has a gap in the data every 77 items

let timeBetweenContiguousValues = (new TimeSpan(0,1,0))

// The fast sequence-patching (method 2)
dataWithOccationalHoles |> List.of_seq |> insertDummyValuesWhereASingleValueIsMissing2 timeBetweenContiguousValues |> Seq.iter (fun pair -> printfn "%f %s" (snd pair) ((fst pair).ToString()))

// The SLOOOOOOW sequence-patching (method 1)
dataWithOccationalHoles |> insertDummyValuesWhereASingleValueIsMissing1 timeBetweenContiguousValues |> Seq.iter (fun pair -> printfn "%f %s" (snd pair) ((fst pair).ToString()))
+15  A: 

Any time you break apart a seq using "Seq.hd" and "Seq.skip 1" you are almost surely falling into the trap of going O(N^2). IEnumerable<T> is an awful type for recursive algorithms (including e.g. Seq.unfold), since these algorithms almost always have the structure of 'first element' and 'remainder of elements', and there is no efficient way to create a new IEnumerable that represents the 'remainder of elements'. (IEnumerator<T> is workable, but its API programming model is not so fun/easy to work with.)

If you need the original data to 'stay lazy', then you should use a LazyList (in the F# PowerPack). If you don't need the laziness, then you should use a concrete data type like 'list', which you can 'tail' into in O(1).

(You should also check out

http://stackoverflow.com/questions/898988/avoiding-stack-overflow-with-f-infinite-sequences-of-sequences

as an FYI, though it's only tangentially applicable to this problem.)

Brian
Brian, do I understand you correctly, that the process of creating a new sequence from an existing one (e.g. let seq2 = Seq.skip 1 seq1) is an expensive operation? (I would have assumed that it was O(1) )If it is expensive, why is that? (I was under the impression that sequences are only evaluated as needed?)
Treefrog
Well, constructing it is actually fast: O(1). But then evaluating its first element means creating an enumerator for the original sequence, computing its first value, throwing it away, computing the next value, and then yielding that. Thus two "Seq.skip 1"s yield a seq that, when evaluated will: create enumerator over inner, which creates enumerator over orig, computes first value, throws away, yields next value to inner, which inner then throws away, computes one more value, and yields that. Each nested Seq.skip 1 adds even more work, resulting in O(N) time and space.
Brian
Thank you for taking the time to reply Brian!
Treefrog
This would explain why F# provides a Seq.head function, but not Seq.tail.
Javaman59
+10  A: 

Seq.skip constructs a new sequence. I think that is why your original approach is slow.

My first inclination is to use a sequence expression and Seq.pairwise. This is fast and easy to read.

let insertDummyValuesWhereASingleValueIsMissingSeq (timeBetweenContiguousValues : TimeSpan) (values : seq<(DateTime * float)>) =
  let sizeOfHolesToPatch = timeBetweenContiguousValues.Add timeBetweenContiguousValues // Only insert dummy-values when the gap is twice the normal
  seq {
    yield Seq.hd values
    for ((prevTime, _), ((currentTime, _) as next)) in Seq.pairwise values do
      let timeDiffBetweenPrevAndCurrentValue = currentTime.Subtract(prevTime)
      if timeDiffBetweenPrevAndCurrentValue = sizeOfHolesToPatch then
        let dummyValue = (prevTime.Add timeBetweenContiguousValues, 42.0) // 42 is chosen here for obvious reasons, making this comment superfluous
        yield dummyValue
      yield next
  }
Jason
+1: When I was learning F#, I got into the functional programming groove by eliminating all imperative constructs. I watched my code's readability take a nosedive using Seq.unfold rather than the infinitely simpler "loop and ref" approach.
Juliet
Jason, this is the solution that I was looking for. My initial inclination when writing the method was to use yield (I come from a C# background), but as I have no F#-book (waiting for Don Syme's december release) I could not figure out how F# employs yield, so I went with Seq.unfold.
Treefrog
@TreeFrog. even better f# has `yield!` which is the `yield foreach`I've been wishing they would add to c#
ShuggyCoUk
@ShuggyCoUk. Thanks for letting me know Shuggy, I will look into "yield!"
Treefrog
@Jason: "Seq.skip constructs a new sequence" -- looking at the implementation (via Reflector) this does not seem to be the case. The iterator that Seq.skip returns encapsulates the passed iterator, albeit a few elements have been iterated. There is no copying of the sequence.
Richard