views:

154

answers:

2

I'm writing a parser in F#, and it needs to be as fast as possible (I'm hoping to parse a 100 MB file in less than a minute). As normal, it uses mutable variables to store the next available character and the next available token (i.e. both the lexer and the parser proper use one unit of lookahead).

My current partial implementation uses local variables for these. Since closure variables can't be mutable (anyone know the reason for this?) I've declared them as ref:

let rec read file includepath =
    let c = ref ' '
    let k = ref NONE
    let sb = new StringBuilder()
    use stream = File.OpenText file

    let readc() =
        c := stream.Read() |> char
    // etc

I assume this has some overhead (not much, I know, but I'm trying for maximum speed here), and it's a little inelegant. The most obvious alternative would be to create a parser class object and have the mutable variables be fields in it. Does anyone know which is likely to be faster? Is there any consensus on which is considered better/more idiomatic style? Is there another option I'm missing?

+2  A: 

I did the following profiling:

let test() = 
    tic()
    let mutable a = 0.0
    for i=1 to 10 do
        for j=1 to 10000000 do
            a <- a + float j
    toc("mutable")
let test2() = 
    tic()
    let a = ref 0.0
    for i=1 to 10 do
        for j=1 to 10000000 do
            a := !a + float j
    toc("ref")

the average for mutable is 50ms, while ref 600ms. The performance difference is due to that mutable variables are in stack, while ref variables are in managed heap.

The relative difference is big. However, 10^8 times of access is a big number. And the total time is acceptable. So don't worry too much about the performance of ref variables. And remember:

Premature optimization is the root of all evil.

My advice is you first finish your parser, then consider optimizing it. You won't know where the bottomneck is until you actually run the program. One good thing about F# is that its terse syntax and functional style well support code refactoring. Once the code is done, optimizing it would be convenient. Here's an profiling example.

Just another example, we use .net arrays everyday, which is also in managed heap:

let test3() = 
    tic()
    let a = Array.create 1 0.0
    for i=1 to 10 do
        for j=1 to 10000000 do
            a.[0] <- a.[0] + float j
    toc("array")

test3() runs about the same as ref's. If you worry too much of variables in managed heap, then you won't use array anymore.

Yin Zhu
+4  A: 

You mentioned that local mutable values cannot be captured by a closure, so you need to use ref instead. The reason for this is that mutable values captured in the closure need to be allocated on the heap (because closure is allocated on the heap).

F# forces you to write this explicitly (using ref). In C# you can "capture mutable variable", but the compiler translates it to a field in a heap-allocated object behind the scene, so it will be on the heap anyway.

Summary is: If you want to use closures, mutable variables need to be allocated on the heap.

Now, regarding your code - your implementation uses ref, which creates a small object for every mutable variable that you're using. An alternative would be to create a single object with multiple mutable fields. Using records, you could write:

type ReadClosure = {
  mutable c : char
  mutable k : SomeType } // whatever type you use here

let rec read file includepath = 
  let state = { c = ' '; k = NONE } 
  // ... 
  let readc() = 
    state.c <- stream.Read() |> char 
    // etc...

This may be a bit more efficient, because you're allocating a single object instead of a few objects, but I don't expect the difference will be noticeable.

There is also one confusing thing about your code - the stream value will be disposed after the function read returns, so the call to stream.Read may be invalid (if you call readc after read completes).

let rec read file includepath =    
  let c = ref ' '    
  use stream = File.OpenText file    
  let readc() =    
    c := stream.Read() |> char    
  readc

let f = read a1 a2
f() // This would fail!

I'm not quite sure how you're actually using readc, but this may be a problem to think about. Also, if you're declaring it only as a helper closure, you could probably rewrite the code without closure (or write it explicitly using tail-recursion, which is translated to imperative loop with mutable variables) to avoid any allocations.

Tomas Petricek
Ah! I've just been looking over the generated code, and I see what's going on, outer variables used by inner functions are passed as extra parameters -- that's true in all cases, so that's why the mutable ones have to be ref. So the most efficient solution is probably to keep the closure method for the immutable ones, but make the mutable ones global variables so no heap allocation. (`read` is the overall parser function of which everything else is an inner function, so `stream` gets disposed after the whole job is done.) Thanks!
rwallace