views:

104

answers:

2

I'm trying to experiment with software defined radio concepts. From this article I've tried to implement a GPU-parallelism Discrete Fourier Transform.

I'm pretty sure I could pre-calculate 90 degrees of the sin(i) cos(i) and then just flip and repeat rather than what I'm doing in this code and that that would speed it up. But so far, I don't even think I'm getting correct answers. An all-zeros input gives a 0 result as I'd expect, but all 0.5 as inputs gives 78.9985886f (I'd expect a 0 result in this case too). Basically, I'm just generally confused. I don't have any good input data and I don't know what to do with the result or how to verify it.

This question is related to my other post here

open Microsoft.ParallelArrays
open System

 // X64MulticoreTarget is faster on my machine, unexpectedly
let target = new DX9Target() // new X64MulticoreTarget()

ignore(target.ToArray1D(new FloatParallelArray([| 0.0f |]))) // Dummy operation to warm up the GPU

let stopwatch = new System.Diagnostics.Stopwatch() // For benchmarking

let Hz = 50.0f
let fStep = (2.0f * float32(Math.PI)) / Hz
let shift = 0.0f // offset, once we have to adjust for the last batch of samples of a stream

// If I knew that the periodic function is periodic 
// at whole-number intervals, I think I could keep 
// shift within a smaller range to support streams 
// without overflowing shift - but I haven't 
// figured that out

//let elements = 8192 // maximum for a 1D array - makes sense as 2^13
//let elements = 7240 // maximum on my machine for a 2D array, but why?
let elements = 7240

// need good data!!
let buffer : float32[,] = Array2D.init<float32> elements elements (fun i j -> 0.5f) //(float32(i * elements) + float32(j))) 

let input = new FloatParallelArray(buffer)
let seqN : float32[,] = Array2D.init<float32> elements elements (fun i j -> (float32(i * elements) + float32(j)))
let steps = new FloatParallelArray(seqN)
let shiftedSteps = ParallelArrays.Add(shift, steps)
let increments = ParallelArrays.Multiply(fStep, steps)
let cos_i = ParallelArrays.Cos(increments) // Real component series
let sin_i = ParallelArrays.Sin(increments) // Imaginary component series

stopwatch.Start()
// From the documentation, I think ParallelArrays.Multiply does standard element by 
// element multiplication, not matrix multiplication
// Then we sum each element for each complex component (I don't understand the relationship 
// of this, or the importance of the generalization to complex numbers)
let real = target.ToArray1D(ParallelArrays.Sum(ParallelArrays.Multiply(input, cos_i))).[0]
let imag = target.ToArray1D(ParallelArrays.Sum(ParallelArrays.Multiply(input, sin_i))).[0]
printf "%A in " ((real * real) + (imag * imag)) // sum the squares for the presence of the frequency
stopwatch.Stop()

printfn "%A" stopwatch.ElapsedMilliseconds

ignore (System.Console.ReadKey())

+2  A: 

Two suggestions:

  • ensure you're not somehow confusing degrees with radians
  • try doing it sans-parallelism, or just with F#'s asyncs for parallelism

(In F#, if you have an array of floats

let a : float[] = ...

then you can 'add a step to all of them in parallel' to produce a new array with

let aShift = a |> (fun x -> async { return x + shift }) 
               |> Async.Parallel |> Async.RunSynchronously

(though I expect this might be slower that just doing a synchronous loop).)

Brian
+2  A: 

I share your surprise that your answer is not closer to zero. I'd suggest writing naive code to perform your DFT in F# and seeing if you can track down the source of the discrepancy.

Here's what I think you're trying to do:

let N = 7240
let F = 1.0f/50.0f
let pi = single System.Math.PI

let signal = [| for i in 1 .. N*N -> 0.5f |]

let real = 
  seq { for i in 0 .. N*N-1 -> signal.[i] * (cos (2.0f * pi * F * (single i))) }
  |> Seq.sum

let img = 
  seq { for i in 0 .. N*N-1 -> signal.[i] * (sin (2.0f * pi * F * (single i))) }
  |> Seq.sum

let power = real*real + img*img

Hopefully you can use this naive code to get a better intuition for how the accelerator code ought to behave, which could guide you in your testing of the accelerator code. Keep in mind that part of the reason for the discrepancy may simply be the precision of the calculations - there are ~52 million elements in your arrays, so accumulating a total error of 79 may not actually be too bad. FWIW, I get a power of ~0.05 when running the above single precision code, but a power of ~4e-18 when using equivalent code with double precision numbers.

kvb