views:

1544

answers:

9

Status: So far the best answer's program executes in 33% of the time of the original program! But there is probably still other ways to optimize it.


Lua is currently the fastest scripting language out there, however Lua scores really bad in a few benchmarks against C/C++.

One of those is the mandelbrot test (Generate Mandelbrot set portable bitmap file N=16,000), where it scores a horrible 1:109(Multi Core) or 1:28(Single Core)

Since the Delta in speed is quite large, this is a good candidate for optimizations. Also I'm sure some that those who know who Mike Pall is might believe its not possible to optimize this any further, but that's blatantly wrong. Anyone who has done optimizations knows it is always possible to do better. Besides I did manage to get some extra performance with a few tweaks, so I know its possible :)

-- The Computer Language Shootout
-- http://shootout.alioth.debian.org/
-- contributed by Mike Pall

local width = tonumber(arg and arg[1]) or 100
local height, wscale = width, 2/width
local m, limit2 = 50, 4.0
local write, char = io.write, string.char

write("P4\n", width, " ", height, "\n")

for y=0,height-1 do
  local Ci = 2*y / height - 1
  for xb=0,width-1,8 do
    local bits = 0
    local xbb = xb+7
    for x=xb,xbb < width and xbb or width-1 do
      bits = bits + bits
      local Zr, Zi, Zrq, Ziq = 0.0, 0.0, 0.0, 0.0
      local Cr = x * wscale - 1.5
      for i=1,m do
        local Zri = Zr*Zi
        Zr = Zrq - Ziq + Cr
        Zi = Zri + Zri + Ci
        Zrq = Zr*Zr
        Ziq = Zi*Zi
        if Zrq + Ziq > limit2 then
          bits = bits + 1
          break
        end
      end
    end
    if xbb >= width then
      for x=width,xbb do bits = bits + bits + 1 end
    end
    write(char(255-bits))
  end
end

So how could this be optimized (of course as with any optimization you have to measure your implementation to be sure its faster). And you aren't allowed to alter the C-core of Lua for this, or use LuaJit, its about finding ways to optimizing one of Lua's weak weak points.

Edit: Putting a Bounty on this as to make the challenge more fun.

A: 

Robert Gould > One of those is the mandelbrot test (Generate Mandelbrot set portable bitmap file N=16,000), where it scores a horrible 1:109

When you quote numbers from the benchmarks game please show where those numbers come from so readers have some context.

In this case you seem to have taken numbers measured on the quadcore machine where the fastest programs have been re-written to exploit multiple cores. Instead of looking at elapsed time sort by CPU time and you'll see the ratio drop to 1:28.

Or look at the median and quartiles to get a better impression of how the set of C++ measurements compares to the set of Lua measurements.

Or there's a whole set of measurements where programs are forced to use just one core - Lua compared with C++ - and if you take a look at those Lua pi-digits programs you'll see that they use the C language GNU GMP library.

igouy
So far Mandrelbot program performance has been halved. And it wasn't hard at all...
Robert Gould
Do you think the programs still follow the benchmarks game "rules" for mandelbrot? (Hint - doing half the work by exploiting symmetry won't be accepted.)
igouy
This is not the shoot out game. It's about optimizing an algorithm in lua without requiring extra dependencies on the core or plugins,
Robert Gould
Then please say so in big letters in your "challenge" - you've made so many comparisons with performance of programs in the benchmarks game that everyone will think these are somehow comparable. If they aren't working against the same requirements they aren't comparable.
igouy
+3  A: 

So here's ~40% for a start:

-- The Computer Language Shootout
-- http://shootout.alioth.debian.org/
-- contributed by Mike Pall

local width = tonumber(arg and arg[1]) or 100
local height, wscale = width, 2/width
local m, limit2 = 50, 4.0
local write, char = io.write, string.char

function addChar (line, c)
    table.insert(line, c)
    for i=table.getn(line)-1, 1, -1 do
        if string.len(line[i]) > string.len(line[i+1]) then
            break
            end
        line[i] = line[i] .. table.remove(line)
        end
    end

local h2 = math.floor(height/2)
local hm = height - h2*2
local top_half = {}
for y=0,h2+hm do
    local Ci = 2*y / height - 1
    local line = {""}
    for xb=0,width-1,8 do
        local bits = 0
        local xbb = xb+7
        for x=xb,xbb < width and xbb or width-1 do
            bits = bits + bits
            local Zr, Zi, Zrq, Ziq = 0.0, 0.0, 0.0, 0.0
            local Cr = x * wscale - 1.5
            for i=1,m do
                local Zri = Zr*Zi
                Zr = Zrq - Ziq + Cr
                Zi = Zri + Zri + Ci
                Zrq = Zr*Zr
                Ziq = Zi*Zi
                if Zrq + Ziq > limit2 then
                    bits = bits + 1
                    break
                    end
                end
            end
        for x=width,xbb do 
            bits = bits + bits + 1 
            end
        addChar(line,char(255-bits))
        end
    line = table.concat(line) 
    table.insert(top_half,line)
    end

write("P4\n", width, " ", height, "\n")
for y=1,h2+hm do
    write(top_half[y])
    end
for y=h2,1,-1 do
    write(top_half[y])
    end

-- MarkusQ

MarkusQ
Congrats your optimization beat my optimizations so far! So now I've posted what I had done. It might give you clues for further optimizations to your code (if that's the case make it another answer, so all different attempts can be seen side by side)
Robert Gould
When running you program on my machine it runs in 33% of the time of the original, beating my 34% !
Robert Gould
Is a symmetry exploit allowed by this benchmark? I can't see a claim either way at the site, but the top ranked multi-threaded C contribution doesn't do it.
RBerteig
doesn't matter. Any valid pure lua optimization is valid.this is about clever speed ups, not the shoot out rules.
Robert Gould
Math is not an "exploit." Besides, pretty much any optimization involves taking advantage of _some_ symmetry, though generally more abstract than "the bottom looks just like the top".
MarkusQ
Note also that I've got a faster version downstairs.
MarkusQ
+1  A: 

Now that there is at least one answer faster than my solution I'll post my answer

-- The Computer Language Shootout
-- http://shootout.alioth.debian.org/
-- contributed by Mike Pall

local width = tonumber(arg and arg[1]) or 100
local height, wscale = width, 2/width
local m, limit2 = 50, 4.0
local write, char = io.write, string.char
local insert = table.insert
local results={}
write("P4\n", width, " ", height, "\n")

for y=0,height-1 do
  local Ci = 2*y / height - 1
  for xb=0,width-1,8 do
    local bits = 0
    local xbb = xb+7
    for x=xb,xbb < width and xbb or width-1 do
      bits = bits + bits
      local Zr, Zi, Zrq, Ziq = 0.0, 0.0, 0.0, 0.0
      local Cr = x * wscale - 1.5
      for i=1,m do
        local Zri = Zr*Zi
        Zr = Zrq - Ziq + Cr
        Zi = Zri + Zri + Ci
        Zrq = Zr*Zr
        Ziq = Zi*Zi
        if Zrq + Ziq > limit2 then
          bits = bits + 1
          break
        end
      end
    end
    if xbb >= width then
      for x=width,xbb do bits = bits + bits + 1 end
    end
    insert(results,(char(255-bits)))
  end
end
write(table.concat(results))

The trick is storing values to a table before sending them to the output. Something as simple as this reduced the run time to 58%.

Robert Gould
A: 

Next step I did was cache the stuff that was calculated over and over, and replace bit+bit to bit*2, These simple optimizations are quite powerful...

local width = tonumber(arg and arg[1]) or 100
local height, wscale = width, 2/width
local m, limit2 = 50, 4.0
local write, char = io.write, string.char
local results={}
write("P4\n", width, " ", height, "\n")
local height_minus_one = height - 1
local width_minus_one = width -1

for y=0,height_minus_one do
  local Ci = 2*y / height_minus_one
  for xb=0,width_minus_one,8 do
    local bits = 0
    local xbb = xb+7
    for x=xb,xbb < width and xbb or width_minus_one do
      bits = bits *2
      local Zr, Zi, Zrq, Ziq = 0.0, 0.0, 0.0, 0.0
      local Cr = x * wscale - 1.5
      for i=1,m do
        local Zri = Zr*Zi
        Zr = Zrq - Ziq + Cr
        Zi = Zri + Zri + Ci
        Zrq = Zr*Zr
        Ziq = Zi*Zi
        if Zrq + Ziq > limit2 then
          bits = bits + 1
          break
        end
      end
    end
    if xbb >= width then
      for x=width,xbb do bits = bits *2 + 1 end
    end
    table.insert(results,(char(255-bits)))
  end
end
write(table.concat(results))

This optimization makes the program run in 34% of the time of the original, but Markus Q's optimization still beat mine ;)

Robert Gould
Did you display the results? 'cause they look off a bit (+i or so) to me.
MarkusQ
A: 

This was another attempt, but it turned out to be slower than local access of variables (I imagined using a clean environment would have made it faster to find the variables, but it wasn't the case, local's virtual registers is slightly faster) This brought the runtime down to 41%.

local env={}
env.width = tonumber(arg and arg[1]) or 100
env.height = env.width
env.wscale = 2/env.width
env.m = 50
env.limit2 = 4.0
env.write = io.write
env.char = string.char
env.results={}
env.height_minus_one = env.height - 1
env.width_minus_one = env.width -1
env.insert = table.insert

setfenv(function()
    write("P4\n", env.width, " ", env.height, "\n")
    for y=0,height_minus_one do
      local Ci = 2*y / height_minus_one
      for xb=0,width_minus_one,8 do
        local bits = 0
        local xbb = xb+7
        for x=xb,xbb < width and xbb or width_minus_one do
          bits = bits *2
          local Zr, Zi, Zrq, Ziq = 0.0, 0.0, 0.0, 0.0
          local Cr = x * wscale - 1.5
          for i=1,m do
            local Zri = Zr*Zi
            Zr = Zrq - Ziq + Cr
            Zi = Zri + Zri + Ci
            Zrq = Zr*Zr
            Ziq = Zi*Zi
            if Zrq + Ziq > limit2 then
              bits = bits + 1
              break
            end
          end
        end
        if xbb >= width then
          for x=width,xbb do bits = bits *2 + 1 end
        end
        insert(results,(char(255-bits)))
      end
    end
end,env)()

io.write(table.concat(env.results))
Robert Gould
+5  A: 

Pass 2, about 30% better (on my machine) than my previous. The main saving came from unrolling the inner loop to amortize the overhead.

Also included (commented out) is an aborted attempt to save time by exiting early (& set the pixel black) when you are stuck in the central cardioid. It works, but it's slower no matter how I jiggered it.

I've got to run, but I'll leave a parting suggestion. There may be some optimization possible by run-length encoding the results (so instead of saving a bunch of bit-twiddled chars you'd save a list (number of white dots, number of black dots, number of white dots, etc.)). This would:

  1. Reduce the storage/GC overhead
  2. Allow some optimizations on the output generation (when the numbers were >> 8)
  3. Permit some orbit detection.

No idea if it could be coded tight enough to fly, but that is where I would try next if I had more time.

-- The Computer Language Shootout
-- http://shootout.alioth.debian.org/
-- contributed by Mike Pall
-- with optimizations by Markus J. Q. (MarkusQ) Roberts

local width = tonumber(arg and arg[1]) or 100
local height, wscale = width, 2/width
local m, limit2 = 50, 4.0
local write, char = io.write, string.char

local h2 = math.floor(height/2)
local hm = height - h2*2
local top_half = {}

for y=0,h2+hm do
    local Ci = 2*y / height - 1
    local line = {""}
    for xb=0,width-1,8 do
        local bits = 0
        local xbb = xb+7
        for x=xb,xbb < width and xbb or width-1 do
            bits = bits + bits
            local Zr, Zi, Zrq, Ziq = 0.0, 0.0, 0.0, 0.0
            local Cr = x * wscale - 1.5
            local Zri = Zr*Zi
            for i=1,m/5 do
                Zr = Zrq - Ziq + Cr
                Zi = Zri + Zri + Ci
                Zri = Zr*Zi

                Zr = Zr*Zr - Zi*Zi + Cr
                Zi = 2*Zri +         Ci
                Zri = Zr*Zi

                Zr = Zr*Zr - Zi*Zi + Cr
                Zi = 2*Zri +         Ci
                Zri = Zr*Zi

                Zr = Zr*Zr - Zi*Zi + Cr
                Zi = 2*Zri +         Ci
                Zri = Zr*Zi

                Zr = Zr*Zr - Zi*Zi + Cr
                Zi = 2*Zri +         Ci
                Zri = Zr*Zi

                Zrq = Zr*Zr
                Ziq = Zi*Zi
                Zri = Zr*Zi
                if Zrq + Ziq > limit2 then
                    bits = bits + 1
                    break
                    end
                -- if i == 1 then
                --    local ar,ai       = 1-4*Zr,-4*Zi
                --    local a_r         = math.sqrt(ar*ar+ai*ai)
                --    local k           = math.sqrt(2)/2
                --    local br,bi2      = math.sqrt(a_r+ar)*k,(a_r-ar)/2
                --    if (br+1)*(br+1) + bi2 < 1 then
                --        break
                --        end
                --    end
                end
            end
        for x=width,xbb do 
            bits = bits + bits + 1 
            end
        table.insert(line,char(255-bits))
        end
    line = table.concat(line) 
    table.insert(top_half,line)
    end

write("P4\n", width, " ", height, "\n")
for y=1,h2+hm do
    write(top_half[y])
    end
for y=h2,1,-1 do
    write(top_half[y])
   end
MarkusQ
Congratulations! It would've been nice to have more contestants, but either way great work!
Robert Gould
I agree. It would have been fun to have more time to spend on it too. Thanks for posting a great question.
MarkusQ
+1  A: 

I don't know Lua that good to produce working code, but you should be able to heavily increase Mandelbrot performance by using some math tricks. There was a suggestion about using symmetry to speed it up, another big improvement could be done using this optimization:

Use a recursive function that uses rectangle coordinates of a Mandelbrot portion. It then calculates the values at the rectangles border lines and the two lines that split in the middle. After this, there are 4 sub-rectangles. If one of it has all the same border pixel colors, you can simply fill it with this color, if not, you recursively call the function on this portion.

I searched for another explanation of this algorithm and found one here - along with a nice visualization. The old DOS program FRACTINT calls this optimization "Tesseral mode".

schnaader
+1  A: 

On the benchmarks game the elapsed time has been reduced from 674s to 211s by spawning more processes to use the available cores.

igouy
there we go! that indeed is the best way to get more throughput :)
Robert Gould
+1  A: 

Why to use local variable Zri? It is possible to avoid its use by reordering next two statements:

Zi = 2*Zr*Zi + Ci Zr = Zrq - Ziq + Cr

There's also possible to use simple periodicity checking, but speedup depends on m. The larger "m" is, the better is speedup gained from periodicity checking.

good ideas, haven't measured the performance difference but it should work
Robert Gould