tags:

views:

874

answers:

11

I'm not talking about algorithmic stuff (eg use quicksort instead of bubblesort), and I'm not talking about simple things like loop unrolling.

I'm talking about the hardcore stuff. Like Tiny Teensy ELF, The Story of Mel; practically everything in the demoscene, and so on.

+2  A: 

The Stalin Scheme compiler is pretty crazy in that aspect.

leppie
+5  A: 

Michael Abrash's "Zen of Assembly Language" had some nifty stuff, though I admit I don't recall specifics off the top of my head.

Actually it seems like everything Abrash wrote had some nifty optimization stuff in it.

Michael Burr
Yeah, I have the Black Book, it's pretty awesome... Pretty out of date, unfortunately.
TraumaPony
I also have this book, don't remember the spesifics anymore. Used to do quite some assembly earlier.
sindre j
+2  A: 

I've gone to the Intel (or AMD) architecture references to see what instructions there are. movsx - move with sign extension is awesome for moving little signed values into big spaces, for example, in one instruction.

Likewise, if you know you only use 16-bit values, but you can access all of EAX, EBX, ECX, EDX , etc- then you have 8 very fast locations for values - just rotate the registers by 16 bits to access the other values.

warren
+3  A: 

I once saw a switch statement with a lot of empty cases, a comment at the head of the switch said something along the lines of:

Added case statements that are never hit because the compiler only turns the switch into a jump-table if there are more than N cases

I forget what N was. This was in the source code for Windows that was leaked in 2004.

Motti
+23  A: 

I once wrote a brute force RC5 key search that processed two keys at a time, the first key used the integer pipeline, the second key used the SSE pipelines and the two were interleaved at the instruction level. This was then coupled with a supervisor program that ran an instance of the code on each core in the system. In total, the code ran about 25 times faster than a naive C version.

Skizz

Skizz
Impressive. But I feel sorry for whoever has to maintain it (but that's the nature of super-optimized code).
Michael Burr
I'm the maintainer, it was a pet project. It is well commented, and I placed the code for the two pipelines in their own columns to make it easier to follow.
Skizz
"I placed the code for the two pipelines in their own columns to make it easier to follow" - great idea. I imagine that would have never crossed my mind.
Michael Burr
Yeah, same... That's actually quite brilliant.
TraumaPony
I would give you +2 if I could: one for the code, and another for the two-column-code. Brilliant^2!
F.D.Castel
"Faster than the native C version" so what did you write it in?
Neil N
If you used SSE, how about using the GPU in addition to that?
LiraNuna
@Neil N: I wrote the key processing in pure assembler with a C++ interface for handling threading.@LiraNuna: Yes, you might be able to use the GPU, this was written a while ago before GPGPUs were common.
Skizz
+1  A: 

The EFF DES cracker, which used custom-built hardware to generate candidate keys (the hardware they made could prove a key isn't the solution, but could not prove a key was the solution) which were then tested with a more conventional code.

CesarB
+1  A: 

The FSG 2.0 packer made by a Polish team, specifically made for packing executables made with assembly. If packing assembly isn't impressive enough (what's supposed to be almost as low as possible) the loader it comes with is 158 bytes and fully functional. If you try packing any assembly made .exe with something like UPX, it will throw a NotCompressableException at you ;)

John T
+12  A: 

In one (here unnamed) video game engine I worked with, they had rewritten the model-export tool (the thing that turns a Maya mesh into something the game loads) so that instead of just emitting data, it would actually emit the exact stream of microinstructions that would be necessary to render that particular model. It used a genetic algorithm to find the one that would run in the minimum number of cycles. That is to say, the data format for a given model was actually a perfectly-optimized subroutine for rendering just that model. So, drawing a mesh to the screen meant loading it into memory and branching into it.

(This wasn't for a PC, but for a console that had a vector unit separate and parallel to the CPU.)

Crashworks
+1 for the l33t use of GA.
ceretullis
This is getting more common among console games as the consoles often have a command buffer that is used to actually render graphics, thus it is just easier to precalculate the required commands and then send them straight to the graphics card (in addition to any pointer fixups needed).
Grant Peters
+8  A: 

In the early days of DOS when we used floppy discs for all data transport there were viruses as well. One common way for viruses to infect different computers was to copy a virus bootloader into the bootsector of an inserted floppydisc. When the user inserted the floppydisc into another computer and rebooted without remembering to remove the floppy, the virus was run and infected the harddrive bootsector, thus permanently infecting the host PC. A particulary annoying virus I was infected by was called "Form", to battle this I wrote a custom floppy bootsector that had the following features:

  • Validate the bootsector of the host harddrive and make sure it was not infected.
  • Validate the floppy bootsector and make sure that it was not infected.
  • Code to remove the virus from the harddrive if it was infected.
  • Code to duplicate the antivirus bootsector to another floppy if a special key was pressed.
  • Code to boot the harddrive if all was well, and no infections was found.

This was done in the program space of a bootsector, about 440 bytes :)

The biggest problem for my mates was the very cryptic messages displayed because I needed all the space for code. It was like "FFVD RM?", which meant "FindForm Virus Detected, Remove?"

I was quite happy with that piece of code. The optimization was program size, not speed. Two quite different optimizations in assembly.

sindre j
+7  A: 
Grant Peters
Ah, good old Quake III
TraumaPony
And these days we can just call __frsqrte. It's too easy; we've all gone soft!
Crashworks
is "__frsqrte" the assembly instruction for the square root? because i have read that InvSqrt(n) is 4 times faster then 1/sqrt(n) (assuming that sqrt uses the FSQRT assembly instruction). But still I am soft... i would not care about that 4 times performance diference... :/
João Portela
I ran a bunch of timings on invsqrts ( http://assemblyrequired.crashworks.org/2009/10/16/timing-square-root/ ). The short story is that `_frsqrte` -- the SSE *scalar* op for an estimated inverse square root -- is four times faster than InvSqrt(), which is six times faster than 1.0/`FSQRT`. On the PowerPC the situation is different; there, InvSqrt() is a total disaster because of a load-hit-store issue.
Crashworks
+5  A: 

A Very Biological Optimisation

Quick background: Triplets of DNA nucleotides (A, C, G and T) encode amino acids, which are joined into proteins, which are what make up most of most living things.

Ordinarily, each different protein requires a separate sequence of DNA triplets (its "gene") to encode its amino acids -- so e.g. 3 proteins of lengths 30, 40, and 50 would require 90 + 120 + 150 = 360 nucleotides in total. However, in viruses, space is at a premium -- so some viruses overlap the DNA sequences for different genes, using the fact that there are 6 possible "reading frames" to use for DNA-to-protein translation (namely starting from a position that is divisible by 3; from a position that divides 3 with remainder 1; or from a position that divides 3 with remainder 2; and the same again, but reading the sequence in reverse.)

For comparison: Try writing an x86 assembly language program where the 300-byte function doFoo() begins at offset 0x1000... and another 200-byte function doBar() starts at offset 0x1001! (I propose a name for this competition: Are you smarter than Hepatitis B?)

That's hardcore space optimisation!

UPDATE: Links to further info:

j_random_hacker
Interesting, but I am only into bacteria, and frame shift reading never happens. Could you provide some link for this behavior ? Do you know if any non-viruses show the same technique? ORFs in bacteria normally can (rarely) overlap a couple of codons on different strands, but I am not aware of full overlap on the same strand.
Stefano Borini
@Stefano: Added a few links which I hope are helpful. Seems this even happens in *rat*! I didn't read the full paper though.
j_random_hacker
The "Tiny Teensy ELF" example listed in the OP could be considered to do this. Though its not 2 separate functions, it does in the end use the ELF header to actually contain the code as well (plus some other various information) that is usually separated out. This means that the same data is being used for multiple (3 if I read correctly) purposes (including execution!).
Grant Peters
@Grant Peters: Agreed -- very cool stuff I think! :)
j_random_hacker