Shark told me this:
This instruction is the start of a loop that is not aligned to a 16-byte address boundary. For optimal performance, you should align the start of a hot loop using a compiler directive. With gcc 3.3 or later, use the -falign-loops=16 compiler flag.
for (int i=0; i < 4; i++) { // line with the info
//...code
}
How would I set that flag, and does it really improve performance?