Efficient memory bandwidth use for streaming
I have an application that streams through 250 MB of data, applying a simple and fast neural-net threshold function to the data chunks (which are just 2 32-bit words each). Based on the result of the (very simple) compute, the chunk is unpredictably pushed into one of 64 bins. So it's one big stream in and 64 shorter (variable length) s...