I need to improve the throughput of the system.
The usual cycle of optimization has been done and we have already achieved 1.5X better throughput.
I am now beginning to wonder if I can utilize the cachegrind output to improve the system's throughput.
Can somebody point me to how to begin on this?
What I understand is we need to ensure most frequently used data should be kept small enough so that it remains in L1 cache and the next set of data should fit in the L2.
Is this the right direction I am taking?