I use numpy for numerical linear algebra. I suspect that I can get much better performance if I make small modifications in how I carry out certain computations so that they are more memory efficient, for example.
I was wondering if there is any form of instrumentation available in python to detect cache and TLB misses. There is a very nice api, PAPI, that I learned about in a recent class but it doesn't have a Python interface:
http://icl.cs.utk.edu/papi/overview/index.html
Also, is there a good way in general to profile numpy or other python numerical code? The timeit module is hard to integrate into code. mpi4py has a nice way to profile using the MPE library. A snippet from demo code (demo/mpe-logging/cpilog.py):
communication = MPE.newLogState("Comunicate", "red")
with communication:
comm.Bcast([n, MPI.INT], root=0)
A log file is created that can be displayed graphically. But this is a bit MPI specific.
Thanks.