First off, a 10 million x 10 million matrix is simply enormous. Assuming doubles for each cell and no storage overhaed, each one of these things is going to be 800 terabytes. Just reading each cell once over from main memory (should it somehow magically fit there, which clearly isn't happening), would take days. Doing it from any sort of plausible SAN (we'll put it on 10GbE) is more likely to be months. And no matrix multiply has O(n) complexity - the normal approaches are O(n^3). So... you aren't doing this with memory mapped files, common databases, or anything of that sort.
Code doing something like this is going to live or die on cache efficiency, where "cache" includes making good use of main memory, local disk drives. Since any storage interface holding more than one 800 terabyte matrix is bound to be a SAN of some sort, you almost certainly involve multiple servers reading and working on different parts of it, too.
There are lots of well-known ways to parallelise matrix multiplication (essentially multiply various-sized sub-matrices and then combining the results), and shift layout so that the access patterns have reasonable cache locality by organizing the data around space-filling curves instead of row/column arrangements. You're certainly going to want to look at the classic LAPACK interfaces and design, Intel's MKL, GotoBLAS as implementations of the BLAS functions tuned to specific modern hardware, and after that you're probably venturing into unexplored territory :-)