i'm trying to write a program that solves the rsa challenge (yes i have interesting goals) and currently i don't have a 64 bit linux box and i don't really wanna spend my time writing a program that doesn't have a chance to ever finish. so while i can do some assembler programming, i would prefer using C++. however, i would also be interested in how to use inline assembly to do the same thing. the plan here is to use the 16 64 bit general purpose registers and the 128 bit sse registers to do (really really long) integer math. so any help on how to do that would be greatly appreciated.
All modern compillers are very good at reusing all available registers to produce the fastest possible code.
Especially Intel C++ and GCC3 - they often create code unbeatable manually.
BTW, check out this: http://stackoverflow.com/questions/1295452/why-does-msvc-not-support-inline-assembly-for-amd64-and-itanium-targets
Based on your comment to BarsMonsters anser, you don't need to get closer to the CPU, you need a large integer library.
One option is gmp, which includes arbitrary integer arithmetic. It has good algorithms for things like multiplying large integers, and a good compiler will do a better job of optimising this than most people.
The main issue that might make you look for an alternative is that it supports variable precision arithmetic, which may be an overhead you'd rather avoid if you know for sure that your numbers have at most 512 binary digits. Even so, you probably want to look at algorithms more than low-level tricks (long multiplication may already be a bad choice at that size), and I'm pretty confident you'll be better off letting the compiler do your optimisation.
My advice - spend your time doing the things that require human intelligence, not the things that a machine can do far more consistently and a billion times faster.
And if you really can optimise machine code better than a compiler can, download LLVM and implement that logic as an optimisation pass so we can all get the benefit ;-)
If you want to just do some precission math, you would better try Intel C++ compiler and use it's math lib, it very powerfull and overoptimized. Same for Boost math library. This might cut your work by 90% :-) They do support arbitrary precission numbers.