In different assembly languages MUL (x86)/MULT (mips) refer to multiplication. It is a black box for the programmer. I am interested in how actually a CPU accomplishes a multiplication regardless of the architecture. Lets say I have two 16-bit values in my registers and I am the cpu, so I have to implement MUL using the other bit-fiddling instructions I have (and,or,xor,not,shl,shr,etc). What shall I do?
http://en.wikipedia.org/wiki/Multiplication_ALU on Wikipedia lists different methods for doing multiplication in a digital circuit.
When I worked on a project to add SIMD instructions to an DEC Alpha-like processor in Verilog back in college, we implemented a Wallace tree multiplier, the primary reason being it ran in a fixed number of cycles and was easy to pipeline.
EDIT: You mentioned using the other bit fiddling instructions, on modern processors multiplication would not be microcoded like this; it'd be way to slow and the processor would get slaughtered in benchmarks.
This page shows the logic gates for a 4 * 4 combinational multipler. You can work up from there.
Here is somebody's lab where they describe building a 16 bit multiplier from 4 4 bit multipliers, each built with AND gates and full adders. Full design, chip layout, and simulation waveforms.