I dont see this as huge and complicated, the closer to the hardware the simpler it gets.
Write a disassembler, thats how the hardware does it. Most processors include the opcodes or instruction set in the same manual as the assembler language.
Look at the opcode for say an add instruction using registers, a few of the bits determine the source register, a few bits for destination register a few bits say that this is an add instruction. Let's say this instruction set you are looking at uses only two registers for a register based add. There is some logic, an adder, that can add two items the size of registers and output a result and a carry bit. Registers are stored on chip in memory bits sometimes called flip flops. So when an add is decoded the input registers are tied to the add logic using electronic switches. These days this happens at the beginning of the clock cycle, by the end of the clock cycle the adder has a result and the output is routed to the bits for the destination register and the answer is captured. Normally an add will modify the flags in the flag register. When the result is too big to be stored in the register (think about what happens when you add the decimal numbers 9 and 1 you get a 0 carry the 1 right?). There is some logic that looks at the output of the adder and compares the bits with the value zero that sets or clears the z flag in the flag register. Another flag bit is the sign bit or n bit for negative, that is the most significant bit of the answer. This is all done in parallel.
Then say your next instruction is jump if zero (jump if equal), the logic looks at the z flag. If set then the next instruction fetched is based on bits in the instruction that are added to the program counter through the same or another adder. Or perhaps the bits in the instruction point to an address in memory that hold the new value for the program counter. Or maybe the condition is false, then the program counter is still run through an adder but what is added to it is the size of the instruction so that it fetches the next instruction.
The stretch from a disassembler to a simulator is not a long one. You make variables for each of the registers, decode the instructions, execute the instructions, continue. Memory is an array you read from or write to. The disassembler is your decode step. The simulator performs the same steps as the hardware, the hardware just does it in parallel using different programming tricks and different programming languages.
Depending on how implemented your disassembler might start at the beginning of the program and disassemble to the end, your simulator would start at the beginning but follow the code execution which is not necessarily beginning to end.
Old game console simulators like MAME have processor simulators that you can look at. Unfortunately, esp with MAME, the code is designed for execution speed not readability and most are completely unreadable. There are some readable simulators out there if you look though.
A friend pointed me at this book http://www1.idc.ac.il/tecs/ which I would like to read, but have not yet. Perhaps it is just the book you are looking for.
Sure hardware has evolved from trivial state machines that take many clocks to fetch, decode, and execute serially. My guess is that if you just understood the classic fetch, decode and execute that is enough for this question. Then you may have other more specific questions, or perhaps I misunderstood the question and you really wanted to understand the memory bus and not the decoder.