For processors that use memory (most of them) there is a memory interface of some sort, some have names (like amba, axi, wishbone), some do not. From a processors perspective this is address, and data and please either read or write what is at that address. In the good old days you would have a single bus and your flash and ram and peripherals would sit on this bus looking at certain (usually upper) bits of the address to determine if they were being addressed and if so then read from or jump on the data bus otherwise remain tristated. Today depending on the chip, etc, some of that memory decoding happens in or close to the core and your public interface to the core or chip might be several busses, there may be a specific flash bus, and a specific sram bus and specific dram bus, etc.
So the first problem you have with a flat linear address space, even if divided up into flash and ram, the ram portion is flat, address 0 to N-1 for N bytes. For a non-embedded operating system, to make peoples lives easier if there were only some way for programs to assume they were all starting at address 0 or address 0x100 or address 0x8000, instead of having to be somehow compiled for whatever the next free memory space is, or for the operating system to not have to completely move a program out of lower memory and replace it with another whenever task switching. An old easy way was to use intels segment:offset scheme. Programs always started at the same place because the code segment was adjusted before launching the program and the offset was used for executing (very simplified view of this model), when task switching among programs you just change the code segment, restore the pc for the next program. One program could be at address 0x1100 and another at 0x8100 but both programs think they are at address 0x0100. Easy for all the developers. MMUs provide the same functionality by taking that address on the processor bus and calling it a virtual address, the mmu normally sits up close to the processor between the processors memory interface and the rest of the chip/world. So you could again have the mmu see address 0x0100 look that up in a table and go to physical address 0x0100, then when you task switch you change the table so the next fetch of 0x0100 goes to 0x1100. Each program thinks it is operating at address 0x0100, linking, compiling, developing, loading and executing code is less painful.
The next feature is caching, memory protection, etc. So the processor and its memory controller may decode some addresses before reaching the mmu, perhaps certain core registers and perhaps the mmu controls themselves. But other things like memory and peripherals may be addressed on the other side of the mmu, on the other side of the cache which is often the next layer of the onion outside the mmu. When polling your serial port for example to see if there is another byte available you dont want the data access to be cached such that the first read of the serial port status register actually goes out on the physical bus and touches the serial port, then all subsequent reads read the stale version in the cache. You do want this for ram values, the purpose of the cache, but for volatile things like status registers this is very bad. So depending on your system you are likely not able to turn on the data cache until the mmu is enabled. The memory interface on an ARM for example has control bits that indicate what type of access it is, is it a non-cacheable access a cacheable, part of a burst, that sort of thing. So you can enable instruction caching independent of data caching and without the mmu on it will pass these control signals straight on through to the cache controller which then is connected to the outside world (if it didnt handle the transaction). So your instruction fetch can be cached everything else not cached. But to cache data ram accesses but not status registers from the serial port what you need to do is setup the tables for the mmu and in your embedded environment you may choose to simply map the ram one to one, meaning address 0x1000 virtual becomes 0x1000 physical, but you can now enable the data cache bit for that chunk of memory. Then for your serial port you can map virtual to physical addresses but you clear the data cache enable bit for that chunk of memory space. Now you can enable the data cache, memory reads are now cached (because the control signals as they pass through the mmu are marked as such, but for your register access the control signals indicate non-cacheable).
You certainly do not have to map virtual to physical one to one, depends on embedded or not embedded, operating system or not, etc. But this is where your protection comes in. Easiest to see in an operating system. An application at the application layer should not be allowed to get at protected system memory, the kernel, etc. Should not be able to clobber fellow applications memory space. So when the application is switched in, the mmu tables reflect what memory it is allowed to access and what memory it is not allowed to access. Any address not permitted by the program is caught by the mmu, an exception/fault (interrupt) is generated and the kernel/supervisor gets control and can deal with that program. You may remember the term "general protection fault" from the earlier windows days, before marketing and other interest groups in the company decided we should change the name, it was straight out of the intel manual, that interrupt was fired when you had a fault that didnt fall into other categories, like a multiple choice question on a test A bob, B ted, C alice, D none of the above. The general protection fault was the none of the above categetory, yet the most widely hit because that is what you got when your program tried to access memory or i/o outside its allocated memory space.
Another benefit from mmus is malloc. Before mmus the memory alloc had to use schemes to re-arrange memory to keep large empty blocks in the middle. for that next big malloc, to minimize the "with 4meg free why did my 1kbyte alloc fail?". Now, like a disk, you chop memory space up into these 4kbyte or some such size chunks. A malloc that is one chunk or less in size, take any free chunk in memory use an mmu table entry to point at it and give the caller the virtual address tied to that mmu entry. You want 4096*10 bytes, the trick is not having to find that much linear memory but finding 10 linear mmu table entries, take any 10 chunks of memory (not neccesarily adjacent) and put their physical addresses in the 10 mmu entries.
The bottom line, "how" it does it is that it sits usually between the processor and the cache or if no cache the physical memory bus. The mmu logic looks at the address, uses that to look into a table. The bits in the table include the physical address plus some control signals which include cacheable, plus some way of indicating if this is a valid entry or a protected region. If that address is protected the mmu fires an interrupt/event back to the core. If valid it modifies the virtual address to become the physical address on the other/outside of the mmu and bits like the cacheable bit are used to tell whatever is on the other side of the mmu what type of transaction this is, instruction, data, cacheable, burst, etc. For an embedded, non-os, single tasking system you may only need a single mmu table. A quick way in an operating system to perform protection for example would be to have a table per application or a subset of the table (which tree like similar to a directory structure) such that when you task switch you only have to change one thing, the start of the table or the start of one branch of the tree to change the virtual to physical addresses and allocated memory (protection) for that branch of the tree.