The two equations for M are just a relationship. They are two ways of saying the same thing. They do not indicate causality, though. I think the assumption made by the author is that the number of unique address bits is defined by the CPU designer at the start via requirements. Then the M can vary per implementation.
m is the width in bits of a memory address in your system, e.g. 32 for x86, 64 for x86-64. Block size on x86, for example, is 4K, so b=12. Block size more or less refers to the smallest chunk of data you can read from durable storage -- you read it into memory, work on that copy, then write it back at some later time. I believe tag bits are the upper t bits that are used to look up data cached locally very close to the CPU (not even in RAM). I'm not sure about the set lines part, although I can make plausible guesses that wouldn't be especially reliable.
Circular ... yes, but I think it's just stating that the two variables m and M must obey the equation. M would likely be a given or assumed quantity.
Example 1: If you wanted to use the formulas for a main memory size of M = 4GB (4,294,967,296 bytes), then m would be 32, since M = 2^32, i.e. m = log2(M). That is, it would take 32 bits to address the entire main memory.
Example 2: If your main memory size assumed were smaller, e.g. M = 16MB (16,777,216 bytes), then m would be 24, which is log2(16,777,216).
m & M are related to each other, not defined in terms of each other. They call M a derived quantity however since usually the processor/controller is the limiting factor in terms of the word length it uses.
On a real system they are predefined. If you have a 8-bit processor, it generally can handle 8-bit memory addresses (m = 8). Since you can represent 256 values with 8-bits, you can have a total of 256 memory addresses (M = 2^8 = 256). As you can see we start with the little m due to the processor constraints, but you could always decide you want a memory space of size M, and use that to select a processor that can handle it based on word-size = log2(M).
Now if we take your assumptions for your example,
512 sets, 8 blocks per set, 32 words per block, 8 bits per word
I have to assume this is an 8-bit processor given the 8-bit words. At that point your described cache is larger than your address space (256 words) & therefore pretty meaningless.
You might want to check out Computer Architecture Animations & Java applets. I don't recall if any of the cache ones go into the cache structure (usually they focus on behavior) but it is a resource I saved on the past to tutor students in architecture.
Feel free to further refine your question if it still doesn't make sense.
It seems you're confused by the math rather than the architectural stuff.
2^m ("2 to the m'th power") is 2 * 2... with m 2's. 2^1 = 2, 2^2 = 2 * 2 = 4, 2^3 = 2 * 2 * 2 = 8, and so on. Notably, if you have an m bit binary number, you can only represent 2^m different numbers. (is this obvious? If not, it might help to replace the 2's with 10's and think about decimal digits)
log2(x) ("logarithm base 2 of x") is the inverse function of 2^x. That is, log2(2^x) = x for all x. (This is a definition!)
You need log2(M) bits to represent M different numbers.
Note that if you start with M=2^m and take log2 of both sides, you get log2(M)=m. The table is just being very explicit.