[edit: the question only mentions Concurrent Haskell, but the paper referenced is, I believe, "Composable Memory Transactions", the paper in which Haskell STM was first described. Please correct me if I'm wrong here.]
STM works just fine on multiple cores now. The parallel implementation was first shipped in GHC 6.6, and uses a fine-grained two-phase locking strategy; that is, to commit a transaction the implementation first attempts to lock each variable involved in the transaction, then commits the changes, and finally unlocks all the variables. Acquiring a lock does not block: if the lock is already held, then the transaction aborts and retries (this avoids the usual lock-order-reversal deadlock that would apply if lock acquisition was blocking).
This STM implementation is certainly not the fastest - the literature describes many alternative techniques that would result in better performance, but GHC's implementation is relatively straightforward and doesn't involve any global locks (transactions operating on distinct sets of variables can proceed in parallel without interference).