There's a basic problem with picking one example to "truly experience the pain". That problem is that threading is used for quite a few things that are (almost) completely different from each other.
- Maintain a UI in one thread and compute in another
- Split large computation across many processors
- Simulations of discrete entities, each acting independently of others
- Decoupling code that deals with devices that work at different speeds
Technically, #1 is a special case of #4, but it's so common, and has enough (nearly) unique requirements, that it's generally easiest to keep it separate.
Of course, that list isn't exhaustive, but I hope it gives some idea of the situation. The idea it should give is that different uses of threading have substantially different concerns, and the "pain" isn't the same for all of them.
Performance concerns, for one example, vary widely across those applications. In some cases virtually all of your threads will nearly always be idle, and there's virtually no contention when you need to synchronize. In other cases, you'll nearly always have lots of threads active, with far more contention over synchronization.