It was slow on the ancient Pentium MMX, but on more modern processors it is very fast.
Still, MMX is mostly obsolete today. Use SSE2, and you'll have no problems multiplexing with the FPU.
Also, the RDTSC instruction can be executed in parallel with other instructions, which explains your measurement - the CPU simply started executing both RDTSCs and the EMMS simultaneously in the same clock cycle... If you want to measure the time a piece of code takes, you must serialize both RDTSCs with regard to the code - usually the CPUID instruction is used for that. As the serializing instructions you use take CPU cycles themselves, you have to also measure how many cycles does it take with no code between the measurement rig you write.
The last point is that even on the Pentium MMX the EMMS instruction itself finished fast - it was the first FPU instruction after that that was getting a nasty delay...