TSCs (what rdtsc uses) are often not synchronized on multi-processor systems. It may help to set the CPU affinity in order to bind the process to a single CPU.
You could also get timestamps from HPET timers if available, which aren't prone to the same problem.
As for repeatability, those variances are true. You could disable caching, give a realtime priority to the process and/or (if on Linux or something similar) recompile your kernel with a lower, fixed timer interrupt frequency (the one that does time-slicing). You can't eliminate the variance completely, at least not easily and not on regular CPU + OS combos.
In general, for easy coding, reliability and portability reasons, I suggest you use what the OS has to offer. If it offers high-precision timers, use the appropriate OS helper.
(Just in case you're trying a time attack on a crypto system, well, you're going to have to live with 1. this randomness and 2. general defenses that make the system unpredictable for good reasons, so the function might not be deterministic with respect to time.)
EDIT: added paragraph about timers the OS can offer.
EDIT: This refers to Linux. For binding a process to a single CPU (to have an accurate read from RDTSC), you can use sched_setaffinity(2). And here's some code from one of my projects using it for some other purpose (mapping threads to CPUs). This should be your first try. As for HPETs, you can use regular POSIX calls like these, as long as the kernel and the machine support these timers.