Caching (e.g. branch target caching), parallel load units (part of pipelining, but also things like "hit under miss" which don't stall the pipeline), and out-of-order execution are likely to help transform a load
-load
-branch
into something that is closer to a fixed branch
. Instruction folding/elimination (what's the proper term for this?) in the decode or branch prediction stage of the pipeline may also contribute.
All of this relies on a lot of different things, though: how many different branch targets there are (e.g. how many different virtual overloads are you likely to trigger), how many things you loop over (is the branch target cache "warm"? how about the icache/dcache?), how the virtual tables or indirection tables are layed out in memory (are they cache-friendly, or is each new vtable load possibly evicting an old vtable?), is the cache being invalidated repeatedly due to multicore ping-ponging, etc...
(Disclaimer: I'm definitely not an expert here, and a lot of my knowledge comes from studying in-order embedded processors, so some of this is extrapolation. If you have corrections, feel free to comment!)
The correct way to determine if it's going to be a problem for a specific program is of course to profile. If you can, do so with the help of hardware counters -- they can tell you a lot about what's going on in the various stages of the pipeline.
Edit:
As Hans Passant points out in an above comment http://stackoverflow.com/questions/3487937/modern-cpu-inner-loop-indirection-optimizations/3487962#3487962, the key to getting these two things to take the same amount of time is the ability to effectively "retire" more than one instruction per cycle. Instruction elimination can help with this, but superscalar design is probably more important (hit under miss is a very small and specific example, fully redundant load units might be a better one).
Let's take an ideal situation, and assume a direct branch is just one instruction:
branch dest
...and an indirect branch is three (maybe you can get it in two, but it's greater than one):
load vtable from this
load dest from vtable
branch dest
Let's assume an absolutely perfect situation: *this and the entire vtable are in L1 cache, L1 cache is fast enough to support amortized one cycle per instruction cost for the two loads. (You can even assume the processor reordered the loads and intermixed them with earlier instructions to allow time for them to complete before the branch; it doesn't matter for this example.) Also assume the branch target cache is hot, and there's no pipeline flush cost for the branch, and the branch instruction comes down to a single cycle (amortized).
The theoretical minimum time for the first example is therefore 1 cycle (amortized).
The theoretical minimum for the second example, absent instruction elimination or redundant functional units or something that will allow retiring more than one instruction per cycle, is 3 cycles (there are 3 instructions)!
The indirect load will always be slower, because there are more instructions, until you reach into something like superscalar design that allows retiring more than one instruction per cycle.
Once you have this, the minimum for both examples becomes something between 0 and 1 cycles, again, provided everything else is ideal. Arguably you have to have more ideal circumstances for the second example to actually reach that theoretical minimum than for the first example, but it's now possible.
In some of the cases you'd care about, you're probably not going to reach that minimum for either example. Either the branch target cache will be cold, or the vtable won't be in the data cache, or the machine won't be capable of reordering the instructions to take full advantage of the redundant functional units.
...this is where profiling comes in, which is generally a good idea anyway.
You can just espouse a slight paranoia about virtuals in the first place. See Noel Llopis's article on data oriented design, the excellent Pitfalls of Object-Oriented Programming slides, and Mike Acton's grumpy-yet-educational presentations. Now you've suddenly moved into patterns that the CPU is already likely to be happy with, if you're processing a lot of data.
High level language features like virtual are usually a tradeoff between expressiveness and control. I honestly think, though, by just increasing your awareness of what virtual is actually doing (don't be afraid to read the disassembly view from time to time, and definitely peek at your CPU's architecture manuals), you'll tend to use it when it makes sense and not when it doesn't, and a profiler can cover the rest if needed.
One-size-fits-all statements about "don't use virtual" or "virtual use is unlikely to make a measurable difference" make me grouchy. The reality is usually more complicated, and either you're going to be in a situation where you care enough to profile or avoid it, or you're in that other 95% where it's probably not worth caring except for the possible educational content.