PIC18
The PIC18 answer given by TK results in the following instructions (binary):
overflow
PUSH
0000 0000 0000 0101
CALL overflow
1110 1100 0000 0000
0000 0000 0000 0000
However, CALL alone will perform a stack overflow:
CALL $
1110 1100 0000 0000
0000 0000 0000 0000
Smaller, faster PIC18
But RCALL (relative call) is smaller still (not global memory, so no need for the extra 2 bytes):
RCALL $
1101 1000 0000 0000
So the smallest on the PIC18 is a single instruction, 16 bits (two bytes). This would take 2 instruction cycles per loop. At 4 clock cycles per instruction cycle you've got 8 clock cycles. The PIC18 has a 31 level stack, so after the 32nd loop it will overflow the stack, in 256 clock cycles. At 64MHz, you would overflow the stack in 4 micro seconds and 2 bytes.
PIC16F5x (even smaller and faster)
However, the PIC16F5x series uses 12 bit instructions:
CALL $
1001 0000 0000
Again, two instruction cycles per loop, 4 clocks per instruction so 8 clock cycles per loop.
However, the PIC16F5x has a two level stack, so on the third loop it would overflow, in 24 instructions. At 20MHz, it would overflow in 1.2 micro seconds and 1.5 bytes.
Intel 4004
The Intel 4004 has an 8 bit call subroutine instruction:
CALL $
0101 0000
For the curious that corresponds to an ascii 'P'. With a 3 level stack that takes 24 clock cycles for a total of 32.4 micro seconds and one byte. (Unless you overclock your 4004 - come on, you know you want to.)
Which is as small as the befunge answer, but much, much faster than the befunge code running in current interpreters.