In wired ethernet at least, there is no "synchronous clock" that times the beginning of every frame. There is a minimum frame size, but it's more like 64 bytes instead of 1500. There are also minimum gaps between frames, but that might only apply to shared-access networks (ATM and modern ethernet is switched, not shared-access). It is the maximum size that is limited to 1500 bytes on virtually all ethernet equipment.
But the smaller your packets get, the higher the ratio of framing headers to data. Eventually you are spending 40-50 bytes of overhead for a single byte. And more for its acknowledgement.
If you could just hold for a moment and collect another byte to send in that packet, you have doubled your network efficiency. (this is the reason for Nagle's Algorithm)
There is a tradeoff on a channel with errors, because the longer frame you send, the more likely it experience an error and will have to be retransmitted. Newer wireless standards load up the frame with forward error correction bits to avoid retransmissions.
The classic example of "tinygrams" is 10,000 users all sitting on a campus network, typing into their terminal session. Every keystroke produces a single packet (and acknowledgement).... At a typing rate of 4 keystrokes per second, That's 80,000 packets per second just to move 40 kbytes per second. On a "classic" 10mbit shared-medium ethernet, this is impossible to achive, because you can only send 27k minimum sized packets in one second - excluding the effect of collisions:
96 bits inter-frame gap
+ 64 bits preamble
+ 112 bits ethernet header
+ 32 bits trailer
-----------------------------
= 304 bits overhead per ethernet frame.
+ 8 bits of data (this doesn't even include IP or TCP headers!!!)
----------------------------
= 368 bits per tinygram
10000000 bits/s ÷ 368 bits/packet = 27172 Packets/second.
Perhaps a better way to state this is that an ethernet that is maxed out moving tinygrams can only move 216kbits/s across a 10mbit/s medium for an efficiency of 2.16%