I'm fairly certain you are getting accurate values. It might help if you thing of an MPEG stream as, well, a stream. In that case, prior to the IBBPBB that you see there would normally be another GOP. Maybe something like this (using same notation as original question):
P(-3,-2) B(-2,-1) B(-1,0)
Basically the B frames after the I frames are based on the I frame and the last P frame from the previous GOP.
While it makes logical sense for a video to start off with this:
Start GOP: IPBBPBBPBB...
Later on it must be
Start GOP: IBBPBBPBBPBB
Start GOP: IBBPBBPBBPBB
Start GOP: IBB...
Remember that decoding any B frame requires a complete frame before it and after it. So each pair of B frames should be displayed before the I or P frame just prior to it in the file.
FFMPEG may just have forgone the "special case" of first GOP.
Since the first two B frames don't have a prior frame to manipulate, you should be able to safely discard them. Just rebase your timestamps off of the first I frame and adjust the audio stream the same amount.
Whether this will actually result in a loss of frames will depend on FFMPEG's implementation, but worse case scenario is that you lose 83 milliseconds (2 frames at 24 frames/sec).