Without QuickTime (or an equivalent multimedia framework), what you describe is quite a lot of work. Ordinarily, you would use a video compression algorithm (such as H.264) to encode your images into video, and an audio compression algorithm (such as AAC) to encode your audio track. Then you would write these streams into a container file, such as an MPEG-4 file, which interleaves the streams for playback, contains metadata and indexes and so on. Then for playback, you parse the file, decode the video and audio data, and schedule them for playback, taking care to keep them in sync.
QuickTime does all this (and more) for you, and it would be an enormous undertaking to write it all yourself. Is there some reason why you are running on OS X but cannot use QuickTime?
Given the question is tagged with iPhone, why can't you just use QTKit?
If you had to do it from scratch, you could adopt a very simple solution whereby you store your image sequence as a set of JPEG files (but then you would require libjpeg
; use raw RGB or PPM if you must), the audio track as a raw WAV data, and then have another file (a text file you define) that stored timing information, so you would simply stream out the audio, and have the frame numbers of the images stored with their corresponding timecode/sample offset. That is a very simple solution that could be made to work without too much effort.
If you give us some more idea of what you are trying to achieve, we could offer some more specific suggestions.