views:

128

answers:

4

Hi, I am very much confused to know what happens inside the codecs. I want to learn about the elements inside audio encoders and decoders. Would be very happy if you can provide me some links where i can find some good study material. Thanks

precisely i would like to know how the codec parses the a media file.

+1  A: 

Try starting here:

Getting Started with Windows Media Encoder

http://www.microsoft.com/windows/windowsmedia/howto/articles/introencoding.aspx

Further data can be found at codecpage.com

wbogacz
+1  A: 

I learned a good bit about the MPEG4 format by working on an MPEG4 decoder. There are many different reference (and open source implementations of) encoders and decoders out there for both video and audio. So, hit the books -- starting with Wikipedia: it has good general summaries and links to follow (if you're lucky to "open specs"). And then hit the source.

There are so many different ways of encoding something (many involving some form of compression, be it lossy or lossless, as well) and the entire issue is generally further complicated by also having to deal with the framing container and "sub formats".

Have fun.

  • Dirac: http://diracvideo.org/specifications/
  • MPEG-4: http://en.wikipedia.org/wiki/MPEG-4
  • JPEG: http://jpeg.org/public/jfif.pdf
pst
+5  A: 

Your title asks about A/V compression, but the rest of your comments talks about parsing the media file & identifying its codec. Those are very different tasks: spec'd & implemented by different organizations, performed by different APIs in most multimedia libraries, and above all requiring very different skill sets.

A/V file formats aren't too different from any other file format, which in turn are just formal grammars. Parsing, validation, and the resulting object graphs are conceptually no different from any other grammar -- and in practice, they tend to be far simpler than the grammars you encounter in a standard CS curriculum (compilers, finite automata). The AVI file format is kind of antiquated at this point, but I'd still recommend starting there because:

  • many of today's more complex formats resemble AVI in whole or in part, or at minimum assume you're familiar with its basic structures
  • AVI is a member of a larger family of multimedia formats known as RIFF, which you'll find used in many other places such as WAVs

Codecs, meanwhile, are some of the most complex algorithms you're likely to find among "consumer" software. They draw heavily on advancements in both the academic community and the R&D arms of large corporations (including their vast patent libraries). To be proficient in codecs you need to know the at least the basics of:

If you have already have a decent background (eg, you've taken one or two undergraduate level "math for engineers"-type of classes) then I say dive right in. Many of the best A/V codecs are open source:

  • x264 (MPEG-4 part 10, aka AVC)
  • LAME (MPEG-1 layer 3, aka mp3)
  • Xvid (MPEG-4 part 2, same as Divx and many others)
  • Vorbis (alternative, patent-free audio codec)
  • Dirac (alternative, patent-free video codec based on a wavelet transform)
Richard Berg
+1  A: 

In general, video compression is concerned with throwing away as much information as possible whilst having a minimal effect on the viewing experience for an end user. For example, using subsampled YUV instead of RGB cuts the video size in half straight away. This is possible as the human eye is less sensitive to colour than it is to brightness. In YUV, the Y value is brightness, and the U and V values represent colour. Therefore, you can throw away some of the colour information which reduces the file size, without the viewer noticing any difference.

After that, most compression techniques take advantage of 2 redundancies in particular. The first is temporal redundancy and the second is spatial redundancy.

Temporal redundancy notes that successive frames in a video sequence are very similar. Typically a video would be in the order of 20-30 frames per second, and nothing much changes in 1/30 of a second. Take any DVD and pause it, then move it on one frame and note how similar the 2 images are. So, instead of encoding each frame independently, MPEG-4 (and other compression standards) only encode the difference between successive frames (using motion estimation to find the difference between frames)

Spatial redundancy takes advantage of the fact that in general the colour spread across images tends to be quite low frequency. By this I mean that neighbouring pixels tend to have similar colours. For example, in an image of you wearing a red jumper, all of the pixels that represent your jumper would have very similar colour. It is possible to use the DCT to transform the pixel values into the frequency space, where some low frequency information can be thrown away. Then, when the reverse DCT is performed (during decoding), the image is now without the thrown away low-frequency information.

To view the effects of throwing away this information, open MS paint and draw a series of overlapping horizontal and vertical black lines. Save the image as a JPEG (which also uses DCT for compression). Now zoom in on the pattern, notice how the edges of the lines are not as sharp anymore and are kinda blurry. This is because some information (the transition from black to white) has been thrown away during compression. Read this for an explanation with nice pictures

For further reading, this book is quite good, if a little heavy on the maths.

Lehane