views:

753

answers:

7

Does anyone know of an algorithm that I could use to find an "interesting" representative thumbnail for a video?

I have say 30 bitmaps and I would like to choose the most representative one as the video thumbnail.

The obvious first step would be eliminate all black frames. Then perhaps look for the "distance" between the various frames and choose something that is close to the avg.

Any ideas here or published papers that could help out?

A: 

Wow, what a great question - I guess a second step would be to iteratively remove frames where there's little or no change between it and it's successors. But all you're really doing there is reducing the set of potentially interesting frames. How exactly you determine "interestingness" is the special sauce I suppose as you don't have the user interaction statistics to rely on like Flickr does.

ninesided
A: 

Directors will sometimes linger on a particularly 'insteresting' or beautiful shot so how about finding a 5 second section that doesn't change and then eliminating those sections that are almost black?

Bedwyr Humphreys
+2  A: 

I think you should only look at key frames.

If the video is not encoded using a compression which is based on key frames, you create an algorithm based on the following article: Key frame selection by motion analysis.

Depending on the compression of the video you can have key frames every 2 seconds or 30 seconds. Than I think you should use the algorithm in the article to find the "most" keyframe out of all the key frames.

Davy Landman
+4  A: 

You asked for papers so I found a few. If you are not on campus or on VPN connection to campus these papers might be hard to reach.

PanoramaExcerpts: extracting and packing panoramas for video browsing

http://portal.acm.org/citation.cfm?id=266396

This one explains a method for generating a comicbook style keyframe representation.

Abstract:

This paper presents methods for automatically creating pictorial video summaries that resem- ble comic books. The relative importance of video segments is computed from their length and novelty. Image and audio analysis is used to automatically detect and emphasize mean- ingful events. Based on this importance mea- sure, we choose relevant keyframes. Selected keyframes are sized by importance, and then efficiently packed into a pictorial summary. We present a quantitative measure of how well a summary captures the salient events in a video, and show how it can be used to improve our summaries. The result is a compact and visually pleasing summary that captures semantically important events, and is suitable for printing or Web access. Such a summary can be further enhanced by including text cap- tions derived from OCR or other methods. We describe how the automatically generated sum- maries are used to simplify access to a large collection of videos.

Automatic extraction of representative keyframes based on scenecontent

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=751008

Abstract:

Generating indices for movies is a tedious and expensive process which we seek to automate. While algorithms for finding scene boundaries are readily available, there has been little work performed on selecting individual frames to concisely represent the scene. In this paper we present novel algorithms for automated selection of representative keyframes, based on scene content. Detailed description of several algorithms is followed by an analysis of how well humans feel the selected frames represent the scene. Finally we address how these algorithms can be integrated with existing algorithms for finding scene boundaries.

smerten
Thanks! Looks promising
Sam Saffron
+1  A: 

It may also be beneficial to favor frames that are aesthetically pleasing. That is, look for common attributes of photography-- aspect ratio, contrast, balance, etc.

It would be hard to find a representative shot if you don't know what you're looking for. But with some heuristics and my suggestion, at least you could come up with something good looking.

Scott Wegner
Yeah, I was thinking of calculating histograms and using them as part of the algorithm
Sam Saffron
+5  A: 

If the video contains structure, i.e. several shots, then the standard techniques for video summarisation involve (a) shot detection, then (b) use the first, mid, or nth frame to represent each shot. See [1].

However, let us assume you wish to find an interesting frame in a single continuous stream of frames taken from a single camera source. I.e. a shot. This is the "key frame detection" problem that is widely discussed in IR/CV (Information Retrieval, Computer Vision) texts. Some illustrative approaches:

  • In [2] a mean colour histogram is computed for all frames and the key-frame is that with the closest histogram. I.e. we select the best frame in terms of it's colour distribution.
  • In [3] we assume that camera stillness is an indicator of frame importance. As suggested by Beds, above. We pick the still frames using optic-flow and use that.
  • In [4] each frame is projected into some high dimensional content space, we find those frames at the corners of the space and use them to represent the video.
  • In [5] frames are evaluated for importance using their length and novelty in content space.

In general, this is a large field and there are lots of approaches. You can look at the academic conferences such as The International Conference on Image and Video Retrieval (CIVR) for the latest ideas. I find that [6] presents a useful detailed summary of video abstraction (key-frame detection and summarisation).

For your "find the best of 30 bitmaps" problem I would use an approach like [2]. Compute a frame representation space (e.g. a colour histogram for the frame), compute a histogram to represent all frames, and use the frame with the minimum distance between the two (e.g. pick a distance metric that's best for your space. I would try Earth Mover's Distance).

  1. M.S. Lew. Principles of Visual Information Retrieval. Springer Verlag, 2001.
  2. B. Gunsel, Y. Fu, and A.M. Tekalp. Hierarchical temporal video segmentation and content characterization. Multimedia Storage and Archiving Systems II, SPIE, 3229:46-55, 1997.
  3. W. Wolf. Key frame selection by motion analysis. In IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 1228-1231, 1996.
  4. L. Zhao, W. Qi, S.Z. Li, S.Q. Yang, and H.J. Zhang. Key-frame extraction and shot retrieval using Nearest Feature Line. In IW-MIR, ACM MM, pages 217-220, 2000.
  5. S. Uchihashi. Video Manga: Generating semantically meaningful video summaries. In Proc. ACM Multimedia 99, Orlando, FL, Nov., pages 383-292, 1999.
  6. Y. Li, T. Zhang, and D. Tretter. An overview of video abstraction techniques. Technical report, HP Laboratory, July 2001.
graveca
+1 Awesome answer, I think for my project the best answer is a mixture of histograms and motion detection, since I am dealing with videos
Sam Saffron
I would suggest also analyzing the soundtrack if the video has one - it can provide cues to where scene/shots start and end.
Unreason
+1  A: 

I worked on a project recently where we did some video processing, and we used OpenCV to do the heavy lifting as far as video processing was concerned. We had to extract frames, calculate differences, extract faces, etc. OpenCV has some built-in algorithms that will calculate differences between frames. It works with a variety of video and image formats.

Don Branson