Emotion-Guided Video Summarization

Jeiyoon Park

Kiho Kwoun

Moongi Lee

Juhui Heo

[GitHub]

[Slide]

We postulate that the part where emotional information appears frequently in a video is correlated with highlights. So we first extract visual features (using CLIP) and audio features (using Wav2vec2) from a video simultaneously. And then we employ attention mechanism with these features. Finally we get importance score from the Emotion-attended features and generate highlights based on the scores.

Real-World Problem

With the surge of large-scale model, available video material, and sophisticated hardware resources such as GPU, recent works in Video Summarization is advancing rapidly. Vision-guided apporaches derive important scenes by utilizing visual features in the video. However, despite its notable success in video summarization, previous works use just visual features, such as Optical flow and RGB, leaving audio features behind. Language-Guided Approaches learn by automatically extracting caption information from the video. Although these methods allow a user to have the option of customizing the summary, it is difficult to use if a user don't know about the video since a user have to rely on subjective information called user captions.

Emotion-Guided Video Summarization

Results

Demo

Paper

TBA