With the surge of large-scale model, available video material, and sophisticated hardware resources such as GPU, recent works in Video Summarization is advancing rapidly. Vision-guided apporaches derive important scenes by utilizing visual features in the video. However, despite its notable success in video summarization, previous works use just visual features, such as Optical flow and RGB, leaving audio features behind. Language-Guided Approaches learn by automatically extracting caption information from the video. Although these methods allow a user to have the option of customizing the summary, it is difficult to use if a user don't know about the video since a user have to rely on subjective information called user captions.
|