Abstract: Video, as an information carrier, provides a vast amount of important information to people. Therefore, the method of obtaining video becomes particularly important, which drives the ...
Abstract: Vision-language models pretrained on image-text pairs have demonstrated strong performance in text-to-video retrieval through contrastive learning. However, videos contain much richer ...
ViTiS consists of a frozen video encoder, a visual mapping network, a frozen text embedding layer, a frozen language model and a frozen classifier head. Given input video frames and text, video ...