Intro Vision Language Modeling

arXiv:2405.17247v1

Contrastive-based VLMs Contrastive-based training is often better explained through an Energy-Based Models (EBM) point of view

VLMs from Pretrained Backbones

3.4.2 Negative captioning Negative samples within the realm of contrastive objectives have been extensively used to mitigate collapse, enhance generalization, and discriminative feature learning

5.5 Challenges in leveraging video data A challenge for video-text pretraining is the current scarcity of (weak) supervision on temporal space, a problem illustrated in VideoPrism [Zhao et al., 2024]. Existing data (e.g., from the Internet) focuses on describing the content of the scenes rather than actions or motion, making a video model downgrade to an image model.