M3 Matryoshka Multimodal Models

https://matryoshka-mm.github.io/

represent visual content as nested sets of visual tokens that capture information across multiple coarse-to-fine granularities.

(2) M3 provides a framework for analyzing the granularity needed for existing datasets,

Matryoshka Representation Learning.

Our approach is inspired by MRL, but instead of learning multiple nested embeddings for a high-dimensional feature vector, we learn nested visual tokens along the token length dimension for the visual input. We are the first to show that the idea of Matryosha learning can enable explicit control over the visual granularity of the visual content that an LMM processes.