Google DeepMind researchers have unveiled PaliGemma, a new open-source vision-language model (VLM) that demonstrates strong performance across a wide range of visual and language tasks despite its relatively small size.
Read the full paper here
The 3-billion parameter model, which combines a SigLIP vision encoder with a Gemma language model, achieved results on nearly 40 diverse benchmarks including standard VLM tasks as well as more specialised challenges in areas like remote sensing and image segmentation.
PaliGemma excels in tasks like image captioning and video understanding, often outperforming larger models. Its architecture supports multiple input images, making it ideal for video clips and image pairs. It achieves top results on benchmarks like MMVP and Objaverse Multiview without task-specific fine-tuning.
Key design choices include a prefix-LM training objective for bidirectional attention, fine-tuning all model components together, a multi-stage training process for increasing image resolution, and curated diverse pretraining data.
The team also conducted extensive ablation studies to analyse the impact of various architectural and training choices. They found that longer pretraining, unfreezing all model components, and increasing resolution all contributed significantly to PaliGemma’s capabilities.
By releasing PaliGemma as an open base model without instruction tuning, the researchers aim to provide a useful starting point for further research into instruction tuning, specific applications, and clearer separation of base models and fine-tunes in VLM development.
The strong performance of this relatively small model suggests that carefully designed VLMs can achieve state-of-the-art results without necessarily scaling to enormous sizes, potentially enabling more efficient and accessible multimodal AI systems.