UHG
Search
Close this search box.

Google Introduces PaliGemma, A 3B Vision Model for Transfer Learning

PaliGemma excels in image captioning and video understanding, surpassing larger models with its versatile architecture and achieving top results on benchmarks like MMVP and Objaverse Multiview without task-specific fine-tuning.

Share

Google DeepMind researchers have unveiled PaliGemma, a new open-source vision-language model (VLM) that demonstrates strong performance across a wide range of visual and language tasks despite its relatively small size.

Read the full paper here 

The 3-billion parameter model, which combines a SigLIP vision encoder with a Gemma language model, achieved results on nearly 40 diverse benchmarks including standard VLM tasks as well as more specialised challenges in areas like remote sensing and image segmentation.

PaliGemma excels in tasks like image captioning and video understanding, often outperforming larger models. Its architecture supports multiple input images, making it ideal for video clips and image pairs. It achieves top results on benchmarks like MMVP and Objaverse Multiview without task-specific fine-tuning. 

Key design choices include a prefix-LM training objective for bidirectional attention, fine-tuning all model components together, a multi-stage training process for increasing image resolution, and curated diverse pretraining data.

The team also conducted extensive ablation studies to analyse the impact of various architectural and training choices. They found that longer pretraining, unfreezing all model components, and increasing resolution all contributed significantly to PaliGemma’s capabilities.

By releasing PaliGemma as an open base model without instruction tuning, the researchers aim to provide a useful starting point for further research into instruction tuning, specific applications, and clearer separation of base models and fine-tunes in VLM development.

The strong performance of this relatively small model suggests that carefully designed VLMs can achieve state-of-the-art results without necessarily scaling to enormous sizes, potentially enabling more efficient and accessible multimodal AI systems.

📣 Want to advertise in AIM? Book here

Picture of Gopika Raj

Gopika Raj

With a Master's degree in Journalism & Mass Communication, Gopika Raj infuses her technical writing with a distinctive flair. Intrigued by advancements in AI technology and its future prospects, her writing offers a fresh perspective in the tech domain, captivating readers along the way.
Related Posts
Association of Data Scientists
Tailored Generative AI Training for Your Team
Upcoming Large format Conference
Sep 25-27, 2024 | 📍 Bangalore, India
Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

Flagship Events

Rising 2024 | DE&I in Tech Summit
April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore
Data Engineering Summit 2024
May 30 and 31, 2024 | 📍 Bangalore, India
MachineCon USA 2024
26 July 2024 | 583 Park Avenue, New York
MachineCon GCC Summit 2024
June 28 2024 | 📍Bangalore, India
Cypher USA 2024
Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA
Cypher India 2024
September 25-27, 2024 | 📍Bangalore, India
discord icon
AI Forum for India
Our Discord Community for AI Ecosystem, In collaboration with NVIDIA.