Microsoft has released the new Phi-3.5 models: Phi-3.5-MoE-instruct, Phi-3.5-mini-instruct, and Phi-3.5-vision-instruct. The Phi-3.5-mini-instruct, with 3.82 billion parameters, is built for basic and quick reasoning tasks.
The Phi-3.5-MoE-instruct, with 41.9 billion parameters, handles more advanced reasoning. The Phi-3.5-vision-instruct, with 4.15 billion parameters, is designed for vision tasks like image and video analysis.
Phi-3.5 MOE-instruct
Phi-3.5-MoE instruct is a 42-billion-parameter open-source model and demonstrates significant improvements in reasoning capabilities, outperforming larger models such as Llama 3.1 8B and Gemma 2 9B across various benchmarks.
Despite its competitive performance, Phi-3.5-MoE falls slightly behind GPT-4o-mini but surpasses Gemini 1.5 Flash in benchmarks. The model supports multilingual applications, although the specific languages covered remain unclear.
Phi-3.5-MoE features 16 experts, with two being activated during generation, and has 6.6 billion parameters engaged in each inference. Phi-3.5-MoE supports multilingual capabilities and extends its context length to 128,000 tokens.
The model was trained over 23 days using 512 H100-80G GPUs, with a total training dataset of 4.9 trillion tokens.
The model’s development included supervised fine-tuning, proximal policy optimisation, and direct preference optimisation to ensure precise instruction adherence and robust safety measures. The model is intended for use in memory and compute-constrained environments and latency-sensitive scenarios.
Key use cases for Phi-3.5-MoE include general-purpose AI systems, applications requiring strong reasoning in code, mathematics, and logic, and as a foundational component for generative AI-powered features.
The model’s tokenizer supports a vocabulary size of up to 32,064 tokens, with placeholders for downstream fine-tuning. Microsoft provided a sample code snippet for local inference, demonstrating its application in generating responses to user prompts.
Phi-3.5-mini-instruct
With 3.8 billion parameters, this model is lightweight yet powerful, outperforming larger models such as Llama3.1 8B and Mistral 7B. It supports a 128K token context length, significantly more than its main competitors, which typically support only up to 8K.
Microsoft’s Phi-3.5-mini is positioned as a competitive option in long-context tasks such as document summarisation and information retrieval, outperforming several larger models like Llama-3.1-8B-instruct and Mistral-Nemo-12B-instruct-2407 on various benchmarks.
The model is intended for commercial and research use, particularly in memory and compute-constrained environments, latency-bound scenarios, and applications requiring strong reasoning in code, math, and logic.
The Phi-3.5-mini model was trained over 10 days using 512 H100-80G GPUs. The training process involved processing 3.4 trillion tokens, leveraging a combination of synthetic data and filtered publicly available websites to enhance the model’s reasoning capabilities and overall performance.
Phi-3.5-vision-instruct
Phi-3.5 Vision is a 4.2 billion parameter model and it excels in multi-frame image understanding and reasoning. It has shown improved performance in benchmarks like MMMU, MMBench, and TextVQA, demonstrating its capability in visual tasks. It even outperforms OpenAI GPT-4o on several benchmarks.
The model integrates an image encoder, connector, projector, and the Phi-3 Mini language model. It supports both text and image inputs and is optimised for prompts using a chat format, with a context length of 128K tokens. The model was trained over 6 days using 256 A100-80G GPUs, processing 500 billion tokens that include both vision and text data.
The Phi-3.5 models are now available on the AI platform Hugging Face under an MIT license, making them accessible for a wide range of applications. This release aligns with Microsoft’s commitment to providing open-source AI tools that are both efficient and versatile.