Apple recently introduced LazyLLM, a a novel technique designed to enhance the efficiency of LLM inference. Detailed in a recent research paper, it aims to accelerate the generation of responses in transformer-based language models without compromising accuracy.
This paper is proposed as a technique for efficient LLM inference, particularly in long context scenarios. LazyLLM selectively computes the KV for tokens important for the next token prediction and ‘lazily’ defers the computation of the remaining tokens to later steps, when they become relevant.
Developed by Qichen Fu, Thomas Merth, Sachin Mehta, and Mahyar Najibi of Apple, alongside Mohammad Rastegari, who now works at Meta, the ‘LazyLLM’ offers flexibility by enabling the model to reconsider tokens that were previously pruned, making the process more adaptive and efficient.
By reducing the heavy computing work of the pre-filling stage, this paves the way for more responsive and agile AI systems, possibly completely changing applications that rely on large language models.
Apple Generative AI Innovations
Apple recently released a new open-source LLM, DCLM-Baseline 7B, featuring 7 billion parameters. This model, which includes weights, training code, and dataset, is trained on 2.5 trillion tokens from open datasets. It primarily uses English data and has a 2048-token context window.
The new model is licensed under Apple Sample Code License, accessible on Hugging Face and Transformers. Trained with PyTorch and OpenLM, it matches closed-dataset models like Mistral in performance.
This comes after Apple at WWDC 2024 introduced Apple Intelligence to enhance Siri’s capabilities with generative AI.