A recent report by researchers at Microsoft Research in Bangalore has uncovered significant issues with current methods for compressing and quantisation of LLMs.
The paper, titled “Accuracy is Not All You Need,” highlights that commonly used compression techniques, such as quantisation, can lead to changes in model behaviour that are not captured by traditional accuracy metrics.
The study by Abhinav Dutta, Sanjeev Krishnan, Nipun Kwatra, and Ramachandran Ramjee emphasises the importance of looking beyond accuracy when evaluating compressed models.
The researchers point out that while compressed models often maintain similar accuracy levels to their baseline counterparts, their behaviour can differ significantly. This phenomenon, referred to as “flips,” involves answers changing from correct to incorrect and vice versa, impacting the model’s reliability.
The researchers propose using distance metrics like KL-Divergence and the percentage of flips to better assess the impact of compression. These metrics provide a more nuanced view of how compression affects model outputs as perceived by end-users.
The research team conducted experiments using multiple LLMs, such as Llama2 chat and Yi chat, across various quantization techniques and datasets. They found that compressed models perform significantly worse in generative tasks, as evidenced by evaluations on the MT-Bench dataset.
Key Findings
The researchers acknowledge that predicting performance degradation in real-world applications remains challenging. They note that distance metrics may not always indicate visible degradation in downstream tasks.
They also added that compressed models often exhibit significant behavioural differences from their baseline versions, impacting user experience. The flips metric revealed that the proportion of answer changes is substantial, highlighting the limitations of accuracy as a sole performance indicator.
Moreover, in tasks requiring generative capabilities, compressed models underperformed compared to their baseline versions, underscoring the need for more comprehensive evaluation metrics.
The Microsoft Research study concludes that traditional accuracy metrics are insufficient for evaluating the quality of compressed LLMs. The introduction of distance metrics such as KL-Divergence and flips offers a more accurate assessment of model performance, capturing changes that affect end-users.
The researchers argue that these metrics are essential for all optimisation methods that aim to minimise visible changes in model behaviour from a baseline. By adopting these metrics, the field of model optimisation and compression can progress more effectively, ensuring that compressed models meet user expectations and maintain high-quality outputs.