Listen to this story
|
Former OpenAI researcher Andrej Karpathy has introduced llm.c, a project aimed at training LLMs systems in pure C without the hefty dependencies of PyTorch and cPython.
Have you ever wanted to train LLMs in pure C without 245MB of PyTorch and 107MB of cPython? No? Well now you can! With llm.c:https://t.co/w2wkY0Ho5m
— Andrej Karpathy (@karpathy) April 8, 2024
To start, implements GPT-2 training on CPU/fp32 in only ~1,000 lines of clean code. It compiles and runs instantly, and exactly…
The llm.c project, available on GitHub, offers a simple approach to implementing GPT-2 training on CPU/fp32 in just around 1,000 lines of code.
“I chose GPT-2 as the first working example because it is the grand-daddy of LLMs, the first time the modern stack was put together,” wrote Karpathy in his GitHub repository.
One of the key advantages of llm.c is its instant compilation and execution, matching the performance of the PyTorch reference implementation. By allocating memory in a single block at the beginning of training, llm.c maintains a constant memory footprint, enhancing efficiency during data streaming and batch processing.
The core of llm.c lies in manually implementing forward and backward passes for individual layers like layernorm, encoder, matmul, self-attention, gelu, residual, softmax, and cross-entropy loss. This meticulous process ensures accurate pointer arrangements and tensor offsets, crucial for seamless model operation.
“I am curious to learn more about Rust and totally understand the appeal. But I still find C so nice, simple, clean, portable and beautiful, aesthetically. It’s as close as you want to get to direct communion with the machine,” wrote Karpathy.
Karpathy’s next endeavor involves porting llm.c to CUDA layer by layer, aiming for efficient performance comparable to PyTorch but without the heavyweight dependencies. This transition to CUDA opens doors for lowering precision from fp32 to fp16/below and supporting modern architectures like llama 2, Mistral, Gemma, and more.