2026

Koda, an LLM trained from scratch

1.27B parameters, trained from scratch to understand

A 1.27 billion parameter language model trained from scratch, just to understand how they work inside. LLaMA-style decoder-only (24 layers, GQA, SwiGLU, RoPE), trained in JAX/Flax NNX on 2 L40S GPUs. Checkpoints published on Hugging Face, with HF, GGUF and MLX exports.

github.com huggingface.co

KodaLite-1.3B is a language model I trained from scratch, not for the size, but to understand the internals: a LLaMA-style decoder-only architecture (24 layers, hidden 2048, GQA 32/8, SwiGLU, RMSNorm pre-norm, RoPE), GPT-2 BPE tokenizer. The full pipeline runs in JAX + Flax NNX on 2 NVIDIA L40S GPUs in bf16: pretraining on SlimPajama (~1.6 billion tokens, ~25 hours) with a crash-recovery orchestrator, LoRA SFT on Dolly and OASST, NTK-aware context extension from 1024 to 2048 tokens. Checkpoints are published on Hugging Face with exports to Transformers, GGUF (llama.cpp, Ollama, LM Studio) and MLX, plus a homemade 8-task zero-shot benchmark.

Challenges

Training a 1.27B parameter model on a limited GPU budget (2x L40S, 96 GB VRAM)
Sustaining a ~25 hour pretraining run without losing progress
Extending context from 1024 to 2048 tokens after pretraining
Making the model usable outside JAX (Transformers, GGUF, MLX)

Solutions

JAX + Flax NNX implementation in bf16 with a crash-recovery orchestrator
SlimPajama pretraining, then LoRA SFT (Dolly, OASST) and an EOS token fix
NTK-aware context extension without full retraining
Export pipeline to Hugging Face Transformers, GGUF and MLX (fp16 and 8-bit)

Results

KodaLite-1.3B published on Hugging Face (YoAbriel/KodaLite-1.3B, GGUF and MLX variants)
Full pretraining: ~1.6B SlimPajama tokens in ~25h on 2x L40S
Homemade 8-task zero-shot benchmark to measure what the model can actually do
Public code on GitHub (Koda-v0.1)

Technologies

JAX · Flax NNX · Python · LoRA · SlimPajama · Hugging Face · GGUF · MLX