I built a from-scratch LLaMA-style inference engine to deeply understand transformer internals and inference mechanics.
The project implements:
-
Token-by-token autoregressive decoding
-
KV cache (per-layer, per-head)
-
Grouped-Query Attention (GQA) with correct KV expansion
-
RoPE applied to Q/K
-
RMSNorm + SwiGLU MLP
-
Distributed matrix execution across multiple nodes (custom cluster backend)
The goal was correctness and clarity rather than performance tuning, and to learn how modern LLM inference actually works under the hood.
Code + README:
https://github.com/rinoScremin/Open-Cluster
I’d love feedback from anyone working on inference engines or distributed LLM systems.