Model distillation trains a smaller model on the outputs of a larger model so the small model learns to imitate the large one on the workflow that matters. The teacher does the expensive thinking. The student ships in production at a fraction of the cost.
How distillation works.
The teacher model (often the largest, slowest, most expensive model in the family) generates outputs over a dataset of inputs. Those outputs become labels for fine-tuning a smaller student model. The student is trained until its outputs match the teacher's closely enough on the eval rubric. The student is then deployed; the teacher is retired from the critical path.
Why distillation wins in production.
- Cost. Smaller models can be ten to a hundred times cheaper per call. On a workflow running a million times a day, this is the difference between a viable feature and an abandoned one.
- Latency. Smaller models are faster. User-facing chat, real-time assist, and high-throughput pipelines all benefit from sub-second responses.
- Hostable locally. A distilled small model can run on commodity hardware or even on-device, unlocking offline-capable, compliance-bound, or low-margin workflows the teacher could never serve.
Caveats.
Distillation transfers task behavior, not general capability. The student inherits the teacher's habits on the trained distribution and nothing more. Off-distribution queries regress sharply. Distillation works for narrow, well-defined workflows; it fails for general-purpose assistants.
When to distill.
When the workflow is stable, the eval scoreboard is mature (you need a real benchmark to know whether the student is good enough), the per-call cost or latency target is binding, and the fixture set covers the production distribution. Distillation is a late-stage optimization, not a first move.
Frequently asked.
- What is model distillation?
- Model distillation is the practice of training a smaller model (the student) on the outputs of a larger model (the teacher), so the student learns to imitate the teacher on the workflow that matters. The student then ships in production at a fraction of the cost.
- When does model distillation make sense?
- Once the workflow is stable, the eval scoreboard is mature, and the per-call cost or latency target is binding. Distillation is a late-stage optimization, not a first move. The student inherits the teacher's task behavior on the trained distribution, not its general capability.
- What is a small language model (SLM)?
- A small language model is one in the one-to-ten billion parameter range, designed to run cheaply on consumer hardware or at high throughput in production. Many SLMs are themselves distilled from larger models. They excel at narrow, well-defined tasks and underperform large general-purpose models on open-ended reasoning.
- Can we distill from a closed-source model?
- Many provider terms of service forbid distilling their models specifically to build a competing model. They generally permit using the outputs to build derivative applications under your own product. Read the license terms before assuming. The technical method is the same; the legal posture is what differs.