Model distillation trains a smaller model on the outputs of a larger model so the small model learns to imitate the large one on the workflow that matters. The teacher does the expensive thinking. The student ships in production at a fraction of the cost.

How distillation works.

The teacher model (often the largest, slowest, most expensive model in the family) generates outputs over a dataset of inputs. Those outputs become labels for fine-tuning a smaller student model. The student is trained until its outputs match the teacher's closely enough on the eval rubric. The student is then deployed; the teacher is retired from the critical path.

Why distillation wins in production.

  • Cost. Smaller models can be ten to a hundred times cheaper per call. On a workflow running a million times a day, this is the difference between a viable feature and an abandoned one.
  • Latency. Smaller models are faster. User-facing chat, real-time assist, and high-throughput pipelines all benefit from sub-second responses.
  • Hostable locally. A distilled small model can run on commodity hardware or even on-device, unlocking offline-capable, compliance-bound, or low-margin workflows the teacher could never serve.

Caveats.

Distillation transfers task behavior, not general capability. The student inherits the teacher's habits on the trained distribution and nothing more. Off-distribution queries regress sharply. Distillation works for narrow, well-defined workflows; it fails for general-purpose assistants.

When to distill.

When the workflow is stable, the eval scoreboard is mature (you need a real benchmark to know whether the student is good enough), the per-call cost or latency target is binding, and the fixture set covers the production distribution. Distillation is a late-stage optimization, not a first move.

Frequently asked.

What is model distillation?
Model distillation is the practice of training a smaller model (the student) on the outputs of a larger model (the teacher), so the student learns to imitate the teacher on the workflow that matters. The student then ships in production at a fraction of the cost.
When does model distillation make sense?
Once the workflow is stable, the eval scoreboard is mature, and the per-call cost or latency target is binding. Distillation is a late-stage optimization, not a first move. The student inherits the teacher's task behavior on the trained distribution, not its general capability.
What is a small language model (SLM)?
A small language model is one in the one-to-ten billion parameter range, designed to run cheaply on consumer hardware or at high throughput in production. Many SLMs are themselves distilled from larger models. They excel at narrow, well-defined tasks and underperform large general-purpose models on open-ended reasoning.
Can we distill from a closed-source model?
Many provider terms of service forbid distilling their models specifically to build a competing model. They generally permit using the outputs to build derivative applications under your own product. Read the license terms before assuming. The technical method is the same; the legal posture is what differs.