What is model distillation?

Model distillation is the practice of training a smaller model (the student) on the outputs of a larger model (the teacher), so the student learns to imitate the teacher on the workflow that matters. The student then ships in production at a fraction of the cost.

What is a small language model (SLM)?

A small language model is one in the one-to-ten billion parameter range, designed to run cheaply on consumer hardware or at high throughput in production. Many SLMs are themselves distilled from larger models. They excel at narrow, well-defined tasks and underperform large general-purpose models on open-ended reasoning.

Can we distill from a closed-source model?

Many provider terms of service forbid distilling their models specifically to build a competing model. They generally permit using the outputs to build derivative applications under your own product. Read the license terms before assuming. The technical method is the same; the legal posture is what differs.

Model distillation · Morvion Glossary

Model distillation trains a smaller model on the outputs of a larger model so the small model learns to imitate the large one on the workflow that matters. The teacher does the expensive thinking. The student ships in production at a fraction of the cost.

How distillation works.

The teacher model (often the largest, slowest, most expensive model in the family) generates outputs over a dataset of inputs. Those outputs become labels for fine-tuning a smaller student model. The student is trained until its outputs match the teacher's closely enough on the eval rubric. The student is then deployed; the teacher is retired from the critical path.

Why distillation wins in production.

Cost. Smaller models can be ten to a hundred times cheaper per call. On a workflow running a million times a day, this is the difference between a viable feature and an abandoned one.
Latency. Smaller models are faster. User-facing chat, real-time assist, and high-throughput pipelines all benefit from sub-second responses.
Hostable locally. A distilled small model can run on commodity hardware or even on-device, unlocking offline-capable, compliance-bound, or low-margin workflows the teacher could never serve.

Caveats.

Distillation transfers task behavior, not general capability. The student inherits the teacher's habits on the trained distribution and nothing more. Off-distribution queries regress sharply. Distillation works for narrow, well-defined workflows; it fails for general-purpose assistants.

When to distill.

When the workflow is stable, the eval scoreboard is mature (you need a real benchmark to know whether the student is good enough), the per-call cost or latency target is binding, and the fixture set covers the production distribution. Distillation is a late-stage optimization, not a first move.

Frequently asked.

What is model distillation?: Model distillation is the practice of training a smaller model (the student) on the outputs of a larger model (the teacher), so the student learns to imitate the teacher on the workflow that matters. The student then ships in production at a fraction of the cost.
When does model distillation make sense?: Once the workflow is stable, the eval scoreboard is mature, and the per-call cost or latency target is binding. Distillation is a late-stage optimization, not a first move. The student inherits the teacher's task behavior on the trained distribution, not its general capability.
What is a small language model (SLM)?: A small language model is one in the one-to-ten billion parameter range, designed to run cheaply on consumer hardware or at high throughput in production. Many SLMs are themselves distilled from larger models. They excel at narrow, well-defined tasks and underperform large general-purpose models on open-ended reasoning.
Can we distill from a closed-source model?: Many provider terms of service forbid distilling their models specifically to build a competing model. They generally permit using the outputs to build derivative applications under your own product. Read the license terms before assuming. The technical method is the same; the legal posture is what differs.

Model distillation

How distillation works.

Why distillation wins in production.

Caveats.

When to distill.

Frequently asked.

Intelligent Systems & AI Infrastructure

Keep reading the glossary.

AI infrastructure

CRM intelligence

Immersive website

AI agent

Business intelligence dashboard

Client portal

Discovery sprint

Digital operating layer

Document intelligence

Eval-driven AI

Hospitality website

Marketplace platform

Multi-agent workflow

Real-time dashboard

Retrieval-augmented generation (RAG)

Prompt engineering

Vector database

AI observability

Embedding model

Fine-tuning

Vector search

Semantic search

Hallucination

Chain-of-thought

Function calling

Safety rails

Eval harness

Regression gate

Model Context Protocol (MCP)

Structured output

Agent tool use

Prompt injection

Agentic search

Observability traces

LLM guardrails

Agent handoff

Vector index

Token budget

Retrieval rerank

Embedding space

Semantic cache

Context window

Faithfulness

Cross-encoder

Model router

AI cost control

Agent memory

Structured extraction

AI evaluation framework

Retrieval quality

AI guardrail policy

Eval fixture

Eval rubric

AI incident

Agent orchestration

Eval versioning

Model fallback

Fine-grained routing

AI policy version control