The token budget is the declared maximum number of tokens an AI workflow may consume per request — across the system prompt, the retrieved context, any reasoning traces, and the model's output. Enforcing a budget at runtime is what keeps cost and latency predictable; without one, both grow silently until a production incident makes them visible.
Why a budget matters.
- Cost. Most providers charge per token. A workflow with no budget can quadruple its bill overnight when a retrieval change pushes more context into every prompt.
- Latency. Longer prompts and outputs take longer to process. A workflow with a tight budget produces predictable p95 latency.
- Quality. Above a workload-specific point, more context actively reduces answer quality (lost-in-the-middle, retrieval noise dilution). The budget enforces context discipline.
How to set one.
Measure the current consumption distribution: per-step token counts across at least 100 representative fixtures. The p95 of the distribution, plus a 20% margin, is a reasonable starting budget. Run the workflow under the budget and trim the costliest steps until the eval harness still passes at the new ceiling.
Enforce it at the gateway.
Budget enforcement happens at the model gateway, not in application code. The Vercel AI Gateway, OpenRouter, and most provider SDKs support per-request token limits. Set the budget there so an application bug cannot accidentally exceed it. Cost regressions caught by observability traces almost always trace back to a missing token budget.
Frequently asked.
- What is a token budget for an AI workflow?
- A token budget is the declared maximum number of tokens a single AI workflow may consume per request — across system prompt, retrieved context, reasoning, and output combined. Enforced at the model gateway, it keeps cost and latency predictable and surfaces regressions in context discipline before they hit the bill.
- How do I calculate the right budget for my workflow?
- Measure the token-count distribution across at least 100 representative fixtures. Take the p95 of that distribution plus a 20% margin. Run the workflow under that ceiling and trim the costliest steps (often retrieval window, sometimes system-prompt boilerplate) until the eval harness still passes at the new budget.
- Should the budget be the same for every workflow?
- No. A document-summarization workflow with 5k-token inputs has a fundamentally different budget than a short-reply workflow with 50-token inputs. Set per-workflow budgets, enforce each at the gateway, and treat the budget as a regression-gate metric in CI so prompt or retrieval changes that blow it surface in PR review.
- What happens when a request exceeds the budget?
- The gateway rejects the call before it reaches the model, returning a structured error. The application's refusal handler produces a user-facing message and logs the event. This is preferable to the silent alternative — a runaway request that costs ten times the expected amount before completing.