What is a context window?

The context window is the maximum number of tokens a language model can process in a single call — the system prompt, the conversation history, the retrieved context, and the response itself all share this budget. Exceeding it truncates or rejects the call.

How big are context windows in 2026?

200k tokens is the practical floor for production-grade models (Claude, GPT-5). Gemini 1.5/2.0 advertises up to ~2M. But quality degrades past ~100k for most workloads regardless of advertised limit. The honest planning number is 100k usable, not the headline maximum.

Should I always use the longest available context window?

No. Longer context costs more per call and quality often degrades past ~50k due to lost-in-the-middle. For RAG specifically, reranking to the top 10 chunks beats dumping 100 chunks into a long-context call. Big windows are a tool, not a strategy.

What's 'lost in the middle'?

A well-documented phenomenon where language models reliably under-attend to information placed in the middle of long contexts. Most prominent past 50k tokens. The practical implication: don't trust the model to find a needle in a haystack — give it the needle with minimal hay.

Context window · Morvion Glossary

The context window is the maximum number of tokens a language model can read and reason over in a single call. Everything in the call — system prompt, conversation history, retrieved context, and the response itself — must fit. Exceeding the window truncates or rejects the call.

State in 2026.

200k tokens is the practical floor for production-grade models (Claude, GPT-5). The frontier extends to ~2M (Gemini 1.5/2.0) but quality degrades past ~100k for most workloads regardless of advertised limit. The honest planning number is 100k usable, not the headline maximum.

Lost in the middle.

Past ~50k tokens, models reliably ignore information placed in the middle of the context (the "lost in the middle" phenomenon documented since 2023). For RAG specifically, this means dumping 100 retrieved chunks into the prompt is worse than reranking to the top 10 and concatenating those. More context past the relevance frontier is negative value.

Context-window planning is the practical face of token budget. When a workflow grows past the budget, the answer is rarely "move to a larger window" — it's reranking, summarization, or splitting into multiple calls. See also RAG for the canonical pattern that keeps context windows small.

Frequently asked.

What is a context window?: The context window is the maximum number of tokens a language model can process in a single call — the system prompt, the conversation history, the retrieved context, and the response itself all share this budget. Exceeding it truncates or rejects the call.
How big are context windows in 2026?: 200k tokens is the practical floor for production-grade models (Claude, GPT-5). Gemini 1.5/2.0 advertises up to ~2M. But quality degrades past ~100k for most workloads regardless of advertised limit. The honest planning number is 100k usable, not the headline maximum.
Should I always use the longest available context window?: No. Longer context costs more per call and quality often degrades past ~50k due to lost-in-the-middle. For RAG specifically, reranking to the top 10 chunks beats dumping 100 chunks into a long-context call. Big windows are a tool, not a strategy.
What's 'lost in the middle'?: A well-documented phenomenon where language models reliably under-attend to information placed in the middle of long contexts. Most prominent past 50k tokens. The practical implication: don't trust the model to find a needle in a haystack — give it the needle with minimal hay.

Context window

State in 2026.

Lost in the middle.

Related concepts.

Frequently asked.

Intelligent Systems & AI Infrastructure

Keep reading the glossary.

AI infrastructure

CRM intelligence

Immersive website

AI agent

Business intelligence dashboard

Client portal

Discovery sprint

Digital operating layer

Document intelligence

Eval-driven AI

Hospitality website

Marketplace platform

Multi-agent workflow

Real-time dashboard

Retrieval-augmented generation (RAG)

Prompt engineering

Vector database

AI observability

Embedding model

Fine-tuning

Vector search

Semantic search

Hallucination

Chain-of-thought

Function calling

Model distillation

Safety rails

Eval harness

Regression gate

Model Context Protocol (MCP)

Structured output

Agent tool use

Prompt injection

Agentic search

Observability traces

LLM guardrails

Agent handoff

Vector index

Token budget

Retrieval rerank

Embedding space

Semantic cache

Faithfulness

Cross-encoder

Model router

AI cost control

Agent memory

Structured extraction

AI evaluation framework

Retrieval quality

AI guardrail policy

Eval fixture

Eval rubric

AI incident

Agent orchestration

Eval versioning

Model fallback

Fine-grained routing

AI policy version control