Context Window Limits and What They Mean for Enterprise AI

Every AI model has a hard limit on how much information it can process at once. This limit is called the context window, and it is measured in tokens, roughly three-quarters of a word each. GPT-4 Turbo offers 128K tokens. Claude offers 200K. Some models now advertise context windows exceeding one million tokens.

The sales pitch writes itself: bigger windows mean more context, more context means better understanding, better understanding means production-ready AI. Feed it everything. Let the model sort it out.

This pitch is wrong. And organizations that bet their AI strategy on bigger context windows are setting themselves up for expensive, hard-to-diagnose failures.

What a Context Window Actually Is

Think of the context window as the AI’s working memory. It is everything the model can “see” when generating a response; the system instructions, the conversation history, the retrieved documents, the user’s question, and the space reserved for the response itself.

Anything outside the window does not exist to the model. It cannot reference it, reason about it, or even acknowledge it. The context window is not a preference. It is a physical constraint of the architecture.

In practical terms, here is what different window sizes hold:

8K tokens (roughly 6,000 words): A long blog post. A short contract. About 15 pages of text.
128K tokens (roughly 96,000 words): A full novel. A full operations manual. About 200 pages.
200K tokens (roughly 150,000 words): Two novels. Several department-level policy manuals combined.
1M+ tokens (roughly 750,000 words): Multiple books. A substantial portion of an organization’s documented knowledge.

The progression looks impressive. It isn’t enough. A mid-market company’s total operational knowledge (policies, procedures, pricing rules, customer data, compliance frameworks, exception logic, tribal knowledge (if it were written down)) easily exceeds tens of millions of tokens. A large enterprise operates on orders of magnitude more. No context window fits it all.

The “Lost in the Middle” Problem

Even if you could fit everything into the window, you’d hit a different wall. Research from Stanford and UC Berkeley demonstrated what practitioners had observed for years: models perform significantly worse on information placed in the middle of long contexts compared to information at the beginning or end.

This is the “lost in the middle” problem. Give a model 100 documents and ask it to find a specific fact. If that fact is in document 3 or document 98, the model finds it reliably. If it is in document 47, the model often misses it.

The implications for enterprise use are severe. A model processing a long operational context doesn’t attend to all of it equally. It gravitates toward the beginning and end. Critical information (a pricing exception, a compliance requirement, a routing rule) gets lost if it happens to land in the middle of a large context. The model doesn’t flag it as missing. It doesn’t say “I couldn’t find the relevant discount rule.” It generates a response as if the information doesn’t exist. Which, from the model’s perspective, it effectively doesn’t.

This isn’t a bug that will be fixed in the next model release. It is a fundamental characteristic of how attention mechanisms work in transformer architectures. Improvements are incremental, not transformational. Longer context windows amplify the problem because there is more “middle” to get lost in.

Why “Just Feed It Everything” Fails

The brute-force approach (dump everything into the context window and let the model figure it out) fails for three compounding reasons.

Irrelevant context degrades performance. This is counterintuitive but well-documented. Adding more context doesn’t just slow the model down and increase cost; it actively reduces accuracy. A study from the research group at Stanford showed that models given 20 relevant documents performed better than models given 20 relevant documents mixed with 80 irrelevant ones, even though the relevant information was present in both cases. The noise overwhelms the signal. The model’s attention spreads across everything in the window, diluting its focus on the information that actually matters.

Cost scales linearly with context size. Every token in the context window costs money. A 200K token prompt costs roughly 20x what a 10K token prompt costs. For a high-volume enterprise application (customer service, compliance review, financial analysis) this multiplication applies to every single request. The brute-force approach doesn’t just underperform. It underperforms expensively.

Latency increases with context size. Larger contexts take longer to process. In enterprise applications where response time matters (real-time customer interactions, live agent workflows, time-sensitive compliance checks) the additional latency from processing a bloated context window can push response times past acceptable thresholds. Users notice. Workflows stall. Adoption drops.

The brute-force approach treats context as a quantity problem: just add more. The actual problem is a quality problem: deliver the right context for the right task at the right time.

The Structural Approach: Schemas + Skills

The alternative to brute force is structure. Instead of feeding the model everything and hoping it finds what it needs, you define precisely what each task requires and deliver exactly that.

Business-as-Code provides the framework. Three components replace the “dump everything in” strategy with targeted context delivery.

Schemas define the entities relevant to each task. An agent handling a pricing query receives the pricing schema (customer segments, product tiers, discount rules, approval thresholds) not the entire operations manual. The schema is compact (typically 200-500 tokens), precise (JSON Schema with defined fields and constraints), and complete for its purpose. The agent doesn’t need to hunt through 200 pages to find the discount structure. It’s right there, structured and unambiguous.

Skills define the decision logic for each task. A pricing skill tells the agent exactly how to calculate the price: check customer segment, apply tier discount, evaluate volume threshold, check for promotional overrides, determine if manual review is required. The skill is typically 500-1,500 tokens of structured markdown. It replaces 20 pages of narrative policy documentation with a precise, step-by-step procedure.

Context provides the background knowledge that makes schemas and skills coherent. Industry-specific constraints, regulatory requirements, organizational structure, strategic priorities. Context is scoped to what matters for the current task; not the entire organizational context, but the slice that affects the decision at hand.

The combined context for a well-structured task is typically 2,000-5,000 tokens. That is 1-2.5% of a 200K token window. It leaves room for conversation history, retrieved data, and the model’s response, all within a compact, focused context that the model can attend to fully.

Compare the two approaches:

	Brute Force	Structured (Business-as-Code)
Context size	100K-200K tokens	2K-5K tokens
Relevance	10-20% relevant	90-100% relevant
Accuracy	Degrades with context size	Consistent regardless of task complexity
Cost per request	High (full window billed)	Low (minimal tokens)
Latency	Seconds (large context processing)	Sub-second (compact context)
Lost-in-the-middle risk	High	Eliminated (context is short enough for full attention)
Maintenance	Rebuild document index when anything changes	Update specific schema or skill

The structured approach doesn’t require a bigger window. It requires a smarter one.

Context Engineering in Practice

Context Engineering is the discipline that makes this work at scale. It is not a one-time setup. It is an ongoing practice of defining, maintaining, and delivering structured context to AI agents.

In practice, Context Engineering involves three activities.

Mapping. Identifying what context each task, workflow, and agent needs. Not everything: the specific schemas, skills, and background knowledge relevant to each specific function. A customer service agent needs different context than a compliance reviewer. A pricing agent needs different context than a content generation agent. Mapping prevents the “give it everything” default.

Structuring. Converting organizational knowledge from narrative documentation (wikis, policy manuals, process docs) into structured artifacts (JSON schemas, structured skills, organized context files). This is the core work of Business-as-Code. It takes implicit, ambiguous knowledge and makes it explicit, precise, and machine-readable.

Routing. Delivering the right context to the right agent at the right time. A well-architected system doesn’t load all schemas and all skills into every agent’s context window. It loads the schemas and skills relevant to the current task. This is what keeps context compact and focused, and what makes the structured approach scale across hundreds of agents and thousands of tasks.

The result is a system where context window size becomes irrelevant. Not because the limit doesn’t exist, but because you never approach it. Each agent operates on a compact, targeted context that fits comfortably within even modest windows, and produces better results than a model drowning in a million tokens of unstructured documentation.

The Recursive Loop keeps the system current. BUILD the context, OPERATE agents on it, LEARN from the gaps, BUILD deeper. When your pricing model changes, you update the pricing schema and skill; not a 200-page document that the model may or may not attend to. Changes propagate immediately, precisely, and without the ambiguity that comes from narrative documentation.

The context window is a real constraint. But it is a constraint on the amount of information the model can process, not on the amount of knowledge the model can access. Structured context delivery means the model processes less but knows more, because what it processes is precise, relevant, and complete for the task at hand.

Frequently Asked Questions

Is a bigger context window always better?

No. Research shows models lose accuracy on information in the middle of large contexts (the 'lost in the middle' problem). A 200K token window that's 80% irrelevant context performs worse than a 10K window with exactly the right information. Quality beats quantity.

How much context does a typical enterprise task require?

It depends on the task. A customer service response might need 5-10 relevant documents. A financial analysis might need hundreds of data points. The key is delivering the right context for each specific task, not dumping everything into the window.

What's the difference between context windows and RAG?

The context window is the model's working memory (what it can see right now. RAG (Retrieval-Augmented Generation) is a technique for filling that window with relevant information. But RAG only works if the information is structured and the retrieval is accurate. Garbage in, garbage out) even with a big window.

Mat Goldsborough·Founder & CEO, NimbleBrain