What Your AI Sees Matters More Than You Think

Evan Mallory
Apr 7
5 min read

The case for context engineering in clinical AI — and how Tryal Accelerator is built around it.

There's a persistent myth in the AI space: give the model more information and you'll get better answers. It sounds intuitive. It's also wrong.

A 2025 research study by Chroma tested 18 frontier language models — including GPT-4.1, Claude, and Gemini — and found that every single one performed worse as the amount of input grew. In some cases, models that scored 95% on shorter inputs saw that number collapse to 60% as the input grew longer — a steep, sudden falloff rather than a gentle slope. The researchers call this phenomenon context rot, and it has serious implications for anyone building AI-powered tools, especially in domains like clinical research where accuracy isn't optional.

Unfortunately, this finding isn’t very surprising. We have been experiencing context based performance drops for years. It validates a philosophy we've held from the beginning: the quality of AI output is determined not only by the power of the model, but also the precision of the information you put in front of it.

The Problem With "Just Add More Data"

Most people's mental model of an LLM is something like a very fast reader — pour in documents, and it synthesizes them top to bottom. But that's not a complete picture.

LLMs process input through an attention mechanism that compares every token of the input against every other piece. (Attention is All you Need) This is powerful, but it comes with two well-documented costs. First, computation scales quadratically — double the input length and you roughly quadruple the processing cost. Second the model's attention is unevenly allocated. Stanford researchers found that content positioned at the start or end of an input receives far more weight than content buried in the middle — a bias known as the "lost in the middle" problem. In their experiments, accuracy dropped by over 30% simply based on where in the input the relevant information appeared.

This creates a dangerous dynamic. As you add more content to an LLM's input, you're not just adding information — you're pushing existing important information into low-attention zones and diluting the model's finite attention budget with more noise. More input fragments the model's focus.

Why Naive Approaches Fail in the Real World

This is the uncomfortable truth about many AI products currently on the market: most of them are just a wrapper around one of the big three models. ChatGPT, Gemini, or Claude. A thin wrapper can look very impressive in a short demo. You paste in a document, ask a question, get a fluent answer. It feels magical.

But fluency is not accuracy. And in clinical research — where a misattributed finding, a hallucinated statistic, or a subtly wrong eligibility criterion can cascade into real harm — fluency without accuracy is worse than no answer at all.

The Chroma study highlights a particularly insidious aspect of context rot: the degradation isn't gradual or predictable. A model might perform flawlessly on a 50-page input and then fall apart on 55 pages, with no graceful decline in between. There's no warning. The answers look just as confident and well-structured at 60% accuracy as they do at 95%. This is what makes context fatigue so dangerous — you cannot tell from the output when the model has started getting things wrong.

For anyone working with real clinical data — protocols, regulatory documents, patient records, literature — this should be alarming. These aren't toy problems with neat answers. They're complex, nuanced, and densely interconnected. Exactly the kind of information that overwhelms an LLM when fed in without structure.

Context Engineering: The Discipline That Matters

The emerging field addressing this challenge is called context engineering. Context Engineering the discipline of curating and structuring everything an LLM sees before it produces output — not just the prompt, but the full informational environment surrounding it. Or as Andrej Karpathy described it: the "delicate art and science of filling the context window with just the right information for the next step."

Context engineering stands in contrast to prompt engineering. Where prompt engineering asks "how do I phrase my question better?", context engineering asks "what does the model need to see right now, and how do I assemble all of it dynamically?" The prompt is one small piece. The surrounding infrastructure — what knowledge is retrieved, how it's structured, what's left out — is what actually determines result quality.

The research community has converged on four core strategies for managing context effectively:

Write — Save important intermediate information to external storage rather than trying to hold everything in the context window at once.
Select — Use retrieval systems to pull in only the information relevant to the current question, rather than dumping entire documents into the input.
Compress — Summarize accumulated history and outputs to preserve essential information without consuming the full attention budget.
Isolate — Split work across specialized agents with focused, clean contexts rather than forcing one model to juggle everything.

Anthropic demonstrated the power of isolation with a multi-agent research system where a lead agent delegated sub-tasks to specialized sub-agents. The result was a 90% improvement over a single agent on research tasks — using the same underlying model family. No upgrade in model intelligence was needed; the gains came entirely from giving each agent a cleaner, more focused window of information.

How Tryal Accelerator Approaches This

At Tryal, we take a strategic approach to managing context and information. We believe the path to the highest-quality results — and the lowest hallucination rates — is exceptional control over what the AI can see at any given moment.

We achieve this through a network of information retrievers and knowledge graphs that deliver highly targeted results for each question. Rather than feeding a model an entire protocol or a full database of literature, our system identifies and presents different facets of the information to the AI platform for generation. Each query gets the specific context it needs, structured and positioned to work with the model's attention patterns rather than against them.

This is the select and isolate strategy in practice — implemented at an architectural level.

Critically, the pipeline doesn't stop at generation. Results are cross-referenced and reviewed before integration into documents. This verification layer catches the kind of subtle inaccuracies that context fatigue produces — the ones that look right at first glance but don't hold up under scrutiny.

The Stakes Are Too High for "Good Enough"

The research is clear: when large-scale data is given unprocessed to LLMs, the quality of outputs degrades dramatically. While the models themselves continue to improve, the architectural constraints — finite attention, positional bias, statelessness — are fundamental to how transformers work. No amount of model scaling eliminates the need to carefully engineer what goes into the context window.

For clinical applications, this isn't an academic concern. A system that's 95% accurate on a demo and 60% accurate on a real protocol isn't a tool — it's a liability. And because context rot is invisible in the output, organizations relying on naive AI integrations may not discover the problem until the consequences have already materialized.

Having an approach to AI context that supports accuracy and referencing at every level isn't just about good engineering. For real-world application of these technologies in clinical research, it's the only responsible path forward.

References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., & Polosukhin, I. "Attention Is All You Need." Advances in Neural Information Processing Systems 30 (NeurIPS), 2017.
Chroma Research. "Context Rot: How Increasing Input Tokens Impacts LLM Performance." 2025.
Liu et al. "Lost in the Middle: How Language Models Use Long Contexts." Stanford, 2023.
Karpathy, Andrej. Post on context engineering. 2025.
LangChain. "Context Engineering for Agents." 2025.
Anthropic. "Building Effective Agents." 2025.