Your AI agent isn’t learning. It’s retrieving.
The frozen model problem vendors don’t put in the pitch deck
14 min read | May 2026
Vendors sell continuously improving AI agents. Procurement teams buy them. Security architects sign off. Nobody asks to see the architecture.
The architecture tells a different story. A stateless model that doesn’t change between sessions. A retrieval layer that fetches external memory on each call. Orchestration logic that stitches both together and resets when the session ends. What looks like learning is retrieval. What looks like improvement is a different prompt against the same frozen weights.
This isn’t a vendor criticism. It’s an architectural reality most procurement processes have never been designed to surface.
Key Insights
“Continuously improving” AI agents are typically frozen models re-prompted against updated external memory, not systems that learn from your environment
The model your team evaluated during procurement may not be the model currently in production, and most contracts provide no mechanism to detect the difference
Decisions made by stateless, session-resetting systems cannot be traced to a consistent reasoning system, which breaks audit traceability at the architectural level
The retrieval layer, not the model, is where most agent “improvement” actually happens, and it has no version control, provenance, or integrity controls in most deployments
A credible vendor framework for AI agents in regulated environments requires four contractual and technical controls that current procurement checklists don’t ask for
The Sharp Reframe
What people think: AI agents deployed in production environments learn from your data, improve over time, and adapt to your specific context through continuous training or fine-tuning.
What actually happens: The model is frozen. Weights don’t change between sessions. What changes is the retrieval layer: a vector store or document index that gets updated with new content. The agent appears to “know more” because it retrieves more. It does not reason differently. It does not update its weights. The frozen model processes a fresh prompt against new retrieved content and produces a new output. That is not learning. That is retrieval with better source material.
This distinction matters at three levels.
First, governance. If the model is frozen, the thing you evaluated in procurement is not the thing producing decisions six months later, not because the model changed, but because the retrieval layer changed. Nobody logged what the retrieval layer looked like the day you signed off.
Second, auditability. A stateless system resets every session. There is no persistent reasoning state. When a regulator asks why the system made a specific decision on a specific day, the answer requires knowing which model version was active, which retrieval snapshot was queried, and which system prompt was in effect. Most deployments log the output. None of those three inputs are captured.
Third, accountability. Vendors claim improvement. The improvement happens in the retrieval layer, which the vendor controls, updates without notice, and does not version in ways your team can inspect. You approved the model. The model didn’t change. Everything around it did.
The hidden antagonist here is not the technology. It is the assumption that “the AI” is a single consistent entity you evaluated once and can rely on continuously. It is not. It is a pipeline. You approved one snapshot of that pipeline. The pipeline keeps moving.
What Everyone Is Doing
Security and procurement teams reviewing AI agent deployments are doing reasonable things. They request vendor security documentation. They run penetration tests. They review data handling agreements. They check for SOC 2 or ISO 27001 certification. They ask about encryption in transit and at rest. Some ask for model cards.
All of this is necessary. None of it surfaces the frozen model problem.
Here is what the standard checklist misses.
Model card reviews don’t capture retrieval layer changes. A model card describes the model: training data, evaluation benchmarks, known limitations. It does not describe the retrieval layer. When the vendor updates the vector store with new documents, adds new retrieval sources, or reindexes existing content, no model card changes. The model card for GPT-4o today is the same model card it was six months ago. What the agent retrieves, and therefore what it “knows,” is entirely different.
Penetration tests test the interface, not the reasoning. A pen test will find injection vulnerabilities, authentication gaps, and data exposure risks. It will not detect that the model your team evaluated in January is now being used with a system prompt that was silently updated in March. The interface is the same. The behavior has changed.
Vendor change management reviews assume discrete software releases. Traditional change management asks: was this change approved, tested, and deployed through a controlled process? For AI agents, the relevant question is: did the reasoning behavior of the system change in ways that affect the decisions it makes? Those are different questions. A retrieval layer update is not a software release. It does not go through the same change management gate. It often does not go through any gate at all.
The hidden assumption underneath all of this is that “the AI agent” is a version-controlled software artifact like any other system in your stack. It is not. It is a composition of a frozen model, a mutable retrieval layer, and orchestration logic, any one of which can change the system’s effective behavior without triggering a formal change management process.
Rav’s core insight: you cannot audit a decision made by a system you cannot reconstruct. If you cannot reproduce the exact state of the model, the retrieval snapshot, and the system prompt that were active when a decision was made, you do not have an audit trail. You have a log of outputs.
The Moment I Saw It
I was three hours into a vendor due diligence review for a compliance team at a regulated financial institution. The vendor was presenting an AI agent for regulatory reporting. The demo was clean. The architecture diagram was well-organized. The team had clearly built something real.
We were working through the change management section of the review when I asked the question that changed the conversation.
“What changes when you do a model update — do our existing test cases still pass, or do we re-run validation?”
The vendor’s technical lead paused. Then: “Model updates are handled on our side. You’d get a release note.”
I followed up: “What’s in the release note? Does it include benchmark comparisons against the previous version? Does it specify which retrieval sources changed?”
A longer pause this time. “It would describe the changes at a high level. Detailed benchmark comparisons aren’t part of our standard release process.”
Then the compliance lead on the client side asked the question she had clearly been sitting on: “So if regulators ask us to demonstrate that the system’s behavior is consistent with what we approved in procurement, what evidence do we produce?”
The room went quiet in the way rooms go quiet when everyone realizes simultaneously that nobody had thought through this specific question.
What the vendor had built was technically sound. The model was a frontier model from a major provider. The retrieval layer was well-designed. The orchestration was solid. None of that was the problem.
The problem was that the client had approved a system based on observed behavior during evaluation. That behavior was the product of a specific model version, a specific retrieval snapshot, and a specific system prompt, none of which were contractually frozen, technically versioned, or operationally logged. The thing they evaluated and the thing in production were the same vendor product. They were not the same system.
When I realized this wasn’t unique to one system, the pattern was uncomfortable. Every AI agent deployment I had reviewed in the preceding twelve months had the same gap. The procurement process was designed to evaluate a software product. The thing being procured was a pipeline with mutable components. The two are not the same, and the standard due diligence framework had not caught up.
Why This Is Different
Most people think AI agents are like other enterprise software: a version-controlled artifact that you evaluate, approve, deploy, and update through a controlled process. The comparison misses the fundamental architecture.
Traditional enterprise software:
Deterministic: same input produces same output
Version-controlled: every change is a discrete, auditable release
Testable: a regression suite verifies behavior before deployment
Stable between updates: behavior does not change unless a release is deployed
Auditable: a log entry points to a specific software version
AI agent deployed in production:
Probabilistic: same input produces statistically similar but not identical outputs
Multi-component: model, retrieval layer, system prompt, and orchestration logic each version independently
Partially testable: benchmarks measure average behavior, not individual decision accuracy
Variable between sessions: retrieval layer updates change effective behavior without a software release
Partially auditable: logs capture outputs, not the full system state that produced them
The fundamental shift is this: in traditional software, the system is the software. In an AI agent, the system is the composition of components, and the composition changes in ways that software version control was not designed to capture.
This affects procurement in three specific ways.
Evaluation validity degrades over time. The model your team tested behaves consistently. The retrieval layer it queries does not. Six months after procurement, the agent may be retrieving from updated document sets, new knowledge bases, or modified vector embeddings. The evaluation you ran is no longer a valid representation of current system behavior, but nothing in your procurement process flagged this.
Change management applies to the wrong layer. Your change management process governs software releases. Retrieval layer updates, system prompt modifications, and orchestration logic changes may not trigger a formal release. They change system behavior without going through the gate designed to catch behavior changes.
Incident response requires state you don’t have. When an AI agent produces a harmful or incorrect output in a regulated environment, incident response requires reconstructing what the system was doing when it failed. That requires the model version, the retrieval snapshot, the system prompt, and the session context. Most deployments log the output. The inputs to the reasoning process are either not logged or not retained in a form that supports reconstruction.
The reader who thinks this doesn’t apply to them is running a deployment where either: the model is open-source and self-hosted with version control over every component, the vendor has contractually committed to versioning the full pipeline and providing reconstruction capability, or the system is not used in contexts where decisions need to be auditable. For every other deployment, this gap exists right now.
The Framework
Name: The Four Verification Gaps. Each gap represents a point in the agent pipeline where procurement, governance, and audit assumptions break against architectural reality.
Layer L5: Reasoning and memory
What it breaks: Retrieval layer updates change the effective knowledge state of the agent without triggering version control, model card updates, or change management reviews, meaning the system behavior at decision time cannot be reconstructed from logged outputs alone.
Example: A regulatory reporting agent is deployed after a three-month evaluation. During evaluation, the retrieval layer indexes internal policy documents from Q3. By Q1 of the following year, the vendor has reindexed with updated regulatory guidance and removed several legacy documents. The agent’s responses to identical queries differ materially. No change management ticket was raised. The evaluation results are no longer valid, but nothing in the governance framework flagged the drift.
Layer L7: Egress and observability
What it breaks: Session-resetting stateless architecture means decisions cannot be traced to a consistent reasoning system. Logs capture outputs but not the full pipeline state, making regulatory reconstruction of any individual decision operationally impossible.
Example: A compliance agent flags a transaction for review. The transaction is later disputed. The audit team needs to demonstrate that the flagging decision was consistent with the policy framework in effect at the time. They have the output. They do not have the model version, the retrieval snapshot, or the system prompt that were active when the decision was made. The audit trail exists. It does not support the question being asked.
Threat and Playbook Map
Threat: Vector Store Poisoning
The retrieval layer is where “learning” actually happens in most deployed agents. It is also the layer with the fewest integrity controls.
How this plays out in a real system:
Entry: the vendor updates the vector store used by the production agent, adding new documents, reindexing existing content, or modifying embedding parameters, without triggering a formal change management process
Escalation: the agent begins retrieving different content for the same queries, producing different outputs, including in edge cases that the original evaluation did not cover
The miss: change management reviews are scoped to software releases. Retrieval layer updates are treated as content updates, not behavioral changes. No review gate fires.
Impact: the agent’s effective behavior has changed in ways that were not tested, approved, or communicated to the operating organization. Decisions made after the update reflect a system state that was never evaluated.
Audit blind spot: the output log shows a decision was made. It does not show that the retrieval state at decision time differed from the retrieval state at evaluation time. The gap is invisible in standard logging.
Playbooks that surface the gaps:
AI Agent Vector Store Integrity Playbook: surfaces whether retrieval sources are versioned, whether embeddings are content-addressed, and whether the organization has visibility into what the agent is retrieving at decision time
AI Agent Context Supply Chain Playbook: surfaces whether retrieved content is validated for provenance and integrity before being injected into the model context, and whether the retrieval pipeline has any controls analogous to a software dependency manifest
Threat: Audit Trail Evasion
Not adversarial evasion in this context. Architectural evasion: the system is designed in a way that makes complete audit reconstruction structurally impossible with standard logging.
How this plays out in a real system:
Entry: the AI agent is deployed with standard application logging: inputs, outputs, timestamps, user IDs
Escalation: over time, the model version, retrieval layer, and system prompt each change independently. Each change is individually minor. The cumulative behavioral drift is significant.
The miss: audit review is triggered by a disputed decision. The audit team queries the log. They have the output. The log does not capture the model version, the retrieval snapshot, or the system prompt active at decision time. The three inputs required for reconstruction are absent.
Impact: the organization cannot demonstrate that the decision was produced by the system they approved. They cannot rule out that a retrieval update or system prompt change introduced a behavior the original evaluation would not have passed.
Audit blind spot: the log looks complete. Timestamps, outputs, session IDs are all present. The incompleteness is structural: standard application logging was not designed to capture the multi-component state of a probabilistic reasoning pipeline.
Playbooks that surface the gaps:
AI Agent Audit Trail Playbook: surfaces whether decision logs include model version, retrieval snapshot reference, and system prompt hash alongside outputs, and whether the log format supports regulatory reconstruction queries
AI Agent Evidence Capture Playbook: surfaces whether the organization can produce a complete evidence bundle for any individual decision, including the full pipeline state at time of decision, not just the output
Why Reviews Miss This
Traditional AI vendor reviews are built on a software procurement mental model. That model has four standard questions: what data does the system process, how is it secured, how is it updated, and who is responsible when it fails.
All four questions are reasonable. None of them surface the frozen model problem.
What they check:
Data processing agreements and data residency
Security certifications (SOC 2, ISO 27001)
Penetration test results
Model card and bias evaluation documentation
Incident response and SLA terms
Why this fails:
Data processing agreements cover data in transit and at rest. They do not cover the retrieval layer’s content composition. A vendor can update what the agent retrieves from your data without triggering a data processing amendment, because the data itself hasn’t changed. How it’s indexed and retrieved has.
Security certifications verify controls around the infrastructure. They do not verify that the system’s reasoning behavior is consistent with what was evaluated. A SOC 2 Type II report tells you the vendor has access controls and change management processes. It does not tell you those processes apply to retrieval layer updates.
Model cards describe the base model. They are static documents. Most are not updated when the retrieval layer changes, because the model didn’t change. The model card for the system you’re reviewing today is likely identical to the model card from the evaluation period, even if the system’s effective behavior has drifted materially.
What the standard lens sees: a vendor with documented controls, a certified infrastructure, and a model with known characteristics.
What the architectural lens reveals: a pipeline with three mutable components, only one of which (the base model) is covered by the documentation you reviewed. The retrieval layer and system prompt are operationally invisible to standard procurement review.
Why This Matters
Procurement decisions expire without a mechanism to detect it
The evaluation you ran established that the system behaved acceptably against your requirements at a point in time. Once deployed, the retrieval layer continues to evolve. Without a contractual versioning requirement and a technical mechanism to detect behavioral drift, your evaluation result has a shelf life. You don’t know how long.
Regulated decisions require reconstruction capability you may not have
In financial services, healthcare, and other regulated sectors, decisions made by AI systems are subject to challenge. Regulators, auditors, and litigants can ask why a specific decision was made. The answer requires the complete pipeline state at decision time. If your logging architecture captures outputs but not inputs to the reasoning process, you cannot answer that question. Not “it would be difficult.” Structurally impossible.
Vendor improvement claims are unverifiable without pipeline versioning
When a vendor tells you the agent has improved since deployment, there is no standard mechanism to verify that claim, understand what changed, or assess whether the change affected decision quality in your specific use case. Improvement could mean the retrieval layer was updated with better source material. It could mean the system prompt was modified. It could mean the base model was quietly swapped. Without pipeline versioning, all three are operationally identical from your side.
Silent behavioral drift is not in most incident response playbooks
Incident response processes are built around discrete events: a breach, an outage, a data loss. Silent behavioral drift in an AI agent, where the system gradually produces different outputs as the retrieval layer evolves, does not trigger any of those categories. It is invisible until a specific decision is challenged. By then, the state of the system at decision time may be unrecoverable.
When you map pipeline state explicitly: you identify which component changed and when, you add retrieval versioning to your change management scope, you design for decision reconstruction from the start, and you build vendor contracts that make behavioral versioning a delivery requirement rather than a nice-to-have.
Objection Handling
You might ask: isn’t this solved by using open-source models you self-host, where you control every component?
Yes, partly. Self-hosting a versioned open-source model removes one mutable component from the vendor’s control. You own the model weights. You can freeze them, version them, and reconstruct the model state for any historical decision.
But the retrieval layer remains. Unless you also version-control your vector store, snapshot your embeddings at decision time, and log the retrieval results alongside the model output, you still cannot fully reconstruct a decision. The model is only one of the three inputs to the reasoning pipeline. Controlling it is necessary. It is not sufficient.
The stronger objection is operational: full pipeline state logging is expensive. Retrieval snapshots at decision time require storage, indexing, and a log format that most observability stacks weren’t designed for. This is true. The cost is real.
The response is not “always do this for every AI deployment.” The response is: the cost of full pipeline state logging should be weighed against the regulatory and operational consequence of not being able to reconstruct a decision that gets challenged. In low-stakes consumer applications, the cost-benefit calculation probably doesn’t support it. In regulated financial services, healthcare, or any context where AI decisions are subject to audit or legal challenge, the calculation runs the other way.
The question is not whether you can afford to log pipeline state. It is whether you can afford not to, given the decisions the system is making.
The Reality Check
An AI agent is not a system. It is a pipeline. You approved one snapshot of that pipeline.
The Four Verification Gaps exist at four points in every deployed agent:
Model version: the base model is frozen, but the version active at decision time may not be logged
Retrieval state: the vector store evolves without version control, and the retrieval snapshot at decision time is typically not captured
System prompt: orchestration logic and system prompts change without triggering model card updates or formal change management reviews
Pipeline reconstruction: standard application logging captures outputs. Regulatory reconstruction requires the full pipeline state. Most deployments cannot produce it.
Most procurement reviews ask:
“Is the model documented and evaluated?”
“Does the vendor have security certifications?”
“Is data handled according to our processing agreement?”
Those are necessary. They are not sufficient.
The questions that surface the verification gaps are:
“What contractual mechanism freezes the pipeline state we evaluated?”
“How do we detect behavioral drift between the evaluated and production system?”
“Can you demonstrate that a historical decision can be fully reconstructed from your logs?”
If a vendor cannot answer the third question with a technical demonstration, the audit trail you think you have is a log of outputs. That is not the same thing.
This is the part most architectures never get reviewed on.
References
Bommasani, R. et al. (2021). “On the Opportunities and Risks of Foundation Models.” Stanford CRFM. https://arxiv.org/abs/2108.07258
Lewis, P. et al. (2020). “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” NeurIPS 2020. https://arxiv.org/abs/2005.11401
NIST (2023). “Artificial Intelligence Risk Management Framework (AI RMF 1.0).” https://airc.nist.gov/RMF
Mitchell, M. et al. (2019). “Model Cards for Model Reporting.” ACM FAccT 2019. https://arxiv.org/abs/1810.03993
European Parliament (2024). “EU AI Act, Article 13: Transparency and Provision of Information to Deployers.” https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32024R1689
Anthropic (2024). “Claude Model Card.” https://www.anthropic.com/claude-model-card
UK FCA (2024). “AI Update: Principles for the Responsible Development and Deployment of AI.” https://www.fca.org.uk/publications/feedback-statements/fs24-1-artificial-intelligence
Weidinger, L. et al. (2022). “Taxonomy of Risks posed by Language Models.” ACM FAccT 2022. https://arxiv.org/abs/2112.04359
OWASP (2025). “OWASP Agentic Security Initiative: Top 10 for Agentic AI.” https://owasp.org/www-project-top-10-for-large-language-model-applications/
Perez, E. & Ribeiro, M. (2022). “Ignore Previous Prompt: Attack Techniques For Language Models.” NeurIPS 2022. https://arxiv.org/abs/2211.09527
EBA (2024). “Report on the Use of Artificial Intelligence in the Banking Sector.” European Banking Authority. https://www.eba.europa.eu/regulation-and-policy/innovation-and-fintech/artificial-intelligence
👉 AI Agent Posture Playbooks: 30+ structured assessments to map where your agent controls were built for humans, not agents. Self-directed. No vendor cycle.
👉 Read the agentic security news. Instantly analyze the threat vector, see if it applies to your setup, and find the gaps with our interactive playbook. All free.
👉 Follow me on LinkedIn | X | Substack for weekly analysis of real agent failures, control gaps, and what the frameworks are and are not catching.


