Seungmin Ha

February 23, 20266 min read

Evidence-Governed RAG in Healthcare: Why Persly Controls Retrieved Evidence Instead of Skipping It

ragllmhealthcareevidence-governanceadaptive-retrievalevaluation

RAG(Retrieval-Augmented Generation) has become the de facto standard architecture for LLM-based AI. In healthcare especially, it is treated as a baseline design choice to reduce hallucinations and incorporate up-to-date information.

Yet most RAG systems share an implicit assumption:

Always perform retrieval for every query and pass the results directly to the generation step.

Persly goes one step deeper than this assumption.

Is retrieval always necessary?
Or more precisely,

Are all retrieved documents always safe and appropriate?

Persly's answer is not "let's reduce retrieval," but rather:

Let's govern the retrieved evidence.

Why Is Always-On Retrieval a Problem?

RAG is generally understood as a technique that improves accuracy. In the medical domain, however, something matters more than "whether retrieval was performed."

Was the retrieved document appropriate?

1. Conflicting Standards Across Documents (Retrieval-Induced Inconsistency)

For example,

"What is the maximum daily dose of acetaminophen for adults?"

The retrieval results may contain a mix of:

Differing national guidelines
Pediatric dosing documents
Outdated guidelines

The LLM may attempt to synthesize these and produce an answer that blends conflicting standards or over-qualifies its response with excessive conditions.

2. Plausible but Inaccurate Documents (Retrieval-Induced Hallucination)

There is no guarantee that top-ranked documents are always high quality.

Blog-style health articles
Outdated guidelines
Information applicable only to specific patient populations
Documents from a different jurisdiction

Because LLMs tend to treat retrieved documents as strong evidence when generating answers, a single inappropriate document can contaminate the entire response.

3. Excessive Context Distorts Medical Reasoning

"I think the diarrhea is caused by a medication I recently started taking."

The core of this question is triage. But if lengthy guidelines or statistical data are injected directly, the model may generate an unnecessarily complex differential diagnosis.

In healthcare, over-explanation can increase patient anxiety or undermine trust.

Persly's Approach: We Don't Skip Retrieval. We Govern It.

Persly does not choose an architecture that skips retrieval.

Instead, we design the following structure:

Query → Retrieve → Evidence Policy → Generate

Retrieval is always performed.
But retrieved documents are not passed directly to the generation step.

An Evidence Governance Layer sits in between.

In production, Persly also includes a post-generation verification step that re-evaluates the full answer. However, this post focuses specifically on the retrieval and evidence governance structure, so we omit the details of that downstream stage.

Additionally, filtering out documents that are not directly relevant to the question is not just a matter of accuracy—it is significantly more efficient in terms of token consumption as well.

Including unnecessary documents in the LLM context increases cost and latency while simultaneously raising the risk of reasoning distortion. Persly's Evidence Policy therefore serves a dual role: governing safety and coherence while structurally optimizing compute resource usage.

Evidence Governance Layer

What Model Are We Training?

We do not train a new Foundation LLM for this purpose. What we train is an Evidence Policy:

F_\theta(q, D) \rightarrow D^*

Input: query $q$ , retrieved document set $D$
Output: refined document set $D^* \subseteq D$
Objective: decide which documents to use and which to discard

In other words, the model Persly builds is not one that decides "whether to retrieve,"
but one that:

Learns the net benefit of retrieved evidence.

How Is the Evidence Policy Trained?

Persly's Evidence Policy is not a simple relevance model. We train it to jointly assess whether a retrieved document is:

Safe
Up to date
Applicable to the patient's condition
Consistent with other retrieved documents

Training is organized into three stages.

Stage 1: Initial Labeling via Weak Supervision

We first generate initial labels for document inclusion using rule-based signals.

Examples:

Drop documents that are outdated based on publication date
Penalize low-authority sources
Drop documents with jurisdiction mismatch
Drop population-specific documents matched to general queries

This stage is imperfect, but it provides a stable teacher signal at scale.

Stage 2: Supervised Document Classifier

Next, we train a model that takes (query, document) pairs as input and predicts whether a given document should be used in the generation step.

Input:

(q, d_i)

Output:

p(\text{use} \mid q, d_i)

This model is designed to capture not just semantic relevance, but also:

Applicability
Safety alignment
Freshness
Context match

Stage 3: Answer-Level Counterfactual Calibration

The final stage measures the actual net benefit of each document.

The core idea is simple:

Does the answer improve when we include this document versus when we exclude it?

Procedure:

Retrieve D
LLM(q, D) → answer_full
LLM(q, D \ {dᵢ}) → answer_minus_i
Compare the two answers on accuracy and safety criteria

If removing a specific document improves the accuracy or safety of the answer, that document is classified as harmful evidence.

Through this process, Persly learns not just relevance, but:

The net benefit of each document.

What It Means to Treat Retrieval as an Expert

The reason Persly calls retrieval an "expert" is straightforward:

The LLM is a parametric expert
Retrieval is a non-parametric expert
The Evidence Policy is a governance layer that controls experts

In healthcare, something matters more than "more documents":

Which documents should be used?

Retrieval is a powerful knowledge augmentation tool, but it can also become a source of contamination.
Persly therefore does not try to reduce retrieval.

Instead,

Persly manages retrieval.

Why Persly Pursues This Direction

Evaluating healthcare agents goes far beyond simple accuracy.

Reflecting the latest guidelines
Maintaining jurisdictional consistency
Verifying applicability to specific patient populations
Preventing evidence overload
Managing latency
Controlling cost

Choosing not to retrieve is difficult to justify in a healthcare setting.
But governing retrieved evidence is structurally safe.

Persly treats retrieval as another expert and builds a structure where a policy layer governs what that expert is allowed to say.

This is the Evidence-Governed RAG that Persly is building.

Ready to try Persly?

Get trustworthy medical answers backed by verified sources. Persly helps patients and caregivers navigate health information with confidence.

Try Persly Free Contact Us