The Life Sciences AI Handbook: Steering Frontier Models in Biology

Tegomoh, Bryan

Translational Evidence and Failure Modes

Published

July 7, 2026

The failure pattern in therapeutic AI is not only model error. It is the mismatch between model endpoint, biological mechanism, assay system, and program decision. A model that improves a proxy endpoint can harm the program if the proxy is poorly linked to disease biology or developability. PoseBusters (Buttenschoen Chem Sci 2024) is the clearest current example: low-RMSD docking poses from deep methods routinely fail basic physical-chemistry validity checks. The same lesson applies across target identification, small-molecule generation, biologics, and trial design. Failure analysis belongs near the start of the workflow, not after candidate nomination.

Learning Objectives

Use this chapter to:

Explain why AI-supported discovery still fails when molecular, biological, toxicological, clinical, or operational evidence does not translate.
Failure is often a mismatch between benchmark success and the decision that matters next: target, assay, molecule, toxicology, trial, or patient subgroup.

Prerequisites: Evaluation Principles for Life Sciences AI for the prospective-validation discipline; Clinical Trial AI for Translational Research for the trial-evidence context.

Chapter Summary (TL;DR)

Summary: Explain why AI-supported discovery still fails when molecular, biological, toxicological, clinical, or operational evidence does not translate. The field has clear examples of benchmark artifacts and validity gaps, but reliable early prediction of clinical success remains beyond current evidence.

Key point: Failure is often a mismatch between benchmark success and the decision that matters next: target, assay, molecule, toxicology, trial, or patient subgroup. Open question: whether earlier validity checks can predict which programs should stop, continue, or change course.

Bottom line: Failure analysis ties together every domain in the handbook because every AI claim eventually faces biological, translational, or institutional selection.

Field Guide

What is this field trying to solve? Explain why AI-supported discovery still fails when molecular, biological, toxicological, clinical, or operational evidence does not translate.

What is the core idea? Failure is often a mismatch between benchmark success and the decision that matters next: target, assay, molecule, toxicology, trial, or patient subgroup.

What is the current state of the field? The field has clear examples of benchmark artifacts and validity gaps, but reliable early prediction of clinical success remains beyond current evidence.

What do we know, and what remains open? Known reference points include PoseBusters, MoleculeNet, Ahlmann-Eltze perturbation benchmarks, translational attrition studies, negative data, clinical-trial readouts, and postmortem analyses. What remains open is whether earlier validity checks can predict which programs should stop, continue, or change course.

Why does this matter? Failure analysis ties together every domain in the handbook because every AI claim eventually faces biological, translational, or institutional selection.

Introduction

PoseBusters is a useful example outside its narrow docking domain because it shows how a familiar metric can hide invalid outputs (Buttenschoen et al., 2024). The same pattern appears in target selection, small molecules, biologics, and trials: model success depends on the decision that follows the score. The Ahlmann-Eltze 2025 critique of deep perturbation predictors (Ahlmann-Eltze et al., 2025) carries an adjacent lesson: comparing only against other deep methods inflates apparent gains and hides cases where a simple baseline matches or beats them.

Drug development already has high attrition before AI enters the workflow. Historical analyses of clinical development success rates and pharmaceutical attrition show that failures concentrate around efficacy, safety, dose, and commercial viability rather than around candidate-generation speed alone (Hay et al., 2014; Waring et al., 2015). A recent Nature Medicine review of AI in drug development is useful because it spans target discovery, molecule design, trial operations, safety, and lifecycle work, making the same point across the pipeline: evidence standards change with context of use (Zhang et al., 2025). Scannell and colleagues’ analysis of declining R&D efficiency is a reminder that better tools can coexist with worse aggregate productivity when the wrong decisions advance (Scannell et al., 2012).

Where AI accelerates the wrong decision

AI adds the most value when it changes a decision that was previously slow, expensive, or poorly informed. It adds risk when it makes a weak proxy look decisive. Target scores can advance a target with no tractable modality. Docking scores can advance a chemically invalid pose. Morphology profiles can advance a compound that changes cell state without useful mechanism. Trial-matching systems can increase recruitment while worsening representativeness if eligibility and access are not reviewed.

The diagnostic question is simple: what action changed because of the model output, and what evidence says that action should change? If the action is “rank these hypotheses for review,” the evidence burden is modest. If the action is “advance this asset,” “drop this target,” or “replace this comparator,” the evidence burden rises sharply.

Attrition by stage

Failure modes differ by stage. Target-stage failures are usually causal: the target does not drive disease in the intended population or cannot be perturbed safely. Chemistry-stage failures are often tractability, selectivity, exposure, or toxicology. Biology-stage failures include assay mismatch, animal-model mismatch, compensatory pathways, or weak effect size. Trial-stage failures include enrollment, endpoint sensitivity, dose, adherence, safety, and heterogeneity of treatment effect.

AI methods should therefore be stage-specific. A method that improves hit generation should not be credited with solving clinical attrition. A method that improves trial operations should not be credited with target validity. This stage discipline is what keeps the translation story honest.

Negative data as infrastructure

Most organizations learn less from failure than they should because failed targets, compounds, assays, and clinical hypotheses are poorly structured. For AI systems, that missing negative data creates a distorted evidence base where published successes dominate. A failure archive should record target rationale, intervention direction, modality, assay, endpoint, dose, toxicity, decision date, and reason for stopping.

The purpose is not blame. The purpose is to give future models and future reviewers the denominator that literature and press releases usually hide.

What is demonstrated?

Demonstrated capability includes identifying model failures through stricter benchmark design and physical validity checks. PoseBusters demonstrated that docking evaluations need chemical plausibility in addition to RMSD (Buttenschoen et al., 2024). MoleculeNet demonstrated the value and limits of shared molecular benchmarks (Wu et al., 2018). The single-cell perturbation critique extended this discipline to cell biology (Ahlmann-Eltze et al., 2025).

Decision-curve analysis is relevant because discovery decisions are asymmetric. The cost of advancing a false positive can be months of chemistry, animal work, manufacturing preparation, and trial planning, while the cost of filtering one true positive depends on portfolio depth. Net-benefit framing evaluates prediction models against decision thresholds rather than discrimination alone (Vickers and Elkin, 2006). For therapeutics, AUC without cost ratio is rarely enough.

Evidence Anchor	What It Supports	Practical Constraint
PoseBusters	Docking failure detection through validity checks	Metric choice changes conclusions
Ahlmann-Eltze 2025	Linear baselines compete with deep models on perturbation	Always run the simple comparator
MoleculeNet	Shared molecular benchmark tasks	Program-level value needs external evidence
Clinical attrition literature	Stage-specific failure priors	AI does not erase efficacy and safety gates
Decision-curve analysis	Cost-aware model evaluation	Thresholds must match program economics
FDA and EMA materials	Regulatory attention to AI lifecycle risks	Documentation and accountability are expected

What is theoretical?

Theoretical capability includes AI systems that forecast full program attrition risk across target, chemistry, biology, toxicology, trial execution, and market access. Existing data fragmentation makes this an evidence integration problem rather than a model-size problem. The pharmaceutical industry’s internal failure archives, if combined and shared, would substantially improve this prediction; institutional incentives currently work against the combination.

What is beyond current capability?

Beyond current capabilities includes reliable prediction of clinical success for early discovery assets without prospective evidence. Program success depends on human biology, trial design, safety, adherence, and effect size. AI changes the throughput of candidate generation and triage; it does not yet change the clinical attrition rate substantially.

Surrogate endpoints are another boundary. Biomarker movement or assay rescue can justify continuing a program, but it does not automatically establish patient benefit. Surrogate endpoint validity is context-specific and requires evidence that effects on the marker reliably predict effects on a clinically meaningful endpoint (Fleming and Powers, 2012).

What would make this more promising?

Translational-failure analysis becomes more promising when it shows that the model improves a program decision under realistic costs, validity checks, and stopping rules.

Claim	Evidence that raises or lowers confidence
“The model metric matters”	The metric maps to a named program decision, threshold, cost ratio, and downstream experiment
“The output is valid”	Modality-specific checks catch impossible poses, molecules, sequences, assay artifacts, or biological implausibility
“The benchmark gain is meaningful”	Strong classical baselines, biology-aware splits, external validation, and prospective tests support transfer
“The program learns from failure”	Negative compounds, targets, assays, batches, and stopped programs are archived with usable metadata
“The AI changes attrition risk”	Predefined stop rules, denominators, decision curves, and stage-gate outcomes improve relative to standard workflow

The claim should become stronger only when the model changes a go, no-go, redesign, or evidence-generation decision in a documented way.

What should researchers, biotech teams, funders, and program leaders do with this?

Map each model endpoint to the next experimental decision before committing to the metric.
Keep failed compounds, failed assays, and failed targets in the learning set.
Use decision curves and cost-aware evaluation when experiments are expensive.
Require an explicit stop rule for AI-guided programs.
Apply validity filters appropriate to the modality (PoseBusters for docking; physical-chemistry checks for generated molecules; developability filters for biologics; biological plausibility for generated sequences).
Always include the strongest classical baseline in any benchmark comparison.
Document failures with the same discipline as successes.