Translational Evidence and Failure Modes

Author
Published

May 24, 2026

The failure pattern in therapeutic AI is not only model error. It is often the mismatch between model endpoint, biological mechanism, assay system, and program decision.

Learning Objectives
  • Identify failure modes across the discovery-to-development path.
  • Separate model validation from program validation.
  • Use negative evidence as a design input.
TL;DR

A model that improves a proxy endpoint may still harm the program if the proxy is poorly linked to disease biology or developability. Failure analysis belongs near the start of the workflow, not after candidate nomination.

Introduction

PoseBusters is a useful example outside its narrow docking domain because it shows how a familiar metric can hide invalid outputs (Buttenschoen et al., 2024). The same pattern appears in target selection, small molecules, biologics, and trials: model success depends on the decision that follows the score.

Demonstrated

Demonstrated capability includes identifying model failures through stricter benchmark design and physical validity checks. PoseBusters demonstrated that docking evaluations need chemical plausibility in addition to RMSD (Buttenschoen et al., 2024). MoleculeNet demonstrated the value and limits of shared molecular benchmarks (Wu et al., 2018).

Evidence Anchor What It Supports Practical Constraint
PoseBusters Docking failure detection Metric choice changes conclusions
MoleculeNet Shared molecular benchmark tasks Program-level value needs external evidence
FDA and EMA materials Regulatory attention to AI lifecycle risks Documentation and accountability are expected

Theoretical

Theoretical capability includes AI systems that forecast full program attrition risk across target, chemistry, biology, toxicology, trial execution, and market access. Existing data fragmentation makes this an evidence integration problem rather than a model-size problem.

Beyond Current Capabilities

Beyond current capabilities includes reliable prediction of clinical success for early discovery assets without prospective evidence. Program success depends on human biology, trial design, safety, adherence, and effect size.

Practice Notes

  • Map each model endpoint to the next experimental decision.
  • Keep failed compounds, failed assays, and failed targets in the learning set.
  • Use decision curves and cost-aware evaluation when experiments are expensive.
  • Require an explicit stop rule for model-guided programs.