Translational Evidence and Failure Modes
The failure pattern in therapeutic AI is not only model error. It is often the mismatch between model endpoint, biological mechanism, assay system, and program decision.
- Identify failure modes across the discovery-to-development path.
- Separate model validation from program validation.
- Use negative evidence as a design input.
A model that improves a proxy endpoint may still harm the program if the proxy is poorly linked to disease biology or developability. Failure analysis belongs near the start of the workflow, not after candidate nomination.
Introduction
PoseBusters is a useful example outside its narrow docking domain because it shows how a familiar metric can hide invalid outputs (Buttenschoen et al., 2024). The same pattern appears in target selection, small molecules, biologics, and trials: model success depends on the decision that follows the score.
Demonstrated
Demonstrated capability includes identifying model failures through stricter benchmark design and physical validity checks. PoseBusters demonstrated that docking evaluations need chemical plausibility in addition to RMSD (Buttenschoen et al., 2024). MoleculeNet demonstrated the value and limits of shared molecular benchmarks (Wu et al., 2018).
| Evidence Anchor | What It Supports | Practical Constraint |
|---|---|---|
| PoseBusters | Docking failure detection | Metric choice changes conclusions |
| MoleculeNet | Shared molecular benchmark tasks | Program-level value needs external evidence |
| FDA and EMA materials | Regulatory attention to AI lifecycle risks | Documentation and accountability are expected |
Theoretical
Theoretical capability includes AI systems that forecast full program attrition risk across target, chemistry, biology, toxicology, trial execution, and market access. Existing data fragmentation makes this an evidence integration problem rather than a model-size problem.
Beyond Current Capabilities
Beyond current capabilities includes reliable prediction of clinical success for early discovery assets without prospective evidence. Program success depends on human biology, trial design, safety, adherence, and effect size.
Practice Notes
- Map each model endpoint to the next experimental decision.
- Keep failed compounds, failed assays, and failed targets in the learning set.
- Use decision curves and cost-aware evaluation when experiments are expensive.
- Require an explicit stop rule for model-guided programs.