AI for the Life Sciences

Author
Published

May 24, 2026

Life sciences AI is the application of machine learning systems to biological objects: sequences, structures, cells, tissues, organisms, experiments. Its center of gravity is upstream of clinical care: molecules generated before they are synthesized, cells represented before they are perturbed, experiments planned before they are run. AlphaFold 2 made over 214 million predicted structures freely available to researchers (Varadi et al., 2023). RFdiffusion designs proteins that bind, fold, and function in laboratory experiments (Watson et al., 2023). Coscientist plans and executes chemistry on cloud robotics with minimal human input (Boiko et al., 2023). None of this removes the experiment. The work is choosing the right one.

Learning Objectives

This chapter is the scope and orientation entry point for the handbook. You will learn to:

  • Distinguish discovery AI from clinical AI and public health AI, and identify which evidence standard applies
  • Apply the three-tier evidence framework (Demonstrated, Theoretical, Beyond current capabilities) used throughout the handbook
  • Map the major model classes (structure prediction, design, foundation models, perturbation prediction, agentic systems) to the biological objects they represent
  • Identify the audience this handbook is written for, and the role-specific reading paths
  • Recognize the infrastructure layer (NIH Bridge2AI, ARPA-H, CZ Biohub, Arc Institute) that shapes what data and benchmarks exist
  • Read frontier-lab announcements (DeepMind, Isomorphic Labs, Anthropic, OpenAI, Recursion, Insilico) with appropriate skepticism about evidence type
  • Frame any AI claim around the falsifying experiment

Prerequisites: none for orientation. The rest of the handbook assumes you have read this chapter and the Executive Summary.

The Big Picture: Life sciences AI is not one field. It is a set of modeling practices that share three constraints: biological data is structured for measurement (not for computation), every model output is a hypothesis (not an endpoint), and the cost of being wrong is paid at the bench, in the clinic, or by a program that does not exist. The first rule when reading any claim is to ask what biological object the model represents and what experiment would falsify the output.

Three Adjacent Domains, Different Evidence Standards:

Domain Asks Evidence Standard
Life sciences AI (this handbook) Does this model identify a molecule, cell state, or experiment worth pursuing? Benchmarks plus prospective experimental validation in the relevant system
Clinical AI Does this tool improve a patient-level decision? Prospective trials, regulatory clearance, real-world workflow studies
Public health AI Does this tool improve a population-level decision? Deployment-context evaluation, proper scoring rules, equity analysis

The Three-Tier Evidence Framework:

  • Demonstrated: Supported by published evidence in peer-reviewed venues, official documentation, or reproducible benchmark results.
  • Theoretical: Plausible given current methods but not yet established for routine use.
  • Beyond current capabilities: Not supported by credible evidence with current systems.

Every major capability claim in this handbook lands in one of the three tiers.

Landscape of Demonstrated Capabilities (Selected):

Class Landmark System Citation
Single-chain protein structure AlphaFold 2 Jumper et al., 2021
Biomolecular interactions AlphaFold 3 / Boltz-1 / Chai-1 Abramson et al., 2024; Wohlwend et al., 2024, preprint
Protein design RFdiffusion Watson et al., 2023
Variant effect (missense) AlphaMissense Cheng et al., 2023
Single-cell foundation model scGPT, Geneformer Cui et al., 2024; Theodoris et al., 2023
Genomic foundation model Evo / Evo 2 Nguyen et al., 2024; Brixi et al., 2025, preprint
Perturbation prediction GEARS Roohani et al., 2023
Autonomous chemistry Coscientist Boiko et al., 2023
Multi-agent biology research Virtual Lab Swanson et al., 2025

Reading Frontier-Lab Announcements:

Source Type Evidence Weight Cite As
Peer-reviewed Nature/Science/Cell paper High (Author et al., Year)
bioRxiv / arXiv preprint Moderate, treat as provisional (Author et al., Year, preprint)
Company blog or press release Low (vendor-reported) (Company, Date) with “reported” label
Verified partnership announcement Factual for partnership existence; not evidence of efficacy (Company, Date)
Hosted demo or marketing Not evidence Do not cite as fact

Reading Rule:

Every major capability claim should land in one of three tiers: Demonstrated, Theoretical, or Beyond current capabilities. If a claim cannot be placed, the claim is not specific enough.

The Takeaway:

Life sciences AI is moving fast and unevenly. Some capabilities (structure prediction, single-cell representation) are genuinely transformative for routine work. Others (general biological reasoning, autonomous discovery) are aspirational. The discipline is being honest about which is which, choosing models by the biological question rather than the marketing, and designing the experiment that turns a prediction into a decision. Every chapter of this handbook applies that frame to a specific model class.


Introduction: What This Handbook Is For

October 2024, Stockholm: The Royal Swedish Academy of Sciences awards the Nobel Prize in Chemistry jointly to Demis Hassabis and John Jumper for AlphaFold, and to David Baker for computational protein design. The prize recognizes a decade of work in which deep learning moved from a tool researchers tried, to a tool researchers depended on, to (in the Committee’s framing) a tool that changed what was possible in structural biology.

May 2024: AlphaFold 3 extends biomolecular prediction beyond proteins to nucleic acids, ions, and ligands (Abramson et al., 2024). The initial release is server-only, with restricted commercial use. The open-source response (Boltz, Chai) arrives within six months.

October 2025: Anthropic launches Claude for Life Sciences with integrations into Benchling, 10x Genomics, PubMed, and Synapse (Anthropic, October 2025). OpenAI’s o1 reaches 77-78% on GPQA Diamond, exceeding the 69.7% PhD-expert baseline on graduate-level biology, chemistry, and physics questions (OpenAI, September 2024).

May 2026: ARPA-H launches the IGoR program (Intelligent Generator of Research) to “deliver gold-standard biomedical science faster through an AI-powered research ecosystem” focused on Alzheimer’s, Parkinson’s, and autoimmune disease (HHS press release, May 2026).


These are not unrelated events. They are a change in the default operating model of biomedical research: AI is no longer an experiment to run, it is infrastructure to use. The questions for working researchers are no longer “can I use AI” but rather “which model, for which decision, with what validation.”

This handbook is written for those questions. It is not a tour of every model release. It is a framework for reading model outputs as research inputs, choosing systems by biological question rather than brand, and designing the experiment that turns a prediction into a program decision.


What Life Sciences AI Is, and What It Is Not

The Discovery Layer

Life sciences AI sits in the discovery layer of biomedical research:

                     Population health
                            │
                            ▼
  Public Health AI ── Patient populations, surveillance, forecasting
                            │
                            ▼
                     Clinical practice
                            │
                            ▼
       Clinical AI ── Diagnosis, treatment, workflow, liability
                            │
                            ▼
                   Therapeutic development
                            │
                            ▼
   [LIFE SCIENCES AI] ── Molecules, cells, experiments, candidates
                            │
                            ▼
                       Fundamental biology

Each layer above depends on inputs from the layer below. A drug candidate exists before a clinical trial; a clinical trial exists before a regulatory decision; a regulatory decision exists before a population-scale deployment. Life sciences AI operates at the molecule-cell-experiment level. Its outputs are inputs to clinical and public health programs.

The Three Adjacent Domains

Domain Object of Study Failure Cost Evidence Standard
Life sciences AI Molecules, cells, experiments, research decisions A failed experiment, a discontinued program, a non-validated paper Benchmarks + prospective experimental validation
Clinical AI Individual patients, diagnostic and therapeutic decisions Misdiagnosis, mistreatment, liability Prospective clinical trials, FDA clearance, real-world performance
Public health AI Populations, surveillance, intervention design Missed outbreaks, misallocated resources, eroded trust Deployment-context evaluation, proper scoring rules, equity analysis

The three are complementary but distinct. A model that predicts a binding interaction is a life sciences AI claim. A model that recommends a treatment for a specific patient is a clinical AI claim. A model that forecasts hospitalizations is a public health AI claim. The same architectural family (transformers, diffusion, graph networks) underlies all three, but the evidence standard depends on the decision the model informs, not the math under the hood.

Companion handbooks in this series cover the clinical, population, and biosecurity layers explicitly. Links and descriptions appear on the welcome page under “Explore the Handbook Series.”


The Three-Tier Evidence Framework

Every chapter in this handbook places capability claims in one of three tiers:

Demonstrated

Supported by published evidence in peer-reviewed venues, official documentation, or reproducible benchmark results. The evidence must be specific to a defined task and dataset. “AlphaFold 2 predicts single-chain protein structures at near-experimental accuracy for the majority of well-folded domains evaluated in CASP14” is a demonstrated claim (Jumper et al., 2021).

Theoretical

Plausible given current methods but not yet established for routine use. The capability has been shown in selected systems, narrow tasks, or controlled settings without proven generalization. “Single-cell foundation models can transfer to new tissues and species” is currently a theoretical claim: published evidence supports transfer in some settings (Cui et al., 2024; Theodoris et al., 2023), but the boundary of useful transfer is an open research question.

Beyond Current Capabilities

Not supported by credible evidence with current systems. The capability is either aspirational, has been demonstrated in toy settings that do not generalize, or requires evidence that has not been produced. “Fully autonomous drug discovery without experimental validation” is beyond current capabilities. Coscientist demonstrates autonomous chemistry execution in bounded settings (Boiko et al., 2023), not autonomous discovery without measurement.

The point of the tiers is not to be conservative: it is to be specific. A claim that cannot be placed in a tier is not a claim about a system; it is marketing.

Reading Claims with the Framework

When a press release, a paper title, or a vendor pitch makes a capability claim, ask:

  1. What biological object is the model representing? Sequence? Structure? Cell state? Tissue? Experiment? Reaction?
  2. What is the specific task on which the claim is made? Property prediction? Generation? Ranking? Classification? Planning?
  3. What is the evidence? Held-out benchmark performance? Prospective experiment? Cross-laboratory replication? Vendor-reported internal evaluation?
  4. What experiment would falsify the claim? And has that experiment been done?

A claim that survives steps 1-4 is at minimum demonstrated for the specific task and dataset. A claim that fails step 4 (no falsifying experiment, or the experiment has not been done) is either theoretical or beyond current capabilities.


The Capability Landscape

The handbook organizes life sciences AI into six parts. Each part contains several chapters that apply the evidence framework to a specific model class.

Part I: Foundations (this part)

Part II: Molecular AI

Part III: Therapeutics AI

Part IV: Cellular and Systems Biology

Part V: Engineering and Automation

Part VI: Practice and Governance


The Infrastructure Layer

Models are visible; infrastructure is decisive. The capability gaps in life sciences AI are often data gaps, benchmark gaps, or compute gaps before they are architecture gaps.

Public Programs

  • NIH Bridge2AI (NIH Common Fund, Bridge2AI Consortium): Four grand-challenge data generation projects (CHORUS for AI/ML in clinical care, CM4AI for functional genomics, VOICE for precision public health, AI-READI for salutogenesis). The program’s premise: AI-ready datasets are themselves an infrastructure problem, requiring metadata, ethics review, quality control, and workforce development: not only more storage.
  • ARPA-H IGoR (ARPA-H programs page; HHS press release, May 2026): Intelligent Generator of Research, focused on Alzheimer’s, Parkinson’s, and autoimmune disease. ARPA-H also funds adjacent AI programs: ADVOCATE (cardiovascular AI agents), RAPID (rare-disease AI diagnostics), CATALYST (ADME-tox modeling), ADAPT (precision cancer therapy).
  • NCI Cancer Research Data Commons (NCI CRDC): Data infrastructure spanning genomics, proteomics, and imaging that AI work depends on, even when not formally an “AI program.”

Non-Profit and Foundation Programs

  • CZ Biohub and CZ CELLxGENE (CZ CELLxGENE Discover): Roughly 100 million curated single-cell observations in a standardized, queryable platform. The Tabula Sapiens collection (1.1M cells from 28 organs, 24 donors) is a benchmark first-draft human cell atlas.
  • CZI Virtual Cells Platform (CZI): An active program to build and benchmark foundation models for cell biology.
  • Arc Institute: Co-developer (with Stanford, UC Berkeley, UCSF, and NVIDIA) of the Evo and Evo 2 genomic foundation models (Nguyen et al., 2024; Brixi et al., 2025, preprint).

Frontier Labs

The major AI labs each have life-sciences programs at varying degrees of openness:

Lab Visible Life-Sciences Work Evidence Type
Google DeepMind / Isomorphic Labs AlphaFold 2/3, AlphaMissense, AlphaProteo; Eli Lilly and Novartis drug-discovery partnerships (Isomorphic Labs, January 2024) Peer-reviewed for AlphaFold lineage; AlphaProteo is arXiv preprint (Zambaldi et al., 2024, preprint); partnerships are factual but not efficacy evidence
Anthropic Claude for Life Sciences (October 2025), AI for Science Program Company announcement; no peer-reviewed life-sciences paper as of this writing
OpenAI Color Health cancer-screening copilot (OpenAI, June 2024); Moderna ChatGPT Enterprise deployment (OpenAI, April 2024); o1 model GPQA Diamond performance (OpenAI, September 2024) Verified partnerships and benchmark results; no peer-reviewed biology paper
Meta FAIR / EvolutionaryScale ESM-2 protein language model (Lin et al., 2023); ESM-3 multimodal (Hayes et al., 2024, preprint) ESM-2 peer-reviewed; ESM-3 preprint

AI-Native Drug Discovery Companies

  • Recursion Pharmaceuticals (Recursion mission): High-content imaging plus ML. 2025 reported first AI-enabled clinical proof of concept; clinical candidates include REC-617 (CDK7) and REC-4881. Note: pipeline contraction also disclosed in May 2025.
  • Insilico Medicine: Generative chemistry for IPF target TNIK; ISM001-055 reported positive Phase IIa topline (Insilico Medicine, November 2024). Company-reported efficacy; not yet peer-reviewed in a journal.
  • Insitro: Machine-learning models for metabolic disease and neuroscience; expanded Eli Lilly small-molecule collaboration in September 2025.

Read these with the framework: a partnership announcement is factual evidence of the partnership; it is not evidence that the AI-discovered molecule will read out positively, advance to Phase III, or change a patient’s outcome.


Who This Handbook Is For

The handbook is written for several overlapping audiences. The shared question is: when does an AI output deserve experimental attention?

Role What You Need From This Handbook
Computational biologist Capability tier for each model class; what the failure modes are; how to design a benchmark that reflects your actual question
Biotechnology team lead Build-vs-buy framing; license diligence; what the open-source alternatives are when a frontier release is restricted
Drug discovery scientist Where AI shifts a stage gate vs. where it does not; how to read a vendor pitch against published evidence
Physician-scientist Translation between bench AI and clinical decision-making; what makes a discovery-stage AI claim relevant to the clinic
Synthetic biologist Design tools, autonomous lab integration, dual-use considerations
Graduate student Conceptual entry points into model classes; canonical citations; how to read benchmark results
Research program leader Capital allocation framing; which capabilities are infrastructure-grade vs. research-grade; how to evaluate proposals that invoke AI

If you have read this far, the handbook is also written for you.


How to Read the Rest of the Handbook

If you have 20 minutes

  1. Executive Summary: Handbook-wide conclusions
  2. Protein Structure Prediction: The landmark capability and its limits
  3. Evaluation Principles for Biomedical Discovery AI: The framework that turns capability into decision

If you have an hour

Add:

  1. Single-Cell Foundation Models: The capability frontier in cell biology
  2. Self-Driving Laboratories: The autonomous laboratory frontier
  3. Information Hazards in Capability Research: Dual-use considerations for design and generation tools

If you are doing a deep program review

Read the relevant Part end-to-end. Each chapter is self-contained but cross-references the others.


Practice Notes

  • Name the biological object first. Sequence, structure, ligand, cell state, tissue, experiment, or clinical endpoint. The right model class depends on the object.
  • Name the validation object second. What experiment, benchmark, or independent dataset would change your decision if the model were wrong?
  • Do not equate a model score with biological truth. A high pLDDT, a low Tanimoto, a strong attention weight: these are model outputs, not measurements.
  • Treat every vendor claim as a claim about a specific data distribution until proven otherwise. A model that works on one cell line, one species, or one assay does not work on all of them.
  • Read the license before scoping the project. A model you cannot run on your infrastructure is, for your program, not the state of the art.
  • Cite by version and venue. “AlphaFold” without a version is ambiguous; an announcement is not a paper; a partnership is not an outcome.

Common Questions

Is life sciences AI the same as “AI for science”?

Overlapping but not identical. “AI for science” is a broader phrase that includes physics, chemistry, materials, climate, mathematics, and other domains. This handbook focuses on the biomedical subset: proteins, cells, genomes, organisms, and the research workflows that connect them.

Do I need to know machine learning to use this handbook?

No. The handbook is written for working biologists, drug discovery scientists, and program leaders. It treats ML as a research instrument the same way it would treat a sequencer or a microscope: you should understand what the output means and what its limits are; you do not need to build the instrument from scratch. The Foundation Models for Biology and Evaluation Principles chapters provide enough ML context to read the rest of the handbook.

How often does the handbook update?

Continuously, as landmark papers and benchmark results appear in Nature, Science, Cell, and adjacent venues, and as frontier-lab and FDA actions warrant. The publication and modification dates are visible in each chapter’s metadata.

Why is dual-use treated as a separate chapter rather than woven through?

Because the questions in dual-use review (what to publish, what to release, how to communicate capability without enabling misuse) apply to many model classes at once. The dedicated chapter (Information Hazards in Capability Research) keeps the framework in one place. Cross-references in individual chapters point back to it.

What’s the difference between a foundation model and a task-specific model in biology?

A foundation model is pretrained on broad data (sequences, structures, cells) and adapted to many downstream tasks. A task-specific model is trained for one task. In life sciences, the foundation-model wave is real: ESM-2 / ESMFold for proteins, scGPT and Geneformer for single cells, Evo and Evo 2 for genomes. The wave is also recent and the generalization claims are still being validated. Foundation Models for Biology treats this in depth.

Are AI-discovered drugs real yet?

In the sense that some AI-discovered molecules have reached and progressed in clinical trials, yes: Insilico’s ISM001-055 (Phase IIa, company-reported) is a frequently cited example. In the sense that an AI-discovered molecule has been approved and changed standard-of-care, not yet. Read the evidence with the framework: a Phase IIa topline is a meaningful signal, not a registration-grade outcome. Translational Evidence and Failure Modes covers this in depth.


Cross-References