The Life Sciences AI Handbook: Steering Frontier Models in Biology

Tegomoh, Bryan

Biological Data Infrastructure

Published

July 7, 2026

Life sciences AI inherits the strengths and weaknesses of biological archives. Protein structures, sequences, small molecules, perturbation screens, single-cell observations, and imaging datasets each carry different biases, missingness patterns, and validation traditions. The data layer is the ceiling on every downstream method, and the gap between machine-readable and AI-ready data is wide enough that NIH Bridge2AI was funded to address it directly.

Learning Objectives

Use this chapter to:

Explain why biological data infrastructure sets the ceiling for model quality before training begins.
Provenance, metadata, negative data, assay context, versioning, and measured-versus-computed labels are scientific requirements, not administrative polish.

Prerequisites: none. This chapter is the data-layer companion to the model-layer and evaluation-layer chapters.

Chapter Summary (TL;DR)

Summary: Explain why biological data infrastructure sets the ceiling for model quality before training begins. Large public resources now support foundation models, but many biological domains still lack AI-ready records with complete context and failed-result capture.

Key point: Provenance, metadata, negative data, assay context, versioning, and measured-versus-computed labels are scientific requirements, not administrative polish. Open question: whether provenance, negative results, and licensing constraints can travel cleanly into training and evaluation.

Bottom line: Every model class in the handbook depends on this substrate, from protein design and single-cell models to ecology, therapeutics, and automated laboratories.

Field Guide

What is this field trying to solve? Explain why biological data infrastructure sets the ceiling for model quality before training begins.

What is the core idea? Provenance, metadata, negative data, assay context, versioning, and measured-versus-computed labels are scientific requirements, not administrative polish.

What is the current state of the field? Large public resources now support foundation models, but many biological domains still lack AI-ready records with complete context and failed-result capture.

What do we know, and what remains open? Known reference points include UniProt, PDB, AlphaFold DB, PubChem, ChEMBL, GenBank, RefSeq, CZ CELLxGENE, Bridge2AI, BioSample, OBO Foundry, and dataset-card practices. What remains open is whether provenance, negative results, and licensing constraints can travel cleanly into training and evaluation.

Why does this matter? Every model class in the handbook depends on this substrate, from protein design and single-cell models to ecology, therapeutics, and automated laboratories.

Introduction

The Protein Data Bank has served as a global archive for experimentally determined macromolecular structures since 1971 (wwPDB, 2026). RCSB PDB now also presents computed structure models beside experimental structures, which forces users to distinguish measurement from prediction (RCSB PDB, 2026). For biomedical AI, the same distinction applies across sequence archives, chemical databases, single-cell atlases, and image repositories.

The minimum standard is not simply “public data.” FAIR data principles make findability, accessibility, interoperability, and reuse explicit, but life sciences AI needs a stricter operational version: assay provenance, batch metadata, version pinning, licensing, and a durable flag for measured versus computed records (Wilkinson et al., 2016). The PDB paper that defined the modern structural archive model is still instructive because coordinates are tied to experimental method, deposition history, and validation metadata rather than treated as anonymous shapes (Berman et al., 2000).

FAIR is necessary, not sufficient

FAIR data principles answer whether a dataset can be found, accessed, combined, and reused. AI-ready biological data ask a narrower and harder question: can a model trained or evaluated on this dataset be interpreted against the biology that produced it? A FAIR single-cell matrix without donor metadata, tissue handling, dissociation protocol, batch, cell-type annotation version, and filtering logic is findable but still weak for model evaluation. A FAIR chemical-bioactivity table without assay conditions, target construct, readout type, replicate handling, and inactive compounds is reusable but still misleading for QSAR or ADMET claims.

Dataset datasheets generalise this discipline by forcing collection motivation, composition, preprocessing, recommended use, and maintenance into the record (Gebru et al., 2021). The Human Cell Atlas made the same point in biological form: a useful atlas is not just a collection of expression matrices, but an organised record of tissue, donor, technology, annotation, and community standards (Regev et al., 2017). For institutional data products, the datasheet and the biological protocol should travel together.

Negative data and failed experiments are model assets

Most public biological data are biased toward successful experiments, publishable effects, and positive findings. That bias is damaging for prioritisation models because the model sees hits, validated pathways, and successful assays more often than abandoned targets, failed compounds, silent perturbations, inconclusive screens, and protocols that never stabilized. In discovery work, the failed experiment is often the most expensive information the institution owns.

Internal AI-ready infrastructure should therefore preserve negative and inconclusive results with the same identifiers as positive findings: target, construct, compound, batch, assay, readout, operator or automation platform, failure category, and stop decision. This does not make a model automatically better. It gives the evaluation layer access to the denominators that published literature usually hides.

Measured records and computed records must stay separate

Predicted structure databases, imputed single-cell profiles, computed annotations, and model-derived labels are useful, but they should not silently merge with measured records. Mixing experimental PDB structures with AlphaFold DB predictions, measured cell states with imputed cell states, or wet-lab assay results with model-predicted activity changes the evidence type. The distinction should be explicit in tables, identifiers, and downstream training filters.

Knowledge-graph infrastructure helps when it preserves provenance rather than flattening it. BioCypher, for example, formalizes schema-driven knowledge representation so biomedical graphs can carry source, relation type, and version information rather than becoming untyped edge collections (Lobentanzer et al., 2023). That discipline is what makes graph-derived model inputs inspectable.

What is demonstrated?

Demonstrated capability includes training and evaluating models on curated public resources. ChEMBL provides curated bioactivity data for drug-like molecules (Zdrazil et al., 2024). PubChem provides chemical substance, compound, and bioassay records through NIH infrastructure (Kim et al., 2023). AlphaFold DB and the ESM Metagenomic Atlas show how predicted structures became research resources at database scale (AlphaFold Protein Structure Database, 2026; ESM Metagenomic Atlas, 2026).

The same “database” label hides different curation models. UniProt separates expert-reviewed protein annotation from automated annotation at scale, which makes it useful both as a high-confidence reference and as a broad pretraining substrate (UniProt Consortium, 2025). ChEMBL is curated around bioactivity and assay context, while PubChem is broader chemical and bioassay infrastructure (Zdrazil et al., 2024; Kim et al., 2023). Single-cell atlases add a different requirement: donor, tissue, protocol, and cell-type annotation must travel with the expression matrix, as the Tabula Sapiens multi-organ atlas made explicit (Tabula Sapiens Consortium, 2022).

Evidence Anchor	What It Supports	Practical Constraint
PDB	Experimentally determined structure archive	Coverage follows what structural biology could measure
ChEMBL and PubChem	Chemical structure and bioactivity resources	Assay context and curation level differ across entries
Bridge2AI	AI-ready data standards as a program objective	Ethical sourcing, metadata, and fairness are part of data quality

What is theoretical?

Theoretical capability includes cross-database models that learn from raw sequence, structure, chemical, image, and text records without manual harmonization. This remains theoretical in many workflows because identifiers, assay conditions, version histories, and licensing terms often fail to align cleanly.

What is beyond current capability?

Beyond current capabilities includes biological datasets that fully encode the causal context of an experiment. No public archive contains all cell states, reagent histories, operator choices, instrument behavior, and environmental variables needed to remove experimental ambiguity.

What would make this more promising?

Data infrastructure becomes more promising if public and consortium datasets routinely carried experiment-level provenance, negative and inconclusive results, license terms, versioned identifiers, and measured-versus-computed flags through model training and evaluation. Cross-institution benchmarks showing that models trained on these records transfer across labs, assays, and organisms would move more data-infrastructure claims from plausible to demonstrated. For now, the practical evidence is strongest where the archive exposes provenance and scope clearly enough for another team to reproduce the data decision.

What should researchers, biotech teams, funders, and program leaders do with this?

Record data version, download date, accession source, and filtering logic for every dataset used in training or evaluation.
Separate experimental structures from computed structure models in tables and figures.
Keep negative, failed, and inconclusive experiments when training prioritisation systems.
Audit licence terms before mixing public, consortium, and commercial data; the strictest term in a mix becomes the binding constraint.
Use dataset cards as a minimum standard for any internal data product.
Treat AI-ready dataset construction as engineering work that deserves its own staffing and budget, not as a side effect of model work.