The Life Sciences AI Handbook: Steering Frontier Models in Biology

Tegomoh, Bryan

History of AI in the Life Sciences

Published

July 7, 2026

Computational biology has used learning algorithms for fifty years. The story is not a single breakthrough at AlphaFold 2 in 2020 but a five-decade arc from dynamic-programming sequence alignment to generative biology, with each generation inheriting and discarding pieces of the one before. Reading the field requires reading that history: most “new” capabilities in a 2026 product brief are a deep-learning rendering of an idea that bioinformatics tried in 1995, often with the same failure modes.

Learning Objectives

Use this chapter to:

Place current AI-biology systems in the longer history of computational biology.
See why shared data, benchmarks, and biological tasks changed which methods mattered.
Use history to calibrate new model claims without treating novelty as proof.

Chapter Summary (TL;DR)

Summary: AI in biology did not begin with AlphaFold. The current foundation-model era grew from sequence alignment, profile HMMs, supervised genomics, deep regulatory models, and decades of structure-prediction benchmarks.

Key point: Methods become important when data, benchmarks, and biological decisions line up. Older methods rarely disappear; they become infrastructure for the next generation.

Bottom line: Use this chapter to read new capability claims against their predecessors. If a field lacks a blind benchmark, shared data standard, or clear failure mode, treat broad claims as early even when the model is new.

Introduction

The conventional public narrative places the start of AI in biology in late 2020, when AlphaFold 2 produced single-chain protein structure predictions whose accuracy was comparable to experimental measurement on the CASP14 blind benchmark (Jumper et al., 2021). That date is right for the moment when biology entered the broad public consciousness as a field where deep learning would matter. It is wrong as a start date for the field.

Biology has been using learning algorithms since the early 1970s. The first practical computational tools were dynamic-programming sequence aligners and statistical models for protein families. By the late 1990s, hidden Markov models were the standard method for gene prediction and homology search. Support vector machines and random forests carried most genomics tasks through the 2000s. Deep learning entered biology around 2012, found early traction in regulatory genomics by 2015, and became indispensable in variant calling by 2018. AlphaFold 2 was an inflection in which problems deep learning could solve, not in whether it should be used at all.

Reading the field requires reading this history. Three reasons:

Most capability claims have a prior generation. “End-to-end learning for transcription factor binding” is the 2015 framing of a problem that profile HMMs and position weight matrices tried to solve in the 1990s. The current generation works better not because the problem is new but because the data and the architecture finally match the problem’s structure. Knowing the prior generation’s failure modes is a practical way to identify where current claims may break.

The pattern of breakthrough is consistent. Major advances have followed a stable pattern: a blind benchmark exists, multiple groups compete on it for years, one group makes an architectural or data-scale change that produces a discontinuous improvement, then incremental advances saturate the metric. CASP for structure, BLAST scoring for alignment, MoleculeNet for molecular property prediction. The pattern is durable enough to be predictive: fields without a blind benchmark (most of cellular biology, most of perturbation prediction) have not yet had their CASP14.

The next decade will not be a continuation of the last. Each era ended when its dominant method ran out of headroom on the available data. BLAST ran out as sequence databases became too large for exact comparison; profile HMMs ran out as deep learning could capture longer-range dependencies. Deep convolutional and recurrent architectures ran out as transformer-based models could capture biology’s sparse, long-range correlations. The current generation’s limits are already visible; what replaces it is not obvious.

The rest of this chapter walks each era in turn. The aim is not a complete bibliography. The aim is to give a reader who knows AI but not biology, or biology but not AI, enough context to read a 2026 capability claim against a 2006 one and tell which lessons still apply.

The Algorithmic Era (1970-1990)

Computational biology started before machine learning. The defining problem of the 1970s and 1980s was sequence comparison: given two protein or DNA sequences, identify their similarity and align them. The Needleman-Wunsch algorithm provided the first dynamic-programming solution for global alignment (Needleman and Wunsch, 1970); the Smith-Waterman algorithm extended it to local alignment, which mattered because most biologically meaningful similarity is local (Smith and Waterman, 1981). These are not learning algorithms. They are exact methods that, given a scoring scheme for matches, mismatches, and gaps, compute the optimal alignment in quadratic time.

The scoring scheme was where statistics entered. Margaret Dayhoff’s PAM matrices and later the BLOSUM matrices were empirical: counts of how often each amino acid substituted for each other amino acid in evolutionarily related sequences. BLOSUM converted conserved blocks of related proteins into substitution scores that became a default language for protein comparison (Henikoff and Henikoff, 1992). These were data-derived scoring schemes, the proto-machine-learning of the field.

The breakthrough that made computational biology a routine laboratory tool was BLAST, the Basic Local Alignment Search Tool (Altschul et al., 1990). BLAST traded optimality for tractability: by indexing query words and extending high-scoring seed matches, it could search a database that grew faster than Moore’s law in time that was practical at a single workstation. By the mid-1990s, BLAST queries against GenBank had become a routine first step in molecular biology. The tool has been continuously used for thirty-five years; it sits inside modern bioinformatics pipelines, including those that preprocess data for foundation models.

The algorithmic era did not have “AI” in any modern sense, but it established the field’s defining habit: a method’s value is measured by whether biologists actually use it, not by whether it produces the theoretically best answer. Methods that ran in time biologists tolerated, with output biologists could interpret, won. Methods that were optimal but slow lost. That habit persists.

The Statistical Learning Era (1990-2012)

By the mid-1990s, the practical problem had shifted from comparing two sequences to modelling families of sequences. A protein family shares ancestry but diverges over time; positions in the family vary in how conserved they are, and the variation itself carries information about function. Profile hidden Markov models (HMMs), formalised in tools like HMMER and SAM, captured this position-specific conservation (Eddy, 1998). By 1999 the Pfam database was indexing thousands of protein families with profile HMMs; by 2024 Pfam covered the majority of UniProt’s reviewed sequences.

The HMM era also drove gene prediction. GENSCAN used a generalised HMM to predict gene structure in genomic DNA, integrating signal-detection components (splice sites, start codons) into a probabilistic framework that could be trained on annotated genomes (Burge and Karlin, 1997). When the human genome was published in 2001, HMM-based annotation pipelines were the standard.

In parallel, microarrays produced the first large-scale gene-expression data, and the field absorbed the supervised-learning toolkit from broader machine learning. Support vector machines were applied to cancer classification, gene-function prediction, and protein-fold recognition. Random forests followed. Logistic regression and naïve Bayes carried most everyday tasks. The methods were borrowed, not biology-specific; the contribution was usually domain knowledge encoded in the features, not in the algorithm.

The era’s ceiling was set by feature engineering. To classify protein function with an SVM, someone had to design features: amino-acid composition, predicted secondary structure, hydrophobicity profiles, sequence-derived physicochemical properties. The classifier was only as good as the features, and good features required deep domain knowledge plus laborious manual design. By 2010, the obvious bottleneck was that the most informative features were probably not the ones humans were designing.

Deep Learning Enters Biology (2012-2018)

AlexNet won the ImageNet competition in 2012. Within three years, the architecture’s core idea, learning hierarchical features end-to-end from raw inputs, had migrated into biology.

The earliest sustained traction was in regulatory genomics. DeepBind (Alipanahi et al., 2015) used convolutional neural networks to predict transcription-factor binding from raw DNA sequence, outperforming the position-weight-matrix methods that had been standard for two decades. DeepSEA extended the approach to predicting chromatin features and non-coding variant effects at single-nucleotide resolution (Zhou and Troyanskaya, 2015). Basset learned chromatin-accessibility code across cell types (Kelley et al., 2016); DanQ added a recurrent layer to capture motif grammar (Quang and Xie, 2016); Basenji moved toward longer-range, quantitative regulatory profiles (Kelley et al., 2018). By 2017, deep learning was the dominant approach in regulatory genomics, and the field had a working assumption that learned features would beat hand-engineered features almost everywhere.

The next domain was variant calling, the routine task of detecting genetic variants from short-read sequencing data. DeepVariant (Poplin et al., 2018) reframed variant calling as an image-classification problem: read pileups became images, a convolutional network classified candidate sites. DeepVariant became the first deep-learning method to win the precisionFDA Truth Challenge in 2016 and remains a standard pipeline component in clinical sequencing.

The third domain was protein structure prediction. AlphaFold 1 (Senior et al., 2020, debuting at CASP13 in 2018) used a deep residual network over distogram predictions, producing distances between residue pairs that were then assembled into structures. The result was a substantial step beyond prior physics- and template-based methods, enough to win CASP13 by a meaningful margin. But CASP13 also showed that the architecture was a stepping stone: AlphaFold 1’s accuracy on novel folds was still well short of experimental, and competitors were closing on the lead.

The defining feature of this era was that deep learning’s role expanded from “tried in biology” to “expected in biology” for sequence-input tasks. The CNN-and-RNN era did not produce a CASP14-level moment. It produced a steady accumulation of methods that quietly beat their predecessors on standard benchmarks, often by 5-15%, with the occasional larger gain.

The Generative Breakthrough (2018-2024)

The next era is the one most readers know about. It was compressed: a five-year window in which structure prediction, design, and biomolecular interaction prediction each crossed thresholds that had been considered hard for decades.

Structure prediction. AlphaFold 2 (Jumper et al., 2021) won CASP14 in late 2020 with median backbone accuracy comparable to experimental structures on most single-chain targets. Two architectural ideas were central: the Evoformer block, which processed evolutionary sequence information and pairwise residue relationships jointly, and the structure module, which produced geometry directly from learned representations. RoseTTAFold (Baek et al., 2021) followed soon after with a three-track architecture and demonstrated that the breakthrough was reproducible by an independent group. ESMFold (Lin et al., 2023) showed that protein language models alone, without explicit multiple sequence alignments, could fold proteins at AlphaFold-like quality but two orders of magnitude faster.

Structural proteome at scale. The AlphaFold Protein Structure Database (Varadi et al., 2024) released over 214 million predicted structures, effectively the predicted structural proteome of life. This made structure a free input to every downstream method, replacing the assumption that structures were rare and expensive. The community assessment of AlphaFold 2’s utility across structural-biology workflows (Akdel et al., 2022) documented both the genuine impact and the persistent gaps: disordered regions, conformational ensembles, large complexes, and ligand-bound states remained hard.

Protein design. RFdiffusion (Watson et al., 2023) adapted diffusion models, the same architecture class behind image generation, to backbone generation. Given a functional constraint (a binding site, a desired fold), RFdiffusion could generate novel protein backbones that did not exist in nature, validated experimentally at meaningful rates. The Baker lab and collaborators have since shown that the design-validate-iterate loop can produce binders to specified targets, novel enzymes, and small functional proteins with reasonable success rates.

Variant interpretation. AlphaMissense (Cheng et al., 2023) used AlphaFold-derived features to classify approximately 89% of human missense variants of unknown significance as likely benign or likely pathogenic, providing the largest single resource for variant interpretation. The system is a research tool; clinical use requires the standard regulatory pathway and independent validation, but it has reshaped how variant-of-unknown-significance backlogs are triaged.

Biomolecular interactions. AlphaFold 3 (Abramson et al., 2024) extended prediction beyond proteins alone to interactions involving nucleic acids, ions, small-molecule ligands, and post-translationally modified residues, using a diffusion-based generative head. The initial release restricted access (web server with usage limits, training code withheld), which prompted independent reproductions: Boltz (MIT) and Chai-1 (Chai Discovery) achieved AlphaFold 3-class performance with permissive open-source licenses by mid-2024.

External recognition. The 2024 Nobel Prize in Chemistry was shared between Demis Hassabis and John Jumper of Google DeepMind (for AlphaFold’s protein structure prediction) and David Baker of the University of Washington (for computational protein design, work that includes RoseTTAFold and RFdiffusion). It was the first Nobel awarded primarily for a deep-learning result in biology.

The pattern of this era is worth naming. Each advance was driven by a transformer-class architecture plus a much larger training corpus than the previous generation could use, plus a re-framing of the problem that exposed the architecture to the right signal. Each advance produced a discontinuous improvement on a long-standing blind benchmark, followed by rapid incremental saturation. Each advance was met by independent reproduction within 12-18 months, which became the operational definition of “the advance is real.”

The Foundation Model Era (2021-Present)

The current era is defined by the same architectural family that drives large language models, applied across biological data modalities. The defining property is pre-training: a model is trained on a very large unlabeled corpus with a self-supervised objective (masked-token prediction, next-token prediction, contrastive prediction), then fine-tuned or queried for specific downstream tasks.

Protein language models. ESM-2 (Meta, 2022) and ESM-3 (EvolutionaryScale, 2024) are trained on hundreds of millions to billions of protein sequences with a masked-language-modelling objective. The learned representations transfer to structure prediction, mutation-effect prediction, function annotation, and design. ESMFold (above) is the protein-structure application of the ESM-2 representation.

Genomic foundation models. Evo (Arc Institute, 2024) and related models train on whole-genome sequences with a next-token-style objective, producing representations that transfer to regulatory annotation, variant-effect prediction, and generative design of sub-genomic elements. The scaling laws look qualitatively similar to those in language modelling, though the question of whether genomic-scale generation is biologically meaningful in the same way text generation is meaningful remains open.

Single-cell foundation models. Geneformer (Theodoris et al., 2023), scGPT, and scFoundation are pre-trained on tens of millions of single-cell transcriptomes. They transfer to cell-type annotation, perturbation prediction, and trajectory inference. The cells, tissues, and systems biology chapters of this handbook treat their evaluation in detail; the short version is that they are clearly useful but the evidence that they exceed strong task-specific baselines is mixed.

Multimodal models. Boltz-2 (MIT, 2024-2025), AlphaFold 3, and Chai-1 increasingly combine structural, sequence, and chemical inputs in a single model. The current direction is toward biology-wide foundation models that handle sequence, structure, interaction, and modification in a unified representation.

The field is young: the term “biology foundation model” became routine in 2022-2023. Whether scaling will continue to deliver, whether emergent biological reasoning will appear in larger models, and whether the right pre-training objective for biology is the language-model objective at all are all open questions. The evidence is not yet conclusive.

What Fifty Years of History Tells Us

Three durable patterns emerge from the arc.

Blind benchmarks are the most credible evidence. CASP is the model: results submitted before targets are revealed, evaluated by independent assessors, published in a community-wide assessment. Every era’s breakthrough method has been validated this way: AlphaFold 1 at CASP13, AlphaFold 2 at CASP14, RoseTTAFold against AlphaFold 2 (with shared targets and consistent assessment). Capability claims in biology that have no equivalent blind-benchmark evidence are claims, not results. The cells and systems chapters, therapeutics chapters, and benchmarks chapter all describe what blind-benchmark coverage looks like in each subfield, and what areas still lack it.

Each era’s breakthrough becomes the next era’s preprocessing step. BLAST is still inside foundation-model training pipelines; profile HMMs are still inside Pfam classification; AlphaFold predictions are inputs to AlphaMissense, RFdiffusion, and structure-based design pipelines. The lesson is not that the old method was wrong; it is that each generation expands the substrate the next generation can use. Methods are not replaced; they are demoted to infrastructure.

The next breakthrough is rarely visible from inside the current paradigm. In 2010, almost no one in computational biology predicted that the field’s next decade would be defined by an architecture (the transformer) that did not yet exist. In 1995, almost no one predicted that the dominant method of the 2000s would be support vector machines from another field. The fields that have had their CASP14 moment (protein structure, variant calling) provide a template for what such a moment looks like. The fields that have not (cellular function, drug response prediction, microbiome interpretation) are where the next decade’s surprises will come from.

The handbook’s organisation reflects this history. The molecular-biology chapters cover problems where the breakthrough has happened. The cellular and systems-biology chapters cover problems where it has partly happened. The therapeutic-discovery chapters cover problems where translation is still the harder constraint than prediction. The automation chapters cover the laboratory infrastructure that any next breakthrough will depend on. Reading any one of those chapters against this history is a practical way to calibrate where a capability claim sits in the arc, and what the prior generation’s failure modes still imply.

How to Use This History

History is useful for calibration, not prediction. The demonstrated pattern is that AI in biology becomes consequential when a method, dataset, benchmark, and decision align. BLAST made sequence search practical. Profile HMMs made protein-family and gene-structure work routine. DeepVariant showed that deep learning could beat established variant-calling pipelines. AlphaFold 2 changed structure prediction because CASP exposed the improvement under blinded conditions. RFdiffusion and related design tools matter because laboratory validation turned generation into measurable design evidence.

The practical lesson is narrower than “the next model will solve the next problem.” It is that cellular perturbation, regulatory genomics, therapeutic translation, and closed-loop laboratories need their own benchmark cultures before broad claims deserve the same weight as structure prediction. Scaling, better data, and new architectures may improve biology models, but the historical record argues for benchmark discipline rather than architectural determinism.

For practitioners, three operational implications follow from this history:

Always ask which generation a method belongs to. A 2024 paper that uses an LSTM for a sequence task is using an older architecture. That is not automatically a problem (the data may suit it; baselines matter), but it is information. Conversely, a foundation-model approach to a problem with 200 labelled examples is an architecture choice without a data justification.
Ask for the blind-benchmark result, not the held-out test set. Self-reported test-set accuracy is the weakest evidence in this history. Performance on an independent blind benchmark (CASP, CAMEO, PoseBusters, BindingDB hold-outs, GuacaMol for molecular generation) is the strongest. Most vendor claims live between these two; treat them accordingly. The Benchmarks for Bio AI chapter catalogues which blind benchmarks exist in each subdomain.
Treat “AI-powered” branding as a question, not an answer. Every era of this history has had branding that absorbed the previous era’s methods under the new label. “AI-powered” pipelines in 2026 routinely include BLAST, profile HMMs, and CNN classifiers from earlier eras. The question is whether the methods that justify the label produce the outcomes the claim implies. The Evaluation Principles for Life Sciences AI chapter sets out a workable framework for that question.