The Life Sciences AI Handbook: Steering Frontier Models in Biology

Tegomoh, Bryan

Variant Effect Prediction

Published

July 7, 2026

Variant effect prediction sits between sequencing and clinical decision-making. A modern exome or genome produces thousands to millions of variants per individual; the clinical question is which of them matter. AI variant predictors (AlphaMissense, EVE, ESM-1v, PrimateAI, SpliceAI, REVEL, CADD, DeepSEA, Enformer, Nucleotide Transformer, GPN-MSA, AlphaGenome) are useful as research and triage tools; the ACMG/AMP framework remains the clinical standard for classification. Using a predictor score as the sole basis for a clinical action is a misuse of the tool. The right framing is that AI provides one strength-of-evidence input into a structured framework that also weighs population data, functional studies, segregation, and prior literature.

Learning Objectives

Use this chapter to:

Prioritize genetic variants by likely molecular or clinical effect while preserving the distinction between prediction and interpretation.
Pathogenicity, splicing, regulatory effect, protein function, ancestry, penetrance, and clinical context are different evidence layers.

Prerequisites: Protein Structure Prediction recommended for the AlphaFold-feature context that underlies AlphaMissense; Evaluation Principles for Life Sciences AI for the calibration discipline.

Chapter Summary (TL;DR)

Summary: Prioritize genetic variants by likely molecular or clinical effect while preserving the distinction between prediction and interpretation. Missense and splice prediction are useful triage tools; clinical-grade interpretation still depends on ACMG/AMP logic, segregation, function, and population evidence.

Key point: Pathogenicity, splicing, regulatory effect, protein function, ancestry, penetrance, and clinical context are different evidence layers. Open question: how much prediction can safely change variant interpretation without functional, segregation, or clinical evidence.

Bottom line: Variant-effect work links genomics to precision medicine, target discovery, protein function, diagnostics, and population-scale biology.

Field Guide

What is this field trying to solve? Prioritize genetic variants by likely molecular or clinical effect while preserving the distinction between prediction and interpretation.

What is the core idea? Pathogenicity, splicing, regulatory effect, protein function, ancestry, penetrance, and clinical context are different evidence layers.

What is the current state of the field? Missense and splice prediction are useful triage tools; clinical-grade interpretation still depends on ACMG/AMP logic, segregation, function, and population evidence.

What do we know, and what remains open? Known reference points include AlphaMissense, EVE, ESM-1v, PrimateAI, SpliceAI, REVEL, Enformer, ClinVar, gnomAD, UK Biobank, ClinGen, and MAVE datasets. What remains open is how much prediction can safely change variant interpretation without functional, segregation, or clinical evidence.

Why does this matter? Variant-effect work links genomics to precision medicine, target discovery, protein function, diagnostics, and population-scale biology.

Introduction

Modern human genomics produces variants in industrial quantities. A clinical exome typically yields 20,000-30,000 variants; a whole-genome sequence yields several million. Some small fraction of these variants drive the phenotype that prompted the test. The clinical question is which ones, and the answer requires evidence from population frequency, functional studies, in-silico prediction, segregation, and prior literature.

Variant effect prediction is the in-silico evidence component. The field has produced a steady cadence of predictors over the past decade: CADD (2014), DeepSEA (2015), REVEL (2016), PrimateAI (2018), SpliceAI (2019), EVE (2021), Enformer (2021), ESM-1v (2021), AlphaMissense (2023), Nucleotide Transformer (2025), GPN-MSA (2025), and disease-specific variant models. Each addresses a slightly different problem (missense, splicing, non-coding, regulatory) with a different architecture and training data.

The clinical use of these predictors is bounded by the ACMG/AMP framework (Richards et al., 2015), which defines five classification tiers (benign, likely benign, uncertain significance, likely pathogenic, pathogenic) and assigns evidence-strength weights to different kinds of supporting data. Computational predictors fit into the framework through the PP3 (computational evidence of pathogenicity) and BP4 (computational evidence of benignity) lines. The framework, not the predictor, is the standard.

This chapter applies the evidence framework to variant predictors specifically. The discipline matters because variant interpretation is one of the highest-stakes places where AI-derived scores meet clinical decisions.

What is demonstrated?

Missense variant prediction

Missense variants change one amino acid in a protein. They are the most-studied variant class for AI prediction because the structural and evolutionary context is interpretable.

AlphaMissense (Cheng et al., 2023) is the most recent and widely covered missense predictor. Trained on a combination of AlphaFold-derived structural features, protein language model evidence, and population data, it produced predictions for approximately 89% of the 71 million possible human missense substitutions (about 19% classified likely benign, 32% likely pathogenic, remainder ambiguous). The Science paper positioned AlphaMissense as a research tool with potential clinical utility, conditional on appropriate validation and framework integration.

EVE (Frazer et al., 2021) takes a different approach: an evolutionary generative model trained on protein family multiple sequence alignments, producing unsupervised pathogenicity predictions. The Nature paper showed competitive performance with supervised methods, with the conceptual advantage that EVE does not depend on labelled disease variants for training.

ESM-1v (Meier et al., 2021, preprint) demonstrated that protein language models trained on sequence alone produce useful zero-shot variant effect predictions, without explicit supervision. The preprint remains the canonical reference; the work has been extended in subsequent ESM model releases.

PrimateAI (Sundaram et al., 2018) used common-variant data from non-human primates as a proxy for benignity and trained a deep network on the resulting labels. The Nature Genetics paper introduced this evolutionary-population-data approach. PrimateAI-3D extended the method by incorporating structural features in subsequent work.

Splicing variant prediction

SpliceAI (Jaganathan et al., 2019) predicts splice-altering effects from primary DNA sequence. The Cell paper demonstrated identification of cryptic splice sites that would be missed by traditional splice-site heuristics, and the predictor has become a standard input in clinical genetics pipelines. Many laboratories have specific score thresholds for SpliceAI evidence inside ACMG/AMP classifications.

Non-coding variant prediction

DeepSEA (Zhou and Troyanskaya, 2015) was the first widely cited deep-learning model for non-coding variant effect prediction, scoring chromatin-feature impact at single-nucleotide resolution.

Enformer (Avsec et al., 2021) extended this approach with a transformer architecture that integrates long-range regulatory context, predicting gene expression and chromatin states from sequences hundreds of kilobases long. The Nature Methods paper made Enformer the canonical model for non-coding regulatory variant effects.

Nucleotide Transformer (Dalla-Torre et al., 2025) trained foundation-style transformers on diverse human and multi-species genome data, with downstream applications across variant effect, regulatory annotation, and chromatin features. The Nature Methods paper positioned this as a generalisable genome foundation model.

AlphaGenome (Avsec et al., 2026) extends this lineage with a long-context model for regulatory variant-effect prediction across tracks such as chromatin, expression, and splicing. The Nature paper supports AlphaGenome as a high-capacity regulatory predictor, not as a clinical classifier. Cell-type coverage, assay modality, calibration, and laboratory validation remain decisive.

GPN-MSA (Benegas et al., 2025) uses multispecies alignments to train a DNA language model for coding and noncoding variant effects. Its contribution is evolutionary-context scoring across the genome; the clinical constraint is unchanged: score calibration, phenotype context, and ACMG/AMP evidence rules still determine use.

Disease-specific fine-tuning is most useful when the biological context is narrow and well defined. Zhan and colleagues reported a disease-specific language model for cardiac and regulatory genomics (Zhan et al., 2025). That evidence supports context-specific predictor improvement, not a replacement for disease-gene curation, functional assays, or ClinGen specifications.

Ensemble baselines: REVEL and CADD

REVEL (Ioannidis et al., 2016) is an ensemble random-forest predictor that combines outputs from multiple older predictors (SIFT, PolyPhen, MutationAssessor, FATHMM and others). It remains a strong baseline for missense pathogenicity and is widely used inside ACMG/AMP classifications.

CADD (Kircher et al., 2014) was the first widely adopted ensemble predictor across variant types (missense, splicing, non-coding). The Nature Genetics paper used selected and simulated variants to train a logistic regression on a large set of annotations. CADD scores are still reported alongside newer predictors in many clinical pipelines.

Functional and population evidence set the reference points for predictor interpretation. Saturation genome editing can produce direct functional maps for genes such as BRCA1 (Findlay et al., 2018), while gnomAD quantifies mutational constraint and population-frequency evidence across large human cohorts (Karczewski et al., 2020). Predictor scores are strongest when they are interpreted against those independent evidence streams.

The clinical framework: ACMG/AMP

The ACMG/AMP guidelines (Richards et al., 2015) are the foundational document for clinical variant interpretation. The framework:

Defines five classification tiers (benign, likely benign, uncertain significance, likely pathogenic, pathogenic)
Specifies evidence categories (population data, computational data, functional data, segregation, allelic data, others) with strength weights (very strong, strong, moderate, supporting)
Provides combination rules that map evidence portfolios to classification tiers
Mandates laboratory-specific score-threshold validation for any computational predictor incorporated into the pipeline

The framework has been refined through ClinGen sequence-variant interpretation work, gene-specific specifications, and evidence-category recommendations. ClinGen recommendations give more explicit treatment to functional evidence under PS3/BS3 (Brnich et al., 2020) and computational evidence under PP3/BP4, including score calibration and threshold discipline (Pejaver et al., 2022). Clinical laboratories should consult current ClinGen specifications before pipeline integration.

Evidence anchor summary

Evidence Anchor	What It Supports	Practical Constraint
AlphaMissense	Proteome-scale missense triage	Research tool; not a classification
EVE	Unsupervised missense pathogenicity from evolution	Family-level coverage varies; rare protein families undersupported
ESM-1v	Zero-shot missense effects from sequence	Preprint reference; newer ESM-class methods may supersede
PrimateAI	Missense pathogenicity from primate population data	Calibration depends on the primate variant sample
SpliceAI	Cryptic splice-site identification	Performance depends on gene-specific splicing biology
DeepSEA	Single-nucleotide chromatin impact	Limited regulatory-context window
Enformer	Long-range non-coding regulatory impact	Tissue and cell-type coverage in training shapes prediction
Nucleotide Transformer	Foundation-style genome prediction	Multi-task evaluation across regulatory categories
AlphaGenome	Long-context regulatory variant prediction	Research tool; not a clinical classification
GPN-MSA	Multispecies alignment-based variant scoring	Evolutionary context helps; clinical actionability still requires ACMG/AMP framing
Disease-specific VEP models	Fine-tuned predictors for constrained disease contexts	Disease specificity does not remove laboratory validation
REVEL	Missense ensemble baseline	Inherits the limits of the inputs it ensembles
CADD	Cross-variant ensemble baseline	Older; useful as a reference point
ACMG/AMP (Richards 2015)	The clinical variant classification standard	Predictor scores are inputs, not classifications

What is theoretical?

Several capabilities are plausible but not yet routine.

Whole-genome variant effect prediction at clinical quality. Models that predict pathogenicity uniformly across missense, splicing, regulatory, structural variants, and combinations would change the clinical workflow. AlphaGenome and Nucleotide Transformer point in this direction; clinical-quality whole-genome variant interpretation is still future work.

Per-laboratory calibrated predictor outputs. Each laboratory’s variant distribution differs (referral pattern, ancestry mix, gene panel). Calibrating predictor scores at the laboratory level is feasible and is becoming clinical practice for the most-used predictors; making it routine for every clinical pipeline is still work.

Functional-evidence integration at scale. Massively parallel reporter assays, deep mutational scanning, and saturation mutagenesis produce direct functional evidence for tens of thousands of variants per gene. Combining this evidence with predictor scores in a framework-compliant way is the natural next layer of evidence integration.

Predictor disagreement as signal. When AlphaMissense, EVE, ESM-1v, and PrimateAI agree, confidence is higher. When they disagree, the variant is interesting in a way that single-score thresholds do not capture. Multi-predictor disagreement signals are an active area of research.

Cross-population variant interpretation. Most predictor training is dominated by European-ancestry data. Performance across populations differs in ways that are not always reported. Population-aware variant interpretation at clinical quality is partial; meaningful improvement requires both data investment and methodological work.

What is beyond current capability?

A few framing claims are not supported by current evidence.

AI replaces ACMG/AMP variant classification. It does not. The framework integrates many evidence categories; computational prediction is one. No predictor produces a clinically actionable classification by itself.

Single-predictor scores can drive clinical decisions. They cannot. Laboratory validation, framework integration, and clinical context are mandatory. A laboratory that uses AlphaMissense outputs as classifications is not practising at the standard of care.

Predictor performance generalises across all genes. It does not. Many predictors have substantial gene-family-specific performance differences. Gene-specific calibration is required for high-stakes use.

Non-coding variant interpretation is solved. It is not. The non-coding genome is most of the genome, and the prediction tools (DeepSEA, Enformer, Nucleotide Transformer, AlphaGenome) operate at the research-tool level for now. Clinical actionability of non-coding variants beyond well-characterised regulatory regions remains uncommon.

What would make this more promising?

Variant prediction becomes more promising with prospective evidence that calibrated predictor use improves variant classification quality inside ACMG/AMP workflows without increasing false pathogenic or false benign calls. Stronger evidence would include laboratory-specific calibration across ancestry groups, gene families, variant classes, and referral patterns, followed by functional or segregation evidence on variants whose classifications changed. Non-coding claims need perturbational validation and disease-context evidence before they move from research triage to clinical interpretation.

What should researchers, biotech teams, funders, and program leaders do with this?

For researchers and clinical-laboratory users of variant effect predictors:

Use the ACMG/AMP framework as the spine. Predictor scores are PP3/BP4 evidence inputs; they are not the classification. Documented score thresholds and combination rules belong in the pipeline.
Validate per-laboratory. Predictor calibration on the variant set the laboratory actually sees is the prerequisite for any clinical use. Generic published thresholds are a starting point, not an endpoint.
Use multiple predictors and read disagreement. AlphaMissense, EVE, ESM-1v, and PrimateAI capture overlapping but distinct signal. Concordant calls strengthen evidence; discordant calls flag variants for closer review.
Match predictor to variant class. SpliceAI for splice-region variants. Enformer or DeepSEA for non-coding regulatory variants. Missense-specific predictors for missense. Generic high-score thresholds across categories produce noise.
Cite predictor versions and dates. AlphaMissense V2, REVEL 2020, SpliceAI 1.3 are different artefacts. Versioning matters for reproducibility and for understanding score drift.
Read the calibration before the headline number. A predictor with an AUC of 0.95 and badly calibrated probabilities is harder to use clinically than a predictor with an AUC of 0.85 and good calibration.
Distinguish research-tool from clinical-pipeline use. AlphaGenome and Nucleotide Transformer are research tools at this stage. Using them inside a clinical pipeline requires the same per-laboratory validation as any other input.
Watch for ClinGen updates. Evidence-strength recommendations for computational predictors are evolving. Stay current with ClinGen working-group guidance, gene-specific specifications, and laboratory validation requirements.