Biomedical Knowledge Graphs and Literature AI

Published

May 24, 2026

Biomedical discovery depends on an evidence layer that is larger than any single database or laboratory notebook. Publications, target-disease links, pathways, protein families, compounds, clinical programs, patents, and adverse-event signals sit in separate systems with different identifiers and evidence conventions. Biomedical knowledge graphs and literature AI are useful when they keep those links visible rather than compressing them into a confident paragraph.

Learning Objectives

This chapter gives you a working framework for biomedical knowledge graphs and literature AI. You will learn to:

  • Distinguish a curated biomedical knowledge graph from a text-mined graph and from a vector-search index
  • Read BioBERT, PubMedBERT, SciBERT, SemMedDB, Hetionet, PrimeKG, and Open Targets as different layers in the same evidence stack
  • Evaluate retrieval-augmented generation for biomedical use by checking source provenance, date, relation type, and claim alignment
  • Identify the failure modes that make a polished answer misleading: citation laundering, relation flattening, abstract-only evidence, and missing negative evidence
  • Decide where literature AI belongs in pharma business development, target triage, portfolio scanning, and competitive intelligence
  • Separate demonstrated literature-mining tasks from theoretical causal inference claims and beyond-current-capability autonomous scientific judgement

Evidence stack:

Layer Examples What it supports Main caution
Biomedical language models BioBERT, PubMedBERT, SciBERT Named-entity recognition, relation extraction, document classification Text performance does not equal biological truth
Text-mined relation stores SemMedDB PubMed-scale subject-predicate-object triples Predications require source and negation checks
Curated or semi-curated graphs Open Targets, Hetionet, PrimeKG Target prioritisation, drug repurposing, disease mechanism mapping Edge provenance and evidence weighting matter
Retrieval systems OpenEvidence, Causaly, Iris.ai, Elicit Literature triage, source retrieval, due-diligence preparation Retrieval quality and citation alignment must be audited
Internal enterprise graphs Pharma target, assay, omics, patent, and program data Portfolio intelligence and institutional memory Access control, provenance, and data contracts define value

Biomedical language and graph references:

Resource Type Verified source
BioBERT Biomedical language model Lee et al., 2019
PubMedBERT Domain-specific biomedical language model Gu et al., 2021
SciBERT Scientific-text language model Beltagy et al., 2019
SemMedDB PubMed-scale semantic predication repository Kilicoglu et al., 2012
Hetionet Biomedical knowledge graph for drug repurposing Himmelstein et al., 2017
PrimeKG Precision-medicine knowledge graph Chandak et al., 2023
Open Targets Platform Target-disease evidence platform Ochoa et al., 2020

Three failures that look like progress:

Failure mode Looks like Actually means
Citation laundering Answer includes a plausible paper link The linked paper does not support the exact claim
Relation flattening Graph says drug treats disease Edge may mean association, trial mention, mechanism hypothesis, or text-mined co-occurrence
Abstract-only evidence Retrieval returns clean snippets Full paper may contain limitations, exclusions, negative results, or context absent from the abstract

Introduction

Literature AI is now a procurement category because the life sciences evidence base is too large for manual reading alone. Pharma teams use retrieval products for target scans, competitive intelligence, deal diligence, safety review, and evidence mapping. Research groups use similar methods for hypothesis triage, dataset discovery, and entity linking across papers and databases.

The hard problem is not search. It is source-grounded reasoning. A system that retrieves five papers and writes a fluent answer can still fail if it collapses “associated with,” “causes,” “treats,” “is trialed in,” and “was mentioned near” into one relation. A knowledge graph that preserves relation type, source, date, and confidence is more useful than a graph that only maximises connectivity.

Biomedical knowledge graphs and literature AI also sit underneath other chapters in the handbook. Target identification depends on evidence integration. Drug repurposing depends on graph traversal and signature matching. Real-world evidence work depends on entity resolution across clinical, molecular, and claims data. Evaluation depends on tracing every claim back to its source.

Demonstrated

Biomedical Language Models

Biomedical and scientific language models have demonstrated value in text-mining tasks where the ground truth is textual: named-entity recognition, relation extraction, question answering over abstracts, document classification, and search ranking. BioBERT adapted BERT to biomedical text and improved common biomedical text-mining benchmarks (Lee et al., 2019). PubMedBERT showed the value of pretraining from scratch on biomedical corpora rather than adapting a general-domain model (Gu et al., 2021). SciBERT established the broader scientific-text pattern across computer science and biomedical corpora (Beltagy et al., 2019).

The demonstrated lesson is narrow but important: domain-specific pretraining improves biomedical text handling when the task is tied to language. It does not establish biological mechanism, therapeutic efficacy, clinical utility, or target validity. Text competence is not evidence competence.

Semantic Predications and Literature-Mined Relations

SemMedDB is the canonical PubMed-scale semantic predication repository. It stores subject-predicate-object triples extracted from biomedical text, such as disease-gene, drug-disease, or compound-effect relations (Kilicoglu et al., 2012). This makes SemMedDB useful for evidence discovery, hypothesis generation, and relation lookup at scale.

The limitation is equally important. A semantic predication is not the same as a verified causal relation. Text-mined triples need checks for negation, speculation, population scope, experimental context, and source quality. If a graph edge is used in target prioritisation, the edge needs a provenance trail to the source sentence and, ideally, to the full paper.

Biomedical Knowledge Graphs

Knowledge graphs organise biomedical entities into typed relationships. Hetionet integrated drugs, diseases, genes, pathways, anatomy, and side effects to prioritise repurposing hypotheses (Himmelstein et al., 2017). PrimeKG integrated disease, drug, gene, pathway, and phenotype relations for precision-medicine use cases (Chandak et al., 2023). Open Targets Platform integrates target-disease evidence from genetics, genomics, drugs, literature, and other evidence streams for systematic target prioritisation (Ochoa et al., 2020).

The strongest biomedical graphs share four traits:

  • Typed edges: the relation says what kind of evidence connects two nodes
  • Source provenance: every important edge traces to a database, publication, or curated evidence stream
  • Evidence weighting: genetics, perturbation, expression, clinical, literature, and animal evidence are not interchangeable
  • Versioning: graph changes across database releases are tracked so conclusions are reproducible

Retrieval-Augmented Literature Workflows

Retrieval-augmented generation combines source retrieval with answer drafting. In biomedical settings, the critical output is not the paragraph. It is the source set, retrieval query, date, filtering logic, and claim-to-source map. OpenEvidence, Causaly, Iris.ai, and Elicit represent the broad product category of literature retrieval and research assistance in medicine and life sciences (OpenEvidence, 2026; Causaly, 2026; Iris.ai, 2026; Elicit, 2026).

For institutional use, a retrieval workflow should preserve:

  • Query terms and synonyms
  • Corpus boundary: PubMed, patents, trial registries, company filings, internal reports, or mixed sources
  • Retrieval date
  • Ranking method
  • Inclusion and exclusion criteria
  • Exact source passages used to support each claim
  • Human reviewer signoff for high-stakes conclusions

This is why literature AI belongs in due-diligence workflow design, not only as a chat interface. The professional artifact is an auditable evidence map.

Enterprise Knowledge Layers

The highest-value graph in a life sciences organisation is usually not public. It combines internal assay results, failed programs, compounds, target rationales, omics data, clinical evidence, patents, vendor reports, and institutional decisions. Public graphs and language models supply the outside evidence layer. The institutional graph supplies the memory of what the organisation already tried, rejected, licensed, or validated.

The immediate value is often duplicate-work prevention. A target scan that recalls prior internal failures, weak assay transfer, or unresolvable IP constraints is more valuable than a ranked list that only mirrors the public literature.

Theoretical

Causal Discovery from Graphs

Biomedical graphs support hypothesis prioritisation, but causal discovery remains theoretical for most real-world graph workflows. A drug-disease edge, gene-disease edge, or pathway-disease edge can point to a plausible mechanism. It does not establish that intervening on the node changes disease outcome. Causal claims require design logic, perturbation data, negative controls, biological priors, and prospective validation.

Graph structure is still useful. It can highlight mechanistic neighborhoods, missing evidence, contradictory evidence, and targets with convergent support from genetics and perturbation data. The mistake is treating graph centrality or embedding similarity as causal evidence.

Automated Evidence Grading

Automated evidence grading is plausible but not yet trustworthy as a standalone process. A literature system can sort study types, extract sample size, identify endpoints, and tag whether a source is preclinical, clinical, regulatory, or company-reported. The hard part is grading design quality, confounding, selective reporting, population fit, and claim relevance.

For life sciences diligence, the near-term pattern is human-in-the-loop evidence grading. The machine prepares the source map. Domain experts decide which evidence changes a program decision.

Cross-Corpus Intelligence

The most useful literature systems will connect publications, patents, clinical trials, conference abstracts, regulatory labels, omics datasets, and internal reports. This is theoretically powerful because early signals often appear outside peer-reviewed literature. It is also difficult because identifiers differ, corporate naming changes, and trial endpoints rarely map cleanly to mechanistic claims.

This is where entity resolution becomes a strategic capability. Without stable identifiers for genes, proteins, compounds, indications, sponsors, trial assets, and mechanisms, cross-corpus search produces noise.

Beyond Current Capabilities

Autonomous Scientific Judgement

No literature system should be treated as an autonomous scientific judge. It cannot replace the chain of evidence required to decide whether a target is viable, whether a compound is developable, whether a biomarker is predictive, or whether a company claim survives diligence. Those judgments require source review, experimental context, biological plausibility, and decision accountability.

Complete Evidence Capture

No public literature system captures all relevant evidence. Negative internal experiments, abandoned assets, unreported assay failures, unpublished tox findings, informal regulatory feedback, and confidential business decisions often determine whether a program is viable. Published literature is necessary but incomplete.

Fully Reliable Citation Grounding

Biomedical LLM failures are particularly dangerous when they combine fluent prose with plausible citations. Galactica, an early scientific-language model, became a cautionary case because the public demo produced scientific-looking but unreliable outputs and was withdrawn shortly after release (Taylor et al., 2022, preprint; Edwards, Ars Technica, November 2022). The enduring lesson is not about one product. It is that citation-shaped text is not citation verification.

Practice Notes

Use literature tools as evidence triage, not final authority. For every high-stakes conclusion, keep a short evidence table with claim, source, source type, publication date, evidence level, and reviewer decision.

Separate graph edges by evidence type. A genetics edge, expression edge, text-mined co-mention, pathway membership, and clinical-trial mention should never collapse into one generic association score without a visible explanation.

Record the retrieval date and corpus boundary. A target-diligence search against PubMed alone answers a different question from a search across PubMed, patents, ClinicalTrials.gov, company filings, and internal reports.

Require a negative-evidence pass. Search for failed trials, terminated programs, contradicted mechanisms, weak animal models, assay-transfer failures, and biomarkers that did not replicate.

Treat vendor outputs as research artifacts. Exportable source lists, relation explanations, and audit logs matter more than a polished answer screen.