Biomedical Knowledge Graphs and Literature AI
Biomedical discovery depends on an evidence layer that is larger than any single database or laboratory notebook. Publications, target-disease links, pathways, protein families, compounds, clinical programs, patents, and adverse-event signals sit in separate systems with different identifiers and evidence conventions. Biomedical knowledge graphs and literature AI are useful when they keep those links visible rather than compressing them into a confident paragraph.
This chapter gives you a working framework for biomedical knowledge graphs and literature AI. You will learn to:
- Distinguish a curated biomedical knowledge graph from a text-mined graph and from a vector-search index
- Read BioBERT, PubMedBERT, SciBERT, SemMedDB, Hetionet, PrimeKG, and Open Targets as different layers in the same evidence stack
- Evaluate retrieval-augmented generation for biomedical use by checking source provenance, date, relation type, and claim alignment
- Identify the failure modes that make a polished answer misleading: citation laundering, relation flattening, abstract-only evidence, and missing negative evidence
- Decide where literature AI belongs in pharma business development, target triage, portfolio scanning, and competitive intelligence
- Separate demonstrated literature-mining tasks from theoretical causal inference claims and beyond-current-capability autonomous scientific judgement
Evidence stack:
| Layer | Examples | What it supports | Main caution |
|---|---|---|---|
| Biomedical language models | BioBERT, PubMedBERT, SciBERT | Named-entity recognition, relation extraction, document classification | Text performance does not equal biological truth |
| Text-mined relation stores | SemMedDB | PubMed-scale subject-predicate-object triples | Predications require source and negation checks |
| Curated or semi-curated graphs | Open Targets, Hetionet, PrimeKG | Target prioritisation, drug repurposing, disease mechanism mapping | Edge provenance and evidence weighting matter |
| Retrieval systems | OpenEvidence, Causaly, Iris.ai, Elicit | Literature triage, source retrieval, due-diligence preparation | Retrieval quality and citation alignment must be audited |
| Internal enterprise graphs | Pharma target, assay, omics, patent, and program data | Portfolio intelligence and institutional memory | Access control, provenance, and data contracts define value |
Biomedical language and graph references:
| Resource | Type | Verified source |
|---|---|---|
| BioBERT | Biomedical language model | Lee et al., 2019 |
| PubMedBERT | Domain-specific biomedical language model | Gu et al., 2021 |
| SciBERT | Scientific-text language model | Beltagy et al., 2019 |
| SemMedDB | PubMed-scale semantic predication repository | Kilicoglu et al., 2012 |
| Hetionet | Biomedical knowledge graph for drug repurposing | Himmelstein et al., 2017 |
| PrimeKG | Precision-medicine knowledge graph | Chandak et al., 2023 |
| Open Targets Platform | Target-disease evidence platform | Ochoa et al., 2020 |
Three failures that look like progress:
| Failure mode | Looks like | Actually means |
|---|---|---|
| Citation laundering | Answer includes a plausible paper link | The linked paper does not support the exact claim |
| Relation flattening | Graph says drug treats disease | Edge may mean association, trial mention, mechanism hypothesis, or text-mined co-occurrence |
| Abstract-only evidence | Retrieval returns clean snippets | Full paper may contain limitations, exclusions, negative results, or context absent from the abstract |
Introduction
Literature AI is now a procurement category because the life sciences evidence base is too large for manual reading alone. Pharma teams use retrieval products for target scans, competitive intelligence, deal diligence, safety review, and evidence mapping. Research groups use similar methods for hypothesis triage, dataset discovery, and entity linking across papers and databases.
The hard problem is not search. It is source-grounded reasoning. A system that retrieves five papers and writes a fluent answer can still fail if it collapses “associated with,” “causes,” “treats,” “is trialed in,” and “was mentioned near” into one relation. A knowledge graph that preserves relation type, source, date, and confidence is more useful than a graph that only maximises connectivity.
Biomedical knowledge graphs and literature AI also sit underneath other chapters in the handbook. Target identification depends on evidence integration. Drug repurposing depends on graph traversal and signature matching. Real-world evidence work depends on entity resolution across clinical, molecular, and claims data. Evaluation depends on tracing every claim back to its source.
Demonstrated
Biomedical Language Models
Biomedical and scientific language models have demonstrated value in text-mining tasks where the ground truth is textual: named-entity recognition, relation extraction, question answering over abstracts, document classification, and search ranking. BioBERT adapted BERT to biomedical text and improved common biomedical text-mining benchmarks (Lee et al., 2019). PubMedBERT showed the value of pretraining from scratch on biomedical corpora rather than adapting a general-domain model (Gu et al., 2021). SciBERT established the broader scientific-text pattern across computer science and biomedical corpora (Beltagy et al., 2019).
The demonstrated lesson is narrow but important: domain-specific pretraining improves biomedical text handling when the task is tied to language. It does not establish biological mechanism, therapeutic efficacy, clinical utility, or target validity. Text competence is not evidence competence.
Semantic Predications and Literature-Mined Relations
SemMedDB is the canonical PubMed-scale semantic predication repository. It stores subject-predicate-object triples extracted from biomedical text, such as disease-gene, drug-disease, or compound-effect relations (Kilicoglu et al., 2012). This makes SemMedDB useful for evidence discovery, hypothesis generation, and relation lookup at scale.
The limitation is equally important. A semantic predication is not the same as a verified causal relation. Text-mined triples need checks for negation, speculation, population scope, experimental context, and source quality. If a graph edge is used in target prioritisation, the edge needs a provenance trail to the source sentence and, ideally, to the full paper.
Biomedical Knowledge Graphs
Knowledge graphs organise biomedical entities into typed relationships. Hetionet integrated drugs, diseases, genes, pathways, anatomy, and side effects to prioritise repurposing hypotheses (Himmelstein et al., 2017). PrimeKG integrated disease, drug, gene, pathway, and phenotype relations for precision-medicine use cases (Chandak et al., 2023). Open Targets Platform integrates target-disease evidence from genetics, genomics, drugs, literature, and other evidence streams for systematic target prioritisation (Ochoa et al., 2020).
The strongest biomedical graphs share four traits:
- Typed edges: the relation says what kind of evidence connects two nodes
- Source provenance: every important edge traces to a database, publication, or curated evidence stream
- Evidence weighting: genetics, perturbation, expression, clinical, literature, and animal evidence are not interchangeable
- Versioning: graph changes across database releases are tracked so conclusions are reproducible
Retrieval-Augmented Literature Workflows
Retrieval-augmented generation combines source retrieval with answer drafting. In biomedical settings, the critical output is not the paragraph. It is the source set, retrieval query, date, filtering logic, and claim-to-source map. OpenEvidence, Causaly, Iris.ai, and Elicit represent the broad product category of literature retrieval and research assistance in medicine and life sciences (OpenEvidence, 2026; Causaly, 2026; Iris.ai, 2026; Elicit, 2026).
For institutional use, a retrieval workflow should preserve:
- Query terms and synonyms
- Corpus boundary: PubMed, patents, trial registries, company filings, internal reports, or mixed sources
- Retrieval date
- Ranking method
- Inclusion and exclusion criteria
- Exact source passages used to support each claim
- Human reviewer signoff for high-stakes conclusions
This is why literature AI belongs in due-diligence workflow design, not only as a chat interface. The professional artifact is an auditable evidence map.
Enterprise Knowledge Layers
The highest-value graph in a life sciences organisation is usually not public. It combines internal assay results, failed programs, compounds, target rationales, omics data, clinical evidence, patents, vendor reports, and institutional decisions. Public graphs and language models supply the outside evidence layer. The institutional graph supplies the memory of what the organisation already tried, rejected, licensed, or validated.
The immediate value is often duplicate-work prevention. A target scan that recalls prior internal failures, weak assay transfer, or unresolvable IP constraints is more valuable than a ranked list that only mirrors the public literature.
Theoretical
Causal Discovery from Graphs
Biomedical graphs support hypothesis prioritisation, but causal discovery remains theoretical for most real-world graph workflows. A drug-disease edge, gene-disease edge, or pathway-disease edge can point to a plausible mechanism. It does not establish that intervening on the node changes disease outcome. Causal claims require design logic, perturbation data, negative controls, biological priors, and prospective validation.
Graph structure is still useful. It can highlight mechanistic neighborhoods, missing evidence, contradictory evidence, and targets with convergent support from genetics and perturbation data. The mistake is treating graph centrality or embedding similarity as causal evidence.
Automated Evidence Grading
Automated evidence grading is plausible but not yet trustworthy as a standalone process. A literature system can sort study types, extract sample size, identify endpoints, and tag whether a source is preclinical, clinical, regulatory, or company-reported. The hard part is grading design quality, confounding, selective reporting, population fit, and claim relevance.
For life sciences diligence, the near-term pattern is human-in-the-loop evidence grading. The machine prepares the source map. Domain experts decide which evidence changes a program decision.
Cross-Corpus Intelligence
The most useful literature systems will connect publications, patents, clinical trials, conference abstracts, regulatory labels, omics datasets, and internal reports. This is theoretically powerful because early signals often appear outside peer-reviewed literature. It is also difficult because identifiers differ, corporate naming changes, and trial endpoints rarely map cleanly to mechanistic claims.
This is where entity resolution becomes a strategic capability. Without stable identifiers for genes, proteins, compounds, indications, sponsors, trial assets, and mechanisms, cross-corpus search produces noise.
Beyond Current Capabilities
Autonomous Scientific Judgement
No literature system should be treated as an autonomous scientific judge. It cannot replace the chain of evidence required to decide whether a target is viable, whether a compound is developable, whether a biomarker is predictive, or whether a company claim survives diligence. Those judgments require source review, experimental context, biological plausibility, and decision accountability.
Complete Evidence Capture
No public literature system captures all relevant evidence. Negative internal experiments, abandoned assets, unreported assay failures, unpublished tox findings, informal regulatory feedback, and confidential business decisions often determine whether a program is viable. Published literature is necessary but incomplete.
Fully Reliable Citation Grounding
Biomedical LLM failures are particularly dangerous when they combine fluent prose with plausible citations. Galactica, an early scientific-language model, became a cautionary case because the public demo produced scientific-looking but unreliable outputs and was withdrawn shortly after release (Taylor et al., 2022, preprint; Edwards, Ars Technica, November 2022). The enduring lesson is not about one product. It is that citation-shaped text is not citation verification.
Practice Notes
Use literature tools as evidence triage, not final authority. For every high-stakes conclusion, keep a short evidence table with claim, source, source type, publication date, evidence level, and reviewer decision.
Separate graph edges by evidence type. A genetics edge, expression edge, text-mined co-mention, pathway membership, and clinical-trial mention should never collapse into one generic association score without a visible explanation.
Record the retrieval date and corpus boundary. A target-diligence search against PubMed alone answers a different question from a search across PubMed, patents, ClinicalTrials.gov, company filings, and internal reports.
Require a negative-evidence pass. Search for failed trials, terminated programs, contradicted mechanisms, weak animal models, assay-transfer failures, and biomarkers that did not replicate.
Treat vendor outputs as research artifacts. Exportable source lists, relation explanations, and audit logs matter more than a polished answer screen.