Analysis Workflow

Literature Evidence

ServiceLocal PubMed Mirror|Six-component scoring|ACMG-aligned

Manual PubMed searches are slow, inconsistent across geneticists, and frequently miss relevant publications. The Literature Evidence service maintains a local genetics-filtered PubMed mirror with pre-extracted variants, genes, and phenotypes, then ranks publications for the case at hand using a six-component scoring model aligned to ACMG evidence categories.

The result is sub-second clinical search delivering ranked publications with per-component score breakdowns and ACMG-aligned strength labels, all integrated into the standard variant interpretation workflow rather than a separate research task.

Contents

01Why Local PubMed Mining 02Ingestion Pipeline 03Genetics Relevance Filtering 04Variant, Gene, and Phenotype Extraction 05Clinical Search Workflow 06Relevance Scoring 07ACMG-Aligned Evidence Strength 08Inputs and Outputs 09Standards and Boundaries

Why Local PubMed Mining

Clinical genetics laboratories need literature evidence to support ACMG variant classification decisions. Three problems with using public PubMed directly motivate a local mirror.

Three Problems

Latency. Public PubMed search is fast for a single query but slow for the kind of structured, multi-criteria, multi-gene queries that real clinical interpretation requires. Sub-second response is essential when literature search is part of every variant review.
Inconsistency. Two geneticists running the same search at the same time may receive different results due to PubMed ranking variation. Reproducible search results are essential for laboratory quality systems and for review of past cases.
Data residency. Sending case context, including HPO terms and gene panels, to public PubMed transmits potentially sensitive context outside the platform. A local mirror keeps clinical search inside the EU-resident Helena infrastructure.

Ingestion Pipeline

Six pipeline stages convert raw PubMed XML into a clinically searchable local database. The pipeline is per-file and resumable: each file completes all stages before the next begins, enabling restart after failure without reprocessing completed files.

Download

PubMed baseline and daily update files are pulled from the NCBI FTP source with parallel transfers and integrity verification. The full PubMed baseline is approximately 1,300 compressed XML files representing tens of millions of articles.

Parsing

Compressed PubMed XML is parsed with a streaming XML reader, producing structured publication records. Memory-efficient parsing handles the largest files without loading them fully into memory.

Filtering

Publications are evaluated against curated MeSH descriptors that signal genetics relevance, against accepted publication types, and against a publication date floor. The filtering ratio reduces the input to a manageable genetics-focused subset of approximately seven to eight percent of the input volume.

Extraction

For each retained publication, the service extracts variant notations, gene symbols, and phenotype mentions. Extracted entities are validated against authoritative reference data before persistence.

Extracted records are inserted into the local literature database in batches with conflict-aware upsert semantics. Re-processing a file is idempotent: existing records are updated rather than duplicated.

Cleanup

Downloaded compressed XML files are removed after successful processing to reclaim disk space. Cleanup policy is configurable and can preserve files for audit or troubleshooting.

Genetics Relevance Filtering

A small fraction of PubMed publications are genetically relevant. The filtering layer reduces the input to a focused genetics dataset using three independent dimensions.

Genetics MeSH Descriptors

A curated set of Medical Subject Headings descriptors signals that a publication concerns genetics or genomics. MeSH classifications are the most reliable filtering signal because they are professionally indexed at the source.

Publication Type

Case reports, clinical trials, original research, and review articles are accepted. Editorial pieces, news items, and similar non-research formats are excluded from the indexed dataset.

Publication Date Floor

Publications older than the configured date floor are excluded. The default floor reflects the period over which contemporary genetic nomenclature and reporting standards stabilised.

Variant, Gene, and Phenotype Extraction

The extraction layer pulls the entities that matter for clinical interpretation directly from publication text and metadata. Doing this at ingestion time, not search time, is what enables sub-second search.

Variant Mentions

HGVS cDNA and protein notations alongside legacy notations are extracted from publication text. Notation is normalised so a single variant referenced in different formats across different publications is recognised consistently in search.

Gene Mentions

Candidate gene symbols are validated against the human protein-coding gene reference. This eliminates false positives from common abbreviations that overlap with non-gene acronyms. Mention counts per publication are preserved as evidence of gene centrality to the paper.

Phenotype Mentions

Phenotype names are mapped to HPO, OMIM, and MeSH identifiers when available. Stemmed morphological matching ensures that variations such as plural and adjectival forms link to the same underlying phenotype.

Clinical Search Workflow

Six steps convert a case clinical context into a ranked, persisted, stream-ready set of literature evidence. The full workflow targets sub-second response for typical queries.

Step 1

Two-Source Publication Discovery

For each query gene, the service finds candidate publications through two complementary sources. The first is a direct lookup against extracted gene mentions, providing publications where the gene is identified as a meaningful subject. The second is a text search across titles and abstracts, catching mentions that may not have been formally extracted. The two sources are merged into a unique candidate set.

Step 2

Publication Enrichment

Each candidate publication is enriched with full metadata: title, abstract, journal, publication date, authors, DOI and PMC identifiers, MeSH descriptors, and publication types. Gene mention counts, variant mentions for query genes, and phenotype mentions are attached to support downstream relevance scoring.

Step 3

Parallel Relevance Scoring

Enriched publications are scored across multiple components. Scoring runs in parallel across worker processes, bypassing Python concurrency limits to deliver sub-second results even when hundreds of candidate publications are evaluated.

Step 4

Filter and Rank

Publications below a minimum relevance threshold are discarded. Remaining publications are sorted by total score in descending order and capped at the requested result limit.

Step 5

Persist to Session Store

Top results are written into the session output store alongside the variant classification data. The geneticist consumes both data types from a single per-session file.

Step 6

Stream-Ready Export

A compressed newline-delimited JSON file is produced for the frontend. Metadata is emitted on the first line, results follow one per line, and a completion marker closes the stream. This format allows progressive display in the frontend as results are received rather than waiting for the full payload.

Relevance Scoring

Six independent components are weighted and combined into a total relevance score per publication. Each component captures a different aspect of clinical relevance, and the geneticist sees the full per-component breakdown alongside the total score.

Phenotype Match

How well the publication phenotype mentions overlap the patient HPO terms. Stemmed morphological matching ensures that small lexical variations do not break the match. The dominant signal in the score because phenotype alignment is the strongest single indicator that a publication is relevant to the case at hand.

Publication Type

Case reports and clinical trial reports are weighted highest. Original research follows. General journal articles and review articles are weighted lower. The hierarchy reflects the relative value of each publication type as evidence in ACMG variant classification.

Gene Centrality

How frequently the query gene is mentioned in the publication. A paper that mentions the gene of interest dozens of times in the abstract and main text is more centrally about that gene than one that mentions it once in passing. Centrality is bounded so a small number of publications mentioning a gene many times does not dominate the ranking.

Functional Data

Whether the publication describes functional studies relevant to ACMG functional evidence criteria. Indicators include MeSH terms for animal models, knockout studies, cell line experiments, and molecular biology techniques. Functional data is a key prerequisite for ACMG functional evidence.

Variant Match

Exact variant notation match between the query and the extracted variant mentions in the publication scores highest. Same-gene different-variant scores lower. This component captures the difference between a paper about the patient exact variant and a paper about a different variant in the same gene.

Recency

Publications decay linearly over a recent window. A current-year publication scores at the top of this component; older publications score progressively lower. Recency is a relatively small contribution because older landmark papers can remain authoritative.

ACMG-Aligned Evidence Strength

Each ranked publication is labelled with an evidence strength category that maps directly to ACMG/AMP evidence categories. The literature search output is not just a list of papers but a structured evidence pool ready for ACMG criteria assignment.

Strength	Description and ACMG Alignment
Strong	Publication describes the exact variant and includes functional studies. Candidate evidence for ACMG PS3 (well-established functional studies showing damaging effect) or PP3 functional component context.
Moderate	Publication describes the exact variant OR functional data, but not both. Candidate context for moderate-weight ACMG criteria.
Supporting	Publication describes the gene with phenotype overlap to the case. Candidate context for ACMG PP4 (phenotype highly specific for a single genetic etiology) or related supporting evidence.
Weak	Gene is mentioned but no exact variant or phenotype-specific context is present. Background reference material for completeness rather than direct ACMG evidence.

Inputs and Outputs

What the service consumes from the upstream pipeline and from the geneticist, and what it produces for review and downstream use.

Inputs from the Pipeline

Patient query genes from the upstream variant analysis or phenotype matching results

Patient HPO terms from the case clinical context

Optional exact variant notations from the case variant analysis output

Inputs from the Geneticist

Search initiation as part of the standard variant interpretation workflow

Optional gene panel scoping when only specific genes are of interest

Optional result limit configuration

Outputs for the Geneticist

Ranked publication list with overall relevance score for the case

Per-publication score breakdown across all six relevance components

Per-publication evidence strength label aligned to ACMG categories

Direct links to PubMed identifier, DOI, and PMC where available

Highlighted variant matches, gene mention counts, and phenotype matches per publication

Persisted session results consumable from the same data store as variant classifications

Stream-ready compressed JSON for instant frontend rendering

Standards and Boundaries

The service operates against published standards and within explicit clinical boundaries.

ACMG/AMP

Relevance scoring components and evidence strength labels are aligned to the ACMG/AMP evidence categories. The four strength levels (Strong, Moderate, Supporting, Weak) map to the criteria categories that geneticists use in classification, providing a direct bridge between literature search results and ACMG criteria assignments.

Reference: Richards et al., Genetics in Medicine, 2015, PMID: 25741868

PubMed

The complete PubMed baseline and daily update streams are the source of all literature data. PubMed is maintained by the U.S. National Library of Medicine at NCBI and indexes the majority of biomedical research worldwide. The local mirror is updated regularly so newly published research is available for clinical search shortly after appearing in PubMed.

Reference: PubMed, U.S. National Library of Medicine, NCBI

MeSH

Medical Subject Headings are the controlled vocabulary maintained by NLM for indexing biomedical literature. MeSH descriptors are used for both filtering (genetics relevance) and scoring (functional data signal). MeSH indexing is performed by trained NLM curators, providing the highest quality content classification available.

Reference: Medical Subject Headings (MeSH), U.S. National Library of Medicine

HPO

The Human Phenotype Ontology provides the structured phenotype vocabulary used for matching publication phenotype mentions against patient HPO terms. Stemmed morphological matching extends HPO matching to lexical variants without requiring exact term lookup.

Reference: Kohler et al., Nucleic Acids Research, 2021, PMID: 33264411

HGNC

Gene symbols extracted from publications are validated against the HGNC approved-symbols set. This eliminates false positives from non-gene acronyms and ensures that gene mentions normalise consistently across publications using older gene symbol versions.

Reference: HUGO Gene Nomenclature Committee, hgnc.symbolreport

Reporting Boundary

The service produces ranked publication lists with relevance scores and ACMG-aligned evidence strength labels. It does not generate clinical interpretations, does not assert pathogenicity, and does not replace direct review of source publications by a qualified clinical geneticist. All output is reference material for the geneticist clinical judgment.

Data Residency

The service runs within the Helena platform on EU-based infrastructure compliant with GDPR Article 9 and 1+MG technical requirements. The local PubMed mirror is hosted within the same EU infrastructure, so clinical search does not transmit case data outside the platform.

What Sets It Apart

Eight design choices that make Helena Literature Evidence distinct from generic PubMed search.

Local mirror, sub-second search

PubMed baseline is mirrored locally and indexed for fast retrieval. Clinical search returns ranked results in well under a second for typical queries, fast enough to be a routine part of the variant interpretation workflow rather than a special-occasion lookup.

Genetics-focused dataset

A curated MeSH-based filter reduces millions of articles to a focused genetics-relevant subset, removing noise that would otherwise dilute search results. The filtering ratio is conservative and biased toward inclusion when relevance is plausible.

Pre-extracted variants, genes, and phenotypes

Variant notations, validated gene symbols, and HPO-mapped phenotype mentions are extracted at ingestion time, not at search time. The geneticist sees publications already enriched with the entities that drive ACMG decisions.

Validated gene symbols

Every candidate gene symbol is validated against the HGNC approved-symbols set before storage. Common abbreviation collisions, the dominant source of false positives in gene mention extraction, are eliminated upstream.

Six-component relevance scoring

A weighted scoring model combines phenotype match, publication type, gene centrality, functional data signal, variant match, and recency into a single ranking. The geneticist sees the per-component breakdown alongside the total score, with full transparency into why each publication ranked where it did.

ACMG-aligned strength labels

Each ranked publication carries a strength label that maps directly to ACMG evidence categories. The literature search output is not just a ranked list of papers but a categorised evidence pool aligned to the criteria the geneticist will apply in classification.

Session-integrated storage

Search results persist in the same per-session data store as variant classifications. The frontend loads both data types from a single source, simplifying architecture and ensuring consistency across the case lifetime.

EU data residency

The local PubMed mirror runs within the Helena EU infrastructure, so clinical search does not transmit case data outside the platform. This satisfies GDPR Article 9 sensitive data handling requirements and 1+MG technical requirements.

See Literature Evidence in Practice

Request a demo to see Helena run a real case literature search end to end, from query genes and HPO terms to ranked publications with per-component score breakdowns and ACMG-aligned strength labels.

ACMG Methodology Screening Methodology Full Pipeline For Geneticists