Analysis Workflow
Cohort Analysis
Variant analysis classifies individual patients. Cohort analysis aggregates classified samples into a unified matrix and runs population-level statistical tests. Which genes carry excess rare variant burden in this disease cohort? Which pathways are implicated? Which signals replicate against published GWAS? Which patients carry compound heterozygous variants in the same gene? Helena answers these questions with research-grade methodology.
No variant re-calling. No re-annotation. The service consumes pre-classified samples from variant analysis and adds population-level statistics on top, preserving ACMG context and avoiding duplicate computation.
Contents
Clinical and Research Positioning
Cohort analysis bridges per-patient ACMG classification and population-level disease genetics. Concrete examples illustrate where it changes the answer.
Three Question Types
Gene burden in a disease cohort. A monogenic diabetes cohort of 176 samples is analysed for rare variant enrichment in established susceptibility genes. Burden testing against a gnomAD reference identifies which genes carry significantly more carriers than expected, with FDR correction across all tested genes and per-gene power analysis on the side.
Pathway-level enrichment. When several burden-significant genes share a common biological pathway, that pathway is more likely to be causally implicated than any single gene in isolation. Pathway enrichment surfaces these convergent signals as additional evidence beyond gene-level p-values.
Cross-sample compound heterozygotes. A pair of heterozygous variants in a recessive gene is invisible at the per-patient level if neither variant is independently flagged. Cohort analysis identifies compound het candidates per sample with phasing context derived from the pre-classified data.
Pipeline Architecture
Six sequential phases convert raw cohort metadata and per-sample VCFs into a complete analytical product. Each phase persists state so the pipeline can resume or be re-run with different parameters without restarting from scratch.
Bulk Classification
For samples not yet classified, the service dispatches the standard variant analysis pipeline per sample. A distributed concurrency limit ensures the cluster is not overwhelmed when ingesting hundreds of samples at once.
Quality Control
Per-sample metrics are computed from each classified DuckDB: variant count, transitions over transversions, heterozygous over homozygous ratio, and mean read depth. Cohort-level mean and standard deviation are calculated, and samples deviating beyond a configurable threshold on any metric are flagged as outliers with the specific deviating metric recorded.
Matrix Construction
A unified cohort variant matrix is built using deduplicated variant catalog and sparse genotype representation. Each sample is attached read-only and ingested into the cohort store. Cohort-wide allele frequencies, carrier counts, and ACMG consensus across samples are computed in a final pass.
Compound Heterozygous Detection
When a gene panel is provided, the service identifies pairs of heterozygous variants in the same gene per sample, restricted to coding and splicing consequences. A noise filter excludes genes with an excessive number of heterozygous variants per sample, which typically indicate common low-penetrance variation rather than disease-causing biallelic loss of function.
Statistical Analysis
Burden testing, pathway enrichment, pLoF, frequency analysis, GWAS replication, and polygenic risk scoring run as separate API-driven analyses on the completed matrix. Each analysis writes its own results table keyed by analysis run identifier.
Candidate Nomination
A weighted scoring engine integrates evidence across the prior phases and produces a ranked list of candidate genes with per-component score breakdowns and human-readable evidence summaries.
Cohort Variant Matrix
The unified matrix is the foundation for every downstream analysis. Architectural choices keep it tractable at hundreds of WGS samples while preserving full per-variant fidelity.
Deduplicated Variant Catalog
Each unique chromosome, position, reference, alternate combination is stored once across the entire cohort. This is the foundation for cohort-wide allele frequency calculation and for cross-sample variant comparison.
Sparse Genotype Matrix
Only non-reference genotypes are stored, keyed by variant identifier and sample identifier. For typical disease cohorts where each sample carries variants at a tiny fraction of the genome, this is far more efficient than dense representation.
ACMG Consensus Resolution
When the same variant is classified differently across samples, the cohort matrix records the most severe classification along with a discordance flag. A variant called Pathogenic in one sample and Likely Benign in another is treated as Pathogenic at the cohort level for burden testing, with the discordance flag preserving the disagreement for review.
Cohort-Wide Statistics
Cohort allele frequency, carrier count, and homozygote count are computed once and stored on the variant catalog. Statistical analyses query these pre-computed values rather than re-aggregating across the full genotype matrix.
Cohort Quality Control
Outlier samples corrupt cohort-level statistics. Quality control runs before matrix construction with multiple metrics evaluated independently. A sample flagged on any single metric is sufficient for outlier review.
Variant Count
Total variant calls per sample. Outliers may indicate library preparation issues, alignment problems, or sample contamination.
Transitions over Transversions
Ti/Tv ratio is a classical sequencing quality metric. Substantial deviation from expected values may reflect base-call quality issues.
Heterozygous over Homozygous Ratio
Het/hom ratio reflects ancestry and consanguinity but extreme deviations may indicate sample-swap or contamination.
Mean Read Depth
Average depth across all variants. Low depth correlates with reduced calling sensitivity and lower confidence on variant interpretation.
Gene-Level Burden Testing
The core question: which genes carry excess rare variant burden in the cohort relative to a reference population? Multiple complementary methods are run on every gene with FDR correction across the full set of tested genes.
Fisher Exact Test
Two-by-two contingency table comparing cohort carriers against gnomAD-expected carriers per gene. Two-sided test capturing both enrichment and depletion. The default method for gene-level burden testing.
CMC (Combined Multivariate and Collapsing)
Binary carrier collapsing per gene. At the gene level, this is equivalent to Fisher exact and provides a methodological cross-check.
SKAT-O (Optional)
Sequence Kernel Association Test, optimised. Optional R integration for variance-component testing. Falls back to CMC when not available.
Multiple Testing Correction
Benjamini-Hochberg false discovery rate (FDR) is applied across all tested genes. Bonferroni-corrected p-values are also reported. Genes are flagged significant when FDR is below the configured threshold.
Statistical Power Analysis
Minimum detectable odds ratio at standard 80% power is computed per gene. This is essential context for null results: a non-significant gene with low power is not the same as a non-significant gene with high power.
Method Discordance Flag
When two methods disagree on significance, the gene is flagged for manual review. Concordant signals across methods are stronger than single-method calls.
Pathway Enrichment
When burden-significant genes share a common biological pathway, that pathway is more likely to be causally implicated. Pathway enrichment is a complementary signal, not a substitute for gene-level evidence.
Method. Fisher exact test, one-sided alternative, per pathway. The contingency table compares burden-significant genes that are members of the pathway against burden-significant genes that are not.
Background. All genes tested in the burden phase, not all genes in the genome. This is critical for correctness: testing a pathway against the full genome inflates significance because cohort capture varies by gene.
Pathway sources. KEGG, Reactome, and Gene Ontology biological process. The researcher provides pathway definitions; the engine runs the test and applies multiple testing correction across all evaluated pathways.
Output. Per pathway, the list of contributing significant genes is preserved. The researcher sees not only the pathway p-value but the specific genes driving the enrichment.
pLoF and Frequency Analysis
Two complementary single-variant analyses targeting different signal classes.
Predicted Loss-of-Function
Variants with frameshift, stop-gained, or canonical splice site consequences are the strongest single-variant signals for haploinsufficiency mechanisms. The pLoF analysis aggregates these per gene, surfacing pLoF carrier counts alongside gene constraint metrics and ClinVar context.
Cohort vs gnomAD Frequency
Per-variant binomial test against the configured gnomAD reference, with FDR correction across all tested variants. Allele count is computed correctly using heterozygous and homozygous genotypes, and direction (enriched or depleted) is reported alongside the p-value.
GWAS Replication and Polygenic Risk Scores
Common-variant complement to the rare variant analyses. Tests how the cohort behaves at known signals and at population-derived risk score scales.
GWAS Signal Replication
Per known GWAS signal (typically by rs-identifier), the cohort allele frequency is tested against the gnomAD reference. Output includes the cohort frequency, p-value, and a comparison to the published odds ratio for the trait.
Polygenic Risk Scoring
Polygenic risk scores are computed per sample using PGS Catalog weight files. Coverage of expected variants is reported alongside the score; below a defined threshold the score is flagged as a directional indicator only, with a recommendation for whole-genome data for clinical-grade computation. Within-cohort percentile is provided for relative comparison.
Candidate Gene Nomination
A single weighted scoring engine integrates evidence across all prior analyses and produces a ranked list of candidate genes with per-component breakdowns. Researchers see both the top candidate and the specific reasons it ranks where it does.
Burden
Gene reaches significance in the cohort burden test against the chosen control population.
Pathway
Gene contributes to a pathway that is significantly enriched among burden-significant genes.
pLoF Carriers
Gene has predicted loss-of-function carriers in the cohort. pLoF variants are the strongest single-variant evidence for haploinsufficiency mechanisms.
GWAS Overlap
Gene region overlaps a known GWAS signal for the disease or related phenotype.
Disease Association
Gene has prior disease association in ClinVar or OMIM. Established disease genes are weighted differently from novel candidates.
Constraint
Gene is constrained against loss-of-function or missense variation in the general population (high pLI or low LOEUF). Constraint is a strong prior on gene-level disease relevance.
Compound Heterozygous Carriers
Gene has cohort samples with paired heterozygous variants suggesting biallelic loss of function. Normalised by the number of carriers detected.
Output. A ranked list of candidate genes with combined score, per-component score breakdown, and a human-readable evidence summary. Rankings are reproducible: the same cohort with the same parameters produces the same ranked output.
Inputs and Outputs
What the service consumes from the upstream pipeline and from the researcher, and what it produces for review and downstream use.
Inputs from the Pipeline
N classified per-sample DuckDB files from variant analysis (one per cohort sample)
ACMG/AMP classification, criteria, and supporting evidence already applied per variant
Per-sample annotations: gene symbol, consequence, in silico predictors, gnomAD frequency, ClinVar context, HPO
Optional gene panel for compound heterozygous detection scoping
Inputs from the Researcher
Cohort name, sequencing type (WGS, WES, or CES), control population for burden comparison
Per-sample metadata: sample identifier, sex, age, optional HPO terms, optional clinical subgroup
Qualifying variant criteria: maximum allele frequency, minimum impact, consequence filter, classification inclusion (VUS, P/LP), collapsing strategy
Optional gene panel and pathway definitions (KEGG, Reactome, GO biological process)
Optional GWAS signals for replication, optional PGS Catalog weight files for polygenic risk scoring
Outputs for the Researcher
Per-gene burden results: Fisher, CMC, SKAT-O p-values, FDR q-value, Bonferroni p-value, minimum detectable odds ratio
Per-pathway enrichment results: Fisher p-value, FDR q-value, contributing significant genes
Per-gene pLoF summary: variant counts, carrier counts, constraint metrics, ClinVar context, HPO
Per-variant frequency analysis: cohort allele count vs gnomAD with binomial p-value and direction (enriched or depleted)
Per-signal GWAS replication: cohort allele frequency, p-value, comparison to published odds ratio
Per-sample polygenic scores with coverage warnings and within-cohort percentile
Per-gene candidate ranking: combined score, per-component breakdown, evidence summary
Compound heterozygous candidate pairs per sample per gene
Standards and Boundaries
The service operates against published standards and within explicit research and clinical boundaries.
ACMG/AMP
Variant classification follows ACMG/AMP 2015 with subsequent ClinGen specifications. Performed upstream by the Variant Analysis Service per cohort sample. The Cohort Service consumes that classification as input and does not reclassify.
Reference: Richards et al., Genetics in Medicine, 2015, PMID: 25741868
CMC Method
Combined Multivariate and Collapsing method for rare variant burden testing. Industry-standard collapsing approach for gene-level analysis.
Reference: Li and Leal, AJHG, 2008, PMID: 18691683 (CMC method)
SKAT-O
Sequence Kernel Association Test, optimised. Variance-component test for rare variant association. Optional R integration with documented fallback to CMC.
Reference: Lee et al., AJHG, 2012, PMID: 22863193 (SKAT-O methodology)
Benjamini-Hochberg FDR
False discovery rate correction applied across all tested genes for burden, pathways for enrichment, and variants for frequency analysis. The standard multiple testing approach for high-dimensional genomic studies.
Reference: Benjamini and Hochberg, JRSSB, 1995 (FDR control)
PGS Catalog
Polygenic risk score computation uses weight files from the PGS Catalog. Coverage of expected variants is reported alongside the score, and below a defined threshold the score is flagged as directional only.
Reference: PGS Catalog, Lambert et al., Nature Genetics, 2021, PMID: 33692572
gnomAD
Default control population for burden testing and frequency analysis. Population stratification is configurable per analysis run (NFE is the default).
Reporting Boundary
The service produces statistical analysis results, candidate gene rankings, and per-sample evidence outputs. It does not generate clinical interpretations, does not make individual diagnostic calls, and does not replace clinical or expert review. All output is for review by qualified researchers and clinical geneticists before any clinical action.
Data Residency
The service runs within the Helena platform on EU-based infrastructure compliant with GDPR Article 9 and 1+MG technical requirements. Cohort genomic data does not leave the platform during analysis.
What Sets It Apart
Eight design choices that make Cohort Analysis distinct from generic rare variant association tools.
Maximum-fidelity input
Operates on pre-classified samples from variant analysis. ACMG context is preserved and consumed as evidence rather than recomputed. No information loss between per-sample classification and cohort analytics.
Six analyses on one matrix
Burden testing, pathway enrichment, pLoF analysis, frequency analysis, GWAS replication, and polygenic risk scoring all share the same cohort matrix. Re-running an analysis with different parameters does not require re-ingesting samples.
Power-aware results
Every burden result includes the minimum detectable odds ratio at standard power. Null results are interpretable: not detecting a signal in an underpowered gene is not the same as not detecting a signal in a well-powered gene.
Method cross-check by design
Fisher and CMC run on every burden test. SKAT-O runs when available. Method discordance is surfaced explicitly so manually reviewed signals are concordant signals.
ACMG consensus across cohort
When the same variant is classified differently across samples, the most severe classification is recorded with an explicit discordance flag. Conservative for burden testing, transparent for review.
Weighted candidate ranking
A single ranked list integrates evidence from all six analyses with per-component breakdowns and human-readable evidence summaries. The researcher sees both the top candidate and why it is the top candidate.
Sparse matrix architecture
Deduplicated variant catalog plus sparse genotype storage scales to hundreds of WGS samples without dense N times M memory cost.
Reproducible runs
Each analysis run records the classifier version, qualifying variant criteria, control population, and gene panel. Re-running the same cohort with a new classifier version produces a new run rather than overwriting prior results.
See Cohort Analysis in Practice
Request a demo to see Helena process a real cohort end to end, from per-sample classification through statistical analyses to ranked candidate genes with full evidence breakdown.