Introduction

nf-core/longraredisease is a specialized bioinformatics pipeline for structural variant (SV) detection and clinical interpretation from long-read sequencing data (Oxford Nanopore and PacBio). Designed for rare disease diagnostics, it delivers high-confidence variant discovery through multi-caller consensus, family-based analysis, and phenotype-driven prioritization.

🎯 Primary Focus: Structural Variant Detection

The pipeline excels at identifying and interpreting structural variants through:

  • Multi-caller SV consensus - Sniffles, CuteSV, SVIM with JASMINE merging
  • Phase-aware calling - Haplotype-resolved SV detection using LongPhase
  • Family analysis - Trio-based joint calling and de novo variant detection
  • Clinical annotation - AnnotSV with disease database integration
  • Phenotype prioritization - SVANNA-based ranking using HPO terms

📊 Analysis Capabilities

Core SV Analysis (Always Enabled):

  • Structural Variants - Multi-caller detection (DEL, INS, DUP, INV, BND)
  • Phasing - Long-range haplotyping with LongPhase
  • Quality Control - Comprehensive QC with NanoPlot, mosdepth, MultiQC

Optional Analyses:

  • 🧬 Single Nucleotide Variants - Clair3 or DeepVariant (enable with --snv true)
  • 📈 Copy Number Variants - Spectre or HiFiCNV (enable with --cnv true)
  • 🔁 Short Tandem Repeats - Straglr genotyping (enable with --str true)
  • 🧪 DNA Methylation - Modkit extraction for ONT (enable with --methyl true)

Requirements

Software

  • Nextflow ≥25.04.0 (DSL2)
  • Container engine: Docker, Singularity/Apptainer, or Podman
  • Java ≥17 (required by Nextflow)
Analysis TypeCPU CoresMemoryStorage
Single WGS sample8-1632-64 GB100 GB

Notes:

  • Coverage recommendations: ≥10x for accurate SV calling, ≥30x for high-confidence trio analysis
  • Storage includes space for input data, intermediate files, and results
  • Adjust --max_cpus and --max_memory parameters based on available resources

Quick Start

1. Install Nextflow

# Install Nextflow (≥25.04.0)
curl -s https://get.nextflow.io | bash
sudo mv nextflow /usr/local/bin/
 
# Verify installation
nextflow -version

2. Test the Pipeline

# Run with test data
nextflow run nf-core/longraredisease \
    -profile test,docker \
    --outdir test_results

3. Run with Your Data

Minimal SV-focused run:

nextflow run nf-core/longraredisease \
    --input samplesheet.csv \
    --outdir results \
    --fasta reference.fasta \
    --sequencing_platform ont \
    -profile docker

With family analysis and phenotype prioritization:

nextflow run nf-core/longraredisease \
    --input samplesheet.csv \
    --outdir results \
    --fasta reference.fasta \
    --sequencing_platform ont \
    --trio_analysis true \
    --run_svanna true \
    --svanna_db /path/to/svanna_db \
    -profile docker

See docs/usage.md for complete examples and parameter details.


Input Requirements

Required Inputs

ParameterDescriptionFormatExample
--inputSamplesheet with sample metadataCSVsamplesheet.csv
--outdirOutput directoryPath./results
--fastaReference genome FASTA.fasta/.faGRCh38.fasta
--sequencing_platformPlatform typeont or pacbioont

Samplesheet Format

The input samplesheet is a CSV file with the following columns:

Minimal format (single samples):

sample,bam,bai
sample1,/path/to/sample1.bam,/path/to/sample1.bam.bai
sample2,/path/to/sample2.bam,/path/to/sample2.bam.bai

Family analysis format (trios):

sample,bam,bai,family,paternal_id,maternal_id,sex,phenotype,hpo_terms
proband,proband.bam,proband.bam.bai,family1,father,mother,1,affected,"HP:0001250,HP:0002066"
father,father.bam,father.bam.bai,family1,0,0,1,unaffected,
mother,mother.bam,mother.bam.bai,family1,0,0,2,unaffected,

Column descriptions:

  • sample - Unique sample identifier
  • bam - Path to aligned BAM file
  • bai - Path to BAM index file
  • family - Family identifier (for trio analysis)
  • paternal_id - Father’s sample ID (or 0 if not in study)
  • maternal_id - Mother’s sample ID (or 0 if not in study)
  • sex - 1 = male, 2 = female, 0 = unknown
  • phenotype - affected or unaffected
  • hpo_terms - Comma-separated HPO terms (e.g., HP:0001250,HP:0002066)

Optional Inputs

ParameterDescriptionRequired For
--bedTarget regions BED fileTargeted sequencing
--annotsv_dbAnnotSV database pathSV annotation
--svanna_dbSVANNA database pathPhenotype prioritization
--str_bedSTR loci BED fileSTR analysis

Key Parameters

Core Analysis Toggles

Structural variant analysis is always enabled. Optional analyses:

ParameterDescriptionDefault
--snvEnable SNV calling (Clair3/DeepVariant)false
--cnvEnable CNV detection (Spectre)false
--strEnable STR genotyping (Straglr)false
--methylEnable methylation calling (Modkit, ONT only)false

SV Detection Parameters

ParameterDescriptionDefault
--run_cutesvEnable CuteSV callertrue
--run_svimEnable SVIM caller (recommended for BNDs)false
--haplotag_bamHaplotag BAM for phase-aware SV callingtrue
--min_sv_sizeMinimum SV size to report (bp)30
--min_read_supportMinimum supporting readsauto

Family Analysis Parameters

ParameterDescriptionDefault
--trio_analysisEnable trio/family-based callingfalse
--run_svannaEnable phenotype-driven prioritizationfalse
--svanna_dbPath to SVANNA database-

Multi-caller Consensus Parameters

ParameterDescriptionDefault
--jasmine_max_distMax distance for merging breakpoints (bp)1000
--jasmine_min_supportMin callers supporting merged variant2
--jasmine_spec_readsMin supporting reads for consensus3

Platform-specific Settings

ParameterDescriptionOptions
--sequencing_platformSequencing platformont, pacbio
--presetMinimap2 alignment presetmap-ont, map-hifi, map-pb
--snv_callerSNV caller choiceclair3, deepvariant

Usage Examples

1. Standard SV Analysis (Single Sample)

nextflow run nf-core/longraredisease \
    --input samplesheet.csv \
    --outdir results \
    --fasta GRCh38.fasta \
    --sequencing_platform ont \
    -profile docker

2. Comprehensive Analysis (SVs + SNVs + CNVs)

nextflow run nf-core/longraredisease \
    --input samplesheet.csv \
    --outdir results \
    --fasta GRCh38.fasta \
    --sequencing_platform pacbio \
    --snv true \
    --cnv true \
    --str true \
    -profile singularity

3. Family Trio Analysis with Phenotype Prioritization

nextflow run nf-core/longraredisease \
    --input trio_samplesheet.csv \
    --outdir family_results \
    --fasta GRCh38.fasta \
    --sequencing_platform ont \
    --trio_analysis true \
    --run_svanna true \
    --svanna_db /databases/svanna_data \
    --annotsv_db /databases/AnnotSV \
    -profile docker

4. High-Sensitivity SV Detection

nextflow run nf-core/longraredisease \
    --input samplesheet.csv \
    --outdir sensitive_results \
    --fasta GRCh38.fasta \
    --sequencing_platform ont \
    --run_svim true \
    --min_sv_size 20 \
    --min_read_support 2 \
    --jasmine_min_support 1 \
    -profile docker

5. Targeted Sequencing with BED File

nextflow run nf-core/longraredisease \
    --input samplesheet.csv \
    --outdir targeted_results \
    --fasta GRCh38.fasta \
    --bed targets.bed \
    --sequencing_platform ont \
    -profile docker

Output Structure

results/
├── pipeline_info/              # Pipeline execution reports
│   ├── execution_report.html   # Resource usage timeline
│   ├── execution_timeline.html # Process execution graph
│   └── multiqc_report.html     # Comprehensive QC report

├── qc/                         # Quality control metrics
│   ├── mosdepth/               # Coverage statistics per sample
│   ├── nanoplot/               # Read quality metrics (ONT)
│   └── cramino/                # CRAM-based QC (optional)

├── structural_variants/        # 🎯 PRIMARY OUTPUT: SV calls
│   ├── sniffles/               # Per-sample Sniffles VCFs
│   │   └── {sample}.sniffles.vcf.gz
│   ├── cutesv/                 # Per-sample CuteSV VCFs
│   │   └── {sample}.cutesv.vcf.gz
│   ├── svim/                   # Per-sample SVIM VCFs (if enabled)
│   │   └── {sample}.svim.vcf.gz
│   ├── merged/                 # Multi-caller consensus SVs
│   │   ├── {sample}.jasmine.vcf.gz
│   │   └── {sample}.survivor.vcf.gz
│   ├── annotated/              # AnnotSV annotations
│   │   └── {sample}.annotated.tsv
│   └── svanna/                 # Phenotype-prioritized SVs
│       └── {sample}.svanna.html

├── phasing/                    # Haplotype-resolved results
│   ├── haplotagged_bams/       # Phase-tagged alignments
│   │   └── {sample}.haplotagged.bam
│   ├── whatshap/               # Phasing statistics
│   │   └── {sample}.phased.vcf.gz
│   └── longphase/              # Alternative phasing
│       └── {sample}.longphase.vcf.gz

├── snv_calls/                  # SNVs (if --snv enabled)
│   ├── clair3/
│   │   └── {sample}.clair3.vcf.gz
│   └── deepvariant/
│       └── {sample}.deepvariant.vcf.gz

├── cnv_calls/                  # CNVs (if --cnv enabled)
│   └── spectre/
│       └── {sample}.cnv.vcf.gz

├── str_calls/                  # STRs (if --str enabled)
│   └── straglr/
│       └── {sample}.straglr.tsv

└── methylation/                # Methylation (if --methyl enabled, ONT only)
    └── modkit/
        └── {sample}.bedmethyl.gz

Key output files:

  • Merged SVs: structural_variants/merged/{sample}.jasmine.vcf.gz (high-confidence consensus)
  • Annotated SVs: structural_variants/annotated/{sample}.annotated.tsv (clinical interpretation)
  • QC Report: pipeline_info/multiqc_report.html (overall quality assessment)
  • Phenotype-prioritized: structural_variants/svanna/{sample}.svanna.html (ranked by phenotype match)

Configuration Profiles

Available Profiles:

  • test: Minimal test dataset
  • docker: Use Docker containers
  • singularity: Use Singularity containers

Custom Configuration

// custom.config
params {
    max_cpus = 16
    max_memory = '64.GB'
    outdir = '/scratch/results'
}
 
process {
    withName: 'CLAIR3' {
        cpus = 8
        memory = '32.GB'
    }
}

Run with:

nextflow run main.nf -c custom.config -profile docker

Family-Based Analysis

Trio/Family Configuration

For family-based SV analysis, provide pedigree information in your samplesheet:

sample,bam,bai,family,paternal_id,maternal_id,sex,phenotype,hpo_terms
child_001,child.bam,child.bam.bai,FAM001,father_001,mother_001,2,affected,"HP:0001250,HP:0002066,HP:0001263"
father_001,father.bam,father.bam.bai,FAM001,0,0,1,unaffected,
mother_001,mother.bam,mother.bam.bai,FAM001,0,0,2,unaffected,

Sex encoding: 1 = male, 2 = female, 0 = unknown Parental IDs: Use 0 for founders (individuals with no parents in the study)

De Novo SV Detection

Enable trio analysis to identify de novo structural variants:

nextflow run nf-core/longraredisease \
    --input trio_samplesheet.csv \
    --trio_analysis true \
    --outdir trio_results \
    --fasta GRCh38.fasta \
    --sequencing_platform ont \
    -profile docker

The pipeline will:

  1. ✅ Call SVs in each family member independently
  2. ✅ Merge calls using JASMINE with family-aware parameters
  3. ✅ Identify variants present in child but absent in parents
  4. ✅ Filter based on read support and quality metrics

Phenotype-Driven Prioritization (SVANNA)

When HPO terms are provided, SVANNA ranks SVs by phenotype relevance:

nextflow run nf-core/longraredisease \
    --input trio_samplesheet.csv \
    --trio_analysis true \
    --run_svanna true \
    --svanna_db /path/to/svanna/2302 \
    --outdir prioritized_results \
    --fasta GRCh38.fasta \
    --sequencing_platform ont \
    -profile docker

Required: Download SVANNA database from Monarch Initiative

Output: HTML report ranking SVs by:

  • Overlap with disease-associated genes
  • Regulatory impact predictions
  • Phenotype similarity scores
  • De novo status (if trio data available)

Annotation with AnnotSV

Enable comprehensive SV annotation:

nextflow run nf-core/longraredisease \
    --input samplesheet.csv \
    --annotsv_db /path/to/AnnotSV_db \
    --outdir annotated_results \
    --fasta GRCh38.fasta \
    --sequencing_platform ont \
    -profile docker

AnnotSV provides:

  • Gene overlap and functional impact
  • ClinGen/ClinVar annotations
  • DGV/gnomAD population frequencies
  • Pathogenicity predictions (ACMG criteria)
  • Regulatory element disruption

Troubleshooting

Common Issues

1. Low SV Detection Rate

Symptoms: Fewer SVs than expected

Solutions:

# Lower read support threshold
--min_read_support 2
 
# Reduce minimum SV size
--min_sv_size 20
 
# Enable SVIM for better breakend detection
--run_svim true
 
# Lower consensus requirement
--jasmine_min_support 1

2. High False Positive Rate

Symptoms: Many low-quality SV calls

Solutions:

# Increase read support
--min_read_support 5
 
# Require multiple caller agreement
--jasmine_min_support 2
 
# Increase minimum SV size
--min_sv_size 50

3. Memory Issues

Symptoms: Process killed due to OOM

Solutions:

# Increase max memory
--max_memory 128.GB
 
# Reduce parallel processes
--max_cpus 16
 
# Use chromosome-based parallelization (automatic)

4. Missing De Novo Variants

Symptoms: Expected de novo variants not detected

Checklist:

  • ✅ Ensure --trio_analysis true is set
  • ✅ Verify pedigree information in samplesheet
  • ✅ Check read coverage in all samples (≥30×)
  • ✅ Review structural_variants/merged/ for family calls
  • ✅ Lower --jasmine_min_support if needed

5. SVANNA Database Issues

Symptoms: SVANNA fails or produces no rankings

Solutions:

# Verify database path and version
ls -lh /path/to/svanna/2302
 
# Ensure HPO terms are valid (HP:XXXXXXX format)
# Check samplesheet for proper HPO term formatting
 
# Download latest SVANNA database:
wget https://storage.googleapis.com/svanna-db/svanna-data-2302.tar.gz
tar -xzf svanna-data-2302.tar.gz

Performance Optimization

For large cohorts (>10 samples):

# Enable resource-efficient mode
--max_cpus 64
--max_memory 256.GB
 
# Use Singularity for better resource isolation
-profile singularity
 
# Enable work directory cleanup
-resume -with-dag flowchart.html

For whole genome sequencing:

  • Expect 8-24 hours runtime (depending on coverage)
  • Allocate 64-128GB RAM per sample for SV calling
  • Use SSD storage for work directory (I/O intensive)

Test Data

The pipeline includes test data for validation:

  • Location: assets/test_data/
  • Genome: Chromosome 22 subset
  • Samples: Simulated nanopore data
  • Runtime: ~10-15 minutes

Getting Help

Debugging Failed Runs:

# Check Nextflow log for detailed errors
less .nextflow.log
 
# Resume from last successful step
nextflow run nf-core/longraredisease -resume
 
# Enable debug mode for verbose output
nextflow run nf-core/longraredisease --debug -profile docker

Reporting Issues:

When reporting issues, please include:

  • Nextflow version (nextflow -version)
  • Command used to run the pipeline
  • Relevant error messages from .nextflow.log
  • Sample metadata (anonymized if sensitive)
  • System specifications (CPU, RAM, storage)

Citation

If you use nf-core/longraredisease in your research, please cite:

nf-core/longraredisease: A Nextflow pipeline for long-read sequencing analysis in rare disease research > Citation to be added upon publication

Additionally, please cite the tools used in your analysis:

Core SV Tools:

  • Sniffles2: Sedlazeck et al. (2018) Nature Methods
  • CuteSV: Jiang et al. (2020) Genome Biology
  • JASMINE: Kirsche et al. (2023) Nature Methods
  • LongPhase: Luo et al. (2023) Nature Communications
  • AnnotSV: Geoffroy et al. (2018) Bioinformatics

Optional Analysis Tools:

  • SVANNA: Danis et al. (2022) AJHG
  • Clair3: Zheng et al. (2022) Nature Computational Science
  • Spectre: Suvakov et al. (2021) Genome Research
  • Straglr: Chin et al. (2023) Genome Research

Contributing

Contributions are welcome! To contribute:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/AmazingFeature)
  3. Make your changes following nf-core guidelines
  4. Test with nextflow run . -profile test,docker
  5. Commit your changes (git commit -m 'Add AmazingFeature')
  6. Push to the branch (git push origin feature/AmazingFeature)
  7. Open a Pull Request

Please ensure:

  • ✅ Code follows nf-core style guidelines
  • ✅ All tests pass successfully
  • ✅ Documentation is updated accordingly
  • ✅ Commit messages are descriptive

License

This project is licensed under the MIT License – see the LICENSE file for details.


Acknowledgments

This pipeline was developed with support from [institution/funding sources]. We thank the nf-core community for infrastructure and best practices, and all tool developers whose software makes this pipeline possible.


Pipeline Version: 1.0.0 Nextflow Version: ≥25.04.0 Last Updated: 2024