nf-core/longraredisease
Long read sequencing pipeline to identify variants in patients with neurodevelopmental disorders
Introduction
nf-core/longraredisease is a comprehensive Nextflow pipeline designed for rare disease diagnostics using Oxford Nanopore long-read sequencing data. The pipeline integrates multiple state-of-the-art tools for variant discovery, including:
- Structural Variants (SVs): Sniffles, CuteSV, SVIM with SURVIVOR merging
- Single Nucleotide Variants (SNVs): Clair3 and DeepVariant
- Copy Number Variants (CNVs): Spectre and HiFiCNV
- Short Tandem Repeats (STRs): Straglr
- Methylation: Modkit
- Phasing: LongPhase with haplotagging
- Quality Control: NanoPlot, mosdepth, and MultiQC
The pipeline supports singleton analysis as well as family-based (trio) analyses with variant annotation and phenotype-driven prioritization.
Samplesheet input
You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location:
--input '[path to samplesheet file]'The samplesheet is a comma-separated file (CSV) with a header row and the following columns:
Samplesheet format
| Column | Required | Description |
|---|---|---|
sample |
Yes | Unique sample identifier (no spaces) |
file_path |
Yes | Path to input file or directory (see details below) |
hpo_terms |
No | HPO terms for phenotype annotation (format: HP:0000001;HP:0000002) |
sex |
No | Sex of the sample (1=male, 2=female, 0=unknown) |
phenotype |
No | Phenotype status (1=unaffected, 2=affected, 0 or -9=missing) |
family_id |
No | Family identifier (required for trio analysis) |
maternal_id |
No | Sample ID of the maternal sample (must match another sample in the samplesheet) |
paternal_id |
No | Sample ID of the paternal sample (must match another sample in the samplesheet) |
Input file types
The file_path column can contain:
- Directory path: A directory containing multiple files of the specified
input_type- For
fastqinput: directory with.fastq.gzor.fq.gzfiles - For
ubaminput: directory with unaligned.bamfiles
- For
- Single file path: Path to a single file
- For
fastqinput: single.fastq.gzor.fq.gzfile - For
bamorubaminput: single.bamfile
- For
The input_type is specified as a pipeline parameter (see Workflow Options).
Example samplesheets
Single sample (FASTQ input)
sample,file_path,hpo_terms,sex,phenotype,family_id,maternal_id,paternal_idsample1,/path/to/sample1.fastq.gz,HP:0002721;HP:0002110,1,2,,,Trio analysis
sample,file_path,hpo_terms,sex,phenotype,family_id,maternal_id,paternal_idproband,/path/to/proband.fastq.gz,HP:0002721;HP:0001263,1,2,family1,mother,fathermother,/path/to/mother.fastq.gz,,2,1,family1,,father,/path/to/father.fastq.gz,,1,1,family1,,Multiple families
sample,file_path,hpo_terms,sex,phenotype,family_id,maternal_id,paternal_idfam1_child,/path/to/fam1_child/,HP:0001263,1,2,family1,fam1_mom,fam1_dadfam1_mom,/path/to/fam1_mom/,,,family1,,fam1_dad,/path/to/fam1_dad/,,,family1,,fam2_child,/path/to/fam2_child.fastq.gz,HP:0002110,2,2,family2,fam2_mom,fam2_dadfam2_mom,/path/to/fam2_mom.fastq.gz,,,family2,,fam2_dad,/path/to/fam2_dad.fastq.gz,,,family2,,Example samplesheets are provided in the assets/ directory:
Reference genome
The pipeline requires a reference genome in FASTA format. You can provide this using:
--fasta '/path/to/reference.fasta'The pipeline supports common reference genomes (GRCh37/hg19, GRCh38/hg38). Ensure the reference genome is indexed (.fai file) or the pipeline will create the index automatically.
Workflow options
Input type
Specify the type of input files using the --input_type parameter:
--input_type 'fastq' # FASTQ files (default: ubam)--input_type 'ubam' # Unaligned BAM files--input_type 'bam' # Aligned BAM filesSequencing platform
Specify the sequencing platform:
--sequencing_platform 'ont' # Oxford Nanopore (default)--sequencing_platform 'hifi' # PacBio HiFi--sequencing_platform 'pacbio' # PacBio CLRAlignment options
By default, the pipeline uses Minimap2 for alignment. You can customize the alignment model or use Winnowmap instead:
--minimap2_model 'map-ont' # Minimap2 preset (default: auto-detected)--use_winnowmap true # Use Winnowmap instead of Minimap2--winnowmap_model 'map-ont' # Winnowmap preset--winnowmap_kmers '/path/to/repetitive_k15.txt' # Repetitive k-mer fileAnalysis modules
Enable or disable specific analysis modules:
--skip_snv false # Enable SNV calling (default: true)--skip_sv false # Enable SV calling (default: true)--skip_cnv false # Enable CNV calling (default: true)--skip_str false # Enable STR analysis (default: true)--skip_methylation false # Enable methylation calling (default: true)--skip_phasing false # Enable phasing (default: true)Trio analysis
Enable family-based analysis for samples with pedigree information:
--trio_analysis true # Enable trio analysis (default: false)--haplotag_bam true # Haplotag BAM files with phase information (default: true)nextflow run nf-core/longraredisease
–input ./samplesheet.csv
–outdir ./results
–fasta /path/to/reference.fasta
–input_type fastq
-profile docker
This will launch the pipeline with the `docker` configuration profile. See below for more information about profiles.
### Example commands
#### Basic singleton analysis
```bashnextflow run nf-core/longraredisease \ --input samplesheet.csv \ --outdir results \ --fasta GRCh38.fasta \ --input_type fastq \ -profile dockerTrio analysis with all modules enabled
nextflow run nf-core/longraredisease \ --input samplesheet_trio.csv \ --outdir results_trio \ --fasta GRCh38.fasta \ --input_type ubam \ --trio_analysis true \ --haplotag_bam true \ -profile dockerSV-focused analysis
nextflow run nf-core/longraredisease \ --input samplesheet.csv \ --outdir results_sv \ --fasta GRCh38.fasta \ --skip_snv true \ --skip_cnv true \ --skip_str true \ --skip_methylation true \ --skip_phasing true \ -profile singularityAnalysis with target regions
nextflow run nf-core/longraredisease \ --input samplesheet.csv \ --outdir results_targeted \ --fasta GRCh38.fasta \ --filter_targets true \ --targets_bed exome_targets.bed \ -profile dockernextflow run nf-core/nanoraredx -profile docker -params-file params.yaml
with:
```yaml title="params.yaml"input: './samplesheet.csv'outdir: './results/'genome: 'GRCh37'<...>You can also generate such YAML/JSON files via nf-core/launch.
Updating the pipeline
When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you’re running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline:
nextflow pull nf-core/longrarediseaseReproducibility
It is a good idea to specify the pipeline version when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you’ll be running the same version of the pipeline, even if there have been changes to the code since.
First, go to the nf-core/longraredisease releases page and find the latest pipeline version - numeric only (eg. 1.0.0). Then specify this when running the pipeline with -r (one hyphen) - eg. -r 1.0.0. Of course, you can switch to another version by changing the number after the -r flag.