nf-core/epigenomesegmentation
An nf-core pipeline for epigenome segmentation using EpiSegMix/Meth — a hidden Markov model with flexible read count distributions and state duration modeling for histone, open chromatin, and methylation signals.
Introduction
nf-core/epigenomesegmentation is a bioinformatics pipeline for chromatin segmentation. It uses a hidden Markov model (HMM) to annotate genomic regions with functional states (e.g., enhancers, promoters) based on combinations of epigenetic modifications, capturing spatial relations via transition probabilities.
Default Workflow: Standard Mode
By default, the EPIGENOMESEGMENTATION pipeline executes the Standard Mode (--standard true or remains unspecified, and --merge false). This primary pathway processes histone data (BAM files) to generate segmentation models, bypassing methylation processing.
Execution Steps
- Genome Preparation (
PREPARE_GENOME&GENERATE_BINS): Fetches chromosome size files and generates the required genomic bins. - Processing Branching (Histone & Methylation): Parses the input samplesheet and evaluates the
--mergeflag:- Histone Processing (
PROCESS_HISTONES): Maps BAM files against genomic bins to extract count matrices. - Methylation Processing (
PROCESS_METHYL): Triggered if--merge true. Processes BED files at base-pair resolution with merged +/- strands to maintain signal fidelity.
- Histone Processing (
- Count Merging & Synchronization (
MERGE_DATA): If--mergeis enabled, the pipeline intersects the processed matrices. This synchronizes the Histone (binned) and Methylation (base-pair) data into a consistent windowed format to ensure all multi-omic layers are aligned to the same coordinate system. - Segmentation Modeling: Based on the selected parameters (
--standard,--duration, or--dna), the data is routed through a specific modeling subworkflow:- Standard (
MODEL_TRAINING_STD): Default HMM-based segmentation. - Duration-Aware (
MODEL_TRAINING_DM): Incorporates state duration modeling. - DNA-Centric (
MODEL_TRAINING_DNA): Optimized for DNA-specific features. - Note: Each subworkflow executes four consecutive modules:
prepare,train,decode, andreport.
- Standard (
- Distribution Fitting & Automated Selection (
DISTRIBUTION_FITTING): If thefittingmode is triggered, the pipeline identifies the optimal statistical distribution for the data using:distfit_histone_train: Trains models across statistical distributions.distfit_histone_assess: Evaluates and selects the distribution with the best fit.- Automated Step: If the
--best_fit_segmentationflag is present, the pipeline automatically executes the segmentation workflow (Step 4). By default, this runs instandardmode unless--durationis explicitly specified.
Subworkflow Reference
The pipeline logic is organized into the following modular components:
| Category | Subworkflows |
|---|---|
| Setup | PREPARE_GENOME, GENERATE_BINS |
| Data Processing | PROCESS_HISTONES, PROCESS_METHYL, MERGE_DATA |
| Modeling Modes | MODEL_TRAINING_STD, MODEL_TRAINING_DM, MODEL_TRAINING_DNA |
| Optimization | DISTRIBUTION_FITTING |
Note: You can set the execution mode using the primary
--episegmix_modeflag (e.g.,standard,duration,dna, orfitting) or by using direct shortcut flags:--standard,--duration,--dna, or--fitting.
Usage
If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.
First, prepare a samplesheet with your input data that looks as follows:
samplesheet.csv:
sample_id,replicate,epigenetic_mark,file_name,modality,paired_end,distribution
Kidney,1,H3K27ac,../data/kidney/histone/kidney_H3K27ac.bam,ChIP-seq,true,NBI
Kidney,2,WGBS,../data/kidney/wgbs/kidney_WGBS.bed,WGBS,true,BIEach row represents a specific assay file associated with a sample. The pipeline automatically distinguishes between histone data and methylation data based on the file extension.
Column Specifications
sample_id: A unique identifier for your sample (e.g.,Kidney). Files sharing the samesample_idwill be grouped and processed together.replicate: The replicate number for the sample (e.g.,1).epigenetic_mark: The specific target or assay type (e.g.,H3K27acfor histones,WGBSfor methylation).file_name: The file path. Histone data must be.bamor.bam.gz. Methylation data must be.bedor.bed.gz.modality: The type of experiment performed (e.g.,ChIP-seq,WGBS).paired_end: A boolean value (trueorfalse) indicating if the sequencing data is paired-end.distribution: The statistical distribution to apply during model training for this mark (e.g.,NBIfor Negative Binomial,BIfor Binomial). Leave empty to use global defaults.
Now, you can run the pipeline using:
nextflow run nf-core/epigenomesegmentation \
--input samplesheet.csv \
--outdir <OUTDIR> \
--episegmix_mode standard \
--genome hg38 \
-profile <docker/singularity/.../institute> Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.
For more details and further functionality, please refer to the usage documentation and the parameter documentation.
Pipeline output
To see the results of an example test run with a full size dataset refer to the results tab on the nf-core website pipeline page. For more details about the output files and reports, please refer to the output documentation.
Credits
The original framework EpiSegMix that was used in ESM (https://doi.org/10.1093/bioinformatics/btae178) and ESMM (https://doi.org/10.1101/2025.07.25.666820) was written by Johanna Elena Schmitz and Nihit Aggarwal (Saarland University).
The pipeline was rewritten in Nextflow DSL2 by Aaryan Jaitly (Saarland University).
EpiSegMix tool was developed and designed by:
- Nihit Aggarwal
- Johanna Elena Schmitz
- Dr. AbdulRahman Salhab
- Prof. Dr. Jörn Walter
- Prof. Dr. Sven Rahmann
Contributions and Support
If you would like to contribute to this pipeline, please see the contributing guidelines.
For further information or help, don’t hesitate to get in touch on the Slack #epigenomesegmentation channel (you can join with this invite).
Citations
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.
You can cite the nf-core publication as follows:
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.