nf-core/epigenomesegmentation
An nf-core pipeline for epigenome segmentation using EpiSegMix/Meth — a hidden Markov model with flexible read count distributions and state duration modeling for histone, open chromatin, and methylation signals.
Introduction
This document describes the output produced by the pipeline.
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
Pipeline overview
The pipeline is built using Nextflow and processes data using the following steps:
- References - Genomic bins and chromosome sizes used for analysis.
- IndexFiles - Processed and indexed BAM files.
- Counts - Binned count matrices for Histones and Methylation.
- EpiSegMix - Trained models, segmentation BED files, and diagnostic plots.
- Pipeline information - Report metrics generated during the workflow execution.
References
Output files
References/[Genome]/*_bins.bed: Genomic windows (e.g., 200bp) used for signal aggregation.*.chrom.sizes: The chromosome sizes file fetched for the reference genome.
This directory contains the structural files generated during Genome Preparation. These files ensure that all downstream counting and modeling are performed on a consistent genomic coordinate system.
IndexFiles
Output files
IndexFiles/[SampleID]/*.nochr.bam: Filtered BAM files used for the counting process.*.nochr.bam.bai: Coordinate-sorted index files for the BAMs.
Processed alignment files that have been filtered (e.g., “nochr” suffix) and indexed to allow for efficient count matrix generation.
Counts
Output files
Counts/[SampleID]/[SampleID]_Histones/: Contains binned histone mark counts.*_refined_counts.txt: The final count matrix used as input for the EpiSegMix model.
If --merge is enabled, these matrices will also include WGBS (Methylation) data intersected with the histone bins.
EpiSegMix
This is the core results directory, containing the output of the segmentation modeling. Files are organized by sample and state number (e.g., _s10).
1. Segmentation
Output files
EpiSegMix/[SampleID]/Segmentation/*.bed.gz: Compressed BED file containing the genomic coordinates and assigned chromatin states.*.txt: A tab-delimited text version of the segmentation results.
2. Models
Output files
EpiSegMix/[SampleID]/Models/final-model-*.json: The trained HMM parameters.*.yaml: The configuration used for the modeling run.*.log: Log files tracking the training and decoding steps.*-train-counts.txt: The specific data matrix used during the training phase.
3. Plots
Output files
EpiSegMix/[SampleID]/Plots/*-correlation.png: Correlation matrix of input marks.*-histogram.png: Signal distribution for each mark.*-transitionMatrix.png: Probabilities of transitioning between chromatin states.*-meanEmission-viterbi.png/*-normEmission-viterbi.png: Heatmaps showing the signal signature for each state.*-stateDistribution-viterbi.png: Percentage of the genome occupied by each state.*-viterbi.html: Interactive HTML report for exploring the segmentation results.
Pipeline information
Output files
pipeline_info/EpiSegMix/[SampleID]/Plots/*-correlation.png: Correlation matrix of input marks.*-histogram.png: Signal distribution for each mark.*-transitionMatrix.png: Probabilities of transitioning between chromatin states.*-meanEmission-viterbi.png/*-normEmission-viterbi.png: Heatmaps showing the signal signature for each state.*-stateDistribution-viterbi.png: Percentage of the genome occupied by each state.*-viterbi.html: Interactive HTML report for exploring the segmentation results.
Pipeline information
Output files
- Reports generated by Nextflow: `execution_report.html`, `execution_timeline.html`, `execution_trace.txt` and `pipeline_dag.dot`/`pipeline_dag.svg`. - Reports generated by the pipeline: `pipeline_report.html`, `pipeline_report.txt` and `software_versions.yml`. The `pipeline_report*` files will only be present if the `--email` / `--email_on_fail` parameter's are used when running the pipeline. - Reformatted samplesheet files used as input to the pipeline: `samplesheet.valid.csv`. - Parameters used by the pipeline run: `params.json`.Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.