Introduction

This document describes the output produced by the pipeline.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

Pipeline overview

The pipeline is built using Nextflow and processes data using the following steps:

  • References - Genomic bins and chromosome sizes used for analysis.
  • IndexFiles - Processed and indexed BAM files.
  • Counts - Binned count matrices for Histones and Methylation.
  • EpiSegMix - Trained models, segmentation BED files, and diagnostic plots.
  • Pipeline information - Report metrics generated during the workflow execution.

References

Output files
  • References/
    • [Genome]/
      • *_bins.bed: Genomic windows (e.g., 200bp) used for signal aggregation.
      • *.chrom.sizes: The chromosome sizes file fetched for the reference genome.

This directory contains the structural files generated during Genome Preparation. These files ensure that all downstream counting and modeling are performed on a consistent genomic coordinate system.

IndexFiles

Output files
  • IndexFiles/
    • [SampleID]/
      • *.nochr.bam: Filtered BAM files used for the counting process.
      • *.nochr.bam.bai: Coordinate-sorted index files for the BAMs.

Processed alignment files that have been filtered (e.g., “nochr” suffix) and indexed to allow for efficient count matrix generation.

Counts

Output files
  • Counts/
    • [SampleID]/
      • [SampleID]_Histones/: Contains binned histone mark counts.
      • *_refined_counts.txt: The final count matrix used as input for the EpiSegMix model.

If --merge is enabled, these matrices will also include WGBS (Methylation) data intersected with the histone bins.

EpiSegMix

This is the core results directory, containing the output of the segmentation modeling. Files are organized by sample and state number (e.g., _s10).

1. Segmentation

Output files
  • EpiSegMix/[SampleID]/Segmentation/
    • *.bed.gz: Compressed BED file containing the genomic coordinates and assigned chromatin states.
    • *.txt: A tab-delimited text version of the segmentation results.

2. Models

Output files
  • EpiSegMix/[SampleID]/Models/
    • final-model-*.json: The trained HMM parameters.
    • *.yaml: The configuration used for the modeling run.
    • *.log: Log files tracking the training and decoding steps.
    • *-train-counts.txt: The specific data matrix used during the training phase.

3. Plots

Output files
  • EpiSegMix/[SampleID]/Plots/
    • *-correlation.png: Correlation matrix of input marks.
    • *-histogram.png: Signal distribution for each mark.
    • *-transitionMatrix.png: Probabilities of transitioning between chromatin states.
    • *-meanEmission-viterbi.png / *-normEmission-viterbi.png: Heatmaps showing the signal signature for each state.
    • *-stateDistribution-viterbi.png: Percentage of the genome occupied by each state.
    • *-viterbi.html: Interactive HTML report for exploring the segmentation results.

Pipeline information

Output files
  • pipeline_info/
  • EpiSegMix/[SampleID]/Plots/
    • *-correlation.png: Correlation matrix of input marks.
    • *-histogram.png: Signal distribution for each mark.
    • *-transitionMatrix.png: Probabilities of transitioning between chromatin states.
    • *-meanEmission-viterbi.png / *-normEmission-viterbi.png: Heatmaps showing the signal signature for each state.
    • *-stateDistribution-viterbi.png: Percentage of the genome occupied by each state.
    • *-viterbi.html: Interactive HTML report for exploring the segmentation results.

Pipeline information

Output files - Reports generated by Nextflow: `execution_report.html`, `execution_timeline.html`, `execution_trace.txt` and `pipeline_dag.dot`/`pipeline_dag.svg`. - Reports generated by the pipeline: `pipeline_report.html`, `pipeline_report.txt` and `software_versions.yml`. The `pipeline_report*` files will only be present if the `--email` / `--email_on_fail` parameter's are used when running the pipeline. - Reformatted samplesheet files used as input to the pipeline: `samplesheet.valid.csv`. - Parameters used by the pipeline run: `params.json`.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.