# README for 2021-modle-paper-001/data/input

This folder contains the input files obtained by running the [fetch_data](https://github.com/paulsengroup/2021-modle-paper-001-data-analysis/blob/v2.0.1/workflows/fetch_data.nf) Nextflow workflow.

Refer to [paulsengroup/2021-modle-paper-001-data-analysis/README.md](https://github.com/paulsengroup/2021-modle-paper-001-data-analysis/blob/v2.0.1/README.md) for instructions on how to run the workflow.

Files in this folder will serve as inputs for the workflows listed in [paulsengroup/2021-modle-paper-001-data-analysis/README.md](https://github.com/paulsengroup/2021-modle-paper-001-data-analysis/blob/v2.0.1/README.md).

## File description
- `checksums.sha256`: SHA256 checksums. Use `shasum -c checksums.sha256` to check file integrity.
- `del*.bed`: BED files with the genomic coordinates of the HoxD deletions. Used in the [comparison_with_mut](https://github.com/paulsengroup/2021-modle-paper-001-data-analysis/blob/v2.0.1/workflows/run_benchmarks.nf)
 workflow.
- `extrusion_barrier_param_optimization_regions_of_interest.bed`: Genomic coordinates of the region used in the parameter optimization performed by workflow [extrusion_barrier_param_optimization](https://github.com/paulsengroup/2021-modle-paper-001-data-analysis/blob/v2.0.1/workflows/extrusion_barrier_param_optimization.nf).
- `extrusion_barriers_benchmark.bed`: List of extrusion barriers in BED format used in the [run_benchmarks](https://github.com/paulsengroup/2021-modle-paper-001-data-analysis/blob/v2.0.1/workflows/run_benchmarks.nf) workflow.
- `GRC?3?_assembly_report.txt.gz`: Assembly report for GRCh37, GRCh38 and GRCm38. Used by the [preprocess_data](https://github.com/paulsengroup/2021-modle-paper-001-data-analysis/blob/v2.0.1/workflows/preprocess_data.nf) workflow to replace NCBI chromosome identifiers with the actual chromosome names (e.g. NC_000001.11 -> chr1).
- `GRC?3?_genome_assembly.fna.gz`: GRCh37, GRCh38 and GRCm38 reference genome assemblies. Used in the [preprocess_data](https://github.com/paulsengroup/2021-modle-paper-001-data-analysis/blob/v2.0.1/workflows/preprocess_data.nf) workflow to when mining CTCF motifs using MAST.
- `*_chip_fold_change.bigwig`: CTCF and RAD21 ChIP-seq fold change over control for GM12878, IMR90 and H1-hESC cell lines. Used in the [preprocess_data](https://github.com/paulsengroup/2021-modle-paper-001-data-analysis/blob/v2.0.1/workflows/preprocess_data.nf) workflow to generate the extrusion barrier annotation to use as input for MoDLE and OpenMM MD simulations.
- `*_chip_narrow_peaks.bed.gz`: CTCF and RAD21 ChIP-seq optimal IDR thresholded peaks for GM12878, IMR90 and H1-hESC cell lines. Used in the [preprocess_data](https://github.com/paulsengroup/2021-modle-paper-001-data-analysis/blob/v2.0.1/workflows/preprocess_data.nf) workflow to generate the extrusion barrier annotation to use as input for MoDLE and OpenMM MD simulations.
- `GRCh37_*_GSE63525*`: Hi-C matrices in .hic format used as reference in the [comparison_with_mut](https://github.com/paulsengroup/2021-modle-paper-001-data-analysis/blob/v2.0.1/workflows/https://github.com/paulsengroup/2021-modle-paper-001-data-analysis/blob/v2.0.1/workflows/run_benchmarks.nf) workflow (data produced by this part of the workflow is not part of the paper).
- `GRC?38*.mcool`: Hi-C and Micro-C matrices in .mcool format for H1-hESC and JM8.N4 cell lines. These matrices are used as reference by several workflows.
- `GRCh38_genome_annotation.gtf.gz`: Genome annotation in GTF format for GRCh38. Not used in the latest workflows ([paulsengroup/2021-modle-paper-001-data-analysis](https://github.com/paulsengroup/2021-modle-paper-001-data-analysis/tree/v2.0.1) v2.0.1).
- `GRCm38_E12.5?L_*.tar.gz`: TAR archives with the contact matrices for E12.5 DL and PL in text format. Used by the [preprocess_data](https://github.com/paulsengroup/2021-modle-paper-001-data-analysis/blob/v2.0.1/workflows/preprocess_data.nf) workflow to generate matrices in .mcool format.
- `GRCh38_regions_for_heatmap_comparison_pt1.bed`: BED file with a list of regions of interest (GRCh38) to be used in MoDLE and OpenMM MD simulations for heatmap comparison.
- `GRCm38_E12.5PL_H3K27ac.bw`: H3K27ac ChIP-seq data in BigWig format. Track shown in Fig. 5E from [MoDLE's paper](https://www.biorxiv.org/content/10.1101/2022.04.13.488157v2).
- `gw_param_optimization/*.tsv`: TSVs with the parameter space used by the [gw_param_optimization](https://github.com/paulsengroup/2021-modle-paper-001-data-analysis/blob/v2.0.1/workflows/gw_param_optimization.nf) and [heatmap_comparison_pt2](https://github.com/paulsengroup/2021-modle-paper-001-data-analysis/blob/v2.0.1/workflows/heatmap_comparison_pt2.nf) workflows.
- `hoxd_regions_of_interest.bed`: BED file with the genomic coordinates of the regions to be simulated by the [comparison_with_mut](https://github.com/paulsengroup/2021-modle-paper-001-data-analysis/blob/v2.0.1/workflows/run_benchmarks.nf) workflow for the HoxD comparison.
- `idh_mutant.bed`: BED file with the genomic coordinates of the regions to be simulated by the [comparison_with_mut](https://github.com/paulsengroup/2021-modle-paper-001-data-analysis/blob/v2.0.1/workflows/run_benchmarks.nf) workflow for the comparison with the IDH mutant (data produced by this part of the workflow is not part of the paper).
- `JASPAR_2022_core.zip`: JASPAR 2022 (20220211) non-redundant database (PFMs in MEME format). Used to extract the CTCF PFM (MA0139.1) for MAST.
- `mm10_encode4_genome_data.tsv`: ENCODE4 genome data file used by the [ENCODE-DCC/chip-seq-pipeline2](https://github.com/ENCODE-DCC/chip-seq-pipeline2/tree/v2.2.0) pipeline.
- `README.md`: This file.
- `SRR508515?.fq.gz`: FASTQ files used to generate CTCF and RAD21 fold change over control and optimal IDR thresholded peaks for JM8.N4 using the [ENCODE-DCC/chip-seq-pipeline2](https://github.com/ENCODE-DCC/chip-seq-pipeline2/tree/v2.2.0) pipeline.
