# README for 2021-modle-paper-001/data/output/preprocessing

This folder contains the input files obtained by running the [preprocess_data](https://github.com/paulsengroup/2021-modle-paper-001-data-analysis/blob/v2.0.1/workflows/preprocess_data.nf) Nextflow workflow.

Refer to [paulsengroup/2021-modle-paper-001-data-analysis/README.md](https://github.com/paulsengroup/2021-modle-paper-001-data-analysis/blob/v2.0.1/README.md) for instructions on how to run the workflow.

Files in this folder will serve as inputs for the workflows listed in [paulsengroup/2021-modle-paper-001-data-analysis/README.md](https://github.com/paulsengroup/2021-modle-paper-001-data-analysis/blob/v2.0.1/README.md).

## File description
- `checksums.sha256`: SHA256 checksums. Use `shasum -c checksums.sha256` to check file integrity.
- `chrom_sizes/*`: Chromosome sizes in `chrom.sizes` and `BED3` format for GRCh37, GRCh38 and GRCm38 genome assemblies.
- `compartments/*`: A/B compartment called with `cooltools eigs-cis` using Hi-C and Micro-C datasets for `H1 hESC` and `JM8.N4 mESC` cell lines (GRCh38 and GRCm38 respectively).
                    Compartment information was used to generate Supplementary Fig. 5.
- `extrusion_barriers/*_mast_hits.???.gz`: List of CTCF (MA0139.1) candidate binding sites in BED ant txt format identified by MAST.
                                           Data available for GRCh37, GRCh38 and GRCm38.
                                           Used in the [preprocess_data](https://github.com/paulsengroup/2021-modle-paper-001-data-analysis/blob/v2.0.1/workflows/preprocess_data.nf) workflow to generate the extrusion barrier annotation to use as input for MoDLE and OpenMM MD simulations.
- `extrusion_barriers/*_barriers_*_occupancy.bed.gz`: List of extrusion barriers in BED format generated by the [preprocess_data](https://github.com/paulsengroup/2021-modle-paper-001-data-analysis/blob/v2.0.1/workflows/preprocess_data.nf) workflow. This extrusion barrier annotation will be used as input for MoDLE and OpenMM MD simulations.
- `MA0139.1.meme`: CTCF PFM in MEME format. Used in the [preprocess_data](https://github.com/paulsengroup/2021-modle-paper-001-data-analysis/blob/v2.0.1/workflows/preprocess_data.nf) workflow to generate the list of candidate CTCF binding sites.
- `mcools/*`: Multi-resolution coolers used by several workflows.
              Not all `.mcool` are used by [2021-modle-paper-001-data-analysis v2.0.1](https://github.com/paulsengroup/2021-modle-paper-001-data-analysis/tree/v2.0.1).
   - `?L_HiC_E12*` files were generated by converting matrices in text format to `.mcool` format. Matrices in text format were published as part of Carballo 2017 ([GSE101715](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE101715)).
   - `GRCh37_*` files were generated by converting `*.hic` files using HiC Explorer `hicConvertFormat`. These files are not used by [2021-modle-paper-001-data-analysis v2.0.1](https://github.com/paulsengroup/2021-modle-paper-001-data-analysis/tree/v2.0.1).
   - `GRC?3?*` files were generated using `cooler zoomify` from the highest resolutions from `4DNFIFJH2524`, `4DNFI9GMP2J8` and `4DNFINNZDDXV`. This was necessary to address the bug described in the [Bug report for 4DNFI9GMP2J8 dataset](#bug-report-for-4dnfi9gmp2j8-dataset) section.

## Bug report for 4DNFI9GMP2J8 dataset
During data analysis we noticed that fetching contacts for resolutions other than base resolution using the Cooler Python API would sometimes return duplicate values for certain columns/rows of pixels.

Example:
```
A        chrom1    start1      end1 chrom2    start2      end2  count  balanced
488629   chr3  37215000  37220000   chr3  37215000  37220000   1571  0.190695
488808   chr3  37215000  37220000   chr3  37215000  37220000   2804  0.340361
488630   chr3  37215000  37220000   chr3  37220000  37225000    175  0.020277
488809   chr3  37215000  37220000   chr3  37220000  37225000   1087  0.125949
488631   chr3  37215000  37220000   chr3  37225000  37230000     70  0.009176
...       ...       ...       ...    ...       ...       ...    ...       ...
489014   chr3  37215000  37220000   chr3  38810000  38815000      1  0.000145
488797   chr3  37215000  37220000   chr3  38845000  38850000      1  0.000150
489016   chr3  37215000  37220000   chr3  38845000  38850000      1  0.000150
488738   chr3  37215000  37220000   chr3  39105000  39110000      1  0.000082
489052   chr3  37215000  37220000   chr3  39105000  39110000      2  0.000164
```

Given that this issue occurs in all but the highest resolution, we speculate that the issue is due to the `cooler` version used to zoomify the base resolution.

Zoomifying the highest resolution using `cooler` v0.8.11 solves the issue.

