# Dataset Zong 2022

This dataset includes the complete set of raw data used to produce "Large-scale two-photon calcium imaging in freely moving mice" (Zong _et al_, 2022). It includes a total of 32 files for a total of approximately 2.1TB


## Dataset Structure

Raw data is split across 27 compressed tarballs (`.tar.gz`).

Each tarball has two components to its name: experiment type and animal ID, e.g.:
* `ca1-97288.tar.gz`
* `behaviour-61439.tar.gz`

Three experiment types refer to two-photon imaging of the respective cortex of the brain (CA1, MEC, VC). The fourth experiment type (behaviour) contains no two-photon imaging, but studies the behaviour of animals with and without the (inactive) two-photon imaging equipment attached. 

Each session stores data in a similar way, although there is some variation in the exact files present in each, depending on the purpose of the session. There is also some variation over time as experimental protocols evolved

Five additional explanatory documents are attached

* This readme file
* `zong_2022_filetypes_per_session.pdf` provides a table that describes what file types _should_ be present for each session
* `zong_2022_sessions_per_figure.pdf` provides a table that describes which sessions were used in each figure in the paper and supplementary information.
* `behaviour_data_split_explanation.md`, which explains how data is distributed within the `behaviour` experiment type.
* `md5sum.chk` is provided listing the MD5 hashes of each uploaded file


### `*.tar.gz` and `md5sum.chk`

The archive is provided as a series of compressed tarballs and associated checksums. 

You can verify that the contents of each tarball have not been changed by comparing the checksum of the file you have downloaded to the checksum provided at time of upload.

Calculate the checksum of an individual tarball with

  md5sum ca1-97288.tar.gz

Compare checksums of _all_ 27 files (note: this will take 8-12 hours depending on your computer hardware)

  md5sum -c md5sum.chk

You can uncompress the tarball to your current directory with 

  tar -xvzf ca1-97288.tar.gz .
  

## Analysis Code

Analysis code is stored [on Github](https://github.com/kavli-ntnu/MINI2P_toolbox). The state of the github archive at time of publication is available via [Zenodo](doi.org/10.5281/zenodo.6033997).


## File Types

Raw data almost invariably has a filename matching this pattern, which will be referred to as a `<base-name>`:

`<animal-id>_<session-timestamp><session-name>_<recording-order>`

There may be some additional string distinguishing the file type, plus a standard file suffix.

For example:
* `96766_20210314_ML-800_AP-800_22Openfiled_00001.tif`
* `96766_20210314_ML-800_AP-800_22Openfiled_00001_trackingVideoDLC_resnet_50_OPENMINI2P_topcamera_20210305Mar5shuffle1_1030000.csv`

A session may contain one or more different `<base-name>*` groupings.

Depending on the nature of the session some of the below file types may be absent.


### Raw two-photon calcium imaging

Typically `<base-name>.tif`.

Each page of the tif corresponds to two-photon imaging of a volume within the subject's brain. Multiple planes of imaging are stored as adjacent images within the page.

The Tifs are generated by [ScanImage2020](https://vidriotechnologies.com/scanimage/), and include all metadata generated by ScanImage


### Distortion corrected calcium imaging

Typically `<base-name>_wrapping-corrected.tif`. Should also include `<base-name>_imgInfo.mat` and `<base-name>_tifHeader.mat`

The raw tif includes distortions and aberrations due to the optics of the miniscope. Distortion corrected images have been processed to remove those distortions in order to facillitate stitching multiple Fields of View together.

Processed Tif files no longer include the ScanImage metadata as our analysis tools do not recreate their propritory metadata tags. The ScanImage metadata is instead written out to mat files, included alongside the processed tifs. 


### Raw animal tracking

Raw animal tracking has evolved through at least three distinct file types as the research has progressed
* v1: one tif per frame
  A directory labelled `<base-name>_trackingVideo` contains several thousand tif files labelled `<base-name>_trackingVideo<frame-number>.tif`. This method is deprecated
* v2: Tif-stacked, uncompressed video file, `<base-name>_trackingVideo.avi`
* v3: mpeg-4 compressed video file, `<base-name>_trackingVideo.mp4`

Typically `<base-name>_trackingVideo.mp4`, some earlier sessions may be `<base-name>_trackingVideo.avi`. May also include `<base-name>_MetaFile.csv`

The animals are tracked within an open environment by an infrared camera from either above or below. No markers are attached to the animals: animal state vector is later inferred from the video via the DeepLabCut application.

The tracking video is pre-synchronised to the two-photon imaging data: each time a new _plane_ of two-photon imaging is begun, a frame of tracking data is generated. Consequently, the number of tracking frames will *always* be equal to (number_of_volumes * number_of_planes_per_volume). It is therefore important when reading out data about a cell to be aware in which plane it was detected, in order to select the correct sub-set of tracking frames with which to match. 


### Derived: Cell data

Two-photon imaging data is processed with the package [Suite2p](https://github.com/MouseLand/suite2p). Documentation for Suite2p is provided at [ReadTheDocs](https://suite2p.readthedocs.io/en/latest/index.html)

The output of Suite2P is stored in the `suite2p` directory of the session, and a `run.log` logfile is included. 

Suite2p is _always_ run over all data present in the session directory.

For sessions in which only a single plane per volume is present, the `suite2p` directory will look like so:
```
93562
 |- 20200629_1
    |- suite2p
       |- plane0
       |  |- data.bin
       |  |- F.npy
       |  |- Fneu.npy
       |  |- iscell.npy
       |  |- ops.npy
       |  |- spks.npy
       |  |- stat.npy
       |- run.log
```
For sessions in which multiple planes per volume are present, the suite2p directory will look like so:
```
93562
 |- 20200629_1
    |- suite2p
       |- combined
       |  |- data.bin
       |  |- F.npy
       |  |- Fneu.npy
       |  |- iscell.npy
       |  |- ops.npy
       |  |- spks.npy
       |  |- stat.npy
       |- plane0
       |  |- data.bin
       |  |- *.npy
       |- plane1
       |  |- data.bin
       |  |- *.npy
       |- run.log
```

Understanding and reading these formats is covered by the [S2P docs](https://suite2p.readthedocs.io/en/latest/outputs.html)

Suite2p outputs are manually curated by the researchers to identify and remove data artefacts, as well as cells that appear to be duplicates across multiple planes. In sessions with multiple planes, these curations are **only** applied to the files stored in the `plane*`  directories. Data stored in `combined` are not curated.


### Derived: Animal tracking and state vectors

Animal tracking data is derived by running [DeepLabCut](https://github.com/DeepLabCut/DeepLabCut) over the raw tracking video. DeepLabCut documentation is available [here](https://deeplabcut.github.io/DeepLabCut/docs/intro.html).

DLC output data is provided as:
* `<base-name>_<dlc-model-name>_<dlc-model-iterations>.csv`
* `<base-name>_<dlc-model-name>_<dlc-model-iterations>.h5`
* `<base-name>_<dlc-model-name>_<dlc-model-iterations>filtered.csv`
* `<base-name>_<dlc-model-name>_<dlc-model-iterations>filtered.h5`
* `<base-name>_<dlc-model-name>_<dlc-model-iterations>_forBSOID.csv`
* `<base-name>_<dlc-model-name>_<dlc-model-iterations>includingmetadata.pickle`

The CSV contains the output tracking information:
* Timestamps are given as frame numbers in column `coords`: recall that due to the nature of the synchronisation, each tracking frame corresponds to a new two-photon imaging _plane_ (not volume)
* Each tracked body part has three columns: `X` (in pixels), `Y` (in pixels) and `likelihood`, which is the confidence that DLC assigns to the values in each frame.
* The body parts to be tracked are defined as part of the model. Typically, they include:
  * nose
  * mouse
  * lefthand
  * righthand
  * leftleg
  * rightleg
  * tailbase
  * bodycenter

The h5 file stores the same information (as an hdf5 container) if that is more convenient to you to use. 

The pickle file contains some metadata about the model that was run. It can be opened in Python by unpickling:
```python
import pickle
file = "<base-name>_<dlc-model-name>_<dlc-model-iterations>includingmetadata.pickle"
with open(file, "rb") as f:
    data = pickle.load(f)
```


### Derived: Analysis with NATEx

The authors have published a [GitHub repository](https://github.com/kavli-ntnu/MINI2P_toolbox) providing their MatLab analysis code, which includes the packaged NATEx application

The application creates and modifies up to 5 Matlab data files:
1. `ExperimentInformation.mat`
2. `TrackingResultMatrix.mat`
3. `NeuronInformation.mat`
4. `NeuroActiveMatrix.mat`
5. `NAT.mat`

These 5 files include information about the session (1), the animal tracking (2), the neuron activity (3, 4) and the combination of the tracking and neuron activity (5)


### Complete directory structure

An example of complete directory structure. This session, for animal 93562 on 2020-08-17, contains two separate `<base-name>`:
* `93562_imaging_20200817_nocookies_00001`
* `93562_imaging_20200817_withcookies_00001`

```
|- VC recordings
   |- 93562
      |- 20200817
         |- suite2p
         |  |- plane0
         |  |  |- data.bin
         |  |  |- F.npy
         |  |  |- Fneu.npy
         |  |  |- iscell.npy
         |  |  |- ops.npy
         |  |  |- spks.npy
         |  |  |- stat.npy
         |  |- run.log
         |- 93562_imaging_20200817_nocookies_00001.tif
         |- 93562_imaging_20200817_nocookies_00001_trackingVideoDLC_resnet50_OPENMINI2P_bottomcameraAug26shuffle1_1030000.csv
         |- 93562_imaging_20200817_nocookies_00001_trackingVideoDLC_resnet50_OPENMINI2P_bottomcameraAug26shuffle1_1030000.h5
         |- 93562_imaging_20200817_nocookies_00001_trackingVideoDLC_resnet50_OPENMINI2P_bottomcameraAug26shuffle1_1030000filtered.csv
         |- 93562_imaging_20200817_nocookies_00001_trackingVideoDLC_resnet50_OPENMINI2P_bottomcameraAug26shuffle1_1030000filtered.h5
         |- 93562_imaging_20200817_nocookies_00001_trackingVideoDLC_resnet50_OPENMINI2P_bottomcameraAug26shuffle1_1030000includingmetadata.pickle
         |- 93562_imaging_20200817_withcookies_00001.tif
         |- 93562_imaging_20200817_withcookies_00001_trackingVideoDLC_resnet50_OPENMINI2P_bottomcameraAug26shuffle1_1030000.csv
         |- 93562_imaging_20200817_withcookies_00001_trackingVideoDLC_resnet50_OPENMINI2P_bottomcameraAug26shuffle1_1030000.h5
         |- 93562_imaging_20200817_withcookies_00001_trackingVideoDLC_resnet50_OPENMINI2P_bottomcameraAug26shuffle1_1030000filtered.csv
         |- 93562_imaging_20200817_withcookies_00001_trackingVideoDLC_resnet50_OPENMINI2P_bottomcameraAug26shuffle1_1030000filtered.h5
         |- 93562_imaging_20200817_withcookies_00001_trackingVideoDLC_resnet50_OPENMINI2P_bottomcameraAug26shuffle1_1030000includingmetadata.pickle
         |- ExperimentInformation.mat
         |- NAT.mat
         |- NeuronActiveMatrix.mat
         |- Neuron Information.mat
         |- TrackingResultMatrix.mat
         |- 93562_imaging_20200817_nocookies_00001_trackingVideo.avi
```
