immuneML is a software platform for machine learning analysis of adaptive immune receptors and repertoires (AIRR).
This dataset contains the original specification files and complete results for immuneML use case 1: Replication of a published study by Emerson et al., 2017 (doi: 10.1038/ng.3822) inside immuneML.
For more information about immuneML, see the documentation: https://docs.immuneml.uio.no/


The original dataset produced by Emerson et al. was deposited in immuneACCESS (doi: 10.21417/B7001Z)
In our use case, some of the subjects were excluded due to missing data. 
The identifiers of the subjects included in our use case can be found in cmv_metadata.cmv, included in this dataset. 


The immuneML specification files in this dataset (cmv_study_replication_10_fold_CV.yaml, subsampling_specs.yaml, cmv_study_robustness_assessment.yaml) are compatible with immuneML version 1.0.1. 
Results (study_replication_10_fold_CV_output.zip, subsampled_datasets_output.zip, cmv_robustness_assessment_output.zip) were generated with immuneML version 1.0.1. 
For detailed information about this use case, and versions of these specification files compatible with the latest version of immuneML, see the documentation for this use case: https://docs.immuneml.uio.no/usecases/emerson_replication.html


The use case consists of two parts:

- In part one, immuneML is used to replicate the study by Emerson et al., 2017 (doi: 10.1038/ng.3822) in which CMV status is predicted from TCRbeta repertoires. 
  This was done using the following input files which are included in this dataset:
  - The configuration file cmv_study_replication_10_fold_CV.yaml
  - The metadata file cmv_metadata.csv, which describes which of the repertoire .tsv files to use as input
  - The metadata files cmv_train_metadata.csv and cmv_test_metadata.csv, which describe which of the repertoire files to use as training and test datasets
  - The file emerson_reference.csv, which describes the CMV-associated sequences found in the original study
  And the following files not included in this dataset:
  - The immune repertoire .tsv files of the original input dataset (doi: 10.21417/B7001Z)
  
  The results produced by immuneML and accompanying HTML files for navigating the results can be found in study_replication_10_fold_CV_output.zip

- In part two, immuneML is used for a robustness assessment of the method reproduced in part one on subsampled datasets
  This experiment consists of two steps:
  
  - First, immuneML was used to subsample the original dataset into smaller datasets of 400, 200, 100 and 50 immune repertoires. 
    This was done using the configuration file subsampling_specs.yaml. 
    The resulting subsampled datasets and accompanying HTML files summarizing the datasets can be found in subsampled_datasets_output.zip

  - Second, immuneML was used to perform the robustness assessment, by repeating the experiment described in part one on the subsampled datasets.
    This was done using the following input files which are included in this dataset:
    - The configuration file cmv_study_robustness_assessment.yaml
    - The file emerson_reference.csv, which describes the CMV-associated sequences found in the original study
    - The subsampled datasets which can be found in subsampled_datasets_output.zip
    The results produced by immuneML and accompanying HTML files for navigating the results can be found in cmv_robustness_assessment_output.zip