Absolut! synthetic antibody-antigen binding database. This dataset contains computed structures from human and murine CDRH3 sequences to 159 PDB-derived lattice antigens. The dataset has been generated using the Absolut! framework, available at: https://github.com/csi-greifflab/Absolut Please refer to / cite this manuscript for general explanations Robert et al. 2021, A billion synthetic 3D-antibody-antigen complexes enable unconstrained machine-learning formalized investigation of antibody specificity prediction, [will be available on biorxiv in ~July 2021] Structure of the dataset: **** Absolut! database: The 1 billion antibody-antigen structures (6.9e6 murine CDRH3s >= 11AAs) **** RawBindingsMurine/ UniqueCDR3s.txt => list of used murine CDRH3 sequences [Greiff 2018 Cell Reports, PMID: 28514665] XXXX_X.zip => for antigen XXXX_X (example 1ADQ_A = antigen generated from PDB 1ADQ, chain A), the energetically optimal binding structure of each 11-mer from each CDRH3 to this antigen. the CDRH3 sequences are separated into multiple text files that can be concatenated. [careful, the column header "Best?" is written as "Best" only in some files (unnoticed change of format happened)] **** Filtered sequences from the database according to affinity thresholds **** (this allows to work only on sequences with high affinity => smaller files) RawBindingsPerClassMurine/ XXXX_XAnalyses/ => filtered sequences to antigen XXXX_X (examples below for 1ADQ_A) *** CDRH3 and 11-mer(slice) -based *** 1ADQ_A_500kNonMascotte.txt => 500k sequences randomly sampled from non-binders (top 99% energies) *** CDRH3-based top sequences (only keeps the best 11-mer from each CDRH3 to describe its binding) *** 1ADQ_A_superHeroes,txt => bottom 0.01% energies 1ADQ_A_Heroes.txt => bottom 0.1% energies 1ADQ_A_Mascotte.txt => bottom 1% energies (= top 1% high affinity) 1ADQ_A_Looser.txt => bottom 5% energies The files with "Exclusive" mean excluding the higher affinity class example: 1ADQ_A_MascotteExclusive.txt = bottom 0.1% to 1% (excludes heroes and super heroes) *** 11-mer based top sequences (threshold defined from the bottom CDRH3 sequences, but keep multiple low energies 11-mers from the same CDRH3 if they satisfy the threshold (not only the best per CDRH3) - see biorxiv paper above *** Files havs identical names but including "Slices", example: 1ADQ_A_MascotteSlices,txt **** Pre-processed datasets used in the Robert 2021 Biorxiv mentioned above: **** Datasets1/ [binary classification per antigen] 11mer-based/ => (used in the manuscript) 1ADQ_A_Task1_SlicesBalancedData.txt => contains all the bottom 1% sequences (Mascotte) to the antigen + the same amount of non-binders sampled to originate from same CDRH3 length distribution CDRH3-based/ => also provided but not used (same generation procedure) Datasets2/ [binary classification per antigen] 11mer-based/ => (used in the manuscript, harder: separating bottom 1% to bottom 1% to 5%) 1ADQ_A_Task1_SlicesBalancedData.txt => contains all the bottom 1% sequences (Mascotte) to the antigen + the same amount of bottom 1% to 5%, sampled to originate from same CDRH3 length distribution CDRH3-based/ => also provided but not used (same generation procedure) Datasets3/ [multi-class classification for subsets of N antigens] nonRedundant_11mer-based/ => using (142) antigens generated from different-proteins in total Treated142.txt => Binding profile of each bottom 1% CDRH3 to each antigen (column) => 142 columns ListAntigens142.txt => name/meaning of columns of Treated142.txt in this order Task2Annotated_142_nonredundant.zip => annotation of each sequence with the status "non-binder (top 99%)" or which antigens it bind if binder (bottom 1%). this was used to generate Treated142.txt, provided for information. redundant_11mer-based/ => using (159) all antigens even if they were generated from the same protein (different PDB) Task2V6Apr.py => this script takes TreatedXXX.txt and generated the multi-class dataset for N antigens (it just selects randomly N columns (antigens)) and keeps monospecific sequences within these selected antigens. [it also does the ML classification] Datasets4_Paratope_Epitope/ (list of all paratope-epitope pairs from binders (bottom 1%) according to different encodings. perEncoding/ (used in the biorxiv manuscript - only unique paratope-epitope pairs) example: Task4_A_EpiSeq_ParaSeq.txt => sequence encoding, degree-explicit Task4_A_EpiSeq_ParaSeqD2.txt => sequence encoding, degree-explicit, filtered with only residues degree 2 or more Task4_E_EpiSeq_ParaSeq_NoDeg => sequence encoding, degree-free Task4_E_EpiSeq_ParaSeq_NoDegD2 => sequence encoding, degree-free, filtered with only residues degree 2 or more ... with Motif, Aggregate and Chemical encodings. Task4_J_ABDB_Motif_StartX.txt => gapped-motifs encodings (ex: X12XXX1 means gaps of size 12 and 1) perAntigen/ also provided for each antigen separately allDegreeExplicitFeatures/ the list of all encodings for each CDRH3 (or each 11-mer - filtered degree 2 residues or not) so this would allow to use different type of encodings for paratope and epitope, and these files can be used to regenerate the files in perEncoding/ or perAntigen/ DatasetsAdditional/ Variations of Datasets1 and 3 with different affinity thresholds: Hard means comparing Mascotte (1% bottom) versus Loosers exclusive (1% to 5% bottom) Harder means comparing Heroes (0.1% bottom) versus Mascotte exclusive (0.1% to 1% bottom) **** Pre-computed list of all possible binding structures to each antigen of the database **** (these files are needed to calculate binding structure of a new CDRH3 sequence to those antigens, by the Absolut! software. They can automatically be re-generated by Absolut! but it takes 1 to 5 days on one core. Easier to download from here.) Structures/ example: 1d6c68fe18082e4c9eaafa0ec9ae7543-10-11-fa152885be4f2bdfa1c07997182f3093Structures.txt **** Comparison with (1e6) human-derived CDRH3 sequences to 23 antigens **** RawBindingsHuman/ XXX_X/ XXXX_XFinalBindings_Process_1_Of_1.txt => structurally-annotated binding of ~1 million human CDRH3 sequences to antigen XXXX_X [Dewitt Plos One 2016, PMID: 27513338] => use the CDR3 column to get the list of used sequences