Introduction to immune receptor databases

Databases with aggregated information about immune receptor specificity provide a straightforward way to annotate your data and find condition-associated receptors. immunarch supports the tools to annotate your data using the most popular AIRR databases - VDJDB, McPAS-TCR and PIRD TBAdb.

Database annotation is a two-step process. First, you need to download database files - either the full database or filtered data obtained from the web interface of the database. After that, you can use immunarch functions to annotate your data and visualise the results. Below you can find a guide for annotation covering both steps.

Downloading databases

VDJDB

VDJDB is a curated database of T-cell receptor sequences of known antigen specificity. The database is GitHub-based and available here: https://github.com/antigenomics/vdjdb-db

Citation: Shugay M et al. VDJdb: a curated database of T-cell receptor sequences with known antigen specificity. Nucleic Acids Research 2017

How to filter and download data

It can be useful to filter out immune receptors that are not relevant from the database before working with it. For instance, if you analyse human T-cell beta repertoires, it is not necessary to keep immune receptors from other species, as well as non-TRB data. Use the web interface to VDJDB located at https://vdjdb.cdr3.net/search to filter out data. Having filtered the data and pressed the “Refresh table” button, locate the “Export” button and select the “TSV” label inside. It will start the downloading of the filtered database file with a name like “SearchTable-2019-10-17 12_36_11.989.tsv”, which can be used for annotation with immunarch.

How to download full VDJDB

You can use the previous method to download the full database if you set all check marks in the “General” section of the “CDR3” tab. However, if you want to download the raw database files, here is the step by step guide to the rather complicated process of VDJDB downloading and unpacking.

  1. First, you need to install JDK 8 - Java Development Kit. If you already have it, skip this step. If you don’t, just search for the proper installation instructions for your system.

  2. Second, you need to install Groovy - a language that is used for processing VDJDB. Go to this link and download the distribution or windows installer depending on your system. For Windows users the best way is to download the Windows installer. For Linux users the easiest way is to use OS package manager such as apt, dpkg, pacman, etc. For Mac users the most seamless way is to use Homebrew.

  3. Download the VDJDB repository from GitHub via this link: https://github.com/antigenomics/vdjdb-db/archive/master.zip

  4. Unzip the archive and go to the unpacked “vdjdb-db-master” folder.

  5. Go to the “src” folder.

  6. Open your Terminal or Console and execute the following command: groovy -cp . BuildDatabase.groovy --no2fix.

  7. After some processing, the database files will be available at the “database” folder inside the “vdjdb-db-master” folder. You will need to provide paths to this files for the immunarch annotation functions.

McPAS-TCR

McPAS-TCR is a manually curated catalogue of pathology associated T-cell receptor sequences. The database is available at http://friedmanlab.weizmann.ac.il/McPAS-TCR/

Citation: Tickotsky N, Sagiv T, Prilusky J, Shifrut E, Friedman N (2017). McPAS-TCR: A manually-curated catalogue of pathology-associated T cell receptor sequences. Bioinformatics 33:2924-2929

How to filter and download data

The filtering feature of the database’s web interface is located in the “Search Database” tab. After processing the data, press the “Download .csv” button. The downloaded file named “McPAS-TCR_search.csv” can be used for annotation with immunarch.

How to download full McPAS-TCR

To download McPAS-TCR you just need to go to http://friedmanlab.weizmann.ac.il/McPAS-TCR/ and press the “Download the complete database” button there. Note that sometimes you need to press it twice or press it in a new browser tab to start the downloading process.

TBAdb from PIRD

TBAdb is a manually curated database of T-cell receptor (TCR) and B-cell receptor (BCR) targeting specific antigen or diseases. The database contains three parts, namely TCR-AB, TCR-GD and BCR. These three parts are aimed at collecting sequences and specificity information of TCRA and TCRB, TCR- gamma and TCR-delta and BCR separately. The database is referenced in this paper: https://doi.org/10.1093/bioinformatics/btz614

Currently there is no direct way to download TBAdb.

Citation: ZHANG W, Wang L, Liu K, Wei X, Yang K, Du W, Wang S, Guo N, Ma C, Luo L, et al. PIRD: Pan immune repertoire database. Bioinformatics(2019)

Annotation of the clonotypes

After downloading the database, we can proceed to the annotation part with R. To demonstrate the applicability of R and immunarch, we will use a common task of annotation of repertoires with Cytomegalovirus (CMV) infection.

Preprocessing databases with R

As a start, we need to load databases into R and filter out non-human, non-TRB and non-CMV data from the input database. With databases, we follow the same philosophy as with repLoad and vis functions: the function dbLoad provides a single interface to loading and basic quering of all supported databases.

For demonstration purposes, we will process each of the supported databases below.

VDJDB

Download the VDJDB database following the instructions above. In the examples, we use URLs to snippets of databases as file paths. In your own code you need to provide paths to your local database files, e.g., “/Users/yourname/Downloads/vdjdb-db-master/vdjdb.slim.txt”. Do not use the links below since they are only for testing purposes and do not contain the actual databases!

Note that VDJDB data obtained from the web interface differs from VDJDB obtained from raw files. Check the next section for working with VDJDB search tables.

The most basic way to load VDJDB to R:

vdjdb = dbLoad("https://gitlab.com/immunomind/immunarch/raw/dev-0.5.0/private/vdjdb.slim.txt.gz", "vdjdb")
vdjdb
## # A tibble: 61,049 × 19
##    gene  cdr3        species antig…¹ antig…² antig…³ compl…⁴ v.segm j.segm v.end
##    <chr> <chr>       <chr>   <chr>   <chr>   <chr>     <dbl> <chr>  <chr>  <dbl>
##  1 TRB   CASSTSRLSN… Macaca… STPESA… Tat     SIV     0       TRBV1… TRBJ1…     4
##  2 TRB   CASSQDRGPA… HomoSa… RLRAEA… EBNA3A  EBV     1.93e 4 TRBV4… TRBJ2…     5
##  3 TRB   CASSMSRSSN… Macaca… TTPESA… Tat     SIV     0       TRBV14 TRBJ1…     4
##  4 TRA   CASNTGTASK… HomoSa… GILGFV… M       Influe… 0       TRAV24 TRAJ44     2
##  5 TRB   CASSLGSQNT… MusMus… HGIRNA… M45     MCMV    2.24e24 TRBV1… TRBJ2…     5
##  6 TRB   CSASILGLAG… HomoSa… KLGGAL… IE1     CMV     8.58e 3 TRBV2… TRBJ2…     3
##  7 TRA   CAVLLEYGNK… HomoSa… GILGFV… M       Influe… 0       TRAV1… TRAJ47     3
##  8 TRB   CASSYFSATN… HomoSa… KLGGAL… IE1     CMV     3.44e 3 TRBV6… TRBJ2…     5
##  9 TRB   CASTGDSNER… MusMus… SSYRRP… PB1     Influe… 2.28e 4 TRBV1… TRBJ1…     3
## 10 TRB   CASSAFPCRE… HomoSa… NLVPMV… pp65    CMV     0       TRBV6… TRBJ2…     4
## # … with 61,039 more rows, 9 more variables: j.start <dbl>, mhc.a <chr>,
## #   mhc.b <chr>, mhc.class <chr>, reference.id <chr>, vdjdb.score <dbl>,
## #   Species <chr>, Chain <chr>, Pathology <chr>, and abbreviated variable names
## #   ¹​antigen.epitope, ²​antigen.gene, ³​antigen.species, ⁴​complex.id

To load VDJDB and filter out information you need to provide .species, .chain and .pathology arguments:

vdjdb = dbLoad("https://gitlab.com/immunomind/immunarch/raw/dev-0.5.0/private/vdjdb.slim.txt.gz", "vdjdb", .species = "HomoSapiens", .chain = "TRB", .pathology = "CMV")
vdjdb
## # A tibble: 18,039 × 19
##    gene  cdr3        species antig…¹ antig…² antig…³ compl…⁴ v.segm j.segm v.end
##    <chr> <chr>       <chr>   <chr>   <chr>   <chr>     <dbl> <chr>  <chr>  <dbl>
##  1 TRB   CSASILGLAG… HomoSa… KLGGAL… IE1     CMV        8584 TRBV2… TRBJ2…     3
##  2 TRB   CASSYFSATN… HomoSa… KLGGAL… IE1     CMV        3445 TRBV6… TRBJ2…     5
##  3 TRB   CASSAFPCRE… HomoSa… NLVPMV… pp65    CMV           0 TRBV6… TRBJ2…     4
##  4 TRB   CASSLWTTNY… HomoSa… KLGGAL… IE1     CMV       19396 TRBV1… TRBJ1…     5
##  5 TRB   CASSLTTESG… HomoSa… NLVPMV… pp65    CMV           0 TRBV7… TRBJ2…     5
##  6 TRB   CASTAKQDFQ… HomoSa… KLGGAL… IE1     CMV       10972 TRBV1… TRBJ2…     3
##  7 TRB   CASSGAGAGY… HomoSa… KLGGAL… IE1     CMV        6231 TRBV6… TRBJ2…     4
##  8 TRB   CASSLIIGVS… HomoSa… KLGGAL… IE1     CMV       12587 TRBV1… TRBJ1…     5
##  9 TRB   CATSSSGVQE… HomoSa… KLGGAL… IE1     CMV       13267 TRBV15 TRBJ2…     4
## 10 TRB   CASSLGTLEE… HomoSa… NLVPMV… pp65    CMV           0 TRBV6… TRBJ2…     4
## # … with 18,029 more rows, 9 more variables: j.start <dbl>, mhc.a <chr>,
## #   mhc.b <chr>, mhc.class <chr>, reference.id <chr>, vdjdb.score <dbl>,
## #   Species <chr>, Chain <chr>, Pathology <chr>, and abbreviated variable names
## #   ¹​antigen.epitope, ²​antigen.gene, ³​antigen.species, ⁴​complex.id

VDJDB search tables

vdjdb_st = dbLoad("https://gitlab.com/immunomind/immunarch/raw/dev-0.5.0/private/SearchTable-2019-10-17%2012_36_11.989.tsv.gz", "vdjdb-search", .species = "HomoSapiens", .chain = "TRB", .pathology = "CMV")
vdjdb_st
## # A tibble: 4,999 × 19
##    complex.id Gene  CDR3     V     J     Species `MHC A` `MHC B` MHC c…¹ Epitope
##         <dbl> <chr> <chr>    <chr> <chr> <chr>   <chr>   <chr>   <chr>   <chr>  
##  1          0 TRB   CASSVDG… TRBV… TRBJ… HomoSa… HLA-A*… B2M     MHCI    YILEET…
##  2          0 TRB   CAWSWGD… TRBV… TRBJ… HomoSa… HLA-A*… B2M     MHCI    YILEET…
##  3          0 TRB   CASSLVG… TRBV… TRBJ… HomoSa… HLA-A*… B2M     MHCI    YILEET…
##  4          0 TRB   CASSLLM… TRBV… TRBJ… HomoSa… HLA-B*… B2M     MHCI    ELKRKM…
##  5          0 TRB   CASSYRA… TRBV… TRBJ… HomoSa… HLA-B*… B2M     MHCI    ELKRKM…
##  6          0 TRB   CASSQAV… TRBV… TRBJ… HomoSa… HLA-B*… B2M     MHCI    ELKRKM…
##  7          0 TRB   CASVTGS… TRBV… TRBJ… HomoSa… HLA-B*… B2M     MHCI    ELKRKM…
##  8          0 TRB   CASSDGT… TRBV… TRBJ… HomoSa… HLA-B*… B2M     MHCI    ELKRKM…
##  9          0 TRB   CASSLAR… TRBV… TRBJ… HomoSa… HLA-B*… B2M     MHCI    CVETMC…
## 10          0 TRB   CASGGAD… TRBV… TRBJ… HomoSa… HLA-B*… B2M     MHCI    CVETMC…
## # … with 4,989 more rows, 9 more variables: `Epitope gene` <chr>,
## #   `Epitope species` <chr>, Reference <chr>, Method <chr>, Meta <chr>,
## #   CDR3fix <chr>, Score <dbl>, Chain <chr>, Pathology <chr>, and abbreviated
## #   variable name ¹​`MHC class`

McPAS-TCR

mcpas = dbLoad("https://gitlab.com/immunomind/immunarch/raw/dev-0.5.0/private/McPAS-TCR.csv.gz", "mcpas", .species = "Human", .chain = "TRB", .pathology = "Cytomegalovirus (CMV)")
mcpas
## # A tibble: 2,723 × 29
##    CDR3.…¹ CDR3.…² Species Categ…³ Patho…⁴ Patho…⁵ Addit…⁶ Antig…⁷ NGS   Antig…⁸
##    <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>     <dbl> <chr> <chr>  
##  1 NA      CASLAP… Human   Pathog… Cytome… D003586 NA            1 No    pp65   
##  2 NA      CASLQA… Human   Pathog… Cytome… D003586 NA            1 No    pp65   
##  3 NA      CASLSG… Human   Pathog… Cytome… D003586 NA            1 No    pp65   
##  4 NA      CASLVA… Human   Pathog… Cytome… D003586 NA            1 No    pp65   
##  5 NA      CASSHR… Human   Pathog… Cytome… D003586 CMV           1 No    pp65   
##  6 NA      CASSSA… Human   Pathog… Cytome… D003586 CMV           1 No    pp65   
##  7 NA      CATSDP… Human   Pathog… Cytome… D003586 CMV           1 No    pp65   
##  8 CARNTG… CACSLR… Human   Pathog… Cytome… D003586 CMV           1 No    pp65   
##  9 CAGNTG… CASSAW… Human   Pathog… Cytome… D003586 Reumat…       1 No    pp65   
## 10 CAYPYN… CASSEL… Human   Pathog… Cytome… D003586 CMV           1 No    pp65   
## # … with 2,713 more rows, 19 more variables: Protein.ID <chr>,
## #   Epitope.peptide <chr>, Epitope.ID <chr>, MHC <chr>, Tissue <chr>,
## #   T.Cell.Type <chr>, T.cell.characteristics <chr>, CDR3.alpha.nt <chr>,
## #   TRAV <chr>, TRAJ <chr>, TRBV <chr>, TRBD <chr>, TRBJ <chr>,
## #   Reconstructed.J.annotation <chr>, CDR3.beta.nt <chr>, Mouse.strain <chr>,
## #   PubMed.ID <dbl>, Remarks <chr>, Chain <chr>, and abbreviated variable names
## #   ¹​CDR3.alpha.aa, ²​CDR3.beta.aa, ³​Category, ⁴​Pathology, ⁵​Pathology.Mesh.ID, …

TBAdb

tbadb = dbLoad("https://gitlab.com/immunomind/immunarch/raw/dev-0.5.0/private/TBAdb.xlsx", "tbadb", .species = "Homo Sapiens", .chain = c("TRB", "TRA-TRB"), .pathology = "CMV")
tbadb

Repertoire annotation

The key immunarch function for annotation is dbAnnotate. As an input it requires repertoires to search in, a database to look up, and columns of interest such as CDR3 amino acid sequence or V gene segment names columns. If you want to try it on the test data packaged with immunarch, execute the following line of code before proceeding further:

data(immdata)

Just in a single line of code you are able to find all clonotypes with matching CDR3 amino acid sequences in the input data and VDJDB database:

dbAnnotate(immdata$data, vdjdb, "CDR3.aa", "cdr3")
##            CDR3.aa Samples A2-i129 A2-i131 A2-i133 A2-i132 A4-i191 A4-i192 MS1
##   1:   CASSLGETQYF      11       6       4       2       8       4       0   1
##   2:   CASSFQETQYF       9       3       2       2       4       2       0   1
##   3:    CASSQETQYF       9       5       2       1       2       3       2   0
##   4:   CASSSSYEQYF       9       1       0       0       1       2       2   1
##   5:  CASSLEGYEQYF       8       0       0       1       1       3       0   1
##  ---                                                                          
## 579:  CSVGTGTYEQYF       1       0       0       0       0       1       0   0
## 580: CSVQGGAYNEQFF       1       0       1       0       0       0       0   0
## 581: CSVQGGSYNEQFF       1       0       1       0       0       0       0   0
## 582:  CSVVATNEKLFF       1       0       0       1       0       0       0   0
## 583: CSVVGTGNTEAFF       1       0       0       0       0       0       0   0
##      MS2 MS3 MS4 MS5 MS6
##   1:   3   1   2   5   1
##   2:   1   0   4   0   2
##   3:   1   0   0   4   1
##   4:   1   0   1   1   3
##   5:   0   1   1   1   1
##  ---                    
## 579:   0   0   0   0   0
## 580:   0   0   0   0   0
## 581:   0   0   0   0   0
## 582:   0   0   0   0   0
## 583:   0   0   0   1   0

The “Samples” column specifies the number of samples in which the clonotype found. Other numbers in columns correspond to the clonal count of the clonotype in a specific input sample.

In the next example we will search the McPAS-TCR database for condition-associated sequences using both CDR3 amino acid sequences and V gene segments:

dbAnnotate(immdata$data, mcpas, c("CDR3.aa", "V.name"), c("CDR3.beta.aa", "TRBV"))
##                CDR3.aa   V.name Samples A2-i129 A2-i131 A2-i133 A2-i132 A4-i191
##    1:      CAISESYEQYF TRBV10-3       3       0       1       0       0       0
##    2: CASSLAPGATNEKLFF  TRBV7-6       3       0       0       0       0       0
##    3:     CASSLGENIQYF   TRBV13       3       0       0       0       0       1
##    4:     CASSLGRETQYF   TRBV28       3       0       0       0       0       0
##    5:   CSVGTGGTNEKLFF TRBV29-1       3       0       0       0       0       0
##   ---                                                                          
## 2123:           KNPTAF   TRBV19       1       0       0       0       0       0
## 2124:       LLGGQETQYF  TRBV7-4       1       0       0       0       0       0
## 2125:     WASSFQGFTEAF   TRBV28       1       0       0       0       0       0
## 2126:    WASSQALPYEQYF TRBV12-4       1       0       0       0       0       0
## 2127:  WASSQQTGTIGGYTF  TRBV6-5       1       0       0       0       0       0
##       A4-i192 MS1 MS2 MS3 MS4 MS5 MS6
##    1:       1   0   0   0   0   0   0
##    2:       0   0   1   0   0   1   0
##    3:       1   0   0   0   0   0   0
##    4:       0   0   0   0   0  75   1
##    5:       0   0   1   0   0   0   1
##   ---                                
## 2123:       0   0   0   0   0   0   0
## 2124:       0   0   0   0   0   0   0
## 2125:       0   0   0   0   0   0   0
## 2126:       0   0   0   0   0   0   0
## 2127:       0   0   0   0   0   0   0

If you seek to search a database for a specific set of sequences, create a data frame containing them and use it as a database file:

local_db = data.frame(Seq = c("CASSDSSGGANEQFF", "CSARLAGGQETQYF"), V = c("TRBV6-4", "TRBV20-1"), stringsAsFactors = F)

dbAnnotate(immdata$data, local_db, c("CDR3.aa", "V.name"), c("Seq", "V"))
##            CDR3.aa   V.name Samples A2-i129 A2-i131 A2-i133 A2-i132 A4-i191
## 1: CASSDSSGGANEQFF  TRBV6-4       7       1       1       2       0       3
## 2:  CSARLAGGQETQYF TRBV20-1       6       1       3       0       1       0
##    A4-i192 MS1 MS2 MS3 MS4 MS5 MS6
## 1:       0   0   0   2   0   0  12
## 2:       0   0   0   1   0   0   1

Visualisation

Visualisation with the vis() function will be supported in the next major release of immunarch. You can use ggplot2 to visualise distributions of found clonotypes.

Advanced filtering

immunarch provides a very basic query interface that permits filtering by species types, chain types and pathology types only. To perform advanced filtering such as filtering by antigen epitope, you need to use R. In the most cases, filtering with the dplyr package is the most seamless way. Here is an example on how to use dplyr to filter out specific antigen epitopes from VDJDB:

# Load the dplyr library
library(dplyr)

# Load the database with immunarch
vdjdb = dbLoad("https://gitlab.com/immunomind/immunarch/raw/dev-0.5.0/private/vdjdb.slim.txt.gz", "vdjdb", .species = "HomoSapiens", .chain = "TRB", .pathology = "CMV")

# Check which antigen epitopes are presented in the database
table(vdjdb$antigen.epitope)
## 
## ARNLVPMVATVQGQN       AYAQKIFKI    CPSQEPMSIYVY       CVETMCNEY       DEEDAIAAY 
##               3              39               2               2               2 
## EDVPSGKLFMHVTLG      EFFWDANDIY       ELKRKMIYM       ELRRKMMYM        FPTKDVAL 
##               1               1               5              10              10 
##       IPSINVHHY       KLGGALQAK LSEFCRVLCCYVLEE       MLNIPSINV        NEGVKAAW 
##              93           12667               2              73              49 
##       NLVPMVATV       QIKVRVDMV       QIKVRVKMV       QYDPVAALF      RPHERNGFTV 
##            4496              15              24              39               4 
##     RPHERNGFTVL      TPRVTGGGAM       VLEETSVML       VMAPRTLIL       VTEHDTLLY 
##              22             207              14               1             202 
##       YILEETSVM     YSEHPTFTSQY 
##               3              53
# Filter out all non NLVPMVATV epitopes
vdjdb = vdjdb %>% filter(antigen.epitope == "NLVPMVATV")
vdjdb
## # A tibble: 4,496 × 19
##    gene  cdr3        species antig…¹ antig…² antig…³ compl…⁴ v.segm j.segm v.end
##    <chr> <chr>       <chr>   <chr>   <chr>   <chr>     <dbl> <chr>  <chr>  <dbl>
##  1 TRB   CASSAFPCRE… HomoSa… NLVPMV… pp65    CMV           0 TRBV6… TRBJ2…     4
##  2 TRB   CASSLTTESG… HomoSa… NLVPMV… pp65    CMV           0 TRBV7… TRBJ2…     5
##  3 TRB   CASSLGTLEE… HomoSa… NLVPMV… pp65    CMV           0 TRBV6… TRBJ2…     4
##  4 TRB   CASSLDSLNT… HomoSa… NLVPMV… pp65    CMV           0 TRBV5… TRBJ1…     5
##  5 TRB   CSADGLPISS… HomoSa… NLVPMV… pp65    CMV           0 TRBV2… TRBJ2…     2
##  6 TRB   CASSFRQGAF… HomoSa… NLVPMV… pp65    CMV           0 TRBV7… TRBJ2…     4
##  7 TRB   CASSFGPRAG… HomoSa… NLVPMV… pp65    CMV           0 TRBV7… TRBJ2…     4
##  8 TRB   CASSYGTGKD… HomoSa… NLVPMV… pp65    CMV           0 TRBV7… TRBJ2…     4
##  9 TRB   CSVEAYATDY… HomoSa… NLVPMV… pp65    CMV           0 TRBV2… TRBJ1…     4
## 10 TRB   CASSSGLISF… HomoSa… NLVPMV… pp65    CMV           0 TRBV5… TRBJ2…     4
## # … with 4,486 more rows, 9 more variables: j.start <dbl>, mhc.a <chr>,
## #   mhc.b <chr>, mhc.class <chr>, reference.id <chr>, vdjdb.score <dbl>,
## #   Species <chr>, Chain <chr>, Pathology <chr>, and abbreviated variable names
## #   ¹​antigen.epitope, ²​antigen.gene, ³​antigen.species, ⁴​complex.id
# Check if everything is OK and there is no other epitopes
table(vdjdb$antigen.epitope)
## 
## NLVPMVATV 
##      4496

Get in contact with us

Cannot find an important feature? Have a question or found a bug? Contact us at