Skip to contents

Repertoire overlap

Repertoire overlap is the most common approach to measure repertoire similarity. It is achieved by computation of specific statistics on clonotypes shared between given repertoires, also called “public” clonotypes. immunarch provides several indices: - number of public clonotypes (.method = "public") - a classic measure of overlap similarity.

  • overlap coefficient (.method = "overlap") - a normalised measure of overlap similarity. It is defined as the size of the intersection divided by the smaller of the size of the two sets.

  • Jaccard index (.method = "jaccard") - measures the similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets.

  • Tversky index (.method = "tversky") - an asymmetric similarity measure on sets that compares a variant to a prototype. If using default arguments, it’s similar to Dice’s coefficient.

  • cosine similarity (.method = "cosine") - a measure of similarity between two non-zero vectors

  • Morisita’s overlap index (.method = "morisita") - a statistical measure of dispersion of individuals in a population. It is used to compare overlap among samples.

  • incremental overlap - overlaps of the N most abundant clonotypes with incrementally growing N (.method = "inc+METHOD", e.g., "inc+public" or "inc+morisita").

The function that includes described methods is repOverlap. Again, the output is easily visualised when passed to vis() function that does all the work:

imm_ov1 <- repOverlap(immdata$data, .method = "public", .verbose = F)
imm_ov2 <- repOverlap(immdata$data, .method = "morisita", .verbose = F)

p1 <- vis(imm_ov1)
p2 <- vis(imm_ov2, .text.size = 2)

p1 + p2

vis(imm_ov1, "heatmap2")

You can easily change the number of significant digits:

p1 <- vis(imm_ov2, .text.size = 2.5, .signif.digits = 1)
p2 <- vis(imm_ov2, .text.size = 2, .signif.digits = 2)

p1 + p2

To analyse the computed overlap measures function apply repOverlapAnalysis.

# Apply different analysis algorithms to the matrix of public clonotypes:
# "mds" - Multi-dimensional Scaling
repOverlapAnalysis(imm_ov1, "mds")
## Standard deviations (1, .., p=4):
## [1] 0 0 0 0
## 
## Rotation (n x k) = (12 x 2):
##                [,1]       [,2]
## A2-i129  18.7767715 -18.360817
## A2-i131 -29.9586985  -7.870441
## A2-i133 -28.1148594  22.629093
## A2-i132  44.3435640   6.221812
## A4-i191 -13.8586515   7.452149
## A4-i192  14.0065477  27.068830
## MS1       8.8469009  -8.151574
## MS2       0.9712073  -1.297017
## MS3      10.4398629   4.894354
## MS4      -0.5131505  10.471309
## MS5     -18.5153823 -12.628029
## MS6      -6.4241122 -30.429669
# "tsne" - t-Stochastic Neighbor Embedding
repOverlapAnalysis(imm_ov1, "tsne")
##                DimI      DimII
## A2-i129  -19.114598 -167.18714
## A2-i131  537.802219  416.46583
## A2-i133 -104.058395  -14.12119
## A2-i132 -246.070113  -85.05854
## A4-i191 -211.091968   29.77542
## A4-i192   55.556464 -162.74795
## MS1       11.636618  -92.01072
## MS2     -208.509752  -52.12419
## MS3        1.357906 -142.37759
## MS4     -198.569845  -20.19550
## MS5      507.982614  423.03316
## MS6     -126.921150 -133.45159
## attr(,"class")
## [1] "immunr_tsne" "matrix"      "array"
# Visualise the results
repOverlapAnalysis(imm_ov1, "mds") %>% vis()

# Apply different analysis algorithms to the matrix of public clonotypes:
# "mds" - Multi-dimensional Scaling
repOverlapAnalysis(imm_ov1, "mds")
## Standard deviations (1, .., p=4):
## [1] 0 0 0 0
## 
## Rotation (n x k) = (12 x 2):
##                [,1]       [,2]
## A2-i129  18.7767715 -18.360817
## A2-i131 -29.9586985  -7.870441
## A2-i133 -28.1148594  22.629093
## A2-i132  44.3435640   6.221812
## A4-i191 -13.8586515   7.452149
## A4-i192  14.0065477  27.068830
## MS1       8.8469009  -8.151574
## MS2       0.9712073  -1.297017
## MS3      10.4398629   4.894354
## MS4      -0.5131505  10.471309
## MS5     -18.5153823 -12.628029
## MS6      -6.4241122 -30.429669
# "tsne" - t-Stochastic Neighbor Embedding
repOverlapAnalysis(imm_ov1, "tsne")
##               DimI       DimII
## A2-i129   26.32524    6.429606
## A2-i131 -144.49709 -241.995123
## A2-i133   15.67059   47.015332
## A2-i132   13.63028   82.041968
## A4-i191   44.23155   87.554131
## A4-i192   38.88944   18.866932
## MS1       14.62390   20.195187
## MS2       25.88140   79.012170
## MS3       26.64503   14.737601
## MS4       33.83457   80.189530
## MS5     -137.06576 -243.242911
## MS6       41.83085   49.195576
## attr(,"class")
## [1] "immunr_tsne" "matrix"      "array"
# Visualise the results
repOverlapAnalysis(imm_ov1, "mds") %>% vis()

# Clusterise the MDS resulting components using K-means
repOverlapAnalysis(imm_ov1, "mds+kmeans") %>% vis()

Public repertoire

In order to build a massive table with all clonotypes from the list of repertoires use the pubRep function.

# Pass "nt" as the second parameter to build the public repertoire table using CDR3 nucleotide sequences
pr.nt <- pubRep(immdata$data, "nt", .verbose = F)
pr.nt
##                                                    CDR3.nt Samples A2-i129
##                                                     <char>   <num>   <num>
##     1:                   TGCGCCAGCAGCTTGGAAGAGACCCAGTACTTC       8       1
##     2:                   TGTGCCAGCAGCTTCCAAGAGACCCAGTACTTC       7      NA
##     3:                   TGTGCCAGCAGTTACCAAGAGACCCAGTACTTC       7       1
##     4:                   TGCGCCAGCAGCTTCCAAGAGACCCAGTACTTC       6       2
##     5:                      TGTGCCAGCAGCCAAGAGACCCAGTACTTC       6       4
##    ---                                                                    
## 75101:             TGTGCTTCACAACTCTTATTGGACGAGACCCAGTACTTC       1      NA
## 75102: TGTGCTTCACAAGCCCTACAGGGCACTTTCCATAATTCACCCCTCCACTTT       1      NA
## 75103:                   TGTGCTTCAGGGCGGGCCTACGAGCAGTACTTC       1      NA
## 75104:             TGTGCTTCCGCCGGACCGGACCGGGAGACCCAGTACTTC       1      NA
## 75105:                TGTGCTTGCGGGACAGATAACTATGGCTACACCTTC       1      NA
##        A2-i131 A2-i133 A2-i132 A4-i191 A4-i192   MS1   MS2   MS3   MS4   MS5
##          <num>   <num>   <num>   <num>   <num> <num> <num> <num> <num> <num>
##     1:      NA       1       1      NA       1    NA    NA     1     1     1
##     2:       1       1       2       1      NA     1    NA    NA     2    NA
##     3:       1       1      NA       1       1     1    NA     2    NA    NA
##     4:      NA       1       1      NA      NA    NA     1    NA     1    NA
##     5:       2      NA       2       3       1    NA    NA    NA    NA     4
##    ---                                                                      
## 75101:       1      NA      NA      NA      NA    NA    NA    NA    NA    NA
## 75102:      NA      NA      NA      NA      NA    NA    NA    NA    NA     1
## 75103:      NA      NA      NA      NA      NA     1    NA    NA    NA    NA
## 75104:      NA       1      NA      NA      NA    NA    NA    NA    NA    NA
## 75105:      NA      NA      NA      NA       1    NA    NA    NA    NA    NA
##          MS6
##        <num>
##     1:     1
##     2:     1
##     3:    NA
##     4:     1
##     5:    NA
##    ---      
## 75101:    NA
## 75102:    NA
## 75103:    NA
## 75104:    NA
## 75105:    NA
# Pass "aa+v" as the second parameter to build the public repertoire table using CDR3 aminoacid sequences and V alleles
# In order to use only CDR3 aminoacid sequences, just pass "aa"
pr.aav <- pubRep(immdata$data, "aa+v", .verbose = F)
pr.aav
##                    CDR3.aa   V.name Samples A2-i129 A2-i131 A2-i133 A2-i132
##                     <char>   <char>   <num>   <num>   <num>   <num>   <num>
##     1:         CASSLEETQYF  TRBV5-1       8       1      NA       2       1
##     2:     CASSDSSGGANEQFF  TRBV6-4       6       1       1       2      NA
##     3:         CASSFQETQYF  TRBV5-1       6       3      NA       1       1
##     4:         CASSLGETQYF TRBV12-4       6       2      NA      NA       4
##     5:     CASSDSGGSYNEQFF  TRBV6-4       5      NA      NA      NA       3
##    ---                                                                     
## 74440:     CTSSRPTQGAYEQYF  TRBV7-2       1      NA      NA      NA      NA
## 74441:    CTSSSRAGAGTDTQYF  TRBV7-2       1      NA      NA      NA      NA
## 74442: CTSSYPGLAGLKRKETQYF  TRBV7-2       1      NA      NA      NA       1
## 74443:    CTSSYRQRPYQETQYF  TRBV7-2       1      NA      NA      NA      NA
## 74444:      CTSSYSTSGVGQFF  TRBV7-2       1      NA      NA      NA      NA
##        A4-i191 A4-i192   MS1   MS2   MS3   MS4   MS5   MS6
##          <num>   <num> <num> <num> <num> <num> <num> <num>
##     1:      NA       2    NA    NA     1     1     1     1
##     2:       3      NA    NA    NA     2    NA    NA    12
##     3:      NA      NA    NA     1    NA     1    NA     1
##     4:       3      NA     1    NA    NA    NA     2     1
##     5:      NA       1     1    NA     1    NA    NA     1
##    ---                                                    
## 74440:      NA      NA    NA    NA    NA    NA    NA     1
## 74441:      NA      NA    NA    NA     1    NA    NA    NA
## 74442:      NA      NA    NA    NA    NA    NA    NA    NA
## 74443:      NA      NA    NA    NA     1    NA    NA    NA
## 74444:      NA      NA    NA    NA    NA     1    NA    NA
# You can also pass the ".coding" parameter to filter out all noncoding sequences first:
pr.aav.cod <- pubRep(immdata$data, "aa+v", .coding = T)
# Create a public repertoire with coding-only sequences using both CDR3 amino acid sequences and V genes
pr <- pubRep(immdata$data, "aa+v", .coding = T, .verbose = F)

# Apply the filter subroutine to leave clonotypes presented only in healthy individuals
pr1 <- pubRepFilter(pr, immdata$meta, c(Status = "C"))

# Apply the filter subroutine to leave clonotypes presented only in diseased individuals
pr2 <- pubRepFilter(pr, immdata$meta, c(Status = "MS"))

# Divide one by another
pr3 <- pubRepApply(pr1, pr2)

# Plot it
p <- ggplot() +
  geom_jitter(aes(x = "Treatment", y = Result), data = pr3)
p