`vignettes/4_overlap.Rmd`

`4_overlap.Rmd`

Repertoire overlap is the most common approach to measure repertoire similarity. It is achieved by computation of specific statistics on clonotypes shared between given repertoires, also called “public” clonotypes. `immunarch`

provides several indices: - number of public clonotypes (`.method = "public"`

) - a classic measure of overlap similarity.

overlap coefficient (

`.method = "overlap"`

) - a normalised measure of overlap similarity. It is defined as the size of the intersection divided by the smaller of the size of the two sets.Jaccard index (

`.method = "jaccard"`

) - it measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets.Tversky index (

`.method = "tversky"`

) - an asymmetric similarity measure on sets that compares a variant to a prototype. If using default arguments, it’s similar to Dice’s coefficient.cosine similarity (

`.method = "cosine"`

) - a measure of similarity between two non-zero vectorsMorisita’s overlap index (

`.method = "morisita"`

) - a statistical measure of dispersion of individuals in a population. It is used to compare overlap among samples.top-overlap - overlaps of the N most abundant clonotypes with sequentially growing N (

`.method = "top+METHOD"`

)

The function that includes described methods is `repOverlap`

. Again the output is easily visualised when passed to `vis()`

function that does all the work:

```
imm_ov1 = repOverlap(immdata$data, .method = "public", .verbose = F)
imm_ov2 = repOverlap(immdata$data, .method = "morisita", .verbose = F)
grid.arrange(vis(imm_ov1), vis(imm_ov2, .text.size=1.5), ncol = 2)
```

`vis(imm_ov1, "heatmap2")`

To analyse the computed overlap measures function apply `repOverlapAnalysis`

.

```
# Apply different analysis algorithms to the matrix of public clonotypes:
# "mds" - Multi-dimensional Scaling
repOverlapAnalysis(imm_ov1, "mds")
```

```
## Standard deviations:
## [1] 0 0 0 0
##
## Rotation:
## [,1] [,2]
## A2-i129 1.8651791 7.771695
## A2-i131 -2.0246183 -8.435687
## A2-i133 -4.6152434 12.017445
## A2-i132 15.6362099 2.186638
## A4-i191 -9.5429713 4.745392
## A4-i192 5.8953987 -10.163369
## MS1 -5.5601454 -1.810935
## MS2 0.3411929 -1.954662
## MS3 5.1691898 1.792923
## MS4 -2.4522675 3.000137
## MS5 5.6609794 -2.721500
## MS6 -10.3729039 -6.428076
```

```
# "tsne" - t-Stochastic Neighbor Embedding
repOverlapAnalysis(imm_ov1, "tsne")
```

```
## DimI DimII
## A2-i129 49.39998 111.706838
## A2-i131 10.24968 -57.071908
## A2-i133 20.40369 66.665371
## A2-i132 49.93091 78.747616
## A4-i191 -44.40195 -102.168599
## A4-i192 24.15683 -9.055338
## MS1 -55.02380 -82.623381
## MS2 13.46577 -44.242101
## MS3 31.77776 89.528218
## MS4 -72.44359 -86.293760
## MS5 20.80516 98.134636
## MS6 -48.32045 -63.327591
## attr(,"class")
## [1] "matrix" "immunr_tsne"
```

```
# Visualise the results
vis(repOverlapAnalysis(imm_ov1, "mds"))
```

```
# Apply different analysis algorithms to the matrix of public clonotypes:
# "mds" - Multi-dimensional Scaling
repOverlapAnalysis(imm_ov1, "mds")
```

```
## Standard deviations:
## [1] 0 0 0 0
##
## Rotation:
## [,1] [,2]
## A2-i129 1.8651791 7.771695
## A2-i131 -2.0246183 -8.435687
## A2-i133 -4.6152434 12.017445
## A2-i132 15.6362099 2.186638
## A4-i191 -9.5429713 4.745392
## A4-i192 5.8953987 -10.163369
## MS1 -5.5601454 -1.810935
## MS2 0.3411929 -1.954662
## MS3 5.1691898 1.792923
## MS4 -2.4522675 3.000137
## MS5 5.6609794 -2.721500
## MS6 -10.3729039 -6.428076
```

```
# "tsne" - t-Stochastic Neighbor Embedding
repOverlapAnalysis(imm_ov1, "tsne")
```

```
## DimI DimII
## A2-i129 142.75898 -155.492305
## A2-i131 -35.09265 -37.830618
## A2-i133 -25.22279 19.723588
## A2-i132 -91.10671 -60.126635
## A4-i191 -40.99363 191.355436
## A4-i192 67.76674 -75.790301
## MS1 -74.74059 170.594758
## MS2 -55.17823 -19.916662
## MS3 147.59090 -114.625395
## MS4 -107.67658 169.287420
## MS5 164.96992 -94.574535
## MS6 -93.07536 7.395247
## attr(,"class")
## [1] "matrix" "immunr_tsne"
```

```
# Visualise the results
vis(repOverlapAnalysis(imm_ov1, "mds"))
```

```
# Clusterise the MDS resulting components using K-means
vis(repOverlapAnalysis(imm_ov1, "mds+kmeans"))
```

In order to build a massive table with all clonotypes from the list of repertoires use the `pubRep`

function.

```
# Pass "nt" as the second parameter to build the public repertoire table using CDR3 nucleotide sequences
pr.nt = pubRep(immdata$data, "nt", .verbose = F)
pr.nt
```

```
## Source: local data table [88,528 x 14]
##
## # A tibble: 88,528 x 14
## CDR3.nt Samples `A2-i129` `A2-i131` `A2-i133` `A2-i132` `A4-i191`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 TGCGCC… 8 2 NA 2 1 NA
## 2 TGTGCC… 7 NA 1 1 2 1
## 3 TGTGCC… 7 1 1 1 NA 1
## 4 TGCGCC… 6 2 NA 1 1 NA
## 5 TGTGCC… 6 5 3 NA 2 3
## 6 TGTGCC… 6 NA 1 1 5 1
## 7 TGTGCC… 6 NA NA NA 1 1
## 8 TGTGCC… 6 1 1 2 NA 5
## 9 TGCAGT… 5 1 3 NA 2 NA
## 10 TGCGCC… 5 1 NA NA NA NA
## # … with 88,518 more rows, and 7 more variables: `A4-i192` <dbl>,
## # MS1 <dbl>, MS2 <dbl>, MS3 <dbl>, MS4 <dbl>, MS5 <dbl>, MS6 <dbl>
```

```
# Pass "aa+v" as the second parameter to build the public repertoire table using CDR3 aminoacid sequences and V alleles
# In order to use only CDR3 aminoacid sequences, just pass "aa"
pr.aav = pubRep(immdata$data, "aa+v", .verbose = F)
pr.aav
```

```
## Source: local data table [87,730 x 15]
##
## # A tibble: 87,730 x 15
## CDR3.aa V.name Samples `A2-i129` `A2-i131` `A2-i133` `A2-i132` `A4-i191`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 CASSLE… TRBV5… 8 2 NA 3 1 NA
## 2 CASSDS… TRBV6… 6 1 1 2 NA 5
## 3 CASSDS… TRBV6… 6 NA NA NA 4 1
## 4 CASSFQ… TRBV5… 6 3 NA 1 1 NA
## 5 CASSLG… TRBV1… 6 2 NA NA 4 3
## 6 CASSLQ… TRBV1… 6 NA 1 1 4 NA
## 7 CAS~FF TRBV1… 6 NA 2 1 2 NA
## 8 CSARLA… TRBV2… 6 1 3 NA 2 1
## 9 CASSDS… TRBV6… 5 NA NA NA 3 NA
## 10 CASSFG… TRBV1… 5 NA 1 NA 5 NA
## # … with 87,720 more rows, and 7 more variables: `A4-i192` <dbl>,
## # MS1 <dbl>, MS2 <dbl>, MS3 <dbl>, MS4 <dbl>, MS5 <dbl>, MS6 <dbl>
```

```
# You can also pass the ".coding" parameter to filter out all noncoding sequences first:
pr.aav.cod = pubRep(immdata$data, "aa+v", .coding=T)
```

```
# Create a public repertoire with coding-only sequences using both CDR3 amino acid sequences and V genes
pr = pubRep(immdata$data, "aa+v", .coding = T, .verbose = F)
# Apply the filter subroutine to leave clonotypes presented only in healthy individuals
pr1 = pubRepFilter(pr, immdata$meta, c(Status = "C"))
# Apply the filter subroutine to leave clonotypes presented only in diseased individuals
pr2 = pubRepFilter(pr, immdata$meta, c(Status = "MS"))
# Divide one by another
pr3 = pubRepApply(pr1, pr2)
# Plot it
p = ggplot() + geom_jitter(aes(x = "Treatment", y = Result), data=pr3)
p
```