repOverlap function is designed to analyse the overlap between
two or more repertoires. It contains a number of methods to compare immune receptor
sequences that are shared between individuals.
repOverlap(.data, .method = c("public", "overlap", "jaccard", "tversky", "cosine", "morisita", "top+shared", "top+morisita"), .col = "nuc", .quant = c("count", "prop"), .a = 0.5, .b = 0.5, .verbose = T, .dup = c("merge", "remove"))
The data to be processed. Can be data.frame, data.table, or a list of these objects.
Every object must have columns in the immunarch compatible format. immunarch_data_format
Competent users may provide advanced data representations: DBI database connections, Apache Spark DataFrame from copy_to or a list of these objects. They are supported with the same limitations as basic objects.
Note: each connection must represent a separate repertoire.
A string that specifies the method of analysis or a combination of
A string that specifies the column to be processed. Pass "nuc" for nucleotide sequence or "aa" for amino acid sequence.
Select the column with data to evaluate
Alpha and beta parameters for Tversky Index. Default values give the Jaccard index measure.
if T then output the progress.
Defines the duplicates' behaviour. Pass "merge" or "remove".
An integer that defines the step of incremetal overlap (Note! Currently, top+overlap, or "top+shared" and "top+morisita".
"public" and "shared" are synonyms that exist for the convenience of researchers.
The "overlap" coefficient is a similarity measure that measures the overlap between two finite sets.
The "jaccard" index is conceptually a percentage of how many objects two sets have in common out of how many objects they have total.
The "tversky" index is an asymmetric similarity measure on sets that compares a variant to a prototype.
The "cosine" index is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them.
The "morisita" index measures how many times it is more likely to randomly select two sampled points from the same quadrat (the dataset is covered by a regular grid of changing size) than it would be in the case of a random distribution generated from a Poisson process.