This is a utility function to estimate the diversity of species or objects in the given distribution.

Note: functions will check if .data is a distribution of a random variable (sum == 1) or not. To force normalisation and / or to prevent this, set .do.norm to TRUE (do normalisation) or FALSE (don't do normalisation), respectively.

repDiversity( .data, .method = "chao1", .col = "aa", .max.q = 6, .min.q = 1, .q = 5, .step = NA, .quantile = c(0.025, 0.975), .extrapolation = NA, .perc = 50, .norm = T, .verbose = T, .do.norm = NA, .laplace = 0 )

.data | The data to be processed. Can be data.frame, data.table, or a list of these objects. Every object must have columns in the immunarch compatible format. immunarch_data_format Competent users may provide advanced data representations: DBI database connections, Apache Spark DataFrame from copy_to or a list of these objects. They are supported with the same limitations as basic objects. Note: each connection must represent a separate repertoire. |
---|---|

.method | Pick a method used for estimation out of a following list: chao1, hill, div, gini.simp, inv.simp, gini, raref, d50, dxx. |

.col | A string that specifies the column(s) to be processed. Pass one of the following strings, separated by the plus sign: "nt" for nucleotide sequences, "aa" for amino acid sequences, "v" for V gene segments, "j" for J gene segments. E.g., pass "aa+v" to compute diversity estimations on CDR3 amino acid sequences paired with V gene segments, i.e., in this case a unique clonotype is a pair of CDR3 amino acid and V gene segment. Clonal counts of equal clonotypes will be summed up. |

.max.q | The max hill number to calculate (default: 5). |

.min.q | Function calculates several hill numbers. Set the min (default: 1). |

.q | q-parameter for the Diversity index. |

.step | Rarefaction step's size. |

.quantile | Numeric vector with quantiles for confidence intervals. |

.extrapolation | An integer. An upper limit for the number of clones to extrapolate to. Pass 0 (zero) to turn extrapolation subroutines off. |

.perc | Set the percent to dXX index measurement. |

.norm | Normalise rarefaction curves. |

.verbose | If T then output progress. |

.do.norm | One of the three values - NA, T or F. If NA then check for distrubution (sum(.data) == 1) and normalise if needed with the given laplace correction value. if T then do normalisation and laplace correction. If F then don't do normalisaton and laplace correction. |

.laplace | A numeric value, which is used as a pseudocount for Laplace smoothing. |

div, gini, gini.simp, inv.simp, raref return numeric vector of length 1 with value.

chao1 returns 4 values: estimated number of species, standart deviation of this number and two 95

hill returns a vector of specified length `.max.q - .min.q`

If input data is a single immune repertoire, then the function returns a numeric vector with diversity statistics.

Otherwise, it returns a numeric matrix with diversity statistics for all input repertoires.

- True diversity, or the effective number of types, refers to the number of equally-abundant types needed for the average proportional abundance of the types to equal that observed in the dataset of interest where all types may not be equally abundant.

- Inverse Simpson index is the effective number of types that is obtained when the weighted arithmetic mean is used to quantify average proportional abundance of types in the dataset of interest.

- The Gini coefficient measures the inequality among values of a frequency distribution (for example levels of income). A Gini coefficient of zero expresses perfect equality, where all values are the same (for example, where everyone has the same income). A Gini coefficient of one (or 100 percents ) expresses maximal inequality among values (for example where only one person has all the income).

- The Gini-Simpson index is the probability of interspecific encounter, i.e., probability that two entities represent different types.

- Chao1 estimator is a nonparameteric asymptotic estimator of species richness (number of species in a population).

- Rarefaction is a technique to assess species richness from the results of sampling through extrapolation.

- Hill numbers are a mathematically unified family of diversity indices (differing among themselves only by an exponent q).

- d50 is a recently developed immune diversity estimate. It calculates the minimum number of distinct clonotypes amounting to greater than or equal to 50 percent of a total of sequencing reads obtained following amplification and sequencing

- dXX is a similar to d50 index where XX corresponds to desirable percent of total sequencing reads.

repOverlap, entropy, repClonality Rarefaction wiki https://en.wikipedia.org/wiki/Rarefaction_(ecology) Hill numbers paper https://www.uvm.edu/~ngotelli/manuscriptpdfs/ChaoHill.pdf Diversity wiki https://en.wikipedia.org/wiki/Measurement_of_biodiversity

if (FALSE) { data(immdata) # chao1 repDiversity(.data = immdata, .method = 'chao1') # Hill numbers repDiversity(.data = immdata, .method = 'hill', .max.q = 6, .min.q = 1, .do.norm = NA, .laplace = 0) # diversity repDiversity(.data = immdata, .method = 'dev', .q = 5, .do.norm = NA, .laplace = 0) # Gini-Simpson repDiversity(.data = immdata, .method = 'gini.simp', .q = 5, .do.norm = NA, .laplace = 0) # inverse Simpson repDiversity(.data = immdata, .method = 'inv.simp', .do.norm = NA, .laplace = 0) # Gini coefficient repDiversity(.data = immdata, .method = 'gini.coef', .do.norm = NA, .laplace = 0) # rarefaction repDiversity(.data = immdata, .method = 'raref', .step = NA, .quantile = c(.025, .975), .extrapolation = 200000, .verbose = T) # d50 repDiversity(.data = immdata, .method = 'd50') # dXX repDiversity(.data = immdata, .method = 'dXX', .perc = 10) }