This is a utility function to estimate the diversity of species or objects in the given distribution.

Note: functions will check if .data is a distribution of a random variable (sum == 1) or not. To force normalisation and / or to prevent this, set .do.norm to TRUE (do normalisation) or FALSE (don't do normalisation), respectively.

repDiversity(.data, .method, .max.q = 6, .min.q = 1, .q = 5, .step = NA, .bound = 2, .quantile = c(0.025, 0.975), .extrapolation = 2e+05, .perc = 50, .verbose = T, .do.norm = NA, .laplace = 0)

.data | The data to be processed. Can be data.frame, data.table, or a list of these objects. Every object must have columns in the immunarch compatible format. immunarch_data_format Competent users may provide advanced data representations: DBI database connections, Apache Spark DataFrame from copy_to or a list of these objects. They are supported with the same limitations as basic objects. Note: each connection must represent a separate repertoire. |
---|---|

.method | Pick a method used for estimation out of a following list: chao1, hill, div, gini.simp, inv.simp, gini, raref. |

.max.q | The max hill number to calculate (default: 5). |

.min.q | Function calculates several hill numbers. Set the min (default: 1). |

.q | q-parameter for the Diversity index. |

.step | Rarefaction step's size. |

.quantile | Numeric vector with quantiles for confidence intervals. |

.extrapolation | An integer that corresponds to sample extrapolation size. |

.perc | Set the percent to dXX index measurement. |

.verbose | If T then output progress. |

.do.norm | One of the three values - NA, T or F. If NA than check for distrubution (sum(.data) == 1) and normalise if needed with the given laplace correction value. if T then do normalisation and laplace correction. If F than don't do normalisaton and laplace correction. |

.laplace | A numeric value, which is used as a pseudocount for Laplace smoothing. |

.quant | Select the column with data to evaluate |

div, gini, gini.simp, inv.simp, raref return numeric vector of length 1 with value.

chao1 returns 4 values: estimated number of species, standart deviation of this number and two 95

hill returns a vector of specified length `.max.q - .min.q`

- True diversity, or the effective number of types, refers to the number of equally-abundant types needed for the average proportional abundance of the types to equal that observed in the dataset of interest where all types may not be equally abundant.

- Inverse Simpson index is the effective number of types that is obtained when the weighted arithmetic mean is used to quantify average proportional abundance of types in the dataset of interest.

- The Gini coefficient measures the inequality among values of a frequency distribution (for example levels of income). A Gini coefficient of zero expresses perfect equality, where all values are the same (for example, where everyone has the same income). A Gini coefficient of one (or 100 percents ) expresses maximal inequality among values (for example where only one person has all the income).

- The Gini-Simpson index is the probability of interspecific encounter, i.e., probability that two entities represent different types.

- Chao1 estimator is a nonparameteric asymptotic estimator of species richness (number of species in a population).

- Rarefaction is a technique to assess species richness from the results of sampling through extrapolation.

- Hill numbers are a mathematically unified family of diversity indices (differing among themselves only by an exponent q).

- d50 is a recently developed immune diversity estimate. It calculates the minimum number of distinct clonotypes amounting to greater than or equal to 50 percent of a total of sequencing reads obtained following amplification and sequencing

- dXX is a similar to d50 index where XX corresponds to desirable percent of total sequencing reads.

repOverlap, entropy, similarity Rarefaction wiki https://en.wikipedia.org/wiki/Rarefaction_(ecology) Hill numbers paper https://www.uvm.edu/~ngotelli/manuscriptpdfs/ChaoHill.pdf Diversity wiki https://en.wikipedia.org/wiki/Measurement_of_biodiversity

# NOT RUN { data('test') dbdir <- tempdir() con <- dbConnect(MonetDBLite::MonetDBLite(), embedded = dbdir) dbWriteTable(con, "twbtest", twb[[1]], overwrite = TRUE) twins <- MonetDBLite::src_monetdblite(dbdir = dbdir) twbtest <- tbl(twins, "twbtest") # chao1 repDiversity(.data = twbtest, .method = 'chao1') # Hill numbers repDiversity(.data = twbtest, .method = 'hill', .max.q = 6, .min.q = 1, .do.norm = NA, .laplace = 0) # diversity repDiversity(.data = twbtest, .method = 'dev', .q = 5, .do.norm = NA, .laplace = 0) # Gini-Simpson repDiversity(.data = twbtest, .method = 'gini.simp', .q = 5, .do.norm = NA, .laplace = 0) # inverse Simpson repDiversity(.data = twbtest, .method = 'inv.simp', .do.norm = NA, .laplace = 0) # Gini coefficient repDiversity(.data = twbtest, .method = 'gini.coef', .do.norm = NA, .laplace = 0) # rarefaction repDiversity(.data = twbtest, .method = 'raref', .step = NA, .quantile = c(.025, .975), .extrapolation = 200000, .verbose = T) # d50 repDiversity(.data = twbtest, .method = 'd50') # dXX repDiversity(.data = twbtest, .method = 'dXX', .perc = 10) # }