Compute key immune repertoire statistics

A family of functions that extract core descriptive statistics from an ImmunData object.

Available functions

Supported methods are the following.

airr_stats_chains — count V(D)J chains per repertoire (optionally split by locus). Quickly gauges capture depth per repertoire and, when split by locus, reveals TRA/TRB/IGH balance. Use it for QC, library-size checks, and to spot locus-specific dropouts or over-representation.

airr_stats_lengths — count the number of sequence lengths per repertoire. Summarizes the CDR3 length distribution, a sensitive QC fingerprint of repertoire prep and selection. Helpful for detecting primer/UMI biases, comparing cohorts, and deriving length-based features for models.

airr_stats_genes - count V(D)J gene segments per repertoire, optionally split by locus and using either receptor counts or barcode/UMI counts as the measure. Profiles V/D/J gene usage to characterize repertoire composition and germline biases, with optional locus split. Useful for cohort comparisons, flagging clonal expansions, and producing ML-ready features for repertoire-level ML tasks.

Usage

airr_stats_chains(
  idata,
  locus_col = NA,
  autojoin = getOption("immundata.autojoin", TRUE),
  format = c("long", "wide")
)

airr_stats_lengths(
  idata,
  seq_col = "cdr3_aa",
  autojoin = getOption("immundata.autojoin", TRUE),
  format = c("long", "wide")
)

airr_stats_genes(
  idata,
  gene_col = "v_call",
  level = c("receptor", "barcode"),
  by = c(NA, "locus"),
  autojoin = getOption("immundata.autojoin", TRUE),
  format = c("long", "wide")
)

Arguments

idata: An ImmunData object.
locus_col: Column in idata$annotations that stores the locus (e.g. "locus"). If NULL or missing, the result is not split by locus.
autojoin: Logical. If TRUE, join repertoire metadata by the schema repertoire id. Change the default behaviour by calling options(immunarch.autojoin = FALSE).
format: String. One of "long" ("long" tibble with imd_repertoire_id, facet columns, and value; useful for visualizations) or "wide" (wide/unmelted table of features, with each row corresponding to a specific repertoire / pair of repertoires; useful for Machine Learning).
seq_col: Character vector with names of the columns containing sequences.
gene_col: A single column name in idata$annotations with gene segment calls (e.g., "v_call", "d_call", "j_call", "c_call"). Default is "v_call".
level: One of "receptor" or "barcode". If "receptor" (default), the function counts unique receptors (one per receptor ID) that carry a given gene segment. If "barcode", the function sums counts (e.g., cells/UMIs) per gene segment using the column defined by immundata::imd_schema("count").
by: Either NULL (no split) or "locus". When "locus", the result is further split by the locus column if present (as given by immundata::imd_schema("locus")); otherwise a warning is emitted and the split is ignored.

Value

`airr_stats_chains` Returns a tibble with columns:

repertoire_id – repertoire identifier
locus – TRA, TRB, IGH, ... (present only if locus_col is supplied)
n_chains – number of chains

`airr_stats_lengths` Returns a tibble with columns:

repertoire_id – repertoire identifier
seq_len – lengths of sequences
n – number of receptors

`airr_stats_genes` A tibble with columns:

repertoire_id - repertoire identifier
(optional) locus - TRA, TRB, IGH, ... (present only when by = "locus" and the locus column exists)
<gene_col> - the gene segment value (e.g., V gene)
n - the measure:
- if level = "receptor": number of receptors carrying the gene segment
- if level = "barcode": sum of counts across receptors for the segment

Examples

# Limit the number of threads used by the underlying DB for this session.
# Change this only if you know what you're doing (e.g., multi-user machines, shared CI/servers).
db_exec("SET threads TO 2")

# Load data
# \dontrun{
immdata <- get_test_idata() |> agg_repertoires("Therapy")
#> Rows: 2 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (4): File, Therapy, Response, Prefix
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> ℹ Found 2/2 repertoire files from the metadata on the disk
#> ✔ Metadata parsed successfully
#> 
#> ── Reading repertoire data 
#>   1. /home/runner/work/_temp/Library/immundata/extdata/tsv/sample_0_1k.tsv
#>   2. /home/runner/work/_temp/Library/immundata/extdata/tsv/sample_1k_2k.tsv
#> ℹ Checking if all files are of the same type
#> ✔ All files have the same extension
#> 
#> ── Renaming the columns and schemas 
#> ✔ Renaming is finished
#> 
#> ── Preprocessing the data 
#>   1. exclude_columns
#>   2. filter_nonproductive
#> ✔ Preprocessing plan is ready
#> 
#> ── Aggregating the data to receptors 
#> ℹ No locus information found
#> ℹ Processing data as immune repertoire tables - no counts, no barcodes, no chain pairing possible
#> ✔ Execution plan for receptor data aggregation and annotation is ready
#> 
#> ── Joining the metadata table with the dataset using 'filename' column 
#> ✔ Joining plan is ready
#> 
#> ── Postprocessing the data 
#>   1. prefix_barcodes
#> ✔ Postprocessing plan is ready
#> 
#> ── Saving the newly created ImmunData to disk 
#> ℹ Writing the receptor annotation data to [/tmp/RtmpPHpsgz/file1eef2411382b/annotations.parquet]
#> ℹ Writing the metadata to [/tmp/RtmpPHpsgz/file1eef2411382b/metadata.json]
#> ✔ ImmunData files saved to [/tmp/RtmpPHpsgz/file1eef2411382b]
#> ℹ Reading ImmunData files from [/tmp/RtmpPHpsgz/file1eef2411382b]
#> ✔ Loaded ImmunData with the receptor schema: [c("cdr3_aa", "v_call") and list()]
#> ℹ Reading ImmunData files from [/tmp/RtmpPHpsgz/file1eef2411382b]
#> 
#> ── Summary 
#> ℹ Time elapsed: 2.34 secs
#> ✔ Loaded ImmunData with the receptor schema: [c("cdr3_aa", "v_call") and NULL]
#> ✔ Loaded ImmunData with [1902] chains
# }

#
# airr_stats_chains
#

# \dontrun{
airr_stats_chains(immdata)
#> # A tibble: 2 × 6
#>   imd_repertoire_id n_barcodes n_receptors locus n_chains Therapy
#> *             <int>      <dbl>       <int> <chr>    <int> <chr>  
#> 1                 1        955         871 TCRB       955 ICI    
#> 2                 2        947         867 TCRB       947 CAR-T  
# }

#
# airr_stats_lengths
#

# \dontrun{
airr_stats_lengths(immdata)
#> # A tibble: 28 × 6
#>    imd_repertoire_id seq_len     n    prop    pct Therapy
#>  *             <int>   <dbl> <int>   <dbl>  <dbl> <chr>  
#>  1                 1       9    38 0.0398   3.98  ICI    
#>  2                 1      11   163 0.171   17.1   ICI    
#>  3                 1      14   119 0.125   12.5   ICI    
#>  4                 1      10   121 0.127   12.7   ICI    
#>  5                 1      15    40 0.0419   4.19  ICI    
#>  6                 1      16    30 0.0314   3.14  ICI    
#>  7                 1      17     7 0.00733  0.733 ICI    
#>  8                 1       5     1 0.00105  0.105 ICI    
#>  9                 2      11   181 0.191   19.1   CAR-T  
#> 10                 2      10   119 0.126   12.6   CAR-T  
#> # ℹ 18 more rows
# }

#
# airr_stats_genes
#

# \dontrun{
# V gene usage by receptor count
airr_stats_genes(immdata, gene_col = "v_call", level = "receptor")
#> # A tibble: 92 × 4
#>    v_call      imd_repertoire_id     n Therapy
#>  * <chr>                   <int> <int> <chr>  
#>  1 TRBV20-1*01                 1   112 ICI    
#>  2 TRBV29-1*01                 1   105 ICI    
#>  3 TRBV28*01                   1    91 ICI    
#>  4 TRBV18*01                   1    72 ICI    
#>  5 TRBV7-9*01                  1    53 ICI    
#>  6 TRBV7-2*01                  1    43 ICI    
#>  7 TRBV12-3*01                 1    36 ICI    
#>  8 TRBV7-8*01                  1    35 ICI    
#>  9 TRBV6-5*01                  1    28 ICI    
#> 10 TRBV2*01                    1    27 ICI    
#> # ℹ 82 more rows

# V gene usage by summed cell/UMI counts (if a count column is present)
airr_stats_genes(immdata, gene_col = "v_call", level = "barcode")
#> # A tibble: 92 × 4
#>    v_call      imd_repertoire_id     n Therapy
#>  * <chr>                   <int> <dbl> <chr>  
#>  1 TRBV20-1*01                 1   123 ICI    
#>  2 TRBV29-1*01                 1   122 ICI    
#>  3 TRBV28*01                   1    97 ICI    
#>  4 TRBV18*01                   1    86 ICI    
#>  5 TRBV7-9*01                  1    60 ICI    
#>  6 TRBV7-2*01                  1    44 ICI    
#>  7 TRBV7-8*01                  1    36 ICI    
#>  8 TRBV12-3*01                 1    36 ICI    
#>  9 TRBV10-2*01                 1    32 ICI    
#> 10 TRBV6-5*01                  1    28 ICI    
#> # ℹ 82 more rows

# Split by locus (TRA/TRB/... if locus column exists)
airr_stats_genes(immdata, gene_col = "v_call", level = "receptor", by = "locus")
#> # A tibble: 92 × 4
#>    v_call      imd_repertoire_id     n Therapy
#>  * <chr>                   <int> <int> <chr>  
#>  1 TRBV20-1*01                 1   112 ICI    
#>  2 TRBV29-1*01                 1   105 ICI    
#>  3 TRBV28*01                   1    91 ICI    
#>  4 TRBV18*01                   1    72 ICI    
#>  5 TRBV7-9*01                  1    53 ICI    
#>  6 TRBV7-2*01                  1    43 ICI    
#>  7 TRBV12-3*01                 1    36 ICI    
#>  8 TRBV7-8*01                  1    35 ICI    
#>  9 TRBV6-5*01                  1    28 ICI    
#> 10 TRBV2*01                    1    27 ICI    
#> # ℹ 82 more rows
# }

Available functions

Usage

Arguments

Value

airr_stats_chains Returns a tibble with columns:

airr_stats_lengths Returns a tibble with columns:

airr_stats_genes A tibble with columns:

See also

Examples

`airr_stats_chains` Returns a tibble with columns:

`airr_stats_lengths` Returns a tibble with columns:

`airr_stats_genes` A tibble with columns: