Skip to contents

[Experimental]

A family of functions that extract core descriptive statistics from an ImmunData object.

Available functions

Supported methods are the following.

airr_stats_chains — count V(D)J chains per repertoire (optionally split by locus). Quickly gauges capture depth per repertoire and, when split by locus, reveals TRA/TRB/IGH balance. Use it for QC, library-size checks, and to spot locus-specific dropouts or over-representation.

airr_stats_lengths — count the number of sequence lengths per repertoire. Summarizes the CDR3 length distribution, a sensitive QC fingerprint of repertoire prep and selection. Helpful for detecting primer/UMI biases, comparing cohorts, and deriving length-based features for models.

airr_stats_genes - count V(D)J gene segments per repertoire, optionally split by locus and using either receptor counts or barcode/UMI counts as the measure. Profiles V/D/J gene usage to characterize repertoire composition and germline biases, with optional locus split. Useful for cohort comparisons, flagging clonal expansions, and producing ML-ready features for repertoire-level ML tasks.

Usage

airr_stats_chains(
  idata,
  locus_col = NA,
  autojoin = getOption("immundata.autojoin", TRUE),
  format = c("long", "wide")
)

airr_stats_lengths(
  idata,
  seq_col = "cdr3_aa",
  autojoin = getOption("immundata.autojoin", TRUE),
  format = c("long", "wide")
)

airr_stats_genes(
  idata,
  gene_col = "v_call",
  level = c("receptor", "barcode"),
  by = c(NA, "locus"),
  autojoin = getOption("immundata.autojoin", TRUE),
  format = c("long", "wide")
)

Arguments

idata

An ImmunData object.

locus_col

Column in idata$annotations that stores the locus (e.g. "locus"). If NULL or missing, the result is not split by locus.

autojoin

Logical. If TRUE, join repertoire metadata by the schema repertoire id. Change the default behaviour by calling options(immunarch.autojoin = FALSE).

format

String. One of "long" ("long" tibble with imd_repertoire_id, facet columns, and value; useful for visualizations) or "wide" (wide/unmelted table of features, with each row corresponding to a specific repertoire / pair of repertoires; useful for Machine Learning).

seq_col

Character vector with names of the columns containing sequences.

gene_col

A single column name in idata$annotations with gene segment calls (e.g., "v_call", "d_call", "j_call", "c_call"). Default is "v_call".

level

One of "receptor" or "barcode". If "receptor" (default), the function counts unique receptors (one per receptor ID) that carry a given gene segment. If "barcode", the function sums counts (e.g., cells/UMIs) per gene segment using the column defined by immundata::imd_schema("count").

by

Either NULL (no split) or "locus". When "locus", the result is further split by the locus column if present (as given by immundata::imd_schema("locus")); otherwise a warning is emitted and the split is ignored.

Value

airr_stats_chains Returns a tibble with columns:

  • repertoire_id – repertoire identifier

  • locus – TRA, TRB, IGH, ... (present only if locus_col is supplied)

  • n_chains – number of chains

airr_stats_lengths Returns a tibble with columns:

  • repertoire_id – repertoire identifier

  • seq_len – lengths of sequences

  • n – number of receptors

airr_stats_genes A tibble with columns:

  • repertoire_id - repertoire identifier

  • (optional) locus - TRA, TRB, IGH, ... (present only when by = "locus" and the locus column exists)

  • <gene_col> - the gene segment value (e.g., V gene)

  • n - the measure:

    • if level = "receptor": number of receptors carrying the gene segment

    • if level = "barcode": sum of counts across receptors for the segment

Examples

# Limit the number of threads used by the underlying DB for this session.
# Change this only if you know what you're doing (e.g., multi-user machines, shared CI/servers).
db_exec("SET threads TO 2")

# Load data
# \dontrun{
immdata <- get_test_idata() |> agg_repertoires("Therapy")
#> Rows: 2 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (4): File, Therapy, Response, Prefix
#> 
#>  Use `spec()` to retrieve the full column specification for this data.
#>  Specify the column types or set `show_col_types = FALSE` to quiet this message.
#>  Found 2/2 repertoire files from the metadata on the disk
#>  Metadata parsed successfully
#> 
#> ── Reading repertoire data 
#>   1. /home/runner/work/_temp/Library/immundata/extdata/tsv/sample_0_1k.tsv
#>   2. /home/runner/work/_temp/Library/immundata/extdata/tsv/sample_1k_2k.tsv
#>  Checking if all files are of the same type
#>  All files have the same extension
#> 
#> ── Renaming the columns and schemas 
#>  Renaming is finished
#> 
#> ── Preprocessing the data 
#>   1. exclude_columns
#>   2. filter_nonproductive
#>  Preprocessing plan is ready
#> 
#> ── Aggregating the data to receptors 
#>  No locus information found
#>  Processing data as immune repertoire tables - no counts, no barcodes, no chain pairing possible
#>  Execution plan for receptor data aggregation and annotation is ready
#> 
#> ── Joining the metadata table with the dataset using 'filename' column 
#>  Joining plan is ready
#> 
#> ── Postprocessing the data 
#>   1. prefix_barcodes
#>  Postprocessing plan is ready
#> 
#> ── Saving the newly created ImmunData to disk 
#>  Writing the receptor annotation data to [/tmp/RtmpPHpsgz/file1eef2411382b/annotations.parquet]
#>  Writing the metadata to [/tmp/RtmpPHpsgz/file1eef2411382b/metadata.json]
#>  ImmunData files saved to [/tmp/RtmpPHpsgz/file1eef2411382b]
#>  Reading ImmunData files from [/tmp/RtmpPHpsgz/file1eef2411382b]
#>  Loaded ImmunData with the receptor schema: [c("cdr3_aa", "v_call") and list()]
#>  Reading ImmunData files from [/tmp/RtmpPHpsgz/file1eef2411382b]
#> 
#> ── Summary 
#>  Time elapsed: 2.34 secs
#>  Loaded ImmunData with the receptor schema: [c("cdr3_aa", "v_call") and NULL]
#>  Loaded ImmunData with [1902] chains
# }

#
# airr_stats_chains
#

# \dontrun{
airr_stats_chains(immdata)
#> # A tibble: 2 × 6
#>   imd_repertoire_id n_barcodes n_receptors locus n_chains Therapy
#> *             <int>      <dbl>       <int> <chr>    <int> <chr>  
#> 1                 1        955         871 TCRB       955 ICI    
#> 2                 2        947         867 TCRB       947 CAR-T  
# }

#
# airr_stats_lengths
#

# \dontrun{
airr_stats_lengths(immdata)
#> # A tibble: 28 × 6
#>    imd_repertoire_id seq_len     n    prop    pct Therapy
#>  *             <int>   <dbl> <int>   <dbl>  <dbl> <chr>  
#>  1                 1       9    38 0.0398   3.98  ICI    
#>  2                 1      11   163 0.171   17.1   ICI    
#>  3                 1      14   119 0.125   12.5   ICI    
#>  4                 1      10   121 0.127   12.7   ICI    
#>  5                 1      15    40 0.0419   4.19  ICI    
#>  6                 1      16    30 0.0314   3.14  ICI    
#>  7                 1      17     7 0.00733  0.733 ICI    
#>  8                 1       5     1 0.00105  0.105 ICI    
#>  9                 2      11   181 0.191   19.1   CAR-T  
#> 10                 2      10   119 0.126   12.6   CAR-T  
#> # ℹ 18 more rows
# }

#
# airr_stats_genes
#

# \dontrun{
# V gene usage by receptor count
airr_stats_genes(immdata, gene_col = "v_call", level = "receptor")
#> # A tibble: 92 × 4
#>    v_call      imd_repertoire_id     n Therapy
#>  * <chr>                   <int> <int> <chr>  
#>  1 TRBV20-1*01                 1   112 ICI    
#>  2 TRBV29-1*01                 1   105 ICI    
#>  3 TRBV28*01                   1    91 ICI    
#>  4 TRBV18*01                   1    72 ICI    
#>  5 TRBV7-9*01                  1    53 ICI    
#>  6 TRBV7-2*01                  1    43 ICI    
#>  7 TRBV12-3*01                 1    36 ICI    
#>  8 TRBV7-8*01                  1    35 ICI    
#>  9 TRBV6-5*01                  1    28 ICI    
#> 10 TRBV2*01                    1    27 ICI    
#> # ℹ 82 more rows

# V gene usage by summed cell/UMI counts (if a count column is present)
airr_stats_genes(immdata, gene_col = "v_call", level = "barcode")
#> # A tibble: 92 × 4
#>    v_call      imd_repertoire_id     n Therapy
#>  * <chr>                   <int> <dbl> <chr>  
#>  1 TRBV20-1*01                 1   123 ICI    
#>  2 TRBV29-1*01                 1   122 ICI    
#>  3 TRBV28*01                   1    97 ICI    
#>  4 TRBV18*01                   1    86 ICI    
#>  5 TRBV7-9*01                  1    60 ICI    
#>  6 TRBV7-2*01                  1    44 ICI    
#>  7 TRBV7-8*01                  1    36 ICI    
#>  8 TRBV12-3*01                 1    36 ICI    
#>  9 TRBV10-2*01                 1    32 ICI    
#> 10 TRBV6-5*01                  1    28 ICI    
#> # ℹ 82 more rows

# Split by locus (TRA/TRB/... if locus column exists)
airr_stats_genes(immdata, gene_col = "v_call", level = "receptor", by = "locus")
#> # A tibble: 92 × 4
#>    v_call      imd_repertoire_id     n Therapy
#>  * <chr>                   <int> <int> <chr>  
#>  1 TRBV20-1*01                 1   112 ICI    
#>  2 TRBV29-1*01                 1   105 ICI    
#>  3 TRBV28*01                   1    91 ICI    
#>  4 TRBV18*01                   1    72 ICI    
#>  5 TRBV7-9*01                  1    53 ICI    
#>  6 TRBV7-2*01                  1    43 ICI    
#>  7 TRBV12-3*01                 1    36 ICI    
#>  8 TRBV7-8*01                  1    35 ICI    
#>  9 TRBV6-5*01                  1    28 ICI    
#> 10 TRBV2*01                    1    27 ICI    
#> # ℹ 82 more rows
# }