A family of functions that extract core descriptive statistics from an ImmunData
object.
airr_stats_chains
— count V(D)J chains per repertoire
(optionally split by locus). Quickly gauges capture depth per repertoire
and, when split by locus, reveals TRA/TRB/IGH balance. Use it for QC,
library-size checks, and to spot locus-specific dropouts or
over-representation.
airr_stats_lengths
— count the number of sequence lengths
per repertoire. Summarizes the CDR3 length distribution, a sensitive QC
fingerprint of repertoire prep and selection. Helpful for detecting
primer/UMI biases, comparing cohorts, and deriving length-based features for
models.
airr_stats_genes
- count V(D)J gene segments per repertoire,
optionally split by locus and using either receptor counts or barcode/UMI
counts as the measure. Profiles V/D/J gene usage to characterize repertoire
composition and germline biases, with optional locus split. Useful for
cohort comparisons, flagging clonal expansions, and producing ML-ready
features for repertoire-level ML tasks.
Usage
airr_stats_chains(
idata,
locus_col = NA,
autojoin = getOption("immundata.autojoin", TRUE),
format = c("long", "wide")
)
airr_stats_lengths(
idata,
seq_col = "cdr3_aa",
autojoin = getOption("immundata.autojoin", TRUE),
format = c("long", "wide")
)
airr_stats_genes(
idata,
gene_col = "v_call",
level = c("receptor", "barcode"),
by = c(NA, "locus"),
autojoin = getOption("immundata.autojoin", TRUE),
format = c("long", "wide")
)
Arguments
- idata
An
ImmunData
object.- locus_col
Column in
idata$annotations
that stores the locus (e.g."locus"
). IfNULL
or missing, the result is not split by locus.- autojoin
Logical. If TRUE, join repertoire metadata by the schema repertoire id. Change the default behaviour by calling
options(immunarch.autojoin = FALSE)
.- format
String. One of
"long"
("long" tibble withimd_repertoire_id
, facet columns, andvalue
; useful for visualizations) or"wide"
(wide/unmelted table of features, with each row corresponding to a specific repertoire / pair of repertoires; useful for Machine Learning).- seq_col
Character vector with names of the columns containing sequences.
- gene_col
A single column name in
idata$annotations
with gene segment calls (e.g.,"v_call"
,"d_call"
,"j_call"
,"c_call"
). Default is"v_call"
.- level
One of
"receptor"
or"barcode"
. If"receptor"
(default), the function counts unique receptors (one per receptor ID) that carry a given gene segment. If"barcode"
, the function sums counts (e.g., cells/UMIs) per gene segment using the column defined byimmundata::imd_schema("count")
.- by
Either
NULL
(no split) or"locus"
. When"locus"
, the result is further split by the locus column if present (as given byimmundata::imd_schema("locus")
); otherwise a warning is emitted and the split is ignored.
Value
airr_stats_chains
Returns a tibble with columns:
repertoire_id
– repertoire identifierlocus
– TRA, TRB, IGH, ... (present only iflocus_col
is supplied)n_chains
– number of chains
airr_stats_lengths
Returns a tibble with columns:
repertoire_id
– repertoire identifierseq_len
– lengths of sequencesn
– number of receptors
airr_stats_genes
A tibble with columns:
repertoire_id
- repertoire identifier(optional)
locus
- TRA, TRB, IGH, ... (present only whenby = "locus"
and the locus column exists)<gene_col>
- the gene segment value (e.g., V gene)n
- the measure:if
level = "receptor"
: number of receptors carrying the gene segmentif
level = "barcode"
: sum of counts across receptors for the segment
Examples
# Limit the number of threads used by the underlying DB for this session.
# Change this only if you know what you're doing (e.g., multi-user machines, shared CI/servers).
db_exec("SET threads TO 2")
# Load data
# \dontrun{
immdata <- get_test_idata() |> agg_repertoires("Therapy")
#> Rows: 2 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (4): File, Therapy, Response, Prefix
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> ℹ Found 2/2 repertoire files from the metadata on the disk
#> ✔ Metadata parsed successfully
#>
#> ── Reading repertoire data
#> 1. /home/runner/work/_temp/Library/immundata/extdata/tsv/sample_0_1k.tsv
#> 2. /home/runner/work/_temp/Library/immundata/extdata/tsv/sample_1k_2k.tsv
#> ℹ Checking if all files are of the same type
#> ✔ All files have the same extension
#>
#> ── Renaming the columns and schemas
#> ✔ Renaming is finished
#>
#> ── Preprocessing the data
#> 1. exclude_columns
#> 2. filter_nonproductive
#> ✔ Preprocessing plan is ready
#>
#> ── Aggregating the data to receptors
#> ℹ No locus information found
#> ℹ Processing data as immune repertoire tables - no counts, no barcodes, no chain pairing possible
#> ✔ Execution plan for receptor data aggregation and annotation is ready
#>
#> ── Joining the metadata table with the dataset using 'filename' column
#> ✔ Joining plan is ready
#>
#> ── Postprocessing the data
#> 1. prefix_barcodes
#> ✔ Postprocessing plan is ready
#>
#> ── Saving the newly created ImmunData to disk
#> ℹ Writing the receptor annotation data to [/tmp/RtmpPHpsgz/file1eef2411382b/annotations.parquet]
#> ℹ Writing the metadata to [/tmp/RtmpPHpsgz/file1eef2411382b/metadata.json]
#> ✔ ImmunData files saved to [/tmp/RtmpPHpsgz/file1eef2411382b]
#> ℹ Reading ImmunData files from [/tmp/RtmpPHpsgz/file1eef2411382b]
#> ✔ Loaded ImmunData with the receptor schema: [c("cdr3_aa", "v_call") and list()]
#> ℹ Reading ImmunData files from [/tmp/RtmpPHpsgz/file1eef2411382b]
#>
#> ── Summary
#> ℹ Time elapsed: 2.34 secs
#> ✔ Loaded ImmunData with the receptor schema: [c("cdr3_aa", "v_call") and NULL]
#> ✔ Loaded ImmunData with [1902] chains
# }
#
# airr_stats_chains
#
# \dontrun{
airr_stats_chains(immdata)
#> # A tibble: 2 × 6
#> imd_repertoire_id n_barcodes n_receptors locus n_chains Therapy
#> * <int> <dbl> <int> <chr> <int> <chr>
#> 1 1 955 871 TCRB 955 ICI
#> 2 2 947 867 TCRB 947 CAR-T
# }
#
# airr_stats_lengths
#
# \dontrun{
airr_stats_lengths(immdata)
#> # A tibble: 28 × 6
#> imd_repertoire_id seq_len n prop pct Therapy
#> * <int> <dbl> <int> <dbl> <dbl> <chr>
#> 1 1 9 38 0.0398 3.98 ICI
#> 2 1 11 163 0.171 17.1 ICI
#> 3 1 14 119 0.125 12.5 ICI
#> 4 1 10 121 0.127 12.7 ICI
#> 5 1 15 40 0.0419 4.19 ICI
#> 6 1 16 30 0.0314 3.14 ICI
#> 7 1 17 7 0.00733 0.733 ICI
#> 8 1 5 1 0.00105 0.105 ICI
#> 9 2 11 181 0.191 19.1 CAR-T
#> 10 2 10 119 0.126 12.6 CAR-T
#> # ℹ 18 more rows
# }
#
# airr_stats_genes
#
# \dontrun{
# V gene usage by receptor count
airr_stats_genes(immdata, gene_col = "v_call", level = "receptor")
#> # A tibble: 92 × 4
#> v_call imd_repertoire_id n Therapy
#> * <chr> <int> <int> <chr>
#> 1 TRBV20-1*01 1 112 ICI
#> 2 TRBV29-1*01 1 105 ICI
#> 3 TRBV28*01 1 91 ICI
#> 4 TRBV18*01 1 72 ICI
#> 5 TRBV7-9*01 1 53 ICI
#> 6 TRBV7-2*01 1 43 ICI
#> 7 TRBV12-3*01 1 36 ICI
#> 8 TRBV7-8*01 1 35 ICI
#> 9 TRBV6-5*01 1 28 ICI
#> 10 TRBV2*01 1 27 ICI
#> # ℹ 82 more rows
# V gene usage by summed cell/UMI counts (if a count column is present)
airr_stats_genes(immdata, gene_col = "v_call", level = "barcode")
#> # A tibble: 92 × 4
#> v_call imd_repertoire_id n Therapy
#> * <chr> <int> <dbl> <chr>
#> 1 TRBV20-1*01 1 123 ICI
#> 2 TRBV29-1*01 1 122 ICI
#> 3 TRBV28*01 1 97 ICI
#> 4 TRBV18*01 1 86 ICI
#> 5 TRBV7-9*01 1 60 ICI
#> 6 TRBV7-2*01 1 44 ICI
#> 7 TRBV7-8*01 1 36 ICI
#> 8 TRBV12-3*01 1 36 ICI
#> 9 TRBV10-2*01 1 32 ICI
#> 10 TRBV6-5*01 1 28 ICI
#> # ℹ 82 more rows
# Split by locus (TRA/TRB/... if locus column exists)
airr_stats_genes(immdata, gene_col = "v_call", level = "receptor", by = "locus")
#> # A tibble: 92 × 4
#> v_call imd_repertoire_id n Therapy
#> * <chr> <int> <int> <chr>
#> 1 TRBV20-1*01 1 112 ICI
#> 2 TRBV29-1*01 1 105 ICI
#> 3 TRBV28*01 1 91 ICI
#> 4 TRBV18*01 1 72 ICI
#> 5 TRBV7-9*01 1 53 ICI
#> 6 TRBV7-2*01 1 43 ICI
#> 7 TRBV12-3*01 1 36 ICI
#> 8 TRBV7-8*01 1 35 ICI
#> 9 TRBV6-5*01 1 28 ICI
#> 10 TRBV2*01 1 27 ICI
#> # ℹ 82 more rows
# }