Data loading

Input / output

The package provides several IO functions:

repLoad - loads the repertoires having compatible format.
repSave - saves changes and writes the repertoire data to a file in a specific format (immunarch, VDJtools).

repLoad detects the input file format automatically. immunarch currently support the following immune repertoire data formats:

"immunarch" - current software tool, in case you forgot :)
"immunoseq" - https://www.immunoseq.com
"mitcr" - https://github.com/milaboratory/mitcr
"mixcr" - https://github.com/milaboratory/mixcr
"migec" - http://migec.readthedocs.io/en/latest/
"migmap" - https://github.com/mikessh/migmap
"tcr" - https://imminfo.github.io/tcr/
"vdjtools" - https://vdjtools-doc.readthedocs.io/en/master/
"imgt" - https://www.imgt.org/HighV-QUEST/
"airr" - http://docs.airr-community.org/en/latest/datarep/overview.html
"10x" - https://support.10xgenomics.com/single-cell-vdj/software/pipelines/latest/output/annotation
"archer" - ArcherDX clonotype tables.
"imseq" - http://www.imtools.org/
"rtcr" - https://github.com/uubram/RTCR
"vidjil" - http://www.vidjil.org/

Please contact us if there are more file formats you want supported.

For parsing IgBLAST results process the data with MigMap first.

You can load the data from a single file, a list of repertoire file paths, or from a folder with repertoire files.

Working with your files

If you have your files, you should just specify a path to your file or to a folder with files. Then load data using repLoad:

#path argument is a path to the folder with your file or files including the metadata file.
immdata <- repLoad(path)

Working with sample files

You could find a folder with example files here (download and extract test_data.zip or test_data.tar.gz) and use it to test data loading without your own files.

If you are not familiar with the file paths, you can download our mock data to your working directory. You can obtain working directory with getwd() command

You could also download all files to the 'example' folder in your working directory and load all of them by passing folder name to repLoad function in quotation marks:

immdata <- repLoad('example')

The example data is already downloaded with immunarch package. You can load all sample files using the following command:

#path to the folder with example data
file_path <- paste0(system.file(package="immunarch"), "/extdata/io/")
immdata <- repLoad(file_path)

In other cases you may want to provide a metadata file and locate it in the folder. It is necessary to name it “metadata.txt”.

# For instance you have a following structure in your folder:
# >_ ls
# immunoseq1.txt
# immunoseq2.txt
# immunoseq3.txt
# metadata.txt

With the metadata repLoad will create a list in the environment with 2 elements, namely data and meta. All the data will be accessible simply from immdata$data.

Otherwise repLoad will create a dummy metadata file with only sample names.

# To load the whole folder with every file in it type:
file_path <- paste0(system.file(package="immunarch"), "/extdata/io/")
immdata <- repLoad(file_path)
print(names(immdata))

# In order to do that your folder must contain metadata file named
# "metadata.txt".

# In R, when you load your data:
# > immdata <- repLoad("path/to/your/folder/")
# > names(immdata)
# [1] "data" "meta"

# Suppose you do not have "metadata.txt":
# > immdata <- repLoad("path/to/your/folder/")
# > names(immdata)
# [1] "data" "meta"

Dummy metadata data frame looks like this:

as_tibble(data.frame(Sample = c("immunoseq1", "immunoseq2", "immunoseq3"), stringsAsFactors = F))

## # A tibble: 3 × 1
##   Sample    
##   <chr>     
## 1 immunoseq1
## 2 immunoseq2
## 3 immunoseq3

The metadata file “metadata.txt” has to be a table with first column named “Sample” and any number of additional columns with any names. The first column should contain the base names of files without extensions in your folder.

Sample	Sex	Age	Status
immunoseq_1	M	1	C
immunoseq_2	M	2	C
immunoseq_3	F	3	A

In order to import data from the external databases you have to connect to this database and then load the data. Make sure that the table format in your database matches the immunarch’s format.

To illustrate the use of external database, here is an example demonstrating data loading to the local MonetDB database:

# Your list of repertoires in immunarch's format
DATA
# Metadata data frame
META

# Create a temporary directory
dbdir <- tempdir()

# Create a DBI connection to MonetDB in the temporary directory.
con <- DBI::dbConnect(MonetDBLite::MonetDBLite(), embedded = dbdir)

# Write each repertoire to MonetDB. Each table has corresponding name from the DATA
for (i in seq_along(DATA)) {
  DBI::dbWriteTable(con, names(DATA)[i], DATA[[i]], overwrite=TRUE)
}

# Create a source in the temporary directory with MonetDB
ms <- MonetDBLite::src_monetdblite(dbdir = dbdir)
res_db <- list()

# Load the data from MonetDB to dplyr tables
for (i in seq_along(DATA)) {
  res_db[[names(DATA)[i]]] <- dplyr::tbl(ms, names(DATA)[i])
}

# Your data is ready to use
list(data = res_db, meta = META)

immunarch is compatible with the following sources:

R data frames (for most applications)
R data tables (for faster calculations, although they require a lot of RAM)
MonetDB-like databases that support both DBI and dplyr (an optimal choice, although you have to be familiar with dplyr)
Apache Spark (if you have experience with it)

Basic data manipulations with dplyr and immunarch

You can find the introduction to dplyr here: https://CRAN.R-project.org/package=dplyr/vignettes/dplyr.html

Get the most abundant clonotypes

The function returns the most abundant clonotypes for the given repertoire:

top(immdata$data[[1]])

## # A tibble: 10 × 15
##    Clones Proportion CDR3.nt    CDR3.aa V.name D.name J.name V.end D.start D.end
##     <dbl>      <dbl> <chr>      <chr>   <chr>  <chr>  <chr>  <int>   <int> <int>
##  1    173    0.0204  TGCGCCAGC… CASSQE… TRBV4… TRBD1  TRBJ2…    16      18    26
##  2    163    0.0192  TGCGCCAGC… CASSYR… TRBV4… TRBD1  TRBJ2…    11      13    18
##  3     66    0.00776 TGTGCCACC… CATSTN… TRBV15 TRBD1  TRBJ2…    11      16    22
##  4     54    0.00635 TGTGCCACC… CATSIG… TRBV15 TRBD2  TRBJ2…    11      19    25
##  5     48    0.00565 TGTGCCAGC… CASSPW… TRBV27 TRBD1  TRBJ1…    11      16    23
##  6     48    0.00565 TGCGCCAGC… CASQGD… TRBV4… TRBD1  TRBJ1…     8      13    19
##  7     40    0.00471 TGCGCCAGC… CASSQD… TRBV4… TRBD1  TRBJ2…    16      21    26
##  8     31    0.00365 TGTGCCAGC… CASSEE… TRBV2  TRBD1  TRBJ1…    15      17    20
##  9     30    0.00353 TGCGCCAGC… CASSQP… TRBV4… TRBD1  TRBJ2…    14      23    28
## 10     28    0.00329 TGTGCCAGC… CASSWV… TRBV6… TRBD1  TRBJ2…    12      20    25
## # ℹ 5 more variables: J.start <int>, VJ.ins <dbl>, VD.ins <dbl>, DJ.ins <dbl>,
## #   Sequence <lgl>

Filter functional / non-functional / in-frame / out-of-frame clonotypes

Conveniently, functions are vectorised over the list of data frames; and coding(immdata$data) in the example below returns a list of data frames with coding sequences:

coding(immdata$data[[1]])

The next one operates in a similar fashion:

noncoding(immdata$data[[1]])

Now, the computation of the number of filtered sequences is rather straightforward:

nrow(inframes(immdata$data[[1]]))

And for the out-of-frame clonotypes:

nrow(outofframes(immdata$data[[1]]))

Get subset of clonotypes with a specific V gene

It is simple to subset data frame according to labels in the specified index. In this example the resulting data frame contains only records with ‘TRBV10-1’ V gene:

filter(immdata$data[[1]], V.name == 'TRBV10-1')

## # A tibble: 24 × 15
##    Clones Proportion CDR3.nt    CDR3.aa V.name D.name J.name V.end D.start D.end
##     <dbl>      <dbl> <chr>      <chr>   <chr>  <chr>  <chr>  <int>   <int> <int>
##  1      2   0.000235 TGCGCCAGC… CASSES… TRBV1… TRBD2  TRBJ2…    16      20    25
##  2      2   0.000235 TGCGCCAGC… CASSDG… TRBV1… TRBD1  TRBJ2…    13      15    22
##  3      1   0.000118 TGCGCCAGC… CASSGD… TRBV1… TRBD2  TRBJ2…     8      10    15
##  4      1   0.000118 TGCGCCACC… CATLRS… TRBV1… TRBD1  TRBJ2…     6       7     9
##  5      1   0.000118 TGCGCCAGC… CASSES… TRBV1… TRBD2  TRBJ2…    16      20    22
##  6      1   0.000118 TGCGCCAGC… CASSES… TRBV1… TRBD2  TRBJ2…    16      17    21
##  7      1   0.000118 TGCGCCAGC… CASRAS… TRBV1… TRBD2  TRBJ2…    10      13    21
##  8      1   0.000118 TGCGCCAGC… CASRRD… TRBV1… TRBD1  TRBJ2…     8      13    19
##  9      1   0.000118 TGCGCCAGC… CASSEV… TRBV1… TRBD1  TRBJ2…    14      19    24
## 10      1   0.000118 TGCGCCAGC… CASSEG… TRBV1… TRBD2  TRBJ2…    13      19    27
## # ℹ 14 more rows
## # ℹ 5 more variables: J.start <int>, VJ.ins <dbl>, VD.ins <dbl>, DJ.ins <dbl>,
## #   Sequence <lgl>

Downsampling

ds <- repSample(immdata$data, "downsample", 100)
sapply(ds, nrow)

## A2-i129 A2-i131 A2-i133 A2-i132 A4-i191 A4-i192     MS1     MS2     MS3     MS4 
##      99      95      95      98      89      95      82     100      94      98 
##     MS5     MS6 
##      82     100

ds <- repSample(immdata$data, "sample", .n = 10)
sapply(ds, nrow)

## A2-i129 A2-i131 A2-i133 A2-i132 A4-i191 A4-i192     MS1     MS2     MS3     MS4 
##      10      10      10      10      10      10      10      10      10      10 
##     MS5     MS6 
##      10      10

Immunarch data format

immunarch comes with its own data format, including tab-delimited columns that can be specified as follows:

“Clones” - count or number of barcodes (events, UMIs) or reads;
“Proportion” - proportion of barcodes (events, UMIs) or reads;
“CDR3.nt” - CDR3 nucleotide sequence;
“CDR3.aa” - CDR3 amino acid sequence;
“V.name” - names of aligned Variable gene segments;
“D.name” - names of aligned Diversity gene segments or NA;
“J.name” - names of aligned Joining gene segments;
“V.end” - last positions of aligned V gene segments (1-based);
“D.start” - positions of D’5 end of aligned D gene segments (1-based);
“D.end” - positions of D’3 end of aligned D gene segments (1-based);
“J.start” - first positions of aligned J gene segments (1-based);
“VJ.ins” - number of inserted nucleotides (N-nucleotides) at V-J junction (-1 for receptors with VDJ recombination);
“VD.ins” - number of inserted nucleotides (N-nucleotides) at V-D junction (-1 for receptors with VJ recombination);
“DJ.ins” - number of inserted nucleotides (N-nucleotides) at D-J junction (-1 for receptors with VJ recombination);
“Sequence” - full nucleotide sequence.

ImmunoMind – improving design of T-cell therapies using multi-omics and AI. Research and biopharma partnerships, more details: immunomind.io

support@immunomind.io