BCR
ImmunoMind – improving design of T-cell therapies using multi-omics and AI. Research and biopharma partnerships, more details: immunomind.io
support@immunomind.io
Source:vignettes/web_only_v0/BCRpipeline.Rmd
BCRpipeline.Rmd
Overview
A B-cell receptor (BCR) consists of an immunoglobulin molecule and a СD79 molecule. BCR includes two identical heavy chains (generated by recombination of V, D, and J segments), and two identical light chains (generated by recombination of V and J segments). Rapid improvements in high-throughput sequencing provide an opportunity to study B-cell immunoglobulin receptors on the cell surface [1] via the process called BCR repertoire sequencing. This pipeline describes how to work with BCR repertoire data after preprocessing raw sequenced data with MiXCR (https://docs.milaboratories.com/mixcr/about/) or any other programs.
The pipeline involves five steps:
-
Data loading.
The step includes loading the data and checking whether there is enough information for the pipeline to work.
-
Reconstructing clonal lineages.
At this step, all BCR sequences are divided into clonal lineages — sets of sequences that presumably share a common ancestor.
-
Building germline.
At this step, the algorithm generates a sequence that represents the ancestral sequence for each BCR in a clonal lineage.
-
Aligning sequences within clonal lineages.
This step involves preparation for phylogenetic and somatic hypermutation analysis.
-
Phylogenetic analysis.
This step provides phylogeny reconstruction and trunk length calculation (by running the PHYLIP package).
-
Somatic hypermutation analysis.
At this step, we compare clonotype and germline sequences to detect and count the number of mutations in clonotype sequence.
Data loading
BCR data are loaded with repLoad
functions, like any
other data.
Check that BCR data has the following:
- information about full sequence read of each BCR,
- positions of CDR3 region start and end,
- number of mutations per CDR3 region. In MiXCR output files, this
information is located in the column ‘
refPoints
’
Look at the sample BCR data implemented in Immunarch:
Reconstructing clonal lineages
B-cell clonal lineage represents a set of B cells that presumably have a common origin (arising from the same VDJ rearrangement event) and a common ancestor. BCR sequences are clustered according to their similarity, and sets of sequences in one cluster are named clonal lineages. Clustering involves two steps.
- The first step is calculating the distance between sequences.
- The second step is clustering the sequences using the information about the distance between them. Check our tutorial for more information about clustering: https://immunarch.com/articles/web_only/clustering.html
An example of reconstructing clonal lineages using default Immunarch options:
#calulate distance matrix
distBCR <- seqDist(bcrdata$data %>% top(500))
#find clusters
bcrdata$data <- seqCluster(bcrdata$data %>% top(500), distBCR, .perc_similarity = 0.9)
Building germline
Each clonal lineage has its own germline sequence that represents the ancestral sequence for each BCR in the clonal lineage. In other words, germline sequence is a sequence of B-cells immediately after VDJ recombination, before B-cell maturation and hypermutation process. Germline sequence is useful for assessing the degree of mutation and maturity of the repertoire.
In Immunarch, repGermline()
function generates germline
for each sequence:
#generate germline
bcrdata$data %>%
repGermline(.threads = 1)
A germline is represented via sequences of V gene - N…N (CDR3 length) - J gene:
#germline example
bcrdata$data %>%
top(1) %>%
repGermline(.threads = 1) %>% .$full_clones %>% .$Germline.sequence
Оptions:
.data
- The data to be processed. Can be
data.frame
, data.table
, or a list of these
objects.
.species
- Specifies species from which reference V and
J are taken. Available species: “HomoSapiens” (default), “MusMusculus”,
“BosTaurus”, “CamelusDromedarius”, “CanisLupusFamiliaris”, “DanioRerio”,
“MacacaMulatta”, “MusMusculusDomesticus”, “MusMusculusCastaneus”,
“MusMusculusMolossinus”, “MusMusculusMusculus”, “MusSpretus”,
“OncorhynchusMykiss”, “OrnithorhynchusAnatinus”, “OryctolagusCuniculus”,
“RattusNorvegicus”, “SusScrofa”.
.min_nuc_outside_cdr3
- This parameter sets how many
nucleotides should have V or J chain outside of CDR3 to be considered
good for further alignment. Reads with too short chains are filtered
out.
.threads
- The number of threads to use.
Aligning sequences within a clonal lineage
After building clonal lineage and germline, we can start analyzing the degree of mutation and maturity of each clonal lineage. This allows us to find cells with a large number of mutated clones. The phylogenetic analysis will find mutations that influence the affinity of BCRs. Aligning the sequence is the first step toward sequence phylogenetic analysis.
In the Immunarch package, the function repAlignLineage
aligns sequences within clonal lineages. This function requires
Clustal W
app to be installed. The app could be downloaded
here: http://www.clustal.org/download/current/,
or installed via your system package manager (such as apt or dnf).
- For Ubuntu, check this guide: https://www.howtoinstall.me/ubuntu/18-04/clustalw/
- For Windows, download and run
clustalw-x.x-win.msi
repAlignLineage
usage example:
data(bcrdata)
bcr_data <- bcrdata$data %>% top(500)
bcr_data %>%
seqCluster(seqDist(bcr_data), .fixed_threshold = 3) %>%
repGermline(.threads = 1) %>%
repAlignLineage(.min_lineage_sequences = 2)
The function has several parameters:
-
.min_lineage_sequences
— Filters clusters (clonal lineages) with the number of clonotypes lower than the threshold. Aligning clonal lineages with few sequences is of little use.
# take clusters that contain at least 1 sequence
bcr_data <- bcrdata$data
align_dt <- bcr_data %>%
seqCluster(seqDist(bcr_data, .col = 'CDR3.nt', .group_by_seqLength = TRUE),
.perc_similarity = 0.6) %>%
repGermline(.threads = 1) %>%
repAlignLineage(.min_lineage_sequences = 6, .align_threads = 2, .nofail = TRUE)
-
.prepare_threads
— the number of threads to prepare results table. A high number can cause memory overload! -
.align_threads
— the number of threads for lineage alignment.
Requirements for the input table for
repAlignLineage()
- must have columns in the immunarch compatible format
immunarch_data_format
, - must contain the ‘Cluster’ column generated by seqCluster() function,
- must contain the ‘Sequence.germline’ column generated by repGermline() function.
Align sequences in a cluster can be visualized using standard functions:
# A name of the first cluster
align_dt$full_clones$Cluster[[1]]
# Alignment of sequences from the first cluster
image(align_dt$full_clones$Alignment[[1]], grid = TRUE)
Phylogenetic analysis
For BCR phylogenetic analysis, we use the maximum parsimony method. The maximum parsimony method reconstructs a phylogenetic tree for the lineage by minimizing the total number of mutation events [2].
This method also enables the reconstruction of intermediate sequences that may have existed between BCR and germline sequence. These simulated sequences contain mutations that were ancestral to the clonotype group selected with the maximum parsimony analysis.
In Immunarch, function repClonalFamily
builds a
phylogeny of B-cells, generates a common ancestor, and calculates a
trunk length. The mean trunk length represents the distance between the
most recent common ancestor and germline sequence [3]. Mean trunk length
serves as a measure of the maturity of a lineage. The trunk length of
the lineage tree approximates the maturation state of the initiating B
cell for each clone [4].
This function requires PHYLIP
app to be installed. The
app can be downloaded here: https://evolution.genetics.washington.edu/phylip/getme-new1.html,
or could be installed with your system package manager (such as apt or
dnf).
- For Ubuntu, follow the installation guide (https://zoomadmin.com/HowToInstall/UbuntuPackage/phylip ):
- For Windows:
- Download a Zip archive (https://evolution.genetics.washington.edu/phylip/install.html).
- Extract the archive into a folder.
- Add the folder to the PATH (https://www.architectryan.com/2018/03/17/add-to-the-path-on-windows-10/)
repClonalFamily usage example:
bcr <- align_dt %>%
repClonalFamily(.threads = 2, .nofail = TRUE)
#plot visualization of the first tree
vis(bcr[["full_clones"]][["TreeStats"]][[1]])
For each cluster tree is represented as table (The default number of clones for CommonAncestor, Germline, Presumable is 1):
#example for the first tree
bcr[["full_clones"]][["TreeStats"]][[1]]
You can recolor leaves. For example, we recolor leaves where number of AA mutations is not 0:
#take sequence where number of AA mutations is not 0
f <- bcr[["full_clones"]][["TreeStats"]][[1]]
#rename these leaves
f[f$DistanceAA != 0, ]['Type'] = 'mutationAA'
#new tree
vis(f)
Another way to recolor leaves is to use .vis_groups
parameter for repClonalFamily. It allows to assign group names for
specific clone IDs, or lists of clone IDs:
#get all clone IDs from align_dt
clone_ids <- unnest(align_dt[["full_clones"]], "Sequences")[["Clone.ID"]]
#run repClonalFamily with assigning some of these clones to differently named and colored groups
bcr_with_groups <- align_dt %>%
repClonalFamily(.vis_groups = list(
Group1 = clone_ids[1],
Group2 = clone_ids[3],
Group3 = list(clone_ids[5], clone_ids[2]),
Group4 = c(clone_ids[7], clone_ids[4])
), .threads = 2, .nofail = TRUE
)
#display the first tree from repClonalFamily results
vis(bcr_with_groups[["full_clones"]][["TreeStats"]][[1]])
We have found 4 clusters:
bcr$full_clones$Cluster %>% unique()
We have found mismatches between a germline and an ancestor sequence. Dots represent nucleotides matches between the sequences, letters represent mismatches between the sequences:
# the example of common ancestor sequence
bcr$full_clones$Common.Ancestor[1]
# the example of mismatches between a germline and an ancestor sequence
bcr$full_clones$Germline.Output[1]
We have calculated a trunk length for each cluster:
bcr$full_clones[ , c('Cluster', 'Trunk.Length') ]
Also trunk length specified in “TreeStats” table in column “DistanceNT”.
#example fot first tree
bcr[["full_clones"]][["TreeStats"]][[1]][1, ]
Somatic hypermutation analysis
The rate of somatic hypermutation allows us to estimate repertoire maturation and detect the type of mutation contributing to the emergence of high affinity antibodies. This makes somatic hypermutation analysis a valuable asset for B-cell repertoire analysis.
In Immunarch, repSomaticHypermutation()
function is
designed for hypermutation analysis:
bcr_data <- bcrdata$data
shm_data <- bcr %>% repSomaticHypermutation(.threads = 2, .nofail = TRUE)
The function repSomaticHypermutation() takes V and J germline sequences and V and J clonotype sequences as an input. Then the function aligns germline and clonotype sequences to detect and calculate occurring mutations.
Examples of germline and clonotype sequences:
full_clones <- shm_data$full_clones
v_length <- nchar(paste(full_clones[1, "FR1.nt"], full_clones[1, "CDR1.nt"],
full_clones[1, "FR2.nt"], full_clones[1, "CDR2.nt"], full_clones[1, "FR3.nt"], collapse=""))
j_length <- nchar(full_clones[1, "FR4.nt"])
seq_length <- nchar(full_clones[1, "Sequence"])
# the example of germline V sequence
full_clones$Germline.Input[1] %>% substr(1, v_length)
# the example of germline J sequence
full_clones$Germline.Input[1] %>% substr(seq_length - j_length, seq_length)
# the example of clonotype V sequence
full_clones$Sequence[1] %>% substr(1, v_length)
# the example of clonotype J sequence
full_clones$Sequence[1] %>% substr(seq_length - j_length, seq_length)
Example: aligning germline and clonotype V sequences:
image(shm_data$full_clones$Germline.Alignment.V[[3]], grid = TRUE)
Example: aligning germline and clonotype J sequences:
image(shm_data$full_clones$Germline.Alignment.J[[3]], grid = TRUE)
The number of mutations for each clonotype sequence:
cols <- c('Clone.ID', 'Substitutions', 'Insertions', 'Deletions', 'Mutations')
shm_data$full_clones[ , cols ]
Then you could easily estimate the mutation rate: