A B-cell receptor (BCR) consists of an immunoglobulin molecule and a СD79 molecule. BCR includes two identical heavy chains (generated by recombination of V, D, and J segments), and two identical light chains (generated by recombination of V and J segments). Rapid improvements in high-throughput sequencing provide an opportunity to study B-cell immunoglobulin receptors on the cell surface [1] via the process called BCR repertoire sequencing. This pipeline describes how to work with BCR repertoire data after preprocessing raw sequenced data with MiXCR (https://mixcr.readthedocs.io/en/master/) or other analogical programs.

The pipeline involves five steps:

  1. Data loading.

    The step includes loading the data and checking whether there is enough information for the pipeline to work.

  2. Reconstructing clonal lineages.

    At this step, all BCR sequences are divided into clonal lineages — sets of sequences that presumably share a common ancestor.

  3. Building germline.

    At this step, the algorithm generates a sequence that represents the ancestral sequence for each BCR in a clonal lineage.

  4. Aligning sequences within clonal lineages.

    This step involves preparation for phylogenetic and somatic hypermutation analysis.

  5. Phylogenetic analysis.

    This final step provides phylogeny reconstruction and trunk length calculation (by running the PHYLIP package).

Data loading

BCR data are loaded with repLoad functions, like any other data.

Check that BCR data has the following:

  • information about full sequence read of each BCR,
  • positions of CDR3 region start and end,
  • number of mutations per CDR3 region. In MiXCR output files, this information is located in the column ‘refPoints

Look at the sample BCR data implemented in Immunarch:

#load the package into the R environment:


Reconstructing clonal lineages

B-cell clonal lineage represents a set of B cells that presumably have a common origin (arising from the same VDJ rearrangement event) and a common ancestor. BCR sequences are clustered according to their similarity, and sets of sequences in one cluster are named clonal lineages. Clustering involves two steps.

An example of reconstructing clonal lineages using default Immunarch options:

#calulate distance matrix
distBCR <- seqDist(bcrdata$data  %>% top(500))

#find clusters
bcrdata$data <- seqCluster(bcrdata$data  %>% top(500), distBCR, .perc_similarity = 0.9)

Building germline

Each clonal lineage has its own germline sequence that represents the ancestral sequence for each BCR in the clonal lineage. In other words, germline sequence is a sequence of B-cells immediately after VDJ recombination, before B-cell maturation and hypermutation process. Germline sequence is useful for assessing the degree of mutation and maturity of the repertoire.

In Immunarch, repGermline() function generates germline for each sequence:

#generate germline
bcrdata$data %>%
    repGermline(.threads = 1)

A germline is represented via sequences of V gene - N…N (CDR3 length) - J gene:

#germline example
bcrdata$data %>%
     top(1) %>%
     repGermline(.threads = 1) %>% .$full_clones %>% .$Germline.sequence


.data - The data to be processed. Can be data.frame, data.table, or a list of these objects.

.species - Specifies species from which reference V and J are taken. Available species: “HomoSapiens” (default), “MusMusculus”, “BosTaurus”, “CamelusDromedarius”, “CanisLupusFamiliaris”, “DanioRerio”, “MacacaMulatta”, “MusMusculusDomesticus”, “MusMusculusCastaneus”, “MusMusculusMolossinus”, “MusMusculusMusculus”, “MusSpretus”, “OncorhynchusMykiss”, “OrnithorhynchusAnatinus”, “OryctolagusCuniculus”, “RattusNorvegicus”, “SusScrofa”.

.min_nuc_outside_cdr3 — filters out reads with low numbers of nucleotides from V or J gene outside of CDR3 region

.align_j_gene - if the germline sequence does not assemble correctly in the region of the J gene, then set this parameter to True. This will slow down the algorithm, but the assembly of the germline sequence will be more accurate.


  • if TRUE, only the first sequence from reference for each allele name will be used in the analysis;
  • if FALSE, all allele variants will be analyzed, and the output table will increase in size as a result.

If .ref_only_first = FALSE, you will get several different germlines:

bcrdata$data %>%
    top(1) %>%
    repGermline(.ref_only_first = F, .threads = 1) %>% .$full_clones %>% .$Germline.sequence

Aligning sequences within a clonal lineage

After building clonal lineage and germline, we can start analyzing the degree of mutation and maturity of each clonal lineage. This allows us to find cells with a large number of mutated clones. The phylogenetic analysis will find mutations that influence the affinity of BCRs. Aligning the sequence is the first step toward sequence phylogenetic analysis.

In the Immunarch package, the function repAlignLineage aligns sequences within clonal lineages. This function requires Clustal W app to be installed. The app could be downloaded here: http://www.clustal.org/download/current/, or installed via your system package manager (such as apt or dnf).

sudo apt update
sudo apt install clustalw
  • For Windows, download and run clustalw-x.x-win.msi

repAlignLineage usage example:

bcr_data <- bcrdata$data %>% top(500)
bcr_data %>%
  seqCluster(seqDist(bcr_data), .fixed_threshold = 3) %>%
  repGermline(.threads = 1) %>%
  repAlignLineage(.min_lineage_sequences = 2)

The function has several parameters:

  • .min_lineage_sequences — Filters clusters (clonal lineages) with the number of clonotypes lower than the threshold. Aligning clonal lineages with few sequences is of little use.
# take clusters that contain at least 1 sequence
align_dt <- bcr_data %>%
  seqCluster(seqDist(bcr_data), .fixed_threshold = 3) %>%
  repGermline(.threads = 1) %>%
  repAlignLineage(.min_lineage_sequences = 2, .align_threads = 2)
  • .prepare_threads — the number of threads to prepare results table. A high number can cause memory overload!
  • .align_threads — the number of threads for lineage alignment.

Requirements for the input table for repAlignLineage()

  • must have columns in the immunarch compatible format immunarch_data_format,
  • must contain the ‘Cluster’ column generated by seqCluster() function,
  • must contain the ‘Sequence.germline’ column generated by repGermline() function.

Align sequences in a cluster can be visualized using standard functions:

# A name of the first cluster
## [1] "IGHV4-61/IGHJ4_length_45_cluster_1"
# Alignment of sequences from the first cluster
image(align_dt$full_clones$Alignment[[1]], grid = T)

Phylogenetic analysis

For BCR phylogenetic analysis, we use the maximum parsimony method. The maximum parsimony method reconstructs a phylogenetic tree for the lineage by minimizing the total number of mutation events [2].

This method also enables the reconstruction of intermediate sequences that may have existed between BCR and germline sequence. These simulated sequences contain mutations that were ancestral to the clonotype group selected with the maximum parsimony analysis.

In Immunarch, function repClonalFamily builds a phylogeny of B-cells, generates a common ancestor, and calculates a trunk length. The mean trunk length represents the distance between the most recent common ancestor and germline sequence [3]. Mean trunk length serves as a measure of the maturity of a lineage. The trunk length of the lineage tree approximates the maturation state of the initiating B cell for each clone [4].

This function requires PHYLIP app to be installed. The app can be downloaded here: https://evolution.genetics.washington.edu/phylip/getme-new1.html, or could be installed with your system package manager (such as apt or dnf).

sudo apt-get update -y
sudo apt-get install -y phylip

repClonalFamily usage example:

bcr <- bcr_data %>%
  seqCluster(seqDist(bcr_data), .fixed_threshold = 3) %>%
  repGermline(.threads = 1) %>%
  repAlignLineage(.min_lineage_sequences = 2) %>%
  repClonalFamily(.threads = 2)

#plot visualization of the first tree

We have found 4 clusters:

bcr$full_clones$Cluster %>% unique()
## [1] "IGHV4-61/IGHJ4_length_45_cluster_1" "IGHV4-61/IGHJ4_length_63_cluster_1"
## [3] "IGHV5-51/IGHJ4_length_60_cluster_1" "IGHV5-51/IGHJ5_length_45_cluster_1"

We have found mismatches between a germline and an ancestor sequence. Dots represent nucleotides matches between the sequences, letters represent mismatches between the sequences:

## [1] "....................................T....................................................A....................................C...G................................A.NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN.....A........................."

We have calculated a trunk length for each cluster:

## [1] 51 70 66 51