vignettes/web_only/load_mixcr.Rmd
load_mixcr.Rmd
MiXCR is a universal software for fast and accurate extraction of T- and B- cell receptor repertoires from any type of sequencing data. It handles paired- and single-end reads, considers sequence quality, corrects PCR errors and identifies germline hypermutations. The software supports both partial- and full-length profiling and employs all available RNA or DNA information, including sequences upstream of V and downstream of J gene segments.
Some of its features include:
Find more info about MiXCR at here
Follow these instructions here to install MiXCR and get started with processing data. Important Note: Currently supports MiXCR version 3 and above.
MiXCR supports the following formats of sequencing data: fasta, fastq, fastq.gz, paired-end fastq and fastq.gz. In this tutorial we use the real IGH data from here.
You can choose to use the analyze amplicon
method to process in one go:
> mixcr analyze amplicon --species hs \
--starting-material dna \
--5-end v-primers \
--3-end j-primers \
--adapters adapters-present \
--receptor-type IGH \
input_R1.fastq input_R2.fastq analysis
or execute each step align
, assemble
, and exportClones
individually.
> mixcr align -s hs -OvParameters.geneFeatureToAlign=VTranscript \
--report analysis.report input_R1.fastq input_R2.fastq analysis.vdjca
: Mon Aug 25 15:22:39 MSK 2014
Analysis Datefile(s): input_r1.fastq,input_r2.fastq
Input : alignments.vdjca
Output file: align --report alignmentReport.log input_r1.fastq input_r2.fastq alignments.vdjca
Command line arguments: 323248
Total sequencing reads: 210360
Successfully aligned reads: 65.08%
Successfully aligned, percent: 4.26%
Alignment failed because of absence of V hits: 30.19%
Alignment failed because of absence of J hits: 0.48% Alignment failed because of low total score
After you run these commands, you will generate these files with detailed information about calculated clonotypes:
.<chains>.txt <-- This contains the count data we want!
├── analysis.clonotypes. Build clonotypes correct PCR and sequencing errors
├── analysis.clna <- Align raw sequences to reference sequences of segments (V, D, J) of IGH gene
├── analysis.vdjca <- Information on the run ├── analysis.report <-
Create a new folder that only contains the specified clonotype text files from your runs, and create a metadata.txt file in the following format.
The metadata file “metadata.txt” has to be tab delimited file with first column named “Sample” and any number of additional columns with arbitrary names. The first column should contain base names of files without extensions in your folder.
Sample | Sex | Age | Status |
---|---|---|---|
immunoseq_1 | M | 1 | C |
immunoseq_2 | M | 2 | C |
immunoseq_3 | F | 3 | A |
The next section will explain how to load a single file or multiple samples in a folder.
In your R environment, run the commands below. The output should be similar. You can run it on the entire folder or a single file. repLoad
will ignore the file formats that are unsupported.
# 1.1) Load the package into R:
> library(immunarch)
# 1.2) Replace with the path to your clonotypes file
> file_path = "path/to/your/mixcr/data/analysis.clonotypes.IGH.txt"
# 1.3) Load MiXCR data with repLoad
> immdata_mixcr <- repLoad(file_path)
== Step 1/3: loading repertoire files... ==
"<initial>" ...
Processing -- Parsing "/path/to/your/mixcr/data/analysis.clonotypes.IGH.txt" -- mixcr
== Step 2/3: checking metadata files and merging... ==
"<initial>" ...
Processing -- Metadata file not found; creating a dummy metadata...
== Step 3/3: splitting data by barcodes and chains... ==
! Done
Congrats! Now your data is ready for exploration. Follow the steps here to learn more about how to explore your dataset.
$> immdata_mixcr
r$data
$data$analysis.clonotypes.IGH
# A tibble: 33,812 x 15
Clones Proportion CDR3.nt CDR3.aa V.name D.name J.name V.end D.start D.end J.start VJ.ins VD.ins DJ.ins Sequence<dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <int> <int> <int> <int> <int> <int> <int> <chr>
1 230 0.00284 TGTGTGAGACATAAACC… CVRHKPMVQ… IGHV4-39 IGHD3-10… IGHJ6 12 NA 5 36 9 3 6 TGTGTGAGACATAAACC…
2 201 0.00248 TGTGCGATTTGGGATGT… CAIWDVGLR… IGHV4-34 IGHD2-21 IGHJ4… 7 NA 5 29 10 7 3 TGTGCGATTTGGGATGT…
3 179 0.00221 TGTGCGAGAGATCATGC… CARDHAGFG… IGHV1-69… IGHD3-10 IGHJ6 13 NA 4 40 18 5 13 TGTGCGAGAGATCATGC…
4 99 0.00122 TGTGCGAGATGGGGATA… CARWGYCIN… IGHV4-39 IGHD2-8 IGHJ6 9 NA 6 64 23 2 21 TGTGCGAGATGGGGATA…
5 97 0.00120 TGTGCGAGAGGCCCCAC… CARGPTSSE… IGHV4-34 IGHD3-22… IGHJ6 13 NA 6 52 26 24 2 TGTGCGAGAGGCCCCAC…
6 97 0.00120 TGTGCGCACCACTATAC… CAHHYTSDY… IGHV2-5 IGHD1-26 IGHJ5 9 NA 2 39 19 NA 20 TGTGCGCACCACTATAC…
7 92 0.00114 TGTGCGAGAGGCCCTCC… CARGPPSMG… IGHV4-34 IGHD5-24… IGHJ4 13 NA 3 38 11 6 5 TGTGCGAGAGGCCCTCC…
8 84 0.00104 TGTGCGAGGTGGCTTGG… CARWLGEDI… IGHV4-39 IGHD3-16… IGHJ4… 8 NA 6 32 13 4 9 TGTGCGAGGTGGCTTGG…
9 83 0.00103 TGTGCGAGAGGCCGCAG… CARGRSGDP… IGHV4-34 IGHD2-2,… IGHJ5 13 NA 4 50 18 13 5 TGTGCGAGAGGCCGCAG…
10 81 0.00100 TGTGTGAGTCACCTCCT… CVSHLLDTS… IGHV1-2 IGHD2-21… IGHJ4… 8 NA 3 40 20 14 6 TGTGTGAGTCACCTCCT…
# … with 33,802 more rows
$meta
# A tibble: 1 x 1
Sample<chr>
1 analysis.clonotypes.IGH
In this tutorial we use three identical samples just to demonstrate the output, but you should put all your output .txt
clonotype files in this folder, along with your metadata.txt
file.
# 1.1) Load the package into R:
> library(immunarch)
# 1.2) Replace with the path to the folder with your processed MiXCR data.
> file_path = "/path/to/your/mixcr/data/"
# 1.3) Load MiXCR data with repLoad
> immdata_mixcr <- repLoad(file_path)
== Step 1/3: loading repertoire files... ==
"/path/to/your/mixcr/data/" ...
Processing -- Parsing "/path/to/your/mixcr/data/analysis.clonotypes.IGH_1.txt" -- mixcr
-- Parsing "/path/to/your/mixcr/data/analysis.clonotypes.IGH_2.txt" -- mixcr
-- Parsing "/path/to/your/mixcr/data/analysis.clonotypes.IGH_3.txt" -- mixcr
-- Parsing "/path/to/your/mixcr/data/metadata.txt" -- metadata
== Step 2/3: checking metadata files and merging files... ==
"/path/to/your/mixcr/data/" ...
Processing -- Everything is OK!
== Step 3/3: processing paired chain data... ==
! Done
Now let’s take a look at the data! Your output should look something like that.
$> immdata_mixcr
r$data
$data$analysis.clonotypes.IGH_1
# A tibble: 32,744 x 15
Clones Proportion CDR3.nt CDR3.aa V.name D.name J.name V.end D.start D.end J.start VJ.ins VD.ins DJ.ins Sequence<dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <int> <int> <int> <int> <int> <int> <int> <chr>
1 230 0.00284 TGTGTGAGACATAAACCTATGG… CVRHKPMVQG… IGHV4-39 IGHD3-10, … IGHJ6 12 NA 5 36 9 3 6 TGTGTGAGACATAAACCTATG…
2 201 0.00248 TGTGCGATTTGGGATGTGGGAC… CAIWDVGLRH… IGHV4-34 IGHD2-21 IGHJ4,… 7 NA 5 29 10 7 3 TGTGCGATTTGGGATGTGGGA…
3 179 0.00221 TGTGCGAGAGATCATGCGGGGT… CARDHAGFGK… IGHV1-69… IGHD3-10 IGHJ6 13 NA 4 40 18 5 13 TGTGCGAGAGATCATGCGGGG…
4 99 0.00122 TGTGCGAGATGGGGATATTGTA… CARWGYCING… IGHV4-39 IGHD2-8 IGHJ6 9 NA 6 64 23 2 21 TGTGCGAGATGGGGATATTGT…
5 97 0.00120 TGTGCGAGAGGCCCCACGAGCA… CARGPTSSEW… IGHV4-34 IGHD3-22, … IGHJ6 13 NA 6 52 26 24 2 TGTGCGAGAGGCCCCACGAGC…
6 97 0.00120 TGTGCGCACCACTATACCAGCG… CAHHYTSDYY… IGHV2-5 IGHD1-26 IGHJ5 9 NA 2 39 19 NA 20 TGTGCGCACCACTATACCAGC…
7 92 0.00114 TGTGCGAGAGGCCCTCCGTCGA… CARGPPSMGT… IGHV4-34 IGHD5-24, … IGHJ4 13 NA 3 38 11 6 5 TGTGCGAGAGGCCCTCCGTCG…
8 84 0.00104 TGTGCGAGGTGGCTTGGGGAAG… CARWLGEDIR… IGHV4-39 IGHD3-16, … IGHJ4,… 8 NA 6 32 13 4 9 TGTGCGAGGTGGCTTGGGGAA…
9 83 0.00103 TGTGCGAGAGGCCGCAGCGGCG… CARGRSGDPY… IGHV4-34 IGHD2-2, I… IGHJ5 13 NA 4 50 18 13 5 TGTGCGAGAGGCCGCAGCGGC…
10 81 0.00100 TGTGTGAGTCACCTCCTCGACA… CVSHLLDTSD… IGHV1-2 IGHD2-21, … IGHJ4,… 8 NA 3 40 20 14 6 TGTGTGAGTCACCTCCTCGAC…
# … with 32,734 more rows
$data$analysis.clonotypes.IGH_2
# A tibble: 32,744 x 15
Clones Proportion CDR3.nt CDR3.aa V.name D.name J.name V.end D.start D.end J.start VJ.ins VD.ins DJ.ins Sequence<dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <int> <int> <int> <int> <int> <int> <int> <chr>
1 230 0.00284 TGTGTGAGACATAAACCTATGG… CVRHKPMVQG… IGHV4-39 IGHD3-10, … IGHJ6 12 NA 5 36 9 3 6 TGTGTGAGACATAAACCTATG…
2 201 0.00248 TGTGCGATTTGGGATGTGGGAC… CAIWDVGLRH… IGHV4-34 IGHD2-21 IGHJ4,… 7 NA 5 29 10 7 3 TGTGCGATTTGGGATGTGGGA…
3 179 0.00221 TGTGCGAGAGATCATGCGGGGT… CARDHAGFGK… IGHV1-69… IGHD3-10 IGHJ6 13 NA 4 40 18 5 13 TGTGCGAGAGATCATGCGGGG…
4 99 0.00122 TGTGCGAGATGGGGATATTGTA… CARWGYCING… IGHV4-39 IGHD2-8 IGHJ6 9 NA 6 64 23 2 21 TGTGCGAGATGGGGATATTGT…
5 97 0.00120 TGTGCGAGAGGCCCCACGAGCA… CARGPTSSEW… IGHV4-34 IGHD3-22, … IGHJ6 13 NA 6 52 26 24 2 TGTGCGAGAGGCCCCACGAGC…
6 97 0.00120 TGTGCGCACCACTATACCAGCG… CAHHYTSDYY… IGHV2-5 IGHD1-26 IGHJ5 9 NA 2 39 19 NA 20 TGTGCGCACCACTATACCAGC…
7 92 0.00114 TGTGCGAGAGGCCCTCCGTCGA… CARGPPSMGT… IGHV4-34 IGHD5-24, … IGHJ4 13 NA 3 38 11 6 5 TGTGCGAGAGGCCCTCCGTCG…
8 84 0.00104 TGTGCGAGGTGGCTTGGGGAAG… CARWLGEDIR… IGHV4-39 IGHD3-16, … IGHJ4,… 8 NA 6 32 13 4 9 TGTGCGAGGTGGCTTGGGGAA…
9 83 0.00103 TGTGCGAGAGGCCGCAGCGGCG… CARGRSGDPY… IGHV4-34 IGHD2-2, I… IGHJ5 13 NA 4 50 18 13 5 TGTGCGAGAGGCCGCAGCGGC…
10 81 0.00100 TGTGTGAGTCACCTCCTCGACA… CVSHLLDTSD… IGHV1-2 IGHD2-21, … IGHJ4,… 8 NA 3 40 20 14 6 TGTGTGAGTCACCTCCTCGAC…
# … with 32,734 more rows
$data$analysis.clonotypes.IGH_3
# A tibble: 32,744 x 15
Clones Proportion CDR3.nt CDR3.aa V.name D.name J.name V.end D.start D.end J.start VJ.ins VD.ins DJ.ins Sequence<dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <int> <int> <int> <int> <int> <int> <int> <chr>
1 230 0.00284 TGTGTGAGACATAAACCTATGG… CVRHKPMVQG… IGHV4-39 IGHD3-10, … IGHJ6 12 NA 5 36 9 3 6 TGTGTGAGACATAAACCTATG…
2 201 0.00248 TGTGCGATTTGGGATGTGGGAC… CAIWDVGLRH… IGHV4-34 IGHD2-21 IGHJ4,… 7 NA 5 29 10 7 3 TGTGCGATTTGGGATGTGGGA…
3 179 0.00221 TGTGCGAGAGATCATGCGGGGT… CARDHAGFGK… IGHV1-69… IGHD3-10 IGHJ6 13 NA 4 40 18 5 13 TGTGCGAGAGATCATGCGGGG…
4 99 0.00122 TGTGCGAGATGGGGATATTGTA… CARWGYCING… IGHV4-39 IGHD2-8 IGHJ6 9 NA 6 64 23 2 21 TGTGCGAGATGGGGATATTGT…
5 97 0.00120 TGTGCGAGAGGCCCCACGAGCA… CARGPTSSEW… IGHV4-34 IGHD3-22, … IGHJ6 13 NA 6 52 26 24 2 TGTGCGAGAGGCCCCACGAGC…
6 97 0.00120 TGTGCGCACCACTATACCAGCG… CAHHYTSDYY… IGHV2-5 IGHD1-26 IGHJ5 9 NA 2 39 19 NA 20 TGTGCGCACCACTATACCAGC…
7 92 0.00114 TGTGCGAGAGGCCCTCCGTCGA… CARGPPSMGT… IGHV4-34 IGHD5-24, … IGHJ4 13 NA 3 38 11 6 5 TGTGCGAGAGGCCCTCCGTCG…
8 84 0.00104 TGTGCGAGGTGGCTTGGGGAAG… CARWLGEDIR… IGHV4-39 IGHD3-16, … IGHJ4,… 8 NA 6 32 13 4 9 TGTGCGAGGTGGCTTGGGGAA…
9 83 0.00103 TGTGCGAGAGGCCGCAGCGGCG… CARGRSGDPY… IGHV4-34 IGHD2-2, I… IGHJ5 13 NA 4 50 18 13 5 TGTGCGAGAGGCCGCAGCGGC…
10 81 0.00100 TGTGTGAGTCACCTCCTCGACA… CVSHLLDTSD… IGHV1-2 IGHD2-21, … IGHJ4,… 8 NA 3 40 20 14 6 TGTGTGAGTCACCTCCTCGAC…
# … with 32,734 more rows
$meta
# A tibble: 3 x 4
Sample Sex Age Status<chr> <chr> <dbl> <chr>
1 analysis.clonotypes.IGH_1 M 1 C
2 analysis.clonotypes.IGH_2 M 2 C
3 analysis.clonotypes.IGH_3 F 3 A
Congrats! Now your data is ready for exploration. Follow the steps here to learn more about how to explore your dataset.