Although next-generation sequencing is widely used in cancer to profile tumors and detect variants, most somatic variant callers used in these pipelines identify variants at the lowest possible granularity, single-nucleotide variants (SNV). As a result, multiple adjacent SNVs are called individually instead of as a multi-nucleotide variants (MNV). With this approach, the amino acid change from the individual SNV within a codon could be different from the amino acid change based on the MNV that results from combining SNV, leading to incorrect conclusions about the downstream effects of the variants. Here, we analyzed 10,383 variant call files (VCF) from the Cancer Genome Atlas (TCGA) and found 12,141 incorrectly annotated MNVs. Analysis of seven commonly mutated genes from 178 studies in cBioPortal revealed that MNVs were consistently missed in 20 of these studies, whereas they were correctly annotated in 15 more recent studies. At the BRAF V600 locus, the most common example of MNV, several public datasets reported separate BRAF V600E and BRAF V600M variants instead of a single merged V600K variant. VCFs from the TCGA Mutect2 caller were used to develop a solution to merge SNV to MNV. Our custom script used the phasing information from the SNV VCF and determined whether SNVs were at the same codon and needed to be merged into MNV before variant annotation. This study shows that institutions performing NGS sequencing for cancer genomics should incorporate the step of merging MNV as a best practice in their pipelines.

Commonly used variant annotation tools, such as SnpEff (7), ANNOVAR (8), and Ensembl variant effect predictor (VEP; ref. 9), annotate variants individually without considering haplotype information or combining nearby in-phase variants to MNVs. There are some tools such as bcftools csq (haplotype aware consequence caller; ref. 10) that have tried to address this problem, but the software expects phased variant call files (VCF) as input with phasing information in the genotype (GT) field in a specific and seldom used format. MAC (multi-nucleotide variant annotation corrector; ref. 11) requires both the VCF and the corresponding Binary Alignment Map (BAM) file to correct for MNVs, corresponding to adjacent SNVs. MACARON (Multi-bAse Codon Association variant ReannotatiON; ref. 12) is another tool that uses both the VCF and the BAM to re-annotate VCFs with corrected MNVs from multiple SNVs within a codon.

We downloaded Refseq transcripts BED file from the UCSC table browser and pre-processed it into a codon file that had the positions of each codon defined. The MNV-merge script then used this codon file to determine whether to merge SNVs, based on whether they are part of the same haplotype and codon.

