Phylovar: toward scalable phylogeny-aware inference of single-nucleotide variations from single-cell DNA sequencing data

Abstract Motivation Single-nucleotide variants (SNVs) are the most common variations in the human genome. Recently developed methods for SNV detection from single-cell DNA sequencing data, such as SCIΦ and scVILP, leverage the evolutionary history of the cells to overcome the technical errors associated with single-cell sequencing protocols. Despite being accurate, these methods are not scalable to the extensive genomic breadth of single-cell whole-genome (scWGS) and whole-exome sequencing (scWES) data. Results Here, we report on a new scalable method, Phylovar, which extends the phylogeny-guided variant calling approach to sequencing datasets containing millions of loci. Through benchmarking on simulated datasets under different settings, we show that, Phylovar outperforms SCIΦ in terms of running time while being more accurate than Monovar (which is not phylogeny-aware) in terms of SNV detection. Furthermore, we applied Phylovar to two real biological datasets: an scWES triple-negative breast cancer data consisting of 32 cells and 3375 loci as well as an scWGS data of neuron cells from a normal human brain containing 16 cells and approximately 2.5 million loci. For the cancer data, Phylovar detected somatic SNVs with high or moderate functional impact that were also supported by bulk sequencing dataset and for the neuron dataset, Phylovar identified 5745 SNVs with non-synonymous effects some of which were associated with neurodegenerative diseases. Availability and implementation Phylovar is implemented in Python and is publicly available at https://github.com/NakhlehLab/Phylovar.


Introduction
With the advent of the first single-cell sequencing (SCS) techniques (Navin et al., 2011;Tang et al., 2009), the fields of single-cell genomics, transcriptomics, proteomics and epigenetics have witnessed remarkable growth over the last decade. Single-cell sequencing technologies have impacted our understanding in different fields of biology including developmental biology, immunology, microbiology and cancer biology (Kashima et al., 2020;Lim et al., 2020;Navin, 2014;Tang et al., 2019;Wang and Navin, 2015). Single-cell DNA sequencing (scDNAseq), as one of the SCS technologies, provides insights into the somatic evolutionary process by sequencing the genomic contents of a complex tissue at a single-cell resolution (Navin, 2014;Navin et al., 2011). Preparing scDNAseq data requires a whole-genome amplification (WGA) process to amplify the DNA material of a single cell to suffice the amount of DNA needed for sequencing. (Kashima et al., 2020;Zafar et al., 2018). WGA technologies, such as multiple displacement amplification (MDA) (Dean et al., 2002;Spits et al., 2006) and multiple annealing and looping-based amplification cycles (MALBAC) (Zong et al., 2012) can elevate the noise level in scDNAseq data. The scDNAseq technical errors include allelic dropout (ADO), false-positive (FP) errors, false-negative (FN) errors and non-uniform coverage (Navin, 2014;Zafar et al., 2018). ADO refers to cases where only one of the two alleles in a heterozygous mutation is amplified, resulting in the loss of the mutated allele. FP artifacts can appear due to uneven amplification or at the early stages of the amplification when the original nucleotide is substituted randomly. The non-uniform coverage over different genomic loci may result in missing data due to zero or insufficient coverage. The scDNAseq-specific technical errors fueled the development of tools such as Monovar (Zafar et al., 2016) and SCcaller (Dong et al., 2017) for detecting single-nucleotide variations (SNVs) from scDNAseq data. Although Monovar and SCcaller account for uneven coverage and scDNAseq-specific errors, more recent methods, SCIU (Singer et al., 2018) and scVILP (Edrisi et al., 2019), showed further improvement in overcoming the scDNAseqspecific technical errors by simultaneously inferring the cells' phylogeny and SNVs. SCIU uses a Markov chain Monte Carlo (MCMC) algorithm to sample the joint posterior distribution of SNVs and the phylogenetic tree of the single cells and reports the tree(s) with the best posterior probability and the corresponding genotypes. scVILP is formulated as an instance of Mixed Integer Linear Programming (MILP) and it aims to find maximum likelihood estimation (MLE) of the observed read counts given the underlying genotype matrix. Here, the MILP solver is restricted to proposing only the genotype matrices that satisfy three-gamete condition in order to maximize the likelihood function (see Estabrook et al., 1976;Ferná ndez-Baca, 2001;Gusfield, 1991Gusfield, , 1997Meacham, 1983;Semple and Steel, 2003 for more details on work related to inference under the threegamete condition).
Although 'regularizing' the mutation detection by using a tree as a guide is a promising direction (Edrisi et al., 2019;Kuipers et al., 2020;Markowska et al., 2021;Singer et al., 2018), applying SCIU and scVILP to datasets with large number of loci such as in Evrony et al. (2015) and Wang et al. (2014) is challenged by either very long running time or large memory consumption of the methods-the major issues in SCIU and scVILP, respectively. Indeed, scVILP runs out of memory on all of the datasets considered in our study here, except for the smallest ones, which is why we do not report on the performance of scVILP. To address this challenge, we developed Phylovar, a likelihood-based method for phylogeny-aware inference of SNVs from scDNAseq datasets consisting of a large number of loci. To simplify likelihood calculations for large-scale data, we assume that mutations occur following an infinite-sites assumption (ISA) (Deshwar et al., 2015;Jahn et al., 2016;Singer et al., 2018). Using this model, Phylovar finds the tree topology and the placement of mutations on ancestral single cells that maximize the likelihood of the erroneous observed read counts given the genotypes. Utilizing a vectorized formulation for likelihood calculations, Phylovar benefits from the vectorized operations in matrix manipulation packages such as NumPy (Harris et al., 2020) to scale up to many loci. We compared the SNV calling accuracy, memory consumption and running time of Phylovar against those of the existing methods, Monovar and SCIU, through a simulation study. We found that Phylovar outperforms SCIU in terms of running time with the same accuracy, while being more accurate than Monovar. Furthermore, we applied our method to two biological datasets: a triple-negative breast cancer (TNBC) dataset (Wang et al., 2014) consisting of 32 single cells and 3375 candidate loci, as well as the dataset from Evrony et al. (2015) containing 16 normal human neuron cells and 2 489 545 candidate loci. For the TNBC data, Phylovar inferred 652 SNVs with 'high' or 'moderate' functional impact, out of which 550 (84%) were also supported by bulk sequencing. For the neuron cells, Phylovar identified 5745 SNVs with nonsynonymous effects some of which were related to neurodegenerative diseases. To the best of our knowledge, Phylovar is the first scDNAseq SNV caller that can utilize the underlying tree structure even when the dataset contains millions of genomic loci.

Materials and methods
The input to Phylovar consists of the reference and variant count matrices, denoted by R ¼ ðr ij Þ 2 N 0 NÂM and V ¼ ðv ij Þ 2 N 0 NÂM , where N and M represent the number of single cells and candidate loci, respectively. Each entry in R and V represents the number of reference and variant counts, respectively, at cell i and site j. These count matrices are obtained from an input file in mpileup format. Here, candidate loci are defined as the genomic loci with a significant number of variant reads. This significance is measured by a statistic test. Note that these loci may not necessarily contain SNVs since the variant reads might be artifacts of scDNAseq technical errors. In all experiments reported below, we used SCIU's likelihood ratio test described in Singer et al. (2018) to identify candidate loci for the analyses. If the total read coverage at a cell and a candidate site is less than k, the corresponding entry is treated as missing data. We used k ¼ 1 in practice.

Single-cell genotype error model
Our genotype model considers bi-allelic genotype with 0 and 1 representing the absence and presence of a mutation, respectively. We differentiate true genotypes from those being subject to scDNAseq errors that propagate from WGA to the sequencing library-called library genotypes. Let G ¼ ðg ij Þ 2 f0; 1g NÂM be the binary matrix containing the true genotypes where g ij represents the true genotype at cell i and locus j. Similarly, we denote the library genotype matrix by W ¼ ðw ij Þ 2 f0; 1g NÂM . We assume library preparation process introduces FP and FN errors into the data resulting in difference between true genotypes and their corresponding library genotypes. Let a and b denote the FP and FN error rates, respectively. Then, the probability of the library genotype given true genotype and error rates are given by the following error model adopted from SiFit (Zafar et al., 2017) and SiCloneFit (Zafar et al., 2019): (1)

Single-cell read count model
For the convenience of notation, let c ij ¼ r ij þ v ij denote the total read coverage at cell i and locus j. We assume that variant read counts follow a binomial distribution whose success probability depends on the value of the library genotype: The variables l 0 and l 1 are the success probabilities associated with reference and alternate alleles, respectively. In practice, we set l 0 to 0.001, which is at the same order of magnitude for the error rate in different Illumina sequencing platforms (Stoler and Nekrutenko, 2021). We used 0.5 for the value of l 1 , which is the mean of variant read counts for a heterozygous mutation.

Tree model
Our tree model consists of two components: a binary tree topology T ¼ ðV; EÞ-where V denotes the set of nodes, and E is the set of edges-and a mutation placement for each genomic locus j, M j 2 f0; 1g 2NÀ1 . The latter is a binary vector of length 2N À 1 containing binary elements for each leaf or internal node in V. We take M j ½q ¼ 1 to denote that a mutation occurred at node q during the evolutionary history of locus j. In our model, we assume mutations evolve following the ISA. According to this model, at most one element in M j is allowed to be 1. This vector requires each node to have an index from f1; . . . ; 2N À 1g. For simplicity, we map indices f1; . . . ; Ng to the leaves/single cells and use the same mapping for the single cells in all tree topologies.

Log-likelihood function
Assuming independence across sites/loci, the log-likelihood function of read counts given true genotypes G, error rates ða; bÞ, underlying tree topology T and mutation placements for all loci (denoted by M) are: FðR; VjG; T; M; a; bÞ ¼ FðR; VjG; a; bÞ Note that the above likelihood is based on G rather than T and M directly, as G is derived from T and M. Therefore, we can drop T and M from Equation (3). It can be shown that after marginalizing out w's, log Pðr ij ; v ij jg ij ; a; bÞ ¼ log f P w Pðr ij ; v ij jw ij ÞPðw ij jg ij ; a; bÞg can be simplified as follows: Here, the log-likelihood values of the missing data are assumed to be 0. The MLE solution is obtained as fFðR; VjG; T; M; a; bÞg: (5)

Hill-climbing search algorithm
Phylovar infers the underlying phylogeny of single cells and their genotypes simultaneously in a hill-climbing fashion. At each step, the log-likelihood function is evaluated and updated by proposing one of the underlying parameters including the tree, mutation placements and error rates. We start the search by reconstructing an initial tree topology. To obtain this tree, first, we create the matrix of initial genotypes, G ð0Þ , as follows: Here, ða ð1Þ ; b ð1Þ Þ ¼ ð0; 0Þ are the initial estimates of the error rates. Given G ð0Þ , we calculate the pairwise Hamming distances between the single cells and build an initial tree topology, T ð1Þ , using the neighbor-joining algorithm (Saitou and Nei, 1987). Given the proposed parameters ðT ð1Þ ; a ð1Þ ; b ð1Þ Þ, the mutation placement with highest log-likelihood for each site j-denoted by M Ãð1Þ j -is determined (see below for more details) yielding the genotype matrix at first iteration, G ð1Þ and the first log-likelihood value FðR; VjG ð1Þ ; T ð1Þ ; M Ãð1Þ ; a ð1Þ ; b ð1Þ Þ. At each iteration t ! 2, either new error rates are estimated (see below for more details) or a new tree is proposed by performing tree rearrangement techniques including subtree pruning and re-grafting (SPR), nearest-neighbor interchange (NNI) and swapping two random leaves. The proposed parameters are accepted if the new log-likelihood value is greater than or equal to the log-likelihood in the previous iteration. In case of stochastic hill-climbing, the acceptance probability of the newly proposed log-likelihood value is: The search procedure terminates when the log-likelihood does not improve after a user-specified number of iterations or when it reaches the maximum number of iterations.

Proposing new error rates
The new error rates at iteration t are calculated using the following equations from the entries of G ðtÀ1Þ and G ð0Þ : Here, the number of 0 entries in G ð0Þ that were 'corrected' to 1 in G ðtÀ1Þ provides a measure of what a more realistic a would be through the hill-climbing trajectory. The same rationale applies to proposing a new value of b. In Equations (8) and (9), a ðtÞ and b ðtÞ indicate the new values of a and b at iteration t, respectively. Here, g ð0Þ ij and g ðtÀ1Þ ij denote the entries of the initial genotype matrix and the genotype matrix at iteration t-1, respectively. The term c ij 6 ¼ 0 indicates the entries with non-zero read counts. The symbol^represents the logical AND operator.

Finding the best mutation placement
Given a topology T ðtÞ and a site j, each possible mutation placement on T ðtÞ yields a unique genotype configuration at the level of single cells. We seek the mutation placement with the highest log-likelihood. Let M ðtÞ ðv k Þ k2f1;...;2NÀ1g denote the mutation placement when the k th node is mutated. As a special case, let M ðtÞ ðv 2N Þ represent the absence of mutation. Because the set of all possible mutation placements is the same for all the sites, we dropped the index j from these two notations. To summarize the effect of all possible ISA mutation placements on genotype configurations, we define S ðtÞ ¼ ðs ðtÞ ki Þ 2 f0; 1g 2NÂN whose k th row, S ðtÞ kÃ , represents the genotype configuration corresponding to M ðtÞ ðv k Þ. We formally define the mapping from mutation placements to genotypes, denoted by U, as follows: where i 2 f1; . . . ; Ng and k 2 f1; . . . ; 2Ng. Here, T ðtÞ v k denotes the subtree rooted at node v k . Note that the mapping U is one-to-one, so we can use its inverse to retrieve the genotypes given a mutation placement. In addition to S ðtÞ , we define two other matrices that store the log-likelihood values from Equation (4), one assuming all genotypes are 0, called matrix of zero-allele likelihoods, Z ðtÞ and the other assuming all genotypes are 1, called matrix of one-allele likelihoods, O ðtÞ . Formally, we define the matrices O ðtÞ ¼ ðo ðtÞ ij Þ 2 ðÀ1; 0 NÂM and Z ðtÞ ¼ ðf ðtÞ ij Þ 2 ðÀ1; 0 NÂM as follows: It can be shown that the following matrix multiplication results in a matrix X ðtÞ ¼ ðv ðtÞ kj Þ 2 ðÀ1; 0 2NÂM whose each element v ðtÞ kj is equal to the log-likelihood value of M ðtÞ ðv k Þ at site j: Here, J 2N;N is matrix of all-ones. The best mutation placement at site j, M ÃðtÞ j , is associated with the highest value in the j th column of X ðtÞ : M ÃðtÞ j ¼ M ðtÞ ðv k Ã Þ: The index corresponding to the highest value is denoted by k Ã : ...;2Ng fv ðtÞ kj g: Using U À1 , we can determine the best genotype configuration at site j which constitutes the j th column of G ðtÞ using G ðtÞ Ãj ¼ S Finally, G ðtÞ is the concatenation of best genotypes configurations at all sites: 3 Results and discussion

Simulation study
We first compared the computational efficiency and SNV calling accuracy of Phylovar, SCIU and Monovar using synthetic datasets simulated under five scenarios: (i) varying the number of mutations, (ii) varying the number of cells, (iii) varying the ADO rates, (iv) copy number effect and (v) violation of ISA. We simulated the datasets using the simulator introduced in Singer et al. (2018). In the first scenario, we investigated how Phylovar performs compared to the other methods when increasing the number of mutations dramatically to a large extent. We simulated datasets containing 16 single cells with 1000, 10 4 and 10 5 mutations. For each mutation value, ten datasets were generated. Phylovar's accuracy was comparable to that of SCIU in terms of F1 measure, while both SCIU and Phylovar were more accurate than Monovar because of accounting for evolutionary history (Fig. 1a). To measure the running time of each method, we used the CPU clock time (in seconds and log scale) during its execution. Figure 1f shows that the running time of each method increased with the number of mutations. For the largest dataset with 10 5 mutations, Phylovar was approximately two orders of magnitude faster than SCIU.
In the second scenario, we sought to answer how the methods' performances depend on the number of cells. Here, we fixed the number of mutations at 10 5 and varied the number of cells, N 2 f8; 16; 32g. For each setting, we generated ten datasets. The F1 accuracy scores of all methods improved as the number of cells increased (Fig. 1b). Again, Phylovar's accuracy was comparable to that of SCIU while it outperformed Monovar. We observed that increasing the number of single cells improved the accuracy of all methods more than increasing the number of mutations. As demonstrated in Figure 1g, similar to the first scenario, the running time of Phylovar was almost two orders of magnitude less than that of SCIU.
In the third scenario, the ADO rate was varied while cells and mutations were fixed at 16 and 1000. We selected ADO rates from the values f0; 0:05; 0:1; 0:15; 0:2; 0:25; 0:3; 0:35g, and generated ten datasets for each ADO value. Figure 1c shows that both SCIU and Phylovar were more robust to high ADO rates than Monovar due to the utilization of the underlying single-cell phylogeny.
Since Phylovar can estimate the false-negative error rates (denoted by b), we measured the correlation between the ADO rates used for generating the simulated datasets and the estimated b's. As demonstrated in Figure 1h, these two values were highly correlated (the Pearson correlation coefficient was 0.991). It is worth noting that based on the linear regression line, the estimated b was almost half of the true ADO, pointing to the difference between the dropout mechanism in the simulator and our definition of b (see Section 2). Given an ADO rate l, the simulator chooses l fraction of the mutations. It changes l 2 of them into reference genotype, and l 2 of them into homozygous mutations; the b in our model indicates the probability of a mutation becoming reference genotype implying b % l 2 , which we can observe in Figure 1h.
Since Phylovar assumes the read counts are originated from diploid strands, in the fourth scenario, additional wild-type alleles were introduced to the read counts to imitate the effect of copy number changes. The simulator randomly selects a fraction of mutated loci (named copy number rate), and chooses c extra copies for each loci with probability 1 2 c (Singer et al., 2018). We increased the copy number rate from 0 to 0.5 with step size 0.125. For each value, we generated ten datasets containing 16 cells and 1000 mutations. Figure 1d shows that the SNV calling accuracy of the methods decreased as more mutated loci were subject to copy number changes.
In the fifth scenario, we were interested in observing how violations of ISA affect the SNV calling accuracy of the methods. Given a fraction of mutations, the simulator randomly selects half of them to recur in different branches, and the rest of them to be lost in the same subtree. We increased the fraction of mutations subject to ISA violations from 0 to 0.15 with 0.05 step size. For each value we generated ten datasets with 16 cells and 1000 mutations. Figure 1e shows that all three methods had a stable performance as the fraction of ISA violations increased. This observation suggests that even though the phylogenies inferred by SCIU and Phylovar might be inaccurate due to the presence of violations of their evolutionary model, the effect of such violations on mutation inference is negligible.

Application to real data
We applied Phylovar on two human scDNAseq datasets. The first dataset consists of single-cell whole-exome sequencing (scWES) samples from a triple-negative breast cancer (TNBC) patient (Wang et al., 2014). Since the population sequencing data from bulk tumor and matched normal tissue are available for the TNBC dataset, the number of mutations shared by scWES and bulk data provides us a metric for measuring the accuracy of our approach. The TNBC dataset consists of 16 diploid cells, eight hyperdiploid/aneuploid cells and eight hypodiploid cells (Wang et al., 2014). Given the (h) Linear regression between estimated false-negative error rates (b's) and actual ADO rates used for simulated data control normal cells, SCIU's likelihood ratio test identifies the loci likely to contain somatic mutations. Applying this statistic test on the input mpileup file resulted in 3375 candidate loci on which we applied Phylovar. Phylovar was run with ten parallel hill-climbing chains, each for 100 000 iterations on a pool of five CPU's, each with 48 cores (AMD EPYC 7642) on a node with 192 GB RAM. The total runtime was 91 min. Phylovar inferred an 18.21% falsenegative error rate and a 1.03% false-positive error rate from TNBC data. We ran SCIU and Monovar with default parameters; SCIU and Monovar terminated after 10 h and 144 min, respectively. Figure 2 shows the three methods' mutation calls on TNBC data from the overlapping sites as well as the initial genotype matrix at the first iteration of our hill-climbing search. We performed hierarchical clustering with Ward's minimum variance method implemented in Python's SciPy package (Virtanen et al., 2020) on the genotype matrix for better visualization. We observed concordance between the calls from Phylovar and SCIU while Monovar's calls are noisy and resemble Phylovar's initial genotypes.
To annotate the mutations, we applied snpEff (Cingolani et al., 2012) on the SNVs detected by Phylovar. Out of 3375 candidate loci, 652 loci contained SNVs with 'high' or 'moderate' functional effects (see https://pcingola.github.io/SnpEff/se_inputoutput/ for details on the types of variants' effects and their descriptions). Then, we ran HaplotypeCaller (GATK version 4.2.0.0) for mutation calling on the bulk tumor and normal samples. Among 652 SNVs in single cells, 550 (84%) mutations were found in bulk data (Fig. 3). Performing this analysis on the results of SCIU yielded the same results as for Phylovar: 652 loci contained SNVs with high or moderate functional effect, out of which 550 mutations overlapped with the calls from bulk data. Monovar detected 18 187 high or moderate non-synonymous mutations in the single-cell data from which 10 981 mutations were found in the bulk data as well which is 60% of the single-cell calls (Figs. 4 and 5).
The second biological data consists of 16 neuron cells on which scWGS was performed to study somatic mutations in human brain development (Evrony et al., 2015). Applying SCIU's statistic test on the input mpileup identified 2 489 545 candidate loci. We ran Phylovar with five parallel hill-climbing chains, each for 50 000 iterations on 5 CPUs with 192 GB RAM. Phylovar finished the process after 17 h and 45 min. To compare our results with other methods, we ran Monovar and SCIU with default parameters. Monovar processed the data in Fig. 2. Clustered heatmaps of mutation calls by different methods performed on the TNBC dataset. Here, rows and columns represent the genomic loci and the single cells, respectively. The pixels show mutation calls (dark blue), reference alleles (light blue) and missing data (pink). The initial genotypes are the initial estimates of genotypes considering no error rates and no underlying phylogeny at the starting step of Phylovar's search algorithm (A color version of this figure appears in the online version of this article) Fig. 3. Clustered heatmap of the mutations detected by Phylovar from TNBC data and population sequencing data of tumor and matched normal tissues. The pixels show mutation calls (dark blue), reference alleles (light blue) and missing data (pink). Columns (cells) are colored according to the ploidy of the cells. The colors of the rows (genes) indicate whether the SNV was found in bulk data or not. Here, popT and popN are the tumor and normal population sequencing samples, respectively. Out of 3375 candidate loci, 652 loci contained SNVs with high or moderate functional effects in the single-cell data among which 550 mutations were found in bulk data as well, which is 84% of the single-cell calls (A color version of this figure appears in the online version of this article) Phylovar's inferred false-positive and false-negative error rates were 75.22% and 1.17%, respectively. snpEff identified 5745 nonsynonymous SNVs among Phylovar's mutation calls. Figure 6 shows hierarchical clustering on the genotypes of Phylovar, Monovar and the initial genotypes at sites with non-synonymous SNVs. We observed similarities between Monovar's calls and the initial genotypes. By comparing the panel of Phylovar in Figure 6 with the other panels, one can see the sparse regions of mutations in panels of Monovar and the initial genotype matrix that are inferred as reference alleles by Phylovar. Furthermore, we investigated the genes likely related to neurodegenerative diseases by comparing our findings with the genes reported in Wei et al. (2019), where somatic mutations were studied in 1461 control and diseased human brains with different neurodegenerative disorders. Among the genes inferred by Phylovar to harbor non-synonymous mutations, 12 genes were reported in Wei et al. (2019). We observed that the genes MUC16 and MLIP were frequently mutated in different regions; also, non-synonymous SNVs were observed within KRT33A and SEMA5B in Wei et al. (2019) from patients with Creutzfeldt-Jakob and Alzheimer diseases, respectively. The presence of these non-synonymous mutations in both diseased and normal samples implies the high mutability of these genes even in a healthy individual.

Conclusions
The rapid growth of SCS technologies poses computational challenges due to the increasing number of cells and sites sequenced per genome (Lä hnemann et al., 2020). In this work, we focused on addressing the computational challenge associated with the breadth of genomic sites in scDNAseq data. Here, we introduced Phylovar, a scalable MLE method for phylogeny-guided inference of SNVs from single-cell DNA sequencing data suitable for scWGS and scWES data with an extensive number of loci. We introduced a novel vectorized formula for likelihood calculation, making Phylovar scalable to hundreds of thousands, even millions of loci.
We assessed Phylovar's performance against state-of-the-art variant callers SCIU (Singer et al., 2018) and Monovar (Zafar et al., 2016), through simulated benchmarks. Phylovar outperforms SCIU in terms of running time while being more accurate than Monovar in different simulation scenarios. We also applied Phylovar to two real biological datasets. For a TNBC dataset with 32 single cells and 3375 candidate loci, Phylovar identified SNVs with functional impact among which 84% were supported by bulk sequencing data. Phylovar was also more accurate than Monovar and 6.5x faster than that of SCIU. For a larger dataset containing 16 normal human neuron cells and approximately 2.5 million candidate loci, Phylovar identified 5745 non-synonymous SNVs some of which were related to neurodegenerative diseases. Interestingly, Phylovar detected 75.22% false-positive, and 1.17% false-negative error rate for this dataset. The neuron cells data was particularly challenging due to large number of sites. For this data, SCIU failed to converge even after ten days of running while Phylovar terminated after less than 18 h. Fig. 4. Clustered heatmap of the mutations detected by SCIU from TNBC data and population sequencing data of tumor and matched normal tissues. The pixels show mutation calls (dark blue), reference alleles (light blue) and missing data (pink). Columns (cells) are colored according to the ploidy of the cells. Since SCIU discarded the diploid cells from the final output, only hypodiploid and hyperdiploid cells are demonstrated here. The colors of the rows (genes) indicate whether the SNV was found in bulk data or not. Here, popT and popN are the tumor and normal population sequencing samples, respectively. SCIU detected 652 SNVs with high or moderate functional effects in the single-cell data among which 550 mutations were found in bulk data as well, which is 84% of the single-cell calls (A color version of this figure appears in the online version of this article) Fig. 5. Clustered heatmap of the mutations detected by Monovar from TNBC data and population sequencing data of tumor and matched normal tissues. The pixels show mutation calls (dark blue), reference alleles (light blue) and missing data (pink). Columns (cells) are colored according to the ploidy of the cells. The colors of the rows (genes) indicate whether the SNV was found in bulk data or not. Here, popT and popN are the tumor and normal population sequencing samples, respectively. Monovar detected 18 187 high or moderate non-synonymous mutations in the single-cell data from which 10 981 mutations were found in the bulk data as well which is 60% of the single-cell calls (A color version of this figure appears in the online version of this article) Phylovar makes it possible to analyze datasets with large number of loci within reasonable time and memory requirements, thus adding to the growing toolbox for analyzing scDNAseq data. As a direction for future research, we will explore deviations from the simplified ISA model and investigate the feasibility of applying more general finite-sites models (FSM) (Zafar et al., 2017(Zafar et al., , 2019 to datasets with many loci. As scDNAseq technologies advance, the sequencing cost per cell decreases (Lim et al., 2020;Tang et al., 2019). Consequently, we expect more scWGS and scWES datasets to emerge in the future, requiring methods such as Phylovar that can perform scalable variant calling on datasets with millions of loci.