## Abstract

Missing data are common in DNA sequences obtained through high-throughput sequencing. Furthermore, samples of low quality or problems in the experimental protocol often cause a loss of data even with traditional sequencing technologies. Here we propose modified estimators of variability and neutrality tests that can be naturally applied to sequences with missing data, without the need to remove bases or individuals from the analysis. Modified statistics include the Watterson estimator *θ*_{W}, Tajima’s *D*, Fay and Wu’s *H*, and HKA. We develop a general framework to take missing data into account in frequency spectrum-based neutrality tests and we derive the exact expression for the variance of these statistics under the neutral model. The neutrality tests proposed here can also be used as summary statistics to describe the information contained in other classes of data like DNA microarrays.

NEUTRALITY tests are among the most widely used tools in population genetics. Many neutrality tests have been developed based on the levels and the patterns extracted from segregating sites, and in particular to be applied to biallelic SNP data. The simplest information that can be extracted from SNP data are the allele frequency spectrum; therefore, many tests focus on the difference between the observed and expected spectrum under the neutral Wright–Fisher model. Widespread tests of this kind include Tajima’s *D* (Tajima 1989), Fu and Li’s *F* and *D* (Fu and Li 1993), and Fay and Wu’s *H* (Fay and Wu 2000). However, this class of tests is much larger, as recently shown by Achaz (2009) following an idea of Nawa and Tajima (2008), and includes among the others the tests by Fu (1997), Zeng *et al.* (2006), and Achaz (2009). A subclass of optimal neutrality tests against specific alternative scenarios was described by Ferretti *et al.* (2010). Some general results on the variances of these tests were provided by Fu (1995) and Pluzhnikov and Donnelly (1996).

All these statistics assume a complete knowledge of the alleles present in the *n* sequenced individuals for all the *L* positions genotyped. However, this is rarely the case: experimental problems in sample preparation or genotyping often result in missing data; *i.e.*, some individual alleles at some positions are actually unknown.

At present, most packages for population genetics analyses like DNAsp (Librado and Rozas 2009) deal with missing data simply by removing individuals and/or positions affected with incomplete data. This is a good strategy as long as missing data represent a very minor fraction of the alleles, since in this case they do not affect the power of the analysis. However, there could be situations in which a large amount of missing data are unavoidable. For example, in samples taken from natural populations the quality of the samples could be low or the amount of DNA available per individual could be insufficient; therefore, genotyping these samples could miss a significant fraction of the alleles.

There is another important reason to consider sequences with missing data. Many of the sequences that are being produced currently are not obtained through Sanger sequencing, but from next-generation sequencing (NGS) technologies. These technologies sequence a large amount of short reads that are then realigned to reconstruct the original sequence. The coverage of these reads is strongly inhomogeneous along the genome and there is often a large fraction of bases that is not covered by a sufficient number of reads, unless the coverage is very high. Missing data are therefore inherent to these technologies: hence, removing individuals or bases with missing alleles would imply a huge loss of information. Given the growing relevance of NGS technologies for population genetics studies, a different strategy is needed to deal with this circumstance. Several estimators of variability can be applied directly to sequenced reads (Lynch 2008; Hellmann *et al.* 2008; Jiang *et al.* 2009; Futschik and Schlötterer 2010; Kang and Marjoram 2011); however, no estimator is available for the sequences obtained after genotype call has been completed for each individual in each position. The difference between these two situations is that, once the genotype has been determined, all the information about the single read bases aligning on a given position and their qualities is (for our purposes) lost.

In this article we present a simple generalization of some estimators and tests that take missing data into account. In particular we consider the Watterson estimator of genetic variability (Watterson 1975), the Tajima estimator of nucleotide diversity (Tajima 1983), neutrality tests based on the frequency spectrum like Tajima’s *D* and Fay and Wu’s *H*, and the HKA test (Hudson *et al.* 1987) for neutral evolution based on the pattern of polymorphism and divergence. The most important result of this article is the general expression for the covariance between the frequency spectrum at two sites

Note that in sequence data, missing data (usually represented by *N*’s located in the same position as the missing alleles) are not equivalent to gaps (represented by white spaces). Gaps correspond to insertions or deletions (indels) in some of the sequences. In this article we do not address indels, even if very short biallelic indels (a few bases long) are similar to SNPs as genetic variants and therefore could be analyzed by similar methods. In practice it is difficult to differentiate indels from missing data if the rate of missing data are high, and this is especially true for sequences obtained from NGS data. Here we consider sequences without indels.

## Neutrality Tests Including Missing Data

In this article we consider estimators and tests based on the frequency spectrum. The basic population parameter involved in these tests is the nucleotide variability *θ* = 2*N*_{e}*μ*, where *N*_{e} is the haploid effective population size and *μ* is the mutation rate per base per generation. We assume a small variability *θ* ≪ 1 and a large window length *L* ≫ 1, such that

All the tests and estimators belong to a general class of neutrality tests that can be parametrized in terms of weights *ω _{i}*, Ω

*(Achaz 2009),*

_{i}*ξ*indicates the number of variants with frequency

_{i}*i*for the derived allele. The weights

*ω*, Ω

_{i}*multiply the normalized frequency spectrum*

_{i}*θ*=

_{i}*iξ*(Nawa and Tajima 2008). The estimators are unbiased estimators of

_{i}*θ*, while the tests are normalized to have mean 0 and variance 1 under the standard neutral model without recombination.

Our definition of neutrality tests based on the frequency spectrum is the most general parametrization compatible with Equations 1–2 that takes explicitly into account the coverage for each site. We denote by *n* the total number of individuals sequenced and by *n _{x}* the number of individuals for which the allele at position

*x*is known. The estimators and tests are defined as

*ξ*(

_{i}*x*) is an index variable that is 1 if there is a segregating site with

*i*derived alleles in position

*x*and 0 otherwise. The estimators are unbiased,

*i.e.*,

*E*(

*T*) = 0, Var(

*T*) = 1 as in the usual framework.

The weights *S* the total number of segregating sites and *et al.* 2008). Its natural generalization is the MCLE with missing data,*S*. For most of the other estimators like Tajima’s Π, we choose the weights *ω _{i}* for the estimators (1), where

*n*is substituted by

*n*, that is,

_{x}As for neutrality tests, Tajima’s *D* (Tajima 1989) corresponds to *H* (Fay and Wu 2000) corresponds to*i.e.*, *n _{x}* =

*n*for all sites.

In this framework it is also possible to implement error corrections for error-prone data: for example, removing singletons (Achaz 2008) is equivalent to the choice of *i.e.*, the ones for which information from most individuals is missing. If we assume that the minimum number of individuals covering reliable positions is *n*_{min}, this filter can be easily implemented by removing all positions with *n _{x}* <

*n*

_{min}from the analysis.

To evaluate the tests (4), we need the variances in the denominators. Our basic result for these variances (leaving out subleading terms in *θ* and 1/*L*) is*ξ _{i}*(

*x*) are index variables with mean

*E*(

*ξ*(

_{i}*x*)) =

*θ*/

*i*and

*ξ*(

_{i}*x*),

*ξ*(

_{j}*x*) are mutually exclusive for

*i*≠

*j*, so

*E*(

*ξ*(

_{i}*x*)

*ξ*(

_{j}*x*)) = 0. The covariance Cov(

*ξ*(

_{i}*x*),

*ξ*(

_{j}*y*)) for the standard neutral model without recombination is presented in the next section.

The HKA (Hudson, Kreitman, Aguadé) test (Hudson *et al.* 1987) and the formulae for estimators and tests based on the folded spectrum are treated in File S1.

## Covariance of the Frequency Spectrum at Different Sites

Since *ξ _{i}*(

*x*),

*ξ*(

_{j}*y*) are index variables, their covariance under the standard neutral model without recombination is

*i*and

*j*,

*n*and

_{x}*n*are the numbers of individuals with known alleles at the two sites, and

_{y}*n*is the number of individuals for which both alleles are known. This probability can be obtained as

_{xy}*k*and

*l*in

*n*complete sequences. (We define a pair of mutations as shared if there are individuals with derived alleles in both loci and as exclusive if no individual sequence contains both the derived alleles.) The sum

*P*=

_{kl}*θ*

^{2}(1/

*kl*+

*σ*), where the matrix

_{kl}*σ*is defined in Fu (1995, Equations 2–3).

_{kl}The coefficients *k* and *l* in *n _{x}* +

*n*−

_{y}*n*complete sequences,

_{xy}*i*and

*j*derived alleles are found among the

*n*,

_{x}*n*alleles in

_{y}*x*and

*y*, respectively, assuming that the

*n*,

_{x}*n*individuals (with

_{y}*n*in common between the two sets) are randomly extracted from the complete set of

_{xy}*n*+

_{x}*n*−

_{y}*n*individuals. The combinatorial formulae for these probabilities are

_{xy}*k*≥

*l*; otherwise use the identity

The formulae for the probabilities *E*(*ξ _{k}ξ_{l}*) by Fu (1995) into the contributions from shared mutations (Fu 1995, Equations 24 and 28) and exclusive mutations (Equations 25, 29, and 30):

Finally, note that the computation of the variances requires an estimate of

## Discussion

In this article we presented a general framework for estimators of variability and neutrality tests based on the frequency spectrum that take into account missing data in a natural way. This is particularly interesting in the light of sequences obtained from NGS data, since for these technologies a relevant fraction of bases is often not sequenced or sequenced at very low read depth.

The approach discussed here is based on results that are conditional on the distribution of the missing data, as summarized by the distribution of all the triples (*n _{x}*,

*n*,

_{y}*n*). An effective way of implementing numerically the above variances is to sample

_{xy}*N*

_{s}random values of (

*n*,

_{x}*n*,

_{y}*n*) from the empirical distribution and compute the covariances using only these values and then rescale the second term in Equation 8 by a factor

_{xy}*L*

^{2}/

*N*

_{s}.

The modifications presented in this article can be applied to all estimators and tests included in the framework of Achaz (2009) and represent, therefore, a complete tool with which to deal with missing data. However, it would be interesting to know the impact of the missing data on the performance of the estimators and tests. If we fix the sample size, an increase in the amount of missing data leads to an increase in the variance of the estimators (Figure 1), as is to be expected given that this is equivalent to loss of information. On the other hand, if the loss of information associated with missing data is compensated by sequencing more individuals, the performances of the estimators actually *increase* (*i.e.*, their variances decrease) with respect to complete sequences with the same coverage (Figure 1). A similar effect can be observed for neutrality tests (Figure 2). The explanation for this counterintuitive behavior lies in the fact that in the case of complete sequences, all individuals share the same genealogical tree at all positions, *i.e.*, the same evolutionary history, while in this case different positions are covered by different sets of individuals with partly independent histories in the same population; therefore, the number of available histories is actually larger and the variance is reduced similarly to what happens with recombination. Our results imply that with the same amount of information per base, missing data could improve the power of neutrality tests.

## Acknowledgments

Work funded by grant CGL2009-09346 to S.R.O., grant AG2010-14822 to Miguel Pérez-Enciso, and Consolider grant CSD2007-00036 “Centre for Research in Agrigenomics” (Ministerio de Ciencia e Innovación, Spain). S.R.O. is recipient of a Ramón y Cajal position (Ministerio de Ciencia e Innovación, Spain). L.F. acknowledges support from Consejo Superior de Investigaciones Científicas (Spain) under the JAE-doc program.

## Footnotes

*Communicating Editor: N. A. Rosenberg*

- Received February 22, 2012.
- Accepted May 24, 2012.

- Copyright © 2012 by the Genetics Society of America