DR AF T A Survey on Distribution Testing Your Data is Big . But is it Blue ?

The field of property testing originated in work on program checking, and has evolved into an established and very active research area. In this work, we survey the developments of one of its most recent and prolific offspring, distribution testing. This subfield, at the junction of property testing and Statistics, is concerned with studying properties of probability distributions. We cover the current status of distribution testing in several settings, starting with the traditional sampling model where the algorithms obtains independent samples from the distribution. We then discuss different recent models, in which one either grants the testing algorithms more powerful types of queries, or evaluates their performance against that of an information-theoretical optimal " adversary. " In each setting, we describe the state-of-the-art for a variety of testing problems. We hope this survey will serve as a self-contained introduction for those considering research in this field. Foreword " Recently there has been a lot of glorious hullabaloo about Big Data and how it is going to revolutionize the way we work, play, eat and sleep. " (R. A. Servedio) This is not a comprehensive survey on distribution testing – yet it aims at being one. It emerged as the author was trying to make sense of what he was doing, and of the myriads of papers read along the way 1 – each with new results, sometimes superseding the previous, sometimes incomparable, sometimes none of the above. The field of distribution testing has grown fast these last years, making great strides in Theoretical Computer Science after being the playground of Statisticians for decades (centuries?). Yet, if pressed to find any, I would state one downside to this fast progress: it is easy to get lost, confused about what is known, who proved it, and whether it relates to that other result from this other paper which looks a tad similar. This will not solve all these questions – yet it aims at doing so. 1 More precisely, looked hard at along the way.


Introduction
Given data from an experiment, study or population, inferring information from the underlying probability distribution it defines is a fundamental problem in Statistics and data analysis, and has applications and ramifications in a myriad of other fields. But this question, extensively studied for decades, has undergone a significant shift these last years: the amount of data has grown huge, and the corresponding distributions now are often over a very large domain (see for instance [18,60,69]). So large, in fact, that the usual methods from Statistics and learning theory are no longer practical; and one has to look for faster, more sample-efficient techniques and algorithms. In particular, by restricting the goal-when learning the whole distribution is not necessary, it may be enough to focus on whatever aspect of the data is important to the application. In doing so, it may be possible to overcome the formidable complexity of the task; most of the time at the price of a slightly relaxed guarantee on the answer. (For a more eloquent exposition of these points, see, e. g., [88] 1 .) But if only one phrase and motivation was allowed to justify the whole field of distribution testing, the author could not find anything more concise and trendy than these two words: "Big Data." 1 Testing: what, why, and how?
We work in the setting of property testing as originally introduced in [90,59], where access to an unknown "huge object" is presented to an algorithm via the ability to perform local "inspections." By making only a small number of such queries to the object, the randomized algorithm must determine whether the object exhibits some prespecified property of interest, or is far from every object with the property. (For a more detailed presentation and overview of the field of property testing, the reader is referred to [53,86,87,56].) In distribution testing, this "huge object" is an unknown probability distribution (or a collection of those) over some known domain Ω; and the type of access granted to this distribution can be of several sorts, depending on the specific model. E. g., in the most common setting, the algorithm is provided with independent samples drawn from the distribution; other models may allow the algorithm to query the value of the probability mass function at points of its choosing, or to sample from the distribution after conditioning on some subsets of the domain. For these various models, the question now becomes to bound the number of queries required to test a range of statistical properties-as a function of the domain size and the "farness parameter." (In particular, the running time of the algorithm is usually only a secondary concern, even though obtaining efficient testers is an ancillary goal in many works.) 2 2 But what about. . .
It is natural to wonder how the above approach to distribution testing compares to classic methods and formulations, as studied in Statistics. While the following will not be a thorough comparison, it may shed some light on the difference.
Null and alternative hypotheses. The standard take on hypothesis testing, simple hypothesis testing, relies on defining two classes of distributions, the null and alternative hypotheses H 0 and H 1 . A test statistic is then tailored specifically to these two classes, in order to optimally distinguish between H 0 and H 1 , that is, under the assumption that the unknown distribution D ∈ H 0 ∪ H 1 . The test then rejects the null hypothesis H 0 if statistical evidence is obtained that D / ∈ H 0 . In this view, the distribution testing formulation would be to set H 0 to be the property P to be tested, and define the alternative hypothesis as "everything far from P." In this sense, the latter captures a much more adversarial setting, where almost no structure is assumed on the alternative hypothesis-a setting known in Statistics as composite hypothesis testing.
Small-sample regime. A second and fundamental difference resides in the emphasis given to the testing question. Traditionally, statisticians tend to focus on asymptotic analysis, characterizing-often exactly-the rate of convergence of the statistical tests under the alternative hypothesis, as the number of samples m grows to infinity. Specifically, the goal is to pinpoint the error exponent ρ such that the probability of error (failing to reject the null hypothesis) asymptotically decays as e −ρm . However, this asymptotic behavior will generally only hold for values of m greater than the size of the domain ("alphabet"). In contrast, the computer science focus is on the small-sample regime, where the number of samples available is small with regard to the domain size, and one aims at a fixed probability of error.
Algorithmic flavor. At a more practical level, a third point on which the two approaches deviate is the set of techniques used in order to tackle the question. Namely, the Statistics literature very often relies on relatively simple-looking and "natural" tests and estimators, which need not be computationally efficient. (This is for instance the case for the generalized likelihood ratio test that requires to compute the maximum likelihood of the sequence of samples obtained under the two hypotheses H 0 and H 1 ; which is not tractable in general.) On the other hand, works in distribution testing predominantly have (or used to have) a more algorithmic flavor, with an emphasis on the computational aspects of the algorithms thus obtained.

Scope and structure of this survey
This survey is (alas) not comprehensive; choices have been made, sometimes even consciously. Our goal is to provide a substantial overview and summary of the results and areas addressed, as well as describe some useful tools and tricks gleaned along the way. In Part II, we provide notation and definitions the reader shall need to navigate safely.
In Part III, we focus on the standard model for distribution testing, where the algorithm can only access the distribution by drawing independent samples from it. We cover a wide range of properties of probability distributions that have been investigated in this setting, and for which both upper and lower bounds on the sample complexity have been established. These include testing whether the input distribution D is uniform [60,19,79], whether D is identical to a known distribution D * [17,99], and testing whether two unknown distributions D 1 , D 2 are identical [18,100,39]: we describe these results in Section 5, as well as the related problems of tolerant testing for these properties [98].
We then turn, in Section 6, to a slightly different line of research, where one tries to test for structure: we start with the problem of deciding if D has a monotone (non-increasing) probability mass function [20,89,23], before tackling the question of testing whether D belongs to some specific parameterized class of distributions (e. g., is D a Binomial distribution?). We then look at the problem of testing independence, that is, deciding whether a distribution on a domain Ω 1 × Ω 2 is a product distribution.
In Section 7, we follow a slightly different path, and cover testing results assuming structure. For example, we discuss testing for monotonicity assuming that D is k-modal, or testing uniformity or identity when D is guaranteed to be a histogram (that is, to have a piecewise constant probability mass function).
After this, we discuss in Section 8 the class of symmetric properties, that is, properties invariant by any permutation of the domain (e. g., "having small support size," or "having entropy at least (log n)/2") [78,16,98]. We conclude the chapter by providing in Section 9 some tips and remarks to keep in mind when working in the sampling model. The next section, Part IV, is dedicated to alternative or new models for testing distributions-whether it be with stronger type of access as in Section 11 and Section 12, or with different objectives and settings altogether (Section 13 and Section 14). In each case, we attempt to present an overview (and, whenever possible, the current state of the art) in the particular setting considered, following the same overall outline as in Part III.
Finally, we give in the appendix a summary of the results covered in this survey, as well as additional definitions and tools that may prove useful to anyone interested in distribution testing and learning.
Caveat. This survey does not cover quantum distribution testing, in any of its aspects-whether it be classical testing of quantum properties, quantum testing of classical properties or quantum testing of quantum properties. Not that the author does not deem this area worthy of interest; but, quite sadly, that he does not know the first thing about it, and prefers pointing the reader to [76] (e. g., Section 2.2.6) or [77] rather than showing his utter and complete ignorance.

Part II Preliminaries
All throughout the paper, we denote by [n] the set {1, . . . , n}, and by log the logarithm in base 2; we use the notationsÕ( f ),Ω( f ) to hide polylogarithmic dependencies on the argument, and will sometimes write O ε ( f ) to signify that the hidden constant depends on the parameter ε (while f does not). A probability distribution over a (countable) domain 3 Ω is a non-negative function D : Ω → [0, 1] such that ∑ x∈Ω D(x) = 1. We denote by ∆(Ω) the (convex) polytope of all such distributions, and by U(Ω) the uniform distribution on Ω (when well-defined). Given a distribution D over Ω and a set S ⊆ Ω, we write D(S) for the total probability weight ∑ x∈S D(x) assigned to S by D; and let supp(D) def = { x ∈ Ω : D(x) > 0 } be the (effective) support of the distribution. Moreover, for S ⊆ Ω such that D(S) > 0, we denote by D S the conditional distribution of D restricted to S, that is, D S (x) = D(x)/D(S) for x ∈ S and D S (x) = 0 otherwise. Finally, for a probability distribution D ∈ ∆(Ω) and integer m, we write D ⊗m ∈ ∆(Ω m ) for the m-fold product distribution obtained by drawing m independent samples s 1 , . . . , s m ∼ D and outputting (s 1 , . . . , s m ).
As is usual in property testing of distributions, throughout this survey the distance between two distributions D 1 , D 2 ∈ ∆(Ω) will be the total variation distance: which takes value in [0, 1]. In some cases, it is useful to consider (either as a proxy towards total variation, or for the sake of the analysis) different metrics, such as 2 , Kolmogorov, Earthmover's or Hellinger distances. More on these can be found in Appendix C.
A property P of distributions over Ω is a subset of ∆(Ω), consisting of all distributions that have the property. The distance from D to a property P, denoted d TV (D, P), is then defined as inf D ∈P d TV (D, D ).
We recall the standard definition of testing algorithms for properties of distributions over Ω, where n is the relevant parameter for Ω (namely, in most cases, its size |Ω|). We chose to phrase it in the most general setting possible, with regard to how the unknown distribution is "queried": and will specify this aspect further in the relevant sections (sampling access, conditional access, etc.). Definition 3.1. Let P be a property of distributions over Ω. Let ORACLE D be an oracle providing some type of access to D. A q-query testing algorithm for P (for this type of oracle) is a randomized algorithm T which takes as input n ∈ N, ε ∈ (0, 1), as well as access to ORACLE D . After making at most q(ε, n) calls to the oracle, T either accepts or REJECT, such that the following holds: • if D ∈ P, then with probability at least 2/3, T accepts; • if d TV (D, P) > ε, then with probability at least 2/3, T rejects; where the probability is taken over the algorithm's randomness and (if any) the randomness from the oracle's answers.
We sometimes write T ORACLE D to indicate that T has access to ORACLE D . Additionally, we will also be interested in tolerant testers-roughly, algorithms robust against a relaxation of the first item above: Definition 3.2. Let P and ORACLE D be as in Definition 3.1. A q-query tolerant testing algorithm for P is a randomized algorithm T which takes as input n ∈ N, 0 ≤ ε 1 < ε 2 ≤ 1, as well as access to ORACLE D . After making at most q(ε 1 , ε 2 , n) calls to the oracle, T returns either ACCEPT or REJECT, such that the following hold: • if d TV (D, P) ≤ ε 1 , then with probability at least 2/3, T accepts; • if d TV (D, P) ≥ ε 2 , then with probability at least 2/3, T rejects; where the probability is taken over the algorithm's randomness and (if any) the randomness from the oracle's answers.
Note that these definitions in particular do not specify the behavior of the algorithms when d TV (D, P) ∈ (0, ε) (in Definition 3.1) or d TV (D, P) ∈ (ε 1 , ε 2 ) (in Definition 3.2): in this case, any answer from the tester is considered valid. Furthermore, we stress that the two definitions above only deal with the query complexity, and not the running time. Almost every lower bound will however apply to computationally unbounded algorithms, while most upper bounds we will cover are achieved by testing algorithms whose running time is polynomial in the number of queries they make.
The last definition we state here is one of distance estimators; that is, of algorithms which compute an approximation of the distance of the unknown distribution to a property. Definition 3.3. Let P and ORACLE D be as in Definition 3.1. A q-query distance estimation algorithm for P is a randomized algorithm A which takes as input n ∈ N, ε ∈ (0, 1], as well as access to ORACLE D . After making at most q(ε, n) calls to the oracle, T outputs a value γ ∈ [0, 1] such that, with probability at least 2/3, it holds that d TV (D, P) ∈ [γ − ε, γ + ε].
Remark 3.4 (Tolerant testing and distance approximation). Parnas, Ron, and Rubinfeld define and formalize in [80] the notion of tolerant testing, and show that distance approximation and (fully) 4 tolerant testing are equivalent, up to a logarithmic factor in 1/ε in the sample complexity (Claims 1 and 2, Section 3.1).
Generalization. These definitions can easily be extended to cover situations in which there are two "unknown" distributions D 1 , D 2 that are accessible via ORACLE D 1 and ORACLE D 2 oracles, respectively. For instance, we shall consider algorithms for testing whether D 1 = D 2 versus d TV (D 1 , D 2 ) > ε in such a setting, the property now being formally a subset of ∆(Ω) × ∆(Ω).
On adaptivity and one-sidedness. As usual in property testing, it is possible to specialize these definitions for some classes of algorithms. In particular, a tester which never errs when D ∈ P (but is only allowed to be wrong with probability 1/3 when D is far from P) is said to be one-sided; as defined above, testers are two-sided. More important in this survey is the notion of adaptive testers: if an algorithm's queries do not depend on the previous answers made to the oracle(s), it is said to be non-adaptive. However, if the i-th query can be a function of the j-th answer for j < i, then it is adaptive. (Roughly speaking, a non-adaptive algorithm is one that can write down all the queries it is going to make "in advance," only after tossing its own random coins.) On the domain and parameters. Unless specified otherwise, Ω will hereafter by default be the nelement set [n]. When stating the results, the accuracy parameter ε ∈ (0, 1] is to be understood as taking small values, either a fixed (small) constant or a quantity tending to 0 as n → ∞; however, the actual parameter of interest will always be n, viewed as "going to infinity." Hence any dependence on n, no matter how mild, shall be considered as more expensive than any function of ε only.

Standard Model 4 The setting
In this first and most common setting, the testers access the unknown distribution by getting independent and identically distributed (i.i.d.) samples from it.
Definition 4.1 (Standard access model (sampling)). Let D be a fixed distribution over Ω. A sampling oracle for D is an oracle SAMP D defined as follows: when queried, SAMP D returns an element x ∈ Ω, where the probability that x is returned is D(x) independently of all previous calls to the oracle.
This definition immediately implies that all algorithms in this model are by essence non-adaptive: indeed, any tester or tolerant tester can be converted into a non-adaptive one, without affecting the sample complexity. (This is a direct consequence of the fact that all an adaptive algorithm can do when interacting with a SAMP oracle is deciding to stop asking for samples, based on the ones it already got, or continue.) A trivial upper bound. It is good to keep in mind that a vast majority of the testing problems studied in this model does have an O |Ω|/ε 2 upper bound on the sample complexity. Indeed, it is known that any distribution D ∈ ∆(Ω) can be learnt to accuracy ε with this many samples (see, e. g., [45, Theorems 2.2 and 3.1]); and once a good enough approximationD has been obtained, it is in most cases enough to check whetherD is close to the property in order to conclude about D.
On a related matter, one may wonder whether the standard "testing by learning" argument that holds for Boolean functions also applies to distributions, that is, is testing a property P always at most as hard as (proper) learning the class P? 5 It turns out this is not the case for distributions: for instance, we shall see in Section 5.1 that testing uniformity requires sample complexity Ω( |Ω|), while learning the uniform distribution trivially costs exactly 0 samples. The reason for this difference stems from the fact that while estimating the distance between two Boolean functions is easy, approximating the distance between two distributions even to constant accuracy requires |Ω| 1−o(1) samples (Theorem 5.12).
Lower bounds. As common in property testing, proving lower bounds in this model usually comes down to defining two distributions, D yes and D no , over distributions (yes-instances, having the property, and no-instances being far from it, respectively). 6 Then, the key is to argue that with high probability over the choice of (D yes , D no ) ∼ D yes × D no , no q-query algorithm can distinguish between D yes and D no with probability more than, say, 1/4: this in turn proves that any successful tester must have sample complexity greater than q.
In doing so, tools from Appendix D.2 are commonly employed, often implicitly. The reader unfamiliar with the notion of indistinguishability of transcripts 7 or the use of Yao's principle may find there a useful complement.

Testing identity and closeness of general distributions
In this section, we consider the three following testing problems, each of them being a generalization of the previous: Uniformity testing: given oracle access to D, decide whether D = U Ω (the uniform distribution on Ω) or is ε-far from it; Identity testing: given oracle access to D and the full description of a fixed D * , decide whether D = D * or is ε-far from it; Closeness testing: given independent oracle accesses to D 1 , D 2 (both unknown), decide whether they are equal or ε-far from each other.
The results below apply to any finite domain Ω; for convenience, we denote |Ω| by n, and write U for U Ω .

Testing uniformity
This problem, arguably the most fundamental and widely studied, asks to distinguish whether the unknown distribution is uniform on the known domain, or is at a distance at least ε from uniform. Phrased as a property testing question, it was first implicitly considered for the 2 norm by Goldreich and Ron [60], in the context of testing whether a bounded-degree regular graph is an expander (i. e., if the distribution over vertices obtained after a short random walk on the graph is close to uniform). In this section, we cover the following result.
Theorem 5.1 (Testing uniformity). There exists an algorithm which, given SAMP access to an unknown distribution D ∈ ∆(Ω), satisfies the following. On input ε ∈ (0, 1), it takes O √ n/ε 2 samples from D, and • if D = U, then with probability at least 2/3, the algorithm accepts; • if d TV (D,U) > ε, then with probability at least 2/3, the algorithm rejects Furthermore, this sample complexity is tight.
In [60], Goldreich and Ron showed that one could efficiently estimate the 2 norm of an unknown distribution, and described how this primitive could be used for uniformity testing. We restate their result, as found later in [17,Theorem 12] (see also [18,Lemma 4] for a detailed analysis of the algorithm): Theorem 5.2 ([60], rephrased). Given SAMP access to an unknown distribution D ∈ ∆(Ω), there exists an algorithm that takes O √ n ε 2 log 1 δ samples and returns a value p such that, with probability at least The high-level idea is to count the number of collisions amongst m samples, that is, the number of pairs of samples with the same values. It is not hard to show that the expected number of such collisions is m 2 D 2 2 ; the crux is then to bound the variance of this estimator in order to apply Chebyshev's inequality. The next step is then to observe that D −U 2 2 = D 2 2 − 1/n. Combined with the general relation between 1 and 2 metrics, namely one can show that to test uniformity to ε in 1 distance (and thus in total variation distance, up to constant factors), it is sufficient to test it to ε/ √ n in 2 distance. To do so, it is in turn sufficient with the above observation to get a multiplicative estimate of D 2 2 up to (1 + ε 2 /2), i. e., to separate D 2 2 ≥ (1 + ε 2 ) · (1/n) from D 2 2 = (1/n); which, with Theorem 5.2 can be done with O √ n/ε 4 samples.
Remark 5.3. This is an example of an important paradigm: using the more convenient 2 as a proxy towards 1 . Albeit non-optimal (in terms of ε), the above tester has an additional feature: it is "weakly tolerant," in the sense that it actually allows to test whether d TV (D,U) ≤ ε/(2 √ n) versus d TV (D,U) > ε Theorem 5.4 ([60], rephrased). There exists an algorithm which, given SAMP access to an unknown distribution D ∈ ∆(Ω), satisfies the following. On input ε ∈ (0, 1), it takes O √ n/ε 4 samples from D, and n , then with probability at least 2/3, the algorithm accepts; • if d TV (D,U) > ε, then with probability at least 2/3, the algorithm rejects.
Take the median value ("median trick") An O √ n/ε 2 upper bound. Paninski [79] later improved on this bound, reducing the dependence on ε (under the restriction that ε = Ω 1/n 1/4 ). The algorithm proposed is very similar in spirit: the high-level idea is to count the number K 1 of "non-collisions," that is, the number of elements sampled exactly once; and to reject if this number is significantly less than the expected number under the uniform distribution. The key in the savings, here, is to work directly with 1 distance, and to work out the complications to still get a good enough bound on the variance of K 1 .
One can also derive the above O √ n/ε 2 upper bound (for all ε > 0) from the testing algorithm of Chan et al. [39] for the 2 distance, along with the usual relation between 2 and 1 norms. Diakonikolas et al. [49] and Acharya et al. [5] both recently gave another proof of the O √ n/ε 2 upper bound (again, without the restriction on the range of ε), with an approach based on a modified χ 2 -test. The first describe an optimal 2 -tester for uniformity which, by taking this many samples, is able to distinguish between D = U and D −U 2 > ε/ √ n: this stronger 2 guarantee immediately implies a tester in total variation. 8 The second works directly with the χ 2 -divergence obtaining a tester with "hybrid tolerance," namely, a tester that can differentiate between d χ 2 (D || U) ≤ ε 2 /2 and d TV (D,U) > ε, which also implies the desired uniformity testing result. (See Section 6.6 for more details.) An Ω √ n/ε 2 lower bound. As a foretaste, it is easy to show, by the birthday paradox, that any tester for uniformity in the standard model must draw Ω( √ n) samples, even for ε = 1/2 [60,18]. Indeed, taking without loss of generality the domain to be [n] and n to be even, let the family of "nodistributions" D no be defined as follows: for any permutation π ∈ S n of the domain, D π puts weight 2/n 8 Their 2 -tester actually offers even a bit more, allowing one to distinguish between D −U 2 < ε/(2 √ n) and D −U 2 > ε/ √ n.
It is straightforward to verify that for, any D ∈ D no , d TV (D,U) = 1/2. However, for D chosen uniformly at random in D no , the Birthday Paradox implies that any algorithm taking o( √ n) samples will not, with probability 1 − o(1), see any collision: that is, all elements drawn will be distinct. Conditioned on this, the distributions over the transcript from U and the one from a (random) no-distribution D are identical, and thus no algorithm can distinguish between the two cases.
From the upper bound above, the √ n dependence is tight. But the status of the ε one remained open until Paninski [79], who proved a matching Ω √ n/ε 2 lower bound. The construction is, not surprisingly, very similar to the one above: a no-instance D ∈ D no is defined by n/2 independent coin tosses, according to which each consecutive pair of elements 2i − 1, 2i is assigned weight either (1 − 2ε)/n, (1 + 2ε)/n or (1 + 2ε)/n, (1 − 2ε)/n. Each such distribution is thus being exactly ε-far from uniform. The proof then goes by proving that as long as is the "expected distribution of a transcript of m samples from a randomly chosen no-distribution," defined as E D∼D no [D ⊗m ]. (That is, the goal is to show that the distance between the distributions of a transcript from the uniform distribution and a transcript from a no-distribution is small.) This last part is done by applying techniques from [82,Section 14.4]: first, writing where dP dQ denotes the density of P with regard to Q. 9 This leads to (the last inequality by Jensen); now, expanding the inner square and massaging the explicit yet discouraging expression of ∆(x 1 , . . . , x m ), one can finally obtain an upper bound of (e m 2 ε 4 /n − 1) 1/2 .

Testing identity
In this section, we cover the following result, settling the sample complexity of testing identity to a known distribution.
Theorem 5.5 (Testing identity). There exists an algorithm which, given the full specification of D * ∈ ∆(Ω) and SAMP access to an unknown distribution D, satisfies the following. On input ε ∈ (0, 1), it takes O √ n/ε 2 samples from D, and • if D = D * , then with probability at least 2/3, the algorithm accepts; • if d TV (D, D * ) > ε, then with probability at least 2/3, the algorithm rejects.
Furthermore, this sample complexity is tight. 9 That is, in our discrete setting, dP dQ denotes the (or rather, a) function f : Ω → R such that P(x) = f (x)Q(x) for every x ∈ Ω.
AnÕ( √ n poly(1/ε)) upper bound. This first algorithm, due to Batu et al. [17], relies on an important idea: to reduce identity testing of D to testing uniformity on a small number of distributions. This is done by using the general technique of bucketing, which, given an explicit distribution D * , partitions the domain into logarithmically many sets ("buckets") such that D * is almost uniform on each bucket. Definition 5.6 (Bucketing). Let def = Θ(log n/ε), and define B 0 , . . . , B as follows: From this definition, it is not hard to see that D * (B 0 ) ≤ ε/2, and that both hold for all j ≥ 1 (see, e. g., [17,Lemma 8]).
The algorithm from [17] then works as follows: after bucketing the domain according to the known distribution D * , it takes O √ n log n/ε 6 samples from D. For each bucket B j such that D * (B j ) ≥ ε/ , it checks whether Ω √ n/ε 4 samples fell in B j and rejects otherwise. If enough samples hit the bucket, it uses them to estimate D B j 2 2 to a multiplicative (1 + ε 2 ), and rejects if this reveals a deviation from uniform. The last step, assuming no rejection occurred, is to see whether the two distributions induced on B 0 , . . . , B by D * and D are (ε/2)-close or ε-far from each other, and reject in the latter case. (As these distributions have now domain of size logarithmic in n, the sample complexity is not an issue.) If all tests passed, properties of the bucketing ensure D and D * must be close: as their conditional distributions are both O(ε)-close to uniform on any interval of the bucketing on which they put weight Ω(ε/ ), D and D * must be ε-close on the union of all such intervals; and the contribution of the other intervals to the distance can be ignored, as they add in total at most · O(ε/ ) = O(ε). Furthermore, it is not hard to see that if D = D * the tester will not reject. What Batu et al. proved is actually slightly stronger: their tester-again, at a very high level, because of the use of the 2 norm as an intermediary step-has some weak tolerance: Theorem 24]). There exists an algorithm which, given the full specification of D * ∈ ∆(Ω) and SAMP access to an unknown distribution D, satisfies the following. On input ε ∈ (0, 1), it takes O √ n ε 6 log n samples from D, and n log n , then with probability at least 2/3, the algorithm accepts; • if d TV (D, D * ) > ε, then with probability at least 2/3, the algorithm rejects.
A tight O √ n/ε 2 upper bound. We first note that, as identity testing is at least as hard as uniformity testing, the Ω √ n/ε 2 lower bound on the sample complexity for the latter still applies. Thus, the question is now whether the actual sample complexity is closer to the upper bound from the previous section, or to this lower bound. The answer is due to Valiant and Valiant [99]; 10 and, as we will shortly see, so is (a stronger version of) the lower bound.
To understand their result, we first have to take a small detour and define what it means for an algorithm to be instance optimal. Recall that from our definition, the sample complexity of an algorithm for identity testing is taken to be the worst-case over all cases of "known distribution" D * , and in particular is not allowed to depend on D * . What [99] argue is that, for many D * , identity testing may be significantly easier (e. g., consider the case of a distribution all its weight on a single element). They model this by allowing the sample complexity of the algorithm to depends on D * , in addition to the usual parameters; and an instance-optimal tester is a tester whose sample complexity for testing any D * is optimal even compared to an algorithm specifically designed for this D * .
Before stating their main theorem, we need a couple last notations. Given a distribution D ∈ ∆(Ω) (seen as a n-dimensional vector of probabilities), define D − max −η to be the vector where the biggest entry has been zeroed out, as well as the set of smallest entries summing to η. 11 Although D − max −η is no longer a probability distribution, its 2/3-(quasi)norm as vector is still defined: Strange as it may seem, 12 this quantity exactly characterizes the complexity of testing identity. • if D = D * , then with probability at least 2/3, the algorithm accepts; • if d TV (D, D * ) > ε, then with probability at least 2/3, the algorithm rejects.
Furthermore, this sample complexity is tight: no algorithm taking o max ε samples can correctly perform this task. 10 We note that subsequently to [99], the works of Diakonikolas et al. [49] and Acharya et al. [5] mentioned in the previous section also imply this O √ n/ε 2 upper bound. See Section 7.3 and Section 6.6 for a more detailed description of their results. 11 That is, the largest probability element and η weight of the smallest ones have been removed. 12 It does seem, to the author at least.
(Without going into more detail, we note their upper bound is achieved by a modified version of Pearson's χ 2 -test; 13 as will be one of the upper bounds from the next subsection. As for the the lower bound, it is-at a very high level-shown by leveraging the nice properties of Hellinger distance with regard to product distributions to bound the distance between two random processes corresponding to the yesand no-instances; namely, instead of bona fide distributions, looking at each element i in the domain as generating independently either Poisson(kD * i ) samples, or Poisson(k(D * i ± ε i )) for some good choice of "perturbation" ε i . One can finally conclude by using the relation between Hellinger and total variation.) To see why the above theorem implies the claimed O √ n/ε 2 upper bound on testing identity, it is enough to observe that, for all D * ∈ ∆(Ω), It is worth pointing out the implications for other distributions: for instance, the same argument along with a simple computation of its 2/3-norm shows an Ω n 1/4 /ε 2 lower bound for testing identity to the Binomial distribution Bin(n, 1/2).

Testing closeness
In this section, we cover the following result, which completely characterizes the sample complexity of the last of the three problems: Theorem 5.9 (Testing closeness). There exists an algorithm which, given SAMP access to two unknown distributions D 1 , D 2 ∈ ∆(Ω), satisfies the following. On input ε ∈ (0, 1), it takes O max n 2/3 ε 4/3 , √ n ε 2 samples from D 1 and D 2 , and • if D 1 = D 2 , then with probability at least 2/3, the algorithm accepts; • if d TV (D 1 , D 2 ) > ε, then with probability at least 2/3, the algorithm rejects.
Furthermore, this sample complexity is tight.
AnÕ n 2/3 poly(1/ε) upper bound. The first algorithm we describe here is due to Batu et al. [18], and again uses testing with regard to the 2 distance as a first step. (We reproduce here the (later) version of this result by Chan et al., which improves quadratically on the dependence on ε.) Theorem 5.10 ([39, Theorem 1.2]). There exists an algorithm which, given SAMP access to two unknown distributions D 1 , D 2 ∈ ∆(Ω), satisfies the following. On input ε ∈ (0, 1) and b ∈ [0, 1], it takes O √ b/ε 2 samples from D 1 and D 2 , and, provided D 1 2 2 , D 2 2 2 ≤ b, 13 Given the explicit description of a distribution D ∈ ∆(Ω) and a multiset of m samples S drawn from an unknown distribution, Pearson's χ 2 -test is the quantity where S i is the number of occurrences in S of the element i ∈ Ω.
Furthermore, this sample complexity is tight.
The main observation here is that the sample complexity depends on an upper bound b of the norm of the distributions (a similar theorem can be obtained with B being an upper bound on the ∞ norms instead of the 2 norms; [18, Lemma 6] then proves a sample complexity of O (B 2 + ε 2 √ B)/ε 4 ). As, by Equation (5.1), 2 testing has to be done with error parameter ε = O(ε/ √ n), leveraging this dependence is crucial to get anything non-trivial. Thus, the tester from [18] proceeds in two steps, after taking O (n 2/3 log n)/ε 2 samples from both distributions: first, it filters out "heavy" elements, i. e., those that have either D 1 (x) ≥ 1/n 2/3 or D 2 (x) ≥ 1/n 2/3 ; and checks whether each of these appear roughly the same number of times under both distributions. Then, it applies the 2 -tester from above to the "filtered distributions" D 1 and D 2 , which now have ∞ norm at most B = 1/n 2/3 , with parameter ε = ε/(2 √ n). The resulting sample complexity is, from the two steps, proving our first upper bound.
Remark 5.11. Chan et al. [39] later observe that, by (a) applying directly Theorem 5.10 instead of the original 2 tester of [18], and (b) improving the filtering approach to remove the log n factor (as conjectured by Batu et al.), the overall sample complexity of this two-stage approach can be reduced to O n 2/3 /ε 2 , both steps being optimal. However, even then, the combined sample complexity still is not optimal, as we shall momentarily see.
upper bound. Although the testing algorithm of [39] is algorithmically simple (being again a suitably modified variant of Pearson's χ 2 -test), the main challenge in their work is in the analysis, and more particularly in bounding the variance of the statistic they propose. Without going into the details, we reproduce in Algorithm 2 the algorithm itself.
An Ω max n 2/3 ε 4/3 , √ n ε 2 lower bound. As before, we first observe that the Ω √ n/ε 2 lower bound on the sample complexity of uniformity and identity testing still holds, closeness testing being at least as hard as these. For the second part of the lower bound, [39] use a similar construction as in [19,100] (which already gives an Ω n 2/3 lower bound). The idea of the construction is to "hide" the distance between the two distributions of a no-instance (D 1 , D 2 ). This is done by choosing Ω(n) light elements with either D 1 (i) = 4/n and D 2 (i) = 0 or the converse; while making the distributions coincide on Ω n 2/3 heavy elements with weight D 1 (i) = D 2 (i) = ε 4/3 /n 2/3 . The non-zero light elements of D 1 and D 2 are disjoint, and thus would give away the difference; yet with high probability the heavy ones are the only elements that may appear several times (i. e., have collisions) when sampling from the distributions, unless enough samples are taken.
Number of occurrences of i in S 1 , S 2 6: end for

Tolerant testing and distance estimation
As a general rule, asking for the testing algorithms to allow some "slack" around the property (i. e., to also accept distributions that are only close to satisfying it) most of the time makes the task much harder. At a very high level, the reason is that now, seeing a "violation" (that is, a statistically significant deviation from the property) is no longer sufficient to reject the distribution: one piece of evidence is not enough, the tester must get quantitative bounds on the amount of violation.
An Ω(n/log n) lower bound. To see how much harder this can be, we start with the following lower bound on testing uniformity (and hence on identity and closeness), 14 due to [98,96]: Theorem 5.12. There exists an absolute constant ε 0 > 0 such that the following holds. Any algorithm which, given SAMP access to an unknown distribution D ∈ ∆(Ω), distinguishes with probability at least 2/3 between (a) d TV (D,U) ≤ ε 0 and (b) d TV (D,U) ≥ 1/2 − ε 0 , must have sample complexity Ω(n/log n).
This theorem follows from [96, Theorem 1], which shows for any φ < 1/4 the existence of pairs of instances (D 1 , D 2 ) such that D 1 is φ log(1/φ )-close 15 to the uniform distribution on n elements, while D 2 is φ log(1/φ )-close to the uniform distribution on some subset of n/2 elements, and thus Ω(1)-far from uniform on n elements. Yet D 1 and D 2 are indistinguishable with fewer than Ω(φ n/log n) samples. These distributions are explicitly constructed using properties of Laguerre polynomials, before arguing that the expected fingerprints of the two distributions (roughly speaking, "number of k-way collisions for all k's" 16 ) are very similar-applying for this a new Central Limit Theorem proven along the way. (The notion of fingerprint and its use in proving the lower bound are covered in more detail in Section 8.) Remark 5.13. Following an observation from [96], we note that it it possible to get an Ω(n/(εlog n)) lower bound on the sample complexity of distinguishing d TV (D,U) ≤ ε from d TV (D,U) ≥ cε, for any ε ∈ (0, ε 0 ) and c = c(ε 0 ) > 1. This is done by replacing D 1 , D 2 defined above by the corresponding mixtures D i = (ε/ε 0 )D i + (1 − ε/ε 0 )U: distinguishing D 1 from D 2 requires a factor 1/ε more samples than distinguishing between D 1 and D 2 .
An O(n/log n) upper bound. As we just saw, tolerant testing of uniformity, identity, and closeness of distributions with fewer than n 1−o (1) samples in the standard model is hopeless. The good news, on the other hand, is that these tasks can be performed with o(n) samples: more precisely, the odd-looking n/log n lower bound is tight. We only state below the result for tolerant closeness testing; it obviously also applies to uniformity and identity.
A final remark. While the dependence on n of these three problems is completely resolved, the exact dependence on (ε 2 − ε 1 ) remains open-or, at least, ajar. We also note that (as exemplified in Theorem 5.10), tolerant testing in 2 norm does not suffer from the same fate as in total variation: the sample complexity is, in the former setting, left unchanged when going from testing to tolerant testing.
Finally, getting a bit ahead, we point out that the ideas and machinery developed in proving the tolerant testing results above have had other significant applications-in particular as seen in Section 8. 15 Actually, they prove a slightly stronger statement, using instead the relative Earthmover's distance. 16 Given a multiset S of samples and integer k ≥ 1, a k-way collision is a k-tuple (s 1 , . . . , s k ) from S such that s 1 = · · · = s k .

Testing for structure
In this part of the survey, we focus on properties related to the structure of the unknown distribution-for instance, on its shape (is the probability mass function non-increasing?), its class (is it, for instance, a Binomial distribution? A Zipf distribution?) or some other structural characteristic (is a distribution on Ω 1 × Ω 2 a product distribution?). Answering this type of questions can be useful for model selection (deciding which specialized algorithm to apply to the data), or crucial for specific applications. One may, e. g., think of health or risk analysis (is the probability to get cancer decreasing with the distance to Fukushima?), or market applications (detecting whether a trend or shopping pattern is correlated with some particular feature, say geographic location). We start by covering specific properties known to be efficiently testable, such as monotonicity, testing for k-histograms and parameterized classes; before turning in Section 6.4 and Section 6.6 to recent results of [30] and [5], which generalize many of these specific cases into one testing framework.
Importantly, this section deals with arbitrary distributions, that one must test for such structural properties; the question of leveraging known structure of the distribution to test for an additional property it may have will be the focus of Section 7.

Monotonicity
In this section, 17 we cover the problem of testing monotonicity of a distribution over [n]. Recall that D ∈ ∆([n]) is said to be monotone (non-increasing), denoted D ∈ M, if D(1) ≥ · · · ≥ D(n), i. e., if its probability mass function is non-increasing. We stress that the definition of the property assumes a total order on the domain; hence the choice of Ω = [n] in this section. (The next subsection will briefly cover the case where the domain is a partially ordered set (poset), setting for which different algorithms and techniques are required.) The following result, due to Batu et al. [20] (and later slightly improved in [30]), 18 almost completely settles-up to polylog(n) factors and the exact dependence on ε-the complexity of testing whether a distribution is monotone. Theorem 6.1 (Testing monotonicity). There exists an algorithm which, given SAMP access to an unknown distribution D ∈ ∆([n]), satisfies the following. On input ε ∈ (0, 1), it takesÕ √ n/ε 7/2 samples from D, and • if D ∈ M, then with probability at least 2/3, the algorithm accepts; • if d TV (D, M) > ε, then with probability at least 2/3, the algorithm rejects.
Furthermore, no algorithm taking o √ n/ε 2 samples can correctly perform this task.
We note that Acharya, Daskalakis, and Kamath [5] recently improved on this upper bound, achieving the optimal sample complexity O √ n/ε 2 for ε =Ω n −1/4 . Their results and techniques are covered in Section 6.6. 17 Part of the following is adapted from [28]. 18 [20] originally claims anÕ √ n/ε 4 sample complexity, but their argument seems to only result in anÕ √ n/ε 6 bound. Subsequent work building on their techniques (and described in Section 6.4) obtains the ε 7/2 dependence.
An O √ n ε 7/2 polylog n upper bound. The algorithm of Batu et al. works by taking this many samples from D, and then using them to recursively split the domain [n] in half, as long as the conditional distribution on the current interval is not close enough to uniform (or not enough samples fall into it). If the binary tree created during this recursive process exceeds O log 2 n/ε nodes, the tester rejects. They then show that this succeeds with high probability, specifically that with high probability the leaves of the recursion yield a partition of [n] in = O log 2 n/ε intervals I 1 , . . . , I , such that either (a) the conditional distribution D I j is O(ε)-close to uniform on this interval; or (b) I j is "light," i. e., has weight at most O(ε/ ) under D.
(the first item relying on Theorem 5.2, relating distance to uniformity and collision count via the 2 norm). This implies this partition defines an -flat distributionD which is ε/2-close to D, and can be easily learnt from another batch of samples. Once this is done, it only remains to test (e. g., via linear programming, which can be done efficiently) whether thisD is itself ε/2-close to monotone, and accept if and only this is the case.
An Ω √ n/ε 2 lower bound. The lower bound ( [20,Theorem 11]) works by reducing the problem of uniformity testing to monotonicity testing. More specifically, assume for the sake of simplicity n to be even, and let D ∈ ∆([n]) be the distribution to be tested for uniformity. One can run a monotonicity tester (with parameter ε def = ε/3) on both D and D, where the latter is defined as D(i) def = D(n − i), i ∈ [n]; and accept if and only if both tests pass. If D is uniform, clearly D = D is monotone; conversely, one can show that if both D and its "mirrored version" D pass the test (are ε -close to monotone), then it must be the case that D is ε-close to uniform. The result then follows 19 from the Ω √ n/ε 2 lower bound of Theorem 5.1.
Remark 6.2. At a (very) high-level, the above results can be interpreted as "relating monotonicity to uniformity." That is, the upper bound is essentially established by proving that monotonicity reduces from testing uniformity on polylogarithmically many intervals, while the lower bound follows from showing that it reduces to testing uniformity on a constant number of them.

Monotonicity over other posets
As aforementioned, the question of whether a distribution is monotone presupposes its domain be endowed with an order relation . In the previous subsection, we covered the case of [n], where the order is total; however, the question was also considered for other partially ordered sets in [20]. In this work, they address monotonicity testing of distributions over the hypergrid Ω = [n] d and the corresponding obvious order. (Note that the parameter of interest is still n, and d is to be thought of as a possibly big constant.) Batu and al. then give an algorithm for testing monotonicity in this setting, with sample complexitỹ O n d−1/2 ; and provide a lower bound of Ω n d/2 by the same reduction from uniform as in the univariate case. Their upper bound-detailed for the case d = 2-is in the same spirit as before, partitioning 19 [20] actually only shows a Ω( √ n) lower bound, as they invoke in the last step the (previously best known) lower bound of [60] for uniformity testing; however, their argument straightforwardly extends to the result of [79]. adaptively the domain and checking uniformity on each of the resulting parts. In more detail, this is performed by recursively splitting [n] × [n] in 4 quadrants, stopping the recursion on a quadrant K = I × J if (i) the distribution on it is close to uniform, or (ii) K has very small weight, or finally (iii) the quadrant is far enough from the origin (1, 1). Otherwise, the quadrant is further split. In the two latter cases, K is marked as "light" (discarded), and at the end the total weight of all faraway quadrants is checked to be small enough (as if D ∈ ∆([n] × [n]) were monotone, all these faraway sets' weights would have to be very small). In the first case, the univariate monotonicity test is used as a subroutine on a small number of (randomly chosen) univariate distributions D {i}×J , to detect violations. (The third criterion, (iii), ensures that the recursion tree does not have too many nodes, which is required to keep the sample complexity under control.) Subsequent work of Bhattacharyya, Fischer, Rubinfeld, and Valiant [23] extends these results to arbitrary posets, establishing strong upper and lower bounds under structural conditions on the underlying domain. (E. g., for the lower bounds, whether the poset or its closure contain a large matching.). Their work also includes the detailed argument for the general high-dimensional (d > 2) case of the [20] algorithm described above.
Finally, we note that the work of Acharya, Daskalakis, and Kamath [5], touched upon in Section 6.6, yields a tight O n d/2 /ε 2 bound for the specific case of the hypergrid, for ε =Ω( √ d/n 1/4 ).

Testing k-histograms
Another common class of distributions is the set H k of k-histograms (or k-flat distributions). A distribution D belongs to H k -where k is a parameter, possibly function of n-if there exists a partition of [n] in k intervals I 1 , . . . , I k such that D is constant on each I . Indyk, Levi and Rubinfeld study this property in [65], giving a learning algorithm in 2 norm as well as property testers for H k in both 2 and total variation distances. Their two testers follow the same overall structure, which is reminiscent of the monotonicity tester of Section 6.1: they iteratively partition the support [n] in at most k pieces in a greedy fashion, trying at each stage to find, with some sort of binary search, the largest leftmost interval of the remaining support on which D either has very little weight, or is very close to uniform (in 2 norm). If it succeeds in identifying such a partition within k stages, then the tester accepts, having effectively learned a k-histogram to which D is close; otherwise, it rejects.
This approach leads to a O log 2 n/ε 4 -query tester in 2 norm, and aÕ √ kn/ε 5 -query one in total variation. A Ω √ n/ε 2 lower bound immediately follows from uniformity testing; [65] also prove a Ω √ kn dependence is necessary as long as ε < 1/k.
Observe that the testing problem as defined above assumes the partition I 1 , . . . , I k is unknown.
In the case where one is provided with this partition in advance, it is easy to design a tester with sample complexity O n 2/3 /ε 4/3 , independent of k: indeed, it is easy, given oracle access to D, to sample from the corresponding distributionD defined byD(x) = D(I )/|I | (where I x). It then suffices to test closeness of D andD, as in Theorem 5.9, to conclude.

Parameterized classes of distributions
We now turn to a related kind of question: instead of testing whether the unknown distribution has some "shape" (as in the case of monotonicity), we are interested instead in knowing if it belongs to a set of parameterized distributions C = {D θ } θ , each (succinctly) characterized by a vector of parameters θ . Examples of such classes of distributions are the set of all Binomials with support n, where each distribution is then defined by a unique parameter p ∈ [0, 1]; the class of Gaussian distributions, where the parameters are a couple of values (µ, σ ) ∈ R × R + ; or the class of Poisson distributions, of parameter λ ∈ R + .
In the following, we focus on the class of Poisson Binomial Distributions (PBD), 20 a generalization of of Binomial distributions. A random variable follows a Poisson Binomial distribution if it is the sum of n independent Bernoulli random variables X 1 , . . . , X n , each of them with its own parameter p i ∈ [0, 1]: thus, a PBD over {0, . . . , n} is parameterized by the n values p 1 , . . . , p n .

Testing Poisson Binomial Distributions
In recent work, [42] showed that the class of PBDs can be learned withÕ 1/ε 2 samples, independent of n: that is, knowing a distribution has this specific structure enables one to learn it very efficiently. However, the game is fundamentally different in the testing setting, where the distribution is allowed to be arbitrary: in this case, Acharya and Daskalakis [4] prove that whileÕ n 1/4 /ε 2 + 1/ε 6 samples suffice, Ω n 1/4 /ε 2 are also necessary.
AÕ n 1/4 /ε 6 upper bound. As argued in [4], one could think of two natural ways for testing PBDs; each of them leading, unfortunately, to a sample complexity ofÕ( √ n). The first one would be to proper learn 21 the distribution D as if it were a PBD (which only costsÕ 1/ε 2 samples), yielding a candidate PBDD; and then to check if thisD is indeed close to D with tolerant testing (which can be generically performed with O(m/ log m) on a domain of size m with Theorem 5.14. Luckily enough, due to the fact that PBDs have most of their probability weight concentrated on a small fraction of the domain, here taking m = n log(1/ε) suffices to get a good accuracy). It is not difficult to argue that this indeed constitutes a bona fide tester: if D is indeed a PBD, the learning phase will output a PBDD close to D; while by contrapositive if the test passes, then D is close to the hypothesisD, which itself belongs to the class.
The second naive approach is quite similar: it starts with a learning phase, but avoids the cost of tolerant testing in the second step by performing instead regular testing, leveraging the fact that the identity tester of Theorem 5.7 does provide a very small amount of tolerance. The drawback is that it becomes necessary to run the learning algorithm with very good accuracy in order to accommodate this limited tolerance. When working out the parameters carefully, the second stage indeed only requires O n 1/4 samples; but the bottleneck is now the learning stage, which usesÕ( √ n) of them. 20 Which, interestingly enough, have nothing to do with Poisson distributions whatsoever, besides having been studied by the same mathematician [81]. 21 We recall the definition of learning and proper learning algorithms in Appendix F.2.
To circumvent this seemingly hopeless tradeoff, the main insight of Acharya and Daskalakis is to observe that there is some useful information to be exploited in this second testing stage. Namely, the question is not to test whether an arbitrary distribution D is close to the knownD, or is far from it: it is to distinguish between the alternatives that D is (a) a Poisson Binomial Distribution that is close tô D, versus (b) an arbitrary distribution that is far from it. While this distinction may seem at first glance innocuous, it allows them to exploit specific results on PBDs, and (modulo several case distinctions and many technical details) reduce the problem in this particular case to an 2 testing problem, which itself can be performed efficiently. Overall, this results in a testing algorithm with sample complexity O n 1/4 log 1/ε and illustrates an interesting paradigm: "testing for structure, exploiting this very purported structure in the distribution." An Ω n 1/4 /ε 2 lower bound. [4] then proceed in showing that the above sample complexity is optimal, up to polylogarithmic dependence on ε (for n sufficiently big with relation to ε). Specifically, they describe a class of distributions Q ε , comprised of "randomly perturbed Binomials," for which the following hold: • with high probability, a random Q ∈ Q ε is ε-far from unimodal; • unless it takes at least Ω n 1/4 /ε 2 samples, no algorithm can distinguish between a randomly chosen Q ∈ Q ε and the Bin(n, 1/2) distribution.
(The latter being shown using Le Cam's method, similarly as in [79,3]-as was the case for the uniformity lower bound of Section 5.1; see Appendix E for more details.) As a consequence, and since all Poisson Binomials Distributions are log-concave (and therefore unimodal), this implies that testing PBDs indeed requires that many samples. 22

A unified approach
A simple observation is that many of the usual structured classes of distributions one would want to test are somehow related: monotone distributions are in particular unimodal, as are log-concave distributions. Monotone Hazard Rate (MHR) distributions are themselves a superset of both log-concave and monotone (non-decreasing) distributions; and the list goes on. Even when no such direct relation holds, there still are common structural aspects. This can lead to efficient and general learning algorithms, as demonstrated by Chan et al. [37,38]; and, more germane to this survey, also has applications to testing. Indeed, Canonne et al. [30] show how to generalize the "partition-and-test" approach of [20] to any class of distributions enjoying some particular structural property-namely, any class that admits succinct approximations by flat (in a specific, 2 sense) distributions. They then give one efficient "meta-algorithm" that works for 22 Slightly stronger, this establishes the same Ω n 1/4 /ε 2 lower bound on testing the classes of Binomial distributions, log-concave distributions and unimodal distributions. Note that comparable or tighter lower bounds can be obtained by the techniques of Section 6.4. any such class of distributions, and whose sample complexity only depends on the parameters of these approximations, denoted below by Φ P : 23 Theorem 6.4. There exists a single algorithm which, given SAMP access to an unknown distribution D ∈ ∆([n]) and a mapping Φ P : (0, 1] × N → N (depending on P), satisfies the following, for every property P ⊆ ∆([n]). On input ε ∈ (0, 1), it takes q(ε, n, Φ P (ε, n)) samples from D, and • if D ∈ P, then with probability at least 2/3, the algorithm accepts; • if d TV (D, P) > ε, then with probability at least 2/3, the algorithm rejects.
Moreover, for any property that satisfies a "natural structural criterion," this algorithm has near-optimal sample complexity q(·, ·, ·) (up to logarithmic factors and the exact dependence on ε). (Finally, the algorithm is, for many such properties, computationally efficient.) Instantiating this result, they are able to derive "out-of-the-box" efficient testers for several classes of distributions, merely by showing that they satisfy the premise of the theorem: (the formal definition of these classes is given in Appendix F.1): Corollary 6.5. The classes of monotone, unimodal, log-concave, concave, convex and monotone hazard rate (MHR) distributions can all be tested withÕ √ n/ε 7/2 samples.
Corollary 6.6. The class of t-modal distributions can be tested withÕ √ tn/ε 7/2 samples. As a counterpart to this generic, "one-fits-all" testing algorithm, [30] also describe a framework to derive lower bounds for such classes. More specifically, they show that (under a relatively mild assumption) testing a class of distributions is at least as hard as testing identity to the worst distribution in the class: Theorem 6.8. Let P be a class of distributions over [n] for which the following holds: (i) there exists a semi-agnostic learner L for P, with sample complexity q L (n, ε, δ ); (ii) there exists a subclass P Hard ⊆ P such that testing P Hard requires q H (n, ε) samples.
Suppose further that q L = o(q H ). Then, any tester for P must use Ω(q H ) samples.
The idea in this reduction is quite simple: assuming a tester for the class P, one can first test whether D is far from the class (if so, then it cannot possibly belong to P Hard ). Otherwise, then it becomes possible to efficiently learn D using the semi-agnostic learner; before checking-without any further sample-if the hypothesis obtained is indeed close to P Hard . Taking P Hard to be the singleton consisting of either the uniform or Bin(n, 1/2) distribution (along with the testing lower bound of [99]), and leveraging the existence of semi-agnostic learners from [37,38] (each with query complexity either poly(1/ε) or poly(log n, 1/ε)), they are able to obtain or rederive the following: Corollary 6.9. Testing log-concavity, convexity, concavity, MHR, unimodality and t-modality each require Ω √ n/ε 2 samples (the last one as long as t = o( √ n)), for any ε ∈ (0, 1).
Finally, by proving a lower bound on testing one specific "simple" k-SIIRV distribution and invoking the agnostic learner of [40] they also get this last corollary: Corollary 6.11. There exists an absolute constant c > 0 such that testing the class of k-SIIRV distributions each requires Ω k 1/2 n 1/4 samples, for any k = o(n c ).
To conclude this section, we briefly mention that an analogue of Theorem 6.8 can easily be seen to apply to tolerant testing.

Other domains: testing independence
While this section has been so far dedicated to properties over [n], some other structural properties that have been studied bring the focus on different domains. An important example is independence [17,12], that apply to distributions over product spaces. Recall that a distribution D over Ω 1 × · · · × Ω d is said to be independent if it is equal to the product of its marginals, i. e., D = π 1 D ⊗ · · · ⊗ π d D (or equivalently if for any random variable X = (X 1 , . . . , X d ) distributed according to D the X i 's are independent). Batu et al. [17] and Levi et al. [73] consider the task of testing independence of distributions over [n] × [m], and give aÕ n 2/3 m 1/3 poly(1/ε) as well as an Ω n 2/3 m 1/3 upper and lower bounds (assuming without loss of generality n ≥ m). 24 Their upper bound relies on the result below, which asserts that if a distribution is close to independent, then in particular it is close to the independent distribution defined by its own marginals: Lemma 6.12 ([17, Proposition 1]). Let P, Q ∈ ∆(Ω 1 × Ω 2 ), and assume Q is independent. If d TV (P, Q) ≤ ε, then d TV (P, π 1 P ⊗ π 2 P) ≤ 3ε.
From there, their algorithm works roughly by dividing [n] into two sets, the heavy prefixes H (i. e., the elements i ∈ [n] for which π 1 D(i) ≥ n −α , where α = α(n, m)) and the light prefixes L. It then tests the independence of D separately on H × [m] and L × [m]; before finally checking that both induced distributions are consistent by testing equivalence of π 2 D H×[m] and π 2 D L×[m] . Their two tests performed in the second stage crucially leverage the "heaviness" or "lightness" of H and L: in the first case, since |H| cannot be too large, making it possible to learn π 1 D H× [m] . In the second case, the upper bound on π 1 D L×[m] ∞ makes it advantageous to apply in one of the steps an 2 -based identity tester of [73]. We note that in both cases, Batu et al. perform a bucketing of at least one of the π i D's, before testing individually D for independence on logarithmically many subdomains. Moreover, for the heavy prefix 24 While [17] originally claimed aÕ (n 2/3 m 1/3 ) poly(1/ε) upper bound, there was a flaw in one of the lemmas their analysis relied on [17,Theorem 15]. To fix this issue, [73] later proved an alternative to this lemma, establishing thẽ O n 2/3 m 1/3 poly(1/ε) upper bound. The lower bound itself is based on a (variant of) the construction of [17], but the full and rigorous proof is due to [73]; and requires n = Ω(m log m). case they introduce an elegant subroutine they refer to as (D, D )-sieve, which acts as follow. Given SAMP access to a distribution D whose projections are only close to uniform, the sieve provides access to another oracle SAMP D , where D is close to D, and has roughly the same independence properties. Moreover, if D was independent then D is uniform, and if D was far from independent then D is far from uniform (see [17,Section 2.4] for the precise statements).
k-wise and non-uniform k-wise independence. Results also exist for the related properties of kwise, almost-k-wise and non-uniform k-wise independence on high-dimensional domains, typically the hypercube {0, 1} n [12] or generalized product spaces Σ 1 × · · · × Σ n [91]. We only briefly mention some of the results; the interested reader is encouraged to consult the above references.
Recall that a distribution on {0, 1} n is k-wise independent if the marginal distribution it induces on any subset of k variables is uniform. Alon et al. give aÕ n k /ε 2 -query algorithm for testing kwise independence in the standard sampling model (as well as a lower bound within a quadratic gap). Their algorithm relies on a structural result relating the distance to P k-wi (the property of being k-wise independent) to the bias of the distribution on subsets of at most k variables. One can indeed characterize P k-wi as follows: Fact 6.13. For a distribution D over {0, 1} n and a non-empty T ⊆ [n], let the bias of D over T be defined The aforementioned structural result is a robust version of the following fact.
where C k is a constant only depending on k. In particular, we have that

A "testing by learning" framework
Subsequent to an early version of this survey, recent work of Acharya, Daskalakis, and Kamath [5] describes improved and nearly-optimal algorithms for testing monotonicity, unimodality, log-concavity, monotone hazard rate, and independence. In particular, their work essentially settles the gap from Theorem 6.1, by establishing an O √ n/ε 2 + log n/ε 4 sample upper bound for testing monotonicity. (We also note that their result extends to monotonicity in higher dimensions.) Their algorithms and techniques, albeit orthogonal to that described in Section 6.4, also follow a generic idea that applies to many "structured classes" of distributions. At a very high level, they take a testing by learning approach, first learning the unknown distribution as if it were in the class, then testing that the hypothesis obtained is both (a) close to the class; and (b) also close to the unknown distribution.
The key here resides in the second step, since tolerant identity testing is crucially not sample-efficient in general (as seen in Section 5.4). To circumvent this impossibility result, the authors introduce an elegant twist to the question, by learning in [χ 2 -test]χ 2 distance instead of total variation (while the former is a harder task in general, the authors show it can be done efficiently for the classes considered). They then prove that the following relaxed question can, surprisingly, be performed with O √ n/ε 2 samples: "given a known distribution D * ∈ ∆(Ω) and SAMP access to an unknown distribution D ∈ ∆(Ω), distinguish between small d χ 2 (D || D * ) and big d TV (D, D * ):" Theorem 6.15 (Testing identity with χ 2 tolerance). There exists an algorithm which, given the full specification of D * ∈ ∆(Ω) and SAMP access to an unknown distribution D, satisfies the following. On input ε ∈ (0, 1), it takes O √ n/ε 2 samples from D, and , then with probability at least 2/3, the algorithm accepts; • if d TV (D, D * ) > ε, then with probability at least 2/3, the algorithm rejects. where

Testing with structure
The focus of this section is in some sense the counterpart of the previous one: instead of trying to decide if a completely arbitrary distribution D possesses some structural features, we are now promised D exhibits these features, and are asked to take advantage of this knowledge in order to test something else about D. This sort of question may arise in situations where a priori information is known about the data, either as a direct consequence of its origin or the application, or because of the modeling assumptions made to explain a given phenomenon [84,102]. The hope is that these additional guarantees on the distribution to be tested would allow one to circumvent the lower bounds that hold in the general case, and to obtain much more efficient testing algorithms. As the next subsections will show, this hope is not ill-funded: many problems indeed become significantly easier when restricted to monotone distributions (Section 7.1), and identity and closeness testing can be performed with only polylog n samples as long as the unknown distributions are k-modal (Section 7.2). Finally, in Section 7.3 we cover a recent result of Diakonikolas et al. [49] that applies to a wide range of classes of distributions, and yields a (very) efficient algorithm for identity testing within these classes. (Unless specified otherwise, all results covered in this section apply to distributions over [n].)

Monotone distributions
We consider here the case where the unknown distribution D ∈ Ω[n] is known to be monotone. As a first example, we give the following folklore result on testing uniformity. Proposition 7.1 (Testing uniformity). There exists an algorithm which, given SAMP access to an unknown monotone distribution D ∈ ∆([n]), satisfies the following. On input ε ∈ (0, 1), it takes O 1/ε 2 samples from D, and • if D = U, then with probability at least 2/3, the algorithm accepts; • if d TV (D,U) > ε, then with probability at least 2/3, the algorithm rejects.
Furthermore, this sample complexity is tight.
The proof of this proposition is quite simple, and boils down to the fact that if a monotone distribution is ε-far from uniform it has to put weight at least (1 + ε)/2 on {1, . . . , n/2} (see, e. g., [20,Lemma 4]). Moreover, this easily generalizes to distributions only assumed to be α-close to monotone, as long as Another property whose testing complexity changes drastically under the monotonicity assumption is that of closeness testing. Specifically, Batu et al. show in [20, Section 6.1] how to obtain a O log 3 n/ε 3sample tester for closeness in this setting, in stark contrast to the Ω n 2/3 general lower bound. This is, at a high-level, done by first partitioning the domain into consecutive intervals on which one of the distributions D 1 is roughly uniform (using an algorithm of [16] for monotone distributions), before checking if D 1 and D 2 are close on each of these intervals. Moreover, this sample complexity can be further improved: 25 Diakonikolas et al. [43] later gaveÕ √ log n andÕ log 2/3 n -sample testing algorithms for testing identity and closeness of monotone distributions. We cover these results in Section 7.2, as part as a general technique they introduce for k-modal distributions.
Finally, we briefly mention two related results, due respectively to [16] and [41]. The first one states that for the task of getting a multiplicative estimate of the entropy of a distribution, assuming monotonicity enables exponential savings in sample complexity-O log 6 n , instead of n Ω(1) for the general case. The second describes how to test if an unknown k-modal distribution is in fact monotone, using only O k/ε 2 samples. 26

Monotonicity over other posets
As in Section 6.1, similar questions were also studied for other partially ordered sets. On the hypergrid Ω = [n] d , [20] use a "partition-and-test" approach to give an O(log 2d/3+1 n) upper bound on the sample complexity of testing independence of monotone distributions. As for monotone distributions over the Boolean hypercube {0, 1} n , Rubinfeld and Servedio [89] analyze aÕ n/ε 2 -sample tester for uniformity, and develop a lower bound technique (of "subcube decomposition") that allows them to derive several interesting results. 27 In particular, they prove that their uniformity tester is essentially optimal, by giving anΩ(n) lower bound; and also provide an exponential lower bound of 2 Ω(n) for the sample complexities 25 We note that from a result of [24], learning a monotone distribution can be performed with O log n/ε 3 samples; this implies the same upper bound on testing identity or closeness of monotone distributions, as one can always learn the unknown distribution(s) to sufficient accuracy, before checking closeness of the hypotheses obtained. 26 The authors then use this as a subroutine in a learning algorithm for k-modal distributions. 27 Subsequent work by [9] generalizes their uniformity testing results to the continuous case [0, 1] n and the hypergrid [k] n . of testing identity and independence (as well as for multiplicative approximation of entropy). This highlights an important difference between the integer and the Boolean hypercube settings: in the latter, uniformity and identity testing are no longer equivalent.

Identity, closeness and distance estimation of k-modal distributions
As mentioned in the previous subsection, Daskalakis et al. described in [43] a general support reduction technique that enables them to treat in a unified way the problems of identity and closeness testing (as well as their tolerant testing counterpart) for monotone and k-modal distributions. At the core of their upper bounds is a way to reduce the testing of these structured distributions on domain [n] to the same problem, but for arbitrary distributions on a much smaller domain [ ]-where is O(log n/ε) for monotone distributions, and O k log n/ε 2 for k-modal. Applying as black-box the algorithms from Section 5 to the reduced distributions on [ ], they obtain: 28 Monotone distributions: -sample tester for identity; -sample tester for closeness; • an O log n ε 3 log log n -sample tolerant tester for identity and closeness.
k-modal distributions: -sample tester for identity; -sample tester for closeness; • an O k 2 ε 4 + k log n ε 4 log(k log n) -sample tolerant tester for identity and closeness.
The support reduction for monotone distributions relies on Birgé's oblivious decomposition: this is a partition of the domain, independent of the monotone distribution D, into (n, ε) = O((log n)/ε) intervals, which induces a "flattened" distribution D such that (i) D remains monotone, (ii) it is easy to sample from D given sample access to D, and (iii) is O(ε)-close to D (specifically, D is an (n, ε)-histogram, obtained by making D uniform on each interval of the decomposition; see Appendix D.4 for more details). For the k-modal case, however, more work is necessary in order to identity a similar (no longer oblivious) decomposition. This leads in particular to the O k 2 /ε 4 overhead, incurred from the call to a subroutine CONSTRUCT-FLAT-DECOMPOSITION. This procedure roughly works by partitioning the unknown distribution in O(k/ε) monotone parts after learning (a crude approximation of) it, and applying Birgé's result on each of these parts.
Lower bounds. As a pendant to the above reduction technique, Daskalakis et al. also describe a reduction in the other direction, enabling one to carry general testing instances on support [n] to k-modal testing instances on an exponentially bigger support [N]. More precisely, they show how to map an arbitrary pair of distributions access to the D i,k 's can be efficiently simulated given SAMP access to the D i 's; and (c) N = Θ ke 8n(1+ln α)/k , where α is the ratio between the maximum and minimum probabilities of D 1 and D 2 . By applying this to the hard instances constructions for testing identity and closeness of general distributions (cf. Section 5), they derive optimal or near-optimal lower bounds: Monotone distributions: • an Ω √ log n lower bound for identity; • an Ω log n log log n 2/3 lower bound for closeness; • an Ω log n log log n·log log log n lower bound for tolerant testing of identity and closeness.
• an Ω √ k log n lower bound for identity; • an Ω k log n log(k log n) 2/3 lower bound for closeness; • an Ω k log n log(k log n)·log log(k log n) lower bound for tolerant testing of identity and closeness.

Identity: a unified approach
In this last subsection, we describe a recent work of Diakonikolas, Kane and Nikishkin [49]. While the previous subsection focused on a particular class of distributions and leveraged its structure to get better algorithms for several testing problems, this paper deals solely with identity testing, 29 but gives a general algorithm that applies to a broad range of distribution classes. Roughly, their main result could be stated as follows: ) be a distribution class such that the probability mass functions of any two D, D ∈ C cross (essentially) at most k times. Then, given any explicit D * and SAMP access to an unknown distribution D ∈ C, one can test identity of D to D * with O( √ k/ε 2 ) samples.
In the above, k "essential" crossings means that while the pmfs can cross an arbitrary number of times, most of the total variation distance between D and D comes from at most k different intervals, one each of which one has either D > D * or D < D * . As a direct application of this and invoking approximation results from [24,37,38], they obtain identity testers for distributions guaranteed to be 30 29 Subsequent work by the same authors obtains analogous results for closeness testing, using entirely different techniques. [48] 30 See Appendix F.1 for the formal definition of these classes.
In more detail, the main insight in the proof of Theorem 7.2 is to part with the total variation distance, and consider instead the A k -distance, one of its generalizations: where S k is defined as the family of all subsets of [n] that are the union of at most k intervals. 31 In The reason to turn to this new distance is the observation that as long as two distributions have at most k crossings, their A k and total variation distances coincide. The authors then describe an optimal algorithm testing identity in the A k -distance with sample complexity only depending on k (and not on the support size n), which implies the result above. In order to do so, [49] proceed in two steps: first, they show how to reduce general identity testing in the A k -distance (over [n]) to uniformity testing in the A k -distance (over a possibly much bigger support [N]). Then, they give a O( √ k/ε 2 )-sample tester-independent of the support size-for the latter problem, by designing and using as a subroutine a new (optimal) 2 -tester for uniformity (see Section 5.1). The last ingredient in their approach is a carefully designed way to consider many possible partitions of the support, each time with a different number of intervals (namely, k, 2k, 4k, . . . , k/ε); before calling their 2 -tester on the reduced distributions these partitions induce (with appropriate parameters). They show that if the original distribution over [N] is indeed far from uniform in A k -distance, at least one of the reduced distributions will be far from uniform in 2 norm-guaranteeing the tester will detect the discrepancy.

Estimating symmetric properties
For the task of getting an additive estimate of some property-in this case the (Shannon) entropy-of a distribution D over Ωgiven SAMP access to it, Paninski shows in [78] that achieving a sublinear sample complexity is possible, proving (non-constructively) the existence of an estimation algorithm using o(n) samples. (Note that [16,63] study the different question of obtaining a multiplicative estimate of the entropy: see Table 5 for a summary of these results.) The question of approximating the support size of a distribution has been studied in [83], where the authors proved an almost linear lower bound on additive support size estimation, namely, that log log n/ log n samples are required to guarantee additive error εn, for any constant ε < 1/2.
In this section, we cover subsequent work by Valiant and Valiant that addresses-among other-these two questions and establishes matching upper and lower bounds on a whole family of testing problems. 31 We follow here the usual definition, as in, e. g., [45,38]. For technical reasons, [49] define the A k -distance in a slightly different, but essentially equivalent way (up to constant factors). 32 The definition of Kolmogorov distance, as well as other distances measures used in this survey, can be found in Appendix C.
Namely, across three successive works ( [96,97], culminating with [98]) they build a framework which applies (under some mild restrictions) to any symmetric property of distributions. As a corollary, they obtain a tight Θ(n/ log n) sample complexity for (additive) approximation of entropy and support size, and for tolerant testing of uniformity. 33 Symmetric properties, histograms and fingerprints. In order to describe the results, we need to introduce a few concepts. A property P of distributions over Ω is said to be symmetric if it is "invariant by relabeling": for any permutation π of the domain, D ∈ P if and only if D • π ∈ P. This includes, for instance, uniformity, "having support size at least 5," or for properties of of pairs of distributions "being equal." By a slight abuse of notation, we also refer to functions ϕ : ∆(Ω) k → R as k-ary (scalar) properties. These capture quantities that reflect some statistic of one or several distributions: for instance, distance to uniformity and support size are both unary properties, and the total variation distance between two distributions is a binary property. As in the previous paragraph, an k-ary scalar property ϕ is said to be symmetric if for every permutation π and distributions D 1 , . . . , D k , it holds that ϕ(D 1 , . . . , D k ) = ϕ(D 1 • π, . . . , D k • π). (Hereafter, we only require the domain Ω to be finite.) counts" the number of elements with a given probability weight: For any sequence s of m independent samples drawn D, the fingerprint of s is a vector F = (F 1 , . . . , F m ) ∈ N m , where F j is the number of elements x ∈ Ω that appear exactly j times: Note that F is a random variable which satisfies ∑ k j=1 jF j = m (in particular, the F j 's are not independent).
The fingerprint can be seen as an empirical version of the histogram: 34 indeed, F j counts the number of elements whose empirical probability is j/m, so that "intuitively" one should expect F j h( j/m). Moreover, it is not hard to see that symmetric properties are completely characterized by histograms and fingerprints: that is, one can assume without loss of generality that a tester for a symmetric property (or scalar property) is only given the fingerprint of the samples (see, e. g., [18,Section 3.3]).
Valiant and Valiant then proceed to define a symmetric linear property as a symmetric scalar property that can be expressed as 33 The reader may remember this work was also mentioned in Section 5.4, in the context of tolerant testing of uniformity. 34 We remark that the use of the word histogram here is slightly unfortunate, and is not to be confused to that of Section 6.2. Indeed, the latter use refers to a class of distributions, not (as it is the case here) to a particular characteristic of a given probability distribution.
where h D is the histogram of D, and f ϕ : [0, 1] → R is a function of ϕ alone. Similarly, they define a linear estimator for a symmetric scalar property as a sequence of coefficients a ∈ R N which, given m samples from a distribution D, outputsφ where F is the fingerprint induced by the samples. The last piece missing is a notion of distance between histograms: for this, they consider the relative Earthmover distance R. Roughly, R(h 1 , h 2 ) is the cost of reassigning probability weight in D 1 (which has histogram h 1 ) to obtain a distribution with histogram h 2 ; where moving a unit of weight from probability α to probability α costs |log(α/α )|.
Linear programming and estimators: upper and lower bounds. After setting up these concepts, the authors proceed to build on them, using tools from linear programming and polynomial approximation theory. The overall flavor of their framework is as follows: given any linear symmetric property ϕ whose function f ϕ is well-behaved (broadly speaking, Lipschitz) with regard to relative Earthmover distance, it is possible to set up two linear programs, ( ) U and ( ) L , such that: • If ϕ can be estimated to within an additive ε with m samples, solving ( ) U will give the coefficients of a linear estimator that uses O(m) samples, and estimates ϕ to within O(ε). 35 • Solving ( ) L with parameter m will result in two distributions D 1 , D 2 that are indistinguishable to any algorithm taking less than m samples, and such that ϕ(D 1 ) − ϕ(D 2 ) (the objective of the linear program) is maximized; • ( ) U and ( ) L are (for the appropriate parameters) dual of each other.
At a very high level, what the above means is that for a broad family of symmetric properties, it is possible to derive in a unified way tight upper and lower bounds via linear programming; and furthermore that-quite counter-intuitively-the (simple) class of linear estimators is as powerful as any other type of estimators, no matter how complex.
Techniques. Very briefly (and inaccurately), the key ingredients in proving these results are • Poissonization, to restore independence between the number of occurrences between any two x, y ∈ Ω (see Appendix D.3 for more details on this technique), and be able to write the expectation of the fingerprint entries, EF j , as the inner product of the histogram h D with some convenient "Poisson functions" poi j ; • Polynomial approximation theory: in order to approximate f ϕ by a linear combination of these Poisson functions that can be used in their linear programs, the authors develop an approximation scheme based on Chebyshev polynomials. 36 To see why this is indeed useful, observe that if and thus the linear programs can enforce or capture constraints on the fingerprint expectation; • An insightful use of linear programming duality, which allows them to prove optimality of their linear estimators by relating ( ) U and ( ) L ; • Polynomial approximation theory, bis: to build lower bound instances in [96], the authors need to define a (family of) pair of distributions which, while far from each other in total variation distance, give rise to fingerprints that are very hard to distinguish. They describe these distributions explicitly based on Laguerre and Hermite polynomials, and leverage properties of these polynomials (combined with the CLT mentioned below) to argue the resulting distributions have histograms that are very close in relative Earthmover distance.
• a new multivariate Central Limit Theorem (CLT) for total variation distance, 37 which allows them to show indistinguishability of the fingerprints obtained from these lower bound instances; We stress that the above sweeps under the rug most of the details, difficulties and subtleties of the argument; the interested reader is encouraged to consult the original papers for further details.
Consequences: tolerant testing of entropy, support size, uniformity and closeness. Leveraging their scalar property machinery, Valiant and Valiant are able to obtain tight or near-tight bounds on four tolerant testing problems, resulting in a somewhat unexpected (by the author) characterization of their sample complexity: [96,Corollary 10]). There exists an algorithm which, given SAMP access to an unknown distribution D ∈ ∆(Ω), satisfies the following. On input ε 1 , ε 2 ∈ (0, 1] such that n log n samples from D, and • if H(D) ≤ ε 1 , then with probability at least 2/3, the algorithm accepts; • if H(D) ≥ ε 2 , then with probability at least 2/3, the algorithm rejects; where H(D) = − ∑ x∈Ω D(x) log D(x) denotes the (Shannon) entropy of the distribution. Furthermore, this sample complexity is tight: no algorithm taking o 1 n log n samples can correctly perform this task. 36 The Chebyshev polynomials play a major role in approximation theory, based on their extremal properties: when approximating a function on fixed interval by its (truncated to degree d) expansion in the Chebyshev basis, the error induced by this truncation is very small. This allows to only restrict oneself to such polynomial expansions to a low(ish) degree when approximating the function f ϕ . Provided that one can also approximate these low-degree Chebyshev polynomials by linear combinations of the Poisson functions with small coefficients-which the authors show is possible-then this yields a good approximation scheme for the function f ϕ in terms of the Poisson functions. 37 In [96], the authors actually also prove and use a slightly weaker but more general CLT, for the Earthmover (Wasserstein) metric.  [96,Corollary 9]). There exists an algorithm which, given SAMP access to an unknown distribution D ∈ ∆(Ω) with the guarantee that D(x) ≥ 1/n for all x ∈ supp(D), satisfies the following. On input ε 1 , ε 2 ∈ (0, 1], it takes O 1 (ε 2 −ε 1 ) 2 n log n samples from D, and • if |supp(D)| ≤ ε 1 n, then with probability at least 2/3, the algorithm accepts; • if |supp(D)| ≥ ε 2 n, then with probability at least 2/3, the algorithm rejects; Furthermore, this sample complexity is tight (for constant ε 1 , ε 2 ): no algorithm taking o(n/log n) samples can correctly perform this task.
To obtain these two corollaries, the first step is to observe that the scalar properties above are indeed symmetric linear properties, and furthermore are continuous with regard to relative Earthmover distancemaking it possible to apply the Valiants' general framework. We also point out that this framework is quite versatile: indeed, the corresponding results from Section 5.4, Theorem 5.14 and Theorem 5.12 (an upper and a lower bound on tolerant closeness testing) were also established by the same approach.
Related work. Following the work described above, Wu and Yang give in [103] another, self-contained proof of the Θ((1/(ε 2 − ε 1 ))(n/log n)) sample complexity of entropy estimation in the SAMP setting (moreover, their result removes the restriction that ε 2 − ε 1 ≥ 1/n Ω(1) ). They do so by analyzing the minimax quadratic risk: as in [98], both their upper and lower bound rely on polynomial approximation of the function f ϕ : x → −x log x. For the latter, Wu and Yang bypass the need for Valiant and Valiant's CLT by "introducing independence" among the fingerprints entries, constructing instances whose probabilities D(1), . . . , D(n) are themselves chosen independently at random. (This elegant idea comes at a cost, however: the instances obtained are not exactly probability distributions anymore, as they do not necessarily sum to one. Thus, the authors have to argue that they are "close enough" to probability distributions for the proof to go through.) Similar techniques were also used in [67], where the authors obtain similar results for estimating Shannon entropy and quantities of the form ∑ x∈Ω D(x) α . Related to this last quantity is the Rényi entropy H α , whose estimation is studied in [8].

Tips and tricks
Sanity checks for lower bounds. When trying to prove a lower bound, always make sure it does not contradict a known upper bound. In particular, if the argument boils down to testing identity to a single and fixed hard instance D * , the best one can hope for is Ω( √ n).
Uniformity testing as a primitive. While this may (sometimes) lead to sample complexities suboptimal by poly log n factors, reducing a testing problem to one or several instances of uniformity testing-either in total variation or 2 distance-is a powerful technique.
Bucketing helps. Often in conjunction with the above item-bucketing is a very common and useful technique to break down a problem into several parts, each of them being "nicer" (as the distribution in each bucket is either uniform, or nearly uniform).
2 as proxy. Whenever total variation ( 1 ) is too stringent or global (does not give enough local information about the distribution), testing in 2 can prove useful. Usually together with one or both items above. As a standalone lemma, we recall the following relation between 2 norm of a distribution D and its distance from uniformity [18,17,20]: Independence is treacherous. Be careful of claims of independence-many things are not independent, even when they "obviously are." Poissonization (Appendix D.3) is your friend.
Hellinger is tighter, better for transcripts. In the sampling model, working with the Hellinger distance between yesand no-instances often enables one to show better lower bounds on the sample complexity of property testing algorithms. For instance, the following theorem provides a very good bound on the number of samples needed to distinguish two distributions: (Note that phrased in terms of total variation distance, one only gets the lower bound which-albeit sometimes easier to work with-can be by Theorem C.5 looser by as much as a quadratic factor.) Roughly speaking, the reason for this potential quadratic improvement comes from the properties of Hellinger distance with relation to product distributions (independent samples), while the total variation distance's behaviour in that regard is very poor (see Equation (C.1) and C.4).
Think of DKW. Performing a coarse learning of the distribution often helps, to approximately identify the problematic portions of the distribution or to decide which subroutine apply to which part. See Theorem D.1.
Symmetric properties. Fingerprints and histograms are all that matters for symmetric properties. [96,97] and [98] are a very good source for lemmas, ideas and techniques that apply to them.
Insight from Statistics. There is an insane amount of literature on statistical tools such as the χ 2 -test. Albeit seldom optimal when used out-of-the-box, custom-tailored variants of these have proven very powerful.

Subsequent work
Following the first version of this survey, several works have been published which settle or address some of the problems covered in this chapter; we hereafter mention a few of them. Diakonikolas and Kane [47] provide a new framework to prove upper bounds for a variety of distribution testing problems, essentially by an elegant reduction from 1 to 2 testing (see also [58] for an exposition), as well as an information-theoretic framework for establishing lower bounds. Canonne [29] proves near-tight upper and lower bounds for the problem of testing the class of k-histograms discussed in Section 6.2. Blais, Canonne, and Gur [26] obtain the distribution testing analogue of the communication complexity framework of [25]; and leverage it to revisit the "instance-optimal" identity testing bound of Theorem 5.8. Diakonikolas et al. [46] analyze the original collision-based tester for uniformity [60], and show thatsurprisingly-it also yields optimal sample complexity (and that Poissonization, here, hurts). Finally, Jiao, Han, and Weissman [66] settle the sample complexity of tolerant testing uniformity, identity, and closeness, improving on the results of Section 5.4 with regard to the dependence on ε 2 − ε 1 .

Other Models
While the sampling model covered in the previous chapter is arguably the most natural and widely considered, it fails to fully capture certain scenarios and situations that arise both in practice and theory. Moreover, as we saw earlier algorithms in the SAMP model must in most cases incur a sample complexity that-albeit sublinear-is polynomial in the domain size. Whenever the domain becomes too large, this is a cost one cannot reasonably afford.
For these reasons-among others, there has been recent work on property testing of probability distributions under alternative models: this includes other types of access to the distribution (either more powerful or incomparable to the sampling one) as well as different objectives or performance measures. The former is the focus of Section 11 and 12, with the conditional and extended access models; while examples of the latter can be found in Section 13 and 14.

Conditional samples
In this section, we focus on the conditional access model, a generalization, introduced independently by Chakraborty et al. [36] and Canonne et al. [33], of the sampling model. In this setting, the algorithms are granted sampling access to any conditional distribution of their choosing; that is, they are able to condition the outcome on arbitrary subsets of the domain Ω.

The setting
Definition 11.1 (Conditional access model [36,33]). Fix a distribution D over Ω. A COND oracle for D, denoted COND D , is defined as follows: the oracle takes as input a query set S ⊆ Ω, chosen by the algorithm, that has D(S) > 0. The oracle returns an element i ∈ S, where the probability that element i is returned is D S (i) = D(i)/D(S), independently of all previous calls to the oracle.
Note that the behavior of COND D (S), as described above, is undefined if D(S) = 0, i. e., the set S has zero probability under D. Various definitional choices could be made to deal with this: e. g., Canonne et al. assume that in such a case the oracle (and hence the algorithm) outputs "failure" and terminates, while Chakraborty et al. define the oracle to return in this case a sample uniformly distributed in S. In most situations, this distinction does not make any difference, as most algorithms can always include in their next queries a sample previously obtained. 38 However, the former choice does rule out the possibility of non-adaptive testers taking advantage of the additional power COND provides over SAMP; such testers are part of the focus of [36] (and are discussed in Section 11.5).
Testing algorithms can often only be assumed to have the ability to query sets S that have some sort of "structure," or are in some way "simple." To capture this, one can define specific restrictions of the general COND model, which do not allow arbitrary sets to be queried but instead enforce some constraints on the queries: [33] introduces and studies two such restrictions, "PAIRCOND" and "INTCOND." Definition 11.2 (Restricted conditional oracles). A PAIRCOND ("pair-cond") oracle for D is a restricted version of COND D that only accepts input sets S which are either S = Ω (thus providing the power of a SAMP D oracle) or S = {x, y} for some x, y ∈ Ω, i. e., sets of size two.
In the specific case of Ω = [n], an INTCOND ("interval-cond") oracle for D is a restricted version of COND D that only accepts input sets S which are intervals S = [a, b] = {a, a + 1, . . . , b} for some a ≤ b ∈ [n] (note that taking a = 1, b = n this provides the power of a SAMP D oracle).

Testing uniformity
The first result we describe shows that in stark contrast to what holds in the SAMP model, testing uniformity with conditional samples can be done with a constant number of queries. We cover here the result of [33], which derive essentially matching upper and lower bounds: note that a poly(1/ε)-query tester was also obtained in [36]. Theorem 11.3 (Testing uniformity). There exists an algorithm which, given PAIRCOND access to an unknown distribution D ∈ ∆(Ω), satisfies the following. On input ε ∈ (0, 1), it makesÕ 1/ε 2 queries to D, and • if D = U, then with probability at least 2/3, the algorithm accepts; • if d TV (D,U) > ε, then with probability at least 2/3, the algorithm rejects. 38 Conversely, for any lower bound relying on a specific instance of distribution one can always consider instead a mixture of the original instance and the uniform distribution-the latter with, say, exponentially small weight.
Furthermore, this is nearly tight: no COND algorithm making o 1/ε 2 queries can correctly perform this task.
At a very high level, the algorithm works by considering 3 sets of samples: a reference set R of constantly many points drawn uniformly from [n], a set H of "possibly heavy points" drawn from D, and a set L of "possibly light points" drawn uniformly from [n]. Then, it goes over every pair (h, r) ∈ H × R and ( , r) ∈ L × R, calling the PAIRCOND oracle on them to try and detect a discrepancy between D(h) and D(r) (D( ) and D(r), resp.). Intuitively, if D was far from uniform, then there would be many light points with weight significantly smaller than 1/n, and enough weight put by D on heavy points (with weight significantly bigger than 1/n). Thus, with high probability L would contain light points, and H heavy points: comparing both to the reference points that can be either heavy or light, at least one of the comparisons will give away the difference. For illustration, the pseudocode is given in Algorithm 3; note that it invokes as a blackbox a subroutine, COMPARE (Algorithm 4). This subroutine, used in several algorithms of [33], behaves as follows: on input two disjoint subsets X,Y of the domain, it either returns a (1 ± η)-multiplicative estimate of the ratio D(X)/D(Y ), or signals if this ratio is too high or too low for the estimation to be done efficiently (relatively to a threshold parameter K).

Algorithm 3
The uniformity tester of Theorem 11.3 Require: error parameter ε > 0; query access to PAIRCOND D oracle 1: Set t ← log( 4 ε ) + 1. Call the SAMP D oracle s j = Θ 2 j · t times to get samples h 1 , . . . , h s j drawn from D.

5:
Select s j points 1 , . . . , s j independently and uniformly from [n]. 6: for all pairs (x, y) = (i r , h r ) and (x, y) = (i r , r ) (where 1 ≤ r ≤ q, 1 ≤ r ≤ s j ) do The lower bound, on the other hand, works by a reduction to a known "hard problem," that of distinguishing a fair from a biased coin (Fact D.3). Specifically, the argument goes by showing how, given access to independent coin tosses which are either (a) fair or (b) ε-biased, one can simulate COND access to a distribution D that is (a) uniform or (b) Ω(ε)-far from uniform. Thus, any COND algorithm for uniformity can be used for distinguishing fair from ε-biased coins, and must therefore make Ω 1/ε 2 queries.
Note that the above upper bound holds even for the restricted "pair-cond" oracle. It is natural to ask if this is also the case with the "interval-cond": [33] show that significant savings are possible in this setting as well, giving a poly(log n, 1/ε)-query tester for uniformity:
Maybe surprisingly, this log Ω(1) n dependence for testing uniformity with INTCOND queries turns out to be necessary, showing a strict separation between INTCOND and PAIRCOND (and a fortiori between INTCOND and COND) for this problem: The upper bound is conceptually simple, and amounts to some sort of "binary descent" performed on O(1/ε) points randomly drawn from D, in order to check their probability weight is close to 1/n. For each such point s j , the algorithm recursively estimates the log n ratios D(I i )/D(I i−1 )) (where I 0 = [n], and the interval I i is the half of I i−1 which contains s j ). To pass the test, each of these ratios should be very close to 1/2; and as multiplying these ratios together gives a good multiplicative estimate of D(s j ), checking if any of the resulting estimates deviates too much from 1/n allows one to detect distributions far from uniform.
The lower bound, however, proves to be (a lot) trickier: the difficulty lying in bounding the quantity of information an INTCOND (or COND, for that matter) algorithm can "learn" from its queries. To analyze their family of hard instances D no , [34] follow an hybridization argument, where they introduce many intermediate stages. Each stage corresponds to an algorithm "faking" more of its queries to the oracle for D ∈ D no , instead of actually making them; so that the first stage is the actual algorithm interacting with INTCOND D , and the last stage is an algorithm that answers all its own queries as if it interacted with the uniform distribution (and thus is exactly what would happen if the tester had been given access to INTCOND U ). The authors then proceed to bound the variation distance between the transcripts obtained in any two such consecutive stages: summing all these distances allows them to upper bound the total variation distance between transcripts in a uniform and no-instance case, and derive their lower bound.

Testing identity
Recalling that in the SAMP model uniformity and identity testing turn out to be equivalent in terms of sample complexity, one may wonder if this is still the case with the more powerful queries a COND oracle allows. As we shall see below, this is indeed the case for in general COND setting; but not in its restricted PAIRCOND variant.
Theorem 11.6 (Testing identity). There exists an algorithm which, given the full specification of D * ∈ ∆(Ω) and COND access to an unknown distribution D, satisfies the following. On input ε ∈ (0, 1), it makesÕ 1/ε 2 queries to D, and • if D = D * , then with probability at least 2/3, the algorithm accepts; • if d TV (D, D * ) > ε, then with probability at least 2/3, the algorithm rejects.
Furthermore, this is nearly tight.
The lower bound is implied by Theorem 11.3. The upper bound is due to [52], where the authors apply a χ 2 -test to the conditional distributions induced by D and D * on adaptively chosen subsets of the domain. The high-level idea is to find a "distinguishing element" i (for D * ), and a small number t of "distinguishing sets" G j 's (with regard to x and D * ), such that (a) D * (x) is within constant factors of each D * (G j ) and (b) D * (G 1 ), . . . , D * (G t ) are roughly equal. Their algorithm then uses this χ 2 -test to check consistency between D and D * on {x, G j } for a randomly chosen G j , and on {G 1 , . . . , G t } (where each set G j is seen as a single element). This, along with a third check meant to verify both D and D * put the same overall weight on j G j , guarantees that with high probability at least one of the tests performed will catch a discrepancy between D and D * .
Note that prior to this work, a poly(log * n, 1/ε)-query tester for identity had been obtained by [36]; and a constant-queryÕ 1/ε 4 -query tester was analyzed in [33]. The latter also worked by comparing the weight (under D) of suitably chosen (with relation to D * ) element j and subsets S i ⊆ Ω. In all three cases, the tester crucially use for its comparisons the ability to condition on arbitrary subsets of the domain. As the following theorems show, this is not by coincidence: in the restricted pair-cond oracle a dependence on n is both necessary and sufficient.
Theorem 11.7 (Testing identity with PAIRCOND ([34, Theorem 7])). There exists an algorithm which, given the full specification of D * ∈ ∆(Ω) and PAIRCOND access to an unknown distribution D, satisfies the following. On input ε ∈ (0, 1), it makesÕ log 4 (n)/ε 4 queries to D, and • if D = D * , then with probability at least 2/3, the algorithm accepts; • if d TV (D, D * ) > ε, then with probability at least 2/3, the algorithm rejects. The upper bound, at its core, follows the same idea as in the uniformity case (where D * = U): trying to compare the ratios D * (x)/D * (y) and D(x)/D(y) for x ∼ D * and y ∼ D, and reject if a significant difference is found at any step. However, this natural approach does no longer work here, as if for instance D * (x) D * (y) and D(x) D(y), then the points are not "comparable" unless one makes ω(1) queries. That is, calling COND D on {x, y} will never return x: and the ratio D(x)/D(y) will be estimated as zero no matter whether it is actually 1/ log * n or n −100 . To circumvent this, the tester first buckets the points according to D * , and then checks that D * assigns approximately the right amount of weight to every bucket. Then, it follows the natural approach above, but on each of the logarithmically many buckets: since in all of them D * is nearly uniform, all points should have comparable weight and the above difficulty no longer arises.
The lower bound leverage this insight of comparable vs. incomparable points, by building as hard instance a distribution on logarithmically many buckets with size growing exponentially, which puts the same total weight on each of them. The corresponding no-instance is a perturbed version of this distribution: buckets are grouped in pairs, and in each pair one random bucket has weight multiplied by 1/2 and the other by 3/2. This does preserve the incomparability: in both the yesand no-cases, two elements x, y from different buckets have weights D(x), D(y) so multiplicatively far apart that a constant number of queries cannot help estimating the ratio, while two points x, y from the same bucket have exactly the same weight D(x) = D(y). Thus, intuitively the only way to tell yesand no-instances apart is to estimate the total weight of a particular bucket: but again, unless many queries are performed the tester cannot even obtain two samples from the same bucket-let alone a number sufficient to estimate its total weight. To make this intuition formal, the analysis proceeds with the same sort of hybridization technique as for Theorem 11.5: by bounding the difference between transcripts obtained by algorithms that "fake" k versus k + 1 of their queries (i. e., that "guess" the samples returned from their first k or k + 1 adaptive queries, instead of actually making these queries to the PAIRCOND oracle). Since algorithms faking all their adaptive queries cannot distinguish between a yesand noinstance, by the triangle inequality one gets that algorithms faking none of them still only have negligible advantage in doing so, and thus cannot be bona fide testers.

Testing closeness
We now cover two theorems which yield a good characterization (if not completely tight) of the sample complexity of closeness testing in the conditional setting: Theorem 11.9 (Testing closeness). There exists an algorithm which, given COND access to two unknown distributions D 1 , D 2 ∈ ∆(Ω), satisfies the following. On input ε ∈ (0, 1), it makesÕ (log log n)/ε 5 queries to D 1 and D 2 , and • if D 1 = D 2 , then with probability at least 2/3, the algorithm accepts; • if d TV (D 1 , D 2 ) > ε, then with probability at least 2/3, the algorithm rejects.
This upper bound is due to [52], and essentially works by generalizing their ideas for identity testing (Theorem 11.6) to the case where both distributions are unknown. In order to do so, the "distinguishing sets" G 1 , . . . , G t are now defined with regard to both D 1 and D 2 ; the key difficulty now being that the algorithm has no direct way to compute them (which would require to explicitly know D 1 and D 2 ). It thus attempts to get a handle on one of these sets by sampling them, that is, including independently each element in a setŜ with some guessed probability r. To find an (approximately) good value of r for which this works, the algorithm then iterates over possible values of r by some double binary search-resulting in the log log n dependence.
Note that in previous work, [34] analyzed aÕ log 5 n/ε 4 -query algorithm for this task which worked by simulating (approximate) evaluation query access to D 1 , D 2 and applying techniques similar as in the "evaluation query" model of Section 12. (The overall cost coming from the calls to this (approximate) EVAL oracle, each using polylog(n) conditional queries.) This doubly logarithmic dependence on the support size may seem unnatural, especially given the constant-query complexity of both uniformity and identity testing in the conditional query setting. One may therefore ask whether this is can be reduced further, down to poly(1/ε): quite surprisingly, this turns out to be impossible. Indeed, [1] show that a (log log n) Ω(1) query complexity is necessary: in contrast to what happens in the SAMP setting, identity and closeness testing with conditional queries are inherently different. 39 Theorem 11.10. There exists an absolute constant ε 0 > 0 such that the following holds. Any algorithm which, given COND access to two unknown distributions D 1 , D 2 ∈ ∆(Ω), distinguishes with probability at least 2/3 between (a) D 1 = D 2 and (b) d TV (D 1 , D 2 ) ≥ ε 0 , must have query complexity Ω √ log log n .
The proof of this lower bound is quite intricate, as it has to capture and "beat" all possible adaptive ways a COND testing algorithm could query the distributions and derive some information about them. At a very high level it relies on a technique introduced by Chakraborty et al. [36] for lower bounds against label-invariant properties, namely, the notion of core adaptive testers, a class of conceptually simpler testing algorithms against which it is sufficient to compete. The argument then works by designing yesand no-instances that are intuitively impossible distinguish without the knowledge of their support size. These instances are obtained by embedding (a variant of) the construction of Theorem 11.8 into a much larger domain by scaling it by a random factor. (This amounts to hiding the relevant part of the distribution in a negligible and unknown portion of the domain, effectively "blindfolding" the testing algorithm.) As in the previous subsection, one can ask if similar upper bounds hold in the more restricted setting of pair-cond queries. As the lower bound of Theorem 11.8 clearly conveys to this more general question, the best one can hope for is a polylog(n)-query upper bound. As it so happens, this hope is justified: Theorem 11.11 (Testing closeness with PAIRCOND ([34, Theorem 11])). There exists an algorithm which, given PAIRCOND access to two unknown distributions D 1 , D 2 ∈ ∆(Ω), satisfies the following. On input ε ∈ (0, 1), it makesÕ (log 6 n)/ε 21 queries to D 1 and D 2 , and 39 A (small) chasm, if you will.
Here is a sketch of how this tester works: in a first stage, it obtains a small "cover" of D 1 . This is a set R of logarithmically many representatives, in the following sense: for almost all x ∈ Ω, there exists a r ∈ R such that D 1 (x) is multiplicatively close to D 1 (r) (such x is said to be in the neighborhood of r). These neighborhoods {U 1 (r)} r∈R can be seen as a succinct cover of the support of D 1 into (not necessarily disjoint) sets, where within each set the points have roughly equal weight-reminiscent of some approximate bucketing.
The algorithm then checks two things: to begin with, it gets an estimate of D 2 (U 1 (r)) for each r in order to make sure D 2 puts the same weight as D 1 on these neighborhoods. Then, it takes samples from both D 1 and D 2 , and verifies all of them have the "same representative" r ∈ R under both distributions (i. e., that for each x sampled from either distribution, if D 1 (x) D 1 (r) for some r then D 2 (x) D 2 (r) as well). As the authors argue, if D 2 is far from D 1 then at least one of these two tests must fail with high probability, that is, the two distributions cannot agree on both the weights of the neighborhoods and the actual elements each neighborhood contains.

Testing for structure: monotonicity
As in Section 6, the analogue for SAMP, we now discuss the task of testing whether an a priori arbitrary distribution meets some structural condition, such as being log-concave, monotone or-say-a Binomial distribution. In this section, we shall specifically focus on testing monotonicity over [n]-in good part because, to the best of the author's knowledge, other structural properties have yet to be studied in the context of conditional queries. (Unless specified otherwise, the following is from [28]; also, recall that all throughout this survey "monotone" is meant as monotone non-increasing, following the notation of Section 6.1.) Theorem 11.12 (Testing monotonicity). There exists an algorithm which, given COND access to an unknown distribution D ∈ ∆([n]), satisfies the following. On input ε ∈ (0, 1), it makesÕ 1/ε 22 queries to D, and • if D ∈ M, then with probability at least 2/3, the algorithm accepts; • if d TV (D, M) > ε, then with probability at least 2/3, the algorithm rejects.
Furthermore, no algorithm taking o 1/ε 2 samples can correctly perform this task.
As the type of queries an INTCOND oracle allows seems very natural in the context of monotonicity, one may wonder whether it allows more efficient testing than in the regular SAMP setting: the following result shows that this is indeed the case.
Furthermore, no algorithm taking o log(n)/log log n samples can correctly perform this task.
We also show that-perhaps surprisingly-the ability to condition on intervals is not necessary to obtain such improvements over SAMP algorithms. Namely, even allowing PAIRCOND queries only (although they have no direct connection nor relation to the ordering of the domain) is enough to bring down the sample complexity to polylog(n): Theorem 11.14 (Testing monotonicity with PAIRCOND). There exists an algorithm which, given PAIRCOND access to an unknown distribution D ∈ ∆([n]), satisfies the following. On input ε ∈ (0, 1), it makesÕ log 2 n ε 3 + log n ε 4 queries to D, and • if D ∈ M, then with probability at least 2/3, the algorithm accepts; • if d TV (D, M) > ε, then with probability at least 2/3, the algorithm rejects.
Furthermore, no algorithm taking o 1/ε 2 samples can correctly perform this task.
Before giving an outline of how the first two algorithm works (and a proof for the third), we observe than in all cases the lower bound comes directly from the corresponding lower bound on testing uniformity. Indeed, the reduction of Batu et al. described in Section 6.1 applies indifferently of the model itself, so that "monotonicity is always at least as hard as uniformity." A poly(log n, 1/ε) upper bound for INTCOND. The natural idea here is to see if the known algorithm for monotonicity testing in the SAMP model (covered in Section 6.1) cannot directly be adapted to take advantage of these additional queries. The key part here is how improve the expensive step in the recursive splitting of the domain, where the algorithm checks if on the current interval the distribution is close enough to uniform in 2 distance. While it would be sufficient to perform a similar test in total variation distance, this is in principle much harder: indeed, in this step the algorithm is not checking if the conditional distribution is uniform (or far from it), but if it is close to uniform, that is, it has to perform tolerant testing.
Yet it is not clear whether the INTCOND queries would give us improved testers or tolerant testers in 2 distance. To circumvent this issue, [28] observes that what the tester of Batu et al. requires is slightly weaker, namely, to distinguish distributions on an interval I that (a) are Ω(ε)-far from uniform from those that are (b) O(ε/|I|)-close to uniform in ∞ distance. But this sort of weak tolerance is exactly what (a straightforward modification of) the INTCOND uniformity tester of [33] provides. Indeed, (b) is equivalent to asking that the ratio D(x)/D(y) of any two points in I be in [1 − ε, 1 + ε], which is exactly what this tester checks.
An O ε (1) upper bound for COND. While the above approach obviously also holds when granted general COND queries, it would still incur a polylogarithmic dependence on n. Indeed, even after plugging in the O ε (1)-query algorithm of [34] for estimating distance to uniformity, 40 the whole recursive binary splitting approach inherently brings a log n factor in the cost, from the number of recursion steps and intervals to consider.
Instead, the algorithm of Theorem 11.12 takes another route, by reducing testing monotonicity of D to testing another property on another distribution (on another domain). In more detail, it considers the Birgé flattening of D, Φ ε (D), which is a histogram on only = O(log n/ε) intervals (see Appendix D.4 for more details on this transformation). Now, D is monotone if and only if the "reduced distribution" D ε on [ ] induced by Φ ε (D) satisfies some new "exponential property" P ε ⊆ ∆([ ]) (defined as the set of distributions Q for which Q(k + 1) ≤ (1 + ε)Q(k) for all k < ). Not only does the above equivalence hold, it is actually robust: d TV (D, M) = d TV (D ε , P ε ).
Given this, the tester works in two stages: first, it checks thatD ε ∈ P ε , using the fact that COND access toD ε can be simulated from COND access to D. Then, it also verifies that Φ ε (D) is close to D, as it should be if D were monotone (as guaranteed by Theorem D.15). These two conditions can easily be seen to hold for any monotone D; and conversely, one can show that if they are both satisfied then D cannot be far from monotone. The remaining part of the argument then amounts to proving that both these stages can be carried out with O ε (1) queries. (Note that the overall 1/ε 22 dependence emerges from the second stage, which uses as subroutine the aforementioned distance-to-uniformity-estimation procedure of [34].) AÕ log 2 n/ε 4 upper bound for PAIRCOND. The following is based on conversations in 2013 with Dana Ron and Rocco Servedio [32], and appears for the first time in this survey. Before proving the theorem, we outline the argument and give its high-level ingredients. While "testing by learning" is usually not efficient for distributions (due to the hardness of tolerant testing), in the particular case of monotonicity such an approach is possible. Somewhat similar to the ideas of Batu et al. for their SAMP algorithm, we start by partitioning the domain [n] into roughly log n consecutive intervals, such that D is (or should be) almost constant on each of them. (This can be done by performing log n binary searches, where the comparisons between two points are simulated via PAIRCOND queries.) Then, we draw O(1/ε) samples from D, and for each of them compare their probability weight to that of the leftmost point of the interval they fall in. This again can be done with pairwise comparisons, and ensures that indeed the distribution is almost constant on each interval.
Proof of Theorem 11.14. Let t def = Θ(log(n)/ε), and let 1 < c < c be two constants to be determined in the course of the analysis. The algorithm works as follows: monotone, then one would have D(1) ≥ D(s): but the above shows that with very high probability D(1) D(s)). 2: By running t binary searches, iteratively find points i 1 = 1 < i 2 < · · · < i t ≤ n such that (each comparison is done with O 1 ε 2 log(t log n) =Õ log log n/ε 2 PAIRCOND queries, to ensure sufficient accuracy and guarantee overall correctness with probability at least 9/10 by a union bound.) 3: If any monotonicity violation is detected during the course of these binary searches (i. e., for some i < j, the estimate of the ratio D(i)/D( j) is less than 1 − Ω(ε)), output REJECT. Otherwise, let B 1 , . . . , B t be the resulting buckets , where B j = {i j , . . . , i j+1 − 1} (and i t+1 = n + 1). 4: Take O 1/ε 2 samples from D to get a estimateb t of D(B t ) within an additive ε 16 (with probability 9/10); ifb t > ε 8 , output REJECT. 5: Take Θ 1 ε samples from D, and Θ 1 ε points uniformly from each of B 1 , . . . , B t−1 . For each element s in the union S of these two sample sets, let B j s = {i j s , . . . , i j s +1 − 1} be the corresponding bucket.
• If for any s above we haveρ s / ∈ [ 1 1+3ε/c , 1 + 3ε/c ], output REJECT. 6: return ACCEPT To argue correctness, note first that all pairwise queries made are on sets with non-zero weight (the preliminary step guarantees D(1) > 0, and afterwards all queries either contain 1 or a point previously returned by the oracle). Moreover, by a union bound all estimates computed via sampling and pairwise queries are within the desired accuracy with probability at least 7/10, and in particular the i j 's meet the specifications of Equation (11.1). From now on, we condition on this.
Define a point k ∈ [n] to be good if k ∈ B for some < t and D(k) ∈ [1 − 4ε/c , 1 + 4ε/c ] · D(i ), and bad otherwise. Furthermore, let Bad + and Bad − be the sets of bad points k for which D(k) > D(i ) and D(k) < D(i ), resp., where k ∈ B .
Completeness. If D is monotone, then the binary searches work as expected and the algorithm does not reject in Step 3. Moreover, as D(i t ) < (1 + ε/c) −t ≤ ε/(16n) (for a suitable choice of the constant c), the leftover bucket B t has weight at most ε/16 and Step 4 does not reject either. Finally, for any 1 ≤ ≤ t any point s ∈ B satisfies by monotonicity and from accuracy of the estimates guarantees that ρ s ∈ 1 1 + 3ε/c , 1 + 3ε/c for any s considered in Step 5. Therefore, the algorithm reaches Step 6 and accepts.
Soundness. By contrapositive, suppose that the algorithm returns ACCEPT (we will show that in this case D is ε-close to monotone). This means that (a) D(B t ) < 3ε/16; (b) D(Bad + ) ≤ ε/8 and (c) for all < t, |Bad + ∩ B | ≤ (ε/8)|B | (the last two from Step 5 and the choice of constants in the Θ(1/ε)). Now, defineD to be the (non-negative) function such that D (k) = D(i ) for k ∈ B , 1 ≤ ≤ t; and D =D/ D 1 to be its normalized version. Clearly, bothD and D are monotone: it remains to prove that d TV (D, D ) ≤ ε, by first considering the 1 distance between D andD. Observe that the contribution of good points to this distance is by definition at most 4ε/c . As for the leftover bucket B t , it costs at most ε/4. The sum of D(k) −D(k) taken over Bad + is at most the same sum over D(k), which is upper bounded by ε/8. Finally, as in each bucket B there are at most (ε/8)|B | points from Bad − , the sum ofD(k) − D(k) ≤D(k) taken over all such points in the buckets is at most an ε/8 fraction of the weight of these buckets according toD. Combining the above, for a suitable choice of c ≥ 32, leads to From the above and the fact that Overall, the query complexity is O t log n · log log n

Estimating symmetric properties
From Section 8, we know that many symmetric properties are "hard" to (additively) estimate in the SAMP model, with tight and common sample complexity Θ(n/ log n). This is in particular the case of support size and entropy estimation, as well as distance to uniformity or between unknown distributions. 41 In this section, we consider the analogous questions in the COND model. In a nutshell, the conclusion is that the landscape is much more disparate in this setting: while all "nice" symmetric properties can now be tested and estimated with only polylog(n) queries, this is not necessarily tight: surprisingly, distance to uniformity can be estimated with constantly many queries. Even though, one cannot hope for such drastic improvements in every case: as we shall see, both entropy and support size estimation do require (log log n) Ω(1) . We first state the results, before giving an outline of their proofs.
Theorem 11.15 ([36, Theorem 6.0.1] (Restated informally)). Every "nice" symmetric scalar property of distributions can be tested and estimated by an algorithm making at most poly(log n, 1/ε) conditional queries. Moreover, this only requires (a subset of) INTCOND queries.
As we shall see, these "nice" properties include entropy, support size, or distance between distributions. However, in the specific case of distance to uniformity, it turns out that one can get rid of the dependence on the domain size altogether: distance estimation Theorem 11.16 ([34, Theorem 14]). There exists an algorithm which, given PAIRCOND access to an unknown distribution D ∈ ∆(Ω), satisfies the following. On input ε ∈ (0, 1] and δ ∈ (0, 1], it makes O 1/ε 20 queries to D and outputs a valueτ which, with probability at least 1 − δ , satisfies |τ − d TV (D,U)| ≤ ε.
Combined with following next lower bound, this shows that neither poly(log n, 1/ε) nor O ε (1) is the general answer for estimating or even testing symmetric properties: Theorem 11.17 ([36, Theorem 7.3.1]). There exists a symmetric property P ⊆ ∆(Ω) and an absolute constant ε 0 > 0 such that the following holds. Any algorithm which, given COND access to an unknown distribution D ∈ ∆(Ω), distinguishes with probability at least 2/3 between (a) D ∈ P and (b) d TV (D, P) ≥ ε 0 , must have query complexity Ω √ log log n .
Quite notably, the proof of this last theorem directly implies the same Ω √ log log n -query lower bound for support size estimation (both additive and multiplicative) 42

and entropy estimation (additive).
A general poly(log n, 1/ε) upper bound. The argument relies at its core on a primitive introduced by Chakraborty et al., namely, of (explicit) persistent sampler. Somewhat similar to the "approximate EVAL oracle" mentioned in Section 11.2.3, this primitive allows one to simulate from INTCOND D queries both evaluation and sampling oracles for a distributionD that is ε-close to D, with only a poly(log n, 1/ε)factor overhead. 43 Building on this, they obtain an algorithm which learns a distribution in total variation distance "up to a permutation," with only poly(log n, 1/ε) queries: As the scalar property to estimate considered is invariant by permutation, the rest is relatively straightforward: after calling this algorithm with a suitable parameter ε , it suffices to compute the value of the property on the (explicit) distribution it outputs. The only caveat with this approach is that this "suitable parameter" has to be chosen as a function of the property itself, to guarantee the computed estimate be close to the target value. That is, it is only practical for scalar properties ϕ that are "weakly continuous" with regard to total variation distance-specifically, those for which there exists a small enough function γ(n, ε) such that |ϕ(D 1 ) − ϕ(D 1 )| ≤ γ(n, ε) whenever d TV (D 1 , D 2 ) ≤ ε. 42 We briefly mention that [1] subsequently gave aÕ(log log n/ε 3 )-query COND algorithm for multiplicative (and additive) estimation of support size to within an (1 + ε) factor, showing that (log log n) Θ(1) is indeed the right query complexity for this problem. 43 The "explicit" refers to the fact that the probability values are explicitly returned by the oracle, as in the case of an actual EVAL oracle; while the "persistent" emphasizes that, on input m, all (at most m) answers simulated by the algorithm will be consistent with the sameD. Remark 11.19. As described for instance in [100], a simple calculation shows that both entropy and support size are weakly continuous in that sense, for γ(n, ε) and for Θ(ε log n) and εn, respectively. (Recall that in the support size estimation problem, it is assumed all non-zero probabilities are at least 1/n.) Tolerant uniformity testing. The first idea underlying the distance estimation procedure of [34] is to rewrite the total variation distance between D and U as where ψ D (x) = 1 − nD(x) if D(x) < 1/n and 0 otherwise takes values in [0, 1] (as we shall see in Section 12, this technique is extensively used in the Dual setting, where one has access to an evaluation (EVAL) oracle). From this, if one was able to approximate efficiently and accurately enough ψ D (x) for a uniformly random x, it would be straightforward to estimate the expected value within an additive ε via sampling.
As the contribution of ψ D (x) is negligible when D(x) is too small or too big, the algorithm can moreover restrict itself to get good (multiplicative) approximations only for those elements satisfying D(x) ∈ [ε/n, 1/(εn)]. But now, if it had even in hand one such "reference" element r with D(r) ∈ [ε/n, 1/(εn)], the task would become easy: making O ε (1) PAIRCOND queries on {x, r} would automatically yield such estimates for the x's that matter, and reveal those whose probability weight is too large or too big to be considered. Having reduced the original question to that of finding such a reference element, Canonne et al. proceed to the core of the proof, i. e., to describe and analyze a subroutine FIND-REFERENCE for this specific task (whose cost overall dominates the query complexity of the algorithm). If this subroutine succeeds in finding a suitable r ∈ Ω, it returns it along with an estimate of its weight D(r); otherwise, it must be the case that there is no or very few good reference point to be found, meaning that the distance to uniformity is very close to 1 anyway.
A specific Ω √ log log n lower bound. After establishing a general polylog(n)-query upper bound for symmetric properties, [36] show that no such approach can ever yield constant query complexity, by describing a particular symmetric property, that of being an "even uniblock distribution," that cannot be tested by any o √ log log n -query COND algorithm. The property in question is the set of all distributions D ∈ ∆(Ω) that are uniform on a subset of size 2 2k for some (1/8) log n ≤ k ≤ (3/8) log n. Using a technique they introduce (and already briefly mention in Section 11.2.3), they are able to restrict the argument against a specific class of testing algorithms, the core adaptive testers, and show that no such algorithm is able to distinguish between an even uniblock distribution D (with parameter k) and another randomly chosen distribution D , also uniform but on a subset of size 2 2k+1 .
The key idea of core adaptive testers is to reduce (without loss of generality) the possible actions the tester can take, to get a more manageable adversary to argue against. This in particular hinges on the fact that the property of interest in symmetric, in a way that is somewhat reminiscent of an analogous restriction to histograms and fingerprints in the SAMP model. (Very roughly, instead of the actual samples, the algorithm only sees the relations between these samples and the sets queried so far.) By doing so, one can eventually reduce dealing with a general, adversarial COND tester to proving lower bounds against a "deterministic" 44 decision-tree-like tester. (Even after doing so the remaining steps of the proof are intricate, and proceed by induction on the height of this tree in order to rule out with high probability some "bad events"-that is, events that would cause the testing algorithm to learn too much from the samples it got.) Remark 11.20. The fact that the lower bounds of Theorem 11.10 and Theorem 11.18 are similar is not a coincidence, but rather inherent to the technique used. Indeed, the core adaptive tester approach both proofs rely on cannot get past this √ log log n barrier, which derives from the size of the decision tree representing the tester (namely, q2 2q 2 for a q-query tester).

Non-adaptive testing
In this section, we turn to the question of non-adaptive testing in the conditional oracle model, restricting ourselves to algorithms that must be able to specify all the queries they will perform before interacting with the oracle. (In doing so, we set aside the issue of zero-weight query sets, discussed in Section 11.1, as definitional choice of the model.) This specifically has been one of the focuses of [36], where upper and lower bounds on uniformity and identity testing are shown. Subsequent work of Acharya, Canonne, and Kamath [1] improved their lower bound of Ω(log log n) on non-adaptive uniformity testing to Ω(log n), effectively settling the query complexity of this question as log Θ(1) n. Quite interestingly, this demonstrates that even removing the flexibility and power coming from adaptivity, COND testing algorithms still provide exponential improvements over their SAMP counterparts. • if d TV (D,U) > ε, then with probability at least 2/3, the algorithm rejects.
Furthermore, no COND algorithm making o(log n) queries can correctly perform this task, even for ε = Ω(1).
Moreover, it is worth noting that, similarly to its adaptive counterparts from Section 11.2.2, this non-adaptive tester enjoys some weak tolerance; in the sense that it also accepts distributions that are close to uniform in ∞ distance (see [36,Theorem 4.1.2] for a formal statement). This turns to be particularly useful: indeed, Chakraborty et al. then employ standard bucketing techniques (as in Section 5.2) to reduce identity testing to O(log n/ε) instances of "weakly tolerant uniformity testing," for which their uniformity tester can be used: 44 We put here the word deterministic between quotes, as core adaptive algorithms still include a probabilistic componentunlike the corresponding definition for non-adaptive algorithms. That is, their behavior does involve some random coin tosses, but of a very limited and fully defined type. (See [36, Definition 7.1.6].) Theorem 11.22 (Testing identity non-adaptively). There exists an algorithm which, given the full specification of D * ∈ ∆(Ω) and COND access to an unknown distribution D, satisfies the following. On input ε ∈ (0, 1), it makesÕ log 27/2 n ε 18 non-adaptive queries to D, and • if D = D * , then with probability at least 2/3, the algorithm accepts; • if d TV (D, D * ) > ε, then with probability at least 2/3, the algorithm rejects.
(Note that, as usual, the lower bound for uniformity testing also applies to identity testing.) A poly(log n, 1/ε) upper bound. We here give a high-level description of the near-uniformity tester of Theorem 11.21. The algorithm works in two stages: in the first one, it tries to "catch" and detect, by choosing uniformly random subsets of Ω with varying sizes, an element a which has much higher probability weight than the others elements of this random subset. In the second stage, it takes another uniformly random subset of polylogarithmic size, and applies a standard uniformity tester (with very small accuracy parameter) to verify that D is indeed uniform on this subset.
More specifically, the proof relies on the observation that if D is far from uniform one of the following must hold. In the first case, there is at least one particular size s of the form s = 2 j with log log n ≤ j ≤ log n such that a "typical" uniformly random subset S of size s will contain a heavy element for the distribution D S (i. e., one with probability much greater than 1/|S|). If so, by performing log 2 n queries on this set the algorithm will then observe some (draw the same element at least twice), which by the setting of the parameters would not happen if D were uniform (for any of the sizes s considered).
If this does not happen, then it must be the case that, on a uniformly random subset U of size poly(log(n), 1/ε), D U has with high probability non-negligible distance from uniform, namely, at least ε def = ε/|U|. But then, invoking the (SAMP) identity tester of Theorem 5.5 on D U with accuracy parameter ε ensures the tester will detect this discrepancy.
A brief glance at the lower bounds. The original Ω(log log n) lower bound of Chakraborty et al. relies on their notion of core adaptive tester, suitably modified to the non-adaptive case, against which they analyze a lower bound construction. In this construction, the yes-instance is the uniform distribution, while in the no-instance case it is a randomly chosen "even uniblock distribution" (as defined in the proof of the symmetric property testing lower bound, Theorem 11.17). While the core non-adaptive tester argument is simpler here than in the adaptive case (in particular, it is no longer necessary to see and analyze the tester as a decision tree), the bound it yields is not tight. Later work by [1] improves this bound to Ω(log n), using more elementary arguments.

Tips and tricks
As in the corresponding section for the standard sampling model, we give here a non-exhaustive list of useful things to consider when working in the conditional query setting.
Sanity checks for lower bounds. As before, when trying to prove a lower bound always make sure first it would not violate a known upper bound. Specifically, check that the distance to uniformity of your yesand no-instances is the same: otherwise, O ε (1) (PAIRCOND) queries will suffice to distinguish them.
Use known low-level primitives. Either because they encapsulate pesky details and allow you to focus on the high-level ideas (e. g., the COMPARE subroutine of [34] to estimate ratios of the form D(X)/D(Y )), or because they may provide features you need without having reinventing the wheel (ESTIMATE-NEIGHBORHOOD, FIND-REFERENCE (ibid.)) Use known high-level primitives. The O ε (1)-query PAIRCOND algorithm for estimating distance to uniformity, the APPROX-EVAL (COND) and Explicit Persistent Sampler (INTCOND) procedures of [34] and [36,Section 5.2] are powerful tools-the last two effectively providing (almost) the power of an EVAL oracle. ∞ as feature. Many known testers provide some weak tolerance with regard to ∞ , e. g., because they work by estimating ratios: this sometimes turn out to be quite handy, as shown in the proofs of Theorem 11.13 and Theorem 11.22.
Adaptivity is treacherous. Be careful when proving lower bounds-adaptivity is something quite difficult to get a grip on. To deal with them, only a few techniques are available so far, and are worth thinking of: reductions to easier problems (see Theorem 11.3), hybridization (11.5), and core adaptive testers (Theorem 11.10 and 11.17).
Symmetric properties. For many of them, a good upper bound can be derived from the INTCOND "learner-up-to-permutation" of [36] (Theorem 11.18).

Evaluation queries
This section covers results pertaining to three related models, where the algorithms are granted query access to the distribution, possibly in addition to the usual sampling access. While this type of access gives to distribution testing a stronger resemblance to-say-testing of Boolean functions, it is useful to keep in mind that the underlying distance measure remains total variation (while the functional setting is usually concerned with Hamming distance).

The setting(s)
The first type of oracle is an evaluation oracle, similar to the one commonly assumed for testing Boolean and real-valued functions: that is, on query i ∈ Ω it provides the value of the probability density function (pdf) of the underlying distribution D at i. Definition 12.1 (Evaluation model [89]). Let D be a fixed distribution over Ω. An evaluation oracle for D is an oracle EVAL D defined as follows: the oracle takes as input a query element x ∈ Ω, and returns the probability weight D(x) that the distribution puts on x.
The second is a dual oracle, which combines the standard model for distributions and the evaluation oracle defined above. In more detail, the testing algorithm is granted access to the unknown distribution D through two independent oracles, one providing samples of the distribution and the other query access to the probability density function.
Definition 12.2 (Dual access model [16,63,35]). Let D be a fixed distribution over Ω. A dual oracle for D is a pair of oracles (SAMP D , EVAL D ) defined as follows: when queried, the sampling oracle SAMP D returns an element x ∈ Ω, where the probability that x is returned is D(x) independently of all previous calls to any oracle; while the evaluation oracle EVAL D takes as input a query element y ∈ Ω, and returns the probability weight D(y) that the distribution puts on y.
This type of dual access to a distribution was first considered (under the name combined oracle) in [16] and [63], where the authors address the task of estimating (multiplicatively) the entropy of a distribution, or the f -divergence between two of them; before being reintroduced in [35] as one of the main focuses of the paper. Finally, the last type of oracle we shall cover in this section also provides dual access (both samples and evaluation queries) to the distribution; but this time query access is granted to the cumulative distribution function (cdf). 45 Definition 12.3 (Cumulative Dual access model [35]). Let D be a fixed distribution over [n]. A cumulative dual oracle for D is a pair of oracles (SAMP D , CEVAL D ) defined as follows: the sampling oracle SAMP D behaves as before, while the evaluation oracle CEVAL D takes as input a query element j ∈ [n], and returns the probability weight that the distribution puts on [ j], that is, (This cumulative dual access, as defined, only applies to totally ordered domains.) Note that in the last two definitions, one can decide to disregard the corresponding evaluation oracle, which amounts to falling back to the standard sampling model. Furthermore, for distributions on Ω = [n] any EVAL D query can be simulated by (at most) two queries to a CEVAL D oracle, that is, the cumulative dual model is at least as powerful as the dual one. 46 Remark 12.4 (On the relation to p -testing for functions on the line). We note that testing distributions with EVAL D access is strongly reminiscent of the recent results of Berman et al. [21] on testing functions f : [n] → [0, 1] with relation to p distances. There are however two major differences, which prevent an easy mapping between the two settings. First, the distance they consider is normalized (by a factor n in the case of 1 distance), so that a straightforward translation between the two settings would imply 45 To the best of our knowledge, such cumulative evaluation oracle CEVAL appears for the first time in [20,Section 8]. 46 We mention that Canonne and Rubinfeld discuss in [35] a relaxation of these two models, where the queries to the corresponding evaluation oracles are only answered within a multiplicative (1 ± γ) factor. They observe that many of their algorithms can be made robust against such multiplicative noise, while maintaining their query complexity (see, e. g., Table 1 of [35]).
Such relaxation, however, does not preserve the relation between CEVAL and EVAL (one can no longer simulate the latter from the former).
replacing ε by ε = ε/n, with a corresponding impact on the sample complexity. The second conceptual caveat is that the distance to a class of [0, 1]-valued functions is not directly related to the distance to the analogous class of distributions, which is in general a proper subset of the former.

Testing identity and closeness of general distributions
Unless specified otherwise, the results covered in this section originate from [35], where tight bounds on the query complexity of testing and tolerant testing of uniformity, identity and closeness are given for the Dual and Cumulative Dual settings.

Testing uniformity, identity and closeness
The first results we describe show that testing uniformity and identity (and, for the case of both dual models, closeness) not only can be performed with a constant number of queries, but also that the dependence on ε itself is as good as one could possibly hope for.
Furthermore, this query complexity is tight.
Furthermore, this query complexity is tight.
Furthermore, this query complexity is tight.
The first two upper bounds follow from a result of Rubinfeld and Servedio [89,Observation 24] which applies to testing with EVAL D queries: the idea is to query the probability weight given by D on samples x drawn from the reference distribution D * , hoping to detect some discrepancy between D * (x) and D(x). While this result does not transpose to testing closeness (as it relies on the ability to draw samples from at least one of the two distributions), [34] adapt this algorithm for testing closeness with their construction of an APPROX-EVAL (see Section 11.6). In turn, this directly yields the testing algorithm of Theorem 12.7.
As for the lower bound, it follows from the hardness of distinguishing, even given Cumulative Dual access, between the uniform distribution and a distribution where a random "chunk" of εn + 1 consecutive elements is perturbed, putting all its weight on the first element of the "chunk."

Tolerant testing and distance estimation
Given the results of the previous section, which establish that both dual models enable very efficient testing for the three related questions of uniformity, identity and closeness testing, it is natural to wonder whether similar theorems hold for their tolerant testing counterpart. As shown in [33], this is indeed the case, and derives from a general technique already mentioned in the proof of Theorem 11.16: the ability to estimate at little cost quantities of the form E x∼D [Φ(x, D(x)] for "nice" functions Φ.
Furthermore, this query complexity is tight for both EVAL D and dual accesses.
Furthermore, this query complexity is tight for dual access.
An O 1/ε 2 upper bound. As previously mentioned, the upper bound (which only requires EVAL access, as well as the ability to sample from at least one of the two distributions involved) follows from a general technique extensively used in [35] and, to a lesser extent, in [63]. In more detail, it boils down to the ability to estimate with very few samples any quantity of the form E x∼D [Φ(x, D(x))], for any bounded function Φ. This in particularly applies to total variation distance (and, modulo some technical details, to entropy and support size as well, as described in Section 12.4), observing that can be computed from evaluation queries. As an illustration, the tolerant uniformity tester of Theorem 12.8 is described in Algorithm 6. An Ω 1/ε 2 lower bound. The high-level idea of the lower bound is a reduction from distinguishing between two differently biased coins (from independent tosses, which is "hard" by Fact D.3) to tolerant testing of uniformity (in the dual access model). Specifically, given access to samples from a fixed coin (promised to have one of these two biases), one can define a probability distribution D as follows: the domain Ω = [n] is randomly partitioned into 1/ε 2 pairs of buckets of equal number of elements. D will be uniform within each bucket, and put equal total weight on every bucket pair (B i , B i ). Yet, within each (B i , B i ) the probability weight is allocating according to a coin toss performed "on-the-fly" when a query is made by the tolerant tester: so that either (a) D(B i ) = (1 + α)D(B i ), or (b) D(B i ) = D(B i ) (for some α function of the unknown bias of the coin). Depending on the type of coin, the resulting distribution D will have different distance from uniformity-so that a tolerant tester must be able to distinguish between the two cases.
We note that the lower bound only applies to the Dual access model (and thus a fortiori for EVAL access): the case of Cumulative Dual access remains open, and showing for instance a o(1/ε 2 )-query tolerant testing algorithm in this setting for any of the three problems above would result in a strong (and natural) separation between the Dual and Cumulative Dual access models. (See Section 12.5 for a discussion of the current separation results between these two.) More generally, we point out that proving lower bounds in these dual models, as it was the case for the COND setting, is quite intricate: as far as the author of this survey is aware, the only techniques available are by reductions such as the one above, or ad hoc proof involving a "needle and haystack"-type argument.

Testing for structure: monotonicity
In this section, we consider in these new models the problem of testing whether an a priori arbitrary distribution satisfies some structural condition, focusing on the particular case of monotonicity over [n]. The results below originate from [28].
Theorem 12.11 (Testing monotonicity with Cumulative Dual ). There exists an algorithm which, given Cumulative Dual access to an unknown distribution D ∈ ∆([n]), satisfies the following. On input ε ∈ (0, 1), it makesÕ 1/ε 4 queries to D, and • if D ∈ M, then with probability at least 2/3, the algorithm accepts; • if d TV (D, M) > ε, then with probability at least 2/3, the algorithm rejects.
Furthermore, no algorithm taking o(1/ε) samples can correctly perform this task.
(We note that Canonne also describes a cumulative dual tolerant testing algorithm for monotonicity with query complexity O(log n), although with a restriction on the range of parameters-see Table 3.) Turning to a weaker query model, [28] obtains (nearly) tight bounds, showing that the complexity of testing monotonicity with EVAL queries is logarithmic: Theorem 12.12 (Testing monotonicity with EVAL). There exists an algorithm which, given EVAL access to an unknown distribution D ∈ ∆([n]), satisfies the following. On input ε ∈ (0, 1), it makes O log n ε + 1 ε 2 queries to D, and • if D ∈ M, then with probability at least 2/3, the algorithm accepts; • if d TV (D, M) > ε, then with probability at least 2/3, the algorithm rejects.
Furthermore, no algorithm taking o(log(n)/log log n) samples can correctly perform this task, even for constant ε = 1/2.
The author additionally conjectures in [28] the "right" lower bound for this last problem to be Ω log(n)/ε , and establishes it for the special case of non-adaptive testing algorithms. We note, however, that nothing specific to the Dual access model is known: that is, while a O ε (1)-query algorithm exists for cumulative dual, no better upper bound than O ε (log n) has been shown for the dual setting.
The first result, Theorem 12.11, is the analogue of Theorem 11.12, and relies on the same approach (tailored for the Cumulative Dual model). As for the second, Theorem 12.12, it derives for the positive side from a result on learning monotone distributions in the EVAL setting, building on a modification of Birgé's argument for the SAMP model. The lower bound itself is obtained by reducing the task to a promise problem on estimating the sum of a non-decreasing sequence, and invoking a result of Sariel Har-Peled [64] for the latter. (Namely, the reduction works by "embedding" such a sequence, summing to 1 or 1 − ε, into a distribution that is either (a) monotone, or (b) has a "bump" of weight ε at a randomly chosen element-but is monotone besides this bump.)

Testing (some) symmetric properties, with and without structure
In contrast to the previous types of oracle covered (see Section 8 and Section 11.4), no general approach to testing symmetric properties is known for the evaluation, dual or cumulative dual access models. 47 Some specific results exist, however; specifically, for additive and multiplicative estimation of entropy and support size.
The first results we mention relate to multiplicative estimation of entropy given Dual access to an unknown distribution, provided a lower bound on this quantity is known. We then state the additive estimation counterpart for entropy, as well as results pertaining to support size estimation (all in the dual access model). Finally, we cover a result of [35] which shows that under a monotonicity constraint, entropy estimation becomes exponentially easier with Cumulative Dual access (while Dual access queries do not help).  In the Dual access model, any algorithm that, given a parameter h > 0 and the promise that H(D) = Ω(h), estimates the entropy within a multiplicative (1 + γ) factor must have sample complexity Ω log(n)/(γ(2 + γ)h) .
Turning to the task of additive estimation, one can obtain the following upper bounds for entropy and support size, using similar techniques as for Theorem 12.13 and Section 12.2.2 (that is, carefully massaging the property to estimate into a quantity of the form E x∼D [Φ(x, D(x)] for bounded Φ): Theorem 12.15. There exists an algorithm which, given Dual access to an unknown distribution D ∈ ∆(Ω), satisfies the following. On input ∆ 1 , ∆ 2 ∈ (0, log n], it makes O 1 (∆ 2 −∆ 1 ) 2 log 2 n ∆ 2 −∆ 1 queries to D, and • if H(D) ≤ ∆ 1 , then with probability at least 2/3, the algorithm accepts; • if H(D) ≥ ∆ 2 , then with probability at least 2/3, the algorithm rejects; where H(D) = − ∑ x∈Ω D(x) log D(x) denotes the (Shannon) entropy of the distribution. Furthermore, this query complexity is tight: no algorithm making o log 2 (n)/(∆ 2 − ∆ 1 ) 2 queries can correctly perform this task, even when granted Cumulative Dual access.
[35] originally proved both upper and lower bounds in the Dual model; [27] later extended the latter to Cumulative Dual access as well. Observe that the bound stated in Theorem 12.15 differs from Theorem 12.13 in some regimes of parameters, e. g., ε 2 − ε 1 = γh > 1 and h > 1; and does not require any lower bound h > 0 as input. The lower bounds of Theorem 12.15 and Theorem 12.16 proceed (once again) by a reduction to the problem of distinguishing a fair from a biased coin, where the reduction is performed "on-the-fly" to answer the dual (or cumulative dual) queries while asking for only one coin toss at a time.
Leveraging structure: entropy of (close to) monotone distributions. It is reasonable to ask if the stronger queries granted in the Cumulative Dual access model could help for these estimation tasks. Intuitively, this should be the case whenever the distribution presents some additional property (related to an underlying total order) that cumulative queries can leverage, as in the case of monotonicity. Canonne and Rubinfeld establish in [35] the following two results, establishing that cumulative queries do enable significant improvements over the Dual access setting in specific cases: Theorem 12.17. In the Cumulative Dual access model, there exists an algorithm that estimates the entropy of distributions (on [n]) guaranteed to be O(1/ log n)-close to monotone to an additive ∆, with sample complexityÕ Theorem 12.18. In the Dual access model, any algorithm that estimates the entropy of distributions (on [n]) guaranteed to be O(1/ log n)-close to monotone within an additive constant must make Ω(log n) queries to the oracle.
On one hand, the upper bound draws again on properties of the Birgé decomposition to reduce the effective domain size to logarithmic, while preserving the ability to simulate Cumulative Dual access to this reduced distribution. On the other hand, the lower bound of Theorem 12.18 proceeds by building two families of instances with very different support size, enough for the corresponding entropies to differ by a constant. However, in both types of instances (1 − 1/ log n) probability weight is put on the very first element of the domain, effectively "masking" the rest of the support from SAMP queries (while estimating its size with EVAL queries amounts to finding a needle in a haystack).

Separating the three models
One immediate question is whether the ability to query the cumulative distribution function, rather than only the probability mass function, enables significant savings for some natural properties: that is, is there a gap between Cumulative Dual and Dual access? We first observe that it is easy to obtain a separation between EVAL and Dual access models, although for a contrived testing problem. 49 Namely, consider the property P defined as the set of distributions D ∈ ∆(Ω) putting equal weight on two distinct elements of the domain. Given Dual (or, for this matter, even SAMP) access, a constant number of queries suffices to test P; while any EVAL testing algorithm must perform Ω(n) queries to distinguish a random distribution in P from (say) a distribution putting all its weight on a single element. (Along with the sample complexity of testing uniformity, this also shows that the EVAL and SAMP models are incomparable.) Separating the power granted by Dual and Cumulative Dual accesses, however, is much trickier. One possible candidate could be monotonicity testing, where the cumulative dual setting access enjoys a constant upper bound (but no O ε (1)-query algorithm is known for dual access). Another would be tolerant uniformity or identity testing, as hinted in Section 12.2.1: specifically, by proving a o(1/ε 2 ) upper bound in the Cumulative Dual access model. Intuitively, any such separation should take advantage of the underlying total order of the domain, as this is the crucial aspect vindicating the Cumulative Dual setting. A preliminary result of this sort was obtained in [35,Section 4.2], where the authors show that estimating the entropy of distributions (very) close to monotone can be done withÕ (log log n) 2 dual queries, while Ω(log n) are required in the dual access model.

Tips and tricks
As in the previous models, we give here an (alas) non-comprehensive list of useful things to consider when working in the extended access settings.
Reductions for lower bounds. To the best of the author's knowledge, the two main approaches to obtain lower bounds are either by a reduction to a simpler and well-understood problem (e. g., the biased coin one, as in Theorem 12.8 and Theorem 12.16), or by separately "disabling" both oracles and arguing against the evaluation one via a needle-and-haystack argument. Note that once again, a key aspect in the reductions or custom-tailored arguments is to handle the case of adaptive queries.
Use the Φ Hammer. As illustrated in Section 12.2 and Section 12.4, a general technique for getting upper bounds in the Dual model is the ability to estimate "cheaply" any quantity E x∼D [Φ(x, D(x)], as long as Φ is a bounded function.
Leverage the total order. For the specific case of Cumulative Dual access, Section 12.3 and Theorem 12.15 illustrate how to take advantage of results such as the Birgé decomposition to reduce the problem to a much smaller domain. What is particularly useful here is the "self-reducibility" of Cumulative Dual access with relation to flattenings: that is, the ability to simulate cumulative dual oracle access to any histogram D induced by D on a known partition, given cumulative dual access to D.

Collections of distributions
This section covers the models and results of Levi, Ron, and Rubinfeld [73,74], pertaining to (joint) testing of properties of several distributions. One can see this model as capturing situations where samples originate from many distributions, and of interest is some joint property of these distributions. For instance, given poll results from various cities or states, can we conclude the population's preferences are homogeneous across the country? From measurements returned by different sensors in a grid, can we be confident all are calibrated similarly, i. e., have same mean value?

The setting
Introduced in [73], this framework is concerned with properties of collections of m distributions over the same domain Ω. 50   is not uniform, but instead follows an underlying distribution W over [m]. Two variants of the framework can then be studied: the known-weights sampling model, where the algorithm is provided as input with the 50 Note that we assume both m and Ω to be known to the algorithms. full description of W ; and the unknown-weights model sampling model, where it has no prior knowledge of W . In both cases, the distance criterion from Equation (13.1) becomes

Relation to other models
Our choice of notations SAMP m D , QCOND m D (specific to this survey) is not innocent, and hints at connections between this testing framework and other settings already described. Specifically, we first observe that as defined in Equation (13.1) (or more generally in Equation (13.2)) the distance between two collections D, D can be rewritten as , resp.). Therefore, testing a property of collections of distributions amounts to testing a (related) property of single distributions, but on a product space. This connection enables [73] to derive bounds on testing independence in the SAMP model from their results on testing equivalence in the collection setting (see Section 6.5 and Section 13.3).
A byproduct of this rephrasing is that the query model can be seen as a restricted type of conditional access to the distribution, namely, one where the algorithm can only condition on sets of the form {i} × Ω.

Testing equivalence and clusterability
This section describes the results of [73] on testing equivalence and clusterability of collections of distributions in both sampling and query models; as well as their implications for testing independence in the standard sampling setting. Note that we merely here provide the statements of the results and necessary definitions. • if dist D, P eq n,m > ε, then with probability at least 2/3, the algorithm rejects.
Furthermore, this algorithm also works in the unknown-weights sampling model.
Theorem 13.4. There exists an algorithm which, given SAMP m access to an unknown collection of distributions D ∈ ∆(Ω) m , satisfies the following. On input ε ∈ (0, 1), it takesÕ n 1/2 m 1/2 + n · poly(1/ε) samples from D, and • if dist D, P eq n,m ≤Õ ε 3 / √ n , then with probability at least 2/3, the algorithm accepts; • if dist D, P eq n,m > ε, then with probability at least 2/3, the algorithm rejects. Furthermore, this algorithm also works in the known-weights sampling model.
(For the detailed statement of the weak tolerance guarantee provided by these two algorithms, the reader is referred to [73], Theorems 6.11 and 6.13, respectively.) The main technical contributions of this work lie in proving (nearly) tight lower bounds on testing P eq n,m in their sampling model, and as a corollary a corresponding lower bound for testing independence in the (standard) SAMP setting (as covered in Section 6.5). We briefly mention the key use of Poissonization in establishing Theorem 13.5.
Theorem 13.5. There exist absolute constants ε 0 > 0 and c > 0 such that the following holds. Any algorithm which, given SAMP m access to an unknown collection of distributions D ∈ ∆(Ω) m , distinguishes with probability at least 2/3 between (a) D ∈ P eq n,m and (b) dist D, P eq n,m > ε 0 must have sample complexity Ω n 2/3 m 1/3 , as long as n ≥ cm log m. Coming back to upper bounds, a generalization of P eq n,m is the (k, β )-clusterability property P clust n,m,k,β , where instead of the m distributions D 1 , . . . , D m all being equal to one given D * it is only required that they can be partitioned into k disjoint clusters S 1 , . . . , S k such that d TV (D i , D * ) ≤ β for all i ∈ S (for some choice of D * 1 , . . . , D * k ). In particular, k = 1 and β = 0 yield P clust n,m,1,0 = P eq n,m where as before n = |Ω|.
[73] describes a tester for (k, β )-clusterability in the query model, which in particular implies a similar tester for equivalence: Theorem 13.8. There exists an algorithm which, given QCOND m access to an unknown collection of distributions D ∈ ∆(Ω) m , satisfies the following. On input ε ∈ (0, 1) such that ε > 8β √ n, it makes O kn 2/3 · poly(1/ε) queries to D, and • if D ∈ P clust n,m,k,β , then with probability at least 2/3, the algorithm accepts; • if dist D, P clust n,m,k,β > ε, then with probability at least 2/3, the algorithm rejects. For this property, Levi, Ron, and Rubinfeld establish upper and lower bounds in both the query and (uniform weights) sampling accesses, showing in particular that in the latter a strong dependence on m was unavoidable.

Testing for similar means
Theorem 13.9. There exists an algorithm which, given QCOND m access to an unknown collection of distributions D ∈ ∆([n]) m , satisfies the following. On input ε ∈ (0, 1) and γ > 0, it makesÕ 1/ε 2 queries (independent of γ) to D, and • if D ∈ P µ γ,n , then with probability at least 2/3, the algorithm accepts; • if dist D, P µ γ,n > ε, then with probability at least 2/3, the algorithm rejects. Furthermore, this sample complexity is nearly tight: any no algorithm making o 1/ε 2 queries can correctly perform this task.
Furthermore, no algorithm taking less than (1 − γ)m ε/γ samples can correctly perform this task.
The lower bound of Theorem 13.9 relies on a reduction to our favorite problem-distinguishing biased from fair coins. The lower bound of Theorem 13.10 is more involved, however, and hinges as a first step on the design of two families of distribution with matching first moments (and "therefore" hard to distinguish). This construction builds on properties of Chebyshev polynomials. (As for the upper bound, it essentially works by simulating the algorithm of Theorem 13.9, i. e., obtaining enough collisions in the samples to provide query access to D.) Extensions. In the last part of their work, the authors also study the specific case of Ω = {0, 1} (i. e., n = 2), where they are able to obtain algorithms (in the sampling access setting) with better sample complexity. (One of these upper bounds building on a result of [73] for testing equivalence, Theorem 13.4.) They also consider testing similarity of means under a different metric, the Earthmover distance, and argue their results convey in this setting. Finally, they describe a generalization of P µ γ,n , extending it to several clusters of similar means (k-clusterability of means).
Following the first version of this survey, two works have been published which touch upon or settle some of the problems covered in this section. Aliakbarpour, Blais, and Rubinfeld [11] introduce a new distribution testing problem, that of junta distributions; and show a connection to testing uniformity of (weighted) collections of distributions. Diakonikolas and Kane [47] address the question of equivalence testing in this model and obtain the tight sample complexity for this question, thus improving on the results of Section 13.3.

Competitive testing
This section describes the model of competitive testing introduced and considered in [2,3,6], where the testing algorithm has to compete with the sample complexity of an almost omniscient "ground-truth tester." That is, on any instance of a property testing problem, the sample complexity of the algorithm is compared to that of an ad hoc algorithm specifically designed for this instance, a "genie" which knows almost everything there is to know beforehand.

The setting
The type of access to the distributions is almost the same as in the SAMP setting: the algorithms are provided with independent samples from one or several distributions. The difference now lies in that they are not given as input a parameter ε ∈ (0, 1), but instead an integer m corresponding to the number of samples they are allowed to take (i. e., that are available). The following definition will be necessary to define what "competitive" means, i. e., with regard to what a tester has to compete. • ifD = D, then with probability at least 1 − δ , T * accepts; • ifD = D , then with probability at least 1 − δ , T * rejects.
(As before, a similar definition holds for properties over pairs or tuples of distributions.) Given this, we are in position to state what a competitive testing algorithm is: Definition 14.2. Let P ⊆ ∆(Ω) be a property of distributions, and let ψ : N → N be a non-decreasing function. A testing algorithm T for P is said to be ψ-competitive if the following holds. Given SAMP access to an unknown distribution D ∈ ∆(Ω) and on input m ≥ 1, T takes ψ(m) samples from D, and • if D ∈ P, then with probability at least 2/3, it accepts; 51 We specify that the descriptions of D and D are only provided up to relabeling of the elements, since the actual "identity" of the samples does not matter here, i. e., for any permutation π of Ω, (D • π, D • π) and (D, D ) have the same description.
• if D / ∈ P, then with probability at least 2/3, it rejects; as long as D is (m, 1/3)-testable. (In particular, if D is not (m, 2/3)-testable any behavior of T is acceptable.) The goal is therefore to obtain testing algorithms for as slowly growing a function ψ as possible-with the identity function being the Holy Grail. Moreover, note that no dependence on n = |Ω| is explicitly mentioned: in particular, Ω could be infinite or unknown.
Remark 14.3. We only describe here the gist of the results from [2,3]; in particular, the papers do contain more observations and useful lemmas, relating (m, 1/3)-testability to general (m, δ )-testability as well as results on the useful notions of patterns and profiles of sequences of samples (see next section for a brief mention of the latter).

Testing closeness of general distributions
For the problem of closeness testing, the property P is as before defined as and an instance (D 1 , • if D 1 = D 2 , then with probability at least 2/3, the algorithm rejects; as long as (D 1 , D 2 ) is (m, 1/3)-testable. Furthermore, any ψ -competitive testing algorithm must satisfy ψ (m) =Ω m 7/6 .
The upper bound is obtained by an algorithm combining a (variant of) χ 2 -test and a test based on the "profiles" of the two sequences of samples obtained (i. e., a statistic summarizing all the information about multiplicities and collisions among the samples, very similar in that to the notions of fingerprint and histogram from Section 8). What the authors show is that if the distributions are (m, δ )-different, then at least of these two tests will detect it-namely, if the "reference genie" focuses on the samples with big probabilities, the χ 2 -test will capture the discrepancy; while if the genie mostly uses the samples with low probabilities, the profile-based test will perform well.
For the lower bound, it "suffices" to give a family F of pairs of instances (D 1 , D 2 ) along with a specific algorithm that can distinguish (D 1 , D 2 ) from (say) (D 1 , D 1 ) with a number m of samples given the description of D 1 , D 2 ; but such that no algorithm can do this with less thanΩ m 7/6 without this description. In more detail, [3] defines a fixed distribution D over m 1/3 / log m elements, as well as a family Q of 2 m 1/3 /(2 log m) distributions that are obtained by randomly perturbing D independently on every pair of consecutive elements. An instance is then a pair (D, D ) where D is chosen uniformly at random in Q (so that the hard problem is essentially testing identity to D). The authors show that an algorithm that knows D can distinguish (D, D ) from (D, D) with m samples; but (from a minimax argument à la Le Cam-see Appendix E) an algorithm which only knows that D is in Q must useΩ m 7/6 samples. 52

Testing with structure
Subsequent work of Acharya, Jafarpour, Orlitsky, and Suresh [6] in this framework focuses on the task of testing uniformity of monotone distributions (without loss of generality, over the domain Ω = [0, 1]). In this setting, a monotone distribution D ∈ ∆([0, 1]) is (m, δ )-testable if there exists an algorithm with sample complexity m which, given as input the full description of D (up to a permutation of the domain), can distinguish with probability at least 1 − δ between D and U([0, 1]).
By leveraging a binning technique based on a decomposition of monotone distributions reminiscent of Birgé's and once again a χ 2 -test-type testing algorithm, they manage to establish a tight bound on the competitiveness of any algorithm for this question: as long as D is (m, 1/3)-testable. Furthermore, this is tight: any ψ -competitive testing algorithm must satisfy ψ (m) = Ω m √ log m .
(The lower bound is proven by defining, given m ≥ 1, an explicit family Q of monotone distributions such that each fixed, specified D ∈ Q can be distinguished from uniform using m samples, but yet information-theoretically-by a minimax argument-a random choice of D ∈ Q requires Ω m √ log m samples.) 52 Note that this is consistent with the identity testing upper and lower bounds of Theorem 5.5: indeed, as defined in [3] any D ∈ Q is ε-far from D for ε = Θ 1/m 1/2 , and the support size is n = m 1/3 / log m: so that √ n/ε 2 = m 7/6 / √ log m.
Before concluding this survey, we remind the reader that the material covered was, sadly, not exhaustiveexhausting at best. In particular, we only briefly mention some of the topics that were left out, and may have deserved a more thorough coverage, had we had the strength.
• Instance-optimal testing. Barely hinted at in Section 5.2, this model introduced in [99] aims at going beyond the worst-case analysis in the SAMP model by achieving sample complexities that are optimal on a case-by-case basis. Roughly speaking, to test for a property P, the number of samples should only depend on known quantities pertaining to P, not on the size n of the domain Ω. (In the case of testing identity to a known distribution D * , for example, the sample complexity would have to be a function of D * and ε only, in this case related to the 2/3-norm D * 2/3 ).
• Testing in other metrics. Instead of total variation or 2 , one can consider testing with regard to general p norms (e. g., as in [101], to pinpoint the exact tradeoff between n, p, and ε) or testing closeness for classes of distances such as f -divergences (which include Jensen-Shannon, Hellinger, and triangle divergences) [63].
• Testing with unequal samples. In the specific case of closeness testing (here in the SAMP model), [7,22] consider the situation where the two distributions are not "created equal." That is, they work in the setting where getting samples from one of the two distributions is more expensive than from the other, or equivalently where one does not get as many samples from the first and the second distribution; and analyze the optimal tradeoff one can obtain between these two distinct sample complexities.
• Connections to (standard) property testing. Although most of the work covered in this survey (as well as most of the literature on the topic, to the best of our knowledge) appears to be disjoint from the property testing literature on, say, Boolean functions, there exist some connections. In particular, Goldreich and Ron articulate in [61, Section 6.3] the relation between sample-based testing of symmetric properties of functions and distribution testing.

Part V Conclusion
As concluding remarks, we describe a few particularly appealing research directions: Lower bounds via communication complexity. Recent work of Blais, Brody, and Matulef [25] (see also [57]) establishes a very elegant framework for leveraging communication complexity results to obtain property testing lower bounds. Obtaining an analogous transference technique applicable to distribution testing, in any of the models touched upon in this survey, would be of great interest. 53 Instance-optimal testing, and testing under structural constraints. Results circumventing the worstcase analysis (as briefly mentioning in Section 5.2 with the results of [99]), or bypassing hardness results by considering natural restrictions of testing problems (e. g., testing for a property P, under the promise that the distribution originates from some class C) as in [43,49,48] both seem legitimate and exciting directions for future research.
The following extension of the multiplicative bound is useful when we only have upper and/or lower bounds on P: Corollary B.5. In the setting of Theorem B.3 suppose that P L ≤ P ≤ P H . Then for any γ ∈ (0, 1], we have Finally, one also has the following corollary of Theorem B.3: Corollary B.6. Let 0 ≤ w 1 , . . . , w m ∈ R be such that w i ≤ κ for all i ∈ [m] where κ ∈ (0, 1]. Let X 1 , . . . , X m be i.i.d. Bernoulli random variables with Pr[ X i = 1 ] = 1/2 for all i, and let X = ∑ m i=1 w i X i and W = ∑ m i=1 w i . For any γ ∈ (0, 1], and for any B > e ·W , Pr[ X > B ] < 2 −B/κ .

C Metrics over ∆(Ω)
We state the definitions of distances below in full generality: unless specified otherwise, they also hold for probability distributions P, Q over the "usual" probability space (R d , B R d ), as long as they are absolutely continuous relative to the Lebesgue measure. When considering the case of some discrete set Ω ⊆ R d , the corresponding σ -algebra is the discrete σ -algebra over Ω. Due to the scope of this survey, we left out many much-studied semi-metrics on distributions; for a more complete introduction to those, as well as a summary of relations between them, we refer the reader to [54].

C.1 More on Total Variation
As stated in (3.1), the total variation distance is given two different expressions, the standard one being as the supremum over all (Borel) sets of (P(S) − Q(S)), and the other as half the 1 distance between P and Q. These two definitions are easily shown to be equivalent; this result is known as Scheffé's Identity.
Lemma C.1 (Scheffé's Identity [93]). For P, Q two probability measures with densities p, q, It is often crucial, when dealing with independent draws from two distributions, to bound the variation distance between the resulting tuples. This is done in the lemma below, using properties of the total variation distance (essentially Chernoff and union bounds).
Lemma C.2 (Direct Product Lemma [92,Lemma 3.4]). For P, Q ∈ ∆(Ω), and m ≥ 1 an arbitrary integer, let the distributions P ⊗m and Q ⊗m on Ω × · · · × Ω be defined by drawing m independent samples s 1 , . . . , s m from P (Q, resp.) and returning their m-fold product (s 1 , . . . , s m ). Then, P ⊗m and Q ⊗m satisfy where α def = d TV (P, Q). Furthermore, there exist distributions for which the upper bound is achieved.
(Fact C.6, by using properties of the Hellinger distance, will tighten these bounds.) More generally, the following folklore bound holds: Fact C.3. For P 1 , Q 1 ∈ ∆(Ω 1 ) and P 2 , Q 2 ∈ ∆(Ω 2 ), C.2 Hellinger distance Definition C.4 (Hellinger distance). For P, Q two probability measures on R d with densities p, q, the Hellinger distance is defined as and takes value in [0, 1]. If P, Q are discrete distributions over some set Ω, The result below relates Hellinger and total variation distances: Theorem C.5 ([15, Corollary 2.39]). For any probability distributions P, Q ∈ ∆(Ω), The Hellinger distance is particularly useful when trying to bound the distance between two m-tuples of independent draws, one from P and one from Q.
Indeed, in contrast with the total variation, the Hellinger distance between tuples from P ⊗m and Q ⊗m has an exact expression in terms of the distance between P and Q: where the quantity 1 − d H (P, Q) 2 is sometimes called the Hellinger affinity between P and Q.
Combining with Theorem C.2 (whose upper bound is better for small values of the total variation distance) yields the following: where the leftmost inequality holds for any τ > 0, for α small enough.
For more details on this metric, see, e. g., [

C.3 Kolmogorov distance
Definition C.7 (Kolmogorov distance). Assuming the domain is totally ordered (e. g., Ω = [n], or Ω ⊆ R), one can also define the Kolmogorov distance between P and Q as where F P and F Q are the respective cumulative distribution functions (cdf) of P and Q. Thus, the Kolmogorov distance is the ∞ distance between the cdf's; and the second equality directly implies d K (P, Q) ≤ d TV (P, Q) ∈ [0, 1].
The main reason to introduce and consider this looser metric is the use of the Dvoretzky-Kiefer-Wolfowitz inequality (Theorem D.1), whose guarantees are stated in terms of Kolmogorov distance.
C.4 Earthmover's distance Definition C.8 (Earthmover's distance). For P, Q two probability measures defined on a metric space (Ω, δ ), with densities p, q, the Earthmover's distance (EMD) is defined as where Γ(p, q) ⊆ ∆(Ω × Ω) denotes the set of probability distributions with marginals p and q. It is also called the Wasserstein metric.
Intuitively, this metric measures the number of "dirt (probability weight) units" one has to transfer from some place x to some other place y of the domain, in order to transform P into Q; where each such unit transfer has a cost δ (x, y) (the closer x and y, the cheaper the transfer). This can be made precise when the domain Ω is finite (e. g., [n]), in which case the EMD has a clean characterization as the solution of a minimum-cost flow optimization problem (see, e. g., [50]). Note that the EMD depends heavily on the choice of the underlying metric on Ω; for instance, [98] chose to endow (0, 1] with the metric δ : (x, y) → |log x − log y|, which confers to the corresponding EMD properties suited to their needs.
As a last property of the EMD, the following theorem of Kantorovich and Rubinstein [68] gives a third characterization of the distance, when the distributions have bounded support (e. g., Ω is bounded): Theorem C.9. For P, Q ∈ ∆(Ω) with bounded support, where Lip 1 denotes the set of all 1-Lipschitz functions from Ω to R.
Remark C. 10. In the case Ω ⊆ R, recall that This yields a general formulation for the three distances.
For more details on this metric, see, e. g., [44].

D A non-comprehensive toolkit D.1 Fundamental results
We first recall a fundamental fact from probability theory, the Dvoretzky-Kiefer-Wolfowitz (DKW) inequality. Informally, this states that one can learn the cumulative distribution function of a distribution to additive error ε in ∞ distance (i. e., learn the distribution in Kolmogorov distance), by taking only O 1/ε 2 samples from it.
Theorem D.1 ( [51,75]). Let D be a distribution over any (possibly continuous) domain Ω ⊆ R. Given m independent samples x 1 , . . . , x m from D, define the distributionD by its empirical distribution function F as follows:F Then, for all ε > 0, Pr d K D,D > ε ≤ 2e −2mε 2 , where the probability is taken over the samples.
(In particular, setting m = Θ log(1/δ )/ε 2 we get that d K D,D ≤ ε with probability at least 1 − δ .) The following theorem guarantees that applying any (possibly randomized) function to two distributions can never increase their total variation distance: Fact D.2 (Data Processing Inequality for Total Variation Distance). Let D 1 , D 2 be two distributions over a domain Ω. Fix any randomized function 55 F on Ω, and let F(D 1 ) be the distribution such that a draw from F(D 1 ) is obtained by drawing independently x from D 1 and f from F and then outputting f (x) (likewise for F(D 2 )). Then we have Moreover, we have equality if each realization f of F is one-to-one.
(See, e. g., part (iv) of Lemma 2 of [85] for a proof of this result.) Finally, we recall a well-known result on distinguishing biased coins (which can for instance be derived from Eq. (2.15) and (2.16) of [10]), that often comes in handy in proving lower bound through reductions:

D.2 On Yao and non-adaptive algorithms
Yao's Principle (at least, what its "easy direction"given below does) enables one to reduce the problem of dealing with randomized non-adaptive algorithms over arbitrary inputs to the one of deterministic non-adaptive algorithms over a ("suitably difficult") distribution over instances 56 : Theorem D.4 (Yao's Minmax Principle (easy direction)). Fix any property P ⊆ ∆(Ω). Suppose there is a distribution D over instances such that any q-query deterministic algorithm is correct with probability strictly less than 2/3 when D ∼ D. Then, given any (non-adaptive) q-query randomized tester T, there exists D T ∈ supp(D), such that Pr[ T is correct on D T ] < 2/3 .
Hence, any non-adaptive property testing algorithm for P must make at least q + 1 queries.
Now, a direct application of the Data Processing Inequality (Lemma D.2) bounds the probability of distinguishing between samples from two distributions by their total variation distance: 55 Which can be seen as a distribution over functions over Ω. 56 An instance is a legitimate input to the testing problem, i. e., D ∈ P ∪ { D : dist(D , P) > ε }. Lemma D.5. Let D 1 , D 2 be two distributions over some set Ω, and A any algorithm (possibly randomized) that takes x ∈ Ω as input and returns yes or no. Then where the probabilities are also taken over the possible randomness of A.
With this in hand, we can give a crucial tool in proving lower bounds in distribution testing: Lemma D.6 (Key Tool Against Non-Adaptive Testers). Fix any property P ⊆ ∆(Ω). Let D yes be a distribution over distributions that belong to P, and D no be a distribution over distributions that all have d TV (D, P) > ε 0 . Suppose further that for all q-query sets Q ⊆ Λ q , one has Then any (two-sided) non-adaptive testing algorithm for P must use at least q + 1 queries (for ε ≤ ε 0 ).
Proof. Let D be the mixture D def = (1/2)D yes + (1/2)D no (that is, a draw from D is obtained by tossing a fair coin, and returning accordingly a sample drawn either from D yes or D no ). Fix a q-query deterministic tester T. Let . That is, p Y is the probability that a random yes-distribution is accepted, while p N is the probability that a random no-distribution is accepted. Via the assumption and the previous lemma, |p Y − p N | ≤ 1/4. However, this means that T cannot be a successful tester; as So Yao's principle (Theorem D.4) tells us that any randomized non-adaptive q-query tester is wrong on some D in support of D with probability at least 3/8; but a legit tester can only be wrong on any such D with probability less than 1/3.
Remark D.7 (Generalization of Lemma D.6). The conclusion of the above lemma still holds even under the weaker assumptions , 100 .
More strings on this bow. We give below another straightforward, yet useful fact when arguing about lower bounds (especially when applying to distributions over transcripts). It asserts that if two random variables, conditioning on a low-probability event not happening, are statistically close, then their overall distributions are close.
Fact D.8 (Fact 34 from [34]). Let D 1 , D 2 be two distributions over the same finite set Ω. Let E be an event such that D i (E) = α i ≤ α for i = 1, 2 and the conditional distributions (D 1 ) E and d TV ( Proof. Write We may upper bound , and the fact is established.

D.3 Poissonization
This technique, taking its name after the Poisson distribution, itself named after Siméon Denis Poisson, a French mathematician from the XIX th century whose name happens to mean "fish," is a very handy trick used to "restore independence" between random variables in some specific situations, generally to make the analysis easier.
As an example, consider taking k independent samples from a distribution D ∈ ∆([n]), and looking at the resulting fingerprint F = (F 1 , . . . , F k ) (i. e., the vector whose j th component is the number of elements in the domain which have been drawn exactly j times). The F j 's, whose analysis is often paramount to proving upper or lower bounds, are not independent: for a start, they always satisfy ∑ k j=1 jF j = k. In order to come down on F with all the convenient machinery which requires independence amongst random variables, one can then apply this trick-in a nutshell, taking m ∼ Poisson(m) samples instead of exactly m. Then the individual components of F will each be the sum of n independent Poisson random variables, and thus enjoy good concentration properties. Even better, many nice properties of the Poisson distribution will apply: the random variable m itself will be tightly concentrated around its mean m, for instance. For more on this technique, see, e. g., [19,Section 3.2] and [98, Section 3.1], as well as [94].
The Poisson distribution has many key properties: amongst others, the sum of finitely many Poisson random variables is itself Poisson; a Poisson random variable is tightly concentrated around its expectation; and the Poisson distribution can be viewed as the limit of a Binomial distribution Bin(n, p) when n goes to infinity while keeping the product λ = np constant. 58 Fact D. 10. Let Ω be a discrete domain, and D ∈ ∆(Ω). Suppose one takes m ∼ Poisson(m) independent samples s 1 , . . . , s m from D, and define X ω for ω ∈ Ω as the number of times ω appears amongst the s i 's. Then (a) the (X ω ) ω∈Ω are independent, and (b) X ω ∼ Poisson(mD(ω)).
Fact D.11 ([55, 95]). Let λ > 0, and k ∈ N. Then, for X ∼ Poisson(λ ), This implies that both for upper and lower bounds on algorithms taking i.i.d. samples from a distribution, one can assume without loss of generality that for instance the fingerprint (as defined above) of the samples has all independence properties wished. Indeed, up to constant factors in the sample and time complexity, the existence of a tester and a "Poissonized tester" 59 are equivalent: • a tester with sample complexity m and success probability 2/3 implies a "Poissonized tester" taking Poisson(m) samples and having success probability 3/5; • a "Poissonized tester" taking Poisson(m) samples and having success probability 2/3 implies a tester with sample complexity 3m and success probability 3/5. 58 See also Le Cam's Inequality [70] for a generalization of this limit theorem to sums of (non-necessarily identical) Bernoulli random variables. 59 A tester that, on input m, draws m ∼ Poisson(m) and then asks m samples from the oracle.

D.4 Birgé's decomposition
We state here a few facts about monotone distributions, namely that they admit a succinct approximation, itself monotone, close in total variation distance. This theorem from [24] has recently been pivotal in several results on learning and testing k-modal distributions [41,43].
For a distribution D and parameter ε, define the histogram Φ ε (D) to be the flattened distribution with relation to the oblivious decomposition I ε : Note that while Φ ε (D) (obviously) depends on D, the partition I ε itself crucially does not; in particular, it can be computed prior to getting any sample or information about D. We stress the fact that Φ ε (D) is supported on logarithmically many intervals; and note that samples from Φ ε (D) can straightforwardly be obtained given sampling access to D.
Remark D. 16. Another proof, self-contained and phrased in terms of discrete distributions (whereas the original paper by Birgé is primarily intended for continuous ones) can be found in [ One can interpret this corollary as saying that the Birgé decomposition provides a tradeoff between becoming simpler (and at least as close to monotone) while not staying too far from the original distribution. Incidentally, the last step of the proof above implies the following easy fact, that one could also get from Lemma D.2: Conjectures. This corollary (which in particular implies that if D is ε-close to monotone, then so is Φ α (D)) does not "feel" tight. Intuitively, flattening out parts of the distribution should indeed never take it farther from monotone, but one would expect the process, in most cases; to it bring it strictly closer. This motivates the following conjecture that there is always a "good" choice of parameter α: Conjecture D. 19. For all n ∈ N, there exists η > 0 such that the following holds. For all α ∈ (0, 1/2] and D 1 , D 2 ∈ ∆([n]), there exists β ∈ [α/2, 2α] for which d TV Φ β (D 1 ), Φ β (D 2 ) ≤ (1 − η)d TV (D 1 , D 2 ).
One could even hope for the stronger statement that many values of α are "good." Conjecture D.20. For every η > 0 there exists c > 0 such that for all α ∈ (0, 1/2] and D 1 ,

E Assouad and Le Cam
In this section, we describe two techniques used to prove lower bounds for distribution learning and testing in the SAMP model, Assouad's lemma and Le Cam's method. (We do not cover here Fano's Inequality, another and somewhat more general result than Assouad's-the interested reader is referred to [104].) where A m is the set of (deterministic) learning algorithms A which take m samples and output a hypothesis distributionD A .
In other terms, R m (C) is the minimum expected error of any m-sample learning algorithm A when run on the worst possible target distribution (from C) for it. It is immediate from the definition that for any H ⊆ C, one has R m (C) ≥ R m (H).
To prove lower bounds on learning a family C, a common method is to come up with a (sub)family of distributions in which, as long as a learning algorithm does not take enough samples, there always exist two (far) distributions which still could have yielded indistinguishable "transcripts." In other terms, after running any learning algorithm A on m samples, an adversary can still exhibit two very different distributions (depending on A) 60 that ought to be distinguished, yet could not possibly have been from only m samples. This is formalized by the following theorem, due to Assouad: Theorem E.2 (Assouad's Lemma [14]). Let C ⊆ ∆(Ω) be a family of probability distributions. Suppose there exist a family of H ⊆ C of 2 r distributions and constants α, β > 0 such that, writing H = {D z } z∈{0,1} r , (i) for all x, y ∈ {0, 1} r , the distance between D x and D y is at least proportional to the Hamming distance: d TV (D x , D y ) ≥ α x − y 1 ; (E.2) (ii) for all x, y ∈ {0, 1} r with x − y 1 = 1, the squared Hellinger distance of D x , D y is small: In particular, to achieve error at most ε, any learning algorithm for C must have sample complexity Remark E.3 (High-level idea). Intuitively, every distribution in H is determined by r "binary choices." 61 With this interpretation, item (i) means that two distributions differing in many choices should be far (so that a learning algorithm has to "figure out" most of the choices in order to achieve a small error), while item (ii) requires that two distributions defined by almost the same choices be very close (so that a learning algorithm cannot distinguish them too easily).

E.2 Testing lower bounds: Le Cam's Method
We now turn to another technique, better suited for proving lower bounds on property testing or parameter estimation, i. e., where the quantity of interest is a functional of the unknown distribution, instead of the distribution itself. We start with some terminology that will be useful in stating the main result of this section. Definition E.6 (Estimator). Let C ⊆ ∆(Ω) be a family of probability distributions over Ω, and m ≥ 1. For any real-valued functional ϕ : C → [0, 1] ("scalar property"), we denote by E m the set of estimators for ϕ, that is, the set of (deterministic) algorithms E taking m ≥ 1 independent samples from a distribution D ∈ C and returning an estimateφ E of ϕ(D).
Then, for all m ≥ 1, One particular reason this result is interesting is that the infimum is taken over the convex hull of the m-fold product distributions from the families D 1 and D 2 , and not over the m-fold distributions themselves. While this makes the computations much less straightforward (as a mixture of product distributions is not in general itself a product distribution, one can no longer rely on using the Hellinger distance as a proxy for total variation and leverage its nice properties with regard to product distributions), it also usually yields much tighter bounds-as the infimum over the convex hull is often significantly smaller.
We obtain an immediate corollary in terms of property testing, where a testing algorithm is said to fail if it returns ACCEPT on a no-instance or REJECT on a yes-instance. Note, as usual, that if the samples originate from a distribution which is neither a yes nor a no-instance, then the any output is valid and the tester cannot fail.
As any (possibly randomized) bona fide testing algorithm can only fail with probability 1/3, the above, combined with Yao's Principle, implies a lower bound of Ω(m) as soon as m and D 1 , D 2 satisfy inf p 1 ,p 2 d TV (p 1 , p 2 ) < 1/3 in (E.6).
Proof of Corollary E.8. We apply Theorem E.7 with the following parameters: An application: testing uniformity. To prove a lower bound of Ω √ n/ε 2 for testing uniformity over [n] (cf. Section 5.1), Paninski [79] defines the families D 1 = P = {U n } and D 2 as the set of distributions D obtained by perturbing each disjoint pair of consecutive elements (2i − 1, 2i) by either (ε/n, −ε/n) or (−ε/n, ε/n) (for a total of 2 n/2 distinct distributions). He then analyzes the total variation distance between U ⊗m n and the uniform mixture which for m ≤ c √ n/ε 2 is less than 1/3-establishing the lower bound.

F Miscellaneous definitions F.1 Distribution classes
Recall that a distribution D over [n] is monotone (non-increasing) if its probability mass function (pmf) satisfies D(1) ≥ D(2) ≥ · · · ≥ D(n). A natural generalization of the class M of monotone distributions is the set of t-modal distributions, i. e., distributions whose pmf can go "up and down" or "down and up" up to t times. We reproduce the following definitions from [30].
Definition F.1 (t-modal). Fix any distribution D over [n], and integer t. D is said to have t modes if there exists a sequence i 0 < · · · < i t+1 such that either (−1) j D(i j ) < (−1) j D(i j+1 ) for all 0 ≤ j ≤ t, or (−1) j D(i j ) > (−1) j D(i j+1 ) for all 0 ≤ j ≤ t. We call D t-modal if it has at most t modes, and write M t for the class of all t-modal distributions. The particular case of t = 1 corresponds to the set M 1 of unimodal distributions.
Definition F.2 (Log-Concave). A distribution D over [n] is said to be log-concave if the following holds: (i) for any 1 ≤ i < j < k ≤ n such that D(i)D(k) > 0, D( j) > 0; and (ii) for all 1 < k < n, D(k) 2 ≥ D(k − 1)D(k + 1). We write L for the class of all log-concave distributions.

Definition F.3 (Concave and Convex). A distribution D over [n]
is said to be concave if it satisfies the following conditions: (i) for any 1 ≤ i < j < k ≤ n such that D(i)D(k) > 0, D( j) > 0; and (ii) for all 1 < k < n such that D(k − 1)D(k + 1) > 0, 2D(k) ≥ D(k − 1) + D(k + 1); it is convex if the reverse inequality holds in (ii). We write K − and K + for the class of all concave and convex distributions, respectively.
It is not hard to see that convex and concave distributions are unimodal; moreover, every concave distribution is also log-concave, i. e., K − ⊆ L. Note that in both Definition F.2 and Definition F.3, condition (i) is equivalent to enforcing that the distribution be supported on an interval. It is known that every log-concave distribution is both unimodal and MHR (see, e. g., [13,Proposition 10]), and that monotone distributions are MHR. Finally, we recall the definition of the following two classes, which both extend the family of Binomial distributions BIN n .
Definition F.5 (Poisson Binomial). A random variable X is said to follow a Poisson Binomial Distribution (with parameter n ∈ N) if it can be written as X = ∑ n k=1 X k , where X 1 , . . . , X n are independent, not necessarily identically distributed Bernoulli random variables. We denote by PBD n the class of all such Poisson Binomial Distributions.
One can generalize even further, by allowing each random variable of the summation to be integer-valued.
Definition F.6 (k-SIIRV). Fix any k ≥ 0. We say a random variable X is a k-Sum of Independent Integer Random Variables (k-SIIRV) with parameter n ∈ N if it can be written as X = ∑ n j=1 X j , where X 1 . . . , X n are independent random variables taking value in {0, 1, . . . , k − 1}. We denote by k-SIIRV n the class of all such k-SIIRVs.
k-SIIRV distributions arise in various areas, such as survey sampling or social studies. We note that the name k-SIIRV is somewhat misleading, as the parameter n is implicit; as such, the term (n, k)-SIIRV seems preferable. We however chose to keep the original name in the above definition in order to stay consistent with the literature on the topic.

F.2 Distribution learning
In this appendix, we recall the notions of learning, proper learning, and agnostic learning of distributions over a domain Ω.
Definition F.7 (Learning and Proper Learning). Let C ⊆ ∆(Ω) be a class of probability distributions and D ∈ C an unknown distribution. Let also H ⊆ ∆(Ω) be aset of distributions, the hypothesis class. A learning algorithm for C (with hypothesis class H) is a randomized algorithm L which takes as input |Ω| and ε, δ ∈ (0, 1), and has access to SAMP D under the promise that D ∈ C, and returns the description of a distributionD ∈ H such that with probability at least 1 − δ one has d TV D,D ≤ ε. The sample complexity of the algorithm is then the number of samples it takes in the worst case, over all D ∈ C. If H ⊆ C, then we say L is a proper learning algorithm.
A useful generalization, which corresponds to the notion of "model misspecification" in Statistics, is that of agnostic learning, where the unknown distribution is not guaranteed to belong to the class C.
Definition F.8 (Agnostic and Semi-Agnostic Learning). Let C, H ⊆ ∆(Ω) be classes of distributions and D ∈ ∆(Ω). An agnostic learning algorithm for C (with hypothesis class H) is a randomized algorithm L which takes as input |Ω| and ε, δ ∈ (0, 1), and has access to SAMP D and returns the description of a distributionD ∈ H such that with probability at least 1 − δ one has d TV D,D ≤ OPT C,D + ε where OPT C,D def = inf D C ∈C d TV (D C , D) is the distance of D to the purported class C. If the guarantee is instead that d TV D,D ≤ C · OPT C,D + ε for some absolute constant C ≥ 1, then L is a semi-agnostic learner (with agnostic constant C). The sample complexity of the algorithm is then the number of samples it takes in the worst case, over all D ∈ ∆(Ω). Finally, if H ⊆ C, then we say L is a proper agnostic learning algorithm.