Sharp bounds for population recovery

The population recovery problem is a basic problem in noisy unsupervised learning that has attracted significant research attention in recent years [WY12,DRWY12, MS13, BIMP13, LZ15,DST16]. A number of different variants of this problem have been studied, often under assumptions on the unknown distribution (such as that it has restricted support size). In this work we study the sample complexity and algorithmic complexity of the most general version of the problem, under both bit-flip noise and erasure noise model. We give essentially matching upper and lower sample complexity bounds for both noise models, and efficient algorithms matching these sample complexity bounds up to polynomial factors.


The erasure noise and bit-flip noise population recovery problems
The noisy population recovery (NPR) problem is to learn an unknown probability distribution D on {0, 1} n , under ν-noise, to ∞ -accuracy ε. 1 In this problem the learner gets access to independent samples y, each distributed as follows: First x ∼ D, and then y ∼ Noise ν (x), where Noise ν (·) denotes the application of bit-flip noise or erasure noise (described below). The learner's task is to produce an estimate D of D satisfying D − D ∞ ≤ ε (with high probability). For the sake of a compact representation, we assume the learner only lists the nonzero values of D; this means that a successful learner need only list O(1/ε) nonzero values. We are interested in minimizing both the sample complexity and the running time of learning algorithms.
A simpler variation of the NPR problem is the estimation task. Here the algorithm does not need to compute a complete D; it only needs to produce an ε-accurate estimate of D(u) for a given input u ∈ {0, 1} n . Certainly the estimation task is no harder than full NPR; conversely, it is known and not hard to see (see Section 2.1) that given the ability to do estimation, one can do full NPR with just a poly(n, 1/ε) factor slowdown. Hence we mainly focus on estimation in this paper.
As mentioned above, we consider two different models of noise. Each involves a parameter 0 < ν < 1; smaller values of ν correspond to more noise, so ν may be better thought of as a "correlation" parameter.
Erasure noise. For x ∈ {0, 1} n we define Erase 1−ν (x) to be the distribution on {0, 1, ?} n given by independently replacing each coordinate of x with the symbol '?' with probability 1 − ν. Thus, ν is the retention probability for each coordinate.
Bit-flip noise. For x ∈ {0, 1} n we define Flip 1−ν 2 (x) to be the distribution on {0, 1} n given by independently flipping each coordinate of x with probability (1 − ν)/2. Equivalently, each coordinate of x is retained with probability ν (as in erasure noise), and is otherwise replaced with a uniformly random bit. This is also the model of noise associated with the so-called "Bonami noise operator" T ν (see [13] for the precise description and many applications of this operator).

Our results
For the bit-flip noise population recovery problem, our main result is a lower bound on the sample complexity of estimation, as well as a full NPR algorithm whose running time (hence also sample complexity) matches it up to polynomial factors. Theorem 1.1 (NPR bit-flip noise upper and lower bounds). Let ε > 0 be sufficiently small and let n ∈ N. Then any estimation algorithm for NPR with bit-flip noise must use at least the following number of samples:    exp Θ n 1/3 · ln 2/3 (1/ε)/ν 2/3 if 2 ln(1/ε) Prior to the present paper and the recent independent article [14], no nontrivial upper or lower bounds were known even for the sample complexity of the general bit-flip noise population recovery problem. (See [17,10,6] for earlier work that gave upper bounds and algorithms under the additional assumption that the unknown distribution D is guaranteed to be supported on at most k strings.) For the erasure noise population recovery problem, our main positive result is an efficient algorithm, and our main negative result is a near-matching lower bound for algorithms which meet either of the following conditions: (a) only uses information about the number of 1s that are present in the received string or (b) the ambient dimension n is sufficiently large. More precisely, we have the following theorem. Theorem 1.2 (NPR erasure noise upper and lower bounds). Let ε > 0 be sufficiently small, 0 ≤ ν ≤ 1 and let n ∈ N.
1. There is an algorithm for the full NPR problem with erasure noise using time and samples at most poly(n, 1/ε 1/ν ). Moreover, the sample complexity of the estimation algorithm is upper bounded by O(1/ε 1/ν ).
For this problem, an earlier paper by Moitra and Saks [11] gave an algorithm with sample complexity and running time (n/ε) O(log(1/ν)/ν) . In the above theorem, item 3 follows from a simple reduction from item 2 (which exploits the shift invariance of binomial distributions). So as to not affect the flow of the text, this reduction is presented in Section 6. Thus, in the main body of this paper, we only focus on proving items 1 and 2.
Finally, we note that in recent independent work, Yury Polyanskiy, Ananda Theertha Suresh, and Yihong Wu [14] have obtained results similar to Theorem 1.1 and Theorem 1.2 for the population recovery problem. We explain the relationship between their results and ours below.

Our techniques and relationship to [14]
Our approach is similar in spirit to, and shares some technical similarities with, the recent papers [5,12] on the trace reconstruction problem as well as an earlier paper by Moitra and Saks [11] on population recovery with erasure noise. At a high level, we take an analytic view on the combinatorial process defined by the bit-flip and erasure noise operators, and convert the sample complexity questions for these population recovery problems to questions about the extrema of real-coefficient polynomials satisfying certain conditions on various circles in the complex plane; we then obtain our sample complexity bounds by analyzing these extremal polynomial questions. We remark here that [11] was the first paper to introduce complex analytic tools in the line of work mentioned here-in contrast to our paper, [11] arrives at the complex analytic formulation by considering the dual of a LP-based estimator. However, it is possible to directly arrive at the complex analytic formulation without invoking the notion of LP duality, and this is what we do in this paper. Finally, we mention that the main algorithmic ingredient in our results is linear programming. THEORY OF COMPUTING, Volume 16 (6), 2020, pp. 1-20 The work presented here and in [14] was done concurrently and independently. We now briefly explain the relationship between the techniques and results in these papers. (a) The results for NPR with bit-flip noise (i. e., Theorem 1.1) are the same as those of [14]. (b) As stated, our results for NPR with erasure noise are quantitatively somewhat weaker though qualitatively quite similar to those of [14]. In particular, we show that the sample complexity of the estimation problem in this setting is 1/ε Θ(1/ν) . In contrast, [14] shows that the sample complexity for the estimation problem in presence of erasure noise, is precisely (1/ε) max{2,2(1−ν)/ν} up to polylogarithmic factors. For any ν, our result differs from that of [14] only up to a fixed constant factor in the exponent of ε.
As our paper was written independently of [14], no attempt was made to match the results of [14] or to obtain exponents with precise constant factors. Incidentally, it turns out that the proof of item 1 of Theorem 1.2 in fact yields the same exponent as that of [14] though we state our result without the factor "1 − ν" (see Theorem 4.1). Finally, we also note that the sample complexity for both our results and the results of [14] are "dimension free" for the estimation problem, i. e., the sample complexity bound is independent of the ambient dimension n. In contrast, for the full NPR problem, the sample complexity depends on n (in both the papers).
At a high level, the techniques of [14] are similar to ours (and those of [12,5]) though our proofs are substantially shorter. This is essentially because we are able to leverage some recent results from [4,8] in our proofs. In particular, in the proof of Item 2 of Theorem 1.2, we directly utilize the construction of an extremal polynomial from [8], while in contrast [14] rely on an argument from first principles based on Hadamard's three line theorem.

Preliminaries
We start by formally defining the notion of a probability distribution on a finite set X as well as the 1 and ∞ metrics on the space of such distributions. Definition 2.1. A probability distribution on a finite set X is a function D : X → [0, 1] such that ∑ x∈X D(x) = 1. For two such distributions D 1 and D 2 , their 1 distance, denoted by D 1 − D 2 1 , is given by ∑ x∈X |D 1 (x) − D 2 (x)|. Likewise, their ∞ distance , denoted by D 1 − D 2 ∞ , is given by max x∈X |D 1 (x) − D 2 (x)|.

Well-known preliminary reductions
Estimation, enumeration, and recovery. Variants of the NPR problem with relaxed goals have been studied in the literature. One is the aforementioned estimation problem. Another (complementary) variant is called enumeration: in the enumeration problem, the learning algorithm is only required to return a list of strings x 1 , . . . , x m that is guaranteed (with high probability) to include all strings that have probability at least ε under D; such strings are sometimes referred to as "heavy hitters." Batman et al. [1] give a range of results for the enumeration problem.
It is easy to see that a solution to the estimation problem can be efficiently bootstrapped to full NPR given the ability to solve the enumeration problem (simply run estimation, with a sufficiently boosted success probability, on each of the m strings in the list obtained from enumeration). In turn, it is also well known that an estimation algorithm can be efficiently transformed into an enumeration algorithm via a THEORY OF COMPUTING, Volume 16 (6), 2020, pp. 1-20 "branch-and-prune" approach. Roughly speaking, such an approach maintains a not-too-large (size at most O(1/ε)) set of i-bit prefixes that is known to contain all the "heavy hitters"; to construct the set of (i + 1)-bit prefixes, the approach first "branches" to extend each i-bit prefix x to both x0 and x1, and then "prunes" any element of {x0, x1} that is determined, using the estimation procedure, not to be a heavy hitter. (Note that since only heavy hitters are maintained it will again be the case that the set of (i + 1)-bit prefixes has size at most O(1/ε).) As observed in [1], an early example of such a branch-and-prune routine that performs enumeration given an oracle for estimation is the Goldreich-Levin algorithm [9] for list-decoding the Hadamard code. Both Dvir et al. [7] and Batman et al. [1] give fairly detailed analyses of the above-described reduction from enumeration to estimation; we omit the details here and refer the interested reader to Section 6.1 of [7] and Section 2 of [1], respectively.
Summarizing the reductions discussed above, we have that NPR is (up to polynomial factors) no harder than the estimation problem, and it is also clearly no easier than estimation (since estimation is a subproblem of general NPR). Thus, in the rest of this paper we restrict our attention to the estimation problem.
Symmetrization. We further recall some well-known techniques that have been used in past papers on NPR. First, in the estimation problem, we may assume without loss of generality that the string u whose probability is to be estimated is u = (0, . . . , 0). To see this, for any point Next, for the problem of estimating D(0, . . . , 0), we may assume without loss of generality that D is symmetric, meaning that it gives equal probability mass to all strings at the same Hamming weight. In other words, D is effectively given by a probability distribution D sym on [0..n], with D(x) = D sym (|x|)/ n |x| . On one hand, if D(0, . . . , 0) can be estimated in the general case, it can certainly be estimated in the symmetric case. On the other hand, given a general distribution D, the learner can randomly permute the coordinates of each sample, effectively obtaining access to samples from a symmetric distribution D , with D (0, . . . , 0) = D(0, . . . , 0). Thus, it suffices for the learner to be able to estimate in the symmetric case. (We note that both these tricks, namely reducing to the case when u = (0, . . . , 0) and symmetrizing D, appear in prior work (see, e. g., [7,11]).
In the symmetric case, we will express the unknown D sym as a probability (row) vector [p 0 p 1 · · · p n ]. Here p i denotes the total weight of the strings with Hamming weight i. Although the learner observes full strings, it may as well only consider the Hamming weights of the strings it receives. (This is without loss of generality in the bit-flip noise model, since the number of 0s is completely determined by the number of 1s. In the erasure noise model, this is why our lower bound holds only for algorithms that only use the number of 1s in each received string; see the discussion in Section 1.3.) Thus we may view the learner as obtaining samples from the probability (row) vector [q 0 q 1 · · · q n ], where Proof. Fix i, j and let x be any string of weight i. Let E k denote the event that k of the 1's in x become 0 and j − i + k of the 0's in x become 1. From positivity constraints, we derive that 0 ≤ k ≤ i and i ≤ j + k ≤ n. It follows then that .
Thus, we get that We now simplify the right hand side by reversing the order of summation and rewriting it in terms of This completes the proof.
For the erasure model, the generating function is even simpler. Proof. By definition of Erase 1−ν , it follows that when j ≤ i, To recap, in the estimation problem the learner's task is to estimate p 0 to accuracy ε, given samples from q. We recall the well-known fact that, by taking the empirical distribution of O(n/δ 2 ) samples, the learner may obtain an estimate q of q satisfying q − q 1 ≤ δ (with high probability). Although q = pA, as noted in previous works one unfortunately cannot effectively estimate p 0 simply as the first coordinate of qA −1 , because A is poorly conditioned. Instead one needs a more sophisticated approach.

Reduction to an analytic problem
It is not hard to characterize the optimal sample complexity for the estimation problem. Define (where the parameter ν implicitly appears within A). If two probability vectors p and p have |p 0 − p 0 | > 2ε, then a successful estimation algorithm must be able to distinguish the two cases. But if q = pA, q = p A are close, in the sense that q − q 1 ≤ δ , then a learning algorithm will need Ω(1/δ ) samples to distinguish them with high probability. We have reached the following conclusion.
On the other hand, suppose the lower bound η(ε/2, ν) ≥ 2δ holds. Consider an estimation algorithm that first produces an empirical estimate q with q − q 1 < δ using O(n/δ 2 ) samples, and then exactly solves the following optimization problem using linear programming: min probability vectors p q − p A 1 .
(This can be efficiently written as an LP with O(n) variables and constraints and with rational numbers of poly(n) bit-complexity. 2 ) First, observe that the objective of the above program is at most δ (because p = p achieves objective at most δ ). Thus, if p * is the optimal solution, then q − p * A 1 ≤ δ . This implies that pA − p * A 1 ≤ 2δ which by our assumption implies that p − p * 1 ≤ ε. Consequently, |p 0 − p * 0 | ≤ ε. Thus we get an efficient solution to the estimation problem. In conclusion, we have established the following result.
Thus we see that, up to polynomial factors, both the sample complexity and the algorithmic complexity of the estimation problem are controlled by the parameter η(ε, ν).
We now further simplify the definition of η(ε, ν), similar to what was done in [5]. The difference of two probability vectors over [0..n] is precisely any vector in the set Thus we have that η(ε, ν) = min c∈∆ c 0 >2ε cA 1 .
Note also that cAz is a polynomial in z that is easily calculated from the generating function of the noise process (see Proposition 2.2 and Proposition 2.3). Putting it all together, we obtain the following inequality.
Given c ∈ ∆ with c 0 > 2ε, define the following polynomial (with real coefficients and a complex parameter): Thus the assumptions on c are equivalent to Q c (0) > 2ε, Q c (1) = 0, and L(Q c ) ≤ 2, where L(Q c ) is the length of Q c , i. e., the sum of the absolute values of its coefficients.
In analyzing E c above, we use that E c (z) = ∑ n i=0 c i u i , where u = (1 − ν) + νz. As z ranges over the unit circle |z| = 1, the range of the parameter u, namely {(1 − ν) + νz : |z| = 1} is precisely the circle ∂ D ν (1 − ν) of radius ν centered at the real value 1 − ν. Thus In analyzing F c above, we use that As the parameter z ranges over the unit circle, w also ranges over the unit circle. To see this, note that w is a Möbius transformation of z, and thus as z ranges over a circle w ranges over either a circle or a line. Further, it is easy to see that for z = 1, −1, i, the resulting w lies on the unit circle. Consequently, for any z such that |z| = 1, w lies on the unit circle. Parameterizing w as w = e iθ , it is not hard to compute that (3.5) We have reached the following conclusion.   Proof. Let us first consider the case when 1/2 ≤ ν ≤ 1. In this case, note that 0 is contained in the circle D ν (1 − ν). By the maximum modulus principle, |Q(0)| ≤ max u∈∂ D ν (1−ν) |Q(u)|. However |Q(0)| ≥ 2ε which completes the proof in this case.
Let us now assume that 0 < ν < 1/2. Let U be the unit circle, let O be the circle of radius 1/2 centered at 1/2, which lies inside U, and let C = ∂ D ν (1 − ν), which lies inside O. The Möbius transformation A(u) = 1/(1 − u) takes these circles to vertical lines U , O , and C with real parts 1/2, 1, and 1/2ν, respectively. Defining the function f (u) = Q(A −1 (u)), we have that f is bounded on the strip defined by U and C , and we have that sup u∈U | f (u)| ≤ 2, sup u∈O | f (u)| ≥ 2ε. Writing M for the maximum modulus of f on C , the Hadamard Three-Lines Theorem implies that which completes the proof after rearrangement.

An upper bound on η(ε, ν) for erasure noise
In this section, we will prove the following result. In order to prove this theorem, we will first collect a few facts. Given a, r > 0, define the set B a,r as We now make a few observations about the set B a,r as r varies. In particular, we have the following fact. • For r = 1, the set B a,r is the line segment joining 1 and 1 − 16a.
• For r = 2, the set B a,r is the ellipse centered at 1 − 8a with major axis • For r = 4, the set B a,r is the ellipse centered at 1 − 8a with major axis [1 − 8a − 17a, 1 − 8a + 17a] and minor axis Proof. For z ∈ ∂ D r (0), we can express z = x + iy where x = r cos θ and y = r sin θ , where r ∈ R and θ ∈ [0, 2π). Consequently, points on B a,r can be parameterized as THEORY OF COMPUTING, Volume 16 (6), 2020, pp. 1-20 Let w = x 1 + iy 1 where x 1 , y 1 ∈ R and w ∈ B a,r . Then the tuple (x 1 , y 1 ) satisfies This implies each r, B a,r describes an ellipse with the center at 1 − 8a. The major axis is given by [1 − 8a + 4a(r + 1 r ), 1 − 8a − 4a(r + 1 r )] and the minor axis is given by Plugging in the values of r (for r ∈ {1, 2, 4}), we get the claim.
Next, we make the following claim. Proof. The ellipse B a,2 is centered at 1 − 8a with the major and minor axis aligned with the real and imaginary axis. Further, the length of the semi-major axis is 10a and the length of the semi-minor axis is 6a. Thus, any point z = x + iy is contained in this ellipse as long as The circle ∂ D 4a (1 − 4a) consists of points z = x + iy where x = 1 − 4a + 4a cos θ and y = 4a sin θ . Observe that for any such point z = x + iy, The last quantity can be easily bounded by 1 showing that ∂ D 4a (1 − 4a) is contained in B a,2 . This immediately implies the same for D 4a (1 − 4a).
By Hadamard's three-circle theorem, any holomorphic function f satisfies Consequently, we have the following corollary. We next recall the following result from [8]. We will also use the following result from [4]. Here the last inequality uses the relation between T and M and ε ≤ τ. Finally, set a = ν/63. Then, applying Corollary 4.5, we obtain Plugging in a = ν/63, M = ln(1/ε)/ν 2 and T = (2Mν)/7, we obtain that which concludes the proof.
THEORY OF COMPUTING, Volume 16 (6), 2020, pp. 1-20 Theorem 5.2 (Corollary 3.2 of [3]). There is a universal constant c > 0 such that the following holds. Let Q(u) be a univariate polynomial with complex coefficients, Q(u) = ∑ n j=0 b j u j with |b 0 | = 1 and all coefficients |b j | ≤ M. Let A be a subarc of the unit circle with length a, where 0 < a < 2π. Then there is some w ∈ A such that We now apply this theorem to polynomial Q c /c 0 by setting "M" to 1/c 0 , "a" to θ * and "A" to A * . This yields Combining with equation (5.3) completes the proof.

An upper bound on η(ε, ν) for bit-flip noise
In this section we prove the following result.
There is a universal constant c > 0 such that for ν, 0 < ε < c and n ∈ N which satisfy 2 ln(2/ε) n Recalling equation (3.6), to prove this result we must demonstrate the existence of a vector [c 0 c 1 . . .
where we recall from equation (3.2) that To prove (5.4), we will use Theorem 4.6 and the following lemma, which relates the multiplicity of roots of a polynomial at 1 with the supremum of p on an arc centered at 1. For this, we set M as follows: We first make the following observations about M. (i) Since ν 4 ≥ ln(1/ε)/n, it is the case that M ≤ n.
(ii) Since 1 − ν ≥ 2 ln(2/ε)/n, it is moreover the case that M ≥ ln(1/ε). For M as defined above, let us rescale the polynomial in Theorem 4.6 so that |a 0 | = 2ε and thus, ∑ M j=1 |a j | ≤ 1. We now set c j = a j for all 1 ≤ j ≤ M and c j = 0 otherwise. Note that since M ≤ n, this is well-defined.
By construction, the polynomial p(u) defined as p(u) = ∑ M j=0 c j u j has at least T repeated roots at 1, where where the last equality uses 1 − ν ≥ 2 ln(2/ε)/n. We note for later reference that Let us define θ * as Observe that since 1 − ν ≥ 2 ln(1/ε)/n, it holds that θ * ≤ 4/63. Let A be the arc of the unit circle A = {e iθ | − θ * ≤ θ ≤ θ * }. Applying Lemma 5.4 (and observing that all degree M + 1 and higher coefficients of p are zero), we obtain that Here the last inequality uses Here the first inequality uses the fact that the second inequality uses equation (5.7), and the last equality uses equation (5.5).
To bound |Φ c (w)| in A, we will need a couple of facts. First, since ∑ n j=0 |c j | ≤ 2, it is the case that |p(w)| ≤ 2 for all |w| = 1, and consequently Recalling equation ( , where the last inequality uses sin 2 (θ * /2) ≥ (θ * ) 2 /8 which holds since θ * ≤ 4/63. Finally, again using ν 4 ≥ ln(1/ε)/n and recalling equation (5.6), we have (with room to spare). Thus, we have that where for the last inequality we used 6 Reduction between the restricted and the general lower bound In this section we give a reduction from item 2 to item 3 of Theorem 1.2. We first set up some notation. For a distribution D supported on {0, 1} n , let D e denote the distribution obtained from Erase 1−ν (x) where x ∼ D. Let D 1,e denote the distribution of the number of ones in D e and D 0,1,e denote the joint distribution of the number of zeros and ones in D e . Note that D 1,e is supported on N and D 0,1,e is supported on N 2 .
Note that by symmetrization, without loss of generality, we can assume that X and Y are symmetric distributions. As a consequence, for any pair z, z ∈ {0, 1, ?} n 0 which have the same number of zeros and ones (and hence the same number of '?'), X e (z) = X e (z ) and Y e (z) = Y e (z ). Item 3 would thus follow if we can show X 0,1,e − Y 0,1,e 1 ≤ ε Ω(1/ν) . While this is not necessarily true, we will modify X and Y to achieve this property.
Choose a number m 0 (we will fix this later) and let us define X by sampling x from X and padding with m 0 zeros. The distribution Y is defined likewise. Observe that For any ∈ N and q ∈ [0, 1], let Bin( , q) denote the binomial distribution with trials where each trial succeeds with probability q. We now claim that X 0,1,e − X 1,e × Bin(m 0 , ν) 1 ≤ n 0 m 0 · min{ν, 1 − ν} . (6.1) To prove this, we need the following basic fact about binomial distributions.
This almost proves item 3 except the factor of 1/min{ν, 1 − ν} in the lower bound for n.
To remove this factor, note that even if ν = 1, there is a trivial lower bound of ε −2 for any estimation algorithm for NPR (this holds as long as n ≥ log(1/ε)). Further, since access to samples with erasure noise ν can be simulated given access to samples with noise rate ν (provided ν ≤ ν), this lower bound holds for any noise rate ν ≥ 0. Combining this observation with the earlier lower bound proves Item 3.