Some Limitations of the Sum of Small-bias Distributions

We exhibit ε-biased distributions D on n bits and functions f : {0, 1} n → {0, 1} such that the xor of two independent copies (D + D) does not fool f , for any of the following choices:

We shall refer to such probability distributions simply as "distributions." Small-bias distributions, introduced by Naor and Naor [24], cf.[1,2,5], are distributions that look balanced to parity functions over {0, 1} n .Definition 1.1.A distribution D over {0, 1} n is ε-biased if for every nonempty subset I ⊆ [n], we have An ε-biased distribution can be generated using a seed of O(log(n/ε)) bits.Since their introduction, small-bias distributions have become a fundamental object in theoretical computer science and have found their uses in many areas including derandomization and algorithm design.
In the last decade or so researchers have considered the sum (i.e., bitwise XOR) of several independent copies of small-bias distributions.The first paper to explicitly consider it is [8].This distribution appears to be significantly more powerful than a single small-bias copy, while retaining a modest seed length.In particular, two main questions have been asked: Question 1.2 (RL).Reingold and Vadhan (personal communication) asked whether there exists a constant c such that the sum of two independent copies of any n −c -biased distribution fools one-way logarithmic space, a. k. a. one-way polynomial-size branching programs, which would imply RL = L.It is known that a small-bias distribution fools one-way width-2 branching programs (Saks and Zuckerman, see also [7] where a generalization is obtained).No such result is known for width-3 programs.Question 1.3 (Polynomials).The papers [8,21,39] show that the sum of d small-bias generators fools F 2 -polynomials of degree d.However, the proofs only apply when d ≤ (1 − Ω(1)) log n.It is an open question whether the construction works for larger d.If the construction worked for any d = log O (1) n, it would make progress on long-standing open problems in circuit complexity regarding AC 0 with parity gates [28].This question is implicit in the works [8,21,39] and explicit in [38, Chapter 1] (Open question 4), cf. the survey [38, Chapter 1].
In terms of negative results, Meka and Zuckerman [22] show that the sum of 2 distributions with constant bias does not fool mod 3 linear functions.Bogdanov, Dvir, Verbin, and Yehudayoff [7] show that for ε = 2 −O( √ n/k) , the sum of k copies of ε-biased distributions does not fool circuits of size poly(n) and depth O(log 2 n) (NC 2 ).
This paper gives two different approaches to improving on both these results and obtain other limitations of the sum of small-bias distributions.One is based on the complexity of decoding, and the other one on bounding the mod 3 rank (see Definition 1.8).Either approach is a candidate to answer negatively the "RL question" (Question 1.2).THEORY OF COMPUTING, Volume 13 (16), 2017, pp.1-23

Our results
The following theorem states our main counterexamples.We denote by D + D the bitwise XOR of two independent copies of a distribution D.
Theorem 1.4.For any c, there exists an explicit ε-biased distribution D over {0, 1} n and an explicit function f , such that f (D + D) = 0 and Pr x∼{0,1} n [ f (x) = 0] ≤ p, where ε, f , p are of any one of the following choices: i. ε = 2 −Ω(n) , f is a uniform poly(n)-size circuit, and p = 2 −Ω(n) ; ii.ε = 2 −Ω(n/ log n) , f is a uniform fan-in 2, poly(n)-size circuit of depth O(log 2 n), and p = 2 −n/4 ; iii.ε = 1/n c , f is a one-way O(c log n)-space algorithm, and p = O(1/n c ); iv.ε = n −Ω (1) , f is a mod 3 linear function, and p = 1/2.Moreover, all our results extend to more copies of D as follows.The input D + D to f can be replaced by the bitwise XOR of k independent copies of D if ε is replaced by ε 2/k , where k is at most the following quantities corresponding to the above items: i. n/60; ii.n/6 log n; iii.2c; iv.O(log n/ log log n).Theorem 1.4.i is tight up to the constant in the exponent because every ε2 −n -biased distribution is ε-close to uniform.
Theorem 1.4.iiwould also be true with ε = 2 −Ω(n) , if a decoder for certain algebraic-geometric codes ran in NC 2 , which we conjecture it does.[7] prove Theorem 1.4.ii with ε = 2 −O( √ n/k) .Theorem 1.4.iiican also be obtained in the following way, pointed out to us by Chen and Zuckerman (personal communication).Since one can distinguish a set of size s from uniform with a width s + 1 branching program, and there exist ε-bias distributions with support size O(n/ε 2 ), the sum of two such distributions can be distinguished from uniform in space O(c log n) when ε = n −c .Actually, both their proof and ours (presented later) apply to c > 0.01; but for smaller c Theorem 1.4.ivkicks in.
We have not been able to say anything on the "Polynomials question" (Question 1.3).
There exist other models of interest.For read-once DNFs no counterexample with large error is possible because Chari, Rohatgi, and Srinivasan [12], building on [15], show that (just one) n −O(log(1/δ ))bias distribution fools any read-once DNF on n variables with error δ , cf.Section 5.The [12] result is rediscovered by De, Etesami, Trevisan, and Tulsiani [14], who also show that it is essentially tight by constructing a distribution which is n −Ω(log(1/δ )/ log log(1/δ )) -biased yet does not δ -fool a read-once DNF.In particular, fooling with polynomial error requires super-polynomial bias.
It would be interesting to know whether the XOR of two copies overcomes this limitation, i. e., if it δ -fools any read-once DNF on n variables provided each copy has bias poly(δ /n).If true, this would give a generator with seed length O(log(n/δ )), which is open.
We are unable to resolve this for read-once DNFs.However, we show that the corresponding result for general DNFs would resolve long-standing problems on circuit lower bounds [35].This can be interpreted as saying that such a result for DNFs is either false or extremely hard to prove.We also get conditional counterexamples for depth-3 and AC 0 circuits.Theorem 1.5.Suppose polynomial time (P) has fan-in 2 circuits of linear size and logarithmic depth.Then Theorem 1.4 also applies to the following choices of parameters: i. ε = n −ω (1) , f is a depth-3 circuit of size n o (1) and unbounded fan-in, and p = n −ω (1) .ii. ε = n −ω (1) , f is a DNF formula of size poly(n), and p = 1 − 1/n o (1) .Moreover, all our results extend to more copies of D as follows.The input D + D to f can be replaced by the bitwise XOR of k ≤ log n independent copies of D if ε is replaced by ε 2/k .
Recall that it is still open whether NP has linear-size circuits of logarithmic depth.
Theorem 1.6.Suppose for every δ > 0 there exists a constant d such that NC 2 has AC 0 circuits of size 2 n δ and depth d.Then Theorem 1.4 also applies to the following choice of parameters: ε = n − log c n , f is an AC 0 circuit of size n O(c) and depth O(c) , and p = n − log Ω(1) n/4 .Moreover, our result extends to more copies of D as follows.The input D + D to f can be replaced by the bitwise XOR of k ≤ log c+1 n/6(c + 1) log log n independent copies of D if ε is replaced by ε 2/k .
Recall that the assumption in Theorem 1.6 holds for NC 1 instead of NC 2 , in fact it holds even for logspace.Moreover, the parameters in the conclusion of Theorem 1.6 are tight in the sense that n −(log n) O(d) bias fools AC 0 circuits of size n d and depth d, as shown in the sequence of works [4,29,10,34].
All the above results except Theorem 1.4.iv are based on a new, simple connection between small-bias generators and error-correcting codes, discussed in Section 1.2.Definition 1.7.A distribution over X n is k-wise independent if its marginal distribution on any k positions is uniformly distributed.Definition 1.8.Let S ⊆ F p be a set of vectors.Define the mod p rank of S, denoted by rank p (S) to be the rank of S over F p .We define the mod p rank of a distribution D to be the mod p rank of its support.
Definition 1.9.The correlation of two functions f , g : {0, Theorem 1.4.ivinstead follows [22] and bounds the mod 3 rank of small-bias distributions.It turns out an upper bound on the mod 3 rank of some k-wise independent distributions over bits would allow us to reduce the bias in Theorem 1.4.iv,assuming long-standing conjectures on correlation bounds for low-degree polynomials (which may be taken as standard).
Claim 1.10.Suppose 1. the parity of k copies of mod 3 on disjoint inputs of length m has correlation 2 −Ω(k) with any F 2 -polynomial of degree ε √ m for some constant ε > 0, and 2. for every c, there exists an n −c -almost c log n-wise independent distribution whose support on {0, 1} n ⊆ F n 3 = {0, 1, 2} n has mod 3 rank at most n 0.49 .Then the "RL question" has a negative answer, i. e., for every c, there exists an n −c -biased distribution D such that D + D does not fool a one-way O(log n)-space algorithm.More specifically, D + D does not fool a mod 3 linear function.
THEORY OF COMPUTING, Volume 13 (16), 2017, pp.Contrapositively, an affirmative answer to the "RL question," even for permutation, width-3 branching programs, implies lower bounds on the mod 3 rank of k-wise independent distributions, or that the aforementioned correlation bounds are false.
What we know about the second assumption in Claim 1.10 is in Section 3, where we initiate a systematic study of the mod 3 rank of (almost) k-wise independent distributions, and obtain the following lower and upper bounds.First, we give an Ω(k log n) lower bound on the mod 3 rank for almost k-wise independent distributions, specifically, distributions such that any k coordinates are 1/10 close to being uniform over {0, 1} k (Claim 3.1).This also gives an exponential separation between mod 3 rank and seed length for such distributions.
We then prove the following upper bounds, see Claim 3.9.
Theorem 1.11.For infinitely many n, there exist k-wise independent distributions over {0, 1} n with mod 3 rank d for k = 2 and d ≤ n 0.73 .
We note that an upper bound of n − 1 on the mod 3 rank of a k-wise independent distribution implies that the distribution is constant on a mod 3 linear test.We ask what is the largest k * = k * (n) such that there exists a k-wise independent distribution with mod 3 rank ≤ n − 1.We conjectured the bound k * (n) = Ω(n).Partial progress towards this conjecture appeared in a preliminary version of this paper [20].This conjecture was later verified [9].

Our techniques
All our counterexamples in Theorem 1.4 and 1.5, except Theorem 1.4.iv,come from a new connection between small-bias distributions and linear codes, which we now explain.Let C ⊆ F n be a linear error correcting code over a finite field of characteristic 2. (Using characteristic 2 allows us to work with small-bias over bits, as opposed to large alphabets, which makes things slightly simpler.)We also use C to denote the uniform distribution over the code Define N e to be the "noise" distribution over F n obtained by repeating the following process e times: Pick a uniformly random position from [n], and set it to a uniform symbol in F. Now, define D e to be the distribution on n log|F| bits obtained from adding N e to C, and we have the following fact.
Proof.If a test is on less than d ⊥ field elements, D e has zero bias because it is (d ⊥ − 1)-wise independent.Otherwise, the bias is nonzero only if none of the symbols touched by the test are hit by random noise, which happens with probability Our main observation is that the XOR of two noisy codewords is also a noisy codeword, with the number of errors injected to the codeword doubled.That is, 13.An algorithm is a threshold-e discriminator for the code C if it decides whether a string is within Hamming distance e of the code.Now suppose an algorithm is a threshold-2e discriminator for C. Then it can be used to distinguish D e + D e from uniform.More generally, if an algorithm is a threshold-ke discriminator for C, then it can distinguish the XOR of k independent copies of D e from uniform.Contrapositively, if D e + D e fools f , then f is not a threshold-2e discriminator for C. Thus, to obtain counterexamples we only have to exhibit an appropriate threshold discriminator.We achieve this by drawing from results in coding theory.This is explained below after two remarks.
Remark 1.14.Our threshold discriminator is only required to tell apart noisy codewords and uniform random strings.This is a weaker condition than decoding.In fact, similar threshold discriminators have been considered in the context of tolerant property testing [17,19,30], where tolerant testers are designed to decide if the input is close to being a codeword or far from every codeword, by looking at as few positions of the input as possible.
Remark 1.15.We note that our connection between ε-bias distributions and linear codes is different from the well-known connection in [24], which shows that for a binary linear code with relative minimum and maximum distance ≥ 1/2 − ε and ≤ 1/2 + ε, respectively, the columns of its k × n generator matrix form the support of an ε-biased distribution over {0, 1} k .However, the connection to codes is lost once we consider the sum of the same distributions.In contrast, the sum of our distributions bears the code structure of a single copy.
As hinted before Fact 1.12, the small-bias property is established through a case analysis based on the weight of the test.This paradigm goes back at least to the original work by Naor and Naor [24].It was used again more recently in [23,3].Our reasoning is especially close to [23,3] because in both papers small tests are handled by local independence but large tests by sum of independent biased bits.
For general circuits (Theorem 1.4.i),we consider the asymptotically good binary linear code with constant dual relative distance, based on algebraic geometry and exhibited by Guruswami in [32].We conjecture that the corresponding threshold discriminator can be implemented in NC 2 .However, we are unable to verify this.Instead, for NC 2 circuits (Theorem 1.4.ii),we use Reed-Solomon codes and the Peterson-Gorenstein-Zierler syndrome-decoding algorithm [27,16] which we note is in NC 2 .Under the assumption that NC 2 is contained in AC 0 circuits of size 2 n δ , by scaling the NC 2 result down to polylog n bits followed by a depth reduction, we obtain our results for AC 0 circuits (Theorem 1.6).This result could also be obtained by scaling down a result in [7].
Our counterexample for one-way log-space computation (Theorem 1.4.iii) also uses Reed-Solomon codes.The threshold discriminator is simply syndrome decoding: To decode from e errors it can be realized by computing the syndrome in a one-way fashion using space O(e log q), where q is the size of the underlying field of the code.For a given constant c, setting q = n, message length k = d ⊥ − 1 = n − O(c), and e = O(c) we obtain a one-way space O(c log n) distinguisher for the sum of two distributions with bias n −c .
Naturally, one might try to eliminate the dependence on c in the O(c log n) space bound with a different choice of e and q, which would answer the "RL question" in the negative.In Claim 2.2 however we show that to obtain n −c bias, the space O(e log q) for syndrome decoding must be of Ω(c log n), regardless of the code and the alphabet.Thus our result is the best possible that can be obtained using syndrome decoding.We raise the question of whether syndrome decoding is optimal for one-way decoding in this setting of parameters, and specifically if it is possible to devise a one-way decoding algorithm using space o(e log q).There do exist alternative one-way decoding algorithms, cf.[30], but apparently not for our setting of parameters of e = O(1) and k = n − O(1).
Our conditional result for depth-3 circuits and DNF formulas (Theorem 1.5) follows from scaling down to barely superlogarithmic input length, and a depth reduction [35] (cf.[38,Chapter 3]) of the counterexample for general circuits (Theorem 1.4.i).We note that the 2 −Ω(n) -bias in Theorem 1.4.i is essential for this result, in the sense that 2 −n/ log n -bias would be insufficient to obtain Theorem 1.5.We also remark that since O(log 2 n)-wise independence suffices to fool DNF formulas [4], one must consider linear codes with dual distance less than log 2 n in our construction, and so D e has bias at least (1 − log 2 n/n) e = 2 −O(log 2 n) .On the other hand, [14] shows that 2 −O(log 2 n log log n) -bias fools DNF formulas.
The connection between codes and small-bias distributions motivate us to study further the complexity of decoding.[37,Chapter 6] and [31], cf.[37,Chapter 6], show that list-decoding requires computing the majority function.In Claim 4.2 we extend their ideas and prove that the same requirement holds even for decoding up to half of the minimum distance.This gives some new results for AC 0 and for branching programs.Finally, since log O(1) n-wise independence fools AC 0 [10,34], we obtain that AC 0 cannot distinguish a codeword from a code with log Ω(1) n dual distance from uniform random strings.This also gives some explanation of why scaling is necessary to obtain Theorem 1.6 from Theorem 1.4.ii.
A different approach.We now explain the high-level ideas in proving Theorem 1.4.iv.Meka and Zuckerman [22] construct the following constant-bias distribution D over n := d 5 bits with mod 3 rank less than √ n.Each output bit is the square of the mod 3 sum of 5 out of the d uniform random bits, which can be written as a degree-5 polynomial over F 2 .Since any parity of the output bits is also a degree-5 polynomial over {0, 1} d , D has constant bias.To show that a mod 3 linear function is always 0 on the support of D + D, they observe that for sufficiently large n, D has mod 3 rank at most d 2 < √ n, and D + D has mod 3 rank at most We extend their construction using ideas from the Nisan generator [25]: We pick a pseudo-design consisting of n sets where each set has size n β (we will choose β to be a small constant), and the intersection of any two sets has size O(log n).Such pseudo-design exists provided the universe has size n 2β .The output distribution is again the square of the mod 3 sum on each set.
For any test of size at least C log n bits, let J be any C log n bits of the test.We fix the intersections of their corresponding sets in the universe to make them independent.After we do this, every bit in J is still a mod 3 function on n β − |J| log n ≥ 0.9n β bits.
We further fix every bit outside the |J| sets in the universe.This will not affect the bits in J. Now consider any bit b in the test that is not in J, it corresponds to a set which has intersection at most log n with each of the sets that correspond to the bits in J. Thus, b is now a mod 3 function on at most |J| log n = log 2 n input bits and thus can be written as a degree-log 2 n polynomial over F 2 .Hence, the parity of the bits outside J is also an F 2 -polynomial of the same degree, and we call this polynomial p.Now observe that the bias of the test equals to the correlation between the parity of the bits in J and p.Since each bit in J is a mod 3 function on n β bits, by Smolensky's theorem [33], it has constant correlation with p.In Lemma 2.9 we prove a variant of Impagliazzo's XOR lemma [18] to show that the XOR of log n independent such bits makes the correlation drop from constant to ε = n −β /4 .This variant of XOR lemma may be folklore, but we are not aware of any reference.This handles tests of size at least C log n.For smaller tests, the above distribution could have constant bias, and hence we XOR it with an 1/n Ω(1) -almost C log n-wise independent distribution, which gives us ε bias for tests of size less than C log n and has sufficiently small rank.We then show that the XOR of the two distributions has rank less than √ n and conclude as in the previous paragraph.We refer the reader to [40] for background on XOR lemmas.
Organization.In Section 2 we describe our counterexamples and prove Theorem 1.4 and 1.5, and Claim 1.10.In Section 3 we prove our lower bounds and upper bounds on the mod 3 rank of k-wise independence.As a bonus, in Section 4 we include some results on the complexity of decoding.For example, we show that for codes with large minimum distance, AC 0 circuits and read-once branching programs cannot decode when the number of errors is close to half of the minimum distance of a code (Claim 4.2).We obtained these results while attempting to devise low-complexity algorithms that can decode (which, by our connection, would have consequences for the sum of small-bias generators).

Our counterexamples
We are now ready to prove Theorem 1.4 and 1.5, and Claim 1.10.We consider linear codes with different parameters, the bias of D follows from Fact 1.12.Then we present our distinguishers.In the end, we explain how our results hold for k copies instead of 2.

General circuits
Venkatesan Guruswami [32] exhibits the following family of constant-rate binary linear codes whose primal and dual relative minimum distance are both constant.
Theorem 2.1 (Theorem 4 in [32]).For infinitely many n, there exists a binary linear code C with block length n and dimension n/2, which can be constructed, encoded, and decoded from n/60 errors in time poly(n).Moreover, the dual of C has minimum distance at least n/30.Proof of Theorem 1.4.i.Applying Fact 1.12 with e = n/120 to the code in Theorem 2.1, we obtain a distribution D that is 2 −n/3600 -biased.Our threshold-2e discriminator f for the code C decodes and encodes the input, and accepts if and only if the input and the re-encoded string differ by at most 2e positions.Since both the encoding and decoding algorithms run in polynomial time, so does f .Note that f accepts at most possible strings, where H(•) is the binary entropy function (cf.[13,Example 11.1.3]for the first inequality).Hence, f distinguishes D + D from the uniform distribution with probability at least 1 − 2 −0.25n .

NC 2 circuits
Proof of Theorem 1.4.ii.Let q be a power of 2. Consider the Reed-Solomon code C over F q with block length q − 1, dimension q/2 and minimum distance q/2.C has dual minimum distance q/2 + 1 and can decode from q/4 errors.Applying Fact 1.12 to C with e = q/12, we obtain a distribution D over n := (q − 1) log q bits that is 2 −Ω(n/ log n) -biased.
Let α be a primitive element of F q .Let H be a parity-check matrix for C. We first recall the Peterson-Gorenstein-Zierler syndrome-decoding algorithm [27,16].
Given a corrupted codeword y, let (s 1 , . . ., s q/2 ) T := Hy be the syndrome of y.Suppose y has v < q/2 errors.Let E denote the set of its corrupted positions.Let be the error locator polynomial.The syndromes and the coefficients of Λ v are linearly related by This forms a linear system with unknowns λ i .The algorithm decodes by attempting to solve the corresponding linear systems with v errors, where v ranges from 2e to 1.Note that the system has a unique solution if and only if y and some codeword differ by exactly v positions, for some v between 1 and 2e.Thus, f computes the determinants of the 2e < q/4 systems and accepts if and only if one of them is nonzero.Since computing determinant is in NC 2 [6], f can be computed by an NC 2 circuit.The system always has a solution when inputs are under D + D and so f always accepts.On the other hand, f accepts at most possible strings, where is the q-ary entropy function.Therefore, f distinguishes D + D from the uniform distribution with probability at least 1 − 2 −n/4 .

One-way log-space computation
Proof of Theorem 1.4.iii.Let q be a power of 2. Consider the [q − 1, q − 6c, 6c] 2 log q Reed-Solomon code C over F 2 log q , which has dual minimum distance q − 6c + 1 and can decode from 3c errors.Applying Fact 1.12 to C with e = c, we obtain a distribution D over n := (q − 1) log q bits that is O(c log n/n) cbiased.
Let H be a parity-check matrix of C. On input y ∈ F q 2 log q , our distinguisher f computes s 2e+1 , . . ., s 4e from the syndrome s := Hy.Clearly this can be implemented in one-pass and space (2e + O(1)) log q.Finally, using the Peterson-Gorenstein-Zierler syndrome-decoding algorithm, f accepts if and only if y differs from a codeword of C by at most 2e positions.
Since f accepts at most Computing the input for syndrome decoding requires space (2e + O(1)) log q.We now show that in order to obtain n −c bias via our construction, we always have 2e log q = Ω(c log n).Thus, one cannot answer the "RL question" in the negative via syndrome decoding.
Claim 2.2.For every q ≥ n + 1, let C be an [n, k, d] code over F q which decodes from e errors, and d ⊥ be its dual minimum distance.If C satisfies (1 − d ⊥ /n) e < q −c for sufficiently large c, then we have e log q = Ω(c log n).
Proof.If d ⊥ > (1 − 1/q)n, then by the Plotkin bound on the dual code, n − k = O(1).By the Singleton bound, e ≤ d ≤ n − k and so we have e = O(1).Hence, (1 − d ⊥ /n) e = (1/q) e ≥ q −c for sufficiently large c, and therefore the condition is not satisfied.
On the other hand, suppose implies e log q > c log q > c log n.
Theorem 2.3 ( [35,36]).Let C : {0, 1} n → {0, 1} be a circuit of size cn, depth c log n and fan-in 2. The function computed by C can also be computed by an unbounded fan-in circuit of size 2 c n/ log log n and depth 3 with inputs x 1 , x 2 , . . ., x n , x 1 , x 2 , . . ., x n , where c depends only on c.
By the assumption that P has fan-in 2 circuits of linear size and logarithmic depth and the fact that f in Theorem 1.4.i is in P, we can apply Theorem 2.3 to f and obtain an unbounded fan-in depth-3 circuit f of size 2 O(n/ log log n) that computes the same function.Then we scale down n to n = log n log log log log n bits (we set the rest of the n − n bits uniformly at random) to get an n −ω(1) -biased distribution D n and a circuit f n of size n o (1) and depth 3 that distinguishes D n + D n from uniform with probability at least 1 − n −ω (1) .This proves Theorem 1.5.i.
To prove Theorem 1.5.ii,note that f n accepts with probability 1 under D n + D n and without loss of generality we can assume f n is an AND-OR-AND circuit.Hence, it contains a DNF f such that (1) f accepts under D n + D n with probability 1, and (2) f rejects with probability at least 1/2n o (1) under the uniform distribution.
Proof of Theorem 1.6.Let D and f be the distribution and distinguisher in Theorem 1.4.ii,respectively.Let D n and f n be the scaled distribution and distinguisher of D and f on n = log c+1 n bits, respectively.(We set the rest of the n − n bits uniformly at random.)D n has bias 2 −Ω(n /log n ) = n −Ω(log c n) .By our assumption, f n is in AC 0 and distinguishes D n + D n from uniform with probability 1 − n − log c n/4 .THEORY OF COMPUTING, Volume 13 (16), 2017, pp.1-23 SOME LIMITATIONS OF THE SUM OF SMALL-BIAS DISTRIBUTIONS

Mod 3 linear functions
Recall the definition of mod p rank in Definition 1.8.Fact 2.4 (Lemma 7.1 and 7.2 in [22]).Let S, T be two sets of vectors in F n 3 .Define S 2 to be the set {x × 3 x : x ∈ S}, where x × 3 y denote the pointwise product of two vectors x and y (over F 3 ).Then be any two vectors.We have Thus . For (2), observe that for any a, b ∈ {0, 1} ⊆ F 3 , we have a and thus The following lemma is well-known (cf.[25]).We include a proof here for completeness.
We will use the following Chernoff bound in the proof.
Claim 2.6 (Chernoff bound).Let X 1 , . . ., X n ∈ {0, 1} be n independent and identically distributed variables with E[X i ] = µ for each i.We have We also have E[|S ∩ S j |] = pt = 0.1 log n.Again by the Chernoff bound, It follows by a union bound that with nonzero probability there is an S which satisfies the two conditions above.
Proof of Theorem 1.4.iv.Let α < 1/36 and β = 4α.Also let d,t, t be the parameters and S 1 , . . ., S n be the pseudo-design specified in Lemma 2.5.Define the function L : {0, 1} d → {0, 1} n whose i-th output bit y i equals Let T 1 be the image set of L. Without the square, this set has mod 3 rank d and so by Fact 2.4, ).Let T 2 be an ε-almost k-wise independent set, where ε = 1/n α and k = 2 log n.Known constructions [2, Theorem 2] (see also [24]) produce such a set of size O((k log n)/ε) 2 and therefore rank 3 (T 2 ) is at most O(n 2α log 4 n).
Consider the set T := T 1 + 2 T 2 .By Fact 2.4, T has rank at most O(n 18α log 4 n).By the same fact, T + 2 T has rank at most O(n 36α log 8 n) < n because α < 1/36.Therefore, there is a non-zero mod 3 linear function such that (y) ≡ 0 (mod 3) for any y ∈ T , while Pr[ (y) = 0] ≤ 1/2 for a uniform y in {0, 1} n .It remains to show that T is O(1/n 0.99α )-biased.For any test on I ⊆ [n], we consider the cases (1) when |I| ≤ k, and (2) when |I| > k separately.
Write y = y 1 + y 2 , where y 1 ∈ T 1 and y 2 ∈ T 2 .Case (1) follows from the fact that T 2 is 1/n α -almost k-wise independent.Case (2) follows from the following claim.
Claim 2.7.For any |I| > k, we have respectively.Observe that which is the correlation between f and p.
Consider the sets S j ⊆ [d] with j ∈ J. Let B 1 be the set of indices appearing in their pairwise intersections.That is, B 1 := { ∈ [d] : ∈ S i ∩ S j for some distinct i, j ∈ J}.Fixing the value of every x ∈ B 1 , each mod 2  3 (x S j ) in f becomes a function on m := n β − t • k ≥ 0.9n β bits.
THEORY OF COMPUTING, Volume 13 (16), 2017, pp.1-23 Let B 2 be the set of indices in [d] outside the S j for j ∈ J.The bits in B 2 do not affect the outputs in J. Fixing their values, each mod 2  3 (x S j ) in p is a function of at most t • k = O(log 2 n) bits and so can be written as a polynomial of degree O(log 2 n) over F 2 .Since p is a parity of values mod 2  3 (x S j ), it can also be written as a polynomial of degree O(log 2 n) over F 2 .
We will use the following theorem by Smolensky [33] (cf.[38,Chapter 1]).The proof in [38] has the condition that n is divisible by 3.This condition can be removed.For example, when n = 3 + 1, we can set a random bit of the uniform distribution to zero.This distribution is close to uniform, but now we can apply [38] as stated.

Theorem 2.8 ([33]
).There exists an absolute constant ε > 0 such that for every n that is divisible by 3 and for every polynomial p : To build intuition, note that after fixing the input bits in B 1 and B 2 , for each of the mod 2 3 (x S j ) in f , by Theorem 2.8 we have In the following lemma we prove a variant of Impagliazzo's XOR Lemma [18] to show that Averaging over the values of the x k in B 1 and B 2 finishes the proof.Lemma 2.9.Let k = 2 log m, define f : {0, 1} m×k → {0, 1} by f (x (1) , . . ., x (k) ) := mod 2  3 (x (1) ) Let p : {0, 1} m×k → {0, 1} be any polynomial of degree O(log 2 m).We have Proof.We will use the fact that Theorem 2.8 holds for degree n Ω(1) polynomials to get correlation 1/n Ω (1)  for polynomials of degree polylog(n).
As in the proof in [18] we first show the existence of a measure M : {0, 1} m → [0, 1] of size at most |M| := ∑ x M(x) = 2 m /4 such that with respect to its induced distribution D(x) := M(x)/|M|, the function mod 2  3 is 1/2m It follows that there exists a set of inputs S ⊆ {0, 1} m of size at most 2 m /8 such that mod 2 3 is 1/m 0.249hard-core on S for any polynomial of degree O(log 2 m).Now we apply the following lemma, which is stated in [18] for circuits, but the same proof applies to polynomials.Lemma 2.10 (Lemma 4 in [18]).If g is ε-hard-core for some set of δ 2 n inputs for polynomials of degree d, then the function f (x (1) , . . ., x (k) ) := g(x (1) ) hard-core for polynomials of the same degree.
Applying this lemma with our choice of k, we have for any polynomial p of degree O(log 2 m), Hence f is O(1/m 0.249 )-hard for any polynomial of degree O(log 2 m), and the lemma follows.
Proof of Claim 1.10.We replace the pseudo-design in the proof of Theorem 1.4.iv with one that has set size t = O(log 4 n) and intersection size t = O(log n).Using the same idea as in the proof of Lemma 2.5 one can show that such pseudo-design exists provided the universe is of size d = O(log 8 n).Now, using the same argument, for tests of size larger than c log n, we apply (1) to f and p, which are the parity of c log n copies of mod 3 function on m = O(log 4 n) bits and a polynomial of degree O(log 2 n), respectively.This gives bias O(1/n c ).Note that the image set T 1 now has mod 3 rank d 2 = O(log 16 n).
For tests of size at most c log n, we replace the almost k-wise independent set with the n −c -almost k-wise independent distribution given by (2), which has bias n −c , and we denote the support of the distribution by T 2 .
By Fact 2.4, T := T 1 + 2 T 2 has mod 3 rank O(n 0.49 log 16 n) = o(n 0.5 ).Hence, T + 2 T has rank less than n and the claim follows.

Sum of k copies of small-bias distributions
We now show that the results hold for k copies when ε is replaced by ε 2/k , proving the "Moreover" part in Theorem 1.4, 1.5 and 1.6.
Proof of "Moreover" part in Theorem 1.4, 1.5 and 1.6.To prove Theorem 1.4.i,1.4.ii and 1.4.iii,we can replace e by 2e/k in their proofs to obtain distributions D that are ε 2/k -biased.Since we have to throw in at least one error, 2e/k ≥ 1.The rest follows by noting the sum of k copies of D is identical to D + D.
By scaling down the above small-bias distributions D for Theorem 1.4.i and 1.4.ii to n bits as in the proofs of Theorem 1.5 and 1.6, respectively, we obtain ε 2/k -biased distributions D n so that the sum of k copies of D n is identical to D n + D n in Theorem 1.6 and 1.5.Moreover, k scales from k(n) to k(n ).
For Theorem 1.4.iv,let α := log(1/ε)/ log n and so ε 2/k = n −2α/k .We set β = 8α/k instead of 4α in the construction of T 1 and replace T 2 by an n −2α/k -almost 2 log n-wise independent set in the proof, and call them T 1 and T 2 , respectively.We now have rank 3 (T 1 ) = O(n 32α/k ) and rank 3 (T 2 ) = O(n 4α/k log 4 n).Thus, the set T := T 1 + 2 T 2 has rank at most O(n 36α/k log 4 n) and therefore the sum of k copies has rank at most rank 3 (T ) k = O(n 36α log 4k n) < n, for k < O(log n/ log log n).The bias of T follows from the facts that T 2 has bias n −2α/k against tests of size at most 2 log n, and T 1 has bias O(n −2α/k ) for tests of size greater than 2 log n.

Mod rank of k-wise independence
In this section, we begin a systematic investigation on the mod 3 rank of k-wise independent distributions.
Recall Definition 1.8 of mod p rank.We also define the mod p rank of a matrix over the integers to be its rank over F p .We also write rank p for mod p rank.
We will sometimes work with vectors over {−1, 1} instead of {0, 1}.Note that the map (1 − x)/2 convert the values 1 and −1 to 0 and 1, respectively, and so the mod 3 rank of a set will differ by at most 1 when we switch vector values from {−1, 1} to {0, 1}, and vice versa.
While we state our results for mod 3, all the results in this section can be extended to mod p for any odd prime p naturally.

Lower bound for almost k-wise independence
In the following claim we give a rank lower bound on almost k-wise independent distributions.Here "almost" is measured with respect to statistical distance.(Another possible definition is the max bias of any parity.)Claim 3.1.Let D be any subset {0, 1} n .If rank 3 (D) = t, then D is not 1/10-almost ct/ log(n/t)-wise independent, for a universal constant c.
This gives an exponential separation between seed length and rank for almost k-wise independence.Indeed, for k = O(1), the seed length is Θ(log log n), whereas the rank must be Ω(log n).
THEORY OF COMPUTING, Volume 13 (16), 2017, pp.1-23 Proof.Let C be the span of D over F 3 and C ⊥ be its orthogonal complement.C ⊥ has dimension n − t.We view C ⊥ as a linear code over F 3 and let d ⊥ be its minimum distance.Since C ⊥ is linear, d ⊥ equals the minimum Hamming weight of its non-zero elements.Moreover, by the Singleton bound, d ⊥ − 1 ≤ t.By the Hamming bound, that is, we have where we use the fact that d ⊥ − 1 ≤ t in the last inequality.Hence, d ⊥ ≤ O(t/ log(n/t)).Now, let y be a codeword in C ⊥ with Hamming weight d ⊥ .Let I := {i | y i = 0}.Note that for every x ∈ D, we have y, x 3 = 0 on I. On the other hand, for a uniformly distributed x in {0, 1} I we have y, x 3 = 0 with probability at most 1/2.Therefore, D is constant bounded away from uniform on the d ⊥ bits indexed by I, and thus cannot be close to d ⊥ -wise independent.

Pairwise independence
We now show that the mod 3 rank of a pairwise independent set can be as small as n 0.73 .Then we give evidence that our approach cannot do any better.
Definition 3.2.We say H is an Hadamard matrix of order n if its entries are ±1 and it satisfies HH T = nI n , where I n is the n × n identity matrix.
It is well-known that by removing the all-ones row of an Hadamard matrix H, which can always be created by multiplying each column by its first element, the uniform distribution over the columns of the truncated matrix is pairwise independent.
Henceforth we will work with vectors whose entries are from {−1, 1} = {2, 1} ⊆ F 3 .The following two claims show that certain Hadamard matrices cannot have dimension smaller than n/2.They are taken from [41], and here we give a self-contained proof for completeness.First, we would give a lower bound to the mod p rank from the determinant of any square matrix.We also note the following negative result for decoding by low-degree polynomials.for some constant t and e ≤ d−1 2 , then no degree-t polynomial over F 2 can be a threshold-te discriminator for C.
Proof.Suppose on the contrary a polynomial P is a threshold-te discriminator for C. By Fact 1.12 and the Schwartz-Zippel Lemma, there exists an ε := (1 − d ⊥ /n) e -biased D such that P distinguishes the sum of t independent copies of D from uniform with probability at least 2 −t .But by [39], the sum of t copies of D fools P with probability 16ε 1/2 t−1 , a contradiction.

Fooling read-once DNF formulas
In this section we state and prove that an m −O(log(1/δ )) bound on the bias suffices to δ -fool any read-once DNF formulas with m terms.This directly follows from Lemma 5.2 in [12].The rest follows from the fact that D fools each i∈S C i with error ε because it is an AND of AND terms.
feedback on a preliminary version of this paper, in particular pointing out several inaccuracies in our definitions and in our empirical results.We wish to thank the referees for the detailed and useful feedback, and in particular for pointing out several inaccuracies.
THEORY S [Adv p (M