Average-Case Lower Bounds and Satisfiability Algorithms for Small Threshold Circuits

$ \newcommand{\cclass}[1]{{\normalfont\textsf{##1}}} $We show average-case lower bounds for explicit Boolean functions against bounded-depth threshold circuits with a superlinear number of wires. We show that for each integer $d>1$, there is a constant $\varepsilon_d>0$ such that the Parity function on $n$ bits has correlation at most $n^{-\varepsilon_d}$ with depth-$d$ threshold circuits which have at most $n^{1+\varepsilon_d}$ wires, and the Generalized Andreev function on $n$ bits has correlation at most $\exp(-{n^{\varepsilon_d}})$ with depth-$d$ threshold circuits which have at most $n^{1+\varepsilon_d}$ wires. Previously, only worst-case lower bounds in this setting were known (Impagliazzo, Paturi, and Saks (SICOMP 1997)). We use our ideas to make progress on several related questions. We give satisfiability algorithms beating brute force search for depth-$d$ threshold circuits with a superlinear number of wires. These are the first such algorithms for depth greater than 2. We also show that Parity on $n$ bits cannot be computed by polynomial-size $\textsf{AC}^0$ circuits with $n^{o(1)}$ general threshold gates. Previously no lower bound for Parity in this setting could handle more than $\log(n)$ gates. This result also implies subexponential-time learning algorithms for $\textsf{AC}^0$ with $n^{o(1)}$ threshold gates under the uniform distribution. In addition, we give almost optimal bounds for the number of gates in a depth-$d$ threshold circuit computing Parity on average, and show average-case lower bounds for threshold formulas of any depth. Our techniques include adaptive random restrictions, anti-concentration and the structural theory of linear threshold functions, and bounded-read Chernoff bounds.

satisfiability algorithms is closely tied to proving average-case lower bounds, though there is no formal connection. Fourth, the seminal paper by Linial, Mansour, and Nisan [30] shows that average-case lower bounds for Parity against a circuit class are tied to non-trivially learning the circuit class under the uniform distribution.
With these different motivations in mind, we systematically study average-case lower bounds for bounded-depth threshold circuits. Our first main result shows correlation upper bounds for Parity and another explicit function known as the Generalized Andreev function with respect to threshold circuits with few wires. No correlation upper bounds for explicit functions against bounded-depth threshold circuits with superlinear wires was known before our work. Theorem 1.1. For each depth d ≥ 1, there is a constant ε d > 0 such that for all large enough n, no threshold circuit of depth d with at most n 1+ε d wires agrees with Parity on more than 1/2 + 1/n ε d fraction of inputs of length n, and with the Generalized Andreev function on more than 1/2 + 1/2 n ε d fraction of inputs of length n. Quite often, ideas behind lower bound results for a circuit class C result in satisfiability algorithms (running in better than brute-force time) for C, and vice-versa (see, e. g., [40]). Here, we leverage the ideas behind our correlation bounds against Threshold circuits to improved satisfiability algorithms for these circuits. More precisely, we constructivize the ideas of the proof of the strong correlation upper bounds for the Generalized Andreev function to get non-trivial satisfiability algorithms for bounded-depth threshold circuits with few wires. Previously, such algorithms were only known for depth-2 circuits, due to Impagliazzo-Paturi-Schneider [23] and Tamaki (unpublished). Theorem 1.2. For each depth d ≥ 1, there is a constant ε d > 0 such that the satisfiability of depth-d threshold circuits with at most n 1+ε d wires can be solved in randomized time 2 n−n ε d poly(n). Theorem 1.2 is re-stated and proved as Theorem 6.4 in Section 6. Using our ideas, we also show correlation bounds against AC 0 circuits with a few threshold gates, as well as learning algorithms under the uniform distribution for such circuits. Theorem 1.3. For each constant d, there is a constant γ > 0 such that Parity has correlation at most 1/n Ω(1) with AC 0 circuits of depth d and size at most n log(n) 0.4 augmented with at most n γ threshold gates. Moreover, the class of AC 0 circuits of size at most n log(n) 0.4 augmented with at most n γ threshold gates can be learned to constant error under the uniform distribution in time 2 n 1/4+o (1) . Theorem 1.3 captures the content of Corollary 8.4 and Theorem 8.2 in Section 7. Having summarized our main results, we now describe related work and our proof techniques in more detail.

Related work
There has been a large body of work proving upper and lower bounds for constant-depth threshold circuits. Much of this work has focused on the setting of small gate complexity, which seems to be the somewhat easier case to handle. A distinction must also be drawn between work that has focused on the setting THEORY OF COMPUTING, Volume 14 (9), 2018, pp.  where the threshold gates are assumed to be majority gates (i. e., the linear function sign representing the gate has integer coefficients that are bounded by a polynomial in the number of variables) and work that focuses on general threshold gates, since analytic tools such as rational approximation that are available for majority gates do not work in the setting of general threshold gates.
We discuss the work on wire complexity first, followed by the results on gate complexity.
Wire complexity. Paturi and Saks [41] considered depth-2 majority circuits and showed an Ω(n 2 ) lower bound on the wire complexity required to compute Parity; this nearly matches the upper bound of O(n 2 ). They also showed that there exist majority circuits of size n 1+Θ(ε d 1 ) and depth d computing Parity; here ε 1 = 2/(1 + √ 5). Impagliazzo, Paturi, and Saks [24] showed a depth-d lower bound for general threshold circuits computing Parity: namely, that any such circuit must have wire complexity at least n 1+ε d 2 where ε 2 < ε 1 . The proof of [24] proceeds by induction on the depth d. The main technical lemma shows that a circuit of depth d can be converted to a depth-(d − 1) circuit of the same size by setting some of the input variables. The fixed variables are set in a random fashion, but not according to the uniform distribution. In fact, this distribution has statistical distance close to 1 from the uniform distribution and furthermore, depends on the circuit whose depth is being reduced. Therefore, it is unclear how to use this technique to prove a correlation bound with respect to the uniform distribution. In contrast, we are able to reduce the depth of the circuit by setting variables uniformly at random (though the variables that we restrict are sometimes chosen in a way that depends on the circuit), which yields the correlation bounds we want.
Gate complexity. The aforementioned paper by Paturi and Saks [41] also proved a near optimal Ω(n) lower bound on the number of gates in any depth-2 majority circuits computing Parity.
Siu, Roychowdhury, and Kailath [52] considered majority circuits of bounded depth and small gate complexity. They showed that Parity can be computed by depth-d majority circuits with O(dn 1/(d−1) ) gates. Building on the ideas of [41], they also proved a near matching lower bound of Ω(dn 1/(d−1) ). Further, they also considered the problem of correlation bounds and showed that there exist depth-d majority circuits with O(dn 1/2(d−1) ) gates that compute Parity almost everywhere and that majority circuits of significantly smaller size have o(1) correlation with Parity (i. e., these circuits cannot compute Parity on more than a 1/2 + o(1) fraction of inputs; recall that 1/2 is trivial since a constant function computes Parity correctly on 1/2 of its inputs). Impagliazzo, Paturi, and Saks [24] extended the worst-case lower bound to general threshold gates, where they proved a slightly weaker lower bound of Ω(n 1/2(d−1) ). As discussed above, though, it is unclear how to use their technique to prove a correlation bound.
Beigel [6] extended the result of Siu et al. to the setting of AC 0 augmented with a few majority gates. He showed that any subexponential-sized depth-d AC 0 circuit with significantly less than some k = n Θ(1/d) majority gates has correlation o(1) with Parity. The techniques of all the above results with the exception of [24] were based on the fact that majority gates can be well-approximated by low-degree rational functions. However, this is not true for general threshold functions [51] and hence, these techniques do not carry over to the case of general threshold gates.
A lower bound technique that does carry over to the setting of general threshold gates is that of showing that the circuit class has low-degree polynomial sign-representations. Aspnes, Beigel, Furst and Rudich [3] used this idea to prove that AC 0 circuits augmented with a single general threshold output THEORY OF COMPUTING, Volume 14 (9), 2018, pp. 1-55 gate-we refer to these circuits as TAC 0 circuits as in [17]-of subexponential-size and constant-depth

Proof techniques
In recent years, there has been much work exploring the analytic properties (such as noise sensitivity) of linear threshold functions (LTFs) and their generalizations polynomial threshold functions (PTFs) (e. g., [50,38,11,20,12,32,26]). We show here that these techniques can be used in the context of constant-depth threshold circuits as well. In particular, using these techniques, we prove a qualitative refinement of Peres' theorem (see below) that may be of independent interest.
Our first result (Theorem 3.2 in Section 3) is a tight correlation bound for Parity with threshold circuits of depth d and gate complexity much smaller than n 1/2(d−1) . This generalizes both the results of Siu et al. [52], who proved such a result for majority circuits, and Impagliazzo, Paturi, and Saks [24], who proved a worst-case lower bound of the same order. The proof uses a fundamental theorem of Peres [42] on the noise sensitivity of LTFs; Peres' theorem has also been used by Klivans, O'Donnell, and Servedio [28] to obtain learning algorithms for functions of a few threshold gates. We use Peres' theorem to prove a noise sensitivity upper bound on small threshold circuits of constant depth. THEORY OF COMPUTING, Volume 14 (9), 2018, pp.  The observation underlying the proof is that the noise sensitivity of a function is exactly the expected variance of the function after applying a suitable random restriction (see also [36]). Seen in this light, Peres' theorem says that, on application of a random restriction, any threshold function becomes quite biased in expectation and hence is well approximated by a constant function. Our analysis of the threshold circuit therefore proceeds by applying a random restriction to the circuit and replacing all the threshold gates at depth d − 1 by the constants that best approximate them to obtain a circuit of depth d − 1. A straightforward union bound tells us that the new circuit is a good approximation of the original circuit after the restriction. We repeat this procedure with the depth-(d − 1) circuit until the entire circuit becomes a constant, at which point we can say that after a suitable random restriction, the original circuit is well approximated by a constant, which means its variance is small. Hence, the noise sensitivity of the original circuit must be small as well and we are done.
This technique is expanded upon in Section 8, where we use a powerful noise sensitivity upper bound for low degree PTFs due to Kane [26] along with standard switching arguments [21] to prove similar results for AC 0 circuits augmented with almost n 1/2(d−1) threshold gates. This yields Theorem 1.3 by some standard results.
In Section 4, we consider the problem of extending the above correlation bounds to threshold circuits with small (slightly superlinear) wire complexity. The above proof breaks down even for depth-2 threshold circuits with a superlinear number of wires, since such circuits could have a superlinear number of gates and hence the union bound referred to above is no longer feasible.
In the case of depth-2 threshold circuits, we are nevertheless able to use Peres' theorem, along with ideas of [3] to prove correlation bounds for Parity against circuits with nearly n 1.5 wires. (This result was independently obtained by Kane and Williams in a recent paper [27].) This result is tight, since by a result of Siu et al. [52], Parity can be well approximated by depth-2 circuits with O( √ n) gates and hence O(n 1.5 ) wires. This argument is in Section 4.1.
Unfortunately, however, this technique requires us to set a large number of variables, which renders it unsuitable for larger depths. The reason for this is that, if we set a large number of variables to reduce the depth from some large constant d to d − 1, then we may be in a setting where the number of wires is much larger than the number of surviving variables and hence correlation bounds for Parity may no longer be possible at all.
We therefore use a different strategy to prove correlation bounds against larger constant depths. The linchpin in the argument is a qualitative refinement of Peres' theorem (Lemma 4.4) that says that on application of a random restriction to an LTF, with good probability, the variance of the LTF becomes negligible (even exponentially small for suitable parameters). The proof of this argument is via anticoncentration results based on the Berry-Esseen theorem and the analysis of general threshold functions via a critical index argument as in many recent papers [50,38,11,32].
The above refinement of Peres' theorem allows us to proceed with our argument as in the gates case. We apply a random restriction to the circuit and by the refinement, with good probability (say 1 − n −Ω(1) ) most gates end up exponentially close to constants. We can then set these "imbalanced" gates to constants and still apply a union bound to ensure that the new circuit is a good approximation to the old one. For the small number of gates that do not become imbalanced in this way, we set all variables feeding into them. Since the number of such gates is small, we do not set too many variables. We now have a depth-(d − 1) circuit. Continuing in this way, we get a correlation bound of n −Ω(1) with Parity. This gives part of THEORY OF COMPUTING, Volume 14 (9), 2018, pp. 1-55 We then strengthen this correlation bound to exp(−n Ω(1) ) for the Generalized Andreev function, which, intuitively speaking, has the following property: even after applying any restriction that leaves a certain number of variables unfixed, the function has exponentially small correlation with any LTF on the surviving variables. To prove lower bounds for larger depth threshold circuits, we follow more or less the same strategy, except that in the above argument, we need most gates to become imbalanced with very high probability (1 − exp(−n Ω(1) )). To ensure this, we use a bounded read Chernoff bound due to Gavinsky, Lovett, Saks, and Srinivasan [16]. We can use this technique to reduce depth as above as long as the number of threshold gates at depth d − 1 is "reasonably large." If the number of gates at depth d − 1 is very small, then we simply guess the values of these few threshold gates and move them to the top of the circuit and proceed. This gives the other part of Theorem 1.1.
This latter depth-reduction result can be completely constructivized to design a satisfiability algorithm that runs in time 2 n−n Ω(1) . The algorithm proceeds in the same way as the above argument, iteratively reducing the depth of the circuit. A subtlety arises when we replace imbalanced gates by constants, since we are changing the behaviour of the circuit on some (though very few) inputs. Thus, a circuit which was satisfiable only at one among these inputs might now end up unsatisfiable. However, we show that there is an efficient algorithm that enumerates these inputs and can hence check if there are satisfiable assignments to the circuits from among these inputs. This gives Theorem 1.2.
In Section 7, we prove correlation bounds for the Generalized Andreev function against threshold formulas of any arity and any depth. The proof is based on a retooling of the argument of Nečiporuk for formulas of constant arity over any basis and yields a correlation bound as long as the wire complexity is at most n 2−Ω(1) . [27] and connections to our results. Independent of our work, Kane and Williams [27] proved some very interesting results on threshold circuit lower bounds. Their focus is on threshold circuits of depth 2 and (a special case of) depth 3 only, but in this regime they are able to show stronger superlinear gate lower bounds and superquadratic wire lower bounds on the complexity of an explicit function (which is closely related to the Generalized Andreev function referred to above).

Independent work of Kane and Williams
The techniques of [27] are closely related to ours, since they also analyze the effect of random restrictions on threshold gates. The statement of the random restriction lemma in our paper is different from that in [27]. While we are only able to prove that a threshold gate becomes highly biased with high probability under a random restriction, [27] prove that with high probability, the threshold gate actually becomes a constant.
However, to obtain this stronger conclusion, the random restrictions from [27] have to set most variables, which makes it unsuitable for proving lower bounds for larger depths as mentioned above. This technique can recover our tight average-case gate lower bounds (Section 3) and wire lower bounds (Section 4.1) for depth 2.
While our random restriction lemmas are weaker (in that they don't make the threshold gate a constant), it makes sense to ask if we can use them to recover the threshold circuit lower bounds of [27].
In Section 9, we show that this is true by using a natural strengthening of our main restriction lemma to prove a lower bound for threshold circuits that matches the lower bound of [27]  This result should not be considered an "alternate proof," since we use many other ideas from [27] in the proof (we have also used one of these ideas, namely Theorem 2.9, to improve our lower bounds for Threshold formulas from [8] in Section 7). Our interest is in understanding the relative power of the random restriction lemmas in the two results.

Preliminaries
Throughout, all logarithms will be taken to the base 2.

Basic Boolean function definitions
A Boolean function on n variables will be a function f : {−1, 1} n → {−1, 1}. We use the standard inner product on functions f , g : (Unless specifically mentioned otherwise, we use a ∼ A for a finite set A to denote that a is a uniform random sample from the set A.) Given Boolean functions f , g on n variables, the Correlation between f and g-denoted Corr( f , g)is defined as Also, we use δ ( f , g) to denote the fractional distance between f and g: i. e., Note that for Boolean f , g, we have Corr( f , g) = |1 − 2δ ( f , g)|.
We say that f is δ -approximated by g if δ ( f , g) ≤ δ . We use Par n to denote the parity function on n variables. I.e., Par n (x 1 , . . . , Definition 2.1 (Restrictions). A restriction on n variables is a function ρ : [n] → {−1, 1, * }. A random restriction is a distribution over restrictions. We use R n p to denote the distribution over restrictions on n variables obtained by setting each ρ(x) = * with probability p and to 1 and −1 with probability (1 − p)/2 each. We will often view the process of sampling a restriction from R n p as picking a pair (I, y) where I ⊆ [n] is obtained by picking each element of [n] to be in I with probability p and y ∈ {−1, 1} n−|I| is chosen uniformly at random. Definition 2.2 (Restriction trees and Decision trees). A restriction tree T on {−1, 1} n of depth h is a binary tree of depth h all of whose internal nodes are labeled by one of n variables, and the outgoing edges from an internal node are labeled +1 and −1; we assume that a node and its ancestor never query the same variable. Each leaf of T defines a restriction ρ that sets all the variables on the path from the root of the decision tree to and leaves the remaining variables unset. A random restriction tree T of depth h is a distribution over restriction trees of depth h. THEORY OF COMPUTING, Volume 14 (9), 2018, pp. 1-55 Given a restriction tree T , the process of choosing a random edge out of each internal node generates a distribution over the leaves of the tree (note that this distribution is not uniform: the weight it puts on leaf at depth d is 2 −d ). We use the notation ∼ T to denote a leaf of T picked according this distribution.
A decision tree is a restriction tree all of whose leaves are labeled either by +1 or −1. We say a decision tree has size s if the tree has s leaves. We say a decision tree computes a function f : {−1, 1} n → {−1, 1} if for each leaf of the tree, f | ρ is equal to the label of .
The above proposition seems to be folklore, but we couldn't find explicit proofs in the literature. For completeness we present the proof below.
Proof. For point 1, we know that THEORY OF COMPUTING, Volume 14 (9), 2018, pp. 1-55 where x and y are sampled as in Definition 2.4. Alternately, we may also think of sampling (x, y) in the following way: choose ρ = (I, z) ∼ R n 2p and for the locations indexed by I we choose x , y ∈ {−1, 1} |I| independently and uniformly at random to define strings x and y respectively. Hence, we have We now proceed with point 2. As NS p ( f ) is a decreasing function of p [37], we may assume that p = 1/n ≤ 1/2 and hence we have Note that for ρ = (I, y) chosen as above, the probability that I = / 0 is Ω(1). Hence we have

Threshold functions and circuits
Definition 2.7 (Threshold functions and gates). A Threshold gate is a gate φ labeled with a pair (w, θ ) where w ∈ R m for some m ∈ N and θ ∈ R. The gate computes the Boolean function f φ : 1} defined by f φ (x) = sgn( w, x − θ ) (we define sgn(0) = −1 for the sake of this definition). The fan-in of the gate φ -denoted fan-in(φ )-is m. A Linear Threshold function (LTF) is a Boolean function that can be represented by a Threshold gate. More generally, a Boolean function f : Definition 2.8 (Threshold circuits). A Threshold circuit C is a Boolean circuit whose gates are all threshold gates. There are designated output gates, which compute the functions computed by the circuit. Unless explicitly mentioned, however, we assume that our threshold circuits have a unique output gate. The gate complexity of C is the number of (non-input) gates in the circuit, while the wire complexity is the sum of all the fan-ins of the various gates.
A Threshold map from n to m variables is a depth-1 threshold circuit C with n inputs and m outputs. We say that such a map is read-k if each input variable is an input to at most k of the threshold gates in C.
Applying the above theorem in the case that s = n and the f i s are just the input bits, we obtain the following. Definition 2.11 (Restrictions of threshold gates and circuits). Given a threshold gate φ of fan-in m labeled by the pair (w, θ ) and a restriction ρ on m variables, we use φ | ρ to denote the threshold gate over the variables indexed by ρ −1 ( * ) obtained in the natural way by setting variables according to ρ.

The Generalized Andreev function
We state here the definition of a generalization of Andreev's function, due to Komargodski and Raz, and Chen, Kabanets, Kolokolova, Shaltiel, and Zuckerman [29,7]. This function will be used to give strong correlation bounds against constant-depth threshold circuits with slightly superlinear wire complexity. We first need some definitions. We have the following explicit construction of a bit-fixing extractor.
Also recall that a function Enc : {−1, 1} a → {−1, 1} b defines (α, L)-error-correcting code for parameters α ∈ [0, 1] and L ∈ N if for any z ∈ {−1, 1} b , the number of elements in the image of Enc that are at relative Hamming distance at most α from z is bounded by L.
The following theorem is folklore and is stated explicitly in the paper by Chen et al. [7]. Now we can define the Generalized Andreev function as in [7]. The function is F : {−1, 1} 5n × {−1, 1} n → {−1, 1} and is defined as follows. Let γ > 0 be a constant parameter. The parameter will be fixed later according to the application at hand.
Let E be any (n, n γ , m = 0.9n γ , 2 −n Ω(γ) ) extractor (we can obtain an explicit one using Theorem 2.17). We interpret the output of E as an integer from Given a ∈ {−1, 1} 4n , we use F a (·) to denote the resulting sub-function on n bits obtained by fixing x 1 = a.
The following lemma was proved as part of Theorem 6.5 in [7].

Concentration bounds
We state a collection of concentration bounds that we will need in our proofs. The proofs of Theorems 2.20 and 2.22 may be found in the excellent book by Dubhashi and Panconesi [13].
We also need a multiplicative form of the Chernoff bound for sums of Boolean random variables.
We will need the following variant of the Chernoff bound that may be found in the survey of Chung and Lu [9].
Theorem 2.23 (Theorem 3.4 in [9]). Fix any parameter p ∈ (0, 1). Let X 1 , . . . , X n be independent Boolean valued random variables such E [X i ] = p for each i ∈ [n]. Let X = ∑ i∈[n] a i X i where each a i ≥ 0. Then, where a = max{a 1 , . . . , a n } and ν = p · ∑ i a 2 i . In the case that λ ≥ E [X], we have ν ≤ pa · ∑ i a i = a E [X] ≤ aλ and hence we obtain Let Y 1 , . . . ,Y m be random variables defined as functions of independent random variables X 1 , . . . , X n . For i ∈ [m], let S i ⊆ [n] index those random variables among X 1 , . . . , X n that influence Y i . We say that Y 1 , . . . ,Y m are read-k random variables if any j ∈ [n] belongs to S i for at most k different i ∈ [m].
The notation D(p||q) represents the KL-divergence (see, e. g., [10]) between the two probability distributions on {0, 1} where the probabilities assigned to 1 are p and q respectively. Theorem 2.24 (A read-k Chernoff bound [16]). Let Y 1 , . . . ,Y m be {0, 1}-valued read-k random variables such that E [Y i ] = p i . Let p denote the average of p 1 , . . . , p m . Then, for any ε > 0, It was pointed out to us by an anonymous reviewer that the above actually follows easily from an older, stronger result of Janson [25]. We cite the result from [16] since it is stated in the form above, which is useful for us.
Using standard estimates on the KL-divergence, Theorem 2.24 implies the following.

Correlation bounds against threshold circuits with small gate complexity
This section serves as a warm up to our main results in the following sections. Here we use a simple version of our technique to show that constant-depth threshold circuits with a small number of gates cannot correlate well with the Parity function. The following is a consequence of our main result.
It should be noted that Nisan [34] already proved stronger correlation bounds for the Inner Product function against any threshold circuit (not necessarily constant-depth) with a sub-linear (much smaller than n/ log n) number of threshold gates. The idea of his proof is to first show that each threshold gate on n variables has a δ -error randomized communication protocol with complexity O(log(n/δ )) [34, Theorem 1]. One can use this to show that any threshold circuit as in the theorem can be written as a decision tree of depth n/k querying threshold functions and hence has an exp(−Ω(k))-error protocol of complexity at most n/10. Standard results in communication complexity imply that any such function can have correlation at most exp(−Ω(k)) with inner product.
However, such techniques cannot be used to obtain lower bounds or correlation bounds for the parity function, since the parity function has low communication complexity, even in the deterministic setting. An even bigger disadvantage to this technique is that it cannot be used to obtain any superlinear lower bound on the wire complexity, since threshold circuits with a linear number of wires can easily compute functions with high communication complexity, such as the Disjointness function.
The techniques we use here can be used to give nearly tight [52] correlation bounds for the parity function (and can also be strengthened to the setting of small wire complexity, as we will show later). In fact, we prove something stronger: we upper bound the noise sensitivity of small constant-depth threshold circuits, which additionally implies the existence of non-trivial learning algorithms [28,17] for such circuits. Further, our techniques also imply noise sensitivity bounds for AC 0 circuits augmented with a small number of threshold gates. For the sake of exposition, we postpone these generalizations to Section 8 and prove only the noise sensitivity result for constant-depth threshold circuits in this section.

Correlation bounds via noise sensitivity
The main result of this section is the following.
Theorem 3.2. Let C be a depth-d threshold circuit with at most k threshold gates. Then, for any parameters p, q ∈ [0, 1], we have Proof. We assume that q ≤ 1/2, since otherwise we always have This will imply the theorem, since by Proposition 2.6, we have The proof of (3.1) is by induction on the depth d of the circuit. The base case d = 1 is just Peres' theorem (Theorem 2.12).
Now assume that C has depth d > 1. Let k 1 be the number of threshold circuits at depth d − 1 in the circuit. We choose a random restriction ρ ∼ R n p and consider the circuit C| ρ . It is easy to check that and hence to prove (3.1), it suffices to bound the expectation of Var((C| ρ )| ρ d−1 ). Let us first consider the circuit C| ρ . Peres' theorem tells us that on application of the restriction ρ, each threshold gate at depth d − 1 becomes quite biased on average. Formally, by Theorem 2.12 and In particular, replacing φ | ρ by b φ ,ρ in the circuit C| ρ yields a circuit that differs from C| ρ on only an O( √ p) fraction of inputs (in expectation). Applying this replacement to each of the k 1 threshold gates at depth d − 1 yields a circuit C ρ with k − k 1 threshold gates and depth d − 1 such that where δ (C| ρ ,C ρ ) denotes the fraction of inputs on which the two circuits differ. On the other hand, we can apply the inductive hypothesis to C ρ to obtain Therefore, to infer (3.1), we put the above together with (3.3) and the following elementary fact.
Proof of Proposition 3.3. By Proposition 2.6, we know that and similarly for g. By definition of noise sensitivity, we have THEORY OF COMPUTING, Volume 14 (9), 2018, pp. 1-55 where x ∈ {−1, 1} m is chosen uniformly at random and y is chosen by flipping each bit of x with probability r/2. Note that each of x and y is individually uniformly distributed over {−1, 1} m and hence, both f (x) = g(x) and f (y) = g(y) hold with probability at least 1 − 2δ . This yields which implies the claimed bound.
The above theorem yields the correlation bound for Parity stated above (Corollary 3.1) as we now show.
Proof of Corollary 3.1. We apply Theorem 3.2 with the following optimized parameters: and q ∈ [0, 1] such that p d−1 q = 1/n. It may be verified that for this setting of parameters, Theorem 3.2 gives us As noted in Proposition 2.6, we have Corr(C, Par n ) ≤ O(NS 1/n (C)). This completes the proof.

Remark 3.4.
It is instructive to compare the above technique with the closely related work of Gopalan and Servedio [17]. The techniques of [17] applied to the setting of Theorem 3.2 show that NS p (C) ≤ O(k2 k √ p), which gives a better dependence on the noise parameter p, but a much worse dependence on k. Indeed, this is not surprising since in this setting, the technique of Gopalan and Servedio does not use the fact that the circuit is of depth d. The threshold circuit is converted to a decision tree of depth k querying threshold functions and it is this tree that is analyzed. We believe that the right answer should incorporate the best of both bounds: As in Corollary 3.1, this would show that Corr(C, ), but additionally, we would also get Corr(C, Par n ) ≤ n −1/2+o(1) as long as k = n o(1) and d is a constant.
It is known from the work of Siu, Roychowdhury and Kailath [52, Theorem 7] that Corollary 3.1 is tight in the sense that there do exist circuits of gate complexity roughly n 1/2(d−1) that have significant correlation with Par n . More formally, Theorem 3.5 (Theorem 7 in [52]). Let ε > 0 be an arbitrary constant. Then, there is a threshold circuit of depth d with O(d) · (n log(1/ε)) 1/2(d−1) gates that computes Par n correctly on a 1 − ε fraction of inputs.

Correlation bounds against threshold circuits with small wire complexity
In this section, we prove correlation bounds against threshold circuits of small wire complexity. We prove three results of this kind. The first result, proved in Section 4.1, is a near-optimal correlation bound against depth-2 circuits computing the Parity function. This result was obtained independently by Kane and Williams [27].
Note that the above theorem is nearly tight, since by Theorem 3.5, there is a depth-2 circuit with O( √ n) gates (and hence O(n 3/2 ) wires) that computes Parity on n variables correctly with high probability. The proof is a simple argument based on Peres' theorem along with a correlation bound against low-degree polynomial threshold functions due to Aspnes, Beigel, Furst and Rudich [3].
The advantage of the above technique is that it yields a near-optimal lower bound in the depth-2 case. Unfortunately, however, it does not directly extend to depths larger than 2. The main results of the section, stated below, show how to obtain correlation bounds against all constant depths.
The second result, proved in Section 4.2 via a more involved argument, yields a correlation bound for the parity function against all constant depths. For comparison, we note that a result of Paturi and Saks [41] shows that the Parity function can be computed by threshold circuits of depth d with n 1+2 −Ω(d) many wires. A worst-case lower bound of a similar form was proved by Impagliazzo et al. [24]. Finally, we are also able to extend the techniques used in the proof of the above theorem to prove exponentially small correlation bounds. This kind of correlation bound cannot be proved for the Parity function since the Parity function on n variables has correlation Ω(1/ √ n) with the Majority function on n variables (see, e. g., [37, Section 5.3]), which clearly has a threshold circuit with only n wires.
We prove such a correlation bound in Section 4.3 for the Generalized Andreev function from Section 2.4. For some technical reasons, we prove a slightly stronger result (Theorem 4.13) from which we obtain the following consequence.

Corollary 4.3 (Correlation bounds for Andreev's function).
For any constant d ≥ 1, there is an ε d = 1/2 O(d) such that the following holds. Let F be the Generalized Andreev function on 5n variables as defined in Section 2.4 for any constant γ < 1/6. Any depth-d threshold circuit C of wire complexity at most n 1+ε d satisfies Corr(C, F) ≤ exp(−n Ω(ε d ) ) where the Ω(·) hides constants independent of d and n.
We now state a key lemma that will be used in the proofs of our correlation bounds in Sections 4.2 and 4.3. The lemma will be proved in Section 5.
Recall that a threshold gate with label Lemma 4.4 (Main Structural lemma for threshold gates). The following holds for some absolute constant p 0 ∈ [0, 1]. For any threshold gate φ over n variables with label (w, θ ) and any p ∈ [0, p 0 ], we have 2

Correlation bounds against depth-2 threshold circuits computing Parity
In this section, we prove Theorem 4.1. The proof is based on the following two subclaims.  . . , f t be the LTFs computed by C at depth-1. Under a random restriction ρ with * -probability p = 1/n 1−α , with probability at least 1 − n −Ω(γ) , the circuit C| ρ is n −Ω(γ) -approximated by a circuitC ρ which is obtained from C by replacing each of the f i | ρ s by an O(n α/2−Ω(γ) )-junta g i .
Assuming the above two claims, we can finish the proof of Theorem 4.1 easily as follows.
Let C be a circuit of wire complexity n 1+ε . We apply a random restriction ρ with * -probability p = 1/n 1−α as in Lemma 4.6. Call the restriction good if there is a circuitC ρ as in the lemma that n −Ω(γ) -approximates C| ρ and |ρ −1 ( * )| ≥ n α /2. The probability that the first of these events does not occur is at most n −Ω(γ) by Lemma 4.6 and the probability of the second is at most exp(Ω(n −α )) by a Chernoff bound (Theorem 2.22). Thus, the probability that ρ is not good is at most Say ρ is a good restriction. Note that each restricted LTF at depth-1 in the circuitC ρ is an O(n α/2−Ω(γ) )-junta and hence can be represented exactly by a polynomial of the same degree. This implies thatC ρ is a O(n α/2−Ω(γ) )-degree PTF and hence, by Theorem 4.5, has correlation at most n −Ω(γ) with the Parity function (on the remaining n α /2 variables). Moreover, then C| ρ is well-approximated bỹ C ρ and hence has correlation at most n −Ω(γ) + n −Ω(γ) with parity.
Upper bounding the correlation by 1 for bad restrictions, we see that the overall correlation is at most n −Ω(γ) .
We now prove Lemma 4.6.
Proof of Lemma 4.6. Let f 1 , . . . , f t be the LTFs appearing at depth 1 in the circuit. We will divide the analysis based on the fan-ins of the f i s (i. e., the number of variables they depend on). We denote by β the quantity 3/4 + ε/2. It can be checked that we have both Consider any f i of fan-in at most n β . When hit with a random restriction with * -probability n −(1−α) , we see that the expected number of variables of f i that survive is at most n β −(1−α) = n α−(1−β ) = n α/2−Ω(γ) by (4.1) above. By a Chernoff bound (Theorem 2.22), the probability that this number exceeds twice its expectation is exponentially small. Union bounding over all the gates of small fan-in, we see that with probability 1 − exp(−n Ω(1) ), all the low fan-in gates depend on at most 2n α/2−Ω(γ) many variables after the restriction. We call this high probability event E 1 . Now, we consider the gates of fan-in at least n β . Without loss of generality, let f 1 , . . . , f r be these LTFs. Since the total number of wires is at most n 1+ε , we have r ≤ n 1+ε−β = n 1/2−α/2−Ω(γ) by (4.1).
By Theorem 2.12, we know that for any f i , By linearity of expectation, we have By Markov's inequality, we see that the probability that Consider the event E = E 1 ∧ E 2 . A union bound tells us that the probability of E is at least 1 − n −Ω(γ) . When this event occurs, we construct the circuitC ρ from the statement of the claim as follows.
When the event E occurs, the LTFs of low arity are already n α/2−Ω(γ) -juntas, so there is nothing to be done for them. Now, consider the LTFs of high fan-in, which are In the circuitC ρ , these gates thus become constants, which are 0-juntas. The circuitC ρ now has the required form. We now analyze the error introduced by this operation. We know that Pr and thus the overall error introduced is at most (since E 2 is assumed to occur). Thus, the circuitC ρ is an n −Ω(γ) -approximation to C.

Correlation bounds for Parity against constant-depth circuits
In this section, we prove Theorem 4.2, assuming Lemma 4.4. The proof proceeds by iteratively reducing the depth of the circuit. In order to perform this depth-reduction for a depth-d circuit, we need to analyze the threshold map defined by the threshold gates at depth d − 1. The first observation, which follows from Markov's inequality, shows that we may assume (after setting a few variables) that the map reads each variable only a few times.
Fact 4.7 (Small wire-complexity to small number of reads). Let C be any threshold circuit on n variables with wire complexity at most cn. Then, there is a set S of at most n/2 variables such that each variable outside S is an input variable to at most 2c many gates in C.
The second observation is that if the fan-ins of all the threshold gates are small, then depth-reduction is easy (after setting some more variables). Proof. This may be done via a simple graph theoretic argument. Define an undirected graph whose vertex set is the set of n variables and two variables are adjacent iff they feed into the same threshold gate. We need to pick an S that is an independent set in this graph. Since the graph has degree at most kt, we can greedily find an independent set of size at least n/kt. Let S be such an independent set.
Let B > 2 be a constant real parameter that we will choose to satisfy various constraints in the proofs below. For d ≥ 1, define Theorem 4.2 (Correlation bounds for Parity (restated)). For any d ≥ 1 and c ≤ n ε d , any depth-d threshold circuit C with at most cn wires satisfies Corr(C, Par n ) ≤ O(n −ε d ) where the O(·) hides absolute constants (independent of d and n).
Proof idea. The proof will be an induction on d. Assume that C has n 1+ε many wires for a small ε > 0. Our aim is to apply a restriction that leaves plenty of variables alive (i. e., set to * ) while at the same time reduces the depth of the circuit from d to d − 1.
We first apply a random restriction ρ with * -probability n −δ for a suitable δ > 0 to the circuit C and analyze the effect of ρ on the threshold gates in C at depth d − 1 (just above the variables).
By Lemma 4.4, we see that with high probability, each threshold gate φ at depth d − 1 becomes imbalanced and hence (by the Chernoff bound (Theorem 2.20)) highly biased. Such gates can be replaced by constants without noticeably changing the correlation with the parity function. Now the only gates at depth d − 1 are the balanced gates.
It remains to handle the balanced gates. Consider a gate φ of C with initial fan-in k. The expected number of input wires to φ that survive the random restriction ρ is k · n −δ . Further by Lemma 4.4, the probability that φ remains biased is at most n −Ω(δ ) . While these two events are not independent, we can argue that their effects are somewhat independent in the following sense. We argue that the random variable Y that is equal to the fan-in of φ | ρ if φ is balanced and 0 otherwise has expectation roughly k · n −δ · n −Ω(δ ) . Hence, the expected number of wires W that feed into balanced gates is at most n 1+ε · n −δ · n −Ω(δ ) where the first term accounts for the total fan-ins of all the gates at depth d − 1. For δ suitably larger than ε, the expectation of W is much smaller than n 1−δ , the expected number of surviving variables. Thus, if we set all the variables corresponding to the wires feeding into the balanced gates, we can set all the balanced gates to constants. In the process, we don't set too many live variables (since the expectation of W is much smaller than n 1−δ ) and further we reduce the depth of the circuit to d − 1.
We can now apply the induction hypothesis to bound the correlation of the resulting circuit with the Parity function on (roughly) n 1−δ variables to get the correlation bound against depth-d circuits.
Proof. The proof is by induction on the depth d of C. The base case is d = 1, which is the case when C is only a single threshold gate. In this case, Corollary 2.13 tells us that Corr(C, Now, we handle the inductive case when the depth d > 1. Our analysis proceeds in phases. Phase 1. We first transform the circuit into a read-2c circuit by setting n/2 variables. This may be done by Fact 4.7. This defines a restriction tree of depth n/2. By Fact 2.3, it suffices to show that each leaf of this restriction tree, the correlation of the restricted circuit and Par n/2 remains bounded by O(n −ε d ).
Let n 1 now denote the new number of variables and let C 1 now be the restricted circuit at some arbitrary leaf of the restriction tree. By renaming the variables, we assume that they are indexed by the set [n 1 ].
We restrict the circuit with a random restriction ρ = (I, y) ∼ R n 1 p , where p = n −δ d /2 . By Lemma 4.4, we know that for each i ∈ [m] and some t = 1/p Ω(1) and q = p Ω(1) , Further, we also know that for each i ∈ L, the expected value of fan-in(φ i | ρ ) = p · fan-in(φ i ), since each variable is set to a constant with probability 1 − p. Since i ∈ L, the expected fan-in of each φ i (i ∈ L) is at least n δ d /2 . Hence, by a Chernoff bound (Theorem 2.22), we see that for any i ∈ L, We call a set I generic if |I| ≥ n 1 p/2 and fan-in(φ i | ρ ) ≤ 2p · fan-in(φ i ) for each i ∈ L. Let G denote the event that I is generic. By Our aim is to further restrict the circuit by setting all the input variables to the gates φ i that are t-balanced. In order to analyze this procedure, we define random variables where the first inequality follows from the fact that since we have conditioned on I being generic, we have (4.7) We let µ := 4pq · n 1+ε d . By Markov's inequality, In particular, we can condition on a fixed generic I ⊆ [n] such that for random y ∼ {−1, 1} n 1 −|I| , we have The above gives us a restriction tree T (that simply sets all the variables in [n 1 ] \ I) such that at all but 1 − 2 √ q fraction of leaves λ of T , the total fan-in of the large gates at depth 1 in C 1 that are t-balanced is at most µ/ √ q; call such λ good leaves. Let n 2 denote |I|, which is the number of surviving variables.
Phase 3. We will show that for any good leaf λ , we have where C λ denotes C 1 | ρ λ . This will prove the theorem, since we have by Fact 2.3, where we have used the fact that Par n 1 | ρ λ = ±Par n 2 for each leaf λ , and also that 2 √ q ≤ n −ε d for a large enough choice of the constant B.
It remains to prove (4.9). We do this in two steps. In the first step, we set all large t-imbalanced gates to their most probable constant values. Formally, for a t-imbalanced threshold gate φ labeled by (w, θ ), we have |θ | ≥ t · w 2 . We replace φ by a constant b φ which is 1 if θ ≥ t · w 2 and by −1 if −θ ≥ t · w 2 . This turns the circuit C λ into a circuit C λ of at most the wire complexity of C λ . Further, note that for any x ∈ {−1, 1} n 1 , C λ (x) = C λ (x) unless there is a t-imbalanced threshold gate φ such that φ (x) = b φ (x). By the Chernoff bound (Theorem 2.20) the probability that this happens for any fixed imbalanced threshold gate is at most exp(−Ω(t 2 )) ≤ exp(−n Ω(δ d ) ).
By a union bound over the ≤ n large threshold gates, we see that In particular, we get by Fact 2.3 where the last inequality is true for a large enough constant B.
In the second step, we further define a restriction tree T λ such that C λ becomes a depth-(d − 1) circuit with at most cn wires at all the leaves of T λ . We first restrict by setting all variables that feed into any of the t-balanced gates. The number of variables set in this way is at most for a large enough choice of the constant B. This leaves n 3 ≥ n 2 /2 variables still alive. Further, all the large t-balanced gates are set to constants with probability 1. Finally, by Proposition 4.8, we may set all but a set S of n 4 = n 3 /2cn δ d variables to ensure that with probability 1, all the small gates depend on at most one input variable each. At this point, the circuit C λ may be transformed to a depth-(d − 1) circuit C λ with at most as many wires as C λ , which is at most cn.
Note that the number of unset variables is n 4 ≥ pn/8cn δ d ≥ n 1−2δ d , for large enough B. Hence, the number of wires is at most for suitably large B. Thus, by the inductive hypothesis, we have with probability 1 over the choice of the variables restricted in the second step. Along with (4.10) and Fact 2.3, this implies (4.9) and hence the theorem.

Exponential correlation bounds for the Generalized Andreev function
We now prove an exponentially strong correlation bound for the Generalized Andreev function defined in Section 2.4 with any γ < 1/6. For any d ≥ 1, we define the constants ε d and δ d as in (4.2), where B > 2 is a large constant that will be chosen below.
THEORY OF COMPUTING, Volume 14 (9), 2018, pp. 1-55 Proof overview. As in the case of Theorem 4.2, the proof proceeds by an iterative depth reduction. The proof of the depth reduction here is technically more challenging than the proof of Theorem 4.2 since we are trying to prove exponentially small correlation bounds and hence, we can no longer afford to ignore "bad" events that occur with polynomially small probability. This forces us to prove a more involved depth-reduction statement, which we prove as a separate lemma (Lemma 4.11 below). We show how to use this depth-reduction statement to prove our correlation bound in Theorem 4.13 and Corollary 4.3.
We begin with a definition that will be required to state our depth-reduction lemma.
Definition 4.9 (Simplicity). We call a threshold circuit C (t, d, w)-simple if there is a set R of r ≤ t threshold functions g 1 , . . . , g r such that for every setting of these threshold functions to bits b 1 , . . . , b r , the circuit C can be represented on the corresponding inputs (i. e., inputs x satisfying g i (x) = b i for each i ∈ [r]) by a depth-d threshold gate of wire complexity at most w.
In particular, note that a (t, d, w)-simple circuit C may be expressed as where each C b 1 ,...,b r is a depth-d circuit of wire complexity at most w. Further, note that the OR appearing in the above expression is disjoint (i. e., no two terms in the OR can be simultaneously true). We need the following elementary property of disjoint ORs.
where 1 denotes the constant 1 function.
Proof. Note that h being the disjoint OR of g 1 , . . . , g N translates into the following identity: Hence, we have h = (∑ i g i ) − (N − 1). Thus, using the bilinearity of the inner product and the triangle inequality We are now ready to prove our main depth-reduction lemma for threshold circuits with small wire complexity.
Lemma 4.11 (Depth-reduction lemma with exponentially small failure probability). Let d ≥ 1 be any constant and assume that ε d , δ d are defined as in (4.2) with B large enough. Say we are given any depth-d threshold circuit C on n variables with at most n 1+ε d wires.
There is a restriction tree T of depth n − n 1−2δ d with the following property: for a random leaf λ ∼ T , let E(λ ) denote the event that the circuit C| ρ λ is exp(−n ε d )-approximated by an (n δ d , d − 1, n 1+ε d )-simple circuit. Then, Pr λ [¬E(λ )] ≤ exp(−n ε d ).
THEORY OF COMPUTING, Volume 14 (9), 2018, pp. 1-55 Comparison with Theorem 4.2. The proof of this lemma is similar to the proof of the depth-reduction in Theorem 4.2, with a few key differences. The main change is that we require the simplification of depth-d to depth-(d − 1) to fail only with exponentially small probability as opposed to the bound of n −Ω d (1) in the proof of Theorem 4.2. To ensure this, we will use the read-k Chernoff bound stated in Theorem 2.24.
However, this lemma yields strong bounds only when the number of random variables (the parameter m in the statement of Theorem 2.24) is quite large. In particular, for example, if the number of gates at depth d − 1 is small (a constant, say), then the bound obtained from Theorem 2.24 is not very useful. In this case, instead of simplifying these gates, we simply say that for each setting of the outputs of these gates, we get a circuit of smaller depth. This is why we obtain (t, d − 1, w)-simple circuits in general and not simply a depth-(d − 1) threshold circuit.
Proof. Let φ 1 , . . . , φ m be the threshold gates appearing at depth d − 1 in the circuit C. We say that φ i is large if fan-in(φ i ) ≥ n δ d and small otherwise.
As in the inductive case of Theorem 4.2, our construction proceeds in phases.
Phase 1. This is identical to Phase 1 in Theorem 4.2. We thus get a restriction tree of depth n/2 such that at all leaves of this tree, the resulting circuit is a read-2c circuit with at most cn wires. Let C 1 denote the circuit obtained at some arbitrary leaf of the restriction tree and let n 1 denote the number of variables.
Phase 2. This basic idea here is similar to Phase 2 from Theorem 4.2. However, there are technical differences from Theorem 4.2 since we apply a concentration bound to ensure that the circuit simplifies with very high probability. We restrict the circuit with a random restriction ρ = (I, y) ∼ R n 1 p , where p = n −δ d /2 . As in Theorem 4.2, we have for some t = 1/p Ω(1) , q = p Ω(1) , and for each i ∈ L, Pr ρ fan-in(φ i | ρ ) > 2p · fan-in(φ i ) ≤ exp(−Ω(n δ d /2 )), and (4.13) Pr ρ=(I,y) |I| < n 1 p 2 ≤ exp(−Ω(np)). (4.14) Now, we partition L as L = L 1 ∪ · · · ∪ L a , where a ≤ 1/ε d , as follows. The set L j indexes all threshold gates at depth d − 1 of fan-in at least n δ d +( j−1)ε d and less than n δ d + jε d . We let j denote |L j |. For each i ∈ L, let Y i be a random variable that is 1 if φ i | ρ is t-balanced and 0 otherwise. Note that this defines a collection of read-2c Boolean random variables (the underlying independent random variables are ρ(k) for each k ∈ [n 1 ]).
Let Z j = ∑ i∈L j Y i , the number of t-balanced gates in L j . We have Assuming that j ≥ n 3δ d /4 and B = δ d /ε d is a large enough constant, the right hand side of the above inequality is upper bounded bounded by exp{−2n ε d }. On the other hand if j < n 3δ d /4 , then Z j < n 3δ d /4 with probability 1. Hence, we have Pr ρ=(I,y) and by a union bound Pr ρ=(I,y) We call a set I generic if |I| ≥ n 1 p/2 and fan-in(φ i | ρ ) ≤ 2p · fan-in(φ i ) for each i ∈ L. Let G denote the event that I is generic. By (4.13) and (4.14), we know that In particular, similar to Theorem 4.2, we get, Pr ρ=(I,y) We fix any generic I such that (4.17) Consider the restriction tree T that sets all the variables not in I. The tree leaves n 2 ≥ pn 1 /2 = pn/4 variables unfixed. We call a leaf λ of the tree good if for each j ∈ [a] we have Z j < max{2q j , n 3δ d /4 } and bad otherwise. We have Pr For good leaves λ , we show how to approximate C λ := C 1 | ρ λ as claimed in the lemma statement. For the remainder of the argument, fix any good leaf λ . We partition [a] = J 1 ∪ J 2 where Note that for any j ∈ J 1 , we have where for the last inequality, we have used the fact that since we have j gates of fan-in at least n δ d +( j−1)ε d each, we must have j · n δ d +( j−1)ε d ≤ n 1+ε d , the total wire complexity of the circuit. In particular, we can bound the total fan-in of all the t-balanced gates indexed by j∈J 1 L j by Phase 3. We proceed in two steps as in Theorem 4.2. Since the steps are very similar, we just sketch the arguments. In the first step, we replace all large t-imbalanced gates by their most probable values. This yields a circuit C λ of at most the wire complexity of C λ and such that (4.20) In the second step, we construct another restriction tree rooted at λ that simplifies the circuit to the required form. We first restrict by setting all variables that feed into the t-balanced gates that are indexed by j∈J 1 L j . By (4.19), the number of variables set is bounded by for a large enough choice of the constant B. This sets all the t-balanced gates indexed by j∈J 1 L j to constants while leaving n 3 ≥ n 2 /2 variables still alive. Finally, by Proposition 4.8, we may set all but a set of n 4 = n 3 /2cn δ d variables to ensure that with probability 1, all the small gates depend on at most one input variable each. We may replace the small gates by the unique variable they depend on or a constant (if they do not depend on any variable) without increasing the wire complexity of the circuit. Call the circuit thus obtained C λ . At this point, the only threshold gates at depth d − 1 in the circuit C λ are the gates indexed by the t-balanced gates in j∈J 2 L j . But by the definition of J 2 , there can be at most (1/ε d ) · n 3δ d /4 ≤ n δ d of them. For every setting of these threshold gates to constants, the circuit becomes a depth-(d − 1) circuit of size at most n 1+ε d . Hence, we have a (n δ d , d − 1, n 1+ε d )-simple circuit, as claimed.
Note that the number of variables still surviving is given by n 4 ≥ pn/16cn δ d ≥ n 1−2δ d , for a large enough choice of the parameter B. Hence, the restriction tree constructed satisfies the required depth constraints.
For a random leaf ν ∼ T , the probability E(ν) does not occur is at most the probability that in Phase 2, the leaf sampled is bad. By (4.18), this is bounded by 2a exp(−2n ε d ) ≤ exp(−n ε d ) as claimed.
We now prove the correlation bound for the Generalized Andreev function against threshold circuits with small wire complexity. For the sake of induction, it helps to prove a statement that is stronger in two ways: firstly, we consider any function F a = F(a, ·) where a ∈ {−1, 1} 4n has high Kolmogorov complexity and the input to F a is further restricted by an arbitrary restriction ρ that leaves a certain number of variables alive; secondly, we prove a correlation bound against circuits which are the AND of a small threshold circuit with a small number of threshold gates. THEORY OF COMPUTING, Volume 14 (9), 2018, pp. 1-55 Definition 4.12 (Intractability). We say that f : {−1, 1} n → {−1, 1} is (N, d,t, α)-intractable if for any restriction ρ on n variables that leaves m ≥ N variables unset, any depth-d threshold circuit C on m variables of wire complexity at most m 1+ε d , and any set S of at most t threshold functions, we have The main theorem of this section is the following generalized correlation bound.
The proof is by induction on d. The properties of F a are only used to prove the base case of the theorem, which can then be used to prove the induction case using Lemma 4.11. We prove the base case separately below (we assume that the constant B > 0 is large enough so that this implies the base case of the theorem stated above).
Proof. Let γ < 1/6 in the definition of the Generalized Andreev function in Section 2.4. Let τ be any restriction of n variables leaving m ≥ √ n variables unfixed. Define f := F a | τ . Let C be a conjunction of √ n + 1 threshold gates each on m variables. We wish to prove that Corr( f ,C) ≤ exp(−n Ω(γ) ).
We build a restriction tree T for C of depth m − n γ , by restricting all but n γ arbitrarily chosen variables. For any leaf of T , the restricted circuit C := C| ρ is a conjunction of √ n + 1 threshold gates each on n γ variables. By Corollary 2.10, each threshold function can be described using n 2γ bits. Hence, the entire circuit can be described in a standard way using ( √ n + 1) · O(n 2γ ) < n bits. Then, by Lemma 2.19, we have Corr f | ρ ,C ≤ exp(−n Ω(γ) ).
Proof of Theorem 4.13. We only need to prove the inductive case. Assume that d ≥ 2 is given. Fix any restriction ρ that sets all but m ≥ n 1−ε d variables and let f = F a | ρ . Let C be a depth-d threshold circuit on the surviving variables of wire complexity at most m 1+ε d . Let S be any set of at most n ε d threshold functions on the m variables. We need to show that Corr f ,C ∧ where f denotes f | ρ and similarly for C and g , and E( ) is the event defined in the statement of Lemma 4.11. Fix any leaf so that E( ) holds. We want to bound Further, by the definition of simplicity and its consequence (4.11), we know that there exist r ≤ m δ d threshold functions h 1 , . . . , h r such that where each C b 1 ,...,b r is a depth-(d − 1) threshold circuit of size at most m 1+ε d and the OR above is disjoint. This further implies that and the OR remains disjoint. Note that we may apply the induction hypothesis to obtain a bound on the correlation with each term in the OR at this point, since the number of surviving variables is at least (throughout, we assume that B is a large enough constant for many of the inequalities to hold); and the wire complexity of each depth-(d − 1) circuit C b 1 ,...,b r is at most and further, the number of threshold functions in each term is at most n ε d + n δ d < m ε d−1 . Thus, by the inductive hypothesis, we obtain for any b 1 , . . . , b r , THEORY OF COMPUTING, Volume 14 (9), 2018, pp. 1-55 Using the fact that the OR in (4.23) is disjoint, from Proposition 4.10, we obtain Putting the above together with (4.21) and (4.22), we obtain which proves the induction case and hence the theorem. Proof. For a random a ∈ {−1, 1} 4n , we know by Fact 2.15 that K(a) ≥ 3n with probability 1 − exp(−Ω(n)). For each such a, by Theorem 4.13, we have Corr(C a , F a ) ≤ exp(−n ε d /2 ), where C a is the circuit obtained by substituting x 1 = a in C. Hence, we have as claimed.

Proof of Main Structural Lemma (Lemma 4.4)
We prove the following statement that was key in proving the results of Section 4. Proof overview. The proof follows a template that has been used in many results on threshold functions (see, e. g., [50,38,11,20,12,32]). The basic idea is to divide threshold gates into one of two kinds: threshold gates such that none of their variables have too much weight (called regular threshold functions below) and those where some variable has large weight. For regular gates, the lemma can be easily proved by appealing to standard results on anticoncentration of sums of independent random variables. For gates that are not regular, we first try to make them regular by setting a few variables. If we succeed, then we can appeal to the regular case and we will be done. Otherwise, it must be the case that there are many variables that have large weight in the threshold gate. In this case, we show that the random restriction, with high probability, sets a large number of these variables. Since these variables have large weight, this results in a potentially large shift in the threshold θ of the threshold gate, causing it to become imbalanced with high probability.
We now proceed with the formal details. We need the following definitions and facts that have appeared many times before in the literature on threshold functions (see, e. g., [11]). Definition 5.1 (Regularity and Critical index). Let ε ∈ [0, 1] be a real parameter. We say that w ∈ R n is ε-regular if for each i ∈ [n], |w i | ≤ ε · w 2 .
Assume that the co-ordinates of the vector w are sorted so that |w 1 | ≥ |w 2 | ≥ · · · ≥ |w n |. Let w ≥i ∈ R n−i+1 denote the vector obtained by removing the first i − 1 co-ordinates of w. We define the ε-critical index of w be the least K = K(ε) so that the vector w ≥K+1 is ε-regular. Note that K = 0 if w is already ε-regular and we define K = n if the ε-critical index is not defined.
We say that an n-variable threshold gate φ labeled by (w, θ ) is ε-regular if w is. Similarly, the ε-critical index of φ is defined to be the ε-critical index of w.
Define the parameter L = L(ε) = (100 log 2 (1/ε))/ε 2 . The Berry-Esseen theorem (see, e. g., [14]) yields the following standard anticoncentration lemma for linear functions. (See [11,Corollary 2.2] for this particular statement.) Lemma 5.2 (Anticoncentration for regular linear functions). Let w ∈ R n be ε-regular and let J ⊆ R be any interval. Then, where Φ(·) denotes the cdf of the standard Gaussian with mean 0 and variance w 2 2 . In particular, if |J| denotes the length of J, then We now proceed with the proof of Lemma 4.4. Throughout, we work with random restrictions sampled from R n p where p ∈ [0, 1] is the probability from the statement of Lemma 4.4: equivalently, we pick a pair (I, y) where I ⊆ [n] and y ∈ {−1, 1} n−|I| .
Let the threshold gate φ be labeled by pair (w, θ ), where w ∈ R n . We may assume that the variables of the threshold gate have been sorted so that |w 1 | ≥ |w 2 | ≥ · · · ≥ |w n |. Note that after applying a restriction ρ, the threshold gate φ | ρ is labeled by pair (w , θ ), where w is the restriction of w to the coordinates in I and where we use w to denote the vector w restricted to the indices in [n] \ I. For a random restriction ρ ∼ R n p , define the following "bad" events: 1. B t (ρ) (t a parameter): φ | ρ is t-balanced: i. e., θ ≤ t · w 2 . This is the event whose probability we want to upper bound.
3. B k, 2 (ρ) (k, parameters): I contains at least k variables among the first variables x 1 , . . . , x . 5 We have the following upper bounds on the probabilities of some of the above bad events. Proof. The proof follows by applying the variant of the Chernoff bound in Theorem 2.23. Since each variable is added to I independently with probability p, we have Applying Theorem 2.23 to the Boolean random variables X i which take value 1 iff x i ∈ I and for X = ∑ i w 2 i X i , we obtain the following bound.
where for the last inequality we have used the fact that w is ε -regular and hence Proof. The probability that any specific set of k variables is a subset of I is p k . By a union bound, for any , we have We start with a simpler subcase of the lemma that follows almost directly from Lemma 5.2.
Lemma 5.5 (The regular case). Say that w is ε -regular for some ε ≤ . Then Proof. We bound Pr ρ [B t (ρ)] as follows.
where the bound on Pr [B 1 (ρ)] follows from Claim 5.3. Now, note that the event ¬B 1 (ρ) only depends on the choice of ρ −1 ( * ) = I. Hence we can condition on an I so that this event occurs; choosing ρ is now equivalent to choosing a random assignment y to the variables in [n] \ I.
We have θ = θ − w , y . Using the fact that B 1 (ρ) doesn't occur, we have Using the ε -regularity of w, for each i ∈ I, we have Thus, w is 2ε -regular.
• w 2 ≤ 2p 1/2 w 2 ≤ 4p 1/2 w 2 , Using the above, we can see that the probability that where the final inequality uses the anti-concentration bound in Lemma 5.2. Putting the above together with (5.2), we are done.
Proof of Lemma 4.4. The proof of the lemma is a standard case analysis based on the ε-critical index of the threshold gate φ (see [50,38,11,32]). Let ε = p 1/8 and t = p −1/16 . The parameter p 0 will be chosen in the proof below. The first case is when the critical index K ≤ L. In this case, we bound the probability of B t (ρ) by where the second inequality follows from Claim 5.4 and the final inequality follows from the fact that epK ≤ epL ≤ √ p by our choice of parameters. The event ¬B 2 (ρ) only depends on the choice of the sub-restriction ρ| [K] and we can condition on ρ| [K] so that this event occurs. From now on, the random choice will be a restriction ρ ∼ R n−K p on the remaining variables. Since the restricted linear function is now ε-regular by the definition of the ε-critical index, we can apply Lemma 5.5 to conclude that (note that ε = p 1/8 ≤ 1/ 16 log(1/p) as long as p is smaller than some absolute constant p 0 , so that Lemma 5.5 is applicable). Along with (5.1), this implies the lemma in the case that K ≤ L. The second case is when K > L. As in previous cases, we first condition on some bad event not occurring. We have As above, we can condition on a fixed I so that B 1,L 2 (ρ) does not occur (i. e., none of the first L variables belong to I). We then use the following claim that is implicit in [11]. THEORY OF COMPUTING, Volume 14 (9), 2018, pp. 1-55 Proposition 5.6. Assume that L = (10r log(1/ε))/ε 2 and that the ε-critical index K > L . Let y be a random assignment to any set of variables including the first L variables. Then, the probability over y that the restricted threshold gate is not (1/ε)-imbalanced is at most 2 −r .
Applying the above proposition with L = L and r = 10 log(1/ε), we have Putting this together with (5.4), we have the claimed upper bound on Pr ρ [B t (ρ)] in the case that K > L.
For completeness, we give below a proof sketch of Proposition 5.6.

Satisfiability algorithms beating brute-force search
In this section, we give satisfiability algorithms beating brute force search for bounded-depth threshold circuits with few wires. Until now, such algorithms were only known for threshold circuits of depth 2. We will assume that each threshold gate on m input bits is given as a pair (w, θ ), where w ∈ Z m and θ ∈ Z, and θ as well as each component of w has bit complexity poly(n). Note that this assumption is without loss of generality for a threshold function, and that some assumption on representability of threshold functions is necessary in an algorithmic context. The satisfiability algorithm relies on an algorithmic version of Lemma 4.11, along with a couple of additional ideas. Essentially, we use the algorithmized version of the lemma to reduce the satisfiability of bounded-depth circuits to satisfiability of ANDs of threshold functions, which we can then solve using a recent result of Williams, stated below.
Theorem 6.1. [54] There is a deterministic algorithm, which given a bounded-depth circuit C on n variables of size 2 n o(1) with ANDs, ORs and threshold gates, and with the threshold gates appearing only at the bottom layer, decides if C is satisfiable in time 2 n−n ε poly(n), where ε > 0 is a constant that depends only on the depth of the circuit.
We also need the following fact about threshold gates on n input bits: the set of inputs evaluating to 1 (and dually, the set of inputs evaluating to −1) of a linear threshold gate can be enumerated in time proportional to the number of such inputs, modulo a poly(n) factor. Proposition 6.2. Let (w, θ ) represent a threshold function φ on m input bits, where w ∈ Z m and θ ∈ Z are integers of bit complexity poly(m). Let S be the set of inputs on which φ evaluates to 1. Then S can be enumerated in time |S|poly(m).
Proof. We will show how to construct a decision tree for φ in time |S|poly(m), where S is the set of inputs on which φ takes value 1. Given a decision tree of size at most |S|poly(m), it is easy to enumerate the set of inputs on which φ takes value 1 in time |S|poly(m) by scanning through leaves labeled 1 and outputting all assignments corresponding to any such leaf.
The decision tree is constructed recursively as follows. Check if φ restricted according to the current partial assignment is satisfiable (in the sense that there is a total assignment consistent with the partial assignment for which φ evaluates to 1). Note that satisfiability of a linear threshold gate with polynomial bit complexity of the weights can be done trivially in polynomial time. If the satisfiability check fails, make the current node a leaf and label it with −1. If it succeeds, check if the current partial assignment is falsifiable. If this check fails, make the current node a leaf and label it with 1. Otherwise, branch on an arbitrary unassigned variable and recurse.
Clearly, this decision tree can be constructed with polynomial work at each node, and hence in time Npoly(m), where N is the number of leaves of the tree. We show that N ≤ |S|m. Indeed, we prove inductively that for any internal node v of the tree of height h ≥ 1, the number of −1 leaves of the tree rooted at v is at most h times the number of 1 leaves, from which the claim follows as the height of the tree ≤ m.
For the inductive claim, the base case h = 1 is clear as any node at height 1 must have one leaf labeled 1 and the other labeled −1. Assume the claim for height h. Consider a node v at height h + 1. Either one of its children is a leaf, or not. If one of the children is a leaf, then the other one v is not and by the induction hypothesis, since it is of height h, has at most h times as many −1 leaves as 1 leaves. The number of −1 leaves of v is at most one plus the number of −1 leaves of v , and hence at most h + 1 times the number of 1 leaves. In case both children of v are internal nodes, then they are both of height at most h, and by the induction hypothesis, both have at most h times as many −1 leaves as 1 leaves, which implies that the same holds for v. Definition 6.3. We use THR to refer to the class of linear threshold functions. We use AND • THR to refer to the class of polynomial-size circuits with an AND gate at the top and threshold gates at the bottom layer.
Theorem 6.4. For each integer d > 0, there is a constant ε d > 0 such that satisfiability of a depth-d threshold circuit with at most n 1+ε d wires on n variables can be solved by a randomized algorithm in time 2 n−Ω(n ε d ) poly(n).
Proof. As the proof follows the proof of Lemma 4.11 closely, we just give a sketch. Call a circuit depth-d AND • THR-skew if the top gate is an AND and all but one child of the top gate is a bottom-level threshold gate, with the possibly exceptional child being a depth-(d − 1) threshold circuit with few wires. We follow the depth reduction argument in the lemma to give a recursive algorithm which reduces satisfiability of polynomial-size depth-d AND • THR-skew circuits to the satisfiability of polynomial-size depth-(d − 1) AND • THR-skew circuits by appropriately restricting variables.
For the base case d = 1, we simply appeal to the algorithm given by Theorem 6.1, which solves satisfiability of AND • THR circuits of polynomial size in time 2 n−n ε poly(n) for some constant ε > 0.
For the inductive case, we simulate the proof of Lemma 4.11, which performs and analyzes a certain kind of adaptive random restriction. Various bad events might happen at Phases 2 and 3 of this random restriction process, however each step of the restriction process as well as the check that a bad event happens can be implemented in polynomial time. Moreover, the probability that a bad event happens is at most 2 −n ε d . Whenever a bad event happens, we simply do brute force search on the remaining variables of the circuit, but thanks to the exponentially small probability that a bad event happens, with high probability, we only spend time 2 n−n ε d on such brute force searches.
In Phase 3 of the restriction process, we replace imbalanced gates by their most probable values. This changes the functionality of the circuit and might lose us satisfying assignments or give us new invalid satisfying assignments. To get around this, for each such imbalanced gate, we use Proposition 6.2 to efficiently enumerate the inputs evaluating to the minority value for each imbalanced gate, and for each such input check whether it satisfies the original circuit. If it does, we just output "yes." We also append to the top gate of the skew circuit a child representing the assignment of the imbalanced gate to its majority value-this needs to be done so that we don't end up with "false positives" in the base case of the recursive algorithm. Although each such false positive can be tested, there might be too many of them, and this could destroy all the savings we accrue through the course of the algorithm. The total time spent in enumerating minority values of imbalanced gates is again at most 2 n−n ε d poly(n), with high probability, using the efficient enumeration and the imbalance property.
Finally, there are a few balanced gates-with high probability at most O(n δ d ) of them-for which we need to try all possible values. This could be expensive, but is compensated for by an increased savings for depth d − 1, just by setting the constant B large enough in the proof of Lemma 4.11. We also need to set B large enough so that the savings given by the application of Williams' algorithm in the base case overwhelms the loss due to branching on balanced threshold gates at depth d = 2.

Threshold formulas
A threshold formula is a threshold circuit such that the fan-out of each gate is at most 1. A formula can be viewed as a rooted tree. Note that a depth-2 threshold circuit can always be converted to a threshold formula without increasing either the wire complexity or the gate complexity (recall that the gate complexity only measures the number of non-input gates). Proof. A formula with s wires is a rooted tree with s + 1 nodes. The number of different rooted trees with s + 1 nodes is 2 O(s) [39]. For a fixed tree structure, we label the leaves by variables x 1 , . . . , x n , and label the internal nodes by LTFs. Since the number of leaves is at most s, there are O(n s ) different ways of labeling the leaves.
We next label the internal nodes. Without loss of generality, we assume leaves feeding into the same node are labeled by distinct variables, and each internal nodes has fan-in at least two. Consider a topological order of the internal nodes, say, h 1 , . . . , h k , where k ≤ s. For i = 1, . . . , k, let s i be the fan-in of h i ; then ∑ i s i = s. We label each h i by a LTF on s i inputs. For fixed labels of h j for 1 ≤ j < i, by Theorem 2.9, there are 2 O(ns i ) possible choices for h i . Therefore, the number of different functions computed by threshold formulas of s wires is The main result of this section is the following.
Theorem 7.2. Let γ > 0 be any constant parameter. Any threshold formula on 5n variables with at most n 2−3γ wires has correlation at most exp(−n Ω(γ) ) with the Generalized Andreev function as defined in Section 2.4 with parameter γ.
Proof. Let C be a threshold formula with n inputs and s = n 2−3γ wires. Let L be the number of leaves in the formula tree; then L ≤ s ≤ 2L. We build a restriction tree T for C up to depth n − pn, for p = n γ /n, by greedily restricting the most frequent variables appearing in the leaves. Since the most frequent variable appears at least L/n times, after restricting one variable, the formula tree has at most L(1 − 1/n) leaves left. Continue this until pn variables left unrestricted; then the number of remaining leaves is at most Thus, for any leaf l of T , the restricted formula C| ρ l , on n := pn = n γ variables, has  F(a, x). Then, by Lemma 2.19, Corr(F a | ρ l ,C| ρ l ) ≤ exp(−n Ω(γ) ).
Note that this holds for every leaf l of T . By Fact 2.3, Corr(F a ,C) ≤ exp(−n Ω(γ) ).
Let D be a threshold formula with 5n inputs and n 2−3γ wires, and let D a (x) := D(a, x). Then D a is a formula on n inputs with at most n 2−3γ wires, and thus Since a random a ∈ {−1, 1} 4n has K(a) ≥ 3n with probability 1 − 2 −Ω(n) , the correlation of D and F is at most 2 −Ω(n) + exp(−n Ω(γ) ) = exp(−n Ω(γ) ).

AC 0 circuits with a few threshold gates
In this section, we extend the noise sensitivity and correlation bounds from Section 3 to the more general setting of small AC 0 circuits (i. e., a constant-depth circuit made up of AND and OR gates) augmented with a small number of threshold gates.
We prove noise sensitivity bounds for Boolean functions computed by such circuits. As consequences of this, we are able to prove correlation bounds against such circuits (as in Section 3) and also devise learning algorithms for such circuits under the uniform distribution (as in [28,17]).
Following Gopalan and Servedio [17], we define TAC 0 [k] to be the class of constant-depth circuits made up of AND and OR gates and at most k arbitrary threshold gates. The inputs to the circuit are allowed to be arbitrary literals over the underlying variables.
We prove upper bounds on the noise sensitivity of small depth-d TAC 0 [k] circuits for k much smaller than n 1/2(d−1) . The main theorem of the section is the following. This implies a correlation bound for the parity function for such circuits as in Section 3 (see Corollary 8.4 below). Using Theorem 8.1 along with a general idea due to Klivans et al. [28], we also get the following subexponential-time (i. e., 2 o(n) -time) learning algorithms for TAC 0 [k] circuits of small size.   [26]). Let f be a degree-D Polynomial Threshold function (PTF). 6 Then, for any Proof of Theorem 8.1. This is a standard switching argument (see, e. g., [21]) augmented with the ideas of Theorem 3.2. We assume throughout that q ≤ 1/2 without loss of generality since otherwise α(q, D) ≥ q ≥ 1/2 and the claim is trivial. We say that a threshold gate is a true threshold gate if it is not an AND or OR gate. For any parameters k 1 , d 1 ,t 1 , s 1 ∈ N with d 1 ≥ 2, we define TAC 0 [k 1 , d 1 ,t 1 , s 1 ] to be the class of constant-depth circuits made up of AND, OR and threshold gates such that: • the overall depth is at most d 1 , • the total number of gates at depth at most d 1 − 2 in the circuit is at most s 1 , • all the true threshold gates are at depth at most d 1 − 2 and there are at most k 1 of them, and • the bottom fan-in of the circuit (i. e., the maximum fan-in of a gate at depth d 1 − 1) is at most t 1 .
Note that the circuit C in the statement of the theorem is in the class TAC 0 [k, d + 1, 1, M], since we may replace the input literals with (say) AND gates of fan-in 1 at the expense of increasing the depth by 1 but in the process satisfying all the criteria of the above definition. We prove the following stronger statement: for any p, q, D as in the statement of the theorem, and any C from the class TAC 0 [k, d, D, M] with d ≥ 2, we have where ρ d ∼ R n p d and p d := 2p d−2 q ∈ [0, 1]. Proving (8.1) will clearly prove the theorem. The proof is by induction on d. The base case is d = 2. In this case, since there are no true threshold gates at depth d − 1 by assumption, a true threshold gate can only occur as the output gate of the circuit C. Since AND and OR gates are also threshold gates, we can assume that the output gate is a threshold gate. The bottom fan-in being at most D implies that each gate at depth 1 can be represented exactly as a polynomial of degree at most D and therefore that the function computed by C is a degree-D PTF. Hence, Lemma 8.3 trivially implies the result. Now assume d > 2. Let ψ 1 , . . . , ψ s denote the AND and OR gates at depth exactly d − 2 in the circuit and let φ 1 , . . . , φ m denote the true threshold gates. By assumption m ≤ k and s ≤ M. We sample a random restriction ρ ∼ R n p and consider the restricted circuit C| ρ . Håstad's switching lemma [21] tells us that for each i ∈ [s], we have and hence by a union bound, Also, as in the base case, we see that each φ j computes a degree-D PTF. Hence, Lemma 8.3 gives us Consider the circuit C ρ obtained from C| ρ as follows: if there is an i ∈ [s] such that DT-depth(ψ i | ρ ) ≥ D, then C ρ is defined to be a trivial circuit that always outputs 1; otherwise, C ρ is the depth-(d − 1) circuit obtained from C| ρ as follows: • We replace each φ j | ρ by a bit b j,ρ ∈ {−1, 1} so that by Fact 2.5, we have • Since each ψ i | ρ is a depth-D decision tree, we can write it as a D-DNF or D-CNF or as a disjoint sum of terms of size at most D each. For each gate χ at depth at most d − 3 that takes ψ i as an input, we do the following: -If χ is an OR gate, then we take the D-DNF representing ψ i | ρ and feed the terms of the DNF directly into χ, eliminating the output OR gate of the D-DNF.
-If χ is an AND gate, we do the same as above, except that we use the D-CNF representation of ψ i | ρ and eliminate the output AND gate.
-If χ is a threshold gate, then we write ψ i | ρ as a disjoint sum of terms of size at most D each and feed each of the terms directly to χ. The gate χ now has many inputs in the place of ψ i | ρ , and the weight given to each of these inputs is the same as the weight given to ψ i | ρ .
Note that the above operations do not increase the number of gates at depth at most d − 3 in the circuit.
Note that C ρ has depth d − 1 and bottom fan-in at most D. Further, the number of gates at depth at most d − 3 in C ρ is at most M − s. Hence, C ρ is a circuit from the class TAC 0 [k − m, d − 1, D, M]. We can thus apply the induction hypothesis and obtain To obtain (8.1), we use where the inequality follows from Proposition 3.3. Inequality (8.5) allows us to bound the first term on the right hand size. It remains to analyze the last term on the right hand side of (8.6). Define a Boolean random variable Z = Z(ρ) which is 1 iff there is an i ∈ [s] such that φ i is not a depth-D decision tree. Let ∆ = ∆(ρ) be the random variable defined by It easily follows from the definition of C ρ that for any choice of ρ, either Z = 1-in which case we can trivially bound δ (C ρ ,C| ρ ) by 1-or Hence, for any choice of ρ, we get  Then Corr(C, Par n ) ≤ n o(1) · δ 1−(1/d) . In particular, if δ = n −Ω(1) , then Corr(C, Par n ) = n −Ω(1) . 7 Of course, we need to be judicious in our choice of constants in the O(·). We leave this matter to the interested reader.
Proof. We choose a D such that ω(1) ≤ D ≤ o( log n/ log log n) so that M ≤ n o(D) and p, q as in Corollary 3.1. We can then use Theorem 8.1 to obtain Thus, we get NS 1/n (C) ≤ n o(1) δ 1−(1/d) . By Proposition 2.6, we have Corr(C, Par n ) ≤ O(NS 1/n (C)), which proves the claim.
Remark 8.5. The above corollary can be strengthened considerably if a widely believed strengthening of Lemma 8.3-named the Gotsman-Linial conjecture [18]-is known to hold. The Gotsman-Linial conjecture is a conjecture about the average sensitivity of low-degree PTFs. We do not recall the exact statement of the conjecture here, and refer the reader to the article by Gopalan and Servedio [17] instead. As noted by [17,Corollary 13], the Gotsman-Linial conjecture implies that for any p and any degree . This is almost a complete generalization of the result of Beigel [6] who proved such a result in the setting where all the threshold gates are of polynomial weight. In contrast, the results of Podolskii [43] and Gopalan and Servedio [17] can prove such correlation bounds only if k < log n.

Learning algorithms for TAC 0 [k] circuits (Theorem 8.2)
To prove Theorem 8.2, we use Theorem 8.1 along with an observation of Klivans, O'Donnell, and Servedio [28]. We have the following lemma that can be obtained by putting together Fact 9 and Corollary 15 in [28].
Lemma 8.6. Let F be a class of Boolean functions defined on {−1, 1} n . Assume that we know that for some ε > 0 and f ∈ F, there is a γ > 0 such that NS γ ( f ) ≤ ε/3. Then, there is an algorithm that learns F with error ε in time n O(1/γ) .
We now prove Theorem 8.2.
Proof of Theorem 8.2. We can assume that ε ≥ 1/n 1/2d since otherwise, we can just run a brute force algorithm that takes time 2 O(n) . We choose a D such that ω(1) ≤ D ≤ o( log n/ log log n) so that M = n o(D) . Theorem 8.1 tells us that for any p, q ≥ 1/n any C from the class of circuits described in the theorem statement, we have where A and B are n o (1) . We choose p, q so that the first two terms above are each bounded by ε/10. This requires p ≤ ε 2 /O(k 2 A 2 ) and q ≤ ε 2 /O(B 2 ). Further, to ensure that the last term is at most ε/10, it suffices to choose p ≤ n −Ω(1) (in fact, this ensures that the third term is n −ω(1) whereas ε ≥ n −1/2d by assumption). Thus, we fix p = min{ε 2 /O(k 2 A 2 ), n −1/4d } and q = ε 2 /O(B 2 ) so that all the above conditions are satisfied. This gives NS γ (C) ≤ ε/3 where γ = p d−1 q. Hence, by Lemma 8.6, we obtain the statement of the theorem.
THEORY OF COMPUTING, Volume 14 (9), 2018, pp. 1-55 Remark 8.7. Assuming the Gotsman Linial conjecture, the above technique yields subexponential time constant-error learning algorithms as long as M ≤ 2 n o(1) and δ = n −Ω (1) . To contrast again with the work of Gopalan and Servedio [17], the results of [17]-even assuming the Gotsman Linial conjecture-only yield subexponential-time learning algorithms in the setting when k < log n. However, the dependence on the error parameter in [17] is better than the dependence we obtain here. (The running time there has a ε 3 in place of the ε 2d that we obtain here.)

Depth-3 threshold circuit lower bounds
In this section, we compare our lower bound techniques for threshold functions with the simultaneous independent work of Kane and Williams [27], which proves improved lower bounds for depth-3 threshold circuits. While both results analyze the effects of random restrictions on threshold gates, the random restriction lemma obtained in [27] is different from ours. It thus makes sense to ask if our techniques can also be used to prove the lower bounds of [27]. We show that a modified version of our structural lemma (Lemma 4.4) can be used to recover the lower bound of [27] up to n o(1) factors.
The modified structural lemma is as follows. We say that a threshold gate φ with label (w, θ ) is (t, k)imbalanced if there is a set S of at most k input variables to φ such that for each setting to these variables, the resulting threshold gate is t-imbalanced. We prove the following modification of Lemma 4.4.
Lemma 9.1. Let φ be a threshold gate on n variables with label (w, θ ). For a parameter p ∈ (0, 1), and any t, k ∈ N such that t = (1/p) o(1) and k = ω(1) (i. e., log 1/p (t) → 0 and k → ∞ as p → 0), we have (Here, upper bounding the probability of an event by p 1/2−o(1) means that the probability is at most √ p · g(p) where lim p→0 g(p)p δ = 0 for any fixed δ > 0.) Remark 9.2. The statement of Lemma 9.1 is somewhat incomparable to that of Lemma 4.4. On the one hand, Lemma 4.4 yields a stronger property: namely that the threshold gate is (t, 0)-imbalanced (or equivalently, not t-balanced) with high probability; Lemma 9.1 only yields that the gate is (t, k)imbalanced with high probability (and also holds only for a smaller value of t). However, the probability estimates in Lemma 9.1 are stronger than those in Lemma 4.4: it can be checked that the proof of Lemma 4.4 cannot yield an upper bound of p c for some c < 1/2.
Replacing the random restriction lemma of [27] with Lemma 9.1 in the lower bound proof of [27] yields the same lower bound for depth-3 circuits up to n o(1) factors. We state the result below after a few basic definitions.
An unweighted majority gate is a threshold gate whose label (w, θ ) satisfies |w i | = 1 for all i. We let MAJ • THR • THR denote the class of depth-3 circuits that have an unweighted majority gate as the output gate above two layers of general threshold gates.

A modified structural lemma
In this section, we prove Lemma 9.1. We will need the following easy consequence of Theorem 2.20.
Fact 9.4. Let φ be a (t, k)-imbalanced threshold gate such that for some set S of at most k input variables to φ and for each setting to these variables, the resulting threshold gate is t-imbalanced. The threshold gate φ obtained from φ by keeping only the variables in S (with their weights) and removing the others satisfies We now prove Lemma 9.1.
Proof of Lemma 9.1. The proof of the above lemma will follow along the lines of the proof of Lemma 4.4. We also use the definitions and notation from Section 5.
Throughout, we assume that p is a small enough constant since the statement of the lemma only needs to be proved for p → 0. We can also assume that k = o(lg(1/p)) since assuming an upper bound on k only makes the statement stronger. We sample ρ = (I, y) from R n p . We will need a number of parameters in the proof. Let ε = m √ p for an integer parameter m ≤ (1/p) o(1) to be chosen later. Define q = 16m 2 p lg(1/p). Note that q = p 1−o(1) since m ≤ (1/p) o(1) . Let L = 100 log 2 (1/ε)/ε 2 . At some places in the proof, we will assume that p is smaller than some constants, since the statement of the lemma is only non-trivial for p → 0. We assume that the variables of the threshold gate have been sorted so that |w 1 | ≥ |w 2 | ≥ · · · ≥ |w n |. After applying a restriction ρ, the threshold gate φ | ρ is labeled by pair (w , θ ), where w is the restriction of w to the coordinates in I and θ = θ (ρ) = θ − w , y . (9.1) where w denotes the vector w restricted to the indices in [n] \ I. For a random restriction ρ ∼ R n p , we define the bad events B t (ρ), B 1 (ρ) and B k, 2 (ρ) exactly as in the proof of Lemma 4.4 (here is a varying parameter we will choose below and k is as fixed above). Additionally, we define the bad eventB t (ρ) to be the event that φ is not (t, k)-imbalanced: note that this is exactly the event whose probability we need to upper bound. Also note that B t (ρ) holds whenever B t (ρ) does and hence any upper bound on the probability of B t (ρ) also applies toB t (ρ).
Exactly as in Lemma 4.4, we first the case that w is ε-regular and bound Pr ρ B t (ρ) in this case. This easily follows since Pr B t (ρ) ≤ Pr [B t (ρ)] and hence by Lemma 5.5 we have where we have used the fact that t = (1/p) o(1) and ε = m In the non-regular case, we proceed by case analysis based on the ε-critical index K of w. Specifically, we proceed based on whether K ≤ L or not. THEORY OF COMPUTING, Volume 14 (9), 2018, pp. 1-55 We first show how to handle the former case, i. e., K ≤ L. In this case, we bound the probability of B t (ρ) as follows where the first inequality follows from Claim 5.4 and we ensure the final inequality by choosing m so that (epK/k) k ≤ (epL/k) k ≤ √ p. This can be done by choosing m = O(log(1/p) · (1/p) 1/2k ) since we have for m as chosen above. Note that m = (1/p) o(1) since we have k = ω(1) by assumption. The event ¬B k,K 2 (ρ) only depends on the choice of the sub-restriction ρ| [K] and we can condition on ρ| [K] so that this event occurs. From now on, the random choice will be a restriction ρ ∼ R n−K p on the remaining variables.
Let S denote the set of variables in I among the first K variables. We show that with high probability over the choice of ρ , we obtain a t-imbalanced threshold gate for each setting of the variables in S.
There are 2 k settings to the variables in S. For each such setting, the restricted linear function is now ε-regular by the definition of the ε-critical index. Hence we can appeal to the regular case (i. e., inequality (9.1)) and use a union bound over the settings to the variables in S to conclude that where we have used our assumption that k = o(log(1/p)) for the final inequality. Along with (9.3), this implies the lemma in the case that K ≤ L.
In the case that K > L, we use a slightly different strategy.
As we did above, we condition on a fixed I so that ¬B k,L 2 (ρ) occurs (i. e., at most k among the first L variables belong to I). Let S denote the set of variables in I among the first L variables. Note that |S| ≤ k. Now, note that ifB t (ρ) does occur, then on a uniformly random setting to the variables in S, the probability that the restricted threshold gate is not t-imbalanced is at least 1/2 k .
On the other hand, we analyze what happens when we set all variables not in I and also the variables in S uniformly at random. Note that Proposition 5.6 is now applicable with L = L and r = 10 log(1/ε) since we are setting all of the first L variables (and also others). Applying Proposition 5.6, the probability that the restricted threshold gate is not t-imbalanced is at most ε 10 < p. Along with the previous paragraph, this implies that Putting this together with (9.4), we have the claimed upper bound onB t (ρ) in the case that K > L.

Superlinear gate lower bounds and superquadratic wire lower bounds for threshold circuits
We now prove Theorem 9.3. The proof closely follows [27] except that we use the structural lemma proved above in lieu of the random restriction lemma from [27].
The following proposition (a weaker version of which is implicit in [27]) will help us prove lower bounds on circuits with an unweighted majority gate on top.
We assume that Pr y∈{−1,1} n [ f (y) = h (y)] ≥ 1 − (1/16s) and obtain a contradiction to (9.5) (the other case is similar). It is easy to check using our assumptions on the w i and θ that whenever f (y) = h (y), then f (y)L (y) ≥ 1/2. Also, for any y such that f (y) = h (y), we have f The explicit function we will use is the same as that defined by Kane and Williams [27,Section 6], which is also a generalization of the Andreev function. The Generalized Andreev function from Section 2.4 can also be suitably modified (by changing the parameters) to yield the same lower bounds.
To prove the above theorem, we need some definitions and some lemmas which are closely related to statements proved in [27].  Proof. We show that each circuit from the class THR • THR[m, k, ] can be described using r bits, where r is as above. This will prove the claim.
To see this, note that by Corollary 2.10 the number of distinct threshold functions of m bits is at most 2 O(m 2 ) and hence, all threshold gates of fan-in greater than k can be described using at most O(m 2 ) bits. The gates of fan-in at most k can compute at most distinct functions; we assume that all these functions appear in the bottom layer. The threshold gate on top thus computes a threshold function of at most + m k 2 O(k 2 ) threshold functions at the bottom layer. By Theorem 2.9, each such function can be described using many bits. Thus, the total description length is O( m 2 + 2 O(k 2 ) · m k+1 ) as required.
We need the following lemma, implicit in [27]. Using the function F from the above lemma, we define our hard function, as in [27], in the following way. Set M = n 16 in the above lemma. In what follows, we assume that n is a power of 2 and that log M | n. The function B n : {−1, 1} n × {−1, 1} n → {−1, 1} is defined to be B n (x, y) = F(z, x) where z ∈ {−1, 1} log M is obtained by partitioning the y variables into log M blocks of size n/ log M and taking the parity of the bits in the ith block to obtain z i (i ∈ [log M]).
Lemmas 9.6 and 9.7 imply a strong correlation bound for B n against certain kinds of circuits. We will say that a restriction on the y variables of B n is live if it leaves at least one variable unset in each of the log M blocks defined above.
Lemma 9.8. Let n ∈ N and M = n 16 be as above. Fix any m, k, ∈ N such that the number of functions in THR • THR[log M, k, ] is 2 o(n) . Then there is an x ∈ {−1, 1} n such that for any live restriction ρ on the y variables that leaves m variables unset, the restricted function B n,ρ := B(x, ·)| ρ satisfies Corr(C, B n,ρ ) ≤ 1 − Ω 1/n 3 for each circuit C from the class MAJ • THR • THR[m, k, , n 3 ].
Proof. By Lemma 9.7, we can fix an x ∈ {−1, 1} n such that the function F(·, x) has correlation at most O(1/n 3.5 ) with any function from the class THR • THR[log M, k, ]. Proposition 9.5 then implies that any circuit from the class MAJ • THR • THR[m, k, , n 3 ] has correlation at most 1 − Ω 1/n 3 with F.
For each i ∈ [log M], fix a variable y (i) in the ith block of y variables that is left unset by ρ. Let Y denote the set of log M variables chosen in this way. Construct a restriction tree T that sets all the variables not in Y that are not set by ρ. For each leaf λ of T , we let C λ denote the circuit C| ρ λ and similarly B n,λ for B n,ρ | ρ λ . By Fact 2.3, we have Corr(C, B n,ρ ) ≤ E λ ∼T Corr(C λ , B n,λ and in particular there is a λ such that Corr(C, B n,ρ ) ≤ Corr(C λ , B n,λ ). Fix any such λ for the rest of the proof.
Note that C λ is a circuit from the class MAJ • THR • THR[m, k, , n 3 ]. Also note that B n,λ is exactly a copy of the function F, with some subset of the variables of F being replaced by their negations. By negating some of the variables in C λ if necessary, we can obtain a circuit C λ from the class MAJ • THR • THR[m, k, , n 3 ] such that Corr(C λ , B n,λ ) = Corr(C λ , F) ≤ 1 − Ω 1 n 3 where the latter inequality follows from our reasoning above. By our choice of λ , we obtain Corr(C, B n,ρ ) ≤ Corr(C λ , B n,λ ) ≤ 1 − Ω(1/n 3 ), which concludes the proof of the lemma.
Proof of Theorem 9.3. The proof follows closely the ideas of [27,Theorem 1.3]. The main new ingredient is the use of Lemma 9.1, which replaces the use of the random restriction lemmas from [27]. Fix m, k ∈ N such that ω((log n) 2 ) ≤ m ≤ n o(1) , 2 k 2 ≤ n o (1) and m k = n ω(1) . (For example, we could set m = 2 (log n) 5/6 and k = (log n) 1/4 .) Let = n/m 3 . By our choice of parameters and Lemma 9.6, the number of functions in THR • THR[log M, k, ] is 2 o(n) . By Lemma 9.8, we can find an x so that the restricted function B n (x, ·) is as guaranteed by Lemma 9.8. We fix any such x for the rest of the proof and consider the restricted function B n (x, ·).
We begin with the gates case. Let C be a MAJ • THR • THR circuit computing B n (x, ·) with at most s = n 1.5 /m 10 r gates where r ≤ n o(1) is a parameter that will be chosen below.
We will apply a random restriction ρ = (I, w) ∼ R n p where p = m/n. We bound the probability of the following bad events: 1. Event E 1 (ρ): This is the event that the restriction ρ is not live. For this event to occur, there is some i ∈ [log M] such that each of the bits of y that z i depends on must be fixed to a constant by ρ.
For each i, the probability that this happens is at most (1 − m/n) n/ log M ≤ 1/2 Ω(m/ log M) ≤ 1/n ω(1) . By a union bound over i ∈ [log M], the probability of E 1 is at most 1/n ω(1) .
2. Event E 2 (ρ): By Lemma 9.1, the probability that any threshold gate is not (t, k)-imbalanced for t = log n is at most p 1/2−δ where δ → 0 as n → ∞. As p ≥ 1/n, we may bound this probability by √ p · n δ . Set r = n δ .
The expected number of gates that are not (t, k)-imbalanced is at most s · √ p · r ≤ n/m 9 . The event E 2 is that the number of gates that are not (t, k)-imbalanced is at least 2n/m 9 . By Markov's inequality, the probability of this event is at most 1/2.
By a union bound, there exists a restriction ρ such that neither of the events E 1 nor E 2 occurs. Fix such a restriction ρ and consider the circuit C| ρ . The number of gates that are not (t, k)-imbalanced in C| ρ is bounded by 2n/m 9 ≤ . By Fact 9.4, any (t, k)-imbalanced threshold gate φ in C| ρ can be approximated to error exp(−Ω(t 2 )) = 1/n ω(1) by a threshold gate φ of fan-in at most k. By a union bound, the circuit C ρ obtained by replacing each (t, k)-imbalanced φ by a φ as defined above satisfies Pr y ∈{−1,1} |I| C| ρ (y ) = C ρ (y ) ≤ s · 1 n ω(1) ≤ 1 n ω(1) .
We now consider the wires case, which is very similar. Let C now be a MAJ • THR • THR circuit computing B n (x, ·) with at most s = n 2.5 /m 10 r wires where r is as defined above. We will say that a gate on the bottom layer is large if its fan-in is at least n/m 2 and small otherwise. We use L to denote the set of large gates and S to denote the set of small gates. Let s L := |L|. Note that s L ≤ s/(n/m 2 ) = n 1.5 /m 8 r.
As in the gates case, we apply a random restriction ρ ∼ R n p to the circuit C. We now consider the following bad events.
1. Event E 1 (ρ): This event is as defined above. As noted above, the probability of this event is bounded by 1/n ω(1) .
2. Event E 2 (ρ): This is the event that some gate in S has fan-in at least k after the restriction. Fix any threshold gate φ ∈ S. Since φ has fan-in at most n/m 2 , the probability that it has fan-in at least k after the restriction is bounded by where the last equality follows from our choice of parameters. Since the circuit C has at most s wires, it also has at most s gates and hence by a union bound over the gates in S, the probability that E 1 (ρ) occurs is at most n 2.5 /n ω(1) ≤ 1/n ω(1) .

Event E 3 (ρ):
This is the event that the number of gates that are not (t, k)-imbalanced is at most 2n/m 7 . Similar to the gates case, we can argue that the expected number of gates in L that are not (t, k)-imbalanced is at most s L √ p · r ≤ n/m 7 . Thus, by Markov's inequality, we see that the probability that E 3 (ρ) occurs is at most 1/2.
By a union bound, we see that there is a ρ such that none of E 1 (ρ), E 2 (ρ), or E 3 (ρ) occur. From here on, we can follow the proof of the gates case verbatim to derive a contradiction to Lemma 9.8. This proves the theorem in the wires case.