Bounding the Sensitivity of Polynomial Threshold Functions

We give the first non-trivial upper bounds on the average sensitivity and noise sensitivity of polynomial threshold functions. More specifically, for a Boolean function f on n variables equal to the sign of a real, multivariate polynomial of total degree d we prove 1) The average sensitivity of f is at most O(n^{1-1/(4d+6)}) (we also give a combinatorial proof of the bound O(n^{1-1/2^d}). 2) The noise sensitivity of f with noise rate \delta is at most O(\delta^{1/(4d+6)}). Previously, only bounds for the linear case were known. Along the way we show new structural theorems about random restrictions of polynomial threshold functions obtained via hypercontractivity. These structural results may be of independent interest as they provide a generic template for transforming problems related to polynomial threshold functions defined on the Boolean hypercube to polynomial threshold functions defined in Gaussian space.


Background
Let P be a real, multivariate polynomial of degree d, and let f = sign(P ). We say that the Boolean function f is a polynomial threshold function (PTF) of degree d. PTFs play an important role in computational complexity with applications in circuit complexity [ABFR94,Bei93], learning theory [KS04,KOS04], communication complexity [She08,She09], and quantum computing [BBC + 01]. While many interesting properties (e.g., Fourier spectra, influence, sensitivity) have been characterized for the case d = 1 of linear threshold functions (LTFs), very little is known for degrees 2 and higher. Gotsman and Linial [GL94] conjectured, for example, that the average sensitivity of a degree d polynomial is O(d √ n). In this work, we take a step towards resolving this conjecture and give the first nontrivial bounds on the average sensitivity and noise sensitivity of degree d PTFs (Theorem 1.6) .
Average sensitivity [BL85] and noise sensitivity [KKL88,BKS99] are two fundamental quantities that arise in the analysis of Boolean functions. Roughly speaking, the average sensitivity of a Boolean function f measures the expected number of bit positions that change the sign of f for a randomly chosen input, and the noise sensitivity of f measures the probability over a randomly chosen input x that f changes sign if each bit of x is flipped independently with probability δ (we give formal definitions below).
Bounds on the average and noise sensitivity of Boolean functions have direct applications in hardness of approximation [Hås01,KKMO07], hardness amplification [O'D04], circuit complexity [LMN93], the theory of social choice [Kal05], and quantum complexity [Shi00]. In this paper, we focus on applications in learning theory, where it is known that bounds on the noise sensitivity of a class of Boolean functions yield learning algorithms for the class that succeed in harsh noise models (i.e., work in the agnostic model of learning) [KKMS08]. We obtain the first efficient algorithms for agnostically learning PTFs with respect to the uniform distribution on the hypercube. We also give efficient algorithms for agnostically learning ellipsoids in R n with respect to the Gaussian distribution, resolving an open problem of Klivans et al. [KOS08]. We discuss these learning theory applications in Section 2.

Main Definitions and Results
We begin by defining the (Boolean) noise sensitivity of a Boolean function: For any δ ∈ (0, 1), let X be a random element of the hypercube {1, −1} n and Z a δ-perturbation of X defined as follows: for each i independently, Z i is set to X i with probability 1 − δ and −X i with probability δ. The noise sensitivity of f , denoted NS δ (f ), for noise δ is then defined as follows: Intuitively, the Boolean noise sensitivity of f measures the probability that f changes value when a random input to f is perturbed slightly. In order to analyze Boolean noise sensitivity, we will also need to analyze the Gaussian noise sensitivity, which is defined similarly, but the random variables X and Z are drawn from a multivariate Gaussian distribution. Let N = N (0, 1) denote the univariate Gaussian distribution on R with mean 0 and variance 1.
Definition 1.2 (Gaussian noise sensitivity). Let f : R n → {−1, 1} be any Boolean function on R n . Let X, Y be two independent random variables drawn from the multivariate Gaussian distribution N n and Z a δ-perturbation of X defined by Z = (1 − δ)X + √ 2δ − δ 2 Y . The Gaussian noise sensitivity of f , denoted GNS δ (f ), for noise δ is defined as follows: It is well known that the Boolean and Gaussian noise sensitivity of LTFs are at most O( √ δ). Our results give the first nontrivial bounds for degrees 2 and higher in both the Gaussian and Boolean cases, with the Gaussian case being considerably easier to handle than the Boolean case.
For the Gaussian case, we get a slightly better dependence on the degree d: Theorem 1.4 (Gaussian noise sensitivity). For any degree d polynomial P such that P is either multilinear or corresponds to an ellipsoid, the following holds for the corresponding PTF f = sign(P ). For all 0 < δ < 1, Diakonikolas et al. [DRST09] prove that a similar bound holds for all degree d PTFs. Our next set of results bound the average sensitivity or total influence of degree d PTFs.
Definition 1.5 (average sensitivity). Let f be a Boolean function, and let X be a random element Then, the influence of the i th variable is defined by The sum of all the influences is referred to as the average sensitivity of the function f , Clearly, for any function f , AS(f ) is at most n. It is well known that the average sensitivity of "unate" functions (functions monotone in each coordinate), and thus of LTFs in particular is O( √ n). This bound is tight as the Majority function has average sensitivity Θ( √ n). As mentioned before, Gotsman and Linial [GL94] conjectured in 1994 that the average sensitivity of any degree We are not aware of any progress on this conjecture until now, with no o(n) bounds known. We give two upper bounds on the average sensitivity of degree d PTFs. We first use a simple translation lemma for bounding average sensitivity in terms of noise sensitivity of a Boolean function and Theorem 1.3 to obtain the following bound.
We also give an elementary combinatorial argument, to show that the average sensitivity of any degree d PTF is at most 3n 1−1/2 d . The combinatorial proof is based on the following lemma for general Boolean functions that may prove useful elsewhere. For x ∈ {1, −1} n , and i ∈ [n], let We believe that when the functions f i in the above lemma are LTFs, the above bound can be improved to O(n), which in turn would imply the Gotsman-Linal conjecture for quadratic threshold functions.

Random Restritctions of PTFs -a structural result
An important ingredient of our sensitivity bounds for PTFs are new structural theorems about random restrictions of PTFs obtained via hypercontractivity. The structural results we obtain can be seen as part of the high level "randomness vs structure" paradigm that has played a fundamental role in many recent breakthroughs in additive number theory and combinatorics. Specifically, we obtain the following structural result (Lemmas 5.1 and 5.2): for any PTF, there exists a small set of variables such that with at least a constant probability, any random restriction of these variables satisfies one of the following: (1) the restricted polynomial is "regular" in the sense that no single variable has large influence or (2) the sign of the restricted polynomial is a very biased function.
We remark that our structural results, though motivated by similar results of Servedio [Ser07] and Diakonikolas et al. [DGJ + 09] for the simpler case of LTFs, do not follow from a generalization of their arguments for LTFs to PTFs. The structural results for random restrictions of low-degree PTFs provide a reasonably generic template for reducing problems involving arbitrary PTFs to ones on regular PTFs. In fact, these structural properties are used precisely for the above reason both in this work and in a parallel work by one of the authors, Meka and Zuckerman [MZ09] to construct pseudorandom generators for PTFs.

Related Work
Independent of this work, Diakonikolas, Raghavendra, Servedio, and Tan [DRST09] have obtained nearly identical results to ours for both the average and noise sensitivity of PTFs. The broad outline of their proof is also similar to ours. In our proof, we first obtain bounds on noise sensitivity and then move to average sensitivity using a translation lemma. On the other hand, Diakonikolas et al. [DRST09] first obtain bounds on the average sensitivity of PTFs and then use a generalization of Peres' argument [Per04] for LTFs to move from average sensitivity to noise sensitivity. Regarding our structural result described in Section 1.3, Diakonikolas, Servedio, Tan and Wan [DSTW09] have independently obtained similar results to ours. As an application, they prove the existence of low-weight approximators for polynomial threshold functions.

Proof Outline
The proofs of our theorems are inspired by the use of the invariance principle in the proof of the "Majority is Stablest" theorem [MOO05]. As in the proof of the "Majority is Stablest" theorem, our main technical tools are the invariance principle and the anti-concentration bounds (also called small ball probabilities) of Carbery and Wright [CW01].
Bounding the probability that a threshold function changes value either when it is perturbed slightly (in the case of noise sensitivity) or when a variable is flipped (average sensitivity) involves bounding probabilities of the form Pr [|Q(X)| ≤ |R(X)|] where Q(X), R(X) are low degree polynomials and R has small l 2 -norm relative to that of Q. The event |Q(X)| ≤ |R(X)| implies that either |Q(X)| is small or |R(X)| is large. In other words, for every γ Since R has small norm, the second quantity in the above expression can be easily bounded using a tail bound (even Markov's inequality suffices). Bounding the first quantity is trickier. Our first observation is that if the random variable X were distributed according to the Gaussian distribution as opposed to the uniform distribution on the hypercube, bounds on probabilities of the form Pr [|Q(X)| ≤ γ] immediately follow from the anti-concentration bounds of Carbery and Wright [CW01]. We then transfer these bounds to the Boolean setting using the invariance principle.
Unfortunately, the invariance principle holds only for regular polynomials (i.e., polynomials in which no single variable has large influence). We thus obtain the required bounds on noise sensitivity and average sensitivity for the special case of regular PTFs. We then extend these results to an arbitrary PTF f using our structural results on random restrictions of the PTF f . The structural results state that either the restricted PTF is a regular polynomial or is a very biased function. In the former case, we resort to the above argument for regular PTFs and bound the noise sensitivity of the given PTF. In the latter case, we merely note that the noise sensitivity of a biased function can be easily bounded. This in turn lets us extend the results for regular PTFs to all PTFs.

Learning Theory Applications
In this section, we briefly elaborate on the learning theory applications of our results. Our bounds on Boolean and Gaussian noise sensitivity imply learning results in the challenging agnostic model of learning of Haussler [Hau92] and Kearns, Schapire and Sellie [KSS94] which we define below.
Definition 2.1. Let D be an arbitrary distribution on X and C a class of Boolean functions f : X → {−1, 1}. For δ, ε ∈ (0, 1), we say that algorithm A is a (δ, ε)-agnostic learning algorithm for C with respect to D if the following holds. For any distribution D ′ on X ×{−1, 1} whose marginal over X is D, if A is given access to a set of labeled examples (x, y) drawn from D ′ , then with probability at least 1 − δ algorithm A outputs a hypothesis h : Kalai, Klivans, Mansour and Servedio [KKMS08] showed that the existence of low-degree real valued polynomial l 2 -approximators to a class of functions, implies agnostic learning algorithms for the class. In an earlier result, Klivans, O'Donnell and Servedio [KOS04] gave a precise relationship between polynomial approximation and noise sensitivity, essentially showing that small noise sensitivity bounds imply good low-degree polynomial l 2 -approximators.
Combining these two results, it follows that bounding the noise sensitivity (either Boolean or Gaussian) of a concept class C yields an agnostic learning algorithm for C (with respect to the appropriate distribution). Thus, using our bounds on noise sensitivity of PTFs, we obtain corresponding learning algorithms for PTFs with respect to the uniform distribution over the hypercube.
Theorem 2.2. The concept class of degree d PTFs is agnostically learnable to within ε with respect to the uniform distribution on {−1, 1} n in time n 1/ε O(d) .
These are the first polynomial-time algorithms for agnostically learning constant degree PTFs with respect to the uniform distribution on the hypercube (to within any constant error parameter). Previously, Klivans et al. [KOS08] had shown that quadratic (degree 2) PTFs corresponding to spheres are agnostically learnable with respect to spherical Gaussians on R n . Our bounds on the Gaussian noise sensitivity of ellipsoids imply that this result can be extended to all ellipsoids with respect to (not necessarily spherical) Gaussian distributions thus resolving an open problem of Klivans et al. [KOS08].
It is implicit from a recent paper of Blais, O'Donnell and Wimmer [BOW08] that bounding the Boolean noise sensitivity for a concept class C yields non-trivial learning algorithms for a very broad class of discrete and continuous product distributions. We believe this is additional motivation for obtaining bounds on a function's Boolean noise sensitivity.

Organization
The rest of the paper is organized as follows. We introduce the necessary notation and preliminaries in Section 4. We then present the structural results on random restrictions of PTFs (Lemmas 5.1 and 5.2) in Section 5. In Section 6 we present our analysis of Gaussian noise sensitivity, followed by the analysis of Boolean noise sensitivity in Section 7. We remark that the analysis of the Gaussian noise sensitivity is simpler than the Boolean noise sensitivity analysis, since the Boolean case, in some sense, reduces to the "regular" or Gaussian case. We then present our bounds on average sensitivity of PTFs in Section 8.

Notation and Preliminaries
We will consider functions/polynomials over n variables X 1 , . . . , X n . Corresponding to any set I ⊆ [n] (possibly multi-set), there is a monomial X I defined as X I = i∈I X i . The degree of the monomial X I is the size of the set I, denoted by |I|. Note that if I is a "regular" set (opposed to a multi-set), then the monomial X I is linear in each of the participating variables X i , i ∈ I.
A polynomial of degree d is a linear combination of monomials of degree at most d, that is, P (X 1 , . . . , X n ) = I⊆[n],|I|≤d a I X I . The a I 's are called the coefficients of the polynomial P . By convention, we set a I = 0 for all other I. If the above summation is only over sets I and not multisets, then the polynomial is said to be multilinear. Observe that while working over the hypercube, it suffices to consider only multilinear polynomials. We use the following notations throughout.
1. Unless otherwise stated, we work with a PTF f of degree d and a degree d polynomial P (X) = I a I X I with zero constant term (i.e., a ∅ = 0) such that f (X 1 , . . . , X n ) = sign(P (X 1 , . . . , X n )− θ). In case of ambiguity, we will refer to the coefficients a I as a I (P ).
2. For a polynomial P as above and an underlying distribution over X = (X 1 , . . . , X n ), the l 2 -norm of the polynomial over X is defined by P 2 = E P (X) 2 . Note that if P is a multilinear polynomial and the distribution is either the multivariate Gaussian N n or the uniform distribution over the hypercube, then P 2 = I a 2 I .
4. For i ∈ [n], P |i (X 1 , . . . , X i ) = I⊆[i] a I X I is the restriction of P to the variables X 1 , . . . , X i . 5. For a multi-set S, x ∈ u S denotes an uniformly chosen element from S.
6. For clarity, we supress the exact dependence of the constants on the degree d in this extended abstract; a more careful examination of our proofs shows that all constants depending on the degree d are at worst 2 O(d) .
Definition 4.1. A partial assignment We now define regular polynomials which play an important role in all our results. Intuitively, a polynomial is regular if no variable has high influence. For a polynomial Q, the weight of the i th coordinate is defined by We also assume without loss of generality that the variables are ordered such that w 1 (P ) ≥ w 2 (P ) ≥ · · · ≥ w n (P ).
We repeatedly use three powerful tools: (2, 4)-hypercontractivity, the invariance principle of Mossel et al. [MOO05] and the anti-concentration bounds of Carbery and Wright [CW01]. We state the relevant results below.
The following anti-concentration bound is a special case of Theorem 8 of [CW01] (in their notation, set q = 2d and the log-concave distribution µ to be N n ).  . There exists a universal constant C such that the following holds. For any ε-regular multilinear polynomial P of degree at most d with P = 1 and t ∈ R, The result stated in [MOO05] uses max i w 2 i (P ) as the notion of regularity instead of i w 4 i (P ) as we do. However, their proof extends straightforwardly to the above.

Random Restrictions of PTFs
We now establish our structural results on random restrictions of low-degree PTFs. The use of critical indices (K(P, ε)) in our analysis is motivated by the results of Servedio [Ser07] and Diakonikolas et al. [DGJ + 09] who obtain similar results for LTFs. At a high level, we show the following.
Given any ε > 0, define the ε-critical index of a multilinear polynomial P , K = K(P, ε), to be the least index i such that w 2 j (P ) ≤ ε 2 σ 2 i+1 (P ) for all j > i. We consider two cases depending on how large K(P, ε) is and roughly, show the following (here c, α > 0 are some universal constants).
2. K > 1/ε cd . In this case we show that with probability at least α, the value of the threshold function is determined by the top L = 1/ε cd variables.
More concretely, we show the following.
Lemma 5.1. For every integer d, there exist constants a d ∈ R, γ d > 0 such that for any multilinear polynomial P of degree at most d and K = K(P, ε) as defined above, the following holds. The Lemma 5.2. For every d, there exist constants b d , c d ∈ R, δ d > 0, such that for any multilinear polynomial P of degree at most d the following holds. If K(P, ε) ≥ c d log(1/ε)/ε 2 = L, then a random partial assignment (x 1 , . . . , x L ) ∈ u {1, −1} L is b d ε-determining for P with probability at least δ d .
To prove the above structural properties we need the following simple lemmas.

Proof of Lemma 5.2
We use the follwing simple lemma.
Proof of Lemma 5.2. Suppose that K(P, ε) ≥ L = c log(1/ε)/ε 2 for a constant c to be chosen later and let Q(X 1 , . . . , X n ) = P (X 1 , . . . , X n )−P |L (X 1 , . . . , X L ). The proof proceeds as follows. We first show that Q is significantly smaller than P |L . We then use Lemma 5.4 applied to P |L − θ and Markov's inequality applied to |Q(X)| to show that |P |L (X 1 , . . . , X L ) − θ| is larger than |Q(X)|, so that Q(X) cannot flip the sign of P |L (X 1 , . . . , X L ) − θ, with at least a constant probability. We first bound Q .
Claim 5.6. For a suitably large enough constant c d , Q ≤ √ ε α d P |L .
By Claim 5.6 and Markov's inequality, The lemma now follows.

Gaussian Noise Sensitivity of PTFs
In this section, we bound the Gaussian noise sensitivity of PTFs and thus prove Theorem 1.4. The proof is simpler than the Boolean case and only makes use of an anti-concentration bound for polynomials in Gaussian space. Although Theorem 1.4 was stated only for multilinear polynomials and ellipsoids, we give a proof below that works for all degree d polynomials using ideas from Diakonikolas et al. [DRST09], who were the first to prove a bound on the Gaussian noise sensitivity of general degree d polynomials (see remarks after the statement of Claim 6.1).
Proof of Theorem 1.4. Let f be the degree d PTF and P the corresponding degree d polynomial such that f (x) = sign(P (x)). We may assume without loss of generality. that P is normalized, i.e., P 2 = E[P 2 (X)] = 1.
We note that we can get a slightly stronger bound of O d δ 1/2d log(1/δ) if we used a stronger tail bound instead of Markov's in the above argument.
Claim 6.1. There exists a constant c d such that Q ≤ c d √ δ.
An earlier version of this paper had an error in the proof of this claim. As pointed out to us by the authors of [DRST09], that proof worked only for multilinear polynomials and ellipsoids. Diakonikolas et al. [DRST09] proved the claim for general degree d polynomials. For the sake of completeness, we give a simplified presentation of their proof (that works for all degree d polynomials) in Section A.

Noise sensitivity of PTFs
We now bound the noise sensitivity of PTFs and prove Theorem 1.3. We do so by first bounding the noise sensitivity of regular PTFs and then use the results of the previous section to reduce the general case to the regular case.

Noise sensitivity of Regular PTFs
At a high level, we bound the noise sensitivity of regular PTFs as follows: (1) Reduce the problem to that of proving certain anti-concentration bounds for regular PTFs over the hypercube. (2) Use the invariance principle of Mossel et al. [MOO05] to reduce proving anti-concentration bounds over the hypercube to that of proving anti-concentration bounds over Gaussian distributions. (3) Use the Carbery-Wright anti-concentration bounds [CW01] for polynomials over log-concave distributions.
For the rest of this section, we fix degree d multilinear polynomial P and a corresponding degree d PTF f . Recall that it suffices to consider multilinear polynomials as we are working over the hypercube. We first reduce bounding noise sensitivity to proving anti-concentration bounds.
Lemma 7.1. For 0 < ρ < 1, δ > 0, Proof. Let S be a random subset S ⊆ [n] where each i ∈ [n] is in S independently with probability ρ. From the definition of noise sensitivity it easily follows that Define a non-negative random variable P S as follows: P 2 S = I:|I∩S| is odd a 2 I . We can then bound the first quantity in the above expression using P S as follows: The lemma now follows by combining Equations (7.1), (7.2), (7.3) and the above equation.
We now prove an anti-concentration bound for regular PTFs.
Lemma 7.2. If P is ε-regular, then for any interval I ⊆ R of length at most α, ). Now, by the above equation and Theorem 4.4 applied to the random variable Y for interval I, We can now obtain a bound on noise sensitivity of regular PTFs.

Noise Sensitivity of arbitrary PTFs
We prove Theorem 1.3 by recursively applying the following lemma.
Call x ∈ {1, −1} n (ε, f )-good if there exists an i, 1 ≤ i ≤ t such that NS ε (f x,i ) ≤ ∆ d ε 1/(2d+2) and let t x be such an i for a (ε, f )-good x. Then, from the definition of f x,i and Lemma 7.4, (7.5) Let y ∈ δ x be a δ-perturbation of x ∈ u {1, −1} n . Then, since |S x,tx | ≤ L t, Also note that for any i ≥ 1, conditioned on an assignment for the values in x |S x,i and Combining (7.5), (7.6), (7.7), we get 2d+2 / log 2 (1/ε) and the above is applicable for all ε > 0, we get that for all ρ > 0,

Average sensitivity of PTFs
In this section we bound the average sensitivity of PTFs on the Boolean hypercube, proving Theorem 1.6. We first prove a lemma bounding the average sensitivity of a Boolean function in terms of its noise sensitivity. Theorem 1.6 follows immediately from Theorem 1.3 and the following lemma: Pr We now give a bound of O(n 1−2 −d ) on the average sensitivity using a different (not using the noise sensitivity bounds), combinatorial, argument. We first show the theorem using Lemma 1.7.
where P i ( ), Q i ( ) are degree d− 1 and degree d polynomials respectively that do not depend on x i . Define f i (x −i ) = sgn(P i (x −i )) and g i (x) = f (x)f i (x −i ). Then, Observe that g i is monotone increasing in x i for i ∈ [n] and hence I i (g i ) = E X [X i g i (X)]. Thus, Since |f (x)| ≤ 1 for all x, we have We now use induction and Lemma 1.7. For an LTF f , f i as defined above are constants. Therefore, by Equation (8.1), Suppose the theorem is true for degree d PTFs and let f be a degree d + 1 PTF and let f i be as defined before. Then, by Equation (8.1) and Lemma 1.7 Therefore, AS(f ) ≤ 3 n 1−2 −(d+1) . The theorem follows by induction.
where µ(x) = 1/2 n is the probability of choosing x under the uniform distribution. We bound the first term in the above expression by the average sensitivity of the f i 's and show that the second term vanishes. Observe that, Note that for x / ∈ S j i ∪ S i j , f i (x), f j (x) are both independent of the values of x i , x j . For such x (abusing notation) let f i (x −ij ) = f i (x), f j (x −ij ) = f j (x) and let T ij = {(x k : k = i, j) : x / ∈ S j i ∪S i j }. Then, since for x / ∈ S j i ∪ S i j , f i (x), f j (x) depend only on x −ij , we get that x / ∈ S j i ∪ S i j if and only if x −ij / ∈ T ij . Therefore, Then, the left hand side of the lemma is Θ(n 3/2 ) and AS(f i ) = m − 1 = Θ( √ n) for all i.
The Hermite polynomials are especially useful while working over the (multivariate) normal distribution due to the following orthonormality conditions.
[ By Cauchy-Schwarz inequality ] [ By orthonormality of H S\R and independence of (Z i − X i ) over the i's ]