Optimal Cryptographic Hardness of Learning Monotone Functions (cid:2)

. A wide range of positive and negative results have been established for learning diﬀerent classes of Boolean functions from uniformly distributed random examples. However, polynomial-time algorithms have thus far been obtained almost exclusively for various classes of monotone functions, while the computational hardness results obtained to date have all been for various classes of general (nonmono-tone) functions. Motivated by this disparity between known positive results (for monotone functions) and negative results (for nonmonotone functions), we establish strong computational limitations on the eﬃcient learnability of various classes of monotone functions. We give several such hardness results which are provably almost optimal since they nearly match known positive results. Some of our re-sults show cryptographic hardness of learning polynomial-size monotone circuits to accuracy only slightly greater than 1 / 2 + 1 / √ n ; this accuracy bound is close to optimal by known positive results (Blum et al. , FOCS ’98). Other results show that under a plausible cryptographic hardness assumption, a class of constant-depth, sub-polynomial-size circuits computing monotone functions is hard to learn; this result is close to optimal in terms of the circuit size parameter by known positive re-sults as well (Servedio, Information and Computation ’04). Our main tool is a complexity-theoretic approach to hardness ampliﬁcation via noise sensitivity of monotone functions that was pioneered by O’Donnell (JCSS ’04).

Abstract: Over the years a range of positive algorithmic results have been obtained for learning various classes of monotone Boolean functions from uniformly distributed random examples. Prior to our work, however, the only negative result for learning monotone functions in this model has been an information-theoretic lower bound showing that certain super-polynomial-size monotone circuits cannot be learned to accuracy 1/2 + ω(log n/ √ n) (Blum, Burch, and Langford, FOCS'98). This is in contrast with the situation for nonmonotone functions, where a wide range of cryptographic hardness results establish that various "simple" classes of polynomial-size circuits are not learnable by polynomial-time algorithms.
In this paper we establish cryptographic hardness results for learning various "simple" classes of monotone circuits, thus giving a computational analogue of the information-theoretic hardness results of Blum et al. mentioned above. Some of our results show the cryptographic hardness of learning polynomial-size monotone circuits to accuracy only slightly greater than 1/2 + 1/ √ n, which is close to the optimal accuracy bound by positive results of Blum et al. Other results show that under a plausible cryptographic hardness assumption, a class of constant-depth, sub-polynomial-size circuits computing monotone functions is hard to learn. This result is close to optimal in terms of the circuit-size parameter by known positive results as well (Servedio, Information and Computation 2004). Our main tool is a complexity-theoretic approach to hardness amplification via noise sensitivity of monotone functions that was pioneered by O' Donnell (JCSS 2004).

Introduction
More than two decades ago Valiant introduced the Probably Approximately Correct (PAC) model of learning Boolean functions from random examples [34]. Since that time a great deal of research effort has been expended on trying to understand the inherent abilities and limitations of computationally efficient learning algorithms. This paper addresses a discrepancy between known positive and negative results for uniform distribution learning by establishing strong computational hardness results for learning various classes of monotone functions.

Background and motivation
In the uniform distribution PAC learning model, a learning algorithm is given access to a source of independent random examples (x, f (x)) where each x is drawn uniformly from the n-dimensional Boolean cube and f is the unknown Boolean function to be learned. The goal of the learner is to construct a highaccuracy hypothesis function h, i. e., one that satisfies Pr[ f (x) = h(x)] ≤ ε, where the probability is with respect to the uniform distribution and ε is an error parameter given to the learning algorithm. Algorithms and hardness results in this framework have interesting connections with topics such as discrete Fourier analysis [23], circuit complexity [22], noise sensitivity and influence of variables in Boolean functions [16,4,21,29], coding theory [11], privacy [8,17], and cryptography [7,20]. For these reasons, and because the model is natural and elegant in its own right, the uniform distribution learning model has been intensively studied for almost two decades.
Monotonicity makes learning easier For many classes of functions, uniform distribution learning algorithms have been devised that substantially improve on a naive exponential-time approach to learning via brute-force search. However, despite intensive efforts, researchers have not yet obtained poly(n)time learning algorithms in this model for various simple classes of functions. Interestingly, in many of these cases restricting the class of functions to the corresponding class of monotone functions has led to more efficient-sometimes poly(n)-time-algorithms. We list some examples: 1. A simple algorithm learns monotone O(log n)-juntas 1 to perfect accuracy in poly(n) time, and a more complex algorithm [9] learns monotoneÕ(log 2 (n))-juntas to any constant accuracy in 4. No poly(n)-time algorithm can learn the general class of all Boolean functions on {0, 1} n to accuracy better than 1/2 + poly(n)/2 n , but a simple polynomial-time algorithm can learn the class of all monotone Boolean functions to accuracy 1/2+Ω(1/ √ n) [6]. We note also that the result of [9] mentioned above follows from a 2Õ ( √ n) -time algorithm for learning arbitrary monotone functions on n variables to constant accuracy. (It is easy to see that no comparable algorithm can exist for learning arbitrary Boolean functions to constant accuracy.) Cryptography and hardness of learning Essentially all known representation-independent hardness of learning results (i. e., results that apply to learning algorithms that do not have any restrictions on the syntactic form of the hypotheses they output) rely on some cryptographic assumption, or an assumption that easily implies a cryptographic primitive. For example, under the assumption that certain subset sum problems are hard on average, Kharitonov [20] showed that the class AC 1 of logarithmic-depth, polynomial-size Boolean circuits (circuits with AND, OR, and NOT gates) is hard to learn under the uniform distribution. Subsequently Kharitonov [19] showed that if factoring Blum integers is 2 n ε -hard for some fixed ε > 0, then even the class AC 0 of constant-depth, polynomial-size Boolean circuits similarly cannot be learned in polynomial time under the uniform distribution. In later work, Naor and Reingold [26] gave constructions of pseudorandom functions with very low circuit complexity. Their results imply that if factoring Blum integers is super-polynomially hard, then even depth-5 TC 0 circuits cannot be learned in polynomial time under the uniform distribution. (TC 0 circuits are Boolean circuits that can also use MAJ gates. The value of a MAJ gate is one if at least half of its inputs are one, and zero otherwise.) We note that all of these hardness results apply even to algorithms that may make black-box "membership queries" to obtain the value f (x) for inputs x of their choosing.
Monotonicity versus cryptography? Given that cryptography precludes efficient learning while monotonicity seems to make efficient learning easier, it is natural to investigate how these phenomena interact. One could argue that prior to the current work there was something of a mismatch between known positive and negative results for uniform-distribution learning: as described above, a fairly broad range of polynomial-time learning results have been obtained for various classes of monotone functions, but there were no corresponding computational hardness results for monotone functions. Can all monotone Boolean functions computed by polynomial-size circuits be learned to 99% accuracy in polynomial time from uniform random examples? As far as we are aware, prior to our work answers were not known even to such seemingly basic questions about learning monotone functions as this one. This gap in understanding motivated the research presented in this paper (which, as we describe below, lets us answer "no" to the above question in a strong sense).

Our results and techniques: cryptography trumps monotonicity
We present several different constructions of "simple" (polynomial-time computable) monotone Boolean functions and prove that these functions are hard to learn under the uniform distribution, even if membership queries are allowed. We now describe our main results, followed by a high-level description of how we obtain them.
Blum, Burch, and Langford [6] showed that arbitrary monotone functions cannot be learned to accuracy better than 1/2 + O(log n/ √ n) by any algorithm that makes poly(n) many membership queries. This is an information-theoretic bound that is proved using randomly generated monotone DNF formulas of size (roughly) n log n that are not polynomial-time computable. A natural goal is to obtain computational lower bounds for learning polynomial-time computable monotone functions that match, or nearly match, this level of hardness (which is close to optimal by the (1/2 + Ω(1/ √ n))-accuracy algorithm of Blum et al. described above). We prove near-optimal hardness for learning polynomial-size monotone Boolean circuits (circuits with AND and OR gates): Theorem 1.1 (informal statement). If one-way functions exist, then there is a class of poly(n)-size monotone Boolean circuits that cannot be learned to accuracy 1/2 + 1/n 1/2−ε for any fixed ε > 0.
Our approach yields even stronger lower bounds if we make stronger assumptions: • Assuming the existence of sub-exponential one-way functions, we improve the bound on the accuracy to 1/2 + polylog(n)/ √ n.
• Assuming the hardness of factoring Blum integers, our hard-to-learn functions may be computed in monotone NC 1 .
• Assuming that Blum integers are 2 n ε -hard to factor on average (which is the same hardness assumption used in Kharitonov's work [19]), we obtain a lower bound for learning constant-depth circuits of sub-polynomial size that almost matches the positive result from [31]. More precisely, we show that for any (sufficiently large) constant d, the class of monotone functions computed by depth-d Boolean circuits of size 2 (log n) O(1)/(d+1) cannot be learned to accuracy 51% under the uniform distribution in poly(n) time. In contrast, the positive result of [31] shows that monotone functions computed by depth-d Boolean circuits of size 2 O((log n) 1/(d+1) ) can be learned to any constant accuracy in poly(n) time.
These results are summarized in Figure 1.
Proof techniques A natural first approach is to try to replace the random n log n -term monotone DNFs constructed in [6] by pseudorandom DNFs of polynomial size. We were not able to do this directly; Hardness assumption Complexity of f Accuracy bound Ref.
none random n log n -term monotone DNF 1 2 + ω(log n) n 1/2 [6] OWF (poly) poly-size monotone circuits  Figure 1: Summary of known hardness results for learning monotone Boolean functions. The meaning of each row is as follows: under the stated hardness assumption, there is a class of monotone functions computed by circuits of the stated complexity that no poly(n)-time membership query algorithm can learn to the stated accuracy. In the first column, OWF and BI denote one-way functions and Blum Integers respectively, and "poly" and "2 n α " mean that the problems are intractable for poly(n)-and 2 n αtime algorithms, respectively (for some fixed α > 0). Recall that the poly(n)-time algorithm of [6] for learning monotone functions implies that the best possible accuracy bound for monotone functions is 1/2 + Ω(1)/n 1/2 .
indeed, as we discuss in Section 5, constructing such DNFs seems closely related to an open problem of Goldreich, Goldwasser, and Nussboim [13]. However, it turns out that a closely related approach does yield some results along the desired lines; in Section 4 we present and analyze a simple variant of the information-theoretic construction from [6] and then show how to replace random choice by pseudorandom in this the variant. Because our variant gives a weaker quantitative bound on the information-theoretic hardness of learning than [6], this gives a construction of polynomial-timecomputable monotone functions that, assuming the existence of one-way functions, cannot be learned to accuracy 1/2 + 1/polylog(n) under the uniform distribution. While this answers the question posed above (even with "51%" in place of "99%"), the 1/polylog(n) factor is rather far from the O(log n/ √ n) factor that one might hope for as described above.
In Section 2 we use a different construction to obtain much stronger quantitative results; another advantage of this second construction is that it enables us to show hardness of learning monotone circuits rather than just circuits computing monotone functions. We start with the simple observation that using standard tools it is easy to construct polynomial-size monotone circuits computing "slice" functions that are pseudorandom on the middle layer of the Boolean cube {0, 1} n . Such functions are easily seen to be mildly hard to learn, i. e., hard to learn to accuracy 1 − Ω(1/ √ n). We then use the elegant machinery of hardness amplification of monotone functions pioneered by O'Donnell [28] to amplify the hardness of this construction to near-optimal levels (as summarized in rows 2-4 of Figure 1). We obtain our result for constant-depth, sub-polynomial-size circuits (row 5 of Figure 1) by augmenting this approach with an argument that, at a high level, is similar to one used in [2], by "scaling down" and modifying our hard-to-learn functions using a variant of Nepomnjaščiȋ's theorem [27].

Preliminaries
We consider Boolean functions of the form f : {0, 1} n → {0, 1}. We view {0, 1} n as endowed with the natural partial order: x ≤ y if and only if x i ≤ y i for all i = 1, . . . , n. A Boolean function f is monotone if Our work uses various standard definitions from the fields of circuit complexity, learning, and cryptographic pseudorandomness; and for completeness we recall this material below.
Learning As described earlier, all of our hardness results apply even to learning algorithms that may make membership queries, i. e., black-box queries to an oracle that gives the label f (x) of any example x ∈ {0, 1} n on which it is queried. It is clear that for learning with respect to the uniform distribution, having membership query access to the target function f is at least as powerful as being given uniform random examples labeled according to x, as the learner can simply generate uniform random strings for herself and query the oracle to simulate a random example oracle.
The goal of the learning algorithm is to construct a hypothesis h so that Pr is small, where the probability is taken over the uniform distribution. We shall only consider learning algorithms that are allowed to run in poly(n) time, so the learning algorithm L may be viewed as a probabilistic polynomial-time oracle machine that is given black-box access to the function f and attempts to output a hypothesis h with small error relative to f . We establish that a class C of functions is hard to learn by showing that for a uniform random f ∈ C, the expected error of any poly(n)-time learning algorithm L is close to 1/2 when run with f as the target function. Thus we bound the quantity where the probability is also taken over any internal randomization of the learning algorithm L. We say that a class C is hard to learn to accuracy 1/2 + ε(n) if, for every poly(n)-time membership query learning algorithm L (i. e., probabilistic polynomial-time oracle algorithm), we have that the above quantity (1.1) is smaller than 1/2 + ε(n) for all sufficiently large n. As noted in [6], it is straightforward to transform a lower bound of this sort into a lower bound for the usual ε, δ formulation of PAC learning.
Circuit complexity We shall consider various classes of circuits computing Boolean functions, including the classes NC 1 (polynomial-size, logarithmic-depth, bounded fan-in Boolean circuits, AC 0 (polynomial-size, constant-depth, unbounded fan-in Boolean circuits), and TC 0 (polynomial-size, constant-depth unbounded fan-in Boolean circuits with MAJ gates).
A circuit is said to be monotone if it is composed entirely of AND/OR gates with no negations. Every monotone circuit computes a monotone Boolean function, but of course non-monotone circuits may compute monotone functions as well. The famous result of [30] shows that there are natural monotone Boolean functions (such as the perfect matching function) that can be computed by polynomial-size circuits but cannot be computed by quasi-polynomial-size monotone circuits, and Tardos [32] observed that the separation can be increased to an exponential gap.
Thus, in general, it is a stronger result to show that a function can be computed by a small monotone circuit than to show that it is monotone and can be computed by a small circuit.
Pseudorandom functions Pseudorandom functions [12] are the main cryptographic primitive that underlie our constructions. Fix k(n) ≤ n, and let G be a family of functions g : {0, 1} k(n) → {0, 1} each of which is computable by a circuit of size poly(k(n)). We say that G is a t(n)-secure pseudorandom function family if the following condition holds: for any probabilistic t(n)-time oracle algorithm A, we have where G is the class of all 2 2 k(n) functions from {0, 1} k(n) to {0, 1} (so the second probability above is taken over the choice of a truly random function g ). Note that the purported distinguisher A has oracle access to a function on k(n) bits but is allowed to run in time t(n).
It is well known that a pseudorandom function family that is t(n)-secure for all polynomials t(n) can be constructed from any one-way function [12,14]. We shall use the following folklore quantitative variant that relates the hardness of the one-way function to the security of the resulting pseudorandom function: Proposition 1.2. Fix t(n) ≥ poly(n) and suppose there exist one-way functions that are hard to invert on average for t(n)-time adversaries. Then there exists a constant, 0 < c < 1, such that for any k(n) ≤ n, there is a pseudorandom family G of functions g :

Lower bounds via hardness amplification of monotone functions
In this section we prove our main hardness results, summarized in Figure 1, for learning various classes of monotone functions under the uniform distribution with membership queries.
Let us start with a high-level explanation of the overall idea. Inspired by the work on hardness amplification within NP initiated by O'Donnell [28,33,15], we study constructions of the form where C is a Boolean "combining function" with low noise stability (we give precise definitions later) that is both efficiently computable and monotone. Recall that O'Donnell showed that if f is weakly hard to compute and the combining function C has low noise stability, then f is very hard to compute. This result holds for general (not necessarily monotone) functions C, and thus generalizes Yao's XOR lemma, which addresses the case where C is the XOR of m bits (and hence has the lowest noise stability of all Boolean functions [28]).
Roughly speaking, we establish an analogue of O'Donnell's result for learning. Our analogue, given in Section 2.2, essentially states that for certain well-structured 2 functions f that are hard to learn to high accuracy, if C has low noise stability then f is hard to learn to accuracy even slightly better than 1/2. As our ultimate goal is to establish that "simple" classes of monotone functions are hard to learn, we shall use this result with combining functions C that are computed by "simple" monotone Boolean circuits. In order for the overall function f to be monotone and efficiently computable, we need the initial f to be well-structured, monotone, efficiently computable, and hard to learn to high accuracy. Such functions are easily obtained by a slight extension of an observation of Kearns, Li, and Valiant [18]. They noted that the middle slice f of a random Boolean function on {0, 1} k is hard to learn to accuracy greater than 1 − Θ 1/ √ k [6,18]; by taking the middle slice of a pseudorandom function instead, we obtain an f with the desired properties. In fact, by a result of Berkowitz [5] (see also [35,3]), this slice function is computable by a polynomial-size monotone circuit, so the overall hard-to-learn functions we construct are computed by polynomial-size monotone circuits.

Organization
In Section 2.2 we adapt the analysis from [28,33,15] to reduce the problem of constructing hardto-learn monotone Boolean functions to constructing monotone combining functions C with low noise stability. In Section 2.3 we show how constructions and analyses in [28,24] can be used to obtain a "simple" monotone combining function with low noise stability. In Section 2.4 we establish Theorems 2.8 and 2.9 (lines 2 and 3 of Figure 1) by making different assumptions about the hardness of the initial pseudorandom functions. Finally, in Section 3 we establish Theorems 3.2 and 3.5 by making specific number theoretic assumptions (namely, the hardness of factoring Blum integers) to obtain hard-to-learn monotone Boolean functions that can be computed by very simple circuits.

Preliminaries
For g : {0, 1} k → {0, 1}, we write slice(g) to denote the "middle slice" function: where |x| denotes the number of ones in the string x. It is immediate that slice(g) is a monotone Boolean function for any g.
Bias and noise stability Following the analysis in [28,33,15], we shall study the bias and noise stability of various Boolean functions. Specifically, we adopt the following notations and definitions from [15]. The bias of a 0-1 random variable X is defined to be Recall that a probabilistic Boolean function h on {0, 1} k is a probability distribution over Boolean functions on {0, 1} k (so for each input x, the output h(x) is a 0-1 random variable). The expected bias of a probabilistic Boolean function h is , 1} m is a vector whose bits are each independently 1 with probability δ , and ⊕ denotes bitwise XOR.

Hardness amplification for learning
Throughout this subsection we write m for m(n) and k for k(n). We shall establish the following: of Boolean functions over {0, 1} n is hard to learn to accuracy This easily yields Corollary 2.3, which is an analogue of Lemma 2.1 with pseudorandom rather than truly random functions, and which we use to obtain our main hardness of learning results.
Proof of Lemma 2.1. Let k, m be such that mk = n, and let C : {0, 1} m → {0, 1} be a Boolean combining function. We prove the lemma by establishing an upper bound on the probability where L is an arbitrary probabilistic polynomial-time oracle machine (running in time poly(n) on input 1 n ) that is given oracle access to f def = C • slice(g) ⊗m and outputs some hypothesis h : {0, 1} n → {0, 1}. We first observe that because C is computed by a uniform family of circuits of size poly(m) ≤ poly(n), it is easy for a poly(n)-time machine to simulate oracle access to f if it is given oracle access to g. So, the probability in (2.1) is at most To analyze the above probability, suppose that in the course of its execution L never queries g on any of the inputs x 1 , . . . , x m ∈ {0, 1} k , where x = (x 1 , . . . , x m ). Then the a posteriori distribution of g(x 1 ), . . . , g(x m ) (for uniform random g ∈ G ), given the responses to the queries of L that it received from g, is identical to the distribution of g (x 1 ), . . . , g (x m ), where g is an independent uniform draw from G : both distributions are uniform random over {0, 1} m . (Intuitively, this just means that if L never queries the random function g on any of x 1 , . . . , x m , then giving L oracle access to g does not help it predict the value of f on x = (x 1 , . . . , x m ).) As L runs in poly(n) time, for any fixed x 1 , . . . , x m the probability that L queried g on any of x 1 , . . . , x m is at most m · poly(n)/2 k . Hence (2.2) is bounded by 3) The first summand in (2.3) is the probability that L correctly predicts the value C • slice(g ) ⊗m (x), given oracle access to g, where g and g are independently random functions and x is uniform over {0, 1} n . It is clear that the best possible strategy for L is to use a maximum likelihood algorithm, i. e., predict according to the function h that, for any fixed input x, outputs 1 if and only if the random variable (C • slice(g ) ⊗m )(x) (which we emphasize has randomness over the choice of g ) is biased towards 1.
The expected accuracy of this h is precisely to be the fraction of inputs in the "middle slice" of {0, 1} k . We observe that the probabilistic function slice(g ) (where g is truly random) is "δ -random" in the sense of Definition 3.1 of [15], meaning that it is balanced, truly random on inputs in the middle slice, and deterministic on all other inputs. This means that we may apply the following technical lemma (Lemma 3.7 from [15], see also [28]): 1} be a function that is δ -random. Then Applying this lemma to the function slice(g ) we obtain Proof. The corollary follows from the fact that (2.2) must differ from its pseudorandom counterpart, (2.6) by less than any fixed 1/ poly(n). Otherwise, we would easily obtain a poly(n)-time distinguisher that, given oracle access to g, runs L to obtain a hypothesis h and checks whether for a random x to determine whether g is drawn from G or G .
By instantiating Corollary 2.3 with a "simple" monotone function C having low noise stability, we obtain strong hardness results for learning simple monotone functions. We exhibit such a function C in the next section.

A simple monotone combining function with low noise stability
In this section we combine known results of [28,24] to obtain: for every monotone m-variable function C, so the above upper bound is fairly close to the best possible (within a polylog(m) factor if m = 2 k Θ(1) ). Following [28,15], we use the "recursive majority of 3" function and the "tribes" function in our construction. We require the following results on noise stability: Then for ≥ log 1 Then if η = O(1/d), we have Setting δ = Θ 1/ √ k and recalling that 3 ≤ k 6 , we have ≥ log 1.1 (1/δ ) so we may apply Lemma 2.5 to obtain As O(k −.35 ) ≤ O(1/d), we may apply Lemma 2.6 with the previous inequalities to obtain The bound (2.7) follows from a rearrangement of the bounds on k, m, d and . It is easy to see that C can be computed by monotone circuits of depth O( ) = O(log m) and size poly(m). This completes the proof.

Nearly optimal hardness of learning polynomial-size monotone circuits
Given a value of k, let m = 3 d2 d for , d as in Proposition 2.4. Let G be a pseudorandom family of functions g : {0, 1} k → {0, 1} secure against poly(n)-time adversaries, where n = mk. Given that we have k = ω(log n), we may apply Corollary 2.3 with the combining function from Proposition 2.4 and conclude that the class C = {C • slice(g) ⊗m | g ∈ G} is hard to learn to accuracy We claim that the functions in C can, in fact, be computed by poly(n)-size monotone circuits. This follows from a result of Berkowitz [5] that states that if a k-variable slice function is computed by a Boolean circuit of size s and depth d, then it is also computed by a monotone Boolean circuit with MAJ gates of size O(s + k) and depth d + 1. Combining these monotone circuits for slice(g) with the monotone circuit for C, we obtain a poly(n)-size monotone circuit for each function in C.
By making various different assumptions on the hardness of one-way functions, Proposition 1.2 lets us obtain different quantitative relationships between k (the input length for the pseudorandom functions) and n (the running time of the adversaries against which they are secure), and thus different quantitative hardness results from (2.8) above: Theorem 2.8. Suppose that standard one-way functions exist. Then for any constant ε > 0 there is a class C of poly(n)-size monotone circuits that is hard to learn to accuracy 1/2 + 1/n 1/2−ε .
Proof. If poly(n)-hard one-way functions exist then we may take k = n c in Proposition 1.2 for an arbitrarily small constant c; this corresponds to taking d = γ log k for γ a large constant in Proposition 2.4. The claimed bound on (2.8) easily follows. (We note that while not every n is of the required form mk = 3 d2 d k, it is not difficult to see that this and our subsequent theorems hold for all (sufficiently large) input lengths n by padding the hard-to-learn functions.) Theorem 2.9. Suppose that sub-exponentially hard (2 n α for some fixed α > 0) one-way functions exist. Then there is a class C of poly(n)-size monotone circuits that is hard to learn to accuracy 1/2 + polylog(n)/ √ n.
Proof. As above, but now we take k = log γ n for some sufficiently large constant γ (i. e., d = c log k for a small constant c).

Hardness of learning simple circuits
In this section we obtain hardness results for learning very simple classes of circuits computing monotone functions under a concrete hardness assumption for a specific computational problem, namely factoring Blum integers. Naor and Reingold [26] showed that if factoring Blum integers is computationally hard then there is a pseudorandom function family, which we denote G , that is computable in TC 0 . From this it easily follows that the functions {slice(g) | g ∈ G } are also computable in TC 0 . We now observe that the result of Berkowitz [5] mentioned earlier for converting slice circuits into monotone circuits applies not only to Boolean circuits, but also to TC 0 circuits. This means that the functions in {slice(g) | g ∈ G } are in fact computable in monotone TC 0 , i. e., by polynomial-size, constant-depth circuits composed only of AND/OR/MAJ gates. As the majority function can be computed by polynomial-size, O(log n)-depth monotone Boolean circuits, (see, e. g., [1]), the functions in {slice(g) | g ∈ G } are computable by O(log n)-depth monotone Boolean circuits. Finally, using the parameters in Theorem 2.8 we have a combining function C that is a O(log n)-depth poly-size monotone Boolean circuit, which implies the following lemma: Theorem 3.2. If factoring Blum integers is hard on average for any poly(n)-time algorithm, then for any constant ε > 0 there is a class C of poly(n)-size monotone NC 1 circuits that is hard to learn to accuracy 1/2 + 1/n 1/2−ε . Now we show that under a stronger but still plausible assumption on the hardness of factoring Blum integers, we get a hardness result for learning a class of constant-depth monotone circuits that is very close to a class known to be learnable to any constant accuracy in poly(n) time. Suppose that n-bit Blum integers are 2 n α -hard to factor on average for some fixed α > 0 (which is the same hardness assumption that was earlier used by Kharitonov [19]). This means there exist 2 n α/2 -secure pseudorandom functions that are computable in TC 0 . Using such a family of functions in place of G in the construction for the preceding theorem and fixing ε = 1/3, we obtain: Lemma 3.3. Assume that Blum integers are 2 n α -hard to factor on average. Then there is a class C of poly(n)-size monotone NC 1 circuits that is hard for any 2 n α/20 -time algorithm to learn to accuracy 1/2 + 1/n 1/6 . Now we "scale down" this class C as follows. Let n be such that n = (log n) κ for a suitable constant κ > 20/α, and let us substitute n for n in the construction of the previous lemma; we call the resulting class of functions C . In terms of n, the functions in C (which are functions over {0, 1} n that only depend on the first n variables) are computed by (log n) O(κ) -size, O(log log n)-depth monotone circuits whose inputs are the first (log n) κ variables in x 1 , . . . , x n . We moreover have that C is hard for any 2 (n ) α/20 = 2 (log n) κα/20 = ω(poly(n))-time algorithm to learn to some accuracy We now recall the following variant of Nepomnjaščiȋ's theorem that is implicit in [2]. As every function in C can be computed in NC 1 , which is contained in NL, combining Lemma 3.4 with the paragraph that proceeds it, we obtain the following theorem (final line of Figure 1 This final hardness result is of interest because it is known that constant-depth circuits of only slightly smaller size can be learned to any constant accuracy in poly(n) time under the uniform distribution (without needing membership queries): Theorem 3.5 is thus nearly optimal in terms of the size of the constant-depth circuits for which it establishes hardness of learning.

A computational analogue of the Blum-Burch-Langford lower bound
In this section we first present a simple variant of the lower bound construction in [6], obtaining an information-theoretic lower bound on the learnability of the general class of all monotone Boolean functions. The quantitative bound our variant achieves is weaker than that of [6], but has the advantage that it can be easily derandomized. Indeed, as mentioned in Section 5 (and further discussed below), our construction uses a certain probability distribution over monotone DNFs, such that a typical random input x satisfies only poly(n) many "candidate terms" (which are terms that may be present in a random DNF drawn from our distribution). By selecting terms for inclusion in the DNF in a pseudorandom rather than truly random way, we obtain a class of poly(n)-size monotone circuits that is hard to learn to accuracy 1/2 + 1/polylog(n) (assuming one-way functions exist).
Below we start with an overview of why it is difficult to obtain a computational analogue of the exact construction of [6] using the pseudorandom approach sketched above, and the idea behind our variant, which overcomes this difficulty. We then provide our information theoretic construction and analysis, followed by its computational analogue.

Idea
Recall the information-theoretic lower bound from [6]. It works by defining a distribution P s over monotone functions of the form {0, 1} n → {0, 1} as follows. (Here s is a numerical parameter which should be thought of as the number of membership queries that a learning algorithm is allowed to make.) Take t = log(3sn). A draw from P s is obtained by randomly including each length-t monotone term in the DNF independently with probability p , where p is chosen so that the function is expected to be balanced on "typical inputs" (more precisely, on inputs with exactly n/2 ones). The naive idea for derandomizing this construction is to simply use a pseudorandom function with bias p to determine whether each possible term of size t should be included or excluded in the DNF. However, there is a problem with this approach: we do not know an efficient way to determine whether a typical example x (with, say, n/2 ones) has any of its n/2 t candidate terms (each of which is pseudorandomly present/not present in f ) actually present in f , so we do not know how to evaluate f on a typical input x in less than n/2 t time.
We get around this difficulty by instead considering a new distribution of random monotone DNFs. In our construction we will again use a random function with bias p to determine whether each possible term of length t is present in the DNF. However, in our construction, a typical example x will have only a polynomial number of candidate terms that could be satisfied, and thus it is possible to check all of them and evaluate the function in poly(n) time.
The main difficulty of this approach is to ensure that although a typical example has only a polynomial number of candidate terms, the function is still hard to learn in polynomial time. We achieve this by partitioning the variables into blocks of size k and viewing each block as a "super-variable" (corresponding to the AND of all k variables in the block). We then construct the DNF by randomly choosing length-t terms over these super-variables. It is not difficult to see that with this approach, we can equivalently view our problem as learning a t-DNF f with terms chosen as above, where each of the n/k variables is drawn from a product distribution with bias 1/2 k . By fine-tuning the parameters that determine t (the size of each term of the DNF) and k (the size of the partitions), we are able to achieve an information-theoretic lower bound showing that this distribution over monotone functions is hard to learn to accuracy 1/2 + o(1).

Construction
Let us partition the variables x 1 , . . . , x n into m = n/k blocks B 1 , . . . , B m of k variables each. Let X i denote the conjunction of all k variables in B i (X 1 , . . . , X m are the super-variables). The following is a description of our distribution P over monotone functions. A function f is drawn from P as follows (we fix the values of k,t later): • Construct a monotone DNF f 1 as follows: each possible conjunction of t super-variables chosen from {X 1 , . . . , X m } is placed in the target function f 1 independently with probability p, where p is defined as the solution to: Note that for a uniform choice of x ∈ {0, 1} n , we expect m/2 k ones in the corresponding "superassignment" X = (X 1 , . . . , X m ), and any superassignment with this many ones will be satisfied by m/2 k t many terms. Thus p is chosen such that a "typical" example X, with m/2 k ones, has probability 1/2 of being labeled positive under f 1 . Note that because of the final step of the construction, the function f is not actually a DNF (though it is a monotone function). Intuitively, the final step is there because if too many supervariables were satisfied in x, there could be too many (more than poly(n)) candidate terms to check, and we would not be able to evaluate f 1 efficiently. We will show later that the probability that the number of supervariables satisfied in x is greater than m/2 k + (m/2 k ) 2/3 is at most 2e −(m/2 k ) 1/3 /3 = 1/n ω(1) , and thus the function f is 1/n ω(1) -close to f 1 ; so hardness of learning results established for the random DNFs f 1 carry over to the actual functions f . For most of our discussion we shall refer to P as a distribution over DNFs, meaning the functions f 1 .

Information-theoretic lower bound
As discussed previously, we view the distribution P defined above as a distribution over DNFs of terms of size t over the supervariables. Each possible combination of t supervariables appears in f 1 independently with probability p and the supervariables are drawn from a product distribution that is 1 with probability 1/2 k and 0 with probability 1 − 1/2 k . We first observe that learning f over the supervariables drawn from the product distribution is equivalent to learning the original function over the original variables. This is because if we are given the original membership query oracle for n-bit examples we can simulate answers to membership queries on m-bit "supervariable" examples and vice versa. Thus we henceforth analyze the product distribution.
We follow the proof technique of [6]. To simplify our analysis, we consider an "augmented" oracle, as in [6]. Given a query X, with ones in positions indexed by the set S X , the oracle will return the first conjunct in lexicographic order that appears in the target function and is satisfied by X. Additionally, the oracle returns 1 if X is positive and 0 if X is negative. (So instead of just giving a single bit as its response, if the example is a positive one the oracle tells the learner the lexicographically first term in the target DNF that is satisfied.) Clearly, lower bounds for this augmented oracle imply the same bounds for the standard oracle.
We are interested in analyzing P s , the conditional distribution over functions drawn from the initial distribution P that are consistent with the information learned by A in the first s queries. We can think of P s as a vector V s of m t elements, one for each possible conjunct of size t. Initially, each element of the vector contains p, the probability that the conjunct is in the target function. When a query is made, the oracle examines one by one the entries that satisfy X. For each entry having value p, we can think of the oracle as flipping a coin and replacing the entry by 0 with probability 1 − p and by 1 with probability p. After s queries, V s will contain some entries set to 0, some set to 1 and the rest set to p. Because V s describes the conditional distribution P s given the queries made so far, the Bayes-optimal prediction for an example X is simply to answer 1 if V s (X) ≥ 1/2 and 0 otherwise.
We now analyze V s (X), the conditional probability over functions drawn from P that are consistent with the first s queries that a random example, X, drawn from the distribution, evaluates to 1, given the answers to the first s queries. We will show that for s = poly(n), for X drawn from the product distribution on {0, 1} m , with probability at least 1 − 1/n ω(1) the value V s (X) lies in 1/2 ± 1/log n. This is easily seen to give a lower bound of the type we require.
Following [6], we first observe that after s queries there can be at most s entries set to one in the vector V s . We shall also use the following lemma from [6]: We thus may henceforth assume that there are at most 2s/p zeros in V s . We now establish the following, which is an analogue (tailored to our setting) of Claim 3 of [6]: For any vector V s of size m t with at most s entries set to 1, at most 2s/p entries set to 0, and the remaining entries set to p, for a random example X (drawn from {0, 1} m according to the 1/2 k -biased product distribution), we have that with probability at least 1 − ε 1 , the quantity V s (X) lies in the range (

4.2)
Here Proof. Let X be a random example drawn from the 1/2 k -biased product distribution over {0, 1} m , and consider the following 3 events: • None of the 1-entries in V s are satisfied by X. There are at most s 1-entries in V s and the probability that any one is satisfied by X is 2 −kt . Therefore the probability that some 1-entry is satisfied by X is at most s2 −kt and the probability that none of the 1-entries in V s are satisfied by X is at least 1 − s2 −kt .
• At most (2s √ n/p)2 −kt of the 0-entries in V s are satisfied by X.
Because there are at most 2s/p entries set to 0 in V s , the expected number of 0-entries in V s satisfied by X is at most (2s/p)2 −kt . By Markov's inequality, the probability that the actual number exceeds this by a √ n factor is at most 1/ √ n.
• The number of ones in X lies in the range m/2 k ± (m/2 k ) 2/3 .
Using a multiplicative Chernoff bound, we have that this occurs with probability at least 1 − 2e −(m/2 k ) 1/3 /3 . Note that for any X in this range, f (X) = f 1 (X). So, conditioning on this event occurring, we can assume that f (X) = f 1 (X).
Therefore, the probability that all 3 of the above events occurs is at least 1 − ε 1 where Given that these events all occur, we show that V s (X) lies in the desired range. We follow the approach of [6]. For the lower bound, V s (X) is minimized when X has as few ones as possible and when as many of the 0-entries in V s are satisfied by X as possible. So V s (X) is at least For the upper bound, V s (X) is maximized when X has as many ones as possible and as few zeros as possible. So, V s (X) is at most which completes the proof. Now let us choose values for k and t. What are our goals in setting these parameters? First off, we want m/2 k t to be at most poly(n) (so that there are at most poly(n) candidate terms to be checked for a "typical" input). Moreover, for any s = poly(n) we want both sides of (4.2) to be close to 1/2 (so the accuracy of any s-query learning algorithm is indeed close to 1/2 on typical inputs), and we want ε 1 to be small (so almost all inputs are "typical"). With this motivation, we set k = Θ(log n) to be such that m/2 k (recall, m = n/k) equals log 6 n, and we set t = √ log n. This means m/2 k t = log 6 n √ log n ≤ 2 6 log(log n) Now we analyze (4.2). First the lower bound: (In the last step we are using the definition of p from (4.1).) Let us bound the exponent: Now for the upper bound: Again bounding the exponent: The above analysis has thus established the following.
Lemma 4.3. Let L be any poly(n)-time learning algorithm. If L is run with a target function that is a random draw f from the distribution P described above, then for all but a 1/n ω(1) fraction of inputs x ∈ {0, 1} n , the probability that h(x) = f (x) (where h is the hypothesis output by L) is at most It is easy to see that by slightly modifying the values of t and k in the above construction, it is actually possible to replace 1/log n with any 1/polylog n in the above lemma.

Computational lower bound
To obtain a computational analogue of Lemma 4.3, we make a pseudorandom choice of terms in a draw of f 1 from P.
Recall that the construction of P placed each possible term (conjunction of t supervariables) in the target function with probability p, as defined in (4.1). We first consider a distribution that uses uniform bits to approximate the probability p. This can be done by approximating log(p −1 ) with poly(n) bits, associating each term with independent uniform poly(n) bits chosen this way, and including that term in the target function if all bits are set to 0. It is easy to see that the resulting construction yields a probability distribution that is statistically close to P, and we denote it by P stat . Now, using a pseudorandom function rather than a truly random (uniform) one for the source of uniform bits will yield a distribution, which we denote by P PSR . Similar arguments to those we give elsewhere in the paper show that a poly(n) time adversary cannot distinguish the resulting construction from the original one (or else a distinguisher could be constructed for the pseudorandom function).
To complete the argument, we observe that every function f in the support of P PSR can be evaluated with a poly(n)-size circuit. It is obviously easy to count the number of supervariables that are satisfied in an input x, so we need only argue that the function f 1 can be computed efficiently on a "typical" input x that has "few" supervariables satisfied. But by construction, such an input will satisfy only poly(n) candidate terms of the monotone DNF f 1 and thus a poly(n)-size circuit can check each of these candidate terms separately (by making a call to the pseudorandom function for each candidate term to determine whether it is present or absent). Thus, as a corollary of Lemma 4.3, we can establish the main result of this section: Theorem 4.4. Suppose that standard one-way functions exist. Then there is a class C of poly(n)-size monotone circuits that is hard to learn to accuracy 1/2 + 1/polylog(n).
An obvious goal for future work is to establish even sharper quantitative bounds on the cryptographic hardness of learning monotone functions. Blum, Burch, and Langford [6] obtained their 1 2 + ω(log n) √ n information-theoretic lower bound by considering random monotone DNF that are constructed by independently including each of the n log n possible terms of length log n in the target function. Can we match this hardness with a class of polynomial-size circuits?
As mentioned in Section 1, it is natural to consider a pseudorandom variant of the construction in [6] in which a pseudorandom rather than truly random function is used to decide whether or not to include each of the n log n candidate terms. However, we have not been able to show that a function f constructed in this way can be computed by a poly(n)-size circuit. Intuitively, the problem is that for an input x with (typically) n/2 bits set to 1, to evaluate f we must check the pseudorandom function's value on all of the n/2 log n potential "candidate terms" of length log n that x satisfies. Indeed, the question of obtaining an efficient implementation of these "huge pseudorandom monotone DNF" has a similar flavor to Open Problem 5.4 of [13]. In that question the goal is to construct pseudorandom functions that support "subcube queries" that give the parity of the function's values over all inputs in a specified subcube of {0, 1} n . In our scenario we are interested in functions f that are pseudorandom only over the n log n inputs with precisely log n ones (these inputs correspond to the "candidate terms" of the monotone DNF) and are zero everywhere else, and we only need to support "monotone subcube queries" (i. e., given an input x, we want to know whether f (y) = 1 for any y ≤ x).
ANDREW WAN is a Ph. D. candidate at Columbia University, advised by Tal Malkin and Rocco Servedio. His interests include complexity theory, cryptography, and computational learning theory. Before graduate school, he was a student of philosophy at Columbia University and enjoyed playing the piano, the trumpet, and the accordion. Although he still enjoys playing music, the PAC model rarely affords him the time.
HOETECK WEE is an assistant professor at Queens College, CUNY 5 . He received his Ph. D. from UC Berkeley under the supervision of Luca Trevisan and his B. S. from MIT. He was previously a postdoc at Columbia University and a visiting student at Tsinghua University and IPAM. Hoeteck currently lives in Manhattan close to the cafés in order to cut down on his commute. He's working to convince more people that "black box is the new black."