An optimal lower bound for monotonicity testing over hypergrids

For positive integers $n, d$, consider the hypergrid $[n]^d$ with the coordinate-wise product partial ordering denoted by $\prec$. A function $f: [n]^d \mapsto \mathbb{N}$ is monotone if $\forall x \prec y$, $f(x) \leq f(y)$. A function $f$ is $\eps$-far from monotone if at least an $\eps$-fraction of values must be changed to make $f$ monotone. Given a parameter $\eps$, a \emph{monotonicity tester} must distinguish with high probability a monotone function from one that is $\eps$-far. We prove that any (adaptive, two-sided) monotonicity tester for functions $f:[n]^d \mapsto \mathbb{N}$ must make $\Omega(\eps^{-1}d\log n - \eps^{-1}\log \eps^{-1})$ queries. Recent upper bounds show the existence of $O(\eps^{-1}d \log n)$ query monotonicity testers for hypergrids. This closes the question of monotonicity testing for hypergrids over arbitrary ranges. The previous best lower bound for general hypergrids was a non-adaptive bound of $\Omega(d \log n)$.


Introduction
Given query access to a function f : D → R, the field of property testing [RS96,GGR98] deals with the problem of determining properties of f without reading all of it. Monotonicity testing [GGL + 00] is a classic problem in property testing. Consider a function f : D → R, where D is some partial order given by "≺", and R is a total order. The function f is monotone if for all x ≺ y (in D), f (x) ≤ f (y). The distance to monotonicity of f is the minimum fraction of values that need to be modified to make f monotone. More precisely, define the distance between functions d(f, g) as |{x : f (x) = g(x)}|/|D|. Let M be the set of all monotone functions. Then the distance to monotonicity of f is min g∈M d(f, g).
A function is called ε-far from monotone if the distance to monotonicity is at least ε. A property tester for monotonicity is a, possibly randomized, algorithm that takes as input a distance parameter ε ∈ (0, 1), error parameter δ ∈ [0, 1], and query access to an arbitrary f . If f is monotone, then the tester must accept with probability > 1 − δ. If it is ε-far from monotone, then the tester rejects with probability > 1 − δ. (If neither, then the tester is allowed to do anything.) The aim is to design a property tester using as few queries as possible. A tester is called one-sided if it always accepts a monotone function. A tester is called non-adaptive if the queries made do not depend on the function values. The most general tester is an adaptive two-sided tester.
Monotonicity testing has a rich history and the hypergrid domain, [n] d , has received special attention. The boolean hypercube (n = 2) and the total order (d = 1) are special instances of hypergrids. Following a long line of work [EKK + 00, GGL + 00, DGL + 99, LR01, FLN + 02, AC06, Fis04, HK08, PRR06, ACCL06, BRW05,BBM12], previous work of the authors [CS13] shows the existence of O(ε −1 d log n)-query monotonicity testers. Our result is a matching adaptive lower bound that is optimal in all parameters (for unbounded range functions). This closes the question of monotonicity testing for unbounded ranges on hypergrids. This is also the first adaptive bound for monotonicity testing on general hypergrids.

Previous work
The problem of monotonicity testing was introduced by Goldreich [CS13]. We refer the interested reader to the introduction of [CS13] for a more detailed history of previous upper bounds.
There have been numerous lower bounds for monotonicity testing. We begin by summarizing the state of the art. The known adaptive lower bounds are Ω(log n) for the total order [n] by Fischer [Fis04], and Ω(d/ε) for the boolean hypercube {0, 1} d by Brody [Bro13]. For general hypergrids, Blais, Raskhodnikova, and Yaroslavtsev [BRY13] recently proved the first result, a non-adaptive lower bound of Ω(d log n). Theorem 1.1 is the first adaptive bound for monotonicity testing on hypergrids and is optimal (for arbitrary ranges) in all parameters. Now for the chronological documentation. The first lower bound was the non-adaptive bound of Ω(log n) for the total order [n] by Ergun [BCGSM12]. Blais, Brody, and Matulef [BBM12] gave an ingenious reduction from communication complexity to prove an adaptive, two-sided bound of Ω(d). (Honing this reduction, Brody [Bro13] improved this bound to Ω(d/ε).) The non-adaptive lower bounds of Blais, Raskhodnikova, and Yaroslavtsev [BRY13] were also achieved through communication complexity reductions.
We note that our theorem only holds when the range is N, while some previous results hold for restricted ranges. The results of [BBM12,Bro13] provide lower bounds for range [ √ d]. The nonadaptive bound of [BRY13] holds even when the range is [nd]. In that sense, the communication complexity reductions provide stronger lower bounds than our result.

Main ideas
The starting point of this work is the result of Fischer [Fis04], an adaptive lower bound for monotonicity testing for functions f : [n] → N. He shows that adaptive testers can be converted to comparison-based testers, using Ramsey theory arguments. A comparison-based tester for [n] can be easily converted to a non-adaptive tester, for which an Ω(log n) bound was previously known. We make a fairly simple observation. The main part of Fischer's proof actually goes through for functions over any partial order, so it suffices to prove lower bounds for comparison-based testers.
(The reduction to non-adaptive testers only holds for [n].) We then prove a comparison-based lower bound of Ω(ε −1 d log n − ε −1 log ε −1 ) for the domain [n] d . As usual, Yao's minimax lemma allows us to prove determinstic lower bounds over some distribution of functions. The major challenge in proving (even non-adaptive) lower bounds for monotonicity is that the tester might make decisions based on the actual values that it sees. Great care is required to construct a distribution over functions whose monotonicity status cannot be decided by simply looking at the values. But a comparison-based tester has no such power, and optimal lower bounds over all parameters can be obtained with a fairly clean distribution.

The reduction to comparison based testers
Consider the family of functions f : D → R, where D is some partial order, and R ⊆ N. We will assume that f always takes distinct values, so ∀x, y, f (x) = f (y). Since we are proving lower bounds, this is no loss of generality.
Definition 2.1. An algorithm A is a (t, ε, δ)-monotonicity tester if A has the following properties. For any f : D → R, the algorithm A makes t (possibly randomized) queries to f and then outputs either "accept" or "reject". If f is monotone, then A accepts with probability > 1 − δ. If f is ε-far from monotone, then A rejects with probability > 1 − δ.
Given a positive integer s, let D s denote the collection of ordered, s-tupled vectors with each entry in D. We define two symbols acc and rej, and denote D ′ = D ∪ {acc, rej}. Any (t, ε, δ)tester can be completely specified by the following family of functions. For all s ≤ t, x ∈ D s , y ∈ D ′ , we consider a function p y x : R s → [0, 1], with the semantic that for any a ∈ R s , p y x (a) denotes the probability the tester queries y as the (s + 1)th query, given that the first s queries are x 1 , . . . , x s and f (x i ) = a i for 1 ≤ i ≤ s. By querying acc, rej we imply returning accept or reject. These functions satisfy the following properties.
(1) ensures the decisions of the tester at step (s + 1) must form a probability distribution.
(2) implies that the tester makes at most t queries.
For any positive integer s, let R (s) denote unordered sets of R of cardinality s. For reasons that will soon become clear, we introduce new functions as follows. For each s, x ∈ D s , y ∈ D ′ , and each permutation σ : [s] → [s], we associate functions q y x,σ : R (s) → [0, 1], with the semantic For any set S = (a 1 < a 2 < · · · < a s ) ∈ R (s) , q y x,σ (S) := p y x (a σ(1) , . . . , a σ(s) ) That is, q y x,σs (S) sorts the answers in S in increasing order, permutes it according to σ, and passes the permuted ordered tuple to p y x . Any adaptive tester can be specified by these functions. The important point to note is that they are finitely many such functions; their number is upper bounded by (t|D|) t+1 . These q-functions allow us to define comparison based testers.
x,σ is a constant function on R (s) . In other words, the (s + 1)th decision of the tester given that the first s questions is x, depends only on the ordering of the answers received, and not on the values of the answers.
The following theorem is implicit in the work of Fischer [Fis04]. This implies that a comparison-based lower bound suffices for proving a general lower bound on monotonicity testing. We provide a proof of the above theorem in the next section for completeness.

Performing the reduction
We basically present Fischer's argument, observing that D can be any partial order. A monotonicity tester is called discrete if the corresponding functions p y x can only take values in {i/K | 0 ≤ i ≤ K} for some finite K. Note that this implies the functions q y x,σ also take discrete values.
Proof. We do a rounding on the p-functions.
x (a). Thep-functions describe a new discrete tester A ′ that makes at most t queries. We argue that A ′ is a (t, ε, 2δ)-tester. Given a function f that is either monotone or ε-far from monotone, consider a sequence of queries x 1 , . . . , x s after which A returns a correct decision ℵ. Call such a sequence good, and let α denote the probability this occurs. We know that the sum of probabilities over all good query sequences is at least (1 − δ). Now, Two cases arise. suppose all of the probabilities in the RHS are ≥ 10t/δK. Then, the probability of this good sequence arising in A ′ is at least (1 − δ/10t) t α ≥ α(1 − δ/2). Otherwise, suppose some probability in the RHS is < 10t/δK. Then the total probability mass on such good sequences in A is atmost 10t/δK · |D| t ≤ δ/2. Therefore, the probability of good sequences in A ′ is at least (1 − 3δ/2)(1 − δ/2) ≥ 1 − 2δ. That is, A ′ is a (t, ε, 2δ) tester.
We introduce some Ramsey theory terminology. For any positive integer i, a finite coloring of N (i) is a function col i : N (i) → {1, . . . , C} for some finite number C. An infinite set X ⊆ N is called monochromatic w.r.t col i if for all sets A, B ∈ X (i) , col i (A) = col i (B). A k-wise finite coloring of N is a collection of k colorings col 1 , . . . , col k . (Note that each coloring is over different sized tuples.) An infinite set X ⊆ N is k-wise monochromatic if X is monochromatic w.r.t. all the col i 's.
The following is a simple variant of Ramsey's original theorem. (We closely follow the proof of Ramsey's theorem as given in Chap V1, Theorem 4 of [Bol00].) Theorem 2.5. For any k-wise finite coloring of N, there is an infinite k-wise monochromatic set X ⊆ N.
Proof. We proceed by induction on k. If k = 1, then this is trivially true; let X be the maximum color class. Since the coloring is finite, X is infinite. We will now iteratively construct an infinite set of N via induction.
Subsequently, let a 1 be the minimum element in A 0 , and consider the (k − 1)-wise coloring Again, the induction hypothesis yields an infinite (k − 1)-wise monochromatic set A 1 as before, and similarly the vector C 1 . Continuing this procedure, we get an infinite sequence a 0 , a 1 , a 2 , . . . of natural numbers, an infinite sequence of vectors of k colors C 0 , C 1 , . . ., and an infinite nested sequence of infinite sets A 0 ⊃ A 1 ⊃ A 2 . . .. Every A r contains a s , ∀s > r and by construction, any set ({a r } ∪ S), S ⊆ A r , |S| = i − 1, has color C i r . Since there are only finitely many colors, some vector of colors occurs infinitely often as C r 1 , C r 2 , . . .. The corresponding infinite sequence of elements a r 1 , a r 2 , . . . is k-wise monochromatic.
Proof. (of Theorem 2.3) Suppose there exists a (t, ε, δ)-tester for functions f : D → N. We need to show there is a comparison-based (t, ε, 2δ)-tester for such functions.
By Claim 2.4, there is a discrete (t, ε, 2δ)-tester A. Equivalently, we have the functions q y x,σ as described in the previous section. We now describe a t-wise finite coloring of N. Consider s ∈ [t]. Given a set A ⊆ N (s) , col s (A) is a vector indexed by (y, x, σ), where y ∈ D ′ , x ∈ D s , and σ is a s-permutation, whose entry is q y x,σ (A). The domain is finite, so the number of dimensions is finite. Since the tester is discrete, the number of possible colors entries is finite. Applying Theorem 2.5, we know the existence of a t-wise monochromatic infinite set R ⊆ N. We have the property that for any y, x, σ, and any two sets A, B ∈ R (s) , we have q y x,σ (A) = q y x,σ (B). That is, the algorithm A is a comparison based tester for functions with range R.
Consider the strictly monotone map φ : N → R, where φ(b) is the bth element of R in sorted order. Now given any function f :

Lower bounds
We assume that n is a power of 2, set ℓ := log 2 n, and think of [n] as {0, 1, . . . , n − 1}. For any number 0 ≤ z < n, we think of the binary representation as z as an ℓ-bit vector (z 1 , z 2 , . . . , z ℓ ), where z 1 is the least significant bit.
Consider the following canonical, one-to-one mapping φ : [n] d → {0, 1} dℓ . For any y = (y 1 , y 2 , . . . , y d ) ∈ [n] d , we concatenate their binary representations in order to get a dℓ-bit vector φ( y). Hence, we can transform a function f : We will now describe a distribution of functions over the boolean hypercube with equal mass on monotone and ε-far from monotone functions. The key property is that for a function drawn from this distribution, any deterministic comparison based algorithm errs in classifying it with non-trivial probability. This property will be used in conjunction with the above mapping to get our final lower bound.

The hard distribution
We focus on functions f : {0, 1} m → N. (Eventually, we set m = dℓ.) Given any x ∈ {0, 1} m , we let val(x) := m i=1 2 i−1 x i denote the number for which x is the binary representation. Here, x 1 denotes the least significant bit of x.
For convenience, we let ε be a power of 1/2. For k ∈ {1, . . . , 1 2ε }, we let Note that S k 's partition the hypercube, with each |S k | = ε2 m+1 . In fact, each S k is a subhypercube of dimension m ′ := m + 1 − log(1/ε), with the minimal element having all zeros in the m ′ least significant bits, and the maximal element having all ones in those. We describe a distribution F m,ε on functions. The support of F m,ε consists f (x) = 2val(x), and m ′ 2ε functions indexed as g j,k with j ∈ [m ′ ] and k ∈ [ 1 2ε ], defined as follows.
The distribution F m,ε puts probability mass 1/2 on the function f = 2val and ε m ′ on each of the g j,k 's. All these functions take distinct values on their domain. Note that 2val induces a total order on {0, 1} m .
The distinguishing problem: Given query access to a random function f from F m,ε , we want a deterministic comparison-based algorithm that declares that f = 2val(x) or f = 2val(x). We refer to any such algorithm as a distinguisher. Naturally, we say that the distinguisher errs on f if it's declaration is wrong. Our main lemma is the following.
Lemma 3.1. Any deterministic comparison-based distinguisher that makes less than m ′ 8ε queries errs with probability at least 1/8.
The following proposition allows us to focus on non-adaptive comparison based testers.
Proposition 3.2. Given any deterministic comparison-based distinguisher A for F m,ε that makes at most t queries, there exists a deterministic non-adaptive comparison-based distinguisher A ′ making at most t queries whose probability of error on F m,ε is at most that of A.
Proof. We represent A as a comparison tree. For any path in A, the total number of distinct domain points involved in comparisons is at most t. Note that 2val(x) is a total order, since for any x, y either val(x) < val(y) or vice versa. For any comparison in A, there is an outcome inconsistent with this ordering. (An outcome "f (x) < f (y)" where val(x) > val(y) is inconsistent with the total order.) We construct a comparison tree A ′ where we simply reject whenever a comparison is inconsistent with the total order, and otherwise mimics A. The comparison tree of A ′ has an error probability at most that of A (since it may reject a few f = 2val), and is just a path. Hence, it can be modeled as a non-adaptive distinguisher. We query upfront all the points involving points on this path, and make the relevant comparisons for the output.
Combined with Proposition 3.2, the following lemma completes the proof of Lemma 3.1.
Proof. Let X be the set of points queried by the distinguisher. Set X k =: X ∩ S k ; these form a partition of X. We say that a pair of points (x, y) captures the (unique) coordinate j, if j is the largest coordinate where x j = y j . (By largest coordinate, we refer to the value of the index.) For a set Y of points, we say Y captures coordinate j if there is a pair in Y that captures j.
Claim 3.4. For any j, k, if the algorithm distinguishes between val and g j,k , then X k captures j.
Proof. If the algorithm distinguishes between val and g j,k , there must exist (x, y) ∈ X such that val(x) < val(y) and g j,k (x) > g j,k (y). We claim that x and y capture j; this will also imply they lie in the same S k ′ since the m − j most significant bit of x and y are the same. Firstly, observe that we must have y j = 1 and x j = 0; otherwise, g j,k (y) − g j,k (x) ≥ 2(val(y) − val(x)) > 0 contradicting the supposition. Now suppose (x, y) don't capture j implying there exists i > j which is the largest coordinate at which they differ. Since val(y) > val(x) we have y i = 1 and x j = 0. Therefore, we have So, x, y capture j and lie in the same S k ′ . If k ′ = k, then again g j,k (y) − g j,k (x) = 2(val(y) − val(x)) > 0. Therefore, X k captures j.
The following claim allows us to complete the proof of the lemma. Proof. We prove by induction on |Y |. When |Y | = 2, this is trivially true. Otherwise, pick the largest coordinate j captured by Y and let Y 0 = {y : y j = 0} and Y 1 = {y : y j = 1}. By induction, Y 0 captures at most |Y 0 | − 1 coordinates, and Y 1 captures at most |Y 1 | − 1 coordinates. Pairs (x, y) ∈ Y 0 × Y 1 only capture coordinate j. Therefore, the total number of captured coordinates is at most |Y 0 | − 1 + |Y 1 | − 1 + 1 = |Y | − 1.
If |X| ≤ m ′ /8ε, then there exist at least 1/4ε values of k such that |X k | ≤ m ′ /2. By Claim 3.5, each such X k captures at most m ′ /2 coordinates. Therefore, there exist at least 1 4ε · m ′ 2 = m ′ 8ε functions g j,k 's that are indistinguishable from the monotone function 2val to a comparison-based procedure that queries X. This implies the distinguisher must err (make a mistake on either these g j,k 's or 2val) with probability at least min( ε m ′ · m ′ 8ε , 1/2) = 1/8.

The final bound
Recall, given function f : {0, 1} dℓ → N, we have the function f : [n] d → N by defining f ( y) := f (φ( y)). We start with the following observation.
Proposition 3.6. The function 2val is monotone and every g j,k is ε/2-far from being monotone.
Proof. Let u and v be elements in [n] d such that u ≺ v. We have val(φ( u)) < val(φ( v)), so 2val is monotone. For the latter, it suffices to exhibit a matching of violated pairs of cardinality ε2 dℓ for g j,k . This is given by pairs ( u, v) where φ( u) and φ( v) only differ in their jth coordinate, and are both contained in S k . Note that these pairs are comparable in [n] d and are violations.

8ε
. Consider the distribution D where we generate f from F m,ε and output f . Suppose t < s. By Proposition 3.6, the deterministic comparison based monotonicity tester acts as a determinisitic comparison-based distinguisher for F m,ε making fewer than s queries, contradicting Lemma 3.3.

Conclusion
In this paper, we exhibit a lower bound of Ω(ε −1 d log n − ε −1 log ε −1 ) queries on adaptive, twosided monotonicity testers for functions f : [n] d → N, matching the upper bound of O(ε −1 d log n) queries of [CS13]. Our proof hinged on two things: that for monotonicity on any partial order one can focus on comparison-based testers, and a lower bound on comparison-based testers for the hypercube domain. Some natural questions are left open. Can one focus on some restricted class of testers for the Lipschitz property, and more generally, can one prove adaptive, two-sided lower bounds for the Lipschitz property testing on the hypergrid/cube? Currently, a Ω(d log n)-query non-adaptive lower bound is known for the problem [BRY13]. Can one prove comparison-based lower bounds for monotonicity testing on a general N -vertex poset? For the latter problem, there is a O( N/ε)-query non-adaptive tester, and a Ω(N 1 log log N )-query non-adaptive, two-sided error lower bound [FLN + 02]. Our methods do not yield any results for bounded ranges, but there are significant gaps in our understanding for that regime. For monotonicity testing of boolean functions f : {0, 1} n → {0, 1}, the best adaptive lower bound of Ω(log n), while the best non-adaptive bound is Ω( √ n) [FLN + 02].