Online Row Sampling

Finding a small spectral approximation for a tall $n \times d$ matrix $A$ is a fundamental numerical primitive. For a number of reasons, one often seeks an approximation whose rows are sampled from those of $A$. Row sampling improves interpretability, saves space when $A$ is sparse, and preserves row structure, which is especially important, for example, when $A$ represents a graph. However, correctly sampling rows from $A$ can be costly when the matrix is large and cannot be stored and processed in memory. Hence, a number of recent publications focus on row sampling in the streaming setting, using little more space than what is required to store the outputted approximation [KL13, KLM+14]. Inspired by a growing body of work on online algorithms for machine learning and data analysis, we extend this work to a more restrictive online setting: we read rows of $A$ one by one and immediately decide whether each row should be kept in the spectral approximation or discarded, without ever retracting these decisions. We present an extremely simple algorithm that approximates $A$ up to multiplicative error $\epsilon$ and additive error $\delta$ using $O(d \log d \log(\epsilon||A||_2/\delta)/\epsilon^2)$ online samples, with memory overhead proportional to the cost of storing the spectral approximation. We also present an algorithm that uses $O(d^2$) memory but only requires $O(d\log(\epsilon||A||_2/\delta)/\epsilon^2)$ samples, which we show is optimal. Our methods are clean and intuitive, allow for lower memory usage than prior work, and expose new theoretical properties of leverage score based matrix approximation.


Background
A spectral approximation to a tall n × d matrix A is a smaller, typicallyÕ(d) × d matrixÃ such that Ã x 2 ≈ Ax 2 for all x. Typically one asks for a multiplicative approximation, which guarantees that (1 − ε) Ax 2 2 ≤ Ã x 2 2 ≤ (1 + ε) Ax 2 2 . In other notation, Such approximations have many applications, most notably for solving least squares regression over A [9,11]. If A is the vertex-edge incidence matrix of a graph,Ã is a spectral sparsifier [26]. It can be used to approximate effective resistances, spectral clustering, mixing time and random walk properties, and many other computations.
A number of recent papers focus on fast algorithms for spectral approximation. Using sparse random subspace embeddings [9,23,22], it is possible to findÃ in input sparsity time-i. e., running time scaling linearly in the number of nonzero entries in A. These methods produceÃ by randomly recombining the rows of A into a smaller number of rows. In some cases these embeddings are not enough, as it is desirable for the rows ofÃ to be a subset of the rows of A. If A is sparse, this ensures thatÃ is also sparse. If A represents a graph, it ensures thatÃ is also a graph, specifically a weighted subgraph of the original.
It is well known that sampling O(d log d/ε 2 ) rows of A with probabilities proportional to their leverage scores yields a (1 ± ε)-factor spectral approximation to A. Further, this sampling can be done in input sparsity time, either using subspace embeddings to approximate leverage scores, or using iterative sampling techniques [20], some that only work with subsampled versions of the original matrix [11].

Streaming and online row sampling
When A is very large, input sparsity running times are not enough-memory restrictions also become important. Hence, recent work has tackled row sampling in a streaming model of computation. [16] presents a simple algorithm for sampling rows from an insertion-only stream, using space approximately proportional to the size of the final approximation. [15] gives a sparse-recovery based algorithm that works in dynamic streams with row insertions and deletions, also using nearly optimal space. Unfortunately, to handle dynamic streams, the algorithm in [15] is complex, requires additional restrictions on the input matrix, and uses significantly suboptimal running time to recover a spectral approximation from its low memory representation of the input stream.
While the algorithm in [16] is simple and efficient, we believe that its proof is incomplete, and do not see an obvious way to fix it. The main idea behind the algorithm is to sample rows by their leverage THEORY OF COMPUTING, Volume 16 (15), 2020, pp.  scores with respect to the stream seen so far. These leverage scores may be coarse overestimates of the true scores. However as more rows are streamed in, better estimates can be obtained and the sampled rows pruned to a smaller set. Unfortunately, the probability of sampling a row becomes dependent on which other rows are sampled. This seems to break the argument in that paper, which essentially claims that their process has the same distribution as would a single round of leverage score sampling. 1 In this paper we initiate the study of row sampling in an online setting. As in an insertion stream, we read rows of A one by one. However, upon seeing a row, we immediately decide whether it should be kept in the spectral approximation or discarded, without ever retracting these decisions. We present a similar algorithm to [16], however, since we never prune previously sampled rows, the probability of sampling a row only depends on whether previous rows in the stream were sampled. This limited dependency structure allows us to rigorously argue that a spectral approximation is obtained.
In addition to addressing gaps in the literature on streaming spectral approximation, our restricted model extends work on online algorithms for a variety of other machine learning and data analysis problems, including principal component analysis [4], clustering [21], classification [3,14], and regression [14]. In practice, online algorithms are beneficial since they can be highly computationally and memory efficient. Further, they can be applied in scenarios in which data is produced in a continuous stream and intermediate results must be output as the stream is processed. Spectral approximation is a widely applicable primitive for approximate learning and computation, so studying its implementation in an online setting is a natural direction. Since the initial publication of this work, online row sampling methods have found applications in kernel matrix approximation [7,8] and sliding window algorithms for streaming matrix approximation [6].

Our results
Our primary contribution is a very simple algorithm for leverage score sampling in an online manner. The main difficultly with row sampling using leverage scores is that leverage scores themselves are not easy to compute. They are given by l i = a T i (A T A) −1 a i , and so require solving systems in A T A if computed naively. This is not only expensive, but also impossible in an online setting, where we do not have access to all of A.
A critical observation is that it always suffices to sample rows by overestimates of their true leverage scores. The number of rows that must be sampled is proportional to the sum of these overestimates. Since the leverage score of a row can only go up when we remove rows from the matrix, a simple way to obtain an overestimate is to compute leverage score using just a subset of the other rows of A. That is, letting A j contain just j of the n rows of A, we can overestimate [11] shows that if A j is a subset of rows sampled uniformly at random, then the expected leverage score of a i is d/ j. This simple fact immediately gives a result for online sampling from a randomly ordered stream. If we compute the leverage score of the current row a i against all previously seen rows (or some approximation to these rows), then the expected sum of our overestimates is bounded by d + d/2 + · · · + · · · + d/n = O(d log n). So, sampling O(d log d log n/ε 2 ) rows is enough obtain a (1 + ε) multiplicative-error spectral approximation.
What if we cannot guarantee a randomly ordered input stream? Is there any hope of being able to compute good leverage score estimates in an online manner? Surprisingly the answer to this is yes-we can in fact run nearly the exact same algorithm and be guaranteed that the sum of estimated leverage scores is low, regardless of stream order. Roughly, each time we receive a row which has high leverage score with respect to the previous rows, it must compose a significant part of the spectrum of A. If A does not continue to grow unboundedly, there simply cannot be too many of these significant rows.
Specifically, we show that if we sample by the ridge leverage scores [1] over all previously seen rows, which are the leverage scores computed over A T i A i + λ I for some small regularizing factor λ , then with just O(d log d log(ε A 2 2 /δ )/ε 2 ) samples we obtain a (1 + ε) multiplicative-error, δ additive-error spectral approximation. That is, with high probability we sample a matrixÃ with To gain intuition behind this bound, note that we can convert it into a multiplicative one by setting δ = εσ min (A) 2 where σ min (A) is the minimum singular value of A (as long as we have some estimate of σ min (A)). This setting of δ will require taking O(d log d log(κ(A))/ε 2 ) samples, where κ(A) = σ max (A)/σ min (A) is the condition number of A. If we have a polynomial bound on this condition number, as we do, for instance, for graphs with polynomially bounded edge weights, this becomes O(d log 2 d/ε 2 )-nearly matching the O(d log d/ε 2 ) achievable if sampling by true leverage scores.
Our online sampling algorithm is extremely simple. When each row comes in, we compute the online ridge leverage score, or an estimate of it, and then irrevocably either add the row to our approximation or remove it. As mentioned, it is similar in form to the streaming algorithm of [16], except that it does not require pruning previously sampled rows. This allows us to avoid difficult dependency issues. Additionally, without pruning, we do not even need to store all previously sampled rows. As long as we store a constant-factor spectral approximation our previous samples, we can compute good approximations to the online ridge leverage scores. In this way, we can store just O(d log d log(ε A 2 2 /δ )) rows in working memory (O(d log 2 d) if we want a spectral graph sparsifier), filtering our input stream into an O(d log d log(κ(A))/ε 2 )-size output stream. Note that this memory bound in fact improves as ε decreases, and regardless, can be significantly smaller than the output size of the algorithm.
In addition to our main sampling result, we use our bounds on online ridge leverage score approximations to show that an algorithm in the style of [2] allows us to remove a log d factor and sample just O(d log(ε A 2 2 /δ )/ε 2 ) rows (Theorem 4.1). This algorithm is more complex and can require O(d 2 ) working memory. However, in Theorem 5.1 we show that it is asymptotically optimal. The log(ε A 2 2 /δ ) factor is not an artifact of our analysis, but is truly the cost of the restricting ourselves to online sampling. No algorithm can obtain a multiplicative (1 + ε) additive δ spectral approximation taking fewer than Ω(d log(ε A 2 2 /δ )/ε 2 ) rows in an online manner.

Overview
Let A be an n × d matrix with rows a 1 , . . . , a n . A natural approach to row sampling from A is picking an a priori probability with which each row is kept, and then deciding whether to keep each row independently. A common choice is for the sampling probabilities to be proportional to the leverage scores of the rows. The leverage score of the i-th row of A is defined to be where the dagger symbol denotes the pseudoinverse. In this work, we will be interested in approximating A T A with some (very) small multiple of the identity added. Hence, we will be interested in the λ -ridge leverage scores [1]: In many applications, obtaining the (nearly) exact values of a T i (A T A + λ I) −1 a i for sampling is difficult or outright impossible. A key idea is that as long as we have a sequence l 1 , . . . , l n of overestimates of the λ -ridge leverage scores, that is, for i = 1, . . . , n, l i ≥ a T i (A T A + λ I) −1 a i , we can sample by these overestimates and obtain rigorous guarantees on the quality of the obtained spectral approximation. This notion is formalized in Theorem 2.1.
Theorem 2.1. Let A be an n × d matrix with rows a 1 , . . . , a n . Let ε ∈ (0, 1), δ > 0, λ := δ /ε, c := 8 log d/ε 2 . Assume we are given l 1 , . . . , l n such that for all i = 1, . . . , n, . . , n, let p i := min(cl i , 1). ConstructÃ by independently sampling each row a i of A with probability p i , and rescaling it by 1/ √ p i if it is included in the sample. Then, with high probability, and the number of rows inÃ is O (∑ n i=1 l i ) log d/ε 2 . Proof. This sort of guarantee for leverage score sampling is well known. See for example Lemma 4 of [11]. If we sampled both the rows of A and the rows of √ λ I with the leverage scores over (A T A + λ I), we would have (1 − ε)(A T A + λ I) Ã TÃ (1 + ε)(A T A + λ I). However, we do not sample the rows of the identity. Since we could have sampled them each with probability 1, we can simply subtract λ I = (δ /ε)I from the multiplicative bound and have The idea of using overestimates of leverage scores to perform row sampling has been applied successfully to various problems (see, e. g., [17,11]). However, in these applications, access to the entire matrix is required beforehand. In the streaming and online settings, we have to rely on partial data to approximate the true leverage scores. The most natural idea is to just use the portion of the matrix seen thus far as an approximation to A. This leads us to introduce the online λ -ridge leverage scores: is defined as the matrix consisting of the first i rows of A. 2 Since clearly A T i A i A T A for all i, it is not hard to see that l i does overestimate the true λ -ridge leverage score for row a i . A more complex question, however, is establishing an upper bound on ∑ n i=1 l i so that we can bound the number of samples needed by Theorem 2.1.
A core result of this work, stated in Theorem 2.2, is establishing such an upper bound; in fact, this bound is shown to be tight up to constants (Theorem 5.1) and is nearly linear in most cases.
Theorem 2.2. Let A be an n × d matrix with rows a 1 , . . . , a n . Let A i for i ∈ {0, . . . , n} be the matrix consisting of the first i rows of A. For λ > 0, let be the online λ -ridge leverage score of the i th row of A. Then Theorems 2.1 and 2.2 suggest a simple algorithm for online row sampling: simply use the online λridge leverage scores, for λ := δ /ε. This gives a spectral approximation with O(d log d log(ε A 2 2 /δ )/ε 2 ) rows. Unfortunately, computing each l i exactly requires us to store all the rows we have seen in memory (or alternatively to store the sum of their outer products, A T i A i ). In many cases, such a requirement would defeat the purpose of streaming row sampling.
A natural idea is to use the sample we have kept thus far as an approximation to A i when computing l i . It turns out that the approximate online ridge leverage scoresl i computed in this way will not always be good approximations to l i ; however, we can still prove that they satisfy the requisite bounds and yield the same row sample size! We formalize these results in the algorithm ONLINE-SAMPLE ( Figure 1) and Theorem 2.3.

and the number of rows inÃ is
To save computation, we note that, with a small modification, we can run ONLINE-SAMPLE with batch processing of rows. Specifically, say we start from the i th position in the stream. we can store the next b = O(d) rows. We can then compute sampling probabilities for these rows all at once using a system solver for (Ã T i+bÃ i+b + λ I). Using a trick introduced in [25], by applying a Johnson-Lindenstrauss random projection to the rows whose scores we are computing, we need just O(log(1/γ)) system solves to compute constant-factor approximations to the ridge scores with probability 1 − γ. If we set γ = 1/poly(n) then we can union bound over our whole stream, using this trick with each batch of O(d) input rows. The batch probabilities will only be closer to the true ridge leverage scores than the non-batch probabilities and we will enjoy the same guarantees as ONLINE-SAMPLE.
Additionally, it turns out that with a simple trick, it is possible to reduce the memory usage of the algorithm by a factor of ε −2 , bringing it down to O(d log d log(ε A 2 2 /δ )) (assuming the row sample is output to an output stream). Note that this expression gets smaller with ε; hence we obtain a row sampling algorithm with memory complexity independent of desired multiplicative precision. The basic idea is that, instead of keeping all previously sampled rows in memory, we store a smaller set of rows that give a constant-factor spectral approximation, still enough to give good estimates of the online ridge leverage scores.
This result is presented in the algorithm SLIM-SAMPLE ( Figure 2) and Lemma 3.5. A particularly interesting consequence for graphs with polynomially bounded edge weights is . For an unweighted graph on d vertices, σ max (A) 2 ≤ d, since d is the largest squared singular value of the complete graph. Combining with Lemma 6.1 of [27], we have that the condition number of a graph on d vertices whose edge weights are within a multiplicative poly(d) of each other is polynomial in d. So log(κ 2 (A)) = O(log d), which gives the corollary.
We remark that the algorithm of Corollary 2.4 can be made to run in nearly linear time in the stream size. We combine SLIM-SAMPLE with the batch processing idea described above. Because A is a graph, our matrix approximation is always a symmetric diagonally dominant matrix, with O(d) nonzero entries. We can solve systems in it in timeÕ(d). Using the Johnson-Lindenstrauss random projection trick of [25], we can compute approximate ridge leverage scores for a batch of O(d) rows with failure probability polynomially small in n inÕ(d log n) time. Union bounding over the whole stream, we obtain nearly linear running time.
To complement the row sampling results discussed above, we explore the limits of the proposed online setting. In Section 4 we present the algorithm ONLINE-BSS, which obtains spectral approximations with O(d log(ε A 2 2 /δ )/ε 2 ) rows in the online setting (with larger memory requirements than the simpler sampling algorithms). Its analysis is given in Theorem 4.1. In Section 5, we show that this number of samples is in fact the best achievable, up to constant factors (Theorem 5.1). The log(ε A 2 2 /δ ) factor is truly the cost of requiring rows to be selected in an online manner. THEORY OF COMPUTING, Volume 16 (15), 2020, pp. 1-25

Analysis of sampling schemes
We begin by bounding the sum of online λ -ridge leverage scores. The intuition behind the proof of Theorem 2.2 is that whenever we add a row with a large online leverage score to a matrix, we increase its determinant significantly, as follows from the matrix determinant lemma (Lemma 3.1). Thus we can reduce upper bounding the online leverage scores to bounding the matrix determinant.
Proof of Theorem 2.2. By Lemma 3.1, we have Hence, Taking logarithms of both sides, we obtain We now turn to analyzing the algorithm ONLINE-SAMPLE. Because the samples taken by the algorithm are not independent, we are not able to use a standard matrix Chernoff bound like the one in Theorem 2.1. However, we do know that whether we take row i does not depend on later rows; thus, we are able to analyze the process as a martingale. We will use a matrix version of the Freedman inequality given by Tropp.
Theorem 3.2 (Matrix Freedman inequality [28]). Let Y 0 , Y 1 , . . . , Y n be a matrix martingale whose values are self-adjoint matrices with dimension d, and let X 1 , . . . , X n be the difference sequence. Assume that the difference sequence is uniformly bounded in the sense that X k 2 ≤ R almost surely, for k = 1, . . . , n.
THEORY OF COMPUTING, Volume 16 (15), 2020, pp. 1-25 Define the predictable quadratic variation process of the martingale as Then, for all ε > 0 and σ 2 > 0, We begin by showing that the output of ONLINE-SAMPLE is in fact an approximation of A, and that the approximate online leverage scores are lower bounded by the actual online leverage scores.
We construct a matrix martingale Y 0 , Y 1 , . . . , Y n ∈ R d×d with the difference sequence X 1 , . . . , X n . Set Y 0 = 0. If Y i−1 2 ≥ ε, we set X i := 0. Otherwise, let In the case that Y i−1 2 < ε, by construction, Y j 2 < ε for all j < i − 1. So we have

Multiplying on both right and left by (A T
THEORY OF COMPUTING, Volume 16 (15), 2020, pp. 1-25 And so, for the predictable quadratic variation process of the martingale {Y i } This implies that with high probability Subtracting λ I = (δ /ε)I from all sides, we get Finally, note that, since we set We thus have the desired bound onl i by equation (3.1).
If we set c in ONLINE-SAMPLE to be proportional to log n rather than log d, we would be able to take a union bound over all the rows and guarantee that with high probability all the approximate online leverage scoresl i are close to true online leverage scores l i . Thus Theorem 2.2 would imply that ONLINE-SAMPLE only selects O(d log n log( A 2 2 /λ )/ε 2 ) rows with high probability. In order to remove the dependency on n, we have to sacrifice achieving close approximations to l i at every step. Instead, we show that the sum of the computed approximate online leverage scores is still small with high probability, using a custom Chernoff bound.
Proof. Define The proof closely follows the idea from the proof of Theorem 2.2. We will aim to show that large values ofl i correlate with large values of δ i . Then, the sum of δ i can be bounded by the logarithm of the ratio of the determinants ofÃ TÃ + λ I and λ I, giving us a bound on the sum ofl i . First, we will show that E exp(l i /8 − δ i ) |Ã i−1 , . . . ,Ã 0 is always at most 1. Note that if row i is sampled by ONLINE-SAMPLE, Otherwise, we have p i = 1 and so, by (3.3), We will now analyze the expected product of exp(l i /8 − δ i ) over the first k steps, E exp ∑ k i=1l i /8 − δ i . Since conditioned on the first k steps, exp( Hence by Markov's inequality By Lemma 3.3, with high probability we haveÃ TÃ + λ I (1 + ε)(A T A + λ I). We also have with high probability Hence, with high probability it holds that And so, with high probability, Proof of Theorem 2.3. The statement follows immediately from Lemmas 3.3 and 3.4.
Observe that by Theorem 2.3, ONLINE-SAMPLE stores O(d log d log(ε A 2 2 /δ )/ε 2 ) rows in memory. We now consider a simple modification of the algorithm, SLIM-SAMPLE (Figure 2), that removes the 1/ε 2 factor from the working memory usage with no additional cost.
and the number of rows inÃ is O(d log d log(ε A 2 2 /δ )/ε 2 ). Moreover, with high probability, the memory requirement of SLIM-SAMPLE is dominated by storing O(d log d log(ε A 2 2 /δ )) rows of A.
Proof. As the samples are independent, the statement follows from Theorem 2.1 and Lemmas 3.3 and 3.4.

Asymptotically optimal algorithm
In addition to sampling by online leverage scores, we introduce a row sampling algorithm, ONLINE-BSS (Figure 3), which improves the row count of ONLINE-SAMPLE by a log d factor, to This improved bound matches the lower bound for online sampling given in Theorem 5.1. This approach uses a variant of the deterministic "BSS" method, introduced by Batson, Spielman, and Srivastava in [2]. It is well known that this method yields spectral approximations with a log d factor fewer rows than leverage scores sampling in the offline setting, and we show that this improvement extends to online approximation. Unlike the original BSS algorithm of [2], our algorithm is randomized. It is similar to, and inspired by, the randomized version of BSS from [19], especially "Algorithm 1" from that paper. In both algorithms, like in online leverage score sampling, when a new row is processed, a probability p i is assigned to it, and it is kept with probability p i and rejected otherwise. The key difference between the algorithms is in the definition of p i . Like ONLINE-SAMPLE, at each step, ONLINE-BSS maintains a row sampleÃ i which approximates the matrix A i that has been seen so far. However, p i cannot be computed solely based onÃ i−1 -it is necessary to "remember" the entire input. Thus, ONLINE-BSS is not memory efficient, using O(d 2 ) space. One may improve the memory dependence by simply running ONLINE-BSS on the output stream of rows produced by ONLINE-SAMPLE. This reduces the storage cost to the size of that output spectral approximation. Of course, this does not mean that ONLINE-BSS leads to a space savings over ONLINE-SAMPLE. However the number of rows in its output stream will be less than that of ONLINE-SAMPLE, by a log d factor.
We also remark that ONLINE-SAMPLE gives bounds on both the size of the output spectral approximation and its accuracy with high probability. In contrast, ONLINE-BSS gives an expected bound on the output size, while it never fails to output a correct spectral approximation. These guarantees are similar to those given in [19]. Below, we give present the performance guarantees of ONLINE-BSS and its analysis.

Figure 3: The Online BSS Algorithm
Proof of Theorem 4.1 Part 1. As in [2], a key of idea of ONLINE-BSS is to maintain two matrices, B U i and B L i , acting as upper and lower "barriers." We will prove that the current approximationÃ i always falls between them: Equivalently, X U i and X L i will always remain positive definite. Since, at the completion of the algorithm, B U n = (1 + ε)A T A + δ I and B L n = (1 − ε)A T A − δ I this ensures that the final approximation always satisfies the approximation bound in claim (1) of the theorem. p i is chosen at step 3(b) to ensure this THEORY OF COMPUTING, Volume 16 (15), 2020, pp. 1-25 invariant-if either X U i or X L i are too small (we are too close to one of the barriers) then at least one of a T i (X U i−1 ) −1 a i or a T i (X L i−1 ) −1 a i will be large and so p i will be large. We can prove this invariant if (4.1) holds, by induction on i. The base case follows from the initialization ofÃ 0 withÃ T 0Ã 0 = 0, B U 0 = δ I, and B L 0 = −δ I since clearly −δ I ≺ 0 ≺ δ I. For each successive step, we consider two possibilities.
Case 1: Since by the induction assumption, X U i−1 and X L i−1 are both positive definite, so are X U i and X L i , giving the claim. Case 2: p i < 1. In this case, with probability Since c L = 2/ε − 1 > 1 for ε ∈ (0, 1), and since p i < 1 (by the fact that we are in Case 2), we have a T i (X L i−1 ) −1 a i < 1. This in turn gives X L i−1 a i a T i and thus since X L i X L i−1 − a i a T i , it must be positive definite, giving the claim in this case.
Thus, we have shown (4.1) for all i. In particular, B U n ≺Ã TÃ ≺ B L n . We can see by construction that gives the first claim of the theorem.
In our proof of the second claim, bounding the expected number of rows sampled, we will need the following technical lemma, which is derived from the Sherman-Morrison formula [24]. Lemma 4.2. Given a positive definite matrix X, two vectors u and v, two scalar multipliers a and b, and a probability p, define the random variable X to be X − auu T with probability p and X − buu T otherwise. Then if u T X −1 u = 1, Proof. We apply the Sherman-Morrison formula to each of the two possibilities ( X = XX − auu T and X = X + buu T respectively). These give respective X −1 values of Volume 16 (15), 2020, pp. 1-25 Combining these gives the stated result.
Proof of Theorem 4.1 Part 2. We will show that the probability that row a i is included inÃ is at most 8/ε 2 · l i , where l i is the online 2δ /ε-ridge leverage score of a i , i. e., by Theorem 2.2, this implies thatÃ has O(d log(ε A 2 2 /δ )/ε 2 ) rows in expectation, completing the second claim of the theorem.
First, we introduce some notation to help in the analysis. Let q i be the probability that row a i is sampled in the algorithm. Note that q i is fixed and we seek to prove that q i ≤ 8/ε 2 · l i . The probability p i that a i is sampled at step i is a random variable. We have We can then define We then have, similarly, Assume that l i < 1. Otherwise, since p i ≤ 1 (it is a probability), we trivially have E [p i ] ≤ 8/ε 2 · l i as desired. Now, note that for all i > 0, Next, we will show that for j < i − 1, and Combined with (4.2) and the fact that (4.3) and (4.4) give The last equality follows from the fact that in ONLINE-BSS we set c U = 2/ε + 1 and c L = 2/ε − 1. This completes the claim that for all i, the probability q i that row a i is sampled is bounded by q i = E [p i ] ≤ 8/ε 2 · l i , giving the second part of Theorem 4.1. It remains to prove (4.3) and (4.4). To do this we will show a somewhat stronger statement: conditioned on any choices for the first j rows, the expected value of a T i (Y U i−1, j+1 ) −1 a i is no larger than that of a T i (Y U i−1, j ) −1 a i , and analogously for (Y L i−1, j+1 ) −1 . Similar to the proof of part 1, we consider two cases: Case 1: p j+1 = 1. In that case, the positive semidefinite matrix a j+1 a T j+1 is added at step j + 1 to givẽ A T j+1Ã j+1 =Ã T jÃ j + a j+1 a T j+1 . This gives that ). An analogous argument holds for Y L i−1, j+1 , giving (4.4). Case 2: p j+1 < 1. This case is more tricky. Importantly, by how p j+1 is set in step 3(b) of ONLINE-BSS and by the observation that Y U i−1, j X U j and Y L i−1, j X L j for j ≤ i − 1 (recall that we must prove (4.3) and (4.4) under the assumption that j ≤ i − 1), we have and Now, we define w j+1 = a j+1 / √ p j+1 and additionally as required. Additionally, with probability p j+1 , Similarly, with probability 1 − p j+1 , To prove (4.3) it suffices to show that is non-positive. Letting r = a T j+1 (Y U i−1, j ) −1 a T j+1 we can write a = r · (1/p j+1 − (1 + ε/2)) and b = −r · (1 + ε/2) < 0. By (4.5), r ≤ p j+1 /c U and thus a ≤ r/p j+1 ≤ 1/c U = 1/(2/ε + 1) < 1. Thus, the denominator (1 − a)(1 − b) is positive, and so it remains to show that the numerator pa + (1 − p)b − ab is non-positive. We can write Lower barrier bound (4.4): For the lower barrier bound we give a similar argument. We use (1 − ε/2), and p = p j+1 . We again have , as required. Additionally, with probability p j+1 , we have Similarly, with probability 1 − p j+1 , Again, to prove (4.4) it suffices to show that is non-positive. Let r = a T j+1 (Y L i−1, j ) −1 a j+1 . We can write a = −r (1/p j+1 − (1 − ε/2)) < 0 and b = r (1 − ε/2). Note that by (4.6), r ≤ p j+1 /c L = p j+1 /(2/ε − 1) < 1, and thus b < 1. So the denominator (1 − a)(1 − b) is positive. It thus remains to show that the numerator pa + (1 − p)b − ab is non-positive. We simplify this numerator as giving the required bound. This proves (4.4) and completes the theorem.
THEORY OF COMPUTING, Volume 16 (15), 2020, pp. 1-25 Here we show that the row count obtained by Theorem 4.1 is in fact optimal. While it is possible to obtain a spectral approximation with O(d/ε 2 ) rows in the offline setting, online sampling always incurs a loss of Ω log(ε A 2 2 /δ ) and must sample Ω d log(ε A 2 2 /δ )/ε 2 rows.
Theorem 5.1. Assume that ε A 2 2 ≥ c 1 δ and ε ≥ c 2 / √ d, for fixed constants c 1 and c 2 . Then any algorithm that selects rows in an online manner and outputs a spectral approximation to A T A with (1+ε) multiplicative error and δ additive error with probability at least 1/2 must sample Ω d log(ε A 2 2 /δ )/ε 2 rows of A in expectation.
Note that the lower bounds we assume on ε A 2 2 and ε are very minor. They just ensure that log(ε A 2 2 /δ ) ≥ 1 and that ε is not so small that we can essentially sample all rows.
Proof. We apply Yao's minimax principle, constructing, for any large enough M, a distribution on inputs A with A 2 2 ≤ M for which any deterministic online row selection algorithm that succeeds with probability at least 1/2 must output Ω d log(εM/δ )/ε 2 rows in expectation. The best randomized algorithm that works with probability 1/2 on any input matrix with A 2 2 ≤ M therefore must select at least Ω d log(εM/δ )/ε 2 rows in expectation on the worst case input, giving us the theorem.
Our distribution is as follows. We select an integer N uniformly at random from [1, log(Mε/δ )]. We then stream in the vertex-edge incidence matrices of N complete graphs on d vertices. We double the weight of each successive graph. Intuitively, spectrally approximating a complete graph requires selecting Ω(d/ε 2 ) edges [2] (as long as ε ≥ c 2 / √ d for some fixed constant c 2 ). Each time we stream in a new graph with double the weight, we force the algorithm to add Ω(d/ε 2 ) more edges to its output, eventually forcing it to return Ω(d/ε 2 · N) edges, which is Ω(d log(Mε/δ )/ε 2 ) in expectation.
Specifically, let K d be the d 2 × d vertex-edge incidence matrix of the complete graph on d vertices. K T d K d is the Laplacian matrix of the complete graph on d vertices. We weight the first graph so that its Laplacian has all its nonzero eigenvalues equal to δ /ε. (That is, each edge has weight δ /(dε)). In this way, even if we select N = log(Mε/δ ) we have overall A 2 2 ≤ δ /ε + 2δ /ε + · · · + 2 log(Mε/δ ) −1 δ /ε ≤ M. Even if N = 1, all nonzero eigenvalues of A T A are at least δ /ε, so achieving (1 + ε) multiplicative error and δ I additive error is equivalent to achieving (1 + 2ε) multiplicative error. A T A is a graph Laplacian so has a null space. However, as all rows are orthogonal to the null space, achieving additive error δ I is equivalent to achieving additive error δ I r where I r is the identity projected to the span of A T A. δ I r εA T A which is why we must achieve (1 + 2ε) multiplicative error.
In order for a deterministic algorithm to be correct with probability 1/2 on our distribution, it must be correct for at least 1/2 of our log(Mε/δ ) possible choices of N.
Let i be the lowest choice of N for which the algorithm is correct. By the lower bound of [2], the algorithm must output Ω(d/ε 2 ) rows of A i to achieve a (1 + 2ε) multiplicative-error spectral approximation. Here A i is the input consisting of the vertex-edge incidence matrices of i increasingly weighted complete graphs. Call the output on this inputÃ i . Now let j be the second lowest choice of N on which the algorithm is correct. Since the algorithm was correct on A i to within a multiplicative (1 + 2ε), to be correct on A j , it must output a set of edgesÃ j such that Since we double each successive copy of the complete graph, A T j A j 2(A T j A j − A T i A i ). So, A T jÃ j −Ã T iÃ i must be a 1 + 8ε spectral approximation to the true difference A T j A j − A T i A i . Noting that this difference is itself just a weighting of the complete graph, by the lower bound in [2] the algorithm must select Ω(d/ε 2 ) additional edges between the i th and j th input graphs. Iterating this argument over all log(Mε/δ ) /2 inputs on which the algorithm must be correct, it must select a total of Ω(d log(Mε/δ )/ε 2 ) edges in expectation over all inputs.

Future work
The main open question arising from the original publication of this work [13] was if one could prove that the algorithm of [16] works despite dependencies arising due to the row pruning step. By operating in the online setting, our algorithm avoids row pruning, and hence is able to skirt these dependencies, as the probability that a row is sampled only depends on earlier rows in the stream. However, because the streaming setting offers the potential for sampling fewer rows than in the online case, obtaining a rigorous proof of [16] is very interesting. This open question was essentially resolved in [18], which presents an algorithm similar to the one presented in [16] for insertion-only streams that admits a correct proof.
While our work focuses on spectral approximation, variants on (ridge) leverage score sampling and the BSS algorithm are also used to solve low-rank approximation problems, including column subset selection [5,12] and projection-cost-preserving sketching [10,12]. Compared with spectral approximation, there is less work on streaming sampling for low-rank approximation, and understanding how online algorithms may be used in this setting would an interesting directino. Since initial publication, this question has been studied extensively [6,8,7], with online ridge leverage scores being employed for online low-rank approximation of kernel matrices and for low-rank approximation in sliding window streams.