Efficient Rounding for the Noncommutative Grothendieck Inequality

$ \newcommand{\cclass}[1]{{\textsf{#1}}} $The classical Grothendieck inequality has applications to the design of approximation algorithms for $\cclass{NP}$-hard optimization problems. We show that an algorithmic interpretation may also be given for a noncommutative generalization of the Grothendieck inequality due to Pisier and Haagerup. Our main result, an efficient rounding procedure for this inequality, leads to a polynomial-time constant-factor approximation algorithm for an optimization problem which generalizes the Cut Norm problem of Frieze and Kannan, and is shown here to have additional applications to robust principal component analysis and the orthogonal Procrustes problem.


Introduction
In what follows, the standard scalar product on C n is denoted ·, · , i. e., x, y = ∑ n i=1 x i y i for all x, y ∈ C n . We always think of R n as canonically embedded in C n ; in particular the restriction of ·, · to R n is the standard scalar product on R n . Given a set S, the space M n (S) stands for all the matrices M = (M i j ) n i, j=1 with M i j ∈ S for all i, j ∈ {1, . . . , n}. Thus, M n (M n (R)) is naturally identified with the n 4 -dimensional space of all 4-tensors M = (M i jkl ) n i, j,k,l=1 with M i jkl ∈ R for all i, j, k, l ∈ {1, . . . , n}. The set of all n × n orthogonal matrices is denoted O n ⊆ M n (R), and the set of all n × n unitary matrices is denoted U n ⊆ M n (C).
Given Analgously, there exists a polynomial-time algorithm that takes as input M ∈ M n (M n (C)) and outputs U,V ∈ U n such that We will explain the ideas that go into the proof of Theorem 1.1 later, and it suffices to say at this juncture that our algorithm is based on a rounding procedure for semidefinite programs that is markedly different from rounding algorithms that have been previously used in the optimization literature, and as such it indicates the availability of techniques that have thus far remained untapped for the purpose of algorithm design. A non-constructive version of Theorem 1.1 was given in earlier works, as explained below; our contribution here is to design an efficient rounding algorithm and to establish various applications of it. Before explaining the proof of Theorem 1.1 we list some of its applications as an indication of its usefulness.
Remark 1.2. The implied constants in the O(1) terms of Theorem 1.1 can be taken to be any number greater than 2 √ 2 in the real case, and any number greater than 2 in the complex case. There is no reason to believe that the factor 2 √ 2 in the real case is optimal, but the factor 2 in the complex case is sharp in a certain natural sense that will become clear later. The main content of Theorem 1.1 is the availability of a constant-factor algorithm rather than the value of the constant itself. In particular, the novelty of the applications to combinatorial optimization that are described below is the mere existence of a constant-factor approximation algorithm.

Applications of Theorem 1.1
We now describe some examples demonstrating the usefulness of Theorem 1.1. The first example does not lead to a new result, and is meant to put Theorem 1.1 in context. All the other examples lead to new algorithmic results. Many of the applications below follow from a more versatile reformulation of Theorem 1.1 that is presented in Section 5 (see Proposition 5.1).

The Grothendieck problem
The Grothendieck optimization problem takes as input a matrix A ∈ M n (R) and aims to efficiently compute (or estimate) the quantity max ε,δ ∈{−1,1} n n ∑ i, j=1 A i j ε i δ j . (1.1) This problem falls into the framework of Theorem 1.1 by considering the 4-tensor M ∈ M n (M n (R)) given by M ii j j = A i j and M i jkl = 0 if either i = j or k = l. Indeed, Here we used the diagonal structure of the tensor M to argue that one may without loss of generality restrict the variables U,V ∈ O n to be diagonal ±1 matrices. While diagonal matrices always commute, in general U,V ∈ O n (R) will not, and it is in this sense that we may think of Opt R (·) as a non-commutative generalization of (1.1). A polynomial-time constant-factor approximation algorithm for the Grothendieck problem was designed in [2], where it was also shown that it is NP-hard to approximate this problem within a factor less that 1 + ε 0 for some ε 0 ∈ (0, 1). A simple transformation [2] relates the Grothendieck problem to the Frieze-Kannan Cut Norm problem [12] (this transformation can be made to have no loss in the approximation guarantee [26,Sec. 2.1]), and as such the constant-factor approximation algorithm for the Grothendieck problem has found a variety of applications in combinatorial optimization; see the survey [26] for much more on this topic. In another direction, based on important work of Tsirelson [45], the Grothendieck problem has found applications to quantum information theory [7]. Since the problem of computing Opt R (·) contains the Grothendieck problem as a special case, Theorem 1.1 encompasses all of these applications, albeit with the approximation factor being a larger constant.

Robust PCA
The input to the classical principal component analysis (PCA) problem is K, n ∈ N a set of points a 1 , . . . , a N ∈ R n . The goal is to find a K-dimensional subspace maximizing the sum of the squared 2 norms of the projections of the a i on the subspace. Equivalently, the problem is to find the maximizing vectors in max y 1 ,...,y K ∈R n y i ,y j =δ i j where here, and in what follows, δ i j is the Kronecker delta. This question has a closed-form solution in terms of the singular values of the N × n matrix whose i-th row contains the coefficients of the point a i . The fact that the quantity appearing in (1.2) is the maximum of the sum of the squared norms of the projected points makes it somewhat non-robust to outliers, in the sense that a single long vector can have a large effect on the maximum. Several more robust versions of PCA were suggested in the literature. One variant, known as "R1-PCA," is due to Ding, Zhou, He, and Zha [10], and aims to maximize the sum of the Euclidean norms of the projected points, namely, max y 1 ,...,y K ∈R n y i ,y j =δ i j We are not aware of any prior efficient algorithm for this problem that achieves a guaranteed approximation factor. Another robust variant of PCA, known as "L1-PCA," was suggested by Kwak [29], and further studied by McCoy and Tropp [32] (see Section 2.7 in [32] in particular). Here the goal is to maximize the sum of the 1 norms of the projected points, namely, max y 1 ,...,y K ∈R n y i ,y j =δ i j In [32] a constant-factor approximation algorithm for the above problem is obtained for K = 1 based on [2], and for general K an approximation algorithm with an approximation guarantee of O(log n) is obtained based on prior work by So [43]. In Section 5.1 we show that both of the above robust versions of PCA can be cast as special cases of Theorem 1.1, thus yielding constant-factor approximation algorithms for both problems and all K ∈ {1, . . . , n}. The key observation is that both (1.3) and (1.4) can be expressed as bilinear optimization problems over a pair of orthogonal matrices. For instance, in the case of (1.4) one of the matrix variables will have the y i as its columns, while the other will be diagonal, with diagonal entries (in an optimal solution) matching the sign of a i , y j .

The orthogonal Procrustes problem
Let n, d 1 and K 2 be integers. Suppose that S 1 , . . . , S K ⊆ R d are n-point subsets of R d . The goal of the generalized orthogonal Procrustes problem is to rotate each of the S k separately so as to best align them. Formally, write S k = {x k 1 , x k 2 , . . . , x k n }. The goal is to find K orthogonal matrices U 1 , . . . ,U K ∈ O d that maximize the quantity (1.5) If one focuses on a single summand appearing in (1.5), say ∑ K k=1 U k x k 1 , then it is clear that in order to maximize its length one would want to rotate each of the x k 1 so that they would all point in the same direction, i. e., they would all be positive multiples of the same vector. The above problem aims to achieve the best possible such alignment (in aggregate) for multiple summands of this type. We note that by expanding the squares one sees that U 1 , . . . ,U K ∈ O d maximize the quantity appearing in (1.5) if and only if they minimize the quantity The term "generalized" was used above because the orthogonal Procrustes problem refers to the case K = 2, which has a closed-form solution. The name "Procustes" is a (macabre) reference to Greek mythology (see http://en.wikipedia.org/wiki/Procrustes).
The generalized orthogonal Procrustes problem has been extensively studied since the 1970s, initially in the psychometric literature (see, e. g., [6,14,3]), and more recent applications of it are to areas such as image and shape analysis, market research, and biometric identification; see the books [15,11], the lecture notes [44], and [33] for much more information on this topic.
The generalized orthogonal Procrustes problem is known to be intractable, and it has been investigated algorithmically in, e. g., [3,4,42]. A rigorous analysis of a polynomial-time approximation algorithm for this problem appears in the work of Nemirovski [35], where the generalized orthogonal Procrustes problem is treated as an important special case of a more general family of problems called "quadratic optimization under orthogonality constraints," for which he obtains a O( 3 √ n + d + log K) approximation algorithm. This was subsequently improved by So [43] to O(log(n + d + K)). In Section 5.2 we use Theorem 1.1 to improve the approximation guarantee for the generalized orthogonal Procrustes problem as defined above to a constant approximation factor. See also Section 5.2 for a more complete discussion of variants of this problem considered in [35,43] and how they compare to our work.

A Frieze-Kannan decomposition for 4-tensors
In [12] Frieze and Kannan designed an algorithm which decomposes every (appropriately defined) "dense" matrix into a sum of a few "cut matrices" plus an error matrix that has small cut-norm. We refer to [12] and also Section 2.1.2 in the survey [26] for a precise formulation of this statement, as well as its extension, due to [2], to an algorithm that allows sub-constant errors. In Section 5.3 we apply Theorem 1.1 to prove the following result, which can be viewed as a noncommutative variant of the Frieze-Kannan decomposition. For the purpose of the statement below it is convenient to identify the space M n (M n (C)) of all 4-tensors with M n (C) ⊗ M n (C). Also, for M ∈ M n (C) ⊗ M n (C) we denote from now on its Frobenius (Hilbert-Schmidt) norm by There exists a universal constant c ∈ (0, ∞) with the following property. Suppose that M ∈ M n (C) ⊗ M n (C) and 0 < ε 1/2, and let One can compute in time poly(n, 1/ε) a decomposition such that A t , B t ∈ U n , the coefficients α t ∈ C satisfy |α t | M 2 /n, and Opt C (E) εOpt C (M). Moreover, if M ∈ M n (R) ⊗ M n (R) then one can replace Opt C (M) in (1.6) by Opt R (M), take the coefficients α t to be real, A t , B t ∈ O n and E such that Opt R (E) εOpt R (M). Theorem 1.3 contains as a special case its commutative counterpart, as studied in [12,2]. Here we are given A ∈ M n (R) with |a i j | 1 for all i, j ∈ {1, . . . , n}, and we aim for an error εn 2 . Define M ii j j = a i j and M i jkl = 0 if i = j or k = l. Then M 2 n. An application of Theorem 1.3 (in the real case) with ε replaced by Moreover, the number of terms is T = O(1/ε 2 ). Theorem 1.3 is proved in Section 5.3 via an iterative application of Theorem 1.1, following the "energy decrement" strategy as formulated by Lovász and Szegedy [31] in the context of general weak regularity lemmas. Other than being a structural statement of interest in its own right, we show in Section 5.3 that Theorem 1.3 can be used to enhance the constant-factor approximation of Theorem 1.1 to a PTAS for computing Opt C (M) when Opt C (M) = Ω(n M 2 ). Specifically, if Opt C (M) κn M 2 then one can compute a (1 + ε)-factor approximation to Opt C (M) in time 2 poly(1/(κε)) poly(n). This is reminiscent of the Frieze-Kannan algorithmic framework [12] for dense graph and matrix problems.

Quantum XOR games
As we already noted, the Grothendieck problem (recall Section 1.1.1) also has consequences in quantum information theory [7], and more specifically to bounding the power of entanglement in so-called "XOR games," which are two-player one-round games in which the players each answer with a bit and the referee bases her decision on the XOR of the two bits. As will be explained in detail in Section 1.2 below, the literature on the Grothendieck problem relies on a classical inequality of Grothendieck [16], while our work relies on a more recent yet by now classical noncommutative Grothendieck inequality of Pisier [36] (and its sharp form due to Haagerup [18]). Even more recently, the Grothendieck inequality has been generalized to another setting, that of completely bounded linear maps defined on operator spaces [38,21]. While we do not discuss the operator-space Grothendieck inequality here, we remark that in [40] the operator-space Grothendieck inequality is proved by reducing it to the Pisier-Haagerup noncommutative Grothendieck inequality. Without going into details, we note that this reduction is also algorithmic. Combined with our results, it leads to an algorithmic proof of the operator-space Grothendieck inequality, together with an accompanying rounding procedure.
In [41], the last two named authors show that the noncommutative and operator-space Grothendieck inequalities have consequences in a setting that generalizes that of classical XOR games, called "quantum XOR games": in such games, the questions to the players may be quantum states (and the answers are still a single classical bit). In [41], efficient constant-factor approximation algorithms are derived for the maximum success probability of players in such a game, in three settings: players sharing an arbitrary quantum state, players sharing a maximally entangled state, and players not sharing any entanglement. Using Theorem 1.1, one can efficiently compute strategies achieving the same approximation guarantee. These matters are taken up in [41] and will not be discussed further here.

The noncommutative Grothendieck inequality
The natural semidefinite relaxation of (1.1) is where S d−1 is the unit sphere of R d . (The dimension d can be taken to be at most the number of vector variables, d 2n.) Since, being a semidefinite program (SDP), the quantity appearing in (1.8) can be computed in polynomial time with arbitrarily good precision (see [17]), the fact that the Grothendieck optimization problem admits a polynomial-time constant-factor approximation algorithm follows from the following inequality, which is a classical inequality of Grothendieck of major importance to several mathematical disciplines (see Pisier's survey [37] and the references therein for much more on this topic; the formulation of the inequality as below is due to Lindenstrauss and Pełczyński [30]): Here K G ∈ (0, ∞), which is understood to be the infimum over those constants for which (1.9) holds true for all n ∈ N and all A ∈ M n (R), is a universal constant known as the (real) Grothendieck constant. Its exact value remains unknown, the best available bounds [39,5] being 1.676 < K G < 1.783. In order to actually find an assignment ε, δ to (1.1) that is within a constant factor of the optimum one needs to argue that a proof of (1.9) can be turned into an efficient rounding algorithm; this is done in [2]. If one wishes to mimic the above algorithmic success of the Grothendieck inequality in the context of efficient computation of Opt R (·), the following natural strategy presents itself: one should replace real entries of matrices by vectors in 2 , i. e., consider elements of M n ( 2 ), and replace the orthogonality constraints underlying the inclusion U ∈ O n , namely, by the corresponding constraints using scalar product. Specifically, given an n × n vector-valued matrix X ∈ M n ( 2 ) define two real matrices XX * , X * X ∈ M n (R) by and let the set of d-dimensional vector-valued orthogonal matrices be given by One then considers the following quantity associated to every M ∈ M n (M n (R)): Since the constraints that underlie the inclusion X,Y ∈ O n (R d ) are linear equalities in the pairwise scalar products of the entries of X and Y , the quantity SDP R (M) is a semidefinite program and can therefore be computed in polynomial time with arbitrarily good precision. One would therefore aim to prove the following noncommutative variant of the Grothendieck inequality (1.9): The term "noncommutative" refers here to the fact that Opt R (M) is an optimization problem over the noncommutative group O n , while the classical Grothendieck inequality addresses an optimization problem over the commutative group {−1, 1} n . In the same vein, noncommutativity is manifested by the fact that the classical Grothendieck inequality corresponds to the special case of "diagonal" 4-tensors M ∈ M n (M n (R)), i. e., those that satisfy M i jkl = 0 whenever i = j or k = l. Grothendieck conjectured [16] the validity of (1.13) in 1953, a conjecture that remained open until its 1978 affirmative solution by Pisier [36]. A simpler, yet still highly nontrivial proof of the noncommutative Grothendieck inequality (1.13) was obtained by Kaijser [25]. In Section 4 we design a rounding algorithm corresponding to (1.13) based on Kaijser's approach. This settles the case of real 4-tensors of Theorem 1.1, albeit with worse approximation guarantee than the one claimed in Remark 1.2. The algorithm modeled on Kaijser's proof is interesting in its own right, and seems to be versatile and applicable to other problems, such as possible non-bipartite extensions of the noncommutative Grothendieck inequality in the spirit of [1]; we shall not pursue this direction here.
A better approximation guarantee, and arguably an even more striking rounding algorithm, arises from the work of Haagerup [18] on the complex version of (1.13). In Section 3 we show how the real case of Theorem 1.1 follows formally from our results on its complex counterpart, so from now on we focus our attention on the complex case.

The complex case
In what follows we let S d−1 C denote the unit sphere of C d (thus S 0 C can be identified with the unit circle S 1 ⊆ R 2 ). The classical complex Grothendieck inequality [16,30] asserts that there exists K ∈ (0, ∞) such that (1.14) Let K C G denote the infimum over those K ∈ (0, ∞) for which (1.14) holds true. The exact value of K C G remains unknown, the best available bounds being 1.338 < K C G < 1.4049 (the left inequality is due to unpublished work of Davie, and the right one is due to Haagerup [19]).
For M ∈ M n (M n (C)) we define where analogously to (1.11) we set THEORY OF COMPUTING, Volume 10 (11), 2014, pp. 257-295 Here for X ∈ M n (C d ) the complex matrices XX * , X * X ∈ M n (C) are defined exactly as in (1.10), with the scalar product being the complex scalar product. Haagerup proved [18] that Our main algorithm is an efficient rounding scheme corresponding to inequality (1.16). The constant 2 in (1.16) is sharp, as shown in [20] (see also [37,Sec. 12]). We note that the noncommutative Grothendieck inequality, as it usually appears in the literature, involves a slightly more relaxed semidefinite program. In order to describe it, we first remark that instead of maximizing over X,Y ∈ U n (C d ) in (1.15) we could equivalently maximize over X,Y ∈ M n (C d ) satisfying XX * , X * X,YY * ,Y * Y I, which is the same as the requirement XX * , X * X , YY * , Y * Y 1, where here and in what follows · denotes the operator norm of matrices. This fact is made formal in Lemma 2.2 below. By relaxing the constraints to XX * + X * X 2 and YY * + Y * Y 2, we obtain the following quantity, which can be shown to still be a semidefinite program: Haagerup proved [18] that the following stronger inequality holds true for all n ∈ N and M ∈ M n (M n (C)): As our main focus is algorithmic, in the following discussion we will establish a rounding algorithm for the tightest relaxation (1.16). In Section 2.3 we show that the same rounding procedure can be used to obtain an algorithmic analogue of (1.18) as well.

The rounding algorithm
Our main algorithm is an efficient rounding scheme corresponding to (1.16). In order to describe it, we first introduce the following notation. Let ϕ : R → R + be given by One computes that R ϕ(t)dt = 1, so ϕ is a density of a probability measure µ on R, known as the hyperbolic secant distribution. By [23, Sec. 23.11] we have (1.20) It is possible to efficiently sample from µ using standard techniques; see, e. g., [8,Ch. IX.7].
In what follows, given X ∈ M n (C d ) and z ∈ C d we denote by X, z ∈ M n (C) the matrix whose entries are X, z jk = X jk , z . THEORY OF COMPUTING, Volume 10 (11), 2014, pp. 257-295 Rounding procedure 1. Let X,Y ∈ M n (C d ) be given as input. Choose z ∈ {1, −1, i, −i} d uniformly at random, and sample t ∈ R according to the hyperbolic secant distribution µ.

Output the pair of matrices
where X z = U z |X z | and Y z = V z |Y z | are the polar decompositions of X z and Y z , respectively. Figure 1: The rounding algorithm takes as input a pair of vector-valued matrices X,Y ∈ M n (C d ). It outputs two matrices A, B ∈ U n (C).
where SDP C (M) is given in (1.15). Then the rounding procedure described in Figure 1 outputs a pair of matrices A, B ∈ U n such that Moreover, rounding can be performed in time polynomial in n and log(1/ε), and can be derandomized in time poly(n, 1/ε).
While the rounding procedure of Figure 1 and the proof of Theorem 1.4 (contained in Section 2 below) appear to be different from Haagerup's original proof of (1.18) in [18], we derived them using Haagerup's ideas. One source of difference arises from changes that we introduced in order to work with the quantity SDP C (M), while Haagerup's argument treats the quantity M nc . A second source of difference is that Haagerup's proof of (1.18) is rather indirect and nonconstructive, while it is crucial to the algorithmic applications that were already mentioned in Section 1.1 for us to formulate a polynomial-time rounding procedure. Specifically, Haagerup establishes the dual formulation of (1.18), through a repeated use of duality, and he uses a bootstrapping argument that relies on nonconstructive tools from complex analysis. The third step in Figure 1 originates from Haagerup's complex-analytic considerations. Readers who are accustomed to semidefinite rounding techniques will immediately notice that this step is unusual; we give intuition for it in Section 1.2.3 below, focusing for simplicity on applying the rounding procedure to vectors rather than matrices (i. e., the more familiar setting of the classical Grothendieck inequality).

An intuitive description of the rounding procedure in the commutative case
Consider the effect of the rounding procedure in the commutative case, i. e., when X,Y ∈ M n (C d ) are diagonal matrices. Let the diagonals of X,Y be x, y ∈ (C d ) n , respectively. The first step consists in performing a random projection: for j ∈ {1, . . . , n} let α j = x j , z / √ 2 ∈ C and β j = y j , z / √ 2 ∈ C, where z is chosen uniformly at random from {1, −1, i, −i} n (alternatively, with minor modifications to the proof one may choose i. i. d. z j uniformly from the unit circle, as was done by Haagerup [18], or use standard complex Gaussians). This step results in sequences of complex numbers whose pairwise products α k β j , in expectation, exactly reproduce the pairwise scalar products x k , y j /2. However, in general the resulting complex numbers α k and β j may have modulus larger than 1. Extending the "sign" rounding performed in, say, the Goemans-Williamson algorithm for MAXCUT [13] to the complex domain, one could then round each α k and β j independently by simply replacing them by their respective complex phase.
The procedure that we consider differs from this standard practice by taking into account potential information contained in the modulus of the random complex numbers α k , β j . Writing in polar coordinates α k = r k e iθ k and β j = s j e iφ j we sample a real t according to a specific distribution (the hyperbolic secant distribution µ), and round each α k and each β j to respectively. Observe that this step performs a correlated rounding: the parameter t is the same for all j, k ∈ {1, . . . , n}.
The proof presented in [18] uses the maximum modulus principle to show the existence of a real t for which a k , b j as defined above provide a good assignment. Intuition for the existence of such a good t can be given as follows. Varying t along the real line corresponds to rotating the phases of the complex numbers α j , β k at a speed proportional to the logarithm of their modulus: elements with very small modulus vary very fast, those with modulus 1 are left unchanged, and elements with relatively large modulus are again varied at (logarithmically) increasing speeds. This means that the rounding procedure takes into account the fact that an element with modulus away from 1 is a "miss": that particular element's phase is probably irrelevant, and should be changed. However, elements with modulus close to 1 are "good": their phase can be kept essentially unchanged.
We identify a specific distribution µ such that a random t distributed according to µ is good, in expectation. This results in a variation on the usual "sign" rounding technique: instead of directly keeping the phases obtained in the initial step of random projection, they are synchronously rotated for a random time t, at speeds depending on the associated moduli, resulting in a provably good pair of sequences a k , b j of complex numbers with modulus 1.
Roadmap: In Section 2 we prove Theorem 1.4 both as stated in Section 1.2.2 and in a form based on (1.18). The real case as well as a closely related Hermitian case are treated next, first in Section 3 as a corollary of Theorem 1.4, and then using an alternative direct rounding procedure in Section 4. Section 5 presents the applications that were outlined in Section 1.1.

Proof of Theorem 1.4
In this section we prove Theorem 1.4. The rounding procedure described in Figure 1 is analyzed in Section 2.1, while the derandomized version is presented in Section 2.2. The efficiency of the procedure is clear; we refer to Section 2.2 for a discussion on how to discretize the choice of t. In Section 2.3 we show how the analysis can be modified to the case of M nc and (1.18).
In what follows, it will be convenient to use the following notation. Given M ∈ M n (M n (C)) and be chosen uniformly at random, and be random variables taking values in M n (C) defined as in the second step of the rounding procedure (see Figure 1). Then, where we used the fact that E[z r z s ] = δ rs for every r, s ∈ {1, . . . , d}.
Observe that (1.20) implies that Recall that the output of our rounding scheme as described in Figure 1 is are the polar decompositions of X z and Y z , respectively. Applying the above identity to the the (positive) eigenvalues of |X z | ⊗ |Y z |, we deduce the matrix equality Our goal from now on is to bound the second, "error" term on the right-hand side of (2.5). Specifically, the rest of the proof is devoted to showing that for any fixed t ∈ R we have Once established, the estimate (2.6) completes the proof of the desired expectation bound (1.22) since So, for the rest of the proof, fix some t ∈ R. As a first step towards (2.6) we state the following claim.
THEORY OF COMPUTING, Volume 10 (11), 2014, pp. 257-295 Now, for every t ∈ R define two vector-valued matrices Thus, Moreover, recalling that X z = U z |X z | is the polar decomposition of X z , we have Analogously, Using that each term X * r X r X * r X r and Y * r Y r Y * r Y r is positive semidefinite, the two equations above imply that F(t), G(t) satisfy the norm bounds (2.14) As shown in Lemma 2.2 below, (2.14) implies that there exists a pair of vector-valued matrices completing the proof of (2.6).
THEORY OF COMPUTING, Volume 10 (11), 2014, pp. 257-295 Then there exist R, S ∈ U n (C d+2n 2 ) such that for every M ∈ M n (M n (C)) we have M(R, S) = M(X,Y ). Moreover, R and S can be computed from X and Y in time poly(n, d).
Proof. Let A = I − XX * and B = I − X * X, and note that A, B 0 and Tr(A) = Tr(B). Write the spectral decompositions of A and B as With this definition we have RR * = XX * + A = I and R * R = X * X + B = I, so R ∈ U n (C d+2n 2 ). Let S ∈ U n (C d+2n 2 ) be defined analogously from Y , with the last two blocks of n 2 coordinates permuted. One checks that M(R, S) = M(X,Y ), as required. Finally, A, B, their spectral decomposition, and the resulting R, S can all be computed in time poly(n, d) from X,Y .

Derandomized rounding
Note that we can always assume that X,Y ∈ U n (C d ), where d 2n 2 . We start by slightly changing the projection step. Define X z to be the projection X z = (1/ √ 2) X, z , after we replace all singular values of X z that are smaller than ε with ε. Then, writing we see that in the analogue of (2.5) the first term is at least M(X,Y ) − 4εdSDP C (M). Here we use that X z , Y z d which follows by the triangle inequality and X,Y ∈ U n (C d ). For the second term in (2.5), the previous analysis remains unchanged, provided we prove an analogue of (2.14). Using (2.11) and the analogous equations for F(t) * F(t), G(t)G(t) * and G(t) * G(t), it will suffice to bound four expressions such as One checks that the modification to the rounding we did can only increase this by ε 4 (even for each z), hence following the previous analysis we get the bound in (2.16) with SDP C (M)/2 replaced by (1 + ε 4 )SDP C (M)/2. Next, we observe that the coordinates of z need not be independent, and it suffices if they are chosen from a four-wise independent distribution. As a result, there are only poly(n) possible values of z and they can be enumerated efficiently. Therefore, we can assume that we have a value z ∈ {1, −1, i, −i} d for which Notice that for some universal constant c > 0, with probability at least 1 − ε, a sample from the hyperbolic secant distribution is at most c log(1/ε) in absolute value. Therefore, denoting the restriction of the hyperbolic secant distribution to the interval [−c log(1/ε), c log(1/ε)] by µ , and using the fact that the expression inside the expectation in (2.17) is never larger than SDP C (M), for t distributed according to µ we have Moreover, for any t,t ∈ R, The first absolute value in (2.18) is at most We have X z d as explained above, and (X z ) −1 1/ε by the way our modified rounding procedure was defined. Similar bounds hold for Y z . It therefore suffices to pick t from a grid of size O(log(1/ε) max(1/ε 2 , d/ε)).

2.3
The rounding procedure in the case of (1.18) Theorem 1.4 addressed the performance of the rounding procedure described in Figure 1 with respect to inequality (1.16). Here we prove that this rounding procedure has the same performance with respect to the noncommutative Grothendieck inequality (1.18) as well. This is the content of the following theorem. Theorem 2.3. Fix n, d ∈ N and ε ∈ (0, 1). Suppose that M ∈ M n (M n (C)) and that X,Y ∈ M n (C d ) satisfy

20)
where M nc was defined in (1.18). Then the rounding procedure described in Figure 1 outputs a pair of matrices A, B ∈ U n such that Proof. We shall explain how to modify the argument presented in Section 2.1, relying on the notation that was introduced there. All we need to do is to replace (2.16) by the assertion Proof. Starting from (2.7) and noting that we obtain the inequality Similarly, from (2.8) we get Analogously, Recalling the definition of M nc , it follows from (2.26) and (2.27) that for every t ∈ R, Hence, completing the proof of the desired expectation bound (2.21).
Remark 2.5. The following example, due to Haagerup [18], shows that the factor 2 approximation guarantee obtained in Theorem 2.3 is optimal: the best constant in (1.18) equals 2. Let M ∈ M n (M n (C)) be given by M 1 jk1 = δ jk , and M i jkl = 0 if (i, l) = (1, 1). A direct computation shows that Opt C (M) = 1. Define X,Y ∈ M n (C n ) by X 1 j = Y j1 = 2/(n + 1) e j ∈ C n for j ∈ {1, . . . , n} and all other entries of X and Y vanish (here e 1 , . . . , e n is the standard basis of C n ). Using these two vector-valued matrices one shows that M nc 2n/(n + 1). Recall that in the introduction we mentioned that it was shown in [20] that the best constant in the weaker inequality (1.16) is also 2, but the example exhibiting this stronger fact is more involved.

The real and Hermitian cases
The n × n Hermitian matrices are denoted H n . A 4-tensor M ∈ M n (M n (C)) ∼ = M n (C) ⊗ M n (C) is said to be Hermitian if M i jkl = M jilk for all i, j, k, l ∈ {1, . . . , n}. Investigating the noncommutative Grothendieck inequality in the setting of Hermitian M is most natural in applications to quantum information, while problems in real optimization as described in the introduction lead to real M ∈ M n (M n (R)). These special cases are treated in this section. Consider the following Hermitian analogue of the quantity Opt C (M): Note that the convex hull of U n consists of all the matrices A ∈ M n (C) with A 1, so by convexity for every M ∈ M n (M n (C)) we have The following theorem establishes an algorithmic equivalence between the problems of approximating either of these two quantities. In Section 3.2 we show that for every K > 2 √ 2 there exists an algorithm Alg * as in assertion 1 of Theorem 3.1. Consequently, we obtain the algorithm for computing Opt R (M) whose existence was claimed in Theorem 1.1. The implication 1 =⇒ 2 of Theorem 3.1 is the only part of Theorem 3.1 that will be used in this article; the reverse direction 2 =⇒ 1 is included here for completeness. Both directions of Theorem 3.1 are proved in Section 3.3.

Two-dimensional rounding
In this section we give an algorithmic version of Krivine's proof [28] that the 2-dimensional real Grothendieck constant satisfies K G (2) √ 2. The following theorem is implicit in the proof of [28, Thm. 1].

3)
where x j , y k ∈ L 2 (R) are such that x j 2 , y k 2 1 and ℜ(·) denotes the real part.

For every
5. Return (λ j ) j∈{1,...,n} and (µ k ) k∈{1,...,n} . where By definition, cos(θ j − φ k ) = ℜ (x j y k ). Using 1 − ε p 1, which follows from the bound stated in Theorem 3.2, η jk equals 1 − p ε times a weighted average of the product of certain values taken by f and g, the former only depending on θ j and the latter on φ k . Equivalently, this weighted average can be written as the inner product of two vectors x j and y k of norm at most 1. Finally, all steps of the rounding procedure can be performed in time polynomial in n and 1/ε.
Hermitian rounding procedure 1. Let X,Y ∈ M n (C d ) and ε > 0 be given as input.
2. Let A, B ∈ M n (C) be the unitary matrices returned by the complex rounding procedure described in Figure 1. If necessary, multiply A by a complex phase to ensure that M (A, B) is real. Write the spectral decompositions of A, B as where θ j , φ k ∈ R and u j , v k ∈ C n .
3. Apply the two-dimensional rounding algorithm from Figure 2 to the vectors x j def = e iθ j and y k def = e iφ k . Let λ j , µ k be the results. This shows that for the purpose of proving the noncommutative Grothendieck inequality for Hermitian M we may assume without loss of generality that the "component matrices" of X,Y are Hermitian themselves. Nevertheless, even in this case the rounding algorithm described in Figure 1 returns unitary matrices A, B that are not necessarily Hermitian. The following simple argument shows how Krivine's two-dimensional rounding scheme can be applied on the eigenvalues of A, B to obtain Hermitian matrices of norm 1, at the loss of a factor √ 2 in the approximation. A similar argument, albeit not explicitly algorithmic, also appears in [41,Claim 4.7].

Output
Theorem 3.4. Let n be an integer, M ∈ M n (M n (C)) Hermitian, ε ∈ (0, 1) and X,Y ∈ U n (C d ) such that Then the rounding procedure described in Figure 3 runs in time polynomial in n and 1/ε and outputs a pair of Hermitian matrices A , B ∈ H n with norm at most 1 such that Proof. Let A, B ∈ M n (C) be as defined as in Step 2 of Figure 3, and assume as in where A , B are as returned by the rounding procedure and the expectation is over the random choices made in the two-dimensional rounding step, i. e., step 3 of Figure 3. Applying Claim 3.3, where in the last inequality, for the first term we used that M(u j u * j , v k v * k ) is real since M is Hermitian, and for the second term we defined the vector-valued matrices where multiplication of the vector x j (resp. y k ) with the scalar-valued matrix u j u * j (resp. v k v * k ) is taken entrywise. Here we assumed that the vectors x j , y k from Claim 3.3 lie in C 2n , which is without loss of generality since there are 2n of them. One checks that

Proof of Theorem 3.1
In this section we prove Theorem 3.1. We first record for future use the following simple lemma, which is an algorithmic version of (3.2).

For any
Formally, for every i, j, k, l ∈ {1, . . . , 2n} we have Then M is Hermitian. Apply the algorithm Alg * (whose existence is the premise of 1 of Theorem 3.1) to M to get A, B ∈ H 2n such that Opt * C (M) K|M(A, B)|. Since A, B ∈ H 2n we have A 3 = A * 2 and B 3 = B * 2 . Because max{ A , B } 1 also max{ ℜA , ℜB } 1. By Lemma 3.5 we can therefore efficiently find U,V ∈ O n such that To prove the reverse implication 2 =⇒ 1, define ψ : M 2n (R) → H n by By Lemma 3.6 below we have max{ ψ(U) , ψ(V ) } 1. Moreover, using Lemma 3.7 below we have where ℜ(·) and ℑ(·) denote the real and imaginary parts, respectively. Observe that for every X ∈ H n we have Consequently the rightmost term in (3.8) equals Opt * C (M). Therefore Opt * C (M) K|M(ψ(U), ψ(V ))|, so that the algorithm that outputs ψ(U), ψ(V ) has the desired approximation factor. Lemma 3.6. Let ψ : M 2n (R) → H n be given as in (3.7).
Since X is Hermitian, Z is symmetric. For every a, b ∈ R n , Since Z is symmetric and X is Hermitian, it follows that Z = X .

Set
3. Let X(ε) ∈ M n (R) of norm at most 1 be such that

Direct rounding in the real case
We now describe and analyze a different rounding procedure for the real case of the noncommutative Grothendieck inequality. The argument is based on the proof of the noncommutative Grothendieck inequality due to Kaijser [25], which itself has a structure similar to the proof of the classical Grothendieck inequality due to Krivine [27] (see also [22], the "Notes and Remarks" section of Chapter 1 of [9], and the survey article [24]), and uses ideas from [36] to extend that proof to the non-commutative setting.
Theorem 4.1. Given n ∈ N, M ∈ M n (M n (R)) and η ∈ (0, 1/2), suppose that X, where SDP R (M) is defined in (1.12). Then the rounding procedure described in Figure 4 runs in time polynomial in n and outputs a pair of real matrices A, B ∈ M n (R) with norm at most 1 such that Note that by Lemma 3.5 we can also efficiently convert the matrices A, B to orthogonal matrices U,V ∈ O n without changing the approximation guarantee.
The proof of Theorem 4.1 relies on two claims that are used to bound the error that results from the truncation step (step 2) in the rounding procedure in Figure 4. The first claim plays the same role as Claim 2.1 did in the complex case.
Claim 4.2. Fix X ∈ M n (R d ) and let ε ∈ {−1, 1} d be chosen uniformly at random. Set X ε = ε, X . Then Proof. Define the symmetric real vector-valued matrix Z by Following the proof of Lemma 1.1 in [36] (which establishes a bound analogous to the one proved in Claim 2.4 for the case of i. i. d. {±1} random variables) and defining as usual Z r ∈ M n (R) by (Z r ) i j = (Z i j ) r for every r ∈ {1, . . . , d}, we have Replacing Z r by its definition and using ABA * B AA * , which holds true for every positive semidefinite B ∈ M n (R), we arrive at the following matrix inequality: The inequality above implies a separate matrix inequality for both diagonal blocks, proving the claim.
THEORY OF COMPUTING, Volume 10 (11), 2014, pp. 257-295 The second claim appears as Lemma 2.3 in [25] (in the complex case). We include a short proof for the sake of completeness.
Proof. Define Y τ by "truncating" the singular values of Y at τ, as is done in step 2 of the rounding procedure described in Figure 4, so that Y τ τ. Define Y τ = Y −Y τ . By definition, Y , Y τ and Y τ all have the same singular vectors, and the non-zero singular values of Y τ are of the form µ − τ, where µ τ is a singular value of Y . Using the bound |µ − τ| µ 2 /(4τ) valid for any µ τ, any nonzero eigenvalue Proof of Theorem 4.1. We shall continue using the notation introduced in Figure 4, Claim 4.2 and Claim 4.3. Let X,Y ∈ O n (R d ) satisfying (4.1). For every ε ∈ {−1, 1} d and any τ > 0 let be the decomposition promised by Claim 4.3. Combining the bound from Claim 4.2 with the one from Claim 4.3, we see that where the final step of (4.4) follows from Y ∈ O n (R d ), and the same bound holds on E ε (Y τ ε ) * Y τ ε . We have where the last inequality in (4.5) follows from the definition of SDP R (M) and (4.4).
To bound the first term in (4.5), let Note that the rightmost term in (4.6) is at most where B(ε) = (1/τ)(Y ε ) τ and X(ε) is as defined in the rounding procedure in Figure 4. To bound the leftmost term in (4.6), define Moreover, by definition W ε W * ε = (X ε X * ε ) 2 , so using Claim 4.2 we have and the same bound holds for E ε W ε W * ε . By the definition of (Y ε ) τ we also have where we used the definition of SDP R (M). Finally, combining (4.5) and (4.6) with the bounds shown above we obtain Setting τ = √ 3/2 in (4.7) and simplifying leads to the desired bound (4.2). All steps in the rounding procedure can be performed efficiently. The calculation of X(ε) in the third step of Figure 4 can be expressed as a semidefinite program and solved in time polynomial in n. Alternatively, one may directly compute X(ε) as follows. Write the polar decomposition where the partial trace is taken with respect to the second tensor, Q is an orthogonal matrix and P is positive semidefinite. The optimal choice of X(ε) corresponds to the real matrix of norm at most 1 maximizing |Tr(X(ε) · QP)| = Tr X(ε) · Tr 2 (M * (I ⊗ (Y ε ) τ )) = |M(X(ε), (Y ε ) τ )| , and this is achieved by taking X(ε) = Q * . THEORY OF COMPUTING, Volume 10 (11), 2014, pp. 257-295

Some applications
Before presenting the details of our applications of Theorem 1.1, we observe that the problem of computing Opt R (M) is a rather versatile optimization problem, perhaps more so than what one might initially guess from its definition. The main observation is that by considering matrices M which only act non-trivially on certain diagonal blocks of the two variables U,V that appear in the definition of Opt R (M), these variables can each be thought of as a sequence of multiple matrix variables, possibly of different shapes but all with operator norm at most 1. This allows for some flexibility in adapting the noncommutative Grothendieck optimization problem to concrete settings, and we explain the transformation in detail next.
For every n, m 1, let M m,n (R) be the vector space of real m × n matrices. Given integers k, 1 and sequences of integers (m i ), (n i ) ∈ N k , (p j ), (q j ) ∈ N , we define Bil R (k, ; (m i ), (n i ), (p j ), (q j )), or simply Bil R (k, ) when the remaining sequences are clear from context, as the set of all that are linear in both arguments. Concretely, f ∈ Bil R (k, ) if and only if there exists real coefficients α irs, juv such that for every Note that this definition coincides with the definition of Opt R ( f ) given in the introduction whenever f ∈ Bil R (1, 1; n, n, n, n). The proof of the following proposition shows that the new optimization problem still belongs the framework of the noncommutative Grothendieck problem.
Proposition 5.1. There exists a polynomial-time algorithm that takes as input k, ∈ N, (m i ), (n i ) ∈ N k , (p j ), (q j ) ∈ N and f ∈ Bil R (k, ; (m i ), (n i ), (p j ), (q j )) and outputs Moreover, the implied constant in the O(1) term can be taken to be any number larger than 2 √ 2.
Proof. Let k, ∈ N, (m i ), (n i ) ∈ N k , (p j ), (q j ) ∈ N be given, and define THEORY OF COMPUTING, Volume 10 (11), 2014, pp. 257-295 and t def = max{m, n, p, q}. We first describe how k i=1 M m i ,n i (R) (and j=1 M p j ,q j (R), respectively) can be identified with a subset of M t (R) consisting of block-diagonal matrices. For any i ∈ {1, . . . , k} and r ∈ {1, . . . , m i }, s ∈ {1, . . . , n i }, let F i r,s ∈ M t (R) be the matrix that has all entries equal to 0 except the entry in position (r + ∑ j<i m j , s + ∑ j<i n j ), which equals 1. Similarly, for any j ∈ {1, . . . , } and u ∈ {1, . . . , p j }, v ∈ {1, . . . , q j } we let G j u,v ∈ M t (R) be the matrix that has all entries equal to 0 except the entry in position (u + ∑ i< j p i , v + ∑ i< j q i ), which equals 1. Define linear maps Φ : From the definition, one verifies that Conversely, let U,V ∈ O t be arbitrary, and U ,V their orthogonal projections on Im(Φ) and Im(Ψ), respectively. Then M(U ,V ) = M(U,V ). Moreover, if and are such that Φ((U i )) = U and Ψ((V j )) = V then by (5.2) max i∈{1,...,k} U i = U U = 1, and similarly max j∈{1,..., } V j 1. As in the proof of Lemma 3.5, we may then argue that there exists To conclude the proof of Proposition 5.1 it remains to note that the algorithm of Theorem 1.1, when applied to M, produces in polynomial time U,V ∈ O t such that Opt R (M) O(1) · M(U,V ). Arguing as in Lemma 3.5, (U i ) and (V j ) can be computed from U,V in polynomial time and constitute the output of the algorithm.

Constant-factor algorithm for robust PCA problems
We start with the R1-PCA problem, as described in (1.3). Let a 1 , . . . , a N ∈ R n be given, and define f ∈ Bil R (1, N; (K), (n), (1, . . . , 1), (K, . . . , K)) by The condition Z i ∈ O 1,K is equivalent to Z i being a unit vector, while Y ∈ O K,n is equivalent to the K rows of Y being orthonormal. Using that the 2 norm of a vector u ∈ R K is equal to max v∈S K−1 u, v , the quantity appearing in (1.3) is equal to which by definition is equal to Opt R ( f ). An approximation algorithm for the R1-PCA problem then follows immediately from Proposition 5.1. The algorithm for L1-PCA follows similarly. Letting g ∈ Bil R (1, NK; (K), (n), (1, . . . , 1), (1, . . . , 1)) be defined as and using that Z ik ∈ O 1,1 is equivalent to Z ik ∈ {−1, 1}, the quantity appearing in (1.4) is equal to sup Y ∈O K,n Z 1,1 ,...,Z N,K ∈O 1,1 g Y, (Z 1,1 , . . . , Z N,K ) = Opt R (g), which again fits into the framework of Proposition 5.1.

A constant-factor algorithm for the orthogonal Procrustes problem
The generalized orthogonal Procustes problem was described in Section (1.1.3). Continuing with the notation introduced there, let A 1 , . . . , A K be d × n real matrices such that the i-th column of A k is the vector x k i ∈ R d . Our goal is then to efficiently approximate It is clear that (5.4) is at least as large as (5.3). For the other direction, using Cauchy-Schwarz, so either U 1 , . . . ,U K or V 1 , . . . ,V K achieve a value in (5.3) that is at least as large as (5.4 Finally, from an assignment to (5.4) we can efficiently extract an assignment to (5.3) achieving at least as high a value by choosing the one among the (U k ) or the (V l ) that leads to a higher value.
Consequently, the quantity appearing in (5.5) differs from the quantity defining P(A 1 , . . . , A K ) by an additive term that does not depend on V 1 , . . . ,V K . In the same vein, as already mentioned in Section (1. While the optimization problems appearing in (5.3), (5.5) and (5.6) have the same exact solutions, this is no longer the case when it comes to approximation algorithms. To the best of our knowledge the polynomial-time approximability of the minimization of the quantity appearing in (5.6) was not investigated in the literature. Nemirovski [35] and So [43] studied the polynomial-time approximability of the maximization of the quantity appearing in (5.5): the best known algorithm for this problem [43] has approximation factor O(log(n + d + K)). This immediately translates to the same approximation factor for computing P(A 1 , . . . , A K ), which was the previously best known algorithm for this problem. Our constant-factor approximation algorithm for P(A 1 , . . . , A K ) yields an approximation for the maximization of the quantity appearing in (5.5) that has a better approximation factor than that of [43] unless the additive difference ∑ K k=1 A k 2 2 is too large. Precisely, our algorithm becomes applicable in this context as long as this term is at most a factor (1 − 1/C) smaller than P(A 1 , . . . , A K ), where C is the approximation constant from Proposition 5.1. This will be the case for typical applications in which one may think of each A k as obtained from a common A by applying a small perturbation followed by an arbitrary rotation: in that case it is reasonable to expect the optimal solution to satisfy U k A k ,U l A l ≈ A 2 for every l, k ∈ {1, . . . , K}; see, e. g., [35,Sec. 4.3].

An algorithmic noncommutative dense regularity lemma
Our goal here is to prove Theorem 1.3, but before doing so we note that it leads to a PTAS for computing Opt C (M) whenever Opt C (M) κn M 2 for some constant κ > 0 (we shall use below the notation introduced in the statement of Theorem 1.3).
The idea for this is exactly as in Section 3 of [12]. The main point is that given such an M and ε ∈ (0, 1) the decomposition in Theorem 1.3 only involves T = O(1/(κε) 2 ) terms, which can be computed in polynomial time. Given such a decomposition, we will exhaustively search over a suitably discretized set of values a t , b t for Tr(A t X) and Tr(B t Y ), respectively. For each such choice of values, verifying whether it is achievable using an X and Y of operator norm at most 1 can be cast as a semidefinite program. The final approximation to Opt C (M) is given by the maximum value of | ∑ T t=1 α t a t b t |, among those sequences (a t ), (b t ) that were determined to be feasible.
In slightly more detail, first note that for any X,Y of operator norm at most 1 the values of Tr(A t X) and Tr(B t Y ) lie in the complex disc with radius n. Given our assumption on Opt C (M), the bound on α t stated in Theorem 1.3 implies |α t | = O(Opt C (M)/(κn 2 )). Hence an approximation of each Tr(A t X) and Tr(B t Y ) up to additive error εκn/T will translate to an additive approximation error to M(X,Y ) of O(εOpt C (M)). As a result, to obtain a multiplicative (1 ± ε) approximation to Opt C (M) it will suffice to exhaustively enumerate among (O((n · T /(εκn)) 2 )) 2T possible values for the sequences (a t ), (b t ).
Finally, to decide whether a given sequence of values can be achieved, it suffices to decide two independent feasibility problems of the following form: given n × n complex matrices (A t ) T t=1 of norm at most 1 and (a t ) T t=1 ∈ C T , does there exist X ∈ M n (C) of norm at most 1 such that max t∈{1,...,T } max{|Re(Tr(A t X) − a t )|, |Im(Tr(A t X) − a t )|} εκn T ?
This problem can be cast as a semidefinite program, and feasibility can be decided in time that is polynomial in n and T .
Proof of Theorem 1.3. The argument is iterative. Assume that {A t , B t } τ−1 t=1 ⊆ U n have already been constructed. Write M 1 def = M and If Opt C (M τ ) εOpt C (M) then we may stop the construction. Otherwise, by Theorem 1.1 and multiplying by an appropriate complex phase if necessary, we can find A τ , B τ ∈ U n such that with the left-hand side of (5.7) real. Set