Prove that the trace of the matrix product $U'AU$ is maximized by setting $U$'s columns to $A$'s eigenvectors

Question

I have an expression which I want to maximize: $$\mbox{trace} (U^T\Sigma U)$$

$\Sigma \in R^{d \times d}$ is symmetric and positive definite
$U \in R^{d \times r}$ its columns are orthonormal.

I am following a proof which states that U is optimal when its columns are the eigenvectors of $\Sigma$:

let $\Sigma = V\Delta V^T$, the eigen-decomposition;
let $B=V^TU$

then: $$U=VB$$ $$U^T\Sigma U= B^TV^TV\Delta V^TVB = B^T\Delta B$$ the last statement is mathematically correct. However, how does this proof to me that the columns of U are the eigenvectors of $\Sigma$?

For some background: I am trying to understand a derivation of PCA. The point is that when this trace is maximized the objective function will be minimized.

What is B and why are you using it? I gues that if you want U to have the eigenvectors as columns then U=V and you equation reduces to $trace{/Delta}$ — Josu Etxezarreta Martinez, Aug 24 '16 at 19:53
Note: this question can also be easily answered using the Schur-Horn theorem — Ben Grossmann, Apr 17 '20 at 03:29

score 8 · Accepted Answer · answered Aug 24 '16 at 21:06

Note that with your substitution, finding the optimal $B$ with orthonormal columns is equivalent to finding the optimal $V$. What we're trying to maximize, then, is $$ \operatorname{trace}(B^T\Delta B) = \sum_{i=1}^r b_i^T\Delta b_i $$ Where $b_i$ is the $i$th column of $B$. It turns out that the optimum occurs when each $b_i$ is a standard basis vector.

Because it's a result that makes intuitive sense, it's common to mistakenly assume that it's easy to prove. See, for instance, the mistaken proof given here. Why is the result intuitive? I think that's because it's clear that if we perform a greedy optimization one column $b_i$ at a time, then we'd arrive at the correct result. However, I would say there's no (simple) a priori justification that this should give us the right answer. If you look at the original paper deriving the results required for PCA, you'll see that the required proof takes some Lie-group trickery.

The most concise modern proof is one using the appropriate trace inequality. I would prove the result as follows: to find an upper bound for the maximum, apply the Von-Neumann trace inequality. In particular, we have $$ \operatorname{trace}(U^T\Sigma U) = \operatorname{trace}(\Sigma[UU^T]) \leq \sum_{i=1}^n \lambda_i(\Sigma)\lambda_i(UU^T) $$ where $\lambda_i(M)$ of a positive definite matrix $M$ denote the eigenvalues of $M$ in decreasing order. Note that $UU^T$ is a projection matrix of rank $r$, so we have $\lambda_1(UU^T) = \cdots = \lambda_r(UU^T) = 1$ and $\lambda_{r+1}(UU^T) = \cdots = \lambda_n(UU^T) = 0$. Conclude that $\sum_{i=1}^r\lambda_i(\Sigma)$ is an upper bound.

From there, it suffices to show that taking the columns of $U$ to be the eigenvectors of $\Sigma$ (in the correct order) allows you to attain this upper bound.

Thanks for the answer =) took me a while to get back to your answer... (mid exam seesion) — Dahlai, Aug 30 '16 at 11:56
At last something which makes sense of the proofs for PCA. Because some other approaches are just finding $ w $ which maximize $ {w}^{T} \Sigma w $. Now what I don't understand is how do they ignore (Or I'm mistaken) that the problem is maximizing a Convex function with respect to $ w $ (Since $ \Sigma $ is PSD to the least). — Royi, Dec 24 '16 at 10:39
It should be noted that this result is also proven directly in Corollary 4.3.18 of Horn and Johnson's Matrix Analysis (first edition). — Ben Grossmann, Feb 08 '17 at 05:38

score 3 · Answer 2 · edited Nov 17 '21 at 16:43

To further add to @Omnomnomnom's answer:

The calculations you posted shows, that the original problem is equivalent to the case where you have a diagonal matrix $\Delta$ instead of a merely symmetric $\Sigma$. Note here, that your $B$ is again orthogonal. You can now solve this new problem

$$\max \lbrace {\rm tr} ( B^T \Delta B ) | B \in \mathbb{R}^{n,k}, \, B^t B={\rm Id}_k \rbrace,$$

which is a bit easier than the original, and the maximum will be the same. Also, to get the desired matrix you can then set $U=VB$.

It turns out that the maximum is the sum of the $r$ largest eigenvalues and that $B=(e_1,\ldots,e_r) \in \mathbb{R}^{n,r}$, which then yields $U = VB= (v_1,\ldots,v_r)$, where $v_i$ are ordered eigenvectors of $\Sigma$.

The author of the proof you are following suggests that the solution of the new maximization is obvious. However, Omnomnomnom and I do not agree with that. You could therefore take the approach shown in his or her answer, or you could try to finish your proof. This is actually quite tedious as I will try to show, so I'd suggest to use Omnomnomnom's proof or Nick's suggestion for the upper bound here.

In order to finish your proof, start by calculating $${\rm tr} ( B^T A B ) = \sum_{j=1}^r \sum_{i=1}^n b_{i,j}^2 \lambda_i = \sum_{i=1}^n (\sum_{j=1}^r b_{i,j}^2)\lambda_i = \sum_{i=1}^n h_i \lambda_i$$ where $B=(b_{i,j})$ and $h_i := \sum_{j=1}^r b_{i,j}^2$. You could also have arrived at this point by simply expressing the columns of $U$ as a linear combination of an orthonormal basis of eigenvectors.

Now we'll show two properties of the above linear combination. For the first one, add columns to $B$ to build an orthogonal square matrix $C=[B|R]=(c_{i,j})$ and note that for each $i$

$$0 \le h_i = \sum_{j=1}^r b_{i,j}^2 = \sum_{j=1}^r c_{i,j}^2 \le \sum_{j=1}^n c_{i,j}^2 = 1$$

holds because $CC^T={\rm Id}_n$. Secondly, for the sum of the coefficients we have

$$ \sum_{i=1}^n h_i = \sum_{j=1}^r <b_j,b_j> = r \cdot 1 = r$$.

Now we've arrived at exactly this problem from convex optimization (with $a=(\lambda_1,\ldots,\lambda_n)^T$ and $x_{\min}=0$, $x_{\max}=1$). Instead of solving it with those methods, we can find the maximum by hand. We bound

$$\sum_{i=1}^n h_i \lambda_i \le \sum_{i=1}^{r-1} h_i \lambda_i + \lambda_r \sum_{i=k}^n h_i = \sum_{i=1}^{r-1} h_i \lambda_i + \lambda_r (r-\sum_{i=1}^{r-1}) = r \lambda_r + \sum_{i=1}^{r-1} h_i (\lambda_i - \lambda_r)$$.

Now we can establish an upper bound for our maximization problem:

$$\max \lbrace {\rm tr} ( B^T A B ) | B \in \mathbb{R}^{n,k}, \, B^t B={\rm Id}_k \rbrace \le \max \lbrace \sum_{i=1}^n h_i \lambda_i | 0 \le h_i \le 1, \sum_{i=1}^n h_i = r \rbrace \\ \le \max \lbrace \sum_{i=1}^n h_i \lambda_i | 0 \le h_i \le 1\rbrace \le \max \lbrace r \lambda_r + \sum_{i=1}^{r-1} h_i (\lambda_i -\lambda_r) | 0 \le h_i \le 1\rbrace \le \sum_{i=1}^r \lambda_k$$.

For the last inequality we've set $h_i=1$ since the term in brackets is nonnegative. Since this bound can be attained with $h_1 = \ldots = h_r = 1$ and $h_{r+1} = \ldots = h_n = 0$ we have found our maximum and with it the desired $B$ or $U$.

Thanks for the very comprehensive answer! (Marked @Omnomnomnom's as the answer since he came first =/, wish I could accept both.... sorry) — Dahlai, Aug 30 '16 at 11:55

Prove that the trace of the matrix product $U'AU$ is maximized by setting $U$'s columns to $A$'s eigenvectors

2 Answers2

Linked