Exercise 8

from Example Sheet 3

Exercise 8.

Consider the following classical-quantum or cq-states on the Hilbert space \(\mathbb{C}^n\otimes \mathcal{H}\): \[ \rho = \sum_{i=1}^np_i|i\rangle\langle i|\otimes \rho_i, \qquad \qquad \sigma = \sum_{i=1}^np_i|i\rangle\langle i|\otimes \sigma_i \] where \(\rho_i,\sigma_i\in\mathcal{D}(\mathcal{H})\) and \(\{p_i\}\) is a probability distribution. Evaluate the quantum relative entropy \(D(\rho\|\sigma)\) and use the result to prove that \[ D\bigg(\sum_ip_i\rho_i\|\sum_ip_i\sigma_i\bigg)\leq \sum_ip_iD(\rho_i\|\sigma_i), \] i.e. joint convexity of the quantum relative entropy.

The solution I presented during the example class used the monotonicity of the relative entropy under partial trace, which was proven in the lectures using the joint convexity of the entropy. I’ll leave that proof below, but will first prove joint convexity without using monotonicity of the relative entropy under partial trace. To do so, we’ll use Lieb’s concavity theorem: for any matrix \(X\) and \(t\in(0,1)\), we have that the function \[ f(A,B) := \operatorname{tr}[X^\dagger A^t X B^{1-t}] \] is jointly concave in positive matrices \(A\) and \(B\).

That is, for any probability distribution \(p_i\) and positive matrices \(A_i\) and \(B_i\), we have \[ f\Big( \sum_i p_i A_i, \sum_i p_i B_i \Big) \geq \sum_i p_i f(A_i, B_i). \]

We’ll follow Nielsen and Chaung. We define \[ I_t(A,X) = \operatorname{tr}[ X^\dagger A^t X A^{1-t}] - \operatorname{tr}[ X^\dagger X A]. \] The first term is \(f(A,A)\) which is thus concave in \(A\), while the second term is linear in \(A\). Therefore, \(I_t(A,X)\) is concave in \(A\). Now, \(t\mapsto I_t(A,X)\) is a function from \([0,1]\to \mathbb{R}\). In fact, we can write it more simply as a function of \(t\) by writing \(A\) in its eigendecomposition, \(A = \sum_i \lambda_i P_i\), where the \(\lambda_i\) are the eigenvalues of \(A\) and the \(P_i\) the associated eigenprojection. Then \(A^t = \sum_i \lambda_i^t P_i\). So, \[ I_t(A,X) = \operatorname{tr}[ X^\dagger A^t X A^{1-t}] - \operatorname{tr}[ X^\dagger X A] = \sum_{i,j} \lambda_i^t \lambda_j^{1-t} \operatorname{tr}[X^\dagger P_i X P_j] - \operatorname{tr}[X^\dagger X A]. \](1) Notice that for each \(i\) and \(j\), the quantity \(\operatorname{tr}[X^\dagger P_i X P_j]\) is just a number. So \(I_t\) is a sum of eigenvalues of \(A\) to the powers \(t\) and \(1-t\), weighted by some numbers, minus \(\operatorname{tr}[X^\dagger X A]\), which has no \(t\)-dependence. Thus, \(I_t\) is differentiable at \(t=0\); we can take the derivative using the rule \(\frac{\mathrm{d}}{\mathrm{d}t} x^t = \ln(x) x^t\) to find \[ \frac{\mathrm{d}}{\mathrm{d}t}I_t(A,X) = \sum_{i,j} [\ln(\lambda_i)\lambda_i^t \lambda_j^{1-t} + \lambda_i^t \ln(\lambda_j) \lambda_j^{1-t}] \operatorname{tr}[X^\dagger P_i X P_j] \] and in particular, evaluating the derivative at \(t=0\), \[\begin{aligned} \left.\frac{\mathrm{d}}{\mathrm{d}t}\right|_{t=0} I_t(A,X) &= \sum_{i,j} [\ln(\lambda_i) \lambda_j + \ln(\lambda_j) \lambda_j] \operatorname{tr}[X^\dagger P_i X P_j] \\ &= \sum_{i,j} \operatorname{tr}[X^\dagger \ln(\lambda_i) P_i X \lambda_j P_j] + \sum_{i,j} \operatorname{tr}[X^\dagger P_i X \ln(\lambda_j) \lambda_j P_j]\\ &= \sum_{i} \operatorname{tr}[X^\dagger \ln(\lambda_i) P_i X A] + \sum_{i} \operatorname{tr}[X^\dagger P_i X \ln(A) A]\\ &= \operatorname{tr}[X^\dagger \ln(A) X A] + \operatorname{tr}[X^\dagger X \ln(A) A]. \end{aligned}\] In fact, \(I(A,X)\) is concave in \(A\) as well. Using the definition of the derivative, \[ I(p A_1 + (1-p) A_2,X) = \lim_{t\to 0} \frac{I_t(p A_1 + (1-p) A_2, X) - I_0(p A_1 + (1-p) A_2, X)}{t}. \] But \(I_0(B,X)=0\) for all \(B\geq 0\), since the two terms in (1) cancel. And since \(B\mapsto I_t(B,X)\) is concave, we have \[\begin{aligned} I(p A_1 + (1-p) A_2,X) &\geq \lim_{t\to 0} \frac{pI_t( A_1, X) + (1-p) I_t(A_2,X) }{t} \\ &=p \lim_{t\to 0} \frac{I_t(A_1,X)}{t} +(1-p) \lim_{t\to 0} \frac{I_t(A_2,X)}{t}\\ &= p I(A_1,X) + (1-p) I(A_2,X) \end{aligned}\] using the definition of the derivative again, and that \(I_0(A_1,X) = I_0(A_2,X)=0\). Next, for any quantum states \(\rho\) and \(\sigma\), we choose \(A = \begin{pmatrix} \rho & 0 \\ 0 & \sigma \end{pmatrix}\), and \(X = \begin{pmatrix} 0 & 0 \\ I& 0 \end{pmatrix}\) as block matrices. We calculate \[\begin{aligned} I(A,X) &= \operatorname{tr}[X^\dagger \ln(A) X A] + \operatorname{tr}[X^\dagger X \ln(A) A] \\ &=\operatorname{tr}\Big[\begin{pmatrix}0 & I \\ 0 & 0\end{pmatrix} \begin{pmatrix}\ln(\rho) & 0 \\ 0 & \ln(\sigma)\end{pmatrix} \begin{pmatrix} 0 & 0 \\ I & 0\end{pmatrix} \begin{pmatrix}\rho & 0 \\ 0 & \sigma\end{pmatrix} \Big] + \operatorname{tr}\Big[ \begin{pmatrix}I & 0\\ 0 & 0\end{pmatrix} \begin{pmatrix}\rho \ln(\rho) & 0 \\ 0 & \sigma \ln(\sigma)\end{pmatrix} \Big]\\ &=\operatorname{tr}\Big[\begin{pmatrix}0 & \ln(\sigma) \\ 0 & 0\end{pmatrix} \begin{pmatrix} 0 & 0 \\ \rho & 0\end{pmatrix} \Big] - \ln(2) S(\rho)\\ &=\operatorname{tr}\Big[\begin{pmatrix} \ln(\sigma)\rho & 0 \\ 0 & 0\end{pmatrix} \Big] - \ln(2) S(\rho)\\ &=\ln(2)\operatorname{tr}[\log(\sigma)\rho] - \ln(2) S(\rho)\\ &= - \ln(2) D(\rho\|\sigma). \end{aligned}\] Therefore, defining \(A_i = \begin{pmatrix} \rho_i & 0 \\ 0 & \sigma_i\end{pmatrix}\), we have \[ D\Big(\sum_i p_i \rho_i\|\sum_i p_i\sigma_i\Big) = - \frac{1}{\ln(2)}I\Big(\sum_i p_i A_i, X\Big) \leq - \frac{1}{\ln(2)}\sum_i p_i I(A_i, X) = \sum_i p_i D(\rho_i \|\sigma_i) \] using the concavity of \(A\mapsto I(A,X)\).

Proof of joint convexity using monotonicity of the relative entropy:

We notice that \(\rho\) has the block-diagonal form \[ \rho = \begin{pmatrix} p_1 \rho_1 \\ && p_2 \rho_2 \\ &&& \ddots \\ &&&& p_n \rho_n\end{pmatrix} \implies \log \rho = \begin{pmatrix} \log(p_1\rho_1) \\ && \log(p_2 \rho_2) \\ &&& \ddots \\ &&&& \log (p_n \rho_n)\end{pmatrix}. \] Since the same form holds for \(\sigma\), we have \[ \log \rho - \log \sigma = \begin{pmatrix} \log(\rho_1) - \log(\sigma_1) \\ && \log(\rho_2) - \log(\sigma_2) \\ &&& \ddots \\ &&&& \log (\rho_n) -\log(\sigma_n)\end{pmatrix} \] using properties of the logarithm to cancel the \(p_i\)’s. Thus, we may write \(\rho (\log \rho - \log \sigma)\) as \[ \begin{pmatrix} p_1 \rho_1 (\log(\rho_1) - \log(\sigma_1)) \\ && p_2 \rho_2(\log(\rho_2) - \log(\sigma_2)) \\ &&& \ddots \\ &&&& p_n \rho_n(\log (\rho_n) -\log(\sigma_n))\end{pmatrix} \] Taking the trace, we find \[ D(\rho\|\sigma) = \operatorname{tr}[\rho (\log \rho - \log \sigma)] = \sum_i p_i D(\rho_i\|\sigma_i). \] On the other hand, the data-processing inequality thus gives that \[ D\left(\sum_i p_i \rho_i \| \sum_i p_i \sigma_i \right) = D( \operatorname{tr}_1 \rho \| \operatorname{tr}_1 \sigma ) \leq D(\rho \|\sigma) = \sum_i p_i D(\rho_i\|\sigma_i) \] where we use \(\operatorname{tr}_1\) to denote the partial trace over the first system (which we haven’t given an explicit label).