# Pearson Correlation Coefficient, Covariance Matrix and Linear Dependency

After reading this post, I think you will find it easy to explain this picture from Wikipedia:

Several sets of (x, y) points, with the correlation coefficient of x and y for each set.

From Wikipedia,

In statistics, the Pearson product-moment correlation coefficient (sometimes referred to as the PMCC, and typically denoted by $\rho$) is a measure of the correlation (linear dependence) between two variables X and Y, giving a value between +1 and −1 inclusive.

The definition of Pearson coefficient is

$\rho_{X,Y} = \frac{cov(X,Y)}{\sigma_X\sigma_Y}$

where $cov(X,Y)$ is the covariance between X and Y, and $\sigma_X$ and $\sigma_Y$ are the standard deviations.

For your reference, standard deviation is the square root of variance, and the estimate of variance from n samples $\{x_i\}$ is:

$var(X) \doteq \frac{1}{n} \sum_{i=1}^n (x_i - \mu_X)^2$

The estimate of covariance between X and Y is

$cov(X,Y) \doteq \frac{1}{n} \sum_{i=1}^n (x_i - \mu_X) (y_i - \mu_Y)$

The estimate of Pearson coefficient is

$\rho_{X,Y} \doteq \frac{\sum_{i=1}^n (x_i - \mu_X) (y_i - \mu_Y)}{\sqrt{ \sum_{i=1}^n (x_i-\mu_X)^2} \sqrt{\sum_{i=1}^n (y_i - \mu_Y)^2}}$

The denominator is simply a non-negative normalization factor, and only the nominator (the covariance) measures the correlation between X and Y. So let us have a close look at covariance.

Consider a sample of four symmetric bivariate points $\{d_i = \langle x_i, y_i \rangle\}_{i=1}^4$: (We consider bivariate samples here only because they can be shown on the 2D screen.)

It is not hard to verify that the estimate of covariance from this sample is 0. Note that if a point is in the 1st and 3rd orthant, then $(x_i-\mu_X)(y_i-\mu_y)$ is positive, otherwise it is negative. So generally, given a set of points distributed symmetrically around their mean, we have covariance $cov(X,Y) \rightarrow 0$ and thus $\rho_{X,Y}\rightarrow 0$.

To make $cov(X,Y)$ or $\rho_{X,Y}$ larger than 0, we need more points in the 1st and 3rd orthants, for example:

Similarly, to make $cov(X,Y)$ or $\rho_{X,Y}$ less than 0, we need more points in the 2nd and 4th orthants.  However large (or small) the value of $cov(X,Y)$ is, the denominator of $\rho_{X,Y}$ normalizes the result between [-1,1].

The covariance matrix of X and Y is

$C(X,Y) = \begin{bmatrix} var(X), & cov(X,Y) \\ cov(X,Y), & var(Y) \end{bmatrix}$

which contains all the factors used to compute $\rho_{X,Y}$.

An experiment is to construct a covariance matrix with extremely large $cov(X,Y)$ given $var(X)=var(Y)=1$:

$C(X,Y) = \begin{bmatrix} 1, & 1 \\ 1, & 1 \end{bmatrix}$

then we sample 1000 points from a Gaussian distribution $N(d; [0,0], C(X,Y)$, plot the sampling result:

You see, all samples are in the sample line.

If we relax $cov(X,Y)$ to be 0.9, the result is:

If we use negative $cov(X,Y)$, say -0.9,  then we have

$C(X,Y) = \begin{bmatrix} 1, & -0.9 \\ -0.9, & 1 \end{bmatrix}$

and

All above plots have mean slope is either 1 or -1.  How can we have other slope values? The result is to change the bounding box defined by $var(X)$ or $var(Y)$.  For example

$C(X,Y) = \begin{bmatrix} 2, & \sqrt{2} \\ \sqrt{2}, & 1 \end{bmatrix}$

Note that the covariance matrix $C(X,Y)$ must be positive-definite, which, for any non-zero vector z, has $z^T C z>0$. In bivariate case, consider z=[a,b], the constraint is equivalent to

$a^2 var(X) + 2ab\, cov(X,Y) + b^2 var(Y) > 0$

If $cov(X,Y)=\pm\sqrt{var(X)}\sqrt{var(Y)}$, the left-side of above equation can be written as a square-form and is thus $\geq 0$. If we want $> 0$, we need

$|cov(X,Y)| < \sqrt{var(X) var(Y)} = \sigma_X\sigma_Y$

This constraints the Pearson coefficient between [-1,1].  When $\rho_{X,Y}$ is 1 or -1,  all sample points of X and Y are on a line, and we say X and Y are linearly dependent.

Plots in this post were drawn using the following MATLAB command:

M = sample_gaussian([0,0,], [2,sqrt(2);sqrt(2),1], 1000);
cla; scatter(M(:,1), M(:,2)); axis tight


where function sample_gaussian can be found in my previous post.