# What is a correlation matrix

Nigel Clay. PhD Candidate - Mathematical Science

Before we consider a matrix let's have a brief chat about what correlation actually is.

There are two main types of correlation. Pearson's product-moment correlation coefficient is the one people most often mean when they use the term correlation coefficient. Statistically this is defined as:

[math]\rho_*^n\left(x_i-\bar \right)\left(y_i-\bar*

*^n\left(x_i-\bar*\right)^2\sum_

*^n\left(y_i-\bar*\right)^2>>[/math]

This particular correlation statistic is a measure of linear association between two variables [math]X[/math] and [math]Y[/math] .

In some cases however we cannot think of a meaningful way to calculate the mean [math]\bar

[math]\rho_*^n\left(x_i-y_i\right)^2> [/math]*

In this case [math]x_i[/math] and [math]y_i[/math] are not the values of the variables themselves, instead they are the ordinal rank of those variables. The interpretation can no longer be made in a linear sense but can be thought of as a directional association.

Now we have got this out of the way we can think about what it means to consider a matrix of correlations. Let's say you have [math]k[/math] different random variables. You can compute the appropriate correlation statistic described above between any two of these. We can arrange these into a grid such that the value of any cell

represents the correlation between the variable assigned to the row and the variable assigned to the column. It is usual that the order of variables in the rows is the same as the columns so that the diagonal values of this grid represent the correlation of a given variable with itself. This means, of course, that the diagonal values are all 1. The other thing you can say about such a grid is that it is symmetrical about the diagonal values. That is because the correlation between variables [math]X[/math] and [math]Y[/math] is the same as the correlation between [math]Y[/math] and [math]X[/math]. As a consequence the grid is square. Arranging the values in this way is called a correlation matrix. It gives you a complete view of the bi-variate correlations that exist in whatever dataset you're looking at.

In the social sciences this can be useful in sorting through different factors to determine which, if any, have an association with each other. There are two warnings I would make though.

Firstly, while the pearson correlation measures linear association it does not mean that the variables have a meaningful linear relationship. Anscombe's quartet gives some great examples of data which have the same summary statistics but vary considerably when viewed as scatterplots.

Secondly, just because two variables are correlated it does not imply that one caused the other. Causation is a whole other ballgame.