# What is a linear correlation

The best way to think about this is to imagine a scatterplot of points with $y$ on the vertical axis and $x$ represented by the horizontal axis. Given this framework, you see a cloud of points, which may be vaguely circular, or may be elongated into an ellipse. What you are trying to do in regression is find what might be called the 'line of best fit'. However, while this seems straightforward, we need to figure out what we mean by 'best', and that means we must define what it would be for a line to be good, or for one line to be better than another, etc. Specifically, we must stipulate a loss function. A loss function gives us a way to say how 'bad' something is, and thus, when we minimize that, we make our line as 'good' as possible, or find the 'best' line.

Traditionally, when we conduct a regression analysis, we find estimates of the slope and intercept so as to minimize the sum of squared errors. These are defined as follows:

$$SSE=\sum_^N(y_i-(\hat\beta_0+\hat\beta_1*x_i))^2$$

In terms of our scatterplot, this means we are minimizing the sum of the vertical distances between the observed data points and the line.

On the other hand, it is perfectly reasonable to regress $x$ onto $y$, but in that case, we would put $x$ on the vertical axis, and so on. If we kept our plot as is (with $x$ on the horizontal axis), regressing $x$ onto $y$ (again, using a slightly adapted version of the above equation with $x$ and $y$ switched) means that we would be minimizing the sum of the horizontal distances between the observed data points and the line. This sounds very similar, but is not quite the same thing. (The way to recognize this is to do it both ways, and then algebraically convert one set of parameter estimates into the terms of the other. Comparing the first model with the rearranged version of the second model, it becomes easy to see that they are not the same.)

Note that neither way would produce the same line we would intuitively draw if someone handed us a piece of graph paper with points plotted on it. In that case, we would draw a line straight through the center, but minimizing the vertical distance yields a line that is slightly flatter (i.e. with a shallower slope), whereas minimizing the horizontal distance yields a line that is slightly steeper .

A correlation is symmetrical; $x$ is as correlated with $y$ as $y$ is with $x$. The Pearson product-moment correlation can be understood within a regression context, however. The correlation coefficient, $r$, is the slope of the regression line when both variables have

been standardized first. That is, you first subtracted off the mean from each observation, and then divided the differences by the standard deviation. The cloud of data points will now be centered on the origin, and the slope would be the same whether you regressed $y$ onto $x$, or $x$ onto $y$ (but note the comment by @DilipSarwate below).

Now, why does this matter? Using our traditional loss function, we are saying that all of the error is in only one of the variables (viz. $y$). That is, we are saying that $x$ is measured without error and constitutes the set of values we care about, but that $y$ has sampling error. This is very different from saying the converse. This was important in an interesting historical episode: In the late 70's and early 80's in the US, the case was made that there was discrimination against women in the workplace, and this was backed up with regression analyses showing that women with equal backgrounds (e.g. qualifications, experience, etc.) were paid, on average, less than men. Critics (or just people who were extra thorough) reasoned that if this was true, women who were paid equally with men would have to be more highly qualified, but when this was checked, it was found that although the results were 'significant' when assessed the one way, they were not 'significant' when checked the other way, which threw everyone involved into a tizzy. See here for a famous paper that tried to clear the issue up.

The formula for the slope of a simple regression line is a consequence of the loss function that has been adopted. If you are using the standard Ordinary Least Squares loss function (noted above), you can derive the formula for the slope that you see in every intro textbook. This formula can be presented in various forms; one of which I call the 'intuitive' formula for the slope. Consider this form for both the situation where you are regressing $y$ on $x$, and where you are regressing $x$ on $y$: $$\overbrace<\hat\beta_1=\frac<\text(x,y)><\text(x)>>^ x> \overbrace<\hat\beta_1=\frac<\text(y,x)><\text(y)>>^y>$$ Now, I hope it's obvious that these would not be the same unless $\text(x)$ equals $\text(y)$. If the variances are equal (e.g. because you standardized the variables first), then so are the standard deviations, and thus the variances would both also equal $\text(x)\text(y)$. In this case, $\hat\beta_1$ would equal Pearson's $r$, which is the same either way by virtue of the principle of commutativity.  \overbrace(x,y)><\text(x)\text(y)>>^<\textx\text< with >y>