FAQ# 1141 Last Modified 1-January-2009
Correlation and linear regression are not the same.
What is the goal?
Correlation quantifies the degree to which two variables are related. Correlation does not fit a line through the data points. You simply are computing a correlation coefficient (r) that tells you how much one variable tends to change when the other one does. When r is 0.0, there is no relationship. When r is positive, there is a trend that one variable goes up as the other one goes up. When r is negative, there is a trend that one variable goes up as the other one goes down.
Linear regression finds the best line that predicts Y from X.
What kind of data?
Correlation is almost always used when you measure both variables. It rarely is appropriate when one variable is something you experimentally manipulate.
Linear regression is usually used when X is a variable you manipulate (time, concentration, etc.)
Does it matter which variable is X and which is Y?
With correlation, you don't have to think about cause and effect. It doesn't matter which of the two variables you call "X" and which you call "Y". You'll get the same correlation coefficient if you swap the two.
The decision of which variable you call "X" and which you call "Y" matters in regression, as you'll get a different best-fit line if you swap the two. The line that best predicts Y from X is
not the same as the line that predicts X from Y (however both those lines have the same value for R 2 )
The correlation coefficient itself is simply a way to describe how two variables vary together, so it can be computed and interpreted for any two variables. Further inferences, however, require an additional assumption -- that both X and Y are measured, and both are sampled from Gaussian distributions. This is called a bivariate Gaussian distribution. If those assumptions are true, then you can interpret the confidence interval of r and the P value testing the null hypothesis that there really is no correlation between the two variables (and any correlation you observed is a consequence of random sampling).
With linear regression, the X values can be measured or can be a variable controlled by the experimenter. The X values are not assumed to be sampled from a Gaussian distribution. The vertical distances of the points from the best-fit line (the residuals) are assumed to follow a Gaussian distribution, with the SD of the scatter not related to the X or Y values.
Relationship between results
Correlation computes the value of the Pearson correlation coefficient, r. Its value ranges from -1 to +1.
Linear regression quantifies goodness of fit with r 2. sometimes shown in uppercase as R 2. If you put the same data into correlation (which is rarely appropriate; see above), the square of r from correlation will equal r 2 from regression.