With big data, one sometimes has to compute correlations involving thousands of buckets of paired observations or time series. For instance a data bucket corresponds to a node in a decision tree, a customer segment, or a subset of observations having the same multivariate feature. Specific contexts of interest include multivariate feature selection (a combinatorial problem) or identification of best predictive set of metrics.
In large data sets, some buckets will contain outliers or meaningless data, and buckets might have different sizes. We need something better than the tools offered by traditional statistics. In particular, we want a correlation metric that satisfies the following
- Independent of sample size to allow comparisons across buckets of different sizes: a correlation of (say) 0.6 must always correspond to (say) the 74-th percentile among all potential paired series (X, Y) of size n, regardless of n
- Same bounds as old-fashioned correlation for back-compatibility. it must be between -1 and +1, with -1 and +1 attained by extreme, singular data sets, and 0 meaning no correlation
- More general than traditional correlation: it measures the degree of monotonicity between two variables X and Y (does X grow when Y grows?) rather than linearity (Y = a + b*X + noise, with a, b chosen to minimize noise). Yet not as general as the distance correlation (equal to zero if and only if X and Y are independent) or my structuredness coefficient .
- Not sensitive to outliers, robust
- More intuitive, more compatible with the way the human brain perceives correlations
Note that R-Squared, a goodness-of-fit measure used to compare model efficiency across multiple models, is typically the square of the correlation coefficient between observations and predicted values, measured on a training set via sound cross-validation techniques. It suffers the same drawbacks, and benefits from the same cures as traditional correlation. So we will focus here on the correlation.
To illustrate the first condition (dependence on n), let's consider the following made-up data set with two paired variables or time series X, Y:
Here n=8 and r (the traditional correlation) is equal to r=0.22. If n=7 (delete the last observation), then r=0. If n=6, r=0.29. Clearly we observe high correlations when n is even, although they slowly decrease to converge to 0, for large values of n. If you shift Y one cell down (assuming both X and Y are infinite time series), then correlations for n even are now negative! However, this (X, Y) process is supposed to simulate a perfect 0 correlation. The traditional correlation coefficient fails to capture this pattern, for small n.
This is a problem with big data, where you might have tons of observations (monthly home prices and square feet for 300 million homes) but you compute correlations on small buckets (for instance, for all specific types of houses sold in 2013 in a particular zip code), to refine home value estimates for houses not on the market, by taking into account local fluctuations. In particular, comparisons between buckets of different sizes become meaningless.
Our starting point to improve the standard correlation will be Spearman's rank correlation. It is the traditional correlation, but measured on ranked variables. All correlations from the new family that I will introduce in the next section, are also based on rank statistics. By working on ranked variables, you satisfy conditions #3 and #4 (though #4 will be further improved with my new correlation). Let's denote Spearman's coefficient as s.
It is easy to prove (see Wikipedia article ) that:
s = 1 - S(X, Y)/q(n)
- S(X, Y) = SUM< |x(j) - y(j)|^2 >,
- x(j), y(j) being the ranks of observation j, in the X and Y series,
- q(n) = n*(n^2-1)/6,
Of course s satisfies condition #2, by construction. A very interesting and important fact, and a source of concerns, is that q(n) = O(n^3). In short, q(n) grows too fast.
1. A new family of rank correlations
Without loss of generality, from now on we can now assume that X is ordered and thus x(j) = j. Our new correlation will be denoted as t. or t(c) to emphasize the fact that it is a family of correlations governed by the parameter c. Bayesian statisticians might view c as a prior.
The new correlation is computed as follows:
- Compute u = SUM< |x(j) - y(j)|^c>
- Compute v = SUM< |x(j) - z(j)|^c> where z(j) = n-1-y(j)
- T = T(c) = min(u, v)
- Sign = +1 if u<v, -1 if u>v, 0 if u=v
- t = t(c) = Sign * < 1-T(c)/q(n) >
- q(n) is the maximum value for T(c) computed across all (X, Y) with n observations
Explanations and discussion
- This new correlation is still symmetric (t will not change if you swap X and Y). Also, if you reverse Y order, then only the sign of the correlation changes (but this was already true for s. prove it as an exercise).
- It always satisfies condition #2 when c>0. Also, when c>0, t =+1 if and only if s =+1, t =-1 if and only if s =-1.
- It is identical to s (Spearman's) when c=2. However c=2 makes t too sensitive to outliers. Also c=2 is an artificial value that makes old-style computations and modeling easy. But it is not built with practical purposes in mind, unlike c=1. In short, it is built based on theoretical and historic considerations, and suffers drawbacks and criticism similar to old-fashioned R Squared.
- I conjecture that q(n) = O(n^
). When c is large, it creates a very long, artificial tail for the distribution of T(c). A small c is preferred, otherwise, very rare, extremely high values of T(c) artificially boost q(n) and create a t that is not well balanced.
- We will focus here on the case c=1
- We compute the rank distances between X and Y (in u) as well as between X and Z, (in v), where Z is Y in reversed order, and then pick up the smallest between u and v. This steps also determines the sign of the correlation.
- q(n) can be pre-computed for each n, I provide look-up tables for small values, approximations are used for large values
Distribution of correlation vector (s. t) , for n=9. It s
hows all n! = 362,880 potential correlation combinations
Extreme case: s = 0.17, t = -0.26 (n=9). Which one makes sense, s or t ? Both are not statistically different from 0
2. Asymptotic behavior and normalization
Again, we assume that t is defined using c=1. Simulations where X = (0, 1. n-1) and Y is a random permutation of n elements (obtained by first creating n random numbers then extracting their rank) show that
- t converges faster than s to 0: in one experiment, the observed t averaged over 30 simulations, for n=10, was 0.02; average s was 0.04
- Average |t | is also smaller than average |s |
- We haven't computed p-values yet, but they will be different, everything else (n, c, test level, null hypothese that correlation is 0) being equal
Finally, when t and s are quite different, usually the t value looks more natural, as in the above chart: t <0 but s >0 because t is better at eliminating the effect of the two outliers (top right corner, bottom left corner). This is critical when you compute correlations across thousands of data buckets of various sizes: you are bound to find outliers in some buckets. And if you look for extreme correlations, your conclusions might be severely biased: read the curse of big data for explanations.
Below is the distribution for T, when n=10. When n=10, 97.2% of t values are between -0.5 and +0.5. When n=7, 92.0% of t values are between -0.5 and +0.5. So the distribution of t depends on n. Theoretically, when c gets very close to 0, what happens?
Histogram for T (n=10)
Interesting fact. t and s agree most of the time. The classic correlation r between s and t. computed on all n! = 362,880 potential (X, Y) vectors with n=9 observations, is equal to 0.96.
To estimate the asymptotic distribution of t (when n becomes large), one needs to compute q(n). All we know so far is that q(n)=O(n^2). Also, when n is a multiple of 3, then q(n)=(n-6)+(n^2)/3. Is there an exact formula if n is not a multiple of 3? We are going to ask this question on Kaggle. If no exact formula is known, a simple solution would consist in using an approximate t. based on q(n)=(n-6)+(n^2)/3 and T truncated to (n-6)+(n^2)/3.
Because the correlation is in the range [-1, +1] and is thus bounded, we would like a correlation of (say) 0.5 to always represent the same percentile of t. regardless of n. There are various ways to accomplish this, but it always involves transforming t. Some research still needs to be done. For r or s. one typically uses the Fisher transformation. But it will not work on t. plus it transforms a bounded metric in a non-bounded one, making interpretation difficult.
Finally, most of this research and these simulations involve the generation of a large number of permutations for small and large n. Here I discuss how to generate these permutations.
One way is to produce all permutations: click here to download the source code. However, it becomes very slow when n is larger than 20, although it is easy to use Map Reduce to split the task on multiple virtual machines, in parallel. Another strategy is to sample enough random permutations to get a good approximation of t 's distribution, as well as q(n). There is a one-to-one mapping between permutations of n elements, and integer numbers between 1 and n! Each of these numbers can be uniquely represented by a n-permutation, and the other way around. See questions 79 and 80 in our 66 job interview questions for data scientists for details. Check the numbering permutations section in the Wikipedia article on permutations, for details. I will write a simple, easier to read version of this algorithms. In short, it is very slow: O(n^2) to transform a number into a permutation or the other way around. So instead, I used a rudimentary approach (click here to download my source code), which is just as fast:
Algorithm to generate a random permutation (p(0), p(1). p(n-1))
For k=0 to n-1, do:
- Step 1. Generate p(k) as a random number on <0. n-1>
- Step 2. Repeat Step 1 until p(k) is different from p(0), p(1). p(k-1)
Question. what is the computational complexity of this algorithm? Answer. read here.
2.3. Final comments
- If you are interested in using correlations to assess how good your predicted values are likely to be, I recommend dividing your data set into two sets, for cross-validation: test and control. Fine-tune your model on the test data (also called training set ) using t to measure its predictive power. But the real predictive power is measured by computing t on the control data. Read this article about a leaving-one-out strategy to accomplish a similar goal.
- Note that t. unlike s. has a bi-modal distribution, with a dip around 0. This makes sense, since a correlation of 0 means no pattern found (a rare event), while different from 0 (but far from -1 or +1) means pattern found - any kind of pattern however slight - a more frequent event. Indeed, we view this bi-modal distribution as an advantage that t has over s .
- t can be used for outlier detection: when |t -s| is large, or |t -r| is large, we have outliers in the data set
- My t is one of the few metrics defined by an algorithm, rather than a formula. This is a new trend in data science - using algorithms to define stuff.
- The fact that s uses square distances, rather than distances (like t ) makes it less robust, as outliers heavily weights on s. While both s and t are based on rank statistics to smooth out the impact of outliers, s is still very sensitive to outliers when n is large. And with big data, we see more cases of large n. This is particularly true for data with small relative variance (defined e.g. as mean / stdv), where rank variance is higher than relative variance in the actual data values.
- You could spend years of research on this whole topic of t and random permutations, and write a book or Ph.D. thesis on the subject. Here I did the minimum amount of research to get to a good understanding and solution to the problem. This way of doing data science research is in contrast with slow-paced, comprehensive academic research.