Derivation of Spearman's coefficient from Pearson's coefficient

Photo credit to wal_172619

Goals of the article

This article is a letter to a friend taking a statistics course in 2019. He and his instructor were looking for an explanation of the relationship between Spearman's coefficient and Pearson's coefficient, particularly the origin of the coefficient 6.

Letter to friend

It was nice to shoot the shit last night. "Nerd snipe"' is the most appropriate description of what you did that I have ever heard. It is a term which once learned, is impossible to imagine living without. Also I solved your problem mostly. ~~There is a hiccup that is very unsatisfying but still obviously what Spearman originally did. Why he(?) did that or better stated, how he justified that is unclear.~~ It is clear now why he did that and it does not have anything to do with the six sigma concept. ~~Tell your instructor that the 6 comes from a curve fitting estimate as best I can tell.~~ Do not tell her that — that's wrong. Tell her that it is probably more algebra than it is worth (that much is true) and that Spearman is an opaque writer with no sense of style. You can read Spearman's original work on the subject [1] but it does not shed much light on the issue (look around page 80-87, the equation in question first appears on page 87). His explanation is hand waving at best but that is to be expected from something written in 1904. I think trying to write clearly was a fad that started in the 1950's.

Let us begin, throughout the course of this letter, \(X\) is the set of ordered ranks \(\{2,6,8,5,4 \dots x_i\dots x_n\}\) where \(X\) contains the unique values from 1 to \(n\) in unknown order. The same is true for the set \(Y\). Pearson's coefficient is

\begin{equation}\label{eq:pearson} \rho = \frac{\text{cov}(X,Y)}{\sigma_X\sigma_Y} \end{equation}

We are going to show that Spearman's coefficient is a special case of Pearson's coefficient where \(x_i,y_i\) are ordinal data ranks. (\(d_i \equiv x_i-y_i\))

\begin{equation}\label{eq:spearman} \rho = \frac{\text{cov}(X,Y)}{\sigma_X\sigma_Y} = 1 -\frac{ 6\sum_i d_i^2}{n(n^2-1)} \end{equation}

Begin by unsimplify Eq (\ref{eq:pearson}) a lot using the definition of covariance and standard deviation.

\begin{equation} \rho = \frac{\text{cov}(X,Y)}{\sigma_X\sigma_Y} = \frac{ \frac{1}{n}\sum_i (x_i-\bar{x})(y_i-\bar{y}) }{ \left[ \frac{1}{n-1}\sum_i (x_i-\bar{x})^2 \right]^{0.5}\left[ \frac{1}{n-1}\sum_i (y_i-\bar{y})^2\right]^{0.5}} \end{equation}

Next take note that the set of \(X\) and the set of \(Y\) are equivalent so the two factors in the denominator are the same. It is just \(\sqrt{a}\sqrt{a}\) so we can write that as \(a\) on the bottom.

\begin{equation} \rho = \frac{ \frac{1}{n}\sum_i (x_i-\bar{x})(y_i-\bar{y} ) }{ \frac{1}{n-1}\sum_i (x_i-\bar{x})^2} \end{equation}

The average of all integers from 1 to \(n\) is approximately \(n/2\) so any \(\bar{x}\) or \(\bar{y}\) is just \(n/2\).

\begin{equation} \frac{\bar{x}}{n/2} = \frac{ 0.5+ n/2}{n/2} \approx 1 \;\; \text{as} \;\; n \rightarrow \infty \end{equation}

\begin{equation}\label{eq:num2} \rho = \frac{ \frac{1}{n}\sum_i (x_i-n/2)(y_i-n/2 ) }{ \frac{1}{n-1}\sum_i (x_i-n/2)^2} \end{equation}

Things are going to get a little hairy here. Take the numerator and explode it out and simplify until we have \(\sum_i -\left(x_i -y_i\right)^2\) which is the difference squared term we are seeking.

\begin{align} \text{numerator} &= \frac{1}{n}\sum_i x_iy_i - x_i\frac{n}{2} -y_i\frac{n}{2} + \frac{n^2}{4} \\ &= \frac{1}{2n}\sum_i 2x_iy_i - x_in -y_in + \frac{n^2}{2} \\ &= \frac{1}{2n}\sum_i 2x_iy_i - x_i^2 -x_i(n-x_i) -y_i^2 - y_i(n-y_i) + \frac{n^2}{2} \\ &= \frac{1}{2n}\left[ \sum_i \left( 2x_iy_i - x_i^2 -y_i^2\right)+ \sum_i\left(-x_i(n-x_i) - y_i(n-y_i) + \frac{n^2}{2} \right)\right]\\ &= \frac{1}{2n}\left[ \sum_i -\left(x_i -y_i\right)^2+ \sum_i\left(-x_i(n-x_i) - y_i(n-y_i) + \frac{n^2}{2} \right)\right]\\ &= \frac{1}{2n}\left[ \sum_i -d_i^2+ \sum_i\left(-x_i(n-x_i) - y_i(n-y_i) + \frac{n^2}{2} \right)\right] \end{align}

The term on the left is fine but the term on the right is bit of a mess. In these next steps keep in mind that any term in a summation that is a function of only \(x_i\) or \(y_i\) is interchangeable because the sets \(X\) and \(Y\) are equivalent. However if both variables appear in one term they are not interchangeable since the order of the set then becomes important.

\begin{align} \text{numerator} &= \frac{1}{2n}\left[ \sum_i -d_i^2+ \sum_i\left(-x_i(n-x_i) - y_i(n-y_i) + \frac{n^2}{2} \right)\right]\\ &= \frac{1}{2n}\left[ \sum_i -d_i^2\right] + \frac{1}{2n}\left[ \sum_i\left(-2x_i(n-x_i) + \frac{n^2}{2} \right)\right]\\ &= \frac{1}{2n}\left[ \sum_i -d_i^2\right] + \frac{1}{n}\left[ \sum_i\left(-x_i(n-x_i) + \frac{n^2}{4} \right)\right]\\ &= \frac{1}{2n}\left[ \sum_i -d_i^2\right] + \frac{1}{n}\left[ \sum_i\left(x_i^2 - nx_i + \frac{n^2}{4} \right)\right]\\ &= \frac{1}{2n}\left[ \sum_i -d_i^2\right] + \frac{1}{n}\left[ \sum_i\left(x_i - \frac{n}{2} \right)^2\right]\label{eq:num1} \end{align}

In the limit of large \(n\) the following approximation is true.

\begin{equation}\label{eq:approx1} \frac{1}{n}\left[ \sum_i\left(x_i - \frac{n}{2} \right)^2\right] \approx \frac{1}{n-1}\left[ \sum_i\left(x_i - \frac{n}{2} \right)^2\right] \end{equation}

Plug that in to Eq (\ref{eq:num1}).

\begin{equation} \text{numerator} = \frac{1}{2n}\left[ \sum_i -d_i^2\right] + \frac{1}{n-1}\left[ \sum_i\left(x_i - \frac{n}{2} \right)^2\right] \end{equation}

And plug that numerator back in to Eq (\ref{eq:num2}) and simplify.

\begin{align} \rho =& \frac{ \frac{1}{2n}\left[ \sum_i -d_i^2\right] + \frac{1}{n-1}\left[ \sum_i\left(x_i - \frac{n}{2} \right)^2\right]}{ \frac{1}{n-1}\sum_i (x_i-\frac{n}{2})^2}\\ \rho =& \frac{\frac{1}{n-1} \sum_i\left(x_i - \frac{n}{2} \right)^2}{ \frac{1}{n-1}\sum_i (x_i-\frac{n}{2})^2} -\frac{ \frac{1}{2n}\left[ \sum_i d_i^2\right]}{ \frac{1}{n-1}\sum_i (x_i-\frac{n}{2})^2}\\ \rho =& 1 -\frac{ \frac{1}{2n}\sum_i d_i^2}{ \frac{1}{n-1}\sum_i (x_i-\frac{n}{2})^2} \label{eq:numsolved} \end{align}

The next step in the derivation is an approximation and has a lot to do with the mystery 6. We are going to make a substitution in the denominator. See the figure below for a graphical analysis of the error of this approximation.

\begin{equation}\label{eq:num3} \frac{1}{n-1}\sum_i \left(x_i-\frac{n}{2}\right)^2 \approx (n^2-1)/12 \end{equation}

[Caption] Comparison of the approximation from Eq (\ref{eq:num3}).

I do not have a good explanation for this other than "it works ok'' but I recognize that it's unsatisfying for two reasons. [1] "Because it seems to be similar" is a poor explanation of why we might do something in this space and [2] the "equivalent" expression appears to diverge as \(n\) increases which is usually the opposite of what we want in an approximation in statistics. We generally want better approximations as our data sets grow, not worse approximations. However for \(n\) up to 100, it looks pretty good in the plot so at this point I'll accept it. Since the lines on the graph are not identical, no amount of algebraic manipulation is going to show those expressions as exactly equivalent and it is a fool's errand to do so.

Forget that quitter mindset, I figured it out.

\begin{align} \text{denominator } = & \frac{1}{n-1}\sum_i \left(x_i-\frac{n}{2}\right)^2 \\ = & \frac{1}{n-1} \left( \sum_i \left[ x_i^2 +\frac{n^2}{4} - nx_i \right] \right)\\ = & \frac{1}{n-1} \left( \sum_i x_i^2 + \sum_i \frac{n^2}{4} - n\sum_i x_i \right) \label{eq:series} \end{align}

From a sequences and series class, remember this is true if the set \(X\) contains every integer between 1 and \(n\).

\begin{align} \sum_{i=1}^n x_i =& \frac{n(n+1)}{2}\label{eq:summation1}\\ \sum_{i=1}^n x_i^2 =& \frac{n(n+1)(2n+1)}{6}\label{eq:summation2} \end{align}

Plug Eq (\ref{eq:summation1}) and Eq (\ref{eq:summation2}) into Eq (\ref{eq:series})

\begin{align} \text{denominator } = & \frac{1}{n-1} \left( \sum_i x_i^2 + \sum_i \frac{n^2}{4} - n\sum_i x_i \right)\\ = & \frac{1}{n-1} \left( \frac{n(n+1)(2n+1)}{6} + \frac{n^3}{4} - n\frac{n(n+1)}{2} \right)\\ = & \frac{1}{n-1} \left( \frac{ 2n^3+3n^2+n }{6} + \frac{n^3}{4} - \frac{n^3+n^2}{2} \right)\\ = & \frac{1}{n-1} \left( \frac{ 4n^3+6n^2+2n }{12} + \frac{3n^3}{12} - \frac{6n^3+6n^2}{12} \right)\\ = & \frac{1}{n-1} \left( \frac{ 4n^3+6n^2+2n + 3n^3 - 6n^3-6n^2}{12} \right)\\ = &\frac{ (4n^3+ 3n^3 - 6n^3) + (6n^2 -6n^2)+2n}{12(n-1)}\\ = &\frac{ n^3 +2n}{12(n-1)}\label{eq:correct_d} \end{align}

At this point, it is important to remember that no approximations have been made in the denominator and that the following statement is exactly true in all contexts where the set \(X\) contains all the integers from 1 to \(n\).

\begin{equation} \frac{1}{n-1}\sum_{i=1}^n \left(x_i-\frac{n}{2}\right)^2 = \frac{ n^3 +2n}{12(n-1)} \end{equation}

It is not clear whether I lost track of an \(n\) or whether someone decided to ignore it with an "as \(n \rightarrow \infty\) \((n^3+2n) \rightarrow n^3 \)". Whatever substitution they made using a limit argument (which they must have done because Eq (\ref{eq:num3}) not an exact equality as shown in the figure), the end result is that they accepted \(n^3\) as the dominant term in the equation (which it is). It is unclear why Spearman left it as \((n^2-1)/12\) instead of \(n^2/12\) but it is an equally valid approximation in either case. At very large \(n\) it does indeed converge, but because there are a couple of limits taken here in the approximation they have different ranges where they are better or worse approximations.

\begin{equation} \label{eq:approx} \text{denominator} = \frac{1}{n-1}\sum_i \left(x_i-\frac{n}{2}\right)^2 =\frac{ n^3 +2n}{12(n-1)} \approx \frac{ n^2 -1}{12} \approx \frac{n^2}{12} \end{equation}

Use first approximation in Eq (\ref{eq:approx}) to make a substitution in denominator of Eq (\ref{eq:numsolved}).

\begin{equation}\label{eq:further-approx} \rho = 1 -\frac{ \frac{1}{2n}\sum_i d_i^2}{ \frac{1}{n-1}\sum_i (x_i-\frac{n}{2})^2} = 1 -\frac{ \frac{1}{2n}\sum_i d_i^2}{(n^2-1)/12} = 1 -\frac{ 6}{n(n^2-1)}\sum_i d_i^2 \approx 1 -\frac{ 6}{n^3} \sum_i d_i^2 \end{equation}

It is also possible to use Eq (\ref{eq:correct_d}) to derive another version of Spearman's coefficient. This does not require making any approximation in the derivation except for Eq (\ref{eq:approx1}) which is an extremely small thing for \(n>100\).

\begin{equation}\label{eq:mine} \rho = 1 -\frac{ \frac{1}{2n}\sum_i d_i^2}{ \frac{1}{n-1}\sum_i (x_i-\frac{n}{2})^2} = 1 -\frac{ \frac{1}{2n}\sum_i d_i^2}{(n^3+2n)/(12(n-1))} = 1 -\frac{6(n-1)}{n^4+2n^2}\sum_i d_i^2 \end{equation}

Not to put too fine a point on it, but the equation that I believe is Spearman's original derivation has an error that the broader community corrected in the early 1900s. Spearman's original paper shows this equation without an \(n\) in the denominator (which cannot be neglected by any reasoning involving limits as \(n\rightarrow \infty\) or any other such justification). The same is true for the factor of 2 in the second term which is similarly incorrect to neglect.

\begin{equation}\label{eq:original} \rho = 1 -\frac{ 3\sum_i d_i^2}{(n^2-1)} \end{equation}

The reason that Spearman's original, Eq (\ref{eq:original}), is wrong is that the denominator scales with \(n^2\) instead of with \(n^3\). When taking limits or making approximations, it is often acceptable (for large \(n\)) to say

\begin{equation*} n^a \approx n^a+ (\text{stuff like }n^2+n \dots n^i \text{ where } i \lt a) \end{equation*}

but it is never acceptable to say

\begin{equation*} n^{a+1} \approx n^a+ (\text{stuff like }n^2+n \dots n^i \text{ where } i \lt a) \end{equation*}

because the term with the highest power is the dominant term at large \(n\).

It is unorthodox but no less correct to say \(n^3+2 \approx n^3-1\) as it is to say \(n^3+2 \approx n^3\). What is unclear still is why all modern statements of Spearman's coefficient use the first approximation instead of the second. It does nothing to further simplify the expression, it is every bit as wrong, and the \(-1\) comes out of nowhere. The only thing that makes me think otherwise is that I must have made so algebraic error but I'll be damned if I can find it. Let me know if you do.

The only plausible explanation is a degree of freedom argument. If you have a sample of only 1 item, you can compute a Spearman coefficient if you use the formula without a \(-1\). This is meaningless and adding the \(-1\) creates a divide by 0 error in the blind application of the formula which might help some people. However, if you are paying that little attention to what you are doing, you probably should not be doing math or operating heavy machinery anyway.

Summary of equations

After playing with this for a few days I have found four versions of Spearman's coefficient. One which appears in your textbook and is perfectly reasonable, one with a bit more of an approximation which is every bit as valid as the common one, one which is slightly more complex and requires slightly fewer approximations so I like it more, and one which is the original version in Spearman's paper which appears to have an error. The original one from the 1904 paper is not derived there, it is just stated so it is not easy to determine where he made an error.

Common in text	\( \rho = 1 -\frac{ 6}{n(n^2-1)} \sum_i d_i^2\)	Eq (\ref{eq:spearman})
Approx from text eq	\(\rho = 1 -\frac{ 6}{n^3} \sum_i d_i^2\)	Eq (\ref{eq:further-approx})
Mine - fewer approximations	\(\rho = 1 -\frac{6(n-1)}{n^4+2n^2}\sum_i d_i^2\)	Eq (\ref{eq:mine})
Spearman's original	\(\rho = 1 -\frac{ 3}{(n^2-1)} \sum_i d_i^2\)	Eq (\ref{eq:original})

Attempt 1 at explaining the 6

Pearson's coefficient is the ratio of the covariance and standard deviations of 2 sets of numbers (\(X\) and \(Y\)). Spearman's coefficient is Pearson coefficient for two sets of numbers containing only the unique integers \(1\dots n\). The 6 in Pearson's formula is the result of several pages of algebraic simplifications that are possible because both \(X\) and \(Y\) contain integers \(1\dots n\). It has no greater significance.

Attempt 2 at explaining the 6

The 6 in Pearson's formula is the result of several pages of algebraic simplifications that are possible because both (\(X\) and \(Y\)) contain only unique integers \(1\dots n\). That algebra could be summarized as

\begin{equation*} \frac{1}{2}\frac{\text{stuff}}{\frac{\text{stuff}}{6}+\frac{\text{stuff}}{4}+\frac{\text{stuff}}{2}} = \frac{12}{2}[\text{stuff}] = 6[\text{stuff}] \end{equation*}

The 2, 4, and 6, are from the \(\sigma_X^2\); the 6 and 2 can be thought of more specifically as originating from the formulas for the sum of integers or sum of squared integers.

The end

I do not think these explanations will be satisfying because I think you are looking for something that is not there as I was when I began this little distraction. The 6 is a consequence of tedious algebra; there's no single source to point to and say "it came from right there!"

Let me know if any of that is unclear. It looked good to me when I wrote it but that is the curse of technical writing — everything I write looks inescapably clear to me.

References

[1] C. Spearman, "The proof and measurement of association between two things," American Journal of Psychology, vol. 15, pp. 72—101, 1904.