True negative | ||
---|---|---|
True positive | ||
False negative | ||
False positive | ||
Specificity | \( \equiv \frac{T_n}{T_n+F_p} \) | |
Sensitivity (aka recall) | \( \equiv \frac{T_P}{T_p+F_n} \) | |
Threshold | ||
False positive rate | \( \equiv \frac{F_p}{T_n+F_p} \) | |
Prevalence | \( \equiv T_p+F_n \) | |
Positive predictive value (aka precision) | \( \equiv \frac{T_P}{T_p+F_p} \) | |
Negative predictive value | \( \equiv \frac{T_n}{T_n+F_n} \) | |
Positive likelihood ratio | \( \equiv\frac{S_n}{1- S_p} \) | |
Negative likelihood ratio | \( \equiv\frac{ 1 - S_n }{S_p} \) |
Photo credit to Greg Jeanneau
In Sex part 2: Tests that lie, we discussed the problem of binary classification (a yes/no test). The inaccuracy of our screening tests (for sexually and non-sexually transmitted infections) often leads people to say "Well, why the hell can't we just have an accurate test?" If they are deep-thinkers in their frustration they might think "Why the hell can't we just accurately measure the thing in the blood/urine/etc that we're looking for?" Unfortunately, even when the measurement is perfectly accurate, it is rarely possible to translate that measurement to a 100% accurate yes/no result. The details of that translation create for the test designer a trade-off between sensitivity and specificity regardless of the accuracy of the measurement.
Before starting this article, we need to review the vocabulary of statistics.
In the field of binary classification, there are four probabilities which determine everything: true positive (\(T_p\)), true negative (\(T_n\)), false positive (\(F_p\)), and false negative (\(F_n\)). Those four probabilities always add up to 1. The prevalence of what we are looking for is a ratio of these four:
From here it makes sense to define sensitivity, specificity, PPV, and NPV as follows:
It is at this point that we should be done, we have defined nine useful terms... the only nine useful terms. However, statisticians are not done. In an ego-fueled battle to obfuscate their field, they have defined a lot more, often-redundant terms. Statisticians owe you and me an apology, personally, for this innane bullshit. Just try to remember two things:
(1) You need to understand the distribution of probability between (\(T_p\)), (\(T_n\)), (\(F_p\)), and (\(F_n\)) to understand what is happening.
(2) If you do not understand the distribution of probability between (\(T_p\)), (\(T_n\)), (\(F_p\)), and (\(F_n\)), you do not understand what is happening no matter how many flowery, opaque words you can pepper a sentence with.
Here is a reference of redundant and unnecessary terms in binary classification statistics.
True positive rate | aka Sensitivity | \( \equiv \frac{T_p}{T_p+F_n} = S_n\) |
---|---|---|
True negative rate | aka Specificity | \( \equiv\frac{T_n}{T_n+F_p} = S_p\) |
False negative rate | \( \equiv \frac{F_n}{F_n+T_p} = 1-S_n\) | |
False positive rate | aka Fall-out | \( \equiv \frac{F_p}{T_n+F_p} = 1-S_p\) |
Precision | aka Positive predicitive value | \( \equiv \frac{T_p}{T_p+F_p}=\text{PPV}\) |
Recall | aka Sensitivity | \( \equiv \frac{T_p}{T_p+F_n} = S_n \) |
False discovery rate | \( \equiv \frac{F_p}{T_p+F_p}= 1-\text{PPV} \) | |
False omission rate | \( \equiv \frac{F_n}{T_n+F_n} = 1-\text{NPV}\) |
In addition there are a large number of metrics which attempt to guage the performance of a binary classifier.
Accuracy | \( \equiv \frac{T_p+T_n}{T_p+T_n+F_p+F_n}\) |
---|---|
Balanced accuracy | \( \equiv \frac{1}{2}\left( \frac{T_p}{T_p+F_n} +\frac{T_n}{F_p+T_n}\right) \) |
F1 score | \( \equiv \frac{2T_p}{2T_p +F_p+F_n} \) |
Informedness | \( \equiv \frac{T_p}{T_p +F_n}+\frac{T_n}{T_n +F_p}-1 \) |
Markedness | \( \equiv \frac{T_p}{T_p +F_p}+\frac{T_n}{T_n +F_n}-1 \) |
Matthews correlation coefficient | \( \equiv \frac{T_pT_n -F_pF_n}{\big( (T_p +F_p)(T_p+F_n)(T_n+F_p)(T_n+F_n)\big)^{1/2}} \) |
If you want to make a yes/no test from continuous measurement, you have to choose a cut-off threshold. According to the test, all measurements above the threshold are positive, all below are negative. Unless these two populations are well-separated, there will occassionally be real positives below the threshold and real negatives above the threshold. For example, imagine we measure antibody in the blood to determine if a person has a certain bacterial infection. We observe most people with the infection have antibody levels above 6 arbitrary units (a.u.). However, occassionally a healthy person has antibody levels higher than 6 a.u. and occassionally a sick person has less than 6 a.u. of antibody.
For any proxy variable we directly measure, there is a distribution of values for the diseased and healthy populations.
The graphics above show that the choice of threshold determines everything about the statistics of the classifier. Since the probability distributions are not chosen by the designer, the threshold is the only design choice. How does a designer choose that threshold?
In the graphic below, panels (a,b,c), we are shown the probability distribution functions of antibody concentration from a measured population. In panel (a), the distribution of the sick and healthy are well-separated and there are an equal number of sick and healthy people. It is not difficult to choose a threshold since any threshold between 2 and 8 results in near 100% \(S_n\) and \(S_p\). In panel (b), there are still an equal number of healthy and sick people but the distributions are wider and averages are closer. It is much harder to choose a threshold. In panel (c), the means and standard deviations are the same as in panel (b) but the prevalence of the disease is now 10%, not 50%.
In panels (d,e,f), we are shown the probability distribution functions of the sick and healthy people separately. At the top of each graph are ten thresholds in viridis colors. At each threshold, we can compute all the parameters of a binary classifier (sensitivity, specificity, NPV, PPV, etc).
In panels (g,h,i), we are plotting a parametric function \(S_p(t)\) vs \(S_n(t)\), where \(t\) is the threshold. The viridis colored dots on these panels correspond to the viridis arrows in panels (d,e,f). Depending on the threshold chosen, it is always possible to reach sensitivities and specificities from 0 to 1. Sensitivity and specificity are independent of prevalence. Therefore, in panels (h) and (i), the plots are identical since the only changing parameter is prevalence.
In panels (j,k,m), we are plotting a parametric function \(\text{PPV}(t)\) vs \(S_n(t)\), where \(t\) is the threshold. The viridis colored dots on these panels correspond to the viridis arrows in panels (d,e,f). Depending on the threshold chosen, it is always possible to reach sensitivities in the range \([0,1]\). However, the positive predicitive value is bounded between \([P_\text{prev},1]\).
The final two rows are the graphs which statisticians use to choose threshold cut-offs. The most common graph that statisticians use is the receiver operating curve (ROC), which is the sensitivity vs false positive rate or \(S_n(t)\) vs \(1-S_p(t)\). The ROC curve does not change as a function of prevalence because neither sensitivity nor specificity depend on prevalence. This the most common tool for selecting an optimum threshold value and it is usually appropriate when the prevalence is near 50%. ROCs are another example of needlessly opaque conventions in the field of statistics. Keeping things simple by focusing on sensitivity and specificity reduces the number of terms and variables. For that reason, the plot above shows \(S_p(t)\) vs \(S_n(t)\). An ROC curve has no advantages over such a plot since it is a linear transformation of the \(S_p(t)\) vs \(S_n(t)\) plot.
A specifity vs sensitivity graph contains a limit I am calling the "line of meaninglessness." The line where \(S_p=1-S_n\) is the line where \(\text{LR}{+}\) and \(\text{LR}{-}\) are 1. At any point along that line, the positive predictive value and the prevalence are exactly the same no matter what the prevalence is. Full details on how this line of meaninglessness is derived and why the \(PPV = P_\text{prev}\) when \(S_p=1-S_n\) are discussed at the end of Sex part 2: Tests that lie.
In the final row we are plotting what is commonly known as a precision/recall curve. The precision/recall curve is the appropriate tool for optimizing the selection of a threshold when the prevalence is extremely small [1,2,3].
Consider panels (g, h, i) where we are plotting specificity vs sensitivity. When the distributions are well-separated (panel g), there are many thresholds which give both a sensitivity and specificity of 1. When the distributions are poorly separated (panels h and i), there is no threshold which gives both a sensitivity and specificity of 1. In most practical situations, the probability distributions are closer to panels (h,i) than they are to panel (g). This results in a trade-off between sensitivity and specificity. No matter how accurate the measurement is, the sensitivity and specificity will not both be 100%.
Let us say we are designing a screening test we hope to deploy to the entire US population. Our goal is to make a cheap test (otherwise we can not afford 300 million of them) and we need every person with the disease to test positive. False positives are not a problem because we plan to use a more expensive test later to sort out which positives are true and false. Therefore we need a highly sensitive test but not a highly specific one. This is a common design constraint for medical screening tests for STIs, hepatitis, and cancers. For most medical screening tests, a negative result is reliable. However, a positive result does not justify treatment, only a more specific follow-up test. This is especially true in the case of low-prevalence conditions, when the positive predictive value plummets even at sensitivities and specificities close to 100%.
To be reliable, a confirmation test following a positive screening test must have a high specificity. The confirmation test is working with much more favorable statistics because the disease prevalence is much higher. The confirmation test only sees samples from the set of people who had a positive screening test. The positive predictive value of the confirmation test is higher than the screening test because (1) the prevalence of disease is higher among people with a positive screening test compared to the general population and (2) the specificity of the confirmation test is often higher.
When using the mirror-log scale, areas will not be proportional to probabilities; they will accentuate low probability spaces.
All values higher than the threshold are counted positive by the test; every value above threshold is a true positive or a false positive; every value below the threshold is a true negative or a false negative.
To avoid divide by zero errors, all probabilities are bounded between 1e-5 and 0.99999. Even if the threshold is outside the square probability distribution.
[1] "The relationship between Precision-Recall and ROC curves," Proceedings of the 23rd International Conference On Machine Learning, pp. 233—240, 2006.
[2] "Precision-recall operating characteristic (P-ROC) curves in imprecise environments," 18th International Conference On Pattern Recognition, vol. 4, pp. 123—127, 2006.
[3] "The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets," PloS One, vol. 10, 2015.