To critically evaluate the literature and to design valid studies, surgeons require an understanding of basic statistics. Despite the increasing complexity of reported statistical analyses in surgical journals and the decreasing use of inappropriate statistical methods, errors such as in the comparison of multiple groups still persist. This review introduces the statistical issues relating to multiple comparisons, describes the theoretical basis behind analysis of variance (ANOVA), discusses the essential differences between ANOVA and multiple t -tests, and provides an example of the computations and computer programming used in performing ANOVA.
Keywords: research/statistics and numerical data, data interpretation/statistical, models, statistical, review
Suppose that a researcher performs an experiment to assess the effects of an antibiotic on interleukin-6 (IL-6) levels in a cecal ligation and puncture rat model. He randomizes 40 rats to one of four equally sized groups: placebo with sham laparotomy, antibiotic with sham laparotomy, placebo with cecal ligation and puncture, and antibiotic with cecal ligation and puncture. He measures IL-6 levels in all four groups and wishes to determine whether a difference exists between the levels in the control rats (placebo with sham laparotomy) and the other groups. He performs two-tailed student’s t -tests on all of the possible pairwise comparisons and determines that there is a significant difference between the control rats and rats receiving placebo with cecal ligation and puncture (P = 0.049). Is this statistical analysis valid?
Just as methodological flaws in research design can influence the interpretation of trial results, failure to use appropriate statistical tests may result in inaccurate conclusions. Readers must be knowledgeable enough to recognize data analytic errors and to interpret the reported statistical findings. However, in a survey of 91 fifth year surgery residents in 1987, 92% reported less than 5 hours of instruction in statistics during their residency [1 ]. In a more recent survey reported in 2000 of 62 surgical residency programs, only 33% included education in statistics as a formal component of their curricula [2 ].
Given the growing impetus to practice evidence-based medicine, surgeons must be able to understand basic statistics to interpret the literature. Although descriptive statistics and t -tests are the most widely used statistical methods [3 –5 ], researchers are employing increasingly sophisticated techniques for analyzing data. A review of trends in statistical techniques in surgical journals in 2003 compared to 1985 reported that statistical analyses have become more complicated with time [5 ]. In particular, the most significant changes were increases in the use of analysis of variance (ANOVA), nonparametric tests. and contingency table analyses. While the use of more advanced statistical methods may reflect increasing contributions of statisticians and epidemiologists to study design and interpretation, researchers must still be able to understand basic statistical concepts so as to choose the appropriate test. Additionally, surgeons must be able to judge the validity of the statistical methods and results reported in the literature both for research purposes and for clinical application.
Over the past several decades, not only have statistical analyses become more sophisticated, but the appropriate application of tests has improved as well. For example, in 2003, out of 187 randomly selected articles from surgical journals, 14 (7%) study authors incorrectly used t -tests instead of ANOVA for comparison of means for three or more groups [5 ]. In comparison, in 1985, 50 journal articles from the New England Journal of Medicine were analyzed, of which 27 (54%) used inappropriate statistical methods for comparison of multiple means [6 ]. Although advancements have been made in the statistics included in medical journals, errors still occur. Inappropriate statistical analyses were identified in 27% of studies examined from 2003 surgical journals [5 ]. Therefore, readers must be able to recognize common errors and the appropriate methods for addressing them. The primary purpose of this paper is to address the problem with multiple comparisons and to discuss why, when, and how to use ANOVA. The intended audience for the main text is surgical researchers and clinicians and, therefore, the concepts and applications of ANOVA are highlighted. For interested readers, the calculations for the main test statistic for a simple, one-way ANOVA are included (Appendix 1 ). A simulated example is also provided with calculations and basic computer programming (Appendix 2 ). The appendices’ purposes are to provide concrete examples for the readers to reinforce the concepts presented in the paper and to increase the readers’ confidence with using ANOVA. Lastly, definitions of the statistical terms used but not explained in the paper are included in a Glossary section.
Student’s t-Test Versus ANOVA
ANOVA expands on the basic concepts used in performing a t -test. In a previous article in the Journal of Surgical Research, Livingston discussed the use of Student’s t -test to detect a statistical difference in means between two normally distributed populations [7 ]. The F-ratio or F-statistic. which is used in ANOVA, can also be used to compare the means of two groups, and yields equivalent results to the t -statistic in this situation. In fact, mathematically, when comparing only two groups, the F-ratio is equal to the square of the t -statistic. However, there are several key differences between the two tests. First, ANOVA can be used for comparing the means of more than two groups and is in fact more statistically powerful in this situation. Moreover, variants of ANOVA can include covariates. which allow one to control statistically for confounders and to detect interactions whereby one variable moderates the effects of another variable.
t -Tests and F-tests vary essentially in the method of quantifying the variability around the group means. The t -statistic is calculated using the actual difference between means, while the F-statistic is calculated from the squared sums of the differences between means. This difference has implications for the probability distributions and the interpretation of the two test statistics. To better understand these differences, a discussion of the t - and f-families of probability distributions and degrees of freedom is necessary. Degrees of freedom is a parameter that is dependent upon sample size, which is used to calculate the probability distributions for certain statistical models. Degrees of freedom may be considered a measure of parsimony, as it is a measure of the number of observations available to vary, to estimate additional parameters. In other words, as the precision increases in estimating model parameters, fewer degrees of freedom are available.
The t -test is based upon the t -distribution. which is similar to a normal distribution (e.g. resembles a bell-shaped curve whereby 95% of data points lie within two standard deviations and 99.7% lie within three standard deviations of the mean) except for the use of the sample rather than the true population standard deviation [7 ]. The t -distribution approaches a normal distribution as the sample size, n. increases. A
smaller sample size and fewer degrees of freedom (n − 1) result in the tails of the t -distribution being denser, containing a greater percentage of the data points. Thus, there is a family of t -distributions that are dependent upon the degrees of freedom. All members of the family of t -distributions are symmetric around zero as depicted in Fig. 1A .
(A) The student’s t -distribution is similar to a normal or Gaussian distribution except that the sample standard deviation is used instead of the population standard deviation. The critical value indicating statistical significance can be either .
The probability density function or equation for generating the family of f-distributions is also dependent upon the sample size, n. The total degrees of freedom for the f-distribution, like for the t -distribution, is n − 1. However, the total degrees of freedom is divided up into the between and within groups degrees of freedom, both of which contribute to the probability distribution. Because the f-distribution is based on squared sums, the f-distribution is always positive ( Fig. 1B ). The flatness and skewness of the distribution depend upon the between and within groups degrees of freedom. For more about the calculations of degrees of freedom for the F-ratio, refer to Appendix 1 .
These differences in probability distributions result in two main distinctions between the t - and the F-tests. First, directionality of hypothesized statistical relations can be evaluated using a one-tailed t -test, which answers the question of whether the mean of one group is larger than the other. In contrast, the F-test cannot determine the direction of a difference, only that one exists. The reason is that for a t -test, the critical value. or the value at which the t -statistic is significant, can be either positive or negative (since the distribution is centered about zero). Therefore, the t -test can evaluate hypotheses at either tail. In contrast, the F-ratio is always a positive number. Second, t -tests are not additive; that is, multiple t -tests cannot be summed together to identify a difference between multiple groups. For example, if the t -statistic for a comparison between A and B is −3 and the t -statistic for a comparison between B and C is −3, then the t -statistic for a comparison between A and C is not 0; that is, one cannot conclude that there is no difference between A and C. On the other hand, the F-test can identify an overall difference between three or more means using a single test that compares all of the groups simultaneously; thus, the F-test is referred to as an omnibus test .
Problem of Multiple Comparisons
One important advantage of the F-test is that as an omnibus test, it maintains an appropriate familywise error rate in hypothesis testing. In contrast, multiple t -tests result in an increased probability of making at least one Type 1 error. The problem of multiple comparisons is important to recognize in the literature, especially since the increase in the error rate may be substantial. As an example, in an analysis of 40 studies in orthopedic surgery journals, 182 significant results were reported. However, after adjustment for multiple comparisons, only 59.3% of these remained statistically significant [8 ]. Therefore, the Type 1 error or false positive rate was much greater than the standard, predetermined rate of 5%.
The probability of at least one Type 1 error increases exponentially with the number of comparisons. The mathematical explanation for this increase is derived as follows: assuming an α equal to 0.05, the probability that an observed difference between two groups is not due to chance variability is 1 − α or 0.95. However, if two comparisons are made, the probability that an observed difference is true is no longer 0.95. Rather, the probability is (1 − α) 2 or 0.90, and the likelihood of a Type 1 error is 1 − 0.90 or 0.10. Therefore, the probability that a Type 1 error occurs if k comparisons are made is 1− (1 − α) k ; if 10 comparisons are made, the Type 1 error rate increases to 40%.
When all pairwise comparisons are made for n groups, the total number of possible combinations is n *(n − 1)/2. However, some pairwise comparisons may not be biologically plausible and other pairwise comparisons may be related to each other. Therefore, the true overall Type 1 error rate is unknown. Nonetheless, the take-home message is that the false-positive error rate can far exceed the accepted rate of 0.05 when multiple comparisons are performed.
Different statistical methods may be used to correct for inflated Type 1 error rates associated with multiple comparisons. One such method is the Bonferroni correction, which resets the P -value to α/k where k represents the number of comparisons made. For example, if 10 hypotheses are tested, then only results with a P -value of less than 0.05/10 or 0.005 would be considered statistically significant. The Bonferroni correction therefore results in fewer statistically significant results. However, the resultant trade-off for minimizing the likelihood of a Type 1 error is a potential inflation of the Type 2 error rate. Another statistical method to minimize the number of comparisons performed is to use an omnibus test, such as the F-ratio in ANOVA, thereby diminishing the Type 1 error rate.
In the initial example, the total number of pairwise comparisons that can be made between four groups of rats is 4*(4 − 1)/2 or six. Therefore, the probability of at least one Type 1 error is 1 −(1 − 0.05) 6 or 0.26, which is significantly higher than the predetermined level for rejecting the null hypothesis of 0.05. Using a Bonferroni correction, the adjusted P -value would be 0.05/6 or 0.008 for each comparison. Therefore, a P value of 0.049 would not be considered statistically significant. Rather than having to perform six separate pairwise comparisons, ANOVA would have identified whether any significant difference in means existed using a single test. An F-ratio less than the critical value would have precluded further unnecessary testing.
Basic Concepts and Terminology
ANOVA was developed by Sir Ronald A. Fisher and introduced in 1925. Although termed analysis of variance. ANOVA aims to identify whether a significant difference exists between the means of two or more groups. The question that ANOVA answers is: are all of the group means the same? Or is the variance between the group means greater than would be expected by chance? For example, consider the data in Table 1 representing 23 observations distributed among four groups. Expressed in words, the null hypothesis in ANOVA is that the means of all four groups are equivalent; that is, the means for each column are equal. Expressed as an equation, the null hypothesis is: