I've created an easy to use R script that will import your data, and performs and writes up a state-of-the-art dependent or independent t-test. The goal of this script is to examine whether more researcher-centered statistical tools (i.e., a one-click analysis script that checks normality assumptions, calculates effect sizes and their confidence intervals, creates good figures, calculates Bayesian and robust statistics, and writes the results section) increases the use of novel statistical procedures. Download the script here: https://github.com/Lakens/Perfect-t-test. For comments, suggestions, or errors, e-mail me at D.Lakens@tue.nl. The script will likely be updated - check back for updates or follow me @Lakens to be notified of updates.
Correctly comparing two groups is remarkably
challenging. When performing a t-test
researchers rarely manage to follow all recommendations that statisticians have
made over the years. Where statisticians update their recommendations,
statistical textbooks often do not. Even though reporting effect sizes and
their confidence intervals has been recommended for decades (e.g., Cohen, 1990),
statistical software (e.g., SPSS 22) often does not provide these statistics. Progress
is slow, and Sharpe (2013) points to a lack of awareness, a lack of time, a
lack of easily usable software, and a lack of education as some of the main
reasons for the resistance to adopting statistical innovations.
Here, I propose a way to speed up the
widespread adoption of the state-of-the-art statistical techniques by providing
researchers with an easy to use script in free statistical software (R) that
will perform and report all statistical analyses, practically with a single
button press. The script (Lakens, 2015, available at https://github.com/Lakens/Perfect-t-test) follows state-of-the-art
recommendations (see below), creates plots of the data, and writes the results
section, including a minimally required interpretation of the statistical
results.
Automated analyses might strike readers as a
bad idea because it facilitates mindless statistics. Having performed
statistics mindlessly for most of my professional career, I sincerely doubt
access to this script would have reduced my level of understanding. If
anything, reading an automatically generated results section of your own data
that includes statistics you are not accustomed to calculate or report is
likely to make you think more about the usefulness of these statistics, not
less. However, the goal of this script is not to educate people. The main goal
is to get researchers to perform and report the analyses they should, and make
this as efficient as possible.
Comparing two groups
Keselman, Othman, Wilcox, and Fradette (2004)
proposed the a more robust two-sample t-test
that provides better Type 1 error control in situations of variance heterogeneity
and nonnormality, but their recommendations have not been widely implemented.
Researchers might in general be unsure whether it is necessary to change the
statistical tests they use to analyze and report comparisons between groups. As
Wilcox,
Granger, and Clark (2013, p. 29) remark: “All indications are that generally, the safest
way of knowing whether a more modern method makes a practical difference is to
actually try it.” Making sure conclusions based on multiple
statistical approaches converge is an excellent way to gain confidence in your
statistical inferences. This R script calculates traditional Frequentist
statistics, Bayesian statistics, and robust statistics, using both a hypothesis testing as an estimation approach, to invite researchers to
examine their data from different perspectives.
Since Frequentist and Bayesian statistics are
based on assumptions of equal variances and normally distributed data, the R
script provides boxplots and histograms with kernel density
plots overlaid with a normal distribution curve to check for outliers and normality. Kernel density
plots are a non-parametric technique to visualize the distribution of a
continuous variable. They are similar to a histogram, but less dependent on the
specific choice of bins used when creating a histogram. The graphs plot both
the normal distribution, as the kernel density function, making it easier to
visually check whether the data is normally distributed or not. Q-Q plots are provided as an additional check for normality.
Yap and Sim
(2011) show that no single test for normality will perform optimally for all
possible distributions. They conclude (p. 2153): “If the distribution is
symmetric with low kurtosis values (i.e. symmetric short-tailed distribution),
then the D'Agostino-Pearson and Shapiro-Wilkes tests have good power. For
symmetric distribution with high sample kurtosis (symmetric long-tailed), the
researcher can use the JB, Shapiro-Wilkes, or Anderson-Darling test." All
four normality tests are provided in the R script. Levene’s test for the equality of variances is provided, although for independent t-tests, Welch’s
t-test (which does not require equal variances) is
provided by default, following recommendations by Ruxton (2006). A short explanation accompanies all plots and assumption checks to help researchers to interpret the results.
The script also creates graphs that, for example, visualize the distribution of the datapoints, and provide both within as between confidence intervals:
The script provides interpretations for effect
sizes based on the classifications ‘small’, ‘medium’, and ‘large’. Default
interpretations of the size of an effect based on these three categories should
only be used as a last resort, and it is preferable to interpret the size of
the effect in relation to other effects in the literature, or in terms of its
practical significance. However, since researchers often do not interpret
effect sizes (if they are reported to begin with), the default interpretation
(and the suggestion to interpret effect sizes in relation to other effects in
the literature) should at least function as a reminder that researchers are
expected to interpret effect sizes. The common language effect
size (McGraw & Wong, 1992) is provided as an additional way to communicate the effect size.
Similarly, the Bayes Factor is classified into
anecdotal, moderate, strong, very strong, and decisive evidence for the
alternative or null hypothesis, following Jeffreys (1961), even though
researchers are reminded that default interpretations of the strength of the
evidence should not distract from the fact that strength of evidence is a
continuous function of the Bayes Factor. We can expect researchers will rely
less on default interpretations, the more acquainted they become with these
statistics, but for novices some help in interpreting effect sizes and Bayes
Factors will guide their interpretation.
Running the Markdown script
R Markdown scripts provide a way to create
fully reproducible reports from data files. The script combines the commands to
perform all statistical analyses with the written sections of the final output.
Calculated statistics and graphs are inserted into the written report at
specified locations. After installing the required packages, preparing the
data, and specifying some variables in the Markdown document, the report can be
generated (and thus, the analysis procedure can be performed) with a single
mouse-click (scroll down for an example of the output).
The R Markdown script and the ReadMe file contain
detailed instructions on how to run the script, and how to install required
packages, including the PoweR package
(Micheaux & Tran, 2014) to perform the normality tests, HLMdiag to create the Q-Q plots (Loy
& Hofmann, 2014). ggplot2 for all
plots (Wickham, 2009), car (Fox &
Weisberg, 2011) to perform Levene's test, MBESS
(Kelley, 2007) to calculate effect sizes and their confidence intervals, WRS for the robust statistics (Wilcox
& Schönbrodt, 2015), BootsES to calculate a robust effect size for the
independent t-test (Kirby &
Gerlanc, 2013), BayesFactor for the
bayes factor (Morey & Rouder, 2015), and BEST (Kruschke & Meredith, 2014) to calculate the Bayesian
highest density interval.
The data file (which should be stored in the
same folder that contains the R markdown script) needs to be tab delimited with
a header at the top of the file (which can easily be created from SPSS by
saving data through the 'save as' menu and selecting 'save as type: Tab delimited
(*.dat)', or in Excel by saving the data as ‘Text (Tab delimited) (.txt)’. For
the independent t-test the data file
needs to contain at least two columns (one specifying the independent variable,
and one specifying the dependent variable, and for the dependent t-test the data file needs to contain three
columns, one subject identifier column, and two columns for the two dependent
variables. The script for dependent t-tests allows you to select a subgroup for the analysis, as long as the data file contains
an additional grouping variable (see the demo data). The data files can contain irrelevant data,
which will be ignored by the script. Finally, researchers need to specify the
names (or headers) of the independent and dependent variables, as well as
grouping variables. Finally, there are some default settings researchers can
change, such as the sidedness of the test, the alpha level, the percentage for
the confidence intervals, and the scalar on the prior for the Bayes
Factor.
The script can be used to create either a word
document or a html document. The researchers can easily interpret all the
assumption checks, look at the data for possible outliers, and (after minor
adaptations) copy-paste the result sections into their article.
The statistical results the script generates has been compared
against the results provided by SPSS,
JASP,
ESCI,
online Bayes
Factor calculators, and BEST online.
Minor variations in the HDI calculation between BEST online and this script are
possible depending on the burn-in samples and number of samples, and for huge t-values there are minor variations
between JASP and the latest version of the Bayes Factor package used in this
script. This program is distributed in the hope that it will be useful, but without
any warranty. If you find an error, please contact me at D.Lakens@tue.nl.
Promoting Statistical Innovations
Statistical software is built around individual statistical tests, while
researchers perform a set of procedures.
Although it is not possible to create standardized procedures for all
statistical analyses, most, if not all, of the steps researchers have to go
through when they want to report correlations, regression analyses, ANOVA’s, and
meta-analyses are sufficiently structured. These tests make up a large portion
of analyses reported in journal articles. Demonstrating this, David Kenny
has created R scripts that will perform and report mediation and moderator analyses.
Felix
Schönbrodt has created a Shiny app that
performs several meta-analytic techniques. Making statistical innovations more accessible has a high potential to
substantially improve the quality of the statistical tests researchers perform
and report. Statisticians who take the application of generated knowledge seriously should try to experiment with the best way to get researchers to use state-of-the-art techniques. R markdown scripts are an excellent method to combine statistical
analyses and a written report in free software. Shiny apps might make
these analyses even more accessible, because they no longer require users to
install R and R packages.
Despite the name of this script, there is
probably not such a thing as a ‘perfect’ report of a statistical test.
Researchers might prefer to report standard errors instead of standard
deviations, perform additional checks for normality, different Bayesian or robust
statistics, or change the figures. The benefit of markdown scripts with a GNU license
stored on GitHub is that they can be forked
(copied to a new repository) where researchers are free to remove, add, or
change sections of the script to create their own ideal test. After some time,
a number of such scripts may be created, allowing researchers to choose an
analysis procedure that most closely matches their desires. Alternatively,
researchers can post feature requests or errors that can be incorporated in future
versions of this script.
It is important that researchers attempt to
draw the best possible statistical inferences from their data. As a science, we
need to seriously consider the most efficient way to accomplish this. Time is
scarce, and scientists need to master many skills in addition to statistics. I
believe that some of the problems in adopting new statistical procedures
discussed by Sharpe (2013) such as lack of time, lack of awareness, lack of
education, and lack of easy to use software can be overcome by scripts that
combine traditional and more novel statistics, are easy to use, and provide a
brief explanation of what is calculated while linking to the relevant
literature. This approach might be a small step towards a better understanding
of statistics for individual researchers, but a large step towards better
reporting practices.
References
Baguley, T. (2012). Calculating and
graphing within-subject confidence intervals for ANOVA. Behavior research methods, 44,
158-175.
Cumming, G. (2012). Understanding the new statistics: Effect
sizes, confidence intervals, and meta-analysis. New York: Routledge.
Fox, J. & Weisberg, S. (2011). An R Companion to Applied Regression, Second
edition. Sage, Thousand Oaks CA.
Jeffreys, H. (1961). Theory of probability (3rd ed.). Oxford:
Oxford University Press, Clarendon Press.
Kelley, K. (2007). Confidence
intervals for standardized effect sizes: Theory, application, and
implementation. Journal of Statistical
Software, 20, 1-24.
Kirby, K. N., & Gerlanc, D.
(2013). BootES: An R package for bootstrap confidence intervals on effect
sizes. Behavior Research Methods, 45, 905-927.
Lakens, D. (2015). The perfect
t-test (version 0.1.0). Retrieved from
https://github.com/Lakens/perfect-t-test. doi:10.5281/zenodo.17603
Loy, A., & Hofmann, H. (2014).
HLMdiag: A Suite of Diagnostics for Hierarchical Linear Models. R. Journal of
Statistical Software, 56, pp. 1-28. URL: http://www.jstatsoft.org/v56/i05/.
McGraw, K. O., & Wong, S. P.
(1992). A common language effect size statistic. Psychological Bulletin, 111,
361-365.
Sharpe, D. (2013). Why the
resistance to statistical innovations? Bridging the communication gap. Psychological Methods, 18, 572-582.
Wilcox, R. R., Granger, D. A., Clark,
F. (2013). Modern robust statistical methods: Basics with illustrations using psychobiological
data. Universal Journal of Psychology,
1, 21-31.
Yap, B. W., & Sim, C. H. (2011).
Comparisons of various types of normality tests. Journal of Statistical Computation and Simulation, 81, 2141-2155.