Introduction


    Introduction
    Fisher's Exact Test
    Chi-Squared Test
    McNemar's Test
    Independent t-Test
    Paired t-Test
    Wilcoxon Rank-Sum Test
    Wilcoxon Signed-Rank Test
A Guide to Power Analysis
for Hypothesis Tests with One Categorical Independent Variable with Two Groups
Anna-Marie Ortloff
University of Bonn
Christian Tiefenau
University of Bonn
Matthew Smith
University of Bonn, Fraunhofer FKIE
What can I expect?

The following short tutorials provide an overview of the data necessary to conduct power analysis for basic hypothesis tests, where to find this data in our database, and how to use it to conduct apriori power analysis using G*Power and R. G*Power is a statistical software package designed for power analysis in experimental design. It is widely used by researchers in various fields to determine the statistical power of their studies, which helps in estimating the sample size needed to detect a significant effect. R is a programming language and software environment widely used for statistical computing and graphics. Developed by Ross Ihaka and Robert Gentleman, it offers extensive statistical analysis and visualization capabilities. With a large ecosystem of packages, R is a popular choice for data scientists and statisticians.

The tutorials can be found in the navigation box on the left. Further information on power analysis can be found below.

What is Power Analysis?

Four parameters are relevant to power analysis: power, the significance criterion (i.e. the \(\alpha\) error level), the reliability of the sample results or sensitivity of the test, and the effect size [2]. These four parameters are interdependent, such that when three of them are available, it is possible to calculate the fourth. Such calculations are referred to as power analysis. In general, there are four different kinds of power analysis, each used to determine one of the parameters from the other three, although it is also possible to determine both \(\alpha\) and power if a ratio for \(\alpha\) and \(\beta\) is given together with the other two parameters - this is termed compromise power analysis [5]. The other four flavors are summarized, e.g. by Cohen [2] in Chapter 1.5.

The Four Parameters Explained
All of the mentioned parameters are explained here. Click the cards below to see the definition.
The power of a statistical test is the probability of the test correctly rejecting the null hypothesis, i.e. that a statistical test yields a significant result, when the alternative hypothesis is true [4]. Power can also be represented as \(1-\beta\), wherein \(\beta\) is the Type II error, i.e. wrongly rejecting the null hypothesis. This means that if a test has a statistical power of 0.8, as is an often used, acceptable value [2, 3], an actual effect will be detected 80% of the time.
The significance criterion or significance level represents the threshold of maximum accepted probability of making a Type I error, i.e. wrongly assuming the alternative hypothesis, detecting an effect, when there actually is none [2]. Using the widely accepted threshold of 0.05 for statistical significance means that only in 5% of cases, an effect is detected in the sample, even though in the population, it does not exist.
Reliability refers to how well a sample estimate represents the corresponding population parameter [2]. Reliability is influenced by different factors, depending on the type of estimated parameter, such as the quality of the measurement instrument, and controlling sources of variance in the data, which might distract from the effect you are trying to measure [4]. The largest and invariably present influencing factor, however, is sample size [2], such that larger samples produce more consistent and reliable estimates than smaller ones.
Finally, the effect size measures the amount of impact of an independent variable on dependent variables, rather than only judging the presence or absence of an effect [4]. There are generally two types of effect sizes: Non-standardized, or simple effect sizes, which represent the size of effect in the units of the outcome variable, and standardized effect sizes which represent the effect size relative to the variability in the sample or population [1]. When comparing two means, e.g. with a t-test, the difference in mean completion time between two different interface variants represents a simple effect size, measured in units of time, e.g. minutes, while a standardized effect size for this scenario, such as Cohen's d, takes into account the standard deviation in the two groups. Standardized effect sizes are commonly classified as either belonging to the d-family, such as Cohen's d in the example above, or as belonging to the r-family, such as the correlation coefficient Pearson's r [6].
  1. Thom Baguley. Standardized or simple effect size: What should be reported? British Journal of Psychology, 100(3):603–617, 2009.
  2. Jacob Cohen. Statistical Power Analysis for the Behavioral Sciences. L. Erlbaum Associates, Hillsdale, N.J, 2nd ed edition, 1988.
  3. Julian di Stephano. How much power is enough? Against the development of an arbitrary convention for statistical power calculations. Functional Ecology, 17(5):707–709, 2003.
  4. Paul D. Ellis. The Essential Guide to Effect Sizes. Statistical Power, Meta-Analysis, and the Interpretation of Research Results. Cambridge University Press, 2010.
  5. Franz Faul, Edgar Erdfelder, Albert-Georg Lang, and Axel Buchner. G*Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior Research Methods, 39(2):175–191, May 2007.
  6. Robert Rosenthal. Parametric Measures of Effect Size. In Harris Cooper and Larry Hedges, editors, The Handbook of Research Synthesis, pages 231–244. Russel Sage Foundation, New York, 1994.