ACEM Statistics & Evidence-Based Medicine Revision Guide

Statistics and evidence-based medicine (EBM) are essential components of the ACEM fellowship exam. These topics appear as MCQs, but the principles also underpin the SAQ component where candidates are asked to interpret study findings or justify clinical decisions. The content is conceptually demanding but finite — a structured approach pays significant dividends.

Exam relevance: Expect 5–10 MCQs on statistics and EBM in any given sitting. Questions typically test understanding of concepts and the ability to interpret results — not complex calculations. You will not need a calculator.

Hierarchy of Evidence

The hierarchy of evidence ranks study designs by their susceptibility to bias. Understanding this hierarchy is fundamental to critical appraisal and is one of the most commonly examined statistical concepts.

Level	Study design	Key features
1a	Systematic review of RCTs (with or without meta-analysis)	Synthesises all high-quality RCTs on a topic using a pre-specified, reproducible search strategy. Highest level of evidence for therapeutic questions. A meta-analysis is a statistical technique used within a systematic review to pool results — the two terms are not synonymous.
1b	Individual RCT with narrow confidence interval	Participants randomised to intervention or control. Gold standard for establishing causation. Key features: randomisation, blinding, intention-to-treat analysis, adequate power.
2a	Systematic review of cohort studies	Highest level for prognosis and harm questions where RCTs are impractical.
2b	Individual cohort study (or low-quality RCT)	Observational. Follows an exposed group and an unexposed group over time. Can establish association and temporal sequence but not definitive causation. Major limitation: confounding.
3	Case-control study	Starts with outcome (cases) and works backwards to identify exposures. Efficient for rare diseases. Susceptible to recall bias. Produces odds ratios, not relative risk (because the incidence of disease cannot be determined from case-control design).
4	Case series / case reports	Descriptive only. No comparator group. Useful for generating hypotheses and reporting rare conditions or novel presentations.
5	Expert opinion	Lowest level. Subject to individual bias and experience.

Important nuance: The hierarchy applies differently depending on the clinical question. For therapy/prevention questions, RCTs are the gold standard. For prognosis questions, cohort studies are often the highest achievable level. For diagnosis questions, cross-sectional studies comparing the test to a reference standard are appropriate (using STARD reporting criteria). The level of evidence is not the only determinant of quality — a well-conducted cohort study may provide better evidence than a poorly designed RCT.

Study Design Concepts

Randomised controlled trials

The RCT is the gold standard for evaluating therapeutic interventions. Key design features that affect internal validity:

Randomisation: Ensures groups are comparable at baseline by distributing both known and unknown confounders equally. Methods include simple randomisation, block randomisation (ensures equal group sizes), and stratified randomisation (ensures balance for specific prognostic variables).
Allocation concealment: Prevents investigators from knowing which group a participant will be assigned to before enrolment. This is distinct from blinding — it prevents selection bias at the point of recruitment. The gold standard is central telephone or computer-based randomisation.
Blinding: Single-blind (participants unaware of allocation), double-blind (participants and investigators unaware), triple-blind (participants, investigators, and outcome assessors unaware). Blinding reduces performance bias (participants behave differently if they know they received the intervention) and detection bias (assessors measure outcomes differently).

Intention-to-treat vs per-protocol analysis

Intention-to-treat (ITT) analysis analyses every participant in the group to which they were originally randomised, regardless of whether they completed the intervention, crossed over to the other group, or withdrew. ITT preserves the benefits of randomisation and provides a conservative estimate of treatment effect (it is biased toward the null hypothesis). This makes it the preferred primary analysis for superiority trials.

Per-protocol (PP) analysis includes only those who completed the study as intended. It can overestimate treatment effect in superiority trials (by excluding non-compliant participants who may dilute the treatment effect). However, PP analysis is preferred as the primary analysis in non-inferiority trials, because ITT is anti-conservative in that context (it biases toward the null, which favours the non-inferiority hypothesis).

Exam favourite: "Which analysis should be the primary analysis in a superiority trial?" Answer: ITT. "Which in a non-inferiority trial?" Answer: both ITT and PP should be presented, with PP typically primary, because ITT biases toward showing non-inferiority.

Pragmatic vs explanatory trials

Pragmatic trials test an intervention under real-world conditions: broad inclusion criteria, usual-care comparator, clinically relevant outcomes, minimal protocol-mandated visits. They answer the question "Does this work in routine practice?" Explanatory trials test under ideal conditions: strict inclusion/exclusion criteria, close protocol adherence, placebo comparator, surrogate outcomes. They answer "Can this work under optimal conditions?" The PRECIS-2 tool provides a framework for categorising where on the pragmatic-explanatory continuum a trial sits.

Cross-over trials

Participants receive both the intervention and control in sequence, separated by a washout period to eliminate carryover effects. Each participant serves as their own control, which reduces between-subject variability and requires a smaller sample size. Limitations: only suitable for stable conditions (not acute illness) with interventions that have a reversible effect. Carryover effects can bias results if the washout period is insufficient.

Cluster randomised trials

Randomisation occurs at the group level (e.g. hospitals, EDs, practices) rather than the individual level. Used when individual randomisation is impractical (e.g. implementing a protocol change in an ED) or when contamination between groups would occur (e.g. education interventions). The key statistical consideration is intracluster correlation (ICC): outcomes of individuals within the same cluster are correlated, which reduces effective sample size. The design effect = 1 + (m – 1) × ICC, where m is the average cluster size. Cluster trials therefore typically need a larger total sample size than individually randomised trials to achieve the same power.

Diagnostic Test Performance

The 2×2 table

	Disease present (D+)	Disease absent (D–)	Totals
Test positive (T+)	True positive (TP) = a	False positive (FP) = b	a + b
Test negative (T–)	False negative (FN) = c	True negative (TN) = d	c + d
Totals	a + c	b + d	N

All diagnostic test metrics derive from this table. Being able to construct and populate a 2×2 table from a clinical scenario is the single most important statistical skill for the exam.

Sensitivity and specificity

Sensitivity = a / (a + c) = TP / (TP + FN) — the proportion of people with the disease who test positive. A highly sensitive test has few false negatives and is good at ruling out disease when negative (mnemonic: SnNOut — Sensitivity, Negative result, rules Out).

Specificity = d / (b + d) = TN / (TN + FP) — the proportion of people without the disease who test negative. A highly specific test has few false positives and is good at ruling in disease when positive (mnemonic: SpPIn — Specificity, Positive result, rules In).

Sensitivity and specificity are intrinsic properties of the test at a given threshold and do not change with disease prevalence. However, they exist in tension — lowering the threshold to increase sensitivity will decrease specificity, and vice versa. The optimal threshold depends on the clinical context (e.g. screening tests prioritise sensitivity; confirmatory tests prioritise specificity).

Predictive values

Positive predictive value (PPV) = a / (a + b) = TP / (TP + FP) — the probability that a patient with a positive test actually has the disease. Negative predictive value (NPV) = d / (c + d) = TN / (TN + FN) — the probability that a patient with a negative test truly does not have the disease.

Predictive values are heavily influenced by disease prevalence (pre-test probability). This is the single most important concept for exam questions about diagnostic tests:

As prevalence increases: PPV increases, NPV decreases.
As prevalence decreases: PPV decreases, NPV increases.
Sensitivity and specificity remain unchanged when prevalence changes (they are properties of the test, not the population).

Worked example: A test with 95% sensitivity and 95% specificity is applied to two populations. In Population A (prevalence 50%, 1000 people): 475 TP, 25 FP, 25 FN, 475 TN → PPV = 475/500 = 95%. In Population B (prevalence 1%, 10,000 people): 95 TP, 495 FP, 5 FN, 9405 TN → PPV = 95/590 = 16%. The same test with the same sensitivity/specificity produces a PPV of 95% in one population and 16% in another. This is why screening tests in low-prevalence populations generate so many false positives.

Likelihood ratios

Likelihood ratios (LRs) combine sensitivity and specificity into a single measure that expresses how much a test result changes the probability of disease. They are independent of prevalence and can be applied at the bedside.

Positive likelihood ratio (LR+) = sensitivity / (1 – specificity). Interpretation: how many times more likely is a positive test result in a person with disease compared to a person without disease?

Negative likelihood ratio (LR–) = (1 – sensitivity) / specificity. Interpretation: how many times more likely is a negative test result in a person with disease compared to a person without disease?

LR+	Approximate change in probability	LR–	Approximate change in probability
> 10	Large, often conclusive increase	< 0.1	Large, often conclusive decrease
5–10	Moderate increase	0.1–0.2	Moderate decrease
2–5	Small increase	0.2–0.5	Small decrease
1–2	Minimal, rarely important	0.5–1	Minimal, rarely important

The clinical application uses Bayes' theorem in simplified form: Pre-test odds × LR = Post-test odds. To convert between probability and odds: Odds = probability / (1 – probability); Probability = odds / (1 + odds). A Fagan nomogram provides a graphical shortcut — draw a line from the pre-test probability through the likelihood ratio to read off the post-test probability.

ROC curves

A Receiver Operating Characteristic (ROC) curve plots sensitivity (y-axis) against 1 – specificity (x-axis) across all possible thresholds. The area under the ROC curve (AUC or c-statistic) quantifies overall test discrimination: 1.0 = perfect test, 0.5 = no better than chance (the diagonal line). Interpretation benchmarks: 0.5–0.7 poor, 0.7–0.8 acceptable, 0.8–0.9 excellent, > 0.9 outstanding. ROC curves are useful for (1) comparing two diagnostic tests applied to the same population (the test with the higher AUC has better overall discrimination) and (2) selecting the optimal threshold — the point on the curve closest to the top-left corner maximises the sum of sensitivity and specificity (Youden's index = sensitivity + specificity – 1).

Measures of Treatment Effect

Absolute and relative risk

Measure	Formula	Interpretation
Control event rate (CER)	Events in control / total in control	The baseline risk in the control group.
Experimental event rate (EER)	Events in treatment / total in treatment	The risk in the treatment group.
Absolute risk reduction (ARR)	CER – EER	The actual difference in event rates. More clinically meaningful than relative measures because it incorporates baseline risk.
Relative risk (RR)	EER / CER	RR < 1 indicates the treatment reduces risk. RR = 0.75 means a 25% relative reduction.
Relative risk reduction (RRR)	1 – RR, or ARR / CER	Often reported in trials because it sounds impressive. A 50% RRR could reflect a drop from 2% to 1% (ARR = 1%, NNT = 100) or from 40% to 20% (ARR = 20%, NNT = 5). Always ask for the absolute numbers.
Odds ratio (OR)	(a × d) / (b × c) from the 2×2 table	Used in case-control studies and logistic regression. Approximates the RR when the event rate is low (< 10%). Overestimates the RR when event rates are higher.

NNT and NNH

Number needed to treat (NNT) = 1 / ARR. It represents the number of patients who need to receive the intervention (rather than control) to prevent one additional adverse outcome. A lower NNT indicates a more effective treatment. Number needed to harm (NNH) = 1 / absolute risk increase (ARI). It represents the number of patients who need to be exposed to a treatment before one additional patient experiences a specific harm. Ideally, NNT should be low and NNH should be high.

Worked example: A trial shows mortality of 8% in the control group and 5% in the treatment group. ARR = 0.08 – 0.05 = 0.03 (3%). NNT = 1/0.03 = 33. You would need to treat 33 patients to prevent one death. The RRR = 0.03/0.08 = 37.5%. Note that if the baseline risk were 0.8% vs 0.5% instead (same RRR of 37.5%), the ARR would be only 0.3% and NNT = 333 — a much less clinically compelling result despite the same relative reduction.

Hypothesis Testing and P-Values

Type I and Type II errors

	H₀ is actually true (no real effect)	H₀ is actually false (real effect exists)
Reject H₀	Type I error (false positive) — probability = α	Correct decision — probability = power (1 – β)
Fail to reject H₀	Correct decision — probability = 1 – α	Type II error (false negative) — probability = β

Alpha (α) is conventionally set at 0.05, meaning a 5% chance of concluding there is an effect when there is not (false positive). Beta (β) is conventionally set at 0.20, giving a power of 80% — the probability of detecting a true effect when one exists. Power is increased by: larger sample size, larger true effect size, less variability in the outcome measure, and a higher alpha threshold (less conservative, but increases Type I error risk).

P-values: what they mean and what they do not

The p-value is the probability of observing data as extreme as (or more extreme than) the observed results, assuming the null hypothesis is true. It is not:

The probability that the null hypothesis is true (a common misinterpretation).
The probability that the results occurred by chance.
A measure of effect size or clinical importance.

A p-value of 0.03 means: "If there truly were no difference between the groups, there would be a 3% probability of observing a result this extreme or more extreme." It does not mean there is a 97% chance the treatment works.

The threshold of p < 0.05 for "statistical significance" is a convention, not a scientific law. A p-value of 0.049 and 0.051 represent virtually identical evidence — the dichotomous significant/not-significant distinction is increasingly recognised as problematic.

Confidence intervals

A confidence interval (CI) provides a range of plausible values for the true population parameter based on the observed sample data. The most commonly used is the 95% CI.

Interpretation: A 95% CI means that if the study were repeated many times using the same methods, 95% of the calculated intervals would contain the true population parameter. It does not mean there is a 95% probability that the true value lies within this specific interval (the true value either is or is not in the interval — it is not a probability statement about a single interval).

What confidence intervals tell you that p-values do not

Magnitude of effect: A CI of 0.4–0.8 for a risk ratio tells you the treatment probably reduces risk by 20–60%. A p-value alone only tells you it is "significant."
Precision: A narrow CI (e.g. RR 0.52–0.58) indicates a precise estimate (large sample, low variability). A wide CI (e.g. RR 0.15–1.85) indicates imprecision (small sample, high variability, or both) — the true effect could be anything from a large benefit to a large harm.
Clinical significance vs statistical significance: A CI that is entirely below a clinically meaningful threshold suggests the result, while statistically significant, may not be clinically important. For example, a mean difference in pain score of 2mm (95% CI: 1–3mm) on a 100mm VAS is statistically significant but clinically meaningless (the minimum clinically important difference is typically 13mm).
Equivalence with p-values: If the 95% CI for a difference does not include 0 (or for a ratio does not include 1.0), the result is statistically significant at p < 0.05. If it does include 0 (or 1.0), the result is not statistically significant at that level.

Exam application: When presented with a CI in an exam question, assess three things: (1) Does it cross the line of no effect (0 for differences, 1.0 for ratios)? This tells you statistical significance. (2) How wide is it? This tells you precision. (3) Where does it sit relative to a clinically meaningful threshold? This tells you clinical significance. A result can be statistically significant but clinically unimportant (narrow CI, small effect) or statistically non-significant but clinically potentially important (wide CI crossing 1.0 but including large effects — the study may be underpowered).

Frequentist vs Bayesian approaches

Frequentist statistics (the traditional approach) tests hypotheses using p-values and assumes a fixed but unknown true parameter. It asks: "What is the probability of these data, given the null hypothesis?" It does not incorporate prior knowledge.

Bayesian statistics starts with a prior probability (based on existing evidence or clinical judgement) and updates it with new data to produce a posterior probability. It asks: "What is the probability of the hypothesis, given these data?" — which is arguably the more clinically useful question. In clinical practice, Bayesian reasoning is what clinicians do naturally at the bedside: starting with a pre-test probability and updating it with test results (likelihood ratios). Bayesian approaches are increasingly used in adaptive trial designs, where interim analyses can lead to early stopping for efficacy or futility, and in clinical decision-making tools.

Bias and Confounding

Bias is a systematic error that leads to an incorrect estimate of the association between exposure and outcome. It is distinct from random error (imprecision), which is addressed by increasing sample size.

Type of bias	Description	How to minimise
Selection bias	Systematic differences in who enters the study or how they are allocated. Includes Berkson's bias (hospital-based studies overrepresent comorbidity).	Randomisation, consecutive patient enrolment, clear inclusion/exclusion criteria.
Information (measurement) bias	Systematic errors in how outcomes or exposures are measured.	Blinding of outcome assessors, standardised measurement tools, objective outcomes.
Recall bias	Participants with the outcome recall exposures differently from those without (particularly in case-control studies).	Use prospective designs, objective exposure records, standardised questionnaires.
Lead-time bias	Screening appears to improve survival by detecting disease earlier, even if the patient dies at the same age. The apparent survival benefit is an artefact of earlier diagnosis.	Use mortality rate (not survival time) as the outcome in screening studies.
Length-time bias	Screening preferentially detects slow-growing (less aggressive) disease, making screened populations appear to have better outcomes.	Use randomised screening trials rather than observational comparisons.
Publication bias	Studies with positive or statistically significant results are more likely to be published. Meta-analyses may overestimate treatment effects as a result.	Funnel plots to detect asymmetry, search for grey literature and trial registries, Egger's test.
Attrition bias	Systematic differences in dropout rates between groups.	ITT analysis, minimise loss to follow-up, report and compare dropouts between groups.
Observer (detection) bias	Knowledge of group allocation influences how outcomes are assessed.	Blinding of outcome assessors (triple-blinding).

Confounding occurs when a third variable is associated with both the exposure and the outcome, distorting the apparent relationship. Unlike bias, confounding can be addressed in the analysis phase. Methods: randomisation (distributes both known and unknown confounders), restriction (limit study to one level of the confounder), matching (pair cases and controls on the confounder), stratification (analyse separately for each level of the confounder), and multivariable regression (statistically adjust for the confounder). Effect modification (interaction) is different from confounding — it occurs when the effect of the exposure genuinely differs across levels of a third variable. This is a real biological phenomenon, not a bias, and should be reported rather than adjusted away.

Critical Appraisal and Reporting

Reporting guidelines

Guideline	Study type	Key checklist items
CONSORT	Randomised controlled trials	Flow diagram, sample size calculation, randomisation method, blinding, ITT analysis, primary/secondary outcomes.
STROBE	Observational studies (cohort, case-control, cross-sectional)	Eligibility criteria, sources of data, handling of missing data, confounders addressed, effect measures with CIs.
PRISMA	Systematic reviews and meta-analyses	Search strategy, inclusion/exclusion criteria, risk of bias assessment, forest plot, heterogeneity assessment.
STARD	Diagnostic accuracy studies	Reference standard, blinding of assessors, patient flow diagram, sensitivity/specificity with CIs.
CARE	Case reports	Patient information, clinical findings, timeline, diagnostic assessment, interventions, outcome.
TRIPOD	Prediction model studies	Model development, validation, calibration, discrimination (c-statistic).

Survival analysis

Kaplan-Meier curves display the probability of surviving (or remaining event-free) over time. Each step represents one or more events. Censored observations (patients lost to follow-up or study ends before event) are marked on the curve. The log-rank test compares survival curves between groups (null hypothesis: no difference in survival). Cox proportional hazards regression produces hazard ratios (HRs) that quantify the relative rate of an event occurring in one group vs another, adjusted for covariates. HR < 1 favours the treatment group (lower event rate). The key assumption is proportional hazards — the hazard ratio is constant over time. If survival curves cross, this assumption is violated.

Meta-analysis and forest plots

A forest plot displays the results of individual studies alongside the pooled estimate. Each study is represented by a square (area proportional to weight) with a horizontal line (confidence interval). The diamond at the bottom represents the pooled estimate — its width shows the pooled CI. The vertical line at RR = 1 (or difference = 0) represents no effect. If the diamond does not cross this line, the pooled result is statistically significant.

Heterogeneity quantifies variability between study results beyond what would be expected by chance alone. The I² statistic is the most commonly reported measure: I² < 25% = low heterogeneity, 25–75% = moderate, > 75% = high. High heterogeneity suggests the studies may not be measuring the same underlying effect — in this situation, a pooled estimate may be misleading and a random-effects model (which accounts for between-study variability) is preferred over a fixed-effects model. Sources of heterogeneity can be explored using subgroup analysis or meta-regression.

Subgroup analysis and composite outcomes

Subgroup analyses examine whether the treatment effect differs across predefined patient subgroups. They are prone to false positives due to multiple comparisons and should be interpreted cautiously. Criteria for credibility of a subgroup effect: the subgroup was pre-specified, a small number of subgroups were tested, a formal test of interaction is statistically significant (not just the within-subgroup p-value), and the finding is biologically plausible and consistent with other evidence.

Composite outcomes combine multiple endpoints into a single measure (e.g. MACE = major adverse cardiac events combining death, MI, and stroke). They increase statistical power but can be misleading if the treatment effect is driven by the least clinically important component (e.g. a reduction in "revascularisation" driving a significant MACE result, with no effect on death or MI). Always examine the individual components of a composite outcome.

EBM in Emergency Medicine Practice

Evidence-based medicine integrates the best available evidence with clinical expertise and patient values. The EBM process follows five steps: (1) formulate a clinical question using the PICO framework (Population, Intervention, Comparator, Outcome), (2) search for the best evidence (PubMed, Cochrane Library, pre-appraised resources), (3) critically appraise the evidence (validity, results, applicability), (4) apply the evidence to the individual patient (considering patient values, local resources, clinical context), and (5) evaluate the outcome.

Understanding the distinction between clinical significance and statistical significance is crucial and is a common exam theme. A statistically significant result (p < 0.05) may have minimal clinical impact if the effect size is small relative to a clinically meaningful threshold. Conversely, a study may fail to reach statistical significance due to insufficient power, while the point estimate and confidence interval suggest a potentially important effect. In this situation, the correct interpretation is "insufficient evidence to conclude there is a difference" — not "there is no difference." Absence of evidence is not evidence of absence.

Study strategy: Statistics is best learned by practising with MCQs. For each concept, understand the definition, know when it applies, and be able to interpret the result in clinical terms. Construct 2×2 tables from clinical scenarios. Work through NNT calculations until they are second nature. Read the methods and results sections of major EM trials (e.g. ARISE, PARAMEDIC2, CRASH-2) and practise identifying the study design, primary outcome, statistical tests used, and whether the conclusions are supported by the data.

References

Cameron P, Little M, Mitra B, Deasy C (eds). Textbook of Adult Emergency Medicine. 6th ed. Elsevier; 2026. Chapter 22 (Academic Emergency Medicine).
Sackett DL, Straus SE, Richardson WS, Rosenberg W, Haynes RB. Evidence-Based Medicine: How to Practice and Teach It. 2nd ed. Churchill Livingstone; 2000.
Greenhalgh T. How to Read a Paper: The Basics of Evidence-Based Medicine and Healthcare. 6th ed. Wiley-Blackwell; 2019.
Jaeschke R, Guyatt GH, Sackett DL. Users' guides to the medical literature. III. How to use an article about a diagnostic test. A. Are the results of the study valid? JAMA. 1994;271(5):389–391.
Peng J, Deenadayalan Y, Garg R. Likelihood ratios for the emergency physician. Academic Emergency Medicine. 2018;25(5):596–598.
Moher D, Liberati A, Tetzlaff J, Altman DG; PRISMA Group. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. BMJ. 2009;339:b2535.
Schulz KF, Altman DG, Moher D; CONSORT Group. CONSORT 2010 statement: updated guidelines for reporting parallel group randomised trials. BMJ. 2010;340:c332.
Bossuyt PM, Reitsma JB, Bruns DE, et al. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. BMJ. 2015;351:h5527.
von Elm E, Altman DG, Egger M, et al. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement. Lancet. 2007;370(9596):1453–1457.
Higgins JPT, Thompson SG, Deeks JJ, Altman DG. Measuring inconsistency in meta-analyses. BMJ. 2003;327(7414):557–560.
The NNT Group. Quick summaries of evidence-based medicine. Available at: thennt.com.
Fagan TJ. Nomogram for Bayes theorem. N Engl J Med. 1975;293(5):257.
Sun X, Briel M, Walter SD, Guyatt GH. Is a subgroup effect believable? Updating criteria to evaluate the credibility of subgroup analyses. BMJ. 2010;340:c117.