Sequential Testing and Early Stopping

Monitoring experiments in real time without inflating error rates

Decision ScienceSequential TestingBayesianExperimentationPython

Part 2 (Bayesian Sample Efficiency) introduced the idea that Bayesian methods support adaptive designs and early stopping. This article formalizes that idea. In practice, experimenters do not wait passively for a fixed sample size: they monitor dashboards, check interim results, and face pressure to call experiments early. The question is how to do this without destroying the validity of the conclusion.

Sequential testing provides the answer. Instead of committing to a single analysis at a predetermined endpoint, sequential methods define rules for evaluating evidence at multiple checkpoints throughout the experiment. Done correctly, these rules control error rates (frequentist) or provide calibrated posterior summaries (Bayesian) at every look.

The Peeking Problem

In a standard frequentist A/B test, the significance level alpha = 0.05 controls the false positive rate under a specific contract: you analyze the data exactly once, at the pre-specified sample size. If you check the p-value after every batch of users and stop as soon as p < 0.05, the true false positive rate inflates well beyond 5%. With continuous monitoring, it can exceed 25%.

The mechanism is straightforward. Under the null hypothesis, the test statistic follows a random walk. Given enough looks, even a random walk will cross any fixed threshold. Each peek is an additional opportunity for a false alarm. The more you peek, the more likely you are to see a "significant" result that is pure noise.

False positive rate inflation from continuous peeking compared to fixed-horizon analysis

Group Sequential Designs

Group sequential designs are the frequentist solution to the peeking problem. They pre-specify a set of interim analyses (e.g., at 25%, 50%, 75%, and 100% of the target sample) and adjust the significance threshold at each look so that the overall Type I error rate stays at alpha.

Alpha Spending Functions

An alpha spending function alpha(t) maps the fraction of information collected t to the cumulative Type I error "spent" by that point. Two common choices:

O'Brien-Fleming: spends almost no alpha early and most at the final analysis. The early boundaries are stringent, making it hard to stop early unless the effect is large. The final-analysis threshold is close to the standard alpha = 0.05.
Pocock: spends alpha uniformly across looks. The boundaries are roughly equal at each interim analysis, making early stopping easier but the final threshold slightly more conservative.

The Lan-DeMets approach generalizes this: you specify the spending function up front, but the timing and number of looks can be chosen adaptively.

O'Brien-Fleming and Pocock alpha spending functions with boundary annotations

Bayesian Sequential Monitoring

The Bayesian approach to sequential analysis avoids the peeking problem by construction. The posterior is always valid: it represents a coherent summary of the evidence seen so far, regardless of how many times you look at it. There is no need to adjust for multiple looks because the posterior does not make a promise about long-run error rates that peeking could violate.

The standard Bayesian monitoring procedure tracks the posterior probability that treatment beats control (P(p_T > p_C | data)) at each interim analysis. Decisions follow threshold rules:

Stop for efficacy: if P(p_T > p_C | data) > theta_upper (e.g., 0.99), declare the treatment a winner.
Stop for futility: if P(p_T > p_C | data) < theta_lower (e.g., 0.01), declare no meaningful effect.
Continue: otherwise, collect more data.

The thresholds theta_upper and theta_lower are design parameters. More aggressive thresholds (0.95/0.05) stop experiments sooner but increase the chance of incorrect decisions. Conservative thresholds (0.99/0.01) require more data but provide stronger evidence.

Posterior probability of treatment superiority over time with stopping boundaries

Frequentist vs. Bayesian Sequential

The two frameworks answer different questions:

Group sequential (frequentist): controls the probability of a false positive across all possible stopping points. The guarantee is about long-run error rates under repeated use.
Bayesian sequential: reports the probability of a correct decision given the data actually observed. The guarantee is about the coherence of the current inference.

In practice, the operating characteristics (how often each method stops early, expected sample size, error rates) are often similar when calibrated to comparable decision thresholds. The philosophical difference matters most when communicating results: "we stopped because the posterior probability exceeded 0.99" is more intuitive to most stakeholders than "we stopped because the z-statistic crossed the O'Brien-Fleming boundary."

Stopping for Futility

Most discussions of early stopping focus on efficacy: detecting a winner quickly. Futility stopping is equally valuable. If the treatment effect is negligibly small, continuing the experiment wastes traffic and delays the next test.

A Bayesian futility rule stops when the posterior probability that the treatment effect exceeds a minimum detectable effect (MDE) is below some threshold. For example: stop if P(delta > MDE | data) < 0.05, where delta = p_T - p_C. This is more useful than simply checking whether the posterior favors treatment, because a tiny positive effect that will never be practically meaningful is not worth chasing.

Stopping Boundaries

The following simulation visualizes both efficacy and futility boundaries on a single plot. The test statistic (or posterior summary) traces a path through the monitoring region. When it crosses an upper boundary, the experiment stops for efficacy. When it crosses a lower boundary, it stops for futility. The region between the boundaries is the continuation zone.

Practical Guidance

Sequential testing in production requires discipline:

Pre-register the monitoring schedule. Decide how many interim looks, at what information fractions, and with what thresholds before the experiment starts. Ad hoc decisions after seeing data undermine the guarantees.
Automate the monitoring. Manual dashboard checks invite unplanned peeks. Build the sequential analysis into the experiment platform so that boundary crossings trigger alerts rather than relying on human judgment about when to look.
Account for delayed outcomes. If the primary metric takes days to mature (e.g., 7-day retention), interim analyses based on incomplete outcome data will be biased toward zero. Use the outcome window appropriate for the metric, not the most recent data.
Report the stopping rule with the result. A sequential experiment that stopped early at interim look 3 of 5 carries different interpretive weight than a fixed-horizon analysis. Transparency about the design supports reproducibility.

Connection to the Series

Sequential testing completes a loop that began in Part 1 (Online Experiments with a Bayesian Lens). The Beta posteriors from that article update continuously as data arrives. Part 2 (Bayesian Sample Efficiency) showed that this updating can reduce the required sample size. This article provides the formal rules for acting on that updating: when the posterior crosses a boundary, you stop.

Part 4 (Multi-Armed Bandits and Thompson Sampling) took a different approach to the same tension: rather than deciding when to stop, Thompson sampling decides how to allocate at each round. And Part 5 (Causal Inference from Observational Data) addressed what to do when you cannot run the experiment at all. Together, the six articles in this series cover the full lifecycle of evidence-based decisions: from designing experiments and analyzing posteriors, through adaptive allocation and observational inference, to the formal rules for stopping and committing.

Sequential monitoring decides when to stop an experiment, but the explore-exploit tradeoff from bandit allocation and the observational methods from causal inference each address different gaps in the decision pipeline.

← PreviousCausal Inference from Observational Data

View all articles in Decision Science →

Continue Reading

SeriesDecision Science SeriesNetwork Graph Analysis BrowseAll Articles