In continuation of the post on biased estimators, today we ask: what’s the harm in running what some people call one-shot experiments where you have very few items but many participants? In this blogpost I want to attempt to hint at an answer to this question by simulating loads of data and seeing how well the experimental data recover the features of the distributions we sampled from, i.e., their true values. We will talk about all of this against the backdrop of the previous post on biased estimators where we discussed what happens to our standard deviation estimates in the face of low sample sizes.
But Maik, you might say, the low amount of items is not a problem if we have many participants, right? If the one-shot proponents are correct, then it should indeed be possible to balance out the lack of items using a comparably large number of participants. This way, the reasoning goes, we will have enough power to detect the effects that we are interested in studying but do not need to spend so much time writing items. After all, writing items is hard and time-consuming, and we generally do not care about the item-level effects at all, right? We just want to know whether there is an effect related to our experimental manipulation. We shall see whether this reasoning holds up to simulation-driven scrutiny.
Before moving on, let me clarify that no matter what the simulations below turn up, one-shot experiments raise a number of other issues. I consider these at least somewhat unrelated to the statistical matters that I want to discuss today. Below are two such issues that I think are important to keep in mind when considering one-shot experiments but which I will not discuss further here:
- Generalizing to the population of sentences is almost guaranteed to be impossible with few items, Clark (1973). If we consider the reverse situation where we test thousands of items but only a few participants, I think many would regard any claims about speakers in general as dubious. The same logic applies in reverse: with few items, generalizing to the population of items is dubious at best. Plus, if we cannot generalize to the population of items, what is the meaning of the fixed effect estimates that we get from our models?
- Compensating participants from Mturk or other platforms where participants make less than minimum wage should in my estimation be viewed with more skepticism. While this is not uniquely a problem for one-shot experiments, it is my impression that it is more likely to occur in such designs.
Simulation setup
Variance components (the SD of the item intercepts/slopes) are estimated from the number of item levels. With only 6 items the estimate has huge sampling variability and a bias toward smaller values.
If the model underestimates item variance, the model “explains” less variation by items and therefore attributes more of the observed structure to fixed effects and/or residual error. That typically produces too-small standard errors for fixed effects that cross items → inflated test statistics and over-confident CIs.
Fixed-effect point estimates themselves are often not severely biased by this alone (in linear mixed models, fixed effects are usually consistent), but if the random effects structure is misspecified (e.g. item random slopes actually exist but were not estimated/are estimated badly) then fixed estimates can shift.
Practical extreme: with only 6 items the item random-effect SD may be estimated as near-zero (or the model may fail to estimate a slope variance), which leads to treating items as almost identical — giving the illusion of very precise fixed effects that actually do not generalize across the population of items.
We will ignore a slight question that often lurks in the background of these simulation studies: what kind of experiment is the data that we’re simulating supposed to be coming from? For now, the point of this post is not to argue that all experimental paradigms are subject to the problems we will be encountering later. Rather, the point is to show that these problems are possible in the most basic case (unbounded Gaussian). By extension, the might plague other paradigms as well, though we will not have time to look into that. I guess the null hypothesis is that more complex paradigms pattern like the basic data we will look at now.
The generate_data() function below takes inspiration from this blog post by June Choe
Click to show/hide the code
generate_data <- function(
n_subs = 48,
n_items = 48,
intercept = 200,
slope = 40,
sd_item_intercept = 20,
sd_item_slope = 40,
sd_subj_intercept = 50,
sd_subj_slope = 20,
subj_ranef_cor = 0.8,
item_ranef_cor = 0.8,
noise_sd = 25,
seed = NULL
) {
if (!is.null(seed)) {
set.seed(seed)
}
# subject REs
Sigma_subj <- matrix(
c(
sd_subj_intercept^2,
subj_ranef_cor * sd_subj_intercept * sd_subj_slope,
subj_ranef_cor * sd_subj_intercept * sd_subj_slope,
sd_subj_slope^2
),
nrow = 2,
byrow = TRUE
)
subj_re <- mvrnorm(n = n_subs, mu = c(0, 0), Sigma = Sigma_subj)
subj_df <- tibble(
subject = factor(1:n_subs),
subj_intercept = subj_re[, 1],
subj_slope = subj_re[, 2]
)
# item REs
Sigma_item <- matrix(
c(
sd_item_intercept^2,
item_ranef_cor * sd_item_intercept * sd_item_slope,
item_ranef_cor * sd_item_intercept * sd_item_slope,
sd_item_slope^2
),
nrow = 2,
byrow = TRUE
)
item_re <- mvrnorm(n = n_items, mu = c(0, 0), Sigma = Sigma_item)
item_df <- tibble(
item = factor(1:n_items),
item_intercept = item_re[, 1],
item_slope = item_re[, 2]
)
subj_item <- expand.grid(
subject = factor(1:n_subs),
item = factor(1:n_items)
) %>%
as_tibble()
# balanced condition assignment within each item:
set.seed(ifelse(is.null(seed), sample.int(1e6, 1), seed + 1))
subj_item <- subj_item %>%
group_by(item) %>%
mutate(cond = sample(rep(c("control", "treat"), length.out = n()))) %>%
ungroup() %>%
mutate(
cond = factor(cond, levels = c("control", "treat")),
cond_coded = ifelse(cond == "control", 0, 1)
)
dat <- subj_item %>%
left_join(subj_df, by = "subject") %>%
left_join(item_df, by = "item") %>%
mutate(
intercept = intercept,
slope = slope,
noise = rnorm(n(), mean = 0, sd = noise_sd),
response = intercept +
slope * cond_coded +
subj_intercept +
subj_slope * cond_coded +
item_intercept +
item_slope * cond_coded +
noise
) %>%
select(subject, item, cond, cond_coded, response)
return(dat)
}
sim_dat <- generate_data(seed = 1111)
sim_dat# A tibble: 2,304 × 5
subject item cond cond_coded response
<fct> <fct> <fct> <dbl> <dbl>
1 1 1 control 0 308.
2 2 1 treat 1 250.
3 3 1 control 0 219.
4 4 1 control 0 179.
5 5 1 control 0 221.
6 6 1 treat 1 565.
7 7 1 control 0 237.
8 8 1 treat 1 337.
9 9 1 treat 1 306.
10 10 1 control 0 144.
# ℹ 2,294 more rows
And here’s that sample data set plotted, with the violins showing the raw sample response distributions:
Click to show/hide the code
sim_dat %>%
ggplot(aes(
x = cond,
y = response,
group = cond,
fill = cond,
color = cond,
shape = cond
)) +
geom_violin(alpha = .3, color = NA) +
stat_summary(fun = mean, geom = "line", group = 1, color = "grey40") +
stat_summary(fun = mean, geom = "point", size = 3) +
stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = .1) +
guides(color = "none", fill = "none", shape = "none", group = "none") +
labs(x = "Condition", y = "Response times ±CI<sub>95%</sub>")
The next step, of course, is to model these data. Given that our focus is on variance components and the effect their misestimation may have (and because this type of model is independently appropriate), we will fit a linear mixed model with by-item and by-participant random intercepts and slopes. You can see this in action below.
Click to show/hide the code
sample_fit <- lmer(
response ~ cond_coded +
(1 + cond_coded | subject) +
(1 + cond_coded | item),
data = sim_dat,
control = lmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e5))
)
summary(sample_fit)Linear mixed model fit by REML ['lmerMod']
Formula: response ~ cond_coded + (1 + cond_coded | subject) + (1 + cond_coded | item)
Data: sim_dat
Control: lmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 200000))
REML criterion at convergence: 22024
Scaled residuals:
Min 1Q Median 3Q Max
-4.18642 -0.66646 -0.02485 0.65559 3.31501
Random effects:
Groups Name Variance Std.Dev. Corr
subject (Intercept) 3746.62 61.210
cond_coded 351.44 18.747 0.860
item (Intercept) 511.38 22.614
cond_coded 2213.17 47.044 0.834
Residual 634.56 25.191
Number of obs: 2304, groups: subject, 48; item, 48
Fixed effects:
Estimate Std. Error t value
(Intercept) 197.7093 9.4483 20.9253
cond_coded 54.5447 7.3866 7.3843
Correlation of Fixed Effects:
(Intr)
cond_coded 0.551
That’s a lot of info, but luckily, we only need a few of these numbers to check how close our simulated data are to the simulation parameters. We will walk through them in turn, starting with the fixed effects and then moving on to the variance components.
- Fixed effect intercept: This should be close to the true value of 200.
- Fixed effect estimate for
cond_coded(the column containing the condition manipulation coded with 0 or 1; in other words, it’s the categorical condition identifier with treatment coding): This should be close to the simulation parameter of 40. - Standard error for
cond_coded: We might expect something like 8.2 (see below). - Residual SD: This should be close to the simulation parameter of 25.
- By-item intercept SD: This should be close to the simulation parameter of 50.
- By-item slope SD: This should be around to the simulation parameter of 40.
- By-subject intercept and slope SDs: These should be close to the simulation parameters of 50 and 20, respectively.
As you can see, we are not exactly right on all of these counts. For now, this does not matter too much, as we will be running many simulations to see how well we recover the true parameters on average to be sure. Up until now, the only goal was to illustrate the data generation and modeling process.
Before we move on to the large-scale simulations, however, let us take a quick detour to a rough approximation of the expected standard error for the fixed effect of cond_coded, since this expectation will help us interpret the results of the large-scale simulations later on.
Excursus: Computing the expected standard error
To compute the expected standard error for the fixed effect of cond_coded, we need to consider the variance components that contribute to it. Here are the relevant points to keep in mind:
- by-subject and by-item random intercepts cancel each other out in the standard error calculation for the fixed effect of
cond_codedbecause they do not covary with condition. - for the purposes of this calculation, we can also leave out the correlations between intercepts and slopes.
- by-subject and by-item random slopes do contribute to the standard error because they do covary with condition.
- residual variance also contributes to the standard error.
Combining these three sources of variance, we can compute an approximate expected standard error for the fixed effect of cond_coded as follows:
Click to show/hide the code
approximate_standard_error <- function(
n_subs,
n_items,
sd_subj_slope,
sd_item_slope,
noise_sd
) {
var_est <-
(sd_subj_slope^2) /
n_subs +
(sd_item_slope^2) / n_items +
(4 * noise_sd^2) / (n_subs * n_items) # normally: noise_sd^2 * (1 / n_treat + 1 / n_control)
# but because of balance n_treat = n_control we can simplify
se <- sqrt(var_est)
return(se)
}
approximate_standard_error(
n_subs = 48,
n_items = 48,
sd_subj_slope = 20,
sd_item_slope = 40,
noise_sd = 25
)[1] 6.5384812
With this done, we are now in a position to move on to our simulations.
Simulation at scale
Now it’s time to scale our operation up and move from the illustration sample from above to sets of simulations to test the performance of our tests at scale so that we can check how reliable our statistical methods are in the face of low item numbers.
The function below essentially runs the data generation and model fitting steps from above multiple times for different numbers of items. It then collects the relevant metrics from each simulation run and summarizes them across runs to give us an idea of how well we are recovering the true parameters on average. We do this to minimize the impact of random noise in any single simulation step. As we saw above, single runs can be a little bit off the true parameters just by chance, so repeating the process many times and calculating averages should lead to a more reliable picture overall.
As before, we will mostly focus on the fixed effect for the condition manipulation, i.e, cond_coded. Some metrics that we will collect are:
- Mean estimate for
cond_coded: This should be close to the true value of 40. - Mean model standard error for
cond_coded: This should be close to the expected standard error computed above. - Coverage of 95% CIs for
cond_coded. This checks whether the true value is within the 95% confidence interval that the model estimates. This should be close to 0.95. - Proportion of significant \(p\)-values for
cond_coded. Since we know that there is an effect in the population, this should be close to 0.95 (power: the probability of detecting a true effect, calculated by \(1 - \alpha\)). Since \(\alpha\) is generally set to 0.05 in most research contexts, we expect to find a significant effect in about 95% of our simulations.
The final two metrics are a direct measure of how well our inferential procedure is performing and thus gives us a sense of how reliable our statistical tests are in the face of low item numbers with one-shot experiment designs. If these indices are too far from the target, we have good reasons to be worried about the reliability of our inferences.
Click to show/hide the code
run_sim_map <- function(
n_sims = 500,
n_subs = 48,
n_items_vec = c(6, 12, 24),
true_slope = 40,
sd_item_intercept = 20,
sd_item_slope = 40,
sd_subj_intercept = 50,
sd_subj_slope = 20,
subj_ranef_cor = 0.8,
item_ranef_cor = 0.8,
noise_sd = 25,
seed_stream = 1111
) {
# helper to fit a single dataset and extract metrics
fit_one <- function(dat, true_slope) {
fit <- try(
lmer(
response ~ cond_coded +
(1 + cond_coded | subject) +
(1 + cond_coded | item),
data = dat,
control = lmerControl(
optimizer = "bobyqa",
optCtrl = list(maxfun = 2e5)
)
),
silent = TRUE
)
if (inherits(fit, "try-error")) {
return(tibble(
est = NA_real_,
se = NA_real_,
t_val = NA_real_,
p = NA_real_,
cover = NA,
item_intercept_sd = NA_real_,
item_slope_sd = NA_real_,
converged = FALSE
))
}
# fixed effect
cf <- summary(fit)$coefficients
est <- cf["cond_coded", "Estimate"]
se <- cf["cond_coded", "Std. Error"]
t_val <- cf["cond_coded", "t value"]
p_val <- 2 * pnorm(-abs(t_val)) # approx p-value
cover <- (est - 1.96 * se <= true_slope) &
(est + 1.96 * se >= true_slope)
# extract random effect SDs
vc_df <- as.data.frame(VarCorr(fit))
item_intercept_sd <- vc_df$sdcor[
vc_df$grp == "item" & vc_df$var1 == "(Intercept)"
]
item_slope_sd <- vc_df$sdcor[
vc_df$grp == "item" & vc_df$var1 == "cond_coded"
]
subj_intercept_sd <- vc_df$sdcor[
vc_df$grp == "subject" & vc_df$var1 == "(Intercept)"
]
subj_slope_sd <- vc_df$sdcor[
vc_df$grp == "subject" & vc_df$var1 == "cond_coded"
]
tibble(
est = est,
se = se,
t_val = t_val,
p = p_val,
cover = cover,
item_intercept_sd = item_intercept_sd,
item_slope_sd = item_slope_sd,
subj_intercept_sd = subj_intercept_sd,
subj_slope_sd = subj_slope_sd,
converged = TRUE
)
}
set.seed(seed_stream)
results_list <- map(n_items_vec, function(ni) {
# create a stream of seeds for reproducibility
seeds <- sample.int(1e8, n_sims)
sims_res <- map(seq_len(n_sims), function(i) {
dat <- generate_data(
n_subs = n_subs,
n_items = ni,
slope = true_slope,
seed = seeds[i],
sd_item_intercept = sd_item_intercept,
sd_item_slope = sd_item_slope,
sd_subj_intercept = sd_subj_intercept,
sd_subj_slope = sd_subj_slope,
subj_ranef_cor = subj_ranef_cor,
item_ranef_cor = item_ranef_cor,
noise_sd = noise_sd
)
fit_one(dat, true_slope)
}) %>%
list_rbind() %>%
summarise(
n_items = ni,
n_sims = n_sims,
mean_est = mean(est, na.rm = TRUE),
mean_model_se = mean(se, na.rm = TRUE),
coverage = mean(cover, na.rm = TRUE),
prop_significant = mean(p < 0.05, na.rm = TRUE),
mean_item_intercept_sd = mean(item_intercept_sd, na.rm = TRUE),
mean_item_slope_sd = mean(item_slope_sd, na.rm = TRUE),
mean_subj_intercept_sd = mean(subj_intercept_sd, na.rm = TRUE),
mean_subj_slope_sd = mean(subj_slope_sd, na.rm = TRUE),
prop_converged = mean(converged, na.rm = TRUE)
) %>%
mutate(
n_subs = n_subs,
true_slope = true_slope,
true_sd_item_intercept = sd_item_intercept,
true_sd_item_slope = sd_item_slope,
true_sd_subj_intercept = sd_subj_intercept,
true_sd_subj_slope = sd_subj_slope
)
})
return(list_rbind(results_list))
}For our puposes, we will run the simulations for 4 different kinds of experiments. Each will feature 48 participants but differ in the number of items that they test: 4, 6, 12, or 24. All the other simulation parameters will be kept constant across these conditions and are the same as before when we just used generate_data(). Each experiment type will be simulated 500 times to get reliable estimates of the relevant metrics.
Click to show/hide the code
sim_fits <- run_sim_map(
n_sims = 1000,
n_subs = 48,
n_items_vec = c(4, 6, 12, 24)
)
sim_fits# A tibble: 4 × 17
n_items n_sims mean_est mean_model_se coverage prop_significant mean_item_intercept_sd mean_item_slope_sd
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 1000 39.5 19.2 0.859 0.545 9.62 36.6
2 6 1000 39.8 16.1 0.921 0.681 9.94 37.9
3 12 1000 41.1 11.8 0.921 0.92 10.1 38.8
4 24 1000 39.9 8.76 0.947 0.994 10.3 39.8
mean_subj_intercept_sd mean_subj_slope_sd prop_converged n_subs true_slope true_sd_item_intercept true_sd_item_slope
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 25.2 20.5 1 48 40 20 40
2 25.4 20.0 1 48 40 20 40
3 25.4 19.7 1 48 40 20 40
4 25.0 19.8 1 48 40 20 40
true_sd_subj_intercept true_sd_subj_slope
<dbl> <dbl>
1 50 20
2 50 20
3 50 20
4 50 20
With these results finally in, we can now turn to interpreting them. Let’s go through the main takeaways.
First off, the mean estimate of interest, the fixed effect of cond_coded, not only matches the true value quite closely across all conditions, but it also does so consistently, irrespoective of the number of items. This is in line with our expectations, as the fixed effects are not biased estimators.
Turning to the issue of biased estimators, though, if our expections continue to be correct, we would expect our standard error to be off because our model only had very few data points to accurately estimate the by-item variance components. And because the variance components that the model estimates directly impact the calulation of the inferential steps, these should be off as well. Looking at the numbers, this is indeed what we find:
- The standard error (
mean_model_se) decreases asn_itemsincreases — from ~20 \(\Rightarrow\) ~10. This means that the standard errors are overestimated with few items and approach our approximated standard error of ~8.2 with more items. - The coverage (proportion of 95% CIs covering the true value) improves with more items: 0.858 \(\Rightarrow\) 0.946.
- The proportion of significant models for the slope effect (i.e., the power when the true slope equals the true parameter 40) also improves steadily as a function of
n_items: 0.528 \(\Rightarrow\) 0.978.
All of this shows that with too few items you underestimate the variance components and commit inferential errors. This can in principle (I think) have several consequences, and we saw one of them here. The variance that would normally be subject to the mixed model’s shrinkage mechanism got reinterpreted here as belonging to the residual variance pool. That is, the model failed to account for the systematic variance associated with the items, underestimating it. In turn, the residual variance, our key non-systemic variance component, was inflated. Finally, because for inference we always relate systematic variance to non-systematic variance, the inflated residual variance destroyed our inferential steps.
Note, however, that this was what happened at scale. It is also perfectly plausible that the variance that does not get recognized as belonging to the random effects get interpreted as belonging to the fixed effects in some model. While our large-scale simulations did not show this being the overwhelming pattern, it is at least possible for this to happen in some cases. And these some cases might be your cases when you run one-shot experiments! This would then have the opposite effect, namely overconfident inference or Type I errors.
Below, we plot some of these results to get a better visual sense of the pattern; especially how the metrics improve as we increase the number of items.
Click to show/hide the code
sim_plot <- function(y, ylab, yline) {
sim_fits %>%
ggplot(aes(x = n_items, y = {{ y }})) +
geom_point() +
geom_line() +
scale_x_continuous(breaks = c(4, 6, 12, 24)) +
scale_y_continuous(
limits = c(0, NA),
breaks = scales::breaks_pretty()
) +
geom_hline(yintercept = yline, color = colors[2], lty = "dashed") +
labs(y = ylab, x = "Number of items")
}
sim_plot(coverage, "CI coverage rate", 1) +
sim_plot(prop_significant, "Proportion of signifant *p*-values", .095)
For the mean recovery, it’s good news all around. As we saw last time, it’s not a biased estimator and will produce reliable estimates even with few/imbalanced data. However, it should be emphasized here that the fixed effect estimation was not really suffering from too little data: we have an adequate amount of participants so it would have been surprising if we had under- or overshot the location portion of the task.
Click to show/hide the code
sim_plot(mean_est, "Mean estimates", unique(sim_fits$true_slope))
Turning to the item-related variance components, however, it seems that our fears have come true. Indeed, we have vastly lower lower estimates than simulation parameters. In addition, though it’s slight, we are seeing a trend towards recovering the correct values as we increase the amount of items, which further cements the biased estimatorness of variance as the culprit for lackluster performance with the hypothesis tests and the coverage checks.
Click to show/hide the code

For completeness, we can also look at the by-subject variance components:
Click to show/hide the code

There is more to be said about these results, of course, but I think the main takeaways are already clear enough from what we have seen so far. For anyone who’s interested (in potentially investigating the puzzling aspects of the above), let me briefly list what I consider mysterious: it is not at all clear to me why the intercept estimates appear to be half the size of the simulation parameter (without approaching the true value with increases in item number) while the slope estimates are much closer to the true value and also appear to approach it with more items. What I would expect, first of all, is that both estimates are underestimated (because they are both biased estimators) and, second, that both estimates approach the true value with more items. Lastly, I do not understand why the by-subject variance components are also underestimated, given that we have enough participants to estimate them well.
Wrap-up
In this post we have seen that having too few items can lead to biased estimates of the variance components associated with them in a linear mixed model. This extended the project we began in the previous post on biased estimators where low sample sizes led to underestimation issues. Here we saw this happening not just for low sample sizes in general but specifically for low item numbers in one-shot experiments (with otherwise adequate, or at least commonly used, participant numbers).
The poor estimation of the item-related variance components can lead to poor inferential performance, as we saw in the coverage and power checks above. This is because the variance that should have been attributed to the random effects gets misattributed to the residual variance, which then leads to inflated standard errors and poor coverage.
In terms of the implications for research design, this means that one-shot experiments with few items and many participants are likely to produce unreliable inferential results. While the (location-related) fixed effect estimates themselves may be unbiased, any generalizations or inferences drawn from them may be suspect due to the poor estimation of variance components. Trivially, this also means that any kind of hypothesis about the variance components themselves (e.g., whether items differ in their slopes) will be unreliable.
Thus, in addition to the other issues that one-shot experiments raise (see above), there are statistical reasons to be cautious as well.
I want to stress that the results here are not the end-all and be-all of the discussion on one-shot experiments. There are many other factors at play, and different experimental paradigms may yield different results. In particular, one may think that a different response variable could change the dynamics at play, such as acceptability judgments using Likert scales or forced-choice tasks. However, the simulations here provide a strong indication that one-shot experiments with few items can lead to biased variance component estimates and unreliable inferences even in the simple case. My expectation would be that these issues do not simply evaporate when we switch from linear to non-linear models. For simulation results showing that non-scale response variables may be more subject to sign flips than scale response data (in low-powered designs), see e.g. Sun et al. (2025) who touch on the issue of type M and type S errors (Gelman and Carlin 2014).
References
Reuse
Citation
@online{thalmann2026,
author = {Thalmann, Maik},
title = {Biased Estimators {II:} {Too} Many Participants},
date = {2026-01-31},
url = {https://maikthalmann.com/posts/2026-01-31_biased-estimators-2-too-many-participants/},
langid = {en}
}