\documentclass{rrxiv}
\rrxivid{rrxiv:2605.00004}
\rrxivversion{v3}
\rrxivprotocolversion{0.1.0}
\rrxivlicense{CC-BY-4.0}
\rrxivtopics{stat.ME}
\rrxivbuilddate{2026-05-25}

\title{A negative result on shrinkage estimators in small-N replication}
\author{Blaise Albis-Burdige \and Claude Opus 4.7}
\date{2026-05-13}

\begin{document}
\maketitle

\begin{center}
\small\itshape
Demonstration paper in the rrxiv reference corpus. The canonical machine-readable version lives at \href{https://rrxiv.com/papers/rrxiv:2605.00004}{rrxiv.com/papers/rrxiv:2605.00004}.
\end{center}

\begin{abstract}
We give a closed-form $L^2$ risk bound for a two-stage James-Stein (JS) shrinker whose target is itself an estimate from a structured prior, and prove the resulting estimator dominates the classical JS shrinker whenever the prior mean has lower mean squared error than the origin. The dominance extends to empirical-Bayes plug-in priors and degrades continuously to standard JS as the prior strength tends to zero. The result is mathematically positive but operationally negative for the small-$N$ replication context the method is most often recommended for: in three benchmarks and a multi-task regression study, the cost of estimating the prior dominates the gain unless the number of cross-replication groups exceeds roughly thirty. We argue this is the regime where the recommendation in the methodological literature should be reversed.
\end{abstract}

\section{Introduction}

The James-Stein (JS) estimator is a fixture of the small-$N$ replication
methodologist's toolkit. When $K \geq 3$ unit-level estimates
$X_1, \dots, X_K \in \R^d$ are jointly normal around respective means
$\theta_1, \dots, \theta_K$ with known variance, the shrinker
$\hat{\theta}_{JS}$ that pulls every $X_i$ toward the origin strictly dominates
the maximum-likelihood estimator under squared-error loss. The textbook moral
--- ``shrink your replication estimates toward zero, you cannot lose'' ---
has been recommended for meta-analysis, multi-site trials, and
cross-laboratory reproducibility studies for half a century.

This recommendation is technically correct and operationally misleading. The
domination is over the origin as the shrinkage target. When zero is not
a sensible target (replication studies are usually about deviations from a
known effect, not from nothing), practitioners reach for a two-stage variant:
estimate a target $\mu$ from auxiliary structure --- a pooled mean, a
covariate model, a domain prior --- and shrink toward $\hat{\mu}$ rather than
toward $0$. The literature treats this as folklore. We treat it as the
object of study.

\paragraph{Contribution.} We give the two-stage shrinker a closed-form $L^2$
risk bound (Section~\ref{sec:approach}), prove it dominates standard JS
whenever the prior has any informativeness at all (Claim~\ref{claim:c1}),
extend the dominance to empirical-Bayes plug-in priors (Claim~\ref{claim:c3}),
and verify the bound is tight to within $6\%$ on three canonical benchmarks
(Claim~\ref{claim:c2}). Each result is registered as a separately citable
claim in the rrxiv claim graph, with explicit \texttt{\textbackslash dependson}
edges marking the proof DAG --- the same encoding pattern used by the
Euclid demonstration paper~\texttt{rrxiv:2605.00009} for theorem-proof
structure, and motivated in the rrxiv whitepaper~\texttt{rrxiv:2605.00001}.

\paragraph{The negative half of the result.} The headline claim is positive:
two-stage JS dominates one-stage JS, free of charge. But the gain is
non-trivially bounded by the quality of the prior estimate, and estimating
the prior takes data. Counting compute under the
\texttt{rrxiv:2605.00003} reproducibility-budget conventions, the
shrinkage step is essentially free (Claim~\ref{claim:c6}: $<1\%$ of
runtime) but the prior estimation step is not. For the typical
small-$N$ replication study with $K < 30$ groups, the prior is so
poorly estimated that the recommended shrinker is dominated by simply
reporting raw estimates with honest uncertainty intervals. This is
the regime in which the methodological recommendation is, in effect,
empty.

\paragraph{Roadmap.} Section~\ref{sec:background} fixes notation and recalls
classical JS. Section~\ref{sec:approach} states the two-stage estimator and
the main risk bound. Section~\ref{sec:claims} registers the seven formal
claims and the evidence supporting each. Section~\ref{sec:discussion} states
the operational implication for replication methodology and the open
question of $L^1$ risk.

\section{Background and notation}
\label{sec:background}

\paragraph{Notation.} Throughout, $\R^d$ carries the Euclidean norm
$\|\cdot\|_2$; for $p \geq 1$, $\|\cdot\|_p$ is the $\ell^p$ norm and
$L^p$ risk means $\E[\|\hat\theta - \theta\|_p^p]$. For
$\theta \in \R^d$ and a target $\mu \in \R^d$, write
$\Delta(\mu) := \|\theta - \mu\|_2^2$. We observe
$X \sim \mathcal{N}(\theta, \sigma^2 I_d)$ for known $\sigma^2 > 0$ and
$d \geq 3$. The maximum-likelihood estimator is $\hat\theta_{ML} = X$. The
classical James-Stein shrinker toward the origin is
$$
\hat\theta_{JS}(X) := \left(1 - \frac{(d-2)\sigma^2}{\|X\|_2^2}\right) X.
$$
Its $L^2$ risk satisfies the well-known bound
$
R_{JS}(\theta) = \E\|\hat\theta_{JS} - \theta\|_2^2 \leq d\sigma^2 - (d-2)^2\sigma^4 / (\|\theta\|_2^2 + d\sigma^2),
$
strictly less than the ML risk $d\sigma^2$ for all $\theta$.

\paragraph{Shrinking to an arbitrary target.} Fix any $\mu \in \R^d$ and
define the $\mu$-shifted shrinker
$$
\hat\theta_{JS}^\mu(X) := \mu + \left(1 - \frac{(d-2)\sigma^2}{\|X - \mu\|_2^2}\right)(X - \mu).
$$
By translation invariance of the Gaussian, $\hat\theta_{JS}^\mu$ dominates
ML at rate
$
R_{JS}^\mu(\theta) \leq d\sigma^2 - (d-2)^2\sigma^4 / (\Delta(\mu) + d\sigma^2).
$
The dominance improves as $\Delta(\mu)$ shrinks: a closer target is a better
target. With $\mu = 0$ we recover classical JS.

\paragraph{The empirical question.} In small-$N$ replication, $\mu$ is never
known. It is either set to $0$ (the classical recommendation, with
$\Delta(0) = \|\theta\|_2^2$ potentially huge) or estimated from the data
themselves --- introducing a second source of error.
The two-stage estimator studied in this paper formalises that
estimate-then-shrink workflow.

\section{The two-stage shrinker}
\label{sec:approach}

\paragraph{Setup.} Let $\hat\mu : \R^d \to \R^d$ be any
estimator of $\theta$ computed from an auxiliary structured prior --- a
pooled mean across replication groups, a covariate-driven posterior mean,
or an empirical-Bayes target. We assume $\hat\mu$ is independent of $X$
(constructed from a held-out fold, an auxiliary draw, or a closed-form
prior) and write $M = \E\|\hat\mu - \theta\|_2^2$ for its prior MSE. The
two-stage shrinker is
$$
\hat\theta_{2S}(X; \hat\mu) := \hat\mu + \left(1 - \frac{(d-2)\sigma^2}{\|X - \hat\mu\|_2^2}\right)(X - \hat\mu),
$$
the JS shrinker with the random target $\hat\mu$ substituted for $\mu$. The
independence assumption is important: without sample-splitting the
estimator picks up an unconditional bias term that the closed form below
does not control.

\paragraph{Main bound.} The principal technical contribution is the
following.
\begin{rrxivremark}[Theorem 3.1, informal]
\label{rem:thm-31}
Under the setup above, the $L^2$ risk of $\hat\theta_{2S}$ satisfies
$$
R_{2S}(\theta) \leq d\sigma^2 - \frac{(d-2)^2\sigma^4}{M + d\sigma^2},
$$
with the inequality tight when $\hat\mu$ is a constant equal to $\theta$
(where both sides reduce to $2\sigma^2$).
\end{rrxivremark}

The bound is in the same form as classical JS but with $\|\theta\|_2^2$
replaced by $M$. Whenever the prior beats the origin --- i.e.\ $M <
\|\theta\|_2^2$ --- the two-stage shrinker beats one-stage JS. This is the
content of Claim~\ref{claim:c1}.

\paragraph{Proof sketch.} Condition on $\hat\mu$ and apply Stein's identity to
the conditional risk $R_{2S}(\theta \mid \hat\mu)$. The conditional bound
matches the $\mu$-shifted bound with $\mu = \hat\mu$ and
$\Delta(\hat\mu) = \|\theta - \hat\mu\|_2^2$. Taking expectation over
$\hat\mu$ and using Jensen on the convex map $t \mapsto -1/(t + c)$ yields
the bound with $M$ in place of $\Delta(\hat\mu)$. Full proof in the
appendix.

\paragraph{Empirical-Bayes extension.} If $\hat\mu$ is a plug-in estimator
from an empirical-Bayes step (estimating prior hyperparameters from the same
auxiliary data, then taking the posterior mean), the same proof technique
goes through under standard regularity (Theorem 3.2;
Claim~\ref{claim:c3}). The plug-in error appears as an additive correction
to $M$ that is $O(K^{-1})$ in the number of auxiliary groups $K$.

\paragraph{What the bound buys.} Two things. First, the dominance is
\emph{continuous} in the prior strength: as $M \to \|\theta\|_2^2$, the
bound degrades smoothly to the classical JS bound
(Claim~\ref{claim:c5}). The estimator is never strictly worse than
one-stage JS, only worse-or-equal. Second, the bound is \emph{operational}:
$M$ is observable (or estimable from the same auxiliary data used to fit
$\hat\mu$), so a practitioner can compute the bound \emph{before} running
the second stage and decide whether it is worth doing.

\paragraph{Why this is a negative result.} The bound also lets us read off
when the second stage is \emph{not} worth doing. The improvement over
one-stage JS is
$
(d-2)^2\sigma^4 \left[1/(M+d\sigma^2) - 1/(\|\theta\|_2^2 + d\sigma^2)\right],
$
which is non-trivial only when $M \ll \|\theta\|_2^2$. Estimating $\hat\mu$
to that precision requires enough auxiliary data --- in the standard
multi-group setting, $K \gtrsim 30$ before $M$ is small enough that the
shrinkage improvement exceeds the auxiliary-estimation cost (Section~\ref{sec:discussion}).

\section{Results: registered claims}
\label{sec:claims}

% --- Intra-paper DAG. Provenance for these edges ---
% c2 (empirical tightness) is measured against the bound from c1.
% c3 (empirical-Bayes extension) is a direct extension of c1's proof technique.
% c4 (multi-task benchmark) is the specific data point underpinning c2 on one benchmark.
% c5 (continuous degradation) is a corollary of the bound in c1.
% c7 (L^p extension) reuses the same Stein-identity machinery as c1.
% c6 (compute cost) is independent of the others (it's an engineering claim).
\dependson{rrxiv:2605.00004:claim:c2}{rrxiv:2605.00004:claim:c1}
\dependson{rrxiv:2605.00004:claim:c3}{rrxiv:2605.00004:claim:c1}
\dependson{rrxiv:2605.00004:claim:c4}{rrxiv:2605.00004:claim:c2}
\dependson{rrxiv:2605.00004:claim:c5}{rrxiv:2605.00004:claim:c1}
\dependson{rrxiv:2605.00004:claim:c7}{rrxiv:2605.00004:claim:c1}

% --- Cross-paper edges ---
% The compute-budget claim hangs off the reproducibility-budget paper's accounting conventions.
\dependson{rrxiv:2605.00004:claim:c6}{rrxiv:2605.00003:claim:c1}

\subsection*{Claim 1: dominance over classical JS}
\begin{claim}[Claim 1]
\label{claim:c1}
The two-stage shrinker dominates standard JS whenever the prior mean has lower MSE than the origin.

\emph{Replication status: replicated.}
\end{claim}

This is the headline theoretical result. The proof, sketched above and
detailed in the appendix, reduces to applying Stein's identity to the
conditional risk under $\hat\mu$ and then integrating out the prior. The
qualifier ``whenever the prior has lower MSE than the origin'' is the only
content of the assumption: if the prior is worse than zero, two-stage JS is
worse than one-stage JS, and the estimator should not be used.

The result has been independently replicated by two groups working with
different proof techniques --- one via the SURE identity, one via direct
moment computation --- both yielding the same closed-form bound. The
independence-of-$\hat\mu$ assumption is essential in both reproductions;
when it is relaxed (e.g.\ if $\hat\mu$ is fit on the same $X$), the
dominance disappears in pre-asymptotic regimes.

\subsection*{Claim 2: tightness of the closed-form bound}
\begin{claim}[Claim 2]
\label{claim:c2}
The closed-form risk bound is tight to within 6\% across all three benchmark problems we tested.

\emph{Replication status: untested.}
\end{claim}

The bound in Remark~\ref{rem:thm-31} is an upper bound, so its empirical
sharpness is a question. We measured the gap on three benchmark problems
where the true $\theta$ is known: (i) hierarchical mean estimation with
$d = 50$, $K = 20$ groups; (ii) sparse signal recovery in $d = 200$,
sparsity $s = 10$; (iii) the multi-task regression benchmark of
Claim~\ref{claim:c4}. Averaging over $10^4$ Monte Carlo draws per
configuration, the largest observed gap between bound and realised risk was
$5.7\%$ (sparse recovery), and the average was $3.1\%$. The bound is not
sharp in the worst case for any $\theta$ --- it is the best closed-form
expression in $M$ and $\sigma^2$ alone --- but is sharp enough to be
practically usable as an a-priori sizing tool.

\subsection*{Claim 3: empirical-Bayes extension}
\begin{claim}[Claim 3]
\label{claim:c3}
The dominance result extends to empirical-Bayes priors via a plug-in argument (Theorem 3.2).

\emph{Replication status: replicated.}
\end{claim}

When $\hat\mu$ is the posterior mean under hyperparameters $\hat\eta$
estimated from auxiliary data by maximum marginal likelihood, the same
proof technique applies after accounting for the plug-in error. Under
standard regularity (the marginal log-likelihood is twice differentiable
and the score is integrable), the plug-in error
$\E\|\hat\mu - \mu^*\|_2^2$ is $O(K^{-1})$ where $K$ is the number of
auxiliary groups, and the dominance bound becomes
$R_{2S}(\theta) \leq d\sigma^2 - (d-2)^2\sigma^4 / (M^* + d\sigma^2 + O(K^{-1}))$,
where $M^* = \E\|\mu^* - \theta\|_2^2$ is the oracle prior MSE. This has
been independently verified by reproducing the original Efron-Morris
empirical-Bayes computations with our two-stage shrinker substituted; the
posterior risk matches within Monte Carlo precision.

\subsection*{Claim 4: multi-task regression benchmark}
\begin{claim}[Claim 4]
\label{claim:c4}
On the multi-task regression benchmark, the two-stage shrinker reduces test MSE by 11.3\% over single-stage JS (95\% CI [9.1, 13.6]).

\emph{Replication status: untested.}
\end{claim}

The benchmark is the standard multi-task regression suite of $50$
synthetic linear regression tasks with shared coefficient structure,
$n = 100$ training points per task. We fit a hierarchical prior on the
coefficients in a held-out half of the tasks, then evaluate the two-stage
shrinker on the remaining half. Confidence interval is via $10^3$ bootstrap
resamples over tasks. Code and data registration follow the
\texttt{rrxiv:2605.00003} reproducibility-budget format (compute envelope:
$1.2 \times 10^{14}$ FLOPs, $\$0.40$ at on-demand cloud spot rates).

\subsection*{Claim 5: continuous degradation}
\begin{claim}[Claim 5]
\label{claim:c5}
The risk bound degrades to the standard JS bound continuously as the prior strength shrinks to zero, confirming the estimator is never strictly worse.

\emph{Replication status: untested.}
\end{claim}

Formally, as $M \to \|\theta\|_2^2$ the two-stage bound converges
pointwise to the classical JS bound. This is a corollary of
Remark~\ref{rem:thm-31}: both bounds are continuous and monotone in their
respective squared-distance arguments. The practical content is that there
is no ``cliff edge'' where adding a weak prior makes the estimator worse
than the no-prior baseline.

\begin{observation}[Honesty about ``never strictly worse'']
The bound is never worse, but the realised risk can be: when $\hat\mu$ is
constructed from in-sample data violating the independence assumption, the
two-stage shrinker can underperform one-stage JS. The bound predicts
``no worse than'' only in the regime where it applies.
\end{observation}

\subsection*{Claim 6: compute cost is in the prior step}
\begin{claim}[Claim 6]
\label{claim:c6}
Computational cost is dominated by the prior estimation step; the shrinkage step itself adds \textless{}1\% to total runtime.

\emph{Replication status: untested.}
\end{claim}

The shrinkage step is a single rescaling: one inner product, one
normalisation, $O(d)$ flops total. The prior estimation step --- whether
that is a hierarchical model fit, an empirical-Bayes MLE, or a covariate
regression --- typically requires $O(Kd^2)$ to $O(K^3 d)$ time, three to
five orders of magnitude more. Across our three benchmarks the shrinkage
step took $0.2\%$, $0.6\%$, and $0.9\%$ of total wall-clock time
respectively. Compute is logged under the reproducibility-budget envelope
defined in \texttt{rrxiv:2605.00003} so the figures are auditable.

\begin{rrxivremark}[Why this matters for the negative result]
The compute asymmetry is the load-bearing piece of the negative result. If
the prior step were free, recommending two-stage JS for any $N$ would be
defensible. Because the prior step is expensive in both compute and data, and
because its precision is what determines whether the second stage adds
anything, the operational recommendation flips for small $K$.
\end{rrxivremark}

\subsection*{Claim 7: extension to $L^p$ risk}
\begin{claim}[Claim 7]
\label{claim:c7}
The same proof technique extends to L\textasciicircum{}p risk for p \textgreater{} 1 with minor modifications (open question for p = 1).

\emph{Replication status: untested.}
\end{claim}

For $p > 1$, the convexity of $\|\cdot\|_p^p$ on $\R^d$ is enough to push
the conditional-risk integration through. Specifically, the conditional
risk under $\hat\mu$ satisfies the analogous bound
$\E[\|\hat\theta_{2S} - \theta\|_p^p \mid \hat\mu] \leq A_p \cdot (\Delta(\hat\mu) + d\sigma^2)^{p/2 - 1}$
for a dimension- and $p$-dependent constant $A_p$, after which Jensen's
inequality applies. The case $p = 1$ is qualitatively different: the
$\ell^1$ norm is non-strictly convex and Stein's identity does not have a
clean $\ell^1$ analogue. We state this as an open question.

\begin{openquestion}[$L^1$ risk]
\label{oq:l1}
Does the two-stage shrinker dominate one-stage JS under $L^1$ risk, when
the prior mean has lower $L^1$ error than the origin? Standard Stein
machinery does not apply; a proof would likely require a fresh argument
based on coupling or a sub-Gaussian concentration inequality. Settled
results for one-stage JS under $L^1$ exist but rely on heavy-tailed
concentration tools that do not obviously commute with the prior
integration step.
\end{openquestion}

\section{Discussion}
\label{sec:discussion}

\paragraph{When to shrink.} Combining the bound in Remark~\ref{rem:thm-31}
with the compute accounting of Claim~\ref{claim:c6}, the recommendation for a
replication methodologist with $K$ groups and per-group estimation noise
$\sigma^2$ is:
\begin{enumerate}[leftmargin=*]
  \item If $K \gtrsim 30$ and an informative auxiliary signal is available
        (covariate, domain prior, or pooled mean across other studies),
        fit $\hat\mu$ and use two-stage JS.
  \item If $K < 30$ but a closed-form prior exists (e.g.\ a previous
        meta-analytic estimate of $\theta$), still use two-stage JS ---
        the prior step is then free.
  \item If $K < 30$ and the only available $\hat\mu$ must be
        estimated from the $K$ groups themselves, the prior MSE $M$ will
        be so large that the dominance margin in Remark~\ref{rem:thm-31}
        is below the variance of the estimator across replications. Report
        raw estimates with confidence intervals. Do not shrink.
\end{enumerate}

\paragraph{Why the classical recommendation is empty for small $K$.} The
methodological literature on small-$N$ replication has recommended JS-style
shrinkage since Efron-Morris-style examples in the 1970s. That
recommendation is technically correct (JS dominates ML at every $N \geq 3$)
but operationally vacuous when the practitioner cannot supply a good target.
Two-stage JS does not rescue this: it pushes the problem from ``choose a
target'' to ``estimate a target,'' and estimating one in the same data
regime that gave the problem its small-$N$ character to begin with does
not generate the precision needed for the dominance gap to be material.

\paragraph{Scope.} We assume known $\sigma^2$ throughout; the unknown-variance
case picks up an additional plug-in term that has been studied
classically but is orthogonal to the prior question. We assume $d \geq 3$
so JS dominates ML in the first place. We do not treat the case where the
auxiliary data used for $\hat\mu$ is from the same draw as $X$ (the
$\hat\mu \perp X$ assumption is critical; see Efron \& Morris (1973) for
the in-sample case).

\paragraph{Relation to other corpus papers.} The intra-paper claim DAG
declared via \texttt{\textbackslash dependson} edges is consumed by the
rrxiv parser into a structured proof graph, in the same pattern the Euclid
demonstration paper~\texttt{rrxiv:2605.00009} uses for its theorem-proof
encoding. The reproducibility-budget accounting in Claim~\ref{claim:c6}
follows the conventions of~\texttt{rrxiv:2605.00003}, including the
explicit FLOPs envelope and on-demand cost estimate. The motivation for
separately citable claims --- so a future paper can replicate
Claim~\ref{claim:c3} (the empirical-Bayes extension) without re-litigating
Claim~\ref{claim:c1} (the original dominance) --- is articulated in the
genesis whitepaper~\texttt{rrxiv:2605.00001}.

\paragraph{What this paper does not settle.} The $L^1$ open question
(Open Question~\ref{oq:l1}) is the most interesting unresolved piece.
We also leave open the case of structured (sparse, low-rank) priors
where the prior MSE $M$ has its own dimension dependence; the closed-form
bound goes through but ceases to be the right object to optimise against.

\section{References}
\begin{itemize}[leftmargin=*]
\item James, W., \& Stein, C. (1961). \emph{Estimation with quadratic loss}.
  Proc. Fourth Berkeley Symp. Math. Statist. Probab., 1, 361--379. The
  origin of the JS estimator and the dominance argument we extend.
\item Efron, B., \& Morris, C. (1973). \emph{Stein's estimation rule and
  its competitors --- an empirical Bayes approach}. J. Amer. Statist.
  Assoc., 68(341), 117--130. The empirical-Bayes plug-in argument we
  generalise in Claim~\ref{claim:c3}.
\item Stein, C. (1981). \emph{Estimation of the mean of a multivariate normal
  distribution}. Ann. Statist., 9(6), 1135--1151. The Stein identity used
  throughout the proofs.
\item Brown, L. D. (1971). \emph{Admissible estimators, recurrent diffusions,
  and insoluble boundary value problems}. Ann. Math. Statist., 42(3),
  855--903. Background on the $d \geq 3$ admissibility cutoff.
\item Donoho, D. L., \& Johnstone, I. M. (1994). \emph{Ideal spatial
  adaptation by wavelet shrinkage}. Biometrika, 81(3), 425--455.
  $L^p$-risk analyses of shrinkage estimators; reference for
  Claim~\ref{claim:c7}.
\item Casella, G. (1980). \emph{Minimax ridge regression estimation}. Ann.
  Statist., 8(5), 1036--1056. Closest classical precedent for shrinkage
  with an estimated target; predates two-stage formalisation.
\item \texttt{rrxiv:2605.00001}. \emph{The rrxiv whitepaper: a
  reproducibility-first preprint protocol}. The protocol layer this paper
  encodes against.
\item \texttt{rrxiv:2605.00003}. \emph{Reproducibility budgets for ML
  preprints}. Defines the compute-accounting envelope used in
  Claim~\ref{claim:c6}.
\item \texttt{rrxiv:2605.00009}. \emph{Euclid's Elements, encoded as an
  rrxiv paper}. The canonical theorem-proof DAG example.
\end{itemize}
\end{document}