License: CC BY 4.0
arXiv:2403.07264v1 [stat.ML] 12 Mar 2024

 

Near-Interpolators: Rapid Norm Growth and
the Trade-Off between Interpolation and Generalization


 


Yutong Wang                        Rishi Sonthalia                        Wei Hu MIDAS, University of Michigan                        UCLA                        University of Michigan

Abstract

We study the generalization capability of nearly-interpolating linear regressors: 𝜷𝜷\bm{\beta}bold_italic_β’s whose training error τ𝜏\tauitalic_τ is positive but small, i.e., below the noise floor. Under a random matrix theoretic assumption on the data distribution and an eigendecay assumption on the data covariance matrix 𝚺𝚺\bm{\Sigma}bold_Σ, we demonstrate that any near-interpolator exhibits rapid norm growth: for τ𝜏\tauitalic_τ fixed, 𝜷𝜷\bm{\beta}bold_italic_β has squared 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-norm 𝔼[𝜷22]=Ω(nα)𝔼delimited-[]superscriptsubscriptnorm𝜷22Ωsuperscript𝑛𝛼\mathbb{E}[\|{\bm{\beta}}\|_{2}^{2}]=\Omega(n^{\alpha})blackboard_E [ ∥ bold_italic_β ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = roman_Ω ( italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ) where n𝑛nitalic_n is the number of samples and α>1𝛼1\alpha>1italic_α > 1 is the exponent of the eigendecay, i.e., λi(𝚺)iαsimilar-tosubscript𝜆𝑖𝚺superscript𝑖𝛼\lambda_{i}(\bm{\Sigma})\sim i^{-\alpha}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_Σ ) ∼ italic_i start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT. This implies that existing data-independent norm-based bounds are necessarily loose. On the other hand, in the same regime we precisely characterize the asymptotic trade-off between interpolation and generalization. Our characterization reveals that larger norm scaling exponents α𝛼\alphaitalic_α correspond to worse trade-offs between interpolation and generalization. We verify empirically that a similar phenomenon holds for nearly-interpolating shallow neural networks.

1 INTRODUCTION

Regularization (Nakkiran et al.,, 2020) and early stopping (Ji et al.,, 2021) are techniques to mitigate the effect of harmful overfitting by training models to nearly, rather than perfectly, interpolate the training data. A key question is: how do near-interpolators generalize?

A long line of work has investigated this question for perfect-interpolators. Zhang et al., (2017) noted the surprising phenomon that, even with noise, perfect-interpolators do not necessarily overfit catastrophically, and can still generalize to some extent. The phenomenon is later formalized as “benign overfitting” and proven to hold in linear regresion in Bartlett et al., (2020). Mallinar et al., (2022) introduced the more nuanced notion of tempered overfitting which is closer to the empirical observation in Zhang et al., (2017) that the test error of perfect-interpolators do degrade somewhat. Koehler et al., (2021) establish a setting under which benign overfitting can be explained by uniform bounds. However, to the best of our knowledge, no work have studied generalization of near-interpolating linear regression 111 Ghosh and Belkin, (2022) establishes lower bound for the interpolation-genearlization trade-off, which is related to but distinct from our contributions. .

Our results. Under random matrix-theoretic and power-law spectra assumptions, we prove that nearly-interpolating ridge regressors have norms that grows rapidly (Theorem 1.6), implying that existing (non-asymptotic) generalization bounds are loose (Section 4.2). Moreover, we derive the exact formula relating the large-sample limit training and the testing error (Theorem 1.5), using the eigenlearning framework of Simon et al., (2023). Finally, we show that our theoretical results on near-interpolating ridge regressors are relevant empirically and can give insight into the behavior of early-stopped near-interpolating shallow neural networks.

Implications. Our result on the norm growth implies that existing data-independent bounds and possible extension are necessarily loose. See Section 4.2. Thus, in order to explain the learning capability of near-interpolators, there is a need to develop data-dependent generalization bounds.

On the other hand, our result allows the analysis of the trade-off between nearness of interpolation and generalization. Our result reveals delicate interplay between the overparametrization ratio and the power-law spectra exponent. In particular, for larger power-law spectra exponent implies larger asymptotic excess test error ratio of 5%percent55\%5 %- over 50%percent5050\%50 %-noise floor interpolation, for instance. Put more simply, the harmfulness of overfitting depends on the data distribution. Moreover, this effect is stronger at high level of overparametrization (large p/n𝑝𝑛p/nitalic_p / italic_n). (See Figure 2 and Figure 1-left panel.) Experimentally, this effect appears in shallow neural networks as well (Figure 4).

1.1 Related works

Near interpolation. Learning algorithms that (nearly) interpolate the training data, such as deep neural networks, have been surprisingly effective in practice despite conventional statistical wisdom suggesting otherwise (Zhang et al.,, 2021). Many practices in modern machine learning e.g., early stopped neural network and high-dimensional ridge regression, result in near- rather than perfect-interpolators (Ji et al.,, 2021; Kuzborskij and Szepesvári,, 2022). In terms of theory work, Ghosh and Belkin, (2022) provides a lower bounds on the test error for near-interpolators.

Power law spectra. Empirically, power law spectra arise in neural tangent kernels computed from practical networks for common datasets, e.g., MNIST (Velikanov and Yarotsky,, 2021) On the theory side, the power law spectra assumption has been used previously to analyze benign (Bartlett et al.,, 2020, Theorem 2) and tempered overfitting phenomena (Mallinar et al.,, 2022, Theorem 3.1).

Looseness of existing generalization bounds. Our work is motivated by the empirical evidence found by Wei et al., (2022) suggests that norms of kernel ridge regressors grow rapidly potentially beyond the purview of norm-based bound. We confirm that bounds similar to ones in Koehler et al., (2021, Corollary 1) grow to infinity. Even more refined bounds such as Koehler et al., (2021, Theorem 1) grow as n𝑛nitalic_n goes to infinity. Therefore, our work suggests that explaining the generalization capability of near-interpolators will require new tools.

1.2 Notations

Throughout this work, we assume the setting of high-dimensional linear regression as described below. Let n𝑛nitalic_n denote the number of samples and p𝑝pitalic_p denote the feature dimension. Consider the setting where n,p𝑛𝑝n,p\to\inftyitalic_n , italic_p → ∞ at the same time. The sample-to-feature ratio is denoted γ:=n/p>0assign𝛾𝑛𝑝subscriptabsent0\gamma:=n/p\in\mathbb{R}_{>0}italic_γ := italic_n / italic_p ∈ blackboard_R start_POSTSUBSCRIPT > 0 end_POSTSUBSCRIPT and the asymptotic sample-to-feature ratio is denoted γ:=limnγ0assignsubscript𝛾subscript𝑛𝛾subscriptabsent0\gamma_{\ast}:=\lim_{n\to\infty}\gamma\in\mathbb{R}_{\geq 0}italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT := roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_γ ∈ blackboard_R start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT. Here, n𝑛nitalic_n is the fundamental parameter which p𝑝pitalic_p depends on implicitly.

Let Xp𝑋superscript𝑝X\in\mathbb{R}^{p}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT and Y𝑌Y\in\mathbb{R}italic_Y ∈ blackboard_R denote a random vector (resp. variable), referred to as the sample (resp. label). Suppose that 𝜷psuperscript𝜷superscript𝑝\bm{\beta}^{\star}\in\mathbb{R}^{p}bold_italic_β start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT is such that Y=ε+X𝜷𝑌𝜀superscript𝑋topsuperscript𝜷Y=\varepsilon+X^{\top}\bm{\beta}^{\star}italic_Y = italic_ε + italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_β start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT where ε𝜀\varepsilon\in\mathbb{R}italic_ε ∈ blackboard_R is a random variable denoting independent, zero mean noise, i.e., 𝔼[ε]=0𝔼delimited-[]𝜀0\mathbb{E}[\varepsilon]=0blackboard_E [ italic_ε ] = 0 and εXperpendicular-to𝜀𝑋\varepsilon\perp Xitalic_ε ⟂ italic_X. Here perpendicular-to\perp denotes independence between random variables. The noise variance is denoted σ2:=𝔼[ε2]assignsuperscript𝜎2𝔼delimited-[]superscript𝜀2\sigma^{2}:=\mathbb{E}[\varepsilon^{2}]italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT := blackboard_E [ italic_ε start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ].

The training data is denoted {(xi,yi)}i=1nsuperscriptsubscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖1𝑛\{(x_{i},y_{i})\}_{i=1}^{n}{ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT where xipsubscript𝑥𝑖superscript𝑝x_{i}\in\mathbb{R}^{p}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT and εisubscript𝜀𝑖\varepsilon_{i}\in\mathbb{R}italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R are i.i.d realizations of X𝑋Xitalic_X and of the noise, and yi=εi+xi𝜷subscript𝑦𝑖subscript𝜀𝑖superscriptsubscript𝑥𝑖topsuperscript𝜷y_{i}=\varepsilon_{i}+x_{i}^{\top}\bm{\beta}^{\star}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_β start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT. Let 𝐗=[x1,,xn]p×n𝐗subscript𝑥1subscript𝑥𝑛superscript𝑝𝑛\mathbf{X}=[x_{1},\dots,x_{n}]\in\mathbb{R}^{p\times n}bold_X = [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_p × italic_n end_POSTSUPERSCRIPT be the data matrix obtained by horizontally stacking the xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s, and let 𝐲=(y1,,yn)n𝐲superscriptsubscript𝑦1subscript𝑦𝑛topsuperscript𝑛\mathbf{y}=(y_{1},\dots,y_{n})^{\top}\in\mathbb{R}^{n}bold_y = ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT be the (column vector) by concatenating the yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s. Likewise, define 𝜺=(ε1,,εn)n𝜺superscriptsubscript𝜀1subscript𝜀𝑛topsuperscript𝑛\bm{\varepsilon}=(\varepsilon_{1},\dots,\varepsilon_{n})^{\top}\in\mathbb{R}^{n}bold_italic_ε = ( italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ε start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. For a positive integer p𝑝pitalic_p, let 𝐈psubscript𝐈𝑝\mathbf{I}_{p}bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT denote the p×p𝑝𝑝p\times pitalic_p × italic_p identity matrix. Let 𝜷p𝜷superscript𝑝\bm{\beta}\in\mathbb{R}^{p}bold_italic_β ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT be arbitrary. The empirical training mean squared error (MSE) of 𝜷𝜷\bm{\beta}bold_italic_β is denoted

𝚝𝚛𝚊𝚒𝚗n(𝜷)=1n𝐗𝜷𝐲22.superscriptsubscript𝚝𝚛𝚊𝚒𝚗𝑛𝜷1𝑛superscriptsubscriptnormsuperscript𝐗top𝜷𝐲22\mathcal{E}_{\mathtt{train}}^{n}(\bm{\beta})=\frac{1}{n}\|\mathbf{X}^{\top}\bm% {\beta}-\mathbf{y}\|_{2}^{2}.caligraphic_E start_POSTSUBSCRIPT typewriter_train end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( bold_italic_β ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∥ bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_β - bold_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Likewise, the expected test error of 𝜷𝜷\bm{\beta}bold_italic_β is denoted

𝚝𝚎𝚜𝚝n(𝜷)=𝔼[X𝜷Y22].superscriptsubscript𝚝𝚎𝚜𝚝𝑛𝜷𝔼delimited-[]superscriptsubscriptnormsuperscript𝑋top𝜷𝑌22\mathcal{E}_{\mathtt{test}}^{n}(\bm{\beta})=\mathbb{E}[\|X^{\top}\bm{\beta}-Y% \|_{2}^{2}].caligraphic_E start_POSTSUBSCRIPT typewriter_test end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( bold_italic_β ) = blackboard_E [ ∥ italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_β - italic_Y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] .

Let 𝚺^:=n1𝐗𝐗assign^𝚺superscript𝑛1superscript𝐗𝐗top\hat{\bm{\Sigma}}:=n^{-1}\mathbf{X}\mathbf{X}^{\top}over^ start_ARG bold_Σ end_ARG := italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_XX start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT denote the sample covariance matrix and 𝐆ˇ:=n1𝐗𝐗assignˇ𝐆superscript𝑛1superscript𝐗top𝐗\check{\mathbf{G}}:=n^{-1}\mathbf{X}^{\top}\mathbf{X}overroman_ˇ start_ARG bold_G end_ARG := italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X the (scaled) gram matrix. Let 𝚺:=𝔼[𝚺^]assign𝚺𝔼delimited-[]^𝚺\bm{\Sigma}:=\mathbb{E}[\hat{\bm{\Sigma}}]bold_Σ := blackboard_E [ over^ start_ARG bold_Σ end_ARG ] denote the population covariance.

We note that all quantities defined on the training data implicitly depend on n𝑛nitalic_n. When the dependencies need to be made explicit, we shall write 𝜷nsubscriptsuperscript𝜷𝑛\bm{\beta}^{\star}_{n}bold_italic_β start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, 𝚺^nsubscript^𝚺𝑛\hat{\bm{\Sigma}}_{n}over^ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and so on.

1.3 Our contributions

Recall the minimum norm near-interpolator:

Definition 1.1.

Let τ(0,σ2)𝜏0superscript𝜎2\tau\in(0,\sigma^{2})italic_τ ∈ ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). The minimum norm τ𝜏\tauitalic_τ-near-interpolator is defined as

𝜷¯τ:=argmin𝜷p𝜷22s.t.1n𝐗𝜷𝐲22τ.formulae-sequenceassignsubscript¯𝜷𝜏subscriptargmin𝜷superscript𝑝superscriptsubscriptnorm𝜷22𝑠𝑡1𝑛superscriptsubscriptnormsuperscript𝐗top𝜷𝐲22𝜏\underline{\smash{\bm{\beta}}}_{\tau}:=\mathrm{argmin}_{\bm{\beta}\in\mathbb{R% }^{p}}\|\bm{\beta}\|_{2}^{2}\,\,s.t.\,\,\tfrac{1}{n}\|\mathbf{X}^{\top}\bm{% \beta}-\mathbf{y}\|_{2}^{2}\leq\tau.under¯ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT := roman_argmin start_POSTSUBSCRIPT bold_italic_β ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ bold_italic_β ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_s . italic_t . divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∥ bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_β - bold_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_τ . (1)

A τ𝜏\tauitalic_τ-near-interpolator (not necessarily of the minimum norm) is any 𝜷p𝜷superscript𝑝\bm{\beta}\in\mathbb{R}^{p}bold_italic_β ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT satisfying the inequality in (1).

In the overparameterized (p>n𝑝𝑛p>nitalic_p > italic_n) regime, near-interpolators can often be realized by ridge regression:

Definition 1.2.

Let ϱ>0italic-ϱ0\varrho>0italic_ϱ > 0. The ridge regressor with regularizer ϱitalic-ϱ\varrhoitalic_ϱ is defined as

𝜷^ϱ:=argmin𝜷p1n𝐗𝜷𝐲22+ϱ𝜷22.assignsubscript^𝜷italic-ϱsubscriptargmin𝜷superscript𝑝1𝑛superscriptsubscriptnormsuperscript𝐗top𝜷𝐲22italic-ϱsuperscriptsubscriptnorm𝜷22\hat{\bm{\beta}}_{\varrho}:=\mathrm{argmin}_{\bm{\beta}\in\mathbb{R}^{p}}% \tfrac{1}{n}\|\mathbf{X}^{\top}\bm{\beta}-\mathbf{y}\|_{2}^{2}+\varrho\|\bm{% \beta}\|_{2}^{2}.over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_ϱ end_POSTSUBSCRIPT := roman_argmin start_POSTSUBSCRIPT bold_italic_β ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∥ bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_β - bold_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ϱ ∥ bold_italic_β ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (2)

Main problems: For any τ(0,σ2)𝜏0superscript𝜎2\tau\in(0,\sigma^{2})italic_τ ∈ ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), 1. find a sequence of regularizers {ϱn}nsubscriptsubscriptitalic-ϱ𝑛𝑛\{\varrho_{n}\}_{n}{ italic_ϱ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT so that

𝚝𝚛𝚊𝚒𝚗:=limn𝔼[𝚝𝚛𝚊𝚒𝚗n(𝜷^ϱn)]=τassignsuperscriptsubscript𝚝𝚛𝚊𝚒𝚗subscript𝑛𝔼delimited-[]superscriptsubscript𝚝𝚛𝚊𝚒𝚗𝑛subscript^𝜷subscriptitalic-ϱ𝑛𝜏\mathcal{E}_{\mathtt{train}}^{\ast}:=\lim_{n\to\infty}\mathbb{E}[\mathcal{E}_{% \mathtt{train}}^{n}(\hat{\bm{\beta}}_{\varrho_{n}})]=\taucaligraphic_E start_POSTSUBSCRIPT typewriter_train end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT := roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT blackboard_E [ caligraphic_E start_POSTSUBSCRIPT typewriter_train end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_ϱ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ] = italic_τ (3)

where the expectation is over all sources of randomness and 2. compute the associated asymptotic test error:

𝚝𝚎𝚜𝚝:=limn𝚝𝚎𝚜𝚝n(𝜷^ϱn).assignsuperscriptsubscript𝚝𝚎𝚜𝚝subscript𝑛superscriptsubscript𝚝𝚎𝚜𝚝𝑛subscript^𝜷subscriptitalic-ϱ𝑛\mathcal{E}_{\mathtt{test}}^{\ast}:=\lim_{n\to\infty}\mathcal{E}_{\mathtt{test% }}^{n}(\hat{\bm{\beta}}_{\varrho_{n}}).caligraphic_E start_POSTSUBSCRIPT typewriter_test end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT := roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT caligraphic_E start_POSTSUBSCRIPT typewriter_test end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_ϱ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) . (4)
Definition 1.3.

Let τ(0,σ2)𝜏0superscript𝜎2\tau\in(0,\sigma^{2})italic_τ ∈ ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) be arbitrary and {ϱn}nsubscriptsubscriptitalic-ϱ𝑛𝑛\{\varrho_{n}\}_{n}{ italic_ϱ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT be any sequence of regularizers. If Equation 3 holds, then we say the sequence of regressors {𝜷^ϱn}nsubscriptsubscript^𝜷subscriptitalic-ϱ𝑛𝑛\{\hat{\bm{\beta}}_{\varrho_{n}}\}_{n}{ over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_ϱ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is an asymptotic τ𝜏\tauitalic_τ-near interpolator with asymptotic test error given by Equation 4.

Assumption 1.4 (Power-law spectra222Also referred to as the eigenvalue decay condition (Goel and Klivans,, 2017).).

Suppose there exists α>1𝛼1\alpha>1italic_α > 1 such that the population data covariance matrix 𝚺=diag(λ1,,λp)𝚺diagsubscript𝜆1subscript𝜆𝑝\bm{\Sigma}=\mathrm{diag}(\lambda_{1},\cdots,\lambda_{p})bold_Σ = roman_diag ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_λ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) where λi=iαsubscript𝜆𝑖superscript𝑖𝛼\lambda_{i}=i^{-\alpha}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_i start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT.

There are many examples of random matrix ensembles exhibiting power-law spectra in a broader sense than that of Assumption 1.4. For instance, see (Arous and Guionnet,, 2008; Mahoney and Martin,, 2019; Wang et al.,, 2024). For simplicity, we do not pursue a general setting and will consider the setting of Assumption 1.4.

The asymptotic test error of an asymptotic τ𝜏\tauitalic_τ-near interpolator can be calculated as follows. Let F=F12𝐹subscriptsubscript𝐹12F={}_{2}F_{1}italic_F = start_FLOATSUBSCRIPT 2 end_FLOATSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT denote the Gaussian Hypergeometric function (Dutka,, 1984, Eqn. (27)).

Refer to caption
Refer to caption
Figure 1: Left: Synthetic experiments validating the norm lower bound of norms of 0.20.20.20.2-near-interpolators given by Proposition 1.7. The squared norms are fitted by least squares (in log-log space) to estimate the norm-growth exponent α𝛼\alphaitalic_α using only data points. See Section 5 for additional experiment details. Right: Trade-off between the testing and training errors from Theorem 1.5. The solid curves are the parametrized curves whose (x,y)𝑥𝑦(x,y)( italic_x , italic_y )-coordinates are (𝚝𝚛𝚊𝚒𝚗,𝚝𝚎𝚜𝚝)superscriptsubscript𝚝𝚛𝚊𝚒𝚗superscriptsubscript𝚝𝚎𝚜𝚝(\mathcal{E}_{\mathtt{train}}^{\ast},\mathcal{E}_{\mathtt{test}}^{\ast})( caligraphic_E start_POSTSUBSCRIPT typewriter_train end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , caligraphic_E start_POSTSUBSCRIPT typewriter_test end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) and parametrized by k𝑘kitalic_k (which is in 1-to-1 correspondence with r𝑟ritalic_r see Theorem 1.5). The scatter points, subsampled for visualization, denote ridge regression run results on the HDA model (Example 2.8). The colored ribbons denote the 20-80 quantiles for the scatter points. The horizontal dotted line denotes the noise σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT which is set to 1111 without the loss of generality.
Theorem 1.5 (Exact trade-off formula).

Let τ(0,σ2)𝜏0superscript𝜎2\tau\in(0,\sigma^{2})italic_τ ∈ ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) be arbitrary. Suppose that supn=1,2𝜷2<+subscriptsupremum𝑛12subscriptnormsuperscript𝜷2\sup_{n=1,2\dots}\|\bm{\beta}^{\star}\|_{2}<+\inftyroman_sup start_POSTSUBSCRIPT italic_n = 1 , 2 … end_POSTSUBSCRIPT ∥ bold_italic_β start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < + ∞, Assumption 1.4 holds, and X=𝚺1/2Z𝑋superscript𝚺12𝑍X=\bm{\Sigma}^{1/2}Zitalic_X = bold_Σ start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT italic_Z where Z𝒩(0,𝐈p)similar-to𝑍𝒩0subscript𝐈𝑝Z\sim\mathcal{N}(0,\mathbf{I}_{p})italic_Z ∼ caligraphic_N ( 0 , bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ). There exists unique number k>0𝑘subscriptabsent0k\in\mathbb{R}_{>0}italic_k ∈ blackboard_R start_POSTSUBSCRIPT > 0 end_POSTSUBSCRIPT such that that the following hold. Define the regularizer-factor

r:=1γ1F(1,1α;1+1α;kγα)assign𝑟1superscriptsubscript𝛾1𝐹11𝛼11𝛼𝑘superscriptsubscript𝛾𝛼r:=1-\gamma_{\ast}^{-1}F(1,\tfrac{1}{\alpha};1+\tfrac{1}{\alpha};-k\gamma_{% \ast}^{-\alpha})italic_r := 1 - italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_F ( 1 , divide start_ARG 1 end_ARG start_ARG italic_α end_ARG ; 1 + divide start_ARG 1 end_ARG start_ARG italic_α end_ARG ; - italic_k italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT )

and let ϱn:=rnαassignsubscriptitalic-ϱ𝑛𝑟superscript𝑛𝛼\varrho_{n}:=rn^{-\alpha}italic_ϱ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT := italic_r italic_n start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT. Then {𝜷^ϱn}nsubscriptsubscript^𝜷subscriptitalic-ϱ𝑛𝑛\{\hat{\bm{\beta}}_{\varrho_{n}}\}_{n}{ over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_ϱ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is an asymptotic τ𝜏\tauitalic_τ-near interpolator whose asymptotic test error is

𝚝𝚎𝚜𝚝=σ211γ1F(2,1α;1+1α;kγα).superscriptsubscript𝚝𝚎𝚜𝚝superscript𝜎211superscriptsubscript𝛾1𝐹21𝛼11𝛼𝑘superscriptsubscript𝛾𝛼\mathcal{E}_{\mathtt{test}}^{\ast}=\sigma^{2}\frac{1}{1-\gamma_{\ast}^{-1}F(2,% \tfrac{1}{\alpha};1+\tfrac{1}{\alpha};-k\gamma_{\ast}^{-\alpha})}.caligraphic_E start_POSTSUBSCRIPT typewriter_test end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 1 - italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_F ( 2 , divide start_ARG 1 end_ARG start_ARG italic_α end_ARG ; 1 + divide start_ARG 1 end_ARG start_ARG italic_α end_ARG ; - italic_k italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT ) end_ARG . (5)

Moreover, 𝚝𝚎𝚜𝚝subscriptsuperscript𝚝𝚎𝚜𝚝\mathcal{E}^{\ast}_{\mathtt{test}}caligraphic_E start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT typewriter_test end_POSTSUBSCRIPT is a decreasing function w.r.t α𝛼\alphaitalic_α, fixing all other quantities.

The reason we call Theorem 1.5 an “exact trade-off formula” is that Equation 5 allows the calculate of the trade-off curve between train and test error (Figure 1-right). The fundamental parameter is k𝑘kitalic_k. The asymptotic testing error and training error, i.e., τ𝜏\tauitalic_τ, all depend on k𝑘kitalic_k via monotonic 1-1 correspondences on the domain k(k𝚌𝚛𝚒𝚝,)𝑘subscript𝑘𝚌𝚛𝚒𝚝k\in(k_{\mathtt{crit}},\infty)italic_k ∈ ( italic_k start_POSTSUBSCRIPT typewriter_crit end_POSTSUBSCRIPT , ∞ ). See Figure 3 below. Thus, the asymptotic testing error depends on the training error implicitly through k𝑘kitalic_k.

Figure 1-right panel demonstrates that, empirically, training and test MSEs concentrate closely around Equation 5. We further discuss in detail the implications of Theorem 1.5 after stating Proposition 3.2.

Figure 1-right shows that for near-interpolators, the test error does not degrade much when training below the noise floor. A natural question is if this can be explained by data-independent norm-based generalization bound such as the one found in Koehler et al., (2021). Our next result shows that the growth rate of an asymptotic τ𝜏\tauitalic_τ-near interpolator is superlinear:

Theorem 1.6 (Rapid norm growth).

In the situation of Theorem 1.5, for any τ(0,σ2)𝜏0superscript𝜎2\tau\in(0,\sigma^{2})italic_τ ∈ ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), suppose ϱn>0subscriptitalic-ϱ𝑛0\varrho_{n}>0italic_ϱ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT > 0 is a sequence of regularizers such that {𝜷^ϱn}nsubscriptsubscript^𝜷subscriptitalic-ϱ𝑛𝑛\{\hat{\bm{\beta}}_{\varrho_{n}}\}_{n}{ over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_ϱ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is an asymptotic τ𝜏\tauitalic_τ-near interpolator. Then 𝔼[𝜷^ϱn22]=Ω(nα)𝔼delimited-[]superscriptsubscriptnormsubscript^𝜷subscriptitalic-ϱ𝑛22Ωsuperscript𝑛𝛼\mathbb{E}[\|\hat{\bm{\beta}}_{\varrho_{n}}\|_{2}^{2}]=\Omega(n^{\alpha})blackboard_E [ ∥ over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_ϱ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = roman_Ω ( italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ).

As a consequence, data-independent norm-based generalization bound for near-interpolators, similar to the one in Koehler et al., (2021), are necessarily loose. See Section 4.2.

The key technical result that enables the proof of Theorem 1.6 is the following

Proposition 1.7 (Rapid norm growth - generic).

Suppose Assumption 1.4 holds and the random matrix-theoretic Assumptions 2.5, 2.6 and 2.9 all hold. For any r>0𝑟0r>0italic_r > 0, let ϱn:=rnαassignsubscriptitalic-ϱ𝑛𝑟superscript𝑛𝛼\varrho_{n}:=rn^{-\alpha}italic_ϱ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT := italic_r italic_n start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT be the regularizer for the ridge regression. Then 𝔼[𝜷^ϱn22]=Ω(nα)𝔼delimited-[]superscriptsubscriptnormsubscript^𝜷subscriptitalic-ϱ𝑛22Ωsuperscript𝑛𝛼\mathbb{E}[\|\hat{\bm{\beta}}_{\varrho_{n}}\|_{2}^{2}]=\Omega(n^{\alpha})blackboard_E [ ∥ over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_ϱ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = roman_Ω ( italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ).

Remark 1.8.

Proposition 1.7 still holds when the stronger Assumption 1.4 is replaced by the weaker Assumption 2.2. See Proposition 4.1.

Remark 1.9 (Effective-factor).

The quantity k in Theorem 1.5 has the following interpretation. Let κ:=knαassign𝜅𝑘superscript𝑛𝛼\kappa:=kn^{-\alpha}italic_κ := italic_k italic_n start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT, which is known as the effective regularizer in Wei et al., (2022). The connection between the effective regularizer and the statistical learning theoretic-literature’s notion of effective dimension is explained in (Jacot et al., 2020a, , §4.1). For this reason, we refer to k𝑘kitalic_k with the shortened name eff-reg-factor.

1.4 Organization

In Section 2, we present the necessary background as well as new technical on random matrix theory (RMT). In Section 3 and 4, we sketch the proof of Theorem 1.5 and Proposition 1.7, respectively. In Section 4.2, we discuss the implication of our results on the looseness of norm-based generalization bounds. In Section 5, we discuss our experiments. We discuss related works and the context of our work in greater details in Section 6. Finally, we conclude with discussion of future works and limitations.

2 PRIMER ON RANDOM MATRIX THEORY

We start with a fundamental concept in random matrix theory (RMT), followed by a review of RMT adapted to the power-law spectra setting and a new result (Proposition 2.10) to prepare for our main results.

Definition 2.1 (Empirical spectral measure).

For c𝑐c\in\mathbb{R}italic_c ∈ blackboard_R, let δcsubscript𝛿𝑐\delta_{c}italic_δ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT denote the Dirac measure on \mathbb{R}blackboard_R at c𝑐citalic_c. Let 𝐌p×p𝐌superscript𝑝𝑝\mathbf{M}\in\mathbb{R}^{p\times p}bold_M ∈ blackboard_R start_POSTSUPERSCRIPT italic_p × italic_p end_POSTSUPERSCRIPT be a matrix with real eigenvalues λ1,,λpsubscript𝜆1subscript𝜆𝑝\lambda_{1},\dots,\lambda_{p}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_λ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. The empirical spectral measure of 𝐌𝐌\mathbf{M}bold_M, denoted by 𝚎𝚜𝚍(𝐌)𝚎𝚜𝚍𝐌\mathtt{esd}(\mathbf{M})typewriter_esd ( bold_M ), is the measure on \mathbb{R}blackboard_R given by 𝚎𝚜𝚍(𝐌)=1pi=1pδλi𝚎𝚜𝚍𝐌1𝑝superscriptsubscript𝑖1𝑝subscript𝛿subscript𝜆𝑖\mathtt{esd}(\mathbf{M})=\frac{1}{p}\sum_{i=1}^{p}\delta_{\lambda_{i}}typewriter_esd ( bold_M ) = divide start_ARG 1 end_ARG start_ARG italic_p end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT.

Our random matrix theoretic-assumptions differs from the standard RMT ones in order to accomodate for power-law spectra. The following is a random matrix-theoretic extension of the earlier Assumption 1.4:

Assumption 2.2 (Power-law spectra, RMT version).

In the situation of Section 1.2, let α>1𝛼1\alpha>1italic_α > 1 and H𝐻Hitalic_H be some probability measure on 0subscriptabsent0\mathbb{R}_{\geq 0}blackboard_R start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT. Assume that 𝚎𝚜𝚍(nα𝚺)𝚎𝚜𝚍superscript𝑛𝛼𝚺\mathtt{esd}(n^{\alpha}\bm{\Sigma})typewriter_esd ( italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT bold_Σ ) converges to H𝐻Hitalic_H (in the sense of convergence in distribution). We refer to H𝐻Hitalic_H as the α𝛼\alphaitalic_α-scaled limiting spectral distribution (α𝛼\alphaitalic_α-scaled LSD).

Morally, we can think of the above α𝛼\alphaitalic_α as the same as that of Assumption 1.4.

Remark 2.3 (Comparison with standard LSD).

In RMT, the condition that “𝚎𝚜𝚍(𝚺)𝚎𝚜𝚍𝚺\mathtt{esd}(\bm{\Sigma})typewriter_esd ( bold_Σ ) converges to H𝐻Hitalic_H” is standard, where H𝐻Hitalic_H is simply referred to as the limiting spectral distribution (LSD) (Bai and Silverstein,, 2010). For power-law spectra covariance, i.e., 𝚺𝚺\bm{\Sigma}bold_Σ satisfying Assumption 1.4, 𝚎𝚜𝚍(𝚺)𝚎𝚜𝚍𝚺\mathtt{esd}(\bm{\Sigma})typewriter_esd ( bold_Σ ) may not have a measure-theoretic limit while 𝚎𝚜𝚍(nα𝚺)𝚎𝚜𝚍superscript𝑛𝛼𝚺\mathtt{esd}(n^{\alpha}\bm{\Sigma})typewriter_esd ( italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT bold_Σ ) does, as we will show in Section 4.1.

Definition 2.4 (Stieltjes transform).

Let μ𝜇\muitalic_μ be a measure on \mathbb{R}blackboard_R with support S𝑆Sitalic_S. The Stieltjes transform of μ𝜇\muitalic_μ is the (complex-valued) function with input zS𝑧𝑆z\in\mathbb{C}\setminus Sitalic_z ∈ blackboard_C ∖ italic_S given by 𝒮μ(z):=μ(t)dttzassignsubscript𝒮𝜇𝑧𝜇𝑡𝑑𝑡𝑡𝑧\mathcal{S}_{\mu}(z):=\int\frac{\mu(t)dt}{t-z}caligraphic_S start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ( italic_z ) := ∫ divide start_ARG italic_μ ( italic_t ) italic_d italic_t end_ARG start_ARG italic_t - italic_z end_ARG.

Next, we recall the so-called the self-consistent equation (Tao,, 2011) which relates the regularizer-factor r𝑟ritalic_r with the eff-reg-factor k𝑘kitalic_k:

Assumption 2.5 (Self-consistent equation).

In the situation of Section 1.2, denote by λisubscript𝜆𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT the i𝑖iitalic_i-th largest eigenvalue of 𝚺𝚺\bm{\Sigma}bold_Σ. For each r>0𝑟0r>0italic_r > 0, there exists a unique kk(r)𝑘𝑘𝑟k\equiv k(r)\in\mathbb{R}italic_k ≡ italic_k ( italic_r ) ∈ blackboard_R such that the tuple (r,k)𝑟𝑘(r,k)( italic_r , italic_k ) satisfies

1=rk+limn1ni=1p11+knαλi1.1𝑟𝑘subscript𝑛1𝑛superscriptsubscript𝑖1𝑝11𝑘superscript𝑛𝛼superscriptsubscript𝜆𝑖11=\frac{r}{k}+\lim_{n\to\infty}\frac{1}{n}\sum_{i=1}^{p}\frac{1}{1+kn^{-\alpha% }\lambda_{i}^{-1}}.1 = divide start_ARG italic_r end_ARG start_ARG italic_k end_ARG + roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 1 + italic_k italic_n start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG . (6)

Similar to Remark 2.3, Equation 6 includes a nαsuperscript𝑛𝛼n^{-\alpha}italic_n start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT scaling term that does not show up in the standard self-consistent equation. Again, our Assumption 2.5 differs from this standard one in order to deal with the power-law spectra.

Next, we state a version of the classical Marchenko-Pastur law for a random matrix 𝐗𝐗\mathbf{X}bold_X (and its associated Gram matrix nα𝐆ˇnsuperscript𝑛𝛼subscriptˇ𝐆𝑛n^{\alpha}\check{\mathbf{G}}_{n}italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT overroman_ˇ start_ARG bold_G end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT):

Assumption 2.6 (Marchenko-Pastur law).

In the setting of Assumption 2.5, further assume that

limnr𝒮𝚎𝚜𝚍(nα𝐆ˇn)(r)=k𝒮H(k),almost surelysubscript𝑛𝑟subscript𝒮𝚎𝚜𝚍superscript𝑛𝛼subscriptˇ𝐆𝑛𝑟𝑘subscript𝒮𝐻𝑘almost surely\lim_{n\to\infty}r\mathcal{S}_{\mathtt{esd}(n^{\alpha}\check{\mathbf{G}}_{n})}% (-r)=k\mathcal{S}_{H}(-k),\quad\mbox{almost surely}roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_r caligraphic_S start_POSTSUBSCRIPT typewriter_esd ( italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT overroman_ˇ start_ARG bold_G end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( - italic_r ) = italic_k caligraphic_S start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( - italic_k ) , almost surely

and limnddr(r𝒮𝚎𝚜𝚍(nα𝐆ˇn)(r))=ddr(k𝒮H(k))subscript𝑛𝑑𝑑𝑟𝑟subscript𝒮𝚎𝚜𝚍superscript𝑛𝛼subscriptˇ𝐆𝑛𝑟𝑑𝑑𝑟𝑘subscript𝒮𝐻𝑘\lim_{n\to\infty}\tfrac{d}{dr}\left(r\mathcal{S}_{\mathtt{esd}(n^{\alpha}% \check{\mathbf{G}}_{n})}(-r)\right)=\tfrac{d}{dr}\left(k\mathcal{S}_{H}(-k)\right)roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT divide start_ARG italic_d end_ARG start_ARG italic_d italic_r end_ARG ( italic_r caligraphic_S start_POSTSUBSCRIPT typewriter_esd ( italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT overroman_ˇ start_ARG bold_G end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( - italic_r ) ) = divide start_ARG italic_d end_ARG start_ARG italic_d italic_r end_ARG ( italic_k caligraphic_S start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( - italic_k ) ). We note that the k𝑘kitalic_k on the RHS depends on r𝑟ritalic_r.

Remark 2.7.

Assumption 2.5 and Assumption 2.6 are standard assumptions in random matrix theory. Both of them are satisfied by the well-studied high-dimensional asymptotic (HDA) model (Example 2.8). For instance, see Dobriban and Wager, (2018) under “Marchenko-Pastur theorem”.

The HDA model serves as an exemplary model in random matrix theory possessing many properties that are particularly amenable to analysis. It is defined as:

Example 2.8.

Let γ(0,)subscript𝛾0\gamma_{\ast}\in(0,\infty)italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∈ ( 0 , ∞ ). The high-dimensional asymptotic (HDA)333See Bai and Silverstein, (2010); Dobriban and Wager, (2018). model:

  1. 1.

    𝐗=𝚺1/2𝐙𝐗superscript𝚺12𝐙\mathbf{X}=\bm{\Sigma}^{1/2}\mathbf{Z}bold_X = bold_Σ start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT bold_Z where the entries of 𝐙={Zij}p×n𝐙subscript𝑍𝑖𝑗superscript𝑝𝑛\mathbf{Z}=\{Z_{ij}\}\in\mathbb{R}^{p\times n}bold_Z = { italic_Z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT } ∈ blackboard_R start_POSTSUPERSCRIPT italic_p × italic_n end_POSTSUPERSCRIPT are i.i.d, have zero mean 𝔼[Zij]=0𝔼delimited-[]subscript𝑍𝑖𝑗0\mathbb{E}[Z_{ij}]=0blackboard_E [ italic_Z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ] = 0 and unit variance 𝔼[Zij2]=1𝔼delimited-[]superscriptsubscript𝑍𝑖𝑗21\mathbb{E}[Z_{ij}^{2}]=1blackboard_E [ italic_Z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = 1. The matrix 𝚺𝚺\bm{\Sigma}bold_Σ is positive semidefinite.

  2. 2.

    n/pγ𝑛𝑝subscript𝛾n/p\to\gamma_{\ast}italic_n / italic_p → italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT

  3. 3.

    Spectral distribution of nα𝚺superscript𝑛𝛼𝚺n^{\alpha}\bm{\Sigma}italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT bold_Σ converges to a distribution H𝐻Hitalic_H supported on 0subscriptabsent0\mathbb{R}_{\geq 0}blackboard_R start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT.

Note that our Example 2.8 is somewhat different compared to the conventional HDA model, wherein the third item is “𝚺𝚺\bm{\Sigma}bold_Σ converges to a distribution H𝐻Hitalic_H”. Since we are working with power-law spectra in the covariance matrix, we require the nαsuperscript𝑛𝛼n^{\alpha}italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT coefficient in our Example 2.8.

We now state a new random matrix-theoretic assumption that is one of the key steps for proving rapid norm growth under the RMT setting (Proposition 4.1):

Assumption 2.9 (Positivity condition).

In the setting of Assumption 2.5, further assume that for every r>0𝑟0r>0italic_r > 0, we have

limn𝔼[ddr(r𝒮𝚎𝚜𝚍(nα𝐆ˇn)(r))]>0.subscript𝑛𝔼delimited-[]𝑑𝑑𝑟𝑟subscript𝒮𝚎𝚜𝚍superscript𝑛𝛼subscriptˇ𝐆𝑛𝑟0\lim_{n\to\infty}\mathbb{E}\left[\frac{d}{dr}(r\mathcal{S}_{\mathtt{esd}(n^{% \alpha}\check{\mathbf{G}}_{n})}(-r))\right]>0.roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT blackboard_E [ divide start_ARG italic_d end_ARG start_ARG italic_d italic_r end_ARG ( italic_r caligraphic_S start_POSTSUBSCRIPT typewriter_esd ( italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT overroman_ˇ start_ARG bold_G end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( - italic_r ) ) ] > 0 .

We show that the HDA model satisfies Assumption 2.9, a fact that appears to be new:

Proposition 2.10.

Assumption 2.9 holds for the HDA model.

We prove the proposition in Appendix C. Now, having introduced the necessary RMT background, we now turn to proving Theorem 1.5 on the interpolation-generalization trade-off.

3 INTERPOLATION-GENERALIZATION TRADE-OFF

Simon et al., (2023) derived “estimates” of the testing and training errors of kernel ridge regression. These estimates, dubbed the eigenlearning framework, are non-rigorous444See Mallinar et al., (2022) for a thorough discussion. Works in similar vein include Bordelon et al., (2020); Canatar et al., (2021) due to invoking a Gaussian universality condition. However, when the kernel is linear and the data is Gaussian (as is the case in Theorem 1.5), the framework is rigorous. See Jacot et al., 2020b .

Given this, we use the eigenlearning framework to rigorously calculate the asymptotic training and testing error of the estimators in Theorem 1.5. To this end, we first define two key functions of the eff-reg-factor k𝑘kitalic_k (See Remark 1.9 for the terminology):

Definition 3.1.

Let α>1𝛼1\alpha>1italic_α > 1 and γ[0,)subscript𝛾0\gamma_{\ast}\in[0,\infty)italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∈ [ 0 , ∞ ). Define functions555Throughout this work, α𝛼\alphaitalic_α and γsubscript𝛾\gamma_{\ast}italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT are fixed constants. For brevity, we often simply write \mathcal{I}caligraphic_I or 𝒥𝒥\mathcal{J}caligraphic_J. ()α,γ()subscript𝛼subscript𝛾\mathcal{I}(\cdot)\equiv\mathcal{I}_{\alpha,\gamma_{\ast}}(\cdot)caligraphic_I ( ⋅ ) ≡ caligraphic_I start_POSTSUBSCRIPT italic_α , italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ ) and 𝒥()𝒥α,γ()𝒥subscript𝒥𝛼subscript𝛾\mathcal{J}(\cdot)\equiv\mathcal{J}_{\alpha,\gamma_{\ast}}(\cdot)caligraphic_J ( ⋅ ) ≡ caligraphic_J start_POSTSUBSCRIPT italic_α , italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ ) as

(k):=01γdx1+kxα,𝒥(k):=01γdx(1+kxα)2.formulae-sequenceassign𝑘superscriptsubscript01subscript𝛾𝑑𝑥1𝑘superscript𝑥𝛼assign𝒥𝑘superscriptsubscript01subscript𝛾𝑑𝑥superscript1𝑘superscript𝑥𝛼2\mathcal{I}(k):=\int_{0}^{\tfrac{1}{\gamma_{\ast}}}\frac{dx}{1+k{x}^{\alpha}},% \quad\mathcal{J}(k):=\int_{0}^{\tfrac{1}{\gamma_{\ast}}}\frac{dx}{(1+k{x}^{% \alpha})^{2}}.caligraphic_I ( italic_k ) := ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_ARG end_POSTSUPERSCRIPT divide start_ARG italic_d italic_x end_ARG start_ARG 1 + italic_k italic_x start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG , caligraphic_J ( italic_k ) := ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_ARG end_POSTSUPERSCRIPT divide start_ARG italic_d italic_x end_ARG start_ARG ( 1 + italic_k italic_x start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG .

When γ=0subscript𝛾0\gamma_{\ast}=0italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = 0, we take 1/0:=+assign101/0:=+\infty1 / 0 := + ∞.

These integrals arise in explicit calculations of the eigenlearning equations for the train and test MSEs applied to our setting. They can be computed via the integral representation of the Gaussian hypergeometric function given in (Dutka,, 1984, Eqn. (27)). The calculations are in Section A.1, where we show that \mathcal{I}caligraphic_I and 𝒥𝒥\mathcal{J}caligraphic_J are, respectively, equal to

01γdx1+kxα=γ1F(1,1α;1+1α;kγα),andsuperscriptsubscript01subscript𝛾𝑑𝑥1𝑘superscript𝑥𝛼superscriptsubscript𝛾1𝐹11𝛼11𝛼𝑘superscriptsubscript𝛾𝛼and\int_{0}^{\tfrac{1}{\gamma_{\ast}}}\frac{dx}{1+k{x}^{\alpha}}=\gamma_{\ast}^{-% 1}F(1,\tfrac{1}{\alpha};1+\tfrac{1}{\alpha};-k\gamma_{\ast}^{-\alpha}),\,\,% \mbox{and}∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_ARG end_POSTSUPERSCRIPT divide start_ARG italic_d italic_x end_ARG start_ARG 1 + italic_k italic_x start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG = italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_F ( 1 , divide start_ARG 1 end_ARG start_ARG italic_α end_ARG ; 1 + divide start_ARG 1 end_ARG start_ARG italic_α end_ARG ; - italic_k italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT ) , and
01γdx(1+kxα)2=γ1F(2,1α;1+1α;kγα).superscriptsubscript01subscript𝛾𝑑𝑥superscript1𝑘superscript𝑥𝛼2superscriptsubscript𝛾1𝐹21𝛼11𝛼𝑘superscriptsubscript𝛾𝛼\int_{0}^{\tfrac{1}{\gamma_{\ast}}}\frac{dx}{(1+k{x}^{\alpha})^{2}}=\gamma_{% \ast}^{-1}F(2,\tfrac{1}{\alpha};1+\tfrac{1}{\alpha};-k\gamma_{\ast}^{-\alpha}).∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_ARG end_POSTSUPERSCRIPT divide start_ARG italic_d italic_x end_ARG start_ARG ( 1 + italic_k italic_x start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG = italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_F ( 2 , divide start_ARG 1 end_ARG start_ARG italic_α end_ARG ; 1 + divide start_ARG 1 end_ARG start_ARG italic_α end_ARG ; - italic_k italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT ) .

To relate the above to 𝚝𝚎𝚜𝚝,𝚝𝚛𝚊𝚒𝚗subscriptsuperscript𝚝𝚎𝚜𝚝superscriptsubscript𝚝𝚛𝚊𝚒𝚗\mathcal{E}^{\ast}_{\mathtt{test}},\mathcal{E}_{\mathtt{train}}^{\ast}caligraphic_E start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT typewriter_test end_POSTSUBSCRIPT , caligraphic_E start_POSTSUBSCRIPT typewriter_train end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT estimates of the testing and training errors in the eigenlearning framework, we prove

Proposition 3.2.

In the situation of Theorem 1.5,

𝚝𝚎𝚜𝚝limn𝚝𝚎𝚜𝚝n(𝜷ϱ^)=σ211𝒥(k),andformulae-sequencesubscriptsuperscript𝚝𝚎𝚜𝚝subscript𝑛superscriptsubscript𝚝𝚎𝚜𝚝𝑛^subscript𝜷italic-ϱsuperscript𝜎211𝒥𝑘and\mathcal{E}^{\ast}_{\mathtt{test}}\equiv\lim_{n\to\infty}\mathcal{E}_{\mathtt{% test}}^{n}(\hat{\bm{\beta}_{\varrho}})=\sigma^{2}\cdot\tfrac{1}{1-\mathcal{J}(% k)},\quad\mbox{and}caligraphic_E start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT typewriter_test end_POSTSUBSCRIPT ≡ roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT caligraphic_E start_POSTSUBSCRIPT typewriter_test end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( over^ start_ARG bold_italic_β start_POSTSUBSCRIPT italic_ϱ end_POSTSUBSCRIPT end_ARG ) = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ divide start_ARG 1 end_ARG start_ARG 1 - caligraphic_J ( italic_k ) end_ARG , and
𝚝𝚛𝚊𝚒𝚗limn𝔼[𝚝𝚛𝚊𝚒𝚗n(𝜷ϱ^)]=σ2(1(k))21𝒥(k).superscriptsubscript𝚝𝚛𝚊𝚒𝚗subscript𝑛𝔼delimited-[]superscriptsubscript𝚝𝚛𝚊𝚒𝚗𝑛^subscript𝜷italic-ϱsuperscript𝜎2superscript1𝑘21𝒥𝑘\mathcal{E}_{\mathtt{train}}^{\ast}\equiv\lim_{n\to\infty}\mathbb{E}[\mathcal{% E}_{\mathtt{train}}^{n}(\hat{\bm{\beta}_{\varrho}})]=\sigma^{2}\cdot\tfrac{(1-% \mathcal{I}(k))^{2}}{1-\mathcal{J}(k)}.caligraphic_E start_POSTSUBSCRIPT typewriter_train end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≡ roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT blackboard_E [ caligraphic_E start_POSTSUBSCRIPT typewriter_train end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( over^ start_ARG bold_italic_β start_POSTSUBSCRIPT italic_ϱ end_POSTSUBSCRIPT end_ARG ) ] = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ divide start_ARG ( 1 - caligraphic_I ( italic_k ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - caligraphic_J ( italic_k ) end_ARG .

Moreover, there exists k𝚌𝚛𝚒𝚝0subscript𝑘𝚌𝚛𝚒𝚝subscriptabsent0k_{\mathtt{crit}}\in\mathbb{R}_{\geq 0}italic_k start_POSTSUBSCRIPT typewriter_crit end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT such that

  1. 1.

    For each r>0𝑟0r>0italic_r > 0, there exists a unique k(k𝚌𝚛𝚒𝚝,+)𝑘subscript𝑘𝚌𝚛𝚒𝚝k\in(k_{\mathtt{crit}},+\infty)italic_k ∈ ( italic_k start_POSTSUBSCRIPT typewriter_crit end_POSTSUBSCRIPT , + ∞ ) such that r=(k):=k(1(k))𝑟𝑘assign𝑘1𝑘r=\mathcal{R}(k):=k(1-\mathcal{I}(k))italic_r = caligraphic_R ( italic_k ) := italic_k ( 1 - caligraphic_I ( italic_k ) ),

  2. 2.

    \mathcal{R}caligraphic_R is monotonically increasing on (k𝚌𝚛𝚒𝚝,+)subscript𝑘𝚌𝚛𝚒𝚝(k_{\mathtt{crit}},+\infty)( italic_k start_POSTSUBSCRIPT typewriter_crit end_POSTSUBSCRIPT , + ∞ ),

  3. 3.

    𝚝𝚎𝚜𝚝>σ2superscriptsubscript𝚝𝚎𝚜𝚝superscript𝜎2\mathcal{E}_{\mathtt{test}}^{\ast}>\sigma^{2}caligraphic_E start_POSTSUBSCRIPT typewriter_test end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT > italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for all k(k𝚌𝚛𝚒𝚝,+)𝑘subscript𝑘𝚌𝚛𝚒𝚝k\in(k_{\mathtt{crit}},+\infty)italic_k ∈ ( italic_k start_POSTSUBSCRIPT typewriter_crit end_POSTSUBSCRIPT , + ∞ ),

  4. 4.

    limk+𝚝𝚎𝚜𝚝=σ2subscript𝑘superscriptsubscript𝚝𝚎𝚜𝚝superscript𝜎2\lim_{k\to+\infty}\mathcal{E}_{\mathtt{test}}^{\ast}=\sigma^{2}roman_lim start_POSTSUBSCRIPT italic_k → + ∞ end_POSTSUBSCRIPT caligraphic_E start_POSTSUBSCRIPT typewriter_test end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

For the proof of Proposition 3.2, see Section A.2. At a high level, to prove the first part we apply the eigenlearning framework while accounting for the additional layer of complexity due to the power-law spectra. For the “Moreover” part, we directly analyze \mathcal{R}caligraphic_R, 𝚝𝚛𝚊𝚒𝚗superscriptsubscript𝚝𝚛𝚊𝚒𝚗\mathcal{E}_{\mathtt{train}}^{\ast}caligraphic_E start_POSTSUBSCRIPT typewriter_train end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and 𝚝𝚎𝚜𝚝superscriptsubscript𝚝𝚎𝚜𝚝\mathcal{E}_{\mathtt{test}}^{\ast}caligraphic_E start_POSTSUBSCRIPT typewriter_test end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as functions of r𝑟ritalic_r, k𝑘kitalic_k and α𝛼\alphaitalic_α. Now, note that Proposition 3.2 immediately implies Theorem 1.5.

We now discuss some of the consequences of Proposition 3.2 First, \mathcal{R}caligraphic_R is a bijection that relates the eff-reg-factor k𝑘kitalic_k and the regularizer-factor r𝑟ritalic_r. The plot of \mathcal{R}caligraphic_R is visualized in Figure 3. Furthermore, note that limk+𝚝𝚎𝚜𝚝=σ2subscript𝑘superscriptsubscript𝚝𝚎𝚜𝚝superscript𝜎2\lim_{k\to+\infty}\mathcal{E}_{\mathtt{test}}^{\ast}=\sigma^{2}roman_lim start_POSTSUBSCRIPT italic_k → + ∞ end_POSTSUBSCRIPT caligraphic_E start_POSTSUBSCRIPT typewriter_test end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT precisely states that the test error can be made arbitrarily close to the noise floor as k𝑘kitalic_k (equivalently, r𝑟ritalic_r) goes to infinity.

Refer to caption
Figure 2: Larger power-law spectra exponent implies larger asymptotic excess test error when interpolating to 5%percent55\%5 % of the noise compared to 50%percent5050\%50 % of the noise floor. Let 𝚝𝚎𝚜𝚝=𝚝𝚎𝚜𝚝(α,γ,τ)subscriptsuperscript𝚝𝚎𝚜𝚝subscriptsuperscript𝚝𝚎𝚜𝚝𝛼subscript𝛾𝜏\mathcal{E}^{\ast}_{\mathtt{test}}=\mathcal{E}^{\ast}_{\mathtt{test}}(\alpha,% \gamma_{\ast},\tau)caligraphic_E start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT typewriter_test end_POSTSUBSCRIPT = caligraphic_E start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT typewriter_test end_POSTSUBSCRIPT ( italic_α , italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_τ ) be as in Theorem 1.5 where we make the dependency on parameters α,γ,τ𝛼subscript𝛾𝜏\alpha,\gamma_{\ast},\tauitalic_α , italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_τ explicit. The color and contour line of plot shows the ratio of test errors at two levels of nearness of interpolation 𝚝𝚎𝚜𝚝(α,γ,0.05)/𝚝𝚎𝚜𝚝(α,γ,0.5)subscriptsuperscript𝚝𝚎𝚜𝚝𝛼subscript𝛾0.05subscriptsuperscript𝚝𝚎𝚜𝚝𝛼subscript𝛾0.5\mathcal{E}^{\ast}_{\mathtt{test}}(\alpha,\gamma_{\ast},0.05)/\mathcal{E}^{% \ast}_{\mathtt{test}}(\alpha,\gamma_{\ast},0.5)caligraphic_E start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT typewriter_test end_POSTSUBSCRIPT ( italic_α , italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , 0.05 ) / caligraphic_E start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT typewriter_test end_POSTSUBSCRIPT ( italic_α , italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , 0.5 ) over an (α,γ)𝛼subscript𝛾(\alpha,\gamma_{\ast})( italic_α , italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT )-grid.

Using Proposition 3.2 with the implemenation of F12subscriptsubscript𝐹12{}_{2}F_{1}start_FLOATSUBSCRIPT 2 end_FLOATSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in SciPy, we illustrate the trade-off between the training error versus the test error in Figure 1-Right and the test error ratio in Figure 2.

Remark 3.3 (Data-independent regularizer-selection).

Let τ(0,σ2)𝜏0superscript𝜎2\tau\in(0,\sigma^{2})italic_τ ∈ ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) be a desired level of nearness of interpolation. To select a regularizer ϱnsubscriptitalic-ϱ𝑛\varrho_{n}italic_ϱ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT that achieves τsuperscript𝜏\tau^{\prime}italic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT-near-interpolation for ττsuperscript𝜏𝜏\tau^{\prime}\approx\tauitalic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≈ italic_τ, we use the following method: Step 1. Find k(k𝚌𝚛𝚒𝚝,+)𝑘subscript𝑘𝚌𝚛𝚒𝚝k\in(k_{\mathtt{crit}},+\infty)italic_k ∈ ( italic_k start_POSTSUBSCRIPT typewriter_crit end_POSTSUBSCRIPT , + ∞ ) such that 𝚝𝚛𝚊𝚒𝚗*=𝚝𝚛𝚊𝚒𝚗*(k)=τsuperscriptsubscript𝚝𝚛𝚊𝚒𝚗superscriptsubscript𝚝𝚛𝚊𝚒𝚗𝑘𝜏\mathcal{E}_{\mathtt{train}}^{*}=\mathcal{E}_{\mathtt{train}}^{*}(k)=\taucaligraphic_E start_POSTSUBSCRIPT typewriter_train end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = caligraphic_E start_POSTSUBSCRIPT typewriter_train end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_k ) = italic_τ, using the expression for 𝚝𝚛𝚊𝚒𝚗*(k)superscriptsubscript𝚝𝚛𝚊𝚒𝚗𝑘\mathcal{E}_{\mathtt{train}}^{*}(k)caligraphic_E start_POSTSUBSCRIPT typewriter_train end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_k ) in Proposition 3.2. Let kτsubscript𝑘𝜏k_{\tau}italic_k start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT be such a k𝑘kitalic_k. Step 2. Next, set r:=(kτ)assign𝑟subscript𝑘𝜏r:=\mathcal{R}(k_{\tau})italic_r := caligraphic_R ( italic_k start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ), where \mathcal{R}caligraphic_R is also as in Proposition 3.2. Step 3. Set the regularizer as ϱn:=rnαassignsubscriptitalic-ϱ𝑛𝑟superscript𝑛𝛼\varrho_{n}:=rn^{\alpha}italic_ϱ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT := italic_r italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT.

4 RAPID NORM GROWTH

Proposition 1.7 can be proven in even greater generality under random matrix-theoretic assumptions.

Proposition 4.1 (Rapid norm growth under RMT).

Suppose Assumptions 2.2, 2.5, 2.6, and 2.9 all hold. For any r>0𝑟0r>0italic_r > 0, let ϱn:=rnαassignsubscriptitalic-ϱ𝑛𝑟superscript𝑛𝛼\varrho_{n}:=rn^{-\alpha}italic_ϱ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT := italic_r italic_n start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT be the regularizer for the ridge regression. Then 𝔼[𝜷^ϱn22]=Ω(nα)𝔼delimited-[]superscriptsubscriptnormsubscript^𝜷subscriptitalic-ϱ𝑛22Ωsuperscript𝑛𝛼\mathbb{E}[\|\hat{\bm{\beta}}_{\varrho_{n}}\|_{2}^{2}]=\Omega(n^{\alpha})blackboard_E [ ∥ over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_ϱ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = roman_Ω ( italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ).

The goal of this section is to sketch the proof for Proposition 4.1. Complete proofs of all results are included in the Appendix. Throughout, we assume the setting of Section 1.2.

The next step is the following:

Proposition 4.2.

Let ϱ:=rnαassignitalic-ϱ𝑟superscript𝑛𝛼\varrho:=rn^{-\alpha}italic_ϱ := italic_r italic_n start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT. Then we have

𝔼𝜷^ϱ22nασ2𝔼[ddr(r𝒮𝚎𝚜𝚍(nα𝐆ˇ)(r))].𝔼superscriptsubscriptnormsubscript^𝜷italic-ϱ22superscript𝑛𝛼superscript𝜎2𝔼delimited-[]𝑑𝑑𝑟𝑟subscript𝒮𝚎𝚜𝚍superscript𝑛𝛼ˇ𝐆𝑟\mathbb{E}\|\hat{\bm{\beta}}_{\varrho}\|_{2}^{2}\geq n^{\alpha}\sigma^{2}\cdot% \mathbb{E}\big{[}\frac{d}{dr}(r\mathcal{S}_{\mathtt{esd}(n^{\alpha}\check{% \mathbf{G}})}(-r))\big{]}.blackboard_E ∥ over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_ϱ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ blackboard_E [ divide start_ARG italic_d end_ARG start_ARG italic_d italic_r end_ARG ( italic_r caligraphic_S start_POSTSUBSCRIPT typewriter_esd ( italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT overroman_ˇ start_ARG bold_G end_ARG ) end_POSTSUBSCRIPT ( - italic_r ) ) ] .

The proof of Proposition 4.2 and other omitted proofs in this section can be found in Appendix B.

Given Proposition 4.2, the proof of Proposition 4.1 is straightforward:

Proof of Proposition 4.1.

Let

L:=limn𝔼[ddr(r𝒮𝚎𝚜𝚍(nα𝐆ˇ)(r))]>0assign𝐿subscript𝑛𝔼delimited-[]𝑑𝑑𝑟𝑟subscript𝒮𝚎𝚜𝚍superscript𝑛𝛼ˇ𝐆𝑟0L:=\lim_{n\to\infty}\mathbb{E}\big{[}\tfrac{d}{dr}(r\mathcal{S}_{\mathtt{esd}(% n^{\alpha}\check{\mathbf{G}})}(-r))\big{]}>0italic_L := roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT blackboard_E [ divide start_ARG italic_d end_ARG start_ARG italic_d italic_r end_ARG ( italic_r caligraphic_S start_POSTSUBSCRIPT typewriter_esd ( italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT overroman_ˇ start_ARG bold_G end_ARG ) end_POSTSUBSCRIPT ( - italic_r ) ) ] > 0

be as in Assumption 2.9. Thus, for all n0much-greater-than𝑛0n\gg 0italic_n ≫ 0 sufficiently large, we have 𝔼[ddr(r𝒮𝚎𝚜𝚍(nα𝐆ˇ)(r))]>L/2>0𝔼delimited-[]𝑑𝑑𝑟𝑟subscript𝒮𝚎𝚜𝚍superscript𝑛𝛼ˇ𝐆𝑟𝐿20\mathbb{E}\big{[}\tfrac{d}{dr}(r\mathcal{S}_{\mathtt{esd}(n^{\alpha}\check{% \mathbf{G}})}(-r))\big{]}>L/2>0blackboard_E [ divide start_ARG italic_d end_ARG start_ARG italic_d italic_r end_ARG ( italic_r caligraphic_S start_POSTSUBSCRIPT typewriter_esd ( italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT overroman_ˇ start_ARG bold_G end_ARG ) end_POSTSUBSCRIPT ( - italic_r ) ) ] > italic_L / 2 > 0. By Proposition 4.2, we get that 𝔼𝜷^ϱ22nασ2L2𝔼superscriptsubscriptnormsubscript^𝜷italic-ϱ22superscript𝑛𝛼superscript𝜎2𝐿2\mathbb{E}\|\hat{\bm{\beta}}_{\varrho}\|_{2}^{2}\geq n^{\alpha}\sigma^{2}\cdot% \frac{L}{2}blackboard_E ∥ over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_ϱ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ divide start_ARG italic_L end_ARG start_ARG 2 end_ARG for all n0much-greater-than𝑛0n\gg 0italic_n ≫ 0, as desired. ∎

4.1 α𝛼\alphaitalic_α-scaled limiting spectral distribution

In this section, we check that power-law spectra covariance matrices has an α𝛼\alphaitalic_α-scaled limiting spectral distribution. In other words, Assumption 1.4 implies Assumption 2.2. Note that this is necessary because we have proved Proposition 4.1 rather than Proposition 1.7. So we need to make sure that the special case, i.e., Proposition 1.7, is indeed the “special case”.

Definition 4.3.

Given a measure μ𝜇\muitalic_μ on \mathbb{R}blackboard_R, we let 𝚌𝚍𝚏[μ]𝚌𝚍𝚏delimited-[]𝜇\mathtt{cdf}[\mu]typewriter_cdf [ italic_μ ] denote the cumulative distribution function (CDF) of μ𝜇\muitalic_μ.

Proposition 4.4.

Under Assumption 1.4, we have that Assumption 2.2 holds. In other words,

limn𝚌𝚍𝚏[𝚎𝚜𝚍(nα𝚺)](t)={1γt1/α:tγα0otherwise.subscript𝑛𝚌𝚍𝚏delimited-[]𝚎𝚜𝚍superscript𝑛𝛼𝚺𝑡cases1subscript𝛾superscript𝑡1𝛼:absent𝑡superscriptsubscript𝛾𝛼0otherwise.\lim_{n\to\infty}\mathtt{cdf}[\mathtt{esd}(n^{\alpha}\bm{\Sigma})](t)=\begin{% cases}1-\gamma_{\ast}t^{-1/\alpha}&:t\geq\gamma_{\ast}^{\alpha}\\ 0&\mbox{otherwise.}\end{cases}roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT typewriter_cdf [ typewriter_esd ( italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT bold_Σ ) ] ( italic_t ) = { start_ROW start_CELL 1 - italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT - 1 / italic_α end_POSTSUPERSCRIPT end_CELL start_CELL : italic_t ≥ italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise. end_CELL end_ROW
Refer to caption
Figure 3: The (k)𝑘\mathcal{R}(k)caligraphic_R ( italic_k ) function from Proposition 3.2. The x𝑥xitalic_x-axis is the input k𝑘kitalic_k. Note that for k<k𝚌𝚛𝚒𝚝𝑘subscript𝑘𝚌𝚛𝚒𝚝k<k_{\mathtt{crit}}italic_k < italic_k start_POSTSUBSCRIPT typewriter_crit end_POSTSUBSCRIPT the regularizer r𝑟ritalic_r is negative. Although we are only interested in the (k𝚌𝚛𝚒𝚝,+)subscript𝑘𝚌𝚛𝚒𝚝(k_{\mathtt{crit}},+\infty)( italic_k start_POSTSUBSCRIPT typewriter_crit end_POSTSUBSCRIPT , + ∞ ) portion, negative regularizers have been studied by Tsigler and Bartlett, (2020); Wu and Xu, (2020).

4.2 Looseness of norm-based generalization bounds

Conjecturally, a norm-based generalization bound for near-interpolators should have the following form: under suitable assumptions, with high probability

sup𝜷:𝜷B,𝚝𝚛𝚊𝚒𝚗(𝜷)τ𝚝𝚎𝚜𝚝(𝜷)=O(B2Tr(𝚺)n).subscriptsupremum:𝜷formulae-sequencenorm𝜷𝐵subscript𝚝𝚛𝚊𝚒𝚗𝜷𝜏subscript𝚝𝚎𝚜𝚝𝜷𝑂superscript𝐵2Tr𝚺𝑛\sup_{\bm{\beta}:\|\bm{\beta}\|\leq B,\mathcal{E}_{\mathtt{train}}(\bm{\beta})% \leq\tau}\mathcal{E}_{\mathtt{test}}(\bm{\beta})=O\left(\frac{B^{2}\text{Tr}(% \bm{\Sigma})}{n}\right).roman_sup start_POSTSUBSCRIPT bold_italic_β : ∥ bold_italic_β ∥ ≤ italic_B , caligraphic_E start_POSTSUBSCRIPT typewriter_train end_POSTSUBSCRIPT ( bold_italic_β ) ≤ italic_τ end_POSTSUBSCRIPT caligraphic_E start_POSTSUBSCRIPT typewriter_test end_POSTSUBSCRIPT ( bold_italic_β ) = italic_O ( divide start_ARG italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT Tr ( bold_Σ ) end_ARG start_ARG italic_n end_ARG ) .

where τ0𝜏subscriptabsent0\tau\in\mathbb{R}_{\geq 0}italic_τ ∈ blackboard_R start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT. For τ=0𝜏0\tau=0italic_τ = 0, the best known bound for (perfect) interpolators is given by Koehler et al., (2021, Corollary 1) under Gaussian assumption on the data X𝒩(0,𝚺)similar-to𝑋𝒩0𝚺X\sim\mathcal{N}(0,\bm{\Sigma})italic_X ∼ caligraphic_N ( 0 , bold_Σ ) and B𝜷𝐵normsuperscript𝜷B\geq\|\bm{\beta}^{\star}\|italic_B ≥ ∥ bold_italic_β start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ∥.

To the best of our knowledge, there is no known extension to the case of near-interpolators, i.e., where τ>0𝜏0\tau>0italic_τ > 0. However, such bound is not informative for our scenario, since by Theorem 1.5 and Proposition 1.7, it is possible to choose ϱnsubscriptitalic-ϱ𝑛\varrho_{n}italic_ϱ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT such that 1𝔼[𝚝𝚛𝚊𝚒𝚗(𝜷^ϱn)]τ𝔼delimited-[]subscript𝚝𝚛𝚊𝚒𝚗subscript^𝜷subscriptitalic-ϱ𝑛𝜏\mathbb{E}[\mathcal{E}_{\mathtt{train}}(\hat{\bm{\beta}}_{\varrho_{n}})]\to\taublackboard_E [ caligraphic_E start_POSTSUBSCRIPT typewriter_train end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_ϱ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ] → italic_τ, 2𝚝𝚎𝚜𝚝(𝜷^ϱn)c0subscript𝚝𝚎𝚜𝚝subscript^𝜷subscriptitalic-ϱ𝑛𝑐subscriptabsent0\mathcal{E}_{\mathtt{test}}(\hat{\bm{\beta}}_{\varrho_{n}})\to c\in\mathbb{R}_% {\geq 0}caligraphic_E start_POSTSUBSCRIPT typewriter_test end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_ϱ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) → italic_c ∈ blackboard_R start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT, and 3𝜷ϱn2=Ω(nα)superscriptnormsubscript𝜷subscriptitalic-ϱ𝑛2Ωsuperscript𝑛𝛼\|\bm{\beta}_{\varrho_{n}}\|^{2}=\Omega(n^{\alpha})∥ bold_italic_β start_POSTSUBSCRIPT italic_ϱ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = roman_Ω ( italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ) for any α>0𝛼0\alpha>0italic_α > 0. Thus, the bound goes to infinity while 𝚝𝚎𝚜𝚝(𝜷^ϱn)subscript𝚝𝚎𝚜𝚝subscript^𝜷subscriptitalic-ϱ𝑛\mathcal{E}_{\mathtt{test}}(\hat{\bm{\beta}}_{\varrho_{n}})caligraphic_E start_POSTSUBSCRIPT typewriter_test end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_ϱ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) is finite.

5 EXPERIMENTS

We run two types of synthetic experiments. The first type, plotted in Figure 1, employs (linear) ridge regression. The second type, plotted in Figure 4 employs neural networks. The data for both types of experiments are drawn from the HDA model (Example 2.8). Moreover, we have conducted experiments on several real world regression datasets from the UCI regression collection.

5.1 Experiments on synthetic data

To generate Figure 1-left, we run experiments with α{1.25,2.5}𝛼1.252.5\alpha\in\{1.25,2.5\}italic_α ∈ { 1.25 , 2.5 } and np=γ=23𝑛𝑝subscript𝛾23\tfrac{n}{p}=\gamma_{\ast}=\tfrac{2}{3}divide start_ARG italic_n end_ARG start_ARG italic_p end_ARG = italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = divide start_ARG 2 end_ARG start_ARG 3 end_ARG. We sweep over the train MSE parameter 𝚝𝚛𝚊𝚒𝚗*superscriptsubscript𝚝𝚛𝚊𝚒𝚗\mathcal{E}_{\mathtt{train}}^{*}caligraphic_E start_POSTSUBSCRIPT typewriter_train end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT to explore the trade-off between the train and testing MSE in linear ridge regression as described in Theorem 1.5. The value of 𝚝𝚛𝚊𝚒𝚗*superscriptsubscript𝚝𝚛𝚊𝚒𝚗\mathcal{E}_{\mathtt{train}}^{*}caligraphic_E start_POSTSUBSCRIPT typewriter_train end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT are sweeped on a linearly-spaced grid of size 16161616 from 0.05σ20.05superscript𝜎20.05\sigma^{2}0.05 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT to 0.8σ20.8superscript𝜎20.8\sigma^{2}0.8 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. The parameters are n𝚝𝚛𝚊𝚒𝚗=5000subscript𝑛𝚝𝚛𝚊𝚒𝚗5000n_{\mathtt{train}}=5000italic_n start_POSTSUBSCRIPT typewriter_train end_POSTSUBSCRIPT = 5000, n𝚝𝚎𝚜𝚝=1000subscript𝑛𝚝𝚎𝚜𝚝1000n_{\mathtt{test}}=1000italic_n start_POSTSUBSCRIPT typewriter_test end_POSTSUBSCRIPT = 1000, γ=0.5subscript𝛾0.5\gamma_{\ast}=0.5italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = 0.5, α=1.75𝛼1.75\alpha=1.75italic_α = 1.75 and σ2=1superscript𝜎21\sigma^{2}=1italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1.

The regularizer achieving a desired training MSE τ(0,σ2)𝜏0superscript𝜎2\tau\in(0,\sigma^{2})italic_τ ∈ ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) is chosen according to the method described in Remark 3.3. We sample 𝜷psuperscript𝜷superscript𝑝\bm{\beta}^{\star}\in\mathbb{R}^{p}bold_italic_β start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT such that 𝜷isubscriptsuperscript𝜷𝑖\bm{\beta}^{\star}_{i}bold_italic_β start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are i.i.d Gaussian with zero mean and variance =10/pabsent10𝑝=10/p= 10 / italic_p.

The same set up is used for Figure 1-right, except we sweep over n𝚝𝚛𝚊𝚒𝚗subscript𝑛𝚝𝚛𝚊𝚒𝚗n_{\mathtt{train}}italic_n start_POSTSUBSCRIPT typewriter_train end_POSTSUBSCRIPT rather than over 𝚝𝚛𝚊𝚒𝚗*superscriptsubscript𝚝𝚛𝚊𝚒𝚗\mathcal{E}_{\mathtt{train}}^{*}caligraphic_E start_POSTSUBSCRIPT typewriter_train end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT. The value of n𝚝𝚛𝚊𝚒𝚗subscript𝑛𝚝𝚛𝚊𝚒𝚗n_{\mathtt{train}}italic_n start_POSTSUBSCRIPT typewriter_train end_POSTSUBSCRIPT are sweeped on a logarithmically-spaced grid of size 20202020 from 200200200200 to 5000500050005000.

For Figure 4, the identical set-up is used as in the Figure 1, except ridge regression is replaced with neural networks666We use the default settings in sklearn, except with early stopping for near-interpolation. during training. We emphasize that the ground truth data (i.e., the teacher) is still generated via the same linear function 𝜷superscript𝜷\bm{\beta}^{\star}bold_italic_β start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT. All code for the experiments are included in Appendix E.

Remark 5.1.

As discussed in the introduction, near-interpolating neural network and its interpolation-generalization trade-offs exhibit similar phenomenon as in the case of ridge regression. Namely, larger power-law spectra exponent implies larger asymptotic excess test error when interpolating to 5%percent55\%5 % of the noise compared to 50%percent5050\%50 % of the noise floor, for instance.

Remark 5.2.

Intriguely, in both the ridge regression (Figure 1) and neural network (Figure 4) experiments, the setting with to larger value of the norm-growth exponent α𝛼\alphaitalic_α (right panels) results in poorer trade-offs (left panels). For ridge regression, this is explained by the “Moreover” part of Theorem 1.5. We believe it is an interesting future direction whether this behavior can be proved in the neural network setting.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: Experiments with neural networks (top row: 1-hidden layer, bottom row: 5-hidden layers). Analogous to Figure 1. See Section 5 and Remark 5.1 for details.

5.2 Experiments on UCI datasets

We conduct experiments on the forest and the stock datasets from the UCI regression collection (Kelly et al.,, 2023). Using neural tangent kernels (Arora et al.,, 2019), we observe power-law spectra in the kernel matrices for both of these datasets (Figure 5-right and Figure 6-right). In this subsection, we discuss the experiments on the forest dataset in relation to our theoretical results. Due to space constraints, we refer the reader to Appendix F for details of the experimental setup.

In Figure 5-left, note that the curve corresponding to forest.2-1 has the fastest spectra decay and simultaneously the worst trade-off. Evidently, larger decay exponent corresponds to a poorer trade-off, especially for near-interpolators, i.e., as the training error approaches 00. This is in agreement with our theoretical results under random matrix theory assumptions illustrated in Figure 2. Similar phenomenon occurs for the stock dataset. See Appendix F.

6 ADDITIONAL RELATED WORKS AND NOVELTY OF OUR WORK

Technical Novelty of Theorem 1.7. Prior works of Hastie et al., (2022); Derezinski et al., (2019) require a lower bound on the smallest eigenvalue of the covariance matrix. Hence, the scenario studied in this paper is not amenable to their results. Further, Theorem 1.5 shows that when we have a sharper decay (larger α𝛼\alphaitalic_α) of the eigenvalues, we have a worse tradeoff between the test and training error. Hence, we show that the scenario from Hastie et al., (2022) is the most benign one. This well conditioning assumption is relaxed in Cheng and Montanari, (2022), however, they require that Σ1/2βnormsuperscriptΣ12𝛽\|\Sigma^{-1/2}\beta\|∥ roman_Σ start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT italic_β ∥ is finite. We do not require this assumption. Finally, Dobriban and Wager, (2018) does not need the well-conditioning assumption but instead needs to assume an isotropic distribution on β𝛽\betaitalic_β. Since we work with fixed β𝛽\betaitalic_β, their results do not apply.

Trade-offs in interpolation-based learning. Prior works (Ghosh and Belkin,, 2022; Belkin et al.,, 2018; Sonthalia et al.,, 2023) have studied the tradeoff between near interpolation and generalization. For regression, previous works have also studied the fundamental trade-off in learning algorithms between overparametrization and (Lipschitz) smoothness (Bubeck and Sellke,, 2021), and robustness and smoothness (Zhang et al.,, 2022).

Loosenss of existing generalization bounds. Belkin et al., (2018, Theorem 1) establishes for classification that the RKHS norm of a “near-interpolating” classifier grows at rate Ω(exp(n1/p))Ωsuperscript𝑛1𝑝\Omega(\exp(n^{1/p}))roman_Ω ( roman_exp ( italic_n start_POSTSUPERSCRIPT 1 / italic_p end_POSTSUPERSCRIPT ) ). The growth is unbounded if n=Ω(exp(p))𝑛Ω𝑝n=\Omega(\exp(p))italic_n = roman_Ω ( roman_exp ( italic_p ) ). If the number of samples n=Θ(poly(p))𝑛Θpoly𝑝n=\Theta(\mathrm{poly}(p))italic_n = roman_Θ ( roman_poly ( italic_p ) ), then the lower bound does not grow to infinity. While our results are for regression and thus not directly comparable, our lower bound is meaningful in the more practical npproportional-to𝑛𝑝n\propto pitalic_n ∝ italic_p regime.

Power-law spectra datasets. Synthetic data with artificial power law EVD covariance have been used frequently as toy examples (Berthier et al.,, 2020; Mallinar et al.,, 2022). On real datasets, power law EVD is often observed to describe neural tangent kernels (NTK) well in practice, including on MNIST ((Bahri et al.,, 2021, Fig, 4) and (Velikanov and Yarotsky,, 2022, Fig. 2)), Fashion-MNIST (Cui et al.,, 2021, Fig. 7) Caltech 101 (Murray et al.,, 2022, Fig. 1), CIFAR-100 (Wei et al.,, 2022, Fig. 3).

Refer to caption
Refer to caption
Figure 5: Left. Training/testing error trade-off on the “forest” dataset from the UCI regression dataset collection using kernel ridge regression with the neural tangent kernel. Each curve is labeled by “DatasetName.d-f” where “d” and “f” represents the number of layers and the number of fixed layers in the NTK corresponding to ReLU networks. Right. The eigenvalue index vs eigenvalue plot of the NTK matrix exhibits power-law spectra. A tiny value is added to the eigenvalues for better visualization on the log-scale.

Theoretical machine learning works using power-law spectra. Bordelon et al., (2020) shows that power law EVD implies power law learning curve. Velikanov and Yarotsky, (2021, §6.2) computes the power law EVD exponent for certain NTKs with ReLU to be α=1+1d𝛼11𝑑\alpha=1+\frac{1}{d}italic_α = 1 + divide start_ARG 1 end_ARG start_ARG italic_d end_ARG. Murray et al., (2022) computes the EVD for NTKs with several different activations. Bartlett et al., (2020, Theorem 6) shows that benign overfitting occurs when the covariance matrix eigenvalues λi=i1logb(i+1)subscript𝜆𝑖superscript𝑖1superscript𝑏𝑖1\lambda_{i}=i^{-1}\log^{-b}(i+1)italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_i start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_log start_POSTSUPERSCRIPT - italic_b end_POSTSUPERSCRIPT ( italic_i + 1 ) for b>1𝑏1b>1italic_b > 1. Mallinar et al., (2022) studies power law decay for α1𝛼1\alpha\geq 1italic_α ≥ 1 and proposes a taxonomy of overfitting into three categories: catastropic, tempered and benign. The EVD condition is also known as the capacity condition in the kernel ridge regression literature. See Bietti et al., (2021) and the references there-in.

Random matrix theory (RMT). The signal processing research community have long been using RMT for theoretical analysis (Couillet and Debbah,, 2012). Increasingly RMT has been applied to machine learning as well as a key tool for analysis.

In particular, Dobriban and Wager, (2018); Hastie et al., (2022); Jacot et al., 2020b ; Liang and Rakhlin, (2020) have applied RMT for (kernel) ridge regression, Sonthalia and Nadakuditi, (2023); Kausik et al., (2023) use it to understand generalization of linear denoisers, Paquette et al., (2022, 2021) uses the so-called local Marchenko-Pastur law (Knowles and Yin,, 2017) to analyze gradient-based algorithms. Finally, Wei et al., (2022) also applies such local law to analyze the so-called generalized cross- validation (GCV) estimator.

7 DISCUSSION AND LIMITATIONS

We conclude with several future research directions that we believe will be fruitful:

Connection to early stopping. Typically, early stopping prevents the trained algorithm from perfectly interpolating the data. Can early stopped learning theory results, e.g., Ji et al., (2021); Kuzborskij and Szepesvári, (2022), be applied to analyze near-interpolators?

Near-interpolators and uniform convergence generalization bound. Is possible to use uniform convergence-based approach to give non-vacuous generalization bound under the setting studied in this work? This question has already been raised by Dobriban and Wager, (2018) in the context of classification in a similar setting. An interesting question is if classical learning theory can be used to obtain results that are currently only obtained via random matrix theoretic or similar techniques. Another approach is to extend the results of Koehler et al., (2021) to the near-interpolation setting.

Limitations. Our work is restricted to analyzing a random matrix model. Understanding the phenomenon uncovered in this paper in more general models and additional real world settings will be needed. Moreover, our work does not rule out the existence of uniform convergence generalization bound.

Code availability. Code used to run and plot the experiments shown in all figures is available at https://github.com/YutongWangUMich/Near-Interpolators-Figures.

Acknowledgements

YW acknowledges support from the Eric and Wendy Schmidt AI in Science Postdoctoral Fellowship, a Schmidt Futures program. WH acknowledges support from the Google Research Scholar program.

References

  • Arora et al., (2019) Arora, S., Du, S. S., Li, Z., Salakhutdinov, R., Wang, R., and Yu, D. (2019). Harnessing the power of infinitely wide deep nets on small-data tasks. In International Conference on Learning Representations.
  • Arous and Guionnet, (2008) Arous, G. B. and Guionnet, A. (2008). The spectrum of heavy tailed random matrices. Communications in Mathematical Physics, 278(3):715–751.
  • Bahri et al., (2021) Bahri, Y., Dyer, E., Kaplan, J., Lee, J., and Sharma, U. (2021). Explaining neural scaling laws. arXiv preprint arXiv:2102.06701.
  • Bai and Silverstein, (2010) Bai, Z. and Silverstein, J. W. (2010). Spectral analysis of large dimensional random matrices, volume 20. Springer.
  • Bartlett et al., (2020) Bartlett, P. L., Long, P. M., Lugosi, G., and Tsigler, A. (2020). Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070.
  • Belkin et al., (2018) Belkin, M., Ma, S., and Mandal, S. (2018). To understand deep learning we need to understand kernel learning. In International Conference on Machine Learning, pages 541–549. PMLR.
  • Berthier et al., (2020) Berthier, R., Bach, F., and Gaillard, P. (2020). Tight nonparametric convergence rates for stochastic gradient descent under the noiseless linear model. Advances in Neural Information Processing Systems, 33:2576–2586.
  • Bietti et al., (2021) Bietti, A., Venturi, L., and Bruna, J. (2021). On the sample complexity of learning with geometric stability. In Advances in Neural Information Processing Systems.
  • Bordelon et al., (2020) Bordelon, B., Canatar, A., and Pehlevan, C. (2020). Spectrum dependent learning curves in kernel regression and wide neural networks. In International Conference on Machine Learning, pages 1024–1034. PMLR.
  • Bubeck and Sellke, (2021) Bubeck, S. and Sellke, M. (2021). A universal law of robustness via isoperimetry. Advances in Neural Information Processing Systems, 34:28811–28822.
  • Canatar et al., (2021) Canatar, A., Bordelon, B., and Pehlevan, C. (2021). Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks. Nature communications, 12(1):1–12.
  • Cheng and Montanari, (2022) Cheng, C. and Montanari, A. (2022). Dimension free ridge regression. arXiv preprint arXiv:2210.08571.
  • Couillet and Debbah, (2012) Couillet, R. and Debbah, M. (2012). Signal processing in large systems: A new paradigm. IEEE Signal Processing Magazine, 30(1):24–39.
  • Cui et al., (2021) Cui, H., Loureiro, B., Krzakala, F., and Zdeborová, L. (2021). Generalization error rates in kernel regression: The crossover from the noiseless to noisy regime. In Advances in Neural Information Processing Systems, pages 10131–10143.
  • Derezinski et al., (2019) Derezinski, M., Liang, F. T., and Mahoney, M. W. (2019). Exact expressions for double descent and implicit regularization via surrogate random design. ArXiv, abs/1912.04533.
  • Dobriban and Wager, (2018) Dobriban, E. and Wager, S. (2018). High-dimensional asymptotics of prediction: Ridge regression and classification. The Annals of Statistics, 46(1):247–279.
  • Dutka, (1984) Dutka, J. (1984). The early history of the hypergeometric function. Archive for History of Exact Sciences, pages 15–34.
  • Ghosh and Belkin, (2022) Ghosh, N. and Belkin, M. (2022). A universal trade-off between the model size, test loss, and training loss of linear predictors. arXiv preprint arXiv:2207.11621.
  • Goel and Klivans, (2017) Goel, S. and Klivans, A. (2017). Eigenvalue decay implies polynomial-time learnability for neural networks. Advances in Neural Information Processing Systems, 30.
  • Hastie et al., (2022) Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. (2022). Surprises in High-Dimensional Ridgeless Least Squares Interpolation. The Annals of Statistics.
  • (21) Jacot, A., Simsek, B., Spadaro, F., Hongler, C., and Gabriel, F. (2020a). Implicit regularization of random feature models. In International Conference on Machine Learning, pages 4631–4640. PMLR.
  • (22) Jacot, A., Simsek, B., Spadaro, F., Hongler, C., and Gabriel, F. (2020b). Kernel alignment risk estimator: Risk prediction from training data. Advances in Neural Information Processing Systems, 33:15568–15578.
  • Ji et al., (2021) Ji, Z., Li, J., and Telgarsky, M. (2021). Early-stopped neural networks are consistent. Advances in Neural Information Processing Systems, 34:1805–1817.
  • Karp and López, (2017) Karp, D. B. and López, J. L. (2017). Representations of hypergeometric functions for arbitrary parameter values and their use. Journal of Approximation Theory, 218:42–70.
  • Kausik et al., (2023) Kausik, C., Srivastava, K., and Sonthalia, R. (2023). Generalization error without independence: Denoising, linear regression, and transfer learning. arXiv preprint arXiv:2305.17297.
  • Kelly et al., (2023) Kelly, M., Longjohn, R., and Nottingham, K. (2023). The uci machine learning repository. https://archive.ics.uci.edu.
  • Knowles and Yin, (2017) Knowles, A. and Yin, J. (2017). Anisotropic local laws for random matrices. Probability Theory and Related Fields, 169(1):257–352.
  • Koehler et al., (2021) Koehler, F., Zhou, L., Sutherland, D. J., and Srebro, N. (2021). Uniform convergence of interpolators: Gaussian width, norm bounds and benign overfitting. Advances in Neural Information Processing Systems, 34:20657–20668.
  • Kuzborskij and Szepesvári, (2022) Kuzborskij, I. and Szepesvári, C. (2022). Learning lipschitz functions by gd-trained shallow overparameterized relu neural networks. arXiv preprint arXiv:2212.13848.
  • Liang and Rakhlin, (2020) Liang, T. and Rakhlin, A. (2020). Just interpolate: Kernel “ridgeless” regression can generalize. The Annals of Statistics, 48(3):1329–1347.
  • Mahoney and Martin, (2019) Mahoney, M. and Martin, C. (2019). Traditional and heavy tailed self regularization in neural network models. In International Conference on Machine Learning, pages 4284–4293. PMLR.
  • Mallinar et al., (2022) Mallinar, N. R., Simon, J. B., Abedsoltan, A., Pandit, P., Belkin, M., and Nakkiran, P. (2022). Benign, tempered, or catastrophic: Toward a refined taxonomy of overfitting. In Advances in Neural Information Processing Systems.
  • Murray et al., (2022) Murray, M., Jin, H., Bowman, B., and Montufar, G. (2022). Characterizing the spectrum of the NTK via a power series expansion. arXiv preprint arXiv:2211.07844.
  • Nakkiran et al., (2020) Nakkiran, P., Venkat, P., Kakade, S. M., and Ma, T. (2020). Optimal regularization can mitigate double descent. In International Conference on Learning Representations.
  • Paquette et al., (2021) Paquette, C., Lee, K., Pedregosa, F., and Paquette, E. (2021). Sgd in the large: Average-case analysis, asymptotics, and stepsize criticality. In Conference on Learning Theory, pages 3548–3626. PMLR.
  • Paquette et al., (2022) Paquette, C., van Merriënboer, B., Paquette, E., and Pedregosa, F. (2022). Halting time is predictable for large models: A universality property and average-case analysis. Foundations of Computational Mathematics, pages 1–77.
  • Silverstein and Choi, (1995) Silverstein, J. W. and Choi, S.-I. (1995). Analysis of the limiting spectral distribution of large dimensional random matrices. Journal of Multivariate Analysis, 54(2):295–309.
  • Simon et al., (2023) Simon, J. B., Dickens, M., Karkada, D., and Deweese, M. (2023). The eigenlearning framework: A conservation law perspective on kernel ridge regression and wide neural networks. Transactions on Machine Learning Research.
  • Sonthalia et al., (2023) Sonthalia, R., Li, X., and Gu, B. (2023). Under-parameterized double descent for ridge regularized least squares denoising of data on a line. arXiv preprint arXiv:2305.14689.
  • Sonthalia and Nadakuditi, (2023) Sonthalia, R. and Nadakuditi, R. R. (2023). Training data size induced double descent for denoising feedforward neural networks and the role of training noise. Transactions on Machine Learning Research.
  • Tao, (2011) Tao, T. (2011). Intuitive understanding of the Stieltjes transform. MathOverflow. Version: 2011-10-25.
  • Tsigler and Bartlett, (2020) Tsigler, A. and Bartlett, P. L. (2020). Benign overfitting in ridge regression. arXiv preprint arXiv:2009.14286.
  • Velikanov and Yarotsky, (2021) Velikanov, M. and Yarotsky, D. (2021). Explicit loss asymptotics in the gradient descent training of neural networks. Advances in Neural Information Processing Systems, 34:2570–2582.
  • Velikanov and Yarotsky, (2022) Velikanov, M. and Yarotsky, D. (2022). Tight convergence rate bounds for optimization under power law spectral conditions. arXiv preprint arXiv:2202.00992.
  • Wang et al., (2024) Wang, Z., Engel, A., Sarwate, A. D., Dumitriu, I., and Chiang, T. (2024). Spectral evolution and invariance in linear-width neural networks. Advances in Neural Information Processing Systems, 36.
  • Wei et al., (2022) Wei, A., Hu, W., and Steinhardt, J. (2022). More than a toy: Random matrix models predict how real-world neural representations generalize. In Proceedings of the 39th International Conference on Machine Learning, pages 23549–23588. PMLR.
  • Wu and Xu, (2020) Wu, D. and Xu, J. (2020). On the optimal weighted 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization in overparameterized linear regression. Advances in Neural Information Processing Systems, 33:10112–10123.
  • Zhang et al., (2017) Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2017). Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations.
  • Zhang et al., (2021) Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2021). Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115.
  • Zhang et al., (2022) Zhang, H., Wu, Y., and Huang, H. (2022). How many data are needed for robust learning? arXiv preprint arXiv:2202.11592.

Checklist

  1. 1.

    For all models and algorithms presented, check if you include:

    1. (a)

      A clear description of the mathematical setting, assumptions, algorithm, and/or model. [Yes]

    2. (b)

      An analysis of the properties and complexity (time, space, sample size) of any algorithm. [Not Applicable]

    3. (c)

      (Optional) Anonymized source code, with specification of all dependencies, including external libraries. [Yes]

  2. 2.

    For any theoretical claim, check if you include:

    1. (a)

      Statements of the full set of assumptions of all theoretical results. [Yes]

    2. (b)

      Complete proofs of all theoretical results. [Yes]

    3. (c)

      Clear explanations of any assumptions. [Yes]

  3. 3.

    For all figures and tables that present empirical results, check if you include:

    1. (a)

      The code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL). [Yes]

    2. (b)

      All the training details (e.g., data splits, hyperparameters, how they were chosen). [Yes]

    3. (c)

      A clear definition of the specific measure or statistics and error bars (e.g., with respect to the random seed after running experiments multiple times). [Yes]

    4. (d)

      A description of the computing infrastructure used. (e.g., type of GPUs, internal cluster, or cloud provider). [Yes]

  4. 4.

    If you are using existing assets (e.g., code, data, models) or curating/releasing new assets, check if you include:

    1. (a)

      Citations of the creator If your work uses existing assets. [Not Applicable]

    2. (b)

      The license information of the assets, if applicable. [Not Applicable]

    3. (c)

      New assets either in the supplemental material or as a URL, if applicable. [Not Applicable]

    4. (d)

      Information about consent from data providers/curators. [Not Applicable]

    5. (e)

      Discussion of sensible content if applicable, e.g., personally identifiable information or offensive content. [Not Applicable]

  5. 5.

    If you used crowdsourcing or conducted research with human subjects, check if you include:

    1. (a)

      The full text of instructions given to participants and screenshots. [Not Applicable]

    2. (b)

      Descriptions of potential participant risks, with links to Institutional Review Board (IRB) approvals if applicable. [Not Applicable]

    3. (c)

      The estimated hourly wage paid to participants and the total amount spent on participant compensation. [Not Applicable]

Appendix

Appendix A Proof of Theorem 1.5 — the exact trade-off formula

Our goal is to calculate the asymptotic test error 𝚝𝚎𝚜𝚝superscriptsubscript𝚝𝚎𝚜𝚝\mathcal{E}_{\mathtt{test}}^{\ast}caligraphic_E start_POSTSUBSCRIPT typewriter_test end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT under the assumptions of Theorem 1.5. This is accomplished through the following three steps.

The first step is to calculate the closed-form solution for the integrals defined in Definition 3.1 which are key ingredients for the expression of 𝚝𝚎𝚜𝚝superscriptsubscript𝚝𝚎𝚜𝚝\mathcal{E}_{\mathtt{test}}^{\ast}caligraphic_E start_POSTSUBSCRIPT typewriter_test end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. This is done in Section A.1. The second step is to relate the integrals from Definition 3.1 to the self-consistent equations in Equation 6. This is done in Section A.2. The final step is to relate the self-consistent equations Equation 6 to the asymptotic test error 𝚝𝚎𝚜𝚝superscriptsubscript𝚝𝚎𝚜𝚝\mathcal{E}_{\mathtt{test}}^{\ast}caligraphic_E start_POSTSUBSCRIPT typewriter_test end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. This is done in Section A.3.

A.1 Closed-form expression for the integrals in Definition 3.1

We prove the identities

01γdx1+kxα=γ1F(1,1α;1+1α;kγα),andsuperscriptsubscript01subscript𝛾𝑑𝑥1𝑘superscript𝑥𝛼superscriptsubscript𝛾1𝐹11𝛼11𝛼𝑘superscriptsubscript𝛾𝛼and\int_{0}^{\tfrac{1}{\gamma_{\ast}}}\frac{dx}{1+k{x}^{\alpha}}=\gamma_{\ast}^{-% 1}F(1,\tfrac{1}{\alpha};1+\tfrac{1}{\alpha};-k\gamma_{\ast}^{-\alpha}),\,\,% \mbox{and}∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_ARG end_POSTSUPERSCRIPT divide start_ARG italic_d italic_x end_ARG start_ARG 1 + italic_k italic_x start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG = italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_F ( 1 , divide start_ARG 1 end_ARG start_ARG italic_α end_ARG ; 1 + divide start_ARG 1 end_ARG start_ARG italic_α end_ARG ; - italic_k italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT ) , and
01γdx(1+kxα)2=γ1F(2,1α;1+1α;kγα).superscriptsubscript01subscript𝛾𝑑𝑥superscript1𝑘superscript𝑥𝛼2superscriptsubscript𝛾1𝐹21𝛼11𝛼𝑘superscriptsubscript𝛾𝛼\int_{0}^{\tfrac{1}{\gamma_{\ast}}}\frac{dx}{(1+k{x}^{\alpha})^{2}}=\gamma_{% \ast}^{-1}F(2,\tfrac{1}{\alpha};1+\tfrac{1}{\alpha};-k\gamma_{\ast}^{-\alpha}).∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_ARG end_POSTSUPERSCRIPT divide start_ARG italic_d italic_x end_ARG start_ARG ( 1 + italic_k italic_x start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG = italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_F ( 2 , divide start_ARG 1 end_ARG start_ARG italic_α end_ARG ; 1 + divide start_ARG 1 end_ARG start_ARG italic_α end_ARG ; - italic_k italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT ) .

as shown in the main text following Definition 3.1.

Proof.

Let F12(a,b;c;z)subscriptsubscript𝐹12𝑎𝑏𝑐𝑧{}_{2}F_{1}(a,b;c;z)start_FLOATSUBSCRIPT 2 end_FLOATSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_a , italic_b ; italic_c ; italic_z ) be the Gauss hypergeometric function. Note that the function can be implemented in SciPy as scipy.special.hyp2f1 and is used to plot Figure 1. Let ΓΓ\Gammaroman_Γ denote the Gamma function. The integral representation of the Gaussian hypergeometric function is well-known and is given by777 We used the formula stated in Karp and López, (2017).

F12(σ,a;b;z)=Γ(b)Γ(a)Γ(ba)01ta1(1t)ba1(1+zt)σ𝑑t.subscriptsubscript𝐹12𝜎𝑎𝑏𝑧Γ𝑏Γ𝑎Γ𝑏𝑎superscriptsubscript01superscript𝑡𝑎1superscript1𝑡𝑏𝑎1superscript1𝑧𝑡𝜎differential-d𝑡{}_{2}F_{1}(\sigma,a;b;-z)=\frac{\Gamma(b)}{\Gamma(a)\Gamma(b-a)}\int_{0}^{1}% \frac{t^{a-1}(1-t)^{b-a-1}}{(1+zt)^{\sigma}}dt.start_FLOATSUBSCRIPT 2 end_FLOATSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_σ , italic_a ; italic_b ; - italic_z ) = divide start_ARG roman_Γ ( italic_b ) end_ARG start_ARG roman_Γ ( italic_a ) roman_Γ ( italic_b - italic_a ) end_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT divide start_ARG italic_t start_POSTSUPERSCRIPT italic_a - 1 end_POSTSUPERSCRIPT ( 1 - italic_t ) start_POSTSUPERSCRIPT italic_b - italic_a - 1 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 + italic_z italic_t ) start_POSTSUPERSCRIPT italic_σ end_POSTSUPERSCRIPT end_ARG italic_d italic_t . (7)

Moreover, (7) is finite for z(,1]𝑧1z\in\mathbb{C}\setminus(-\infty,-1]italic_z ∈ blackboard_C ∖ ( - ∞ , - 1 ] and (ba)>0𝑏𝑎0\Re(b-a)>0roman_ℜ ( italic_b - italic_a ) > 0 and (a)>0𝑎0\Re(a)>0roman_ℜ ( italic_a ) > 0. Thus, by Equation 7, we have

F12(1,1α;1+1α;kγα)=Γ(1+1α)Γ(1α)Γ(1)01t1/αt11+kγαt𝑑t=1α01t1/αt11+kγαt𝑑tsubscriptsubscript𝐹1211𝛼11𝛼𝑘superscriptsubscript𝛾𝛼Γ11𝛼Γ1𝛼Γ1superscriptsubscript01superscript𝑡1𝛼superscript𝑡11𝑘superscriptsubscript𝛾𝛼𝑡differential-d𝑡1𝛼superscriptsubscript01superscript𝑡1𝛼superscript𝑡11𝑘superscriptsubscript𝛾𝛼𝑡differential-d𝑡{}_{2}F_{1}(1,\tfrac{1}{\alpha};1+\tfrac{1}{\alpha};-k\gamma_{\ast}^{-\alpha})% =\frac{\Gamma(1+\tfrac{1}{\alpha})}{\Gamma(\tfrac{1}{\alpha})\Gamma(1)}\int_{0% }^{1}\frac{t^{1/\alpha}t^{-1}}{1+k\gamma_{\ast}^{-\alpha}t}dt=\frac{1}{\alpha}% \int_{0}^{1}\frac{t^{1/\alpha}t^{-1}}{1+k\gamma_{\ast}^{-\alpha}t}dtstart_FLOATSUBSCRIPT 2 end_FLOATSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 1 , divide start_ARG 1 end_ARG start_ARG italic_α end_ARG ; 1 + divide start_ARG 1 end_ARG start_ARG italic_α end_ARG ; - italic_k italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT ) = divide start_ARG roman_Γ ( 1 + divide start_ARG 1 end_ARG start_ARG italic_α end_ARG ) end_ARG start_ARG roman_Γ ( divide start_ARG 1 end_ARG start_ARG italic_α end_ARG ) roman_Γ ( 1 ) end_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT divide start_ARG italic_t start_POSTSUPERSCRIPT 1 / italic_α end_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG start_ARG 1 + italic_k italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT italic_t end_ARG italic_d italic_t = divide start_ARG 1 end_ARG start_ARG italic_α end_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT divide start_ARG italic_t start_POSTSUPERSCRIPT 1 / italic_α end_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG start_ARG 1 + italic_k italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT italic_t end_ARG italic_d italic_t

where the second inequality follows from the well-known identity z=Γ(1+z)/Γ(z)𝑧Γ1𝑧Γ𝑧z=\Gamma(1+z)/\Gamma(z)italic_z = roman_Γ ( 1 + italic_z ) / roman_Γ ( italic_z ) for the Gamma function. Let u=t1/α𝑢superscript𝑡1𝛼u=t^{1/\alpha}italic_u = italic_t start_POSTSUPERSCRIPT 1 / italic_α end_POSTSUPERSCRIPT. Then we have du=1αt1/αt1dt𝑑𝑢1𝛼superscript𝑡1𝛼superscript𝑡1𝑑𝑡du=\frac{1}{\alpha}t^{1/\alpha}t^{-1}dtitalic_d italic_u = divide start_ARG 1 end_ARG start_ARG italic_α end_ARG italic_t start_POSTSUPERSCRIPT 1 / italic_α end_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_d italic_t. Thus, by u-substitution, we have

1α01t1/αt11+kγαt𝑑t=0111+kγαuα𝑑u=01/γγ1+kxα𝑑x1𝛼superscriptsubscript01superscript𝑡1𝛼superscript𝑡11𝑘superscriptsubscript𝛾𝛼𝑡differential-d𝑡superscriptsubscript0111𝑘superscriptsubscript𝛾𝛼superscript𝑢𝛼differential-d𝑢superscriptsubscript01subscript𝛾subscript𝛾1𝑘superscript𝑥𝛼differential-d𝑥\frac{1}{\alpha}\int_{0}^{1}\frac{t^{1/\alpha}t^{-1}}{1+k\gamma_{\ast}^{-% \alpha}t}dt=\int_{0}^{1}\frac{1}{1+k\gamma_{\ast}^{-\alpha}u^{\alpha}}du=\int_% {0}^{1/\gamma_{\ast}}\frac{\gamma_{\ast}}{1+kx^{\alpha}}dxdivide start_ARG 1 end_ARG start_ARG italic_α end_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT divide start_ARG italic_t start_POSTSUPERSCRIPT 1 / italic_α end_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG start_ARG 1 + italic_k italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT italic_t end_ARG italic_d italic_t = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 1 + italic_k italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG italic_d italic_u = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_ARG start_ARG 1 + italic_k italic_x start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG italic_d italic_x

where the second inequality used u-substituted with x=u/γ𝑥𝑢subscript𝛾x=u/\gamma_{\ast}italic_x = italic_u / italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT. Now, we have by the definition of \mathcal{I}caligraphic_I in Definition 3.1 that

(k)=01/γ11+kxα𝑑x=1γ01/γγ1+kxα𝑑x=1γF12(1,1α;1+1α;kγα)𝑘superscriptsubscript01subscript𝛾11𝑘superscript𝑥𝛼differential-d𝑥1subscript𝛾superscriptsubscript01subscript𝛾subscript𝛾1𝑘superscript𝑥𝛼differential-d𝑥1subscript𝛾subscriptsubscript𝐹1211𝛼11𝛼𝑘superscriptsubscript𝛾𝛼\mathcal{I}(k)=\int_{0}^{1/\gamma_{\ast}}\frac{1}{1+kx^{\alpha}}dx=\frac{1}{% \gamma_{\ast}}\int_{0}^{1/\gamma_{\ast}}\frac{\gamma_{\ast}}{1+kx^{\alpha}}dx=% \frac{1}{\gamma_{\ast}}{}_{2}F_{1}(1,\tfrac{1}{\alpha};1+\tfrac{1}{\alpha};-k% \gamma_{\ast}^{-\alpha})caligraphic_I ( italic_k ) = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 1 + italic_k italic_x start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG italic_d italic_x = divide start_ARG 1 end_ARG start_ARG italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_ARG start_ARG 1 + italic_k italic_x start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG italic_d italic_x = divide start_ARG 1 end_ARG start_ARG italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_ARG start_FLOATSUBSCRIPT 2 end_FLOATSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 1 , divide start_ARG 1 end_ARG start_ARG italic_α end_ARG ; 1 + divide start_ARG 1 end_ARG start_ARG italic_α end_ARG ; - italic_k italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT )

as desired. By an analogous calculation, we get 𝒥(k)=1γF12(2,1α;1+1α;kγα)𝒥𝑘1subscript𝛾subscriptsubscript𝐹1221𝛼11𝛼𝑘superscriptsubscript𝛾𝛼\mathcal{J}(k)=\frac{1}{\gamma_{\ast}}{}_{2}F_{1}(2,\tfrac{1}{\alpha};1+\tfrac% {1}{\alpha};-k\gamma_{\ast}^{-\alpha})caligraphic_J ( italic_k ) = divide start_ARG 1 end_ARG start_ARG italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_ARG start_FLOATSUBSCRIPT 2 end_FLOATSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( 2 , divide start_ARG 1 end_ARG start_ARG italic_α end_ARG ; 1 + divide start_ARG 1 end_ARG start_ARG italic_α end_ARG ; - italic_k italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT ). ∎

A.2 Proof of Proposition 3.2

We begin by analyzing the functions defined in Definition 3.1 and prove the items 1 and 2 of the “Moreover” part of Proposition 3.2:

Proposition A.1.

Let \mathcal{I}caligraphic_I and 𝒥𝒥\mathcal{J}caligraphic_J be functions as defined in Definition 3.1. Under Assumption 1.4 and Assumption 2.5, we have that r=(k):=k(1(k))𝑟𝑘assign𝑘1𝑘r=\mathcal{R}(k):=k\cdot(1-\mathcal{I}(k))italic_r = caligraphic_R ( italic_k ) := italic_k ⋅ ( 1 - caligraphic_I ( italic_k ) ) and drdk=1𝒥(k)𝑑𝑟𝑑𝑘1𝒥𝑘\tfrac{dr}{dk}=1-\mathcal{J}(k)divide start_ARG italic_d italic_r end_ARG start_ARG italic_d italic_k end_ARG = 1 - caligraphic_J ( italic_k ).

Furthermore, the following holds:

  1. 1.

    (k)kasymptotically-equals𝑘𝑘\mathcal{R}(k)\asymp kcaligraphic_R ( italic_k ) ≍ italic_k for k0much-greater-than𝑘0k\gg 0italic_k ≫ 0,

  2. 2.

    There exists k𝚌𝚛𝚒𝚝>0subscript𝑘𝚌𝚛𝚒𝚝0k_{\mathtt{crit}}>0italic_k start_POSTSUBSCRIPT typewriter_crit end_POSTSUBSCRIPT > 0 such that (k𝚌𝚛𝚒𝚝)=0subscript𝑘𝚌𝚛𝚒𝚝0\mathcal{R}(k_{\mathtt{crit}})=0caligraphic_R ( italic_k start_POSTSUBSCRIPT typewriter_crit end_POSTSUBSCRIPT ) = 0, \mathcal{R}caligraphic_R is increasing and positive on (k𝚌𝚛𝚒𝚝,+)subscript𝑘𝚌𝚛𝚒𝚝(k_{\mathtt{crit}},+\infty)( italic_k start_POSTSUBSCRIPT typewriter_crit end_POSTSUBSCRIPT , + ∞ ).

  3. 3.

    𝒥(k)<1𝒥𝑘1\mathcal{J}(k)<1caligraphic_J ( italic_k ) < 1 for k(k𝚌𝚛𝚒𝚝,+)𝑘subscript𝑘𝚌𝚛𝚒𝚝k\in(k_{\mathtt{crit}},+\infty)italic_k ∈ ( italic_k start_POSTSUBSCRIPT typewriter_crit end_POSTSUBSCRIPT , + ∞ ) and 𝒥(+)=0𝒥0\mathcal{J}(+\infty)=0caligraphic_J ( + ∞ ) = 0.

Proof of Proposition A.1.

We begin by proving the first part: that r=(k):=k(1(k))𝑟𝑘assign𝑘1𝑘r=\mathcal{R}(k):=k\cdot(1-\mathcal{I}(k))italic_r = caligraphic_R ( italic_k ) := italic_k ⋅ ( 1 - caligraphic_I ( italic_k ) ) and drdk=1𝒥(k)𝑑𝑟𝑑𝑘1𝒥𝑘\tfrac{dr}{dk}=1-\mathcal{J}(k)divide start_ARG italic_d italic_r end_ARG start_ARG italic_d italic_k end_ARG = 1 - caligraphic_J ( italic_k ). Rewrite the limit in Equation 6 as follows:

limn1ni=1p11+knαλi1=limn1ni=1n/γ11+k(i/n)α=01/γdx1+kxαsubscript𝑛1𝑛superscriptsubscript𝑖1𝑝11𝑘superscript𝑛𝛼superscriptsubscript𝜆𝑖1subscript𝑛1𝑛superscriptsubscript𝑖1𝑛𝛾11𝑘superscript𝑖𝑛𝛼superscriptsubscript01subscript𝛾𝑑𝑥1𝑘superscript𝑥𝛼\lim_{n\to\infty}\frac{1}{n}\sum_{i=1}^{p}\frac{1}{1+kn^{-\alpha}\lambda_{i}^{% -1}}=\lim_{n\to\infty}\frac{1}{n}\sum_{i=1}^{n/\gamma}\frac{1}{1+k{(i/n)}^{% \alpha}}=\int_{0}^{1/\gamma_{\ast}}\frac{dx}{1+k{x}^{\alpha}}roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 1 + italic_k italic_n start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG = roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n / italic_γ end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 1 + italic_k ( italic_i / italic_n ) start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG italic_d italic_x end_ARG start_ARG 1 + italic_k italic_x start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG

The right-most equality follows from the definition of the (Riemann) integral. If γ=0subscript𝛾0\gamma_{\ast}=0italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = 0, then 1/γ=+1subscript𝛾1/\gamma_{\ast}=+\infty1 / italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = + ∞ and the above is interpreted as an improper Riemann integral. Now, rearranging Equation 6, we get the desired formula of r=(k):=k(1(k))𝑟𝑘assign𝑘1𝑘r=\mathcal{R}(k):=k\cdot(1-\mathcal{I}(k))italic_r = caligraphic_R ( italic_k ) := italic_k ⋅ ( 1 - caligraphic_I ( italic_k ) ). The formula for drdk𝑑𝑟𝑑𝑘\frac{dr}{dk}divide start_ARG italic_d italic_r end_ARG start_ARG italic_d italic_k end_ARG follows by “differentiating under the integral” (Leibniz integral rule).

For the first item of the “Furthermore” part, it suffices to show that limk+(k)=0subscript𝑘𝑘0\lim_{k\to+\infty}\mathcal{I}(k)=0roman_lim start_POSTSUBSCRIPT italic_k → + ∞ end_POSTSUBSCRIPT caligraphic_I ( italic_k ) = 0. This follows from the fact that limk+11+kxα=0subscript𝑘11𝑘superscript𝑥𝛼0\lim_{k\to+\infty}\frac{1}{1+kx^{\alpha}}=0roman_lim start_POSTSUBSCRIPT italic_k → + ∞ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 1 + italic_k italic_x start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG = 0 for all x>0𝑥0x>0italic_x > 0, integrability of the function (1+xα)1superscript1superscript𝑥𝛼1(1+x^{\alpha})^{-1}( 1 + italic_x start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over 0subscriptabsent0\mathbb{R}_{\geq 0}blackboard_R start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT, and the dominated convergence theorem. Likewise, limk𝒥(k)=0subscript𝑘𝒥𝑘0\lim_{k\to\infty}\mathcal{J}(k)=0roman_lim start_POSTSUBSCRIPT italic_k → ∞ end_POSTSUBSCRIPT caligraphic_J ( italic_k ) = 0 as well.

For the second item of the “Furthermore” part, we note that for all x𝑥xitalic_x sufficiently large, we have drdk>0𝑑𝑟𝑑𝑘0\tfrac{dr}{dk}>0divide start_ARG italic_d italic_r end_ARG start_ARG italic_d italic_k end_ARG > 0 since limk𝒥(k)=0subscript𝑘𝒥𝑘0\lim_{k\to\infty}\mathcal{J}(k)=0roman_lim start_POSTSUBSCRIPT italic_k → ∞ end_POSTSUBSCRIPT caligraphic_J ( italic_k ) = 0. Now, let k𝚌𝚛𝚒𝚝subscript𝑘𝚌𝚛𝚒𝚝k_{\mathtt{crit}}italic_k start_POSTSUBSCRIPT typewriter_crit end_POSTSUBSCRIPT be the largest real number such that (k𝚌𝚛𝚒𝚝)=0subscript𝑘𝚌𝚛𝚒𝚝0\mathcal{R}(k_{\mathtt{crit}})=0caligraphic_R ( italic_k start_POSTSUBSCRIPT typewriter_crit end_POSTSUBSCRIPT ) = 0. Since (0)=000\mathcal{R}(0)=0caligraphic_R ( 0 ) = 0, we must have k𝚌𝚛𝚒𝚝0subscript𝑘𝚌𝚛𝚒𝚝0k_{\mathtt{crit}}\geq 0italic_k start_POSTSUBSCRIPT typewriter_crit end_POSTSUBSCRIPT ≥ 0.

For all k>k𝚌𝚛𝚒𝚝𝑘subscript𝑘𝚌𝚛𝚒𝚝k>k_{\mathtt{crit}}italic_k > italic_k start_POSTSUBSCRIPT typewriter_crit end_POSTSUBSCRIPT, we claim that (k)<1𝑘1\mathcal{I}(k)<1caligraphic_I ( italic_k ) < 1. To see this, assume the contrary. Then by the fact that limk+(k)=0subscript𝑘𝑘0\lim_{k\to+\infty}\mathcal{I}(k)=0roman_lim start_POSTSUBSCRIPT italic_k → + ∞ end_POSTSUBSCRIPT caligraphic_I ( italic_k ) = 0 and the intermediate value theorem, there must exists ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT such that k>ksuperscript𝑘𝑘k^{\prime}>kitalic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > italic_k such that (k)=1superscript𝑘1\mathcal{I}(k^{\prime})=1caligraphic_I ( italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = 1 which implies that (k)=0superscript𝑘0\mathcal{R}(k^{\prime})=0caligraphic_R ( italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = 0. This contradicts the maximality of k𝚌𝚛𝚒𝚝subscript𝑘𝚌𝚛𝚒𝚝k_{\mathtt{crit}}italic_k start_POSTSUBSCRIPT typewriter_crit end_POSTSUBSCRIPT.

Finally, since 1+kxα(1+kxα)21𝑘superscript𝑥𝛼superscript1𝑘superscript𝑥𝛼21+kx^{\alpha}\leq(1+kx^{\alpha})^{2}1 + italic_k italic_x start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ≤ ( 1 + italic_k italic_x start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for all k0𝑘0k\geq 0italic_k ≥ 0 and x0𝑥0x\geq 0italic_x ≥ 0, we have that (k)𝒥(k)𝑘𝒥𝑘\mathcal{I}(k)\geq\mathcal{J}(k)caligraphic_I ( italic_k ) ≥ caligraphic_J ( italic_k ) for all such k𝑘kitalic_k’s. Thus, by the previous claim, for all k>k𝚌𝚛𝚒𝚝𝑘subscript𝑘𝚌𝚛𝚒𝚝k>k_{\mathtt{crit}}italic_k > italic_k start_POSTSUBSCRIPT typewriter_crit end_POSTSUBSCRIPT, we have 1>(k)𝒥(k)1𝑘𝒥𝑘1>\mathcal{I}(k)\geq\mathcal{J}(k)1 > caligraphic_I ( italic_k ) ≥ caligraphic_J ( italic_k ). This proves that drdk>0𝑑𝑟𝑑𝑘0\frac{dr}{dk}>0divide start_ARG italic_d italic_r end_ARG start_ARG italic_d italic_k end_ARG > 0 for all k>k𝚌𝚛𝚒𝚝𝑘subscript𝑘𝚌𝚛𝚒𝚝k>k_{\mathtt{crit}}italic_k > italic_k start_POSTSUBSCRIPT typewriter_crit end_POSTSUBSCRIPT, as desired. ∎

A.3 Review of the eigenlearning framework (Simon et al.,, 2023)

Before proceeding with finishing the proof of Proposition 3.2, we briefly review the eigenlearning framework. Simon et al., (2023) calculates the test error for the estimator

𝜷ˇδ:=𝐗(𝐗𝐗+δ𝐈n)1y=𝐗(n𝐆ˇ+δ𝐈n)1yassignsubscriptˇ𝜷𝛿𝐗superscriptsuperscript𝐗top𝐗𝛿subscript𝐈𝑛1𝑦𝐗superscript𝑛ˇ𝐆𝛿subscript𝐈𝑛1𝑦\check{\bm{\beta}}_{\delta}:=\mathbf{X}(\mathbf{X}^{\top}\mathbf{X}+\delta% \mathbf{I}_{n})^{-1}y=\mathbf{X}(n\check{\mathbf{G}}+\delta\mathbf{I}_{n})^{-1}yoverroman_ˇ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT := bold_X ( bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X + italic_δ bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_y = bold_X ( italic_n overroman_ˇ start_ARG bold_G end_ARG + italic_δ bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_y (8)

for kernel ridge regression using the so-called eigenlearning equations (Simon et al.,, 2023, Section 4.1). Below, we recall some relevant parts of the framework:

Definition A.2 (Eigenlearning eqn. specialized to setting in Section 1.3).

Suppose that the ground truth regression function is linear, i.e., f(x)=x𝜷𝑓𝑥superscript𝑥topsuperscript𝜷f(x)=x^{\top}\bm{\beta}^{\star}italic_f ( italic_x ) = italic_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_β start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT for some 𝜷psuperscript𝜷superscript𝑝\bm{\beta}^{\star}\in\mathbb{R}^{p}bold_italic_β start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT. Let δ𝛿\deltaitalic_δ and κ𝜅\kappaitalic_κ satisfy the equation

n=δκ+i=1pλiλi+κ.𝑛𝛿𝜅superscriptsubscript𝑖1𝑝subscript𝜆𝑖subscript𝜆𝑖𝜅n=\frac{\delta}{\kappa}+\sum_{i=1}^{p}\frac{\lambda_{i}}{\lambda_{i}+\kappa}.italic_n = divide start_ARG italic_δ end_ARG start_ARG italic_κ end_ARG + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT divide start_ARG italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_κ end_ARG . (9)

Define the following n𝑛nitalic_n-dependent quantities:

  1. 1.

    Overfitting coefficient: 𝚌𝚘𝚎𝚏:=ndκdδassignsubscript𝚌𝚘𝚎𝚏𝑛𝑑𝜅𝑑𝛿\mathcal{E}_{\mathtt{coef}}:=n\frac{d\kappa}{d\delta}caligraphic_E start_POSTSUBSCRIPT typewriter_coef end_POSTSUBSCRIPT := italic_n divide start_ARG italic_d italic_κ end_ARG start_ARG italic_d italic_δ end_ARG

  2. 2.

    Testing error: 𝚝𝚎𝚜𝚝:=𝚌𝚘𝚎𝚏(σ2+C)assignsubscript𝚝𝚎𝚜𝚝subscript𝚌𝚘𝚎𝚏superscript𝜎2𝐶\mathcal{E}_{\mathtt{test}}:=\mathcal{E}_{\mathtt{coef}}(\sigma^{2}+C)caligraphic_E start_POSTSUBSCRIPT typewriter_test end_POSTSUBSCRIPT := caligraphic_E start_POSTSUBSCRIPT typewriter_coef end_POSTSUBSCRIPT ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_C ) where

    C=i=1p(1i)(βi)2andi:=λiλi+κ.formulae-sequence𝐶superscriptsubscript𝑖1𝑝1subscript𝑖superscriptsubscriptsuperscript𝛽𝑖2andassignsubscript𝑖subscript𝜆𝑖subscript𝜆𝑖𝜅C=\textstyle\sum_{i=1}^{p}(1-\mathcal{L}_{i})(\beta^{\star}_{i})^{2}\quad\mbox% {and}\quad\mathcal{L}_{i}:=\frac{\lambda_{i}}{\lambda_{i}+\kappa}.italic_C = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( 1 - caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( italic_β start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT := divide start_ARG italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_κ end_ARG .
  3. 3.

    Training error: 𝚝𝚛𝚊𝚒𝚗:=δ2n2κ2𝚝𝚎𝚜𝚝assignsubscript𝚝𝚛𝚊𝚒𝚗superscript𝛿2superscript𝑛2superscript𝜅2subscript𝚝𝚎𝚜𝚝\mathcal{E}_{\mathtt{train}}:=\frac{\delta^{2}}{n^{2}\kappa^{2}}\mathcal{E}_{% \mathtt{test}}caligraphic_E start_POSTSUBSCRIPT typewriter_train end_POSTSUBSCRIPT := divide start_ARG italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG caligraphic_E start_POSTSUBSCRIPT typewriter_test end_POSTSUBSCRIPT.

Proof of Proposition 3.2.

Simon et al., (2023) uses a different scaling for ridge regression than the one we use. To bridge the different notations, we first resolve this discrepancy. Comparing Equation 8 with the expression in Equation 11, if we let δ:=nϱassign𝛿𝑛italic-ϱ\delta:=n\varrhoitalic_δ := italic_n italic_ϱ, then the expressions are equivalent, i.e., βˇδ=β^ϱsubscriptˇ𝛽𝛿subscript^𝛽italic-ϱ\check{\beta}_{\delta}=\hat{\beta}_{\varrho}overroman_ˇ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT = over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_ϱ end_POSTSUBSCRIPT. To see this, note that

βˇδ=βˇnϱsubscriptˇ𝛽𝛿subscriptˇ𝛽𝑛italic-ϱ\displaystyle\check{\beta}_{\delta}=\check{\beta}_{n\varrho}overroman_ˇ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT = overroman_ˇ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_n italic_ϱ end_POSTSUBSCRIPT =X(XX+nϱ𝕀n)1yabsent𝑋superscriptsuperscript𝑋top𝑋𝑛italic-ϱsubscript𝕀𝑛1𝑦\displaystyle=X(X^{\top}X+n\varrho\mathbb{I}_{n})^{-1}y= italic_X ( italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X + italic_n italic_ϱ blackboard_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_y
=(XX+nϱ𝕀p)1XyLemma B.2\displaystyle=(XX^{\top}+n\varrho\mathbb{I}_{p})^{-1}Xy\quad\because\mbox{% \lx@cref{creftypecap~refnum}{lemma:woodbury}}= ( italic_X italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + italic_n italic_ϱ blackboard_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_X italic_y ∵
=(n(n1XX+ϱ𝕀p))1Xyabsentsuperscript𝑛superscript𝑛1𝑋superscript𝑋topitalic-ϱsubscript𝕀𝑝1𝑋𝑦\displaystyle=(n(n^{-1}XX^{\top}+\varrho\mathbb{I}_{p}))^{-1}Xy= ( italic_n ( italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_X italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + italic_ϱ blackboard_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_X italic_y
=(Σ^+ϱ𝕀p)11nXy=β^ϱDefinition of β^ϱ\displaystyle=(\hat{\Sigma}+\varrho\mathbb{I}_{p})^{-1}\tfrac{1}{n}Xy=\hat{% \beta}_{\varrho}\quad\because{\mbox{Definition of $\hat{\beta}_{\varrho}$}}= ( over^ start_ARG roman_Σ end_ARG + italic_ϱ blackboard_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n end_ARG italic_X italic_y = over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_ϱ end_POSTSUBSCRIPT ∵ Definition of over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_ϱ end_POSTSUBSCRIPT

Below, let r>0𝑟0r>0italic_r > 0 be arbitrary. Furthermore, we claim that as n𝑛{n\to\infty}italic_n → ∞, we have r,k𝑟𝑘r,kitalic_r , italic_k satisfies Equation 6 if and only if the tuple (δ,κ):=(nrnα,knα)assign𝛿𝜅𝑛𝑟superscript𝑛𝛼𝑘superscript𝑛𝛼(\delta,\kappa):=(nrn^{-\alpha},kn^{-\alpha})( italic_δ , italic_κ ) := ( italic_n italic_r italic_n start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT , italic_k italic_n start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT ) satisfies Equation 9:

n=δκ+i=1pλiλi+κn=nrnαknα+i=1pλiλi+knα1=rk+1ni=1p11+knαλi1.iff𝑛𝛿𝜅superscriptsubscript𝑖1𝑝subscript𝜆𝑖subscript𝜆𝑖𝜅𝑛𝑛𝑟superscript𝑛𝛼𝑘superscript𝑛𝛼superscriptsubscript𝑖1𝑝subscript𝜆𝑖subscript𝜆𝑖𝑘superscript𝑛𝛼iff1𝑟𝑘1𝑛superscriptsubscript𝑖1𝑝11𝑘superscript𝑛𝛼superscriptsubscript𝜆𝑖1\displaystyle n=\frac{\delta}{\kappa}+\sum_{i=1}^{p}\frac{\lambda_{i}}{\lambda% _{i}+\kappa}\iff n=\frac{nrn^{-\alpha}}{kn^{-\alpha}}+\sum_{i=1}^{p}\frac{% \lambda_{i}}{\lambda_{i}+kn^{-\alpha}}\iff 1=\frac{r}{k}+\frac{1}{n}\sum_{i=1}% ^{p}\frac{1}{1+kn^{-\alpha}\lambda_{i}^{-1}}.italic_n = divide start_ARG italic_δ end_ARG start_ARG italic_κ end_ARG + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT divide start_ARG italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_κ end_ARG ⇔ italic_n = divide start_ARG italic_n italic_r italic_n start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT end_ARG start_ARG italic_k italic_n start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT end_ARG + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT divide start_ARG italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_k italic_n start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT end_ARG ⇔ 1 = divide start_ARG italic_r end_ARG start_ARG italic_k end_ARG + divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 1 + italic_k italic_n start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG .

Taking limit as n𝑛n\to\inftyitalic_n → ∞, we have proved the claim.

Next, we show that limnC=0subscript𝑛𝐶0\lim_{n\to\infty}C=0roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_C = 0 where C𝐶Citalic_C is as in Definition A.2. We have i:=λiλi+κ=11+k(i/n)αassignsubscript𝑖subscript𝜆𝑖subscript𝜆𝑖𝜅11𝑘superscript𝑖𝑛𝛼\mathcal{L}_{i}:=\frac{\lambda_{i}}{\lambda_{i}+\kappa}=\frac{1}{1+k(i/n)^{% \alpha}}caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT := divide start_ARG italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_κ end_ARG = divide start_ARG 1 end_ARG start_ARG 1 + italic_k ( italic_i / italic_n ) start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG. Note that limni=1subscript𝑛subscript𝑖1\lim_{n\to\infty}\mathcal{L}_{i}=1roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 for all fixed i𝑖iitalic_i. On the other hand, since supn=1,2β2<+subscriptsupremum𝑛12subscriptnormsuperscript𝛽2\sup_{n=1,2\dots}\|\beta^{\star}\|_{2}<+\inftyroman_sup start_POSTSUBSCRIPT italic_n = 1 , 2 … end_POSTSUBSCRIPT ∥ italic_β start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < + ∞, dominated convergence theorem implies that limnC=0subscript𝑛𝐶0\lim_{n\to\infty}C=0roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_C = 0

We claim that the following asymptotic expression for the testing and training error hold:

limn𝚝𝚎𝚜𝚝=σ2dkdrandlimn𝚝𝚛𝚊𝚒𝚗=σ2r2k2dkdrformulae-sequencesubscript𝑛subscript𝚝𝚎𝚜𝚝superscript𝜎2𝑑𝑘𝑑𝑟andsubscript𝑛subscript𝚝𝚛𝚊𝚒𝚗superscript𝜎2superscript𝑟2superscript𝑘2𝑑𝑘𝑑𝑟\lim_{n\to\infty}\mathcal{E}_{\mathtt{test}}=\sigma^{2}\cdot\tfrac{dk}{dr}% \quad\mbox{and}\quad\lim_{n\to\infty}\mathcal{E}_{\mathtt{train}}=\sigma^{2}% \cdot\tfrac{r^{2}}{k^{2}}\cdot\tfrac{dk}{dr}roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT caligraphic_E start_POSTSUBSCRIPT typewriter_test end_POSTSUBSCRIPT = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ divide start_ARG italic_d italic_k end_ARG start_ARG italic_d italic_r end_ARG and roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT caligraphic_E start_POSTSUBSCRIPT typewriter_train end_POSTSUBSCRIPT = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ divide start_ARG italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ divide start_ARG italic_d italic_k end_ARG start_ARG italic_d italic_r end_ARG (10)

where r𝑟ritalic_r and k𝑘kitalic_k satisfy Equation 6 from from Assumption 2.5.

To see this, first note that the overfitting coefficient satisfies

𝚌𝚘𝚎𝚏:=ndκdδ=ndκdϱdϱdδ=ndκdϱ1n=dκdϱ=dkdr.assignsubscript𝚌𝚘𝚎𝚏𝑛𝑑𝜅𝑑𝛿𝑛𝑑𝜅𝑑italic-ϱ𝑑italic-ϱ𝑑𝛿𝑛𝑑𝜅𝑑italic-ϱ1𝑛𝑑𝜅𝑑italic-ϱ𝑑𝑘𝑑𝑟\mathcal{E}_{\mathtt{coef}}:=n\tfrac{d\kappa}{d\delta}=n\tfrac{d\kappa}{d% \varrho}\tfrac{d\varrho}{d\delta}=n\tfrac{d\kappa}{d\varrho}\tfrac{1}{n}=% \tfrac{d\kappa}{d\varrho}=\tfrac{dk}{dr}.caligraphic_E start_POSTSUBSCRIPT typewriter_coef end_POSTSUBSCRIPT := italic_n divide start_ARG italic_d italic_κ end_ARG start_ARG italic_d italic_δ end_ARG = italic_n divide start_ARG italic_d italic_κ end_ARG start_ARG italic_d italic_ϱ end_ARG divide start_ARG italic_d italic_ϱ end_ARG start_ARG italic_d italic_δ end_ARG = italic_n divide start_ARG italic_d italic_κ end_ARG start_ARG italic_d italic_ϱ end_ARG divide start_ARG 1 end_ARG start_ARG italic_n end_ARG = divide start_ARG italic_d italic_κ end_ARG start_ARG italic_d italic_ϱ end_ARG = divide start_ARG italic_d italic_k end_ARG start_ARG italic_d italic_r end_ARG .

Thus, we obtain the following asymptotic expression

limn𝚝𝚎𝚜𝚝=𝚌𝚘𝚎𝚏σ2=σ2dkdr.subscript𝑛subscript𝚝𝚎𝚜𝚝subscript𝚌𝚘𝚎𝚏superscript𝜎2superscript𝜎2𝑑𝑘𝑑𝑟\lim_{n\to\infty}\mathcal{E}_{\mathtt{test}}=\mathcal{E}_{\mathtt{coef}}\cdot% \sigma^{2}=\sigma^{2}\cdot\tfrac{dk}{dr}.roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT caligraphic_E start_POSTSUBSCRIPT typewriter_test end_POSTSUBSCRIPT = caligraphic_E start_POSTSUBSCRIPT typewriter_coef end_POSTSUBSCRIPT ⋅ italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ divide start_ARG italic_d italic_k end_ARG start_ARG italic_d italic_r end_ARG .

On the other hand, the training error is given by

𝚝𝚛𝚊𝚒𝚗=δ2n2κ2𝚝𝚎𝚜𝚝=ϱ2κ2𝚝𝚎𝚜𝚝=𝚝𝚎𝚜𝚝r2k2.subscript𝚝𝚛𝚊𝚒𝚗superscript𝛿2superscript𝑛2superscript𝜅2subscript𝚝𝚎𝚜𝚝superscriptitalic-ϱ2superscript𝜅2subscript𝚝𝚎𝚜𝚝subscript𝚝𝚎𝚜𝚝superscript𝑟2superscript𝑘2\mathcal{E}_{\mathtt{train}}=\tfrac{\delta^{2}}{n^{2}\kappa^{2}}\mathcal{E}_{% \mathtt{test}}=\tfrac{\varrho^{2}}{\kappa^{2}}\mathcal{E}_{\mathtt{test}}=% \mathcal{E}_{\mathtt{test}}\cdot\tfrac{r^{2}}{k^{2}}.caligraphic_E start_POSTSUBSCRIPT typewriter_train end_POSTSUBSCRIPT = divide start_ARG italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG caligraphic_E start_POSTSUBSCRIPT typewriter_test end_POSTSUBSCRIPT = divide start_ARG italic_ϱ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_κ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG caligraphic_E start_POSTSUBSCRIPT typewriter_test end_POSTSUBSCRIPT = caligraphic_E start_POSTSUBSCRIPT typewriter_test end_POSTSUBSCRIPT ⋅ divide start_ARG italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG .

Therefore, limn𝚝𝚛𝚊𝚒𝚗=σ2r2k2dkdrsubscript𝑛subscript𝚝𝚛𝚊𝚒𝚗superscript𝜎2superscript𝑟2superscript𝑘2𝑑𝑘𝑑𝑟\lim_{n\to\infty}\mathcal{E}_{\mathtt{train}}=\sigma^{2}\cdot\tfrac{r^{2}}{k^{% 2}}\cdot\tfrac{dk}{dr}roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT caligraphic_E start_POSTSUBSCRIPT typewriter_train end_POSTSUBSCRIPT = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ divide start_ARG italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ divide start_ARG italic_d italic_k end_ARG start_ARG italic_d italic_r end_ARG. This proves (10), as desired. ∎

Appendix B Proof of Proposition 4.1 — rapid norm growth under RMT assumptions

The first key technical step the following:

Proposition B.1.

𝔼𝜷^ϱ22n1σ2𝔼[tr((𝚺^+ϱ𝐈p)2𝚺^)]𝔼superscriptsubscriptnormsubscript^𝜷italic-ϱ22superscript𝑛1superscript𝜎2𝔼delimited-[]trsuperscript^𝚺italic-ϱsubscript𝐈𝑝2^𝚺\mathbb{E}\|\hat{\bm{\beta}}_{\varrho}\|_{2}^{2}\geq n^{-1}\sigma^{2}\mathbb{E% }[\mathrm{tr}((\hat{\bm{\Sigma}}+\varrho\mathbf{I}_{p})^{-2}\hat{\bm{\Sigma}})]blackboard_E ∥ over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_ϱ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ roman_tr ( ( over^ start_ARG bold_Σ end_ARG + italic_ϱ bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT over^ start_ARG bold_Σ end_ARG ) ].

Proof.

Below, for brevity we let 𝐚:=𝐗𝜷assign𝐚superscript𝐗topsuperscript𝜷\mathbf{a}:=\mathbf{X}^{\top}\bm{\beta}^{\star}bold_a := bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_β start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT and 𝐌:=(𝚺^+ϱ𝐈p)11n𝐗assign𝐌superscript^𝚺italic-ϱsubscript𝐈𝑝11𝑛𝐗\mathbf{M}:=(\hat{\bm{\Sigma}}+\varrho\mathbf{I}_{p})^{-1}\frac{1}{n}\mathbf{X}bold_M := ( over^ start_ARG bold_Σ end_ARG + italic_ϱ bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n end_ARG bold_X. Recall the closed-form solution for Equation 2 is given by the formula

𝜷^ϱ:=(𝚺^+ϱ𝐈p)11n𝐗y.assignsubscript^𝜷italic-ϱsuperscript^𝚺italic-ϱsubscript𝐈𝑝11𝑛𝐗𝑦\hat{\bm{\beta}}_{\varrho}:=(\hat{\bm{\Sigma}}+\varrho\mathbf{I}_{p})^{-1}% \tfrac{1}{n}\mathbf{X}y.over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_ϱ end_POSTSUBSCRIPT := ( over^ start_ARG bold_Σ end_ARG + italic_ϱ bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n end_ARG bold_X italic_y . (11)

Thus,

𝜷^ϱ=(𝚺^+ϱ𝐈p)11n𝐗y=(𝚺^+ϱ𝐈p)11n𝐗(f(𝐗)+𝜺)=𝐌(𝐚+𝜺).subscript^𝜷italic-ϱsuperscript^𝚺italic-ϱsubscript𝐈𝑝11𝑛𝐗𝑦superscript^𝚺italic-ϱsubscript𝐈𝑝11𝑛𝐗𝑓𝐗𝜺𝐌𝐚𝜺\hat{\bm{\beta}}_{\varrho}=(\hat{\bm{\Sigma}}+\varrho\mathbf{I}_{p})^{-1}\frac% {1}{n}\mathbf{X}y=(\hat{\bm{\Sigma}}+\varrho\mathbf{I}_{p})^{-1}\frac{1}{n}% \mathbf{X}(f(\mathbf{X})+{\bm{\varepsilon}})=\mathbf{M}(\mathbf{a}+\bm{% \varepsilon}).over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_ϱ end_POSTSUBSCRIPT = ( over^ start_ARG bold_Σ end_ARG + italic_ϱ bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n end_ARG bold_X italic_y = ( over^ start_ARG bold_Σ end_ARG + italic_ϱ bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n end_ARG bold_X ( italic_f ( bold_X ) + bold_italic_ε ) = bold_M ( bold_a + bold_italic_ε ) .

Thus,

𝜷^ϱ22=(𝐚+𝜺)𝐌𝐌(𝐚+𝜺)=𝐚𝐌𝐌𝐚0+𝜺𝐌𝐌𝜺+2𝜺𝐌𝐌𝐚.superscriptsubscriptnormsubscript^𝜷italic-ϱ22superscript𝐚𝜺topsuperscript𝐌top𝐌𝐚𝜺subscriptsuperscript𝐚topsuperscript𝐌top𝐌𝐚absent0superscript𝜺topsuperscript𝐌top𝐌𝜺2superscript𝜺topsuperscript𝐌top𝐌𝐚\|\hat{\bm{\beta}}_{\varrho}\|_{2}^{2}=(\mathbf{a}+\bm{\varepsilon})^{\top}% \mathbf{M}^{\top}\mathbf{M}(\mathbf{a}+\bm{\varepsilon})=\underbrace{\mathbf{a% }^{\top}\mathbf{M}^{\top}\mathbf{M}\mathbf{a}}_{\geq 0}+\bm{\varepsilon}^{\top% }\mathbf{M}^{\top}\mathbf{M}\bm{\varepsilon}+2\bm{\varepsilon}^{\top}\mathbf{M% }^{\top}\mathbf{M}\mathbf{a}.∥ over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_ϱ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ( bold_a + bold_italic_ε ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_M start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_M ( bold_a + bold_italic_ε ) = under⏟ start_ARG bold_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_M start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Ma end_ARG start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT + bold_italic_ε start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_M start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_M bold_italic_ε + 2 bold_italic_ε start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_M start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Ma .

Note that 𝜺𝐌𝐌𝐚perpendicular-to𝜺superscript𝐌top𝐌𝐚\bm{\varepsilon}\perp\mathbf{M}^{\top}\mathbf{M}\mathbf{a}bold_italic_ε ⟂ bold_M start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Ma since 𝜺𝐗perpendicular-to𝜺𝐗\bm{\varepsilon}\perp\mathbf{X}bold_italic_ε ⟂ bold_X. Thus, since 𝔼[𝜺]=0𝔼delimited-[]𝜺0\mathbb{E}[\bm{\varepsilon}]=0blackboard_E [ bold_italic_ε ] = 0, we have

𝔼[𝜷^ϱ22]=𝔼[(𝐚+𝜺)𝐌𝐌(𝐚+𝜺)]𝔼[𝜺𝐌𝐌𝜺]=𝔼[tr(𝐌𝐌𝜺𝜺)].𝔼delimited-[]superscriptsubscriptnormsubscript^𝜷italic-ϱ22𝔼delimited-[]superscript𝐚𝜺topsuperscript𝐌top𝐌𝐚𝜺𝔼delimited-[]superscript𝜺topsuperscript𝐌top𝐌𝜺𝔼delimited-[]trsuperscript𝐌top𝐌𝜺superscript𝜺top\mathbb{E}[\|\hat{\bm{\beta}}_{\varrho}\|_{2}^{2}]=\mathbb{E}[(\mathbf{a}+\bm{% \varepsilon})^{\top}\mathbf{M}^{\top}\mathbf{M}(\mathbf{a}+\bm{\varepsilon})]% \geq\mathbb{E}[\bm{\varepsilon}^{\top}\mathbf{M}^{\top}\mathbf{M}\bm{% \varepsilon}]=\mathbb{E}[\mathrm{tr}(\mathbf{M}^{\top}\mathbf{M}\bm{% \varepsilon}\bm{\varepsilon}^{\top})].blackboard_E [ ∥ over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_ϱ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = blackboard_E [ ( bold_a + bold_italic_ε ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_M start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_M ( bold_a + bold_italic_ε ) ] ≥ blackboard_E [ bold_italic_ε start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_M start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_M bold_italic_ε ] = blackboard_E [ roman_tr ( bold_M start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_M bold_italic_ε bold_italic_ε start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ] .

Since 𝐌𝐌𝜺𝜺perpendicular-tosuperscript𝐌top𝐌𝜺superscript𝜺top\mathbf{M}^{\top}\mathbf{M}\perp\bm{\varepsilon}\bm{\varepsilon}^{\top}bold_M start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_M ⟂ bold_italic_ε bold_italic_ε start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, we have

𝔼[tr(𝐌𝐌𝜺𝜺)]=tr(𝔼[𝐌𝐌]𝔼[𝜺𝜺])=tr(𝔼[𝐌𝐌σ2𝐈n])=σ2𝔼[tr(𝐌𝐌)].𝔼delimited-[]trsuperscript𝐌top𝐌𝜺superscript𝜺toptr𝔼delimited-[]superscript𝐌top𝐌𝔼delimited-[]𝜺superscript𝜺toptr𝔼delimited-[]superscript𝐌top𝐌superscript𝜎2subscript𝐈𝑛superscript𝜎2𝔼delimited-[]trsuperscript𝐌top𝐌\mathbb{E}[\mathrm{tr}(\mathbf{M}^{\top}\mathbf{M}\bm{\varepsilon}\bm{% \varepsilon}^{\top})]=\mathrm{tr}(\mathbb{E}[\mathbf{M}^{\top}\mathbf{M}]% \mathbb{E}[\bm{\varepsilon}\bm{\varepsilon}^{\top}])=\mathrm{tr}(\mathbb{E}[% \mathbf{M}^{\top}\mathbf{M}\sigma^{2}\mathbf{I}_{n}])=\sigma^{2}\mathbb{E}[% \mathrm{tr}(\mathbf{M}^{\top}\mathbf{M})].blackboard_E [ roman_tr ( bold_M start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_M bold_italic_ε bold_italic_ε start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ] = roman_tr ( blackboard_E [ bold_M start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_M ] blackboard_E [ bold_italic_ε bold_italic_ε start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] ) = roman_tr ( blackboard_E [ bold_M start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_M italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] ) = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ roman_tr ( bold_M start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_M ) ] .

On the other hand, 𝐌𝐌=1n(𝚺^+ϱ𝐈p)1𝚺^(𝚺^+ϱ𝐈p)1superscript𝐌top𝐌1𝑛superscript^𝚺italic-ϱsubscript𝐈𝑝1^𝚺superscript^𝚺italic-ϱsubscript𝐈𝑝1\mathbf{M}^{\top}\mathbf{M}=\frac{1}{n}(\hat{\bm{\Sigma}}+\varrho\mathbf{I}_{p% })^{-1}\hat{\bm{\Sigma}}(\hat{\bm{\Sigma}}+\varrho\mathbf{I}_{p})^{-1}bold_M start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_M = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ( over^ start_ARG bold_Σ end_ARG + italic_ϱ bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over^ start_ARG bold_Σ end_ARG ( over^ start_ARG bold_Σ end_ARG + italic_ϱ bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. Using the cyclic property of trace, we get the desired inequality. ∎

Proof sketch of Prop. B.1.

We first simplify 𝜷^ϱ22superscriptsubscriptnormsubscript^𝜷italic-ϱ22\|\hat{\bm{\beta}}_{\varrho}\|_{2}^{2}∥ over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_ϱ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT using the well-known formula for ridge regression: Next, let 𝐌:=(𝚺^+ϱ𝐈p)11n𝐗assign𝐌superscript^𝚺italic-ϱsubscript𝐈𝑝11𝑛𝐗\mathbf{M}:=(\hat{\bm{\Sigma}}+\varrho\mathbf{I}_{p})^{-1}\tfrac{1}{n}\mathbf{X}bold_M := ( over^ start_ARG bold_Σ end_ARG + italic_ϱ bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n end_ARG bold_X. Using the independence of 𝐗𝐗\mathbf{X}bold_X and 𝜺𝜺\bm{\varepsilon}bold_italic_ε, we get 𝔼[𝜷^ϱ22]𝔼[tr(𝐌𝐌𝜺𝜺)]𝔼delimited-[]superscriptsubscriptnormsubscript^𝜷italic-ϱ22𝔼delimited-[]trsuperscript𝐌top𝐌𝜺superscript𝜺top\mathbb{E}[\|\hat{\bm{\beta}}_{\varrho}\|_{2}^{2}]\geq\mathbb{E}[\mathrm{tr}(% \mathbf{M}^{\top}\mathbf{M}\bm{\varepsilon}\bm{\varepsilon}^{\top})]blackboard_E [ ∥ over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_ϱ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≥ blackboard_E [ roman_tr ( bold_M start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_M bold_italic_ε bold_italic_ε start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ]. Since 𝐌𝐌superscript𝐌top𝐌\mathbf{M}^{\top}\mathbf{M}bold_M start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_M and 𝜺𝜺𝜺superscript𝜺top\bm{\varepsilon}\bm{\varepsilon}^{\top}bold_italic_ε bold_italic_ε start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT are also independent, we have

𝔼[tr(𝐌𝐌𝜺𝜺)]=σ2𝔼[tr(𝐌𝐌)].𝔼delimited-[]trsuperscript𝐌top𝐌𝜺superscript𝜺topsuperscript𝜎2𝔼delimited-[]trsuperscript𝐌top𝐌\mathbb{E}[\mathrm{tr}(\mathbf{M}^{\top}\mathbf{M}\bm{\varepsilon}\bm{% \varepsilon}^{\top})]=\sigma^{2}\mathbb{E}[\mathrm{tr}(\mathbf{M}^{\top}% \mathbf{M})].blackboard_E [ roman_tr ( bold_M start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_M bold_italic_ε bold_italic_ε start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ] = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ roman_tr ( bold_M start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_M ) ] .

By 𝐌𝐌=1n(𝚺^+ϱ𝐈p)1𝚺^(𝚺^+ϱ𝐈p)1superscript𝐌top𝐌1𝑛superscript^𝚺italic-ϱsubscript𝐈𝑝1^𝚺superscript^𝚺italic-ϱsubscript𝐈𝑝1\mathbf{M}^{\top}\mathbf{M}=\frac{1}{n}(\hat{\bm{\Sigma}}+\varrho\mathbf{I}_{p% })^{-1}\hat{\bm{\Sigma}}(\hat{\bm{\Sigma}}+\varrho\mathbf{I}_{p})^{-1}bold_M start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_M = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ( over^ start_ARG bold_Σ end_ARG + italic_ϱ bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over^ start_ARG bold_Σ end_ARG ( over^ start_ARG bold_Σ end_ARG + italic_ϱ bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT and the cyclic property of trace, we get the desired inequality. ∎

For the sake of completeness, we prove Equation 11 though its well-known

Proof of Equation 11.

Start with the objective function (𝜷):=1n𝐗𝜷𝐲22+ϱ𝜷22assign𝜷1𝑛superscriptsubscriptnormsuperscript𝐗top𝜷𝐲22italic-ϱsuperscriptsubscriptnorm𝜷22\mathcal{F}(\bm{\beta}):=\frac{1}{n}\|\mathbf{X}^{\top}\bm{\beta}-\mathbf{y}\|% _{2}^{2}+\varrho\|\bm{\beta}\|_{2}^{2}caligraphic_F ( bold_italic_β ) := divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∥ bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_β - bold_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ϱ ∥ bold_italic_β ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Take derivative with respect to 𝜷𝜷\bm{\beta}bold_italic_β, we have

12𝜷(1n𝐗𝜷𝐲22+ϱ𝜷22)=12𝜷(𝜷(𝚺^+ϱ𝐈p)𝜷2n𝜷𝐗𝐲)12subscript𝜷1𝑛superscriptsubscriptnormsuperscript𝐗top𝜷𝐲22italic-ϱsuperscriptsubscriptnorm𝜷2212subscript𝜷superscript𝜷top^𝚺italic-ϱsubscript𝐈𝑝𝜷2𝑛superscript𝜷top𝐗𝐲\displaystyle\frac{1}{2}\nabla_{\bm{\beta}}\left(\frac{1}{n}\|\mathbf{X}^{\top% }\bm{\beta}-\mathbf{y}\|_{2}^{2}+\varrho\|\bm{\beta}\|_{2}^{2}\right)=\frac{1}% {2}\nabla_{\bm{\beta}}\left(\bm{\beta}^{\top}(\hat{\bm{\Sigma}}+\varrho\mathbf% {I}_{p})\bm{\beta}-\frac{2}{n}\bm{\beta}^{\top}\mathbf{X}\mathbf{y}\right)divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∇ start_POSTSUBSCRIPT bold_italic_β end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∥ bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_β - bold_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ϱ ∥ bold_italic_β ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∇ start_POSTSUBSCRIPT bold_italic_β end_POSTSUBSCRIPT ( bold_italic_β start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( over^ start_ARG bold_Σ end_ARG + italic_ϱ bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) bold_italic_β - divide start_ARG 2 end_ARG start_ARG italic_n end_ARG bold_italic_β start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Xy )
=(𝚺^+ϱ𝐈p)𝜷1n𝐗𝐲.absent^𝚺italic-ϱsubscript𝐈𝑝𝜷1𝑛𝐗𝐲\displaystyle=(\hat{\bm{\Sigma}}+\varrho\mathbf{I}_{p})\bm{\beta}-\frac{1}{n}% \mathbf{X}\mathbf{y}.= ( over^ start_ARG bold_Σ end_ARG + italic_ϱ bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) bold_italic_β - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG bold_Xy .

Since 𝜷(𝜷^ϱ)=0subscript𝜷subscript^𝜷italic-ϱ0\nabla_{\bm{\beta}}\mathcal{F}(\hat{\bm{\beta}}_{\varrho})=0∇ start_POSTSUBSCRIPT bold_italic_β end_POSTSUBSCRIPT caligraphic_F ( over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_ϱ end_POSTSUBSCRIPT ) = 0, we are done. ∎

Lemma B.2 (Special case of Woodbury formula).

Let

𝐌p×n𝐌superscript𝑝𝑛\displaystyle\mathbf{M}\in\mathbb{R}^{p\times n}bold_M ∈ blackboard_R start_POSTSUPERSCRIPT italic_p × italic_n end_POSTSUPERSCRIPT (12)

be an arbitrary matrix and ϱ(0,)italic-ϱ0\varrho\in(0,\infty)italic_ϱ ∈ ( 0 , ∞ ). Then

(𝐌𝐌+ϱ𝐈p)1𝐌=𝐌(𝐌𝐌+ϱ𝐈n)1n×p.superscriptsuperscript𝐌𝐌topitalic-ϱsubscript𝐈𝑝1𝐌𝐌superscriptsuperscript𝐌top𝐌italic-ϱsubscript𝐈𝑛1superscript𝑛𝑝(\mathbf{M}\mathbf{M}^{\top}+\varrho\mathbf{I}_{p})^{-1}\mathbf{M}=\mathbf{M}(% \mathbf{M}^{\top}\mathbf{M}+\varrho\mathbf{I}_{n})^{-1}\in\mathbb{R}^{n\times p}.( bold_MM start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + italic_ϱ bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_M = bold_M ( bold_M start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_M + italic_ϱ bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_p end_POSTSUPERSCRIPT .
Proof of Lemma B.2.

It suffices to prove Lemma B.2 for the special case when ϱ=1italic-ϱ1\varrho=1italic_ϱ = 1, which we assume below. By the Woodbury matrix identity, we have

(𝐌𝐌+𝐈p)1=𝐈𝐌(𝐌𝐌+𝐈n)1𝐌superscriptsuperscript𝐌𝐌topsubscript𝐈𝑝1𝐈𝐌superscriptsuperscript𝐌top𝐌subscript𝐈𝑛1superscript𝐌top(\mathbf{M}\mathbf{M}^{\top}+\mathbf{I}_{p})^{-1}=\mathbf{I}-\mathbf{M}(% \mathbf{M}^{\top}\mathbf{M}+\mathbf{I}_{n})^{-1}\mathbf{M}^{\top}( bold_MM start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = bold_I - bold_M ( bold_M start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_M + bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_M start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT (13)

For brevity, let 𝐏:=𝐌𝐌+𝐈passign𝐏superscript𝐌𝐌topsubscript𝐈𝑝\mathbf{P}:=\mathbf{M}\mathbf{M}^{\top}+\mathbf{I}_{p}bold_P := bold_MM start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and let 𝐍:=𝐌𝐌+𝐈nassign𝐍superscript𝐌top𝐌subscript𝐈𝑛\mathbf{N}:=\mathbf{M}^{\top}\mathbf{M}+\mathbf{I}_{n}bold_N := bold_M start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_M + bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. To proceed, we have

𝐏1𝐌superscript𝐏1𝐌\displaystyle\mathbf{P}^{-1}\mathbf{M}bold_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_M
=𝐌𝐌𝐍1𝐌𝐌Multiplying (13) by 𝐌 on the right\displaystyle=\mathbf{M}-\mathbf{M}\mathbf{N}^{-1}\mathbf{M}^{\top}\mathbf{M}% \quad\because\mbox{Multiplying \eqref{equation:lemma:woodbury-1} by $\mathbf{M% }$ on the right}= bold_M - bold_MN start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_M start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_M ∵ Multiplying ( ) by bold_M on the right
=𝐌(𝐈n𝐍1𝐌𝐌)Factoring out 𝐌 on the left\displaystyle=\mathbf{M}(\mathbf{I}_{n}-\mathbf{N}^{-1}\mathbf{M}^{\top}% \mathbf{M})\quad\because\mbox{Factoring out $\mathbf{M}$ on the left}= bold_M ( bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - bold_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_M start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_M ) ∵ Factoring out bold_M on the left
=𝐌(𝐈n(𝐈n𝐍1))𝐈n=𝐍1𝐍=𝐍1+𝐍1𝐌𝐌\displaystyle=\mathbf{M}(\mathbf{I}_{n}-(\mathbf{I}_{n}-\mathbf{N}^{-1}))\quad% \because\mbox{$\mathbf{I}_{n}=\mathbf{N}^{-1}\mathbf{N}=\mathbf{N}^{-1}+% \mathbf{N}^{-1}\mathbf{M}^{\top}\mathbf{M}$}= bold_M ( bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - ( bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - bold_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) ) ∵ bold_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = bold_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_N = bold_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + bold_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_M start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_M
=𝐌𝐍1absentsuperscript𝐌𝐍1\displaystyle=\mathbf{M}\mathbf{N}^{-1}= bold_MN start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT

as desired. ∎

Lemma B.3.

Let 𝐌p×p𝐌superscript𝑝𝑝\mathbf{M}\in\mathbb{R}^{p\times p}bold_M ∈ blackboard_R start_POSTSUPERSCRIPT italic_p × italic_p end_POSTSUPERSCRIPT be any symmetric matrix and z𝑧z\in\mathbb{R}italic_z ∈ blackboard_R. Then we have

ddztr(z(𝐌+z𝐈p)1)=tr(𝐌(𝐌+z𝐈p)2).𝑑𝑑𝑧tr𝑧superscript𝐌𝑧subscript𝐈𝑝1tr𝐌superscript𝐌𝑧subscript𝐈𝑝2\tfrac{d}{dz}\mathrm{tr}(z(\mathbf{M}+z\mathbf{I}_{p})^{-1})=\mathrm{tr}(% \mathbf{M}(\mathbf{M}+z\mathbf{I}_{p})^{-2}).divide start_ARG italic_d end_ARG start_ARG italic_d italic_z end_ARG roman_tr ( italic_z ( bold_M + italic_z bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) = roman_tr ( bold_M ( bold_M + italic_z bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) .
Proof of Lemma B.3.

Without the loss of generality, suppose that 𝐌=diag(λi,,λp)𝐌diagsubscript𝜆𝑖subscript𝜆𝑝\mathbf{M}=\mathrm{diag}(\lambda_{i},\dots,\lambda_{p})bold_M = roman_diag ( italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , … , italic_λ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ). Then we have f(z):=tr(z(𝐌+z𝐈p)1)=i=1pzλi+z.assign𝑓𝑧tr𝑧superscript𝐌𝑧subscript𝐈𝑝1superscriptsubscript𝑖1𝑝𝑧subscript𝜆𝑖𝑧f(z):=\mathrm{tr}(z(\mathbf{M}+z\mathbf{I}_{p})^{-1})=\sum_{i=1}^{p}\frac{z}{% \lambda_{i}+z}.italic_f ( italic_z ) := roman_tr ( italic_z ( bold_M + italic_z bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT divide start_ARG italic_z end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_z end_ARG . Now, from elementary calculus, we have

ddxxy+x=(y+x)1x(y+x)2=(y+x)2((y+x)x)=y(y+x)2.𝑑𝑑𝑥𝑥𝑦𝑥superscript𝑦𝑥1𝑥superscript𝑦𝑥2superscript𝑦𝑥2𝑦𝑥𝑥𝑦superscript𝑦𝑥2\frac{d}{dx}\frac{x}{y+x}=(y+x)^{-1}-x(y+x)^{-2}=(y+x)^{-2}((y+x)-x)=\frac{y}{% (y+x)^{2}}.divide start_ARG italic_d end_ARG start_ARG italic_d italic_x end_ARG divide start_ARG italic_x end_ARG start_ARG italic_y + italic_x end_ARG = ( italic_y + italic_x ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - italic_x ( italic_y + italic_x ) start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT = ( italic_y + italic_x ) start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ( ( italic_y + italic_x ) - italic_x ) = divide start_ARG italic_y end_ARG start_ARG ( italic_y + italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG .

From this, we recover the fact that ddzf(z)=i=1nλi(λi+z)2=tr(𝐌(𝐌+z𝐈p)2),𝑑𝑑𝑧𝑓𝑧superscriptsubscript𝑖1𝑛subscript𝜆𝑖superscriptsubscript𝜆𝑖𝑧2tr𝐌superscript𝐌𝑧subscript𝐈𝑝2\frac{d}{dz}f(z)=\sum_{i=1}^{n}\frac{\lambda_{i}}{(\lambda_{i}+z)^{2}}=\mathrm% {tr}(\mathbf{M}(\mathbf{M}+z\mathbf{I}_{p})^{-2}),divide start_ARG italic_d end_ARG start_ARG italic_d italic_z end_ARG italic_f ( italic_z ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ( italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_z ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG = roman_tr ( bold_M ( bold_M + italic_z bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) , as desired. ∎

Lemma B.4 (Gram-to-covariance).

Let c𝑐c\in\mathbb{R}italic_c ∈ blackboard_R and z𝑧z\in\mathbb{C}italic_z ∈ blackboard_C be arbitrary, then 𝒮𝚎𝚜𝚍(c𝚺^)(z)=γ𝒮𝚎𝚜𝚍(c𝐆ˇ)(z)(1γ)zsubscript𝒮𝚎𝚜𝚍𝑐^𝚺𝑧𝛾subscript𝒮𝚎𝚜𝚍𝑐ˇ𝐆𝑧1𝛾𝑧\mathcal{S}_{\mathtt{esd}(c\hat{\bm{\Sigma}})}(z)=\gamma\cdot\mathcal{S}_{% \mathtt{esd}(c\check{\mathbf{G}})}(z)-\frac{(1-\gamma)}{z}caligraphic_S start_POSTSUBSCRIPT typewriter_esd ( italic_c over^ start_ARG bold_Σ end_ARG ) end_POSTSUBSCRIPT ( italic_z ) = italic_γ ⋅ caligraphic_S start_POSTSUBSCRIPT typewriter_esd ( italic_c overroman_ˇ start_ARG bold_G end_ARG ) end_POSTSUBSCRIPT ( italic_z ) - divide start_ARG ( 1 - italic_γ ) end_ARG start_ARG italic_z end_ARG.

Proof of Lemma B.4.

Without the loss of generality, we may assume that c=1𝑐1c=1italic_c = 1. Let λ^1λ^psubscript^𝜆1subscript^𝜆𝑝\hat{\lambda}_{1}\geq\dots\geq\hat{\lambda}_{p}over^ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≥ ⋯ ≥ over^ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT be the eigenvalues of 𝚺^^𝚺\hat{\bm{\Sigma}}over^ start_ARG bold_Σ end_ARG. Since p>n𝑝𝑛p>nitalic_p > italic_n, we necessarily have that λ^n+1==λ^p=0subscript^𝜆𝑛1subscript^𝜆𝑝0\hat{\lambda}_{n+1}=\cdots=\hat{\lambda}_{p}=0over^ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = ⋯ = over^ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 0. Moreover, λ^1,,λ^nsubscript^𝜆1subscript^𝜆𝑛\hat{\lambda}_{1},\dots,\hat{\lambda}_{n}over^ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over^ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are the eigenvalues of 𝐆ˇˇ𝐆\check{\mathbf{G}}overroman_ˇ start_ARG bold_G end_ARG. Now, unwinding the definition, we have

𝒮𝚎𝚜𝚍(𝚺^)(z)=1pi=1p1λ^izsubscript𝒮𝚎𝚜𝚍^𝚺𝑧1𝑝superscriptsubscript𝑖1𝑝1subscript^𝜆𝑖𝑧\mathcal{S}_{\mathtt{esd}(\hat{\bm{\Sigma}})}(z)=\frac{1}{p}\sum_{i=1}^{p}% \frac{1}{\hat{\lambda}_{i}-z}caligraphic_S start_POSTSUBSCRIPT typewriter_esd ( over^ start_ARG bold_Σ end_ARG ) end_POSTSUBSCRIPT ( italic_z ) = divide start_ARG 1 end_ARG start_ARG italic_p end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG over^ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_z end_ARG

and

𝒮𝚎𝚜𝚍(𝐆ˇ)(z)=1ni=1n1λ^iz.subscript𝒮𝚎𝚜𝚍ˇ𝐆𝑧1𝑛superscriptsubscript𝑖1𝑛1subscript^𝜆𝑖𝑧\mathcal{S}_{\mathtt{esd}(\check{\mathbf{G}})}(z)=\frac{1}{n}\sum_{i=1}^{n}% \frac{1}{\hat{\lambda}_{i}-z}.caligraphic_S start_POSTSUBSCRIPT typewriter_esd ( overroman_ˇ start_ARG bold_G end_ARG ) end_POSTSUBSCRIPT ( italic_z ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG over^ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_z end_ARG .

Thus,

𝒮𝚎𝚜𝚍(𝚺^)(z)subscript𝒮𝚎𝚜𝚍^𝚺𝑧\displaystyle\mathcal{S}_{\mathtt{esd}(\hat{\bm{\Sigma}})}(z)caligraphic_S start_POSTSUBSCRIPT typewriter_esd ( over^ start_ARG bold_Σ end_ARG ) end_POSTSUBSCRIPT ( italic_z ) =1p(i=1n1λ^iz+i=n+1p1z)absent1𝑝superscriptsubscript𝑖1𝑛1subscript^𝜆𝑖𝑧superscriptsubscript𝑖𝑛1𝑝1𝑧\displaystyle=\frac{1}{p}\left(\sum_{i=1}^{n}\frac{1}{\hat{\lambda}_{i}-z}+% \sum_{i=n+1}^{p}\frac{1}{-z}\right)= divide start_ARG 1 end_ARG start_ARG italic_p end_ARG ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG over^ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_z end_ARG + ∑ start_POSTSUBSCRIPT italic_i = italic_n + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG - italic_z end_ARG )
=(np1ni=1n1λ^iz)pnp1zabsent𝑛𝑝1𝑛superscriptsubscript𝑖1𝑛1subscript^𝜆𝑖𝑧𝑝𝑛𝑝1𝑧\displaystyle=\left(\frac{n}{p}\frac{1}{n}\sum_{i=1}^{n}\frac{1}{\hat{\lambda}% _{i}-z}\right)-\frac{p-n}{p}\frac{1}{z}= ( divide start_ARG italic_n end_ARG start_ARG italic_p end_ARG divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG over^ start_ARG italic_λ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_z end_ARG ) - divide start_ARG italic_p - italic_n end_ARG start_ARG italic_p end_ARG divide start_ARG 1 end_ARG start_ARG italic_z end_ARG
=γ𝒮𝚎𝚜𝚍(𝐆ˇ)(z)(1γ)zabsent𝛾subscript𝒮𝚎𝚜𝚍ˇ𝐆𝑧1𝛾𝑧\displaystyle=\gamma\cdot\mathcal{S}_{\mathtt{esd}(\check{\mathbf{G}})}(z)-% \frac{(1-\gamma)}{z}= italic_γ ⋅ caligraphic_S start_POSTSUBSCRIPT typewriter_esd ( overroman_ˇ start_ARG bold_G end_ARG ) end_POSTSUBSCRIPT ( italic_z ) - divide start_ARG ( 1 - italic_γ ) end_ARG start_ARG italic_z end_ARG

as desired. ∎

Proof of Proposition 4.2.

Recall from Proposition B.1 that 𝔼𝜷^22n1σ2𝔼[tr((𝚺^+ϱ𝐈p)2𝚺^)]𝔼superscriptsubscriptnorm^𝜷22superscript𝑛1superscript𝜎2𝔼delimited-[]trsuperscript^𝚺italic-ϱsubscript𝐈𝑝2^𝚺\mathbb{E}\|\hat{\bm{\beta}}\|_{2}^{2}\geq n^{-1}\sigma^{2}\mathbb{E}[\mathrm{% tr}((\hat{\bm{\Sigma}}+\varrho\mathbf{I}_{p})^{-2}\hat{\bm{\Sigma}})]blackboard_E ∥ over^ start_ARG bold_italic_β end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ roman_tr ( ( over^ start_ARG bold_Σ end_ARG + italic_ϱ bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT over^ start_ARG bold_Σ end_ARG ) ]. Below, we analyze the term inside the expectation. By the definition of the Stieltjes transform, we have

tr(ϱ(𝚺^+ϱ𝐈p)1)=tr(rnα(𝚺^+rnα𝐈p)1)=tr(r(nα𝚺^+r𝐈p)1)=pr𝒮𝚎𝚜𝚍(nα𝚺^)(r).tritalic-ϱsuperscript^𝚺italic-ϱsubscript𝐈𝑝1tr𝑟superscript𝑛𝛼superscript^𝚺𝑟superscript𝑛𝛼subscript𝐈𝑝1tr𝑟superscriptsuperscript𝑛𝛼^𝚺𝑟subscript𝐈𝑝1𝑝𝑟subscript𝒮𝚎𝚜𝚍superscript𝑛𝛼^𝚺𝑟\mathrm{tr}(\varrho(\hat{\bm{\Sigma}}+\varrho\mathbf{I}_{p})^{-1})=\mathrm{tr}% (rn^{-\alpha}(\hat{\bm{\Sigma}}+rn^{-\alpha}\mathbf{I}_{p})^{-1})=\mathrm{tr}(% r(n^{\alpha}\hat{\bm{\Sigma}}+r\mathbf{I}_{p})^{-1})=pr\mathcal{S}_{\mathtt{% esd}(n^{\alpha}\hat{\bm{\Sigma}})}(-r).roman_tr ( italic_ϱ ( over^ start_ARG bold_Σ end_ARG + italic_ϱ bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) = roman_tr ( italic_r italic_n start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT ( over^ start_ARG bold_Σ end_ARG + italic_r italic_n start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) = roman_tr ( italic_r ( italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT over^ start_ARG bold_Σ end_ARG + italic_r bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) = italic_p italic_r caligraphic_S start_POSTSUBSCRIPT typewriter_esd ( italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT over^ start_ARG bold_Σ end_ARG ) end_POSTSUBSCRIPT ( - italic_r ) .

Therefore, by Lemma B.3, we have

ddr(pr𝒮𝚎𝚜𝚍(nα𝚺^)(r))=ddrtr(ϱ(𝚺^+ϱ𝐈p)1)𝑑𝑑𝑟𝑝𝑟subscript𝒮𝚎𝚜𝚍superscript𝑛𝛼^𝚺𝑟𝑑𝑑𝑟tritalic-ϱsuperscript^𝚺italic-ϱsubscript𝐈𝑝1\displaystyle\frac{d}{dr}\left(pr\mathcal{S}_{\mathtt{esd}(n^{\alpha}\hat{\bm{% \Sigma}})}(-r)\right)=\frac{d}{dr}\mathrm{tr}(\varrho(\hat{\bm{\Sigma}}+% \varrho\mathbf{I}_{p})^{-1})divide start_ARG italic_d end_ARG start_ARG italic_d italic_r end_ARG ( italic_p italic_r caligraphic_S start_POSTSUBSCRIPT typewriter_esd ( italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT over^ start_ARG bold_Σ end_ARG ) end_POSTSUBSCRIPT ( - italic_r ) ) = divide start_ARG italic_d end_ARG start_ARG italic_d italic_r end_ARG roman_tr ( italic_ϱ ( over^ start_ARG bold_Σ end_ARG + italic_ϱ bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT )
=dϱdrddϱtr(ϱ(𝚺^+ϱ𝐈p)1)=nαtr((𝚺^+ϱ𝐈p)2𝚺^).absent𝑑italic-ϱ𝑑𝑟𝑑𝑑italic-ϱtritalic-ϱsuperscript^𝚺italic-ϱsubscript𝐈𝑝1superscript𝑛𝛼trsuperscript^𝚺italic-ϱsubscript𝐈𝑝2^𝚺\displaystyle=\frac{d\varrho}{dr}\cdot\frac{d}{d\varrho}\mathrm{tr}(\varrho(% \hat{\bm{\Sigma}}+\varrho\mathbf{I}_{p})^{-1})=n^{-\alpha}\mathrm{tr}((\hat{% \bm{\Sigma}}+\varrho\mathbf{I}_{p})^{-2}\hat{\bm{\Sigma}}).= divide start_ARG italic_d italic_ϱ end_ARG start_ARG italic_d italic_r end_ARG ⋅ divide start_ARG italic_d end_ARG start_ARG italic_d italic_ϱ end_ARG roman_tr ( italic_ϱ ( over^ start_ARG bold_Σ end_ARG + italic_ϱ bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) = italic_n start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT roman_tr ( ( over^ start_ARG bold_Σ end_ARG + italic_ϱ bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT over^ start_ARG bold_Σ end_ARG ) .

By Lemma B.4, we have

pr𝒮𝚎𝚜𝚍(nα𝚺^)(r)=pr(γ𝒮𝚎𝚜𝚍(nα𝐆ˇ)(r)+(1γ)r)𝑝𝑟subscript𝒮𝚎𝚜𝚍superscript𝑛𝛼^𝚺𝑟𝑝𝑟𝛾subscript𝒮𝚎𝚜𝚍superscript𝑛𝛼ˇ𝐆𝑟1𝛾𝑟\displaystyle pr\mathcal{S}_{\mathtt{esd}(n^{\alpha}\hat{\bm{\Sigma}})}(-r)=pr% \left(\gamma\cdot\mathcal{S}_{\mathtt{esd}(n^{\alpha}\check{\mathbf{G}})}(-r)+% \frac{(1-\gamma)}{r}\right)italic_p italic_r caligraphic_S start_POSTSUBSCRIPT typewriter_esd ( italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT over^ start_ARG bold_Σ end_ARG ) end_POSTSUBSCRIPT ( - italic_r ) = italic_p italic_r ( italic_γ ⋅ caligraphic_S start_POSTSUBSCRIPT typewriter_esd ( italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT overroman_ˇ start_ARG bold_G end_ARG ) end_POSTSUBSCRIPT ( - italic_r ) + divide start_ARG ( 1 - italic_γ ) end_ARG start_ARG italic_r end_ARG )
=nr𝒮𝚎𝚜𝚍(nα𝐆ˇ)(r)+p(1γ)absent𝑛𝑟subscript𝒮𝚎𝚜𝚍superscript𝑛𝛼ˇ𝐆𝑟𝑝1𝛾\displaystyle=nr\mathcal{S}_{\mathtt{esd}(n^{\alpha}\check{\mathbf{G}})}(-r)+p% (1-\gamma)= italic_n italic_r caligraphic_S start_POSTSUBSCRIPT typewriter_esd ( italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT overroman_ˇ start_ARG bold_G end_ARG ) end_POSTSUBSCRIPT ( - italic_r ) + italic_p ( 1 - italic_γ )

Thus, we have

ddr(pr𝒮𝚎𝚜𝚍(nα𝚺^)(r))=nddr(r𝒮𝚎𝚜𝚍(nα𝐆ˇ)(r))𝑑𝑑𝑟𝑝𝑟subscript𝒮𝚎𝚜𝚍superscript𝑛𝛼^𝚺𝑟𝑛𝑑𝑑𝑟𝑟subscript𝒮𝚎𝚜𝚍superscript𝑛𝛼ˇ𝐆𝑟\frac{d}{dr}\left(pr\mathcal{S}_{\mathtt{esd}(n^{\alpha}\hat{\bm{\Sigma}})}(-r% )\right)=n\frac{d}{dr}\left(r\mathcal{S}_{\mathtt{esd}(n^{\alpha}\check{% \mathbf{G}})}(-r)\right)divide start_ARG italic_d end_ARG start_ARG italic_d italic_r end_ARG ( italic_p italic_r caligraphic_S start_POSTSUBSCRIPT typewriter_esd ( italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT over^ start_ARG bold_Σ end_ARG ) end_POSTSUBSCRIPT ( - italic_r ) ) = italic_n divide start_ARG italic_d end_ARG start_ARG italic_d italic_r end_ARG ( italic_r caligraphic_S start_POSTSUBSCRIPT typewriter_esd ( italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT overroman_ˇ start_ARG bold_G end_ARG ) end_POSTSUBSCRIPT ( - italic_r ) )

from which we conclude that

tr((𝚺^+ϱ𝐈p)2𝚺^)=nα+1ddr(r𝒮𝚎𝚜𝚍(nα𝐆ˇ)(r)).trsuperscript^𝚺italic-ϱsubscript𝐈𝑝2^𝚺superscript𝑛𝛼1𝑑𝑑𝑟𝑟subscript𝒮𝚎𝚜𝚍superscript𝑛𝛼ˇ𝐆𝑟\mathrm{tr}((\hat{\bm{\Sigma}}+\varrho\mathbf{I}_{p})^{-2}\hat{\bm{\Sigma}})=n% ^{\alpha+1}\frac{d}{dr}\left(r\mathcal{S}_{\mathtt{esd}(n^{\alpha}\check{% \mathbf{G}})}(-r)\right).roman_tr ( ( over^ start_ARG bold_Σ end_ARG + italic_ϱ bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT over^ start_ARG bold_Σ end_ARG ) = italic_n start_POSTSUPERSCRIPT italic_α + 1 end_POSTSUPERSCRIPT divide start_ARG italic_d end_ARG start_ARG italic_d italic_r end_ARG ( italic_r caligraphic_S start_POSTSUBSCRIPT typewriter_esd ( italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT overroman_ˇ start_ARG bold_G end_ARG ) end_POSTSUBSCRIPT ( - italic_r ) ) .

In view of 𝔼𝜷^22n1σ2𝔼[tr((𝚺^+ϱ𝐈p)2𝚺^)]𝔼superscriptsubscriptnorm^𝜷22superscript𝑛1superscript𝜎2𝔼delimited-[]trsuperscript^𝚺italic-ϱsubscript𝐈𝑝2^𝚺\mathbb{E}\|\hat{\bm{\beta}}\|_{2}^{2}\geq n^{-1}\sigma^{2}\mathbb{E}[\mathrm{% tr}((\hat{\bm{\Sigma}}+\varrho\mathbf{I}_{p})^{-2}\hat{\bm{\Sigma}})]blackboard_E ∥ over^ start_ARG bold_italic_β end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ roman_tr ( ( over^ start_ARG bold_Σ end_ARG + italic_ϱ bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT over^ start_ARG bold_Σ end_ARG ) ] from Proposition B.1, we get the desired inequality. ∎

B.1 Proof of Proposition 4.4

Proof of Proposition 4.4.

To simplify notations in this proof, we write γ𝛾\gammaitalic_γ instead of γsubscript𝛾\gamma_{\ast}italic_γ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT. Now, the set of eigenvalues of nα𝚺superscript𝑛𝛼𝚺n^{\alpha}\bm{\Sigma}italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT bold_Σ can be expressed as

{(n/i)α}i=1,,psubscriptsuperscript𝑛𝑖𝛼𝑖1𝑝\displaystyle\{(n/i)^{\alpha}\}_{i=1,\dots,p}{ ( italic_n / italic_i ) start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 , … , italic_p end_POSTSUBSCRIPT
={(np)α=γα,,,(nn+1)α,nn=1,(nn1)α,,(n1)α=nα}.\displaystyle=\{\underbrace{(\tfrac{n}{p})^{\alpha}}_{=\gamma^{\alpha}},,\dots% ,(\tfrac{n}{n+1})^{\alpha},\underbrace{\tfrac{n}{n}}_{=1},(\tfrac{n}{n-1})^{% \alpha},\dots,\underbrace{(\tfrac{n}{1})^{\alpha}}_{=n^{\alpha}}\}.= { under⏟ start_ARG ( divide start_ARG italic_n end_ARG start_ARG italic_p end_ARG ) start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT = italic_γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , , … , ( divide start_ARG italic_n end_ARG start_ARG italic_n + 1 end_ARG ) start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT , under⏟ start_ARG divide start_ARG italic_n end_ARG start_ARG italic_n end_ARG end_ARG start_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT , ( divide start_ARG italic_n end_ARG start_ARG italic_n - 1 end_ARG ) start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT , … , under⏟ start_ARG ( divide start_ARG italic_n end_ARG start_ARG 1 end_ARG ) start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT = italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_POSTSUBSCRIPT } .

Thus, 𝚌𝚍𝚏[𝚎𝚜𝚍(nα𝚺)](t)=0𝚌𝚍𝚏delimited-[]𝚎𝚜𝚍superscript𝑛𝛼𝚺𝑡0\mathtt{cdf}[\mathtt{esd}(n^{\alpha}\bm{\Sigma})](t)=0typewriter_cdf [ typewriter_esd ( italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT bold_Σ ) ] ( italic_t ) = 0 if t<γα𝑡superscript𝛾𝛼t<\gamma^{\alpha}italic_t < italic_γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT and =1absent1=1= 1 if tnα𝑡superscript𝑛𝛼t\geq n^{\alpha}italic_t ≥ italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT. It remains to calculate 𝚌𝚍𝚏[𝚎𝚜𝚍(nα𝚺)](t)=0𝚌𝚍𝚏delimited-[]𝚎𝚜𝚍superscript𝑛𝛼𝚺𝑡0\mathtt{cdf}[\mathtt{esd}(n^{\alpha}\bm{\Sigma})](t)=0typewriter_cdf [ typewriter_esd ( italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT bold_Σ ) ] ( italic_t ) = 0 if t<γα𝑡superscript𝛾𝛼t<\gamma^{\alpha}italic_t < italic_γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT for t[γα,nα]𝑡superscript𝛾𝛼superscript𝑛𝛼t\in[\gamma^{\alpha},n^{\alpha}]italic_t ∈ [ italic_γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT , italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ].

To this end, let t[γα,nα)𝑡superscript𝛾𝛼superscript𝑛𝛼t\in[\gamma^{\alpha},n^{\alpha})italic_t ∈ [ italic_γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT , italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ) and j(t){1,,p}𝑗𝑡1𝑝j(t)\in\{1,\dots,p\}italic_j ( italic_t ) ∈ { 1 , … , italic_p } be the smallest index such that (n/j(t))αtsuperscript𝑛𝑗𝑡𝛼𝑡(n/j(t))^{\alpha}\leq t( italic_n / italic_j ( italic_t ) ) start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ≤ italic_t. By definition of the CDF, we have 𝚌𝚍𝚏[𝚎𝚜𝚍(nα𝚺)](t)=1p(pj(t)+1)𝚌𝚍𝚏delimited-[]𝚎𝚜𝚍superscript𝑛𝛼𝚺𝑡1𝑝𝑝𝑗𝑡1\mathtt{cdf}[\mathtt{esd}(n^{\alpha}\bm{\Sigma})](t)=\tfrac{1}{p}(p-j(t)+1)typewriter_cdf [ typewriter_esd ( italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT bold_Σ ) ] ( italic_t ) = divide start_ARG 1 end_ARG start_ARG italic_p end_ARG ( italic_p - italic_j ( italic_t ) + 1 ). We first argue that j(t)1𝑗𝑡1j(t)\neq 1italic_j ( italic_t ) ≠ 1 by contradiction. If j(t)=1𝑗𝑡1j(t)=1italic_j ( italic_t ) = 1, then we have nαtsuperscript𝑛𝛼𝑡n^{\alpha}\leq titalic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ≤ italic_t. Since t[γα,nα]𝑡superscript𝛾𝛼superscript𝑛𝛼t\in[\gamma^{\alpha},n^{\alpha}]italic_t ∈ [ italic_γ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT , italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ], this implies that t=nα𝑡superscript𝑛𝛼t=n^{\alpha}italic_t = italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT, a contradiction. Thus, j(t)1𝑗𝑡1j(t)\neq 1italic_j ( italic_t ) ≠ 1.

Now, by the definition of j(t)𝑗𝑡j(t)italic_j ( italic_t ), we have (n/(j(t)1))α>tsuperscript𝑛𝑗𝑡1𝛼𝑡(n/(j(t)-1))^{\alpha}>t( italic_n / ( italic_j ( italic_t ) - 1 ) ) start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT > italic_t. Therefore, n/j(t)t1/α<n/(j(t)1)𝑛𝑗𝑡superscript𝑡1𝛼𝑛𝑗𝑡1n/j(t)\leq t^{1/\alpha}<n/(j(t)-1)italic_n / italic_j ( italic_t ) ≤ italic_t start_POSTSUPERSCRIPT 1 / italic_α end_POSTSUPERSCRIPT < italic_n / ( italic_j ( italic_t ) - 1 ) which implies that j(t)1<nt1/αj(t)𝑗𝑡1𝑛superscript𝑡1𝛼𝑗𝑡j(t)-1<nt^{-1/\alpha}\leq j(t)italic_j ( italic_t ) - 1 < italic_n italic_t start_POSTSUPERSCRIPT - 1 / italic_α end_POSTSUPERSCRIPT ≤ italic_j ( italic_t ). By the definition of the ceiling function, we have that j(t)=𝚌𝚎𝚒𝚕(nt1/α)𝑗𝑡𝚌𝚎𝚒𝚕𝑛superscript𝑡1𝛼j(t)=\mathtt{ceil}(nt^{-1/\alpha})italic_j ( italic_t ) = typewriter_ceil ( italic_n italic_t start_POSTSUPERSCRIPT - 1 / italic_α end_POSTSUPERSCRIPT ). Therefore,

𝚌𝚍𝚏[𝚎𝚜𝚍(nα𝚺)](t)𝚌𝚍𝚏delimited-[]𝚎𝚜𝚍superscript𝑛𝛼𝚺𝑡\displaystyle\mathtt{cdf}[\mathtt{esd}(n^{\alpha}\bm{\Sigma})](t)typewriter_cdf [ typewriter_esd ( italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT bold_Σ ) ] ( italic_t )
=1p(p𝚌𝚎𝚒𝚕(nt1/α)+1)absent1𝑝𝑝𝚌𝚎𝚒𝚕𝑛superscript𝑡1𝛼1\displaystyle=\tfrac{1}{p}(p-\mathtt{ceil}(nt^{-1/\alpha})+1)= divide start_ARG 1 end_ARG start_ARG italic_p end_ARG ( italic_p - typewriter_ceil ( italic_n italic_t start_POSTSUPERSCRIPT - 1 / italic_α end_POSTSUPERSCRIPT ) + 1 )
=1γn𝚌𝚎𝚒𝚕(nt1/α)+γn.absent1𝛾𝑛𝚌𝚎𝚒𝚕𝑛superscript𝑡1𝛼𝛾𝑛\displaystyle=1-\tfrac{\gamma}{n}\mathtt{ceil}(nt^{-1/\alpha})+\tfrac{\gamma}{% n}.= 1 - divide start_ARG italic_γ end_ARG start_ARG italic_n end_ARG typewriter_ceil ( italic_n italic_t start_POSTSUPERSCRIPT - 1 / italic_α end_POSTSUPERSCRIPT ) + divide start_ARG italic_γ end_ARG start_ARG italic_n end_ARG .

Taking limit of both side as n𝑛n\to\inftyitalic_n → ∞ and using the fact that limn𝚌𝚎𝚒𝚕(nc)/n=csubscript𝑛𝚌𝚎𝚒𝚕𝑛𝑐𝑛𝑐\lim_{n\to\infty}\mathtt{ceil}(nc)/n=croman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT typewriter_ceil ( italic_n italic_c ) / italic_n = italic_c for any positive number c>0𝑐0c>0italic_c > 0, we get the desired result. ∎

Appendix C Proof of the Positivity Condition portion of Proposition 2.10

This section will focus on the proof of Proposition 2.10, in particular the Positivity Condition (Assumption 2.9) portion. Thus, throughout this section, we assume the setting of Example 2.8. As mentioned in the main text, it is well-known that Assumptions 2.5 to 2.6 for the HDA model. As such, we assume these assumptions. Now, using Assumption 2.6 and elementary calculus, we first show that

limn𝔼[ddr(r𝒮𝚎𝚜𝚍(nα𝐆ˇ)(r))]=(drdk)1ddk(k𝒮H(k))subscript𝑛𝔼delimited-[]𝑑𝑑𝑟𝑟subscript𝒮𝚎𝚜𝚍superscript𝑛𝛼ˇ𝐆𝑟superscript𝑑𝑟𝑑𝑘1𝑑𝑑𝑘𝑘subscript𝒮𝐻𝑘\lim_{n\to\infty}\mathbb{E}\big{[}\tfrac{d}{dr}(r\mathcal{S}_{\mathtt{esd}(n^{% \alpha}\check{\mathbf{G}})}(-r))\big{]}=\left(\tfrac{dr}{dk}\right)^{-1}\cdot% \tfrac{d}{dk}\left(k\mathcal{S}_{H}(-k)\right)roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT blackboard_E [ divide start_ARG italic_d end_ARG start_ARG italic_d italic_r end_ARG ( italic_r caligraphic_S start_POSTSUBSCRIPT typewriter_esd ( italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT overroman_ˇ start_ARG bold_G end_ARG ) end_POSTSUBSCRIPT ( - italic_r ) ) ] = ( divide start_ARG italic_d italic_r end_ARG start_ARG italic_d italic_k end_ARG ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⋅ divide start_ARG italic_d end_ARG start_ARG italic_d italic_k end_ARG ( italic_k caligraphic_S start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( - italic_k ) )

where r𝑟ritalic_r and k𝑘kitalic_k are as in Assumption 2.5. Thus, we reduce to showing the positivity of drdk𝑑𝑟𝑑𝑘\tfrac{dr}{dk}divide start_ARG italic_d italic_r end_ARG start_ARG italic_d italic_k end_ARG and ddk(k𝒮H(k))𝑑𝑑𝑘𝑘subscript𝒮𝐻𝑘\tfrac{d}{dk}\left(k\mathcal{S}_{H}(-k)\right)divide start_ARG italic_d end_ARG start_ARG italic_d italic_k end_ARG ( italic_k caligraphic_S start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( - italic_k ) ).

Before proceeding, we recall several definitions and notations adapted from Dobriban and Wager, (2018):

limn𝔼[𝒮𝚎𝚜𝚍(nα𝐆ˇ)(z)]=v(z)subscript𝑛𝔼delimited-[]subscript𝒮𝚎𝚜𝚍superscript𝑛𝛼ˇ𝐆𝑧𝑣𝑧\lim_{n\to\infty}\mathbb{E}\left[\mathcal{S}_{\mathtt{esd}(n^{\alpha}\check{% \mathbf{G}})}(z)\right]=v(z)roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT blackboard_E [ caligraphic_S start_POSTSUBSCRIPT typewriter_esd ( italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT overroman_ˇ start_ARG bold_G end_ARG ) end_POSTSUBSCRIPT ( italic_z ) ] = italic_v ( italic_z ) (14)

is analogous to the v(z)𝑣𝑧v(z)italic_v ( italic_z ) defined in the paragraph immediately following (Dobriban and Wager,, 2018, Eqn. (2)). The difference is our Equation 14 is for the limit of the nαsuperscript𝑛𝛼n^{\alpha}italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT-scaled matrices nα𝐆ˇsuperscript𝑛𝛼ˇ𝐆n^{\alpha}\check{\mathbf{G}}italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT overroman_ˇ start_ARG bold_G end_ARG, rather than for 𝐆ˇˇ𝐆\check{\mathbf{G}}overroman_ˇ start_ARG bold_G end_ARG as in Dobriban and Wager, (2018).

Let H=limn𝚌𝚍𝚏[𝚎𝚜𝚍(nα𝚺)]𝐻subscript𝑛𝚌𝚍𝚏delimited-[]𝚎𝚜𝚍superscript𝑛𝛼𝚺H=\lim_{n\to\infty}\mathtt{cdf}[\mathtt{esd}(n^{\alpha}\bm{\Sigma})]italic_H = roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT typewriter_cdf [ typewriter_esd ( italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT bold_Σ ) ] be the limiting distribution as in Assumption 2.2. Plugging in z=r𝑧𝑟z=-ritalic_z = - italic_r into Dobriban and Wager, (2018, Eqn. (A.1)), we have

1v(r)=r1γtdH(t)1+tv(r).1𝑣𝑟𝑟1𝛾𝑡𝑑𝐻𝑡1𝑡𝑣𝑟-\frac{1}{v(-r)}=-r-\frac{1}{\gamma}\int\frac{tdH(t)}{1+tv(-r)}.- divide start_ARG 1 end_ARG start_ARG italic_v ( - italic_r ) end_ARG = - italic_r - divide start_ARG 1 end_ARG start_ARG italic_γ end_ARG ∫ divide start_ARG italic_t italic_d italic_H ( italic_t ) end_ARG start_ARG 1 + italic_t italic_v ( - italic_r ) end_ARG .

Letting kk(r):=1v(r)𝑘𝑘𝑟assign1𝑣𝑟k\equiv k(r):=\frac{1}{v(-r)}italic_k ≡ italic_k ( italic_r ) := divide start_ARG 1 end_ARG start_ARG italic_v ( - italic_r ) end_ARG, we can rewrite the above as

1=rk+1γtdH(t)k+t.1𝑟𝑘1𝛾𝑡𝑑𝐻𝑡𝑘𝑡1=\frac{r}{k}+\frac{1}{\gamma}\int\frac{tdH(t)}{k+t}.1 = divide start_ARG italic_r end_ARG start_ARG italic_k end_ARG + divide start_ARG 1 end_ARG start_ARG italic_γ end_ARG ∫ divide start_ARG italic_t italic_d italic_H ( italic_t ) end_ARG start_ARG italic_k + italic_t end_ARG . (15)

By construction, we have

1γtdH(t)k+t=limn1ni=1p11+knαλi11𝛾𝑡𝑑𝐻𝑡𝑘𝑡subscript𝑛1𝑛superscriptsubscript𝑖1𝑝11𝑘superscript𝑛𝛼superscriptsubscript𝜆𝑖1\frac{1}{\gamma}\int\frac{tdH(t)}{k+t}=\lim_{n\to\infty}\frac{1}{n}\sum_{i=1}^% {p}\frac{1}{1+kn^{-\alpha}\lambda_{i}^{-1}}divide start_ARG 1 end_ARG start_ARG italic_γ end_ARG ∫ divide start_ARG italic_t italic_d italic_H ( italic_t ) end_ARG start_ARG italic_k + italic_t end_ARG = roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 1 + italic_k italic_n start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG

where the RHS is as in Assumption 2.5. Consequently, the tuple r,k𝑟𝑘r,kitalic_r , italic_k from Assumption 2.5 coincide with the earlier definition of k:=1v(r)assign𝑘1𝑣𝑟k:=\frac{1}{v(-r)}italic_k := divide start_ARG 1 end_ARG start_ARG italic_v ( - italic_r ) end_ARG right before Equation 15. Having established the above, we now proceed to:

Lemma C.1.

Under the HDA model (Example 2.8) and the EVD condition (Assumption 2.2), we have that limn𝔼[ddr(r𝒮𝚎𝚜𝚍(nαGˇ)(r))]>0subscript𝑛𝔼delimited-[]𝑑𝑑𝑟𝑟subscript𝒮𝚎𝚜𝚍superscript𝑛𝛼ˇ𝐺𝑟0\lim_{n\to\infty}\mathbb{E}\big{[}\tfrac{d}{dr}(r\mathcal{S}_{\mathtt{esd}(n^{% \alpha}\check{G})}(-r))\big{]}>0roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT blackboard_E [ divide start_ARG italic_d end_ARG start_ARG italic_d italic_r end_ARG ( italic_r caligraphic_S start_POSTSUBSCRIPT typewriter_esd ( italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT overroman_ˇ start_ARG italic_G end_ARG ) end_POSTSUBSCRIPT ( - italic_r ) ) ] > 0.

Proof of Lemma C.1.

By the product rule, we have

ddr(r𝒮𝚎𝚜𝚍(nα𝐆ˇ)(r))=𝒮𝚎𝚜𝚍(nα𝐆ˇ)(r)r𝒮𝚎𝚜𝚍(nα𝐆ˇ)(r)𝑑𝑑𝑟𝑟subscript𝒮𝚎𝚜𝚍superscript𝑛𝛼ˇ𝐆𝑟subscript𝒮𝚎𝚜𝚍superscript𝑛𝛼ˇ𝐆𝑟𝑟superscriptsubscript𝒮𝚎𝚜𝚍superscript𝑛𝛼ˇ𝐆𝑟\tfrac{d}{dr}\left(r\mathcal{S}_{\mathtt{esd}(n^{\alpha}\check{\mathbf{G}})}(-% r)\right)=\mathcal{S}_{\mathtt{esd}(n^{\alpha}\check{\mathbf{G}})}(r)-r% \mathcal{S}_{\mathtt{esd}(n^{\alpha}\check{\mathbf{G}})}^{\prime}(-r)divide start_ARG italic_d end_ARG start_ARG italic_d italic_r end_ARG ( italic_r caligraphic_S start_POSTSUBSCRIPT typewriter_esd ( italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT overroman_ˇ start_ARG bold_G end_ARG ) end_POSTSUBSCRIPT ( - italic_r ) ) = caligraphic_S start_POSTSUBSCRIPT typewriter_esd ( italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT overroman_ˇ start_ARG bold_G end_ARG ) end_POSTSUBSCRIPT ( italic_r ) - italic_r caligraphic_S start_POSTSUBSCRIPT typewriter_esd ( italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT overroman_ˇ start_ARG bold_G end_ARG ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( - italic_r )

Now, taking the limit of the above equation on both side, we have

limn𝔼[ddr(r𝒮𝚎𝚜𝚍(nα𝐆ˇ)(r))]subscript𝑛𝔼delimited-[]𝑑𝑑𝑟𝑟subscript𝒮𝚎𝚜𝚍superscript𝑛𝛼ˇ𝐆𝑟\displaystyle\lim_{n\to\infty}\mathbb{E}\left[\tfrac{d}{dr}\left(r\mathcal{S}_% {\mathtt{esd}(n^{\alpha}\check{\mathbf{G}})}(-r)\right)\right]roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT blackboard_E [ divide start_ARG italic_d end_ARG start_ARG italic_d italic_r end_ARG ( italic_r caligraphic_S start_POSTSUBSCRIPT typewriter_esd ( italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT overroman_ˇ start_ARG bold_G end_ARG ) end_POSTSUBSCRIPT ( - italic_r ) ) ]
=limn𝔼[𝒮𝚎𝚜𝚍(nα𝐆ˇ)(r)r𝒮𝚎𝚜𝚍(nα𝐆ˇ)(r)]absentsubscript𝑛𝔼delimited-[]subscript𝒮𝚎𝚜𝚍superscript𝑛𝛼ˇ𝐆𝑟𝑟superscriptsubscript𝒮𝚎𝚜𝚍superscript𝑛𝛼ˇ𝐆𝑟\displaystyle=\lim_{n\to\infty}\mathbb{E}\left[\mathcal{S}_{\mathtt{esd}(n^{% \alpha}\check{\mathbf{G}})}(-r)-r\mathcal{S}_{\mathtt{esd}(n^{\alpha}\check{% \mathbf{G}})}^{\prime}(-r)\right]= roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT blackboard_E [ caligraphic_S start_POSTSUBSCRIPT typewriter_esd ( italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT overroman_ˇ start_ARG bold_G end_ARG ) end_POSTSUBSCRIPT ( - italic_r ) - italic_r caligraphic_S start_POSTSUBSCRIPT typewriter_esd ( italic_n start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT overroman_ˇ start_ARG bold_G end_ARG ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( - italic_r ) ]
=v(r)rv(r)Definition of v and v\displaystyle=v(-r)-rv^{\prime}(-r)\qquad\because\mbox{Definition of $v$ and $% v^{\prime}$}= italic_v ( - italic_r ) - italic_r italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( - italic_r ) ∵ Definition of italic_v and italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
=ddr(rv(r))Product rule\displaystyle=\tfrac{d}{dr}\left(rv(-r)\right)\qquad\because\mbox{Product rule}= divide start_ARG italic_d end_ARG start_ARG italic_d italic_r end_ARG ( italic_r italic_v ( - italic_r ) ) ∵ Product rule
=ddr(k𝒮H(k)) Marchenko-Pastur law (Assumption 2.6)\displaystyle=\tfrac{d}{dr}\left(k\mathcal{S}_{H}(-k)\right)\qquad\because% \mbox{ Marchenko-Pastur law (\lx@cref{creftypecap~refnum}{assumption:MP-law})}= divide start_ARG italic_d end_ARG start_ARG italic_d italic_r end_ARG ( italic_k caligraphic_S start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( - italic_k ) ) ∵ Marchenko-Pastur law ( )
=dkdrddk(k𝒮H(k))Chain rule\displaystyle=\tfrac{dk}{dr}\cdot\tfrac{d}{dk}\left(k\mathcal{S}_{H}(-k)\right% )\qquad\because\mbox{Chain rule}= divide start_ARG italic_d italic_k end_ARG start_ARG italic_d italic_r end_ARG ⋅ divide start_ARG italic_d end_ARG start_ARG italic_d italic_k end_ARG ( italic_k caligraphic_S start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( - italic_k ) ) ∵ Chain rule
=(drdk)1ddk(k𝒮H(k))Inverse function theorem\displaystyle=\left(\tfrac{dr}{dk}\right)^{-1}\cdot\tfrac{d}{dk}\left(k% \mathcal{S}_{H}(-k)\right)\qquad\because\mbox{Inverse function theorem}= ( divide start_ARG italic_d italic_r end_ARG start_ARG italic_d italic_k end_ARG ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⋅ divide start_ARG italic_d end_ARG start_ARG italic_d italic_k end_ARG ( italic_k caligraphic_S start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( - italic_k ) ) ∵ Inverse function theorem

To complete the proof, it suffices to show that both drdk𝑑𝑟𝑑𝑘\frac{dr}{dk}divide start_ARG italic_d italic_r end_ARG start_ARG italic_d italic_k end_ARG and ddk(k𝒮H(k))𝑑𝑑𝑘𝑘subscript𝒮𝐻𝑘\frac{d}{dk}\left(k\mathcal{S}_{H}(-k)\right)divide start_ARG italic_d end_ARG start_ARG italic_d italic_k end_ARG ( italic_k caligraphic_S start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( - italic_k ) ) are positive which will be checked in the next two lemmas. ∎

Lemma C.2.

The function drdk𝑑𝑟𝑑𝑘\frac{dr}{dk}divide start_ARG italic_d italic_r end_ARG start_ARG italic_d italic_k end_ARG evaluated at k𝑘kitalic_k is positive.

Proof of Lemma C.2.

Recall that k=1v(r)𝑘1𝑣𝑟k=\frac{1}{v(-r)}italic_k = divide start_ARG 1 end_ARG start_ARG italic_v ( - italic_r ) end_ARG. Thus, we have

dkdr(r)=(1)1v(r)2(1)v(r)=v(r)v(r)2.𝑑𝑘𝑑𝑟𝑟11𝑣superscript𝑟21superscript𝑣𝑟superscript𝑣𝑟𝑣superscript𝑟2\frac{dk}{dr}(r)=(-1)\frac{1}{v(-r)^{2}}(-1)\cdot v^{\prime}(-r)=\frac{v^{% \prime}(-r)}{v(-r)^{2}}.divide start_ARG italic_d italic_k end_ARG start_ARG italic_d italic_r end_ARG ( italic_r ) = ( - 1 ) divide start_ARG 1 end_ARG start_ARG italic_v ( - italic_r ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( - 1 ) ⋅ italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( - italic_r ) = divide start_ARG italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( - italic_r ) end_ARG start_ARG italic_v ( - italic_r ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG .

From the proof of Silverstein and Choi, (1995, Theorem 4.1), we see that v()>0superscript𝑣0v^{\prime}(\cdot)>0italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ ) > 0 for all negative inputs. In particular, v(r)>0superscript𝑣𝑟0v^{\prime}(-r)>0italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( - italic_r ) > 0 which implies that dkdr𝑑𝑘𝑑𝑟\frac{dk}{dr}divide start_ARG italic_d italic_k end_ARG start_ARG italic_d italic_r end_ARG is positive. By the inverse function theorem, we have drdk=(dkdr)1𝑑𝑟𝑑𝑘superscript𝑑𝑘𝑑𝑟1\frac{dr}{dk}=(\frac{dk}{dr})^{-1}divide start_ARG italic_d italic_r end_ARG start_ARG italic_d italic_k end_ARG = ( divide start_ARG italic_d italic_k end_ARG start_ARG italic_d italic_r end_ARG ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT is also positive. ∎

Lemma C.3.

The quantity ddk(k𝒮H(k))𝑑𝑑𝑘𝑘subscript𝒮𝐻𝑘\frac{d}{dk}\left(k\mathcal{S}_{H}(-k)\right)divide start_ARG italic_d end_ARG start_ARG italic_d italic_k end_ARG ( italic_k caligraphic_S start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( - italic_k ) ) is positive.

Proof of Lemma C.3.

Plugging in z=r𝑧𝑟z=-ritalic_z = - italic_r into Dobriban and Wager, (2018, Eqn. (3)), we have

v(r)1r=1γ(m(r)1r).𝑣𝑟1𝑟1𝛾𝑚𝑟1𝑟v(-r)-\frac{1}{r}=\frac{1}{\gamma}\left(m(-r)-\frac{1}{r}\right).italic_v ( - italic_r ) - divide start_ARG 1 end_ARG start_ARG italic_r end_ARG = divide start_ARG 1 end_ARG start_ARG italic_γ end_ARG ( italic_m ( - italic_r ) - divide start_ARG 1 end_ARG start_ARG italic_r end_ARG ) . (16)

Now,

rm(r)𝑟𝑚𝑟\displaystyle rm(-r)italic_r italic_m ( - italic_r ) =γrv(r)+(1γ)Equation 16\displaystyle=\gamma rv(-r)+(1-\gamma)\quad\because\mbox{\lx@cref{% creftypecap~refnum}{equation:v-to-m-transform}}= italic_γ italic_r italic_v ( - italic_r ) + ( 1 - italic_γ ) ∵ (17)
=γrk+(1γ)Definition of k\displaystyle=\gamma\frac{r}{k}+(1-\gamma)\quad\because\mbox{Definition of $k$}= italic_γ divide start_ARG italic_r end_ARG start_ARG italic_k end_ARG + ( 1 - italic_γ ) ∵ Definition of italic_k (18)
=(γtdH(t)k+t)+(1γ)Equation 15\displaystyle=\left(\gamma-\int\frac{tdH(t)}{k+t}\right)+(1-\gamma)\quad% \because\mbox{\lx@cref{creftypecap~refnum}{equation:rk-equation}}= ( italic_γ - ∫ divide start_ARG italic_t italic_d italic_H ( italic_t ) end_ARG start_ARG italic_k + italic_t end_ARG ) + ( 1 - italic_γ ) ∵ (19)
=1tdH(t)k+tabsent1𝑡𝑑𝐻𝑡𝑘𝑡\displaystyle=1-\int\frac{tdH(t)}{k+t}= 1 - ∫ divide start_ARG italic_t italic_d italic_H ( italic_t ) end_ARG start_ARG italic_k + italic_t end_ARG (20)
=kdH(t)k+t1=dH(t)=k+tk+tdH(t)\displaystyle=\int\frac{kdH(t)}{k+t}\quad\because 1=\int dH(t)=\int\frac{k+t}{% k+t}dH(t)= ∫ divide start_ARG italic_k italic_d italic_H ( italic_t ) end_ARG start_ARG italic_k + italic_t end_ARG ∵ 1 = ∫ italic_d italic_H ( italic_t ) = ∫ divide start_ARG italic_k + italic_t end_ARG start_ARG italic_k + italic_t end_ARG italic_d italic_H ( italic_t ) (21)
=k𝒮H(k).absent𝑘subscript𝒮𝐻𝑘\displaystyle=k\mathcal{S}_{H}(-k).= italic_k caligraphic_S start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( - italic_k ) . (22)

Thus, differentiating under the integral, we have

ddk(k𝒮H(k))=ddk(kk+t)𝑑H(t)=tdH(t)(k+t)2>0𝑑𝑑𝑘𝑘subscript𝒮𝐻𝑘𝑑𝑑𝑘𝑘𝑘𝑡differential-d𝐻𝑡𝑡𝑑𝐻𝑡superscript𝑘𝑡20\frac{d}{dk}(k\mathcal{S}_{H}(-k))=\int\frac{d}{dk}\left(\frac{k}{k+t}\right)% dH(t)=\int\frac{tdH(t)}{(k+t)^{2}}>0divide start_ARG italic_d end_ARG start_ARG italic_d italic_k end_ARG ( italic_k caligraphic_S start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( - italic_k ) ) = ∫ divide start_ARG italic_d end_ARG start_ARG italic_d italic_k end_ARG ( divide start_ARG italic_k end_ARG start_ARG italic_k + italic_t end_ARG ) italic_d italic_H ( italic_t ) = ∫ divide start_ARG italic_t italic_d italic_H ( italic_t ) end_ARG start_ARG ( italic_k + italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG > 0

as desired. ∎

Appendix D Proof of Theorem 1.6 — rapid norm growth

Proof of Theorem 1.6.

Let τ:=τ+σ22assignsuperscript𝜏𝜏superscript𝜎22\tau^{\prime}:=\frac{\tau+\sigma^{2}}{2}italic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT := divide start_ARG italic_τ + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG. Then by definition, we have τ(0,σ2)superscript𝜏0superscript𝜎2\tau^{\prime}\in(0,\sigma^{2})italic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). From Theorem 1.5, we can pick r>0superscript𝑟0r^{\prime}>0italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > 0 so that the sequence of regularizers ϱn=rnαsuperscriptsubscriptitalic-ϱ𝑛superscript𝑟superscript𝑛𝛼\varrho_{n}^{\prime}=r^{\prime}n^{-\alpha}italic_ϱ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT satisfies 𝚝𝚛𝚊𝚒𝚗*(r)=τsuperscriptsubscript𝚝𝚛𝚊𝚒𝚗superscript𝑟superscript𝜏\mathcal{E}_{\mathtt{train}}^{*}(r^{\prime})=\tau^{\prime}caligraphic_E start_POSTSUBSCRIPT typewriter_train end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Next, we note that there exists c>0𝑐0c>0italic_c > 0 such that

Pr(𝚝𝚛𝚊𝚒𝚗(𝜷^ϱn)𝚝𝚛𝚊𝚒𝚗(𝜷^ϱn))cPrsubscript𝚝𝚛𝚊𝚒𝚗subscript^𝜷superscriptsubscriptitalic-ϱ𝑛subscript𝚝𝚛𝚊𝚒𝚗subscript^𝜷subscriptitalic-ϱ𝑛𝑐\Pr(\mathcal{E}_{\mathtt{train}}(\hat{\bm{\beta}}_{\varrho_{n}^{\prime}})\geq% \mathcal{E}_{\mathtt{train}}(\hat{\bm{\beta}}_{\varrho_{n}}))\geq croman_Pr ( caligraphic_E start_POSTSUBSCRIPT typewriter_train end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_ϱ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ≥ caligraphic_E start_POSTSUBSCRIPT typewriter_train end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_ϱ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) ≥ italic_c

for all n𝑛nitalic_n sufficiently large. Such a c𝑐citalic_c is guaranteed to exist by the fact that

limn𝚝𝚛𝚊𝚒𝚗(𝜷^ϱn)=τ=τ+σ22>τ=limn𝚝𝚛𝚊𝚒𝚗(𝜷^ϱn).subscript𝑛subscript𝚝𝚛𝚊𝚒𝚗subscript^𝜷superscriptsubscriptitalic-ϱ𝑛superscript𝜏𝜏superscript𝜎22𝜏subscript𝑛subscript𝚝𝚛𝚊𝚒𝚗subscript^𝜷subscriptitalic-ϱ𝑛\lim_{n}\mathcal{E}_{\mathtt{train}}(\hat{\bm{\beta}}_{\varrho_{n}^{\prime}})=% \tau^{\prime}=\frac{\tau+\sigma^{2}}{2}>\tau=\lim_{n}\mathcal{E}_{\mathtt{% train}}(\hat{\bm{\beta}}_{\varrho_{n}}).roman_lim start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT caligraphic_E start_POSTSUBSCRIPT typewriter_train end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_ϱ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) = italic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = divide start_ARG italic_τ + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG > italic_τ = roman_lim start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT caligraphic_E start_POSTSUBSCRIPT typewriter_train end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_ϱ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) .

Now, note that

𝚝𝚛𝚊𝚒𝚗(𝜷^ϱn)𝚝𝚛𝚊𝚒𝚗(𝜷^ϱn)subscript𝚝𝚛𝚊𝚒𝚗subscript^𝜷superscriptsubscriptitalic-ϱ𝑛subscript𝚝𝚛𝚊𝚒𝚗subscript^𝜷subscriptitalic-ϱ𝑛\displaystyle\mathcal{E}_{\mathtt{train}}(\hat{\bm{\beta}}_{\varrho_{n}^{% \prime}})\geq\mathcal{E}_{\mathtt{train}}(\hat{\bm{\beta}}_{\varrho_{n}})caligraphic_E start_POSTSUBSCRIPT typewriter_train end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_ϱ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ≥ caligraphic_E start_POSTSUBSCRIPT typewriter_train end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_ϱ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) (23)
 𝜷^ϱn is a 𝚝𝚛𝚊𝚒𝚗(𝜷^ϱn)-near interpolatoriffabsent 𝜷^ϱn is a 𝚝𝚛𝚊𝚒𝚗(𝜷^ϱn)-near interpolator\displaystyle\iff\mbox{ $\hat{\bm{\beta}}_{\varrho_{n}}$ is a $\mathcal{E}_{\mathtt{train}}(\hat{\bm{\beta}}_{\varrho_{n}^{\prime}})$-near % interpolator }⇔ over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_ϱ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT is a caligraphic_E start_POSTSUBSCRIPT typewriter_train end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_ϱ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) -near interpolator (24)
𝜷^ϱn2𝜷^ϱn2absentsuperscriptnormsubscript^𝜷subscriptitalic-ϱ𝑛2superscriptnormsubscript^𝜷superscriptsubscriptitalic-ϱ𝑛2\displaystyle\implies\|\hat{\bm{\beta}}_{\varrho_{n}}\|^{2}\geq\|\hat{\bm{% \beta}}_{\varrho_{n}^{\prime}}\|^{2}⟹ ∥ over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_ϱ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ ∥ over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_ϱ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (25)

From this, we conclude that there exists a c>0𝑐0c>0italic_c > 0 such that

Pr(𝜷^ϱn2𝜷^ϱn2)cPrsuperscriptnormsubscript^𝜷subscriptitalic-ϱ𝑛2superscriptnormsubscript^𝜷superscriptsubscriptitalic-ϱ𝑛2𝑐\Pr(\|\hat{\bm{\beta}}_{\varrho_{n}}\|^{2}\geq\|\hat{\bm{\beta}}_{\varrho_{n}^% {\prime}}\|^{2})\geq croman_Pr ( ∥ over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_ϱ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ ∥ over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_ϱ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ≥ italic_c

for all n𝑛nitalic_n sufficiently large. From this, it follows that 𝔼[𝜷^ϱn2]c𝔼[𝜷^ϱn2]𝔼delimited-[]superscriptnormsubscript^𝜷subscriptitalic-ϱ𝑛2𝑐𝔼delimited-[]superscriptnormsubscript^𝜷superscriptsubscriptitalic-ϱ𝑛2\mathbb{E}[\|\hat{\bm{\beta}}_{\varrho_{n}}\|^{2}]\geq c\mathbb{E}[\|\hat{\bm{% \beta}}_{\varrho_{n}^{\prime}}\|^{2}]blackboard_E [ ∥ over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_ϱ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≥ italic_c blackboard_E [ ∥ over^ start_ARG bold_italic_β end_ARG start_POSTSUBSCRIPT italic_ϱ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]. By Proposition 1.7 and the preceding inequality, we are done. ∎

Appendix E Code for implementation \mathcal{I}caligraphic_I and 𝒥𝒥\mathcal{J}caligraphic_J

Implementation of the \mathcal{I}caligraphic_I and 𝒥𝒥\mathcal{J}caligraphic_J functions from Definition 3.1 can be implemented in SCIPY as:

1import scipy.special as sc
2gamma = 0.5
3alpha = 1.75
4
5# I helper
6I_gen = lambda x,k, alpha: x*sc.hyp2f1(1,(1/alpha), 1 + (1/alpha), -k*x**alpha)
7# J helper
8J_gen = lambda x,k, alpha: x*sc.hyp2f1(2,(1/alpha), 1 + (1/alpha), -k*x**alpha)
9
10I = lambda k : I_gen(1/gamma, k, alpha) #\mathcal{I}
11J = lambda k : J_gen(1/gamma, k, alpha) #\mathcal{J}
12
13N = lambda k : 1 - I(k) # helper
14D = lambda k : 1 - J(k) # helper
15
16Etst = lambda k : 1/D(k) #\Etest/\sigma^2
17Etrn = lambda k : N(k)**2/D(k) #\Etrain/\sigma^2
18R  = lambda k : k*(1-I(k)) # \mathcal{R}

Appendix F Experiments

Code for reproducing all figures are included in the official GitHub repository:

For downloading the UCI regression datasets, we use the following repository:

For the neural tangent kernel, we use the official repository associated to Arora et al., (2019):

In Figure 6-left, note that the curve corresponding to stock.2-1 has the fastest spectra decay and simultaneously the worst trade-off. Evidently, larger decay exponent corresponds to a poorer trade-off, especially for near-interpolators, i.e., as the training error approaches 00. This is in agreement with our theoretical results under random matrix theory assumptions illustrated in Figure 2.

Refer to caption
Refer to caption
Figure 6: Left. Training/testing error trade-off on the “stock” dataset from the UCI regression dataset collection using kernel ridge regression with the neural tangent kernel. Each curve is labeled by “DatasetName.d-f” where “d” and “f” represents the number of layers and the number of fixed layers in the NTK corresponding to ReLU networks. Right. The eigenvalue index vs eigenvalue plot of the NTK matrix exhibits power-law spectra. A tiny value is added to the eigenvalues for better visualization on the log-scale.