CASR: A Robust Cyclic Framework for Arbitrary Large-Scale Super-Resolution with Distribution Alignment and Self-Similarity Awareness

Wenhao Guo1, Zhaoran Zhao1, Peng Lu1,∗, Sheng Li2, Qian Qiao1, RuiDe Li1
1School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, China
2Peking University, Beijing, China
{whguo, lupeng, zhaozhaoran}@bupt.edu.cn, xujun.peng@capitalone.com, lisheng@pku.edu.cn
Abstract

Arbitrary-Scale SR (ASISR) remains fundamentally limited by cross-scale distribution shift: once the inference scale leaves the training range, noise, blur, and artifacts accumulate sharply. We revisit this challenge from a cross-scale distribution transition perspective and propose CASR, a simple yet highly efficient cyclic SR framework that reformulates ultra-magnification as a sequence of in-distribution scale transitions. This design ensures stable inference at arbitrary scales while requiring only a single model. CASR tackles two major bottlenecks: distribution drift across iterations and patch-wise diffusion inconsistencies. The proposed SDAM module aligns structural distributions via superpixel aggregation, preventing error accumulation, while SARM module restores high-frequency textures by enforcing autocorrelation and embedding LR self-similarity priors. Despite using only a single model, our approach significantly reduces distribution drift, preserves long-range texture consistency, and achieves superior generalization even at extreme magnification.

Corresponding author.

1 Introduction

Arbitrary-Scale Image Super-Resolution (ASISR) aims to reconstruct high-resolution (HR) images at arbitrary scaling factors from a single low-resolution (LR) input using one unified model. While existing methods [8, 5, 31, 26, 9] perform well within their trained scale ranges, they degrade sharply once the inference scale moves beyond this regime. This failure is rooted in cross-scale distribution shift, where the LR-to-HR mapping, texture statistics, and reconstruction priors become inconsistent under large-scale jumps, leading to blur, detail loss, and severe artifacts. Such instability fundamentally limits the practicality and scalability of ASISR in real-world ultra-magnification scenarios.

Refer to caption
Figure 1: Comparison of cyclic cascade stability across different ASISR. The SIFID measures distribution shifts between reconstructed images and the training data during cascading. Our method achieves notably higher distribution stability than others.

A straightforward strategy to address this extrapolation challenge is to enlarge the training scale range. However, the ill-posed one-to-many mapping in SR becomes increasingly unstable as the scale expands, making optimization intractable and convergence unreliable. Cascading multiple specialized SR networks is another option, but such pipelines incur substantial parameter redundancy, storage overhead, and lack flexibility for dynamically varying scales.

To overcome these limitations, we propose a cyclic reusable single-network architecture that interprets ultra-magnification as a sequence of in-distribution scale transitions.

Instead of directly predicting large upscaling factors, which forces the model outside its training distribution, CASR progressively enlarges images by repeatedly applying the same SR model. Each step remains within the learned distribution, ensuring stable inference while significantly reducing complexity and memory usage. This cyclic formulation achieves high-quality reconstruction through gradual refinement, offering an elegant, scalable, and efficient solution for ASISR.

However, building a robust cyclic SR framework introduces two key challenges. First, recursive application of an SR model inevitably causes distribution drift: intermediate outputs gradually deviate from the training distribution, amplifying residual noise, ringing, and blur through positive feedback. As illustrated in Fig. 1, the SIFID [24] metric increases consistently with each iteration, illustrating this drift and the resulting quality degradation. Second, diffusion-based priors have shown remarkable potential in enhancing texture realism, but most diffusion backbones impose strict input resolution constraints. In cyclic ASISR, progressively enlarged images must be processed in smaller patches due to memory limits, followed by reassembly into full-resolution outputs. While overlap blending [2, 13] partially mitigate boundary artifacts, cross-patch self-similarity, the coherence of textures and repeated structures across adjacent patches, remains difficult to preserve, as shown in Fig. 2.

Refer to caption
Figure 2: This illustrates the texture inconsistency between patches caused by patch-based super-resolution, where identical repeated objects are reconstructed with different texture patterns.

To address these challenges, we propose a novel super-resolution framework, CASR, consisting of two key components: a Superpixel-based Distribution Alignment Module (SDAM) and a Self-similarity Aware Refinement Module (SARM). SDAM stabilizes the cross-scale distribution transitions within the cyclic process by grouping visually similar pixels into homogeneous superpixel regions. This suppresses isolated noise, mitigates edge misalignment, and adaptively controls region granularity to prevent artifacts and over-smoothing. A normalized depth constraint further corrects spatial misalignment during local upsampling, preserving structural consistency across intermediate representations. SARM aims to restore high-frequency textures lost during degradation. Guided by an autocorrelation loss, the network captures repetitive structures, while a cross-correlation mechanism explicitly embeds self-similarity from the low-resolution input. This enables the reconstructed image to maintain structural coherence and significantly enhance fine-grained texture fidelity.

Our main contributions are summarized as follows:

  • We propose CASR, a simple yet theoretically grounded cyclic SR framework that models ultra-magnification as a sequence of in-distribution scale transitions, fundamentally mitigating cross-scale distribution shift.

  • We design SDAM and SARM to jointly stabilize distribution drift and preserve cross-scale self-similarity, enabling CASR to achieve coherent textures and state-of-the-art performance under extreme magnification.

2 Related Work

2.1 Arbitrary-Scale Super-Resolution

MetaSR [12] introduced a meta-upscaling module that dynamically generates filter weights based on input coordinates and scaling factors, enabling arbitrary magnification. However, its generalization drops sharply at large scales. LIIF [8] adopts implicit neural representation (INR), where an MLP predicts RGB values for queried coordinates from local LR features, allowing extrapolation beyond training scales.Subsequent works integrated generative paradigms such as normalizing flows [31, 26] and diffusion models [9, 17] to improve perceptual fidelity. LINF [31] first combined normalizing flows with INR for arbitrary-scale SR, while BFSR [26] introduced conditional learning for further gains. IDM [9] and Kim [17] incorporated diffusion priors, achieving state-of-the-art perceptual results on category-specific datasets. Despite the impressive performance of implicit neural representation methods for ASISR capable of scaling up to ×30\times 30—these models still encounter issues with blur or distortion when recovering fine image details in scenarios exceeding ×4\times 4.

Refer to caption
Figure 3: Illustration of the proposed CASR. The purple module denotes the SDAM, the green block corresponds to the SARM, and the gray U-Net represents the SR backbone.

2.2 Large-scale Image Super-Resolution

Large-scale SR has traditionally been explored under fixed upsampling factors. Many methods attempt to mitigate the ill-posedness of extreme downsampling by training on LR–HR pairs up to ×16\times 16 or ×64\times 64, often within restricted semantic domains. For example, PULSE [19] optimizes StyleGAN [15] latent codes to generate HR faces consistent with LR inputs, while GLEAN [6] enhances spatial consistency by conditioning StyleGAN on convolutional features, extending scalability to ×64\times 64. However, these generative approaches remain domain-specific—typically limited to faces, cats, or other well-structured categories—and struggle to generalize to arbitrary real-world content.

Cascaded SR frameworks [22, 21] address scalability by sequentially applying multiple SR models. SR3 [23] stacks several ×4\times 4 diffusion-based models to achieve ×64\times 64 upsampling. Yet, the mismatch between intermediate outputs and the next model’s training distribution causes performance degradation across stages. CDM [11] alleviates this issue through noise and blur augmentation but requires training and storing multiple networks, limiting scalability. Such cascaded pipelines are ill-suited for arbitrary-scale SR.

3 Method

Given a low-resolution (LR) image and an arbitrary scaling factor ss, our goal is to reconstruct a high-resolution (HR) image at arbitrary magnification. To support ultra-large scaling, we decompose ss into a series of sub-scale factors, each bounded by a predefined maximum scale smaxs_{\text{max}} used during training: s=s1×s2××sk××sKs=s^{1}\times s^{2}\times\cdots\times s^{k}\times\cdots\times s^{K}, where sksmaxs^{k}\leq s_{\text{max}} and k[1,K]k\in[1,K]. The proposed CASR framework performs KK iterative upsampling steps, where each intermediate result serves as the input for the next. In the kk-th iteration, the upsampling module enlarges Ik1I^{k-1} by a factor of sks^{k}, producing IkI^{k}. Starting from I0I^{0}, this process continues until the final HR output IKI^{K} is obtained. As shown in Fig. 3, the input Ik1I^{k-1} is first processed by the SDAM and divided into patches. Superpixel and depth maps are extracted and fed into the SR backbone, followed by refinement via the SARM to ensure texture consistency. All enhanced patches are then assembled into the final full-resolution output IkI^{k}.

We next detail the SDAM in Sec.  3.1 and SARM in Sec.  3.2, followed by the overall training strategy in Sec.  3.3.

3.1 Superpixel-based Distribution Alignment

During the cyclic super-resolution process, reconstruction artifacts tend to accumulate across iterations, leading to a significant distribution shift. As the SR network repeatedly enhances edges and textures, residual noise, ringing effects, and local blurring are unintentionally introduced, progressively altering the feature statistics of the reconstructed image. Consequently, the input of subsequent iterations deviates from the model’s original training manifold.

To mitigate this problem, we aim to suppress these degradation factors at an early stage while preserving as much of the original signal as possible. Motivated by this observation, we propose a novel distribution alignment strategy that separates the stable structural components from noisy artifacts.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption

origin

Refer to caption

superpixel-seg.

Refer to caption

superpixel

Refer to caption

depth

Refer to caption

output (ours)

Figure 4: Illustration of the distribution alignment process, where the input image is decomposed into a superpixel representation and a depth map. This decomposition effectively removes artifacts and noise, enabling robust SR.

Superpixel-based Structural Filtering. We first employ a superpixel segmentation strategy to eliminate accumulated artifacts by partitioning the image into coherent and visually homogeneous regions. Superpixels group perceptually similar pixels into compact, uniform segments that approximate sparse coding representations, leading to a smoother and more structured image representation. This process effectively removes cascading artifacts while preserving essential image content. Moreover, the explicit superpixel boundaries facilitate vector-like upsampling through simple nearest-neighbor interpolation.

In our framework, the input is uniformly divided into n×nn\times n superpixel blocks. We design a lightweight fully convolutional Superpixel Segmentation Network (SSN) to predict the soft assignment probabilities of each pixel to its nine neighboring regions. The SSN is adapted from SuperPixel-FCN [28], where we prune redundant channels (reducing convolutional width by 35%) and apply knowledge distillation from the original model to maintain segmentation accuracy while improving inference efficiency.

The SSN outputs a probability map indicating the likelihood of each pixel belonging to its surrounding superpixels. After bilinear upsampling followed by an argmax operation, we obtain the segmentation mask Pk1P^{k-1}, where each label corresponds to an individual region. The normalized representation of region rr in Pk1P^{k-1} is computed as:

Crk1=1|r|irIik1,C^{k-1}_{r}=\frac{1}{|r|}\sum_{i\in r}I^{k-1}_{i}, (1)

where Iik1I^{k-1}_{i} denotes the pixel intensity at position ii, and |r||r| is the size of the region.

Depth-guided Geometric Constraint. Superpixel representations alone may disrupt edge continuity, as segmentation boundaries can misalign with object contours. To preserve geometric integrity, we complement the superpixel image with depth-based structural cues. Unlike edge detection, which is easily corrupted by artifacts, depth estimation provides robust geometric information. We thus incorporate depth maps obtained from the pretrained DepthAnything [29] model as auxiliary constraints.

As a result, the original image is decomposed into two complementary and stable representations throughout the iterative cascade:a superpixel image capturing low-frequency content and a structural image preserving high-frequency geometric details. This dual representation effectively suppresses random noise while retaining semantic content and boundary consistency, providing a stable input distribution for the subsequent SR module, as illustrated in Fig. 4.

3.2 Self-Similarity Aware Refinement

Due to GPU memory constraints and the fixed input size of diffusion backbones, progressively upscaled images are divided into smaller patches for independent processing and later reassembly. Although overlapping regions can mitigate boundary artifacts, existing approaches still struggle to fully preserve fine-grained self-similarity—i.e., the ability of local textures or structures to remain consistent across different patches.

This issue involves two challenges: (1) How to represent self-similarity and guide the network to focus on repeating patterns, enabling it to recognize and preserve recurring local textures or structures within the image; (2) How to provide the network with self-similarity information and embed it into the learning process, so that independently processed patches can share fine-grained texture and semantic information, thereby maintaining global consistency.

Refer to caption
Figure 5: Illustration of the local self-similarity computation, where structurally similar regions are assigned higher correlation.

In general, the self-similarity of an image can be expressed through correlations in deep feature space. Let ee denote the feature map extracted from an image by a pretrained encoder. For any local feature vector eie_{i}, its similarity to the entire image can be written as ri=eier_{i}=e_{i}e^{\top}, where rihw×1r_{i}\in\mathbb{R}^{hw\times 1} describes the correlation between the current location and all spatial positions, as show Fig. 5. Preserving this correlation structure during super-resolution helps maintain the intrinsic self-similarity of the image. For the reconstructed image IkI^{k} and the ground truth IgtI^{\text{gt}}, we extract their semantic embeddings eke^{k} and egte^{\text{gt}} using a pretrained SAM  [18] encoder, and compute their cosine self-correlation matrices:

Rk=ek(ek),Rgt=egt(egt).R^{k}=e^{k}(e^{k})^{\top},\qquad R^{\text{gt}}=e^{\text{gt}}(e^{\text{gt}})^{\top}. (2)

These matrices compactly encode pairwise similarities within each image, providing a robust representation of internal self-similarity.

To inject this information into the reconstruction process, we introduce a Self-Similarity Aware Refinement Module (SARM) that enables cross-patch information exchange. As shown in Fig. 2, SARM takes the high-dimensional patch features ff and aggregates global information through an attention mechanism, allowing each patch to perceive the spatial distribution of patterns across the whole image. To mitigate the loss of global context caused by patch-wise processing, we further extract a global semantic embedding gg from the LR image IkI^{k} using SAM, and introduce it via cross-attention. In contrast to [30], which emphasizes local neighbor cues, our design explicitly incorporates global semantic context to strengthen cross-patch consistency. We additionally enforce this self-similarity structure through a correlation-guided objective:

Lcorr=RkRgt2.L_{\text{corr}}=\left|\left|R^{k}-R^{\text{gt}}\right|\right|_{2}. (3)

This correlation-guided loss enforces consistent similarity relationships among semantically related regions, ensuring coherent textures and structures in the final output.

3.3 Training Strategy

We adopt SD-Turbo as the backbone of our framework, which is a single-step diffusion model optimized for fast generation. During fine-tuning, all pretrained parameters are kept frozen, and lightweight adaptation is achieved through LoRA modules applied to both the VAE encoder and the denoising U-Net. To ensure deterministic refinement, the stochastic noise injection process in diffusion sampling is disabled. Given a superpixel-aligned input image, the module performs encoding, denoising, and decoding through a VAE–U-Net–VAE pipeline, while structural control signals from the ControlNet [32] branch are injected into the U-Net decoder to guide structure-aware reconstruction. The final output is a high-quality, detail-enhanced high-resolution image.

Two-stage Training. CASR is trained in two stages: a super-resolution stage and an self-similarit stage. In the first stage, the SD-Turbo backbone is fine-tuned while omitting the SARM, focusing on high-quality reconstruction with both perceptual and structural fidelity. The reconstruction loss is:

Lrec=λ1L1+λ2LLPIPS+λ3LGAN.L_{\text{rec}}=\lambda_{1}L_{1}+\lambda_{2}L_{\text{LPIPS}}+\lambda_{3}L_{\text{GAN}}. (4)

To better exploit geometric cues, a depth consistency loss is introduced. Given depth maps dkd^{k} and dgtd^{\text{gt}} from DepthAnything [29], both normalized to [0,1][0,1], the alignment loss is:

Ldepth=Norm(dk)Norm(dgt)2.L_{\text{depth}}=\left\|\text{Norm}(d^{k})-\text{Norm}(d^{\text{gt}})\right\|_{2}. (5)

The total loss for the first stage is:

Ltotal1=Lrec+λ4Ldepth.L_{\text{total}_{1}}=L_{\text{rec}}+\lambda_{4}L_{\text{depth}}. (6)

In the second stage, the backbone and ControlNet are frozen, and the global fusion module is trained with an additional correlation term:

Ltotal2=Lrec+λ4Ldepth+λ5Lcorr,L_{\text{total}_{2}}=L_{\text{rec}}+\lambda_{4}L_{\text{depth}}+\lambda_{5}L_{\text{corr}}, (7)

which enhances self-similarity of the reconstructed image.

Table 1: Comparison with ASISR methods on the DIV8K synthetic dataset, with the best results in bold. Our approach significantly outperforms others in visual perceptual metrics (LPIPS, MUSIQ, NIQE, PI).
Method DIV8K
×\times8 ×\times12 ×\times18
LPIPS\downarrow MUSIQ\uparrow NIQE\downarrow PI\downarrow LPIPS\downarrow MUSIQ\uparrow NIQE\downarrow PI\downarrow LPIPS\downarrow MUSIQ\uparrow NIQE\downarrow PI\downarrow
LINF [31] 0.442 26.01 10.11 8.93 0.528 19.42 11.43 9.87 0.578 17.24 13.37 10.89
BFSR [26] 0.399 24.30 8.29 7.80 0.500 18.43 10.70 9.18 0.561 17.06 14.24 11.06
IDM [9] 0.486 24.11 7.23 6.46 0.604 23.42 7.98 6.87 0.656 23.75 7.82 6.87
Kim [17] 0.491 23.54 8.24 8.84 0.621 21.69 8.80 7.38 0.685 20.06 8.34 7.72
LIIF [8]+Diff 0.411 28.99 9.32 8.42 0.496 20.25 10.86 9.44 0.550 17.58 12.17 10.21
CiaoSR [5]+Diff 0.408 30.94 9.15 8.34 0.493 20.89 10.64 9.33 0.545 17.61 11.93 10.10
CASR 0.363 53.63 5.66 5.07 0.403 53.82 5.47 4.89 0.450 51.44 6.01 5.24
Method ×\times24 ×\times30 ×4×3×1.5\times 4\times 3\times 1.5
LPIPS\downarrow MUSIQ\uparrow NIQE\downarrow PI\downarrow LPIPS\downarrow MUSIQ\uparrow NIQE\downarrow PI\downarrow LPIPS\downarrow MUSIQ\uparrow NIQE\downarrow PI\downarrow
LINF [31] 0.608 16.42 15.34 11.86 0.625 16.36 16.32 12.28 0.640 19.04 9.39 6.45
BFSR [26] 0.594 16.42 16.19 12.05 0.611 16.49 16.71 12.21 0.772 17.50 11.07 7.26
IDM [9] 0.710 23.76 8.03 7.21 0.705 23.84 7.96 7.33 0.608 25.06 8.62 7.80
Kim [17] 0.796 19.45 8.57 8.14 0.709 20.06 8.48 8.29 0.604 23.19 8.72 8.09
LIIF [8]+Diff 0.582 16.55 13.49 10.86 0.603 16.16 15.69 11.85 0.535 20.27 11.93 9.19
CiaoSR [5]+Diff 0.579 16.35 13.08 10.66 0.603 15.95 15.70 11.87 0.503 21.06 10.98 9.57
CASR 0.495 41.42 6.93 6.18 0.501 41.76 6.98 6.09 0.450 51.44 6.01 5.24
Refer to caption

“DIV8K-1493” ×24\times 24

Refer to caption

Bicubic

Refer to caption

Kim

Refer to caption

LINF

Refer to caption

LIIF+Diff

Refer to caption

LINF-LP

Refer to caption

CiaoSR+Diff

Refer to caption

IDM

Refer to caption

CASR

Refer to caption

“DIV8K-1459” ×8\times 8

Refer to caption

Bicubic

Refer to caption

Kim

Refer to caption

LINF

Refer to caption

LIIF+Diff

Refer to caption

BFSR

Refer to caption

CiaoSR+Diff

Refer to caption

IDM

Refer to caption

CASR

Figure 6: Qualitative comparison with different methods on the DIV8K dataset. For large-scale super-resolution, our method reconstructs more realistic statue textures and finer fur details on the cat’s ears.

4 Experiments

4.1 Training Datasets and Metrics

Following prior works [7, 34, 25, 27], we use the DF2K dataset [1] for training and synthetically generate low-resolution (LR) images via bicubic downsampling.

For perceptual evaluation, we adopt LPIPS [33] along with no-reference image quality assessment metrics, including MUSIQ [16], NIQE [20], and PI [3]. For real-world datasets, since ground-truth references are unavailable, LPIPS is excluded from the evaluation.

4.2 Testing Datasets

We evaluate our method on three types of datasets: synthetic, real-world, and face datasets. For synthetic evaluation, we adopt the last 100 HR images from the DIV8K dataset [10]. The LR inputs are generated by bicubic downsampling of the HR images. For real-world evaluation, we use the RealSR dataset [4], which contains images captured by two different cameras with complex and authentic degradations. We further evaluate our method on the CelebA-HQ dataset [14] following the settings of diffusion-based ASISR methods, including IDM [9] and Kim [17]. Specifically, 100 face images with a resolution of 128×128128\times 128 are randomly selected for perceptual evaluation.

4.3 Implementation Details

In all experiments, the maximum upsampling factor smaxs_{\text{max}} is set to 4. The overall training consists of two stages. In the first stage, we freeze the Distribution Alignment Module and train the SR backbone for 10K iterations with a batch size of 32 and a learning rate of 2×1052\times 10^{-5}. Each low-resolution input is first processed by the Distribution Alignment Module to produce 512×512512\times 512 superpixel and depth maps, which are then fed into the backbone. During fine-tuning, the LoRA rank parameters are set to 16 for the VAE encoder and 32 for the diffusion U-Net. In the second stage, we freeze both the Distribution Alignment Module and backbone networks, training only the SARM on 1024×10241024\times 1024 images divided into four patches, with a batch size of 8, and the same learning rate. All models are trained on four NVIDIA A6000 GPUs.

4.4 Comparisons with State-of-the-Art

We conduct comprehensive comparisons with several sota ASISR methods. These include perceptual quality-driven methods such as LINF [31], BFSR [26], IDM [9], and Kim [17]. Since the official checkpoint of Kim [17] has not been publicly released, we re-implemented the method according to the descriptions provided in the original paper. In addition, we enhanced LIIF[8] and CiaoSR[5] by integrating a diffusion-based post-processing module, resulting in the improved variants LIIF+Diff and CiaoSR+Diff. For high-resolution inference, all methods operate on 512×512512\times 512 patches, using a 64-pixel overlap to ensure seamless boundary blending during stitching.

Table 2: Comparison with ASISR methods on real-world datasets, with the best results in bold. Our approach archives consistently superior performance over others, showcasing strong generalization in real-world image.
Method RealSR
×\times8 ×\times12 ×\times18 ×\times24 ×\times30
MUSIQ\uparrow NIQE\downarrow PI\downarrow MUSIQ\uparrow NIQE\downarrow PI\downarrow MUSIQ\uparrow NIQE\downarrow PI\downarrow MUSIQ\uparrow NIQE\downarrow PI\downarrow MUSIQ\uparrow NIQE\downarrow PI\downarrow
LINF [31] 19.58 11.50 9.96 17.22 12.90 10.83 16.74 14.59 11.71 17.26 15.70 12.29 18.29 16.26 12.56
BFSR[26] 19.58 11.50 9.96 17.22 12.90 10.83 16.74 14.59 11.71 17.26 15.70 12.29 18.29 16.26 12.56
IDM [9] 28.69 7.50 7.15 26.56 8.14 7.67 33.79 8.00 7.52 31.68 8.17 7.47 28.22 8.35 7.44
Kim [17] 26.39 8.09 6.94 25.16 7.83 7.85 25.32 8.51 8.42 30.59 8.58 8.46 27.43 8.56 8.43
LIIF [8] + Diff 19.83 10.82 9.36 16.91 12.32 10.34 16.14 13.50 10.90 16.47 13.90 11.12 17.39 14.00 11.12
CiaoSR [5] +Diff 20.53 10.62 9.25 17.14 12.04 10.19 16.22 12.99 10.64 16.24 13.23 10.74 17.35 13.26 10.75
CASR 53.50 6.81 5.71 49.42 6.73 5.73 44.03 7.56 6.34 40.35 7.80 6.65 37.84 7.81 6.73
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption

Bicubic

Refer to caption

LINF

Refer to caption

LINF-LP

Refer to caption

IDM

Refer to caption

Kim

Refer to caption

LIIF+Diff

Refer to caption

CiaoSR+Diff

Refer to caption

CASR

Figure 7: Qualitative comparison with different methods on the RealSR dataset. Our method produces clearer and more natural results.

4.4.1 Experiments on Synthetic Dataset

For the synthetic DIV8K dataset, the table 1 presents a detailed comparison with baseline methods. Our method consistently achieves the highest perceptual quality across all upsampling factors. Notably, at ×30\times 30, it outperforms the second-best method, LIIF+Diff, by 16.9% in LPIPS. Regarding no-reference metrics (MUSIQ, NIQE, PI), our approach surpasses IDM by 75.2%, 12.3%, and 15.8%, respectively. Importantly, performance remains robust even at extreme upsampling scales, while BFSR and other baselines exhibit significant degradation.

Qualitative comparisons in Fig. 6 further corroborate these quantitative findings. At extreme magnifications, LINF and BFSR produce excessively blurry results, while IDM and Kim suffer from severe blocky artifacts. They struggle to recover fine structures such as rocky surfaces and statue edges, and fail to reconstruct delicate details like the fur on the cat’s ears. Even LIIF+Diff and CiaoSR+Diff exhibit noticeable artifacts and unrealistic textures when operating beyond their training distribution. In contrast, CASR preserves sharp edges and intricate fine details, clearly demonstrating its superiority in perceptual quality.

To thoroughly investigate the impact of distribution shift on cyclic super-resolution, we take the ×18\times 18 upsampling task as an example. In the experiments, all baseline methods adopt a multi-stage recursive processing strategy, successively performing ×4\times 4, ×3\times 3, and ×1.5\times 1.5 progressive upsampling operations. The corresponding results are recorded in the bottom-right corner of Table 1. Experimental results show that although each upsampling stage operates within the scale range covered during training, the overall performance does not improve compared to direct ×18\times 18 upsampling, and the quantitative metrics of the multi-stage and single-stage methods are very close. However, the reasons for performance degradation differ: the performance decline in the multi-stage approach is mainly due to error accumulation caused by distribution shift, while the performance drop in the single-stage method results from out-of-scale amplification. Therefore, without effectively mitigating distribution shift, baseline methods struggle to benefit from cyclic cascading structures.

4.4.2 Experiments on Real-World Datasets

Table 2 presents the evaluation results on real-world datasets using no-reference metrics. Our approach consistently achieves superior perceptual quality. On the RealSR dataset at the ×30\times 30 scale, it outperforms the second-best method, IDM, by 34.1%, 6.5%, and 9.5% in MUSIQ, NIQE, and PI, respectively, demonstrating strong generalization capability. As shown in Fig. 7, on real-world datasets, LIIF+DCM and CiaoSR+DCM fail to recover sharp edges and produce noticeable blocky artifacts in the second set of images. Similarly, LINF and BFSR struggle to reconstruct crisp edges, while IDM and Kim et.al. still generate severe blocky artifacts. In contrast, our method not only reconstructs realistic whisker textures but also effectively restores the structural integrity of the tower top.

4.4.3 Face Image Comparisons with Diffusion-Based ASISR

Table 3 presents the perceptual evaluation results on CelebA-HQ across various upsampling scales. While IDM achieves acceptable performance at lower scales, its quality deteriorates as the upsampling factor increases. In contrast, our approach consistently delivers high perceptual fidelity and maintains fine structural details. Fig. 8 illustrates ×12\times 12 super-resolution results (1536×15361536\times 1536). IDM and Kim produce overly smooth facial reconstructions, resulting in the loss of subtle details. In contrast, CASR accurately restores critical facial features, such as the eyes and mouth, achieving a more realistic and visually faithful reconstruction.

Table 3: Comparison with diffusion-based ASISR methods on the CelebA-HQ dataset. Our method achieves superior performance at large upsampling scales.
Method CelebA-HQ
×\times4 ×\times6
MUSIQ\uparrow NIQE\downarrow PI\downarrow MUSIQ\uparrow NIQE\downarrow PI\downarrow
IDM [9] 72.79 9.70 6.97 66.77 5.60 5.44
Kim [17] 43.57 7.83 7.48 41.06 8.06 7.87
CASR 71.12 9.69 7.19 70.86 4.77 4.87
Method ×\times8 ×\times12
MUSIQ\uparrow NIQE\downarrow PI\downarrow MUSIQ\uparrow NIQE\downarrow PI\downarrow
IDM [9] 60.18 6.51 5.83 41.47 9.66 8.08
Kim [17] 39.40 9.15 8.21 31.66 11.10 9.56
CASR 72.41 4.63 4.17 71.71 4.77 4.04

128×\times128
Refer to caption

1536×\times1536
Refer to caption

1536×\times1536
Refer to caption

1536×\times1536
Refer to caption

Refer to caption

LR

Refer to caption

IDM

Refer to caption

Kim

Refer to caption

CASR(Ours)

Figure 8: Super-resolution results at ×12\times 12 on CelebA-HQ. IDM and Kim fail to recover fine facial details, while our method produces cleaner and sharper reconstructions.

4.5 Ablation Study

4.5.1 Component Effectiveness

To evaluate the contribution of each component in the proposed CASR framework, we conduct an ablation study with four model variants. Base Model: A diffusion-based SR model built on SD-Turbo with LoRA fine-tuning, where LR inputs are upsampled via bicubic interpolation before HR reconstruction. Model 1: Adds the superpixel segmentation module. Model 2: Further incorporates the depth-conditioning module for geometric consistency. Full Model: The complete version equipped with the SARM.

Table 4: Ablation study of various components. The best performance is achieved when all components are enabled.
SuperPixel Depth SARM ×\times 18 (×4×3×1.5\times 4\times 3\times 1.5)
LPIPS\downarrow MUSIQ\uparrow NIQE\downarrow PI\downarrow
0.585 31.73 7.10 5.91
0.471 42.23 6.61 5.96
0.467 45.18 6.15 5.37
0.450 51.44 6.01 5.24
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption

Origin

Refer to caption

LR

Refer to caption

Base

Refer to caption

Model1

Refer to caption

Model2

Refer to caption

Full Model

Figure 9: Impact of different components. Incorporating the superpixel module (Model1) effectively suppresses accumulated blur and artifacts during cascading, while depth conditioning (Model2) further enhances edge sharpness. The full model produces natural results with consistent textures across patches.

Quantitative results in Table 4 and qualitative comparisons in Fig. 9 collectively validate the effectiveness of each component. Removing any module consistently degrades perceptual quality and reconstruction accuracy, confirming their complementary roles. The Base Model suffers from distribution shifts during iterative refinement, producing blurred edges and artifacts in uniform regions (e.g., building facades). Introducing the superpixel segmentation module (Model 1) sharpens boundaries and restores fine textures, indicating that spatially coherent segmentation stabilizes feature distributions across iterations. Adding the depth-conditioning module (Model 2) further enhances geometric fidelity—petal structures become clearer and depth transitions smoother—though patch-level inconsistencies remain in repetitive patterns such as windows. Finally, the Full Model with the SARM achieves globally consistent and perceptually coherent reconstructions, with continuous textures, sharp edges, and artifact-free details. These results demonstrate that global semantic priors effectively align inter-patch features and promote stable high-fidelity generation.

Table 5: Ablation study on loss functions, validating the effectiveness of each loss component.
LcorrL_{corr} LdepthL_{depth} ×\times 18 (×4×3×1.5\times 4\times 3\times 1.5)
LPIPS\downarrow MUSIQ\uparrow NIQE\downarrow PI\downarrow
0.462 49.33 6.71 5.91
0.459 50.23 6.24 5.40
0.450 51.44 6.01 5.24

4.5.2 Superpixel Size Analysis

We conducted a comparative analysis using five different initial superpixel grid sizes. Table 6 and Fig. 10 summarizes the effect of superpixel block size on super-resolution performance. As the superpixel size increases, the performance in terms of structural consistency gradually decreases, while the perceptual quality improves. This observation suggests that larger superpixels can more effectively suppress high-frequency artifacts accumulated during cyclic cascading, leading to a more stable feature distribution. However, the increased smoothing effect of large superpixels also tends to remove fine details, thereby reducing fidelity. In practice, we adopt a 4×44\times 4 superpixel size to achieve a balanced trade-off between fidelity and perceptual quality.

Table 6: Ablation study on superpixel size.
SuperPixel Size DIV8K
LPIPS\downarrow MUSIQ\uparrow NIQE\downarrow PI\downarrow
3 ×\times 3 0.513 35.11 7.67 6.81
4 ×\times 4 0.450 51.44 6.01 5.24
5 ×\times 5 0.481 53.19 5.33 4.42
8 ×\times 8 0.516 64.29 6.29 4.89
Refer to caption

Origin

Refer to caption

3 ×\times 3

Refer to caption

4 ×\times 4

Refer to caption

5 ×\times 5

Refer to caption

8 ×\times 8

Figure 10: While superpixels effectively suppress degradation artifacts, excessively large superpixel sizes remove fine details and may even alter image content.

5 Conclusion

We demonstrate that ASISR becomes fundamentally more stable when ultra-magnification is modeled as a sequence of distribution-consistent transitions rather than a single extrapolation step. This shift reframes ASISR as a principled and inherently scalable paradigm, revealing that the key to extreme-resolution reconstruction lies not in enlarging models or datasets, but in understanding and regulating how representations evolve across scales. Beyond its empirical benefits, this distribution-aware cyclic perspective may open directions for future research. It provides a conceptual foundation for unified multi-scale generative models, progressive detail synthesis, and controllable magnification, and may extend naturally to video, 3D content, and cross-modal reconstruction. More broadly, the principles uncovered here suggest a pathway toward deeper theories of cross-scale representation learning and recursive generative enhancement.

References

  • [1] E. Agustsson and R. Timofte (2017) Ntire 2017 challenge on single image super-resolution: dataset and study. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 126–135. Cited by: §4.1.
  • [2] O. Bar-Tal, L. Yariv, Y. Lipman, and T. Dekel (2023) Multidiffusion: fusing diffusion paths for controlled image generation.(2023). URL https://arxiv. org/abs/2302.08113. Cited by: §1.
  • [3] Y. Blau, R. Mechrez, R. Timofte, T. Michaeli, and L. Zelnik-Manor (2018) The 2018 pirm challenge on perceptual image super-resolution. In Proceedings of the European conference on computer vision (ECCV) workshops, pp. 0–0. Cited by: §4.1.
  • [4] J. Cai, H. Zeng, H. Yong, Z. Cao, and L. Zhang (2019) Toward real-world single image super-resolution: a new benchmark and a new model. In In ICCV, pp. 3086–3095. Cited by: §4.2.
  • [5] J. Cao, Q. Wang, Y. Xian, Y. Li, B. Ni, Z. Pi, K. Zhang, Y. Zhang, R. Timofte, and L. Van Gool (2023) Ciaosr: continuous implicit attention-in-attention network for arbitrary-scale image super-resolution. In CVPR, pp. 1796–1807. Cited by: §1, Table 1, Table 1, §4.4, Table 2.
  • [6] K. C. Chan, X. Wang, X. Xu, J. Gu, and C. C. Loy (2021) Glean: generative latent bank for large-factor image super-resolution. In CVPR, pp. 14245–14254. Cited by: §2.2.
  • [7] X. Chen, X. Wang, J. Zhou, Y. Qiao, and C. Dong (2023) Activating more pixels in image super-resolution transformer. In CVPR, pp. 22367–22377. Cited by: §4.1.
  • [8] Y. Chen, S. Liu, and X. Wang (2021) Learning continuous image representation with local implicit image function. In CVPR, pp. 8628–8638. Cited by: §1, §2.1, Table 1, Table 1, §4.4, Table 2.
  • [9] S. Gao, X. Liu, B. Zeng, S. Xu, Y. Li, X. Luo, J. Liu, X. Zhen, and B. Zhang (2023) Implicit diffusion models for continuous super-resolution. In CVPR, pp. 10021–10030. Cited by: §1, §2.1, Table 1, Table 1, §4.2, §4.4, Table 2, Table 3, Table 3.
  • [10] S. Gu, A. Lugmayr, M. Danelljan, M. Fritsche, J. Lamour, and R. Timofte (2019) Div8k: diverse 8k resolution image dataset. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pp. 3512–3516. Cited by: §4.2.
  • [11] J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans (2022) Cascaded diffusion models for high fidelity image generation. Journal of Machine Learning Research 23 (47), pp. 1–33. Cited by: §2.2.
  • [12] X. Hu, H. Mu, X. Zhang, Z. Wang, T. Tan, and J. Sun (2019) Meta-sr: a magnification-arbitrary network for super-resolution. In CVPR, pp. 1575–1584. Cited by: §2.1.
  • [13] Á. B. Jiménez (2023) Mixture of diffusers for scene composition and high resolution image generation. arXiv preprint arXiv:2302.02412. Cited by: §1.
  • [14] T. Karras, T. Aila, S. Laine, and J. Lehtinen (2017) Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196. Cited by: §4.2.
  • [15] T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In CVPR, pp. 4401–4410. Cited by: §2.2.
  • [16] J. Ke, Q. Wang, Y. Wang, P. Milanfar, and F. Yang (2021) Musiq: multi-scale image quality transformer. In In ICCV, pp. 5148–5157. Cited by: §4.1.
  • [17] J. Kim and T. Kim (2024) Arbitrary-scale image generation and upsampling using latent diffusion model and implicit neural decoder. In CVPR, pp. 9202–9211. Cited by: §2.1, Table 1, Table 1, §4.2, §4.4, Table 2, Table 3, Table 3.
  • [18] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W. Lo, et al. (2023) Segment anything. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 4015–4026. Cited by: §3.2.
  • [19] S. Menon, A. Damian, S. Hu, N. Ravi, and C. Rudin (2020) Pulse: self-supervised photo upsampling via latent space exploration of generative models. In CVPR, pp. 2437–2445. Cited by: §2.2.
  • [20] A. Mittal, R. Soundararajan, and A. C. Bovik (2012) Making a “completely blind” image quality analyzer. IEEE Signal processing letters 20 (3), pp. 209–212. Cited by: §4.1.
  • [21] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever (2021) Zero-shot text-to-image generation. In International conference on machine learning, pp. 8821–8831. Cited by: §2.2.
  • [22] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans, et al. (2022) Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems 35, pp. 36479–36494. Cited by: §2.2.
  • [23] C. Saharia, J. Ho, W. Chan, T. Salimans, D. J. Fleet, and M. Norouzi (2022) Image super-resolution via iterative refinement. IEEE transactions on pattern analysis and machine intelligence 45 (4), pp. 4713–4726. Cited by: §2.2.
  • [24] T. R. Shaham, T. Dekel, and T. Michaeli (2019) Singan: learning a generative model from a single natural image. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 4570–4580. Cited by: §1.
  • [25] R. Timofte, E. Agustsson, L. Van Gool, M. Yang, and L. Zhang (2017) Ntire 2017 challenge on single image super-resolution: methods and results. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 114–125. Cited by: §4.1.
  • [26] L. Tsao, Y. Lo, C. Chang, H. Chen, R. Tseng, C. Feng, and C. Lee (2024) Boosting flow-based generative super-resolution models via learned prior. In CVPR, pp. 26005–26015. Cited by: §1, §2.1, Table 1, Table 1, §4.4, Table 2.
  • [27] X. Wang, X. Chen, B. Ni, H. Wang, Z. Tong, and Y. Liu (2023) Deep arbitrary-scale image super-resolution via scale-equivariance pursuit. In CVPR, pp. 1786–1795. Cited by: §4.1.
  • [28] F. Yang, Q. Sun, H. Jin, and Z. Zhou (2020) Superpixel segmentation with fully convolutional networks. In CVPR, pp. 13964–13973. Cited by: §3.1.
  • [29] L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao (2024) Depth anything: unleashing the power of large-scale unlabeled data. In CVPR, pp. 10371–10381. Cited by: §3.1, §3.3.
  • [30] Z. Yang, H. Jiang, W. Hong, J. Teng, W. Zheng, Y. Dong, M. Ding, and J. Tang (2024) Inf-dit: upsampling any-resolution image with memory-efficient diffusion transformer. In European Conference on Computer Vision, pp. 141–156. Cited by: §3.2.
  • [31] J. Yao, L. Tsao, Y. Lo, R. Tseng, C. Chang, and C. Lee (2023) Local implicit normalizing flow for arbitrary-scale image super-resolution. In CVPR, pp. 1776–1785. Cited by: §1, §2.1, Table 1, Table 1, §4.4, Table 2.
  • [32] L. Zhang, A. Rao, and M. Agrawala (2023) Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 3836–3847. Cited by: §3.3.
  • [33] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018) The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586–595. Cited by: §4.1.
  • [34] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu (2018) Image super-resolution using very deep residual channel attention networks. In Proceedings of the European conference on computer vision (ECCV), pp. 286–301. Cited by: §4.1.