WeatherCity: Urban Scene Reconstruction with Controllable
Multi-Weather Transformation

Wenhua Wu^∗ Huai Guan^∗ Zhe Liu Hesheng Wang^†
Shanghai Jiao Tong University

Abstract

Editable high-fidelity 4D scenes are crucial for autonomous driving, as they can be applied to end-to-end training and closed-loop simulation. However, existing reconstruction methods are primarily limited to replicating observed scenes and lack the capability for diverse weather simulation. While image-level weather editing methods tend to introduce scene artifacts and offer poor controllability over the weather effects. To address these limitations, we propose WeatherCity, a novel framework for 4D urban scene reconstruction and weather editing. Specifically, we leverage a text-guided image editing model to achieve flexible editing of image weather backgrounds. To tackle the challenge of multi-weather modeling, we introduce a novel weather Gaussian representation based on shared scene features and dedicated weather-specific decoders. This representation is further enhanced with a content consistency optimization, ensuring coherent modeling across different weather conditions. Additionally, we design a physics-driven model that simulates dynamic weather effects through particles and motion patterns. Extensive experiments on multiple datasets and various scenes demonstrate that WeatherCity achieves flexible controllability, high fidelity, and temporal consistency in 4D reconstruction and weather editing. Our framework not only enables fine-grained control over weather conditions (e.g., light rain and heavy snow) but also supports object-level manipulation within the scene.

Abstract

In the supplementary material, we present additional implementation details (Sec. 6), including training (Sec. 6.1), baselines (Sec. 6.3) and evaluation (Sec. 6.4). We also provide further experimental results and analysis (Sec. 7), including detailed quantitative and qualitative results (Sec. 7.1), temporal consistency comparison (Sec. 7.2), and 3D baseline comparisons (Sec. 7.3)

Figure 1: We present WeatherCity, a novel framework for dynamic urban scene reconstruction and controllable weather editing. Given a sequence of raw images, WeatherCity seamlessly integrates 4D reconstruction with flexible weather manipulation, producing highly consistent, photorealistic, and versatile multi-weather rendering results.

^†^†footnotetext: ^∗ The first two authors contribute equally to this paper.^†^†footnotetext: ^† Corresponding Author.

1 Introduction

High-fidelity 4D simulation is crucial for autonomous driving, as it can not only provide diverse training samples covering edge cases like extreme weather but also construct reproducible virtual testing environments for closed-loop evaluation [6]. However, significant challenges persist across the ”reconstruction-editing-simulation” pipeline for dynamic scenes, severely limiting the robustness of autonomous systems in complex environments.

First, existing 4D scene reconstruction methods struggle to overcome the limitation of observation dependency. While emerging techniques like Neural Radiance Fields (NeRF) [24] and 3D Gaussian Splatting (3DGS) [18] achieve high-fidelity geometric and photometric reconstruction, they can only reproduce the weather conditions present during data capture, failing to simulate challenging scenarios such as rain, snow, or fog. Subsequent optimizations for urban scenes, such as StreetGaussians [38], which models vehicle motion through dynamic appearance models, and OmniRe [4], which introduces a multi-node structure to support modeling of pedestrians and deformable objects, primarily focus on ”object-level adjustments” (e.g., modifying vehicle positions or counts) without addressing the crucial environmental dimension of weather.

Second, image-level weather editing techniques fail to meet the consistency requirements essential for 4D scene generation. Early approaches, primarily based on GAN architectures [15], required training specialized models for each weather condition, resulting in limited editing diversity and poor generalization [43]. Recent methods have turned to diffusion models based [16, 42], which allow flexible weather control via text prompts. However, they often introduce content hallucinations, such as altering the position of road markings or distorting buildings. Furthermore, they offer limited capability for fine-grained control over weather intensity parameters, such as the strength of rain or snow or the density of fog. Recent methods have attempted weather editing based on 3D representation. DerainNeRF [21] and WeatherGS [27] utilize NeRF and 3D Gaussians to reconstruct and edit rainy scenes, but their functionality is limited to removing raindrops. ClimateNeRF [20] can simulate static weather effects such as snow cover and flooding, but it cannot generate dynamic weather phenomena such as falling rain or snow.

To address these challenges, we present WeatherCity, an editable high-fidelity 4D urban scene reconstruction and weather editing method. To achieve flexible scene editing, we employ a text-guided image editing model for weather image synthesis, which enables flexible generation of target weather conditions. To ensure scene consistency, we propose a Weather Gaussian representation with shared features and multi-weather decoders that disentangle intrinsic structural and textural features of the scene from weather specific appearance attributes. This allows consistent scene structure across varying weather conditions while effectively modeling diverse weather effects. Furthermore, we introduce a content consistency loss to further enhance structural coherence. For simulating and controlling dynamic weather effects, we design a physics-driven dynamic weather simulation system. For rain and snow, we develop a variety of weather particles along with corresponding motion models. For fog, we implement depth-aware fog rendering based on the Beer–Lambert law. Our method achieves realistic dynamic weather simulation with fine-grained control over weather intensity, such as the amount of rain or snow, and the density of fog.

The main contributions of this work are summarized as follows:

•

We propose a unified framework supporting integrated 4D Reconstruction - Weather Editing - Dynamic Simulation, effectively elevating 2D image editing to 4D simulation and enabling the generation of multi-weather, highly consistent 4D scenes for autonomous driving applications.
•

We introduce weather Gaussian representation with shared feature and multi-weather decoders, which disentangles scene geometry from weather-related appearance. This ensures structural consistency across different weather conditions and facilitates efficient switching and editing of multi-weather scenes.
•

We construct a physics-driven dynamic weather simulation system, designing weather effects based on weather particles and optical principles for rain, snow, and fog respectively, thereby achieving dynamic weather simulation with both visual realism and physical consistency while enabling precise control.

2 Related Work

2.1 Urban Scene Reconstruction

The field of 3D scene reconstruction has evolved significantly from traditional geometric methods [26, 33] to modern neural representations [24, 18, 25, 45, 38, 4, 35, 36, 3, 7]. The introduction of Neural Radiance Fields (NeRF) [24] and 3D Gaussian Splatting [18] marked a significant leap forward, enabling dense and photorealistic reconstruction of complex driving environments. Neural Scene Graphs (NSG) [25] proposed a graph-based representation that separately models static backgrounds and dynamic objects using dedicated radiance fields. EmerNeRF [39] further advanced this line by introducing a self-supervised framework that automatically decomposes scenes into static and dynamic components. DrivingGaussian [45] developed an incremental static Gaussian representation combined with dynamic Gaussian graphs to enable object-level scene editing, while Street Gaussians [38] achieved more accurate dynamic reconstruction through object pose optimization and 4D spherical harmonics. OmniRe [4] further pushed the boundaries by introducing deformable nodes to handle non-rigid dynamic objects like pedestrians. Despite these advances, these methods inherently bake in the weather conditions present during data capture, lacking the ability to disentangle and control meteorological factors.

2.2 Image-level Weather Editing

Weather editing and enhancement in 2D images represent a standing challenge in computer vision. With the development of deep learning, significant progress has been made [43, 42, 9, 29, 1, 14, 10]. ClimateGAN [28] introduced a method for realistic flood simulation on real-world images. WeatherGAN [19] built upon StarGAN v2 [5] and proposed a weather feature-guided approach for multi-domain translation. TPSeNCE [43] further improved weather editing consistency by introducing a triangular probability similarity constraint. The emergence of Diffusion Models has substantially advanced image generation capabilities, enabling more flexible and controllable editing. For instance, ControlNet [42] incorporated spatial conditioning into text-to-image diffusion models, facilitating various conditional image edits. InstructPix2Pix [1] combined language instructions with diffusion models to enable instruction-based image editing, leveraging large-scale generated data for training. TurboEdit [9] reduced editing artifacts via shifted noise scheduling and novel guidance techniques. Despite these improvements in image quality, challenges such as artifacts and inconsistencies persist. Moreover, these methods generally lack fine-grained control over weather intensity and effects, limiting their applicability in scenarios requiring precise meteorological modeling.

2.3 3D-level Weather Editing

Recent research has begun to explore weather editing directly in 3D scene representations. ClimateNeRF [20] integrates physical simulation with NeRF to achieve editing of various climate effects, although it is limited to static weather phenomena and cannot simulate dynamic conditions. WeatherGS [27] focuses on mitigating weather artifacts, restoring clear scenes from adverse weather inputs. StyleGaussian [22] and SGSST [13] adapt 3D Gaussians for artistic style transfer using reference images, yet their frameworks are not tailored for realistic weather editing. RainyGS [8] combines physical simulation with Gaussian splatting to realistically simulate rainy conditions. Similarly, Fiebelman et al. [11] propose a Gaussian-particle hybrid representation for dynamic weather effects, but their method primarily focuses on foreground weather particles and lacks synchronized background weather editing, limiting the overall realism. In contrast, our approach enables flexible and synchronized control over both background ambiance and foreground weather elements, achieving highly consistent and photorealistic editing results.

3 Method

Refer to caption — Figure 2: Overview of WeatherCity. Our framework comprises four main modules. First, the image editing module employs a text-guided video editing foundation model to adapt image weather background. Second, the scene representation module introduces a weather-aware Gaussian representation based on shared features and multi-weather decoders, which disentangles geometric-textural attributes from weather-specific appearances, thereby ensuring structural consistency across varying meteorological conditions. Subsequently, we construct RGB and content losses for consistency optimization. Finally, a physics-driven dynamic weather simulation mechanism is designed to achieve flexible and controllable editing of diverse dynamic weather effects.

Given a sequence of captured raw scene data, WeatherCity achieves joint 4D dynamic reconstruction and flexible, controllable weather editing. Our framework, illustrated in Fig. 2, comprises four core components. Firstly, the image editing module leverages a text-guided image editing foundation model to flexibly alter image weather backgrounds while preserving the original scene content 3.1. Subsequently, the scene representation module introduces a novel Weather Gaussian Representation, which builds upon shared scene features and weather-specific decoders to disentangle geometric and textural attributes from weather appearances, thereby ensuring structural consistency across different conditions 3.2. Consistency is further optimized by employing RGB and content losses between the rendered and edited multi-weather images 3.3. Finally, a physics-driven dynamic weather simulation module is designed, which utilizes particle systems to simulate rain and snow and applies the Beer-Lambert law to model fog, enabling controllable editing of various dynamic weather effects 3.4.

3.1 Text-Guided Image Weather Background Editing

The first stage of our pipeline involves editing the input images $\{I_{t}^{raw}|I_{t}^{raw}\in\mathbb{R}^{H\times W\times 3},t=1,...,N\}$ to exhibit various target weather conditions. These edited sequences serve as crucial supervision signals for the subsequent 3D reconstruction and editing tasks. This requires the edited images to achieve both realistic weather effects and consistency with the original scene content.

To this end, we leverage Qwen-Image [34], a powerful text-guided image editing model. For each desired weather condition, we design a corresponding text prompt. Our prompts are meticulously crafted to not only describe the target weather effect (e.g., ”a rainy city street”) but also to explicitly emphasize the strict preservation of the original scene content. Benefiting from these carefully designed prompts and the robust image editing capabilities of Qwen-Image [34], we obtain highly realistic and temporally consistent multi-weather image sequences $\{I_{t}^{w}|w\in\mathcal{W},t=1,...,N\}$ , where $\mathcal{W}$ is the set of multi weathers.

3.2 Weather Gaussian with Shared Feature and Multi-Weather Decoders

Following OmniRe [4], we employ a dynamic Gaussian graph to structure the scene, enabling flexible modeling and control of movable objects. Our scene graph $\mathcal{G}=\{\mathcal{N},\mathcal{E}\}$ comprises the following nodes $\mathcal{N}$ :

•

Sky Node $\mathcal{N}_{sky}$ : Representing the distant sky via an optimizable environment texture map.
•

Background Node $\mathcal{N}_{bg}$ : Modeling the static background composed of 3D Gaussians.
•

Rigid Nodes $\mathcal{N}_{rigid}$ : Representing movable rigid objects (primarily vehicles), each composed of 3D Gaussians.
•

Non-Rigid Nodes $\mathcal{N}_{nonrigid}$ : Accounting for deformable objects (primarily pedestrians), each composed of deformation 3D Gaussians.

To achieve consistent scene reconstruction and diverse weather editing, we design a novel Weather Gaussian Representation that disentangles inherent scene geometry from weather-dependent appearance.

Each Gaussian primitive $G_{i}$ in nodes $\{\mathcal{N}_{bg},\mathcal{N}_{rigid},\mathcal{N}_{nonrigid}\}$ is parameterized by fellows:

G_{i}=\{\mu_{i},s_{i},r_{i},o_{i},f_{i}\},

(1)

where:

•

$\mu_{i}\in\mathbb{R}^{3}$ denotes the 3D center position.
•

$s_{i}\in\mathbb{R}^{3}$ represents the scale factors along three axes.
•

$r_{i}\in\mathbb{R}^{4}$ is the rotation quaternion.
•

$o_{i}\in[0,1]$ is the opacity value.
•

$f_{i}\in\mathbb{R}^{d}$ is a shared appearance feature encoding the intrinsic texture and material properties.

For each weather condition $w\in\mathcal{W}$ , the corresponding Gaussian color $c_{i}^{w}$ is decoded by a weather-specific MLP $\phi_{w}$ :

c_{i}^{w}=\phi_{w}(f_{i}).

(2)

Subsequently, the shared feature Gaussians are transformed into multi-weather Gaussians $G_{i}^{w}=\{\mu_{i},\Sigma_{i},o_{i},c_{i}^{w}\}$ through our dedicated weather-specific decoders. The covariance matrix $\Sigma\in\mathbb{R}^{3\times 3}=RSS^{\top}R^{\top}$ . The spatial influence at point $x$ given by:

g_{i}(x)=e^{-\frac{1}{2}(x-\mu_{i})^{\top}\Sigma^{-1}(x-\mu_{i})}.

(3)

During rendering, each 3D Gaussian is first projected onto the camera coordinate. The corresponding 2D covariance matrix $\boldsymbol{\Sigma}^{\prime}$ in the image plane is derived as:

\Sigma^{\prime}=JW\Sigma W^{\top}J^{\top}

(4)

where $J$ denotes the Jacobian of the projective transformation and $W$ represents the view transformation matrix. For each pixel, the contributing Gaussians are sorted by depth and rendered via alpha blending:

\hat{I}_{t}^{w}=\sum_{i}c_{i}^{w}\alpha_{i}\prod_{j=1}^{i-1}(1-\alpha_{j}),\quad\hat{D}_{t}=\sum_{i}d_{i}\alpha_{i}\prod_{j=1}^{i-1}(1-\alpha_{j}),

(5)

where $\alpha_{i}$ is the computed opacity of the $i$ -th Gaussian after sorting, $d_{i}$ is the depth of the depth of the $i$ -th Gaussian.

This design forces the shared features to capture the intrinsic textural properties of the scene, while the separate decoders learn to model the photometric impact of specific weather conditions. Consequently, our representation ensures structural consistency across different weathers on one hand, and enables distinct environmental photometric modeling on the other.

3.3 Multi-Weather Consistency Optimization

We jointly optimize our Weather Gaussian Representation using a composite loss function that aligns the rendered scenes with both the original and the edited multi-weather images.

RGB Loss. We render the scene under both the original clear weather and edited multi- weather $\mathcal{W}$ , and compute the RGB loss against the corresponding ground truth and edited images. The RGB loss is defined as:

	$\displaystyle\mathcal{L}_{rgb}=\sum_{t=1}^{N}\sum_{w\in\mathcal{W}\cup\{raw\}}$	$\displaystyle(1-\lambda)\left\\|\hat{I}_{t}^{w}-I_{t}^{w}\right\\|_{1}$		(6)
		$\displaystyle+\lambda(1-{\rm SSIM}(\hat{I}_{t}^{w},I_{t}^{w})).$		(6)

Content Consistency Loss. To further enforce semantic and structural consistency across different weather conditions, we introduce a content consistency loss. Specifically, we employ a pre-trained VGG network [30] $\Phi$ to extract content features from both the rendered images and the original weather images. The content consistency loss is computed as the L1 distance between the feature of the rendered image under weather condition $w$ and the original clear weather image:

\mathcal{L}_{cc}=\sum_{t=1}^{N}\sum_{w\in\mathcal{W}}\|\Phi(\hat{I}_{t}^{w})-\Phi(I_{t}^{raw})\|.

(7)

Applying 2D image editing models on a frame-by-frame may leads to temporal flickering and geometric inconsistencies. This loss is able to optimize regions with scene content distortion in the 2D editing results, preventing inconsistent 2D edits from compromising scene coherence. This ensures that modifying the weather attributes $w$ does not alter the underlying scene content.

Depth Loss: To supervise the geometric information of the scene, we compute an L1 loss between the rendered depth and the sparse depth map $D_{t}$ obtained from LiDAR projection:

\mathcal{L}_{depth}=\sum_{t=1}^{N}\|\hat{D}_{t}-D_{t}\|.

(8)

The complete optimization loss is a weighted combination of all loss terms:

	$\displaystyle\mathcal{L}_{total}=$	$\displaystyle\mathcal{L}_{rgb}+\lambda_{cc}\mathcal{L}_{cc}+\lambda_{depth}\mathcal{L}_{depth}$		(9)
		$\displaystyle+\lambda_{opacity}\mathcal{L}_{opacity}+\mathcal{L}_{reg},$		(9)

where $\lambda_{cc}$ , $\lambda_{depth}$ , and $\lambda_{opacity}$ are balancing weights for the respective loss components. $\mathcal{L}_{opacity}$ is the opacity loss, which ensures the Gaussian opacities align with the non-sky mask, and $\mathcal{L}_{reg}$ is the regularization loss. More details are available in the Supplementary Material.

3.4 Physics-Driven Dynamic Weather Simulation

To achieve visually realistic, physically consistent, and finely controllable dynamic weather simulation, we develop a physics-driven dynamic weather simulation system.

Weather Particle Modeling. We model rain and snow particles using Gaussian ellipsoids: raindrops are represented by a single elongated Gaussian to capture their vertically stretched characteristics and motion blur effects, while snowflakes are approximated by three concentric Gaussian ellipsoids with identical scales arranged at 60-degree angles to form a basic crystal shape. The scale parameters of the Gaussians control the size of the weather particles, while the rotation parameters govern their orientation. Furthermore, the spatial distribution of weather particles is key to controlling weather intensity. To this end, we construct a spatial bounding volume based on the reconstructed scene. Within this volume, particle attributes, including position $\mu_{i}$ , rotation $r_{i}$ , and opacity $o_{i}$ , are initialized according to varying densities to achieve natural visual variations.

Motion Control. Each particle moves according to a velocity vector $\mathbf{v}$ updated per frame based on physics-driven parameters. The velocity equations for raindrop and snowflake particles are given as follows:

\mathbf{v}_{\text{rain}}=\mathbf{v}_{\text{fall}}+\mathbf{v}_{\text{wind}},\quad\mathbf{v}_{\text{snow}}=\mathbf{v}_{\text{fall}}+\mathbf{v}_{\text{wind}}+\mathbf{v}_{\text{turb}},

(10)

where $\mathbf{v}_{\text{fall}}$ is a constant downward velocity, $\mathbf{v}_{\text{wind}}$ is a global wind vector parameterized by magnitude $v_{\text{mag}}$ , tilt angle $\theta_{\text{wind}}$ , and azimuth $\phi_{\text{wind}}$ , the elongated Gaussian is dynamically aligned with $\mathbf{v}$ to reflect wind-influenced trajectories. $\mathbf{v}_{\text{turb}}$ is a stochastic turbulence component applied only to snowflakes, which induces their characteristic fluttering and non-linear descent.

Unified Weather Rendering. A core advantage of our method lies in its unified rendering pipeline. Instead of being rendered through a separate pass, the dynamically generated weather Gaussians for rain and snow are directly integrated into the dynamic Gaussian scene graph as weather nodes $\mathcal{N}_{rain}$ and $\mathcal{N}_{snow}$ , thereby facilitating subsequent scene editing and manipulation. All Gaussians (e.g, scene and weather particles) are rasterized together using the standard Gaussian Splatting process. This unified formulation naturally ensures correct occlusion, composition, and blending, resulting in seamless and physically consistent integration of weather effects.

Since fog is uniformly diffused throughout the space, we implement depth-aware fog simulation based on the Beer-Lambert law [32]. The final fog-affected color $c_{render}^{fog}$ is obtained by blending the rendered pixel color $c_{render}$ with the global fog color $c_{fog}$ :

c_{render}^{fog}=fc_{render}+(1-f)c_{fog},

(11)

where $f=e^{-d_{f}d_{render}}$ denotes the transmittance, $d_{f}$ is the fog density parameter, and $d_{render}$ represents the depth value obtained from Equation 5. By adjusting parameters $c_{fog}$ and $d_{f}$ , we achieve realistic rendering of fog effects with varying density and colors.

4 Experiment

We conduct extensive experiments to comprehensively evaluate the effectiveness and superiority of WeatherCity.

Table 1: Quantitative comparison on Waymo Open Dataset and nuScenes Dataset.

\uparrow

means higher is better.

Method	Waymo Open Dataset			nuScenes Dataset
Method	CLIP-S $\uparrow$	CLIP-DS $\uparrow$	Sem-CS $\uparrow$	CLIP-S $\uparrow$	CLIP-DS $\uparrow$	Sem-CS $\uparrow$
ControlNet [42]	0.634	0.238	0.695	0.656	0.228	0.811
TurboEdit [9]	0.830	0.220	0.801	0.782	0.250	0.829
FRESCO [40]	0.720	0.213	0.824	0.710	0.224	0.855
Qwen-Image [34]	0.785	0.279	0.843	0.804	0.279	0.902
WeatherCity (Ours)	0.872	0.303	0.915	0.870	0.302	0.968

4.1 Experimental Setting

Dataset. We evaluate our method on two prominent autonomous driving benchmarks: the Waymo Open Dataset [31] and the nuScenes dataset [2]. Both datasets provide diverse driving scenarios with multi-sensor data including multi-view images and LiDAR point clouds. For quantitative evaluation, we select five representative scenes rich in dynamic objects, each comprising 30 consecutive frames. The image resolutions are $1920\times 1080$ for Waymo and $1600\times 900$ for nuScenes.

Evaluation Metrics. To assess weather editing quality, we employ the following metrics:

•

CLIP-Score (CLIP-S): Measures content preservation by computing the cosine similarity between CLIP image embeddings of edited and original images [17].
•

CLIP Directional Similarity (CLIP-DS): Evaluates alignment between edited images and target text prompts in the CLIP embedding space [12].
•

Semantic Consistency Score (Sem-CS): Measures the semantic consistency between edited and original images using a frequency-weighted Intersection over Union computed from a semantic segmentation model [23].

Baselines. To demonstrate the advantages of our method, we compare it against several state-of-the-art approaches, including image editing methods—ControlNet [42] and TurboEdit [9]—and the video editing model FRESCO [40]. For a fair comparison, all methods are conditioned on the same textual prompts.

Implementation Details. The shared Gaussian feature dimension is set to 32, and the weather-specific MLP decoder consists of two linear layers with ReLU activation and a Sigmoid output mapping features to RGB. We train the model using Adam for 30,000 iterations with a learning rate of $1\mathrm{e}{-4}$ , using loss weights $\lambda_{cc}=1.0$ , $\lambda_{depth}=0.01$ , and $\lambda_{SSIM}=0.2$ . The content loss uses VGG-19 [30] relu_4_1 features. Dynamic weather effects use 40,000 particles for rain and 16,000 for snow. Fog color and density parameters are set to $c_{fog}=[0.80,\,0.80,\,0.85]$ and $d_{f}=0.2$ . All Gaussians are rasterized together using the standard Gaussian Splatting pipeline. All experiments are conducted on a server with an Intel W-3335 CPU and an RTX 8000 GPU. More details are provided in the supplementary material.

4.2 Experimental Results

Multi-Weather Editing. Tab. 1 summarizes the quantitative results for multi-weather editing on the Waymo and nuScenes datasets. Our method achieves state-of-the-art performance across all metrics, significantly outperforming all baseline approaches. The notable improvements in CLIP-S and Sem-CS metrics particularly demonstrate our method’s superior capability in preserving scene content and semantic consistency. It is strongly supported by the qualitative results shown in Fig.3 and Fig. 4. Our method precisely maintains the scene’s geometric structure and semantic content while generating highly realistic weather effects, including overcast skies, wet ground reflections, accumulated snow on surfaces, and depth-attenuated fog. Furthermore, our method generates convincing dynamic weather particles (raindrops and snowflakes) achieving a level of realism unattainable by image-level editing methods. Although baselines can produce certain weather appearances, they introduce severe content distortion—manifested as warped vehicles, hallucinated structures, and erroneous lane markings. Additionally, image-level editing methods fundamentally lack the capability to produce depth-aware atmospheric effects.

Object-Level Scene Editing. Our framework supports not only weather editing but also object-level manipulation. Leveraging the dynamic Gaussian scene graph, we achieve precise control over dynamic nodes in the scene, enabling object removal, insertion, and repositioning. Fig 5 presents a visual comparison of object editing results. Using the text prompt: ”Remove all vehicles except the red and white ones in the center and change the weather to snowy”, our method accurately executes the requested edits. In contrast, baseline approaches either fail to remove the specified vehicles or eliminate them entirely without discrimination, demonstrating the superior precision of our editing capability.

Table 2: Comparison of runtime measured in FPS.

Method	Speed (FPS) $\uparrow$
ControlNet [42]	0.033
TurboEdit [9]	0.097
FRESCO [40]	0.142
WeatherCity (Ours)	25.67

Runtime Analysis. We compare the runtime performance of all methods in Tab 2. For image and video editing models, we report the average inference speed (in FPS). For WeatherCity, We provide the average rendering speed. WeatherCity achieves a rendering speed of 25.67 FPS, which is sufficient to meet the requirements for real-time simulation.

4.3 Ablation Study

To validate the effectiveness of each core component in WeatherCity, we conduct extensive ablation experiments under the following configurations:

a.

Baseline: All proposed modules are removed, using only Qwen-Image for image editing.
b.

w/o WGS: Replace the weather Gaussian with the original 3D Gaussian, where all weather conditions share the same set of Gaussians.
c.

w/o $\mathcal{L}_{cc}$ : The content consistency loss $\mathcal{L}_{cc}$ is removed.

Tab. 3 presents quantitative comparisons across these configurations, while Fig. 6 provides corresponding visualizations.

Table 3: Ablation study results.

Method	CLIP-S $\uparrow$	CLIP-DS $\uparrow$	Sem-CS $\uparrow$
a. Baseline	0.735	0.276	0.891
b. w/o WGS	0.781	0.212	0.894
c. w/o $\mathcal{L}_{cc}$	0.817	0.289	0.916
WeatherCity	0.880	0.320	0.943

Effectiveness of Weather Gaussian. The removal of our Weather-aware Gaussian causes noticeable degradation in reconstruction metrics. The model fails to disentangle intrinsic scene textures from weather-specific appearances, leading to effect blending across different conditions as illustrated in Fig. 6 (b). This confirms that our shared-scene-feature and weather-specific decoder design effectively separates inherent scene attributes from transient weather appearances, enabling stable scene structure preservation and distinct weather effect modeling.

Effectiveness of content consistency loss. The removal of the content consistency loss leads to a notable decline in scene consistency metrics and introduces inconsistencies in the editing results, as illustrated in Fig. 6 (c). These artifacts stem from frame-wise inconsistencies introduced by the Qwen-Image editing process as illustrated in Fig. 6 (a). In contrast, the content consistency loss enforces feature alignment between the rendered images and the original scene through a pre-trained VGG network. This effectively rectifies the local artifacts caused by per-frame Qwen-Image editing, significantly enhancing semantic coherence and geometric integrity across weather transitions.

Effectiveness of physics-driven dynamic weather simulation. To further demonstrate the advantages of our physics-driven dynamic weather module, we design additional comparative experiments using dynamic weather prompts (e.g., ”heavy rain with raindrops falling under gravity, snowflakes drifting with weak wind, fog gradually thickening in the distance”). As shown in Fig. 7, we compare the image editing results from Qwen-Image with our physics-based simulation. The visual comparison clearly reveals that the dynamic weather effects generated by Qwen-Image [34] lack temporal coherence, whereas our method achieves smooth inter-frame transitions through particle motion equations.

5 Conclusion

We present WeatherCity, a unified framework for high-fidelity 4D dynamic scene reconstruction and controllable weather simulation. We extend 2D image editing to 4D scene editing and propose a novel weather Gaussian representation that disentangles scene structure from weather appearance, and a physics-driven simulation system for dynamic effects. This enables the generation of diverse, temporally consistent, and photorealistic urban scenes under various weather conditions with fine-grained control. Extensive experiments validate that our method outperforms alternatives in visual quality, cross-weather consistency, and editing flexibility. WeatherCity not only provides a powerful tool for autonomous driving simulation but also establishes a solid foundation for future research in dynamic and controllable virtual environment creation. Limitations. The current system requires manual tuning of weather particle parameters. Future work will focus on developing more automated editing algorithms to streamline this process.

References

[1] T. Brooks, A. Holynski, and A. A. Efros (2023) Instructpix2pix: learning to follow image editing instructions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 18392–18402. Cited by: §2.2.
[2] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom (2020) Nuscenes: a multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11621–11631. Cited by: §4.1.
[3] Y. Chen, C. Gu, J. Jiang, X. Zhu, and L. Zhang (2023) Periodic vibration gaussian: dynamic urban scene reconstruction and real-time rendering. arXiv preprint arXiv:2311.18561. Cited by: §2.1.
[4] Z. Chen, J. Yang, J. Huang, R. de Lutio, J. M. Esturo, B. Ivanovic, O. Litany, Z. Gojcic, S. Fidler, M. Pavone, et al. OmniRe: omni urban scene reconstruction. In The Thirteenth International Conference on Learning Representations, Cited by: §1, §2.1, §3.2, §6.
[5] Y. Choi, Y. Uh, J. Yoo, and J. Ha (2020) Stargan v2: diverse image synthesis for multiple domains. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8188–8197. Cited by: §2.2.
[6] A. Christodoulides, G. K. Tam, J. Clarke, R. Smith, J. Horgan, N. Micallef, J. Morley, N. Villamizar, and S. Walton (2025) Survey on 3d reconstruction techniques: large-scale urban city reconstruction and requirements. IEEE Transactions on Visualization and Computer Graphics. Cited by: §1.
[7] X. Cui, W. Ye, Y. Wang, G. Zhang, W. Zhou, T. He, and H. Li (2025) Streetsurfgs: scalable urban street surface reconstruction with planar-based gaussian splatting. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §2.1.
[8] Q. Dai, X. Ni, Q. Shen, W. Chen, B. Chen, and M. Chu (2025) RainyGS: efficient rain synthesis with physically-based gaussian splatting. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 16153–16162. Cited by: §2.3.
[9] G. Deutch, R. Gal, D. Garibi, O. Patashnik, and D. Cohen-Or (2024) Turboedit: text-based image editing using few-step diffusion models. In SIGGRAPH Asia 2024 Conference Papers, pp. 1–12. Cited by: §2.2, §4.1, Table 1, Table 2, §6.3, §6.3, Table 5, Table 6.
[10] P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y. Levi, D. Lorenz, A. Sauer, F. Boesel, et al. (2024) Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first international conference on machine learning, Cited by: §2.2.
[11] G. Fiebelman, H. Averbuch-Elor, and S. Benaim (2025) Let it snow! animating static gaussian scenes with dynamic weather effects. arXiv preprint arXiv:2504.05296. Cited by: §2.3.
[12] R. Gal, O. Patashnik, H. Maron, G. Chechik, and D. Cohen-Or (2021) StyleGAN-nada: clip-guided domain adaptation of image generators. CoRR abs/2108.00946. External Links: Link, 2108.00946 Cited by: 2nd item, §6.4.
[13] B. Galerne, J. Wang, L. Raad, and J. Morel (2025) SGSST: scaling gaussian splatting style transfer. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 26535–26544. Cited by: §2.3.
[14] M. Geyer, O. Bar-Tal, S. Bagon, and T. Dekel (2023) Tokenflow: consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373. Cited by: §2.2.
[15] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2020) Generative adversarial networks. Communications of the ACM 63 (11), pp. 139–144. Cited by: §1.
[16] O. Greenberg, E. Kishon, and D. Lischinski (2023) S2ST: image-to-image translation in the seed space of latent diffusion. arXiv preprint arXiv:2312.00116. Cited by: §1.
[17] J. Hessel, A. Holtzman, M. Forbes, R. L. Bras, and Y. Choi (2022) CLIPScore: a reference-free evaluation metric for image captioning. External Links: 2104.08718, Link Cited by: 1st item, §6.4.
[18] B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis (2023) 3D gaussian splatting for real-time radiance field rendering.. ACM Trans. Graph. 42 (4), pp. 139–1. Cited by: §1, §2.1.
[19] X. Li, C. Li, K. Kou, and B. Zhao (2022) Weather translation via weather-cue transferring. IEEE Transactions on Neural Networks and Learning Systems 35 (6), pp. 7988–7998. Cited by: §2.2.
[20] Y. Li, Z. Lin, D. Forsyth, J. Huang, and S. Wang (2023) Climatenerf: extreme weather synthesis in neural radiance field. In Proceedings of the ieee/cvf international conference on computer vision, pp. 3227–3238. Cited by: §1, §2.3, §6.3, §6.3, §7.3, Table 7.
[21] Y. Li, J. Wu, L. Zhao, and P. Liu (2024) Derainnerf: 3d scene estimation with adhesive waterdrop removal. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 2787–2793. Cited by: §1.
[22] K. Liu, F. Zhan, M. Xu, C. Theobalt, L. Shao, and S. Lu (2024) Stylegaussian: instant 3d style transfer with gaussian splatting. In SIGGRAPH Asia 2024 Technical Communications, pp. 1–4. Cited by: §2.3.
[23] Z. Liu, H. Mao, C. Wu, C. Feichtenhofer, T. Darrell, and S. Xie (2022) A convnet for the 2020s. External Links: 2201.03545, Link Cited by: 3rd item, §6.4.
[24] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng (2021) Nerf: representing scenes as neural radiance fields for view synthesis. Communications of the ACM 65 (1), pp. 99–106. Cited by: §1, §2.1.
[25] J. Ost, F. Mannan, N. Thuerey, J. Knodt, and F. Heide (2021) Neural scene graphs for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2856–2865. Cited by: §2.1.
[26] O. Özyeşil, V. Voroninski, R. Basri, and A. Singer (2017) A survey of structure from motion*.. Acta Numerica 26, pp. 305–364. Cited by: §2.1.
[27] C. Qian, Y. Guo, W. Li, and G. Markkula (2025) Weathergs: 3d scene reconstruction in adverse weather conditions via gaussian splatting. In 2025 IEEE International Conference on Robotics and Automation (ICRA), pp. 185–191. Cited by: §1, §2.3.
[28] V. Schmidt, A. Luccioni, M. Teng, T. Zhang, A. Reynaud, S. Raghupathi, G. Cosne, A. Juraver, V. Vardanyan, A. Hernández-García, et al. ClimateGAN: raising climate change awareness by generating images of floods. In International Conference on Learning Representations, Cited by: §2.2.
[29] S. Sheynin, A. Polyak, U. Singer, Y. Kirstain, A. Zohar, O. Ashual, D. Parikh, and Y. Taigman (2024) Emu edit: precise image editing via recognition and generation tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8871–8879. Cited by: §2.2.
[30] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556. External Links: Link Cited by: §3.3, §4.1.
[31] P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine, et al. (2020) Scalability in perception for autonomous driving: waymo open dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2446–2454. Cited by: §4.1.
[32] D. F. Swinehart (1962) The beer-lambert law. Journal of chemical education 39 (7), pp. 333. Cited by: §3.4.
[33] X. Wang, C. Wang, B. Liu, X. Zhou, L. Zhang, J. Zheng, and X. Bai (2021) Multi-view stereo in the deep learning era: a comprehensive review. Displays 70, pp. 102102. Cited by: §2.1.
[34] C. Wu, J. Li, J. Zhou, J. Lin, K. Gao, K. Yan, S. Yin, S. Bai, X. Xu, Y. Chen, et al. (2025) Qwen-image technical report. arXiv preprint arXiv:2508.02324. Cited by: §3.1, §4.3, Table 1, §6.3, §6.3, Table 5, Table 6.
[35] W. Wu, Q. Wang, G. Wang, J. Wang, T. Zhao, Y. Liu, D. Gao, Z. Liu, and H. Wang (2024) Emie-map: large-scale road surface reconstruction based on explicit mesh and implicit encoding. In European Conference on Computer Vision, pp. 370–386. Cited by: §2.1.
[36] W. Wu, T. Zhao, C. Peng, L. Yang, Y. Wei, Z. Liu, and H. Wang (2025) BEV-gs: feed-forward gaussian splatting in bird’s-eye-view for road reconstruction. arXiv preprint arXiv:2504.13207. Cited by: §2.1.
[37] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo (2021) SegFormer: simple and efficient design for semantic segmentation with transformers. Advances in neural information processing systems 34, pp. 12077–12090. Cited by: §6.2.
[38] Y. Yan, H. Lin, C. Zhou, W. Wang, H. Sun, K. Zhan, X. Lang, X. Zhou, and S. Peng (2024) Street gaussians: modeling dynamic urban scenes with gaussian splatting. In European Conference on Computer Vision, pp. 156–173. Cited by: §1, §2.1.
[39] J. Yang, B. Ivanovic, O. Litany, X. Weng, S. W. Kim, B. Li, T. Che, D. Xu, S. Fidler, M. Pavone, et al. (2023) Emernerf: emergent spatial-temporal scene decomposition via self-supervision. arXiv preprint arXiv:2311.02077. Cited by: §2.1.
[40] S. Yang, Y. Zhou, Z. Liu, and C. C. Loy (2024) Fresco: spatial-temporal correspondence for zero-shot video translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8703–8712. Cited by: §4.1, Table 1, Table 2, §6.3, §6.3, Table 5, Table 6.
[41] Z. Ye, W. Li, S. Liu, P. Qiao, and Y. Dou (2024) AbsGS: recovering fine details for 3d gaussian splatting. External Links: 2404.10484, Link Cited by: §6.1.
[42] L. Zhang, A. Rao, and M. Agrawala (2023) Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 3836–3847. Cited by: §1, §2.2, §4.1, Table 1, Table 2, §6.3, §6.3, Table 5, Table 6.
[43] S. Zheng, C. Lu, and S. G. Narasimhan (2024) Tpsence: towards artifact-free realistic rain generation for deraining and object detection in rain. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 5394–5403. Cited by: §1, §2.2.
[44] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba (2017) Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §6.4.
[45] X. Zhou, Z. Lin, X. Shan, Y. Wang, D. Sun, and M. Yang (2024) Drivinggaussian: composite gaussian splatting for surrounding dynamic autonomous driving scenes. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 21634–21643. Cited by: §2.1.

\thetitle

Supplementary Material

6 Implementation Details

We build WeatherCity upon a dynamic Gaussian scene graph following the node design in OmniRe [4], containing a sky node, a static background node, and multiple rigid and non-rigid object nodes for vehicles and pedestrians, respectively, each represented by 3D Gaussian primitives with learnable position, scale, rotation, opacity, and a shared appearance feature vector. For each weather condition, a lightweight weather-specific MLP with two fully connected layers, ReLU activation, and a final Sigmoid layer maps the shared feature of every Gaussian to its RGB color, yielding a set of weather-dependent Gaussians while keeping the underlying geometry shared across all conditions.

6.1 Training Details

Parameter setting. We optimize all scene nodes jointly for 30,000 iterations using Adam, while adopting node-specific learning rates to stabilize training for different motion patterns. The rotation parameters of Gaussian nodes are trained with a learning rate of $5\times 10^{-5}$ for non-rigid nodes and $1\times 10^{-5}$ for all other nodes. All other scalar parameters, including shared features and weather-decoder weights, use a base learning rate of $1\times 10^{-4}$ . All Gaussian densification operations are driven by the absolute gradient of the Gaussian parameters [41] with a densification threshold of $3\times 10^{-4}$ ; and the scaling threshold for pruning is set to $3\times 10^{-3}$ . The shared Gaussian feature dimension is fixed to 32 with the hidden layer dimension equal to 64. During training, we randomly sample both raw clear-weather images and edited images of different weather types as supervision, and render corresponding views from the Gaussian representation to compute reconstruction losses, while all Gaussians (scene and weather particles) are rasterized with the standard 3D Gaussian Splatting pipeline.

Weather Particle Simulation. For dynamic weather effects, we instantiate dedicated Gaussian nodes for rain and snow inside a scene-aligned 3D bounding volume, and treat each particle as an elongated or compact Gaussian primitive that is jointly rasterized with the reconstructed scene. In the rainy setting, we sample 40,000 raindrop particles whose base color is fixed to $c_{\text{rain}}=[0.7,0.7,0.8]$ , with scale initialized to $[0.0025,0.0025,0.075]$ and opacity set to 0.13, which produces thin, semi-transparent streaks aligned with the velocity direction. For snow, we use 16,000 particles with a brighter color $c_{\text{snow}}=[0.9,0.9,0.95]$ , an anisotropic scale of $[0.0064,0.004,0.004]$ , and opacity 0.2, leading to denser and more softly visible flakes that exhibit fluttering motion under the turbulence term. Fog is modeled as a global depth-dependent medium using the Beer–Lambert formulation, where the transmittance is parameterized by density $d_{f}=0.2$ and the global fog color is set to $c_{\text{fog}}=[0.8,0.8,0.85]$ , enabling continuous control of visibility and color tone by adjusting $d_{f}$ and $c_{\text{fog}}$ .

Prompt design. In all experiments, we use identical text instructions for Qwen-Image and all baselines to ensure a fair comparison. As shown in Table 4, for each target weather condition, we design a structured prompt that separately specifies: (1) strict preservation of the original layout and style, (2) the desired visual properties of the target weather, and (3) prohibited artifacts.

The prompts enforce consistent content preservation—including camera composition, object categories, and spatial arrangement—so the models focus solely on modifying global atmospheric conditions.

For rain, the prompt specifies an overcast sky, wet roads with puddles and reflections, and cool, dim lighting, while forbidding sunlight, dry ground, and hallucinated objects. For snow, it additionally removes all but the designated white and red vehicles, converts foliage to snow-covered bare branches, and requires overcast lighting with falling snow, without introducing new elements or distortions. For fog, it similarly keeps only the white and red vehicles and requests realistic atmospheric haze with depth-dependent visibility reduction and soft, overcast illumination, while prohibiting clear-air or warm-light appearances. These prompts ensure consistent, content-preserving weather editing across rain, snow, and fog for all compared methods.

Table 4: Prompts used for Qwen-Image and all baseline methods under each weather condition.

Weather	Prompt
Rainy	Please strictly maintain the original composition, all scene contents (including ground, buildings, vegetation, cars, background, pedestrians, etc.), their positions, and the original artistic style. Convert the scene to a rainy setting. Requirements: The image should be clear and realistic; the sky must be overcast with dark clouds; the ground should be wet with puddles and reflections; the lighting should be dim, and the overall tone should be cool to create a rainy atmosphere. Do NOT include: sunny weather, blue skies and white clouds, sunlight, dry ground, any elements not present in the original image, trees that are not in the original, distorted visuals, deformed subjects, or incorrect proportions.
Snowy	Please strictly maintain the original composition, all scene contents (including ground, buildings, vegetation, cars, background, pedestrians, etc.), their positions, and the original artistic style. Convert the scene to a snowy setting. Requirements: The image should be clear and realistic; the sky should be overcast with falling snowflakes; the ground should be naturally covered with snow; do not add any extra vegetation; the lighting should be soft, and the overall tone should be cool. If the original image contains green leaves, please turn them into snow-covered bare branches. Do NOT include: sunny weather, warm tones, sunlight, melting snow, elements unrelated to the original image, elements not present in the original image, distorted visuals, deformed subjects, or incorrect proportions.
Foggy	Please strictly maintain the original composition, all scene contents (including ground, buildings, vegetation, cars, background, pedestrians, etc.), their positions, and the original artistic style. Convert the scene to a foggy setting. Requirements: The image should be clear while maintaining realistic atmospheric fog; the sky should appear overcast; the scene should be filled with natural, soft fog that reduces visibility in the distance; lighting should be diffused and soft, with an overall cool tone. Do NOT include: sunny weather, warm tones, sunlight, dry and clear air, elements unrelated to the original image, elements not present in the original image, deformed subjects, or incorrect proportions.
Snowy & vehicle removal	Please remove all vehicles in the image except for the white and red ones, and then transform the scene into snowy weather. Requirements: The image should be clear and realistic; the sky should be overcast with falling snowflakes; the ground should be naturally covered with snow; do not add any extra vegetation; the lighting should be soft, and the overall tone should be cool. If the original image contains green leaves, please turn them into snow-covered bare branches. Do NOT include: sunny weather, warm tones, sunlight, melting snow, elements unrelated to the original image, distorted visuals, deformed subjects, or incorrect proportions.

6.2 Loss Functions

To jointly optimize all learnable parameters of the scene representation and the dynamic nodes model, we employ a weighted combination of image-based reconstruction terms and regularization losses,

	$\displaystyle\mathcal{L}_{total}=$	$\displaystyle\mathcal{L}_{rgb}+\lambda_{cc}\mathcal{L}_{cc}+\lambda_{depth}\mathcal{L}_{depth}$		(12)
		$\displaystyle+\lambda_{opacity}\mathcal{L}_{opacity}+\mathcal{L}_{reg}.$		(12)

We set the depth weight to $\lambda_{\text{depth}}=0.01$ , the opacity weight to $\lambda_{\text{opacity}}=0.05$ , and the content consistency weight to $\lambda_{\text{cc}}=1.0$ . The losses $\mathcal{L}_{rgb}$ , $\mathcal{L}_{cc}$ , and $\mathcal{L}_{depth}$ have been introduced in the main text. Here, we additionally present the details of losses $\mathcal{L}_{opacity}$ and $\mathcal{L}_{reg}$ .

Opacity loss.

We further constrain the opacities of the Gaussians using a 2D supervision derived from the sky mask. For each view, we render an opacity map $O_{G}$ from the current Gaussian scene, and use a binary sky mask $M_{\text{sky}}$ obtained from semantic segmentation [37]. The opacity loss takes the form

	$\displaystyle\mathcal{L}_{\text{opacity}}=$	$\displaystyle-\sum_{u}O_{G}(u)\log O_{G}(u)$		(13)
		$\displaystyle-\sum_{u}M_{\text{sky}}(u)\log\bigl(1-O_{G}(u)\bigr).$		(13)

Regularization loss.

The regularization loss comprises sharp shape regularization, voxel deformer regularization, temporal smoothness regularization, and scaling regularization.

6.3 Baselines

ControlNet [42] is an image editing method that introduces spatial conditional control into text-to-image diffusion models. Its core lies in achieving precise guidance of the image editing process by injecting spatial constraint information, supporting image modification tasks under various conditions. By aligning the intermediate features of pre-trained diffusion models with spatial conditions (such as edges and depth), this method maintains the flexibility of text prompts while enhancing the structural consistency of editing results. In the weather editing task of this study, ControlNet [42] generates target weather effects based on text prompts. However, experimental results indicate that it is prone to scene content distortion (e.g., vehicle deformation, incorrect lane markings), lacks fine-grained control over weather intensity, and has a slow inference speed (only 0.033 FPS), making it difficult to meet the real-time and consistency requirements of 4D scene simulation.

TurboEdit [9] is a text-guided image editing method based on few-step diffusion models. It aims to reduce artifacts generated during diffusion model editing and improve editing efficiency and image quality by optimizing noise scheduling strategies and novel guidance techniques. By adjusting the noise distribution and guidance signals during the diffusion process, this method maintains the visual coherence of editing results while reducing inference steps, making it suitable for fast image content modification tasks. As an image-level editing comparison baseline, TurboEdit [9] can generate weather effects to a certain extent. However, limited by the nature of 2D image editing, it cannot model depth-aware atmospheric effects (e.g., depth attenuation of fog), and exhibits insufficient performance in semantic consistency and scene structure preservation. Meanwhile, its inference speed is still far below real-time requirements (0.097 FPS).

FRESCO [40] is a video editing model for zero-shot video translation. Its core innovation lies in modeling spatial-temporal correspondence to achieve cross-domain editing of video sequences without specialized training for specific tasks. By capturing spatial alignment and temporal coherence between video frames, this method completes style or scene transformation of video content under the guidance of text prompts, suitable for dynamic sequence editing tasks. In this study, FRESCO [40] serves as a video-level editing baseline to verify the weather transformation capability in dynamic scenes. However, experimental results show that it still suffers from significant scene content distortion in multi-weather editing, and has limited ability to simulate dynamic weather effects (e.g., falling rain and snow). Its inference speed (0.142 FPS) is difficult to support the real-time simulation needs.

Qwen-Image [34] is a powerful text-guided image editing foundation model with high-quality image generation and editing capabilities. It can accurately respond to semantic requirements in text prompts and generate realistic and content-consistent editing results. Trained on large-scale data, this model achieves a good balance between image content preservation and target effect generation, supporting flexible editing in various scenarios. However, as a pure 2D image editing model, Qwen-Image [34] lacks modeling of temporal coherence between frames. When used alone, it is prone to temporal flickering and geometric inconsistency issues, and cannot support object-level editing and dynamic weather simulation of 4D scenes.

ClimateNeRF [20] is a 3D-level weather editing method that integrates physical simulation with Neural Radiance Fields (NeRF) to enable the editing of various climate effects in 3D scenes. By leveraging the inherent 3D geometric modeling capability of NeRF, this method achieves more realistic environmental rendering compared to 2D image editing approaches. However, a key limitation is that it is confined to static reconstruction and simulation of static weather phenomena, it lacks the ability to model dynamic vehicles and cannot simulate dynamic weather effects, such as falling rain or snow, which are critical for 4D urban scene simulation. Additionally, it supports only a limited range of weather editing operations and fails to realize flexible text-guided weather control. Furthermore, ClimateNeRF [20] has a rendering speed of 0.032 FPS, which is insufficient to meet the demands of real-time simulation.

6.4 Evaluation Details

CLIP-Score (CLIP-S [17]). CLIP-S measures the visual similarity between the original image $I$ and the edited image $\hat{I}$ using the CLIP image encoder. Let $f_{\text{CLIP}}(\cdot)$ denote the CLIP model, then the metric is computed as:

\text{CLIP-S}=\frac{\langle f_{\text{CLIP}}(I),\,f_{\text{CLIP}}(\hat{I})\rangle}{\|f_{\text{CLIP}}(I)\|_{2}\,\|f_{\text{CLIP}}(\hat{I})\|_{2}}.

(14)

CLIP Directional Similarity (CLIP-DS [12]). CLIP-DS evaluates whether the “editing direction” in CLIP space—produced by the edited image relative to the original image—aligns with the target editing direction defined by the text prompt. Given the original image $I$ , edited image $\hat{I}$ , and target text instruction $T$ , the metric is:

\text{CLIP-DS}=\frac{\left\langle f_{\text{CLIP}}(\hat{I})-f_{\text{CLIP}}(I),\;f_{\text{CLIP}}(T)\right\rangle}{\|f_{\text{CLIP}}(\hat{I})-f_{\text{CLIP}}(I)\|_{2}\,\|f_{\text{CLIP}}(T)\|_{2}}.

(15)

Semantic Consistency Score (Sem-CS). Sem-CS measures the semantic consistency between edited and original images. We apply a ConvNeXt-XL-384 $\times$ 384 [23] model pretrained on ADE20K [44] to perform panoptic segmentation on the original image $I$ and the edited image $\hat{I}$ . Let $\text{IoU}_{c}$ denote the IoU of category $c$ , aggregated over all ADE20K classes $\mathcal{C}$ . Sem-CS is defined as the frequency-weighted IoU (fwIoU):

\text{Sem-CS}=\frac{\sum_{c\in\mathcal{C}}n_{c}\,\text{IoU}_{c}}{\sum_{c\in\mathcal{C}}n_{c}},

(16)

where $n_{c}$ is the number of pixels belonging to class $c$ in the ground-truth segmentation of the original image.

We note that fog synthesis substantially reduces scene visibility, which consequently invalidates metrics designed for content preservation (e.g., CLIP-S, Sem-CS). Thus, for foggy weather, our evaluation is solely based on the CLIP-DS metric.

7 Additional Results and Analysis

7.1 Detailed Quantitative and Qualitative Results

Table 5 and Table 6 present the complete quantitative comparison results for the Waymo and nuScenes datasets, respectively. Our method significantly outperforms all baseline approaches (ControlNet, FRESCO, and TurboEdit) across all metrics. Specifically, higher CLIP-S indicates better content preservation w.r.t. the original scene. Furthermore, the improvements in Sem-CS quantify our method’s ability to preserve the original scene content—such as road layout and vehicle geometry—during the weather transformation process, confirming that WeatherCity minimizes the content distortion often observed in image-level editing frameworks.

Rainy: As illustrated in Fig. 8 and Fig. 11, our method successfully renders high-frequency details such as falling raindrops and specular reflections on wet road surfaces. Unlike baselines such as ControlNet and FRESCO, which tend to apply a global style transfer that often blurs the boundary between the road and the environment, our method leverages 3D scene representations to ensure that reflections are geometrically consistent with the camera view.

Snowy: In snowy weather generation, as shown in Fig. 9 and Fig. 12, our method achieves realistic snow accumulation on distinct surfaces, such as vehicle roofs and vegetation, without altering the underlying object semantics. The visual evidence shows that competitive methods (e.g., ControlNet) frequently hallucinate structures or warp the shape of vehicles when attempting to add snow textures. WeatherCity effectively avoids these artifacts, preserving the clear contours of dynamic objects and lane markings.

Foggy: The foggy scenarios highlight the advantage of our depth-aware approach. As seen in Fig. 10 and Fig. 13, WeatherCity simulates physically plausible depth attenuation, where visibility decreases naturally with distance. In contrast, baselines like TurboEdit and FRESCO often apply a uniform haze layer or introduce artifacts that obscure nearby objects, failing to respect the scene’s depth map.

Overall, while baseline methods can produce general weather-like appearances, they suffer from severe content distortion—manifesting as warped vehicles and erroneous scene structures. WeatherCity overcomes these limitations, offering a robust solution for high-fidelity, geometry-preserving weather simulation.

Table 5: Comparison on Waymo Open Dataset.

\uparrow

means higher is better.

Method	Rainy			Snowy			Foggy
Method	CLIP-S $\uparrow$	CLIP-DS $\uparrow$	Sem-CS $\uparrow$	CLIP-S $\uparrow$	CLIP-DS $\uparrow$	Sem-CS $\uparrow$	CLIP-DS $\uparrow$
ControlNet [42]	0.654	0.228	0.713	0.615	0.261	0.677	0.225
TurboEdit [9]	0.843	0.233	0.825	0.816	0.228	0.787	0.221
FRESCO [40]	0.721	0.209	0.852	0.719	0.253	0.797	0.177
Qwen-Image [34]	0.813	0.248	0.845	0.757	0.310	0.840	0.251
WeatherCity (Ours)	0.898	0.300	0.931	0.847	0.330	0.899	0.278

Table 6: Comparison on nuScenes Dataset.

\uparrow

means higher is better.

Method	Rainy			Snowy			Foggy
Method	CLIP-S $\uparrow$	CLIP-DS $\uparrow$	Sem-CS $\uparrow$	CLIP-S $\uparrow$	CLIP-DS $\uparrow$	Sem-CS $\uparrow$	CLIP-DS $\uparrow$
ControlNet [42]	0.703	0.250	0.810	0.609	0.234	0.812	0.201
TurboEdit [9]	0.806	0.225	0.848	0.758	0.261	0.811	0.266
FRESCO [40]	0.726	0.220	0.863	0.694	0.239	0.847	0.213
Qwen-Image [34]	0.823	0.256	0.891	0.785	0.302	0.913	0.264
WeatherCity (Ours)	0.880	0.272	0.977	0.860	0.333	0.960	0.301

7.2 Temporal Consistency Comparison

Temporal consistency is crucial for 4D urban scene simulation, requiring coherent motion of dynamic objects and continuous evolution of weather effects across frames. Qualitative comparisons on Waymo/nuScenes dynamic sequences (Fig. 14, Fig. 15, Fig. 16) reveal that baseline methods generally suffer from inter-frame inconsistency: scene geometry undergoes deformation and dynamic vehicles exhibit shape changes between consecutive frames, while weather effects display random fluctuations with noticeable inter-frame flickering. In contrast, our approach achieves temporally consistent weather editing throughout the temporal sequence.

7.3 3D Baseline Comparison

We compare WeatherCity with ClimateNeRF [20], a NeRF-based 3D weather editing method, with quantitative and qualitative results presented in Tab. 7 and Fig. 17. ClimateNeRF exhibits significant limitations due to its static 3D representation: it cannot effectively model dynamic objects (resulting in motion-blurred vehicles) nor simulate evolving weather effects. In contrast, WeatherCity, leveraging dynamic Gaussian modeling, outperforms ClimateNeRF across all metrics while delivering more realistic editing effects. Our approach additionally supports dynamic weather particles (such as falling snowflakes) and achieves significantly higher rendering efficiency than ClimateNeRF, conclusively validating its superiority.

Table 7: Comparison on Waymo Open Dataset scene 788.

\uparrow

means higher is better.

Method	Snowy			Foggy	FPS $\uparrow$
Method	CLIP-S $\uparrow$	CLIP-DS $\uparrow$	Sem-CS $\uparrow$	CLIP-DS $\uparrow$	FPS $\uparrow$
ClimateNeRF [20]	0.807	0.294	0.905	0.269	0.032
WeatherCity (Ours)	0.847	0.341	0.941	0.280	25.67

WeatherCity: Urban Scene Reconstruction with Controllable Multi-Weather Transformation

Abstract

Abstract

1 Introduction

2 Related Work

2.1 Urban Scene Reconstruction

2.2 Image-level Weather Editing

2.3 3D-level Weather Editing

3 Method

3.1 Text-Guided Image Weather Background Editing

3.2 Weather Gaussian with Shared Feature and Multi-Weather Decoders

3.3 Multi-Weather Consistency Optimization

3.4 Physics-Driven Dynamic Weather Simulation

4 Experiment

4.1 Experimental Setting

4.2 Experimental Results

4.3 Ablation Study

5 Conclusion

References

6 Implementation Details

6.1 Training Details

6.2 Loss Functions

Opacity loss.

Regularization loss.

6.3 Baselines

6.4 Evaluation Details

7 Additional Results and Analysis

7.1 Detailed Quantitative and Qualitative Results

7.2 Temporal Consistency Comparison

7.3 3D Baseline Comparison

WeatherCity: Urban Scene Reconstruction with Controllable
Multi-Weather Transformation