Fast Spatial Memory with
Scalable Elastic Test-Time Training

Ziqiao Ma^{1, 2}* Xueyang Yu³* Haoyu Zhen³
Yuncong Yang³ Joyce Chai² Chuang Gan^1,3

MIT-IBM Watson AI Lab¹ University of Michigan² University of Massachusetts Amherst³
* Equal Contribution

TL;DR

Large Chunk Test-Time Training (LaCT) has shown strong performance on long-context 3D reconstruction, but its fully plastic inference-time updates remain vulnerable to catastrophic forgetting and overfitting. We propose Elastic Test-Time Training (ETTT) that stabilizes LaCT fast-weight updates with a Fisher-weighted elastic prior, and introduce Fast Spatial Memory (FSM) — an efficient and scalable 4D reconstruction model that learns spatiotemporal representations from long observation sequences to render novel views at novel times.

Abstract

Large Chunk Test-Time Training (LaCT) has shown strong performance on long-context 3D reconstruction, but its fully plastic inference-time updates remain vulnerable to catastrophic forgetting and overfitting. As a result, LaCT is typically instantiated with a single large chunk spanning the full input sequence, falling short of the broader goal of handling arbitrarily long sequences in a single pass. We propose Elastic Test-Time Training inspired by elastic weight consolidation, that stabilizes LaCT fast-weight updates with a Fisher-weighted elastic prior around a maintained anchor state. The anchor evolves as an exponential moving average of past fast weights to balance stability and plasticity. Based on this updated architecture, we introduce Fast Spatial Memory (FSM), an efficient and scalable model for 4D reconstruction that learns spatiotemporal representations from long observation sequences and renders novel view-time combinations. We pre-trained FSM on large-scale curated 3D/4D data to capture the dynamics and semantics of complex spatial environments. Extensive experiments show that FSM supports fast adaptation over long sequences and delivers high-quality 3D/4D reconstruction with smaller chunks while mitigating the camera-interpolation shortcut. Overall, we hope to advance LaCT beyond the bounded single-chunk setting toward robust multi-chunk adaptation, a necessary step for generalization to genuinely longer sequences, while substantially alleviating the activation-memory bottleneck.

Key Insights

Elastic Test-Time Training (ETTT). We introduce a consolidate operator on top of LaCT's fast-weight update, inspired by Elastic Weight Consolidation (EWC) in continual learning. Each LaCET block maintains anchor weights and tracks per-parameter importance via an online Fisher-style statistic. During inference, important parameters are softly pulled back toward their anchors while less critical ones adapt freely — transforming the base transformer into a fast, self-refining yet elastic 4D learner.
LaCET mitigates camera-interpolation shortcuts. Standard LaCT exploits short-range temporal redundancy (frame interpolation) rather than learning true view-conditioned 4D representations. LaCET's elastic prior penalizes cumulative fast-weight drift relative to a dynamically evolving anchor, substantially reducing this shortcut and improving generalization under sparse and out-of-distribution inputs.
Scalable 4D pretraining. Fast Spatial Memory is pretrained on a curated mixture of 3D/4D datasets (RealEstate10K, DL3DV, PointOdyssey, Spring, DynamicReplica, Multi-Cam Video, Stereo4D) using a long-context curriculum that gradually increases resolution, temporal span, and input density. FSM is the first large-scale feedforward 4D model that accepts arbitrary view-time inputs and renders arbitrary novel view-time combinations.

Method

Fast Spatial Memory adopts an end-to-end feedforward network that patchifies posed images and augments them with Plücker ray maps and timestamp embeddings to form visual tokens. These tokens are processed by a stack of LaCET blocks — our novel Large-Chunk Elastic Test-Time Training backbone.

Each LaCET block maintains two sets of parameters: fast weights (adapted per-scene at inference) and anchor weights (a stable EMA reference). During a forward pass, the fast weights are updated chunk-by-chunk using KV statistics from input-view tokens, while the elastic consolidation term softly restores critical parameters toward the anchor to prevent drift. This stabilizes rapid adaptation while preserving genuine novel-view synthesis capability.

The model supports two decoding heads: (i) an LVSM-style lightweight linear decoder for direct RGB patch prediction, and (ii) an LRM-style decoder that predicts pixel-aligned 4D Gaussian Splatting primitives followed by differentiable rasterization.

Left: FSM takes a sequence of posed images at arbitrary times and renders novel view-time combinations. Right: The LaCET block maintains anchor and fast weights, tracking importance online to elastically consolidate updates.

Results

4D Novel View Synthesis

Evaluated on Stereo4D and NVIDIA Dynamic Scene benchmarks at 256×256 resolution. Metrics are resolution-dependent; we adopt the lowest resolution for fair comparison. FSM-LVSM outperforms all prior feed-forward and optimization-based 4D methods across all metrics.

Model	Stereo4D				NVIDIA
Model	Res.	PSNR↑	LPIPS↓	SSIM↑	Res.	PSNR↑	LPIPS↓	SSIM↑
Optimization-based
SoM	— OOT (~10 min/scene) —				379×672	15.30	0.509	0.317
MoSca	— OOT (~45 min/scene) —				379×672	21.45	0.265	0.712
Feed-forward
L4GM	— OOT (requires MVD prior) —				256×256	10.07	0.587	0.235
4DGT	504×504	24.62	0.102	0.785	504×504	14.13	0.640	0.131
MoVieS	504×504	27.19	0.114	0.888	379×672	19.16	0.315	0.514
FSM-LRM (ours)	256×256	27.29	0.147	0.876	256×256	20.17	0.337	0.567
FSM-LVSM (ours)	256×256	32.16	0.043	0.931	256×256	23.90	0.105	0.747

3D Novel View Synthesis

Evaluated on DL3DV-140 at 256×256 resolution. FSM-LVSM achieves the best LPIPS and SSIM among all methods at the same resolution, demonstrating that LaCET preserves strong 3D capability while adding 4D generalization.

Model	DL3DV
	Res.	PSNR↑	LPIPS↓	SSIM↑
Static Models
DepthSplat	512×448	17.81	0.356	0.596
GS-LRM	256×256	23.02	0.266	0.705
LVSM	256×256	23.10	0.257	0.703
RayZer†	256×256	23.72	0.222	0.733
LongLRM	540×960	24.10	0.254	0.783
tttLRM	540×960	25.07	0.215	0.822
tttLVSM	540×960	26.90	0.185	0.837
FSM-LRM (ours)	256×256	23.59	0.206	0.766
FSM-LVSM (ours)	256×256	26.69	0.091	0.846
Dynamic Models
FSM-LRM (ours)	256×256	21.89	0.314	0.692
FSM-LVSM (ours)	256×256	24.61	0.118	0.787

† RayZer ignores input poses and uses target reference images, placing it between pose-conditioned and pose-free approaches.

3D Novel View Synthesis

FSM-LVSM renders novel views from long observation sequences on real-world static scenes.

Input Images

4D Novel View Synthesis

FSM renders novel view-time combinations from long dynamic sequences.

Input Images

BibTeX

@arxiv{ma2026fastspatial,
  title     = {Fast Spatial Memory with Scalable Elastic Test-Time Training},
  author    = {Ma, Ziqiao and Yu, Xueyang and Zhen, Haoyu and Yang, Yuncong and Chai, Joyce and Gan, Chuang},
  year      = {2026}
}

Fast Spatial Memory withScalable Elastic Test-Time Training

TL;DR

Abstract

Key Insights

Method

Results

4D Novel View Synthesis

3D Novel View Synthesis

3D Novel View Synthesis

4D Novel View Synthesis

BibTeX

Fast Spatial Memory with
Scalable Elastic Test-Time Training