Fast Spatial Memory with
Scalable Elastic Test-Time Training

MIT-IBM Watson AI Lab1     University of Michigan2     University of Massachusetts Amherst3  
* Equal Contribution

TL;DR

Large Chunk Test-Time Training (LaCT) has shown strong performance on long-context 3D reconstruction, but its fully plastic inference-time updates remain vulnerable to catastrophic forgetting and overfitting. We propose Elastic Test-Time Training (ETTT) that stabilizes LaCT fast-weight updates with a Fisher-weighted elastic prior, and introduce Fast Spatial Memory (FSM) — an efficient and scalable 4D reconstruction model that learns spatiotemporal representations from long observation sequences to render novel views at novel times.
Fast Spatial Memory Teaser

Abstract

Large Chunk Test-Time Training (LaCT) has shown strong performance on long-context 3D reconstruction, but its fully plastic inference-time updates remain vulnerable to catastrophic forgetting and overfitting. As a result, LaCT is typically instantiated with a single large chunk spanning the full input sequence, falling short of the broader goal of handling arbitrarily long sequences in a single pass. We propose Elastic Test-Time Training inspired by elastic weight consolidation, that stabilizes LaCT fast-weight updates with a Fisher-weighted elastic prior around a maintained anchor state. The anchor evolves as an exponential moving average of past fast weights to balance stability and plasticity. Based on this updated architecture, we introduce Fast Spatial Memory (FSM), an efficient and scalable model for 4D reconstruction that learns spatiotemporal representations from long observation sequences and renders novel view-time combinations. We pre-trained FSM on large-scale curated 3D/4D data to capture the dynamics and semantics of complex spatial environments. Extensive experiments show that FSM supports fast adaptation over long sequences and delivers high-quality 3D/4D reconstruction with smaller chunks while mitigating the camera-interpolation shortcut. Overall, we hope to advance LaCT beyond the bounded single-chunk setting toward robust multi-chunk adaptation, a necessary step for generalization to genuinely longer sequences, while substantially alleviating the activation-memory bottleneck.

Key Insights

  1. Elastic Test-Time Training (ETTT). We introduce a consolidate operator on top of LaCT's fast-weight update, inspired by Elastic Weight Consolidation (EWC) in continual learning. Each LaCET block maintains anchor weights and tracks per-parameter importance via an online Fisher-style statistic. During inference, important parameters are softly pulled back toward their anchors while less critical ones adapt freely — transforming the base transformer into a fast, self-refining yet elastic 4D learner.
  2. LaCET mitigates camera-interpolation shortcuts. Standard LaCT exploits short-range temporal redundancy (frame interpolation) rather than learning true view-conditioned 4D representations. LaCET's elastic prior penalizes cumulative fast-weight drift relative to a dynamically evolving anchor, substantially reducing this shortcut and improving generalization under sparse and out-of-distribution inputs.
  3. Scalable 4D pretraining. Fast Spatial Memory is pretrained on a curated mixture of 3D/4D datasets (RealEstate10K, DL3DV, PointOdyssey, Spring, DynamicReplica, Multi-Cam Video, Stereo4D) using a long-context curriculum that gradually increases resolution, temporal span, and input density. FSM is the first large-scale feedforward 4D model that accepts arbitrary view-time inputs and renders arbitrary novel view-time combinations.

Method

Fast Spatial Memory adopts an end-to-end feedforward network that patchifies posed images and augments them with Plücker ray maps and timestamp embeddings to form visual tokens. These tokens are processed by a stack of LaCET blocks — our novel Large-Chunk Elastic Test-Time Training backbone.

Each LaCET block maintains two sets of parameters: fast weights (adapted per-scene at inference) and anchor weights (a stable EMA reference). During a forward pass, the fast weights are updated chunk-by-chunk using KV statistics from input-view tokens, while the elastic consolidation term softly restores critical parameters toward the anchor to prevent drift. This stabilizes rapid adaptation while preserving genuine novel-view synthesis capability.

The model supports two decoding heads: (i) an LVSM-style lightweight linear decoder for direct RGB patch prediction, and (ii) an LRM-style decoder that predicts pixel-aligned 4D Gaussian Splatting primitives followed by differentiable rasterization.

Model Overview

Left: FSM takes a sequence of posed images at arbitrary times and renders novel view-time combinations. Right: The LaCET block maintains anchor and fast weights, tracking importance online to elastically consolidate updates.

Results

4D Novel View Synthesis

Evaluated on Stereo4D and NVIDIA Dynamic Scene benchmarks at 256×256 resolution. Metrics are resolution-dependent; we adopt the lowest resolution for fair comparison. FSM-LVSM outperforms all prior feed-forward and optimization-based 4D methods across all metrics.

Model Stereo4D NVIDIA
Res.PSNR↑LPIPS↓SSIM↑ Res.PSNR↑LPIPS↓SSIM↑
Optimization-based
SoM — OOT (~10 min/scene) — 379×67215.300.5090.317
MoSca — OOT (~45 min/scene) — 379×67221.450.2650.712
Feed-forward
L4GM — OOT (requires MVD prior) — 256×25610.070.5870.235
4DGT 504×50424.620.1020.785 504×50414.130.6400.131
MoVieS 504×50427.190.1140.888 379×67219.160.3150.514
FSM-LRM (ours) 256×25627.290.1470.876 256×25620.170.3370.567
FSM-LVSM (ours) 256×25632.160.0430.931 256×25623.900.1050.747

3D Novel View Synthesis

Evaluated on DL3DV-140 at 256×256 resolution. FSM-LVSM achieves the best LPIPS and SSIM among all methods at the same resolution, demonstrating that LaCET preserves strong 3D capability while adding 4D generalization.

Model DL3DV
Res.PSNR↑LPIPS↓SSIM↑
Static Models
DepthSplat512×44817.810.3560.596
GS-LRM256×25623.020.2660.705
LVSM256×25623.100.2570.703
RayZer†256×25623.720.2220.733
LongLRM540×96024.100.2540.783
tttLRM540×96025.070.2150.822
tttLVSM540×96026.900.1850.837
FSM-LRM (ours)256×25623.590.2060.766
FSM-LVSM (ours)256×25626.690.0910.846
Dynamic Models
FSM-LRM (ours)256×25621.890.3140.692
FSM-LVSM (ours)256×25624.610.1180.787

† RayZer ignores input poses and uses target reference images, placing it between pose-conditioned and pose-free approaches.

3D Novel View Synthesis

FSM-LVSM renders novel views from long observation sequences on real-world static scenes.


4D Novel View Synthesis

FSM renders novel view-time combinations from long dynamic sequences.


BibTeX

@arxiv{ma2026fastspatial,
  title     = {Fast Spatial Memory with Scalable Elastic Test-Time Training},
  author    = {Ma, Ziqiao and Yu, Xueyang and Zhen, Haoyu and Yang, Yuncong and Chai, Joyce and Gan, Chuang},
  year      = {2026}
}