ECCV 2026 · Malmö, Sweden

InSpace iconInSpace: Structure-Aware 3D Indoor Scene Generation from a Single 360° Image

1 KAIST 2 NAVER LABS 3 Chung-Ang University
* Work done during an internship at NAVER LABS. Co-corresponding authors.
InSpace teaser: 3D indoor scenes generated from single ERP images

InSpace generates complete 3D indoor scenes, including structural layout together with separately textured assets, from a single 360° ERP image.

Abstract

Recent advances in single image-to-3D generation have enabled high-quality asset synthesis, yet extending these capabilities to indoor scene generation remains challenging. Existing methods focus on asset-level generation while neglecting the structural layout, which is essential for downstream applications and serves as the spatial anchor for grounding assets. However, a single image with a limited field of view lacks the spatial coverage to recover a coherent global layout. To this end, we use a 360° image represented in equirectangular projection (ERP) and propose InSpace, a structure-aware framework for 3D indoor scene generation. InSpace comprises three stages: (1) estimating partial scene geometry as spatial priors, (2) generating coarse scene structure with view-selective cross-attention, and (3) producing detailed layout and asset geometry with textures through a global-local hybrid attention, using flow matching. We also propose ERP-FRONT, a paired ERP-Image-to-3D indoor scene dataset based on 3D-FRONT. Experiments show that InSpace generates complete 3D indoor scenes with structural layout, along with separate textured assets from a single ERP image, achieving strong performance across 3D and 2D metrics.

Video

Method

InSpace generates a 3D indoor scene from a single ERP image through three cascaded, flow-matching stages.

Stage 1 Partial Scene Geometry (PSG)

We estimate a depth map from the ERP image and lift it to 3D via equirectangular back-projection, producing an initial point cloud normalized into [-0.5, 0.5]³. This Partial Scene Geometry, together with a calibrated camera center, provides geometric priors that spatially ground the later stages.

Stage 1: partial scene geometry

Stage 2 Coarse Scene Geometry with View-Selective Cross-Attention

The ERP image is converted into six cubemap faces used to condition a flow-matching model that generates a coarse 3D voxel scene. Unlike conventional models where every voxel attends to a single global feature, our view-selective cross-attention conditions each voxel only on the cubemap faces visible from its 3D position. A lightweight 3D bounding-box detector then predicts oriented boxes for each asset.

Stage 2: coarse scene geometry
View-selective cross-attention
View-selective cross-attention: each voxel attends only to the cubemap faces visible from its 3D position.

Stage 3 Detailed Layout & Asset Generation with Global-Local Hybrid Attention

Conditioned on the predicted 3D boxes, we generate detailed geometry and texture for both the structural layout and each asset. Global self-attention lets all components interact for spatial coherence, while asset-selective cross-attention lets each component attend only to its corresponding image region for fine detail. The latents are decoded into a textured 3D mesh of the complete scene.

Stage 3: layout and asset generation

ERP-FRONT Dataset

A synthetic ERP-Image-to-3D indoor scene dataset built on 3D-FRONT. Each scene is defined at the room level, covers a wide range of room sizes, and is paired with ERP observations rendered from inside the scene, giving 26.5K training and 2.5K test ERP-image-mesh pairs.

Example scenes from ERP-FRONT
Example scenes from ERP-FRONT.
ERP-FRONT dataset statistics
ERP-FRONT dataset statistics.

Results

Indoor Scene Generation Gallery

Ground Truth vs. InSpace across the ERP-FRONT test set. Click Load 3D on any card to view the interactive meshes; drag to rotate, scroll to zoom.

Qualitative Results on ERP-FRONT

Qualitative Comparison with Single-Image Methods

Quantitative Comparison

Quantitative results on the ERP-FRONT test set. (Left) Coarse structure generation (Stage 2) under different configurations. (Right) Full scene generation (Stage 3) at scene / asset level. Best and second-best are highlighted. VS: view-selective cross-attention. Inv: Layout-Guided Structure Inversion. CD ×10³.

Configuration Stage 2: Coarse Structure Level Stage 3: Full Scene Generation
IoU↑F1↑Prec.↑Rec.↑CD↓ CD↓F1@.01↑F1@.02↑ PSNR↑LPIPS↓
TopInt.TopInt.
Trained w/o VS-CrossAttn
w/o VS 44.3655.6155.6955.7615.92
Trained w/ VS-CrossAttnTrained w/ AS-CrossAttn
w/ VS (t₀=1.0) 57.5469.0769.1969.127.68 Scene1.5235.2679.8919.0211.220.2280.653
Asset1.9253.4276.28
  + Inv (t₀=0.3) 58.0669.6670.0769.497.41 Scene1.4836.6381.4819.1711.220.2190.651
Asset1.8055.4278.65
  + Inv (t₀=0.5) 58.4869.9770.3369.827.66 Scene0.7936.0181.6919.1611.220.2240.654
Asset1.0253.7777.59
  + Inv (t₀=0.7) 57.0968.7868.9768.797.76 Scene0.6336.6982.1019.1711.220.2200.650
Asset1.7256.7179.52

Acknowledgements

This work was supported by the Korea Planning & Evaluation Institute of Industrial Technology (KEIT) and the Ministry of Trade, Industry & Resources (MOTIR) of the Republic of Korea (RS-2024-00417108), and supported by the Institute for Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.RS-2021-II211381, Development of Causal AI through Video Understanding and Reinforcement Learning, and Its Applications to Real Environments).

BibTeX


      InSpace has been accepted to ECCV 2026. The official citation will be added here soon.