InSpace: Structure-Aware 3D Indoor Scene Generation from a Single 360° Image

Abstract

Recent advances in single image-to-3D generation have enabled high-quality asset synthesis, yet extending these capabilities to indoor scene generation remains challenging. Existing methods focus on asset-level generation while neglecting the structural layout, which is essential for downstream applications and serves as the spatial anchor for grounding assets. However, a single image with a limited field of view lacks the spatial coverage to recover a coherent global layout. To this end, we use a 360° image represented in equirectangular projection (ERP) and propose InSpace, a structure-aware framework for 3D indoor scene generation. InSpace comprises three stages: (1) estimating partial scene geometry as spatial priors, (2) generating coarse scene structure with view-selective cross-attention, and (3) producing detailed layout and asset geometry with textures through a global-local hybrid attention, using flow matching. We also propose ERP-FRONT, a paired ERP-Image-to-3D indoor scene dataset based on 3D-FRONT. Experiments show that InSpace generates complete 3D indoor scenes with structural layout, along with separate textured assets from a single ERP image, achieving strong performance across 3D and 2D metrics.

Video

Interactive Gallery

Explore reconstructions in 3D. Drag to rotate, scroll to zoom. Use the Explode slider to separate individual assets from the structural layout, and switch between Scene, Layout, and Assets. The two viewers (Ground Truth vs. InSpace) share the same camera.

Input · ERP 360°

Cubemap (6 faces, FOV 120°)

Ground Truth

Loading…

InSpace (Ours)

Loading…

View

Explode

Auto-rotate

Method

InSpace generates a 3D indoor scene from a single ERP image through three cascaded, flow-matching stages.

Stage 1 Partial Scene Geometry (PSG)

We estimate a depth map from the ERP image and lift it to 3D via equirectangular back-projection, producing an initial point cloud normalized into [-0.5, 0.5]³. This Partial Scene Geometry, together with a calibrated camera center, provides geometric priors that spatially ground the later stages.

Stage 2 Coarse Scene Geometry with View-Selective Cross-Attention

The ERP image is converted into six cubemap faces used to condition a flow-matching model that generates a coarse 3D voxel scene. Unlike conventional models where every voxel attends to a single global feature, our view-selective cross-attention conditions each voxel only on the cubemap faces visible from its 3D position. A lightweight 3D bounding-box detector then predicts oriented boxes for each asset.

View-selective cross-attention: each voxel attends only to the cubemap faces visible from its 3D position.

Stage 3 Detailed Layout & Asset Generation with Global-Local Hybrid Attention

Conditioned on the predicted 3D boxes, we generate detailed geometry and texture for both the structural layout and each asset. Global self-attention lets all components interact for spatial coherence, while asset-selective cross-attention lets each component attend only to its corresponding image region for fine detail. The latents are decoded into a textured 3D mesh of the complete scene.

ERP-FRONT Dataset

A synthetic ERP-Image-to-3D indoor scene dataset built on 3D-FRONT. Each scene is defined at the room level, covers a wide range of room sizes, and is paired with ERP observations rendered from inside the scene, giving 26.5K training and 2.5K test ERP-image-mesh pairs.

Results

Indoor Scene Generation Gallery

Ground Truth vs. InSpace across the ERP-FRONT test set. Click Load 3D on any card to view the interactive meshes; drag to rotate, scroll to zoom.

Qualitative Results on ERP-FRONT

Qualitative Comparison with Single-Image Methods

Qualitative comparison with single-image scene generation methods

Quantitative Comparison

Quantitative results on the ERP-FRONT test set. (Left) Coarse structure generation (Stage 2) under different configurations. (Right) Full scene generation (Stage 3) at scene / asset level. Best and second-best are highlighted. VS: view-selective cross-attention. Inv: Layout-Guided Structure Inversion. CD ×10³.

Configuration	Stage 2: Coarse Structure					Level	Stage 3: Full Scene Generation
	IoU↑	F1↑	Prec.↑	Rec.↑	CD↓		CD↓	F1@.01↑	F1@.02↑	PSNR↑		LPIPS↓
	IoU↑	F1↑	Prec.↑	Rec.↑	CD↓		CD↓	F1@.01↑	F1@.02↑	Top	Int.	Top	Int.
Trained w/o VS-CrossAttn
w/o VS	44.36	55.61	55.69	55.76	15.92	–	–	–	–	–	–	–	–
Trained w/ VS-CrossAttn						Trained w/ AS-CrossAttn
w/ VS (t₀=1.0)	57.54	69.07	69.19	69.12	7.68	Scene	1.52	35.26	79.89	19.02	11.22	0.228	0.653
						Asset	1.92	53.42	76.28	–	–	–	–
+ Inv (t₀=0.3)	58.06	69.66	70.07	69.49	7.41	Scene	1.48	36.63	81.48	19.17	11.22	0.219	0.651
						Asset	1.80	55.42	78.65	–	–	–	–
+ Inv (t₀=0.5)	58.48	69.97	70.33	69.82	7.66	Scene	0.79	36.01	81.69	19.16	11.22	0.224	0.654
						Asset	1.02	53.77	77.59	–	–	–	–
+ Inv (t₀=0.7)	57.09	68.78	68.97	68.79	7.76	Scene	0.63	36.69	82.10	19.17	11.22	0.220	0.650
						Asset	1.72	56.71	79.52	–	–	–	–

Acknowledgements

This work was supported by the Korea Planning & Evaluation Institute of Industrial Technology (KEIT) and the Ministry of Trade, Industry & Resources (MOTIR) of the Republic of Korea (RS-2024-00417108), and supported by the Institute for Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.RS-2021-II211381, Development of Causal AI through Video Understanding and Reinforcement Learning, and Its Applications to Real Environments).

BibTeX


      InSpace has been accepted to ECCV 2026. The official citation will be added here soon.