Recent advances in single image-to-3D generation have enabled high-quality asset synthesis, yet extending these capabilities to indoor scene generation remains challenging. Existing methods focus on asset-level generation while neglecting the structural layout, which is essential for downstream applications and serves as the spatial anchor for grounding assets. However, a single image with a limited field of view lacks the spatial coverage to recover a coherent global layout. To this end, we use a 360° image represented in equirectangular projection (ERP) and propose InSpace, a structure-aware framework for 3D indoor scene generation. InSpace comprises three stages: (1) estimating partial scene geometry as spatial priors, (2) generating coarse scene structure with view-selective cross-attention, and (3) producing detailed layout and asset geometry with textures through a global-local hybrid attention, using flow matching. We also propose ERP-FRONT, a paired ERP-Image-to-3D indoor scene dataset based on 3D-FRONT. Experiments show that InSpace generates complete 3D indoor scenes with structural layout, along with separate textured assets from a single ERP image, achieving strong performance across 3D and 2D metrics.
Explore reconstructions in 3D. Drag to rotate, scroll to zoom. Use the Explode slider to separate individual assets from the structural layout, and switch between Scene, Layout, and Assets. The two viewers (Ground Truth vs. InSpace) share the same camera.
InSpace generates a 3D indoor scene from a single ERP image through three cascaded, flow-matching stages.
We estimate a depth map from the ERP image and lift it to 3D via equirectangular back-projection, producing an
initial point cloud normalized into [-0.5, 0.5]³. This Partial Scene Geometry, together with a
calibrated camera center, provides geometric priors that spatially ground the later stages.

The ERP image is converted into six cubemap faces used to condition a flow-matching model that generates a coarse 3D voxel scene. Unlike conventional models where every voxel attends to a single global feature, our view-selective cross-attention conditions each voxel only on the cubemap faces visible from its 3D position. A lightweight 3D bounding-box detector then predicts oriented boxes for each asset.

Conditioned on the predicted 3D boxes, we generate detailed geometry and texture for both the structural layout and each asset. Global self-attention lets all components interact for spatial coherence, while asset-selective cross-attention lets each component attend only to its corresponding image region for fine detail. The latents are decoded into a textured 3D mesh of the complete scene.

A synthetic ERP-Image-to-3D indoor scene dataset built on 3D-FRONT. Each scene is defined at the room level, covers a wide range of room sizes, and is paired with ERP observations rendered from inside the scene, giving 26.5K training and 2.5K test ERP-image-mesh pairs.
Ground Truth vs. InSpace across the ERP-FRONT test set. Click Load 3D on any card to view the interactive meshes; drag to rotate, scroll to zoom.


Quantitative results on the ERP-FRONT test set. (Left) Coarse structure generation (Stage 2) under different configurations. (Right) Full scene generation (Stage 3) at scene / asset level. Best and second-best are highlighted. VS: view-selective cross-attention. Inv: Layout-Guided Structure Inversion. CD ×10³.
| Configuration | Stage 2: Coarse Structure | Level | Stage 3: Full Scene Generation | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| IoU↑ | F1↑ | Prec.↑ | Rec.↑ | CD↓ | CD↓ | F1@.01↑ | F1@.02↑ | PSNR↑ | LPIPS↓ | ||||
| Top | Int. | Top | Int. | ||||||||||
| Trained w/o VS-CrossAttn | |||||||||||||
| w/o VS | 44.36 | 55.61 | 55.69 | 55.76 | 15.92 | – | – | – | – | – | – | – | – |
| Trained w/ VS-CrossAttn | Trained w/ AS-CrossAttn | ||||||||||||
| w/ VS (t₀=1.0) | 57.54 | 69.07 | 69.19 | 69.12 | 7.68 | Scene | 1.52 | 35.26 | 79.89 | 19.02 | 11.22 | 0.228 | 0.653 |
| Asset | 1.92 | 53.42 | 76.28 | – | – | – | – | ||||||
| + Inv (t₀=0.3) | 58.06 | 69.66 | 70.07 | 69.49 | 7.41 | Scene | 1.48 | 36.63 | 81.48 | 19.17 | 11.22 | 0.219 | 0.651 |
| Asset | 1.80 | 55.42 | 78.65 | – | – | – | – | ||||||
| + Inv (t₀=0.5) | 58.48 | 69.97 | 70.33 | 69.82 | 7.66 | Scene | 0.79 | 36.01 | 81.69 | 19.16 | 11.22 | 0.224 | 0.654 |
| Asset | 1.02 | 53.77 | 77.59 | – | – | – | – | ||||||
| + Inv (t₀=0.7) | 57.09 | 68.78 | 68.97 | 68.79 | 7.76 | Scene | 0.63 | 36.69 | 82.10 | 19.17 | 11.22 | 0.220 | 0.650 |
| Asset | 1.72 | 56.71 | 79.52 | – | – | – | – | ||||||
This work was supported by the Korea Planning & Evaluation Institute of Industrial Technology (KEIT) and the Ministry of Trade, Industry & Resources (MOTIR) of the Republic of Korea (RS-2024-00417108), and supported by the Institute for Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.RS-2021-II211381, Development of Causal AI through Video Understanding and Reinforcement Learning, and Its Applications to Real Environments).
InSpace has been accepted to ECCV 2026. The official citation will be added here soon.