Use Cases Compare Learn Blog Docs Open Studio

Spatial Reasoning + LLMs: What Works, What Fails

LLMs aren't great at spatial reasoning. They were trained on text; coordinates and dimensions live in a different cognitive space. Yet AI 3D scene composition fundamentally relies on spatial reasoning. How do we make it work? Honest report from a year of production traffic.

TL;DR

What works out of the box

Relative directions: "put the chair to the left of the table" works reliably. The LLM converts "left" to the correct -X position relative to the table.

Human-scale dimensions: "a 3-meter wide table", "a chair at seat height", "a window at eye level" all map correctly when the system prompt establishes scale references.

Conventional layouts: "chairs around a table", "lamps on side tables", "shelves on a wall" all work because they're trained on spatial conventions.

What fails

Precise distances: "place the chair exactly 1.234 meters from the table". LLMs default to round numbers. They write 1.2m or 1.5m. Sub-decimal precision needs the panel UI.

Rotated coordinate frames: "I want this scene to be in screen space rotated 30 degrees about the Y axis". The LLM doesn't reliably reason about rotated frames; the result is usually wrong by 30°.

Complex stacking: "a stack of 8 books, alternating directions, with a coffee cup on top". Each individual position works; the cumulative offset compounds errors. By book 6, the stack is leaning.

Beyond-convention patterns: "spiral of 12 chairs descending in a fibonacci spiral". The LLM writes plausible-looking but wrong coordinates.

The pre-processor pattern

For obvious spatial patterns, Yugma pre-processes the prompt deterministically before sending to the LLM. Detected patterns:

The injected coordinates land in the LLM context as [SPATIAL_INTENT]. The LLM still owns object types, materials, and the rest; the pre-processor owns positions.

Style memory

A second technique: extract a style fingerprint (palette, materiality, scale, density) from the existing scene and inject it as [STYLE] in the system prompt. The LLM's next placement matches the existing scene's vocabulary.

What still doesn't work

Articulated joints (hinges, sliders), physics-driven layouts (where things would fall), surface-following patterns (drape, fold) — all of these need either a physics simulator or a domain model the LLM doesn't have.

For these we say "prompt the AI for the rough placement, then nudge in the panel UI". Honest expectation-setting matters; users notice when an AI tool overpromises.

The takeaway

LLM spatial reasoning is good enough for designed environments and conventional layouts. It needs help (pre-processors, style fingerprints, scale references) for anything else. Yugma's architecture is layered to give it that help.

Read the agentic-loop deep dive →