Spatial Reasoning + LLMs: What Works, What Fails

By Akshay Sarode · December 11, 2025 · Engineering

LLMs aren't great at spatial reasoning. They were trained on text; coordinates and dimensions live in a different cognitive space. Yet AI 3D scene composition fundamentally relies on spatial reasoning. How do we make it work? Honest report from a year of production traffic.

TL;DR

LLMs reliably handle: relative directions ("left of", "above"), human-scale dimensions, conventional layouts.
LLMs fail at: precise distances, rotated coordinate frames, complex stacking, anything beyond conventional patterns.
Fix: pre-process spatial intents (circle of N, grid, stack, scatter) deterministically. Let the LLM handle the rest with explicit scale references.

What works out of the box

Relative directions: "put the chair to the left of the table" works reliably. The LLM converts "left" to the correct -X position relative to the table.

Human-scale dimensions: "a 3-meter wide table", "a chair at seat height", "a window at eye level" all map correctly when the system prompt establishes scale references.

Conventional layouts: "chairs around a table", "lamps on side tables", "shelves on a wall" all work because they're trained on spatial conventions.

What fails

Precise distances: "place the chair exactly 1.234 meters from the table". LLMs default to round numbers. They write 1.2m or 1.5m. Sub-decimal precision needs the panel UI.

Rotated coordinate frames: "I want this scene to be in screen space rotated 30 degrees about the Y axis". The LLM doesn't reliably reason about rotated frames; the result is usually wrong by 30°.

Complex stacking: "a stack of 8 books, alternating directions, with a coffee cup on top". Each individual position works; the cumulative offset compounds errors. By book 6, the stack is leaning.

Beyond-convention patterns: "spiral of 12 chairs descending in a fibonacci spiral". The LLM writes plausible-looking but wrong coordinates.

The pre-processor pattern

For obvious spatial patterns, Yugma pre-processes the prompt deterministically before sending to the LLM. Detected patterns:

Circle of N: "8 chairs in a circle around the table" → exact polar coordinates injected.
Grid: "a 5×5 grid of cubes" → exact grid coordinates injected.
Stack: "5 boxes stacked" → cumulative Y offsets.
Scatter: "30 trees randomly across a 20×20 area" → seeded random scatter.
Spiral: "spiral staircase, 12 steps, 30° rotation each" → exact transform per step.

The injected coordinates land in the LLM context as [SPATIAL_INTENT]. The LLM still owns object types, materials, and the rest; the pre-processor owns positions.

Style memory

A second technique: extract a style fingerprint (palette, materiality, scale, density) from the existing scene and inject it as [STYLE] in the system prompt. The LLM's next placement matches the existing scene's vocabulary.

What still doesn't work

Articulated joints (hinges, sliders), physics-driven layouts (where things would fall), surface-following patterns (drape, fold) — all of these need either a physics simulator or a domain model the LLM doesn't have.

For these we say "prompt the AI for the rough placement, then nudge in the panel UI". Honest expectation-setting matters; users notice when an AI tool overpromises.

The takeaway

LLM spatial reasoning is good enough for designed environments and conventional layouts. It needs help (pre-processors, style fingerprints, scale references) for anything else. Yugma's architecture is layered to give it that help.

Read the agentic-loop deep dive →

# TL;DR

# What works out of the box

# What fails

# The pre-processor pattern

# Style memory

# What still doesn't work

# The takeaway