How the Agentic AI Loop Builds Coherent 3D Scenes

By Akshay Sarode · July 8, 2025 · Engineering

A coherent 3D scene from a sentence is harder than it looks. The model has to know about scale, spatial relationships, materials, lighting, and the existing scene state — and emit those decisions in a form the renderer can apply atomically. Here's how Yugma's agentic loop runs end to end.

TL;DR

Stage 1: Reference resolution. "the chair", "that lamp" → real scene-graph IDs.
Stage 2: Scene serialization. Current scene compressed to ~45 tokens/object using a custom YSL format.
Stage 3: Tool-call composition. LLM emits N parallel typed tool calls in one response.
Each tool call validated against a JSON schema; malformed ones rejected before they touch the scene.

Stage 1 — reference resolution

The user says "make the red chair bigger". Before the LLM sees that prompt, a deterministic reference resolver scans the scene for an object whose tags or material color matches "red" and whose name/type contains "chair", and rewrites the prompt to "make object id obj_8h2k (chair_main, color #b32d2d) bigger".

This is a simple step that prevents the LLM from confidently picking the wrong object — which it will sometimes do if the scene has 50 objects and the prompt is ambiguous.

Stage 2 — scene serialization (YSL)

USD-style serialization is verbose: ~400 tokens per object including transforms and materials. Yugma uses a custom YSL (Yugma Scene Language) format that compresses to ~45 tokens per object:

obj_8h2k chair_main box pos[1.2,0.45,-0.8] s[0.5,0.9,0.5] mat#b32d2d r0.7 m0 tags[furniture,chair]

For a 100-object scene, that's 4,500 tokens of context vs 40,000 — leaving headroom for the system prompt and tool schemas without context overflow.

Stage 3 — tool-call composition

The LLM sees: system prompt (coordinate system, scale references, design principles, material recipes) + scene context (YSL) + 19 typed tool schemas. It emits parallel tool calls in a single response:

[
  {"name": "update_object", "input": {"id": "obj_8h2k", "patch": {"transform": {"scale": [0.7, 1.05, 0.7]}}}},
  {"name": "focus_camera", "input": {"objectId": "obj_8h2k"}}
]

The client receives the whole response, dispatches each call against the scene store, commits one undo entry. Render is real-time.

Why typed schemas

Letting the LLM emit raw Three.js code looks elegant in demos and breaks in production. The model emits stale APIs, hallucinated methods, or correct-looking-but-wrong material props.

Typed schemas mean: the LLM sees only valid call shapes; every call is validated; malformed calls are rejected before they affect state. Three failure modes (stale API, wrong method, wrong parameter type) collapse to one validation error.

The optional planning pass

For complex prompts ("draft a furnished living room"), the loop runs a planning pass first: a cheap, fast LLM call that breaks the prompt into 5–10 logical steps. Then the main pass executes them as parallel tool calls. This trades one extra LLM hop for dramatically better composition on multi-step requests.

Where the loop fails

Prompts with implicit physics ("the ball rolls down the ramp") — the LLM doesn't simulate.
Sub-millimeter precision ("offset by exactly 1.234mm") — meter scale is the AI's comfort zone.
Highly novel aesthetics outside training distribution — the LLM defaults to common conventions.

For those cases, the workflow is: AI roughs out the scene, human nudges via the panel UI, AI does the next pass.

Read the AI 3D scene composition pillar →

# TL;DR

# Stage 1 — reference resolution

# Stage 2 — scene serialization (YSL)

# Stage 3 — tool-call composition

# Why typed schemas

# The optional planning pass

# Where the loop fails