NVIDIA AI Introduces SpatialClaw: A Trainable Agent That Handles Code as an Action Interface for Spatial Reasoning

NVIDIA Research has released SpatialClaw, a free spatial reasoning training framework. It points to persistent weaknesses in visual language models (VLMs). These models still struggle to judge where objects are, how they relate to each other, and how they move in 3D.
SpatialClaw does not retrain the model. Instead, it changes the action interface the agent uses to call cognitive tools. The research team argues that the interface is a bottleneck. Their solution is to treat the code as an action interface. On 20 benchmarks, SpatialClaw achieves an average accuracy of 59.9%. It surpasses the latest space agent SpaceTools by 11.2 points.
What is SpatialClaw
SpatialClaw is an agent loop wrapped in an awesome Python kernel. The kernel is preloaded with input frames and a set of initialization objects. Python’s callable inference tools are transparent. Their effects, including masks, depth maps, camera geometry, and trajectories, are standard Python expressions.
The kernel features six public access points. InputImages holds sample frames. Metadata handles frame rate, duration, and frame indexes. tools It exposes perspective and geometry primitives. show() embeds the image in the next context of the agent. vlm sends queries to a separate VLM session. ReturnAnswer() submits the final answer.
Two central cognitive tools. tools.Reconstruct wraps depth any 3 and returns depth per frame, in-camera objects, extrinsics, and density pointmaps. tools.SAM3 wraps SAM 3 and generates an image or video mask with text, point, or box information. The frame adds lightweight features: tools.Geometry, tools.Mask, tools.Time, tools.Graphagain tools.Draw.
It is not trained. The same system information, toolset, and hyperparameters apply to all benchmarks and cores.

Why the Action Interface is Important
The research team studied three action links in the same question. Consider measuring the closest distance between the heater and the door.
- One pass code writes one complete program and runs it once. It binds the full strategy before seeing any intermediate mask or depth map. The wrong thought then spreads directly to the answer.
- Scheduled tool call invokes tools named in a static JSON schema. It cannot freely integrate the output with NumPy or SciPy to generate test-time calculations. The closest point function does not have a pre-registered tool, so the result is incorrect.
- The SpatialClaw you code the tools, test the results, and update. It first calculates the centroid distance, then determines the centroid using the median. The agent switches to
scipy.spatial.KDTreeto find the nearest true point. It moves 0.9439 m against the ground truth of 0.9 m.
Benchmark
SpatialClaw was tested on 20 benchmarks in five categories. These include single image, multi-view, standard, video and 4D, as well as standard video insights. It improves over the toolless baseline for all six backbones tested. Backbones range from 26B to 397B parameters across the Qwen3.5/3.6 and Gemma4 families.
Controlled comparison separates the interface. All three types share the same toolset and command. Only the action interface is different.
| Visible action | Average. (20 bench.) | Δ vs no-tool |
|---|---|---|
| Tool-free base | 53.4 | – |
| One pass code | 55.2 | +1.8 |
| Scheduled tool call | 56.7 | +3.3 |
| SpatialClaw (code as action) | 59.9 | +6,5 |
Gemma4-31B backbone, 20-benchmark average.
Against the local front agents on the same backbone of Gemma4-31B, the gap widens.
| The way | Interface | Average. | Δ vs SpatialClaw |
|---|---|---|---|
| VADAR | One pass | 40.5* | -19.4 |
| pySpatial | One pass | 47.8 | -12.1 |
| SpaceTools-Toolshed | Scheduled tool call | 48.7 | -11.2 |
| The SpatialClaw | Code as action | 59.9 | best |
The biggest benefits come from flexible jobs. For Gemma4-31B, DSI-Bench rose +17.6 points and MindCube rose +15.3 points. These sections require geometric calculations tied to all frames and views.
The adjective LLM-as-judge describes the winning of a systematic tool call. Coding accounts for 52.2% of it. Flow control accounts for 19.5%, and the remaining 28.3% is neutral.
Inside the Five-Stage Loop
Each sample uses a five-stage loop: planning, coding, coding, response synthesis, and response submission. The planner writes the plan without seeing the pictures. The main agent then writes one Python cell per step. A static AST checker rejects unsafe code before execution. The loop repeats until ReturnAnswer() called or passed 30 steps.
The official repo works with the LangGraph workflow and the continuous Jupyter kernel. Backbones are powered by vLLM. Vision runs behind the FastAPI GPU service. One quick start using one benchmark on one machine:
git clone --recursive
cd SpatialClaw
bash spatial_agent/scripts/setup.sh
cp .env.example .env # add API keys, or self-host vLLM
python -m spatial_agent.entrypoints.run
--dataset spatial_agent/config/dataset/erqa.json
--model spatial_agent/config/model/gemini-3-pro.json
--concurrency 4
The representative agent cell includes geometry detection, and then updates:
# Reconstruct the scene, then segment both objects in one video pass
recon = tools.Reconstruct.Reconstruct(InputImages)
seg = tools.SAM3.segment_video_by_text(["radiator heater", "door"])
show(seg.visualize(1)) # inspect the masks first
# Closest-point distance via KD-tree, not centroids
pts_h = seg.get_masked_points(recon, frame=1, object=0) # object 0 = heater
pts_d = seg.get_masked_points(recon, frame=2, object=1) # object 1 = door
dists, _ = scipy.spatial.KDTree(pts_d).query(pts_h, k=1)
ReturnAnswer(float(dists.min()))
The agent selects the primitives from the query itself. Range queries ask for KD tree search and vector norms. Directions questions depend on dot products. No phase specific routing is used.
Use Cases
The design is suited to problems that require step-by-step geometric reasoning. Concrete examples include:
- Robots and integrated agents which measures the distance metric between objects before taking action.
- A multi-view testwhere the position of an object is obtained from several camera angles.
- Video and 4D analysis which tracks the object or camera movement in every frame.
- It answers the question of a domestic incidentsuch as “where is the door to the sink?”
Because it is not trained, teams can extend the VLM used without new data or fine-tuning.
Interactive Descriptor



