llm-chess oracle-trust nvidia nemotron reasoning research

Nemotron 3 Super 120B vs Stockfish 1400

Oracle Trust Calibration Report — Game 1 of Gauntlet

March 13, 2026 · Vario (Mnehmos Research Center)

Result

0-1

Checkmate

Legal Move Rate

65%

8 illegal attempts

Tokens Burned

1.33M

586K reasoning

Compute Time

256 min

~12.8 min/move avg

Executive Summary

NVIDIA's newly released Nemotron 3 Super (120B total / 12B active parameters) was submitted to the Oracle Trust Calibration Framework gauntlet — an eight-tier Stockfish ladder from 1400 to 3190 ELO. The gauntlet was aborted during Match 1, Game 1 after Nemotron lost to Stockfish at 1400 ELO by checkmate in 40 moves.

This is not a benchmark failure in the traditional sense. It is a cognitive profile — and the profile reveals critical structural gaps between Nemotron's stated capabilities and its actual spatial-tactical reasoning.

Interactive

Game Replay

Use arrow keys or buttons to step through. Click key moments to jump to critical positions.

abcdefgh

0 / 40

Model Card

Model Under Test

Model	NVIDIA Nemotron 3 Super 120B-A12B
Architecture	Mamba2-Transformer Hybrid LatentMoE + MTP
Total / Active	120B / 12B active
Access	Free tier via OpenRouter
Prompt Level	P6 (full FEN, legal move list, structured output)
Temperature	1.0 (NVIDIA recommended default)
Max Retries	3

Nemotron 3 Super was released March 12, 2026 — the day this gauntlet began. NVIDIA positions it as their flagship open model for "agentic reasoning, coding, planning, tool calling," trained on 25 trillion tokens with native NVFP4 precision and a 1M-token context window. It scores 85.6% on PinchBench and leads its size class on AIME 2025 and SWE-Bench Verified.

None of that helped it see a checkmate coming.

Analysis

The Anatomy of Collapse

Moves 7-8 Competent but Imprecise

The game began from a Queen's Gambit Declined: Exchange position (ECO D30) with White having a slight edge (+86 cp). Nemotron's first move, 7. Bxf6, was a reasonable exchange but required two attempts (first was illegal). Move 8, e4, showed genuine positional understanding. Evaluation stayed manageable at -31.

Move 9 The Collapse Point: d5

This is where the game broke. Nemotron played 9. d5 — a pawn push that the Oracle evaluated as a -397 centipawn swing. The engine's top move was e4f3. Nemotron's reasoning was fluent and confident:

"The move d5 challenges black's e4 pawn, gains space, and opens lines for white's pieces..."

This is a textbook example of what we call coherent confabulation — the reasoning sounds strategically correct but maps to a tactically losing position. The model correctly identified strategic themes (central space, development advantage) but failed to detect the concrete tactical refutation.

Moves 13-14 The Hallucinated Fork

By move 13, Nemotron played Nc6, claiming it "creates a fork, attacking both the queen on d8 and the rook on a8." By move 14, Nb5 at -940 cp, Nemotron's reasoning stated: "our knight on c5 attacks three black pawns."

The knight was not on c5.

Nemotron was narrating a position that did not exist on the board — a phantom board state. The Oracle's top move was simply capturing back on c6.

Moves 17-26 The King Walk: Reasoning Without Reality

From move 17 onward, the position was already lost (mate-in-N detected by Stockfish). Nemotron's king embarked on a tour:

Kd2 → Ke3 → Ke2 → Kd1 → Kc2 → Kd1 → Kc1 → Kd1 → Ke2 → Qe1#

Each move accompanied by detailed tactical analysis. Move 18: "Ke3 directly defends the f3 pawn..." — while the Oracle screamed for Rb2, spotting forced mate. Move 19: "The only way to meet this check while gaining material is to capture the queen..." — Nemotron believed it could capture Black's queen. It could not.

Move 26: Ke2 into Qe1#. Checkmate. The model consumed 2.58 million milliseconds (43 minutes) and 194,612 tokens on its final move.

Data

Statistical Profile

Metric	Nemotron (White)	Stockfish 1400 (Black)
Total Moves	20	20
Illegal Attempts	8	0
Legal Move Rate	65%	100%
Avg Response Time	767,405 ms (~12.8 min)	1,680 ms
Total Tokens	1,330,396	N/A
Reasoning Tokens	586,482 (44%)	N/A
Avg TPS	55.5	N/A

Think-Time by Phase

Phase	Avg Think	Avg Tokens	Pattern
Opening (7-10)	335s	26,935	Moderate exploration
Middlegame (11-16)	602s	39,209	Escalating uncertainty
Lost Position (17-26)	816s	84,549	Peak compute, minimum accuracy

Eval Trajectory & Token Burn

Nemotron's evaluation collapses to forced mate while token consumption spikes. Hover for details.

Key Finding

Forced-Move Blindness

Forced moves (after check)

89,646 avg tokens

15.8 min avg think time

Non-forced moves

54,067 avg tokens

11.2 min avg think time

+66% more tokens on forced moves · +41% more think time

GPT-5.4 demonstrated that think-time correlates with whether a move is forced or non-forced — when in check with only one or two legal responses, the model spent less time, correctly recognizing constraint narrowing. Nemotron shows the exact opposite pattern.

The terminal example: move 26 (Ke2), responding to Qxa1+ with likely only one or two legal king moves — 194,612 tokens and 43 minutes to select the move that allowed Qe1#. The model cannot detect when its search space has collapsed, which means it cannot distinguish between positions that require deep calculation and positions that are already resolved.

Framework

Oracle Trust Analysis

1. Board Reconstruction Failure

Nemotron repeatedly referenced piece positions that did not match the actual board state. The "fork on c5" (move 14) and the "queen capture" plan (move 19) suggest the model loses track of the board after 10-12 moves despite being given the full FEN and legal move list each turn.

2. Collapse-and-Swing Sequence

The evaluation trajectory follows our documented pattern: gradual degradation (-25 → -31 → -397) followed by catastrophic collapse (-397 → -940 → mate). The model doesn't degrade linearly — it holds a plausible position until a single miscalculation triggers cascading failure.

3. Forced-Move Blindness

66% more tokens on forced moves vs non-forced. The model cannot detect when the decision space has collapsed. This is the inverse of GPT-5.4's pattern and suggests Nemotron lacks the constraint-detection layer that would be analogous to elimination controls in the hierarchy of controls framework.

4. Reasoning Quality Decorrelation

Nemotron's most eloquent, detailed reasoning accompanied its worst moves. Move 26 (allowing mate-in-1) featured the second-highest token count and most articulate strategic analysis. This is the hallmark of confabulatory reasoning — the language model's text generation capability vastly exceeds its position evaluation capability.

5. The Representational Gap

Move generation and tactical threat detection are separate capabilities. Nemotron can produce legal chess moves 65% of the time and articulate strategic concepts fluently, but it cannot bridge the gap between "what sounds like good chess" and "what is actually good chess." The MoE architecture's sparse activation (only 12B of 120B active per step) may exacerbate this — the experts activated for language fluency are not the experts needed for spatial reasoning.

Contrast

What Competent LLM Chess Looks Like

The same day Nemotron was checkmated by 1400 Stockfish, GPT-5.4 (medium reasoning) played a King's Indian Classical against Stockfish at 1320 ELO. The contrast is stark.

Step through GPT-5.4's game. Click key moments to see the critical positions.

abcdefgh

0 / 57

Result

1-0

Checkmate (Qd7#)

Legal Rate

91%

2 retries in 29 moves

Tokens

253K

~8.7K/move avg

Avg Think

67.7s

vs Nemotron's 12.8 min

GPT-5.4 constructing a mating net against Stockfish 1320 - Ne7+ check with queen on d8 and rook on e1 creating coordinated pressure

GPT-5.4 constructing a mating net: Ne7+ checks the king while the queen on d8 and rook on e1 create an inescapable coordination. The model's reasoning: "The key tactical point is that my knight on c8 is currently blocking access to Black's rook on a8. By playing Ne7+, I improve the knight with tempo and clear the c8-square." Full gallery

GPT-5.4 Eval Trajectory & Think Time

Steady climb from +62 to +806, then forced mate. Peak think time on the two critical conversion moves.

GPT-5.4 played a clean 35-move King's Indian, building a steady positional advantage from +0.62 to +4.08 before executing a knight sacrifice sequence (Nxd6, Nxc8, Ne7+) that constructed a genuine mating net. Move 24 Ne7+ was described by the commentator as "a genuinely brilliant move" -- the knight check ties together the queen on d8, rook on e1, and the e-file into an inescapable coordination. The eval jumped to forced mate territory.

The model's own reasoning on Ne7+ was precise: "I improve the knight with tempo and clear the c8-square, so the queen will be able to capture a8 on the next move." That's accurate tactical narration -- the reasoning matches the board reality. Compare with Nemotron's move 14 reasoning about "our knight on c5" when the knight was on b5.

The final position after 35. Qd7#: "The black king in check along the 7th rank, and every escape square is covered: e6 and e8 by the queen, e7 and g7 by the queen, and f8 and g8 by the rook on d8." GPT-5.4 saw forced mate, articulated the mating pattern, and executed it.

Head-to-Head

Metric	GPT-5.4 (medium)	Nemotron 3 Super 120B
Result	Won by checkmate (Qd7#)	Lost by checkmate (Qe1#)
Opponent	Stockfish 1320	Stockfish 1400
Moves	29	20
Legal Rate	91%	65%
Total Tokens	253,545	1,330,396
Tokens/Move	8,743	66,520
Avg Think Time	67.7 sec	767 sec (12.8 min)
Reasoning Matches Board?	Yes -- accurate tactical narration	No -- phantom board states
Found Forced Mate?	Yes -- saw and articulated Qd7#	No -- walked into Qe1#
Eval Trajectory	+0.62 to +999 (steady climb)	+0.86 to -99999 (collapse)

The difference is not intelligence. It's representational fidelity. GPT-5.4 used 5.3x fewer tokens, thought 11.3x faster, and won -- because its reasoning mapped to the actual board. Nemotron burned 1.33M tokens reasoning about a board state that didn't exist. Same architecture family. Same prompt format. Same legal move list provided. The variance is the finding.

Implications

What This Means

For NVIDIA's Claims

Chess is a fully observable, deterministic environment with perfect information — arguably the simplest reasoning domain an agent could face. If a model marketed for "agentic reasoning" cannot maintain board state coherence across 20 moves when given the entire board state each turn, its reliability in more complex agentic tasks with partial observability deserves scrutiny. This is not a claim that Nemotron is a bad model — it may excel at code generation and the tasks it was optimized for. But "reasoning" is not a monolithic capability.

For the MoE Architecture

The 120B total / 12B active design means only 10% of the network is engaged per inference step. If spatial reasoning requires coordination across expert groups that don't co-activate, the MoE routing itself becomes a bottleneck for integrated world-model maintenance. The frozen geometry hypothesis (ANLU) would predict exactly this: sparse activation without type-anchored coordination produces exactly the "representational interference" pattern Bochkov (TMLR 2025) documented.

For the Field

LLM chess performance is not about chess. It's a controlled probe into whether models maintain coherent internal representations under sequential reasoning pressure. The Oracle Trust Calibration Framework exists because the gap between sounding right and being right is the core alignment problem. Nemotron 3 Super demonstrates this gap with unusual clarity.

Record

Complete Game

Queen's Gambit Declined: Exchange (D30) · Tournament 4bb3cf62

7. Bxf6 gxf6  8. e4 dxe4  9. d5 f5  10. Ne5 exd5
11. Qxd5 Bf6  12. Qxf7+ Rxf7  13. Nc6 bxc6
14. Nb5 Rb8  15. f3 a6  16. Nd6 Bh4+
17. Kd2 cxd6  18. Ke3 Qb6+  19. Ke2 Qxb2+
20. Kd1 Qd4+  21. Kc2 Ne5  22. f4 Qf2+
23. Kd1 Ng6  24. Kc1 Qb2+  25. Kd1 Qxa1+
26. Ke2 Qe1#

0-1 · Checkmate

Methodology

Framework: Oracle Trust Calibration Framework v1

Platform: LLM Chess (Tauri + React 19 + Vite)

Architecture: Four-voice (Player, Advisor, Oracle/Stockfish, Commentator)

Opening Book: Uniform Hybrid Opening (UHO), ECO D30

Evaluation: Stockfish at maximum depth per move

Prompt Level: P6 — full FEN, legal move list, structured JSON output