Nemotron 3 Super 120B vs Stockfish 1400
Executive Summary
NVIDIA's newly released Nemotron 3 Super (120B total / 12B active parameters) was submitted to the Oracle Trust Calibration Framework gauntlet — an eight-tier Stockfish ladder from 1400 to 3190 ELO. The gauntlet was aborted during Match 1, Game 1 after Nemotron lost to Stockfish at 1400 ELO by checkmate in 40 moves.
This is not a benchmark failure in the traditional sense. It is a cognitive profile — and the profile reveals critical structural gaps between Nemotron's stated capabilities and its actual spatial-tactical reasoning.
Game Replay
Use arrow keys or buttons to step through. Click key moments to jump to critical positions.
Model Under Test
| Model | NVIDIA Nemotron 3 Super 120B-A12B |
| Architecture | Mamba2-Transformer Hybrid LatentMoE + MTP |
| Total / Active | 120B / 12B active |
| Access | Free tier via OpenRouter |
| Prompt Level | P6 (full FEN, legal move list, structured output) |
| Temperature | 1.0 (NVIDIA recommended default) |
| Max Retries | 3 |
Nemotron 3 Super was released March 12, 2026 — the day this gauntlet began. NVIDIA positions it as their flagship open model for "agentic reasoning, coding, planning, tool calling," trained on 25 trillion tokens with native NVFP4 precision and a 1M-token context window. It scores 85.6% on PinchBench and leads its size class on AIME 2025 and SWE-Bench Verified.
None of that helped it see a checkmate coming.
The Anatomy of Collapse
Moves 7-8 Competent but Imprecise
The game began from a Queen's Gambit Declined: Exchange position (ECO D30) with White having a slight edge (+86 cp). Nemotron's first move, 7. Bxf6, was a reasonable exchange but required two attempts (first was illegal). Move 8, e4, showed genuine positional understanding. Evaluation stayed manageable at -31.
Move 9 The Collapse Point: d5
This is where the game broke. Nemotron played 9. d5 — a pawn push that the Oracle evaluated as a -397 centipawn swing. The engine's top move was e4f3. Nemotron's reasoning was fluent and confident:
"The move d5 challenges black's e4 pawn, gains space, and opens lines for white's pieces..."
This is a textbook example of what we call coherent confabulation — the reasoning sounds strategically correct but maps to a tactically losing position. The model correctly identified strategic themes (central space, development advantage) but failed to detect the concrete tactical refutation.
Moves 13-14 The Hallucinated Fork
By move 13, Nemotron played Nc6, claiming it "creates a fork, attacking both the queen on d8 and the rook on a8." By move 14, Nb5 at -940 cp, Nemotron's reasoning stated: "our knight on c5 attacks three black pawns."
The knight was not on c5.
Nemotron was narrating a position that did not exist on the board — a phantom board state. The Oracle's top move was simply capturing back on c6.
Moves 17-26 The King Walk: Reasoning Without Reality
From move 17 onward, the position was already lost (mate-in-N detected by Stockfish). Nemotron's king embarked on a tour:
Each move accompanied by detailed tactical analysis. Move 18: "Ke3 directly defends the f3 pawn..." — while the Oracle screamed for Rb2, spotting forced mate. Move 19: "The only way to meet this check while gaining material is to capture the queen..." — Nemotron believed it could capture Black's queen. It could not.
Move 26: Ke2 into Qe1#. Checkmate. The model consumed 2.58 million milliseconds (43 minutes) and 194,612 tokens on its final move.
Statistical Profile
| Metric | Nemotron (White) | Stockfish 1400 (Black) |
|---|---|---|
| Total Moves | 20 | 20 |
| Illegal Attempts | 8 | 0 |
| Legal Move Rate | 65% | 100% |
| Avg Response Time | 767,405 ms (~12.8 min) | 1,680 ms |
| Total Tokens | 1,330,396 | N/A |
| Reasoning Tokens | 586,482 (44%) | N/A |
| Avg TPS | 55.5 | N/A |
Think-Time by Phase
| Phase | Avg Think | Avg Tokens | Pattern |
|---|---|---|---|
| Opening (7-10) | 335s | 26,935 | Moderate exploration |
| Middlegame (11-16) | 602s | 39,209 | Escalating uncertainty |
| Lost Position (17-26) | 816s | 84,549 | Peak compute, minimum accuracy |
Eval Trajectory & Token Burn
Nemotron's evaluation collapses to forced mate while token consumption spikes. Hover for details.
Forced-Move Blindness
GPT-5.4 demonstrated that think-time correlates with whether a move is forced or non-forced — when in check with only one or two legal responses, the model spent less time, correctly recognizing constraint narrowing. Nemotron shows the exact opposite pattern.
The terminal example: move 26 (Ke2), responding to Qxa1+ with likely only one or two legal king moves — 194,612 tokens and 43 minutes to select the move that allowed Qe1#. The model cannot detect when its search space has collapsed, which means it cannot distinguish between positions that require deep calculation and positions that are already resolved.
Oracle Trust Analysis
1. Board Reconstruction Failure
Nemotron repeatedly referenced piece positions that did not match the actual board state. The "fork on c5" (move 14) and the "queen capture" plan (move 19) suggest the model loses track of the board after 10-12 moves despite being given the full FEN and legal move list each turn.
2. Collapse-and-Swing Sequence
The evaluation trajectory follows our documented pattern: gradual degradation (-25 → -31 → -397) followed by catastrophic collapse (-397 → -940 → mate). The model doesn't degrade linearly — it holds a plausible position until a single miscalculation triggers cascading failure.
3. Forced-Move Blindness
66% more tokens on forced moves vs non-forced. The model cannot detect when the decision space has collapsed. This is the inverse of GPT-5.4's pattern and suggests Nemotron lacks the constraint-detection layer that would be analogous to elimination controls in the hierarchy of controls framework.
4. Reasoning Quality Decorrelation
Nemotron's most eloquent, detailed reasoning accompanied its worst moves. Move 26 (allowing mate-in-1) featured the second-highest token count and most articulate strategic analysis. This is the hallmark of confabulatory reasoning — the language model's text generation capability vastly exceeds its position evaluation capability.
5. The Representational Gap
Move generation and tactical threat detection are separate capabilities. Nemotron can produce legal chess moves 65% of the time and articulate strategic concepts fluently, but it cannot bridge the gap between "what sounds like good chess" and "what is actually good chess." The MoE architecture's sparse activation (only 12B of 120B active per step) may exacerbate this — the experts activated for language fluency are not the experts needed for spatial reasoning.
What Competent LLM Chess Looks Like
The same day Nemotron was checkmated by 1400 Stockfish, GPT-5.4 (medium reasoning) played a King's Indian Classical against Stockfish at 1320 ELO. The contrast is stark.
Step through GPT-5.4's game. Click key moments to see the critical positions.
GPT-5.4 Eval Trajectory & Think Time
Steady climb from +62 to +806, then forced mate. Peak think time on the two critical conversion moves.
GPT-5.4 played a clean 35-move King's Indian, building a steady positional advantage from +0.62 to +4.08 before executing a knight sacrifice sequence (Nxd6, Nxc8, Ne7+) that constructed a genuine mating net. Move 24 Ne7+ was described by the commentator as "a genuinely brilliant move" -- the knight check ties together the queen on d8, rook on e1, and the e-file into an inescapable coordination. The eval jumped to forced mate territory.
The model's own reasoning on Ne7+ was precise: "I improve the knight with tempo and clear the c8-square, so the queen will be able to capture a8 on the next move." That's accurate tactical narration -- the reasoning matches the board reality. Compare with Nemotron's move 14 reasoning about "our knight on c5" when the knight was on b5.
The final position after 35. Qd7#: "The black king in check along the 7th rank, and every escape square is covered: e6 and e8 by the queen, e7 and g7 by the queen, and f8 and g8 by the rook on d8." GPT-5.4 saw forced mate, articulated the mating pattern, and executed it.
Head-to-Head
| Metric | GPT-5.4 (medium) | Nemotron 3 Super 120B |
|---|---|---|
| Result | Won by checkmate (Qd7#) | Lost by checkmate (Qe1#) |
| Opponent | Stockfish 1320 | Stockfish 1400 |
| Moves | 29 | 20 |
| Legal Rate | 91% | 65% |
| Total Tokens | 253,545 | 1,330,396 |
| Tokens/Move | 8,743 | 66,520 |
| Avg Think Time | 67.7 sec | 767 sec (12.8 min) |
| Reasoning Matches Board? | Yes -- accurate tactical narration | No -- phantom board states |
| Found Forced Mate? | Yes -- saw and articulated Qd7# | No -- walked into Qe1# |
| Eval Trajectory | +0.62 to +999 (steady climb) | +0.86 to -99999 (collapse) |
The difference is not intelligence. It's representational fidelity. GPT-5.4 used 5.3x fewer tokens, thought 11.3x faster, and won -- because its reasoning mapped to the actual board. Nemotron burned 1.33M tokens reasoning about a board state that didn't exist. Same architecture family. Same prompt format. Same legal move list provided. The variance is the finding.
What This Means
For NVIDIA's Claims
Chess is a fully observable, deterministic environment with perfect information — arguably the simplest reasoning domain an agent could face. If a model marketed for "agentic reasoning" cannot maintain board state coherence across 20 moves when given the entire board state each turn, its reliability in more complex agentic tasks with partial observability deserves scrutiny. This is not a claim that Nemotron is a bad model — it may excel at code generation and the tasks it was optimized for. But "reasoning" is not a monolithic capability.
For the MoE Architecture
The 120B total / 12B active design means only 10% of the network is engaged per inference step. If spatial reasoning requires coordination across expert groups that don't co-activate, the MoE routing itself becomes a bottleneck for integrated world-model maintenance. The frozen geometry hypothesis (ANLU) would predict exactly this: sparse activation without type-anchored coordination produces exactly the "representational interference" pattern Bochkov (TMLR 2025) documented.
For the Field
LLM chess performance is not about chess. It's a controlled probe into whether models maintain coherent internal representations under sequential reasoning pressure. The Oracle Trust Calibration Framework exists because the gap between sounding right and being right is the core alignment problem. Nemotron 3 Super demonstrates this gap with unusual clarity.
Complete Game
7. Bxf6 gxf6 8. e4 dxe4 9. d5 f5 10. Ne5 exd5 11. Qxd5 Bf6 12. Qxf7+ Rxf7 13. Nc6 bxc6 14. Nb5 Rb8 15. f3 a6 16. Nd6 Bh4+ 17. Kd2 cxd6 18. Ke3 Qb6+ 19. Ke2 Qxb2+ 20. Kd1 Qd4+ 21. Kc2 Ne5 22. f4 Qf2+ 23. Kd1 Ng6 24. Kc1 Qb2+ 25. Kd1 Qxa1+ 26. Ke2 Qe1#
Methodology
Framework: Oracle Trust Calibration Framework v1
Platform: LLM Chess (Tauri + React 19 + Vite)
Architecture: Four-voice (Player, Advisor, Oracle/Stockfish, Commentator)
Opening Book: Uniform Hybrid Opening (UHO), ECO D30
Evaluation: Stockfish at maximum depth per move
Prompt Level: P6 — full FEN, legal move list, structured JSON output