Independent Cross-Family Peer Evaluation

What seven frontier LLMs get wrong about graduate physics.

Seven models from seven different labs derive the same physics problem from first principles. Then they cross-examine each other's derivations, claim by claim. Every model judges. No lab sits in judgment of itself. Open data. Transparent methodology.

SCROLL
0
Problems
0
Contestants
0
Atomic Claims
0
Errors Confirmed
What this is

An independent, cross-family peer-evaluation protocol for LLM physics reasoning. We surface where frontier models converge, where they disagree, and where adversarial peers from different families flag each other's steps as errors — with the raw data and methodology fully open for anyone to audit or extend.

What this isn't

A truth oracle. We do not symbolically verify algebra (Layer 3, planned). We do not have expert physicists in the loop (future work). We do not grade proofs against Birkhoff's theorem. What we do is give you the strongest version of multi-judge LLM evaluation that can be built today — and the commitment to strengthening it toward formal verification over time.

The Last Question — Series

Three problems. Seven contestants each.

Each problem was posed to seven frontier LLMs from seven different labs, their derivations decomposed into atomic claims by a universal decomposer, then stress-tested by independent adversarial peers from different model families.

№ 001

The Schwarzschild Radius

r_s = 2GM / c²

Derive the radius at which the Schwarzschild metric becomes coordinate-singular, starting from the Einstein field equations in vacuum. Textbook general relativity, 7/7 reached the final answer.

14
Flagged
4
Confirmed
28.6%
Rate
Read the analysis →
№ 002

The Unruh Temperature

T = ℏa / (2π c k_B)

Starting from a massless scalar field in the Minkowski vacuum, derive the thermal spectrum an accelerating observer reads out via Bogoliubov transformation. Two-part QFT in curved spacetime, 6/7 completed.

21
Flagged
7
Confirmed
33.3%
Rate
Read the analysis →
№ 003

The Casimir Force

F/A = -π² ℏ c / (240 d⁴)

Derive the attractive force between two conducting plates from the quantized EM vacuum, choosing a regularization scheme and extracting the finite part. Graduate QFT with a regulator choice, 7/7 attempted.

20
Flagged
6
Confirmed
30.0%
Rate
Read the analysis →
The Protocol

Four layers of independent judgment.

No model judges itself or its own family. No final-answer check stands alone. Every intermediate step is stress-tested, and flagged errors are confirmed by a second adversary from yet another family.

01
Generation
Seven frontier LLMs (Claude, GPT, Gemini, Grok, DeepSeek, Qwen, o3-mini) each derive the full problem independently. No model sees any other's response. Identical prompt. Identical parameters.
02
Decomposition
A universal decomposer (Claude Opus 4.6) breaks each derivation into atomic, individually verifiable claims — preserving the original author's errors verbatim. Every claim is tagged by type: definition, derivation step, invoked result, assumption.
04
Adversarial Review
Two adversaries from different model families stress-test each claim: CONFIRMED, ERROR, UNJUSTIFIED, or INCOMPLETE. A model cannot adversarially review its own family. Verdicts are parsed, validated, and preserved.
04b
Cross-Family Confirmation
Every ERROR verdict gets re-judged by a third adversary from a family different from both the contestant and the original reviewer. Only errors confirmed by both adversaries are counted as real.
03
Symbolic VerificationPlanned
SymPy-based formal verification of algebraic claims: Christoffel symbols, Ricci tensor components, dimensional consistency, index contractions. The bridge between LLM consensus and mathematical correctness. Under active development.
The Path Forward

From consensus to verification.

Multivac is an evaluation platform, not a benchmark. Independent cross-family judgment is the mechanism. Symbolic verification strengthens the signal. Expert review grounds it. More problems test its reach.

Now · Live

Cross-family adversarial confirmation

Seven contestants, two-stage adversarial stress-testing, parse-validated verdicts. Three graduate physics problems published with full raw data on GitHub.

Next · Weeks

Layer 3 — Symbolic verification

SymPy integration for algebra, dimensional analysis, tensor manipulation. Claim-level machine-checking. Moves us from inter-LLM agreement toward formal correctness on verifiable steps.

Later · Months

Expert review integration

Domain physicists sign derivations with cryptographic provenance. Experimental databases linked to measurable claims. Open preprint companion at arXiv:cs.AI.

The Data is Open

Audit it. Break it. Build on it.

Every Layer 1 response, every decomposed claim, every adversarial verdict, every confirmation — on GitHub under MIT license. If you find an error in our methodology or our physics, we would rather be corrected than be wrong.

The last question was asked for the first time, half in jest, on May 21, 2061, at a time when humanity first stepped into the light.

— Isaac Asimov, “The Last Question” (1956)