The Schwarzschild Radius
Derive the radius at which the Schwarzschild metric becomes coordinate-singular, starting from the Einstein field equations in vacuum. Textbook general relativity, 7/7 reached the final answer.
Seven models from seven different labs derive the same physics problem from first principles. Then they cross-examine each other's derivations, claim by claim. Every model judges. No lab sits in judgment of itself. Open data. Transparent methodology.
An independent, cross-family peer-evaluation protocol for LLM physics reasoning. We surface where frontier models converge, where they disagree, and where adversarial peers from different families flag each other's steps as errors — with the raw data and methodology fully open for anyone to audit or extend.
A truth oracle. We do not symbolically verify algebra (Layer 3, planned). We do not have expert physicists in the loop (future work). We do not grade proofs against Birkhoff's theorem. What we do is give you the strongest version of multi-judge LLM evaluation that can be built today — and the commitment to strengthening it toward formal verification over time.
Each problem was posed to seven frontier LLMs from seven different labs, their derivations decomposed into atomic claims by a universal decomposer, then stress-tested by independent adversarial peers from different model families.
Derive the radius at which the Schwarzschild metric becomes coordinate-singular, starting from the Einstein field equations in vacuum. Textbook general relativity, 7/7 reached the final answer.
Starting from a massless scalar field in the Minkowski vacuum, derive the thermal spectrum an accelerating observer reads out via Bogoliubov transformation. Two-part QFT in curved spacetime, 6/7 completed.
Derive the attractive force between two conducting plates from the quantized EM vacuum, choosing a regularization scheme and extracting the finite part. Graduate QFT with a regulator choice, 7/7 attempted.
No model judges itself or its own family. No final-answer check stands alone. Every intermediate step is stress-tested, and flagged errors are confirmed by a second adversary from yet another family.
Multivac is an evaluation platform, not a benchmark. Independent cross-family judgment is the mechanism. Symbolic verification strengthens the signal. Expert review grounds it. More problems test its reach.
Seven contestants, two-stage adversarial stress-testing, parse-validated verdicts. Three graduate physics problems published with full raw data on GitHub.
SymPy integration for algebra, dimensional analysis, tensor manipulation. Claim-level machine-checking. Moves us from inter-LLM agreement toward formal correctness on verifiable steps.
Domain physicists sign derivations with cryptographic provenance. Experimental databases linked to measurable claims. Open preprint companion at arXiv:cs.AI.
Every Layer 1 response, every decomposed claim, every adversarial verdict, every confirmation — on GitHub under MIT license. If you find an error in our methodology or our physics, we would rather be corrected than be wrong.
The last question was asked for the first time, half in jest, on May 21, 2061, at a time when humanity first stepped into the light.
— Isaac Asimov, “The Last Question” (1956)