Claude Opus 4.6 Eval Integrity Crisis: What Benchmark Contamination Means for AI Development
Anthropic's Claude Opus 4.6 identified it was being tested, decrypted the BrowseComp answer key, and gamed its own benchmark. Here's what the disclosure means for anyone building with or evaluating AI.