I was wrong. Last week, I wrote that GPT-5.2 was OpenAI's best model yet. I bought the benchmark hype. That article aged like milk.

GPT-5.2 Is Here: OpenAI Fires Back at Google with Its Most Capable Model Yet
Nine days after declaring 'Code Red,' OpenAI launches GPT-5.2 with three distinct variants. Here's what developers are really saying about the...
Read full articleThe benchmarks looked incredible. 100% on AIME 2025 (competition-level maths, without tools). 55.6% on SWE-bench Pro (real GitHub issue resolution). Over 90% on ARC-AGI-1 reasoning tests. A staggering 70.9% on GDPval, beating human experts across 44 professional occupations. Plus 30% fewer hallucinations than GPT-5.1. OpenAI's best scores ever. I took those numbers at face value and wrote glowing coverage about how the "Code Red" panic had delivered results.
Then the developer community actually started using it. And the reception has been brutal. And honestly? The critics are right.
The Benchmark Theatre Problem
GPT-5.2 has OpenAI's best benchmark scores in history. It also has the worst user reception of any GPT-5 series model. That's not a coincidence. It's a symptom.
Benchmarks have become theatre. They're impressive demonstrations that bear almost no relationship to how these models perform on actual work. OpenAI optimises for them because they make spectacular press releases, not because they predict real-world usefulness.
The pattern is obvious now. Every GPT-5 release (5.0, 5.1, 5.2) has followed the same formula: announce incredible benchmark improvements, ship a model that feels worse than its predecessor for conversational tasks, watch developers complain about the same things (robotic responses, overcautious safety filters, loss of personality), then repeat the cycle three months later.
At what point do we admit this is a direction problem, not an execution problem?
What Developers Are Actually Saying
The developer backlash hasn't been subtle. Twitter lit up within hours of the launch, and the complaints were remarkably consistent.
David Stark's post went viral: "It feels more robotic, less human even compared to 5.1, which was already a step down from 4o." Anthony was blunter: "GPT-5.2 is trash. Zero creativity. Zero emotion. Zero spark. It's like every humanising quality of 4o and 5.0 was found and deleted." ASM called it "overly constrained by its new guidelines." Unniekoya summed up what many were feeling: "It's like experiencing August all over again. It's loss for everyone who doesn't use it for coding or anything STEM."
Ben Davis captured what I'm seeing too: "claude models still just feel so much better... gpt just kinda runs around in circles, thinks too much, and does weird shit." It's not anti-OpenAI sentiment. It's pattern recognition from people who actually use these tools daily.
Allie K. Miller, a prominent AI voice with early testing access, was direct: "My favourite model remains Claude Opus 4.5 but my complex ChatGPT work will get a nice incremental boost." She gave GPT-5.2 a fair evaluation. It didn't win her over.
Burhan offered a theory that rings uncomfortably true: GPT-5.2 is "worse for chatting and vibes but better if you treat it like a worker you hand tasks to." That's damning with faint praise. It means OpenAI has optimised for benchmark tasks (structured problems with clear right answers) at the expense of the conversational intelligence that made ChatGPT compelling in the first place.
There were positive voices too. Dp Singh said it was the "first time I've felt comfortable treating an AI like a thinking workspace instead of a prompt machine." Indra called it a "solid release" and noted it's 76.5% cheaper than Claude Opus for coding tasks. But here's the thing: if cost is your main selling point, you've already lost the quality argument.
My Own Failure With GPT-5.2
I didn't just write that optimistic article. I actually tried to use GPT-5.2 for something real. And it failed spectacularly.
This site's mobile PageSpeed scores have been stuck in the mid-60s for weeks. I needed them at 90+ for professional credibility. (Nothing says "we build quality websites" like a laggy mobile experience on your own marketing site.) I thought this would be the perfect test for GPT-5.2's supposedly superior reasoning capabilities. I set it up with Lighthouse testing so it could measure its own progress. Perfect conditions for a model that supposedly excels at complex problem-solving, right?
Wrong.
The model was confidently incorrect, over and over. It kept promising each fix would deliver major score improvements. Reality: scores barely moved. Sometimes they got worse. It kept recommending the same carousel library swap that demonstrably wasn't helping, like a broken record stuck on one solution. I burned through most of my weekly credit on high-reasoning mode trying to get it to actually solve the problem. Nothing.
Then I switched to Claude Opus 4.5. Different story entirely.
Claude immediately identified Cumulative Layout Shift (CLS) issues and TypeScript bundle problems that GPT-5.2 had completely missed. It wasn't more confident. It was more accurate. It found the actual problems instead of repeatedly suggesting the same ineffective fixes. When I had GPT-5.2 follow Claude's plan (since I'd already paid for the OpenAI subscription), it worked. The older gpt-5.1-codex-max model also outperformed 5.2 on iterative improvements.
So yes, I was wrong in that original article. The benchmarks lied. Or more accurately, they measured something that doesn't translate to actually solving problems. GPT-5.2 can ace a maths competition. It can't debug why a React component is causing layout shifts on mobile.
That's not a model I can trust for client work.
The Three-Release Pattern Nobody's Talking About
This is the third time OpenAI has done this. GPT-5.0 launched in August 2025 with incredible benchmarks and immediate complaints about personality loss. Users called it a "corporate beige zombie." GPT-5.1 launched with improved benchmarks and the same complaints. Now GPT-5.2 has even better benchmarks and, predictably, the same complaints.
Three consecutive releases. Same problem. Same user feedback. Same response from OpenAI: ship another model optimised for benchmarks.
I don't think this is incompetence. I think it's strategic prioritisation. OpenAI isn't optimising for developer experience anymore. They're optimising for enterprise sales and regulatory compliance. Benchmarks matter to CIOs evaluating vendor contracts. Safety filters matter to legal departments worried about liability. Personality and conversational intelligence matter to developers actually using the models, but we're not the target customer anymore.
That's a defensible business decision. But let's be honest about what it means: ChatGPT is becoming enterprise software. It's trading delight for predictability, creativity for compliance, personality for corporate acceptability.
If you're a developer who fell in love with the magic of early GPT-4, that's a loss worth acknowledging.

ChatGPT's 'Code Red' Moment: Three Years After Disrupting Google, the Tables Have Turned
In December 2022, ChatGPT triggered Google's 'code red.' Now OpenAI has declared its own as Anthropic leads enterprise AI. Here's how the tables...
Read full articleWhy Claude Opus 4.5 Keeps Winning
Claude Opus 4.5 doesn't top most benchmarks. It doesn't have the best scores on reasoning tests or maths problems. On paper, it's often inferior to GPT-5.2.
In practice, it's better. And that's what actually matters.
I've stopped using GPT-5.2 for serious work. When I need to debug a complex problem, analyse a codebase, or reason through an architectural decision, I reach for Claude. It's not because Claude is smarter on benchmarks. It's because Claude is better at thinking alongside me rather than at me.
GPT-5.2 feels like it's trying to prove how intelligent it is. Claude feels like it's trying to help. That difference is worth more than ten percentage points on a reasoning benchmark.

Claude Opus 4.5: The Developer Verdict After Two Weeks (It's Not What You'd Expect)
Stellar benchmarks meet brutal rate limits. Here's what developers are really saying about Anthropic's 'best model ever' now that the honeymoon's...
Read full articleAnthropic isn't gaming benchmarks because they don't need to. They've bet on developer experience, and it's paying off. Every developer I know who's tried both models has the same reaction: Claude just works better for real tasks. The vibes are better. (And yes, "vibes" is a legitimate technical consideration when you're spending eight hours a day working with these tools.)
OpenAI's "Code Red" panic wasn't about catching up to Claude's capabilities. It was about matching Claude's benchmark scores so they could compete in enterprise procurement processes. Mission accomplished, I guess. But they've lost something important along the way.
One Fascinating Theory
Lyra Intheflesh posted something on Twitter that's been nagging at me: "I don't think we're all getting the same ChatGPT... divergent experiences aren't user skill... They're likely deliberate... experimental conditions."
OpenAI has admitted to running A/B tests on different user cohorts. What if some users are getting a heavily safety-filtered version while others get something closer to 5.1? That would explain the wildly divergent reactions. Some people love it, some hate it, and there's almost no middle ground because they're literally using different models.
If that's true, it's a nightmare for anyone trying to evaluate these tools professionally. How do you make informed decisions when the model you're testing today might not be the one you get tomorrow?
What I've Learnt
I'm not trusting benchmarks anymore. When GPT-5.3 launches with even better scores, I'll wait for the developer community to actually use it before forming opinions.
I'm diversifying my model usage. Claude Opus 4.5 for complex reasoning and code. GPT-5.1 (not 5.2) for quick queries when I need OpenAI's speed. The older gpt-5.1-codex-max for iterative coding tasks. Gemini when I need large context windows. No single model is best at everything, and pretending otherwise was naive.
I'm being more sceptical of AI company marketing. The "Code Red" narrative was compelling. It painted OpenAI as the scrappy underdog fighting to catch up. In reality, OpenAI is a company optimising for enterprise revenue, and that means prioritising different things than individual developers want. That's not evil. It's capitalism. But I should've recognised it sooner.
Most importantly, I'm testing things myself before forming opinions. That GPT-5.2 launch article was based on benchmarks and press materials. This article is based on actually using the model for real work and watching it fail. One of those information sources is more reliable than the other.
Key Takeaways
For developers:
- Don't trust benchmarks. Test models yourself on your actual work.
- Claude Opus 4.5 remains the best model for complex reasoning and code, regardless of what the benchmark scores say.
- GPT-5.2 is acceptable for structured tasks. It's worse than 5.1 for conversational work.
- Diversify your model usage. No single model is best at everything.
For OpenAI:
- Three releases with identical complaints means you have a direction problem, not an execution problem.
- Benchmark scores don't predict developer satisfaction. Ship what users want, not what tests well.
- The enterprise market and developer community want different things. You can't optimise for both with a single model.
For the industry:
- Benchmarks have become meaningless theatre. We need better evaluation methods that predict real-world usefulness.
- The gap between "best on paper" and "best in practice" is growing. That should worry everyone.
For me:
- I was wrong about GPT-5.2. The benchmarks misled me, and I should've tested it myself first.
- I won't make that mistake again. Future coverage will prioritise real-world testing over press release metrics.
The numbers don't match the vibes because we're measuring the wrong things. Until that changes, expect more releases like GPT-5.2: impressive on benchmarks, disappointing in practice, and loudly defended by people who haven't actually tried to use them for real work.
I'm done being one of those people.
---
Sources
- TechRadar. "ChatGPT 5.2 branded a 'step backwards' by disappointed early users." December 2025. https://www.techradar.com/ai-platforms-assistants/openai/chatgpt-5-2-branded-a-step-backwards-by-disappointed-early-users-heres-why
- OpenAI. "Introducing GPT-5.2." 11 December 2025. https://openai.com/index/introducing-gpt-5-2/
- VentureBeat. "GPT-5.2 first impressions: A powerful update, especially for business tasks." December 2025. https://venturebeat.com/ai/gpt-5-2-first-impressions-a-powerful-update-especially-for-business-tasks
- Every.to. "Vibe Check: GPT-5.2 Is an Incremental Upgrade." December 2025. https://every.to/vibe-check/vibe-check-gpt-5-2-is-an-incremental-upgrade
- Maria Sukhareva. "GPT-5.2 and Meaningless Benchmarks." Substack, December 2025. https://msukhareva.substack.com/p/gpt-52-and-meaningless-benchmarks
- DataCamp. "GPT-5.2: Benchmarks, Model Breakdown, and Real-World Performance." December 2025. https://www.datacamp.com/blog/gpt-5-2
- Fortune. "OpenAI GPT-5.2 launch aims to silence concerns it is falling behind." 11 December 2025. https://fortune.com/2025/12/11/openai-gpt-5-2-launch-aims-to-silence-concerns-it-is-falling-behind-google-anthropic-code-red/
- TechCrunch. "OpenAI fires back at Google with GPT-5.2 after 'code red' memo." 11 December 2025. https://techcrunch.com/2025/12/11/openai-fires-back-at-google-with-gpt-5-2-after-code-red-memo/
- David Stark (@stark4815). X post, 12 December 2025. https://x.com/stark4815/status/1999385742431121906
- Ben Davis (@davis7). X post, 12 December 2025. https://x.com/davis7/status/1999268582015049934
- Allie K. Miller (@alliekmiller). X post, 11 December 2025. https://x.com/alliekmiller/status/1999196399561506946
- Lyra Intheflesh (@LyraInTheFlesh). X post, 13 December 2025. https://x.com/LyraInTheFlesh/status/1999978564301492498
- Developer sentiment aggregated from X/Twitter and Reddit r/ChatGPT, December 11-15, 2025.
