I woke up yesterday to absolute carnage in my notifications. Sometime around 5 AM Sydney time, while I was very much asleep, Anthropic had dropped Opus 4.6. About 20 minutes later, OpenAI fired back with GPT-5.3-Codex. By the time I got to my coffee, the AI community had already declared a winner, changed its mind twice, and moved on to memes.

Both companies had apparently planned to announce at 10 AM Pacific on 5 February. Anthropic jumped the gun, getting its blog post out first. OpenAI followed within minutes. The actual model drops were roughly 20 minutes apart, depending on which source you trust (TechCrunch, February 2026).

My Slack channels were already a mess by the time I caught up. Twitter was unreadable. And somewhere in the chaos, I found myself trying to figure out the same question everyone else had been arguing about for hours: who actually won?

I've spent the last day pulling apart the benchmarks, pricing, features, and community reactions. The answer isn't as straightforward as either company's marketing would have you believe.

What Anthropic Shipped: Claude Opus 4.6

Anthropic came out swinging with what they're calling their most capable model to date. Three things stand out.

Agent Teams is the headline feature. It's a research preview that lets multiple Claude Code agents work in parallel, splitting complex tasks into independent subtasks and coordinating autonomously. One agent acts as the team lead, assigning work and synthesising results, while teammates each operate in their own context window (Anthropic, February 2026). Think of it like a small dev team, except they don't argue about tabs versus spaces. (In my experience, five to six tasks per teammate keeps things productive before the coordination overhead eats you alive.)

The 1 million token context window (in beta) is the other big deal. Previous Opus models topped out at 200K. This is a 5x jump, and the numbers back it up. On the MRCR v2 benchmark (8-needle), Opus 4.6 scores 93% at 256K context and 76% at the full 1M window (Digital Applied, February 2026). That's roughly 4-9x more reliable at pulling information from deep in long documents compared to Sonnet 4.5. For anyone doing financial analysis, legal review, or codebase audits, that matters.

One important caveat, though: the 1M window is currently API-only and gated behind Tier 4 access (which requires significant spend history). If you're on a claude.ai Pro or Max plan, you're still capped at 200K tokens. Enterprise gets 500K. So the headline "1M context" is real, but most people won't have access to it yet. Anthropic hasn't said when (or if) it'll roll out to consumer plans.

Microsoft 365 integration rounds out the enterprise play. Claude in Excel got better at handling long-running financial modelling tasks, and Claude in PowerPoint entered research preview for Max, Team, and Enterprise plans (IT Pro, February 2026). Anthropic also introduced adaptive thinking, which lets the model dynamically adjust reasoning depth based on task complexity. You can control this with an /effort parameter to trade off quality against speed and cost.

Pricing stayed at $5/$25 per million input/output tokens for the standard tier, with a long-context tier at $10/$37.50 for anything beyond 200K tokens. Available immediately on claude.ai, the API, GitHub Copilot, and all the major cloud platforms (Anthropic, February 2026).

What OpenAI Shipped: GPT-5.3-Codex

OpenAI's response was pointed. While Anthropic went broad with enterprise features, OpenAI went deep on coding.

The headline claim is wild: GPT-5.3-Codex is, according to OpenAI, "the first model that was instrumental in creating itself." The Codex team used early versions of the model to debug its own training, manage its own deployment, and diagnose test results during development (OpenAI, February 2026).

That sentence gave me pause, and not in the way OpenAI intended. Not because I think we're headed for Skynet. But because there's a meaningful difference between "the model helped with some of its own development tooling" and "the model created itself." OpenAI's framing is clever marketing. The reality is probably closer to a very capable autocomplete tool being used during the development process. Still impressive. Just not quite the self-bootstrapping AI some people are reading into it.

The model's 25% faster than GPT-5.2-Codex and uses fewer output tokens to get results, which matters a lot when you're running agentic coding workflows that chain dozens of calls together. Context window sits at 400K tokens, optimised for long-horizon agentic tasks (Digital Applied, February 2026).

There's a cybersecurity angle that deserves attention. GPT-5.3-Codex is the first model OpenAI has classified as "High" for cybersecurity capability under their Preparedness Framework. It scored 77.6% on cybersecurity CTF challenges, up from 67.4% on the previous version. That triggered their most comprehensive safety deployment, including trusted-access controls and a $10 million commitment in API credits for cyber defence research (Fortune, February 2026).

Here's a wrinkle worth noting, though: as Vladimir Haltakov (former VP Engineering and ex-BMW autonomous driving) pointed out on X, OpenAI appeared to rush this release.

Opus 4.6 was already available in Claude Code and Cursor the moment it dropped. GPT-5.3-Codex? Not available in the API yet. Not even in OpenAI's own Codex app at launch. It's available to paid ChatGPT users and through the Codex CLI and IDE extensions, but API access is "coming soon" (Thurrott, February 2026). That's a significant difference. Anthropic shipped a product. OpenAI shipped a press release with a product coming later.

API pricing for GPT-5.3-Codex specifically hasn't been finalised yet, but the broader Codex family runs at $1.25/$10 per million input/output tokens (PricePerToken, 2026). That's a fraction of what Opus charges.

The Scorecards: Who Actually Won?

Right, let's get into the numbers. I've pulled together the head-to-head benchmarks from both companies' announcements and third-party sources.

BenchmarkOpus 4.6GPT-5.3-CodexWinner
ARC AGI 2 (novel problem-solving)68.8%N/A*Opus
GDPval-AA Elo (knowledge work)1606~1462**Opus
BrowseComp (web search)84.0%N/A*Opus
Terminal-Bench 2.0 (agentic coding)65.4%77.3%GPT-5.3
OSWorld (computer use)72.7%64.7%Opus
CTF (cybersecurity)N/A*77.6%GPT-5.3
SWE-Bench Pro (code repair)N/A*56.8%GPT-5.3
Context Window1M (API beta) / 200K (claude.ai)400KOpus***

*Not reported by this vendor for this model. GPT-5.2 Elo; GPT-5.3 reports 70.9% "wins or ties" on GDPval. *1M requires API Tier 4 beta access; claude.ai plans get 200K (Pro/Max/Team) or 500K (Enterprise).

The ARC AGI 2 result is worth lingering on. Opus 4.6 scored 68.8%, nearly doubling the 37.6% from Opus 4.5. That's an 83% improvement in novel problem-solving, the kind of reasoning that separates "good pattern matching" from something approaching genuine intelligence (OfficeChai, February 2026).

But Terminal-Bench 2.0 tells a different story. GPT-5.3-Codex hit 77.3% versus Opus 4.6's 65.4%. That's nearly a 12-point gap on the benchmark that measures real-world agentic coding in terminal environments (Neowin, February 2026).

So who won? It depends entirely on what you're trying to do. And I know that sounds like a cop-out. But the benchmarks genuinely split along functional lines. Opus 4.6 wins on general intelligence, novel reasoning, web browsing, and raw context capacity. GPT-5.3-Codex wins on coding speed, cybersecurity, and inference efficiency.

Don't sleep on Gemini 3 Pro either. It still leads on multimodal reasoning (81% on MMMU-Pro) and holds the biggest context window by default. The three-way race is real, and anyone telling you one model "dominates" across the board isn't reading the same benchmarks I am.

The Part Nobody Wants to Talk About: Are These Actually Good?

Here's where it gets uncomfortable, because the developer community's reaction hasn't been uniformly celebratory. Not even close.

Simon Willison, one of the most respected developers in the Python community, got preview access to both models and wrote something that stuck with me: "They're both really good, but so were their predecessors Codex 5.2 and Opus 4.5." He admitted he'd been "having trouble finding tasks that those previous models couldn't handle but the new ones are able to ace" (Simon Willison, February 2026).

That's not a takedown. It's something more concerning: honest confusion about what's actually improved.

On X, the reactions ranged from mildly underwhelmed to openly sceptical. Developer @estebs put it bluntly: "Opus 4.6, minor improvement if any." AI model evaluator @Vicrom1509 was harsher, showing Opus 4.6 failing a Fano logic puzzle that GPT-5.2 and Gemini 3 Flash both passed: "Don't trust the benchmarks blindly. Trust your own tests."

That last bit stings because it cuts right to the heart of what I've been calling "benchmark theatre" for months now. We covered the same problem with GPT-5.2's launch back in December: incredible benchmark numbers, mediocre real-world vibes. If you haven't read that piece, it's worth revisiting. The pattern's becoming awfully familiar.

Abstract visualisation representing the gap between AI benchmarks and developer experience
Related Article8 min read

GPT-5.2 Benchmarks vs Reality: Why Developers Are Switching to Claude [Dec 2025]

A week after launch, GPT-5.2 has the best benchmarks in OpenAI history and the worst user reception. I was wrong about this model. Here's what...

Read full article

Then there's the token consumption problem. Robin Ebers, an AI coach with a sizeable following, ran the same task through both models and found Opus 4.6 had already compacted its chat history while GPT-5.3-Codex was only at 46% of its context (about 118K tokens). His observation: "Opus 4.6 eats tokens like it's the last thing on this earth."

That post got 264 likes, the highest engagement of any tweet I tracked about either release. It clearly resonated. And it raises a practical question that benchmarks don't answer: if a model is technically more capable but burns through your token budget twice as fast, are you actually better off? For teams paying per token at the API level, that's not a hypothetical. That's a budget conversation.

(I've been running Opus 4.6 in Claude Code for about 48 hours now. The adaptive thinking feature, which is supposed to calibrate reasoning depth to task complexity, sometimes overthinks simple tasks. Anthropic says you can tune this with the /effort parameter, but the default setting feels a bit aggressive.)

There's a darker theory circulating too. Some developers have suggested Opus 4.6 feels like a bigger leap than it actually is because Opus 4.5 had been gradually degraded over the preceding weeks. There are multiple GitHub issues from late 2025 and early 2026 reporting declining Opus 4.5 performance, including one titled "Claude 4.5 Opus returning incomplete/partial results with degraded output quality" (GitHub, December 2025). One X user who regularly tests new models put it well: "I could be wrong, and I'm 99.9% sure I am (I want to be wrong)."

I'm not going to pretend I can verify or debunk that claim. Model performance can genuinely drift over time as serving infrastructure changes, and users' perception of quality is notoriously subjective. But the fact that enough developers noticed a decline in 4.5 to file bug reports, right before 4.6 launched to much fanfare? That's worth noting even if the explanation turns out to be mundane.

Split-screen visualization showing positive and negative developer reactions to Claude Opus 4.5
Related Article9 min read

Claude Opus 4.5: The Developer Verdict After Two Weeks (It's Not What You'd Expect)

Stellar benchmarks meet brutal rate limits. Here's what developers are really saying about Anthropic's 'best model ever' now that the honeymoon's...

Read full article

WinBuzzer summed up the emerging consensus neatly: "Better coding, worse writing." Early adopters were advising people to use 4.6 for coding tasks and stick with 4.5 for writing work (WinBuzzer, February 2026). Which is an odd thing to say about a model that's supposed to be an upgrade across the board.

The bigger picture here isn't about one model or one release. Software engineer @drilonre drew a comparison that's been rattling around in my head: "Different decade. Same exaggeration. Same red flags," comparing the current AI hype cycle to the dot-com bubble. Sociologist @SimonPR_3 went further, arguing that autoregressive transformers can't achieve AGI and that "the models themselves co-author the mythology of their own superintelligence."

I don't know if that's right. I genuinely don't. But I've been building websites for over 20 years, and I remember the late-90s promises about how the internet would make every business infinitely scalable by Thursday. Some of those promises came true. Most took a decade longer than predicted and looked nothing like what was advertised. If that's the pattern repeating here, then the question isn't whether AI is useful (it clearly is) but whether the current pace of releases is delivering genuine capability jumps or just marketing cycles dressed up as breakthroughs.

Looking at the numbers honestly? About 70% of the developer community seems genuinely impressed. The other 30% aren't buying it. That ratio feels about right to me.

The Super Bowl Subplot

Three days after the model releases, both companies ran ads during Super Bowl LX. And Anthropic's was genuinely funny.

The 60-second pregame spot opens with "BETRAYAL" splashed across the screen. A bloke asks his AI chatbot for advice on talking to his mum, and the chatbot pivots into an ad for a fictitious cougar-dating site called Golden Encounters. The tagline: "Ads are coming to AI. But not to Claude." (The Drum, February 2026)

It was a direct shot at OpenAI, which announced in January it'd start testing ads in ChatGPT's free tier. And Sam Altman didn't take it well. He called the ads "funny" but "clearly dishonest," saying OpenAI would "obviously never run ads in the way Anthropic depicts them" (TechCrunch, February 2026). Then he went further, saying "Anthropic wants to control what people do with AI" and that they "serve an expensive product to rich people."

That's a lot of emotion for a Super Bowl ad. If you want the full breakdown of why this monetisation split matters, I covered it in detail back in January: [article:google-anthropic-reject-chatgpt-ads-davos-2026].

What's interesting is that these aren't just different marketing strategies. They're fundamentally different visions for what AI should be. Anthropic is betting on premium, ad-free, enterprise-focused AI. OpenAI is betting on mass-market reach, subsidised by advertising. Google, for its part, told Davos it has "no plans" to put ads in Gemini, which is ironic given they're the world's largest advertising company.

The Market's Verdict

Wall Street had its own reaction to the day's events, and it wasn't subtle.

FactSet Research Systems dropped roughly 9% after Anthropic demoed Opus 4.6's financial analysis capabilities. S&P Global fell 2.8%. Moody's dropped 1.3% (CNN Business, February 2026). Investors took one look at Claude doing financial analysis in spreadsheets and decided the incumbents were in trouble.

(I think the market overreacted, personally. These tools are still better at augmenting financial analysts than replacing them. But I've been wrong about market sentiment before, and the direction of travel is clear.)

Cinematic ultra-modern boardroom scene with a panoramic view of a nighttime metropolis, featuring a diverse team of executives collaborating around a large, golden holographic digital globe representing global AI investment
Related Article10 min read

The $2 Trillion Bet: Your Business Will Have AI Employees by Next Year

Global AI spending is projected to hit $2 trillion in 2026. But here's what nobody's telling you: most of that money isn't buying chatbots. It's...

Read full article

The same week, Alphabet announced it expects 2026 capital expenditure of $175 to $185 billion, roughly double the $91.4 billion it spent in 2025 (Fortune, February 2026). Most of that's going toward AI compute infrastructure. When you see numbers that big, you start to understand why these companies are releasing models at a pace that makes your head spin. There's an enormous amount of money being bet on AI delivering enough to justify the infrastructure spend.

As @sloomis74 put it on X: "The AI hype cycle has reached desperation mode. They know what's coming." I wouldn't go that far. But the pressure to show progress, to justify those billions, is clearly driving release cadence. Whether that pressure produces better models or just more models is the trillion-dollar question.

What This Means If You're Picking an AI Tool

I talk to clients about AI tool selection almost every week now, and the "which one should I use?" question has gotten harder, not easier. But here's how I'd think about it.

If coding is your primary use case, GPT-5.3-Codex is hard to argue against on paper. It leads on Terminal-Bench 2.0, it's 25% faster than its predecessor, and the Codex family pricing is roughly 4x cheaper than Opus for API calls. At scale, that cost gap is enormous. A project that costs $10,000 in API calls on Opus might run about $2,500 on Codex. But keep in mind the API isn't live yet, and the METR study on AI coding productivity should make everyone cautious about assuming more AI means more output. I wrote about that recently: [article:metr-study-ai-coding-tools-slower-developers-2026].

If you're doing complex, multi-step business workflows, Opus 4.6's Agent Teams feature is something nothing else quite matches yet. The ability to spin up parallel agents that coordinate autonomously is genuinely useful for large codebase reviews, cross-layer development, and research tasks that benefit from competing hypotheses. Just watch your token consumption. Seriously.

If context length matters to you, Opus 4.6 technically offers up to 1M tokens via the API beta, though you'll need Tier 4 access to get it. On claude.ai plans, you're looking at 200K (Pro/Max/Team) or 500K (Enterprise), which still beats GPT-5.3's 400K. For financial analysis, legal document review, or auditing codebases with hundreds of files, that extra capacity matters. Just don't assume you're getting 1M out of the box.

If you're cost-sensitive, and most businesses should be, the pricing gap is real. Opus at $5/$25 versus Codex at $1.25/$10 means you need to be clear about what you're getting for the premium. For many tasks, Codex at a quarter of the price will get you 90% of the way there. And remember, Opus's tendency to burn through tokens faster (as Robin Ebers demonstrated) means the effective cost gap might be even wider than the sticker price suggests.

My actual recommendation? Stop trying to pick one. The multi-model strategy isn't just sensible anymore, it's becoming necessary. Use Codex for coding sprints. Use Opus for complex reasoning and long-context analysis. Use Gemini when you need multimodal processing or Google integration. The era of "we're a Claude shop" or "we're a ChatGPT shop" is ending faster than most organisations realise.

And here's the thing nobody wants to hear: benchmarks are benchmarks. They measure specific tasks under controlled conditions. The only way to know which model works best for your actual use case is to test both on your actual tasks. I've seen Opus outperform Codex on coding tasks that the benchmarks say it shouldn't win, and vice versa. Results will surprise you in both directions. As @Vicrom1509 put it after watching Opus 4.6 fail a logic puzzle that older models passed: "Don't trust the benchmarks blindly. Trust your own tests."

Test Your Site's AI Readiness

See exactly how AI agents view your website with our free analysis tool.

Key Takeaways

The Event:

  • Anthropic released Claude Opus 4.6 and OpenAI released GPT-5.3-Codex within 20 minutes of each other on 5 February 2026
  • Both companies had apparently planned simultaneous 10 AM PST releases, but Anthropic moved up by about 15 minutes
  • OpenAI appeared to rush its release. Opus 4.6 was immediately available in Claude Code and Cursor; GPT-5.3-Codex API access is still pending
  • The releases coincided with competing Super Bowl LX ads three days later (8 February)

Opus 4.6 Strengths:

  • 1M token context window (API-only beta, Tier 4 access required; claude.ai plans capped at 200K-500K)
  • Agent Teams for parallel multi-agent coordination
  • Leads on ARC AGI 2 (68.8%), GDPval-AA Elo (1606), BrowseComp (84%), and OSWorld (72.7%)
  • Microsoft 365 integration with Excel upgrades and PowerPoint in research preview
  • Adaptive thinking for dynamic reasoning depth

Opus 4.6 Concerns:

  • High token consumption ("eats tokens like it's the last thing on this earth")
  • Reports of degraded writing quality compared to 4.5, with early users advising "4.6 for coding, 4.5 for writing"
  • Some developers reporting improvements feel incremental rather than revolutionary
  • Adaptive thinking can overthink simple tasks at default settings

GPT-5.3-Codex Strengths:

  • Leads on Terminal-Bench 2.0 (77.3%) and cybersecurity CTF (77.6%)
  • 25% faster inference than GPT-5.2-Codex, using fewer output tokens
  • First OpenAI model classified "High" for cybersecurity capability
  • Codex family pricing at roughly 4x less than Opus ($1.25/$10 vs $5/$25 per million tokens)
  • "First model instrumental in creating itself" (used to debug its own training)

GPT-5.3-Codex Concerns:

  • API access not available at launch (still pending)
  • "Self-creating model" framing is better marketing than reality
  • Smaller context window (400K vs Opus's 1M API beta)

For Businesses:

  • There's no single winner across all use cases
  • Cost-sensitive coding work favours GPT-5.3-Codex (when the API actually launches)
  • Complex reasoning and long-context analysis favours Opus 4.6, but watch token costs
  • A multi-model strategy is increasingly the smartest approach
  • Test both models on your specific workflows before committing
  • Don't trust benchmarks blindly. The 30% of developers who aren't impressed have valid points

---

Sources
  1. Anthropic. "Introducing Claude Opus 4.6." 5 February 2026. https://www.anthropic.com/news/claude-opus-4-6
  2. OpenAI. "Introducing GPT-5.3-Codex." 5 February 2026. https://openai.com/index/introducing-gpt-5-3-co...
  3. TechCrunch. "OpenAI launches new agentic coding model only minutes after Anthropic drops its own." 5 February 2026. https://techcrunch.com/2026/02/05/openai-launch...
  4. TechCrunch. "Anthropic releases Opus 4.6 with new agent teams." 5 February 2026. https://techcrunch.com/2026/02/05/anthropic-rel...
  5. VentureBeat. "OpenAI's GPT-5.3-Codex drops as Anthropic upgrades Claude: AI coding wars heat up." 5 February 2026. https://venturebeat.com/technology/openais-gpt-...
  6. IT Pro. "Anthropic reveals Claude Opus 4.6, enterprise-focused model with 1M token context." 5 February 2026. https://www.itpro.com/technology/artificial-int...
  7. OfficeChai. "Anthropic Releases Claude Opus 4.6, Beats Gemini 3 And GPT 5.2 On Most Benchmarks." 5 February 2026. https://officechai.com/ai/claude-opus-4-6-bench...
  8. Digital Applied. "GPT-5.3 Codex Features, Benchmarks, and Migration Guide." February 2026. https://www.digitalapplied.com/blog/gpt-5-3-cod...
  9. Digital Applied. "Claude Opus 4.6 Features, Benchmarks, and Pricing Guide." February 2026. https://www.digitalapplied.com/blog/claude-opus...
  10. CNN Business. "Anthropic Opus 4.6: The AI that shook software stocks gets a big update." 5 February 2026. https://www.cnn.com/2026/02/05/tech/anthropic-o...
  11. Fortune. "OpenAI's new model leaps ahead in coding capabilities, but raises unprecedented cybersecurity risks." 5 February 2026. https://fortune.com/2026/02/05/openai-gpt-5-3-c...
  12. Fortune. "Alphabet plans record $185 billion AI spending, but CEO says it still won't be enough." 4 February 2026. https://fortune.com/2026/02/04/alphabet-google-...
  13. TechCrunch. "Sam Altman got exceptionally testy over Claude Super Bowl ads." 4 February 2026. https://techcrunch.com/2026/02/04/sam-altman-go...
  14. The Drum. "Claude's first Super Bowl campaign asks if ads belong in AI." February 2026. https://www.thedrum.com/news/claude-s-first-sup...
  15. ABC News. "Anthropic, OpenAI rivalry spills into new Super Bowl ads." 5 February 2026. https://abcnews.go.com/Technology/wireStory/ant...
  16. Neowin. "OpenAI debuts GPT-5.3-Codex: 25% faster and setting new coding benchmark records." 5 February 2026. https://www.neowin.net/news/openai-debuts-gpt-5...
  17. PricePerToken. "GPT 5 Codex API Pricing 2026." 2026. https://pricepertoken.com/pricing-page/model/op...
  18. GitHub Changelog. "Claude Opus 4.6 is now generally available for GitHub Copilot." 5 February 2026. https://github.blog/changelog/2026-02-05-claude...
  19. CNBC. "Alphabet resets the bar for AI infrastructure spending." 4 February 2026. https://www.cnbc.com/2026/02/04/alphabet-resets...
  20. Simon Willison. "Opus 4.6 and Codex 5.3." 5 February 2026. https://simonwillison.net/2026/Feb/5/two-new-mo...
  21. WinBuzzer. "Claude Opus 4.6: Better Coding, Worse Writing?" 5 February 2026. https://winbuzzer.com/2026/02/05/claude-opus-4-...
  22. Thurrott. "OpenAI Releases GPT-5.3-Codex." 5 February 2026. https://www.thurrott.com/a-i/openai-a-i/332421/...
  23. GitHub Issues. "Claude 4.5 Opus returning incomplete/partial results with degraded output quality." January 2026. https://github.com/anthropics/claude-code/issue...

---