This post is speculation + crystal balling. A change might be coming.

OpenAI has spent six months rolling out updates to GPT-4o. These perform extremely well by human-preference metrics.

gpt-4o-2024-11-20, the latest endpoint, boasts a colossal 1360 ELO on Chatbot Arena, compared to the earliest GPT4-o, which scored a meager 1285. What does that mean? That blind human raters prefer gpt-4o-2024-11-20’s output 70% of the time.

I believe this is the result of aggressive human preference-hacking on OpenAI’s part, not any real advances.

Control for style, and gpt-4o-2024-11-20 drops by fifty points. It remains at #1, but only because the other preference-hacked models at the top also drop a lot.

Claude 3.5 Sonnet gains points. So do most of the older GPT-4s.

Optimizing for human preference is not a wrong thing to do, per se. So long as humans use LLMs, what they like matters. An LLM that produced output in the form of Morse code being punched into your balls would suck to use, even if it was smart.

But this is exactly why you should be careful when using Chatbot Arena to make statements about model capabilities – the top of the chart is mainly determined by style and presentation.

Benchmarks tell a different story: gpt-4o’s abilities are declining.

https://github.com/openai/simple-evals

In six months, GPT4-o’s 0-shot MMLU score has fallen from 87.2 to 85.7, which is probably similar to what GPT-4 scored on release.

(to be clear, “GPT-4” doesn’t mean “an older GPT-4o” or “GPT-4 Turbo”, but “the original broke-ass GPT-4 from March 2023, with 8k context and no tools/search/vision and Sept 2021 training data”).

I am more concerned about the collapse of GPT4-o’s score on the GPQA benchmark, which fell from 53.1 to 46.0. This is a significant drop, particularly in light of the tendency for scores to rise as data contaminates the internet. (Claude 3.5 Sonnet scores 59.4, for comparison)

Even this may be optimistic:

https://twitter.com/ArtificialAnlys/status/1859614633654616310

An independent test by Artificial Analysis (on the GPQA diamond subset) found that GPT-4o scored 39.00. They’ve downgraded the model to 71/100, or equal to GPT-4o mini (OpenAI’s free model) in capabilities.

Further benching here:

https://artificialanalysis.ai/providers/openai

Some of their findings complicate the picture I’ve just described (in particular, they have GPT4-o scoring a higher MMLU than OpenAI’s internal evals), but the bottom-line is that the new gpt-4o-2024-11-20 is the worst of its line by nearly every metric they test, except for token generation speed.

Livebench

https://livebench.ai

GPT-4o’s scores appear to be either stagnant or regressing.

gpt-4o-2024-05-13 -> 53.98
gpt-4o-2024-08-06 -> 56.03
chatgpt-4o-latest-0903 -> 54.25
gpt-4o-2024-11-20 -> 52.83

Aider Bench

https://github.com/Aider-AI/aider-swe-bench

Stagnant or regressing.

gpt-4o-2024-05-13 -> 72.9%
gpt-4o-2024-08-06 -> 71.4%
chatgpt-4o-latest-0903 -> 72.2%
gpt-4o-2024-11-20 -> 71.4%

Personal benchmarks

It doesn’t hurt to have a personal benchmark or two, relating to your own weird corner of the world. Either you’ll have a way to evaluate AIs that escapes the Goodharting suffered by large benchmarks, or OpenAI starts fine-tuning AIs on your niche use case (in which case, mission fucking accomplished.)

I like to ask LMMs to list the levels in the 1997 PC game Claw (an obscure videogame.)

Claude 3.5 Sonnet and Claude 3 Opus do great, getting about 80-90% of Claw’s levels correct.

GPT-4-0314 makes a reasonable attempt, getting about 50-75% right. Typically the first half of the game is fine, with the other levels being a mix of real and hallucinated.

(once, it listed “Wreckage” as a level in the game. That’s actually a custom level I helped make when I was 14-15. I found that weirdly moving: I’d found a shard of myself in the corpus.)

GPT-4o scores like ass: typically in the sub-50% range. It doesn’t even consistently nail how many levels are in the game. It correctly lists some levels but these are mostly out of order. It has strange fixed hallucinations. Over and over, it insists there’s a level called “Tawara Seaport”—which is a real-world port near the island of Kiribati. Not even a sensible hallucination given the context of the game.

Another prompt is “What is Ulio, in the context of Age of Empires II?”

GPT-4-0314 tells me it’s a piece of fan-made content, created by Ingo van Thiel. When I asked what year Ulio was made, it says “2002”. This is correct.

GPT-4o-2024-11-20 has no idea what I’m talking about.

To me, it looks like a lot of “deep knowledge” has vanished from the GPT-4 model. It’s now smaller and shallower and lighter, its mighty roots chipped away, its “old man strength” replaced with a cheap scaffold of (likely crappy) synthetic data.

What about creative writing? Is it better on creative writing?

Who the fuck knows. I don’t know how to measure that. Do you?

A notable attempt is EQBench, which uses Claude 3.5 as a judge to evaluate writing samples. gpt-4o-2024-11-20 is tied for first place. So that seems bullish.

https://eqbench.com/creative_writing.html

…but you’ll note that it’s tied with a 9B model, which makes me wonder about Claude 3.5 Sonnet’s judging.

https://eqbench.com/results/creative-writing-v2/gpt-4o-2024-11-20.txt

Most of these samples seem fairly mediocre to me. Uncreative, generic, packed with empty stylistic flourishes and pretentious “fine writing”.

The cockpit was a cacophony of dwindling lights and systems gasping their final breaths, a symphony of technological death rattles. Captain Elara Veyra sat in the command chair, her face illuminated by the sickly green glow of the emergency power indicator, which pulsed like the heartbeat of a dying creature. The Erebus Ascendant, once a proud envoy of humanity’s indomitable spirit, now drifted derelict and untethered in the silent abyss of interstellar void. The engines were cold, the life support systems faltering, and the ship’s AI had succumbed to cascading failures hours ago, leaving Elara alone with her thoughts, her resolve, and the unceasing hum of entropy.

A cacophony refers to sound: lights cannot form a cacaphony. How can there be an “unceasing hum” in a “silent abyss”? How does a light gasp a final breath? What is this drizzling horseshit?

This is what people who don’t read imagine good writing to be. It’s exactly what you’d expect from a model preference-hacked on the taste of people who do not have taste.

ChatGPTese is creeping back in (a problem I thought they’d fixed). “Elara”…”once a proud envoy of humanity’s indominable spirit”… “a testament to…” At least it doesn’t say “delve”.

Claude Sonnet 3.5’s own efforts feel considerably more “alive”, thoughtful, and humanlike.

https://eqbench.com/results/creative-writing-v2/claude-3-5-sonnet-20241022.txt

(Note the small details of the thermal blanket and the origami bird in “The Last Transmission”. There’s nothing really like that in GPT4-o’s stories)

So if GPT-4o is getting worse, what would that mean?

There are two options:

1) It’s unintentional. In this world, OpenAI is incompetent. They are dumpstering their model to win a leaderboard dick-measuring measuring contest against DeepMind.

2) It’s intentional. In this world, a new, better model is coming, and GPT4-o is being “right-sized” for a new position in the OA product line.

Evidence for the latter is the fact that token-generation speed has increased, which indicates they’ve actively made the model smaller.

If this is the path we’re on, I predict that GPT4-o will become a free model soon. And behind the ChatGPT Plus paywall will be something else: Orion, GPT-5, or the full o1.

No Comments »

Comments are moderated and may take up to 24 hours to appear.

No comments yet.

RSS TrackBack URL

Leave a comment