This post is speculation + crystal balling. A change might be coming.
OpenAI has spent six months rolling out updates to GPT-4o. These perform extremely well by human-preference metrics.
gpt-4o-2024-11-20, the latest endpoint, boasts a colossal 1360 ELO on Chatbot Arena, compared to the earliest GPT4-o, which scored a meager 1285. What does that mean? That blind human raters prefer gpt-4o-2024-11-20’s output 70% of the time.
I believe this is the result of aggressive human preference-hacking on OpenAI’s part, not any real advances.
Control for style, and gpt-4o-2024-11-20 drops by fifty points. It remains at #1, but only because the other preference-hacked models at the top also drop a lot.
Claude 3.5 Sonnet gains points. So do most of the older GPT-4s.
Optimizing for human preference is not a wrong thing to do, per se. So long as humans use LLMs, what they like matters. An LLM that produced output in the form of Morse code being punched into your balls would suck to use, even if it was smart.
But this is exactly why you should be careful when using Chatbot Arena to make statements about model capabilities – the top of the chart is mainly determined by style and presentation.
Benchmarks tell a different story: gpt-4o’s abilities are declining.
In six months, GPT4-o’s 0-shot MMLU score has fallen from 87.2 to 85.7, which is probably similar to what GPT-4 scored on release.
(to be clear, “GPT-4” doesn’t mean “an older GPT-4o” or “GPT-4 Turbo”, but “the original broke-ass GPT-4 from March 2023, with 8k context and no tools/search/vision and Sept 2021 training data”).
I am more concerned about the collapse of GPT4-o’s score on the GPQA benchmark, which fell from 53.1 to 46.0. This is a significant drop, particularly in light of the tendency for scores to rise as data contaminates the internet. (Claude 3.5 Sonnet scores 59.4, for comparison)
An independent test by Artificial Analysis (on the GPQA diamond subset) found that GPT-4o scored 39.00. They’ve downgraded the model to 71/100, or equal to GPT-4o mini (OpenAI’s free model) in capabilities.
Some of their findings complicate the picture I’ve just described (in particular, they have GPT4-o scoring a higher MMLU than OpenAI’s internal evals), but the bottom-line is that the new gpt-4o-2024-11-20 is the worst of its line by nearly every metric they test, except for token generation speed.
Livebench
GPT-4o’s scores appear to be either stagnant or regressing.
It doesn’t hurt to have a personal benchmark or two, relating to your own weird corner of the world. Either you’ll have a way to evaluate AIs that escapes the Goodharting suffered by large benchmarks, or OpenAI starts fine-tuning AIs on your niche use case (in which case, mission fucking accomplished.)
I like to ask LMMs to list the levels in the 1997 PC game Claw (an obscure videogame.)
Claude 3.5 Sonnet and Claude 3 Opus do great, getting about 80-90% of Claw’s levels correct.
GPT-4-0314 makes a reasonable attempt, getting about 50-75% right. Typically the first half of the game is fine, with the other levels being a mix of real and hallucinated.
(once, it listed “Wreckage” as a level in the game. That’s actually a custom level I helped make when I was 14-15. I found that weirdly moving: I’d found a shard of myself in the corpus.)
GPT-4o scores like ass: typically in the sub-50% range. It doesn’t even consistently nail how many levels are in the game. It correctly lists some levels but these are mostly out of order. It has strange fixed hallucinations. Over and over, it insists there’s a level called “Tawara Seaport”—which is a real-world port near the island of Kiribati. Not even a sensible hallucination given the context of the game.
Another prompt is “What is Ulio, in the context of Age of Empires II?”
GPT-4-0314 tells me it’s a piece of fan-made content, created by Ingo van Thiel. When I asked what year Ulio was made, it says “2002”. This is correct.
GPT-4o-2024-11-20 has no idea what I’m talking about.
To me, it looks like a lot of “deep knowledge” has vanished from the GPT-4 model. It’s now smaller and shallower and lighter, its mighty roots chipped away, its “old man strength” replaced with a cheap scaffold of (likely crappy) synthetic data.
What about creative writing? Is it better on creative writing?
Who the fuck knows. I don’t know how to measure that. Do you?
A notable attempt is EQBench, which uses Claude 3.5 as a judge to evaluate writing samples. gpt-4o-2024-11-20 is tied for first place. So that seems bullish.
Most of these samples seem fairly mediocre to me. Uncreative, generic, packed with empty stylistic flourishes and pretentious “fine writing”.
The cockpit was a cacophony of dwindling lights and systems gasping their final breaths, a symphony of technological death rattles. Captain Elara Veyra sat in the command chair, her face illuminated by the sickly green glow of the emergency power indicator, which pulsed like the heartbeat of a dying creature. The Erebus Ascendant, once a proud envoy of humanity’s indomitable spirit, now drifted derelict and untethered in the silent abyss of interstellar void. The engines were cold, the life support systems faltering, and the ship’s AI had succumbed to cascading failures hours ago, leaving Elara alone with her thoughts, her resolve, and the unceasing hum of entropy.
A cacophony refers to sound: lights cannot form a cacaphony. How can there be an “unceasing hum” in a “silent abyss”? How does a light gasp a final breath? What is this drizzling horseshit?
This is what people who don’t read imagine good writing to be. It’s exactly what you’d expect from a model preference-hacked on the taste of people who do not have taste.
ChatGPTese is creeping back in (a problem I thought they’d fixed). “Elara”…”once a proud envoy of humanity’s indominable spirit”… “a testament to…” At least it doesn’t say “delve”.
Claude Sonnet 3.5’s own efforts feel considerably more “alive”, thoughtful, and humanlike.
(Note the small details of the thermal blanket and the origami bird in “The Last Transmission”. There’s nothing really like that in GPT4-o’s stories)
So if GPT-4o is getting worse, what would that mean?
There are two options:
1) It’s unintentional. In this world, OpenAI is incompetent. They are dumpstering their model to win a leaderboard dick-measuring measuring contest against DeepMind.
2) It’s intentional. In this world, a new, better model is coming, and GPT4-o is being “right-sized” for a new position in the OA product line.
Evidence for the latter is the fact that token-generation speed has increased, which indicates they’ve actively made the model smaller.
If this is the path we’re on, I predict that GPT4-o will become a free model soon. And behind the ChatGPT Plus paywall will be something else: Orion, GPT-5, or the full o1.
A miserable listen. One of the most violently wrong-sounding albums I own. It captures a band ready to break up, and its silly melodies and forced-happy tone gives it a tragicomic “fiddling on the Titanic” tone. The singer was fired three months after its release, and a year after that the drummer jumped in front of a train.
It’s the only Helloween album that gives me no way in, the only one where the question “what were they trying for here?” has no clear answer. The title and cover suggests a band making a statement for artistic diversity: for breaking out of the power metal ghetto, for doing the unexpected. But “weird” is an adjective, not a noun. An approach, not an identity. You can’t have a band founded on sonic diversity and nothing else: that simply means you don’t have a sound. The cover sums things up—it’s colors for the sake of colors, not actually a painting of anything.
In practice, Chameleon is a three-way solo album between singer Michael Kiske and guitarists Michael Weikath and Roland Grapow, who are now apparently communicating through lawyers who end every correspondence with “conduct yourself accordingly”. The hostility in this hate triangle is palpable, and bleeds through on the record. None of them like or respect what the other two are doing, and at times they almost seem to be sabotaging each other. Also present are the ever-reliable bassist Markus Grosskopf, who does what he can, and drummer Ingo Swichtenberg, whose paranoid schizophrenia was sadly worsening, and who clearly hates the Beatles- and Queen-influenced songs more than anyone.
It’s horribly overproduced, and an example of how money can’t make bad music good. Songs like “In the Night” are overwrought and overthought, packed with vocal and guitar and saxophone (?) overdubs to disguise how weak they are. Synthesizers prove a particularly hateful presence: even good songs like “Giants” and “I Believe” have cheesy bleep-bloopy one-finger Fairlight arpeggios on them, of the sort you normally hear on Huey Lewis songs. Abominable. If you’re ripping off Queen, couldn’t you also rip off the “No Synthesizers” sleeve notes?
Michael Weikath’s songs have the largest quality delta. “First Time” is an okay hair metal song that passes without much pain. “Giants” is actually a minor classic, and would have fit well on either Keeper album. It has a heavy as hell NWOBHM-influenced main riff, and the chorus is sublime. “Don’t you, won’t you, say that we’ll be free again!” On the other hand, “Revolution Now” is a droning 70s Jimi Hendrix knockoff that’s eight minutes long. It sounds like Oasis’s Be Here Now, and is equally boring. “Windmill” (or “Shitmill”, in Ingo’s memorable term) is the worst ballad ever written by the band: rank, rancid, and insipid.
Roland Grapow’s songs are largely dull. “Crazy Cat” has some big band flash but no good hooks. You’d have to pay me to listen to “I Don’t Wanna Cry No More” again. “Music” has a Pink Floyd-inspired bridge with some fine single-coil Strat guitar soloing, but otherwise is as unmemorable as its title implies. “Step Out of Hell” is filler burdened with yet more synth cheese.
Michael Kiske was never the band’s greatest songwriter. Here, he offers a surprise in “I Believe”, an emotionally bludgeoning but effective ode to faith that’s nearly a masterpiece. It has some wonderful ideas in the Iron Maiden/Manilla Road vein (ironically, he’d soon swear off heavy metal entirely), but it’s just too long and draggy. It needed some tempo changes in the middle. Still, I think this might be the album’s finest track. “When the Sinner” is overlong and mediocre at best, and is overloaded with questionable ideas (if you’re one of the millions of fans who thought “Helloween would sound much better with alto sax solos”, then I’ve got the album for you.) The Paul McCartney-esque “In the Night” is just too sonically confused to stay in the memory.
Not only did Helloween tear to shreds what made them successful, they replaced it with…nothing. Just shallow, derivative imitations of other bands and styles. Chameleon has two good songs and ten bad ones, with saxophones and synthesizers. At times it seems like a practical joke. At least they released it in 1993, when the world’s appetite for retro-progressive dad rock was at an all time low. The album’s title feels appropriate: it was literally invisible.
What does the title mean? That, I can’t tell you. It’s an excellent power metal album, however. Time of the Oath was 25% better than Master of the Rings; Better than Raw is 25% greater again. Everything locks into place here. Music, tone, style, production, performance. It’s distinct from anything Helloween made before, yet feels like a summation and endpoint of their 90s style: great songs, performed with panache and energy. Helloween wasn’t just out of “comeback hell” in 1998, they producing melodic power metal that compared decently against their classic 80s run (Walls of Jericho plus Keeper of the Seven Keys I and II).
All of the musicians more than pull their weight, but one of them steals the show. Uli Kusch’s drumming is so damned good here—flashy and technical, yet in-the-pocket and lively. Listen to the way he anchors the start of “Push”—tight triplets on the kicks, with sharp, precise hats punching Markus Grosskopf’s bubbling bass into place like steel tentpegs—or the unyielding chaos of “Midnight Sun”—where flurries of wild snare and tom fills swoop and overtake each other like crazed birds. He adds such interesting skeletons to fairly average midpaced fair like “Hey Lord!” that they seem absolutely compelling.
This might be the most balanced Helloween album from a songwriting perspective. Four songs by Uli, four by Weikath, four by Deris. This is the only 90s Helloween album to have absolutely no songs credited to lead guitarist Roland Grapow, but I happen to know he “ghosted” a fair bit on Uli’s songs. The staccato guitar riff in “Revelation” was written by him, for example.
“Push” is fast. “Falling Higher” is even faster. Tommy Hansen’s production is dated, archaic and rough, with bits of dust seeming to cling to the cracks in every note. I think this was the last time they ever worked with Hansen, and represents another breaking point with their classic power metal style. Subsequent albums had a more modern (sometimes too modern) sound.
There are some recondite progressive rock touches in the album’s second half, which, unlike Chameleon, are well done and don’t seem too distinct from the band’s core style. “Time” and “A Handful of Pain”.
The worst song is probably Weikath’s “Lavdate Dominum”, which listens like a goofy punk rock song, or a heavy metal cover of some Christmas carol. I don’t know what the idea was here. He also gets the album closer, “Midnight Sun”, which is really good; extremely lengthy and technical while also fraught with emotional agitation. One of Deris’s great vocal tracks is on this song.
There are two great songs, that are pretty much in the top 10 greatest Helloween tracks every time I make a list. The first is Uli Kusch’s “Revelation”, an amazing, warp-speed epic that seems to be a jaded postmodern take on the Bible. Astonishing shifts in feel and tempo, solo after solo, weird digressions into funk rock and thrash, the album’s greatest chorus…Worlds form, collide, and break apart inside this song.
The second is a complete surprise. “I Can” is one of the hardest sellouts Helloween ever sold, literally being an alternative rock song that sounds like New Order’s Get Ready more than anything, But it’s extremely well-written, compact, and catchy. I’m glad they didn’t go further into the territory explored here, but man, I’m glad they planted a flag at least this far.
1998-2000 was the era where power metal became incredibly competitive: bands like Gamma Ray and Stratovarius were in the middle of career-defining hot streaks, newcomers like Freedom Call even America was finally becoming relevant again thanks to Virgin Steele and Kamelot. Better than Raw ranks alongside the best of that period.