The end of an era. Helloween’s Y2K album is the last to feature the second “classic” lineup of Weikath/Deris/Grosskopf/Grapow/Kusch. It marks a turning point: post-The Dark Ride, Helloween becomes, though not bad, more streamlined, less risk-averse, and (in my view) less interesting.
To dispatch with the obvious, no, this isn’t “nu metal” Helloween. It has some downtuned, tonally dark songs, but they mostly seem patterned after Dio/Martin-era Black Sabbath more than, say Korn.
It’s definitely confused. I’ll say that much. The band doesn’t fully commit to their new, dark style, writing a bunch of classic-style songs as well, turning the album into a bit of a patchwork. The Dark Ride is an odd, contradictory amphibian of an album that seems to exist in the sunlight and under the starless sky at the same time, with the tracklisting throwing every tonal mismatch into sharp relief. You have basically the floweriest song ever written under the Helloween imprimature (“All Over the Nations”) right next to arguably the darkest one (“Escalation 666”). “Mr Torture” is a perfect opening, “The Dark Ride” a perfect closer, but otherwise you could jumble the songs at random and get a more cohesive listening experience.
Grapow/Kusch really start driving the band here—to their detriment, as creative conflicts would soon lead to them being ousted (Grapow, 2005: “We weren’t really a band anymore and struggled with tons of issues along the way, it was best for us to leave and aim for new goals.”). They write a ton of songs, and according to Grapow, virtually all the guitar work here is his. At the same time, they were also amassing some songs that never made the album, and were later featured on the debut album of their next band, Masterplan. (You can really imagine “Into the Light” on this album, being sung by Deris.)
Kusch’s “Mr Torture” is one of the all-time Helloween opening songs. Punchy, tight, catchy, accessible, it rolls and bounces along, verses propelled by jagged runs of double-bass, the chorus opening wide up, and the bridge illuminated by a short but flashy Grapow guitar solo that lights the song on fire. Great track.
The lyrics are pretty weird, portraying some kind of…torture entrepeneur? “You can catch him on his website / Has a live chat every weeknight / Cyber-torture soon coming your way!” Well, it wouldn’t be a year 2000 album without gratuitous internet references, I suppose. (Viz Britney Spears’ “Email My Heart”)
Then Weikath’s “All Over the Nations” arrives: a fast, melodic, somewhat generic power metal track, it sounds literally nothing like the preceding or following song. Other than Deris’s vocals and Roy Z’s murky but textured production (which proves to be the glue holding The Dark Ride‘s disparate shards together), you wouldn’t even think this and “Mr Torture” were from the same album. Not offensive, but definitely a bit lightweight and “Helloween done by committee”.
Two things are noticeable about The Dark Ride: first, it’s really, really good. Possibly superior to Better than Raw, which might make it the best Helloween album ever, aside from Walls of Jericho and The Keepers.
Second, the different songwriters are really, really, really not on the same page anymore. Grapow and Kusch want darkness, Weikath stubbornly cleaves to the “happy happy Helloween” template, and Deris has a foot in both camps. Markus Grosskopf sticks to playing bass, and doesn’t write a song this time (although his composition “Deliver Us” appears on various bonus editions, and suggests he was of one mind with the Grapow/Kusch contingent.)
Grapow’s “Escalation 666” is one of the band’s most crushing and experimental tracks. A doom metal paced trudge through some inner mindscape of madness, it’s not a song, it’s a black hole yawning at the album’s core. The chugging, C-standard (I think?) opening riff sounds supernova-heavy, and the dissonant, effects-laden guitar solo reminds me of “Bleeding Eyes” off that first Masterplan album. It’s not the greatest song on the album, but it’s never far from my thoughts.
Andi Deris proves to be hit or miss like usual, writing two certified classics (the piano-driven single “If I Could Fly” and the flighty, foot-on-the-gas adventure of “We Damn the Night”) and two stinkers. “Mirror Mirror” and “I Live For Your Pain” are just chuggy, downtuned nothingburgers with mediocre ideas and no sense of catchiness or energy. Skip-button fodder. Like Helloween trying to be a grunge rock band or something.
His bonus track “Madness of the Crowds” is a fascinating “one idea” type song, pairing quiet verses with explosive choruses (and some intriguing knifing symphonic stabs). “Immortal” is the closest we have to a torch ballad. Not bad, but a bit slender when compared with Kusch’s “The Departed”, which we just heard a few minutes earlier.
The album concludes with Grapow’s “The Dark Ride”, a monolithic speed epic that’s like a tombstone for this era of the band. Beginning with the (somewhat stale) motif of amusement park sounds, it’s a bit long, but when the ideas come, they really come. Grapow really loves octave-skipping tremolo riffs (like in the pre-chorus: “Out of doubt, no hope / Satan feeds our madness”), but so do I. The guitar solo section is just straight-up Yngwie Malmsteen worship. Some of the last he ever did.
This is one of those spikey albums where the flaws are evident but the strengths are so good that even if I’m bitching about it half the time, I still love it. This is an incredibly special and important record to me. One last triumph of power metal before Y2K shut the world down.
Walt Disney’s career as a director of animated film was not a particularly inspiring one.
We’ll ignore the Laugh-O-Grams and Alice Comedies since those were cranked out under Stakhanovite conditions for nearly no money for men who often turned out to be literal criminals (Pat Sullivan has a borderline classic Wikipedia page, littered with lines like “Sulivan(sic) would often fire employees in a drunken haze, not remembering the next day, when they would return to work as if nothing had happened“, and a Controversies section split into subheadings “Rape Conviction” and “Racism”).
Yes, “Steamboat Willie” and “Skeleton Dance” and “Hell’s Bells” and “The Problematically-Depicted Negro” (etc) are holy classics, but Ub Iwerks (and his hunger for violence) deserve a lot of credit for those. Probably more than he got or will ever get, even by me. “Poor Papa” is great and underrated. “Minnie’s Yoo Hoo” sucks. Etc. More misses than hits, by my lights.
On the whole, you would describe Disney’s directorial output as “stiff, stagey, and moralistic.” You would not describe it as “very fun”. He did not make animation sing. He made it squawk, fret, and preach. His skills were adequate for the rubber hose era. By the 1930s, cartoons were entering their golden years, rapidly exploding in complexity, detail and quality of writing/acting/etc. Walt ended up over his head, his aged and dating skillset like racing a Model T at the Indy 500.
“The Golden Touch” (1935) was famously the result of a bet that Walt couldn’t direct as well as his animators: a bet that his animators immediately and decisively won. The last animated short ever directed by the man behind the mouse, it’s somewhat watchable, but most of the fun parts—like Midas giving himself a gangsterish gold tooth—feel like they were added by animators to try and punch life into things.
The story is flat and predictable and preachy. Don’t be greedy! Even if you don’t know who King Midas is, you can guess the plot after thirty seconds. Countless opportunities for gags are missed. King Midas spends half the short sitting in a chair. And when Goldie grants Midas the Golden Touch, shouldn’t he do it in a funny or interesting way? Instead of just saying “you have the Golden Touch now!” (or something) and disappearing?
I liked the skeleton. I wonder if that came from Walt. I expect it did. He always had an eye for the morbid.
What was Walt good at? I see him as a visionary and a dreamer who made audacious technical bets (synchronized sound, Technicolor, feature-length films), re-imagined the concept of what a cartoon could be, and then found talented artists to execute his vision. He wasn’t much of an artist himself, but that’s okay. There’s the big picture and the small picture. Georgy Zhukov was a talented general on the Eastern Front, but I could probably beat him at kickboxing—him dying in 1974 helps.
This post is speculation + crystal balling. A change might be coming.
OpenAI has spent six months rolling out updates to GPT-4o. These perform extremely well by human-preference metrics.
gpt-4o-2024-11-20, the latest endpoint, boasts a colossal 1360 ELO on Chatbot Arena, compared to the earliest GPT4-o, which scored a meager 1285. What does that mean? That blind human raters prefer gpt-4o-2024-11-20’s output 70% of the time.
I believe this is the result of aggressive human preference-hacking on OpenAI’s part, not any real advances.
Control for style, and gpt-4o-2024-11-20 drops by fifty points. It remains at #1, but only because the other preference-hacked models at the top also drop a lot.
Claude 3.5 Sonnet gains points. So do most of the older GPT-4s.
Optimizing for human preference is not a wrong thing to do, per se. So long as humans use LLMs, what they like matters. An LLM that produced output in the form of Morse code being punched into your balls would suck to use, even if it was smart.
But this is exactly why you should be careful when using Chatbot Arena to make statements about model capabilities – the top of the chart is mainly determined by style and presentation.
Benchmarks tell a different story: gpt-4o’s abilities are declining.
In six months, GPT4-o’s 0-shot MMLU score has fallen from 87.2 to 85.7, which is probably similar to what GPT-4 scored on release.
(to be clear, “GPT-4” doesn’t mean “an older GPT-4o” or “GPT-4 Turbo”, but “the original broke-ass GPT-4 from March 2023, with 8k context and no tools/search/vision and Sept 2021 training data”).
I am more concerned about the collapse of GPT4-o’s score on the GPQA benchmark, which fell from 53.1 to 46.0. This is a significant drop, particularly in light of the tendency for scores to rise as data contaminates the internet. (Claude 3.5 Sonnet scores 59.4, for comparison)
An independent test by Artificial Analysis (on the GPQA diamond subset) found that GPT-4o scored 39.00. They’ve downgraded the model to 71/100, or equal to GPT-4o mini (OpenAI’s free model) in capabilities.
Some of their findings complicate the picture I’ve just described (in particular, they have GPT4-o scoring a higher MMLU than OpenAI’s internal evals), but the bottom-line is that the new gpt-4o-2024-11-20 is the worst of its line by nearly every metric they test, except for token generation speed.
Livebench
https://livebench.ai
GPT-4o’s scores appear to be either stagnant or regressing.
It doesn’t hurt to have a personal benchmark or two, relating to your own weird corner of the world. Either you’ll have a way to evaluate AIs that escapes the Goodharting suffered by large benchmarks, or OpenAI starts fine-tuning AIs on your niche use case (in which case, mission fucking accomplished.)
I like to ask LMMs to list the levels in the 1997 PC game Claw (an obscure videogame.)
Claude 3.5 Sonnet and Claude 3 Opus do great, getting about 80-90% of Claw’s levels correct.
GPT-4-0314 makes a reasonable attempt, getting about 50-75% right. Typically the first half of the game is fine, with the other levels being a mix of real and hallucinated.
(once, it listed “Wreckage” as a level in the game. That’s actually a custom level I helped make when I was 14-15. I found that weirdly moving: I’d found a shard of myself in the corpus.)
GPT-4o scores like ass: typically in the sub-50% range. It doesn’t even consistently nail how many levels are in the game. It correctly lists some levels but these are mostly out of order. It has strange fixed hallucinations. Over and over, it insists there’s a level called “Tawara Seaport”—which is a real-world port near the island of Kiribati. Not even a sensible hallucination given the context of the game.
Another prompt is “What is Ulio, in the context of Age of Empires II?”
GPT-4-0314 tells me it’s a piece of fan-made content, created by Ingo van Thiel. When I asked what year Ulio was made, it says “2002”. This is correct.
GPT-4o-2024-11-20 has no idea what I’m talking about.
To me, it looks like a lot of “deep knowledge” has vanished from the GPT-4 model. It’s now smaller and shallower and lighter, its mighty roots chipped away, its “old man strength” replaced with a cheap scaffold of (likely crappy) synthetic data.
What about creative writing? Is it better on creative writing?
Who the fuck knows. I don’t know how to measure that. Do you?
A notable attempt is EQBench, which uses Claude 3.5 as a judge to evaluate writing samples. gpt-4o-2024-11-20 is tied for first place. So that seems bullish.
Most of these samples seem fairly mediocre to me. Uncreative, generic, packed with empty stylistic flourishes and pretentious “fine writing”.
The cockpit was a cacophony of dwindling lights and systems gasping their final breaths, a symphony of technological death rattles. Captain Elara Veyra sat in the command chair, her face illuminated by the sickly green glow of the emergency power indicator, which pulsed like the heartbeat of a dying creature. The Erebus Ascendant, once a proud envoy of humanity’s indomitable spirit, now drifted derelict and untethered in the silent abyss of interstellar void. The engines were cold, the life support systems faltering, and the ship’s AI had succumbed to cascading failures hours ago, leaving Elara alone with her thoughts, her resolve, and the unceasing hum of entropy.
A cacophony refers to sound: lights cannot form a cacaphony. How can there be an “unceasing hum” in a “silent abyss”? How does a light gasp a final breath? What is this drizzling horseshit?
This is what people who don’t read imagine good writing to be. It’s exactly what you’d expect from a model preference-hacked on the taste of people who do not have taste.
ChatGPTese is creeping back in (a problem I thought they’d fixed). “Elara”…”once a proud envoy of humanity’s indominable spirit”… “a testament to…” At least it doesn’t say “delve”.
Claude Sonnet 3.5’s own efforts feel considerably more “alive”, thoughtful, and humanlike.
(Note the small details of the thermal blanket and the origami bird in “The Last Transmission”. There’s nothing really like that in GPT4-o’s stories)
So if GPT-4o is getting worse, what would that mean?
There are two options:
1) It’s unintentional. In this world, OpenAI is incompetent. They are dumpstering their model to win a leaderboard dick-measuring measuring contest against DeepMind.
2) It’s intentional. In this world, a new, better model is coming, and GPT4-o is being “right-sized” for a new position in the OA product line.
Evidence for the latter is the fact that token-generation speed has increased, which indicates they’ve actively made the model smaller.
If this is the path we’re on, I predict that GPT4-o will become a free model soon. And behind the ChatGPT Plus paywall will be something else: Orion, GPT-5, or the full o1.