This post is speculation + crystal balling. A change might... | News | Coagulopath
This post is speculation + crystal balling. A change might be coming.
OpenAI has spent six months rolling out updates to GPT-4o. These perform extremely well by human-preference metrics.
gpt-4o-2024-11-20, the latest endpoint, boasts a colossal 1360 ELO on Chatbot Arena, compared to the earliest GPT4-o, which scored a meager 1285. What does that mean? That blind human raters prefer gpt-4o-2024-11-20’s output 70% of the time.
I believe this is the result of aggressive human preference-hacking on OpenAI’s part, not any real advances.
Control for style, and gpt-4o-2024-11-20 drops by fifty points. It remains at #1, but only because the other preference-hacked models at the top also drop a lot.
Claude 3.5 Sonnet gains points. So do most of the older GPT-4s.
Optimizing for human preference is not a wrong thing to do, per se. So long as humans use LLMs, what they like matters. An LLM that produced output in the form of Morse code being punched into your balls would suck to use, even if it was smart.
But this is exactly why you should be careful when using Chatbot Arena to make statements about model capabilities – the top of the chart is mainly determined by style and presentation.
Benchmarks tell a different story: gpt-4o’s abilities are declining.
In six months, GPT4-o’s 0-shot MMLU score has fallen from 87.2 to 85.7, which is probably similar to what GPT-4 scored on release.
(to be clear, “GPT-4” doesn’t mean “an older GPT-4o” or “GPT-4 Turbo”, but “the original broke-ass GPT-4 from March 2023, with 8k context and no tools/search/vision and Sept 2021 training data”).
I am more concerned about the collapse of GPT4-o’s score on the GPQA benchmark, which fell from 53.1 to 46.0. This is a significant drop, particularly in light of the tendency for scores to rise as data contaminates the internet. (Claude 3.5 Sonnet scores 59.4, for comparison)
An independent test by Artificial Analysis (on the GPQA diamond subset) found that GPT-4o scored 39.00. They’ve downgraded the model to 71/100, or equal to GPT-4o mini (OpenAI’s free model) in capabilities.
Some of their findings complicate the picture I’ve just described (in particular, they have GPT4-o scoring a higher MMLU than OpenAI’s internal evals), but the bottom-line is that the new gpt-4o-2024-11-20 is the worst of its line by nearly every metric they test, except for token generation speed.
Livebench
GPT-4o’s scores appear to be either stagnant or regressing.
It doesn’t hurt to have a personal benchmark or two, relating to your own weird corner of the world. Either you’ll have a way to evaluate AIs that escapes the Goodharting suffered by large benchmarks, or OpenAI starts fine-tuning AIs on your niche use case (in which case, mission fucking accomplished.)
I like to ask LMMs to list the levels in the 1997 PC game Claw (an obscure videogame.)
Claude 3.5 Sonnet and Claude 3 Opus do great, getting about 80-90% of Claw’s levels correct.
GPT-4-0314 makes a reasonable attempt, getting about 50-75% right. Typically the first half of the game is fine, with the other levels being a mix of real and hallucinated.
(once, it listed “Wreckage” as a level in the game. That’s actually a custom level I helped make when I was 14-15. I found that weirdly moving: I’d found a shard of myself in the corpus.)
GPT-4o scores like ass: typically in the sub-50% range. It doesn’t even consistently nail how many levels are in the game. It correctly lists some levels but these are mostly out of order. It has strange fixed hallucinations. Over and over, it insists there’s a level called “Tawara Seaport”—which is a real-world port near the island of Kiribati. Not even a sensible hallucination given the context of the game.
Another prompt is “What is Ulio, in the context of Age of Empires II?”
GPT-4-0314 tells me it’s a piece of fan-made content, created by Ingo van Thiel. When I asked what year Ulio was made, it says “2002”. This is correct.
GPT-4o-2024-11-20 has no idea what I’m talking about.
To me, it looks like a lot of “deep knowledge” has vanished from the GPT-4 model. It’s now smaller and shallower and lighter, its mighty roots chipped away, its “old man strength” replaced with a cheap scaffold of (likely crappy) synthetic data.
What about creative writing? Is it better on creative writing?
Who the fuck knows. I don’t know how to measure that. Do you?
A notable attempt is EQBench, which uses Claude 3.5 as a judge to evaluate writing samples. gpt-4o-2024-11-20 is tied for first place. So that seems bullish.
Most of these samples seem fairly mediocre to me. Uncreative, generic, packed with empty stylistic flourishes and pretentious “fine writing”.
The cockpit was a cacophony of dwindling lights and systems gasping their final breaths, a symphony of technological death rattles. Captain Elara Veyra sat in the command chair, her face illuminated by the sickly green glow of the emergency power indicator, which pulsed like the heartbeat of a dying creature. The Erebus Ascendant, once a proud envoy of humanity’s indomitable spirit, now drifted derelict and untethered in the silent abyss of interstellar void. The engines were cold, the life support systems faltering, and the ship’s AI had succumbed to cascading failures hours ago, leaving Elara alone with her thoughts, her resolve, and the unceasing hum of entropy.
A cacophony refers to sound: lights cannot form a cacaphony. How can there be an “unceasing hum” in a “silent abyss”? How does a light gasp a final breath? What is this drizzling horseshit?
This is what people who don’t read imagine good writing to be. It’s exactly what you’d expect from a model preference-hacked on the taste of people who do not have taste.
ChatGPTese is creeping back in (a problem I thought they’d fixed). “Elara”…”once a proud envoy of humanity’s indominable spirit”… “a testament to…” At least it doesn’t say “delve”.
Claude Sonnet 3.5’s own efforts feel considerably more “alive”, thoughtful, and humanlike.
(Note the small details of the thermal blanket and the origami bird in “The Last Transmission”. There’s nothing really like that in GPT4-o’s stories)
So if GPT-4o is getting worse, what would that mean?
There are two options:
1) It’s unintentional. In this world, OpenAI is incompetent. They are dumpstering their model to win a leaderboard dick-measuring measuring contest against DeepMind.
2) It’s intentional. In this world, a new, better model is coming, and GPT4-o is being “right-sized” for a new position in the OA product line.
Evidence for the latter is the fact that token-generation speed has increased, which indicates they’ve actively made the model smaller.
If this is the path we’re on, I predict that GPT4-o will become a free model soon. And behind the ChatGPT Plus paywall will be something else: Orion, GPT-5, or the full o1.
Scott Alexander created a Turing Test for AI generated artwork.... | News | Coagulopath
Scott Alexander created a Turing Test for AI generated artwork. Begin quote:
Here are fifty pictures. Some of them (not necessarily exactly half) were made by humans; the rest are AI-generated. Please guess which are which. Please don’t download them onto your computer, zoom in, or use tools besides the naked eye. Some hints:
I’ve tried to balance type of picture / theme , so it won’t be as easy as “everything that looks like digital art is AI”.
I’ve tried to crop some pictures of both types into unusual shapes, so it won’t be as easy as “everything that’s in DALL-E’s default aspect ratio is AI”.
At the end, it will ask you which picture you’re most confident is human, which picture you’re most confident is AI, and which picture was your favorite – so try to keep track of that throughout the exercise.
All the human pictures are by specific artists who deserve credit (and all the AI pictures are by specific prompters/AI art hobbyists who also deserve credit) but I obviously can’t do that here. I’ll include full attributions on the results post later.
I got 88% correct (44/50). Here’s my attempt (or rather my imperfect memory of my attempt), and my justification for the answers I gave.
Concluding Thoughts
You’re in a desert walking along in the sand when all of a sudden you look down, and you see a tortoise, it’s crawling toward you. You reach down, you flip the tortoise over on its back. The tortoise lays on its back, its belly baking in the hot sun, beating its legs, trying to turn itself over, but it can’t, not without your help. But you’re not helping. Why is that?
This guitar teacher talks about an amusing, perhaps alarming, trend he’s... | News | Coagulopath
This guitar teacher talks about an amusing, perhaps alarming, trend he’s seen among his students.
In the 90s, they’d be hardcore fans of a particular band. They’d want to learn Metallica’s Master of Puppets in its entirety, or some obscure song buried at the back of an album. They’d display fierce loyalty to a chosen artist or style.
Things changed in the Napster/Limewire era (early 2000s). Digital filesharing meant the album slowly started to die. Kids would rock up to him with home-burned CDs and tapes of random songs collaged from various places. This was exciting, as far as it went. Kids were taking control of their music. Albums are ultimately a marketing construct from the 1950s dictated by manufacturing constraints. There’s no God-given reason why music has to be doled out in 40-60 minute blobs, all by the same artist, and with an immutable track order. Other worlds are possible. There are more things in heaven and earth than are dreamed of in your philosophy, Horatio.
But the new generation of listeners had far less loyalty to individual bands/artists. They did not know the “deep tracks”. If they wanted to learn an Offspring song, it would always be one of the same 3-4 songs.
Today, it has shifted again. His students are like “yo, I want to learn $SONG”, he’s like “so you like $SONG_ARTIST?” and he gets a blank stare. They live in a world where endless music drifts algorithmically in front of them, like indistinguishable ocean waves. Sometimes they like it, but this doesn’t provoke any interest in who made the music, where it came from, what its context is, and so forth. Why even learn those things? More and equally good music is coming along soon. There’s no reason to be a fan of anyone. Once the artist vanishes from Spotify playlists, they can safely forget about them.
(Superfans obviously still exist, but now seem to be motivated more by weird parasocial obsessions than actual artistic output. What would the average kpop “sasaeng” desire more—an unreleased song by Jungkook, or a piece of Jungkook’s shirt?)
It makes you wonder….what does it actually mean to be a fan of someone?
I view fandom as a search algorithm. A way of managing the limitless choices of entertainment.
If I want to read a book there are millions of them, but reading random books is a poor use of my time: most are bad/uninteresting/unsuitable for me. I can dramatically increase my odds of finding a good book by reading an author I’ve enjoyed in the past.
This creates an illusion that the author is important. Actually, “author” is just a highly optimal branch on a search tree for book discovery. If my favorite author started writing bad books, I’d eventually stop reading him. The books are what matters.
Which leads to the question: what happens when search algorithms can connect you with good books better than the tried-and-true method of “read books by your favorite author”? Do you still need to have a favorite author? What happens in that world? Does the fan still exist?
Overnight, a Tiktoker became the third biggest star on the platform because she nodded her head to an electronic beat for a few seconds.
This is literally a fame lottery. There is no reasonable way that talent, perseverance, or “star power” can manifest in a video that’s a few seconds long and where you don’t even talk. (The Tiktok algorithm, by the way, is believed to have a reasonable amount of random noise, to stop people hacking it.) We expect stars to have a degree of personal charisma. Stars produced this way, as Tenbarge notes, tend to be punishingly average and unsuited for fame.
All of it — and this is coming from someone who collects influencer merchandise — is incredibly boring and one-dimensional. This is not by any means an invitation to bash Charli, who I feel great sympathy for given her age and precarious position in multiple overlapping industries. She’s at the epicenter of a new generation’s group of power brokers, with her every move impacting the salaries of grown adults, including her parents and older sister. And I mean this in this kindest way possible — she’s not qualified for any of it. Charli, who I have never personally spoken to, comes off as incredibly sweet, caring, and normal. She reminds me of every high school-aged white girl at my hideously expensive dance studio in suburban Cincinnati. She can definitely perform well, as well as any “Dance Moms-”era teenage competitive dancer from Connecticut. But if you read an interview with Charli and her older sister Dixie, the mediocrity is palpable. They don’t really have anything interesting to say about, well, anything — and if they do, their publicists won’t let them. I’ve seen a handful of their YouTube videos, listened to clips from their podcasts, and scrolled through dozens of Instagram posts, TikTok videos, and tweets. There’s nothing wrong with either of them, they’re just oppressively average. And by the way, so are all their friends in the Hype House and Sway House and whatever new teeny-bopper house went on the market this week.
Bella has attempted to turn her fame into a recording career, with familiar results to anyone familiar with past cases like Kreayshawn or Tila Tequila.
To me, this feels like an area where the past world is meshing incongruously with the new, algorithmic one.
The industry is trying to position Bella Poarch for a lasting career, where her fans continue to buy the songs she puts out for a long time because she’s talented and special. But if her audience cared enough to do that, they probably wouldn’t have found her on Tiktok to begin with. Bella Poarch is the product of a system where the old rules no longer quite apply. She is a nobody who was plucked out of obscurity, through no great merit of her own. She’s not some rising, hot property who will sustain a long career.
And it’s all so meaningless! In 1964, 73 million people watched the Beatles on The Ed Sullivan Show, and it reshaped the face of music. Today, Tiktoks get billions of views and leave no mark on popular culture at all. It’s just noise that gets drowned out by the next day’s noise.
(Hero image: Laura Makabresku | The silent light of God’s Mercy)