This post is speculation + crystal balling. A change might... | News | Coagulopath

This post is speculation + crystal balling. A change might be coming.

OpenAI has spent six months rolling out updates to GPT-4o. These perform extremely well by human-preference metrics.

gpt-4o-2024-11-20, the latest endpoint, boasts a colossal 1360 ELO on Chatbot Arena, compared to the earliest GPT4-o, which scored a meager 1285. What does that mean? That blind human raters prefer gpt-4o-2024-11-20’s output 70% of the time.

I believe this is the result of aggressive human preference-hacking on OpenAI’s part, not any real advances.

Control for style, and gpt-4o-2024-11-20 drops by fifty points. It remains at #1, but only because the other preference-hacked models at the top also drop a lot.

Claude 3.5 Sonnet gains points. So do most of the older GPT-4s.

Optimizing for human preference is not a wrong thing to do, per se. So long as humans use LLMs, what they like matters. An LLM that produced output in the form of Morse code being punched into your balls would suck to use, even if it was smart.

But this is exactly why you should be careful when using Chatbot Arena to make statements about model capabilities – the top of the chart is mainly determined by style and presentation.

Benchmarks tell a different story: gpt-4o’s abilities are declining.

https://github.com/openai/simple-evals

In six months, GPT4-o’s 0-shot MMLU score has fallen from 87.2 to 85.7, which is probably similar to what GPT-4 scored on release.

(to be clear, “GPT-4” doesn’t mean “an older GPT-4o” or “GPT-4 Turbo”, but “the original broke-ass GPT-4 from March 2023, with 8k context and no tools/search/vision and Sept 2021 training data”).

I am more concerned about the collapse of GPT4-o’s score on the GPQA benchmark, which fell from 53.1 to 46.0. This is a significant drop, particularly in light of the tendency for scores to rise as data contaminates the internet. (Claude 3.5 Sonnet scores 59.4, for comparison)

Even this may be optimistic:

https://twitter.com/ArtificialAnlys/status/1859614633654616310

An independent test by Artificial Analysis (on the GPQA diamond subset) found that GPT-4o scored 39.00. They’ve downgraded the model to 71/100, or equal to GPT-4o mini (OpenAI’s free model) in capabilities.

Further benching here:

https://artificialanalysis.ai/providers/openai

Some of their findings complicate the picture I’ve just described (in particular, they have GPT4-o scoring a higher MMLU than OpenAI’s internal evals), but the bottom-line is that the new gpt-4o-2024-11-20 is the worst of its line by nearly every metric they test, except for token generation speed.

Livebench

https://livebench.ai

GPT-4o’s scores appear to be either stagnant or regressing.

gpt-4o-2024-05-13 -> 53.98
gpt-4o-2024-08-06 -> 56.03
chatgpt-4o-latest-0903 -> 54.25
gpt-4o-2024-11-20 -> 52.83

Aider Bench

https://github.com/Aider-AI/aider-swe-bench

Stagnant or regressing.

gpt-4o-2024-05-13 -> 72.9%
gpt-4o-2024-08-06 -> 71.4%
chatgpt-4o-latest-0903 -> 72.2%
gpt-4o-2024-11-20 -> 71.4%

Personal benchmarks

It doesn’t hurt to have a personal benchmark or two, relating to your own weird corner of the world. Either you’ll have a way to evaluate AIs that escapes the Goodharting suffered by large benchmarks, or OpenAI starts fine-tuning AIs on your niche use case (in which case, mission fucking accomplished.)

I like to ask LMMs to list the levels in the 1997 PC game Claw (an obscure videogame.)

Claude 3.5 Sonnet and Claude 3 Opus do great, getting about 80-90% of Claw’s levels correct.

GPT-4-0314 makes a reasonable attempt, getting about 50-75% right. Typically the first half of the game is fine, with the other levels being a mix of real and hallucinated.

(once, it listed “Wreckage” as a level in the game. That’s actually a custom level I helped make when I was 14-15. I found that weirdly moving: I’d found a shard of myself in the corpus.)

GPT-4o scores like ass: typically in the sub-50% range. It doesn’t even consistently nail how many levels are in the game. It correctly lists some levels but these are mostly out of order. It has strange fixed hallucinations. Over and over, it insists there’s a level called “Tawara Seaport”—which is a real-world port near the island of Kiribati. Not even a sensible hallucination given the context of the game.

Another prompt is “What is Ulio, in the context of Age of Empires II?”

GPT-4-0314 tells me it’s a piece of fan-made content, created by Ingo van Thiel. When I asked what year Ulio was made, it says “2002”. This is correct.

GPT-4o-2024-11-20 has no idea what I’m talking about.

To me, it looks like a lot of “deep knowledge” has vanished from the GPT-4 model. It’s now smaller and shallower and lighter, its mighty roots chipped away, its “old man strength” replaced with a cheap scaffold of (likely crappy) synthetic data.

What about creative writing? Is it better on creative writing?

Who the fuck knows. I don’t know how to measure that. Do you?

A notable attempt is EQBench, which uses Claude 3.5 as a judge to evaluate writing samples. gpt-4o-2024-11-20 is tied for first place. So that seems bullish.

https://eqbench.com/creative_writing.html

…but you’ll note that it’s tied with a 9B model, which makes me wonder about Claude 3.5 Sonnet’s judging.

https://eqbench.com/results/creative-writing-v2/gpt-4o-2024-11-20.txt

Most of these samples seem fairly mediocre to me. Uncreative, generic, packed with empty stylistic flourishes and pretentious “fine writing”.

The cockpit was a cacophony of dwindling lights and systems gasping their final breaths, a symphony of technological death rattles. Captain Elara Veyra sat in the command chair, her face illuminated by the sickly green glow of the emergency power indicator, which pulsed like the heartbeat of a dying creature. The Erebus Ascendant, once a proud envoy of humanity’s indomitable spirit, now drifted derelict and untethered in the silent abyss of interstellar void. The engines were cold, the life support systems faltering, and the ship’s AI had succumbed to cascading failures hours ago, leaving Elara alone with her thoughts, her resolve, and the unceasing hum of entropy.

A cacophony refers to sound: lights cannot form a cacaphony. How can there be an “unceasing hum” in a “silent abyss”? How does a light gasp a final breath? What is this drizzling horseshit?

This is what people who don’t read imagine good writing to be. It’s exactly what you’d expect from a model preference-hacked on the taste of people who do not have taste.

ChatGPTese is creeping back in (a problem I thought they’d fixed). “Elara”…”once a proud envoy of humanity’s indominable spirit”… “a testament to…” At least it doesn’t say “delve”.

Claude Sonnet 3.5’s own efforts feel considerably more “alive”, thoughtful, and humanlike.

https://eqbench.com/results/creative-writing-v2/claude-3-5-sonnet-20241022.txt

(Note the small details of the thermal blanket and the origami bird in “The Last Transmission”. There’s nothing really like that in GPT4-o’s stories)

So if GPT-4o is getting worse, what would that mean?

There are two options:

1) It’s unintentional. In this world, OpenAI is incompetent. They are dumpstering their model to win a leaderboard dick-measuring measuring contest against DeepMind.

2) It’s intentional. In this world, a new, better model is coming, and GPT4-o is being “right-sized” for a new position in the OA product line.

Evidence for the latter is the fact that token-generation speed has increased, which indicates they’ve actively made the model smaller.

If this is the path we’re on, I predict that GPT4-o will become a free model soon. And behind the ChatGPT Plus paywall will be something else: Orion, GPT-5, or the full o1.

Scott Alexander created a Turing Test for AI generated artwork.... | News | Coagulopath

Scott Alexander created a Turing Test for AI generated artwork. Begin quote:

Here are fifty pictures. Some of them (not necessarily exactly half) were made by humans; the rest are AI-generated. Please guess which are which. Please don’t download them onto your computer, zoom in, or use tools besides the naked eye. Some hints:

  • I’ve tried to balance type of picture / theme , so it won’t be as easy as “everything that looks like digital art is AI”.
  • I’ve tried to crop some pictures of both types into unusual shapes, so it won’t be as easy as “everything that’s in DALL-E’s default aspect ratio is AI”.

At the end, it will ask you which picture you’re most confident is human, which picture you’re most confident is AI, and which picture was your favorite – so try to keep track of that throughout the exercise.

All the human pictures are by specific artists who deserve credit (and all the AI pictures are by specific prompters/AI art hobbyists who also deserve credit) but I obviously can’t do that here. I’ll include full attributions on the results post later.

I got 88% correct (44/50). Here’s my attempt (or rather my imperfect memory of my attempt), and my justification for the answers I gave.

Human. Stuff like the cherubs holding her skirts (one blindfolded, one a cyborg) read like the kind of deliberative creative choice that AI never makes. The birds and feathers and hair (typical pain points for AI) are all sharp and coherent.
Human. I haven’t seen an AI-generated image like this before. Tree branches overlap believably (instead of disappearing/multiplying/changing when they occlude each other, as in AI images).
Human. Has an AI feel, but the hair strands look correct, and the frills hang believably across her shoulders. Image is loaded with intricate, distinct-yet-similar details (waves/foam/fabric/clouds/birds) that never blend into mush.
AI. Blobby oven-mitt hands. Deformed foot. Nobody in the 18th/19th century would allow their daughter to be painted wearing such a scandalous dress.
Human. Coherent and symmetrical to the smallest detail. Note the dividers on the window.
Hard call. I went with AI because of the way the blackness of the woman’s hair abruptly changes intensity when the curving line bisects it. A random, unmotivated choice that a human wouldn’t make, but a machine might.
AI. Right hand has two thumbs.
Human. Haven’t seen any AI images like this before. The sleeping disciples are sprawled in complex but believable ways.
Human. I went back and forth, but decided the red crosshatching pattern was too coherent to be AI.
AI. I recognize the model that created this: Dall-E 3, which has a grainy, harsh, oversaturated look. Other clues are the nonsensical steps to nowhere, the symmetry errors, and the kitsch colors of the gate, which detract from the sense of ancient grandeur. Nobody would spend so much time on the details of the stonework, only to make it look like it was built out of Lego blocks.
A hard one. I guessed AI because of the way the windows/chimneys of the houses appear to be slanted. No reason a human artist would do that. (I didn’t see the nonsensical “signature” in the bottom length corner; if I had, I would have guessed Human.)
Guessed AI, got it wrong. Very hard.
Not hard. At least 10 separate details would make me guess “AI” at 90% confidence. The hairstrands are a melty incoherent mess. The hands have froglike webbing on them. Her skirt is excessively detailed and its metallic designs lack symmetry. Her earring floats in space. It contains confused stylistic choices. The girl’s face is a simplified anime design: does that agree with the ultra-detailed skirt and the near-photorealistic water? Also, where is the light source coming from?
AI. Not sure what made me think a human hadn’t made it. Maybe the lack of doors: how do you get in and out?
Human. Many specific choices, plus the text on the temple.
heavy AI slop feel. Many occlusion errors. Deformed hands. Probably Dall-E 3 again.
I actually can’t remember how I guessed here but it’s pretty clearly AI. The face’s left eye is deformed in a random, unmotivated way. Occlusion errors. Filled with ugly, harsh artifacts. Seems like it was trying to write letters in the middle of the image and gave up.
Human. Mild slop aesthetic (pretty girl + shiny plastic skin + random sparkles/nonsense) plus heterochromia should point to AI, but the hairstrands are too coherent.
Human. Don’t know why I gave this answer. Was probably a guess.
AI. What’s the middle black rectangle in the house? It can’t be a door, because it doesn’t reach the ground. If it’s a window, why doesn’t it match the other two?
lol, get this cancer off my screen.
what makes this look human? I don’t know. Maybe it’s the stark, understated tree? AI would probably put dramatic explosion-esque leaves or flowers on it to match the sky.
A good example of the Midjourney slop aesthetic. It superficially invokes liturgic religious iconography, yet Mary wears a full face of Instathot makeup and has detailed veins on her hands. Very weird, inhuman choices. What’s the little nub of flesh poking between her right thumb and index finger?
Midjourney slop. Occlusion errors on the hairstrands. It tried to give her an earring but didn’t finish the job, leaving a strange malformed icicle hanging off her ear.
Human. This was a bit of a cheat because I’d seen the image before: it’s a render by Michael Stuart and was my desktop background for a while. If I hadn’t seen it, I would still have said “human” based on the complex, coherent rigging and ratlines.
AI. Mangled hands. Man wears two or three belts at once, and the holes don’t make sense. His loincloth tore in a weird way, giving him a Lara Croft-esque thigh holster strap. An interesting example of stylistic blend: it’s going for a Caravaggio-esque painting, but mistakenly puts all sorts of painterly details into the setting itself: notice how the man’s left hand appears to be holding a pen or a brush, and how the windowsill is a gilded painting frame.
AI. Looks like a Midjourney image from 2 years ago.
Blindly guessed “Human” and got lucky.
Human. The windows look coherent.
AI. Got it wrong. I thought “what are the odds that two such similar looking pictures back to back would both be by humans.
AI. The cafe has seemingly hundreds of chairs and tables, some of which are overlapping or inside one another. Why are plants growing inside the window frames?
Guessed AI because of the numerous mistakes in the reflections. It wasn’t. Come on, man.
So sloppy that Oliver Twist is holding up a bowl and asking for more.
Human. I was rushing, and becoming careless. Seems obvious it’s AI in hindsight, when you see how messy the fruit is.
Human. The musculature of the man’s torso is consistent with how human bodies were portrayed by late-medieval artists (not all of them could pilfer corpses), but inconsistent with AI, which is mode-collapsed into anatomical accuracy (see the painting of the Blessed Virgin Mother above).
Human. It’s unlike AI attempts I’ve seen at this style. The signature looks real.
Looks like a 4 year old GAN image.
I think this is by Dall-E 3. It loves excessive pillars.
Human, but I don’t know what makes it human. I haven’t seen AI images like it.
If AI slop was a videogame, this picture would be the final boss.
Messy, but packed with deliberate human choices and intent. Characters interact with each other in complex ways. Branches and trees look correct.
Human. I’d seen this before.
AI. A very old image.
Guessed human. Whoops. It was AI.
AI. Reflection errors in the water. The tree roots on the left look wrong.
More slop. Surface of the landing craft is just random shit (and its feet are asymmetrical). Right astronaut has a weird proboscis sprouting from his helmet.
Human. Complex interactions between positive and negative space. The cutouts are chaotic yet have a congruent internal logic. I’ve never seen an AI image like it. Some of the cutouts have torn white edges—a human error.
Human. Difficult to say. I think AI images generally either have a clear subject or a clear focal point.
AI. Another hard one. What swayed me was the second door on the left-side building. It seems to exist off the edge of the land.
We have such slop to show you. Cute big-eyed robot + staring directly at viewer + meaningless graffiti sprays + meaningless planets and circles to take up space = Midjourney.

Concluding Thoughts

You’re in a desert walking along in the sand when all of a sudden you look down, and you see a tortoise, it’s crawling toward you. You reach down, you flip the tortoise over on its back. The tortoise lays on its back, its belly baking in the hot sun, beating its legs, trying to turn itself over, but it can’t, not without your help. But you’re not helping. Why is that?

This guitar teacher talks about an amusing, perhaps alarming, trend he’s... | News | Coagulopath

This guitar teacher talks about an amusing, perhaps alarming, trend he’s seen among his students.

In the 90s, they’d be hardcore fans of a particular band. They’d want to learn Metallica’s Master of Puppets in its entirety, or some obscure song buried at the back of an album. They’d display fierce loyalty to a chosen artist or style.

This era of music is summed up by Zebra Man in Jeff Krulik and John Heyn’s infamous gonzo documentary Heavy Metal Parking Lot. “Heavy metal rules! All that punk shit sucks! It belongs on fucking Mars, man!”

Things changed in the Napster/Limewire era (early 2000s). Digital filesharing meant the album slowly started to die. Kids would rock up to him with home-burned CDs and tapes of random songs collaged from various places. This was exciting, as far as it went. Kids were taking control of their music. Albums are ultimately a marketing construct from the 1950s dictated by manufacturing constraints. There’s no God-given reason why music has to be doled out in 40-60 minute blobs, all by the same artist, and with an immutable track order. Other worlds are possible. There are more things in heaven and earth than are dreamed of in your philosophy, Horatio.

But the new generation of listeners had far less loyalty to individual bands/artists. They did not know the “deep tracks”. If they wanted to learn an Offspring song, it would always be one of the same 3-4 songs.

Today, it has shifted again. His students are like “yo, I want to learn $SONG”, he’s like “so you like $SONG_ARTIST?” and he gets a blank stare. They live in a world where endless music drifts algorithmically in front of them, like indistinguishable ocean waves. Sometimes they like it, but this doesn’t provoke any interest in who made the music, where it came from, what its context is, and so forth. Why even learn those things? More and equally good music is coming along soon. There’s no reason to be a fan of anyone. Once the artist vanishes from Spotify playlists, they can safely forget about them.

(Superfans obviously still exist, but now seem to be motivated more by weird parasocial obsessions than actual artistic output. What would the average kpop “sasaeng” desire more—an unreleased song by Jungkook, or a piece of Jungkook’s shirt?)

It makes you wonder….what does it actually mean to be a fan of someone?

I view fandom as a search algorithm. A way of managing the limitless choices of entertainment.

If I want to read a book there are millions of them, but reading random books is a poor use of my time: most are bad/uninteresting/unsuitable for me. I can dramatically increase my odds of finding a good book by reading an author I’ve enjoyed in the past.

This creates an illusion that the author is important. Actually, “author” is just a highly optimal branch on a search tree for book discovery. If my favorite author started writing bad books, I’d eventually stop reading him. The books are what matters.

Which leads to the question: what happens when search algorithms can connect you with good books better than the tried-and-true method of “read books by your favorite author”? Do you still need to have a favorite author? What happens in that world? Does the fan still exist?

Here’s a related thing I recently read: Kat Tenbarge’s Sorry, Bella Poarch, this IS ‘Build a B*tch’

Overnight, a Tiktoker became the third biggest star on the platform because she nodded her head to an electronic beat for a few seconds.

This is literally a fame lottery. There is no reasonable way that talent, perseverance, or “star power” can manifest in a video that’s a few seconds long and where you don’t even talk. (The Tiktok algorithm, by the way, is believed to have a reasonable amount of random noise, to stop people hacking it.) We expect stars to have a degree of personal charisma. Stars produced this way, as Tenbarge notes, tend to be punishingly average and unsuited for fame.

All of it — and this is coming from someone who collects influencer merchandise — is incredibly boring and one-dimensional. This is not by any means an invitation to bash Charli, who I feel great sympathy for given her age and precarious position in multiple overlapping industries. She’s at the epicenter of a new generation’s group of power brokers, with her every move impacting the salaries of grown adults, including her parents and older sister. And I mean this in this kindest way possible — she’s not qualified for any of it. Charli, who I have never personally spoken to, comes off as incredibly sweet, caring, and normal. She reminds me of every high school-aged white girl at my hideously expensive dance studio in suburban Cincinnati. She can definitely perform well, as well as any “Dance Moms-”era teenage competitive dancer from Connecticut. But if you read an interview with Charli and her older sister Dixie, the mediocrity is palpable. They don’t really have anything interesting to say about, well, anything — and if they do, their publicists won’t let them. I’ve seen a handful of their YouTube videos, listened to clips from their podcasts, and scrolled through dozens of Instagram posts, TikTok videos, and tweets. There’s nothing wrong with either of them, they’re just oppressively average. And by the way, so are all their friends in the Hype House and Sway House and whatever new teeny-bopper house went on the market this week.

Bella has attempted to turn her fame into a recording career, with familiar results to anyone familiar with past cases like Kreayshawn or Tila Tequila.

To me, this feels like an area where the past world is meshing incongruously with the new, algorithmic one.

The industry is trying to position Bella Poarch for a lasting career, where her fans continue to buy the songs she puts out for a long time because she’s talented and special. But if her audience cared enough to do that, they probably wouldn’t have found her on Tiktok to begin with. Bella Poarch is the product of a system where the old rules no longer quite apply. She is a nobody who was plucked out of obscurity, through no great merit of her own. She’s not some rising, hot property who will sustain a long career.

And it’s all so meaningless! In 1964, 73 million people watched the Beatles on The Ed Sullivan Show, and it reshaped the face of music. Today, Tiktoks get billions of views and leave no mark on popular culture at all. It’s just noise that gets drowned out by the next day’s noise.

(Hero image: Laura Makabresku | The silent light of God’s Mercy)