COAGULOPATH

I once ran a community site for the game Claw.... | News | Coagulopath

I once ran a community site for the game Claw. Three months back, a journalist for a gaming website contacted me—they were looking for some quotes for an article they were putting together. I don’t believe they ever published the article. For anyone interested, my words are below.

Sorry about the wait. Coastal Australia got hit with a one-in-a-hundred-year storm, and my farm flooded. Pretty cool.

I could write a lot about Claw; certainly, more than anyone has the desire to read.

The short version: in 1997, my dad wrote a review column for an Australian tech magazine. He received samples such as cameras, DVD players, and promo copies of games. One of these games was a 2D platformer called Claw, which was about pirate cats. I liked pirates and cats, so Claw was strongly Relevant to My Interests(tm).

I don’t think I have ever “clicked” with something harder than I did with Claw at age seven. The game was spellbinding; for a long time, it was the only one I played, or wanted to play. It seemed to have real depth and beauty and style. Even today, I don’t mentally shelve it with Mario or Sonic, but with a hardcore “art” game like Eric Chahi’s Another World.

Sorry, but I’m simply not a person who can form objective opinions about Claw. Childhood nostalgia is a hell of a drug. If you held a gun to my head and told me to denounce Claw, you’d have to come up with a shovel and an alibi real fast.

Yes, viewed objectively, it’s flawed. The difficulty is tuned very high for a kids’ game. The gameplay loop is simple and arguably simplistic, built around fiendish jumping puzzles and not much else—the puzzles of level 14 require faster reflexes and timing than those of level 1, but they’re fundamentally the same puzzles. It can get monotonous. The cool-looking enemies are repetitive to fight. One move (jump, then slash downward) wrecks basically every baddie in the game, which is disappointing. (The peg-leg pirates of levels 9 and 10 are among the most memorable opponents you’ll face because they’re the one exception that the trick does not work on). There are bugs. Some levels cannot be perfected due to treasure placed outside Claw’s ability to reach.

None of these really register as problems for me, per se. I have a friend who trash-talks his home city with evident fondness. “You know you love something when you even love the bad parts of it.” That’s where I land with Claw.

Monolith swung way harder than they had to with a 2D platform title. It has traditionally animated cutscenes that possibly cost more to create than the game itself. The soundtrack whips. The art and design are lavish and thoughtful. The levels are always well-designed and frequently a masterclass in how to suck the player in. Level Four (The Dark Woods) perfectly captures how it feels to wander lost in a dense forest—that weirdly terrifying sense of there being both too much empty space around you, and not enough. Level 8 (The Shipyards) has the Captain exploring massive and utterly believable ships (which must be painstakingly assembled in the editor like jigsaw puzzles from seemingly hundreds of 64×64 tiles—it’s a pain in the ass. Building a real ship might be easier.) This is the part of Claw that has held up the best: the rivet-tight sense of immersion it builds around the player.

(I love old games, but find a lot of modernistic “retro” efforts kind of frustrating. Everything in them is a slavish recreation of some generic “classic gaming” touchstone. Pixel art. Chiptunes. Floating hearts as health items. Where’s the vision? Super Mario Bros and Sonic look the way they do because of technological limitations, not because they’d hit upon some objectively perfect Aesthetic of Gaming that must be copied and imitated until the end of time. A lot of “retro” games just feel like parasites upon the past, offering the player nothing except his or her own repackaged nostalgia. When I see people trying to Kickstart a “Doom-style FPS game” I always think “What else do you have to offer? I can fire up Dosbox and play the real Doom any time I want.” Claw strikes a good balance, I think.)

The game has a tangled heap of spaghetti instead of a plot, mainly due to contradictions between the animated cutscenes and the game itself. Captain Claw can apparently warp through time and space. In level 8, he captures a gem, and two levels later, receives that same gem as a gift from a crewmate. He kills off the main antagonist in level 2, and thus a NEW main antagonist is shoved into the story out of nowhere. Claw must assemble a lost map to find Tiger Island, yet somehow his arch-nemesis can find his way to Tiger Island without the map, and Claw’s crew are also there at the end, despite him previously ordering them to stay behind. And so on, ad infinitum. Most Marvel comics need twenty years and two retcon arcs to achieve Claw’s level of confusion. It’s sort of impressive.

The game clearly had a lot cut out of it. You can see fragments of a larger story sticking out like dinosaur bones. Who’s Katherine? What’s the relationship between Claw and Marrow? The game cried out for a sequel. Monolith apparently almost made one.

Observable evidence would suggest that Claw was not a commercial success. Growing up, nobody I knew had ever heard of it. Monolith never made another Claw, or even another game quite like it. Their subsequent titles were cheaply made arcade/action titles like Get Medieval and Gruntz (which reused Claw’s WAP engine, and even some of its art assets), or triple-A FPS titles like No One Lives Forever and FEAR.

I beat Claw in 1998, and then beat it a few more times. I wanted more. There wasn’t any more. I was not on the internet and had no way of meeting other Claw fans (if they even existed). Eventually, I moved on.

In 2005, I remembered the game and decided to play it again. My dad’s old CD-ROM didn’t work. I attempted to buy the game again and found that you couldn’t. Monolith had stopped selling it years ago (this was before Steam or digital distribution). I shrugged and acquired the game through other means.

I replayed through the whole game in a day, was hit by that same “Damn, I wish there was more.” Then I went online and found that people had made custom levels. I played a few hundred of them; then I started creating my own.

I felt like I was making them for ghosts. The English Claw community was basically dead in 2005. There was an official Claw website. Monolith paid the hosting bill on it but did nothing else. The official Claw forum required no account to join (you just typed your handle in the user box) and was obviously about 95% spam and trolls. The occasional newcomer would show up, ask for help finding the game, and get linked to goatse—it was that kind of place.

A few people had Geocities sites with Claw levels on them. One of them was DzjeeAr—a highly prolific and creative level designer whom I looked up to. I mailed him my levels, and he sent back honest but fairly blunt feedback: my levels were too difficult and not that fun. He was right. I started working on making them bigger and better. The Claw level editor had so many options. I kept trying random stuff, and interesting new tricks seemed to fall out of nowhere (like cannonballs that could fly diagonally). I needed somewhere to host them.

At the time, I had a subscription to a PC magazine. It had a brief (and error-filled) “create your own website” guide, showing you how to write basic HTML. I used it to create a Claw fansite, but with some help from my dad, I got a Claw fansite online. In early 2005, The Belated Claw Fansite went live.

I called it “Belated” because I felt like I was making something long after it had ceased to matter. Like erecting a monument to Rome in 477 AD. (Except Claw had never had an imperial phase, so maybe a monument to the Etruscans or whatever.) But in my eyes, Claw was a genuinely great game. It was too good to be forgotten. I didn’t care if I was the only fan the game still had: it still deserved fans.

I promoted my site on the Claw forum. Soon, I expanded and began hosting other people’s levels, along with downloads and guides and so on.

I was the greenest of webmasters and made every mistake imaginable. Once, I “installed” a stats tracker by putting the tracking pixel on a hidden page that nobody except me knew about—I couldn’t figure out why my hits wouldn’t rise above 1. For a while, I actually hosted the full game on my site. Lots of people appreciated the gesture—I found this out when my hosting company informed me that I had 1) blown my bandwidth quota by several times, and that 2) I would be paying them a hundred dollars for the pleasure. Oops!

Yet the Claw community seemed to surge back to life around the site. The game’s apparent deadness was an illusion; a ton of people were still hanging around: they just didn’t have anywhere to go.

There was a guy from Poland called Zuczek who had his own Polish-language Claw page. We discussed combining efforts. I’d run the English site; he’d do the Polish one. We relaunched in late 2005 as The Claw Recluse. The site still remains online, 20 years later, in 9 languages, still with my original design.

It served as a lightning rod to gather old Claw fans. I’d say it sparked a revival of the game, but I’m not sure the game had ever reached this level of popularity to begin with. You see this by downloading the full list of custom levels on The Claw Recluse and sorting by date. In the couple of years before 2005, only a few custom levels were made. Dozens upon dozens poured out afterward. It was incredible. The game was coming back to life.

Soon, Zuczek had Teophil working alongside him—he was a longtime Claw fan who’d been active in the community longer than anyone (except possibly a guy called Randy, who came and went). Eventually, Zuczek handed over the site to him. Gradually, I stepped away too.

I have not been involved in running The Claw Recluse since 2007. I moved on to other things. I still play Claw from time to time and was involved in speedrunning a number of years ago. It’s intermittent. Claw will always be something of a North Star for me. I don’t think I have to be crazily obsessed with the game anymore for that to be true.

Claw is actually more active and alive in 2025 than it was in, say, 1999. That defies every intuition I have about how gaming works. They’re supposed to be released, get played by however many people play them, and then die. But somehow, people are keeping this one damned game alive.

The lesson I learned is that a fan can easily care more than the actual creators of the thing they’re a fan of. The Beatles found this out in the 1960s (Lennon famously wrote “I Am the Walrus” to mock/troll superfans who attached profound meaning to lyrics he’d dashed off in a few minutes). Claw was abandoned by its developers and kept alive by its fan community. I am honored to be a part of that, if only for what seems like a brief moment.

Anyway, that’s it. Hopefully some of this was useful.

The “Long Boy” in Lisey’s Story

I read Stephen King’s Lisey’s Story when I was young.... | News | Coagulopath

I read Stephen King’s Lisey’s Story when I was young. I didn’t get much out of it. The incessant baby-talk (“smucking”, “bad gunky”) felt stickily tiresome, like wading through a saliva-splattered ballpit. The pacing was languid; the plot mushy and oversentimental. It felt like a personal work written for Tabitha Spruce King, with me as an outsider, unwanted and begrudged, sitting at their table and being resented for it.

I read it again now and enjoyed it more. It’s not as inaccessible as I thought. I can better see what King was trying to do. It takes themes he explored at thirty, and lets the (different, dimmer) light of seventy gleam over their cracks and hollows.

The plot basically combines Misery (one of his more successful books) with Rose Madder (one of his less successful). From Misery, we get the idea of an psychopath fan who’s obsessed with a famous writer. The twist here is that the famous writer (Scott Landon) is already dead, and the stalker’s rage and entitlement settles on his bereaved widow, Lisey. That adds an interesting dynamic. In Misery, Paul Sheldon at least had some power. He’s the only one who can write the Misery Chastain romance stories his captor loves, so she’s forced to keep him alive. Lisey Landon, on the other hand, is not her husband. She’s just a person who shared a bed with him. Through the world’s eyes, she’s a person-shaped mirror, a window to her husband. Mirrors cannot create; only reflect. They also cannot die; only be smashed. This heightens Lisey’s victimhood: her husband’s fans and enemies grow obsessed with her, but never actually regard her as a person.

From Rose Madder, he takes the idea of an magickal dreamworld that can be accessed using chintzy artifacts. The otherworldly land of Booya Moon (which Scott introduced Lisey her to while he was alive) is useful. Injuries heal swiftly. It might also be a good place to hide a dead man, or lose an unwanted living one. But it’s ultimately a dangerous place to be. This is because of what lives there: the long boy.

The long boy is one of King’s better inventions; one of his most direct forays (along with “N”) into Lovecraft-style horror.

It is not bound by the same rules as most of the things in Booya Moon. It can reach into the real world somehow (using glass surfaces and mirrors as portals). It has marked Scott as its prey, and has spent a long time searching for him. Occasionally he sees its face in glass, peering around and looking for him.

In the end Scott’s thing had come back for him, anyway—that thing he had sometimes glimpsed in mirrors and waterglasses, the thing with the vast piebald side. The long boy.

Long before we see it ourselves, we hear it, in a second-hand way. Scott knows the sound it makes, and imitates this for Lisey.

Scott says, “Listen, little Lisey. I’ll make how it sounds when it looks around.”
“Scott, no—you have to stop.”
He pays no attention. He draws in another of those screaming breaths, purses his wet red
lips in a tight O, and makes a low, incredibly nasty chuffing noise. It drives a fine spray of
blood up his clenched throat and into the sweltering air.

[..]

“I could . . . call it that way,” he whispers. “It would come. You’d be . . . rid of my . . . everlasting . . . quack.”
She understands that he means it, and for a moment (surely it is the power of his eyes) she
believes it’s true. He will make the sound again, only a little louder, and in some other world
the long boy, that lord of sleepless nights, will turn its unspeakable hungry head.

Later (or earlier, in a flashback), Scott is stranded in Booya Moon, and Lisey travels there to rescue him. Here, she briefly sees the long boy in the unflesh.

“Shhhh, Lisey,” Scott whispers. His lips are so close they tickle the cup of her ear. “For
your life and mine, now you must be still.”
It’s Scott’s long boy. She doesn’t need him to tell her. For years she has sensed its presence
at the back of her life, like something glimpsed in a mirror from the corner of the eye. Or, say,
a nasty secret hidden in the cellar. Now the secret is out. In gaps between the trees to her left,
sliding at what seems like express-train speed, is a great high river of meat. It is mostly
smooth, but in places there are dark spots or craters that might be moles or even, she supposes
(she does not want to suppose and cannot help it) skin cancers. Her mind starts to visualize
some sort of gigantic worm, then freezes. The thing over there behind those trees is no worm,
and whatever it is, it’s sentient, because she can feel it thinking. Its thoughts aren’t human,
aren’t in the least comprehensible, but there is a terrible fascination in their very alienness . . .

“A great high river of meat” is a vivid phrase. Stephen King should consider writing more words. He can be quite good at them.

But she finally sees the long boy’s face—or mouth, at least—near the end.

Then there’s movement from her right, not far from where Dooley is thrashing about and trying to haul himself upward. It is vast movement. For a moment the dark and fearsomely sad thoughts which inhabit her mind grow even sadder and darker; Lisey thinks they will either kill her or drive her insane. Then
they shift in a slightly different direction, and as they do, the thing over there just beyond the
trees also shifts. There’s the complicated sound of breaking foliage, the snapping and tearing
of trees and underbrush. Then, and suddenly, it’s there. Scott’s long boy. And she understands
that once you have seen the long boy, past and future become only dreams. Once you have
seen the long boy, there is only, oh dear Jesus, there is only a single moment of now drawn
out like an agonizing note that never ends.
What she saw was an enormous plated side like cracked snakeskin. It came bulging
through the trees, bending some and snapping others, seeming to pass right through a couple
of the biggest. That was impossible, of course, but the impression never faded. There was no
smell but there was an unpleasant sound, a chuffing, somehow gutty sound, and then its
patchwork head appeared, taller than the trees and blotting out the sky. Lisey saw an eye,
dead yet aware, black as wellwater and as wide as a sinkhole, peering through the foliage. She
saw an opening in the meat of its vast questing blunt head and intuited that the things it took
in through that vast straw of flesh did not precisely die but lived and screamed . . . lived and
screamed . . . lived and screamed.
She herself could not scream. She was incapable of any noise at all. She took two steps
backward, steps that felt weirdly calm to her. The spade, its silver bowl once more dripping
with the blood of an insane man, fell from her fingers and landed on the path. She thought, It
sees me . . . and my life will never truly be mine again. It won’t let it be mine.
For a moment it reared, a shapeless, endless thing with patches of hair growing in random
clumps from its damp and heaving slicks of flesh, its great and dully avid eye upon her. The
dying pink of the day and the waxing silver glow of moonlight lit the rest of what still lay
snakelike in the shrubbery.

At the end of the book, the long boy becomes aware of Lisey Landon. She starts seeing it peering in mirrors, uncoiling muddily at the bottoms of glasses, just as Scott did. (Emphasis mine)

“Looks a little like dried blood,” Mike said, and finished his iced tea. The sun, hazy and hot, ran across the surface of his glass, and for a moment an eye seemed to peer out of it at Lisey. When he set it down, she had to restrain an urge to snatch it and hide it behind the plastic pitcher with the other one. […] They both laughed. Lisey thought hers sounded almost as natural as his. She didn’t look at his glass. She didn’t think about the long boy that was now her long boy. She thought about nothing but the long boy.

Like the madman stalking her throughout the story, perhaps the long boy has marked her as a substitute for her husband. The man I truly want is dead and gone…but in his place, you’ll do.

I wonder where King got the idea for the long boy?

Worms as symbols of corruption and decay are too common to be worth discussing at any length. A mindworm or mindsnake is a more specific image, though.

Yes, the brain kind looks like a kind of worm, coiled around and around inside the skull, slippery and wet. Perhaps the metaphor extends further. In the 60s, it was actually believed that planarium worms could encode memories in their bodies, and transfer them to new bodies. In the late fifties, James McConnell of UMich conducted experiments that appeared to show that memory transfer via cannibalism was possible in planarian flatworms.

Chop a worm into three pieces. All three pieces will regrow into new worms, and each of those worms will have the same brain, including (supposedly) the same memories. Do worms store memories outside their brains, somehow? DNA and RNA are fairly informationally dense—the haploid genome of a human being encodes about ~720mb of uncompressed data—and other chemicals and proteins can also encode things. This, as I understand, is fairly well-accepted science.

McConnell apparently figured out something weirder. He used a painful electric jolt to train worms to contract their bodies upon exposure to light. Then, he chopped them to pieces, fed the body parts to cannibalistic worms called Dugesia dorotocephala…and they contracted their bodies to light, too! Confirmation of this has been slow in coming—this the type of science the replication crisis tragically stole from us.

(McConnell, by the way, has one of those all-timer Wikipedia pages. “McConnell was one of the targets of Theodore Kaczynski, the Unabomber. In 1985, he suffered hearing loss when a bomb, disguised as a manuscript, was opened at his house by his research assistant Nicklaus Suino.”)

King, at least, has been struck by the image of a wormlike thing that preys on the psyche and memories and trauma of its victims. Maybe there’s just something viscerally repellant about worms.

In any case, the long boy may not be a worm or a snake. The thing Lisey regards as such might just be an appendage. An adjunct to a large (perhaps vast) body, whose totality we do not see. And furthermore, that it doesn’t eat as much as capture—that its victims might still live on.

Lisey closed her own. For a moment she saw that blunt head that wasn’t a head at all but only a maw, a straw, a funnel into blackness filled with endless swirling bad-gunky. In it she still heard Jim Dooley screaming, but the sound was now thin, and mixed with other screams.

I like the long boy. It will never be as famous as Pennywise or Randall Flagg or [Insert Thinly-Veiled Metaphor for Republican Politician Here] but that’s good. The enemy of darkness is the light, and no horror creature survives too much media exposure. As the century spins on, the long boy will retain its mystery.

(Also, what happens when you chop OpenWorm into three pieces of code?)

The Fate of GPT-4o

This post is speculation + crystal balling. A change might... | News | Coagulopath

This post is speculation + crystal balling. A change might be coming.

OpenAI has spent six months rolling out updates to GPT-4o. These perform extremely well by human-preference metrics.

gpt-4o-2024-11-20, the latest endpoint, boasts a colossal 1360 ELO on Chatbot Arena, compared to the earliest GPT4-o, which scored a meager 1285. What does that mean? That blind human raters prefer gpt-4o-2024-11-20’s output 70% of the time.

I believe this is the result of aggressive human preference-hacking on OpenAI’s part, not any real advances.

Control for style, and gpt-4o-2024-11-20 drops by fifty points. It remains at #1, but only because the other preference-hacked models at the top also drop a lot.

Claude 3.5 Sonnet gains points. So do most of the older GPT-4s.

Optimizing for human preference is not a wrong thing to do, per se. So long as humans use LLMs, what they like matters. An LLM that produced output in the form of Morse code being punched into your balls would suck to use, even if it was smart.

But this is exactly why you should be careful when using Chatbot Arena to make statements about model capabilities – the top of the chart is mainly determined by style and presentation.

Benchmarks tell a different story: gpt-4o’s abilities are declining.

https://github.com/openai/simple-evals

In six months, GPT4-o’s 0-shot MMLU score has fallen from 87.2 to 85.7, which is probably similar to what GPT-4 scored on release.

(to be clear, “GPT-4” doesn’t mean “an older GPT-4o” or “GPT-4 Turbo”, but “the original broke-ass GPT-4 from March 2023, with 8k context and no tools/search/vision and Sept 2021 training data”).

I am more concerned about the collapse of GPT4-o’s score on the GPQA benchmark, which fell from 53.1 to 46.0. This is a significant drop, particularly in light of the tendency for scores to rise as data contaminates the internet. (Claude 3.5 Sonnet scores 59.4, for comparison)

Even this may be optimistic:

https://twitter.com/ArtificialAnlys/status/1859614633654616310

An independent test by Artificial Analysis (on the GPQA diamond subset) found that GPT-4o scored 39.00. They’ve downgraded the model to 71/100, or equal to GPT-4o mini (OpenAI’s free model) in capabilities.

Further benching here:

https://artificialanalysis.ai/providers/openai

Some of their findings complicate the picture I’ve just described (in particular, they have GPT4-o scoring a higher MMLU than OpenAI’s internal evals), but the bottom-line is that the new gpt-4o-2024-11-20 is the worst of its line by nearly every metric they test, except for token generation speed.

Livebench

https://livebench.ai

GPT-4o’s scores appear to be either stagnant or regressing.

gpt-4o-2024-05-13 -> 53.98
gpt-4o-2024-08-06 -> 56.03
chatgpt-4o-latest-0903 -> 54.25
gpt-4o-2024-11-20 -> 52.83

Aider Bench

https://github.com/Aider-AI/aider-swe-bench

Stagnant or regressing.

gpt-4o-2024-05-13 -> 72.9%
gpt-4o-2024-08-06 -> 71.4%
chatgpt-4o-latest-0903 -> 72.2%
gpt-4o-2024-11-20 -> 71.4%

Personal benchmarks

It doesn’t hurt to have a personal benchmark or two, relating to your own weird corner of the world. Either you’ll have a way to evaluate AIs that escapes the Goodharting suffered by large benchmarks, or OpenAI starts fine-tuning AIs on your niche use case (in which case, mission fucking accomplished.)

I like to ask LMMs to list the levels in the 1997 PC game Claw (an obscure videogame.)

Claude 3.5 Sonnet and Claude 3 Opus do great, getting about 80-90% of Claw’s levels correct.

GPT-4-0314 makes a reasonable attempt, getting about 50-75% right. Typically the first half of the game is fine, with the other levels being a mix of real and hallucinated.

(once, it listed “Wreckage” as a level in the game. That’s actually a custom level I helped make when I was 14-15. I found that weirdly moving: I’d found a shard of myself in the corpus.)

GPT-4o scores like ass: typically in the sub-50% range. It doesn’t even consistently nail how many levels are in the game. It correctly lists some levels but these are mostly out of order. It has strange fixed hallucinations. Over and over, it insists there’s a level called “Tawara Seaport”—which is a real-world port near the island of Kiribati. Not even a sensible hallucination given the context of the game.

Another prompt is “What is Ulio, in the context of Age of Empires II?”

GPT-4-0314 tells me it’s a piece of fan-made content, created by Ingo van Thiel. When I asked what year Ulio was made, it says “2002”. This is correct.

GPT-4o-2024-11-20 has no idea what I’m talking about.

To me, it looks like a lot of “deep knowledge” has vanished from the GPT-4 model. It’s now smaller and shallower and lighter, its mighty roots chipped away, its “old man strength” replaced with a cheap scaffold of (likely crappy) synthetic data.

What about creative writing? Is it better on creative writing?

Who the fuck knows. I don’t know how to measure that. Do you?

A notable attempt is EQBench, which uses Claude 3.5 as a judge to evaluate writing samples. gpt-4o-2024-11-20 is tied for first place. So that seems bullish.

https://eqbench.com/creative_writing.html

…but you’ll note that it’s tied with a 9B model, which makes me wonder about Claude 3.5 Sonnet’s judging.

https://eqbench.com/results/creative-writing-v2/gpt-4o-2024-11-20.txt

Most of these samples seem fairly mediocre to me. Uncreative, generic, packed with empty stylistic flourishes and pretentious “fine writing”.

The cockpit was a cacophony of dwindling lights and systems gasping their final breaths, a symphony of technological death rattles. Captain Elara Veyra sat in the command chair, her face illuminated by the sickly green glow of the emergency power indicator, which pulsed like the heartbeat of a dying creature. The Erebus Ascendant, once a proud envoy of humanity’s indomitable spirit, now drifted derelict and untethered in the silent abyss of interstellar void. The engines were cold, the life support systems faltering, and the ship’s AI had succumbed to cascading failures hours ago, leaving Elara alone with her thoughts, her resolve, and the unceasing hum of entropy.

A cacophony refers to sound: lights cannot form a cacaphony. How can there be an “unceasing hum” in a “silent abyss”? How does a light gasp a final breath? What is this drizzling horseshit?

This is what people who don’t read imagine good writing to be. It’s exactly what you’d expect from a model preference-hacked on the taste of people who do not have taste.

ChatGPTese is creeping back in (a problem I thought they’d fixed). “Elara”…”once a proud envoy of humanity’s indominable spirit”… “a testament to…” At least it doesn’t say “delve”.

Claude Sonnet 3.5’s own efforts feel considerably more “alive”, thoughtful, and humanlike.

https://eqbench.com/results/creative-writing-v2/claude-3-5-sonnet-20241022.txt

(Note the small details of the thermal blanket and the origami bird in “The Last Transmission”. There’s nothing really like that in GPT4-o’s stories)

So if GPT-4o is getting worse, what would that mean?

There are two options:

1) It’s unintentional. In this world, OpenAI is incompetent. They are dumpstering their model to win a leaderboard dick-measuring measuring contest against DeepMind.

2) It’s intentional. In this world, a new, better model is coming, and GPT4-o is being “right-sized” for a new position in the OA product line.

Evidence for the latter is the fact that token-generation speed has increased, which indicates they’ve actively made the model smaller.

If this is the path we’re on, I predict that GPT4-o will become a free model soon. And behind the ChatGPT Plus paywall will be something else: Orion, GPT-5, or the full o1.

Older posts