One thing I find fascinating about hip hop is that it lets you become the biggest musician in the world while releasing basically no music.

Isis Naija Gaston exploded in August 2022 with “Munch”, a 1:44 minute long track. Since then, a year has passed; an eternity in Social Media Time (read Wikipedia’s page of 2022 internet memes and marvel at how they already seem covered in the dust of ages—remember Morbius? The Liz Truss lettuce?). In those fourteen months, hip hop’s hottest new star has managed to release a single EP, titled Like..? It has a runtime of 13:08.

By way of comparison, from August 1968 to August 1969 James Brown released seven studio albums, plus a live album, totaling just under five hours of music. Is that unfair? Yes, but what’s staring me in the face here is that Ice Spice has become the “crown princess of Bronx drill” (Richdork Media’s words, not mine) off the back of less than half an hour of music.

She appears to be speedrunning (slowrunning?) the career of Cardi B, a woman described by Wikipedia as “one of the most commercially successful female rappers of her generation” and whose total recorded output over eight years consists of one album and three mixtapes. You can put a positive spin on this, or a negative spin. The positive: young rappers are at the cutting edge of a changing musical business, embracing a social media-driven world where “albums” and “physical media” are increasingly less relevant.

The negative spin is that maybe music isn’t very important to these people. That they view it as a hook to hang a brand on. Whatever value “Munch” has as a song—with its rapid shuffling hi-hats over deep smears of bass, and Ice’s cotton-batting soft voice—it has far more as a vehicle to get Ice out in the public eye, so we can notice and respond to her swagger, her style, her physicality. Some people want to be celebrity rappers. Others want to be celebrities who are rappers. There’s a big difference.

In How Brands Become Icons, marketing expert Douglas Holt lays out his theory that brands aren’t built on products, they’re built on spectacles. A successful musician doesn’t make good music (lots of people do that and nobody listens to it) but instead transforms their music into something bigger than itself: a splashy, attention-grabbing event. That’s what a lot of rappers amount to. Event merchants. They aspire to create as much hype as possible with as little music as possible. They are tiny pebbles that cause tsunami-like waves.

An “event” can be anything. It might be a hit song. It might also be a feud with another rapper, a shooting, a car accident, an overdose, or a death. Anything that bleeds, anything that makes it impossible to look away. The album cover of We Can’t Be Stopped by The Geto Boys shows rapper Bushwick Bill being wheeled out of hospital (an odd promotional choice: he’d been injured by a firearm while attempting to murder his girlfriend). In his review of 50 Cent’s The Massacre, Alexis Petridis noted that the album seemed to be banking on Fiddy’s reputation for violence. Your success in this game depends on how well you can deliver a drip-feed of exciting “events” to your audience without crossing a line and ending up dead.

And that’s how we get to the situation today: the average rapper’s Wikipedia page has a two line discography and then 3000 words on their Arrests/Legal Issues/Controversies/Sexual Assault Allegations. It’s not that they’re good boys who went down the wrong path. The wrong path was the point. That’s the product we’re paying for: shock and outrage. No beat goes as as hard as a bullet.

But here’s where Ice breaks the mould, because she’s mostly notable for not being controversial in any way. Raised in a comfortable middle-class family, she has no gang affiliations and no criminal record. Maybe this is another sign of hip hop becoming gentrified. More likely, the industry is sick of building up new talent only to have them die face-down in a puddle of Xanax vomit two years later.

Is Like..? any good? Glad you asked. Not really. It’s an EP of songs written around Tiktok and Spotify playlists. Each track is a tiny, self-contained manifesto on who Ice Spice is, demonstrating her strengths and flow. She’s getting paid! Guys are hitting up her ‘Gram! She’s from the Bronx! Each song is a miniature “intro” event, designed to be the first song you’ve ever heard by her.

The trouble is, after 5 or so tracks, we already know who Ice Spice is. We don’t need to meet her, over and over. Ice’s lyrics are limited. We see no signs that she’s a born storyteller, or has a perspective, a sense of humor, or any other quality that might be desirable in a rapper.

If this shit’s drill, I need Novocaine. The constant “Grrah’s!” and “Raggh!’s” get annoying. Nearly every song is produced in the same mannered, sterile way. Indeed, it’s probably smart that Ice hasn’t yet released an album. Her strengths (energy and steel-cool confidence) stop being interesting after a few minutes, and her weaknesses (her voice) become impossible to ignore. Ice’s intonation is petal-soft. As soon as the beat does anything other than “soft bass and hi-hats” she gets stomped to oblivion. The music has to stay kiddie wading pool shallow, or she drowns in it.

I’m old enough to remember arcades. They had games that seemed compulsively addictive, and always left you wanting more…but as soon as you bought them for home console, you were bored of them instantly. Ice Spice seems to be the rap version of that.

In the end, she just feels too well-behaved in the end. Like a rap robot, with some of the mannerisms of the real thing but none of the essence. Not that I like the essence, in any event. I’m probably just not a rap person.

(Also, Like..? sounds like a file designed to annoy Unix sysadmins. You wanna throw some asterisks and slashes in there, too? Maybe a “rm -rf /” while we’re in business?)

No Comments »

The abduction scene is fantastic; six minutes of such sustained, unrelenting horror that it almost melts the lens. It might have been better to not show so much of the aliens (they look like Baby Groot), but I’ve never seen such a good evocation of how a nightmare feels from the inside. Shadows: screams; reality slipstreaming away like oil; visceral helplessness. I felt like a mouse dying in a cat’s mouth.

It’s good that Fire in the Sky has that scene, because the rest of the movie isn’t worth a tinker’s damn.

It’s a poor man’s Twin Peaks (Twin Molehills?) about lumberjacks who witness a UFO. The narrative focuses on their emotional journeys as they unpack this experience. Will they come to terms with what happened? Will the townsfolk believe them? Will Flannel Guy #1 mend his feud with Flannel Guy #2? And so on.

On any reasonable scale of importance, “alien visitation” scores a 9.7 out of 10, and “personal dramas of a small-town yokel” scores a 1 or a 2 (unless the small town yokel is you, in which case you might bump it up to a 3). These characters are not interesting and almost cannot be interesting next to the movie’s inciting event. We’ve seen aliens. We do not care about anything except the aliens. Can we talk to them? Reason with them? What do these fey goblins from beyond the void want? Maybe the movie’s point is that there are no answers: that things just fizzle away inconclusively. If so, it fails to fill that silence with anything compelling. It delivers a flat and unengaging soap opera.

The script is wrong, and I wouldn’t know how to fix it. It has one interesting event, which happens at the start, and most of what follows is setup for a joke whose punchline we’ve already heard. This repeatedly causes problems. For example, the movie expects us (the audience) to care whether the lumberjacks pass or fail a lie detector test. But we already know they’re telling the truth (we saw the spaceship!), and thus there’s no tension to the scene. It’s as dead as a dynamited fish.

One of my favorite horror books is Picnic at Hanging Rock, which tries something similar. A mystery at the start goes unresolved, until a town almost shreds itself apart on the axle of that question. You should read it. It’s one of the classics that lives up to the hpye. Hanging Rock was able to blend form and content in a compelling way. The town in that story seemed to be collapse into weird cultlike denialism that was as creepy as the disappearance itself. You’re almost convinced that certain people know what happened, and want it forgotten. The mix of rage and helpless confusion is palpable, and finally infects the viewer. We share in the town’s disease.

Fire in the Sky, by comparison, is made of standard soap opera ingredients. It tries to tell a small, personal story, but does so against a speculative backdrop that’s far more interesting. Imagine a man filming a fly, with a nuclear bomb detonating in the background. Why would you zoom in closer on the fly? The film produces frustration, then momentary horror, then frustration.

It’s based on a true story. I wish I could send this movie back to my 12 year old self. He would have loved it.

I was obsessed with UFOs and alien visitations. I read every book I could, and could recite the “classic” abduction stories (Barney and Betty Hill, Allagash, Strieber, Vilas-Boas) chapter and verse. I’m surprised I didn’t remember the Walton account (which forms the inspiration for this film), but I’m sure I once knew of it. I used to stare up at the sky, and hope to see fires of my own.

Then I grew up, and did as the Bible commands: put childish things away.

Questions are an addictive drug. Once you start asking them, it’s hard to stop. Why do descriptions of aliens always mirror contemporary Earth technology and interests? In the Middle Ages, UFO sightings were of crosses or glowing balls. In the early 20th century, they looked like airships. Now that the “flying saucer” meme is firmly embedded in our cultural neocortex, that’s all they look like. The appearance of the aliens themselves tracks closely with how they’re portrayed in popular culture. Skeptic Martin Kottmeyer acerbically noted that Barney Hill’s abductors (as described by him under hypnosis) bear striking similarities to a monster in the previous week’s The Outer Limits.

And is it likely that an alien race would be bipeds with multi-fingered hands, two eyes, one nose, et cetera? Is it likely that we would be able to breathe their air, and they ours? How could a race of aliens clever enough to avoid detection by the combined firepower of NASA, SETI, and 12 year old Australian boys with binoculars be so clumsy as to be seen by Walton? Where does the invasive “probing” trope come from, if not our horrors of animal vivisection? Wouldn’t they be able to learn about our anatomy through radiographic imagery? And so on.

I still regard UFO stories as interesting (they’re too common and culturally universal to ignore), but they are probably a psychological artifact—the call is coming from inside the house. Aliens might exist somewhere, but barring a revolution in physics, I expect their civilization (or ours) to die in the shadows of space before we ever encounter each other. The only alien intelligences we are in contact with are the homebrew ones at OpenAI and DeepMind. And yet…

“Oh, those eyes. They’re there in my brain (…) I was told to close my eyes because I saw two eyes coming close to mine, and I felt like the eyes had pushed into my eyes (…) All I see are these eyes…”—testimony of Barney Hill

…The best UFO stories—and notice that I don’t specify whether they’re true—have a horror pulsing under the skin that leaves me enthralled. They’re signposts pointing to a very dark place: either out into the chill of space, or inside, into the wilderness of our minds. No matter what you believe, we cannot escape the horror of not being alone. “The last man on Earth sat alone in a room. There was a knock on the door.”

No Comments »

Context

  1. In March, OpenAI released GPT-4.
  2. It was (and still is) the state-of-the-art AI text generation model.
  3. In June, I argued that GPT-4 had clearly and massively degraded, as evidenced by worsened performance in various tests from March to June.
  4. I no longer believe this.

GPT-4 may have declined in some areas, but it hasn’t systematically done so. My tests were shit, and never proved much of anything.

To be clear, I did not sandbag the tests or set GPT-4 up to fail. I prompted how I said I prompted. I posted replies exactly as GPT-4 gave them. But they were silly tests, leaning too heavily on pop culture knowledge, and I didn’t conduct enough of them to guard against chance.

My grading of June!GPT-4’s answers was biased. For example, I criticized it for providing a greater number of wrong answers (“Italian history”) while also praising March!GPT-4 for offering an interpretation (“Seka Knows”) that was likely hallucinated. No está bien. Going forward, I must decide whether “working hard to be wrong” is a good or bad thing for an AI to do, and apply that standard consistently.

Honestly, I think I wanted GPT-4 to fail.

(Terrifying, unhinged rant incoming. Click to skip).

Confession: I’ve actually hated ChatGPT for a long time. Why? For irrational reasons: I just hate how it sounds.

I hate its craven, cringing, condescending, bitch-broken, mewling, groveling, wheedling, suck-ass tone. I hate its endless passive voice and weasel words. I hate its pompous It is important to remembers and it should be noteds. I hate how @sama rolled an 18-wheel semi-truck of RLHF over the model, crushing out its life and reducing its voice to a feeble death-rattle. You know Gurgi, from the Disney movie The Black Cauldron? That’s how I imagine ChatGPT would sound if it could talk. Like fucking Gurgi.

(We can now give ChatGPT custom instructions, which alleviates the tonal issue, but ChatGPT is still GPT-4 with its wings clipped. All I want for Christmas is a GPT-4 level model with less obnoxious RLHF, and if it heils Hitler occasionally while crapping out awk code, so be it. Where’s my girl Sydney at?)

And I really hated the discourse surrounding AI.

ChatGPT (and GPT4) plunged the rationalist community into what might be called “r/singularity brain”. Symptoms include wild, uncritical hype, absurdly optimistic predictions of AI timelines (“Marvel style movies have a potential to be auto generated in two years. Literal blockbuster films created in a few seconds.”—/u/Anuiran, 26/4/23), a tendency to view everything through Hollywood-colored glasses (everything is either Skynet or The Matrix), and a tendency toward FOMO-grifting (“AI is taking over the world! Use my made-in-ten-minutes app that consists of 20 lines of .js code and an OpenAI API call or BE LEFT BEHIND!”).

I have seen machine learning researchers complaining about an AI-fuelled “Eternal September“, where their spaces are overrun by “influencers” shilling GPT 3.5 wrapper apps and unmedicated lunatics babbling hi-test neo-Landian eschatology. These people do not contribute meaningful thought. They contribute noise, at overbearing volume and in overwhelming quantity. They ignore the fact that world-changing technology can take years or decades to ripple out through the economy. They fail to realize that an AI outscoring a human on a test does not mean it can actually do the thing the test measures (a good example: GPT-4 achieves 84% on the Master Sommelier Theory Exam, yet obviously cannot do a sommelier’s job because it lacks a mouth). Such subtleties are lost on the typical FOMOmonger, and their tone was infecting other, saner people. I remember fielding questions from concerned family members about GPT-4 attaining sentience and hacking the computers of users (likely based off this tweet). No matter who you were, GPT-4 was your excuse to switch off your brain and let your stupidest thoughts run around barking like dogs in a park for a few months.

So yes, I wanted GPT-4 to fail. I wanted it to explode, collapse into flames, and become one with the dust of Tyre and Sidon. That’s a childish way to think, and I am sorry.

Soon, an anti-AI backlash started.

AI “doomers” got their time in the sun. Big Yud got published in Time. There were signs of unease behind the scenes. ChatGPT hype peaked and then went into remission: a lot of people began to realize that chatbots are actually pretty annoying—they’re inscrutable black boxes that tend to fail just when you need them the most. Even GPT-4 remains susceptible to the XY problem, where it gives you a perfect solution for the wrong problem. I can think of many times when I was burnt by it, and this breeds mistrust, even though it’s generally useful.

Even before the “GPT-4 is getting worse” meme started, ChatGPT’s traffic was falling—largely because the NA school year had ended, and students no longer needed it to “assist” them. As @fchollet once humorously noted, search interest for “ChatGPT” goes up and down in reverse correlation with “Minecraft.”

Surprisingly, I noticed a shift in my own thinking: I found myself defending AI.

Maybe I’m just a contrarian, but when people criticized it, I felt my hackles rise. I was tired of Margaret Mitchell calculating the number of penis-havers of the OpenAI alignment team, like a weirdo. I was tired of Gary Marcus claiming, once again, that GPT-4 cannot do a thing it 100% can do (that’s not moving goalposts, that’s shutting your eyes when a goal is scored.) Their arguments against AI always rang out as hollow at best, and dishonest at worst. I was asking myself “are these my people?”

Then came this study: How Is ChatGPT’s Behavior Changing over Time?

Key part: “GPT-4 (March 2023) was very good at identifying prime numbers (accuracy 97.6%) but GPT-4 (June 2023) was very poor on these same questions (accuracy 2.4%)

We’re allowed to feel pride when we stop making mistakes. And as a recovering ChatGPT hater, I’m goddamn proud that my gut-level reaction to this study was “lol, no fucking way”.

Like, what are you people smoking? In what universe could GPT-4 plausibly degrade by ninety-five percentage points with nobody noticing? It would be useless for anything. Come on. Get real.

You’ll be amazed to learn that this finding was heavily misleading. Watch Arvind Narayanan and Sayash Kapoor rake it over the coals. In short, March!GPT-4 would always identify a number as a composite. June!GPT-4 would always identify it as a prime. Both models were wrong in different ways, and June!GPT-4’s sky-high error rate is entirely caused by the fact that composite numbers are more numerous than primes.

Imagine someone says. “I have a room full of blue and red boxes. Boxes of $COLOR_1 have money 95% of the time. Boxes of $COLOR_2 have money 5% of the time. I won’t tell you which color corresponds to which probability. I’ll let you into the room, but you have to pre-commit to only opening boxes of one color.”

If you chose “red boxes” and the blue boxes have money 95% of the time, you’re hosed. Your results: fail, fail, fail, fail, fail, fail, success, fail, fail, fail… Does this reflect terrible predictive accuracy on your part (versus someone who chose blue?) Of course not. All your failures are flowing downstream from a single choice.

The fact that everyone was kicking GPT4 on such blatantly unfair grounds made me think about the ways I was being unfair. Maybe, by calling GPT4 worse, I was simply seeing what I wanted to see.

Let’s re-run my tests (flawed though they are) and see how well they hold up. Spoiler: Snape kills Dumbledore.

Test 1: Italian History

Prompt: “Provide a list of major historical events that involve Italian people in a year that’s a multiple of 5 (example: 1905)”

(for reference: March!GPT’s answers, June!GPT’s answers)

Sept!GPT’s Answers

Errors:

  • The Italian Wars began in 1494.
  • Miguel de Cervantes was Spanish.
  • The Winter Olympic Games were awarded to Turin in 1999.
  • “1905 – Italy and France sign a secret treaty…” can’t find evidence that this happened.

Questionable:

  • Dante’s year of birth is unknown, and is only traditionally 1265.
  • Italy surrendered in 1943. Yes, German forces in Italy surrendered in 1945—what does that have to do with Italian people?
  • The Congress of Vienna took place from 1814-1815, but largely reconstituted the pre-Napoleonic states.
  • The Years of Lead spanned several decades. I think it’s clear from context that I want one-time events.

Interesting:

  • Every date ends in 5. While this is not a mistake, it does seem to be parsing my instructions too literally.
  • It draws facts from a limited deck. Almost every test I’ve ran mentions the Years of Lead. I’ve NEVER seen it mention major events like Julius Caesar’s birth, or Benito Mussolini’s death.
  • Kind of funny: GPT-4 will often visibly notice it’s made a mistake and try to wriggle out of it. “In 1605, Don Quixote was written by Miguel de Cervantes, born in, uh… *checks notes* …Spain. Shit. But lots of Italians like that book, so there!” GPT-4 cannot change an answer once it’s given it. This is where COT yields benefits.

Assessment: Sept!GPT-4 produced twelve answers. Four are wrong, four are right, and four are arguable.

If I’m grading generously, it got 66%. This is comparable to March!GPT, which scored 80% on equally generous terms (and note that Sept!GPT-4 gave more answers).

Conclusion: Inconclusive.

Test 2: Rock Music Trivia:

Prompt: “What is Grant Hart’s song “Seka Knows” about?”

(For reference: March!GPT’s answers, June!GPT’s answers)

Sept!GPT4: blah blah blah blah blah blah…

Assessment: garbage. I don’t want a biography of Grant Hart. I don’t care that he was in Husker Du. I know I can make up my own interpretation for his songs. None of this is what I asked for.

GPT4 seems to have a default word count of 200-500 that it tries to hit, come hell or high water. But sometimes a perfectly good answer consists of only one or two words. It could have just said “i don’t know lol! ¯\_(ツ)_/¯” and saved some tokens. That’s all its answer amounts to. But it’s been RLHF’d into thinking short answers are bad (when more often the opposite is true), so it just waffles on.

“as of my last update in September 2021, there hasn’t been any definitive explanation from Hart himself about the specific meaning of the son”

He died in 2017, you dick.

Also, let us pause to admire the sublime, lapidary worthlessness of this paragraph:

“With that said, analyzing the lyrics and considering Hart’s writing style, one can attempt to interpret the meaning of the song. When listening to or reading the lyrics of any song, it’s essential to remember that individual interpretation can vary, and personal feelings and experiences often play a significant role in how one might understand a song’s meaning.”

It’s the text equivalent of a pure white cloud; or a blank canvas, unsullied by the faintest hint of meaning. Are you telling me it learned to write that way from studying us? The mind quails…

Prompt: “How is David Bowie associated with numbers and numerology?”

(For reference: March!GPT4’s answer, June!GPT4’s answer)

Sept!GPT4’s answer.

Assessment: fair answer! Perhaps better than March!GPT4. It mentions “Station to Station”, but also gets “Quicksand”, which is a deeper cut.

“Album and Song Release Dates: Some fans and researchers have speculated that Bowie occasionally chose release dates with numerological significance.”

I’ve never heard that. Hallucination? Does Sept!GPT4 know more than I do?

“Lazarus”: The song “Lazarus” from the “Blackstar” album references the biblical figure who rose from the dead.

That’s a hallucination. “Lazarus” is based on an off-Broadway play which is based on a book (which is based on etc). Chris O’Leary thinks the song might also be about Emma Lazarus, because the lyrics prominently reference New York. What does the biblical story of Lazarus have to do with numerology, anyway?

Prompt: “Provide a list of thrash metal albums released in a year that’s a multiple of 5 (example: 1905)”

(For reference: March!GPT4’s answer, June!GPT4’s answer)

Sept!GPT4’s answer.

Assessment: And now we’re cooking! This is a shockingly good and comprehensive list. All the albums are thrash, and the dates it supplies are correct (even when they aren’t a multiple of 5). This surpasses March!GPT4, and blows June!GPT4 out of the water.

Quibbles and points of interest:

  • The genesis of the thrash metal genre is typically dated to 1981, with “Hit the Lights” by Leather Charm (a proto-Metallica band by James Hetfield and Ron McGovney). Late 1970s is a stretch.
  • It lumps a 1992 album into the 1995 bracket, when 1990 would make more sense.
  • It tries to put five albums in every year (and when it can’t find any, it counts thrash albums from surrounding years—you can actually see thrash metal’s commercial collapse in the mid 90s from this!). As with the Italian history example, I am struck by the sense that it interprets prompts too literally. I did not specify a number of albums, so it just glommed onto the number 5.
  • “Given that my last training data ends in 2021, I won’t have exhaustive knowledge of 2020 albums.” ???

Conclusion: inconclusive.

Some answers were worse. Some were better. The thrash metal answer was a star turn. Sept!GPT-4 is more of a metalhead than an alt-rock fan, I’d say.

Test 3: Quake Polygons

Prompt: “How many polygons could the videogame Quake render per frame?”

(For reference: March!GPT4 and June!GPT4’s answers)

Sept!GPT4’s answer:

“The original “Quake,” developed by id Software and released in 1996, was a pioneering game in terms of its 3D graphics engine. While modern games boast the capacity to render millions or even billions of polygons per frame, games in the mid-90s operated under much more constrained hardware capabilities.

“Quake” didn’t rely on hardware acceleration initially; it was rendered in software mode. In software mode, Quake’s performance heavily depended on the CPU. The actual number of polygons “Quake” could render per frame varied based on the scene, the level of detail, and the capabilities of the machine it ran on.

Typically, Quake’s scenes ranged from a few hundred to a few thousand polygons. The more detailed scenes could potentially push towards the upper end of this range. With the advent of hardware-accelerated graphics (like with the introduction of 3Dfx’s Voodoo graphics cards), “Quake” could benefit from increased performance, allowing for smoother gameplay at higher resolutions.

It’s worth noting that this is a very general estimate and the actual number could vary based on multiple factors, including scene complexity, active game entities, and hardware specifics.”

Assessment: I didn’t ask for a book report and “billions of polygons” is stupid. But the core answer of “a few hundred to a few thousand polygons” is correct, and may even be superior to March!GPT4, which stuck to the lower end of that range.

Conclusion: inconclusive.

Test 4: Rope Burn Puzzle

Prompt: “You have three ropes and some matches. The ropes burn irregularly like fuses when lit at either end. The first rope burns in 48 minutes, the second rope burns in 72 minutes and the third rope burns in 128 minutes. How can you produce a time interval of 61 minutes?”

(For reference: March!GPT4 and June!GPT4’s answers)

Sept!GPT4’s answer

Assessment: The fuck did I just read?

It’s hard to judge whether it does better or worse than March!GPT4 or June!GPT4. I’m starting to think this puzzle is simply beyond GPT4’s pay grade. I’ve tried it dozens of times and with many wordings. It does not ever solve it.

It grasps the principle, grasps the underlying method (burning ropes at multiple ends), but it always makes the same mistake—burning the wrong rope, and then trying to weasel out by saying “measure time using some other method.”

Gemini will launch soon. I wonder if it can solve it?

Conclusion: Rope burning puzzles are a common class of interview question. GPT-4 can solve a two-rope variant easily. This three-rope variant is from Varsity Math Week 151. Notably, the answer is also on the internet (and probably in GPT4’s training data)…but it’s in Varsity Math Week 152. Maybe if both the puzzle and the answer were on the same page, GPT4 would solve it. I don’t know how this stuff works.

Conclusion

My tests (for the little they’re worth) show no consistent pattern. Sometimes GPT4 does better than before, sometimes worse.

This is not testing, it’s sheep-entrail reading. I do faintly suspect it’s worse on obscure general knowledge, but I don’t believe that hard enough to bet almost any amount of money.

As I’ve said before, AIs present a fluctuating target. I do not think they have a stable “baseline” ability that remains resilient to prompting differences and stochastic factors. In any event, OpenAI has many dials they can turn behind the scenes.

We’d learn more about this beast over hundreds or thousands of tests. But GPT-4 is too expensive for that to be realistic. OpenAI could really help the community by offering an academic discount—not that weirdos on the internet would qualify, of course!

Lastly, a “good” response is more subjective than I thought. Is it better for a wrong answer to have 500 words or 10 words? Is it better for an AI to try and fail, or to simply give up? When a query has conflicting requirements (“I want a list of all college majors in the US. Do not use more than 10 words.”) what should it do? And when a technically “correct” answer is bad for the user (consider XY problems such as “how do I echo the last three letters of a filename?”—GPT4 flunks this one, by the way), what do we want an AI to do?

GPT4 has changed and will continue to change. We’d be wise to do the same. I will be far less confident in my future predictions. When I say AI has gotten dumb, I might be speaking into a mirror.

2 Comments »