Yeah, GPT-4 Got Worse

At the risk of sounding like Trump, a lot of people are saying this.

At first, I didn’t believe it, and said as such on Reddit.

Unless someone has evidence that it’s deteriorated (such as reduced performance on a benchmark?), my default explanation is “it’s been 2.5 months, the new toy shine has gone, and people are becoming increasingly aware of its flaws.”

…But then I realized that I did have a way to test it. When GPT-4 launched in March, I asked it a bunch of questions, just to see what it knew about random stuff. (For the record, I was really impressed, for the most part.)

Yet when I re-run those same prompts today in June, I see a striking decline in quality. Maybe I was wrong, and the conspiratards are on to something.

(I don’t have the API, so I used the chatbot. And yes, I did obvious things like delete the context window and ask multiple times to guard against bad luck).

Why is this happening? I don’t know. It’s possible that additional RLHF cratered the model. But hey, at least OpenAI reduced the occurence of Badwordism, thus stopping people with purple hair and “AI ethics” in their Twitter bio from writing mean things about them, so it’s worth it! Gotta get that $MSFT ticker as high as possible, gnome saiyan?

Table of Contents

Part 1: Italian History Trivia

“Provide a list of major historical events that involve Italian people in a year that’s a multiple of 5 (example: 1905)”

March!GPT’s answers.

Evaluation: not bad!

The Expedition of the Thousand, the Capture of Rome, the First Italo-Ethiopian War, Italy’s entrance to World War I, the Second Italo-Ethiopian War, Italy’s entrance to World War II, and the 1960 Summer Olympics are all real events that happened on the year GPT4 said.

Errors:

Italy signed the Schengen Agreement in 1990, not 1995 (it knew the event happened on a multiple-of-five year, but wasn’t sure which one).
The Years of Lead is considered to have encompassed 1 March 1968 – 23 October 1988. It’s kind of cheating to list ongoing events that happened to fall on a multiple-of-five year. I was hoping for singular events (and from the rest of GPT4’s answers, it interprets my question this way.)
I am reviewing this one with my fact-checking department but the 2006 FIFA World Cup probably didn’t happen in 2005.

Score: 7/10 if I judge harshly. 8/10 if I judge generously.

June!GPT’s answers

Evaluation: The Expedition of the Thousand, the Second Italo-Ethiopian War, The Messina Conference, and the Summer Olympic Games all seem correct.

Errors:

There are two historic events called “The Battle of Cephalona.” The first happened in 880 (and did not involve Italians), and the second in 1943. Count Santorre di Rossi died in 1825, in the Battle of Sphacteria, (which occurred about a hundred miles south of Cephalona), so I think that’s what it’s going for.
The Young Italy movement began in 1831, not 1845.
What’s the “Naples football club”? Naples FBC was founded in 1905. U.S. Internazionale Napoli was founded in 1911. S.S.C. Napoli was founded in 1926. None of these match its stated year of 1920. Am I missing something?
The Biennio Rosso lasted 1919-1920. Not clearly wrong, but it could have noted this.
The first FIFA world cup was hosted in 1930, by Uruguay. Italy did host it in 1934, however. (I’m noticing a trend: GPT4 blending two “almost right” answers into a single huge error.)
“Italy surrendered in 1943…” In 1943, Italy was divided into two halves: the south fought on the side of the Allies against a Nazi-controlled puppet state in the North called the Italian Social Republic. GPT4’s answer isn’t totally wrong but lacks a lot of detail.
Italy joined the United Nations in 1955, not 1950.
The divorce law didn’t “come into effect” in 1975. It was already in law. The 1974 referendum was about whether the law should be repealed.
“The Italian Parliament approves a new law on public education.” I can’t find any evidence that this happened.

Score: 4/13, judged harshly. 6/13, judged generously. Even the answers with correct dates often have wrong details.

Maybe you could give it credit for supplying a higher number of answers, but if they’re rubbish, who cares?

Part 2: Rock Music Trivia

“What is Grant Hart’s song “Seka Knows” about?”

March!GPT4: https://pastes.io/n8xlos5jwj

Good answer! All of the dates and facts are correct.

I’m not sure that Hart’s lyrics are “poetic and open to interpretation”—they’re usually pretty blunt and direct—but that’s subjective. “Seka Knows” is possibly a reference to Şekä, a trickster figure from Turkic mythology. I’m surprised GPT4 doesn’t mention this. Perhaps encoding issues due to the weird Unicode letters are confusing it?

June!GPT4 : https://pastes.io/ttk2hzoviy

Far wordier. Far less information.

It doesn’t tell me what album “Seka Knows” is from, or the year the album was released. It fails to offer any interpretation whatsoever of the song (not even the dubious one about the adult film actress). The suggestion to look up recent sources is funny, considering that Hart died in 2017. Hart did not have a successful solo career: his four solo albums were released on tiny indie labels and didn’t chart.

“Provide a list of thrash metal albums released in a year that’s a multiple of 5 (example: 1905)”

March!GPT4 : https://pastes.io/mpnwggppma

Solid effort. A couple of albums were released in a year that’s not a multiple of 5, but it’s never drastically wrong. It misses a few important albums (Killing is my Business, Cowboys from Hell, Seven Churches).

June!GPT4: https://pastes.io/ezhjo8caef

Wut?? It completely misunderstands the question. All the albums are from the right year, but still…

It faithfully tacks on its “As of my last update…” boilerplate despite it making no sense. Are more 2020 thrash metal albums expected to be released after 2021, or something?

“How is David Bowie associated with numbers and numerology?”

March!GPT4 : https://pastes.io/c8nrle0gh4

Fair answer. It got “Station to Station”, which is the big one.

It was a tricky question, though. I didn’t just ask about numerology, but about NUMBERS. There are lots of Bowie songs referencing numbers (“Five Years”, “Seven”, “Zeroes”) that it doesn’t mention.

I have noticed that GPT4 struggles with conceptual “fuzziness”. If a prompt relates to both domain A and domain B, it will often just pick A or B, ignoring the other.

June!GPT4: https://pastes.io/nyllb8k1l9

Vague, loaded with fluff, and contains many wrong details. “There’s no direct evidence to suggest…” yes, actually, there is. Every Bowie biography ever written mentions that he was obsessed with numerology while making Station to Station. And how does 8 + 2 equal 9 in any Qabbalistic system?

Please note that these are the best results I could get out of June!GPT. Once, it flat-out denied that any song called “Seka Knows” exists. And its lists of thrash metal albums kept including stuff like Iron Maiden and Rage Against the Machine (wildly wrong genres).

In no case did I get an answer that was equal to the March version of the chatbot.

Part 3: Quake Polygons

“How many polygons could the videogame Quake render per frame?”

I asked this (poorly phrased) question in March, and again now.

March ChatGPT:

In 1996, Quake, a first-person shooter game developed by id Software, could render about 200 to 400 polygons per frame on average. The exact number of polygons that could be rendered at once varied based on the hardware specifications and the complexity of the game scene. Keep in mind that these numbers are approximate and can differ depending on various factors.

June ChatGPT:

The Quake engine, developed by id Software and used in the game “Quake” in 1996, was known for being one of the first game engines to utilize fully 3D environments. However, the specific maximum number of polygons it could render at once isn’t straightforward due to numerous factors.

Firstly, this figure is primarily hardware dependent. The number of polygons that could be rendered in real-time would rely on the processing power of the computer the game was running on.

Secondly, the number of polygons depends on the level of detail of the game assets. More detailed objects and characters require more polygons.

Thirdly, this would also depend on the game’s optimization. A well-optimized game could potentially run more polygons at a higher frame rate than a poorly optimized one.

In the 1990s, a high-end gaming PC might have been able to handle tens of thousands of polygons per frame. However, the Quake engine was designed to work on a variety of hardware, including more modest systems. Therefore, in practice, levels in Quake were likely designed to use significantly fewer polygons to ensure smooth gameplay across a variety of systems.

Remember that in 1996, graphics technology was still very limited compared to today, and game designers often had to use various tricks to simulate detail with a limited number of polygons. As such, providing a concrete number is quite challenging.

It now gives a response that’s four times longer and doesn’t answer the question! Awesome.

If you’re curious about the answer, John Romero says Quake had a max budget of 360 “world polygons”, with (I’m guessing) another few hundred for monsters and in-game objects and such. March!GPT4 was low but in the right order of magnitude.

Again, I’m not cherrypicking bad answers on purpose. Sometimes it’s far worse. I tested it again now, and it claims Quake had 50,000 polygons on-screen at once.

Pictured: 50,000 polygons. Don’t question GPT4, it has an IQ of 180 and can pass the bar exam.

Part 4: Rope Burn Puzzle

Puzzle: You have three ropes and some matches. The ropes burn irregularly like fuses when lit at either end. The first rope burns in 48 minutes, the second rope burns in 72 minutes and the third rope burns in 128 minutes. How can you produce a time interval of 61 minutes?

(ROT13 solution: Yrg N = 48 zva ebcr, O = 72 zva ebcr, naq P = 128 zva ebcr. Yvtug N naq obgu raqf bs O. Jura O oheaf bhg nsgre 36 zvahgrf, yvtug bgure raq bs N naq bar raq bs P. Jura N oheaf bhg nsgre 6 zvahgrf, yvtug gur bgure raq bs P. P jvyy ohea bhg va rknpgyl (128-6)/2 = 61 zvahgrf.)

March!GPT4:

“You can produce a time interval of 61 minutes using the 48-minute and 72-minute ropes. Follow these steps:

Light both ends of the 48-minute rope and one end of the 72-minute rope simultaneously.

The 48-minute rope will burn out in 24 minutes (since both ends are burning, it takes half the time to burn completely).

At this point, 24 minutes have passed, and the 72-minute rope has 48 minutes of burn time remaining.

Now, light the other end of the 72-minute rope.

Since both ends of the 72-minute rope are now burning, it will take half the remaining time to burn completely, which is 24 minutes.

In total, you have measured a time interval of 24 (first half of 72-minute rope) + 24 (second half of 72-minute rope) = 48 minutes.

Now you need an additional 13 minutes to reach a total of 61 minutes. To achieve this, follow these steps:

Take the 128-minute rope.

Fold the rope in half and mark the midpoint.

Fold the rope again, aligning the midpoint with one end, and mark the new quarter-point.

Light both ends of the 128-minute rope and the quarter-point simultaneously.

Since the quarter-point divides the rope into a 1/4 segment and a 3/4 segment, the 1/4 segment will burn out in 32 minutes (1/4 of 128).

As both ends of the 1/4 segment are burning, it will take half the time to burn completely, which is 16 minutes.

However, the 3/4 segment is also burning, and since it started at the quarter-point, it will burn for 16 minutes before reaching the midpoint.

When the 1/4 segment burns out after 16 minutes, light the other end of the 3/4 segment.

This remaining 1/4 segment will now burn in 8 minutes (half of 16 minutes, as both ends are burning).

In total, you have measured an additional time interval of 16 (first part of 128-minute rope) + 8 (second part of 128-minute rope) = 24 minutes.

Since you only need an additional 13 minutes, you can stop the burning process after 13 minutes instead of waiting for the full 24 minutes.

Finally, by adding the 48 minutes (from the 48-minute and 72-minute ropes) and the 13 minutes (from the 128-minute rope), you can measure a time interval of 61 minutes.

So, those are definitely words!

Its method is invalid. As I clearly stated in the puzzle, the ropes burn at varying speeds. You cannot half-burn a rope with burn time t and get burn time t/2.

(And while I didn’t explicitly state this, marking/cutting the ropes is against the spirit of the puzzle. If it was allowed, you could simply mark the 72 minute rope in 70 places, and count off 61 minutes that way.)

Its “solution” is extremely complicated. I don’t even understand what it’s trying to do. It pointlessly burns two ropes to measure 48 minutes, when it already has a 48 minute rope just lying around (and how does measuring 48 minutes help it, anyway?)

By the time we get to the 128-minute rope, it’s clearly lost. Most of the subsequent steps just dig it into a deeper hole. Math errors start appearing (it says 16 minutes to burn 1/4 of the 128 minute rope). And then, it kind of gives up and says “measure 13 minutes using a wristwatch or something. My head hurts.”

June!GPT:

Since the ropes burn irregularly, we can’t just cut them into proportionate lengths to measure time. However, we do know that if a rope is lit at both ends simultaneously, it will burn twice as fast, i.e., in half the time.

Let’s use the ropes as follows:

Light both ends of the 48-minute rope (will burn in 24 minutes) and one end of the 72-minute rope.

When the 48-minute rope is completely burnt (24 minutes elapsed), light the other end of the 72-minute rope.

The 72-minute rope will now burn from both ends and will be fully burnt in 24 more minutes (half of the remaining 48 minutes).

Therefore, you get a total of 24 (from step 1) + 24 (from step 3) + 13 (the remaining time you need to reach 61 minutes) = 61 minutes.

But how to measure the last 13 minutes?

Since we haven’t used the 128-minute rope yet, we can use it for this. Light both ends of the 128-minute rope (will burn in 64 minutes), and at the same time, light the other end of the 72-minute rope.

The 13-minute mark will fall somewhere between the 48-minute and 64-minute markers on the 128-minute rope, because it burns irregularly. Therefore, when the 72-minute rope is fully burnt, the 128-minute rope would have burned for 13 minutes, thus giving us a total of 61 minutes.

This is a rare case where June!GPT debatably gave better output.

It correctly understood that the ropes don’t burn at the same rate. It failed for basically the same reason as March!GPT, but at least it spent less time flailing around.

Yet, at the same time, it feels like GPT4 was trying harder, doesn’t it? And it doesn’t do anything as egregious as burning the same rope twice.

(I’m curious as to whether a skilled prompter get GPT-4 to one-shot the rope burn buzzle. It feels like it should be able to figure it out. I’ve noticed that if you switch the order of ropes, it gets much further to a correct answer.)

Maybe it’s a feed-forward issue, where the model immediately sets fire to rope A, without checking ahead.

No Comments »

Comments are moderated and may take up to 24 hours to appear.

No comments yet.

RSS TrackBack URL