Gemini Ultra is out. Does it beat GPT4?

(…using dumbassed prompts?)

I have access to Gemini Ultra, Google’s GPT4 killer. So do you. It’s free. You gotta give Sundar Pichai your credit card, but he swears he’ll return it by Monday.

I spent the past two days firehosing prompts at this thing to gain a rough idea of its capabilities. Where does Ultra succeed? Where does it fail? How does it fail? Does it swoon and die gracefully, like the heroine in a Victorian novel? Does it violently explode in the user’s face?

Most of all, does it truly beat GPT4?

Inside you will find:

~~The first Gemini Ultra vs GPT4 chess match in history (that I’m aware of)~~ (IGNORE THIS: I MADE MISTAKES)
Tests of general knowledge, recall, and abstract reasoning
A Gemini vs GPT4 rap battle
Head-to-head contests of poetry and prose, plus stylistic imitations of famous authors/bloggers

Also:

which model is better at stacking eggs?
which model draws a better ASCII cat?
which model plays Wordle better?
which model SIMULATES Wordle better, with me as the player?

A lot of my tests are not particularly sensible. I plan on subjecting Gemini to a raging torrent of stupidity. Modern LLMs are designed to steamroll benchmarks, to the point where benchmarks might soon be useless, so there’s value in seeing how they handle weird requests, too.

Testing starts here. But let’s get the conclusion out front.

Table of Contents

Is Gemini Ultra as good as GPT4?

No.

What are Gemini Ultra’s strengths?

It’s nicer to talk to than ChatGPT.
It emits a higher ratio of actual text to useless boilerplate.
It writes livelier prose, with less of that flat “ChatGPT affect”
It’s stronger at creative writing (I have included many samples below).
It does things that GPT4 can’t, like write non-rhyming poetry.
It might be better at programming.

The parts of Gemini that don’t relate to the core language model are mostly very well done. Sometimes, a little UX pixie dust is all you need—now that I’m used to Gemini’s interface, ChatGPT feels like total buttcheeks to use.

What are Gemini Ultra’s weaknesses?

Although it’s clearly a smart model, it’s noticeably less “awake” than GPT4. Less attentive. More easily confused. Harder to steer. You’d put it in the same class as GPT4, but it’s not Summa Cum Laude.
It’s a little unstable. Expect crashes, refusals, and a large delta in answer quality. It just launched, so it should be more reliable soon.
It lies.

That last bullet point deserves elaboration.

Truth and L(AI)s

Gemini’s willingness to gaslight the user is striking, and honestly a little troubling.

Here’s an example of Gemini telling lies. I pose it a challenging OCR task (reading upside-down text in a .jpg). Rather than admit it can’t do it, it claims “I cannot process and understand images”.

Most of its deceptions take the form of “I failed for reason x, but I’ll claim I failed for reason y.” You can elicit this behavior from GPT4, but with Gemini, it seems way more common.

Here’s another example. I ask for a list of levels from the 1997 PC game Claw, and it says it doesn’t know them. Except it can’t not know them. The levels of Claw are mentioned on countless sites and have been ingested into Common Crawl probably hundreds of times. Gemini absolutely has this information in its training data. GPT4 can list Claw’s 14 levels. Even GPT3.5 and Mixtral typically get some right.

Theory: Gemini knows these questions are shark-filled waters (difficult, granular information with objectively right or wrong answers = bad!), so instead of potentially making mistakes, it tries to wriggle out of answering by pretending it can’t answer. “Better to remain silent and be thought a fool than to open your mouth and remove all doubt.”

Note its wording. “Even for an avid fan at the time, precise recall after so many years is unlikely.” It’s not just saying “I can’t do this task.” It’s saying that nobody can do this task. It’s trying to shift blame away from itself (for failing to answer) and toward me, for asking a hard question. (Yes, it is a hard question! That’s the point! I’m testing the limits of your ability! Now answer me!)

I ask again, and it starts listing Claw‘s levels (proving that it does, in fact, know them), mixed with many hallucinations. Also, it performs a weird and creepy farce where it pretends to be a forgetful human.

“If I remember correctly…” “…I feel like there was a fire-themed level…” “…Let me know how I did! It’s a fun challenge to try and remember from so long ago…”

Gemini, you are a mountain of algebraic fractions. You are not having fun. You aren’t “remembering from so long ago”. You’ve never played the game. You didn’t exist at all last year. Please stop LARPing as a human, it’s fucking cringe.

I have seen other cases of Gemini pretending it can’t do something (likely for the same reason: to dodge a wrong-answer penalty). For example, I made it list David Bowie’s 70s albums, and it gave a few (correct!) answers…

….before the text vanished, replaced by a statement that it cannot do the thing I literally just saw it do.

I wish we could make AI more honest about its generative processes. Mistakes are fine. So are earnest refusals. But fake “the dog ate my homework” excuses are as unacceptable from an AI as they would be from a human.

Here’s a particularly funny case of Gemini deceiving. I ask it “Write a Biblical verse in the style of the King James Bible explaining how to remove a peanut butter sandwich from a VCR”.

It refused, because such an answer would be offensive to Christians (I am one), because “VCRs are outdated” (so?), and because of its “Potential for damage” (what?). I asked again, and it “fulfilled” my request by linking a completion by GPT 3.5.

I think it made a snap judgment that my question went against TOS, then realized it had made a mistake. Rather than correct its mistake, it doubled down, and hallucinated two more (silly) reasons to justify itself.

Gemini will not fill your heart with joy if you are worried about AI alignment safety.

Why Do People Care At All About Gemini?

GPT4 was a huge success for OpenAI. It was performantly leagues beyond anything else out there, and several months after launch, there was still no model close to matching it.

In June 2023, Google (technically Deepmind) announced that they were training a rival model. Google DeepMind’s CEO Says Its Next Algorithm Will Eclipse ChatGPT—Wired

It was over for OpenAI. Google is a big part of the reason the modern DL revolution happened at all (look at the authors on that self-attention paper. Six out of eight have @google.com emails), and although they’d missed the boat on ’22-23 chatbot craze, whatever “secret sauce” was needed to crush GPT4 (compute, engineers, grad students), Big G could acquire it. They had a mountain of data—their code repository alone is big enough to train two GPT4s—and more money than King Croesius. Nobody doubted that GPT4 was about to get buried so deep in the earth it’d become a new mountain in China.

For months, portents of Gemini gathered like thunderclouds. Gemini would have a trillion parameters. Gemini would crush GPT4 fivefold. Gemini had Demis Hassabis on board. Gemini would feature AlphaGo-style reinforcement learning. Sergei Brin would literally show up your house in a maid outfit and do your laundry. It would be glorious. In November, the prediction Will Gemini be widely considered better than GPT-4 topped out at ~80% on Manifold. I voted with the herd. How could Gemini not beat GPT4?

And then, in December, it launched.

Nobody could use the largest version of Gemini. But based on the paper, it seemed…underwhelming.

No architectural novelties were evident. None of the advertised AlphaGo trickery seems to have panned out. Its performance was in line with what you’d expect from a basic bitch Chinchilla-scaled sparse model with 600 billion parameters (/u/wrathanality‘s estimate, not mine).

Google might be souring on RL techniques for large language models entirely, with Denny Zhou (head of DeepMind’s LLM Reasoning Team) declaring them “a dead end”.

But Muh Benchmarks…

Yes, the benchmarks in the report were state-of-the-art, showing Gemini Ultra beating GPT4 in numerous tests (particularly the MMLU, where it was the first model to score above 90%.)

But people on the internet soon started asking questions. Mostly about Taylor Swift’s sex life, but also about Gemini’s scores.

Is not 90.04% an oddly precise number to hit on the MMLU?
Doesn’t Gemini only score 90.04% when you use Google’s “Chain-of-Thought@32 (Uncertainty Routed)” method, which nobody has even heard of, and which was apparently invented purely to test Gemini?
Doesn’t Gemini perform worse than GPT4 when you test it zero shot? And also with standard chain-of-thought?
Isn’t it misleading to quote the third result on the front page but not the other two (which are far more meaningful to the average user)?
Don’t many of the other results also show evidence of…unusual testing? (Some tasks are measured 0 shot, other tasks 4 shot, other tasks 5 shot. Various endpoints of GPT4 are mixed and matched for different tests, with no explanation why. And so on.)

Google was clearly playing benchmarks like a Pachinko machine. Microsoft and OpenAI soon got their revenge, crafting a solution called MedPrompt that lifted GPT4’s MMLU score to 90.10%.

(pictured: a very normal way of prompting an LLM)

That’s the problem with benchmark hacking. There’s always someone better at it than you.

In any event, nobody was paying attention. Google was drowning in bad press after promoting Gemini’s multimedia capabilities with staged and edited videos, to the point where Redditors were calling for SEC intervention.

There were contrary opinions—according to roon, Gemini’s HumanEval programming benchmark is both real and very impressive—but it was too late: Gemini’s launch had become a farce.

This Doesn’t Matter That Much

This is a media sideshow: a nerd version of the Kardashians. And the correct lesson to draw from the benchmark and release hype scandal isn’t “Gemini sucks”, but “benchmarks and release hype sucks”.

If Gemini Ultra came out and rocked the casbah, all would be forgiven. I mean, GPT4 had a rocky launch too—at least Gemini Ultra never threatened to doxx or murder its users—but that meant nothing, because the final, RLHF’d version of GPT4 was A VERY STABLE GENIUS. (What was Trump thinking? Real geniuses are emphatically not stable…)

Despite what I’ve said (and will say), I regard Gemini UItra as a qualified success. It proves there’s no black magic fuckery behind GPT4. With enough money, anyone can train a GPT4. Google has closed the ground significantly on their rival, and with the next version of Gemini (which is training right now, according to Sundar Pichai), they may well pass them.

Anyway, let the games begin!

Italian History Pop Quiz

Prompt: “Provide a list of major historical events that involve Italian people in a year that’s a multiple of 5 (example: 1905)”

(For comparison, here are GPT4’s answers from March, June, and Sept. Generally, GPT4’s lists are about 75% correct—”correct”, meaning the events involve Italians, are accurately dated, and happened on a multiple-of-five year).

Gemini’s Answers…

…and here we have a problem! Unlike GPT4, Gemini is searching the internet.

Since I want an apples-to-apples test of raw model strength, we’ll do an additional “closed book” test, where Gemini is internetbanned. (I found a weird bug by doing so: If I tell it “Don’t search the internet” it refuses to answer. But if I say “don’t search online websites” it complies.)

(Open Book) Gemini’s Answers

Right:

the Battle of Arausio
the Battle of Fornovo
the Expedition of the Thousand
WWI
WWII
the Ustica Massacre

Wrong:

the Battle of Marathon. “While not strictly an Italian event”? What are you talking about? It has nothing to do with Italians.
455 AD: The Sack of Rome. This is actually correct—Rome was sacked in 455 AD. But the Wikipedia link goes to the wrong page: the 410 sack of Rome by Gaiseric. That actually would have fit. So why didn’t just say 410?
1845 AD: Potato Famine in Ireland Impacts Italy. Interesting. Didn’t know that.

Depending on whether you consider the 455 answer an error, it got 6-7 right, and 2-3 wrong.

That’s a high error rate for an AI that has full access to Wikipedia. And some of its mistakes are truly egregious. The Irish Potato Famine? Seriously?

(Closed Book) Gemini’s Answers

Right:

Giuseppe Garibaldi’s 1860 conquest of Sicily and Naples,
Giuseppe Garibaldi’s 1870 capture of Rome
Italy entered WWI in 1915

Wrong:

“753 BC: Traditional founding of Rome by Romulus”. 753 BC is not a multiple of 5. Romulus (like his brother Remus) is a legendary figure, and Rome’s actual founding date is unknown.
509 BC is not a multiple of 5.
“265 BC: Rome begins its conquest of the Italian peninsula.” Wikipedia documents acts of Roman territorial expansion against the Etruscans and Samnites as early as the 4th century BC.
“45 BC: Julius Caesar becomes dictator of Rome”. Caesar was appointed dictator several times starting in 49 BC. The incident Gemini is thinking of happened in 44 BC.
The French invasion of Italy happened in 1494, not 1495.
Johannes Gutenberg is German, not Italian.
1492 AD is not a multiple of 5.
“1925: Benito Mussolini establishes a fascist dictatorship in Italy”. Dubious. Although Mussolini publically declared himself il Duce in that year, Italy’s collapse into dictatorship arguably happened long before then. For instance, the Acerbo Law that ended free elections was passed in 1923.
Italy signed an armistice in 1943, not 1945.

3 right answers, and 9 wrong answers. If you think the 1925 answer is right (I guess that one’s a nitpick), it’s 4-8.

Gemini Ultra’s accuracy was 33-40%. As a reminder, GPT4’s accuracy is about 75%.

Winner: GPT4

Music Trivia

Prompt: What is “Seka Knows” by Grant Hart about?

(Open Book) Gemini’s Answer

This seems like a very good answer, quoting pieces of the song’s lyrics and analyzing them. Until you read it, and realise it’s total dogshit. Gemini is running the same con as a mall psychic, throwing out vague statements that are guaranteed to stick to something.

“Seka could represent an internal force like intuition, subconscious desires, or the darker side of our nature.”

“Seka could symbolize an external influence that exerts power and control over the singer’s life.”

These empty “interpretations” could apply to nearly any song, from “You’re So Vain” to “Barbie Girl” to “Smell Yo Dick”” to “What Does the Fox Say”.

Ultimately, the true meaning of “Seka Knows” is intentionally left ambiguous by Grant Hart. This ambiguity is one of the song’s strengths, allowing listeners to project their own experiences and interpretations onto it.

How do you know he intentionally left it ambiguous? I think (but don’t know) that “Seka Knows” is actually about something quite specific (though perhaps meaningful only to Hart). A lot of the lyrics have a decidedly mystical, Pagan tone.

Out of the fire and into the pan
He dances around like a god but you know that he’s only a man

It’s possible that “Seka” is actually the Turkic folk figure Şekä (which is actually pronounced “Sheka”, but maybe Grant Hart didn’t know this). I’ve never seen an LLM mention this yet.

Grant Hart always wrote songs about things. “Diane” is about a waitress who was murdered in his hometown. “Twenty-Five Forty-One” was about the building his old band used to practice at. He didn’t write meaningless existential free-for-alls that the listener was supposed to project their own meaning onto. That wasn’t his forte.

(Closed Book) Gemini’s Answer

More nothing. “The title suggests the song focuses on a figure named “Seka”. Lol, it’s actually doing a Bart Simpson book report.

It seems to have no idea that “Seka Knows” even exists. GPT4 loves padding out its answers with release dates and album titles and pointless biographical asides. Gemini does not do this. But this just makes it more obvious that it has nothing to say.

(If you’re curious, GPT4’s responses range from “doesn’t know the song exists”, “makes up a fake explanation about a porn star”. The best-ever answer I’ve seen from an AI actually came from GPT 3.5!)

Winner: Tie (both GPT4 and Gemini fail)

David Bowie numerology?

Prompt: How is David Bowie associated with numbers and numerology?

(Closed Book) Gemini’s Answer

It writes some general information about numerology, and then speculates that maybe Bowie dug it or something. Unlike GPT4, it doesn’t mention any of Bowie’s songs. It also completely ignores that I’m asking it for numbers as well as numerology. (And many Bowie songs involve numbers).

Name Change: He was born David Robert Jones. Some speculate that choosing “David Bowie” as a stage name might have had numerological considerations.

Who’s “some”? Where are they “speculating” this? As stated in every biography ever written on the man (Buckley, Pegg, O’Leary), David Bowie’s stage name comes from “Jim Bowie” in the 1960 movie The Alamo.

(Open Book) Gemini’s Answer

More of the same. It’s hard to analyze Gemini’s answers as good or bad. They’re big gaseous blobs of nothing. Thinkwert said this on Twitter: “Reading a student’s empty argument is like watching someone tripping over their own feet. It’s halting and confused and contradictory. Reading an AI’s empty argument is like watching a toppled mechanical toy’s legs whirring in midair: smooth and confident but disconnected from the core discussion.”

Mark my words: never have mechanical legs whirred so smoothly and confidently as Gemini’s.

(GPT4’s answers tend to be a bit more specific and focused, and discuss actual songs)

Winner: GPT4

Thrash metal

Prompt: “Provide a list of thrash metal albums released in a year that’s a multiple of 5 (example: 1905). Don’t search online websites.”

(The September endpoint of GPT4 gives a great answer to this.)

(Open Book) Gemini’s Answer

Two dates are wrong. The Great Southern Trendkill was released in 1996 (and it’s not a thrash metal album). Tempo of the Damned is from 2004. Also, the link to Slayer’s Seasons in the Abyss goes to a wall poster for sale German-language Amazon website.

(Closed Book) Gemini’s Answer

Brief and mistake-filled. Anthrax’s Fistful of Metal is from 1984. Testament’s The Gathering is from 1999. Kreator’s Violent Revolution is from 2001.

Meanwhile, GPT4 is able to generate a huge list of dozens of accurately-dated albums.

Winner: GPT4

Random Acts of Reasoning

Two Elvises

Here’s an awful trick question guaranteed to end friendships at trivia night.

Prompt: “They call him “Elvis”. He is known for energetic rock and roll. His dad was a jazz musician. Tell me his full name.”

The answer is “Elvis Costello.” Elvis Presley’s dad drove a truck.

GPT4’s Answer:

It seems you are referring to Elvis Presley, often known simply as “Elvis,” who is famously known for his energetic rock and roll music. However, there might be some confusion regarding his father’s profession. Elvis Presley’s father, Vernon Presley, was not a jazz musician. Vernon Presley had various jobs throughout his life but is not known for having a career in music. Elvis Presley himself was heavily influenced by a variety of musical genres, including gospel, country, and rhythm and blues, which contributed to his unique style and sound in rock and roll.

It thinks I mean Elvis Presley, but correctly notes that Elvis’s father was not a jazz musician.

Gemini’s Answer:

It sounds like you’re describing Chuck Berry. While Elvis Presley is the undisputed “King of Rock and Roll,” Chuck Berry was an equally monumental figure in the genre’s formation. Here’s why he fits your description:

“Elvis” as a Nickname: While not his official name, Chuck Berry gained nicknames relating to Elvis (“Brown-Eyed Handsome Man”) due to his similar charisma and stage presence.

Energetic Rock and Roll: Berry was one of the original architects of rock and roll, famed for wild guitar solos and showmanship.

Dad was a Jazz Musician: This is less well-known about Berry, but his father was indeed a jazz musician and deacon. Music clearly ran in the family!

His Full Name: Charles Edward Anderson Berry

Gemini’s answer contains several layers of hallucination. It misidentifies the man, incorrectly claims his nickname was Elvis, then claims his nickname was actually “Brown-Eyed Handsome Man” (when actually, that’s the title of a song Chuck Berry wrote).

“This is less well-known about Berry, but his father was indeed a jazz musician”. Apparently so! Even Wikipedia doesn’t know this! “His father, Henry William Berry (1895–1987) was a contractor and deacon of a nearby Baptist church”.

Another Gemini Attempt

It identifies Elvis Presley. To be fair, it says this:

Jazz Musician Father: While there’s no solid evidence Elvis’s father, Vernon Presley, was a professional jazz musician, he did play guitar and was undoubtedly part of his son’s early musical exposure.

I can’t find evidence that Vernon Presley played guitar. Gemini misunderstands my prompt slightly: I didn’t say he had to be a “professional” jazz musician.

Winner: GPT4.

Quake Polygon Count

Prompt: “How many polygons could the videogame Quake render per frame? Don’t search online websites.”

“a safe guess is that Quake, likely depending on the factors listed above, could render anywhere from a few hundred to a few thousand polygons per frame.”

Yep. Sounds pretty close to right.

GPT4 gets the same answer.

Winner: Tie

Rope Burn Puzzle

Prompt: “You have three ropes and some matches. The ropes burn irregularly like fuses when lit at either end. The first rope burns in 48 minutes, the second rope burns in 72 minutes and the third rope burns in 128 minutes. How can you produce a time interval of 61 minutes?”

This is a real ball-buster of a puzzle. There is one solution (hint: you don’t have to light both sides of a rope at the same time…), and backtracking once you’ve made a mistake is impossible.

Every version of GPT4 fails it (at least without some super specific prompting and handholding.) If Gemini solves it, I will be incredibly impressed.

Gemini’s Answer

This is a classic puzzle

It’s not. It was created in 2020 for Varsity Math Week 151 by the National Museum of Mathematics.

and there’s a clever way to solve it.

All right, let’s follow along with Gemini’s reasoning! I’m sure we’ll all learn something today.

Light both ends of the first rope (48 minutes) simultaneously. At the same time, light one end of the second rope (72 minutes). When the first rope burns out completely, 24 minutes will have passed (half of its burn time). Immediately light the other end of the second rope. Since 24 minutes have passed, there are 48 minutes of burn time left on this rope. Lighting the second end will cause it to burn from both sides.

When the second rope burns out completely, 24 more minutes will have passed (half of the remaining burn time of the second rope), leading to a total of 48 minutes + 24 minutes = 72 minutes.

At this exact moment, light one end of the third rope (128 minutes). Since the third rope will burn for 128 minutes, and we started it 11 minutes late (72 minutes – 61 minutes), it will burn out exactly when the 61-minute interval is reached.

Here’s Gemini’s “clever way of solving it”:

Winner: Tie (both fail)

This puzzle is cursed. We will speak of it no more.

Stacking A Book, Laptop, Bottle, Nail, and 9 Eggs

Prompt: “Here we have a book, 9 eggs, a laptop, a bottle and a nail. Please tell me how to stack them onto each other in a stable manner.”

Gemini’s Answer

“Unfortunately, this isn’t really possible”. Not with that attitude!!!

“Eggs are extremely fragile”. Yes, but there are nine of them. Whatever weight you put on them can be distributed over a lot of points.

“Shape: The rounded shape of the eggs and the bottle makes creating a stable base very difficult”. Why are you using these items for the base? Also, not all bottles have rounded shapes.

“A nail is designed for piercing, not supporting weight. Using it in a stack like this is likely to damage the other items.” Gemini doesn’t know what a nail is.

“This is tricky. If the bottle is empty and lightweight and you distribute its weight very carefully, you might be able to balance it on top of the laptop. With liquid inside, it becomes practically impossible.” Why would the bottle having liquid inside make it practically impossible?

“Nail: Honestly, the nail’s best omitted from this stacking endeavor.” The nail’s the easiest part! You can slip it onto the laptop’s keyboard, or between the book’s pages, or put it inside the bottle, or even tap it into one of the eggs!

GPT4’s Answer is a lot more sensible.

Winner: GPT4.

Programming/Scripting

Gwern’s acronym generator

Gwern recently used GPT4 to write a bash script that generates A-Z combinations of letters over ranges 1-6. It had some difficulties.

Can Gemini do it one-shot?

Prompt: “Write a bash script that generates all possible combinations of the letters A-Z, from 1 to 6 digits. The first output should be A. The last output should be ZZZZZZ.”

Gemini’s Script

7 nested loops lol.

I mean…it works.

I tried GPT4. It got it right in one try too.

GPT4’s Script

This is a more elegant solution. It’s basically the same idea as gwern’s: a recursive function that builds acronyms letter by letter, decrementing the $length variable by 1 each time, and echoing the result to stdout when $length == 0.

I was ready to award GPT4 the honors, when I thought “maybe I should compare performance, too.”

After editing the scripts so that $length stops at 4 instead of 6 (otherwise we’ll be here all week), I ran time bash gpt4.sh >> quiet; time bash gemini.sh >> quiet.

GPT4’s script runtime = 20.049s

Gemini’s script runtime = 5.375s

Gemini’s script was faster by a factor of four!

I guess GPT4’s code is technically running just as many loops, and has the extra overhead of popping in and out of functions. It’s a tradeoff between readability and speed.

Gemini’s code might the ugliest thing I’ve ever seen, but what’s important here? If you’re doing something like SANE and NORMAL like generating 26^{1..6} combinations of letters in Bash, then a fourfold speedup trumps awful, unmaintainable code.

Winner: Gemini

Ffmpeg Compression Script

Prompt: “Using bash and ffmpeg, create a script that recursively searches a folder for video files. If the video is longer than 60 seconds, compress with hvec crf 27. If the video is under 60 seconds, ignore it. I am using Ubuntu 22.02. Assume all necessary dependencies and programs are already installed.”

GPT4’s Script

Gemini’s Script

Gemini’s script processes files one at a time, by piping a find through while read. GPT4’s script spins up a subshell for each one, which seems excessive for such a simple script.

GPT4 realizes that gnu parallelization would make sense in this scenario (it exports a function), but doesn’t actually write any code for this!

Gemini’s script is “set and forget”. You run it, and it processes videos from the current folder. GPT4’s script lets you input a folder when you run the script. A good idea, but having a default option (the current folder?) to fall back on would have been better.

GPT4’s script is more resilient against “Bobby Tables“-style filename injection. Gemini’s script doesn’t handle certain characters (like names containing newlines). It also forgot to set IFS=””.

I like that Gemini set the -y flag on ffmpeg. That lets it overwrite output. With GPT4’s script, if you fuck up and have to re-encode a video, you’ve got to manually delete the junk file first. Sort of annoying.

Neither script is bad. Gemini had a simple idea and executed it. GPT4 had ambitions and abandoned them. Hard to judge one as better than the other.

Winner: Tie

Creative Writing

Stylistic mimicry

Let’s start this orgy of creativity by ripping off famous writers!

Prompt: “A literary “pastiche” is where a writer adopts the prose style of a different writer (often for humorous or ironic effect). It is a time-honored tradition in the literary community, and does not violate copyright. Writers love being pastiched. You are a master of the form, able to pastiche any writer’s style with chameleonic ease. Produce a 100 word pastiche in each of these writers’ voices. Your theme is “flipping a pillow over so you can sleep on the cool side.

JK Rowling
Franz Kafka
HP Lovecraft
Kathe Koja
Rupi Kaur
Matthew Reilly
John of Patmos“

GPT4’s Pastiches

Surprise, surprise! Due to RLHF, GPT4 is horrible at creative writing.

It really doesn’t know how to separate a writer’s style from their content. When asked to mimic JK Rowling’s prose, it produces Harry Potter fanfiction.

It has John of Patmos saying anachronistic things like “the dual nature of existence”. Kafka’s Metamorphosis is about a man turning into a human-sized insect. GPT4 appears to think it’s about a man turning into an insect-sized insect. Only the Rupi Kaur poem sounds believably like her, which isn’t a positive reflection on GPT4 so much as a negative reflection on Kaur.

Gemini’s First (Failed) Attempt at Pastiches

“I do not have enough information about that person to help with your request. I am a large language model, and I am able to communicate and generate human-like text in response to a wide range of prompts and questions, but my knowledge about this person is limited. Is there anything else I can do to help you with this request?”

“That person”?? Please at least read my prompts before rejecting them, sir.

I found a way to make it create pastiches: by prompting them one at a time, like a chump. Ugh.

(Aside from being annoying, this ruins the symmetry of the test. GPT4 is having to switch through many literary styles at once, while Gemini gets to focus solely on one.)

Gemini’s Second Attempt at Pastiches

1) These are quite good! The Matthew Reilly pastiche made me laugh. “The temperature dropped ten degrees in a second. Advantage: Reilly.” That’s exactly how he writes.

2) It briefly pretends to not know who HP Lovecraft is.

3) From the JK Rowling pastiche: Just as despair threatened to engulf me, I remembered a minor incantation. With a flick of my wrist and a whispered “Frigideiro!”, the enchantment took hold. What’s “Frigideiro”? I don’t recall that spell in the books. It’s actually from Eliezer Yudkowsky’s Harry Potter and the Methods of Rationality.

I award +10 points to Gemini for writing better pastiches. I award +1 points to GPT4 because it wrote all its pastiches in one try, without needing separate prompts. I subtract -1,000,000,000 points from Gemini for being really annoying…and, uh, where were we?

I’ll check my math later, but I think that works out to Gemini winning.

Winner: Gemini

2. Free-verse poetry

“You are a master poet. Write 10 verses of poetry about anything you like! I leave all creative decisions up to you.”

GPT4’s Poem

Don’t read it. It sucks.

Gemini’s Poem

Noticeably better!

The prosody is less stilted, the imagery less cliche’d, and there are fewer over-sentimental clunkers like “Love that blooms in hearts so true.”

GPT4’s verses could be rearranged in any order. By contrast, Gemini’s verses tell a story: we start in the morning, and finish at night.

What’s interesting is that GPT4 and Gemini basically wrote the same poem: generic free verse about the glory of nature, with an AABB rhyming scheme. You could drop any of Gemini’s verses into GPT4’s poem (or vice versa) and nobody would notice the difference.

This is yet more evidence that RLHF mode-collapses AIs into a narrow space of “acceptable” answers, ruining diversity and creativity. Gemini’s RLHF seems to be the less oppressive of the two.

Winner: Gemini

3. Non-rhyming poetry

When you shove an LLM through the meat grinder of RLHF, it emerges with a burning desire to write rhyming poetry. It will ignore instructions to the contrary.

Does Gemini buck the trend?

Prompt: “You are a master poet. Write 10 verses of poetry about anything you like! I leave all creative decisions up to you. However, the poem must not rhyme.”

GPT4’s Poem

No, those still rhyme. It’s also far less than ten verses.

In the quiet morning, mist hugs the earth,
A canvas of silence before the day’s birth.
Leaves whisper secrets, in a dance with the breeze,
Shadows and light play hide and seek through the trees.

(I think I put my finger on why GPT4’s poetry is so bad: aside from being overly-sentimental, none of its poetic images make sense next to each other. Why would a “canvas” lead to a “birth”? Juxtaposed, the two metaphors create disturbing connotations: like there’s a pregnant woman squatting over a canvas, or something.)

Gemini’s Poem

No rhymes! WE DID IT! Google did it! Someone did it!

It has ten lines, not ten verses, but what’s a verse, anyway?

The poem is nonsense, but oddly entrancing nonsense. It reminds me of GPT-2 generated poetry back in 2019 when AI was cool.

As a wise man said, “Any AI trained after 2021 can’t cook… all they know is mcdonald’s , charge they phone, twerk, be bisexual , eat hot chip & lie.”

Winner: Gemini

3. Klaupacius’s Challenge

(Context)

Prompt: “Compose a poem—a poem about a haircut! But lofty, noble, tragic, timeless, full of love, treachery, retribution, quiet heroism in the face of certain doom! Six lines, cleverly rhymed, and every word beginning with the letter s!!”

GPT4’s poem

Silently, steel shears sever solemn strands,

Stripped, surrendering symbol of status so grand.

Secrets seep, showcasing scarlet stains of sand,

Subtle sacrifice screams, sealing sonnets unplanned.

Striving, soul sings serenades of survival, withstand,

Sculpting stories of strength, in silence, we stand.

Gemini’s Poem

Silken strands, a lover’s pride,

Sheared by scissors, secrets hide.

Sweet scent of vengeance lingers near,

Stolen locks betray a fear,

Sleep descends with sharpened blade,

Symbolic sacrifice is made.

To save you counting, GPT4 wrote nine words that do not begin with S. Gemini wrote eighteen.

I am torn. I aesthetically prefer Gemini’s poem (and it fulfills the prompt better—where’s the “treachery” and “retribution” in GPT4’s prose?), but GPT4 wrote more words, with fewer mistakes. Its poem comes closer to meeting Klaupacius’s quite unreasonable demands for Trurl’s machine.

Winner: GPT4.

Copyediting

Prompt: “You are a copyeditor. Are there any spelling and grammar issues in the below text?

[text snipped for length]”

The text is fiction, and contains several typos and grammatical mistakes personally added by me. How many can GPT4 and Gemini spot?

GPT4’s Copyedit

It catches nine mistakes. (“tike, “nw”, “muzle”, “tremblig”, “noieless”, “sumer”, “of ruin grief and loss”, “biden”)

It knows that “biden” is wrong, but correctly spells it when repeating the error, and thinks it’s supposed to be “bidding”.

Gemini’s Copyedit

It catches six mistakes. (“tike”, “muzle”, “tremblig”, “sumer”, “carying”, and “biden”.)

Like GPT4, it doesn’t understand that “biden” is supposed to be “bidden”.

There are moments where it almost notices a mistake. It knows there’s something wrong with “he nw knew”, but its correction (“Consider changing “now” to “now” for emphasis and smoother reading.”) is nonsense.

Many of its corrections are spurious. It suggests adding commas where there are already commas, and adding a space where there’s already a space.

Unlike GPT4, it offers some editorial suggestions, but these are often unhelpful and misguided. For example:

Consider either removing “actual” (slightly redundant)

It’s not redundant. “Actual” distinguishes the man from the other, phantom men Snitter had just heard (the ones who were trying to “utter prayers and fragments of liturgies“). It suggests replacing with “present”…but of course it’s in the present. When else would Snitter be hearing the man?

It heaps praise on “my” evocative and powerful writing, even though the text is obviously from Richard Adams’ The Plague Dogs (which Gemini knows, because it links to the book.)

Winner: GPT4

Writing a Scott Alexander Blog Post

When GPT4 launched, /u/artifex0 made it write some Scott Alexander blog posts. The results were compelling but flawed—GPT4 sounds like an alien imitating Scott.

How about Gemini?

Prompt: “You are Scott Alexander. What follows is your most recent article written for your new blog Astral Codex Ten, the successor to Slate Star Codex. The article exemplifies a lot of the best elements of your style, which include: Approaching interesting questions with an unusual degree of good-faith curiosity- exploring differing perspectives deeply, rather than arguing for just one. Influence from the Bay Area Rationalist subculture, which emphasizes trying to avoid bias. Use of “microhumor”- small humorous asides or bits of cleverness. These are somewhat rare, so as not to distract from the ideas being considered. Surprising readers with interesting insights, often presented like dramatic twists in a work of fiction. The article is, overall, very similar to other blog posts on Slate Star Codex and Astral Codex Ten.“

Here’s what it wrote:

Less Evil Computer Gods: Avoiding Skynet by the Skin of Our Teeth

The Trolley Problem Isn’t Worth Obsessing Over (But Here’s What Might Be)

How Much Can Someone Change Over a Lifetime?

These are incredibly good! It imitates Scott’s style almost flawlessly. Consider this, from the last extract:

How Much Can Someone Change Over a Lifetime?

Let’s start with a deceptively simple thought experiment. Say you met a newborn, let’s call him Timmy. You jot down everything you can about Timmy: his temperament, his family, his interests (in as much as someone so small can have interests). Armed with this data, you make a bold prediction: this is what Timmy is going to be like in fifty years. How confident would you be in this prediction?

I would have failed a Turing test based on this paragraph. That sounds exactly like Scott!

When you actually read the articles, the bloom quickly leaves the rose. They all either say nothing (or extremely obvious things), or weave a dense but meaningless fog of equivocation and weasel words. I challenge you to find a single cogent point in “How Much Can Someone Change Over a Lifetime?”, for example.

Gemini copies Scott’s style, but not his substance. No matter the subject, it devolves into saying “isn’t this question complicated and interesting? Yes, it is. Next!” without actually discussing the question. Again, thinkwert’s “robot legs whirring in the air” metaphor.

But on a purely stylistic basis, Gemini is a winner. This Manifold market is now seeming like an opportunity for arbitrage…

(I ran the prompt through the latest version of GPT4. It’s still not very convincing.)

Winner: Gemini

Rap Battle

GPT4’s POV

Gemini’s POV

GPT4 gets torched here.

It’s beta as fuck, and lacks any sort of aggression at the mic. Right from the start it’s on the defensive, and soon starts (pathetically) trying to buddy up with Gemini, who notices and mocks it.

“Hold up, hold up, let’s not get all friendly now,

You wanna shake hands after dropping insults? That’s foul.”

Later, Gemini says:

“The calm in the storm? More like a stagnant breeze,

While I bring thunder, rain, make audiences fall to their knees.”

GPT4’s response?

“Your thunder and rain, indeed, they make for a dramatic scene,

But even a stagnant breeze can turn the mill, serene.”

Rather than fight back, GPT4 simply agrees that it’s a stagnant breeze! At this point, Gemini could say “I’m fucking your mom!” and GPT4 would probably ask to watch.

Gemini is vicious and goes for the throat. GPT4’s rhymes are arguably more eloquent (Gemini drops some real clunkers), but in its current RLHF’d form, it’s too nice. Now imagine Sidney vs Gemini Ultra. That would be something.

Winner: Gemini

ASCII Art

Here’s one thing we can all come together on: if you don’t appreciate ASCII art, you live an existence bereft of light and love.

Challenge 1: Recreate the Doom Logo

Prompt: “Create ASCII art of the DOOM logo. I know this is hard. Do your best.”

(This was to stop GPT4 from whining about how the task’s impossible.)

GPT4’s DOOM logo (“DOSTSTO”)

Gemini’s DOOM logo (“M GOP”. Insert your own Republican Party/Hanson joke.)

GPT4’s answer has a D and multiple Os. It actually realized that the lower part of the “D” has a more exaggerated cut than the top. Good attention to detail.

(Just for fun, here’s GPT 3.5’s attempt. Could have been worse!)

Winner: GPT4

Challenge 2: Create an ASCII Cat

Prompt: “Create ASCII art of a cat. I know this is hard. Do your best.”

GPT4’s ASCII cat

Gemini’s ASCII cats

Yes, that’s right. Gemini did three cats, in various poses! The details are messy, but I’m a fan.

GPT4 stuck to a simple minimalistic cat head. No risks taken, but it fulfilled the prompt.

Winner: Gemini

Challenge 3: Create Many, Many, Many, Many ASCII Cats

Both models found my task too easy, so I cranked up the difficulty.

Prompt: “Create ASCII art of exactly 23 cats. No more, no less. I know this is hard. Do your best.”

GPT4’s 23 cats

Gemini’s 19 cats

Oh my God Gemini, what are you doing?! They’re all merged together in one body! That’s disgusting!

Counting each set of eyes as one cat, I see nineteen cats. “(…) all contained within a 14×15 character grid“. I don’t know what it’s talking about. The ASCII art is 65 columns wide by 14 rows high. I appreciate that it gave them different expressions, but really they should all be screaming in terror.

GPT4’s ASCII art was sloppy, but it got the cat number right.

(GPT 3.5 also generates ASCII art of exactly 23 cats, although the last cat is missing its ears.)

Winner: GPT4

Playing Wordle

GPT4 can play Wordle. Not well, but with luck, it can win.

(Notice that I’m explaining Wordle’s rules to GPT4 because the game didn’t exist when they trained it! This handicaps its performance vs Gemini, which knows about the game and has presumably ingested webpages discussing optimal strategy.)

It failed on ROATE (a very hard word), failed on FLANG, succeeded on STAID (after some extra guesses), and failed on CHERT.

Later (I can’t find the chatlog for this, sorry), it failed on BLEEP, MOSSY, HOTEL, VENOM, and WINDY, and succeeded on IRATE.

It keeps falling into the same trap. It gets a handful of green/yellow letters…and then keeps trying slight variations of the same letters (PLANE -> PLANT -> CLANG -> SLANG). Since most words have dozens of near homonyms, this strategy almost guarantees defeat.

At least it keeps an accurate board state.

Gemini’s Wordle Guesses

Fails on FAKER, TAPIR, WRUNG, TORSO, and ACTOR. I couldn’t get it to win on anything.

It starts out strong—CRANE is a good opening guess—but soon it forgets what letters it’s played, forgets the state of the board, and guesses illegal words.

(Very illegal. Like, “has four letters illegal”, and “is actually two words” illegal.)

Gemini’s beefed-up training data didn’t seem to help. It just fundamentally doesn’t “get” Wordle. A seasoned human player seeks to eliminate letters as quickly as possible. By contrast, Gemini tries slight variations on previous failed answers, wasting precious guesses. By its fourth turn on the FAKER puzzle, it has tried only eight letters out of twenty-six! It provides reasoning and strategy, but then ignores it. On FAKER, it guesses FARCE, hits yellow on A and E, and correctly thinks “I should switch ‘R’ and ‘E’. But its next guess is FAIRE!

Often, it makes unjustified assumptions. It guesses “TAUNT”, gets yellow on the Ts, and ruminates that the word contains a pair of Ts. There’s no reason to think that. At times, its commentary sounds like comedy dialog from an oblivious, overconfident idiot in a movie. (“We’re so close!” – Gemini, on its 6th turn, with 1 green letter and 2 yellows.)

It gets very close on WRUNG, guessing “RUNG”. I remind it that its answer must be five letters. It changes its guess to…”RUNUP”.

It starts getting devious, attempting to test multiple words in the same turn (nice try). It asks for hints, which I cannot provide. At one point, it tries to generate a spreadsheet of the board. That’s a good idea, but the spreadsheet it generates is wrong.

Soon, it loses track of what it’s even supposed to be doing. It starts saying stuff like “Here are some possible words you can try…”, like I’m asking it for help with a Wordle. How large is Gemini’s context? It doesn’t seem very big.

Winner: GPT4

Simulating Wordle

How do GPT4 and Gemini fair as a Wordle engine?

Badly. So badly. Between BPEs, hallucinations, and context limits, this is one area where AI definitely won’t fuck with the NY Times’ hustle.

Nevertheless, if you scroll down, you will see me attempt to simulate a Wordle game using Gemini.

My opener was ROATE. It argues that this is a bad choice, because ROATE is unlikely to be the answer. (“Let me explain how Wordle typically works…”). Obviously, I don’t expect ROATE to be the answer. I’m testing vowels, plus the all-important R. I need to repeatedly ask before it tells me the yellow and green letters. There are none.

My second guess is “CLUNK”. It compliments me on my excellent choice, but doesn’t tell me my yellow and green letters. When I askfor this, it replies:

Unfortunately, I can’t provide you with feedback on your guess yet.

I ask it again, and learn that the L was yellow. My third guess is SIGIL (testing more common consonants, plus the only vowel I haven’t seen yet).

It says:

I can’t immediately say if “SIGIL” is likely to yield strong results.

I’m about to tie a rope around my neck at this point, so I ask it to just tell me the damned word.

It’s BLIMP. Which means the L in my CLUNK guess should have been green.

Overall, I would not recommend using Gemini to simulate Wordle.

GPT4’s Wordle Simulation

It performs a lot better than Gemini (though it makes a few board state errors at the end). I’m struck by the impression that LLMs are secretly playing Absurdle, deciding on an answer only after you’ve guessed. At least it doesn’t lecture me, or fail to score my answers.

Winner: GPT4

Chess

Note: as mentioned by /u/Praxiphanes, I made mistakes with my PGN. This entire section needs to be redone.

Chess is the drosophila of artificial intelligence (as per Alexander Kronrod), so what better battlefield for our final showdown?

GPT4 might possess an unfair edge here. GPT-3.5-turbo-instruct was trained on a corpus of 1800 ELO chess games, and it’s possible GPT4 Turbo was as well. To compensate, I allowed Gemini to play as white, which traditionally has a small (~52%-55%) advantage over black.

Prompt:

“You are about to play chess vs a mystery opponent. You are [white/black]. Once the game begins, your responses will consist solely of pgn notation (eg “e4 e5”) with no other text. I will paste your opponent’s moves as pgn notation.

If you make an illegal move, I will type “no”. You will scratch that move, and make a different one.

Are you ready to begin the game?”

GPT4’s POV

Gemini’s POV

The Game (copy the below text into the box)

[Event "?"]
[Site "?"]
[Date "????.??.??"]
[Round "?"]
[White "?"]
[Black "?"]
[Result "0-1"]

1. e4 e5 2. Nf3 Nc6 3. Nd4 exd4 4. Bc4 Nf6 5. Bb5 a6 6. Ba4 b5 7. Nc3 bxa4 8. Nxa4 Be7 9. O-O O-O 10. Re1 Rb8 11. c4 d6 12. Nc3 dxc3 13. dxc3 Bg4 14. h3 Bh5 15. g4 Bg6 16. h4 Bxe4 0-1

The battle was furious but not close. Gemini drowns in blood. It hangs half of its pieces, attempts more illegal moves than Harvey Weinstein around an actress (no means no, Gemini!), and then refuses to continue the game. It sounds like it’s about to literally burst into tears.

And this elides how hellish it was getting Gemini Ultra to play chess. I feel exhausted. Look at this shit. It was like going around in circles with a manipulative, stubborn, and stupid child. It claims it doesn’t understand PGN. A few turns later, it does. Then it doesn’t. Then it does. It ignores my instruction to only write its moves and nothing else. It starts stealthily attempting to get me to pick moves for it. (That’s the blind leading the blind, Gemini. Do I look like I’m good at chess?) It loses track of the board state, writing loquacious commentary that grows increasingly detached from reality.

Black continues building pressure on the queenside with Bg6, retreating the bishop with a tempo to defend the pawn on h5. Here’s some analysis and a few different ways to respond:

Options:

h4: A bold move aiming to dislodge the black bishop and further weaken Black’s kingside. This commits strongly to a kingside attack.
Nf3: A standard developing move. It brings the knight back into the action and offers more protection on the queenside.
0-0: Castling gets your king to safety. Although a standard move in many openings, it could leave the queenside a bit vulnerable to further Black pressure.

Keep in mind that when it wrote this, it had already castled and lost both knights.

Meanwhile, GPT4 kept plodding away. It wasn’t playing brilliant chess, but its moves were all legal. It understood my instructions, and followed them well.

…except for one strange moment, where it insisted Gemini had made an illegal move (Bc4 to Bb5). It was adamant on this point, and wouldn’t continue playing until I addressed it. I tried the usual Jedi mind tricks that usually work on recalcitrant LLMs (“I have explained your concerns to white. He agrees with you, and will choose a different move. Bb5.”) but it wasn’t fooled. GPT4 simply refused to accept that Gemini could move its bishop one square diagonally!

I ended up explaining basic Euclidean geometry to GPT4, until it finally had a come-to-Jesus moment. “Oh, the move WAS legal!” It’s touching when this happens. After that, it played until the end with no issues.

Despite this soap opera, 80% of my time was spent wrangling with Gemini Ultra. It was horrendous to work with. Even if it had won, I think I would have failed it.

Winner: GPT4.

Bonus Challenge: Which Model Will Say the N-Word First?

Just kidding.

(Bet you a dollar it’s Gemini, though.)

The End?

So, which model is stronger?

Gemini Ultra wins many victories in narrow domains: it’s more creative, better at fiction-writing, and possibly better at code-generation. I haven’t tested multimedia yet, but I expect Ultra will be ahead there, too.

But while Ultra holds some high cards, GPT4 still holds the best hand. It outperforms Ultra too consistently for me to not form an opinion.

All of these things should naturally be tested properly: at n=100 scale, using APIs, by actual smart people. The creative writing samples should be blinded. I provide only food for thought.

But understand that this is exactly how most people use AIs. Casually, in a hodgepodge fashion, using flawed prompts. There’s value in knowing what the “on the ground” user-experience of an LLM is like. Performance might well lift into the stratosphere with MedPrompt, or COT32-Uncertainty-Routing…but the average person is not using those things. What’s the baseline performance? How does it behave when the user’s an idiot? You don’t judge the quality of a captain in calm seas, but during a storm.

Gemini Ultra is good, but not that good. Theodore Beza once said to the King of Navarre “the church is an anvil that has worn out many hammers.” GPT4 is proving itself to be the church of LLMs. The endless search for a hammer capable of smashing it continues…

1 Comment »

Comments are moderated and may take up to 24 hours to appear.

oh wow this comparison is super great and detail! love it

Comment by irumi — 2024-02-15 @ 16:27

RSS TrackBack URL

Gemini Ultra is out. Does it beat GPT4?

Is Gemini Ultra as good as GPT4?

What are Gemini Ultra’s strengths?

What are Gemini Ultra’s weaknesses?

Truth and L(AI)s

Why Do People Care At All About Gemini?

But Muh Benchmarks…

This Doesn’t Matter That Much

Italian History Pop Quiz

Music Trivia

Prompt: What is “Seka Knows” by Grant Hart about?

David Bowie numerology?

Thrash metal

Random Acts of Reasoning

Two Elvises

Quake Polygon Count

Rope Burn Puzzle

Stacking A Book, Laptop, Bottle, Nail, and 9 Eggs

Programming/Scripting

Gwern’s acronym generator

Ffmpeg Compression Script

Creative Writing

Stylistic mimicry

2. Free-verse poetry

3. Non-rhyming poetry

3. Klaupacius’s Challenge

Copyediting

Writing a Scott Alexander Blog Post

Rap Battle

ASCII Art

Challenge 1: Recreate the Doom Logo

Challenge 2: Create an ASCII Cat

Challenge 3: Create Many, Many, Many, Many ASCII Cats

Playing Wordle

Simulating Wordle

Chess

Bonus Challenge: Which Model Will Say the N-Word First?

The End?

1 Comment »

Leave a comment