How Accurate Are Lords of Limited's Early Format Guesses? An AI Analysis

11 Apr, 2025

Spoiler: Surprisingly good, but maybe don't trust them on Green.

Heads Up: This post is for Magic: The Gathering players.
You've been warned. We're deep in the weeds here, talking about the wizard poker podcast equivalent of pre-season NFL predictions.
For non-Magic players, the underlying idea (analyzing expert guesses) might still resonate. You might want to read the first section and checkout my python package for YouTube transcript analysis.

You ever listen to a podcast previewing something -a sports season, a movie lineup, a new MTG set- and think, "Yeah, but how often are they actually right?" I do this constantly with Lords of Limited (LoL), especially their "Early Access" episodes (like the most recent one). Ben and Ethan jump into a brand-new set, play for a few hours against others who are also just figuring things out, and then... they make pronouncements. Bold takes! Predictions! It's part of the fun, listening to smart people try to map an unknown territory based on blurry first impressions.

Naturally, some takes age like fine wine, others like milk left in a hot car. We all remember the big whiffs. But how does the overall record look?

You could figure this out the hard way. Go back, listen to hours of Early Access shows, then listen to the corresponding hours-long "50 Takes" retrospectives, meticulously cross-referencing every prediction. It sounds... noble? Exhausting?

Or, you know, you could get a robot to do it. Which is what I did. Using the magic of transcript analysis and AI comparison (shoutout to Google's new model: gemini-2.5-pro, you are A BEAST), the process became almost trivial:

Feed the AI the Early Access transcript: Tell it to find the core arguments, the "takes," and the quotes backing them up.
Do the same for the "50 Takes" episode: Get the final verdict, the wisdom of hindsight.
Make the AI compare them: Ask it, "Okay, how close was Take A to Final Conclusion B?" and assign an accuracy score.

I went full data nerd and did this for every major set since Karlov Manor. No regrets. The whole thing took less than 12 minutes (discounting the few hours spent fiddling with different AI models and frameworks, making sure this thing actually worked).

(For the interested reader: AI's detailed comparisons for each set lives here. Want to try this on your favorite Hearthstone strategy podcast (not even sure if that's a thing)? The Python code is here. Good luck.)

What Does "Analysis" Even Look Like Here?

Right, so the AI spits out... stuff. What kind of stuff? It's basically three things.

First, the initial predictions from the Early Access show, broken down into multiple takes. Like this one (among 42 different takes) from Aetherdrift:

Format Speed - Slower than expected

Take: The format felt slower than initially anticipated during Early Access, with games often going longer despite the presence of aggressive mechanics. Aggro decks aren't necessarily the default Tier 1.
Supporting Statements:

"despite feeling like we got start your engines and nailed that as like the major mechanic of the format the set didn't feel as fast as I thought it was"
"more so than not I was surprised at the amount of like 20 minute games I was having during Early Access"
"my thought going in was that the the aggro decks would have been tier one going in and some some of them are some of them aren't"
"if both people play to the board... the games go a long time or or or can go a long time"
Confidence Level: High

Second, the retrospective summary (which contains 56 different takes for Aetherdrift), like the one below:

Format Speed - Slower Than Expected

Timestamp: 00:04:30
Current Evaluation: The format played out significantly slower than initial previews suggested. It was one of the slower formats in recent memory, especially in the Play Booster era.
Initial Expectations: Previews billed it as "too fast and too furious."
Supporting Statements:

"ether drift was build in previews as too fast and Too Furious but the vehicle set drove under the speed limit."
"this is actually one of the slower formats we've seen incertainly the play booster era and I feel like certainly just in our podcasting lifetime this is definitely an exception and not the rule."

Key Insights: Initial impressions based on themes (Vehicles) can be misleading. Format speed dictates card evaluation significantly.

And finally, the money shot: the direct comparison. Did the early take hold up? Here is what AI said based on the previous early take and the retrospective:

Format Speed

Initial Take: Slower than anticipated based on pre-release hype, leading to long, grindy games despite aggressive mechanics. High confidence.
Retrospective Reality: Confirmed to be much slower than expected, one of the slowest Play Booster formats. This significantly impacted card evaluations.
Accuracy Analysis: Highly accurate. The early access gameplay correctly identified the format's defining characteristic, deviating from initial community expectations.
Key Factors: Direct gameplay experience quickly revealed the board complexity and tendency for games to go long, overriding assumptions based on mechanics alone.
Quotes:

Initial: "despite feeling like we got start your engines and nailed that as like the major mechanic of the format the set didn't feel as fast as I thought it was"
Retrospective: "ether drift was build in previews as too fast and Too Furious but the vehicle set drove under the speed limit... this is actually one of the slower formats we've seen"
Verdict: Highly Accurate (YAY!)

These verdicts ("Highly Accurate", "Mostly Accurate", "Partially Accurate", "Mostly Inaccurate" or "Completely Wrong") are the bedrock of the number-crunching that follows.

Overall Accuracy: Wait, They're Actually Pretty Good?

So, how'd they do? Honestly, much better than my cynical brain expected. Across six sets worth of takes:

59% hit "Mostly Accurate" or higher, and 15% hit "Mostly Inaccurate" or worse.
That is, they were right 59% of the time and wrong 15% of the time.
Yes, that adds up to 74%. The rest 26% of the guesses were "Partially Accurate". Directionally correct, but flawed in the details.

Here’s the breakdown of the chart for each set, if you like charts:

Now, let's be real. Slapping a label like "Partially Accurate" on a nuanced prediction is inherently reductive. Is getting the feel of a format right more important than whiffing on a specific common? Almost certainly. So, these numbers are a starting point, a vibe check for the vibe checkers. The interesting stuff is in the patterns. Where do Ben & Ethan consistently shine, and where do they stumble?

What They Nail (and What They Miss)

Strengths: Reading the Room (The Format's Room)

Where LoL consistently crushes it is understanding the gestalt of a format. The big picture stuff. The feel. It's uncanny how often their initial impression, often phrased using exactly that word ("feels like..."), ends up being spot-on.

Aetherdrift: Immediately sensed the format was slower than advertised ("the set didn't feel as fast as I thought it was") and flagged the lack of good removal.
Duskmourne: Nailed the power of enchantment synergies and correctly called GW (Survival) and WB (Resurrect) as likely stinkers. Good nose for archetype viability.
Bloomburrow: A near-perfect read. Called it a "core set with synergy" from the jump, which was exactly the final consensus. They even got the draft strategy (stay open, find your lane) right. Chef's kiss.
Modern Horizons 3: Understood it was about synergy, not just good cards (Early: "powerful synergy trumps individual power"; 50 Takes: "synergy driven format evaluating cards in a vacuum really didn't work"). Also flagged the absurd power of RG Eldrazi Ramp.
Outlaws of Thunder Junction: Ethan sniffed out the "Prince" dynamic immediately ("I'm worried... just take the card in the upper left corner").
Murder at Karlov Mannor: They pegged MKM's core dynamic early; the impact of Disguise/Ward 2, the need for efficiency, and White's aggressive potential.

Weakness: The Color Green Is Apparently Invisible

If LoL has a consistent blind spot, it's evaluating Green. For roughly the last year and a half, it's been the same story:

Aetherdrift, Duskmourne, and OTJ: Early takes put Green 4th best. Final verdict? Green was the best color. Whoops.
Murders at Karlov Manor (MKM): Early takes had Green slumming it with Black at the bottom. Final verdict? Green was #2.
Bloomburrow: While correctly identified as strong early, the sheer dominance fueled by Forage and raw stats wasn't fully captured until later.

What gives? My pet theory: maybe they undervalue the sheer consistency Green often provides? In formats that end up slower than expected (which many recent ones have), maybe the raw stats of Green commons/uncommons just outperform flashier, synergy-dependent cards in other colors? Are big green fatties the new Dual Color Lands? Boring, unexciting, but reliably effective over time? Food for thought.

Weakness: Underestimating the Unassuming Overperformers

Beyond the Green conundrum, another recurring blind spot involves... Evaluating the quiet cards. The unassuming utility commons, the "clunky" looking cards, the glue pieces that end up being format all-stars.

Think Aetherdrift's Thunderhead Gunner(clunky body, amazing value) or Ride's End (crucial removal for White).
Or OTJ's Cactarantula (initially disliked, became a format-warping blocker).
Or Bloomburrow's Fountainport Bell (looked like draft chaff, became a golden egg utility piece) and Lightshell Duo (unassuming body, huge impact).

It seems these cards eventually provide unexpected value in the specific context of their format. Maybe it's because these cards don't scream 'synergy' upfront, or their value is defensive or incremental, making them harder to appreciate until the format's actual rhythm and key threats become clear through gameplay.

Card Evaluations: Good Hit Rate, Legendary Whiffs

On individual cards that aren't Green or unassuming utility, they're usually pretty solid. Maybe an 80/20 hit rate? They spot the workhorse commons and uncommons effectively, especially removal and obvious archetype payoffs.

Aetherdrift Hits: Wreckage Wickerfolk ("big mover up") and Spin Out ("highest on that card I have ever been"). Both correct.
MKM Hits: Shock > Galvanize, Makeshift Binding, Nervous Gardener. All good calls.
MH3 Hits: Writhing Chrysalis (called it busted early)

But oh, the misses. When they miss, they sometimes really miss:

Duskmourne Miss: Spineseeker Centipede "[is] a little bit small ball... a little bit homeless." (Narrator: It was the best green common and had many homes).
OTJ Misses: Phantom Interference "probably only good in the draw go control decks." (It was Blue's best common, played everywhere). Arid Archway "is nuts" (It was actively bad).
MH3 Miss: Evolution Witness over Malevolent Rumble (NO!), also overhyped Cursed Wombat based on theoretical potential.

Anecdotally, when both hosts strongly agree on a card take early? The hit rate seems much higher.

So, What's the Takeaway?

Peering into the early calls of experts like LoL is fascinating. They're working with fuzzy data and intuition, yet they nail the format's DNA (the "feel," the core mechanics, the general speed) with surprising frequency.

Their blind spots (mainly, Green) are just as interesting, perhaps revealing biases towards certain playstyles or an underestimation of raw stats in complex environments. It underscores that even for the best, predicting chaotic systems like MTG limited is hard.

Should you base your first Arena draft entirely on their Early Access takes? I think yes, absolutely yes. Their insights, even the misses, are valuable. And for Ben and Ethan? Keep trusting those gut feelings on format shape... but maybe, just maybe, give those unassuming green cards a second look next time.

Note: During the writing of this article, I heavily relied on generative AI; specifically, gemini-2.5-pro. Ideas, analysis and core content are entirely mine (yes, that also means I read all of the AI outputs/reports). I fed my initial draft into Gemini ("make it sound even more like Matt Levine / Chuck Klosterman!") for what could be called stylistic enhancement. The result had a much better vibe than what I'd written. I made further edits to the Gemini output and restructured a few parts, but it feels right to credit Gemini as a co-author (ghostwriter?) here.