LL Emre

How Accurate Are Lords of Limited's Early Format Guesses? An AI Analysis

Spoiler: Surprisingly good, but maybe don't trust them on Green.

Heads Up: This post is for Magic: The Gathering players.
You've been warned. We're deep in the weeds here, talking about the wizard poker podcast equivalent of pre-season NFL predictions.
For non-Magic players, the underlying idea (analyzing expert guesses) might still resonate. You might want to read the first section and checkout my python package for YouTube transcript analysis.

You ever listen to a podcast previewing something -a sports season, a movie lineup, a new MTG set- and think, "Yeah, but how often are they actually right?" I do this constantly with Lords of Limited (LoL), especially their "Early Access" episodes (like the most recent one). Ben and Ethan jump into a brand-new set, play for a few hours against others who are also just figuring things out, and then... they make pronouncements. Bold takes! Predictions! It's part of the fun, listening to smart people try to map an unknown territory based on blurry first impressions.

Naturally, some takes age like fine wine, others like milk left in a hot car. We all remember the big whiffs. But how does the overall record look?

You could figure this out the hard way. Go back, listen to hours of Early Access shows, then listen to the corresponding hours-long "50 Takes" retrospectives, meticulously cross-referencing every prediction. It sounds... noble? Exhausting?

Or, you know, you could get a robot to do it. Which is what I did. Using the magic of transcript analysis and AI comparison (shoutout to Google's new model: gemini-2.5-pro, you are A BEAST), the process became almost trivial:

  1. Feed the AI the Early Access transcript: Tell it to find the core arguments, the "takes," and the quotes backing them up.
  2. Do the same for the "50 Takes" episode: Get the final verdict, the wisdom of hindsight.
  3. Make the AI compare them: Ask it, "Okay, how close was Take A to Final Conclusion B?" and assign an accuracy score.

I went full data nerd and did this for every major set since Karlov Manor. No regrets. The whole thing took less than 12 minutes (discounting the few hours spent fiddling with different AI models and frameworks, making sure this thing actually worked).

(For the interested reader: AI's detailed comparisons for each set lives here. Want to try this on your favorite Hearthstone strategy podcast (not even sure if that's a thing)? The Python code is here. Good luck.)

What Does "Analysis" Even Look Like Here?

Right, so the AI spits out... stuff. What kind of stuff? It's basically three things.

First, the initial predictions from the Early Access show, broken down into multiple takes. Like this one (among 42 different takes) from Aetherdrift:


Format Speed - Slower than expected

Take: The format felt slower than initially anticipated during Early Access, with games often going longer despite the presence of aggressive mechanics. Aggro decks aren't necessarily the default Tier 1.
Supporting Statements:


Second, the retrospective summary (which contains 56 different takes for Aetherdrift), like the one below:


Format Speed - Slower Than Expected

Timestamp: 00:04:30
Current Evaluation: The format played out significantly slower than initial previews suggested. It was one of the slower formats in recent memory, especially in the Play Booster era.
Initial Expectations: Previews billed it as "too fast and too furious."
Supporting Statements:

Key Insights: Initial impressions based on themes (Vehicles) can be misleading. Format speed dictates card evaluation significantly.


And finally, the money shot: the direct comparison. Did the early take hold up? Here is what AI said based on the previous early take and the retrospective:


Format Speed

Initial Take: Slower than anticipated based on pre-release hype, leading to long, grindy games despite aggressive mechanics. High confidence.
Retrospective Reality: Confirmed to be much slower than expected, one of the slowest Play Booster formats. This significantly impacted card evaluations.
Accuracy Analysis: Highly accurate. The early access gameplay correctly identified the format's defining characteristic, deviating from initial community expectations.
Key Factors: Direct gameplay experience quickly revealed the board complexity and tendency for games to go long, overriding assumptions based on mechanics alone.
Quotes:


These verdicts ("Highly Accurate", "Mostly Accurate", "Partially Accurate", "Mostly Inaccurate" or "Completely Wrong") are the bedrock of the number-crunching that follows.

Overall Accuracy: Wait, They're Actually Pretty Good?

So, how'd they do? Honestly, much better than my cynical brain expected. Across six sets worth of takes:

Here’s the breakdown of the chart for each set, if you like charts:

Now, let's be real. Slapping a label like "Partially Accurate" on a nuanced prediction is inherently reductive. Is getting the feel of a format right more important than whiffing on a specific common? Almost certainly. So, these numbers are a starting point, a vibe check for the vibe checkers. The interesting stuff is in the patterns. Where do Ben & Ethan consistently shine, and where do they stumble?

What They Nail (and What They Miss)

Strengths: Reading the Room (The Format's Room)

Where LoL consistently crushes it is understanding the gestalt of a format. The big picture stuff. The feel. It's uncanny how often their initial impression, often phrased using exactly that word ("feels like..."), ends up being spot-on.

Weakness: The Color Green Is Apparently Invisible

If LoL has a consistent blind spot, it's evaluating Green. For roughly the last year and a half, it's been the same story:

What gives? My pet theory: maybe they undervalue the sheer consistency Green often provides? In formats that end up slower than expected (which many recent ones have), maybe the raw stats of Green commons/uncommons just outperform flashier, synergy-dependent cards in other colors? Are big green fatties the new Dual Color Lands? Boring, unexciting, but reliably effective over time? Food for thought.

Weakness: Underestimating the Unassuming Overperformers

Beyond the Green conundrum, another recurring blind spot involves... Evaluating the quiet cards. The unassuming utility commons, the "clunky" looking cards, the glue pieces that end up being format all-stars.

It seems these cards eventually provide unexpected value in the specific context of their format. Maybe it's because these cards don't scream 'synergy' upfront, or their value is defensive or incremental, making them harder to appreciate until the format's actual rhythm and key threats become clear through gameplay.

Card Evaluations: Good Hit Rate, Legendary Whiffs

On individual cards that aren't Green or unassuming utility, they're usually pretty solid. Maybe an 80/20 hit rate? They spot the workhorse commons and uncommons effectively, especially removal and obvious archetype payoffs.

But oh, the misses. When they miss, they sometimes really miss:

Anecdotally, when both hosts strongly agree on a card take early? The hit rate seems much higher.

So, What's the Takeaway?

Peering into the early calls of experts like LoL is fascinating. They're working with fuzzy data and intuition, yet they nail the format's DNA (the "feel," the core mechanics, the general speed) with surprising frequency.

Their blind spots (mainly, Green) are just as interesting, perhaps revealing biases towards certain playstyles or an underestimation of raw stats in complex environments. It underscores that even for the best, predicting chaotic systems like MTG limited is hard.

Should you base your first Arena draft entirely on their Early Access takes? I think yes, absolutely yes. Their insights, even the misses, are valuable. And for Ben and Ethan? Keep trusting those gut feelings on format shape... but maybe, just maybe, give those unassuming green cards a second look next time.


Note: During the writing of this article, I heavily relied on generative AI; specifically, gemini-2.5-pro. Ideas, analysis and core content are entirely mine (yes, that also means I read all of the AI outputs/reports). I fed my initial draft into Gemini ("make it sound even more like Matt Levine / Chuck Klosterman!") for what could be called stylistic enhancement. The result had a much better vibe than what I'd written. I made further edits to the Gemini output and restructured a few parts, but it feels right to credit Gemini as a co-author (ghostwriter?) here.