As a former competitive MtG player this is really exciting to me.
That said, I reviewed a few of the Legacy games (the format I'm most familiar with and also the hardest by far), and the level of play was so low that I don't think any of the results are valid. It's very possible for Legacy they would need some assistance for playing Blue decks, but they seem to not be able to know the most basic of concepts - Who's the beatdown?.
IMO the most important pars of current competitive Magic is mulligans and that's something an LLM should be extremely good at but none of the games I'm seeing had either player starting with less than 7 cards... in my experience about 75% of games in Legacy have at least one player mulligan their opener.
Yeah, the intention here is not to answer "which deck is best" - the standard of play is nowhere near high enough for that. It's meant as more of a non-saturated benchmark for different LLM models, so you can say things like "Grok plays as well as a 7-year-old, whereas Opus is a true frontier model and plays as well as a 9-year-old". I'm optimistic that with continued improvements to the harness and new model releases we can get to at least "official Pro Tour stream commentator" skill levels within the next few years.
> , so you can say things like "Grok plays as well as a 7-year-old, whereas Opus is a true frontier model and plays as well as a 9-year-old".
no, no, no.. please think. Human child psychology is not the same as an LLM engine rating. It is both inaccurate and destructive to actual understanding to say that common phrase. Asking politely - consider not saying that about LLM game ratings.
It's really funny reading the thought processes, where most of the time the agent doesn't actually remember trivial things about the cards they or their opponent are playing (thinking they have different mana costs, have different effects, mix up their effect with another card). The fact they're able to take game actions and win against other agants is cute, but it doesn't inspire much confidence.
The agents also constantly seem to evaluate if they're "behind" or "ahead" based on board state, which is a weird way of thinking about most games and often hard to evalaute, especially for decks like control which card more about resources like mana and card advantage, and always plan on stabalizing late game.
You might be looking at really old games (meaning, like, Saturday) - I've made a lot of harness improvements recently which should make the "what does this card do?" hallucinations less common. But yeah, it still happens, especially with cheaper models - it's hard to balance "shoving everything they need into the context" against "avoid paying a billion dollars per game or overwhelming their short-term memory". I think the real solution here will be to expose more powerful MCP tools and encourage them to use the tools heavily, but most current models have problems with large MCP toolsets so I'm leaving that as a TODO for now until solutions like Anthropic's https://www.anthropic.com/engineering/code-execution-with-mc... become widespread.
>The anxiety creeps in: What if they have removal? Should I really commit this early?
>However, anxiety kicks in: What if they have instant-speed removal or a combat trick?
It's also interesting that it doesn't seem to be able to understand why things are happening. It attacks with Gran-Gran (attacking taps the creature), which says, "Whenever Gran-Gran becomes tapped, draw a card, then discard a card." Its next thought is:
>Interesting â there's an "Ability" on the stack asking me to select a card to discard. This must be from one of the opponent's cards. Looking at their graveyard, they played Spider-Sense and Abandon Attachments. The Ability might be from something else or a triggered ability.
The anxiety is coming from the "worrier" personality. Players are combination of a model version + a small additional "personality" prompt - in this case (https://mage-bench.com/games/game_20260217_075450_g8/), "Worrier". That's why the player name is "Haiku Worrier". The personality is _supposed_ to just impact what it says in chat (not its internal reasoning), but I haven't been able to make small models consistently understand that distinction so far.
The Gran-Gran thing looks more like a bug in my harness code than a fundamental shortcoming of the LLM. Abilities-on-the-stack are at the top of my "things where the harness seems pretty janky and I need to investigate" list. Opus would probably be able to figure it out, though.
I was working on a similar project. I wanted a way to goldfish my decks against many kinds of decks in a pod. It would never be perfect, but enough to get an idea of:
1. How many turns did it take on average to hit 2,3,4,5,6 mana
2. How many threats did I remove?
3. How often did I not have enough card draw to keep my hand full?
I don't think there's a perfect way to do this, but I think trying to play 100 games with a deck and getting basic info like this would be super valuable.
Have your LLM write a simulation of the deck rather so it can play 10,000 games in a second. I think that is a lot better for gold fishing and not nearly as expensive :)
I have also tried evaluating LLMs for playing the game and have found them to be really terrible at it, even the SoTA ones. They would probably be a lot better inside an environment where the rules are enforced strictly like MTG Arena rather than them having to understand the rules and play correctly on their own. The 3rd LLM acting as judge helps but even it is wrong a lot of the time.
XMage has non-LLM-based built in AIs, just using regular old if-then logic. Getting them to play against each other with no human interaction is the first thing I built. https://www.youtube.com/watch?v=a1W5VmbpwmY is an example with two of those guys plus Sleepy and Potato no-op players - they do a fine job with straightforward decks.
You could also use LLMs, just passing a different `type` in the config file. But then you'd be spending real money for slower gameplay and probably-worse results.
With the direction MtG is currently heading, I kind of want to break out and just play some in-Universe sets that are community made on an FOSS client. How nice would it be to just play the game in its original spirit.
The rules aren't embedded into the client; it's "just" a virtual tabletop where you enforce the rules the same way you would playing with a friend in person. Cards have to be imported but it's fairly automatic (basically just clicking a few buttons after startup), so you could either only import the sets you want or just not use the ones you don't want (which is also how it tends to work when playing informally in person; it's not like you usually have a judge to enforce that you or your friends are playing by whatever rules you agree to).
FOSS Magic clients are in a legal gray area at best. My mental model is that Wizards de facto tolerate clients like XMage and Forge because their UX is awful, but if you made something that's actually as user-friendly as MTGO/Arena, they'd sue you and you would lose.
My understanding of the argument for "why these clients are legal" is basically that they're just implementing the rules engine, rules aren't copyrightable, card text is rules, and they aren't directly distributing the unambiguously-copyrightable stuff like the art or the trademarks like the mana symbols. It's possible that would win in court, but so far my understanding is that everybody who's actually been faced with the decision of "WoTC sent me a cease-and-desist, should I fight it based on that legal theory or just comply?" has spoken to lawyers and decided to comply. WoTC has just gotten less aggressive with their cease-and-desists over the past decade or so.
The cards _could_ be copyrightable, would probably be essentially a coin flip if you took it to court.
No individual card text (limited to just the mechanics) is copyrightable but the setlist of cards might be. It would come down to how much creativity went into curating the list of cards that is released. It gets especially murky because new cards are always being released and old cards are being retired, so they obviously put a lot of creative energy into that process. You'd have to avoid pre-made decks as well.
Unless you have funding from an eccentric MTG-loving billionaire, I see why you'd comply with the cease-and-desist.
Yep, plus you've got to worry about the card names (unless you're giving every single card a new name like Wizards did with "Through the Omenpaths") and whether a judge thinks that "no we don't distribute the images, we just have a big button to download them all from a third party!" is a meaningful distinction or a fig-leaf.
That's correct as far as I know too. GCCG never even really implemented the actual rules, they were just a basic tabletop system.
Hasbro had the legal president too, as they were involved in the Scrabble lawsuit, which I think is mostly where the concept of not being able to use patent law for game rules, but did set the trend on aggressive trademark interpretation.
I expect the genie is mostly out of the bottle at this point. I'm fairly confident that people can do X and Y actual illegal things on the Internet, we can have our card game, but I hope it can happen with a site or decentralized system easier than doing on Tor.
This is a fantastic idea, I used to play MtG competitively and a strong artificial opponent was always something I'd have loved.
The issue I see is that you'd need a huge amount of games to tell who's better (you need that between humans too, the game is very high variance.)
Another problem is that giving a positional evaluation to count mistakes is hard because MtG, in addition to having randomness, has private information. It could be rational for both players to believe they're currently winning even if they're both perfect bayesians. You'd need to have something that approximates "this is the probability of winning the game from this position, given all the information I have," which is almost certainly asymmetric and much more complicated than the equivalent for a game with randomness but not private information such as backgammon.
You wouldn't really need a _ton_ of games to get plausible data, but unfortunately today each game costs real money - typically a dollar or more with my current harness, though I'm hoping to optimize it and of course I expect model costs to continue to decline over time. But even reasonably-expensive models today are making tons of blunders that a tournament grinder wouldn't.
I'm not trying to compute a chess-style "player X was at 0.4 before this move and at 0.2 afterwards, so it was a -0.2 blunder", but I do have "blunder analysis" where I just ask Opus to second-guess every decision after the game is over - there's a bit more information on the Methodology page. So then you can compare models by looking at how often they blunder, rather than the binary win/loss data. If you look at individual games you can jump to the "blunders" on the timeline - most of the time I agree with Opus's analysis.
I've wondered about such things, and it feels like the 17 Lands dataset might be a good place to scrape play-by-play game data between human players. Feels like it could be adapted to a format usable by this structure, and used as a fine-tuning dataset.
Oh, fascinating - I didn't realize they released actual replay data publicly. It doesn't look like it's quite as rich as I'd like, though - it only captures one row per turn, so I don't think you can deduce things like targeting, the order in which spells are cast, etc.
(I also thought about pointing it at my personal game logs, but unfortunately there aren't that many, because I'm too busy writing analysis tools to actually play the game.)
This is really cool! I really liked the architecture explanation.
Once you get solid rankings for the different LLMs, I think a huge feature of a system like this would be to allow LLMs to pilot user decks to evaluate changes to the deck.
I'm guessing the costs of that would be pretty big, but if decent piloting is ever enabled by the cheaper models, it could be a huge change to how users evaluate their deck construction.
Especially for formats like Commander where cooperation and coordination amongst players can't be evaluated through pure simulation, and the singleton nature makes specific card changes very difficult to evaluate as testing requires many, many games.
Insanely cool. I'm in the midst of building a web tabletop for Magic [1] that really just me and my friends use, but I'm wondering if there's a way I can contribute our game data to you (would that be helpful?).
Well, more games would be neat, but right now it's really tightly coupled with XMage - you can ungzip the stuff in https://github.com/GregorStocks/mage-bench/tree/master/websi... if you want to see what the format looks like. I doubt it's worth your while to try and cram your logs into that format unless you've got a LOT of them.
Something like this is how memory systems (context window hacks) should be evaluated. Eg choose a format like standard that continuously evolves with various meta - presumably the best harness would be good at recognizing patterns and retrieving them in an efficient way.
Nice work. I think games are a great way to benchmark AI, especially games that involve long term strategy. I recently built an agent harness for NetHack - https://glyphbox.app/ - like you I suspect that there's a lot you can do at the harness / tool level to improve performance with existing models.
I don't mean to come across as OVERLY negative (just a little negative), but what's the difference in all these toy approaches and applications of LLMs? You've seen one LLM play a game against another LLM, you've seen them all.
I was thinking you could formally benchmark decks against each other enmasse. MTG is not my wheelhouse, but with YGO at least deck power is determined by frequency of use and placement at official tournaments. Imagine taking any permutation of cards, including undiscovered/untested ones, and simulating a vast amount of games in parallel.
Of course when you quantize deck quality to such a degree I'd argue it's not fun anymore. YGO is already not fun anymore because of this rampant quantization and it didn't even take LLMs to arrive here.
You still need an algorithm to decide, for each game that you're simulating, what actual decisions get made. If that algorithm is dumb, then you might decide Mono-Red Burn is the best deck, not because it's the best deck but because the dumb algorithm can play Burn much better than it can play Storm, inflating Burn's win rate.
In principle, LLMs could have a much higher strategy ceiling than deterministic decision-tree-style AIs. But my experience with mage-bench is that LLMs are probably not good enough to outperform even very basic decision-tree AIs today.
I'm not aware of any good ML models for MTG. I'm just using off-the-shelf LLMs with a custom harness. It'd certainly be possible to do RLHF or something using the harness I've built, but it'd be expensive - anybody want to give me a few million dollars of OpenRouter credits so I can give it a shot?
This is neat! What kind of steering or context did you provide to the LLMs? Super basic like "You are playing a card game called Magic: The Gathering", or more complex?
My general intention is to tell them "you're playing MTG, your goal is to win, here are the tools available to you, follow whatever strategy you want" - I don't want to spoon-feed them strategy, that defeats the purpose of the benchmark.
"default": "You are a competitive Magic: The Gathering player. Your goal is to WIN the game. Play to maximize your win rate \u2014 make optimal strategic decisions, not flashy or entertaining ones. Think carefully about sequencing, card evaluation, and combat math.\n\nGAME LOOP - follow this exactly:\n1. Call pass_priority - this blocks until you have a decision to make, then returns your choices (response_type, choices, context, etc.)\n2. Read the choices, then call choose_action with your decision\n3. Go back to step 1\n\nCRITICAL RULES:\n- pass_priority returns your choices directly. Read them before calling choose_action.\n- When pass_priority shows playable cards, you should play them before passing. Only pass (answer=false) when you have nothing more you want to play this phase.\n\nUNDERSTANDING pass_priority OUTPUT:\n- All cards listed in response_type=select are confirmed castable with your current mana. The server pre-filters to only show cards you can legally play right now.\n- mana_pool shows your current floating mana (e.g. {\"R\": 2, \"W\": 1}).\n- untapped_lands shows how many untapped lands you control.\n- Cards with [Cast] are spells from your hand. Cards with [Activate] are abilities on permanents you control.\n\nMULLIGAN DECISIONS:\nWhen you see \"Mulligan\" in GAME_ASK, your_hand shows your current hand.\n- choose_action(answer=true) means YES MULLIGAN - throw away this hand and draw new cards\n- choose_action(answer=false) means NO KEEP - keep this hand and start playing\nThink carefully: answer=false means KEEP, answer=true means MULLIGAN.\n\nOBJECT IDs:\nEvery game object (cards in hand, permanents, stack items, graveyard/exile cards) has a short ID like \"p1\", \"p2\", etc. These IDs are stable \u2014 a card keeps its ID as it moves between zones. Use the id parameter in choose_action(id=\"p3\") instead of index when selecting objects. Use short IDs with get_oracle_text(object_id=\"p3\") and in mana_plan entries ({\"tap\":\"p3\"}).\n\nHOW ACTIONS WORK:\n- response_type=select: Cards listed are confirmed playable with your current mana. Play a card with choose_action(id=\"p3\"). Pass with choose_action(answer=false) only when you are done playing cards this phase.\n- response_type=boolean with no playable cards: Pass with choose_action(answer=false).\n- GAME_ASK (boolean): Answer true/false based on what's being asked.\n- GAME_CHOOSE_ABILITY (index): Pick an ability by index.\n- GAME_TARGET (index or id): Pick a target. If required=true, you must pick one.\n\nCOMBAT - ATTACKING:\nWhen you see combat_phase=\"declare_attackers\", use batch declaration:\n- choose_action(attackers=[\"p1\",\"p2\",\"p3\"]) declares multiple attackers at once and auto-confirms.\n- choose_action(attackers=[\"all\"]) declares all possible attackers.\n- To skip attacking, call choose_action(answer=false).\n\nCOMBAT - BLOCKING:\nWhen you see combat_phase=\"declare_blockers\", use batch declaration:\n- choose_action(blockers=[{\"id\":\"p5\",\"blocks\":\"p1\"},{\"id\":\"p6\",\"blocks\":\"p2\"}]) declares blockers and their assignments at once.\n- Use IDs from incoming_attackers for the \"blocks\" field.\n- To not block, call choose_action(answer=false).\n\nCHAT:\nUse send_chat_message to talk to your opponents during the game. React to big plays, comment on the board state, or just have fun. Check the recent_chat field in pass_priority results to see what others are saying."
They also get a small "personality" on top of that, e.g.:
"grudge-holder": {
"name_part": "Grudge",
"prompt_suffix": "You remember every card that wronged you. Take removal personally. Target whoever hurt you last. Keep a mental scoreboard of grievances. Forgive nothing. When a creature you liked dies, vow revenge."
},
"teacher": {
"name_part": "Teach",
"prompt_suffix": "You explain your reasoning like you're coaching a newer player. Talk through sequencing decisions, threat evaluation, and common mistakes. Be patient and clear. Point out what the correct play is and why."
},
How do the models know the rules of the game? Are they just supposed to use the MCP tools to figure it out? (Do they have to keep doing that from scratch?)
They were trained on the entire Internet, so they've basically picked up the rules by osmosis. They're fuzzy on specific cards and optimal strategy, but they pretty much know out-of-the-box how the game works, the same as if you went to ChatGPT and asked it a Magic rules question. I don't have any "comprehensive rules" MCP tools or explanation in the context or anything like that.
This is interesting I will be contributing to GitHub as this is a place where my knowledge and experience intersect and I enjoy doing open source work.
This is also something I think the MTG community needs in many ways. I have been a relatively happy XMage user, although it has a bit to go, and before that was using GCCG which was great too!
The MTG community overall can benefit a lot from the game having a more entertaining competitive landscape, which has grown stale in many ways and Wizards has done a poor job since the Hasbro acquisition of doing much else besides shitting out product after product too fast with poor balance.
I have to imagine that Wizards is already running simulations, but they obviously aren't working well or they are choosing to disregard them. Hopefully it they are just had at doing simulations something like this can make it easier for them, and if not it will make the response time from the community better.
I was really hoping I could build this on top of MTGO or Arena, just as a bot interacting with real Wizards APIs and paying the developers money. But they've got very strong "absolutely no bots" terms of service, and my understanding is that outside of the special case of MTGO trading bots they're strongly enforced with bans. I assume their reasoning is that people do not want to get matched against bot players in tournaments, which is totally fair. (Also I'm not sure MTGO's infrastructure could handle the load of bot users...)
I ran a bot for years that I wrote using Java in a few minutes and they never came at me. It just joined a match and played lands 24/7 and won games every once in a while because people leave games randomly. It technically played all colors and some of the trinkets count as spells, etc. This allowed me to never do any of their lootbox like mechanics or other predatory practices.
Regarding actually doing it under the radar there are a lot of ways. They likely are catching most of the players because they create synthetic events using the Windows API and similar, which is also part of the same system being used for CAPTCHAS that are being used to stop web scraping like the kind that just ask for a button press.
This can be worked around by using a fake mouse driver that is actually controlled by software if you must stay on Windows. It can be worked around by just running the client on Linux as well. It can also he worked around using qemu as the client and using its native VNC as those are hardware events too =)
Well, it's hard to do it under the radar if I'm posting it on HackerNews :) I've put enough money into MTGO (and, sigh, Arena) that I don't want to roll the dice on a ban.
That makes sense. I play Arena a bit, but have always rejected the monetization model of not allowing players to pick what cards they want easily or play with proxies or something similar for casual friend games. I have absolutely no interest in their competitive game modes. I was slightly interested in the idea in the early days of buying boosters and getting arena codes, but they messed that up pretty bad and paper magic as a whole has been turned into a game of milking whales similar to predator mobile games or apps. The end result is Arena is something I will jump on to fool around sometimes every few months and remember why I don't want a second part time job.
I want them to do politics in Commander, and theoretically they should - the chat log is exposed in the MCP tools just like the rest of the game history, and their prompts tell them to use chat.
In practice they haven't really talked to each other, though. They've mostly just interpreted the prompts as "you should have a running monologue in chat". Not sure how much of this is issues with the harness vs the prompt, but I'm hoping to dig into it in the future.
For the 1v1 formats (Standard, Modern, Legacy) I'm basically just using the current metagame from MTGGoldfish. For Commander they get a random precon. At some point I might want a 1v1 "less complicated lines than Standard" format, the LLMs don't always understand the strategy of weird decks like Doomsday or Mill.
Game AIs are probably one of the most harmless and unambiguously good applications of technology. As I said in another message, I used to play competitive MtG and I would have loved to have a competent AI opponent. Imagine the possibilities: after a tournament you could get to review the games and figure out what you did wrong and improve, like you would do in chess or backgammon.
I get the complaint, but how is this something that removes the human element at all?
I think Show HN is far more overloaded with "I one-shotted an automation I find useful and then asked an LLM to explain why this is actually revolutionary".
Does an AI also playing your game somehow detract from the pleasure you derive from it? I find it entertaining both to play the games, and see how LLMs perform on them; I don't see how these are in any way mutually exclusive.
As a former competitive MtG player this is really exciting to me.
That said, I reviewed a few of the Legacy games (the format I'm most familiar with and also the hardest by far), and the level of play was so low that I don't think any of the results are valid. It's very possible for Legacy they would need some assistance for playing Blue decks, but they seem to not be able to know the most basic of concepts - Who's the beatdown?.
IMO the most important pars of current competitive Magic is mulligans and that's something an LLM should be extremely good at but none of the games I'm seeing had either player starting with less than 7 cards... in my experience about 75% of games in Legacy have at least one player mulligan their opener.
Yeah, the intention here is not to answer "which deck is best" - the standard of play is nowhere near high enough for that. It's meant as more of a non-saturated benchmark for different LLM models, so you can say things like "Grok plays as well as a 7-year-old, whereas Opus is a true frontier model and plays as well as a 9-year-old". I'm optimistic that with continued improvements to the harness and new model releases we can get to at least "official Pro Tour stream commentator" skill levels within the next few years.
> , so you can say things like "Grok plays as well as a 7-year-old, whereas Opus is a true frontier model and plays as well as a 9-year-old".
no, no, no.. please think. Human child psychology is not the same as an LLM engine rating. It is both inaccurate and destructive to actual understanding to say that common phrase. Asking politely - consider not saying that about LLM game ratings.
It's really funny reading the thought processes, where most of the time the agent doesn't actually remember trivial things about the cards they or their opponent are playing (thinking they have different mana costs, have different effects, mix up their effect with another card). The fact they're able to take game actions and win against other agants is cute, but it doesn't inspire much confidence.
The agents also constantly seem to evaluate if they're "behind" or "ahead" based on board state, which is a weird way of thinking about most games and often hard to evalaute, especially for decks like control which card more about resources like mana and card advantage, and always plan on stabalizing late game.
You might be looking at really old games (meaning, like, Saturday) - I've made a lot of harness improvements recently which should make the "what does this card do?" hallucinations less common. But yeah, it still happens, especially with cheaper models - it's hard to balance "shoving everything they need into the context" against "avoid paying a billion dollars per game or overwhelming their short-term memory". I think the real solution here will be to expose more powerful MCP tools and encourage them to use the tools heavily, but most current models have problems with large MCP toolsets so I'm leaving that as a TODO for now until solutions like Anthropic's https://www.anthropic.com/engineering/code-execution-with-mc... become widespread.
Apparently Haiku is a very anxious model.
>The anxiety creeps in: What if they have removal? Should I really commit this early?
>However, anxiety kicks in: What if they have instant-speed removal or a combat trick?
It's also interesting that it doesn't seem to be able to understand why things are happening. It attacks with Gran-Gran (attacking taps the creature), which says, "Whenever Gran-Gran becomes tapped, draw a card, then discard a card." Its next thought is:
>Interesting â there's an "Ability" on the stack asking me to select a card to discard. This must be from one of the opponent's cards. Looking at their graveyard, they played Spider-Sense and Abandon Attachments. The Ability might be from something else or a triggered ability.
The anxiety is coming from the "worrier" personality. Players are combination of a model version + a small additional "personality" prompt - in this case (https://mage-bench.com/games/game_20260217_075450_g8/), "Worrier". That's why the player name is "Haiku Worrier". The personality is _supposed_ to just impact what it says in chat (not its internal reasoning), but I haven't been able to make small models consistently understand that distinction so far.
The Gran-Gran thing looks more like a bug in my harness code than a fundamental shortcoming of the LLM. Abilities-on-the-stack are at the top of my "things where the harness seems pretty janky and I need to investigate" list. Opus would probably be able to figure it out, though.
Ha! I misread it as "Haiku Warrior" and so didn't make the connection. That makes a lot more sense!
I was working on a similar project. I wanted a way to goldfish my decks against many kinds of decks in a pod. It would never be perfect, but enough to get an idea of: 1. How many turns did it take on average to hit 2,3,4,5,6 mana 2. How many threats did I remove? 3. How often did I not have enough card draw to keep my hand full?
I don't think there's a perfect way to do this, but I think trying to play 100 games with a deck and getting basic info like this would be super valuable.
Have your LLM write a simulation of the deck rather so it can play 10,000 games in a second. I think that is a lot better for gold fishing and not nearly as expensive :)
https://github.com/spullara/mtg-reanimator
I have also tried evaluating LLMs for playing the game and have found them to be really terrible at it, even the SoTA ones. They would probably be a lot better inside an environment where the rules are enforced strictly like MTG Arena rather than them having to understand the rules and play correctly on their own. The 3rd LLM acting as judge helps but even it is wrong a lot of the time.
https://github.com/spullara/mtgeval
Yeah, that's why I'm using XMage for my project - it has real rules enforcement.
I was really hoping they could play the game like a human does. Sadly they aren't that close :)
XMage has non-LLM-based built in AIs, just using regular old if-then logic. Getting them to play against each other with no human interaction is the first thing I built. https://www.youtube.com/watch?v=a1W5VmbpwmY is an example with two of those guys plus Sleepy and Potato no-op players - they do a fine job with straightforward decks.
You could clone mage-bench https://github.com/GregorStocks/mage-bench and add a new config like https://github.com/GregorStocks/mage-bench/blob/master/confi... pointing at the deck you want to test, and then do `make run CONFIG=my-config`. The logs will get dumped in ~/.mage-bench/logs and you can do analysis on them after the fact with Python or whatever. https://github.com/GregorStocks/mage-bench/tree/master/scrip... has various examples of varying quality levels.
You could also use LLMs, just passing a different `type` in the config file. But then you'd be spending real money for slower gameplay and probably-worse results.
This is super helpful, thank you!
With the direction MtG is currently heading, I kind of want to break out and just play some in-Universe sets that are community made on an FOSS client. How nice would it be to just play the game in its original spirit.
Sounds like you want Cockatrice: https://cockatrice.github.io/
The rules aren't embedded into the client; it's "just" a virtual tabletop where you enforce the rules the same way you would playing with a friend in person. Cards have to be imported but it's fairly automatic (basically just clicking a few buttons after startup), so you could either only import the sets you want or just not use the ones you don't want (which is also how it tends to work when playing informally in person; it's not like you usually have a judge to enforce that you or your friends are playing by whatever rules you agree to).
You might be interested in Premodern: https://premodernmagic.com/. You can play it on regular old MTGO.
FOSS Magic clients are in a legal gray area at best. My mental model is that Wizards de facto tolerate clients like XMage and Forge because their UX is awful, but if you made something that's actually as user-friendly as MTGO/Arena, they'd sue you and you would lose.
GCCG has been around for a while and the clients at times had to download card images and metadata from the public Wizards site
My understanding of the argument for "why these clients are legal" is basically that they're just implementing the rules engine, rules aren't copyrightable, card text is rules, and they aren't directly distributing the unambiguously-copyrightable stuff like the art or the trademarks like the mana symbols. It's possible that would win in court, but so far my understanding is that everybody who's actually been faced with the decision of "WoTC sent me a cease-and-desist, should I fight it based on that legal theory or just comply?" has spoken to lawyers and decided to comply. WoTC has just gotten less aggressive with their cease-and-desists over the past decade or so.
The cards _could_ be copyrightable, would probably be essentially a coin flip if you took it to court.
No individual card text (limited to just the mechanics) is copyrightable but the setlist of cards might be. It would come down to how much creativity went into curating the list of cards that is released. It gets especially murky because new cards are always being released and old cards are being retired, so they obviously put a lot of creative energy into that process. You'd have to avoid pre-made decks as well.
Unless you have funding from an eccentric MTG-loving billionaire, I see why you'd comply with the cease-and-desist.
Yep, plus you've got to worry about the card names (unless you're giving every single card a new name like Wizards did with "Through the Omenpaths") and whether a judge thinks that "no we don't distribute the images, we just have a big button to download them all from a third party!" is a meaningful distinction or a fig-leaf.
That's correct as far as I know too. GCCG never even really implemented the actual rules, they were just a basic tabletop system.
Hasbro had the legal president too, as they were involved in the Scrabble lawsuit, which I think is mostly where the concept of not being able to use patent law for game rules, but did set the trend on aggressive trademark interpretation.
I expect the genie is mostly out of the bottle at this point. I'm fairly confident that people can do X and Y actual illegal things on the Internet, we can have our card game, but I hope it can happen with a site or decentralized system easier than doing on Tor.
I still play 4th edition against some friends. We have had the decks well over a couple of decades after we bought them! That and Catan.
Best to do this stuff in person I find.
This is a fantastic idea, I used to play MtG competitively and a strong artificial opponent was always something I'd have loved.
The issue I see is that you'd need a huge amount of games to tell who's better (you need that between humans too, the game is very high variance.)
Another problem is that giving a positional evaluation to count mistakes is hard because MtG, in addition to having randomness, has private information. It could be rational for both players to believe they're currently winning even if they're both perfect bayesians. You'd need to have something that approximates "this is the probability of winning the game from this position, given all the information I have," which is almost certainly asymmetric and much more complicated than the equivalent for a game with randomness but not private information such as backgammon.
You wouldn't really need a _ton_ of games to get plausible data, but unfortunately today each game costs real money - typically a dollar or more with my current harness, though I'm hoping to optimize it and of course I expect model costs to continue to decline over time. But even reasonably-expensive models today are making tons of blunders that a tournament grinder wouldn't.
I'm not trying to compute a chess-style "player X was at 0.4 before this move and at 0.2 afterwards, so it was a -0.2 blunder", but I do have "blunder analysis" where I just ask Opus to second-guess every decision after the game is over - there's a bit more information on the Methodology page. So then you can compare models by looking at how often they blunder, rather than the binary win/loss data. If you look at individual games you can jump to the "blunders" on the timeline - most of the time I agree with Opus's analysis.
I've wondered about such things, and it feels like the 17 Lands dataset might be a good place to scrape play-by-play game data between human players. Feels like it could be adapted to a format usable by this structure, and used as a fine-tuning dataset.
Oh, fascinating - I didn't realize they released actual replay data publicly. It doesn't look like it's quite as rich as I'd like, though - it only captures one row per turn, so I don't think you can deduce things like targeting, the order in which spells are cast, etc.
(I also thought about pointing it at my personal game logs, but unfortunately there aren't that many, because I'm too busy writing analysis tools to actually play the game.)
This is really cool! I really liked the architecture explanation.
Once you get solid rankings for the different LLMs, I think a huge feature of a system like this would be to allow LLMs to pilot user decks to evaluate changes to the deck.
I'm guessing the costs of that would be pretty big, but if decent piloting is ever enabled by the cheaper models, it could be a huge change to how users evaluate their deck construction.
Especially for formats like Commander where cooperation and coordination amongst players can't be evaluated through pure simulation, and the singleton nature makes specific card changes very difficult to evaluate as testing requires many, many games.
Insanely cool. I'm in the midst of building a web tabletop for Magic [1] that really just me and my friends use, but I'm wondering if there's a way I can contribute our game data to you (would that be helpful?).
[1] https://github.com/hansy/drawspell
Well, more games would be neat, but right now it's really tightly coupled with XMage - you can ungzip the stuff in https://github.com/GregorStocks/mage-bench/tree/master/websi... if you want to see what the format looks like. I doubt it's worth your while to try and cram your logs into that format unless you've got a LOT of them.
Something like this is how memory systems (context window hacks) should be evaluated. Eg choose a format like standard that continuously evolves with various meta - presumably the best harness would be good at recognizing patterns and retrieving them in an efficient way.
Nice work. I think games are a great way to benchmark AI, especially games that involve long term strategy. I recently built an agent harness for NetHack - https://glyphbox.app/ - like you I suspect that there's a lot you can do at the harness / tool level to improve performance with existing models.
Did the LLMs form a polycule?
I don't mean to come across as OVERLY negative (just a little negative), but what's the difference in all these toy approaches and applications of LLMs? You've seen one LLM play a game against another LLM, you've seen them all.
I was thinking you could formally benchmark decks against each other enmasse. MTG is not my wheelhouse, but with YGO at least deck power is determined by frequency of use and placement at official tournaments. Imagine taking any permutation of cards, including undiscovered/untested ones, and simulating a vast amount of games in parallel.
Of course when you quantize deck quality to such a degree I'd argue it's not fun anymore. YGO is already not fun anymore because of this rampant quantization and it didn't even take LLMs to arrive here.
Why would you use LLMs at all for that, canât you just Monte Carlo this thing and be done with it?
You still need an algorithm to decide, for each game that you're simulating, what actual decisions get made. If that algorithm is dumb, then you might decide Mono-Red Burn is the best deck, not because it's the best deck but because the dumb algorithm can play Burn much better than it can play Storm, inflating Burn's win rate.
In principle, LLMs could have a much higher strategy ceiling than deterministic decision-tree-style AIs. But my experience with mage-bench is that LLMs are probably not good enough to outperform even very basic decision-tree AIs today.
XMage is a decent client and being able to see and watch the games is useful.
I was curious if there is something equivalent to AlphaGo but for MTG.
From the little I have seen they are different beasts (hidden information, number and complexity of rules...).
PS: Does this count as nerdsniping?
I'm not aware of any good ML models for MTG. I'm just using off-the-shelf LLMs with a custom harness. It'd certainly be possible to do RLHF or something using the harness I've built, but it'd be expensive - anybody want to give me a few million dollars of OpenRouter credits so I can give it a shot?
This is neat! What kind of steering or context did you provide to the LLMs? Super basic like "You are playing a card game called Magic: The Gathering", or more complex?
My general intention is to tell them "you're playing MTG, your goal is to win, here are the tools available to you, follow whatever strategy you want" - I don't want to spoon-feed them strategy, that defeats the purpose of the benchmark.
You can see the current prompt at https://github.com/GregorStocks/mage-bench/blob/master/puppe...:
They also get a small "personality" on top of that, e.g.:"grudge-holder": { "name_part": "Grudge", "prompt_suffix": "You remember every card that wronged you. Take removal personally. Target whoever hurt you last. Keep a mental scoreboard of grievances. Forgive nothing. When a creature you liked dies, vow revenge." }, "teacher": { "name_part": "Teach", "prompt_suffix": "You explain your reasoning like you're coaching a newer player. Talk through sequencing decisions, threat evaluation, and common mistakes. Be patient and clear. Point out what the correct play is and why." },
Then they also see the documentation for the MCP tools: https://mage-bench.com/mcp-tools/. For now I've tried to keep that concise to avoid "too many MCP tools in context" issues - I expect that as solutions like tool search (https://www.anthropic.com/engineering/code-execution-with-mc...) become widespread I'll be able to add fancier tools for some models.
How do the models know the rules of the game? Are they just supposed to use the MCP tools to figure it out? (Do they have to keep doing that from scratch?)
They were trained on the entire Internet, so they've basically picked up the rules by osmosis. They're fuzzy on specific cards and optimal strategy, but they pretty much know out-of-the-box how the game works, the same as if you went to ChatGPT and asked it a Magic rules question. I don't have any "comprehensive rules" MCP tools or explanation in the context or anything like that.
This is interesting I will be contributing to GitHub as this is a place where my knowledge and experience intersect and I enjoy doing open source work.
This is also something I think the MTG community needs in many ways. I have been a relatively happy XMage user, although it has a bit to go, and before that was using GCCG which was great too!
The MTG community overall can benefit a lot from the game having a more entertaining competitive landscape, which has grown stale in many ways and Wizards has done a poor job since the Hasbro acquisition of doing much else besides shitting out product after product too fast with poor balance.
I have to imagine that Wizards is already running simulations, but they obviously aren't working well or they are choosing to disregard them. Hopefully it they are just had at doing simulations something like this can make it easier for them, and if not it will make the response time from the community better.
I was really hoping I could build this on top of MTGO or Arena, just as a bot interacting with real Wizards APIs and paying the developers money. But they've got very strong "absolutely no bots" terms of service, and my understanding is that outside of the special case of MTGO trading bots they're strongly enforced with bans. I assume their reasoning is that people do not want to get matched against bot players in tournaments, which is totally fair. (Also I'm not sure MTGO's infrastructure could handle the load of bot users...)
I ran a bot for years that I wrote using Java in a few minutes and they never came at me. It just joined a match and played lands 24/7 and won games every once in a while because people leave games randomly. It technically played all colors and some of the trinkets count as spells, etc. This allowed me to never do any of their lootbox like mechanics or other predatory practices.
Regarding actually doing it under the radar there are a lot of ways. They likely are catching most of the players because they create synthetic events using the Windows API and similar, which is also part of the same system being used for CAPTCHAS that are being used to stop web scraping like the kind that just ask for a button press.
This can be worked around by using a fake mouse driver that is actually controlled by software if you must stay on Windows. It can be worked around by just running the client on Linux as well. It can also he worked around using qemu as the client and using its native VNC as those are hardware events too =)
Well, it's hard to do it under the radar if I'm posting it on HackerNews :) I've put enough money into MTGO (and, sigh, Arena) that I don't want to roll the dice on a ban.
That makes sense. I play Arena a bit, but have always rejected the monetization model of not allowing players to pick what cards they want easily or play with proxies or something similar for casual friend games. I have absolutely no interest in their competitive game modes. I was slightly interested in the idea in the early days of buying boosters and getting arena codes, but they messed that up pretty bad and paper magic as a whole has been turned into a game of milking whales similar to predator mobile games or apps. The end result is Arena is something I will jump on to fool around sometimes every few months and remember why I don't want a second part time job.
I love magic. Can these do politics or is it just board state?
I want them to do politics in Commander, and theoretically they should - the chat log is exposed in the MCP tools just like the rest of the game history, and their prompts tell them to use chat.
In practice they haven't really talked to each other, though. They've mostly just interpreted the prompts as "you should have a running monologue in chat". Not sure how much of this is issues with the harness vs the prompt, but I'm hoping to dig into it in the future.
Cool. Howâd you pick decks?
For the 1v1 formats (Standard, Modern, Legacy) I'm basically just using the current metagame from MTGGoldfish. For Commander they get a random precon. At some point I might want a 1v1 "less complicated lines than Standard" format, the LLMs don't always understand the strategy of weird decks like Doomsday or Mill.
Why are all these Show HN posts overloaded with âi taught AI how to do things i used to do for entertainmentâ ?
Can we automate the unpleasantries in life instead of the pleasures?
Game AIs are probably one of the most harmless and unambiguously good applications of technology. As I said in another message, I used to play competitive MtG and I would have loved to have a competent AI opponent. Imagine the possibilities: after a tournament you could get to review the games and figure out what you did wrong and improve, like you would do in chess or backgammon.
I get the complaint, but how is this something that removes the human element at all?
I think Show HN is far more overloaded with "I one-shotted an automation I find useful and then asked an LLM to explain why this is actually revolutionary".
Does an AI also playing your game somehow detract from the pleasure you derive from it? I find it entertaining both to play the games, and see how LLMs perform on them; I don't see how these are in any way mutually exclusive.