ML Decision Audit Pipeline
Built on Pokemon battle data as a structured ML audit exercise.
I built a Python machine learning pipeline that turns simulated Pokemon Showdown battle logs into structured decision data, trains a Random Forest model, and uses shadow testing plus guardrails to audit risky battle choices like unnecessary switches, repeated low-value moves, and no-progress loops.
The latest validation run compared those turns against 34,757 possible actions.
Battle logs to future RL and self-play
The pipeline starts with messy battle output, turns it into turn-level decisions, evaluates candidate actions, and then uses shadow review and guardrails as a bridge toward deeper self-play learning later.
Battle logs
Simulated battles created raw turn-by-turn behavior.
Structured turns
Each turn became rows showing the selected action and available choices.
Feature engineering
I added type matchups, damage estimates, switch risk, repetition penalties, and score gaps.
Model
A supervised Random Forest learned patterns connected to winning outcomes.
Shadow review
The model gave recommendations without controlling the battle.
Guardrails
The system used conservative checks to catch risky switches and repeated bad behavior.
Future RL / self-play
The longer-term goal is to learn more directly from rewards, outcomes, and self-play cycles.
A Pokemon battle is turn-based. On each turn, the player usually chooses between attacking, switching Pokemon, using a status or utility move, or recovering. The best choice depends on the current matchup, health, type effectiveness, available moves, and what the opponent might do next. This project turns those choices into structured data so a model can review which decisions looked risky, useful, or worth avoiding.
What the model was really checking
Why Pokemon?
Pokemon gave me a clean set of rules, repeatable decisions, and enough complexity to make the machine learning side interesting. It was a fun theme, but the real point of the project was turning messy battle logs into usable data and learning how to evaluate decisions better.
A machine learning project with real decisions
I wanted a project that felt more interesting than a normal dataset. Pokemon battles are structured enough for analysis, but complicated enough to create real decision-making problems. The goal was not a finished competitive AI, it was to build a strong foundation: collect battle data, review decisions, expose weak logic, and test whether machine learning could improve turn quality.
Every turn is a scoring problem
On each turn, the program reviews the legal choices available in that battle state. If attacking is a real option, it also compares the actual attacks using type matchup, estimated damage, KO potential, switch risk, and repeated-move penalties. The model is learning to evaluate a specific decision in context, not just follow a simple rule.
Behind the page, the project ran like a repeatable pipeline: generate sandbox battles, structure the turn data, train a model, evaluate the results, and keep the whole cycle easy to rerun.
================================================================================ Step 1: Sandbox Battles ================================================================================ Command: python scripts/run_chunked_random_team_sandbox_battles.py --total-battles 100 --chunk-size 50 --enable-live-switch-guard Sandbox run started Total battles requested: 100 Chunks planned: 2 Running chunk 1/2: battles 1-50 Running chunk 2/2: battles 51-100 Saved battle results Saved turn-level decisions Saved candidate action rankings ================================================================================ Step 8: Train Learning Examples Model ================================================================================ Training rows: 4,642 Test rows: 1,161 Random Forest test accuracy: 67.0% ROC AUC: 0.74 ================================================================================ Step 9: Shadow Evaluation ================================================================================ Decision turns evaluated: 5,803 Possible actions compared: 34,757 Bad-switch guard triggers: 14 No-progress loops detected: 49 Cycle complete.
The system was acting scared
The battle logic sometimes switched too aggressively, even when staying in and attacking was clearly better.
I added a model-assisted switch guard
I added audits and a conservative model-assisted switch guard so the model could flag bad switches without taking full control of the battle.
The guard found the exact mistakes I cared about
In the latest 100-battle validation run, the guard triggered 14 times. In the reviewed bad-switch cases, the model disagreed with the bad switch 100 percent of the time, but the guard still stayed conservative.
Some battles got stuck in low-value loops
Some battles got stuck repeating low-value moves or passive actions.
I added audits and guardrails for no-progress behavior
I added loop detection plus repeated-action audits for status moves, hazards, recovery loops, and low-value repeated attacks.
The audits made loop behavior visible
The system detected 49 no-progress loops and made those failure cases visible for future tuning.
A resisted move is not always the wrong move
At first, a resisted move looked like an obvious mistake. But the audits showed that was not always true. Sometimes every available attack was bad, or the alternative was not clearly better.
I separated true bad choices from forced ones
Instead of treating every bad-looking type matchup the same way, I separated true bad type choices from forced choices and misleading alternatives.
Most bad-looking type choices were not truly bad
Out of 330 reviewed bad-type choices, only 1 was labeled as a likely true bad-type mistake. Most were either forced or only looked bad because the alternative was not clearly better.
A second way to judge battle decisions
The current model is supervised. It learns from turn-level decisions connected to final battle outcomes and tries to spot which decision patterns were more likely to lead toward winning results.
The model was useful because it stayed in review mode first
The important choice was not handing the system full control right away. I used shadow mode first so the model could disagree with the current logic before it had any real influence on live choices.
The current model is supervised. It learns from turn-level decisions connected to final battle outcomes, then reviews candidate actions in shadow mode before any live decision logic changes. It is not true reinforcement learning yet. The next step is to add better reward signals, next-state tracking, and self-play so it can learn more directly from the consequences of its actions.
Important ML lesson: the Random Forest reached about 99.78 percent train accuracy but only about 67.0 percent test accuracy. That gap was a useful reminder that a high training score does not mean much unless the model also works on unseen battles.
I did not immediately let the model control the battle. I ran it in shadow mode. That means the normal system still made the actual choice, while the model quietly gave its own recommendation. Then I compared the two.
Part of the job was making the pipeline reliable enough to run repeatedly. During one run, shadow evaluation looked like it had stalled during a heavy prediction step. I fixed that bottleneck and tightened the defensive checks around the review process.
Messy data becomes useful only after structure
Cleaning and structuring the battle logs was just as important as the model itself. Until the turns, actions, and outcomes were readable, none of the downstream ML work mattered.
The custom signals mattered more than I expected
The strongest signals were not generic. They came from battle context like move matchup, damage estimates, switch risk, repetition penalties, and how the chosen move compared to the other legal options on that turn.
Models need a controlled trial before they get real influence
Shadow testing was useful because it let the model disagree with the current system before it was trusted to control anything.
Accuracy is not the whole story
The Random Forest reached about 67.0 percent test accuracy and a 0.74 ROC AUC, but the bigger lesson was seeing how easily a model can overfit when training accuracy jumps toward 99.78 percent.
The current version is the foundation: logs are parsed, decisions are structured, features are built, models are evaluated, and guardrails are tested. The next goal is to shift toward reinforcement-learning-style training where the agent learns from battle outcomes, immediate rewards, and self-play cycles, moving from reviewing decisions toward making better ones.
Pokemon Showdown and Smogon
This project was built around Pokemon Showdown, the open-source Pokemon battle simulator that provided the battle environment and battle logs used for experimentation. Pokemon Showdown is maintained by Smogon.
The Python ML pipeline around those battles
My work focused on the Python machine learning pipeline around those logs: parsing battle data, structuring turn-level decisions, engineering features, training models, running shadow evaluation, and testing guardrails.
Project code is available on GitHub.