Building an XGBoost Win Probability Calculator for Baseball

Project Goals

Win probability is one of the most fundamental metrics in sports analytics - the foundation upon which many advanced analyses are built. In September 2025, I developed a comprehensive win probability model for baseball that will serve as a critical component for many of my future projects, particularly my reinforcement learning-based bullpen management system. The model needed to be accurate, fast, and most importantly, free from data leakage issues that plague many amateur implementations.

The primary objective wasn’t just to calculate win probability, but to build a production-ready system that could evaluate the impact of strategic decisions in real-time. When should a manager pull their starter? How much does bringing in a specific reliever change the team’s chances? These questions require precise, contextual win probability calculations that account for the specific pitchers and batters involved.

Project Methods

Building a reliable win probability model requires careful attention to both the machine learning pipeline and the baseball domain knowledge. I implemented two complementary approaches: a Bayesian model for uncertainty quantification and an XGBoost model for fast, accurate predictions.

The Data Pipeline Challenge

The most critical challenge in this project wasn’t model selection or hyperparameter tuning - it was avoiding data leakage. Many win probability models inadvertently use future information when making predictions, leading to artificially inflated performance metrics that fail catastrophically in production.

The Leakage Problem

Consider this scenario: It’s the top of the 5th inning, score tied 2-2. To calculate win probability, we need to know the current game state. But here’s where things get tricky - when exactly do we capture that state?

Most naive implementations use the state after an at-bat completes. This introduces subtle but devastating leakage. If a player hits a home run, the post-at-bat state includes that run, and the model learns to associate certain at-bat features with scoring that already happened. The model appears brilliant in testing but fails miserably in production when it has to predict before knowing the at-bat outcome.

The Solution: Pre-At-Bat State Capture

I rebuilt the entire data pipeline to capture state at the first pitch of each at-bat:

def process_game_with_leak_free_timing(game_data):
    states = []
    for at_bat_id, at_bat_data in game_data.groupby('at_bat_number'):
        # Get the FIRST pitch (pre-at-bat state)
        first_pitch = at_bat_data.iloc[0]
        
        # Calculate running scores from COMPLETED at-bats only
        completed_at_bats = game_data[game_data['at_bat_number'] < at_bat_id]
        
        if len(completed_at_bats) == 0:
            current_score = (0, 0)  # Game start
        else:
            last_completed = completed_at_bats.iloc[-1]
            current_score = (last_completed['home_score'], 
                           last_completed['away_score'])
        
        # Build state using only past information
        state = build_state(first_pitch, current_score)
        states.append(state)

This seemingly simple change had profound effects. The model’s performance dropped from an unrealistic 0.95 ROC-AUC to a more honest 0.885 - still excellent, but grounded in reality.

Feature Engineering

The model uses 24 carefully selected features that capture the multifaceted nature of baseball game states:

Game Situation Features

Inning and outs: Where we are in the game structure
Score differential: The current competitive state
Base runners: Runners on base and scoring position situations
Leverage index: How critical is this moment?

Pitcher Performance Metrics

Season stats: ERA, WHIP, FIP, K/9, BB/9
Fatigue indicators: Pitch count, times through the order
Recent performance: Rolling averages over last 10 batters faced

Batter Metrics

Season stats: AVG, OBP, SLG, OPS
Platoon splits: Performance vs. left/right pitchers
Clutch metrics: Performance in high-leverage situations

Contextual Factors

Save situation: Is this a save opportunity?
High leverage: Critical game moments
RISP: Runners in scoring position pressure

The XGBoost Approach

After experimenting with various algorithms, XGBoost emerged as the clear winner for the primary model. Gradient boosting excels at capturing the complex, non-linear relationships inherent in baseball:

class FastWinProbabilityModel:
    def __init__(self):
        self.model = XGBClassifier(
            n_estimators=200,
            max_depth=8,
            learning_rate=0.05,
            subsample=0.8,
            colsample_bytree=0.8,
            objective='binary:logistic',
            eval_metric='logloss'
        )

The model builds an ensemble of decision trees, each learning from the mistakes of its predecessors. This approach naturally handles interactions - for instance, a 3-run lead means something very different in the 2nd inning versus the 9th.

Bayesian Alternative

I also implemented a Bayesian model using PyMC. From what I understand, the Bayesian approach is the industry standard, and was my original idea when starting this project:

with pm.Model() as bayesian_model:
    # Hierarchical structure for pitcher effects
    pitcher_variance = pm.HalfNormal('pitcher_variance', sigma=0.5)
    pitcher_effects = pm.Normal('pitcher_effects', 
                                mu=0, 
                                sigma=pitcher_variance, 
                                shape=n_pitchers)
    
    # Main effects with informative priors
    beta = pm.Normal('beta', mu=prior_means, sigma=prior_sds)
    
    # Logistic regression
    logit_p = pm.math.dot(X, beta) + pitcher_effects[pitcher_ids]
    p = pm.Deterministic('p', pm.math.sigmoid(logit_p))
    
    # Likelihood
    y_obs = pm.Bernoulli('y_obs', p=p, observed=y)

The Bayesian approach provides full posterior distributions, allowing us to say not just “65% win probability” but “65% with a standard deviation of 8%.” This uncertainty quantification is invaluable for risk-sensitive decision-making.

Addressing the Calibration Challenge

One persistent challenge in win probability modeling is calibration at the extremes. Real baseball games can produce seemingly impossible comebacks - the 2011 Cardinals’ Game 6 World Series comeback, down to their last strike twice, comes to mind. Yet machine learning models tend to be conservative, rarely predicting below 10% or above 90% probability.

Why Models Avoid Extremes

This conservatism stems from how models are trained. Loss functions like log loss severely punish overconfident wrong predictions. Predicting 99% probability and being wrong incurs massive penalty, while predicting 85% is much safer. The model learns to hedge its bets.

Additionally, extreme situations are rare in training data. A 9th inning, 7-run deficit might occur only a handful of times in a season. Without sufficient examples, the model can’t learn that these situations truly have near-zero win probability.

Calibration Techniques

I explored two approaches to improve calibration:

Platt Scaling (Sigmoid Calibration)

The primary approach I used was Platt scaling, which fits a logistic regression to map predicted probabilities to actual probabilities:

from sklearn.linear_model import LogisticRegression

# Fit sigmoid calibration
platt_cal = LogisticRegression()
platt_cal.fit(val_predictions.reshape(-1, 1), val_outcomes)

# Apply to test set
calibrated_probs = platt_cal.predict_proba(test_predictions.reshape(-1, 1))[:, 1]

Sigmoid calibration provided the best overall performance, improving log loss from 0.411 to 0.357 (13% improvement) and reducing Brier score from 0.133 to 0.118. The calibrated model maintained excellent discrimination while providing much more reliable probability estimates.

Isotonic Regression Alternative

I also tested isotonic regression as a non-parametric alternative:

from sklearn.isotonic import IsotonicRegression

iso_cal = IsotonicRegression(out_of_bounds='clip')
iso_cal.fit(val_predictions, val_outcomes)
calibrated_probs = iso_cal.transform(test_predictions)

However, sigmoid calibration proved superior for this dataset, achieving better log loss (0.357 vs 0.378) while providing reliable probability estimates across the full range of game situations.

Performance and Validation

The model achieved strong performance metrics on held-out 2024 test data:

Classification Metrics

ROC-AUC: 0.915 (XGBoost), 0.617 (Bayesian)
Log Loss: 0.411 (XGBoost), 0.657 (Bayesian)
Brier Score: 0.133 (XGBoost), 0.233 (Bayesian)

The XGBoost model clearly dominates in raw predictive performance, while the Bayesian model provides valuable uncertainty estimates.

Calibration Quality

After sigmoid calibration:

Final log loss: 0.357 (13% improvement from 0.411)
Final Brier score: 0.118 (11% improvement from 0.133)
ROC-AUC maintained: 0.915 (discrimination preserved)
Reliable probability estimates across all game situations

Extreme Situation Handling

The model correctly identifies extreme situations:

9th inning, 5+ run leads: 94.3% average predicted win probability
9th inning, down 3+ runs: 8.7% average predicted win probability
Early game tied situations: 49-51% predictions (appropriate uncertainty)

Data Quality Validation

Critically, the leak-free pipeline was validated:

Early inning correlations with outcome: 0.15-0.25 (expected: low)
Late inning correlations with outcome: 0.45-0.55 (expected: high)
No evidence of future information leakage

Lessons Learned

This project reinforced several critical principles in sports analytics:

Data leakage is insidious: The initial “amazing” results were too good to be true. Always be suspicious of exceptional performance and audit your pipeline thoroughly.
Domain knowledge matters: Understanding baseball helped identify which features matter and which model behaviors were unrealistic.
Simple models can excel: XGBoost outperformed complex neural networks and sophisticated Bayesian hierarchical models.
Calibration requires attention: Raw model outputs often need post-processing to match real-world probabilities.
Validation on future data is essential: Testing on 2024 data (not available during 2023 training) provided honest assessment.

Future Enhancements

While the current model performs well, several enhancements could push it further:

1. Pitcher-Specific Modeling

Instead of using aggregate stats, model individual pitcher tendencies:

Velocity trends within games
Performance under pressure
Pitch arsenal effectiveness by count

2. Momentum and Clutch Factors

Baseball has psychological elements the current model ignores:

Recent scoring runs
Clutch performance history
Home field advantage in late innings

3. Weather and Park Effects

Environmental factors affect win probability:

Wind effects at Wrigley Field
Altitude at Coors Field
Temperature impacts on ball flight

4. Real-Time Updates

Implement online learning to update the model during the season:

Adjust for injured players
Capture hot/cold streaks
Respond to roster changes

5. Uncertainty-Aware RL Integration

Use the Bayesian model’s uncertainty estimates to make risk-adjusted decisions in the bullpen management system.

Conclusion

Building a production-ready win probability model required more than just applying XGBoost to baseball data. The critical innovation was identifying and eliminating data leakage through careful pre-at-bat state capture.

The model now serves as the foundation for multiple downstream projects. My reinforcement learning bullpen management system (post to come soon) uses it to evaluate every potential pitcher change. Future projects will leverage it for in-game strategy optimization, player valuation adjustments, and even fan engagement applications.

Most importantly, this project demonstrated that rigorous attention to data quality and proper validation trumps algorithmic complexity. A simple XGBoost model with clean data outperforms sophisticated approaches built on flawed foundations. In sports analytics, as in many domains, getting the fundamentals right is the key to building systems that actually work when it matters - during the game, when decisions have real consequences.

The win probability calculator is more than just another model - it’s a fundamental building block that enables a new generation of baseball analytics tools. By providing accurate, contextual probability estimates, it transforms vague intuitions about game situations into quantifiable insights that can drive better decisions. This foundation will support many exciting projects to come.

Tags: Sports-Analytics, Baseball, Machine-Learning, XGBoost, Win-Probability