Built a WC 2026 match prediction + betting EV
WE
West_Ganache3649🌱Rookie
2026-06-10 23:44:27 · 👁 100 · 💬 0
I've been building a football match prediction system around the 2026 World Cup and wanted to share it and get some outside eyes on it. What it does: - Pulls historical WC data (2006–2022) + live betting odds via API-Football - Engineers features: ELO ratings, rolling form (last 5 matches), StatsBomb xG, FIFA rankings, fixture-level team stats - Trains an XGBoost classifier to output Win/Draw/Loss probabilities - Secondary models for BTTS, Over/Under 2.5 goals, and corners - Filters daily bets by Expected Value (EV > 0), 100 MXN flat stake budget - Streamlit dashboard for predictions, historical results, and ROI tracking - Monte Carlo tournament simulator — runs 10,000 full WC 2026 bracket simulations and spits out champion/finalist probabilities (current prediction: Spain vs Argentina final, Spain wins) What I'm happy with: The EV-based bet filtering feels solid conceptually and the XGBoost model beats the logistic regression baseline on log loss. Where I feel uncertain / would love input: 1. Feature engineering — I'm using ELO + rolling form + xG but I suspect I'm leaving value on the table. What features have you found most predictive for international football specifically? 2. Data sources — API-Football is good but expensive at scale. Are there free or cheaper alternatives with decent historical depth? I've seen [Football-Data.org](http://Football-Data.org) mentioned but coverage seems limited. 3. Model calibration — Probabilities look reasonable but I haven't done a proper calibration curve check yet. Any go-to methods for calibrating XGBoost outputs on small datasets? 4. Tournament simulator — Right now it uses a Poisson goal model scaled by ELO difference. Would a Dixon-Coles correction be worth adding given the small sample of WC matches? 5. Overfitting risk — WC data is inherently small (≈200 matches per tournament, 5 tournaments). I'm using cross-validation but still worried. Any suggestions for regularization or data augmentation