Forecasting System for Counter-Strike 2
- Executive Summary
- Why CS2 Prediction Is Hard
- System Architecture
- Data Foundation & Parser-First Governance
- Rating Methodology — CQE, CQR & Team Elo
- Forecasting Methodology — the GESUS Family
- Validation & Honest Results — match outcome, map-level, in-round
- Shadow-Betting Environment & Audit Trail
- Risk Management & Failure Modes
- Roadmap & Conclusion
CounterQuant is a Counter-Strike 2 intelligence platform built on a single foundation: every statistic we publish is derived from demo files we parse ourselves, never from scraped third-party numbers. That one dataset feeds three layers — proprietary player and team ratings (CQE, CQR, Team Elo), a family of match-outcome forecasting models (the GESUS suite), and a fully public paper shadow-betting ledger that records exactly how those forecasts would have performed against real markets.
The forecasting layer is a regime-aware ensemble. For each upcoming match a single model is selected: NEXUS when a liquid betting market exists for that match (it blends our model with the live market price), otherwise ORACLE, our market-independent base model. Every prediction is written once, frozen, timestamped, and attributed to the exact model and version that made it — so the number you see never silently changes.
Counter-Strike is a high-variance game decided in short series. A best-of-one is close to a coin-flip between evenly matched sides; even a best-of-three can swing on a handful of clutch rounds, an eco that hits, or one player's off-day. Published esports-prediction work tends to report accuracy on cherry-picked top-tier datasets and quietly omits the long tail of unpredictable lower-tier matches. We refuse to do that — we predict all three tiers and report each separately.
Most CS2 analytics sites display numbers scraped from match pages. Those numbers are inconsistent across sources, miss per-round context, and cannot be audited. CounterQuant instead downloads and parses the actual demo files, reconstructing every kill, every round of damage, every economy state and every side-switch from the ground truth of the game server. This is the hard, slow path — and it is the moat. The metrics that drive our ratings and models simply cannot be reproduced by anyone who has not built the parsing pipeline.
CS2 lineups change constantly, maps rotate in and out of the active pool, and balance patches alter how the game is played. A model that treats a team as a fixed entity will misfire the moment a star player is benched. Our features are built point-in-time: every match is scored using only what was known before it started, with recency-weighted form and roster-aware aggregation, so the system tracks teams as they actually are on the day — not as they were a season ago.
CS2 demos are parsed by a cloud worker fleet (10+ concurrent instances) running demoparser2 under a pinned parser version (v4). Only demos that pass validation are written to the gold tables; partial or corrupt parses are rejected rather than silently half-ingested. Versioning the parser means we always know exactly which code produced any given row. Historical CS:GO demos are processed on a separate dedicated server with its own parser version to prevent cross-game contamination of the CS2 gold tables.
Two consumers read the gold tables. The rating engines (Section 5) produce the player and team ratings shown across the site. The feature engine turns the same gold rows into the 509+ point-in-time signals that the GESUS models (Section 6) consume to forecast match outcomes. Both read the identical canonical source, so a player's rating and the model's view of them never disagree about the underlying facts.
All analytics derive from demos we parse and the match metadata we scrape for scheduling only. Live coverage (estimated row counts, refreshed every 10 minutes):
| Gold dataset | What it records | Approx. rows |
|---|---|---|
| demo_kill_events | Every kill: attacker, victim, weapon, round, side | 16,633,076 |
| demo_damage_events | Per-round damage exchanges | 63,453,184 |
| demo_player_rounds | Per-player, per-round state (the basis of every player metric) | 24,589,028 |
| demo_rounds | Per-round outcome, economy, and side | 2,313,333 |
| matches | Match schedule, teams, tier, result (2006-05-18 → 2026-07-01) | 119,957 |
Matches with at least one fully-parsed demo driving their stats: 48,468 · teams tracked: 11,083 · players tracked: 1,380.
A strict data-isolation rule governs everything downstream: only validated, v4-parsed demo data may drive any player or match metric, rating, or model. Scraped third-party numbers (such as external rating figures) are physically separated — stored under an external_ namespace — and are never allowed to feed a rating, an achievement, or a forecast. This guarantees that no model can accidentally learn from, or be contaminated by, numbers we did not compute from ground truth.
The feature engine is the most heavily protected part of the system. We publish the categories and approximate counts, not the specific transformations. The 509+ point-in-time features span six broad families:
Real demo data is never perfectly clean. Controls include: a delete-before-insert guarantee on kill events (the gold tables carry no silent duplicates); per-round side resolution so a player's team is read from the side they actually played each round, correctly handling halftime and overtime swaps; point-in-time Elo snapshots taken before each match; and tolerance for known parser edge-cases (e.g. warmup-round artefacts) rather than letting them corrupt aggregates. Settlement of bets resolves the winner by team identity and name, which absorbs the occasional duplicate team-ID emitted upstream.
CounterQuant publishes three proprietary ratings. We describe their design principles here; the exact weightings and constants are vault-tier (see the tiered-disclosure model).
CQE is an opponent-adjusted player rating computed only from validated demo data. It combines round-survival, multi-kill and trade impact, per-round damage, and consistency into a single number — then applies Bayesian shrinkage so that a player with a small sample is pulled toward a sensible prior rather than rocketing to the top off a handful of pug rounds. The headline failure we engineered against: a low-sample player should never outrank a proven elite simply for having one hot series.
CQR turns CQE into a ranked, percentile-tiered leaderboard, but only for players who clear an eligibility gate (a verified competitive identity plus minimum match and round volume). The gate is the clean discriminator that keeps mangled or one-off accounts off the board, so the leaderboard reflects real established players.
Team Elo is a results-driven, opponent-adjusted rating with a tier-weighted update — beating a top team at a LAN moves your rating more than beating a weaker side online. It is reconstructed point-in-time from match history, and a lobby-strength context term scales a team's effective rating by the quality of the field it is competing in. Team Elo is also the backbone strength feature consumed by the forecasting models.
All GESUS forecasting models are built on gradient-boosted tree ensembles with a post-hoc probability calibration stage. This family was chosen over deep learning for reasons specific to the problem: it handles the heavy feature correlation inherent in our 509+ signals natively; it is sample-efficient given a finite history of professional matches; and its feature contributions are interpretable, which lets us publish the top-3 drivers behind every prediction.
The family splits into three groups by task:
| Model | Task | Role |
|---|---|---|
| MATCH OUTCOME — pre-match win probability | ||
| ORACLE | Full feature set incl. clean economy + Team Elo | Market-independent base — the everyday forecaster |
| NEXUS | ORACLE ⊕ live market price (log-odds blend) | Selected when a liquid betting market exists |
| APEX | Earlier-generation baseline (290 features) | Deployed baseline; retained for provenance comparison |
| MAP SPECIALISTS — per-map outcome prediction | ||
| MIRAGE | Per-map win probability (7 independent map models) | Feeds veto predictor + map-level intelligence panels |
| IN-ROUND SPECIALISTS — within-match signal extraction | ||
| CLUTCH | Clutch-situation outcome (kill sequence replay) | Player clutch rating signal; feeds achievement engine |
| SPECTER | Round kill/death sequence classification | Per-round performance signal; economy context |
| CIPHER | Player performance regression (CQE prediction) | Roster strength signal for match-outcome models |
For each upcoming match, exactly one model makes the call — we never stack several models onto the same match. If a liquid betting market exists for the match, NEXUS is selected: it blends ORACLE's view with the live market price in log-odds space, weighted by market liquidity, because a sharp market is itself a strong predictor. Otherwise ORACLE makes the market-independent call. The chosen model name and version are stored on the immutable prediction, so you always know who made it.
Models are trained on a strict chronological train/validation/test split — no shuffling, no future information. Features are generated by the same point-in-time walk used at serving time, giving zero train/serve skew: the feature vector a match receives in production is identical to the one it would have received in training. Calibration is fit on validation data only; the test period is held out until final evaluation.
With 509+ features and a finite match history, controlling overfitting is the central concern. The ensembles use conservative tree depth, minimum-leaf population floors, row and column subsampling, and early stopping on a held-out validation set — training halts when validation loss stops improving, and the selected iteration count sits well below the budget, confirming the regularisation is binding rather than cosmetic.
| Segment | Held-out test AUC | Reading |
|---|---|---|
| Tier 1 | 0.70–0.72 | The genuine, usable signal — top-tier matches are the most predictable |
| Tier 2 | 0.64–0.67 | Moderate signal; smaller team-history samples |
| Tier 3 | 0.63–0.66 | Noisy by nature; bet sizing is capped hardest here |
Calibrated on isotonic regression fit over the validation fold. Deployed baseline (APEX) achieved calibrated test AUC 0.702 all-tiers. ORACLE adds full economy + Team Elo history signals to extend that baseline.
The MIRAGE family runs 7 independent map-specialist models. Because each map is a different tactical game, a single model trained across all maps loses resolution. Each specialist uses the same feature categories but with map-specific CT/T side win rates, economy conversion, and player lineup signals.
| Map | Best version AUC | Note |
|---|---|---|
| de_nuke | 0.641 | CT-side dominance and side balance are strong predictors |
| de_inferno | 0.632 | Economy conversion rate dominant |
| de_dust2 | 0.635 | High data volume; stable estimates |
| de_ancient | 0.611 | Newer map — growing training set |
| de_overpass | 0.613 | Limited data (≈1,400 train rows); lineup signals help |
| de_mirage | 0.608 | Economy and CT/T balance dominant |
| de_anubis | 0.591 | Newest map — smallest training set |
These models operate at round level, not match level. Their outputs feed player rating and achievement calculations — they are not used directly for pre-match win probability.
| Model | Task | AUC |
|---|---|---|
| CLUTCH | Clutch-situation win prediction (kill sequence replay) | 0.877 |
| SPECTER | Round kill/death sequence outcome | 0.904 |
| CIPHER | Player performance regression (R²) | 0.307 (R²) |
CLUTCH and SPECTER handle different, narrower prediction tasks than full match outcome — their higher AUCs reflect that. Direct comparison to match-outcome AUC would be misleading.
Why ~0.70 is the honest ceiling for match outcome. Pre-match CS2 outcome prediction has a real upper bound around 0.70–0.72 AUC at the top tier — the residual is genuine in-game variance that no pre-match feature can resolve. Claims meaningfully above this on realistic, leakage-free, all-tier data should be treated with suspicion. We would rather publish a credible 0.70 than an inflated number that quietly leaks the future.
Calibration is not the same as AUC. AUC measures ranking; calibration measures whether a stated 70% actually wins ~70% of the time. We calibrate explicitly and judge betting value on calibrated probabilities, not raw scores.
The market is hard to beat. On matches with a liquid market, the market price is typically sharper than our standalone model — which is precisely why NEXUS folds the market in rather than ignoring it. We are honest that our independent edge is clearest where markets are thin or absent.
The shadow-betting layer turns forecasts into an honest, public scorecard. It places paper bets only — no real money — starting from a 1000-unit virtual bankroll.
Each upcoming match gets exactly one immutable prediction, written the first time the match is seen and never altered afterwards (probabilities, model and timestamp are frozen). Each prediction gets exactly one shadow bet, and that bet is always on the predicted winner — enforced in code, not by convention. We bet all three tiers.
Stakes are fractional-Kelly (a conservative quarter-Kelly) against the best available price — the live market when it is liquid, otherwise an Elo-implied baseline — with per-tier caps that allow more on Tier-1 conviction and least on noisy Tier-3. Every bet records its model probability, market probability, edge, price source, stake and the bankroll it was placed against.
Because predictions are immutable and one-per-match, the numbers on every page — home, predictions, bets, match detail — are read from the same ledger rows rather than recomputed. There is no surface where a different probability or a different winner can appear. Settlement compares the bet's team to the match's real result, so a win is a win everywhere at once.
We document how this system can fail, in detail and in public, on a dedicated page. A summary of the matrix:
| Failure mode | Severity | Primary mitigation |
|---|---|---|
| Roster changes | High | Point-in-time, roster-aware, recency-weighted form |
| Meta / patch shifts | High | Rolling windows; per-map signals; periodic retrain |
| Tier-3 noise | Medium | Per-tier honesty; hardest Kelly caps on T3 |
| Market beats model | High if misapplied | NEXUS blends the market in; honest edge accounting |
| Demo parse gaps | Medium | Validation gate; parser-first isolation; delete-before-insert |
| Overfitting | Medium | Chronological split; calibration; early stopping |
| Format variance (bo1) | Medium | Format feature; confidence floor before betting |
| Small live sample | Documented | Paper-only until a significant track record exists |
Full narrative descriptions are at counterquant.com/transparency/why-it-fails/.
- Immutable prediction ledger — one APEX/ORACLE prediction per match, frozen at creation, public audit trail with shadow-bet bankroll
- Team intelligence pages — TITAN score (Team CQ), ORACLE prediction history, live map pool win rates (score-based, 8 active pool maps), CIPHER roster stats (avg Rating / ADR / Kills from parsed demos)
- Match detail — GESUS prediction panel, per-team map pool heat-map, BO3 veto predictor (bans/picks simulated from historical win rates, only maps with ≥3 parsed appearances)
- Player pages — CQE, CQR, demo-derived stats (ADR, K/D, rating), achievement system (45 unlockable CS2 achievements across 8 categories, sourced entirely from demo events)
- Per-team and per-match predictions shown on home page, predictions list, and team detail — with live probability bars and settled/correct markers
Deploy ORACLE and NEXUS to replace the APEX baseline in the prediction path — same immutable ledger, better model; publish a calibration curve (win-rate by predicted-probability bucket) as the settled-bet sample grows; add SHAP top-3 feature contributions to each prediction card so users can see why the model made each call.
CQ-GESUS is a bet that the honest path wins: parse the real data yourself, build leakage-free features, report per-tier truth, freeze every prediction, and let a public ledger keep score. We do not claim to have beaten Counter-Strike's variance — we claim to measure ourselves against it in the open. Read the methodology, check the ledger, question the numbers. That is exactly what we want.
Version history: v1.0 (June 2026) — initial publication. · v1.1 (June 2026) — expanded metrics (live pipeline counts), full GESUS model family table (8 models), per-map MIRAGE AUCs, in-round specialist results, "What is live now" roadmap section, EC2 parse fleet detail.