Our logistic regression model for high school football

Monday, November 25, 2024

The Voxitatis logistic regression model, based on the complete dataset of Illinois high school football in the fall 2024 season, was built to predict the binary outcome of a game: whether a team would win or lose. Illinois is celebrating its 50th year of high school football this season, and many game qualities are similar to those on-field qualities in 1975. But one thing that is vastly different today is the use of artificial intelligence to assist both players and teams in their training. We thus congratulate the IHSA and celebrate 50 years of high school football with a model for predicting the winners of the title games using artificial intelligence.

Erik Drost (Flickr Creative Commons)

The model was trained using historical game data for the 2024 season, which included features representing the strengths and weaknesses of both the team and its opponents. Key features incorporated into the model included:

Team-specific metrics: average points scored and allowed, strength of schedule (SoS), and secondary SoS (the difficulty of opponents’ schedules).
Opponent-specific metrics: the same statistical measures as above but for the opposing team.
Game-specific metrics: the enrollment ratio (team enrollment compared to opponent enrollment) and whether the game was played at home.

These features were carefully engineered to reflect intrinsic team strength and contextual factors like opponent quality and game location. The logistic regression model was trained by fitting a mathematical equation to the log-odds of the binary target (win or loss), with each feature contributing a weighted coefficient to this equation. The final output, transformed via a sigmoid function, represented the probability of a team winning.

Accuracy in Predicting Binary Outcomes

Logistic regression is a widely used and reliable method for predicting binary target variables like win/loss. Its accuracy depends on several factors:

Feature Quality: The model’s success hinges on how well the input features capture patterns in the data. Here, incorporating opponent-specific metrics, game-specific factors (like home-field advantage), and team strength ensured the model could learn meaningful relationships.

Data Completeness: The model’s performance is enhanced when the comprehensive dataset includes relevant past game outcomes, making the predictions more robust.

Interpretable Coefficients: Logistic regression provides insights into feature importance via its coefficients, which indicate how each feature contributes to the probability of a win. For example, features like ‘Playing at Home’ and ‘Enrollment Ratio’ showed strong positive effects on win probability in this model.

Logistic regression models typically achieve accuracy rates between 60% and 80% in binary classification tasks, depending on the dataset and feature quality. In our case, the model reached 87.7% accuracy, reflecting its ability to capture key patterns while also leaving room for improvement, possibly via more complex models like Random Forests or XGBoost or by the inclusion of more contextual data about the games, such as rushing yards gained and given up, quarterback strength, and so on.

How we split the dataset for accuracy measurements

It’s important, when using machine learning approaches like a linear regression model, to split the dataset into data that will be used to “train” the engine and data that will be used to “test” the model the computer comes up with. Typically, about 20% of the dataset is used for testing.

Since our goal with this project was not only to discover the relative importance of various measurable features of competition but also to make a prediction for the winner of the state’s final game in each class, we used the last two games each team played in the test dataset and trained the model on the earlier games. This way, the test set included the most recent games, making it more appropriate for evaluating the model’s ability to predict future outcomes.

Games in the training set: 3,936 (80%)
Games in the test set: 1,036 (20%)

Note that games against out-of-state opponents were excluded, as team strength indicators for these opponents are not reliably available from the IHSA. In addition, games that were forfeited (based on a 1-0 score) were also excluded from the dataset, as these games do not reflect team strength in any way.

DOWNLOAD the complete dataset.

These other models could lead to more reliable predictions. If outcomes depend on subtle, non-linear interactions (e.g., between enrollment, the strength of schedule, and home-field advantage), XGBoost can capture these relationships better than linear models. XGBoost also provides detailed feature importance metrics, helping to refine which variables matter most.

But while logistic regression is inherently limited to linear relationships between features and log-odds, it remains a powerful baseline model due to its simplicity, interpretability, and efficiency. The insights gained here can guide further refinements, such as exploring non-linear models or adding additional contextual features to improve predictive power. Nonetheless, this model provides a solid foundation for evaluating win probabilities and understanding the factors driving game outcomes.

Model Coefficients

The model considered several features of the dataset provided by the IHSA website. For each team-specific metric, the model considered the coefficient for the team and the team’s opponents. For example, while “Average Points Scored” by a team had a coefficient of 0.1624 (strong positive, meaning teams that scored more points had a fairly good chance of winning a game against a team that hadn’t scored as many points in games, other features being equal), the team’s opponents’ average points scored had a negative coefficient. In other words, the more points a team’s opponents scored, the lower that team’s winning chances were.

Average Points Scored: 0.1624
Average Points Allowed: -0.1594
Strength of Schedule: 0.2684
Secondary Strength of Schedule: 0.4843
Enrollment Ratio: 0.3162
Playing at Home: 0.5313

Home-field Advantage

We note a very strong positive coefficient for “Playing at Home.” Apparently, being the home team increases the odds of winning, other factors being equal. There are several possible explanations for a strong influence of this feature, reflecting the well-documented “home-field advantage” in sports, contributing significantly to the probability of winning. Why does playing at home have such a strong effect?

First, the home team plays the game in a more familiar environment. If field conditions are an issue, the home team knows them better than the away team.

Players on the home team also don’t need to adjust their travel schedules. They generally suffer less from travel fatigue or disruptions in their daily routines. For the away team, these disruptions can reduce performance levels.

Let’s not forget the “12th man” theory, which reflects support from the home crowd. This psychologically boosts the home team and increases player morale and confidence.

Because the 2024 season dataset included results where home teams consistently won more often, the model naturally assigned a strong positive coefficient to the “Playing at Home” feature to reflect this trend. When running the simulations, we set both teams as ‘away’ teams, which led to the reported results.

The Importance of Enrollment

The data also reflect a strong positive coefficient for the “Enrollment Ratio,” simply the enrollment reported by the IHSA for a team divided by the enrollment for the opponent. As expected, schools with higher enrollments tend to win games against schools with lower enrollments in the dataset.

There’s a reason high school athletic associations in the US, like the IHSA, so commonly provide state titles in sports for different “classes” of schools. Schools with similar enrollments should compete against each other for the state title, as this provides the fairest approach possible based on the data from this year in Illinois.