Editor’s note: This manuscript is being posted as a preprint. It is currently in the process of peer review, and we are awaiting author revisions. In the meantime, a copy of the submitted manuscript is posted here for advance consideration.
This paper explains the application of linear regression to analyze the relationship between goals scored, goals allowed, and goal difference with points in the final standings of the English Premier League. This paper shows that there is a strong linear relationship between goal difference and points, as well as a relatively strong linear relationship between goals scored and goals allowed with points. The purpose of this paper is to gain insight into the utility of mathematical applications in analyzing the Premier League, which can be applied to analyze other soccer leagues as well. This paper also suggests that linear regression is a useful tool to measure the relative value of attackers and defenders for English Premier League teams under certain assumptions. Given the same capability, a defender should be purchased to achieve a higher standing compared with an attacker.
In a world full of an enormous amount of data, it is essential for us to process and analyze these data. Linear regression, a mathematical approach that models the relationship between a dependent variable and one or more independent variables [1,4], plays a crucial role in analyzing our daily lives since the method can be employed extensively in practical applications such as in the evaluation of house prices and in identifying differentials between housing areas, as well as personal health conditions and insurance. Linear regression is also significant in the field of machine learning. The linear regression algorithm is a fundamental machine-learning algorithm in view of its relatively simple and widely-known properties .
This article mainly focuses on linear regression with one independent variable and one dependent variable, which is called simple linear regression. It concerns two-dimensional sample points in a Cartesian coordinate and predicts the linear relationship between the dependent variable and the independent variable by finding a linear function that best fits the data points . There are several assumptions for the linear regression models, such as constant variable and independence of errors, which means there are no outliers in the data set and the errors of the dependent variables are uncorrelated with each other .
A real life example of the English Premier League standings will also be presented to explore the linear relationship between points and goal difference since a club with a larger goal difference often ends the season with higher points, followed by an extension that addresses the question of whether a team should buy a better attacker or defender. The hypothesis is that points and goal difference have a strong linear relationship, and teams in the English Premier League should prioritize the purchase of defenders over attackers since it is widely known that defense wins championships.
English Premier League
The Premier League is the top level of the English soccer pyramid in which twenty teams compete against each other, playing each other team twice for a total of 38 games. In general, the team with the larger goal difference, calculated as the number of goals scored minus the number of goals conceded, will often win more games. Winning more games will lead to higher points earned because a win game is worth 3 points whereas a draw game is worth 1 point and a lose game is worth 0 points. Therefore, there should be a linear relationship between goal difference and points earned by a team at the end of the season.
In order to explore the relationship, the mathematical methods in section 2 are applied. The standing of the Premier League in 2018-2019 season is randomly chosen to be the example in this section. A table below is created to record the points, goal difference, goals scored and goals allowed, which corresponds to the second to the fifth column respectively.
|Clubs||Points||Goal Difference||Goals Scored||Goals Allowed|
In this section, since we will explore the impact of the goal difference on the points in the final standings, the independent variable, the x-axis, is goal difference and the dependent variable, the y-axis, is points. Each data point is imported to the plot as Figure 1 shown, as well as the best fit line. It is obvious that there is a linear relationship between these two variables.
and are calculated, which are 0.64 and 53.45 respectively by using the equation discussed in Section 2. Noticeably, the best fit line printed in the plot has the gradient equal to alpha and y-intercept equal to beta. Since the x-axis is the goal difference and the y-axis is the points of a club at the end of the season, the gradient represents that if a club has one more goal difference, then on average it will end the season with 0.64 more points. The y-intercept indicates that if a club scores as many goals as it concedes, it will end the season with 53.45 points.
This revelation of the linear relationship between the goal difference and the points inevitably leads to another question for clubs: whether they should spend money on an attacker or spend the same amount of money on a defender, assuming that these two players are identical, with the same capability and efficiency, but playing in different positions. Thus, in order to delve into the question, we can still use linear regression to analyze it quantitatively.
Attackers or Defenders?
Similar to the procedure in section 3, the only variable that requires changing is the x-axis variables. Instead of goal difference, it will be substituted by goals scored and goals allowed. Admittedly, some teams which end the season with higher ranking do not necessarily have more goals scored than those teams with fewer points since they can allow less goals scored. Conversely, some teams with more points may have more goals allowed because they can score plenty of goals to compensate for the goals allowed. Nevertheless, after plotting each point in the graph, there is still a linear relationship but with a weaker correlation compared with the correlation in section 3. Figure 2 is the plot of goals scored vs. points and Figure 3 is the plot of goals allowed vs. points. The slope of the best fit line in Figure 2 is positive because more goals should lead to a higher points whereas the slope of the best fit line in Figure 3 is negative because fewer goals allowed will lead to a higher ranking.
In Figure 2 and 3, the output alpha values are 1.129 and -1.230 respectively, which means one more goal scored can help the team end the season with 1.129 points higher and one fewer goal allowed will enable the team to finish the season with 1.230 points higher. Generally, losing a goal and scoring a goal will be compensated, which means the point difference should be 0 instead of around 0.1, because in the real game, after scoring but losing a goal, two teams go back to the same starting line instead of two teams both losing 0.1 points. Back to the question proposed at the end of the section 3 on whether a club should buy an attacker or a defender. Quantitatively speaking, in terms of the results of the linear regression, a defender should be purchased since a goal allowed is worth more points than a goal scored. It is noticeable that the case only works in the Premier League because in other leagues, the result can be different. The result, to some extent, is reasonable in that there are many clubs in the Premier League who are capable of winning the tournament. For example, the top tier contains six teams which dominate the Premier League. In contrast, in other European leagues such as Italian Serie A and Bundesliga, just one team, Juventus and Bayern respectively, have been the champions for more than five years, and Real Madrid, Barcelona, and Atletico Madrid dominate La Liga. As a result, when the teams in the Premier League have similar capabilities, the goal difference for each game will be small. For the top tier six teams, a better defense will ensure the team to allow less goals since their offense has been relatively strong, which bring more points considering the fact that a draw game is worth 1 point and a win game is worth 3 points. For other teams which are relatively weak, they can earn one point through strong defense to make the stronger teams unable to score, which is exactly what those teams have been doing in the recent seasons. It is not uncommon that the team in the last five positions in the standings can earn one point in the game against the top six teams. However, in other leagues in which there is a huge gap between strong teams and weak teams, when a weaker team allows plenty of goals, they should consider reinforcing their offense to try to score as much as possible to compensate for the loss of the goals. It is the same case for the strong teams whose defense is good enough. They should consider scoring more goals to ensure the win. Buying a defense player is not as efficient as buying an attacking player. Again, it is a matter of power difference between the strong teams and the weak teams in a league.
After the plots were generated and analyzed, the final conclusion confirms that the hypotheses are correct since the results prove that there exists a strong linear relationship between goal difference and points, and a defender should be bought to earn a higher standing compared with an attacker in the English Premier League. The sentence “defense wins championships” is certainly reasonable.
The implication of the research is to better understand the significance of mathematics in real life. Linear regression, an indispensable part of mathematics, is a vital and practical tool that has already been widely applied in the fields such as artificial intelligence and business. The paper dives deeper into the principles and operations of linear regression, which can further provide readers with insights to comprehend the linear relationships between other variables beyond sports such as soccer. Many real life examples can be quantitatively analyzed by applying the method of linear regression, enabling people to find some particular order in our complex world.
 David A Freedman. Statistical models: theory and practice. Cambridge University press, 2009.
 Tarunpreet Kaur. Factors affecting health insurance premiums: Explorative and predictive analysis. 2018.
 A Mehra. Statistical sampling and regression: simple linear regression. PreMBA analytical methods. Columbia Business School and Columbia University, 2003.
 Sanford Weisberg. Applied linear regression, volume 528. John Wiley & Sons, 2005.