## Abstract

A central feature of the game of baseball is stealing bases: the action of attempting to advance bases during a pitch. Given stolen bases have succeeded over 50% of the time historically, teams unsurprisingly try to steal whenever they feel like their runner can out-run the opposing catcher’s throw. However, as analytics have made their way into baseball, teams have been more hesitant to steal because statistical analysis of the risk vs. reward of stealing implies that teams who steal recklessly often sabotage themselves. Moreover, recent changes to the rules of baseball have increased the viability of aggressive base-stealing. In 2023, Major League Baseball (MLB) changed the distance between bases to be nine inches shorter. While this change may seem small, many stolen base fails had been due a distance this small or smaller, thus flipping the results for a large portion of attempts. This slight tweak forced teams to entirely re-evaluate how often they should attempt steals, which has resulted in an astounding 3,500 successful steals in 2023, the second-highest rate of the last 50 years. In this paper, we explore the variables that affect the likelihood of successful stolen bases and compare our statistical analysis from before and after the 2023 rule changes. We discover that the speed of the baserunner had less of an impact on base stealing success post 2023, likely due to the shorter base path increasing their margin for error. In addition, our analysis indicates that catchers who are worse at throwing out runners are exposed more consistently.

## Introduction

Stealing bases is one of baseball’s most unique tactics. Base-stealing is one of the more complex rules of baseball (a very complicated sport in itself), so many fail to recognize its importance. Baseball is one of the biggest sports in the world, and stealing bases is an essential facet of the game with the potential to change the careers of many baseball players. For example MLB player Terrence Gore has signed with a new team almost every September to fill in as a pinch-runner for playoff baseball. The allure of stealing bases also has entered popular culture, such as Taco Bell’s annual free taco day after the first stolen base of each World Series.

Despite the importance of base stealing, many outsiders do not know of its existence or its value to the game from a strategic perspective. Base stealing has been a staple of the game for over a century, but recently has fallen in popularity due to the rise of analytics in the sport that indicate stealing a base usually is not worth the risk. In this work, we statistically assess the factors that contribute to the success of stealing a base before and after 2023, at which time the rules of baseball were significantly amended.

Stolen bases have been on the decline for a decade prior to the 2023 season. This decline has occurred as a result of the break-even rate of attempting to steal a base, the statistical evaluation of how often a team would need to succeed in stealing for each steal’s average win probability to be positive. Researchers found that for stealing a base to increase a team’s win probability, the team would need to be successful at an unsustainable 74% clip. In many cases, the risk of stealing has not been worth the reward.

The 2023 season was unique because it featured a major change in the state of stolen bases due to rulebook changes. As MLB officials pushed for more action in gameplay, the distance between bases shortened by 9 inches, improving the stealing success probability for all players. While this difference seems small, more failed steals came within that margin than the common fan would think. In turn, the league’s success rate on stolen bases took a massive jump. Per pitcherlist.com, a baseball analytics journal and website, the league-wide stolen base success rate is up 5% from last year to a rate of 80%^{1}. This is striking, particularly because base stealing attempts are on the rise. Even with an increase in volume, base stealing efficiency also improved, which is uncharacteristic given historical trends. Thus, the stolen base is back to being one of the key components of baseball.

There are a multitude of factors that influence successfully stealing a base. However, there is a high degree of uncertainty with stealing a base, making effective prediction difficult for even pro-level analytics departments. A runner’s speed is arguably the most important factor, but the impact of other considerations such as the distance of a lead –the distance the runner is from his original base to get a head start – the velocity of a pitch, and the runner’s acceleration are far harder to calculate. Indeed, stealing a base is almost never guaranteed. In part, this uncertainty led to the reduction in stealing bases that we see today. In the late 2010s, stolen base numbers hit lows that were previously unseen in the past ten years^{1}. The main established theory is that a stolen base would need to have at least a 70% likelihood of success for the risk to be worth the reward, a theory that came from Billy Beane’s Moneyball Oakland A’s and revolutionized sports analytics. Due to the inherent randomness of successfully stealing a base, some teams did not think that a 70% threshold could ever be met^{2}.

In addition to analytics on base stealing, Billy Beane revolutionized baseball at every level. Beginning around 2000, Beane’s Oakland A’s were the first to normalize win-probability in baseball statistics, where before it had only been a theoretical concept. They could apply numbers to situational baseball before anybody else, and could even calculate how likely you were to score based on how many runners were on and how many outs had been recorded.

He began with theories regarding the value of on-base percentage relative to batting average, and once 29 other teams saw Oakland at the top of the standings every season, they began to take note. Teams began to hire data analysts to model out which components of baseball best increased winning percentage across seasons. The league’s analytics scene eventually tried to optimize stolen bases, resulting in a significant drop-off in attempts league-wide because they were often inefficient (Greenberg, Neil. “After baseball’s rule changes, the stolen base is officially back”).

Although win probability dominates base-stealing analytics today, the factors that contribute to how likely a runner is to steal successfully are changing in importance significantly because of recent rule changes in Major League Baseball (MLB). Although the 2023 season is not complete, there is still enough information to speculate on the likelihood of stealing a base without being caught. In turn, we collect and analyze gameplay statistics to examine the contributing factors in base stealing likelihood before and after this change.

## Literature Review

Data and research regarding stolen bases in 2023 is quite rare due to the recency of rulebook changes. However, doing research on this limited data is very important because of the significance of these recently implemented rule changes. Still, we anchor our research with studies about pre-2023 stolen bases to study why their occurrence had been shrinking. Fundamentally, We want to know which factors are most important to making good base-stealers, and how their importance has changed this year.

On a broad scale, stolen bases do not correlate much with winning as a team. Braden Murray, a student studying analytics at Samford University, discovered that among over 70 statistics, stolen bases and speed ranked near the bottom when it came to how strongly they correlated to team wins. This study was completed using 2021 statistics, which largely explains why teams were increasingly stealing fewer bases around that time (Murray, Braden. “MLB Winning Percentage Breakdown: Which Statistics Help Teams Win More Games?”)^{3}. On an even broader scale, pitching metrics seemed to matter much more than hitting statistics overall. This is important to consider; many of the variables that are harder to quantify with a stolen base are the ones for which the pitcher has the most influence.

When modeling the success likelihood of a stolen base in a given situation, statisticians include as many salient features as they can into a regression model. However, University of South Carolina graduate Cade Stanley added another dimension to the evaluation of stolen bases: win probability (Stanley, Cade. “Modeling the Probability of a Successful Stolen Base Attempt in Major League Baseball”)^{4}. Cade Stanley’s analysis tailors the general framework of Beane’s original statistical strategies to stolen bases specifically. Not every part of a baseball game is the same, and stealing a base is far more advantageous in some situations than others. For example, whether you are down one or multiple runs late in a game, you have no more chances if you fail to score in the ninth. The difference between these situations applies to stolen bases. A runner on first matters far more in a one-run game because of the potential to tie the score. While the risk is the same, the reward for a successful steal in the tighter game is much greater since a successful stolen base can significantly improve win probability. By contrast, the main runner that matters in the two-run game is the one at the plate; the base runner needs to score safely to allow for a comeback to occur but does not need to score before the batter. We can expand this simple component to a variety of situations that allow us to assess how much is to be gained from stealing a base at any given point in a game. Though we do not directly analyze the expected value of a stolen base, we feel our work in examining the factors of successful base stealing is a strong basis for future work, particularly under the new rulebook.

Theodore Turocy at Texas A&M University then applied similar metrics to assess how likely the team was to score after a successful vs. failed stolen base and applied those results to the generally predicted 70% success rate to see when stealing would be a net positive (Turocy, Theodore L. “The theory of theft: An inspection game model of the stolen base play in baseball”)^{5}. He then went one step further and outlined how defenses shift when a threatening runner is on first. With a fast runner on base, the hitter’s production marginally increases, showing that even if no steal is attempted, the threat of one is a boost to the offense. Paul Bursik and Kevin Quinn published a similar analysis in the University of Nebraska Press (Bursik, Paul, and Kevin Quinn. “Whither or whether the stolen base?”)^{6}. They illustrated that while certain teams began to steal fewer bases less often, some teams continued attempting stolen bases solely out of tradition and precedent. Herman Demmink III, an independent baseball researcher, also notes that good base stealing teams can usually gain at least 3 extra wins over bad ones, a difference that can cost a team a shot at the world series (Demmink III, Herman. “Value of stealing bases in Major League Baseball: “Stealing” runs and wins”)^{7}.

Studies like these are important in showing how stolen bases change the fate of teams, but they can’t reliably predict certain factors that often determine the success likelihood. Pitch location, exact lead distance and momentum, the quality of a tag, and other unpredictable factors make assessing every component of a stolen base intractable. However, there is still room for research on the nuances of a stolen base rather than its at-large effect on the sport. Demmink’s study and the others mentioned show the importance of stolen bases in baseball but do not directly compare the changes to stealing induced by the new MLB rule changes. In this study, we intend to fill this gap.

## Methodology

To analyze the likelihood of a stolen base, we gathered data on everything that was available and could affect the odds of a stolen base attempt. For base stealers, this includes a runner’s speed, their previous stolen base success percentage, their lead, and their acceleration. For catchers, this includes their pop time – the duration between catching a pitch and delivering a throw – their throw velocity, and their previous success rate at throwing other runners out. At first glance, it seems as though the raw base stealing success rates would entirely influence future base stealing success. However, factors like level of competition, pitch location, and the right/left-handedness of a pitcher can significantly affect base stealing success but cannot be easily collected or modeled. We leave incorporation of these factors for future work.

The data for this project was collected from a baseball analytics organization www.baseballsavant.com, allowing us to customize datasets to include everything that could be important in determining the success of a stolen base. Baseballsavant uses tracking data from Statcast/Hawkeye to record specific velocities, distances, and timings that are key to assessing such a quick interaction in stolen bases. The site already cleans the data to filter out malformed data and situations where a throw is not attempted, as this situation ruins the integrity of the study because runners slow down when there is no threat of a throw. However, due to price and availability constraints, we were unable to get data for pitch location, game situation, throw location, and other key factors that contribute to successful steals. Our objective is to build a regression model that indicates to us how likely a player is to steal a base against a certain catcher given their strength versus the catcher’s defensive prowess.

Correlation and regression analysis were the basis of our exploration. By definition, the correlation between two variables is the covariance, the measure of joint variability between two variables, divided by the product of the two variables’ standard deviations. In practice, correlation indicates how much one variable trends relative to another and in which direction. A correlation value of 0.8 implies a strong positive correlation, while a correlation value of -0.2 implies a weak negative correlation.

Assessing which variables matter was important in determining how to approach building a model that could identify the likelihood of a successful stolen base. We know which variables to test, but we need to test for how much each one sways the odds of a successful stolen base. We hypothesize that the runner’s sprint speed will be the most important factor in the result of a stolen base attempt, as it is the most consistent variable and the only tool we can reliably use to measure the runner’s success probability. By contrast, catchers have many more impactful traits that can determine whether or not they throw out a runner. We expect significant positive correlation with sprint speed and stolen base percentage, while we expect negative correlation with a catcher’s pop time and the baserunner’s distance from second base relative to how frequently a runner was thrown out. Intuitively, we drew this hypothesis from the factors we identified earlier in the paper, as they all directly link to our prediction for this model.

Beyond direct regression analysis, we also analyze the statistical significance of our learned coefficients. This allows us to place varying degrees of confidence on our findings. The first variable we considered is the estimate, which is simply a coefficient that represents the estimated change in the dependent variable when there is an increase of the corresponding independent variable. The estimate and other similar variables measure the confidence of our findings, so a higher estimate likely indicates that a certain variable is more impactful than another. The standard deviation and error were also crucial to the model-building process, as they would tell us how dense our results were. The standard deviation is the measure of how dispersed data is relative to the mean, and the standard error is simply the standard deviation divided by the square root of the number of samples, functioning as an estimate to standard deviation. The T-value is also incredibly important, as it represents the statistical significance of the results, and with the t-value we can measure how likely it was that the level of correlation in the dataset was achieved by chance. All of these factors can allow us to assess how impactful certain variables are on stolen base success likelihood.

## Experimental Results and Discussion

One of the first questions we wanted to answer was how much speed mattered in the context of stolen bases. For steals in 2023^{8}, while the correlation was low at 0.1314, there is enough of a correlation to yield a t-value of 2.083, linking sprint speed to stolen base percentage. Stats from 2015-2022 gave a lower correlation at 0.1125, but the dataset is about 13 times larger, so the t-value reaches 4.917. All of the other deviations between 2023 sprint speed and 2015-2022 sprint speed are a result of a major difference in sample size.

The results we gathered with catcher data gave us much more information about the impactful variables regarding how likely a runner was to be thrown out. These stats are split into a lot of variables, but mainly are determined by how long the catcher takes to get the ball from his mitt to second base, and the runner’s distance from second base. Unsurprisingly, the t-value on the pop time was quite high, but the most impactful statistic in these datasets was easily the runner’s distance from second base.

These two graphs show a stronger correlation with the average runner distance relative to the runner than with pop time after 2023. The same relationship occurs before 2023 as well.

The correlations are far more similar before the rule change than after. Another interesting note here is that the average runner distance dropped in 2023, further proving that the shorter base path had an incredible impact on the rise in stolen bases in 2023.

As throwing out runners seems to be harder overall in 2023, pop time’s t-value is at a respectable -1.136, while the t-value for pop time before 2023 is actually positive. These numbers make more sense because the pop time in both datasets has a high value in the pr()

column, implying that the t-values in both columns are decently likely to be due to random chance. However, pop time seems to matter far more with a shorter base path given its estimate in the first column is much less statistically significant.

The standard error of these factors further backs up these conclusions, as the standard error of the runner’s distance from second base lays around 0.015 in both timeframes, while the standard error for pop time before 2023 is 0.53 and the standard error in 2023 is 0.98, implying much more variance in regards to how consistent the values are.

Pop time itself naturally splits up into two smaller factors: exchange time and arm strength. Both the exchange and the throw affect the process of getting the ball from the catcher’s mitt to second base, and the model shows that the exchange time is much more impactful. In both timeframes that we evaluated, exchange time was more important than arm strength when determining how good a catcher is at throwing out runners.

Our main takeaway from this analysis is that stealing bases became much easier in 2023. The decreased statistical significance of sprint speed in 2023 shows that there was far larger margin for error for runners, while the inverse for catchers shows an increased difficulty in throwing out a base stealer. Evidently, changes in graphs of pop time and sprint speed into 2023 show a weaker linear relationship between either one with stolen base success rate relative to statistics of previous years. Stolen bases became much more random and less predictable in 2023, which correlates with the rule changes that made the feat much easier to accomplish.

Unsurprisingly, we learned that faster runners are more likely to steal bases than slower runners, that catchers who took longer to throw the ball were less likely to throw runners out and that runners closer to second base were also less likely to be thrown out. We also learned that the catcher’s pop time is an important factor, but that the most important factor is the most unpredictable one that relies very little on skill and more so getting lucky. The runner’s distance from second base is often reliant on who the runner is, what the situation is, and if the pitcher is trying to hold him at first. All of these very context-dependent variables make the lead distance intractable to predict, though we maintain it is an important factor. Getting a consistent lead is difficult, so in reality, the majority of the stolen bases are determined by a variable that is nearly impossible to predict and control. Evaluating and analyzing this data is still important, however, as the several changes in 2023 results relative to those before show a shift towards stealing being easier and easier, putting even more stress on a catcher than before.

## Conclusion

We found that speed, pop time, and most importantly the runner’s lead from first base affect the likelihood of a successful stolen base. In particular, we see that sprint speed was far more important from 2015-2022 than it is now. Interesting avenues for future work include gathering data from specific stolen base attempts to gain a deeper understanding and more reliable metrics for specific stolen base attempts rather than aggregate findings. Knowing the change in significance of these statistics between the rule change and specific play scenarios would contribute insightful context to the data we have already collected.

Another key limitation with data is that we collected most of our data early into the 2023 season, so these numbers have likely changed since. In the future, this data may change meaningfully, which could be interesting to explore. Even then, we still do not have very much data in 2023 relative to the data before it, so the results of our model are still limited. With more time, we would have liked to input a specific catcher and runner to try and see how likely a stolen base attempt was to succeed, but unfortunately time and data constraints limited us from doing so. In the future, teams should tap into stolen bases more than they have. 2023 rosters were not constructed with this rule change in mind, but future years will likely tap into stolen bases more than they have. In addition, a catcher who is able to throw out runners at a high clip will be more valuable, while pitchers who have slower wind ups and pitches will likely be more exploitable. Overall, we effectively showed the factors that matter most to successfully stealing a base, and with MLB’s new rules in their inaugural season, there can be much more analysis done to further assess what changed, and how MLB teams can maximize the stolen base.

## References

- Schwartz, Nate. “Touching Base: Who’s Stealing More With the New Rules?” (2023). [↩] [↩]
- Greenberg, Neil. “After baseball’s rule changes, the stolen base is officially back.” Yahoo Sports, 2023. [↩]
- Murray, Braden. “MLB Winning Percentage Breakdown: Which Statistics Help Teams Win More Games?” Samford University Sports Analytics, 2022. [↩]
- Stanley, Cade. “Modeling the Probability of a Successful Stolen Base Attempt in Major League Baseball.” (2023). [↩]
- Turocy, Theodore L. “The theory of theft: An inspection game model of the stolen base play in baseball”. No. 0401005. University Library of Munich, Germany, 2005. [↩]
- Bursik, Paul, and Kevin Quinn. “Whither or whether the stolen base?.” NINE: A Journal of Baseball History and Culture 17.2 (2009): 122-135. [↩]
- Demmink III, Herman. “Value of stealing bases in Major League Baseball: “Stealing” runs and wins.” Public Choice 142.3-4 (2010): 497-505. [↩]
- Baseball Almanac, Inc. “MLB League by League Totals for Stolen Bases.” Baseball Almanac, www.baseball-almanac.com/hitting/hisb3.shtml. Accessed 14 Dec. 2023. [↩]