Regression Modeling of U.S. Health Care Cost


Author: Allyson Wang

Peer Reviewer: Hyunjin Christina Lee

Professional Reviewer: Dr. D’Arcy Mays


As a widespread issue in the United States, health care costs are rapidly increasing at almost 5% a year [1], causing families to struggle to pay the costs [2]. Some factors that contribute to the high spending are the growing population, income, and life expectancy. The purpose of this project was to apply regression analysis to understand what major factors drive health care costs and to calculate future costs using the final model. It was hypothesized that pharmaceutical costs were the major predictor of health care costs. To observe the significance of the predictors and determine the optimal model, individual correlations, t-tests, F-tests, stepwise regression, multicollinearity, and relative importance were performed using R Studio. Due to the insignificance of both variables diabetes and life expectancy, they were excluded from the final model. Using VIF tests to observe multicollinearity, the final model was chosen to include pharmaceutical costs and Medicare enrollment, and from relative importance, it was determined that pharmaceutical costs were the major driver that explained 50.84% of the variation of health care costs, supporting the research hypothesis. Policymakers can focus on inhibiting specific predictors that cause the rise in health care costs and be cautious of the continual growth in the United States. Further research can be conducted using more drivers and testing the outcomes of different regression models. 


Over previous decades, health care costs in the United States have grown more rapidly compared to other countries. Due to these soaring costs, more than 79 million people in the United States, mostly young adults out of college and the unemployed, struggle to pay for health care [2]. Health expenditures can be measured as health insurance premiums, a percent of the gross domestic product (GDP), and health spending per capita. Average annual health spending per person can outline the differences among countries and trends over time [3]. Health care costs per capita was used in this study to measure the expenditure. Compared to many countries, health care costs per capita in the United States is the highest ($10,000), both in absolute term and in proportion to its GDP [4]. With the country’s growing population, health care accounts for almost 18% of the national GDP and is projected to grow to 19.3% in 2023 [5]. Factors such as the growing population, chronic diseases, and costs of health services have contributed to this rise of health spending in the U.S. [6]. Medicare enrollment seems to be a major driver of health care costs based on the historical and projected data [7]. Diabetes is the most expensive disease at $26,971 per family, and its prevalence among Americans is increasing [8]. In addition, according to the Center for Medicare & Medicaid Services (CMS), pharmaceutical costs were the main driver for health care costs and will continue to drive the total cost [9]. Profits of pharmaceutical companies have increased, and prescription drug costs have risen to $1,443 per person per year [10]. In 2016, $329 billion of the $3,337 billion spent on health care were towards prescription drugs [11].

Multiple regressions are used to observe the relationship between the explanatory variables in aggregate and response variable to predict future values of the response variable. The optimal model is chosen by selecting the “best” predictors to explain the variation of the response variable. The equation below is the multiple linear regression model equation:

Allyson Wang, Figure #1, Multiple Linear Regression Formula
Figure #1 – Multiple Linear Regression Formula

The parameters beta2 and beta3 represent the slopes for each explanatory variable. The y-intercept is labeled by beta1, and the epsilon represents the error [12].

Limited research has been conducted regarding specific contributors to increasing health care costs. Regression analysis in this project can determine the role of each factor towards the increase of health spending and help health officials improve current policies to lower consumer costs [13]. The explanatory variables were the number of diabetic cases in millions, gross domestic product (GDP) in dollars, income per capita in dollars, Medicare enrollment in number of people, the population in millions, pharmaceutical costs in dollars, and life expectancy (Table 1). The response variable was the health care costs per capita in dollars. Based on previous research, it was hypothesized that pharmaceutical costs are the greatest contributor to health care costs as a survey conducted that 87% of health care workers believe the rising costs are due to high pharmaceutical costs [10]. 

Methods and Materials

Historical data from 1966-2016 were collected through government-sponsored sites. R Studio was downloaded to perform regression and statistical analysis. After the data was imported into the R software, missing values were interpolated for a variable using non-missing observations. Individual correlations for each explanatory variable and response variable were represented with a matrix scatterplot in R Studio.

A t-test was performed on simple linear regressions for each explanatory variable, followed by a partial t-test on an all-inclusive multiple regression. Using the conventional significance level of 0.1, variables with p-values over 0.1 were removed from the final model because they were not statistically significant [14]. The F value, which would determine overall significance, was calculated to observe which regression model had the best fit. The R squared statistic was found in order to explain the variation in the response variable explained by the model, so the higher the value, the better the model explained the variable [15].

Stepwise regression is a type of variable selection used to also observe the significance based on the adjusted R-squared and Akaike information criterion (AIC). The adjusted R-square increased if the variable added improved the regression model, not by chance, which prevents overfitting the model [16]. AIC determined how well a model can project values: the lower the AIC, the better the model [17]. The statistics were applied to the three types of stepwise regression of forward selection, backward elimination, and combined by removing or adding variables. The three procedures were performed once with insignificant variables and once excluding insignificant variables to observe if there were any oddities in patterns. The given results determined the best drivers to be incorporated into the final model.

Though there was high correlation between the explanatory and response variables, high collinearities among the explanatory variables created redundancy which required the removal of certain variables. Variance inflation factors (VIF) tests were performed to view the presence of multicollinearity among the explanatory variables; high VIFS showed high multicollinearity. If the value exceeded the VIF range of low multicollinearity of 1-5, variables were removed to lower the value [18]. Measured in lmg, the relative importance was then conducted in R to observe the weight of each factor to the response variable. Using the final regression model, a t-test was performed along with the R2 and F values to explain the variation of the final model [19].

The future values of the significant drivers from 2017-2026 were downloaded from government-sponsored sites [20]. Using Microsoft Excel, the predicted values from the datasets were inputted into the calculated multiple linear regression equation to project values of the response variable. A 3D bubble chart was created to incorporate the explanatory variables with the response variable represented by the size of the bubbles (Figure 3). Finally, a line graph was created to observe the overall increase in health care costs from 1966-2026.


Table 1. Regression Modeling of U.S. Health Expenditure: Variable Definitions

Explanatory Variables:

Medicare Enrollment (# of people)enroll
Life Expectancy (age)life
Pharmaceutical Costs ($/retail drug)pharm
Population (# in millions)pop
Gross Domestic Product ($ in billion)gdp
Income ($ per capita)income
Diabetes (# of cases in millions)diabetes

Response Variable:

Health Care Costs ($ per capita)healthcost

Figure 2. Regression Modeling of U.S. Health Expenditure: Matrix Scatterplot

Allyson Wang, Figure #2, Regression Modeling of U.S. Health Expenditure: Matrix Scatterplot
Figure #2 – Regression Modeling of U.S. Health Expenditure: Matrix Scatterplot

Table 2. Regression Modeling of U.S. Health Expenditure: Partial T-test

VariableCoefficientP-ValueStandard Error

R2 = 0.9988                   Adjusted R2 = 0.9986                  F = 4961                   <#ALPHA> = 0.1

Table 3. Regression Modeling of U.S. Health Expenditure: Forward Selection without Diabetes

VariableAIC (1)VariableAIC (2)VariableAIC (3)
VariableAIC (4)VariableAIC (5)VariableAIC (6)

Table 4. Regression Modeling of U.S. Health Expenditure: Backward Elimination and Combined


Table 5. Regression Modeling of U.S. Health Expenditure: Forward Selection with Diabetes

VariableAIC (1)VariableAIC (2)VariableAIC (3)
VariableAIC (4)VariableAIC (5)

healthcost = -3551.3247 + 0.7686(gdp) – 0.1627(income) + 68.086(diabetes) + 18.2642(pop)

Table 6. Regression Modeling of U.S. Health Expenditure: Three Possible Models

Three Possible ModelsR2Adjusted R2F
Model 1: gdp+income+pop+diabetes0.99850.99847888
Model 2: gdp+income+pop+enroll+pharm0.99860.99846412
Model 3: all variables0.99880.99864961

Table 7. Regression Modeling of U.S. Health Expenditure: VIF Tests

pop ~ enroll + pharm + gdp + income
enroll ~ pharm + gdp + income + pop
gdp ~ enroll + pharm

Table 8. Regression Modeling of U.S. Health Expenditure: Relative Importance


Variance explained by model: 99.7%

Table 9. Regression Modeling of U.S. Health Expenditure: F-test and T-test on Final Model

R2Adjusted R2F
VariableCoefficientP-ValueStandard Error

Table 10. Regression Modeling of U.S. Health Expenditure: Projected Costs

Yearpharmenroll (in million)Projected health costs Growth rateCMS projected health costsCMS growth rate
2017$1,03957.6$10,859.90 $10,724 
2018$1,09759.3$11,412.935.09%$11,193 4.4%
2019$1,14861.1$11,940.384.62%$11,670 4.3%
2020$1,20962.9$12,523.644.88%$12,230 4.8%
2021$1,28064.8$13,176.205.21%$12,804 4.7%
2022$1,35866.6$13,854.345.15%$13,394 4.6%
2023$1,44168.4$14,560.385.10%$14,024 4.7%
2024$1,53070.2$15,299.915.08%$14,690 4.7%
2025$1,61772.0$16,028.284.76%$15,365 4.6%

Figure 3. 3D Model of Final Multiple Linear Regression Model

Allyson Wang, Figure #3, 3D Model of Final Multiple Linear Regression Model
Figure #3 – 3D Model of Final Multiple Linear Regression Model
*The size of the bubbles represents the corresponding health care costs to the values of the drivers.

Figure 4. Representation of Historical and Projected Health Care Costs

Allyson Wang, Figure #4, Representation of Historical and Project Health Care Costs
Figure #4 – Representation of Historical and Project Health Care Costs

The scatterplots in Figure 2 reveal the high correlations between the explanatory and response variables. There was a correlation between each explanatory variable and the response (Figure 2); therefore, all the drivers were primarily significant. Diabetes was the only variable that required interpolation because of missing data points. In the partial t-test (Table 2), Diabetes was eliminated from the final model because its p-value was greater than the level of significance of 0.1, labeling the driver as insignificant. Medicare Enrollment had a significantly smaller standard error compared to other variables of 0.00002219 while Life Expectancy had the largest of 61.42. It was also observed that the R2 was extremely high, showing that the data had a close fit to the perfect model. Because Diabetes was eliminated from the model by the partial t-test, the variable was excluded when performing stepwise regression. At one point in forward selection (Table 3), the variable life was insignificant but after adding all variables, it was proven significant. In backward elimination and both stepwise (Table 4), no variables were eliminated from the equation, similar to the results from forward selection. The following model was created as a result of the stepwise functions that included all variables except Diabetes:

healthcost = 2705 + 2.656(pharm) – 112.5(life) + 0.00006941(enroll) – 0.08558(income) + 19.81(pop) + 0.4197(gdp)

Regression analyses were performed with all variables including Diabetes. Backwards elimination and combined stepwise resulted in the inclusion of all variables. In forward selection, Life Expectancy, Pharmaceutical Costs, and Medicare Enrollment were excluded (Table 5), which suggests a relationship between the three excluded variables and Diabetes in terms of prediction of health care costs.

Based on the alias coefficients, both variables life expectancy and Medicare enrollment had perfect collinearity. As mentioned above, at one point during forward selection, Life Expectancy was insignificant. From these observations, Life Expectancy was excluded from the final model. Three possible models were hypothesized from the assumptions, and the statistics calculated for all three were extremely high (Table 6). Because Diabetes was originally eliminated from the model due to its high p-value, Model 2 (variables – GDP, income, population, Medicare enrollment, and pharmaceutical costs) was chosen as the optimal model. This model not only had a high R2 value and a high F value, but it also included a reasonable number of variables with justifiable reasons.

After choosing the model, the multicollinearity between the variables was observed through VIF tests (Table 7). When more than three variables were put into the function, the VIF values were very high, between 100 to 1000. To lower the VIF, variables of life expectancy, population, gdp, and income were removed. After different combinations, it was found that a model with Medicare Enrollment and Pharmaceutical Costs had the lowest VIF of 7.987123. Relative importance was performed to determine which of the two variables had a greater effect on health care costs (Table 8). Overall, both Medicare Enrollment and Pharmaceutical Costs explained half of the costs, 48.86% and 50.84% respectively, concluding pharmaceutical costs as the best predictor. Additionally, the data was analyzed using a F-test, and all statistics were fairly high, indicating how close the model was to a perfect fit (Table 9). The final multiple linear regression model was found:

Final Model: healthcost = -2709 + 5.581(pharm) + 0.0001349(enroll)

Overall, the projected health care costs using the model were very close to the costs projected by CMS, though the growth rate grew faster than the rates calculated by CMS (Table 10) [7]. The values calculated were visualized with a 3D bubble graph (Figure 3). Medicare Enrollment was the x-axis, Pharmaceutical Costs was the y-axis, and the health care costs were represented by the size of the bubble. A close-to-exponential graph was created to represent the health care costs from 1966-2026 (Figure 4).

Discussion and Conclusions

The purpose of this project was to use regression analysis to understand which driver best explains the variation in the response variable and to predict future health care costs in the U.S. It was hypothesized that pharmaceutical costs was the most significant factor to the high health care costs. The optimal model included solely the significant drivers, Medicare enrollment and pharmaceutical costs. Observing the relative importance values, pharmaceutical costs (50.84%) was found to have a greater effect towards health care costs compared to Medicare enrollment (48.86%). Little research has been conducted on Medicare enrollment’s contribution, but one study stated that the expensive costs of Medicare were unexpected, further increasing costs of the Medicare program [3]. Based on the findings, the hypothesis was supported because pharmaceutical costs were not only a significant variable, but it also lowered multicollinearity (Table 7) and had a higher percentage in relative importance of 50.84%.                                                 

Other studies concluded that pharmaceutical costs pose the biggest threat towards the rising health care costs in the U.S. Based on research, it was suggested that pharmaceutical costs drive the costs in the U.S. more than any other high-income countries [21]. Moreover, the CMS projects that pharmaceutical costs will grow parallel to health care costs and account for more than 17% of the reason for rising costs of premiums in hospitals [9]. One reason for the high costs is from overuse, caused by drug advertisements and prescriptions patients do not know enough about [11]. Compared to other countries as the highest spender, the U.S. spends up to 117% higher in prescription drugs and produces 57% of the world’s chemical products like prescription drugs [22].

Based on the projected values, all three variables – pharmaceutical costs, Medicare enrollment, and health care costs – are expected to grow rapidly over the next decade. Though the growth rate generally stays constant, it was projected that the health care costs per person would be $16,000-$17,000 in 2026. Additionally, a few values for Diabetes were interpolated, which may explain its weaker correlation and being the only insignificant variable initially eliminated from the partial t-test.

One source of error could potentially be any bias associated with the original datasets. Although the datasets were trusted, different databases and sites may hold unlike values that may conclude different results. Using a larger year range of data would also improve the accuracy of the results. For further experimentation, different final models can be used to see how health care costs change based on inclusions of other variables. Additional drivers of health care costs such as physicians’ income, medical service cost, and medical technology could be included to observe how they affect the costs. To further analyze the results, more tests could have been run. For example, confidence levels could be used to determine multicollinearity, and other automatic search methods could be tested in addition to stepwise.

Pharmaceutical costs have a huge effect on the rising health care costs. Consequently, pharmaceutical companies should formulate alternative prescriptions to offer patients a larger range of prescription drugs to select from [11]. As a whole, the U.S. should focus on the major drivers that cause the increase in health care costs in order to help reduce costs across the nation. The nation needs to focus on objectives to help families pay for health care and control the ever-increasing health care costs.