Regression Modeling of U.S. Health Care Cost

April 23, 2021

8620

Author: Allyson Wang

Peer Reviewer: Hyunjin Christina Lee

Professional Reviewer: Dr. D’Arcy Mays

Abstract

As a widespread issue in the United States, health care costs are rapidly increasing at almost 5% a year [1], causing families to struggle to pay the costs [2]. Some factors that contribute to the high spending are the growing population, income, and life expectancy. The purpose of this project was to apply regression analysis to understand what major factors drive health care costs and to calculate future costs using the final model. It was hypothesized that pharmaceutical costs were the major predictor of health care costs. To observe the significance of the predictors and determine the optimal model, individual correlations, t-tests, F-tests, stepwise regression, multicollinearity, and relative importance were performed using R Studio. Due to the insignificance of both variables diabetes and life expectancy, they were excluded from the final model. Using VIF tests to observe multicollinearity, the final model was chosen to include pharmaceutical costs and Medicare enrollment, and from relative importance, it was determined that pharmaceutical costs were the major driver that explained 50.84% of the variation of health care costs, supporting the research hypothesis. Policymakers can focus on inhibiting specific predictors that cause the rise in health care costs and be cautious of the continual growth in the United States. Further research can be conducted using more drivers and testing the outcomes of different regression models.

Introduction

Over previous decades, health care costs in the United States have grown more rapidly compared to other countries. Due to these soaring costs, more than 79 million people in the United States, mostly young adults out of college and the unemployed, struggle to pay for health care [2]. Health expenditures can be measured as health insurance premiums, a percent of the gross domestic product (GDP), and health spending per capita. Average annual health spending per person can outline the differences among countries and trends over time [3]. Health care costs per capita was used in this study to measure the expenditure. Compared to many countries, health care costs per capita in the United States is the highest ($10,000), both in absolute term and in proportion to its GDP [4]. With the country’s growing population, health care accounts for almost 18% of the national GDP and is projected to grow to 19.3% in 2023 [5]. Factors such as the growing population, chronic diseases, and costs of health services have contributed to this rise of health spending in the U.S. [6]. Medicare enrollment seems to be a major driver of health care costs based on the historical and projected data [7]. Diabetes is the most expensive disease at $26,971 per family, and its prevalence among Americans is increasing [8]. In addition, according to the Center for Medicare & Medicaid Services (CMS), pharmaceutical costs were the main driver for health care costs and will continue to drive the total cost [9]. Profits of pharmaceutical companies have increased, and prescription drug costs have risen to $1,443 per person per year [10]. In 2016, $329 billion of the $3,337 billion spent on health care were towards prescription drugs [11].

Multiple regressions are used to observe the relationship between the explanatory variables in aggregate and response variable to predict future values of the response variable. The optimal model is chosen by selecting the “best” predictors to explain the variation of the response variable. The equation below is the multiple linear regression model equation:

Figure #1 – Multiple Linear Regression Formula

The parameters beta₂ and beta₃ represent the slopes for each explanatory variable. The y-intercept is labeled by beta₁, and the epsilon represents the error [12].

Limited research has been conducted regarding specific contributors to increasing health care costs. Regression analysis in this project can determine the role of each factor towards the increase of health spending and help health officials improve current policies to lower consumer costs [13]. The explanatory variables were the number of diabetic cases in millions, gross domestic product (GDP) in dollars, income per capita in dollars, Medicare enrollment in number of people, the population in millions, pharmaceutical costs in dollars, and life expectancy (Table 1). The response variable was the health care costs per capita in dollars. Based on previous research, it was hypothesized that pharmaceutical costs are the greatest contributor to health care costs as a survey conducted that 87% of health care workers believe the rising costs are due to high pharmaceutical costs [10].

Methods and Materials

Historical data from 1966-2016 were collected through government-sponsored sites. R Studio was downloaded to perform regression and statistical analysis. After the data was imported into the R software, missing values were interpolated for a variable using non-missing observations. Individual correlations for each explanatory variable and response variable were represented with a matrix scatterplot in R Studio.

A t-test was performed on simple linear regressions for each explanatory variable, followed by a partial t-test on an all-inclusive multiple regression. Using the conventional significance level of 0.1, variables with p-values over 0.1 were removed from the final model because they were not statistically significant [14]. The F value, which would determine overall significance, was calculated to observe which regression model had the best fit. The R squared statistic was found in order to explain the variation in the response variable explained by the model, so the higher the value, the better the model explained the variable [15].

Stepwise regression is a type of variable selection used to also observe the significance based on the adjusted R-squared and Akaike information criterion (AIC). The adjusted R-square increased if the variable added improved the regression model, not by chance, which prevents overfitting the model [16]. AIC determined how well a model can project values: the lower the AIC, the better the model [17]. The statistics were applied to the three types of stepwise regression of forward selection, backward elimination, and combined by removing or adding variables. The three procedures were performed once with insignificant variables and once excluding insignificant variables to observe if there were any oddities in patterns. The given results determined the best drivers to be incorporated into the final model.

Though there was high correlation between the explanatory and response variables, high collinearities among the explanatory variables created redundancy which required the removal of certain variables. Variance inflation factors (VIF) tests were performed to view the presence of multicollinearity among the explanatory variables; high VIFS showed high multicollinearity. If the value exceeded the VIF range of low multicollinearity of 1-5, variables were removed to lower the value [18]. Measured in lmg, the relative importance was then conducted in R to observe the weight of each factor to the response variable. Using the final regression model, a t-test was performed along with the R² and F values to explain the variation of the final model [19].

The future values of the significant drivers from 2017-2026 were downloaded from government-sponsored sites [20]. Using Microsoft Excel, the predicted values from the datasets were inputted into the calculated multiple linear regression equation to project values of the response variable. A 3D bubble chart was created to incorporate the explanatory variables with the response variable represented by the size of the bubbles (Figure 3). Finally, a line graph was created to observe the overall increase in health care costs from 1966-2026.

Results

Table 1. Regression Modeling of U.S. Health Expenditure: Variable Definitions

Explanatory Variables:

Medicare Enrollment (# of people)	enroll
Life Expectancy (age)	life
Pharmaceutical Costs ($/retail drug)	pharm
Population (# in millions)	pop
Gross Domestic Product ($ in billion)	gdp
Income ($ per capita)	income
Diabetes (# of cases in millions)	diabetes

Response Variable:

Health Care Costs ($ per capita)

healthcost

Figure 2. Regression Modeling of U.S. Health Expenditure: Matrix Scatterplot

Table 2. Regression Modeling of U.S. Health Expenditure: Partial T-test

Variable	Coefficient	P-Value	Standard Error
enroll	0.00005382	0.01957	0.00002219
life	-128.3	0.04262	61.42
pharm	1.823	0.03214	0.8234
income	-0.08533	0.09004	0.0492
pop	22.79	0.02207	9.595
gdp	0.4343	0.00307	0.1384
diabetes	41.24	0.13832	27.31

R²= 0.9988 Adjusted R²= 0.9986 F = 4961 <#ALPHA> = 0.1

Table 3. Regression Modeling of U.S. Health Expenditure: Forward Selection without Diabetes

Variable	AIC (1)	Variable	AIC (2)	Variable	AIC (3)
gdp	544.89	income	509.13	pharm	508.09
income	624.52	pharm	516.19	enroll	509.10
pharm	639.11	pop	521.88	pop	509.11
pop	659.54	life	525.18	<none>	509.13
enroll	668.04	enroll	541.42	life	509.73
life	725.22	<none>	544.89
<none>	823.19
Variable	AIC (4)	Variable	AIC (5)	Variable	AIC (6)
enroll	498.85	pop	498.11	life	496.36
pop	505.20	<none>	498.85	<none>	498.11
<none>	508.09	life	499.14
life	508.20

Table 4. Regression Modeling of U.S. Health Expenditure: Backward Elimination and Combined

Variable	AIC
<none>	496.36
income	497.66
life	498.11
pop	499.14
gdp	503.83
enroll	506.78
pharm	512.10

Table 5. Regression Modeling of U.S. Health Expenditure: Forward Selection with Diabetes

Variable	AIC (1)	Variable	AIC (2)	Variable	AIC (3)
gdp	544.89	income	509.13	diabetes	500.66
income	624.52	pharm	516.19	pharm	508.09
diabetes	628.40	diabetes	516.99	enroll	509.10
pharm	639.11	pop	521.88	pop	509.11
pop	659.54	life	525.18	<none>	509.13
enroll	668.04	enroll	541.42	life	509.73
life	725.22	<none>	544.89
<none>	823.19
Variable	AIC (4)	Variable	AIC (5)
pop	498.05	<none>	498.05
<none>	500.66	life	498.60
enroll	501.28	pharm	499.29
pharm	502.63	enroll	499.91
life	502.66

healthcost = -3551.3247 + 0.7686(gdp) – 0.1627(income) + 68.086(diabetes) + 18.2642(pop)

Table 6. Regression Modeling of U.S. Health Expenditure: Three Possible Models

Three Possible Models	R²	Adjusted R²	F
Model 1: gdp+income+pop+diabetes	0.9985	0.9984	7888
Model 2: gdp+income+pop+enroll+pharm	0.9986	0.9984	6412
Model 3: all variables	0.9988	0.9986	4961

Table 7. Regression Modeling of U.S. Health Expenditure: VIF Tests

pop ~ enroll + pharm + gdp + income
enroll	pharm	gdp	income
70.57697	107.74866	879.56958	352.35545
enroll ~ pharm + gdp + income + pop
pharm	gdp	income	pop
76.68112	670.55560	1013.97972	361.94825
gdp ~ enroll + pharm
enroll	pharm
7.987123	7.987123

Table 8. Regression Modeling of U.S. Health Expenditure: Relative Importance

enroll	pharm
48.85788%	50.84499%

Variance explained by model: 99.7%

Table 9. Regression Modeling of U.S. Health Expenditure: F-test and T-test on Final Model

R²	Adjusted R²	F
0.997	0.9969	8053

Variable	Coefficient	P-Value	Standard Error
enroll	0.0001349	<2e-16	0.000006988
pharm	5.581	<2e-16	0.2119

Table 10. Regression Modeling of U.S. Health Expenditure: Projected Costs

Year	pharm	enroll (in million)	Projected health costs	Growth rate	CMS projected health costs	CMS growth rate
2017	$1,039	57.6	$10,859.90		$10,724
2018	$1,097	59.3	$11,412.93	5.09%	$11,193	4.4%
2019	$1,148	61.1	$11,940.38	4.62%	$11,670	4.3%
2020	$1,209	62.9	$12,523.64	4.88%	$12,230	4.8%
2021	$1,280	64.8	$13,176.20	5.21%	$12,804	4.7%
2022	$1,358	66.6	$13,854.34	5.15%	$13,394	4.6%
2023	$1,441	68.4	$14,560.38	5.10%	$14,024	4.7%
2024	$1,530	70.2	$15,299.91	5.08%	$14,690	4.7%
2025	$1,617	72.0	$16,028.28	4.76%	$15,365	4.6%
2026	$1,717	73.7	$16,815.71	4.91%	$16,168	5.2%

Figure 3. 3D Model of Final Multiple Linear Regression Model

Figure 4. Representation of Historical and Projected Health Care Costs

Allyson Wang, Figure #4, Representation of Historical and Project Health Care Costs — **Figure #4 –** Representation of Historical and Project Health Care Costs

The scatterplots in Figure 2 reveal the high correlations between the explanatory and response variables. There was a correlation between each explanatory variable and the response (Figure 2); therefore, all the drivers were primarily significant. Diabetes was the only variable that required interpolation because of missing data points. In the partial t-test (Table 2), Diabetes was eliminated from the final model because its p-value was greater than the level of significance of 0.1, labeling the driver as insignificant. Medicare Enrollment had a significantly smaller standard error compared to other variables of 0.00002219 while Life Expectancy had the largest of 61.42. It was also observed that the R² was extremely high, showing that the data had a close fit to the perfect model. Because Diabetes was eliminated from the model by the partial t-test, the variable was excluded when performing stepwise regression. At one point in forward selection (Table 3), the variable life was insignificant but after adding all variables, it was proven significant. In backward elimination and both stepwise (Table 4), no variables were eliminated from the equation, similar to the results from forward selection. The following model was created as a result of the stepwise functions that included all variables except Diabetes:

healthcost = 2705 + 2.656(pharm) – 112.5(life) + 0.00006941(enroll) – 0.08558(income) + 19.81(pop) + 0.4197(gdp)

Regression analyses were performed with all variables including Diabetes. Backwards elimination and combined stepwise resulted in the inclusion of all variables. In forward selection, Life Expectancy, Pharmaceutical Costs, and Medicare Enrollment were excluded (Table 5), which suggests a relationship between the three excluded variables and Diabetes in terms of prediction of health care costs.

Based on the alias coefficients, both variables life expectancy and Medicare enrollment had perfect collinearity. As mentioned above, at one point during forward selection, Life Expectancy was insignificant. From these observations, Life Expectancy was excluded from the final model. Three possible models were hypothesized from the assumptions, and the statistics calculated for all three were extremely high (Table 6). Because Diabetes was originally eliminated from the model due to its high p-value, Model 2 (variables – GDP, income, population, Medicare enrollment, and pharmaceutical costs) was chosen as the optimal model. This model not only had a high R² value and a high F value, but it also included a reasonable number of variables with justifiable reasons.

After choosing the model, the multicollinearity between the variables was observed through VIF tests (Table 7). When more than three variables were put into the function, the VIF values were very high, between 100 to 1000. To lower the VIF, variables of life expectancy, population, gdp, and income were removed. After different combinations, it was found that a model with Medicare Enrollment and Pharmaceutical Costs had the lowest VIF of 7.987123. Relative importance was performed to determine which of the two variables had a greater effect on health care costs (Table 8). Overall, both Medicare Enrollment and Pharmaceutical Costs explained half of the costs, 48.86% and 50.84% respectively, concluding pharmaceutical costs as the best predictor. Additionally, the data was analyzed using a F-test, and all statistics were fairly high, indicating how close the model was to a perfect fit (Table 9). The final multiple linear regression model was found:

Final Model: healthcost = -2709 + 5.581(pharm) + 0.0001349(enroll)

Overall, the projected health care costs using the model were very close to the costs projected by CMS, though the growth rate grew faster than the rates calculated by CMS (Table 10) [7]. The values calculated were visualized with a 3D bubble graph (Figure 3). Medicare Enrollment was the x-axis, Pharmaceutical Costs was the y-axis, and the health care costs were represented by the size of the bubble. A close-to-exponential graph was created to represent the health care costs from 1966-2026 (Figure 4).

Discussion and Conclusions

The purpose of this project was to use regression analysis to understand which driver best explains the variation in the response variable and to predict future health care costs in the U.S. It was hypothesized that pharmaceutical costs was the most significant factor to the high health care costs. The optimal model included solely the significant drivers, Medicare enrollment and pharmaceutical costs. Observing the relative importance values, pharmaceutical costs (50.84%) was found to have a greater effect towards health care costs compared to Medicare enrollment (48.86%). Little research has been conducted on Medicare enrollment’s contribution, but one study stated that the expensive costs of Medicare were unexpected, further increasing costs of the Medicare program [3]. Based on the findings, the hypothesis was supported because pharmaceutical costs were not only a significant variable, but it also lowered multicollinearity (Table 7) and had a higher percentage in relative importance of 50.84%.

Other studies concluded that pharmaceutical costs pose the biggest threat towards the rising health care costs in the U.S. Based on research, it was suggested that pharmaceutical costs drive the costs in the U.S. more than any other high-income countries [21]. Moreover, the CMS projects that pharmaceutical costs will grow parallel to health care costs and account for more than 17% of the reason for rising costs of premiums in hospitals [9]. One reason for the high costs is from overuse, caused by drug advertisements and prescriptions patients do not know enough about [11]. Compared to other countries as the highest spender, the U.S. spends up to 117% higher in prescription drugs and produces 57% of the world’s chemical products like prescription drugs [22].

Based on the projected values, all three variables – pharmaceutical costs, Medicare enrollment, and health care costs – are expected to grow rapidly over the next decade. Though the growth rate generally stays constant, it was projected that the health care costs per person would be $16,000-$17,000 in 2026. Additionally, a few values for Diabetes were interpolated, which may explain its weaker correlation and being the only insignificant variable initially eliminated from the partial t-test.

One source of error could potentially be any bias associated with the original datasets. Although the datasets were trusted, different databases and sites may hold unlike values that may conclude different results. Using a larger year range of data would also improve the accuracy of the results. For further experimentation, different final models can be used to see how health care costs change based on inclusions of other variables. Additional drivers of health care costs such as physicians’ income, medical service cost, and medical technology could be included to observe how they affect the costs. To further analyze the results, more tests could have been run. For example, confidence levels could be used to determine multicollinearity, and other automatic search methods could be tested in addition to stepwise.

Pharmaceutical costs have a huge effect on the rising health care costs. Consequently, pharmaceutical companies should formulate alternative prescriptions to offer patients a larger range of prescription drugs to select from [11]. As a whole, the U.S. should focus on the major drivers that cause the increase in health care costs in order to help reduce costs across the nation. The nation needs to focus on objectives to help families pay for health care and control the ever-increasing health care costs.

Regression Modeling of U.S. Health Care Cost

Author: Allyson Wang

Peer Reviewer: Hyunjin Christina Lee

Professional Reviewer: Dr. D’Arcy Mays

Abstract

Introduction

Methods and Materials

Results

Table 1. Regression Modeling of U.S. Health Expenditure: Variable Definitions

Figure 2. Regression Modeling of U.S. Health Expenditure: Matrix Scatterplot

Table 2. Regression Modeling of U.S. Health Expenditure: Partial T-test

Table 3. Regression Modeling of U.S. Health Expenditure: Forward Selection without Diabetes

Table 4. Regression Modeling of U.S. Health Expenditure: Backward Elimination and Combined

Table 5. Regression Modeling of U.S. Health Expenditure: Forward Selection with Diabetes

Table 6. Regression Modeling of U.S. Health Expenditure: Three Possible Models

Table 7. Regression Modeling of U.S. Health Expenditure: VIF Tests

Table 8. Regression Modeling of U.S. Health Expenditure: Relative Importance

Table 9. Regression Modeling of U.S. Health Expenditure: F-test and T-test on Final Model

Table 10. Regression Modeling of U.S. Health Expenditure: Projected Costs

Figure 3. 3D Model of Final Multiple Linear Regression Model

Figure 4. Representation of Historical and Projected Health Care Costs

Discussion and Conclusions

POPULAR CATEGORIES

NAVIGATION

ABOUT US

Author: Allyson Wang

Peer Reviewer: Hyunjin Christina Lee

Professional Reviewer: Dr. D’Arcy Mays

Abstract

Introduction

Methods and Materials

Results

Table 1. Regression Modeling of U.S. Health Expenditure: Variable Definitions

Figure 2. Regression Modeling of U.S. Health Expenditure: Matrix Scatterplot

Table 2. Regression Modeling of U.S. Health Expenditure: Partial T-test

Table 3. Regression Modeling of U.S. Health Expenditure: Forward Selection without Diabetes

Table 4. Regression Modeling of U.S. Health Expenditure: Backward Elimination and Combined

Table 5. Regression Modeling of U.S. Health Expenditure: Forward Selection with Diabetes

Table 6. Regression Modeling of U.S. Health Expenditure: Three Possible Models

Table 7. Regression Modeling of U.S. Health Expenditure: VIF Tests

Table 8. Regression Modeling of U.S. Health Expenditure: Relative Importance

Table 9. Regression Modeling of U.S. Health Expenditure: F-test and T-test on Final Model

Table 10. Regression Modeling of U.S. Health Expenditure: Projected Costs

Figure 3. 3D Model of Final Multiple Linear Regression Model

Figure 4. Representation of Historical and Projected Health Care Costs

Discussion and Conclusions

RELATED ARTICLESMORE FROM AUTHOR

Does “Good Length Outside Off” Really Work? A Ball-Tracking Study of Wickets in the IPL

Comparing Text-Only Linguistic Profiles of Math Explanations Across Khan Academy, ChatGPT, and Textbooks: A Descriptive Case Study of Four High School Topics

Price Prediction From Auction Log Data

POPULAR CATEGORIES

NAVIGATION

ABOUT US

RELATED ARTICLES MORE FROM AUTHOR