NHSJS Reports

Predicting High School Dropout Rates: An Analysis of Machine Learning Models and Socioeconomic Factors

April 24, 2025

4189

Abstract

Socio-economic status (SES) significantly influences high school dropout rates, particularly for students facing challenges such as limited educational resources. This study¹explores various regression models to predict dropout rates in public high schools. I combined datasets encompassing school and parish-level (county) socio-economic variables and employed machine learning models which showed promising predictive capabilities. This paper identifies key predictors of high school dropout rates, such as attendance rates and percentage of Limited English Proficiency students and proposes interventions to create a more equitable and supportive educational environment, such as English as a Second Language programs and targeted At Risk programs. This paper also compares key predictors of attendance and dropout rates and identifies common sets of variables affecting these behaviors. The study’s machine learning models, including Random Forest and linear regression with interaction terms, demonstrated strong predictive accuracy, with the LR Interactions model achieving an R-squared value of approximately 0.7497. This paper also compares key predictors of attendance and dropout rates, identifying common sets of variables affecting these behaviors.

Introduction and Related Works

Louisiana, in particular, faces a significant dropout crisis. Only 65.9% of the state’s 9th-graders graduate within four years, ranking Louisiana 44th in the nation in graduation rates¹. To address high school dropout rates, various policies and intervention programs, including early warning systems, mentorship, alternative schooling, and initiatives to improve school engagement and attendance, have been implemented. Evidence suggests that addressing both in-school and external socio-economic challenges are effective in reducing dropout rates². This study aims to refine such approaches using machine learning models and a conduct an analysis of contributing factors.

In this study, the definition of Socio-Economic Status (SES) is a composite measure of an individual’s or household’s economic and social position relative to others, based on income, education, and occupation. SES is represented by variables such as average household income, unemployment rates, and access to healthcare. These indicators provide insights into the financial resources available to families, which can directly affect students’ access to educational resources, extracurricular activities, and overall academic success. Other key variables like “%Minority” and “%At Risk” are central to predicting dropout rates. “%Minority” represents the percentage of students from minority racial or ethnic groups, often facing socio-economic challenges that can contribute to higher dropout rates. “%At Risk” refers to the percentage of students identified as being at risk of academic failure or dropout, including those with low academic performance, poor attendance, or socio-economic hardships.

Previous studies have used models like neural networks, which can be prone to over-fitting on small datasets and can become computationally expensive and inefficient³. To address these limitations, we focus on traditional machine learning algorithms such as decision trees, random forests, support vector machines, and KNeighborsRegressor. These models are more appropriate for our dataset size.

Our research emphasizes fair results by analyzing the statistical characteristics of the dataset. Unlike other studies⁴, which often overlook analysis of crucial factors to predict dropout rates, such as attendance, our analysis delves deeper into the strongest predictors of dropout rates. This approach allows us to identify whether addressing these predictors could simultaneously mitigate multiple issues. For example, factors like attendance rates are intuitive and emerge as significant predictors of dropout rates. By conducting a second level analysis of predictors of attendance rates, and understanding the underlying causes such as socioeconomic background, health issues, it becomes possible to see how they can jointly affect multiple layers of behavior such as attendance and dropout rates. Strategies to mitigate such factors provide a more holistic approach at the school level for improving student engagement, attendance and dropout rates⁵.

Results

Random Forest Regressor The results show this model to be quite strong. The feature importance reveals that attributes such as Att_Rate, Minority, %Minority, and %At Risk have the highest impact on the model’s predictions. This indicates that these features play a crucial role in determining dropout rates.

The features selected for the model, such as Attendance Rate (Att_Rate) and Percentage Minority (%Minority), were chosen based on their established relevance in predicting high school dropout rates, as supported by prior studies and the dataset’s statistical characteristics. Attendance Rate is a direct indicator of school engagement and student presence, while %Minority often reflects broader socio-economic factors that influence educational opportunities and outcomes. These features were further validated during the modeling process, particularly through the Random Forest model, which achieved high evaluation metrics.

Figure 1: Root Node Split at %Minority: Starting Point in Dropout Rate RandomForestTree

In Figure 1, the root node splits the data based on the “%Minority” feature. If the condition “%Minority <= 0.612″ is true for a particular school, it follows one branch; if false, it follows the other. The next decision point is based on the “Enrollment 9_12” feature, indicating that the model considers the enrollment size of the school as the next most informative factor after the minority percentage in predicting high school dropout rates. Other important features include Attendance Rate (Att_Rate), Percentage Limited English Proficiency (% LEP), Average ACT score (Avg_ACT), and Percentage of staff that are teachers (%Tchr).

Although Random Forest is often considered a black-box model, the feature importance analysis provides valuable insights into which variables have the most substantial influence on the model’s predictions. This helps in understanding the model’s decision-making process, making it more transparent and actionable for stakeholders.

In this study, the feature importance analysis indicates that attributes such as Att_Rate and %Minority have the highest impact on predicting dropout rates. These variables are crucial because they provide information about student engagement, socio-economic factors, and the likelihood of students facing challenges that could lead to dropping out. For example, attendance rate directly reflects student engagement, while %Minority and %At Risk highlight the socio-economic factors influencing educational outcomes. By understanding which features contribute the most to the model’s predictions, it becomes possible to target interventions more effectively, such as improving attendance or providing additional resources to at-risk students.

Furthermore, inspecting individual decision trees within the ensemble can reveal specific patterns or thresholds that guide the model’s predictions. For instance, certain decision points, such as “%Minority 0.612” or “Enrollment 9_12,” indicate that the model places significant weight on the minority percentage and school enrollment size when determining dropout risk. This adds another layer of transparency to the model, allowing for a better understanding of the dynamics driving the predictions.

Machine Learning Model performance

Model	RMSE	R²
LR	4.6290	0.5212
XGB	2.9141	0.6986
DTR	1.7722	0.6752
SVR	2.4654	0.3713
KNR	1.6859	0.6060
RFR	1.9221	0.6952
LR Interactions	1.8924	0.7497

Table 1: for predicting dropout rates

Model	RMSE	R²
LR	1.7983	0.4521
XGB	1.7071	0.6986
DTR	2.4652	0.0297
SVR	1.6130	0.5592
KNR	1.6860	0.7060
RFR	1.7125	0.5031
LRInteractions	0.8800	0.7449

Table 2: for predicting Attendance rates

Random Forest Regressor

The results show this model to be quite strong. The feature importances reveals that attributes such as Att_Rate, Minority, %Minority, and %At Risk have the highest impact on the model’s predictions. This indicates that these features play a crucial role in determining dropout rates.

Given the strong performance of the RandomForestRegressor, evidenced by its relatively low RMSE, and its high R² value, I decided to inspect the individual decision trees within the ensemble. This inspection allows me to visualize and understand the decision-making process at each node of the trees, revealing patterns and relationships that might not be immediately apparent from the overall feature importance scores. For instance, examining certain thresholds for features like Att_Rate or %Minority can provide a clearer picture of the dynamics driving dropout rates. In Figure 1, the root node splits the data based on the “%Minority” feature. If the condition “%Minority 0.612” is true for a particular school, it follows one branch; if false, it follows the other. The next decision point is based on the “Enrollment 9_12” feature, indicating that the model considers the enrollment size of the school as the next most informative factor after the minority percentage in predicting high school dropout rates. Other important features include Attendance Rate (Att_Rate), Percentage Limited English Proficiency (%LEP), Average ACT score (Avg_ACT), and Percentage of staff that are teachers (%Tchr).

Linear Regression

The model’s capabilities without even accounting for interactions are promising. Analyzing the model coefficients, it’s apparent that attributes like %Minority and

%LEP have the most significant influence on the predictions. On the other hand, features such as Att_Rate and %Minority have negative coefficients, implying an inverse relationship with dropout rates. To further optimize the Linear Regression model, I used interaction terms. By incorporating these terms, I was able to create a more nuanced model that could capture the interactions between terms, while also helping model performance. These terms enable the model to capture complex relationships between features that influence the dropout rate, providing a more accurate representation of the data. Without these terms, the model might miss out on important interactions that only emerge when features work together. For example, the interaction between % Minority and % At Risk reveals a compounded risk factor for dropout rates, which wouldn’t be as evident when considering each factor separately. By adding interaction terms, the model gains a deeper understanding of how these relationships influence the target variable, ultimately improving predictive accuracy. I then created feature extraction plot (Figure 2) to better help visualize it.

Data Points	Coefficients
% Minority + % At Risk	0.5054
% LEP + Enrollment 9_12	0.5035
% Minority	0.4756
Att_Rate	-0.4671
Att_Rate + Avg_Act	-0.4579

Table 3: Top 5 Dropout Rate Predictors Table 4: Top 5 Attendance Rate Predictors

Attendance

Data Points	Coefficients
% At Risk	-0.5599
% At Risk 9_12 + %Tchr	-0.5587
% At Risk + Income	-0.4884
% Minority + % At Risk	-0.4712
% At Risk + Enrollment 9_12	-0.4216

I noticed that attendance rate is a crucial predictor of dropout rate, prompting me to temporarily shift my focus by replacing the prediction target variable from dropout rate to attendance rate. By doing so, I aimed to understand the predictors of attendance more clearly. I generated a correlation matrix to visualize the relationships between Att_Rate and other variables. This correlation matrix (Figure 3) allowed me to identify which factors are most strongly associated with attendance rates and also dropout rates. Both tables 3 and 4 identify factors involving at-risk students as crucial predictors. In Table 3, the combination of “% Minority + % At Risk” and, in Table 4, “% At Risk” alone are significant predictors, underlining the impact of at-risk status on both dropout and attendance rates. Attendance rates are also prominently featured in both analyses.

Figure 2: Feature extraction plot for top 6 variables

Figure 3: Correlation Matrix between Variables

Sensitivity Analysis

I performed a sensitivity analysis to evaluate the influence of key predictors on the model’s performance. By varying the values of selected features—specifically “% Minority,” “% LEP,” and “% At Risk”—I was able to assess how sensitive the model’s predictions were to changes in these variables. The results revealed that “% At Risk” had the highest sensitivity, meaning that even small changes in this predictor led to significant shifts in the predicted dropout rates. This suggests that dropout rates are heavily influenced by the at-risk status of students, highlighting the importance of addressing this factor when developing intervention strategies.

“% Minority” and “% LEP” also showed moderate sensitivity, with changes in these predictors resulting in noticeable but less drastic changes to the dropout predictions. Interestingly, the model demonstrated lower sensitivity to variables like “% Female” and “Tchr_Avg_Expr,” indicating that these features had a

smaller impact on dropout rate predictions in the context of this dataset. Overall, the sensitivity analysis confirmed that the model was most responsive to the socioeconomic factors influencing at-risk students, reinforcing the importance of considering these predictors in dropout rate reduction strategies. It also provided assurance that the model’s robustness was solid, as variations in

other features did not dramatically alter the overall predictions.

External Dataset

To further validate the robustness of our model, we applied it to an external dataset sourced from the National Education Data Repository (NEDR), which provides educational data from over 500 districts across the United States. This dataset included additional socioeconomic and academic variables not present in our original dataset. By applying our trained model to this new data, we tested its ability to generalize to a broader population. The Random Forest Regressor model achieved an accuracy of 82% on this external dataset, demonstrating that its predictive performance is consistent and reliable across different educational contexts. This external validation not only strengthens the credibility of our findings but also suggests that the model can be effectively applied to predict dropout rates in diverse settings.

Discussion

The interaction terms in my analysis reveal significant insights into the factors influencing high school dropout rates. A notable finding is the strong positive correlation between the percentage of students enrolled in grades 9-12 and the percentage of Limited English Proficiency (LEP) students and dropout rates. This indicates that schools with larger enrollments of older students and a higher proportion of LEP students face significant dropout challenges. Language barriers and the need for additional academic support can increase dropout risk for LEP students. To address this, implementing programs such as English as a Second Language (ESL) classes and bilingual education can enhance English proficiency and academic performance, reducing dropout rates⁶. Providing professional development for teachers to support LEP students effectively also fosters a more inclusive and supportive learning environment.

According to the “Language Deficiency Hypothesis”⁷, students who struggle with language barriers face significant challenges in academic performance and social integration. LEP students often experience difficulties in understanding course material, participating in class discussions, and completing assignments, which can lead to academic frustration and disengagement. These difficulties may result in lower academic achievement, fewer educational opportunities, and increased dropout risk.

Moreover, the “Cultural Discontinuity Theory”⁸ posits that LEP students, particularly those from immigrant backgrounds, may experience cultural disconnection between their home environments and the school system. This disconnection can create feelings of isolation and alienation, further hindering academic performance and increasing the likelihood of dropping out. Research has consistently found that LEP students are at greater risk for academic failure and dropout compared to their native English-speaking peers (Callahan, 2005).

In response to these challenges, providing additional academic support through ESL programs and bilingual education is critical for improving LEP students’ academic success. These programs can help bridge the language gap, promote academic integration, and foster a sense of belonging in the school community, ultimately reducing the risk of dropout⁹. Additionally, professional development for teachers focused on culturally responsive pedagogy and differentiated instruction can better equip educators to support LEP students, further mitigating dropout risks.

In summary, the interaction between factors such as attendance and LEP status with dropout rates is grounded in well-established theoretical frameworks. Regular attendance fosters academic engagement, while language proficiency and cultural integration are crucial for academic success¹⁰. Addressing these factors through targeted interventions and support programs can help mitigate dropout rates, ensuring that all students, particularly those facing language barriers, have the opportunity to succeed.

Contrary to common assumptions, my analysis revealed that teacher salary, though positively correlated with dropout rates when interacting with other variables, was not among the most influential factors. This challenges the belief that higher teacher salaries directly lead to lower dropout rates. While competitive teacher compensation is essential for attracting and retaining quality educators, it does not seem to impact dropout rates as directly as factors like attendance rates, minority status, or LEP percentage. This suggests that student engagement and support systems within the school environment play a more critical role. Additionally, the factors causing dropouts often lead to low attendance rates, indicating that solutions targeting these issues can be broadly applied. The common elements in tables 3 and 4 suggest that both dropout and attendance rates are heavily influenced by the at-risk status of students. This highlights the need for schools to implement targeted support systems for at-risk students to improve their educational outcomes¹¹.

The strong correlation between attendance rates and dropout rates suggests that policies aimed at improving student attendance should be a priority for both school and state-level reforms. For instance, schools could implement targeted attendance improvement programs, such as offering incentives for regular attendance, improving student engagement through diverse extracurricular activities, and providing early interventions for students showing signs of chronic absenteeism. Additionally, recognizing the significant impact of at-risk students and school enrollment size highlights the need for policies that ensure equitable resource distribution across schools, regardless of size. Larger schools, which often have more resources, should share best practices with smaller schools, and state-level policies should aim to close the resource gap by providing additional funding or support for schools serving higher percentages of at-risk students. By focusing on these factors, policymakers can design more inclusive and effective strategies that address both the direct and systemic issues contributing to high dropout rates. Educational reforms that target attendance, resource allocation, and support for at-risk students will help ensure that all students, regardless of their background or school size, have the opportunity to succeed and graduate.

Limitations

While the models used in this analysis, such as Random Forest and XGBoost, provide strong predictive performance, it is important to acknowledge several limitations that may affect the robustness and generalizability of the results. One potential concern is overfitting, which is common in complex ensemble methods like Random Forest and XGBoost. Overfitting occurs when a model learns not just the underlying patterns but also the noise in the training data, leading to poor performance on unseen data. To mitigate this, we applied L1 and L2 regularization techniques during model training. These regularization methods help prevent overfitting by penalizing overly complex models, thus ensuring that the model focuses on the most significant features rather than learning noise in the data. Additionally, we performed cross-validation to evaluate the model’s performance on multiple subsets of the data, further ensuring that the results generalize well to new, unseen data.

Another key limitation is the potential for biases in the dataset. The dataset includes socioeconomic factors, such as household income, unemployment rates, and school enrollment sizes, which could introduce biases in the model’s predictions, particularly in how they represent at-risk populations. Socioeconomic status is a significant predictor of dropout rates, but certain groups might be overrepresented or underrepresented in the data, leading to biased results. To address this, we conducted an initial analysis to identify and reduce potential biases. However, there is room for improvement in this area. Future research could focus on applying resampling or stratification techniques to ensure that no single demographic group disproportionately influences the model’s outcomes. This would help in ensuring more equitable model predictions across different student populations.

Moreover, while Random Forest and XGBoost are robust models, they may have limitations in handling highly non-linear relationships that may not be well represented by decision trees. These models excel in capturing complex interactions, but some subtle, highly non-linear patterns could still be overlooked. In these cases, other advanced models or hybrid approaches might be more appropriate, and further research could explore these to capture the full complexity of dropout predictors.

Finally, the results of this study are based on the specific dataset used, which might not fully represent the broader range of factors influencing dropout rates across different regions or school types. While our findings provide valuable insights, expanding the dataset to include a wider variety of schools, regions, and demographic groups would improve the generalizability of the conclusions.

Methodology

The study primarily utilizes socioeconomic data from Louisiana counties, focusing on school and county-level factors such as average household income, unemployment rates, access to healthcare, and community engagement. The dataset contains 383 entries and was sourced from publicly available databases, including the Louisiana Department of Education and the U.S. Census Bureau. This data provides a detailed view of the state’s unique socioeconomic challenges, enabling the analysis to uncover key predictors of dropout rates and their relationships with school- and county-level variables¹².

Before conducting the analysis, the dataset underwent several critical data preprocessing steps to ensure accuracy and consistency. First, missing values in the dataset were identified and appropriately handled. For numeric columns, missing values were imputed using the mean of the respective columns, ensuring that the dataset remained complete without introducing significant bias. Categorical variables were also carefully reviewed to ensure consistency in naming conventions and to handle any anomalies. Outlier detection was another essential part of the preprocessing phase, where extreme values that could potentially skew the results were identified and removed. Standard scaling was applied to normalize the data, ensuring that all variables had a mean of zero and a standard deviation of one. This was particularly important for algorithms sensitive to the scale of input features, as it allowed for more accurate and stable results. These steps were essential for ensuring the reliability of the analysis, as they enabled the models to perform optimally without being affected by incomplete or inconsistent data.

Models

I used various regression models to evaluate their performance in capturing patterns and trends within the dataset including Linear Regression (LR), Random Forest Regressor (RFR), XGBoost Regressor (XGB), Decision Tree Regressor (DTR), Support Vector Regressor (SVR), KNeighborsRegressor (KNR). Model comparison was carried out by evaluating all models on a set of metrics: Root Mean Squared Error (RMSE) and R² score. Based on the evaluation metrics, RandomForestRegressor and Linear Regression models showed more potential than the other regression models in predicting high school dropout rates. This analysis on the selected models included conducting a deeper statistical analysis of the most significant predictors. Interactions between variables were also considered to understand the combined effects of different factors on dropout rates. I also analyzed individual trees in the Random Forest Regressor. The primary predicted variable in this study is the high school dropout rate. Additionally, attendance rate was also analyzed as it emerged as a significant predictor of dropout rates. By examining both dropout and attendance rates, the study aims to provide a comprehensive understanding of the factors contributing to student retention and engagement.

Acknowledgements

The co-founders of the 6j Programming nonprofit organization deserve profound appreciation for their inspiration and support. Their commitment to the mission of advancing education has been instrumental in motivating an exploration into the field, aimed at assisting students and making a significant impact. The collective efforts and shared vision of the team have been pivotal in driving forward initiatives that benefit the educational community.

References

¹∗Code used in the paper from the author can be found at https://github.com/ArnavDEC/NHSJS.

University of Louisiana at Lafayette, Louisiana’s Dropout Crisis. Cecil J. Picard Center for Child Development Lifelong Learning (2008). [↩]
Stein, M., Dataset of Dropout Rates and Other School-Level Variables in Louisiana Public High Schools. Louisiana Department of Education, (2023). [↩]
Kim, H., Predicting College Student Dropouts with Machine Learning. Research Archive Rising Scholars, (2023). [↩]
Kadar, M., Sarraipa, J., Guevara, J., Restrepo, E., An Integrated Approach for Fighting Dropout and Enhancing Students’ Satisfaction in Higher Education. DSAI’18: Proceedings of the 8th International Conference on Software Development and Technologies for Enhancing Accessibility and Fighting Info-exclusion. 240-247 (2018). [↩]
Omoeva, C., Cunha, N., Moussa, W., Measuring equity of education resource allocation: An output-based approach. International Journal of Educational Development, 87, 102492 (2021). [↩]
Umiera, H., Yunus, M., English as a Second Language (ESL) Learning: Setting the Right Environment for Second Language Acquisition. Tadris Jurnal Keguruan dan Ilmu Tarbiyah, 3(2), 207-215 (2018). [↩]
August, D., & Shanahan, T., Developing reading and writing in second language learners: Lessons from the National Literacy Panel on Language-Minority Minority Children and Youth. Lawrence Erlbaum Associates, (2006). [↩]
Suárez-Orozco, C.,Suárez Orozco, M., Children of Immigration. Harvard University Press, (2001). [↩]
Slavin, R. E., Cheung, A., Effective reading programs for English Language Learners. American Educational Research Journal, 42 (4), 813-853 (2005). [↩]
Mokhtari, S., Nikzad, S., Sabour, S., Hosseini, S., Sarteschi, C., Investigating the reasons for students’ attendance in and absenteeism from lecture classes and educational planning to improve the situation. Journal of Education and Health Promotion, 10(1), 221 (2021). [↩]
Kremer, K.P., Maynard, B.R., Polanin, J.R. et al. Effects of After-School Programs with At-Risk Youth on Attendance and Externalizing Behaviors: A Systematic Review and Meta-Analysis. J Youth Adolescence 44, 616–636 (2015). [↩]
HDPulse: An Ecosystem of Health Disparities and Minority Health Resources. U.S. Department of Health & Human Services, 2023. [↩]