NHSJS Reports

Predicting Crop Soil Acidity Via Machine Learning

May 11, 2025

1090

Abstract

Our aim for this project was to create a machine learning model to predict soil pH levels with minimal error; by using a nine-layered neural network, we predicted the level of soil acidity for rice and maize. The variables used to predict the soil acidity level were potassium concentration, nitrogen-to-soil ratio, and phosphorus concentration within a plant. Our final result was a mean-squared error of 0.5168. By using these results, we fulfilled our goal for this project and believe this neural network model can be used to advance the field of technology in the farming industry. Our research is important because soil acidity is one of the most important things for determining plant health and the yield of a harvest, and by using our model to optimize soil pH, crop yields may become larger.

Introduction

Rice and maize are vital crops that serve as dietary staples for millions of people globally. The soil’s chemical properties, specifically ph level, play a crucial role in determining plant health and growth. Soil pH influences nutrient availability, bacterial activity, and various physiological processes in plants. By using machine learning to accurately predict soil acidity, farmers can save large amounts of time that would’ve been spent measuring acres of soil’s acidity. Measuring soil PH manually can yield slightly more accurate results but would take an extremely lengthy amount of time compared to predicting it with a machine learning model. An acidity level that is too high or low could greatly harm the plant and its yield.

Agriculture is a massive industry, with new innovation and technology boosting crop yields and production constantly. Farmers are a necessity to satisfy the global demand for food. Originally, the plan was to find a dataset where we could predict crop growth. Unfortunately, after an extensive search for a suitable dataset resulted in no avail, we adjusted our expectations and settled for another dataset where our goal was to predict soil acidity. After some background research on what factors affect a crop’s health, we chose a dataset containing categories such as potassium concentration, nitrogen-to-soil ratio, and phosphorous concentration. All valuable information to use to predict soil pH.

How does soil acidity affect plants?

A soil acidity level should be one of the first things any farmer or homeowner should check before planting crops. When the soil is too acidic, elements like aluminum and manganese are more available and make the plant more toxic, while necessary elements like calcium, phosphorus, and magnesium are much less available. Some of the health factors of a plant are affected by soil acidity. For example, in highly acidic soil, the growth of a plant can be stunted. The soil acidity level of most fruits and vegetables heavily varies. Blueberries thrive in very acidic soil, while most vegetables are either neutral or prefer slightly acidic soil. Highly acidic soil is also more affected by erosion, which can affect a plant greatly. Unfit conditions can cause a plant high irritation as well as an overall weaker body system, which makes them more vulnerable to environmental stress¹. By optimizing soil acidity using machine learning, farmers can create a much larger crop yield. Here is a chart representing how soil acidity affects plant growth. This paper investigates how variations in soil pH affect the growth of rice and maize and presents an innovative approach utilizing machine learning to predict soil pH levels accurately.

These levels are generally well accepted within the farming community. Most species of rice thrive at an acidity level of around 5.5². The most direct effect of soil acidity on rice is the equilibrium of sulfides and ferrous iron. Too much of either is toxic to the crop; having a stable amount of both provides a good growing environment. The ideal soil acidity level for maize is slightly high, around 6 to 6.5. Maize also has a problem with an acidity level too high, the crop develops an iron deficiency³.

Figure 2 – chart that explains how acidic, neutral, and alkaline soils affect plants.

Methods

The dataset chosen contained numerical data stored inside a CSV file from Kaggle. There were several preprocessing steps. First, the dataset was filtered down the original data set so that it only contained data from rice and maize. The data was then stored in a new dataframe, which was intended to be used for the model. The cleaned dataset contained 1800 numerical and textual values. A 200 row and 9 column dataset means that there were 200 individual entries and samples used in the model. I searched for outliers between the soil acidity and the nitrogen soil ratio, potassium concentration, and phosphorus concentration. I put the soil acidity on the Y axis and each of the 3 values individually on the X axis of a box plot. In this step, there were no major outliers, so I didn’t have to remove any data entries. There were also no null values. In total, these were the 5 preprocessing steps done before getting to exploratory data analysis. The first step in my exploratory data analysis was finding any categories that had high relativity with the soil acidity. To accomplish this, I used a heatmap to achieve this, as well as another library known as ppscore.

The heatmap feature in seaborn uses a series of calculations in order to determine a relativity score between 1 and -1. A score close to 1 means that there is very high relativity between 2 categories. The category chosen to predict soil ph did not have a very strong relativity level with the other categories which made further stages of this project harder. The pandas profiling library was used to provide a more comprehensive report of the data. Through this investigation, it was concluded that there was no dominant trend in the dataset and was highly diverse due to no major correlations. The little correlation demonstrated by the dataset also put non-regression models, such as neural networks or decision trees, into consideration during our model building phase.

The model building process by building a simple base model which gave a relative starting point. The first model was a linear regression model which incorporated a train-test split where 20% of the data was used for training and 80% was used for testing. The 2 statistics used to measure the model were MSE(mean-squared error) and r2 score. Both are greatly used in many machine learning models. Mean squared error works by subtracting the predicted value from the actual value of a dataset and then squaring the difference.

Figure 4 – the formula for mean squared error

R2 score works a bit differently compared to mean squared error. R2 score varies from 1 to 100 percent. 100% means the data matches up perfectly and 1% meaning the predicted value has very little correlation to the actual value. It is mainly used as a metric to determine the percent of value that has deviated from the actual value⁴.

The base linear regression model utilized an 80/20 train-test split. This train-test split was chosen because it seemed like the most ideal split, as having too little training data would result in underfitting. Having too little testing data wouldn’t allow for an accurate prediction and would have variance, which would make it hard to trust the testing accuracy. The baseline model got a test MSE of 0.8326 and a test r2 score of -0.04. A negative r2 score indicates that the model predicted less accurately when compared to the average of all the values taken. This model was discarded for a similar reason to the polynomial regression model. The categories of the dataset were non-linear so using regression wasn’t the best choice. The linear regression model also demonstrated the worst testing performance compared to the other models, which is why it was discarded. The next model used a polynomial regression model and got a test MSE of 0.4466. The degree chosen was 3. A gridsearch was conducted to find the best degree for the model which was 3. I used a graph to depict where all the data was, and then the model would plot a line on the graph, mapping out its predictions. In the end, the polynomial regression model was not compatible with this dataset because there were too many data points scattered throughout the entire graph for there to be a proper prediction, and there was no linear relationship between the categories in the dataset, so this model was discarded.

Figure 6: Polynomial regression model results

The decision tree regressor predicts values by using a tree-like model of decisions in order to predict the target value. Using a decision tree also yielded similar testing results to polynomial regression.

Figure 7 – Description of decision tree process

Throughout all this time, I continued to experiment with training and testing sizes, as well as changing which categories were used to predict soil ph. It was decided that this model would also use a train-test split. Continuously changing these small factors is what led to a larger time consumption, which gave little to no value in terms of improving the model⁵. The decision tree gave a test MSE of 0.4764. The MSE for the training data was 0.2838. Due to the large discrepancy between the MSEs for the training and testing data, we believe this model was slightly overfitting the training data.The reason the decision tree regressor was discarded was because, compared to the neural network, it would require more preprocessing and tuning in order to perform better in the future. If many more soil samples were added to the dataset, then the decision tree would need more pruning of the dataset, as well as the fact that even with optimal pre-processing, there was a good chance the decision tree could grow too deep and begin to overfit the data. Compared to a neural network, the decision tree can’t match the flexibility and adaptability of a neural network. The last and final model used on this dataset was a neural network.

The neural network was inspired by the human brain and the way neurons communicate with each other in order to produce an end output. A neural network first must consist of an input layer. This input layer usually has the most nodes or neurons used because it has to process the most data. Once the data goes through the input layer, it can go through any number of nodes in the first hidden layer. There can be any number of hidden layers, and as more and more layers are added, each layer slowly gets a smaller number of nodes. This is also because there is less data. Once all the data is processed through the hidden layers, there is one final layer known as the output layer. This output layer outputs a predicted value.

In order to test the accuracy of many different architectures of a neural network so I constructed a loop which would test how multiple neural network architectures would compare between each other. Mean squared error was the metric used to compare all the different architectures. This was because all the models we had been using so far had used the same metric. The epochs chosen for each architecture were 20. 20 epochs were used because using more than 20 epochs would be computationally expensive and time-consuming. Most of the architectures had validation losses that would stabilize around the 16-20 epoch range, which would mean that this was a good amount of epochs to use. Relu activation was used for all the layers of every model, and the Adam optimizer was chosen for compiling every layer. The batch size chosen was 16. We chose this size because using a larger batch size came with a higher risk of overfitting and poor generalization. Larger batch sizes like 128, 256, etc. also require higher memory, and if this dataset were to grow in the future, then using a larger batch size would be unfeasible due to the higher memory requirement. A smaller batch size would allow this model to run on a computer with less advanced hardware, making our model more accessible. This loop allowed me to decide whether or not I should use a complex neural network architecture with lots of layers and nodes or a more simple neural network. Because there were so many different variations of possible neural network structures in the loop, it took a lengthy amount of time for the loop to run and for me to compare the MSE of all the neural network architectures. After filtering through all the different architectures the loop tested, the one with the best results was one with 9 layers. The layers consisted of 200, 100, 80, 50, 40, 20, 10, and 5 nodes, with the output layer having 1 node.

Results

In the end, I decided to use a neural network with 9 layers in total, which includes my output layer. The layers had 200, 100, 80, 50, 40, 20, 10, and 5 nodes; my output layer had 1 node. We used mean squared error as my metric to test my final model. A neural network was chosen due to the fact that it could learn nonlinear relationships in better ways than linear models or shallow algorithms. Which meant that I could predict the soil acidity using a variety of categories and not just 1 category. And in the future, if this dataset were to expand, a neural network would be much better suited to handle a larger dataset compared to other models such as polynomial regression or linear regression.

Model	Training MSE	Testing MSE
Linear regression	0.6988	0.8326
Polynomial regression	0.3675	0.4466
Decision tree classifier	0.2838	0.4764
Neural Network	0.4203	0.5168

Our final result was a 0.5168 MSE Compared to the starting model with an MSE of 0.83, a MSE of 0.5168 could be considered adequate as it could be used to predict pre collected soil samples with a satisfactory margin of error that would be enough to allow the crop to thrive in its soil. All the models used a train/test split of 80% training data and 20% testing data. Although some models, such as polynomial regression and the decision tree classifier, provided better results, the reason a neural network was chosen over these models is because neural networks tend to be better at handling non-linear relationships, as well as the fact that neural networks are better at learning and extracting features from raw data. The actual performance improvement in the MSE was a decrease of 0.3132. We could tell this model was not overfitting or underfitting because it was performing similarly on both the testing and training data. This model can continue to be improved and changed in order to gain better results.

Discussion

Our results were in line with our hypothesis which was that we could accurately predict soil acidity by using potassium concentration, nitrogen to soil ratio, and the phosphorus concentration of a plant. In a comparison of the results to a research paper from the National library of medicine. Much of their results were having a RMSE(root-mean-squared error) of between 0.75-0.83⁶. And our RMSE was at 0.7188. This is important because in the real world, most crops only have a certain range of acidity at which they can be healthy and grow, and when farmers plant crops in soil which is several numbers off the intended range of acidity, that crop struggles to grow and produce crops which could lead to major inconsistencies in crop yield and potential economic losses for farmers. When crops are placed in soil with 土0.7188ph from their perfect value of soil acidity, it allows those crops to produce greater yields with reduced variation and improves sustainability. By applying neural networks to the fields of soil acidity and agriculture, this paper further shrinks the gap between these 2 fields.

While there are several studies on how machine learning can be used to predict soil acidity, our research is novel because of the way we optimized our neural network for computational and predictive accuracy. Our research is also different because of the different groups used to predict soil acidity, including, potassium concentration, nitrogen to soil ratio, temperature, and rainfall.

One major limitation to our model was that our data was solely from India. This constraint hinders the model’s generalizability due to the fact that rice and maize are grown in several different nations, all with varying soil and weather conditions. Specifically, the climate in India is characterized by monsoon patterns, temperature ranges, and soil conditions that may not be representative of other maize and rice growing regions such as Southeast Asia, Africa, or the Americas. For instance, soil composition in India is rich in certain minerals which may not be present in other rice and maize growing regions and the monsoon rainfall dependent agriculture industry unique to India compared to the irrigation systems in regions such as north america, these geographical and systematic differences in the way other nations farm rice and maize mean that our model may not perform well when applied to data from other countries where the environmental variables differ significantly from those in India. To address this limitation, future research should focus on acquiring and integrating data from a wider range of countries and regions. By incorporating various datasets from several differing geographical locations, the model can be trained on a more diverse array of environmental conditions. This approach would enhance the model’s generalizability across many different agricultural settings.

Conclusion

Our final model was a 9 layered neural network and our MSE for that model was 0.5168. Using our model we could conclude that soil acidity could be predicted in a close enough range which would allow for it to be used in typical farms. These findings are important because they can provide another aspect of farming which before the 21st century, the use of artificial intelligence has only become prevalent in many different fields in the past few years.

References

Barrow, N., and A. Hartemink. “The Effects of pH on Nutrient Availability Depend on Both Soils and Plants.” Plant and Soil, Mar. 2023, https://www.semanticscholar.org/paper/f91120836a1f3187459d4280d9d62d15f6e7956b [↩]
Yu, T. R. “Characteristics of Soil Acidity of Paddy Soils in Relation to Rice Growth.” SpringerLink, Springer Netherlands, 1 Jan. 1991, link.springer.com/chapter/10.1007/978-94-011-3438-5_12#:~:text=Many%20species%20of%20rice%20plants%20can%20grow%20well%20at%20soil%20pH%205.5 [↩]
Fernández, Fabián, and Robert Hoeft. “Managing Soil Ph and Crop Nutrients.” Managing Soil pH and Crop Nutrients, extension.cropsciences.illinois.edu/handbook/pdfs/chapter08.pdf. Accessed 14 Dec. 2024 [↩]
Kumar, Vijendra, et al. “Advanced Machine Learning Techniques to Improve Hydrological Prediction: A Comparative Analysis of Streamflow Prediction Models.” Water, July 2023, https://www.mdpi.com/2073-4441/15/14/2572 [↩]
Richard, Umeokwobi, and Ocheni Victor. “Nexus of Cryptocurrency and Output Gap in Nigeria: A Decision Tree Regression in Machine Learning Using Python Programming Language.” International Journal of Research and Innovation in Social Science, Jan. 2023, https://www.semanticscholar.org/paper/f879e757a292cb7cdb3b0db001ac0a2aafefdbf3.python-1e6e48aa7a47#:~:text=A%20decision%20tree%20is%20one. Accessed 26 Nov. 2023 [↩]
Yang, Meihua, et al. “Evaluation of Machine Learning Approaches to Predict Soil Organic Matter and PH Using Vis-NIR Spectra.” Sensors, vol. 19, no. 2, 11 Jan. 2019, p. 263, https://doi.org/10.3390/s19020263. Accessed 9 Dec. 2021 [↩]

Abstract

Introduction

How does soil acidity affect plants?

Methods

Results

Discussion

Conclusion

References

RELATED ARTICLESMORE FROM AUTHOR

How Algorithmic Models Affect Public Attitudes and Ethical Considerations Across Different Fields

Optimizing Nanoparticle Decoration: Effects of Ligand Valency and Diversity on Nanoparticle Performance in Biomedicine

Integrin αVβ8 Structure Prediction and Extension by Changing the Torsion Angles of One Residue in Each Genu

LEAVE A REPLY Cancel reply

RELATED ARTICLES MORE FROM AUTHOR