Investigating Traditional Machine Learning and Advanced Tabular Deep Learning Models for Water Potability Detection

December 3, 2024

2764

Abstract

Water is vital for the survival of all living organisms, including human beings, whose bodies are approximately 60% water. Despite the abundance of water on Earth, most of it is saltwater, unsuitable for human consumption. Freshwater sources, although safer, can be contaminated by various pollutants, thereby affecting their potability. Ensuring access to potable water is crucial for public health. However, traditional water quality monitoring methods are resource intensive. Recently, machine learning has been proven useful in automating complex pattern recognition tasks and improving predictive accuracy across various domains. This paper investigates and compares the performance of traditional machine learning algorithms such as Random Forest and Gradient Boosting, with neural networks like xDeepFM and Cross DNN Nets on a tabular dataset on drinking water quality. The results, evaluated through several evaluation metrics, indicated that Random Forest, Voting Classifier, xDeepFM, and Cross DNN Nets were superior in predicting water quality. Further, the experiments indicated that ensemble methods and certain DL models provided robust predictions, highlighting the potential of machine learning to improve water quality monitoring and management. Future work will focus on optimizing these models through hyperparameter tuning to enhance predictive accuracy further and support effective water management strategies.

Introduction

Water is an essential part of life. Without it, billions of humans would be unable to survive and go about their everyday lives. The human body, which itself is roughly 60% water, relies on water to execute many of its fundamental functions. The significance of water is further evident in the fact that a human cannot survive without water for more than a few days.

About 70% of the Earth is covered in water. Water from rivers, streams, creeks, lakes, glaciers, and aquifers contains low concentrations of salt are the most suitable for human consumption. However, even these once-safe sources of drinking water can become contaminated by pathogens, chemicals, heavy metals, bacteria, viruses, parasites, pesticides, herbicides, fertilizers, lead, or even industrial pollutants, making them unsafe for human consumption. This label of whether water is suitable for human consumption or drinking is referred to as potability.

Despite global efforts to improve access to and supply of potable water, the World Health Organization estimates that 1 in 10 people, 785 million total people in the world, continue to lack basic drinking water services. Traditional techniques of monitoring water quality involve complex chemical and biological tests that take a long time and demand a lot of human and laboratory resources. These constraints highlight the need for more flexible, efficient, and resource-light alternatives. Recent advancements in artificial intelligence (AI), particularly machine learning (ML) and deep learning (DL) techniques, have demonstrated success in addressing complex, data-driven problems in other fields, raising the question of whether these methods can also be applied effectively to water quality monitoring. Machine learning algorithms can discover trends and abnormalities in water quality data, indicating nascent issues or contamination risks before they become severe problems. By utilizing vast datasets of water quality factors, AI algorithms may predict water safety more quickly and accurately than traditional approaches. This integration of AI not only holds significant potential to improve monitoring efforts in rural and underserved areas, but it also allows for real-time decision-making, considerably enhancing public health response and water management plans.

Throughout the mid to late 20th century, a few core algorithms, such as decision trees, support vector machines (SVM), and linear regression, served as the backbone of predictive analytics, offering reliable and interpretable solutions across various applications. These methods performed best in scenarios where the relationship between input variables and output was well-defined and represented using statistical assumptions. However, as data complexity and dimensionality increased, traditional approaches showed limitations, particularly in handling non-linear interactions and large-scale data efficiently.

In the 21st century, there has been a substantial shift in the machine learning environment towards deep learning, illustrated by the development of neural networks. Neural networks, inspired by the human brain, were a form of machine learning designed to identify patterns through interconnected layers of nodes, or “neurons,” each performing specific computations. Deep neural networks with numerous hidden layers have demonstrated unmatched success in learning from vast amounts of unstructured data. This renaissance of neural networks, driven by extreme 21st-century advancement in computing power and data availability, has led to groundbreaking applications in natural language processing, image recognition, and autonomous driving, showcasing their adaptability and superiority in addressing challenges traditional machine learning models struggle with.

The success of neural networks prompts consideration of their potential in areas where traditional machine learning has proven effective, such as tabular data. However, when dealing with tabular data, structured in rows and columns, neural networks may introduce unnecessary complexity, potentially leading to overfitting on simpler datasets. In such cases, simpler models may generalize better.

As machine learning continues to evolve, a multifaceted approach that combines classical algorithms with neural networks is expected to provide the most robust foundation for addressing the increasingly complex challenges of data-driven decision-making. While AI has been successful in other practical domains, it is uncertain whether these techniques can consistently and accurately assess water potability. There are many challenges and limitations that arise in this context. Datasets frequently contain meaningless data (noise) or missing information which can hinder the performance of AI models. If the data is excessively noisy, AI models may fail to detect meaningful patterns, resulting in inaccurate predictions. To address this, several procedures can be implemented, such as manual denoising, outlier removal, and normalization techniques to enhance the overall quality of the data. However, even after extensive data cleaning, model performance may still remain questionable. Another potential issue is overfitting, which occurs when a model becomes too dependent on the training data, therefore failing to fit to additional data or predict future observations reliably. To address this, techniques such as cross-validation, regularization, and early stopping can be used. Furthermore, the diversity of water quality across different regions complicates the generalizability of any AI models used. The effectiveness of AI in this context remains an open question, so This study aims to combine these massive inquiries in the scientific community and determine the key features influencing water potability by comparing the performance of traditional machine learning methods with neural networks using tabular data.

Literature Survey

Machine Learning vs Deep Learning

Several studies have compared the performance of tree ensemble models with deep learning models on tabular data. According to one paper, tree ensemble models such as XGBoost (short for Extreme Gradient Boosting and is a specific implementation of the gradient boosting framework) outperform deep learning models on a variety of tabular datasets, including those datasets that were specifically known to work well with deep learning and chosen to demonstrate the efficacy of deep models¹. Not only did they outperform but these tree-based models required significantly less hyperparameter tuning compared to their deep learning counterparts, making them more accessible and easier to implement for many practical applications¹.

Similarly, another study found that XGBoost consistently outperformed deep learning models, including advanced architectures like TabNet and NODE (deep learning models specifically designed to work with tabular data), across multiple datasets. This study highlights the efficiency of XGBoost, noting it has a better performance even with very minimal tweaking, in contrast to the extensive optimization that was often required for the other deep learning models². An interesting finding from both papers is that ensemble models, which combine deep learning models with XGBoost, tend to perform better than XGBoost alone¹’². This suggests that while deep learning models may not be the best standalone solution, they can still contribute valuable insights when used in tandem with traditional models.

The inherent inductive biases (the set of assumptions or constraints that a machine learning model or algorithm inherently possesses) of different models play a crucial role in their performance on tabular data. These biases guide the learning process by favoring certain hypotheses over others, thereby influencing how the model generalizes from training data to unseen data. One paper discusses how tree-based models, like Random Forest and XGBoost, have inductive biases that are well-suited for the irregular patterns often found in tabular datasets. These models excel at learning piecewise constant functions and handling uninformative features, which are common in tabular data. In contrast, deep learning models tend to be biased towards smoother solutions, which may not capture the complexities of tabular data effectively³.

Despite the notable performance of tree-based models, the potential of deep learning models on tabular data cannot be entirely dismissed. One survey points out that while deep learning models have shown promising results on certain datasets, there is no single deep model that consistently outperforms others across all tasks ⁴. This highlights the ongoing challenge in developing efficient deep learning models for tabular data. Moreover, the survey identifies several open challenges in the field, such as handling data streams, addressing distribution shifts, and ensuring privacy and fairness, all of which are critical for the broader adoption of deep learning in tabular data analysis.

A knowledge gap that this study aims to fill revolves around evaluating whether traditional machine learning models, such as XGBoost, outperform neural networks in this specific context of water potability detection. We believe that addressing this question will not only contribute to the understanding of feature importance in water quality analysis but also provide interesting insights into the comparative effectiveness of traditional machine learning and deep learning models on tabular data, potentially guiding future research and practical applications in this domain.

Water Potability Identification and Detection

A number of studies have focused on identifying and detecting factors influencing water potability using both traditional and sophisticated methodologies. One prominent study⁵ delves into the effect of ammonia in the formation of trihalomethanes (THMs) during the chlorination of water from the Sitalakhya River in Dhaka, Bangladesh. The study emphasizes the importance of ammonia concentration, which was mainly disregarded in prior studies on THM formation, and finds that THM formation can be very useful in determining water potability and help the hundreds of people around the world who need a fast way to determine water potability. Parameters such as contact time, temperature, pH, total organic carbon, and chlorine dosage were identified as crucial factors influencing THM formation. Laboratory techniques, including UV-Vis spectrophotometry, were employed to measure ammonia concentrations in treated water samples. Findings indicated a negative logarithmic correlation between ammonia concentration and THM formation, with a notable decrease in THM levels at low ammonia concentrations up to 3 mg/L, followed by a slower decline beyond this threshold. This research highlights the importance of managing ammonia levels in water treatment processes to minimize THM formation and ensure safe drinking water⁵.

Moving away from THMs, one study⁶ ruled out temperature as a useful way to determine water potability and put more emphasis and questioning on pH, conductivity, and ORP. The research revealed significant spatial and temporal variations in temperature, conductivity, pH, and oxidation-reduction potential (ORP). It found that ambient temperature (the temperature of the surrounding environment or the air in a particular location) significantly impacted water temperature, with notable differences between hydrant and faucet water. Additionally, initial scale formation and contaminant leaching from plumbing fittings were identified as concerns. By comparing continuous sensor data with traditional analytical methods, the study demonstrated the potential for real-time monitoring systems to provide valuable insights into water quality dynamics and support proactive management strategies⁶.

Another study explored the use of ML techniques to predict water quality, very similar to what we will do in this study. Key features such as pH, dissolved oxygen, and ammonia concentration (NH3-N) were selected for analysis. The study compared traditional ML models, including support vector machines and k-nearest neighbor algorithms, with more complex neural networks⁷. This research underscores the practicality of traditional ML approaches for accurate and efficient water quality assessment, emphasizing their role in supporting sustainable water management practices. In another case study on Indian rivers, the performance of traditional ML models and artificial neural networks were compared for predicting water potability. Parameters such as dissolved oxygen, pH, conductivity, biochemical oxygen demand, and nitrate levels were analyzed. This study reinforces the finding that ML models, particularly XGBoost, are more suited for tabular data analysis. Additionally, it emphasizes the crucial and helpful role of this methodology in determining world-changing factors like water potability based solely on dataset analysis⁸.

The reviewed articles show the importance of ammonia concentration, pH, dissolved oxygen, and temperature among other parameters in determining water suitability for drinking. These studies indicate that using both conventional analytical methodologies and modern data-driven methods can successfully assess different water quality aspects. The knowledge gap this study also seeks to fill is about including the relatively less explored factors for water drinkability in a particular dataset. We will be utilizing the aforementioned water characteristics from the studies to draw our own conclusions within a dataset. While some papers have examined pH, ammonia concentration, dissolved oxygen and temperature as factors influencing potable water supply, little emphasis has been put on the conductance level or nitrate content relevant to assessment of potability. The purpose of this work is to fill this gap in research by investigating how conductivity and nitrate levels affecting water drinkability would contribute towards a comprehensive assessment of water quality. Understanding how these variables relate to water salinity may assist in improving monitoring and management processes aimed at supplying safe sustainable drinking sources across various parts globally.

Methodology

Dataset Description

The Water Quality Dataset by Aditya Kadiwal on Kaggle⁹, encompassing various parameters from 3276 different bodies of water, was used in this study to assess water quality. The data is structured into a folder with 10 columns, each containing over 2500 data points for each feature. Nine of the ten features are represented as decimal numbers, while the potability feature is imputed as an integer (1 indicating potable and 0 indicating not potable). These parameters included pH, indicating acidity levels; Hardness, representing the concentration of dissolved calcium and magnesium ions; Total Dissolved Solids (TDS), quantifying the combined content of all dissolved inorganic and organic substances; Chlorine concentration; Sulfate concentration; Conductivity, denoting the water’s ability to conduct electrical charges; Total Organic Carbon (TOC), measuring the presence of carbon in organic compounds within the water; Trihalomethanes (THM) concentration; and Turbidity, gauging the impact of suspended particles on water clarity and transparency.

Data Preprocessing

This dataset contained a few missing values. Specifically, the pH column had 491 missing values (14.99%), the Sulfate column had 781 missing values (23.84%), and the Trihalomethanes column had 162 missing values (4.95%). We replaced these missing values with the median value of each respective column. This approach is useful because the median is less affected by outliers and provides a robust central tendency. To ensure accuracy, we calculated the median separately for samples where water was potable or not. However, we found that the median values for these columns were approximately the same regardless of potability, so we replaced all missing values with the overall median. After this step, there were no missing values left in our dataset. Following this, we normalized our data to adjust values to a common scale without distorting differences in the ranges of values, thus preventing bias towards larger numbers. For normalization, we utilized the MinMaxScaler from the scikit-learn library in Python, scaling all feature values to a range between 0 and 1. By handling missing values and normalizing the data, we ensured that our dataset was clean and ready for further analysis and machine learning modeling. The dataset was then split into training (80%) and testing (20%) sets to ensure effective model training and accurate performance evaluation, promoting robust and generalizable machine learning models.

Traditional Models

We have identified and used the most popular and widely utilized traditional models available in the scikit-learn library for Python. These models, including Logistic Regression, Random Forest, MLPClassifier, KNeighborClassifier, Decision Tree, Gradient Boosting Classifier, Voting Classifier, Adaboost Classifier, and SVC, were chosen to represent the full scope of traditional machine learning performance. Each of these algorithms holds a significant place within the machine learning community, offering distinct advantages and capabilities.

Logistic Regression

Logistic Regression is adept at predicting binary outcomes by evaluating variable relationships. It models the probability of a binary outcome using a logistic function, making it suitable for classification tasks. This method calculates the odds of a particular event occurring and transforms it into a probability.

Logistic Regression is widely used for its simplicity and effectiveness in binary classification problems.

Decision Tree

Decision Tree models provide interpretable decision rules through recursive splitting based on feature values. They construct a tree-like structure where internal nodes represent tests on features, branches represent outcomes, and leaf nodes represent class labels. This makes the model easy to understand and visualize. Decision Trees are particularly useful for handling datasets with complex interactions between features.

Random Forest

Random Forest, an ensemble method, leverages multiple decision trees to enhance forecast accuracy and robustness. By averaging the predictions of several trees, it reduces overfitting and improves generalization. This approach increases the model’s stability and accuracy, making it more reliable for various tasks. Random Forest is also effective in handling large datasets with many features.

Multi-layer Perceptron Classifier

The Multi-layer Perceptron Classifier (MLPClassifier) embodies a feed-forward neural network approach, capable of learning intricate data patterns through interconnected layers of nodes. Each layer applies a nonlinear transformation to the input, enabling the model to capture complex relationships. This makes MLPClassifier powerful for tasks requiring deep learning capabilities. It is particularly useful for image and speech recognition tasks. While MLPs represent a step toward neural networks and modern machine learning, their relatively simple architecture and traditional training methods align them with the characteristics of traditional machine learning models.

K-Neighbor Classifier

The K-Neighbor Classifier (KNeighborClassifier) offers a straightforward classification method relying on the majority class of nearest neighbors. It determines the class of a sample based on the most common class among its k-nearest neighbors in the feature space. This simplicity makes it easy to implement and understand. However, the choice of k and the distance metric can significantly impact its performance.

Voting Classifier

The Voting Classifier combines multiple independent classifiers via majority voting to bolster overall prediction accuracy. This ensemble technique can aggregate the strengths of various models to produce a more reliable outcome. By leveraging diverse algorithms, it mitigates the weaknesses of individual classifiers. This method is particularly effective when no single classifier performs best on all parts of the dataset.

Adaboost Classifier

Adaboost iteratively adjusts misclassified sample weights to emphasize challenging instances, contributing to improved classification. By focusing on hard-to-classify samples, it enhances the overall model performance. This boosting technique can significantly increase the accuracy of weak learners. Adaboost is particularly effective in scenarios where model accuracy needs improvement.

Support Vector Classifier

The Support Vector Classifier (SVC) crafts an optimal hyperplane to outline various classes by maximizing the margin between them. This approach ensures a clear distinction between classes, which is particularly effective for high-dimensional data. SVC aims to find the best boundary that separates different classes with the widest possible margin. This makes it a robust choice for classification tasks with complex boundaries.

Collectively, these methods comprise a diverse toolkit capable of addressing a wide array of machine learning tasks, encompassing everything from regression to classification. Their performance varies based on dataset features and the specific nature of the problem being addressed, underlining their versatility and utility in real-world applications and will be of great assistance in our Water Potability analysis.

Deep Learning Models

In this study, models from the Python library DeepTables were utilized to represent deep learning performance. Specifically, DeepFM, xDeepFM, Linear, DNN Nets, Cross DNN Nets, DCN Nets, DCN, Wide&Deep, were selected. The most basic and interpretable model out of these is the Linear model. In the DeepTables Library, there are a few preset models such as DeepFM.

Deep FM is a deep learning model that blends the strengths of factorization machines (FM) and deep neural networks (DNN) to capture both low-order and high-order feature interactions seamlessly. Factorization Machines model feature interactions by factoring them into latent factors, which can be thought of as lower-dimensional representations of the features. This allows FMs to capture complex interactions between features without explicitly enumerating all possible combinations, which would be computationally prohibitive in high-dimensional spaces. Deep neural networks are a type of neural network used in deep learning, designed to model complex, non-linear relationships in data. They consist of multiple layers of interconnected neurons. The input layer is where the network receives the data, with each neuron representing a feature. Between the input and output layers are hidden layers, which perform calculations and transformations to capture various levels of feature abstractions. The term deep refers to having many hidden layers. Each neuron applies an activation function, like ReLU, Sigmoid, or Tanh, to a weighted sum of inputs to produce an output. Weights determine the importance of each input to a neuron, while biases allow the activation function to be shifted. These parameters are learned during training. In the forward pass, data flows through the network layer by layer, and in the backward pass, or backpropagation, the network adjusts weights and biases to minimize error.

Like DeepFM, eXtreme Deep Factorization Machine (xDeepFM) refines DeepFM by introducing a Compressed Interaction Network (CIN). The key idea behind CIN is to reduce the computational cost of modeling high-order interactions between features while maintaining model effectiveness. In simpler terms, it helps the model understand how different features relate to each other in complex ways without overwhelming the computer’s processing power. CIN achieves this by organizing feature interactions into a series of layers, similar to how a neural network is structured. Each layer in the CIN learns a specific level of interaction between features and aggregates these interactions in a compressed form. This compressed representation allows the model to capture complex patterns in the data more efficiently.

The Deep & Cross Network (DCN), tailored for recommendation systems, amalgamates deep learning with cross feature interactions. By integrating a cross network to efficiently capture explicit feature interactions and a deep network to model implicit feature interactions, DCN enhances predictive performance, particularly for tasks like click-through rate (CTR) prediction. The Deep Tables library has the basic application of this concept of combining cross networks and deep neural networks as a preset known as DCN, however, how the cross features are integrated into the model can be different. This is seen in the two different component nets in DeepTables that are Cross DNN Nets and DCN Nets. Cross DNN Nets combines CrossNet with DNNs, where CrossNet focuses on explicit feature interactions, while DNNs capture implicit feature interactions through multiple layers. In contrast, DCN Nets stands for Deep & Cross Networks and follows a similar principle but in a more structured manner. While both architectures aim to understand and utilize feature interactions, Cross DNN Nets allow for more flexibility in how these interactions are learned, with the potential for deeper and more complex representations. On the other hand, DCN Nets provide a more structured and organized framework for modeling feature interactions in deep learning applications.

Among the models, Wide&Deep also stands out as a fundamental one, combining linear models (the wide part) with deep neural networks (the deep part). This integration adeptly captures feature interaction memorization and generalization to unseen feature combinations, thereby bolstering recommendation systems and classification tasks. The wide part is excellent at quickly recognizing familiar patterns, making it perfect for recommendation systems that rely on past user actions to suggest new things. Then, there is the deep part, which digs deeper into the data to find hidden connections and patterns. Instead of just remembering, it tries to figure out why things happened the way they did. By combining these two parts, Wide&Deep can both remember past interactions and make educated guesses about new ones. It is like having the best of both worlds – the reliability of old knowledge and the adaptability to handle new situations. This makes Wide&Deep useful for all kinds of tasks and a fundamental representative of Deep Learning Models.

The models evaluated from the Python library DeepTables encompass a wide spectrum of deep learning capabilities tailored for recommendation systems and predictive tasks. They each bring unique enhancements to feature interaction modeling, catering to diverse needs in understanding complex data patterns. Their collective versatility and effectiveness underline their pivotal role in addressing real-world challenges across various domains, such as water potability, and their performance nuances will be utilized in this study on our Water Potability dataset.

Optimizers

In neural networks, weights and biases are adjusted using optimization algorithms like Stochastic Gradient Descent (SGD), Adam, or RMSprop.

SGD is a fundamental optimizer that updates a deep learning model’s parameters based on the gradient of the loss function with respect to each parameter. It randomly selects a subset of the training data, often called the mini-batch, to compute the gradient, which helps accelerate the training process by avoiding the computation of gradients on the whole entire dataset. However, SGD’s simple update concept may lead to slow convergence or oscillations around the minimum of the loss function.

Adaptive Moment Estimation (Adam) is an extension of SGD that incorporates adaptive learning rates for each parameter. It maintains two moving averages of gradients: the first moment (the mean) and the second moment (the uncentered variance). These moving averages are used to adaptively adjust the learning rates for each parameter during training. Adam uses the advantages of SGD by providing adaptive learning rates while also incorporating momentum to accelerate convergence and dampen oscillation of models.

Root Mean Square Propagation (RMSprop) is another optimization algorithm that addresses some of the limitations of SGD. Like Adam, RMSprop also adapts the learning rates for each parameter individually. However, instead of maintaining a single learning rate for all parameters, RMSprop scales the learning rates based on the average of the squared gradients for each parameter. This helps mitigate the vanishing or exploding gradient problem by normalizing the gradients and preventing large updates to parameters with frequently occurring features. Optimizers play a significant part in deep learning models and are very crucial to their performances. In this study, we investigate the effect of the optimizers on given networks and report the results in the following section.

Experiments and Results

Implementation Details

In our study, we carefully adjusted the settings of various machine learning algorithms to find the best performance. These settings, called hyperparameters, influence how the algorithms learn from data. To find the best combination of hyperparameters for each traditional ML model, we used an elaborate technique called grid search and found the best combination. For logistic regression, we found that setting the regularization strength to a value of 0.001 and using the linear solver provided the best results. Next, we tuned the hyperparameters for a decision tree model and found that a tree depth of 10, with a minimum leaf sample size of 5 and a minimum split sample size of 5, worked well. Similarly, we tuned the hyperparameters for random forests and found that using 1000 trees and optimizing for log loss with features squared root gave us the best results. Moving on to the MLP Classifier, we found that using an logistic activation function, a small alpha value of 0.0001, a batch size of 200, and a hidden layer size of 500 produced the best results. For the k-nearest neighbors’ algorithm, we found that using 35 neighbors, setting p to 1, and weighting by distance provided the best performance. Additionally, we optimized the hyperparameters for gradient boosting and AdaBoost algorithms, finding that learning rates of 0.01 with max features being set to square root, and 1000 estimators produced optimal results. For the SVC, we achieved good performance by setting the regularization parameter to 0.1 and using a polynomial kernel. We then utilized a wide and deep neural network architecture with the RMSprop optimizer for the deep learning model. We then ran all these models and collected our results. For the Voting Classifier, we experimented with four different models: one with the top four accurate traditional models as the parameters, one with the top four performing models, another with the top three performing models, and finally one with the second best, third best, and fourth best performing models. For Neural Networks, we performed with an early stop patience of 100 and epochs set to 100. By carefully tuning these hyperparameters, we were able to optimize the performance of each machine learning algorithm for our specific task.

Evaluation Metrics

Evaluation metrics help quantify the performance of the machine learning models used in this study and provide insights into how well they generalize to unseen data. In this scenario, the metrics used on the testing data were precision, recall, and F1-score. The macro average was taken for these evaluation metrics to get an overall picture of how well each model performs across different classes or categories within the dataset. Instead of just looking at the performance for each class separately, the macro average calculates the average performance across all classes. To compute the macro average, the performance metrics for each class are first calculated individually. Precision measures the proportion of true positive predictions among all positive predictions, while recall measures the proportion of true positive predictions among all actual positive instances. The F1-score is a single metric that combines both precision and recall into one measure, providing a balance between the two. It is calculated as the harmonic mean of precision and recall, taking into account both false positives (precision) and false negatives (recall), providing a holistic evaluation of the model’s performance in the binary classification task. Once all three of these metrics are calculated for each class, the macro average is obtained by averaging the metric values across all classes. This means that each class contributes equally to the overall average, regardless of its size or prevalence in the dataset, disregarding any bias that could have been in the dataset.

Performance Analysis

Table 1 shows the results of the previously mentioned traditional machine learning models, evaluated based on their accuracy and other performance metrics such as precision, recall, and F1-score. Among these models, the Random Forest classifier attained the highest accuracy of 69.7%, closely followed by the Gradient Boosting Classifier with 68.6%. These two models outperformed others in terms of overall accuracy, but accuracy alone does not deliver the complete picture of machine learning model’s performances. When taking into account precision, recall, and F1-score, the metrics used to assess the model’s ability to correctly classify instances across different classes, some interesting insights emerge which will be discussed below.

One interesting insight from these macro averages is the fact that the MLPClassifier achieved the highest precision of 81.5%, indicating its capability to correctly identify positive instances, albeit with moderate accuracy overall. On the other hand, the Decision Tree classifier demonstrated relatively high recall, indicating its ability to capture a high proportion of positive instances, although at the expense of precision.

Model	Accuracy	Macro AVG: Precision	Macro AVG: Recall	Macro AVG: F1-Score
Logistic Regression (LR)	62.70%	31.40%	49.90%	38.50%
Decision Tree (DT)	67.50%	69.70%	57.90%	55.30%
Random Forest (RF)	69.70%	70.10%	61.90%	61.40%
Multi-layer Perceptron Classifier (MLP)	63.00%	81.50%	50.20%	39.00%
KNeighborClassifier (KN)	61.60%	54.00%	51.60%	47.20%
Gradient Boosting Classifier (GB)	68.60%	70.10%	59.80%	58.30%
Adaboost Classifier (AB)	63.10%	58.00%	52.30%	46.70%
Support Vector Classifier (SVC)	62.80%	31.40%	50.00%	38.60%
Voting Classifier (AB, GB, RF, DT)	67.70%	71.30%	57.80%	54.80%
Voting Classifier (GB, RF, DT)	68.40%	71.20%	59.20%	57.10%
Voting Classifier (AB, RF, DT)	67.50%	69.40%	58.00%	55.50%
Voting Classifier (AB, GB, DT)	67.40%	69.40%	57.70%	55.00%

Table 1: Performance Comparison of Different Traditional Machine Learning Methods

**Figure 1:** Grouped Bar Graph of Performance Comparison of Different Traditional Machine Learning Methods

Moreover, the Voting Classifiers, which combine multiple base classifiers, showed competitive performance compared to individual models. Notably, the Voting Classifier composed of Gradient Boosting, Random Forest, and Decision Tree achieved an accuracy of 68.4%, comparable to individual top-performing models. This suggests that ensemble methods like Voting Classifiers can effectively leverage the strengths of different base models to improve overall performance and predict tabular data.

Table 2 displays the performance metrics of various deep learning models across different optimizers. Each model’s accuracy, precision, and recall are measured with the optimization techniques of SGD, RMSprop, and Adam. Using the DNN Nets model, the highest accuracy was achieved with the RMSprop optimizer, recording a value of 66.62%. Alongside this, the precision was 57.23% and the recall was 40.57%. The Adam optimizer produced a comparable accuracy of 66.31%, but it showed a slightly lower precision of 54.73% while maintaining a higher recall of 54.51%. The SGD optimizer also performed reasonably well with an accuracy of 66.31%, precision of 56.65%, and recall of 40.16%.

Model	Optimizer	Accuracy	Precision	Recall
DNN Nets	SGD	66.31%	56.65%	40.16%
	RMSprop	66.62%	57.23%	40.57%
	Adam	66.31%	54.73%	54.51%
Linear	SGD	62.80%	00.00%	00.00%
	RMSprop	59.15%	39.40%	18.44%
	Adam	61.74%	37.90%	45.10%
Cross DNN Nets	SGD	63.72%	52.94%	22.13%
	RMSprop	67.53%	59.51%	39.75%
	Adam	62.96%	50.28%	36.89%
DCN Nets	SGD	66.62%	58.50%	35.25%
	RMSprop	63.41%	50.93%	44.76%
	Adam	63.72%	63.72%	47.54%
DeepFM	SGD	62.88%	00.00%	00.00%
	RMSprop	54.73%	40.70%	47.54%
	Adam	55.79%	35.44%	22.95%
xDeepFM	SGD	62.88%	00.00%	00.00%
	RMSprop	68.75%	71.91%	26.23%
	Adam	66.46%	60.91%	27.46%
DCN	SGD	66.77%	58.44%	36.89%
	RMSprop	64.79%	53.05%	46.31%
	Adam	62.65%	49.75%	40.57%
Wide&Deep	SGD	62.80%	00.00%	00.00%
	RMSprop	66.62%	56.76%	43.03%
	Adam	51.98%	37.46%	43.44%

Table 2: Performance Comparison of Different Tabular Deep Learning Methods

**Figure 2:** Group Bar Graph of Performance Comparison of Different Tabular Deep Learning Methods

For the Linear model, the best performance in terms of accuracy was observed with the SGD optimizer, achieving 62.80%. However, both precision and recall were recorded as 0, indicating no true positives. With the RMSprop optimizer, the accuracy was slightly lower at 59.15%, with a precision of 39.47% and a relatively extremely low recall of 18.44%. The Adam optimizer resulted in an accuracy of 61.74%, precision of 37.90%, and recall of 4.51%, showing improved precision over RMSprop but lower recall.

The Cross DNN Nets model had the highest recorded accuracy out of all the models of 67.53% using the RMSprop optimizer, with a precision of 59.51% and recall of 39.75%. When using the Adam optimizer, the accuracy dropped all around to 62.96%, with a precision of 50.28% and recall of 36.89%. The SGD optimizer also showed a drop in accuracy but not as much as Adam gave as it recorded an accuracy of 63.72%, precision of 52.94%, and recall of 22.13%.

DCN Nets model also followed the pattern where the highest accuracy came with the SGD optimizer. The highest accuracy was 66.62% with the SGD optimizer, accompanied by a precision of 58.50% and recall of 35.25%. The Adam optimizer also showed a relatively good performance with an accuracy of 63.72%, precision of 63.72%, and recall of 47.54%. The RMSprop optimizer provided the lowest, albeit slightly, accuracy of 63.41%, precision of 50.93%, and recall of 44.76%.

The DeepFM model showed subpar results compared to the others as its best results with the Adam optimizer, achieving an accuracy of 55.79%, precision of 35.44%, and an extremely low recall of 22.95%. Using the SGD optimizer, the accuracy was 62.88%, but both precision and recall were unusually reported to be 0. The RMSprop optimizer resulted in an accuracy of 54.73%, precision of 40.70%, and recall of 47.54%.

The xDeepFM model seemed to be comparably strong, with the highest accuracy was 68.75% with the RMSprop optimizer, which also had the highest precision (out of all the models) at 71.91% but a lower recall of 26.23%. That model was probably the most overall strong when it came to accuracy and all the matrices. The Adam optimizer resulted in an accuracy of 66.46%, precision of 60.91%, and recall of 27.46%. The SGD optimizer, once again, despite achieving an accuracy of 62.88%, had precision and recall values of 0.

The DCN model achieved its highest accuracy of 66.77% with the SGD optimizer, along with a precision of 58.44% and recall of 36.89%. The RMSprop optimizer provided an accuracy of 64.79%, precision of 53.05%, and recall of 46.31%. The Adam optimizer resulted in an accuracy of 62.65%, precision of 49.75%, and recall of 40.57%.

Finally, the Wide&Deep model was also one of the stronger models as it showed the highest accuracy of 66.62% with the RMSprop optimizer, with a precision of 56.76% and recall of 43.03%. The Adam optimizer resulted in the lowest accuracy of 51.98%, with a precision of 37.46% and recall of 43.44%. The SGD optimizer, while achieving an accuracy of 62.80%, had precision and recall values of 0. The possible explanations for this will be given in the following discussion section.

Discussion

Going into this study, the hypothesis was that deep learning models would outperform traditional machine learning models in terms of classification accuracy and other performance metrics owing to their higher information processing capabilities. However, we also expected some traditional machine learning models including the ensemble methods to perform better than some deep learning algorithms.

Our work partially supports this hypothesis. Ensemble methods like Random Forest, Gradient Boosting, and Voting Classifiers indeed showed high accuracy and balanced performance metrics, supporting the hypothesis for traditional models. One commonality between the relatively well perfoming models like Random Forest, Gradient Boosting, and manage to keep precision and recall fairly balanced, suggesting they are well-suited to dealing with imbalanced datasets (which is often the case with water potability). Conversely, while the xDeepFM model achieved high precision and accuracy among deep learning models, other deep learning models did not consistently outperform the top traditional machine learning models. This indicates that while deep learning has potential, its superiority is not absolute and depends on specific configurations and optimizers.

The findings are consistent with other studies in the field of machine learning and deep learning⁸’⁷. Ensemble methods like Random Forest and Gradient Boosting are well-documented for their high performance in tabular data-based classification tasks due to their ability to reduce overfitting and improve generalization. Similarly, the superior performance of the xDeepFM model aligns with recent research highlighting the effectiveness of advanced deep learning architectures in handling complex datasets and capturing intricate patterns. However, the variability in performance among different deep learning models and optimizers also reflects ongoing challenges in optimizing these models, a topic widely discussed in current literature. On average we found that the traditional learning models had a accuracy of over 65% while the deep learning models had an average accuracy of about 63%. Deep learning typically excels in high-dimensional and unstructured data like images, text, and audio, where complex hierarchical patterns exist. However, tabular data (such as that of this water potability dataset) often has simpler, structured relationships between features. This water potability dataset likely doesn’t have enough complexity or size to fully utilize deep learning’s capabilities. In tabular data, feature interactions ( how two or more features relate to the target variable) are usually simple and shallow. Traditional methods like tree-based models capture these interactions efficiently without needing deep architectures. Deep learning models are designed to learn complex representations and feature hierarchies, which might not exist in tabular data. This might of lead to over-complicating the problem and thus a poorer performance.

Despite the robust findings, several limitations need to be addressed. The dataset used for training and testing may contain inherent biases that could affect the generalizability of the results. In this particular dataset, 61% of the data samples were of nonpotable water and 39% was of potable water. A 61/39 split is moderately imbalanced. While it is not as severe as more extreme imbalances (e.g., 90/10 or 95/5), it still implies that the model will see more examples of one class than the other during training. Class imbalance can affect the performance of machine learning models, leading to higher accuracy for the majority class at the expense of poor performance for the minority class. This imbalance can skew performance metrics like accuracy, making them less reliable indicators of model performance. This is most probably why the pattern throughout the different models, whether they were traditional or neural networks, did not have highly reliable accuracies and were instead around the 60-70% range. This class imbalance issue could have led to a bias where the model performs well on the majority class but poorly on the minority class. Rather than relying on accuracy alone, macro average metrics were used to mitigate this imbalance in terms of performance evaluation, but biases can still influence model performance during training. The performance of deep learning models varied significantly across different optimizers. This variability highlights the sensitivity of deep learning models to hyperparameter tuning and optimization strategies. Some models, like the Multi-layer perceptron (MLP) Classifier, showed high precision but moderate overall accuracy. This indicates a trade-off between different performance metrics that need to be balanced based on specific application requirements. Additionally, several models with the SGD optimizer reported zero precision and recall. These reports of zero precision and recall were specifically only when SGD was selected as an optimizer. Even though it could be attributed to the inherent class imbalance, it could also be because of something more specific to the SGD Optimizer itself. SGD is known to be very sensitive to the learning rate. If the learning rate is too high, the optimizer might overshoot the minimum of the loss function, leading to poor convergence or divergence. If the learning rate is too low, the optimizer may converge too slowly or get stuck in a local minimum. This improper learning rate can prevent the model from learning effectively, causing it to make random or constant predictions that result in zero precision and recall. However, as seen in our data results, SGD, when achieving a proper learning rate, attained some of the highest accuracy, precision, and recall scores for models such as DNN Nets, DCN Nets, and DCN. This illustrates that with careful hyperparameter tuning, SGD can be an effective optimizer for tabular data, even if its variability makes it more challenging to use compared to other optimizers like RMSprop or Adam. For example, RMSprop had consistently good results and exhibited very low variability compared to the other optimizers, so it might be a better optimizer to focus on in future studies.

As many of these traditional machine learning and deep learning models demonstrated comparable performance, an additional feature analysis was performed to determine if the top two traditional machine learning models (Random Forest and Gradient Boosting Classifier) and the top two deep learning models (using the RMSprop optimizer) agreed on which features most influenced water potability predictions.

For the Random Forest model, the “feature_importances” attribute was used to assess how much each feature contributed to the model’s decision-making process by examining the importance scores. The Random Forest model assigns these scores based on how effectively each feature reduces impurity, such as Gini impurity or entropy, when used to split data at decision tree nodes. In essence, whenever a feature is selected for a split, the model measures how much it enhances the purity of the resulting groups. Features that consistently lead to larger reductions in impurity are deemed more critical and receive higher importance scores. The results, shown in Figure 3, were plotted in a bar chart. Features like pH, Hardness, Sulfate, Chloramines, and Solids played a more decisive role in accurately distinguishing between classes, accounting for over 10% of the model’s total importance, while the rest did not.

Similarly, the Gradient Boosting model also had a “feature_importances” attribute, and the results are presented in Figure 4. For Gradient Boosting, the importance score is calculated based on how often a feature is selected to make a split in the decision trees that compose the ensemble model, as well as how much that split reduces the overall prediction error. Features that lead to larger reductions in error receive higher importance scores.

For the deep learning models, the DeepTables library does not have a built-in feature importance attribute, so a feature permutation method was used. This involved iteratively removing one feature at a time from the dataset and retraining the model without that feature. Features like pH, Hardness, Sulfate, Chloramines, and Solids once again emerged as the most decisive, accounting for over 10% of the model’s total importance, while the rest did not. For both the xDeepFM and Cross DNN Nets models, the process was made consistent by clearing the session after each iteration to prevent residual effects from previous models. After retraining, the modified model was evaluated on the test set, and the new accuracy was measured. The drop in accuracy, determined by comparing the baseline model’s accuracy to that of the modified model, was used to assess the importance of the omitted feature. A larger decrease in accuracy indicated a more critical feature, as its absence significantly impacted the model’s predictive capabilities. The results of these analyses are shown in Figures 5 and 6.

In these graphs, a positive drop in accuracy indicates that removing the feature led to a decrease in the model’s performance. This suggests that the feature is valuable, as its absence caused the model to be less accurate. The larger the positive drop, the more important the feature is, since it implies that the feature provides critical information for the model to make accurate predictions. Conversely, a negative drop (or an increase) in accuracy would mean that removing the feature actually improved the model’s performance. This could suggest that the feature was redundant, noisy, or even misleading, causing the model to make less accurate predictions when it was included. For the xDeepFM model, Solids, Chloramines, and Sulfate were the only features to show a positive importance score. In contrast, for the Cross DNN Nets model, Turbidity, Sulfate, pH, and Solids emerged as the only ones with positive importance scores. Features like pH, Hardness, Sulfate, Chloramines, and Solids were consistently identified as critical across both model types, suggesting that these variables fundamentally influence water potability. This consistency may also reflect their robust influence on the underlying chemistry of water, which has direct implications for water quality management and public health initiatives.

Conversely, the differences in feature importance rankings, especially regarding models like xDeepFM and Cross DNN Nets, point to the potential of deep learning models to capture nuanced interactions between features that traditional models might overlook. For instance, the significance of Turbidity in the Cross DNN Nets model suggests that deep learning techniques can uncover hidden patterns in the data that aren’t immediately apparent through traditional analyses. This emphasizes the value of using diverse modeling approaches, as each can provide unique insights that enhance our understanding of the factors influencing water potability and even expand the horizons of the usefulness of other features when it comes to water potability.

**Figure 3:** Bar Graph of Feature Analysis on Random Forest Model

**Figure 4:** Bar Graph of Feature Analysis on Gradient Boosting Model

**Figure 5:** Bar Graph of Feature Analysis on xDeepFM Model

**Figure 6:** Bar Graph of Feature Analysis on Cross DNN Nets Model

Based on these findings, several steps can be taken to further investigate the most important features of water potability and the advantages of neural networks compared to traditional machine learning methods. This study was a foundational introduction to the genre of machine learning for determining water potability and establishing a benchmark. However, there are more and newer complex models in both traditional machine learning and neural networks that could be applied to this dataset. Conducting extensive hyperparameter tuning, especially for deep learning models, can help identify optimal configurations and improve performance consistency. Exploring more sophisticated ensemble methods, such as stacking and boosting, could further enhance model performance by leveraging the strengths of multiple classifiers such as XGBoost, LightGBM, and CatBoost. Additionally, this data set was just one of the many other waters sample datasets available, and with only around 10 classes to help predict water potability. These 10 classes might not be enough or could even be too little to predict water potability and it is crucial for future studies to replicate these findings and perform additional investigations on this phenomenon of water.

Conclusion

This paper explored the efficiency of neural networks and machine learning algorithms in addressing water pollution problems by considering various water characteristics for quality prediction. The study compared different ML algorithm performances on a drinking water quality dataset, selecting significant features and creating subsets for training and testing. The results indicated that Random Forest, Voting Classifier, xDeepFM, and Cross DNN Nets were superior in predicting water quality. Future work will focus on enhancing algorithm performance through hyperparameter tuning to optimize results further, thereby contributing to better decision-making, long-term planning, and faster action in environmental problem automation.

Acknowledgment

Thank you for the guidance of mentor Pramit Saha from the University of Oxford and Sarah Olshan from the University of Illinois at Urbana-Champaign in the development of this research paper.

Supplementary Section

The evaluation metrics on the training data have been reported in Tables 3 and 4 below.

Model	Accuracy	Macro AVG: Precision	Macro AVG: Recall	Macro AVG: F1-Score
Logistic Regression (LR)	60.50%	30.30%	50.00%	37.70%
Decision Tree (DT)	69.50%	76.50%	62.10%	60.30%
Random Forest (RF)	74.90%	71.10%	64.80%	67.30%
Multi-layer Perceptron Classifier (MLP)	60.60%	80.30%	50.10%	38.00%
KNeighborClassifier (KN)	64.20%	59.10%	55.90%	48.80%
Gradient Boosting Classifier (GB)	72.50%	80.30%	65.70%	65.10%
Adaboost Classifier (AB)	64.00%	69.40%	55.10%	49.30%
Support Vector Classifier (SVC)	60.50%	30.30%	50.00%	37.70%
Voting Classifier (AB, GB, RF, DT)	70.30%	82.70%	62.50%	60.30%
Voting Classifier (GB, RF, DT)	80.10%	86.60%	78.30%	79.90%
Voting Classifier (AB, RF, DT)	78.40%	86.20%	72.80%	73.80%
Voting Classifier (AB, GB, DT)	70.60%	80.10%	63.20%	61.50%

Table 3: Performance Comparison of Different Traditional Machine Learning Methods (Training Data)

Model	Optimizer	Accuracy	Precision	Recall
DNN Nets	SGD	70.27%	69.29%	44.29%
	RMSprop	85.27%	85.92%	74.95%
	Adam	88.51%	87.44%	82.79%
Linear	SGD	60.53%	00.00%	00.00%
	RMSprop	60.53%	00.00%	00.00%
	Adam	60.54%	00.00%	00.00%
Cross DNN Nets	SGD	67.33%	73.42%	26.98%
	RMSprop	80.88%	83.95%	63.73%
	Adam	77.25%	75.89%	62.09%
DCN Nets	SGD	68.89%	68.01%	39.85%
	RMSprop	81.87%	73.78%	83.85%
	Adam	85.95%	87.41%	75.24%
DeepFM	SGD	60.64%	40.57%	48.45%
	RMSprop	72.21%	64.60%	65.47%
	Adam	67.67%	76.49%	26.11%
xDeepFM	SGD	60.38%	39.34%	13.54%
	RMSprop	70.34%	68.60%	45.84%
	Adam	75.50%	76.13%	55.22%
DCN	SGD	70.88%	68.39%	48.74%
	RMSprop	84.39%	83.21%	75.73%
	Adam	85.15%	84.35%	76.60%
Wide&Deep	SGD	65.84%	73.72%	20.89%
	RMSprop	67.67%	76.48%	45.07%
	Adam	58.81%	47.80%	47.48%

Table 4: Performance Comparison of Different Tabular Deep Learning Methods (Training Data)

References

Shwartz-Ziv, Ravid, and Amitai Armon. “Tabular Data: Deep Learning Is Not All You Need.” Information Fusion, vol. 81, May 2022, pp. 84–90, https://doi.org/10.1016/j.inffus.2021.11.011. [↩] [↩] [↩]
Fayaz, Sheikh Amir, et al. “Is Deep Learning on Tabular Data Enough? An Assessment.” International Journal of Advanced Computer Science and Applications, vol. 13, no. 4, 2022, https://doi.org/10.14569/ijacsa.2022.0130454. Accessed 26 June 2022. [↩] [↩]
Léo Grinsztajn, et al. “Why Do Tree-Based Models Still Outperform Deep Learning on Tabular Data?” HAL (Le Centre Pour La Communication Scientifique Directe), French National Centre for Scientific Research, July 2022. [↩]
Borisov, Vadim, et al. “Deep Neural Networks and Tabular Data: A Survey.” IEEE Transactions on Neural Networks and Learning Systems, 2022, pp. 1–21, https://doi.org/10.1109/tnnls.2022.3229161. [↩]
Sun, Ying-Xue, et al. “Effect of Ammonia on the Formation of THMs and HAAs in Secondary Effluent Chlorination.” Chemosphere, vol. 76, no. 5, July 2009, pp. 631–37, https://doi.org/10.1016/j.chemosphere.2009.04.041. Accessed 21 July 2022. [↩] [↩]
Ling, Jing. “Evaluation of the Suitability of Real-Time Continuous Water Quality Monitoring in a Chloraminated Drinking Water Distribution System.” Repositories.lib.utexas.edu, May 2022, hdl.handle.net/2152/116737. Accessed 28 May 2024. [↩] [↩]
Kaddoura, Sanaa. “Evaluation of Machine Learning Algorithm on Drinking Water Quality for Better Sustainability.” Sustainability, vol. 14, no. 18, Sept. 2022, p. 11478, https://doi.org/10.3390/su141811478. [↩] [↩]
G. Bharati Ainapure, et al. “Drinking Water Potability Prediction Using Machine Learning Approaches: A Case Study of Indian Rivers.” Water Practice & Technology, vol. 18, no. 12, UWA Publishing, Nov. 2023, pp. 3004–20, https://doi.org/10.2166/wpt.2023.202. Accessed 30 Mar. 2024. [↩] [↩]
Kadiwal, A. (2021). Water Quality. Www.kaggle.com. https://www.kaggle.com/datasets/adityakadiwal/water-potability [↩]

Investigating Traditional Machine Learning and Advanced Tabular Deep Learning Models for Water Potability Detection

Abstract

Introduction

Literature Survey

Methodology

Experiments and Results

Discussion

Conclusion

Acknowledgment

Supplementary Section

References

LEAVE A REPLY Cancel reply

POPULAR CATEGORIES

NAVIGATION

ABOUT US

Abstract

Introduction

Literature Survey

Methodology

Experiments and Results

Discussion

Conclusion

Acknowledgment

Supplementary Section

References

RELATED ARTICLESMORE FROM AUTHOR

Global Classification of Ocean Microplastic Concentration Levels Using Machine Learning with Geo-Embeddings

Does “Good Length Outside Off” Really Work? A Ball-Tracking Study of Wickets in the IPL

Family-Centered Care (FCC) and Family-Integrated Care (FIC): Global Trends and Local Provider Awareness in Fresno County, California

LEAVE A REPLY Cancel reply

POPULAR CATEGORIES

NAVIGATION

ABOUT US

RELATED ARTICLES MORE FROM AUTHOR