Authors: Rushank Goyal, Pranati Dani, Shinjini Ghosh, Aditya Gudi, William Fang
Professional Reviewer: Dr. Swati Gupta
Abstract
Liver disease is highly pervasive and growing fast, yet methods for its early detection remain limited. To help combat this issue, we used data from a public dataset provided by the University of California Irvine that provides insights into the diagnosis of liver disease. We applied the logistic regression, multilayer perceptron, deep neural network, and random forest classifier models to our data. In addition, we altered the parameters of each of these models to identify the combination that provided the most accurate predictions for liver disease patients. After careful consideration and evaluation, the multilayer perceptron model, with parameters of solver lbfgs, the activation function tanh and configuration of the hidden layer (5, 2) was found to result in the most accurate prediction of the test data, with an accuracy score of near 82%.
Structure
In the following paper, Section 2 describes the background of the topic and why researching efficacies of models, specifically with liver disease, is important. Section 3 carries on into careful analysis of the data set. Section 4 describes logistic regression, multilayer perceptron, deep neural networks, and the random forest classifier models. It also details and examines the parameters used in the aforementioned models. Section 5 brings together the statistical analysis of all the different models, while Section 6 compares the statistics of each model to each other. Finally, Section 7 summarizes the research and repeats the importance of the following topic and the conclusions achieved from the study.
Background
Liver disease disrupts the liver’s functions through many different conditions, including alcohol abuse, liver cancer, hepatitis A, B, C, D, and E, infectious mononucleosis, iron overload, and more (Wedro, 2019). It remains highly fatal and dangerous to those living with it. Currently, in the United States, there are around 4.5 million people diagnosed with liver disease, approximately 42,000 deaths from it in 2018, and more than 73,100 new cases every year (GE Healthcare, 2019; National Center for Health Statistics, 2019). Early detection could prevent a large percentage of deaths; however, detection is not easy. Symptoms such as itchy skin, nausea, chronic fatigue, abdominal pain, swelling, and yellowish skin only persist in the later stages, where fatality may occur (GE Healthcare, 2019). Machine learning algorithms currently being used to predict liver disease in patients use a minimal number of limited parameters, such as gender and age, and are still lacking, with the number of deaths increasing significantly every year; there was a 65% percent increase since 1999 just for cirrhosis (British Medical Journal, 2018). Worldwide, it accounts for 2 million deaths every year, according to the National Center for Biotechnology Information (Asrani et al., 2019). In fact, it is now the fifth-largest killer (Godfrey, 2014). With our research, we hoped to increase the accuracy rates of models for liver disease prediction using different types of models and comparing numerous combinations of parameters.
Data Collection
To define a model for the detection of liver disease, collecting data was vital. Our dataset, residing on the database Kaggle, was compiled from a public dataset provided by the University of California Irvine in the United States (UC Irvine, 2012). It recorded patient attributes of diseased and healthy patients from the North East of Andhra Pradesh, India. The data contained 416 liver patient records and 167 non-liver patient records. In terms of gender, there were 441 male and 142 female patients. Features included the age and gender of the patient, along with levels of total bilirubin, direct bilirubin, alkaline phosphatase, alamine aminotransferase, aspartate aminotransferase, total proteins, albumin, and, lastly, the albumin and globulin ratio. Therefore, there were plenty of features in this set correlated with the diagnosis of liver disease, which helped improve the accuracy of results. This dataset was compared with numerous others and was chosen based on its variety of features, various data points, and a high usability rating on Kaggle. With these data, which were split into training and testing sets, we could focus on the model comparisons to produce the most accurate and precise results for liver disease patients.
Methods & Results
I. Logistic Regression
Logistic Regression is one of the regression models used to predict certain classifications. In our study, logistic regression was used to determine whether a patient is diagnosed with liver disease or not. This type of regression is most commonly used when the dependent variable is dichotomous. The diagnosis of liver disease fitted this criterion as it was supposed to be either a positive or negative result.
The first step to logistic regression was creating the hypothesis function using the sigmoid function:
With these formulas, if z travels to infinity, the y value will become 1 (positive result), and if z goes to negative infinity, the y value will become 0 (negative result).
To create the separation between the hypothesis function results, a decision boundary was used. It could have been linear or non-linear and created a wall between the positive and negative results, which in this case were the positive or negative diagnosis of liver disease, respectively. The parameters of the model were estimated by maximizing a likelihood function (using Maximum Likelihood Estimation or MLE) that predicted the mean of a Bernoulli distribution for each example (Brownlee, 2019).
Once we fed the data into the model, it converged in as little as 8 iterations and we obtained a log-likelihood value of -290.61, with the greatest positive coefficients associated with albumin and globulin ratios, and the greatest negative coefficient associated with the direct bilirubin amount. This implied that the chances of a liver diagnosis prediction increased as albumin and globulin ratios increased and direct bilirubin amount decreased. The model suggested that out of the factors considered, alkaline phosphatase was the least correlated to the presence or absence of liver disease. The model had an accuracy of around 61%.
II. Multilayer Perceptron
A multilayer perceptron (MLP) is a type of neural network that includes the input layer, one hidden layer, and the output layer. Unlike typical neural networks, the MLP model only needs one hidden layer to have sufficient accuracy. The input consisted of the parameters mentioned in Section 3 and a positive or negative diagnosis was given as output.
The connections from the layers to each other had certain weights that were trained to produce the most ideal results. With each iteration, stochastic gradient descent was used to determine the most ideal parameters of the hidden layer configuration, solver, and activation function. This was done by comparing the output of the network using forward propagation with the one that was expected. The error was then calculated and trickled down using backpropagation. Working backward allowed optimal parameters and weights on the nodes of the networks to be chosen such that they come closer to the desired output with each iteration.
After the training was done with a selected number of iterations (taking care not to overfit), the trained model could be used to make predictions.We tried experimenting over a cartesian product of solvers, activation functions, and hidden layer configurations, totaling 72 experiments. We used a choice of solvers from lbfgs, sgd, and adam, activation functions from identity, logistic, tanh, and relu, and hidden layer configurations from {(2,), (3,), (5,), (2,2,), (3,2,), (5,2,)} and the greatest accuracy was obtained by these specific parameters: {solver: lbfgs, activation function: tanh, hidden layer configuration: (5,2)}.
III. Deep Neural Network
Simply put, deep neural networks (DNNs) take an input X and predict the output Y by utilizing its connections the way the brain uses its neurons. Therefore, a DNN learns different patterns between the input and output in the form of layers (Montavon et al., 2018).
The simplest possible model would contain the input layer, 1 hidden layer, and the output layer. In reality, though, the number of hidden layers can vary widely. The input layer takes different inputs and forms values for the hidden layer(s), where all computations occur. The results are subsequently transferred to the output layer which performed the final classification: diseased or healthy, in this case. All the layers are connected with nodes that have certain weights and biases which then act as parameters that are adjusted with each iteration. These weights help determine the classification at the end.
The initial prediction was made through the hypothesis function using the sigmoid:
The cost/loss was calculated using the mean squared error. Eventually, the goal of the DNN is to minimize this cost/loss by changing the weights and biases. This is done similarly as in logistic regression with gradient descent but has a unique process called backpropagation. This means the network goes backward through the layers to update the weights and biases according to the error received in the loss function. Therefore, with each backtrack, we reduced the loss further and teaching the model to produce better results. The following formula was used to calculate the new parameters:
In our analysis, we split the data into training and testing sets, with a learning rate of 0.33. In our research, we varied the n_estimators parameter from 0 to 100 with a step of 10, while all other parameters were kept constant. After conducting 100 iterations, our model achieved an accuracy of around 71%.
IV. Random Forest Classifier
A random forest classifier (RFC) is an ensemble machine learning algorithm. Such algorithms combine more than one algorithms of the same or different kind for classifying objects, a Decision Tree in this case (Yiu, 2019). Suppose a training set is given as [X1, X2 X3, X4] with corresponding labels [L1, L2 L3, L4]. The random forest classifier will create some number of decision trees, each with a random subset of the training set as input. This number is decided by the n_estimators parameter. Another important parameter is the criterion parameter, which can take on 2 possible values: gini and entropy. The formulas for calculating these are given below (pj is the proportion of the samples that belongs to class c for a particular node).
Since it measures the quality of a split, optimizing the criterion parameter is necessary to maximize the quality of the individual decision trees in the classifier. Based on these two, and other parameters, the random forest classifier makes its final prediction based on the majority vote from each decision tree (Ronaghan, 2018).In our analysis, we also split the data into training and testing sets, with a test size of 0.33 and a random state of 42. In our research, we varied the n_estimators parameter from 50 to 250 with a step of 10, and varied criterion between gini and entropy. All other parameters were kept constant, and a random state of 42 was used for initializing the classifier. The highest accuracy, 75.6%, was achieved with 90 estimators and gini criterion.
Table of Results
Logistic Regression | Multilayer Perceptron | Deep Neural Network | Random Forest Classifier | |
Mean Squared Error | 0.280 | 0.269 | 0.283 | 0.275 |
R-Squared Statistic | 0.168 | NA | NA | NA |
Accuracy | 61.0% | 81.8% | 70.7% | 75.6% |
95% Confidence Interval of Accuracy | (0.561, 0.641) | (0.787, 0.849) | (0.670, 0.744) | (0.716, 0.796) |
Discussion
As seen in the comparison section, we calculated the mean squared error, R-squared value, confidence intervals, and the accuracy score for all models when applicable. These values were obtained using the metrics module of the scikit-learn Python library (Pedregosa et al., 2012).
The mean squared error determines the squared average difference between our test data and the data predicted by the model. Therefore, the higher this value, the less accurate our model becomes in terms of this specific value. When looking at logistic regression, we see the mean squared error was 0.280, second-highest in comparison to all others. Also, its accuracy score, which is the total amount of correct predictions divided by the total, was 61%, which is the lowest amongst all models. Its R-squared value, which shows how fitted the logistic formula is to the data, was low at 0.168. Lastly, the confidence interval of the accuracy, which states that it is 95 percent probable that the accuracy is within this range, was highest for logistic regression at ?0.0398, showing that its 95 percent range was the largest and therefore the least accurate. Thus we can claim that the logistic regression model was not our best-performing one.
To now transition to the DNN: its mean squared error was the highest at 0.283, along with a mediocre accuracy score of 71 percent. In addition, the DNN model had the next largest confidence interval of accuracy at accuracy ?0.0368, yet again proving that this model can not be our most accurate and efficient one.
Between the RFC and the MLP, although the RFC had a mean squared error that was slightly smaller (by 0.006), its accuracy score was much lower than that of the MLP model (6.2% less). Therefore, using these two statistical points, the MLP model can be declared most efficient and accurate, specifically the one with lbfgs as solver, tanh as the activation function, and a configuration of the hidden layer (5, 2).
Conclusion
In conclusion, our study suggested that the MLP algorithm produced the most accurate and precise results regarding the detection of liver disease. The accuracy, which provided the most reason for this choice, was near 82% and had a mean squared error of 0.269.
Further, we found that specific parameters can result in higher accuracies as well. For example, with the multilayer perceptron model, parameters that gave the highest accuracy of 82% were with solver lbfgs, activation function tanh, and configuration of the hidden layer (5, 2) and, in the case of the random forest, 90 estimators with gini criterion gave the best results.
We do concede, however, that there were a few limitations to our study. For instance, the complete dataset only had 583 data points, and more data would have given better results. There might also be differences based on ethnicity that this dataset, made up solely of South Asians, did not capture. Lastly, 416 out of the 583 patients were diagnosed with disease, so this high percentage (71.36%) might not be representative of what real-life cases look like.
Further research that could be done includes testing other algorithms with other parameters, obtaining a larger dataset with higher geographical diversity, and one that is closer to the real-life ratio of diseased to healthy patients.
With further and deeper discoveries into the performance of different models, liver disease can be detected early on in a higher percentage of patients. As discussed prior, liver disease unfortunately results in a large number of deaths due to late detection, and such statistical comparison regarding the models can help improve that for the future. Our research delved into the question of the most efficient and reliable way to detect liver disease and, according to the methods we employed, the multilayer perceptron model with parameters of solver lbfgs, the activation function tanh, and the configuration of the hidden layer (5, 2) produced the most accurate diagnosis.
References
[Wedro, 2019] Wedro, B. (2019, July 11). Liver disease symptoms, signs, diet & treatment. MedicineNet. https://www.medicinenet.com/liver_disease/article.htm
[GE Healthcare, 2019] GE Healthcare. (2019, November 20). Early detection may aid patients with liver disease. https://www.gehealthcare.com/article/early-detection-may-aid-patients-with-liver-disease
[National Center for Health Statistics, 2019] National Center for Health Statistics. (2019, October 11). FastStats. https://www.cdc.gov/nchs/fastats/liver-disease.htm
[British Medical Journal, 2018] British Medical Journal. (2018, July 19). Rapid rise in deaths from liver disease in the US over the last decade. https://www.bmj.com/company/newsroom/rapid-rise-in-deaths-from-liver-disease-in-the-us-over-the-last-decade/
[Asrani et al., 2019] Asrani, S. K., Devarbhavi, H., Eaton, J., & Kamath, P. S. (2019). Burden of liver diseases in the world. Journal of hepatology, 70(1), 151–171. https://doi.org/10.1016/j.jhep.2018.09.014
[Godfrey, 2014] Godfrey, K. (2014, July 22). Liver disease is now the fifth most common cause of death. Nursing Times. https://www.nursingtimes.net/archive/liver-disease-is-now-the-fifth-most-common-cause-of-death-22-07-2014/
[UC Irvine, 2012] University of California Irvine. (2012, May). UCI Machine Learning Repository: ILPD (Indian Liver Patient Dataset) Data Set. https://archive.ics.uci.edu/ml/datasets/ILPD+(Indian+Liver+Patient+Dataset)
[Brownlee, 2019] Jason Brownlee. (2019, October 28). A Gentle Introduction to Logistic Regression With Maximum Likelihood Estimation. https://machinelearningmastery.com/logistic-regression-with-maximum-likelihood-estimation/
[Montavon et al., 2018] Grégoire Montavona, Wojciech Samekb, Klaus-Robert Müller. (2018, February). Methods for interpreting and understanding deep neural networks. https://doi.org/10.1016/j.dsp.2017.10.011
[Yiu, 2019] Tony Yiu. (2019, June 12). Understanding Random Forest. https://towardsdatascience.com/understanding-random-forest-58381e0602d2
[Ronaghan, 2018] Stacey Ronaghan. (2018, May 12). The Mathematics of Decision Trees, Random Forest and Feature Importance in Scikit-learn and Spark. https://towardsdatascience.com/the-mathematics-of-decision-trees-random-forest-and-feature-importance-in-scikit-learn-and-spark-f2861df67e3[Pedregosa et al.] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, Édouard Duchesnay. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12(85), 2825-2830. https://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html