Applications of Existing Convolutional Neural Networks to Deepfake Detection

0
315

Abstract

As technology becomes increasingly advanced, AI-generated media known as Deepfakes have raised concerns in areas like cybersecurity, misinformation, and privacy. Since Convolutional Neural Networks (CNNs) have proved effective in other fields like image classification tasks, the objective of this research is to apply these CNNs to the growing problem of detecting Deepfake images. Even if they are not initially designed for this task, this study will determine which of these architectures perform best for detecting Deepfake images. A dataset of real and Deepfake images was constructed and several CNN architectures including pre-existing and pre-trained models like VGG 16, VGG 19, DenseNet121, and ResNet50, were tested. A previously developed architecture for age classification was also modified to fit this task and all were compared to a custom architecture designed. The custom model achieved the highest validation accuracy of 97%, though it lacks generalizability and had a validation accuracy of 81.1% when tested on another dataset. Pre-trained transfer learning models underperformed even with fine-tuning, with ResNet50 yielding a maximum accuracy of 94%. The highest performing modified age-classification model had a validation accuracy of 92.36%. Increasing the number of convolutions in the modified age-classification model roughly improved accuracy. Overall, this study found that though pre-trained image classifiers can be applied to Deepfake detection, they were not designed for this task. Evidence of overfitting and differences between saliency maps for different models show the importance of developing large, generalizable datasets specifically for Deepfake detection. Custom CNNs and adapted models can be developed and run quickly with high performance on one dataset, but the lack of representative training data and State Of The Art (SOTA) methods leave models at risk of overfitting and limited applications of their results.

Keywords: Machine Learning, Computer Science, Deepfakes, Artificial Intelligence,  Convolutional Neural Networks, Image Classification, AI-Generated Media, Model Performance Evaluation

Introduction

Motivation

Deepfakes, or digitally manipulated media, are becoming more common and sophisticated, creating challenges in areas like cybersecurity, misinformation, and privacy. Deepfakes have significant implications for cybersecurity, as they can be used to impersonate individuals in phishing attacks or bypass biometric authentication systems. In the realm of misinformation, Deepfakes can spread false narratives or discredit legitimate sources, causing widespread confusion and distrust. From a privacy standpoint, the ability to create hyper-realistic synthetic media poses a risk to personal reputations and can be used for blackmail, harassment, or identity theft. Although AI detection exists in current research, it is often inaccurate when faced with new, sophisticated deep fakes, limiting their usefulness1. By enhancing detection systems, this research aims to fill this gap, mitigating the threats of AI generated media and protecting individuals, institutions, and the public.

This research focuses on answering the question: Which CNN architectures are best to detect deep fake media effectively? To achieve this, this study aims to train, test, and compare the performance of various CNN architectures with different features to determine which architectures are better suited to this task. The CNN architectures tested in this study are most, if not all, key performers in image recognition, but have yet to be applied to Deepfake detection. The scope of this research is limited to binary classification of images as either real or fake, since CNN architectures are trained on labeled data. Deepfake detection for images is focused on within this study, so conclusions or data drawn from this study may not be applicable to other forms of media such as videos, time lapses, etc.

Background

The industry standard tools for detecting Deepfake media consist of various deep learning techniques, most notably CNNs and Generative Adversarial Networks (GANs). CNNs are a type of artificial neural network primarily used for image recognition tasks due to their ability to detect minute details through kernel or filter optimization. One of the most used CNNs for identifying AI-generated images is XceptionNet, a CNN that excels in detecting pixel anomalies often indicative of Deepfakes.

Patel et. al. conducted a case study on Deepfake Generation and Detection in 20232, and they report three approaches to Image/ Video Detection: Physical/Physiological based approaches, Signal Level Feature based approaches, and Data Driven models. Data Driven models relate the most to this proposed study, and Patel et. al. describes common CNNs that have performed well for problems like this. Rather than focusing on specific features of images or faces (for face forgery problems), data driven models like CNNs and Recurrent Neural Networks (RNNs) are trained over a general dataset of AI and real images/videos. One of the models they discuss alongside other industry standard models like Xception and DenseNet is MesoNet, which uses Inception as the foundation of its architecture, allowing it to detect compressed images and Deepfakes on social media platforms, where not all images are high resolution.

Additionally, Ashok V and Dr. Preetha Theresa Joy applied XceptionNet to Deepfake detection in 20233, where they used XceptionNet to categorize images and video frames into real or fake. They compared the results to other SOTA models like VGG-16, ResNet-152, and Inception V3 and found that XceptionNet performed better than the other models. However, using CNNs for this problem risks overfitting the data, essentially training the model to memorize rather than predict. The model becomes used to seeing the same data but is not actually learning anything new during the epochs. Approaches including CNNs often struggle to generalize their results to other datasets because of this, meaning that their model performs well only on their training datasets, leading to underperformance on new data and an overall not robust conclusion. Ashok V and Dr. Joy combated this by taking their data from FaceForensics++ Database, a forensics dataset containing 1000 videos that have been altered using four different face forging techniques: Face2Face, Deepfakes, FaceSwap, and NeuralTextures. This helps minimize confounding variables due to only one technique being used to encode the real videos into Deepfakes. They train their model using 5000 real images and 5000 Deepfakes and then test their data on 794 video frames taken from FaceForensics++.

This study uses a large dataset to combat overfitting and increase the generalizability of the custom architecture CNN. It uses 60,000 synthetically generated images and 60,000 real (Collected from Krizhevsky & Hinton’s CIFAR-10 dataset) images to build a dataset for the models; a large dataset like this helps expose the model to many different images and helps create a more complex model through a thorough training process. Furthermore, the personal CNN model was tested on data from Manjil Karki’s Deepfake and Real Images dataset to identify how it performs on unseen data and test its generalizability.

Furthermore, A GAN is another widely used tool that creates two neural networks: a generator network and a discriminator network. The two are pitted against each other such that the generator outputs progressively convincing fake images while the discriminator attempts to differentiate the AI-generated images from the real ones. This technique allows GANs to develop high accuracies and better identify subtle inconsistencies and patterns within data4. GANs and other deep learning methods like CNNs are staples within the field of AI-generated media detection as they provide a robust system for detecting and recognizing differences between real and fake images.     

Though GANs are commonly used for Deepfake generation, Galamo Monkam, Weifeng Xu, and Jie Yan introduced G-Job GAN in 20235, a GAN-based machine learning model that outperformed other industry standard GAN models. Their proposed architecture consists of two CNNs, a discriminator and a generator that work together to progressively create and detect Deepfake images. They pulled images from CelebA, a large-scale dataset containing over 200,000 celebrity images with 40 different facial attributes per image, using 162,770 images for training; 19,867 images for validation; and 19,962 images for testing. They measured their results using F1 score and additionally noted that their accuracy positively improved as their dataset size increased.

Though GANs bring unprecedented accuracy and immense potential to the field of Deepfake detection, they are notoriously hard to train, and the generators can easily become biased if not trained with a large and diverse dataset. Finding datasets like this is a key limitation of GANs since current real world Deepfake detection datasets often lack variance between different demographics (ethnicity, gender, image quality, age, etc) and manipulation techniques (lip syncing, expression transfer, face swapping, etc). Deepfake technology evolves quickly, which means that datasets can quickly become outdated as new technology surfaces. Privacy concerns with people’s real facial data is also a concern when trying to compile real images for these datasets, which is also why some public datasets are unable to capture the full range of images.

This was combated by utilizing transfer learning for the industry standard models incorporated into the study. These models will be imported using pre-trained weights that have already been determined due to exposure to different datasets, which will help increase the generalizability of the results and reduce the likelihood of overfitting. Additionally, the personal CNN model is tested on Manjil Karki’s Deepfake and Real Images Dataset6 to test generalizability of the solution and expose the model to different kinds of images. Use of images from different sources helps expose the model to a diverse set of images and helps determine the actual strength of the model.

Methodology Overview

This problem falls under supervised learning and is a classification task. In supervised learning, the model is trained on labeled data, where we know whether each image is real or fake. Since the output is a prediction of whether an image is real or fake, this is a binary classification problem. The labels for the data are 0 for fake images and 1 for real images. A mix of real images and deep fake images was gathered to build the dataset. The data used is vision data—specifically, images that were either generated by artificial intelligence (fake) or captured in the real world (real). After gathering the images, they were converted into numerical values that can be used by the CNN model. The model’s output is a label, either 0 or 1, depending on whether it predicts the image is fake or real.

After data preprocessing, several accepted image-recognition models with differing architectural priorities and goals such as DenseNet121, ResNet50, VGG-16, and VGG-19 were tested. For comparison, an age classifier was modified for this problem and a personalized CNN architecture was developed. Post testing and fine-tuning, these models were evaluated based on validation accuracy, validation loss, validation precision, validation recall, and validation f1 score. The generalizability of the personal CNN architecture was further tested on the Manjil Karki Deepfake and Real Images dataset from Kaggle6. By experimenting with different CNN architectures, the goal is to identify which one produces the most accurate predictions in order to understand which types of architectures in general are best applied to this situation. After all the models have been evaluated, saliency maps were visualized for true real predictions, true fake predictions, false real predictions, and false fake predictions for the ResNet50 model, 4 Convolution [512, 512, 2] model from the Age Classifier, Personal CNN on CIFAKE Dataset, and Personal CNN on the Manjil Karki Dataset. This research is significant because finding the best CNN architecture can improve deep fake detection systems and help address the growing issue of media manipulation.

Results

The transfer learning models had the lowest validation accuracy rates overall, as shown in the table below. ResNet50 overall had the highest test validation accuracy of 94 percent but was lower than the experimental model. The modified Age classifier model with the highest accuracy was the 4 Convolution [512, 512, 2] model with an accuracy of 92.36 percent.

These models were evaluated by comparing their validation losses, accuracies, precision, recall, and f1 score. Loss is an evaluation metric that works by comparing the true labels to the predicted outputs, using a specific loss function such as cross-entropy loss or mean squared error. This is helpful to visualize and compare how inaccurate different models are compared to each other. This study uses cross-entropy loss due to its common use in classification problems, and it works by comparing the true probability distribution of the data to the predicted probability distribution. Cross-entropy loss also helps to identify areas where possible over or underfitting occurred, which is essential in evaluating the overall success of the image classification models. Equation 1 shows the mathematical formula for calculating Cross-entropy loss.

(1)   \begin{equation*}Loss = - \frac{1}{n} \sum_{i=1}^{N} \sum_{c=1}^{C} y_{i,c} log(p_{i,c})\end{equation*}

Validation test accuracy, on the other hand, measures the percentage of correct labels over the percentage of total predictions. To make it a percentage, simply multiply by 100. Equation 2 shows the formula for accuracy.

(2)   \begin{equation*}Accuracy = \frac{Correct Predictions}{Total Number of Predictions}\end{equation*}


Including precision, recall, and F1 score helps identify the predicting habits of the model by assessing the likelihood of correctly identifying Deepfake images compared to the real images. This in addition to the accuracy and loss results creates a thorough group of evaluation metrics for the study. Below, Equation 3, Equation 4, and Equation 5 show the mathematical formulas used to calculate Precision7, Recall8, and F1 Score9, respectively.

(3)   \begin{equation*}Precision = \frac{True Positives}{True Positives + False Positives}\end{equation*}

(4)   \begin{equation*}Recall = \frac{True Positives}{True Positives + False Negatives}\end{equation*}

(5)   \begin{equation*}F1 Score = \frac{2 \ast Precision \ast Recall}{precision + recall}\end{equation*}

Below is a table comparing the successes of the Transfer Learning Models, personal experimental model, and the raw architectures of VGG-16 and VGG-19 (Note the following metrics are all after fine-tuning).

Model TypeValidation LossValidation AccuracyValidation PrecisionValidation RecallValidation F1 Score
VGG-16 Architecture0.69310.50250.50250.50250.6689
VGG-19 Architecture0.69310.50250.50250.50250.6689
VGG Transfer Learning0.5760.80970.80970.80970.8221, 0.7955
DenseNet121 Transfer Learning0.43740.83780.83780.83780.8474, 0.8268
ResNet50 Transfer Learning0.174630.93550.93550.93550.9335, 0.9374
Personal CNN0.13140.96960.96960.96960.9697, .9695
Personal CNN on Manjil Karki’s Dataset0.96350.81120.82150.81120.8103
Figure 1: Validation Metrics for Image Classification Models

For the Modified Age Classifier, 12 different models were generated. For each number of convolutional layers, the number of nodes vary. This pattern was repeated for convolutional layers of 2,3,4 and for node amounts of [2], [512,2], [512,512,2], and [1024,256,2].

Model NameValidation AccuracyValidation LossValidation PrecisionValidation RecallValidation F1 Score
2 convolution [2]0.86175000670.33908107880.86175000670.86175000670.863423 0.86003536
2 convolution [512, 2]0.88516664510.29854136710.88516664510.88516664510.89142764 0.87813926
2 convolution [512, 512, 2]0.90141665940.29535681010.90141665940.90141665940.9040628 0.8986202
2 convolution [1024, 512, 256, 2]0.89916664360.37997546790.89916664360.89916664360.90309143 0.89491045
3 convolution [2]0.91091668610.2644065320.91091668610.91091668610.9118205 0.90999407
3 convolution [512, 2]0.90516668560.25939688090.90516668560.90516668560.91045004 0.89922065
3 convolution [512, 512, 2]0.91591668130.2499794960.91591668130.91591668130.9135167 0.91818696
3 convolution [1024, 512, 256, 2]0.9177500010.40920844670.9177500010.9177500010.91954017 0.91587824
4 convolution [2]0.91233330970.23160587250.91233330970.91233330970.9096375 0.9148729
4 convolution [512, 2]0.91883331540.22888085250.91883331540.91883331540.9226492 0.91462123
4 convolution [512, 512, 2]0.9236666560.22633799910.9236666560.9236666560.92419726 0.9231285
4 convolution [1024, 512, 256, 2]0.92191666360.25235506890.92191666360.92191666360.92404956 0.9196604
Figure 2: Validation Metrics for Modified Age Classifier

The model with the highest accuracy for the age classifier was the 4 convolution [512, 512, 2], while the model with the lowest accuracy for the age classifier was 2 convolution [2]. The following confusion matrices show highest and lowest performing models. A label of 0 means fake, and a label of 1 means real.

Figure 3: Confusion Matrices for Modified Age Classifier Model. (a) 4-convolution model [512, 512, 2] with the highest validation accuracy (92.36%). (b) 2-convolution model [2] with the lowest validation accuracy (86.17%)

Methodology

Figure 4: Overview of Methodology Visualization

Data Preprocessing

This project uses CIFAKE: Real and AI-Generated Synthetic Images10 taken from Kaggle, which contains 60,000 synthetically generated images (generated the equivalent of CIFAR-10 with Stable Diffusion version 1.4) and 60,000 real (Collected from Krizhevsky & Hinton’s CIFAR-10 dataset) images. At the beginning of the project, the images are converted to numerical values of 0 and 1, where 0 represents an AI generated image and 1 represents a real image. Data was split using Sci-kit Learn’s train test split using an 80:20 split. The data is preprocessed by looping through the image paths and separating them into four lists: real_test, real_train, fake_test, fake_train. The labels are converted into 0 and 1 and set equal to the y. The X became all four of the lists added together and the y is the labelled 0 and 1 numbers. The test set is later split in half to create a separate validation set.

Figure 5: Color Channel Intensities for AI Images
Figure 6: Color Channel Intensities for Human Images

To preprocesses specifically for CNN testing, the values are converted back from binary into an image of 32×32 such that the sizes of X_train, X_test, X_validation, y_train, y_test, y_validation, respectively, are as follows: (96000, 32, 32, 3), (24000, 32, 32, 3), (12000, 32, 32, 3), (96000, 2), (24000, 2), (12000,2). The goal of this is to better organize our data into different lists to be used later when experimenting with scikit-learn models and visualizing the data.

To better understand the data, the different RGB color distributions, as well as the average brightness for pixels, are visualized for the fake images and the real images separately.

To visualize the RGB color intensities, the red value in each of the images is isolated and used in a row major order to loop through each individual pixel in the image. For each pixel, the red, blue, and green values were summed individually, and the dimensions of the image was used to return the average color value of that specific image color. Then, for each color per fake and real images, a total of 6 lists was created and a histogram was visualized for each. Shown in Figure 5 are the average red, blue, and green color intensity values for the AI-Generated images, and in Figure 6 for the Human images (total dataset, not relevant to train or test). We observe that though the distributions of red and green color intensity are approximately normal for both human and ai images, the distribution of blue color intensity appears skewed right, which may indicate that most images do not include a lot of blue color in them. However, there are no significant differences in the distributions between human and AI images.

To visualize the brightness values, the images are converted into black and white values and then using the same row major approach to loop through each pixel in the image. To find the average brightness per image, each grayscale image was added to a total sum for the image and divided by the dimensions to find the average value. Then by looping through the list of real and fake images, this code was repeated for each image in the study and saved in two lists: one for average brightnesses of all real images, and one for all fake images.

Both are plotted on a histogram to make comparison easier and both are shown within Figure 7 below. Both graphs show an approximately normal distribution, with most images having a color intensity not far on either end of the brightness spectrum. This indicates that there is not a significant skew for color intensity between AI images or human images. However, we observe that the average color intensity for human images has a slightly greater spread than that of AI images, which may be due to colors and shades found in real life that AI images do not replicate.

Figure 7: Average Color Intensity for AI Images and Human Images

To solve the research problem, this work began by implementing a few scikit- learn methods (KNN Model, Random Forest Model, Linear Regression Model, SVC Model) using both the brightness training data described in the Dataset Section, and the full image data. After experimenting with the data to better understand how in general it responds to model classification, the study progressed to experimentation with the neural network MLP classifier and then began testing various CNN architectures on the data.

Introduction to Models Used

The VGG-16 and 19 are two pre-existing CNN architectures commonly used for image recognition within this field. The architecture is known to be simple and consists of 16 and 19 layers respectively, with small 3×3 filters. They are known for their simplicity and are the benchmark in image classification, often also being implemented in image style transfer problems11. These models are manually implemented by building their layers in Tensorflow and Keras without pre-trained weights.

The idea of transfer learning is also applied by importing the VGG model from Keras Applications that includes pre-determined weights based on past training. Transfer learning uses models that have already been trained on large datasets and adapts them to new tasks, meaning that this model has already been configured to produce its best accuracy for image recognition tasks, which is what it is commonly used for in this field12. For the VGG-16, VGG-19, and transfer learning VGG models, the models are later fine-tuned by unfreezing the layers for all the models from layer 14 onward, which includes the last two convolutional blocks (each block includes one max pooling layer and three convolutional layers) and connected dense layers. This is so that the earlier layers that are useful for more general observations were kept frozen, while the later layers that are involved in leading abstraction and pattern identification are retrained to better distinguish deepfakes from real images.     

DenseNet121 is another architecture looked at, with a unique structure where each layer is connected to every other layer, instead of stacking layers one after another. DenseNet121 ensures each layer receives input from all previous layers which makes the flow of information more efficient13. This model is imported from Tensorflow Keras applications to obtain its predetermined weights from prior training on image recognition tasks. This model is commonly used for text generation, image generation, and image feature extractions. For DenseNet121’s finetuning, it was unfrozen from layer 299 out of 429, keeping the final Dense Block and classifier layers, which are responsible for the decision making and minor pattern identification, two crucial aspects of Deepfake detection. Retraining the model on the later dense blocks from the architecture helped reduce the risk of overfitting from unnecessarily rerunning earlier layers and helps improve performance by letting the model specialize its predictions.

Additionally, this study uses the ResNet architecture, which is unique in the fact that it prioritizes model efficiency by using residual blocks to skip unnecessary layers to reduce computation time and make it easier for the model to train itself. This clever approach allows ResNet to have extremely deep networks while avoiding performance degradation that often comes with an increase in convoluted layers. Due to its often deep, complicated nature, this model is commonly used for deep computer learning tasks like image segmentation and object detection14. Within the field it is most applied to image recognition tasks, and in this study is used transfer learning to import it from the Tensorflow Keras application library. For the ResNet50 transfer learning, the first 7 blocks are preserved, which primarily serve as general feature analysis, and unfroze the layers starting at Block 8, which includes the final convolutional layers, global average pooling, and fully connected output layer. The model is retrained with these upper layers so that it could adapt and detect subtle patterns within deepfake images that were not determined during original training.

Finally, to act as a comparison point, this study looks at a personalized CNN with strategically composed convolutional and pooling layers to see if it could outperform the standard models used for image classification. This methodology helps determine whether it is necessary to develop specific and targeted models or if other models perform well too.

This model takes images of size 32×32 with three RGB color channels with 128 filters. The first block contains an alternating pattern of a convolutional layer, a Leaky ReLU activation function, and then runs a Batch Normalization. A Leaky ReLU activation function is a variant of a ReLU that helps the network learn complex patterns by reducing the likelihood of the network becoming inactive while training. Batch Normalization helps to normalize the outputs of the previous layers to help stabilize the model15. This three-layer combination is repeated twice more. The first block ends with a pooling layer that reduces the dimensions by half, so it becomes 16×16 and a dropout layer that drops 20% of the neurons to prevent overtraining.

The second block has 256 filters of size (3,3) but follows the same repetition pattern of the first block. The third block follows the same pattern but with 512 filters and only two repetitions of the three-layer combination consisting of a convolutional layer, LeakyReLU layer, and batch normalization layer.

Finally, it finishes by flattening the output into a 1D vector in preparation for the following dense layers that ultimately output the probabilities for each class, used for binary classification. The Adam optimizer, an adaptive learning rate method commonly used for deep networks16, is utilized here. It was trained with a batch size of 64 over 10 epochs.  See the figure below for a visual representation of the architecture.

While performing transfer learning, the model is unfrozen starting at layer 23 of 40 so that the model layers of the final block can be retrained (one convolutional layer, one LeakyReLU layer, Batch Normalization, Convolutional, LeakyReLU, Batch Normalization, Max pooling, and a dropout layer) and the upper most dense layer block. Essentially, the low to mid-level features are kept the same while allowing the higher up layers to be re-trained based on the dataset, which creates better performance since the model is able to identify new patterns from the data.

Figure 8 below visualizes the architecture and labels the different layers of the model.

Figure 8: Personal CNN Architecture

The section below gives an in depth explanation and walk through of M. Fatih Aydogdu and M. Fatih Demirci’s Age Classifier CNN17 which uses an extremely unique technique to determine a person’s age from an image. Their methodology is then applied to this research problem and replicated.

Age Classification CNN Architecture

This methodology was taken from research conducted by M. Fatih Aydogdu and M. Fatih Demirci, where they used a unique approach to classify age from images using a convolutional neural network. In their paper, they created multiple CNN stages that consist of increasing layers and varying node amounts. Then they combined each layer with each other layer to create an exponential number of models to find the most accurate one. This approach was successful because they ensured that the ending format of the images was common amongst each CNN stage.

The age classification model is included due to its unique architecture and methodology. The systematic combination of multiple CNN models is an interesting way to increase model performance since it allows for experimentation with different amounts of layers and nodes at once. Using a multi-stage approach, helps build a more robust model with varying levels of complexity which ultimately allows for flexible and progressive analysis of fine image details, which is crucial for technical problems like age classification and Deepfake detection.

A 2-layer, 3 layer, and four-layer approach is used to replicate their approach. Next, matrix math is performed to find a combination of layers and pooling that would result in the final image size for all three CNN stages and create the architecture to be the same. However, the approach needed to be adjusted to fit the image sizes for the CIFAKE dataset that differ from theirs.

Figure 9: Base Convolutional Models
Figure 10: Dense Layer Configurations

After building and saving each individual CNN stage, the magnitude of nodes to be used for each structure was determined; Four different groups were selected. These four nodes will be used on each CNN stage, creating 12 different models. After creating a for loop and looping through the number of nodes, each model was trained and saved in a list of all 12 of them to make training easier later. A for loop was used to cycle through the list of models and their names. It trained them, saved their validation curves as a file, and saved their accuracies in a table as shown for clarity. Changing the number of node groups or CNN groups would have led to different amounts of models, but to not deviate from the original study, 4 node groups and 4 CNN groups are used. This created 12 models of varying convolutional layers and amounts of nodes.

Figure 11: Model Composition Process
Figure 12: All 12 Final Models for Modified Age Classifier

Evaluation Metrics and Modes of Comparison

For each of these models, the preprocessed dataset of real and fake images is trained using each model for ten epochs. The 12 models generated in the Age Classifier Replica were individually trained and the validation accuracies, loss, precision, recall, and f1 score were saved in a table shown in the results section and confusion matrices were generated for the highest and lowest performing models. The confusion matrices assisted in evaluating model precision, and the accuracies helped generalize the overall performance of all the models.

Analyzing the accuracy of the models helped provide a general sense of model performance, but can be misleading with imbalanced datasets, which is why other metrics are included as well. Validation loss was useful because it quantified the error between the predicted and actual outputs, which helped identify potential areas of over or underfitting by the models.

Precision helps quantify the likelihood of the model correctly identifying a Deepfake image as fake, while Recall similarly indicates how many fake images the models correctly identified. F1 Score is the harmonic mean of precision and recall, which balances false positives and false negatives into a robust score that gives a broader idea of the exactness of the model. Though these three scores are similar, using all of them provides a nuanced understanding of how well the model predicted fake or real images, and provides a more complete and reliable evaluation of model performance in addition to the validation accuracy and loss.

Error Analysis

This section visualizes, analyzes, and compares the saliency maps for ResNet50 (Highest performing SOTA Image Classifier), 4 Convolution [512, 512, 2] (Highest Performing Model from Modified Age Classifier), Personal CNN on CIFAKE Dataset (Original Dataset), and Personal CNN on Manjil Karki’s Deepfake and Real Images Dataset (Test Dataset). These models are used for visualizing the saliency maps because they performed the best in each category: The ResNet50 model had the highest performance across all metrics for the pretrained image classifier, the 4 Convolution [512, 512, 2] model had the highest performance out of 12 of the models from the Modified Age classifier, and the Personal CNN had the highest performance across all models on the CIFAKE Dataset. To make the saliency maps, images are randomly selected based on whether they had a high confidence score or not. Confidence score refers to the level of confidence that the model had when making its predictions for a certain image.

            From the saliency map visualizations, a clear pattern emerges within the focuses of the four different models. The personal CNN, which exhibits a higher accuracy on the CIFAKE dataset compared to the ResNet50 and 4 Convolution [512, 512, 2] models, focuses on the central part of the image, typically around key features of the image such as human faces and animal faces. The size of the area of focus is also smaller and contains brighter colored pixels, indicating that the center of the images was extremely crucial to the model’s predictions. This is good because it means the model is efficient but can also lead to overfitting on limited patterns of images specifically in the CIFAKE dataset. In comparison, the ResNet50 and 4 Convolution [512, 512, 2] models, while also sometimes focusing on the center of the image, typically have a wider area that the model considers in its predictions with less bright pixels. This indicates a less precise and specific area influencing the model predictions, which could have contributed to their lower performance.

 ResNet50 in particular, shows saliency maps in Figure 13 that do not exhibit a circular area of focus; For all visualizations displayed, the area of focus by the model is scattered around the image compared to the other models that are clearly circular in the center of the image. The 4 Convolution [512, 512, 2] model also does not display a circular locus of focus but instead seems to consider the whole image. This is a clear difference from the personal CNN, which exhibits a very small area of focus in the form of a circle usually in the center of the image shown.

Also, it can be observed that the model had more difficulty with images where the center object or person was a similar color to the background. In Figure 13 with the personal CNN on the new dataset, the two images that were misclassified showed people in the dark so that there was a less distinct difference between the people and the background. In comparison, the two correct images have a clear difference between the people and the background since they are taken in better lighting with starker lines.

The visualizations for the ResNet50 Model, 4 Convolution [512, 512, 2] model, Personal CNN on CIFAKE dataset, and Personal CNN on Manjil Karki Dataset are shown below. There are four subplots per visualization. Each subplot contains the actual image on the left side by side with the saliency map overlay to the right. All images used for saliency map analysis have high confidence levels (close to or equal to 1.000), which indicates that the model was completely sure about their prediction. This is a helpful metric to consider while analyzing the process of the different models and the role that plays in its overall performance.

Figure 13: ResNet50 Multiple Saliency Map Overlays and Image
Figure 14: 4 Convolution [512, 512, 2] For Multiple Saliency Map Overlays and Images
Figure 15: Personal CNN Architecture on CIFAKE Dataset For Multiple Saliency Map Overlays and Images
Figure 16: Personal CNN on Manjil Karki Dataset For Multiple Saliency Map Overlays and Images

Overfitting Analysis

     Detailed analysis of the training and validation metrics for the pre-trained image classifiers as well as the custom CNNs show that alongside the potential for recognizing underlying patterns in images lies the risk of memorizing instead of analyzing the data.

Comparison of the training accuracy with the validation accuracy after fine tuning for the ResNet50 model show that though that training accuracy continued to increase over the 10 epochs, its validation accuracy fluctuates between 94 and 93 percent accuracy and doesn’t get stronger. This is a strong indicator of potential overfitting, where the model is memorizing the training data but is not increasing its performance on unseen validation data. This could occur since the image classifiers were imported with pre-existing weights from prior training on datasets used for image classification, not Deepfake detection. Findings where the model’s validation accuracy fluctuates between two values while the training accuracy continues to increase are also consistent with the ResNet50 model results for before fine tuning, and the Personal CNN post fine tuning on the original dataset (CIFAKE dataset). The values for each evaluation metric for both training and validation data before and after fine tuning are shown in Figure 17.

Conclusion

For the CIFAKE dataset, the pretrained image classifiers were outperformed by the personal CNN architecture. The best performing model out of them was the RestNet50 Model, with a validation accuracy of 94 percent. This was surprising considering the model had an accuracy of about 97 percent. The age classifier model had accuracies that approximately increased with the number of nodes and number of convolutional layers. The highest accuracy for the age classifier replication was for the 4 convolution [512, 512, 2] model with a validation accuracy of 92.36 percent. However, testing the personal CNN on Manjil Karki’s Deepfake and Real Images dataset resulted in a validation accuracy of about 81 percent, a significant drop from its performance on the CIFAKE dataset.

While transfer learning and fine tuning are expected to improve the performance of the models, the results showed that on certain datasets, they can be outperformed by raw CNN architectures without pre-existing weights. This can also be since the weights for the pre-trained image classifiers are determined for image classification, an entirely separate problem than Deepfake detection. 

These results contribute to the idea that although pre-trained image classifiers can be applied for Deepfake detection, the lack of specific datasets for this task and lack of SOTA methods for benchmarking creates models prone to overfitting and limited generalizability. Creating the personal CNN was easier and faster than the industry standard models and even performed better on the CIFAKE dataset but underperformed on Manjil Karki’s Deepfake and Real Images Dataset. This is due to potential overfitting with the training dataset, shown by the increase in training validation over 10 epochs while the validation accuracy did not change and often decreased. This calls into question the diversity and strength of current datasets used for Deepfake detection.

However, one limitation of this study is the use of one dataset to train the transfer learning models and the modified age classifier. Although the dataset contains many images, they have all been taken from the same AI generated platform, which calls into question if there is a potential bias skewing the data. All the fake images were generated the equivalent of CIFAR-10 with Stable Diffusion version 1.4, and all the real images were collected from Krizhevsky & Hinton’s CIFAR-10 dataset. This may have caused the data to not be representative of all images. Furthermore, the personal CNN was tested on Manjil Karki’s Deepfake and Real images to determine its generalizability and found a significant drop in its performance from a validation accuracy of about 97 percent to a validation accuracy of 81 percent, indicating that the model’s performance is not as strong as previously thought. Additionally, this study was limited to only a few different architectures for testing, and did not explore further beyond the transfer learning models (VGG 16, VGG 19, DenseNet121, ResNet50) and the Age Classifier Model. To stay accurate to the original age classifier model, the Age Classifier methodology did not stray too far from their original design, but did find that the accuracy kept increasing.

Although there are existing datasets standardly used for Deepfake detection tasks, this study highlights the importance of developing evolved datasets that account for the latest changes to generative AI. These newer datasets, alongside clearer SOTA models specifically for this task, will allow for clear benchmarking standards to improve both training and evaluation. Future research should focus on developing more specialized CNN architectures and even experiment with the modified age classifier to see if it will continue to follow the trend of increasing validation accuracy with increasing number of nodes and CNN groups. Further research should test the impacts of changing nodes, CNN groups, and CNN layers.

As deep fake technology continues to evolve and become more undetectable, it becomes clearer that specific models designed to detect fake images are becoming more and more necessary. Though this study works to search for applications of existing models to this new problem, it also encourages the development of more specified models and more substantial datasets to mitigate this problem. Detecting deepfakes is not just a technical challenge but a necessary step in safeguarding digital content and safety.

Appendix

Epoch Number:Validation LossValidation AccuracyValidation PrecisionValidation RecallValidation F1 Score
1.8218.7968.8173.7968.7944
2.6918.8140.8140.8140.8140
3.72820.8088.8144.8088.8084
4.7506.8108.8155.8108.8105
5.7647.8044.8074.8044.8043
6.7409.8164.8165.8164.8163
7.7355.8092.8098.8092.8092
8.7369.8232.8240.8232.8232
9.7180.8112.8113.8112.8112
10.9635.8112.8215.8112.8103
Figure 17: Table of all Training and Validation Evaluation Metrics Per Epoch Before and After Fine Tuning
Figure 18: Evaluation Metrics Per Epoch for Fine Tuning Personal CNN on Manjil Karki Dataset
EpochModelTraining AccuracyTraining F1Training LossTraining PrecisionTraining RecallValidation AccuracyValidation F1Validation LossValidation PrecisionValidation Recall
12 convolution [2]0.78898960350.78474957 0.793065670.53077113630.78898960350.78898960350.78466665740.7479515 0.812045340.46884793040.78466665740.7846666574
22 convolution [2]0.83621877430.8331547 0.83917230.37394282220.83621877430.83621877430.85199999810.85810155 0.845350.35651656990.85199999810.8519999981
32 convolution [2]0.8478749990.84482974 0.85080290.35444331170.8478749990.8478749990.8430833220.8356463 0.849876340.35761302710.8430833220.843083322
42 convolution [2]0.85540622470.8528417 0.857882860.33624577520.85540622470.85540622470.86766666170.8647359 0.87047310.31082600360.86766666170.8676666617
52 convolution [2]0.85980206730.8575299 0.86200280.32763776180.85980206730.85980206730.86675000190.861041 0.872008260.31602463130.86675000190.8667500019
62 convolution [2]0.86610418560.8638174 0.86831530.31676447390.86610418560.86610418560.85716664790.8485597 0.86484780.32940170170.85716664790.8571666479
72 convolution [2]0.86869794130.8665841 0.87074580.31194764380.86869794130.86869794130.85616666080.8573317 0.854982260.33797687290.85616666080.8561666608
82 convolution [2]0.87408334020.8719789 0.87611960.30279427770.87408334020.87408334020.86291664840.8594617 0.866205750.32725355030.86291664840.8629166484
92 convolution [2]0.87391668560.87192607 0.87584620.2993331850.87391668560.87391668560.87333333490.86952776 0.8769230.30836918950.87333333490.8733333349
102 convolution [2]0.87611460690.8741973 0.87797420.29626947640.87611460690.87611460690.86175000670.863423 0.860035360.33908107880.86175000670.8617500067
12 convolution [512, 2]0.87749999760.8767941 0.87819780.29499495030.87749999760.87749999760.8743333220.87466747 0.87399720.2996443510.8743333220.874333322
22 convolution [512, 2]0.89081251620.8901396 0.891477170.26571285720.89081251620.89081251620.88791668420.8847965 0.89087220.27310556170.88791668420.8879166842
32 convolution [512, 2]0.89834374190.89782536 0.898856760.24967417120.89834374190.89834374190.88966667650.8896114 0.88972170.26315969230.88966667650.8896666765
42 convolution [512, 2]0.90186458830.90132076 0.902402340.2425793260.90186458830.90186458830.89741665120.89723676 0.89759590.25416010620.89741665120.8974166512
52 convolution [512, 2]0.90529167650.90473187 0.905844750.23332472150.90529167650.90529167650.89866667990.8995539 0.897763550.25744643810.89866667990.8986666799
62 convolution [512, 2]0.90854167940.90805125 0.90902680.22568807010.90854167940.90854167940.89749997850.9010298 0.89370880.26932913060.89749997850.8974999785
72 convolution [512, 2]0.91105210780.91054606 0.91155230.22133629020.91105210780.91105210780.89608335490.8967629 0.89539460.25873127580.89608335490.8960833549
82 convolution [512, 2]0.91355210540.91306394 0.914034660.21557037530.91355210540.91355210540.89824998380.8972308 0.8992490.2545166910.89824998380.8982499838
92 convolution [512, 2]0.91653126480.9159578 0.917096850.2081762850.91653126480.91653126480.89958333970.89789 0.901221330.25563013550.89958333970.8995833397
102 convolution [512, 2]0.91980206970.91935086 0.920248150.20084659760.91980206970.91980206970.88516664510.89142764 0.878139260.29854136710.88516664510.8851666451
12 convolution [512, 512, 2]0.90360414980.90339077 0.903816460.26056718830.90360414980.90360414980.89225000140.8912073 0.89327270.29257950190.89225000140.8922500014
22 convolution [512, 512, 2]0.91580206160.9156254 0.91597790.20899695160.91580206160.91580206160.89383333920.89258003 0.89505760.27039200070.89383333920.8938333392
32 convolution [512, 512, 2]0.92082291840.9205805 0.921063840.19858945910.92082291840.92082291840.89333331580.8909338 0.89562940.27322864530.89333331580.8933333158
42 convolution [512, 512, 2]0.92348957060.92315245 0.92382360.18995597960.92348957060.92348957060.89566665890.89757854 0.893682060.27458661790.89566665890.8956666589
52 convolution [512, 512, 2]0.92821872230.92794174 0.928493560.17951206860.92821872230.92821872230.89674997330.899814 0.89349260.26836168770.89674997330.8967499733
62 convolution [512, 512, 2]0.93169790510.93130076 0.93209040.17052234710.93169790510.93169790510.9092500210.90990317 0.90858720.23966082930.9092500210.909250021
72 convolution [512, 512, 2]0.93552082780.9351831 0.935854850.16207118330.93552082780.93552082780.90083330870.9017341 0.899915930.25651225450.90083330870.9008333087
82 convolution [512, 512, 2]0.93888539080.93852097 0.93924540.15353430810.93888539080.93888539080.90208333730.90024614 0.903853950.28518259530.90208333730.9020833373
92 convolution [512, 512, 2]0.94098955390.94064516 0.941329960.14877814050.94098955390.94098955390.90516668560.90446603 0.9058570.30126214030.90516668560.9051666856
102 convolution [512, 512, 2]0.94472914930.9444281 0.94502690.13929647210.94472914930.94472914930.90141665940.9040628 0.89862020.29535681010.90141665940.9014166594
12 convolution [1024, 512, 256, 2]0.91467708350.91436934 0.914982560.26350298520.91467708350.91467708350.89625000950.89877224 0.89359880.26859053970.89625000950.8962500095
22 convolution [1024, 512, 256, 2]0.92961460350.9293961 0.92983160.17788746950.92961460350.92961460350.90275001530.9034339 0.902056160.25424724820.90275001530.9027500153
32 convolution [1024, 512, 256, 2]0.93633335830.93612164 0.936543460.1616159230.93633335830.93633335830.90066665410.8977174 0.90345040.33738666770.90066665410.9006666541
42 convolution [1024, 512, 256, 2]0.93956249950.9391772 0.93994280.15220667420.93956249950.93956249950.8989999890.89616174 0.90168720.27564653750.8989999890.898999989
52 convolution [1024, 512, 256, 2]0.94356250760.9432087 0.943911850.14333292840.94356250760.94356250760.9069166780.9075407 0.90628410.26851579550.9069166780.906916678
62 convolution [1024, 512, 256, 2]0.94758331780.9473286 0.94783540.13422612850.94758331780.94758331780.90249997380.9031135 0.90187850.302352190.90249997380.9024999738
72 convolution [1024, 512, 256, 2]0.95115625860.9508752 0.951433960.1255683750.95115625860.95115625860.90241664650.9008718 0.90391390.30426847930.90241664650.9024166465
82 convolution [1024, 512, 256, 2]0.95440626140.9541113 0.95469740.11667382720.95440626140.95440626140.90549999480.9052631 0.90573560.3640798330.90549999480.9054999948
92 convolution [1024, 512, 256, 2]0.95653122660.9561627 0.95689350.11052718010.95653122660.95653122660.90341669320.90396875 0.90285810.31792953610.90341669320.9034166932
102 convolution [1024, 512, 256, 2]0.96090626720.9606747 0.961134970.10101184990.96090626720.96090626720.89916664360.90309143 0.894910450.37997546790.89916664360.8991666436
13 convolution [2]0.82494789360.82221633 0.82759680.46995040770.82494789360.82494789360.87599998710.8743879 0.87757110.29864069820.87599998710.8759999871
23 convolution [2]0.87337499860.87207705 0.87464680.30315396190.87337499860.87337499860.87866663930.88704413 0.868946850.29633519050.87866663930.8786666393
33 convolution [2]0.88540625570.8843351 0.88645760.27816501260.88540625570.88540625570.89174997810.89231527 0.89117860.26814547180.89174997810.8917499781
43 convolution [2]0.89409375190.8930633 0.89510440.25975111130.89409375190.89409375190.88099998240.8732693 0.887841640.27643647790.88099998240.8809999824
53 convolution [2]0.90029168130.89940506 0.901162560.24677670.90029168130.90029168130.88849997520.88310325 0.89342040.2675675750.88849997520.8884999752
63 convolution [2]0.90417706970.9033281 0.90501120.23771515490.90417706970.90417706970.8973333240.89728194 0.89738460.25060147050.8973333240.897333324
73 convolution [2]0.90856248140.9077281 0.909381750.2287646830.90856248140.90856248140.88050001860.87226075 0.88774060.28034496310.88050001860.8805000186
83 convolution [2]0.91038542990.90951157 0.911242370.2233875990.91038542990.91038542990.90791666510.9062208 0.90955220.24345692990.90791666510.9079166651
93 convolution [2]0.91222918030.9113705 0.91307120.21996289490.91222918030.91222918030.9032499790.90572464 0.90064180.25256103280.9032499790.903249979
103 convolution [2]0.91363543270.91276383 0.914489570.21370708940.91363543270.91363543270.91091668610.9118205 0.909994070.2644065320.91091668610.9109166861
13 convolution [512, 2]0.91510415080.91490203 0.91530520.21865801510.91510415080.91510415080.90683335070.909503 0.90400130.23581190410.90683335070.9068333507
23 convolution [512, 2]0.9229791760.9228908 0.92306730.1932107210.9229791760.9229791760.90833336110.9088649 0.90779540.23260928690.90833336110.9083333611
33 convolution [512, 2]0.92680209880.9267158 0.92688810.18446572120.92680209880.92680209880.91383332010.9154399 0.91216440.22672021390.91383332010.9138333201
43 convolution [512, 2]0.93054169420.93049526 0.93058790.1765732020.93054169420.93054169420.91033333540.90995806 0.91070540.24702164530.91033333540.9103333354
53 convolution [512, 2]0.93218749760.9322002 0.932174740.17147855460.93218749760.93218749760.90499997140.9010588 0.90863920.24204435940.90499997140.9049999714
63 convolution [512, 2]0.93589586020.93586504 0.935926440.16358853880.93589586020.93589586020.89691668750.9021128 0.89113790.26776286960.89691668750.8969166875
73 convolution [512, 2]0.93490624430.9347859 0.935025930.16675643620.93490624430.93490624430.91391664740.91192764 0.91581770.2161483020.91391664740.9139166474
83 convolution [512, 2]0.93949997430.9393863 0.939613160.15490749480.93949997430.93949997430.91258335110.91073096 0.91436020.26077133420.91258335110.9125833511
93 convolution [512, 2]0.94112497570.94101304 0.94123630.14981035890.94112497570.94112497570.91675001380.9161842 0.91730810.21813863520.91675001380.9167500138
103 convolution [512, 2]0.94338542220.94323933 0.94353070.14572785790.94338542220.94338542220.90516668560.91045004 0.899220650.25939688090.90516668560.9051666856
13 convolution [512, 512, 2]0.94068747760.9405872 0.94078740.21583835780.94068747760.94068747760.91258335110.910594 0.914485930.23283740880.91258335110.9125833511
23 convolution [512, 512, 2]0.94539582730.9453023 0.945488870.14108870920.94539582730.94539582730.91825002430.9178598 0.918636440.24214750530.91825002430.9182500243
33 convolution [512, 512, 2]0.94594794510.945813 0.94608210.13885523380.94594794510.94594794510.91275000570.9116529 0.913820.22175164520.91275000570.9127500057
43 convolution [512, 512, 2]0.94771873950.94756293 0.94787340.13330481950.94771873950.94771873950.91299998760.9153284 0.91053980.22404693070.91299998760.9129999876
53 convolution [512, 512, 2]0.95061457160.9504063 0.9508210.1272133440.95061457160.95061457160.91866666080.9196046 0.91770650.24162308870.91866666080.9186666608
63 convolution [512, 512, 2]0.95116668940.9508739 0.951455830.12435532360.95116668940.95116668940.91533333060.91555846 0.91510690.22500970960.91533333060.9153333306
73 convolution [512, 512, 2]0.95399999620.95381814 0.954180360.11928089710.95399999620.95399999620.9177500010.9175783 0.9179210.27619636060.9177500010.917750001
83 convolution [512, 512, 2]0.94159376620.94090354 0.942267840.1528567970.94159376620.94159376620.91183334590.9102629 0.91334960.23878926040.91183334590.9118333459
93 convolution [512, 512, 2]0.95731252430.9570764 0.957545940.11075057090.95731252430.95731252430.91541665790.91467 0.91615020.27671065930.91541665790.9154166579
103 convolution [512, 512, 2]0.95778125520.95750993 0.95804910.11155832560.95778125520.95778125520.91591668130.9135167 0.918186960.2499794960.91591668130.9159166813
13 convolution [1024, 512, 256, 2]0.95502084490.9549353 0.955105840.16058197620.95502084490.95502084490.91783332820.91717064 0.918485340.30675941710.91783332820.9178333282
23 convolution [1024, 512, 256, 2]0.95942705870.9594089 0.959445240.10651180150.95942705870.95942705870.91649997230.9182841 0.91463620.23219275470.91649997230.9164999723
33 convolution [1024, 512, 256, 2]0.95990622040.9597574 0.9600540.10551732030.95990622040.95990622040.91675001380.9193899 0.91393120.24641497430.91675001380.9167500138
43 convolution [1024, 512, 256, 2]0.95935416220.95924795 0.95945970.1052728370.95935416220.95935416220.91399997470.9126903 0.915270860.30685767530.91399997470.9139999747
53 convolution [1024, 512, 256, 2]0.95890623330.95874596 0.959065260.11037179830.95890623330.95890623330.90466666220.90064263 0.908377350.31684148310.90466666220.9046666622
63 convolution [1024, 512, 256, 2]0.96098959450.9607479 0.96122820.10342290250.96098959450.96098959450.91391664740.9144513 0.91337520.26623865960.91391664740.9139166474
73 convolution [1024, 512, 256, 2]0.96258336310.9623866 0.96277790.098195955160.96258336310.96258336310.91858333350.9190354 0.91812620.35842156410.91858333350.9185833335
83 convolution [1024, 512, 256, 2]0.96473956110.96453524 0.964941560.092784360050.96473956110.96473956110.91575002670.91424197 0.917205750.63828676940.91575002670.9157500267
93 convolution [1024, 512, 256, 2]0.95176041130.95137995 0.95213480.12895755470.95176041130.95176041130.90566664930.9087096 0.90241380.28610879180.90566664930.9056666493
103 convolution [1024, 512, 256, 2]0.96364581580.9634179 0.96387080.096590951090.96364581580.96364581580.9177500010.91954017 0.915878240.40920844670.9177500010.917750001
14 convolution [2]0.82538539170.82519776 0.825572550.40697860720.82538539170.82538539170.87866663930.87836254 0.878969130.29182222490.87866663930.8786666393
24 convolution [2]0.87869793180.8785397 0.87885560.29183360930.87869793180.87869793180.89266663790.89300543 0.89232570.27339842920.89266663790.8926666379
34 convolution [2]0.89346873760.89320725 0.893728850.26127481460.89346873760.89346873760.90275001530.90410054 0.901360750.24554966390.90275001530.9027500153
44 convolution [2]0.9021250010.9018448 0.902403530.2444034070.9021250010.9021250010.89608335490.89981514 0.89206260.25723475220.89608335490.8960833549
54 convolution [2]0.90860414510.90820444 0.90900030.22812068460.90860414510.90860414510.89574998620.8909613 0.900135640.24817250670.89574998620.8957499862
64 convolution [2]0.91237497330.91196597 0.91278010.22036454080.91237497330.91237497330.90499997140.9031106 0.90681710.23909486830.90499997140.9049999714
74 convolution [2]0.9156354070.9153488 0.915920.21222989260.9156354070.9156354070.90583330390.9018074 0.90954210.23956862090.90583330390.9058333039
84 convolution [2]0.9201041460.919857 0.92034980.20133572820.9201041460.9201041460.91558331250.9156605 0.91550580.21526207030.91558331250.9155833125
94 convolution [2]0.92203122380.92166644 0.922392550.1971539110.92203122380.92203122380.92299997810.923712 0.922274530.20072971280.92299997810.9229999781
104 convolution [2]0.92482292650.92449015 0.925152660.18891707060.92482292650.92482292650.91233330970.9096375 0.91487290.23160587250.91233330970.9123333097
14 convolution [512, 2]0.92193752530.92195374 0.92192120.20243471860.92193752530.92193752530.91566663980.9171849 0.914091650.21685731410.91566663980.9156666398
24 convolution [512, 2]0.92770832780.92784804 0.927567960.18247318270.92770832780.92770832780.91325002910.9140308 0.91245470.21539619570.91325002910.9132500291
34 convolution [512, 2]0.93203127380.9321308 0.93193120.17276324330.93203127380.93203127380.8900833130.89893484 0.879532340.2505465150.8900833130.890083313
44 convolution [512, 2]0.93297916650.9330767 0.932881240.17013879120.93297916650.93297916650.92341667410.923525 0.923307960.19966916740.92341667410.9234166741
54 convolution [512, 2]0.93435418610.93445516 0.934252740.16762915250.93435418610.93435418610.92283332350.9233063 0.922354460.21751548350.92283332350.9228333235
64 convolution [512, 2]0.93734377620.937377 0.93731040.15843798220.93734377620.93734377620.92175000910.92200345 0.92149480.21298091110.92175000910.9217500091
74 convolution [512, 2]0.93814581630.9381045 0.938186940.15705908830.93814581630.93814581630.92658334970.92703927 0.92612150.19250248370.92658334970.9265833497
84 convolution [512, 2]0.94188541170.94176704 0.942003130.15049424770.94188541170.94188541170.91250002380.91780174 0.90646710.23308511080.91250002380.9125000238
94 convolution [512, 2]0.94221872090.9421772 0.94226020.14907002450.94221872090.94221872090.9266666770.9279868 0.92529710.20453076060.9266666770.926666677
104 convolution [512, 2]0.94375002380.9436372 0.94386220.14240728320.94375002380.94375002380.91883331540.9226492 0.914621230.22888085250.91883331540.9188333154
14 convolution [512, 512, 2]0.9464166760.946335 0.9464980.15919944640.9464166760.9464166760.92500001190.925187 0.924811960.2005106360.92500001190.9250000119
24 convolution [512, 512, 2]0.94784373040.94783884 0.94784860.13409537080.94784373040.94784373040.92516666650.9245124 0.925809560.20244909820.92516666650.9251666665
34 convolution [512, 512, 2]0.94566667080.94552934 0.94580320.13899296520.94566667080.94566667080.92583334450.9266644 0.92498310.20809407530.92583334450.9258333445
44 convolution [512, 512, 2]0.94834375380.9482245 0.948462370.13246549670.94834375380.94834375380.92533332110.92754316 0.92298430.20810940860.92533332110.9253333211
54 convolution [512, 512, 2]0.95114582780.9510459 0.95124530.12657375630.95114582780.95114582780.91841667890.9158428 0.920837640.24437803030.91841667890.9184166789
64 convolution [512, 512, 2]0.94861459730.94847333 0.948754970.13582625990.94861459730.94861459730.92350000140.923994 0.92299940.22352878750.92350000140.9235000014
74 convolution [512, 512, 2]0.95358335970.9534641 0.953701850.11942505840.95358335970.95358335970.92425000670.92454547 0.923952040.20587842170.92425000670.9242500067
84 convolution [512, 512, 2]0.95418751240.95410717 0.954267440.11985733360.95418751240.95418751240.92608332630.92690563 0.92524220.23839986320.92608332630.9260833263
94 convolution [512, 512, 2]0.95458334680.95443803 0.954727530.11714459210.95458334680.95458334680.91891664270.9187609 0.919071730.23677161340.91891664270.9189166427
104 convolution [512, 512, 2]0.95536458490.95523065 0.95549760.11500819770.95536458490.95536458490.9236666560.92419726 0.92312850.22633799910.9236666560.923666656
14 convolution [1024, 512, 256, 2]0.96047914030.96037346 0.96058420.13509884480.96047914030.96047914030.91783332820.91562545 0.91992850.25295427440.91783332820.9178333282
24 convolution [1024, 512, 256, 2]0.96021872760.9602125 0.960224870.10463325680.96021872760.96021872760.92466664310.9251035 0.924224560.21584180.92466664310.9246666431
34 convolution [1024, 512, 256, 2]0.95641666650.9564003 0.956432940.11412245040.95641666650.95641666650.9213333130.92248315 0.92014880.29025495050.9213333130.921333313
44 convolution [1024, 512, 256, 2]0.96048957110.9603735 0.960604850.10494970530.96048957110.96048957110.91366666560.91044253 0.91666660.24000394340.91366666560.9136666656
54 convolution [1024, 512, 256, 2]0.95710414650.95709336 0.95711480.11374570430.95710414650.95710414650.92141664030.9218788 0.920948860.21314528580.92141664030.9214166403
64 convolution [1024, 512, 256, 2]0.95855206250.95834154 0.95876030.10758584740.95855206250.95855206250.9243333340.92524284 0.92340140.23822979630.9243333340.924333334
74 convolution [1024, 512, 256, 2]0.96168750520.96152073 0.96185270.099563077090.96168750520.96168750520.91991668940.9189097 0.920898740.24683223660.91991668940.9199166894
84 convolution [1024, 512, 256, 2]0.95863538980.9584166 0.95885180.10745286940.95863538980.95863538980.92100000380.9190434 0.92286410.26743102070.92100000380.9210000038
94 convolution [1024, 512, 256, 2]0.96234375240.96216434 0.96252130.098953217270.96234375240.96234375240.9236666560.92550415 0.921736060.26576995850.9236666560.923666656
104 convolution [1024, 512, 256, 2]0.96214580540.9619803 0.96230990.099753610790.96214580540.96214580540.92191666360.92404956 0.91966040.25235506890.92191666360.9219166636
Figure 19: Evaluation Metrics Per Epoch for 12 Models from Modified Age Classifier

References

  1. N. Jacobson, Deepfakes and Their Impact on Society, CPI OpenFox, 2024 []
  2. Y. Patel et al., Deepfake Generation and Detection: Case Study and Challenges, IEEE Access 11, 143296–143323, 2023 []
  3. Deepfake Detection Using XceptionNet, IEEE, https://ieeexplore.ieee.org/document/10363477 []
  4. What is a GAN? – Generative Adversarial Networks Explained, Amazon Web Services, https://aws.amazon.com/what-is/gan/ []
  5. G. Monkam, W. Xu, J. Yan, A GAN-based Approach to Detect AI-Generated Images, in 2023 26th ACIS International Winter Conference on SNPD-Winter, IEEE, 229–232, 2023 []
  6. deepfake and real images, Kaggle, https://www.kaggle.com/datasets/manjilkarki/deepfake-and-real-images [] []
  7. precision_score, scikit-learn, https://scikit-learn.org/ stable/modules/generated/sklearn.metrics.precision_score.html []
  8. recall_score, scikit-le arn, https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html []
  9. f1_score, scikit-learn, https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f 1_score.html []
  10. CIFAKE: Real and AI-Generated Synthetic Images, Kaggle, https://www.kaggle.com/datasets/birdy654/cifake-rea l-and-ai-generated-synthetic-images []
  11. G. Boesch, Very Deep Convolutional Networks (VGG) Essential Guide, viso.ai, 2021 []
  12. What Is Transfer Learning? A Guide for Deep Learning, Built In, https://builtin.com/data-science/transfer-learning []
  13. S. Das, Implementing DenseNet-121 in PyTorch: A Step-by-Step Guide, deepkapha notes, 2023 []
  14. G. Boesch, Deep Residual Networks (ResNet, ResNet-50) – 2024 Guide, viso.ai, 2023 []
  15. L. Gupta, Batch Normalization and ReLU for solving Vanishing Gradients, Analytics Vidhya, 2021 []
  16. Adam Optimizer – an overview, ScienceDirect Topics, https://www.scienced irect.com/topics/computer-science/Adam-optimizer []
  17. M. F. Aydogdu, M. F. Demirci, Age Classification Using an Optimized CNN Architecture, in Proceedings of the International Conference on Compute and Data Analysis, ACM, 233–239, Lakeland FL USA, 2017 []

LEAVE A REPLY

Please enter your comment!
Please enter your name here