Detection of Forged Images Using a Combination of Passive Methods Based on Neural Networks
[ad_1]
1. Introduction
Acquisition artifacts are introduced in the image either by imperfections created by the manufacture of camera sensors called fixed-pattern noise or by programs used by cameras to process sensor data before storage; the introduction of artifacts in this manner usually follows a predictable pattern, and any divergences to it is evidence of manipulation.
Format artifacts, introduced in the digital storage of images by algorithms like the jpeg compression, by removing information less relevant to human eyes, areas of the image with artifacts of this type different than expected indicate the image was manipulated. Manipulation artifacts, that get introduced during image manipulation by a program, like the application of a blur filter that alters areas of an image in predictable patterns, the presence of those known artifacts in areas of an image is evidence of manipulation.
There has yet to be a consensus on classifying all image manipulations; however, most papers recognize the existence of at least two types, copy-move, and splicing. In copy-move forgery, an area of an image is duplicated and pasted over another area of the same image. In splicing, however, an area of another image is pasted over the image. This manipulation can add new meaning to the image or conceal information; the resulting image is a fusion of two different images.
This methodological improvement overcomes the limitations of previous studies, ensuring that the detection model is trained on diverse and representative data. By incorporating a wider range of manipulations, including subtle retouching techniques such as filters, highlights, resizing, and inpainting, the proposed approach enhances the model’s ability to generalize and accurately detect manipulation across various contexts.
Furthermore, our approach capitalizes on established passive detection methods within a multi-stream neural network architecture, as outlined in the abstract. This integration enables the utilization of knowledge acquired from traditional detection methods to enhance the accuracy of neural network-based approaches. By leveraging this synergy, our method aims to significantly mitigate overfitting, a prevalent challenge in such methodologies.
2. Related Works
A problem in defining the search string consisted of the significant variability of terms used to refer to the addressed subject. For example, terms sometimes used to refer to counterfeits are “falsification”, “tampering”, “counterfeiting”, “adulteration”, “forgery”, “manipulation”, “edited”, “doctored” or “altered”. After some tests, we defined the following search string: “image AND (tamper OR forged OR forgery) AND (detect OR localize) AND NOT video”. This search returned a considerable amount of work. To delimit the search, using the first search string, we adopted the following criteria:
-
The presented detection method must follow the passive approach;
-
The Detection method should primarily focus on verifying digital images’ authenticity; and
-
Must be in the top five most relevant results of each base that follows the other criteria.
With this research, we find 15 works. However, most of the papers returned on this initial search did not claim to be capable of detecting all types of manipulation and primarily focused on a single type, therefore, to include works related to general identification, terms such as “global” or “universal” were tested. Unfortunately, not all works that detect all types of manipulation used those terms, so the search key had to be made less specific: “image AND (tamper OR forged OR forgery) AND (detect OR localize)” and to limit the search the following criteria were employed:
-
Introduce a method of detecting general-purpose manipulated images in their text;
-
The presented detection method must follow the passive approach;
-
The presented method should not require file formats with data compression;
-
The Detection method should primarily focus on verifying digital images’ authenticity;
-
The paper was published in the last five years; and
-
The paper must have the most relevant results of each base that follow the other criteria.
The detection methods found can be categorized as classifiers capable of determining alterations in JPEG artifacts, classifiers to detect duplicated areas, and methods that compare detected features to determine correlation or fixed pattern noise (FPN).
To understand methods based on fixed pattern noise, it is first necessary to understand the image capture process. Image acquisition aims to transform an image into a discrete and numerical representation that a computer can store and process. This approach requires a sensor capable of capturing a range of energy from the electromagnetic spectrum and generating as output an electric signal proportional to the captured energy level; then, a digitizer must convert the analog signal into digital information that can be represented in binary form.
The datasets encompassing Duplication and Splicing manipulations consist of CASIA v1.0, CASIA v2.0, Columbia, and the Image Manipulation Database. CASIA v1.0 and v2.0 feature manually tampered images with splicing and duplication manually introduced. In contrast, the Columbia and Image Manipulation Dataset are generated by algorithmically by editing patches of an image. The Columbia dataset exclusively contains splicing changes, while the Image Manipulation Database includes both duplication and splicing. Both datasets feature different alterations classified as retouching, aimed at camouflaging other modifications.
To tackle these challenges, this study employs a meticulously curated dataset comprising four publicly available datasets featuring images manually tampered by humans. Importantly, this dataset does not impose restrictions on the types of manipulations, aiming to closely mimic real-life scenarios and facilitate generalization across a broader spectrum of manipulations. Furthermore, the multi-stream CNN approach enables the utilization of knowledge acquired from traditional passive methods. This approach allows for the extraction of data streams that may contain pertinent information for the model’s analysis.
3. Materials and Methods
In this section, we present the methodology adopted for image manipulation detection, which integrates Error-Level Analysis (ELA) and Discrete Wavelet Transform (DWT) alongside a novel multi-stream neural network architecture.
Our decision to adopt a multi-stream neural network architecture was driven by the recognition that traditional passive methods could offer valuable insights to guide the learning process of a Convolutional Neural Network (CNN) for image manipulation detection. Rather than aiming to identify the optimal method outright, we viewed this choice as an initial exploration of the potential capabilities such an architecture may offer. In line with this perspective, we selected two streams based on methodologies outlined in previous literature, with the intention of extracting different facets of image data. This approach was motivated by our desire to integrate diverse aspects of passive image analysis, serving as a foundational step in our investigation into the effectiveness of multi-stream architectures for detecting image manipulation.
The multi-stream architecture comprises three distinct CNNs, each operating on a unique data stream extracted from the original image. Two of these streams were selected based on methodologies outlined in prior literature, which will be explained in detail in the subsequent sections. Each stream is designed to analyze specific data subsets, while the third stream processes the unaltered image itself. By adopting this multi-stream framework, our aim is to leverage the strengths of traditional passive methods and integrate them into a unified detection system.
3.1. Error-Level Analysis
Listing 1. Error-Level analysis implementation in pseudocode. |
# Static method that performs Error Level Analysis (ELA) on an image using JPEG compression.
# Returns a normalized difference image between the original image and a JPEG-compressed version of the image. FUNCTION method_1_ela(image, quality) # Create another image with jpeg compression of the given quality save_image_as_jpeg(image, temp_image, quality) compressed_image = open_image(temp_image) # Calculate the image difference between the original and the JPEG-compressed image
difference_image = image - compressed_image
# Normalize the difference image for contrast by assigning a value of 255 to the brightest points, while proportionally adjusting the values of all other points based on their distance from the brightest point.
normalized_difference = difference_image.normalizeContrast()
RETURN normalized_difference
END FUNCTION
|
In this example, the resulting image on the bottom is mostly black, but edited areas have more color in them, however, ELA results aren’t always so easily interpreted and traditionally need a forensics specialist to look at the results, however, this paper proposed using ELA as a feature extraction step and feeding it to a Convolutional Neural Network in order to perform the authenticity analysis of an image automatically.
3.2. Discrete Wavelet Transform
This technique works by initially the image is converted to grayscale, then DWT is applied to decompose the image information into sub-bands, then the image is reconstructed after discarding the LL band and a bilateral filter, and the Laplace operator is applied, the result is an image where sharp pixel transitions are more easily visible, however when filters are used to mask the manipulation this method fails to highlight the manipulated areas. This work incorporates the image resulting from this method as one of the streams in a Convolutional Neural Network, utilizing it for feature selection to enhance the model’s ability to detect large pixel transitions.
Listing 2. DWT based method implementation in pseudocode. |
FUNCTION method_2_dwt(image)
# Convert image to grayscale and perform discrete wavelet transform
gray_image = convertToGrayscale(image)
coeffs = discreteWaveletTransform(gray_image)
(LL, (LH, HL, HH)) = coeffs
# Reconstruct the image using only the high-frequency components
high_freq_components = (None, (LH, HL, HH))
joinedLhHlHh = inverseDiscreteWaveletTransform(high_freq_components)
# Apply bilateral filter to smooth the image while preserving edges
blurred = bilateralFilter(joinedLhHlHh, 9, 75, 75)
# Apply Laplacian edge detection to highlight edges
kernel_size = 3
imgLapacian = laplacianEdgeDetection(blurred, kernel_size)
# Convert negative values to zero
final_image = convertScaleToAbs(imgLapacian)
RETURN final_image END FUNCTION |
3.3. Proposed Method
This approach aims to explore the accuracy of each stream individually in addition to their combination. For this, we individually trained three distinct models and a model composed of all three streams, plus additional learning layers. Stream A uses only the original images without alterations as input, Stream B uses only images with feature selection by ELA, and Stream C uses only images with feature selection by the DWT-based method. Finally, the models are combined and four layers are added to generate the Merged model, which uses images from all three streams as input.
The tests revealed a tendency for pre-trained models to overfit to the training images. Additionally, attempts to mitigate overfitting by unfreezing some of the pre-trained layers yielded similar results. Moreover, employing another pre-trained model, such as InceptionResNetV2, did not demonstrate superior performance compared to the custom model. Consequently, these findings lead us to conclude that pre-trained models may not be well-suited for image manipulation detection.
With that in mind, we decided to create three identical custom CNNs and implement regularization methods to minimize overfitting to produce results capable of generalizing to a more extensive set of images and facilitating the comparison of the three individual models.
The chosen values for the convolutional layers were set as 16, 32, 64, 128, and 64, each utilizing a 3 by 3 kernel. Similarly, the dense layers were configured with sizes of 128, 128, 64, and 32.
The final proposed merged model is a culmination of individual models A, B, and C. Initially, the last non-output layers of models A and B were combined through a concatenation layer. Subsequently, this combined output was further merged with the last non-output layer of model C through another concatenation layer. To facilitate learning, two dense layers with 64 neurons each were added, accompanied by batch normalization layers. Notably, all layers of the individual models were frozen to preserve the acquired knowledge during training.
3.4. Dataset Assembly
The initial step in executing our experiments involved compiling the final dataset. To achieve this, we meticulously curated authentic and manipulated images from four distinct datasets. These datasets were chosen for their robust representation of real-life scenarios, as they involved human manipulation of images, with an emphasis on creating manipulations that were challenging to detect. The size of the dataset was limited due to the low availability of humanly manipulated images, which require significant time and effort to create. Additionally, biases may have been introduced due to the skill set of individuals selected to manipulate the images. Nevertheless, efforts were made to ensure that biases were minimized by selecting datasets that inherently did not impose restrictions on the types of manipulations that could be performed.
-
CASIA V2.0: proposed in [54], contains 7491 authentic images and 5123 manipulated images containing Splicing and or Duplication operations with retouching operations applied on top to mask alterations;
-
Realistic Tampering Dataset: Proposed by [55,56] containing 220 authentic and 220 Splicing and or Duplication manipulations made to the original images with the objective of being realistic. Retouching operations are sometimes applied to help hide Compositing and Duplication manipulations. In addition, this dataset provides masks of tampered areas and information about capture devices used;
-
IMD2020: Proposed by [57], it consists of four parts, first a dataset containing 80 authentic images manipulated to generate 1930 images tampered realistically and using all types of manipulation, with their respective manipulation masks. Then the second part consists of 35,000 authentic images captured by 2322 different camera models, the images were collected online and reviewed manually by the authors. The third has 35,000 algorithmically generated images with retouching manipulations. Finally, the last part has 2759 authentic images acquired by the authors with 19 different camera models designed for sensor noise analysis;
-
CASIA V1.0: Proposed in [54], Contains 800 authentic images, 459 Duplicate-type manipulation images, and 462 Splicing images. This dataset has no retouching operations applied.
By not imposing limitations on the number of manipulations used, certain types of manipulations may be overrepresented in the dataset. Furthermore, only one of the selected datasets provides information on the types of manipulation performed on each image, leaving uncertainty regarding the exact biases present. However, the hope is that the proportion of manipulations in the dataset mirrors that of real-world scenarios.
The second step of the final program is to apply the methods described to the dataset, as both methods generate an image as output. The result is two new sets of images with specific inconsistencies enhanced, using methods from the TensorFlow library whenever applicable.
The dataset was divided into three parts: 70% for training, 20% for validation, and 10% for testing. Each image was resized to 224 by 224 pixels to match the input size required by MobileNetV2, a pre-trained neural network that was initially considered but ultimately not used in favor of a custom model due to improved performance.
The experiment consisted of first creating two additional sets of images using the two selected passive methods. Subsequently, three convolutional neural networks were trained: Model A using the original dataset, and Models B and C utilizing the outputs of the selected passive methods. These models were then combined by concatenating them at the penultimate layer. Additionally, an extra dense layer followed by a batch regularization layer was introduced so the model could learn how to combine the results, this combination is the final merged model.
To enhance accuracy and mitigate overfitting, three techniques were employed. First, a model checkpoint callback was implemented, which saved the model weights corresponding to the minimum validation loss achieved during training. Second, an early stopping condition was defined, terminating training after 50 epochs without any improvement in validation loss.
The third technique used to reduce overfitting was the use of dataset augmentation techniques applied at the end of every epoch to the training images, which randomly flipped the image in the four-axis and changed the brightness, contrast, saturation, and hue of the images randomly.
The experiments were conducted on a system with the following specifications: Windows 11 operating system, Intel(R) Core(TM) i5-8300H CPU 2.30GHz, 16GB RAM, GeForce GTX 1050 graphics card, Python version 3.10, CUDA version 11.2, cuDNN version 8.1.1, and TensorFlow version 2.10.0. The training was performed with a batch size of 32 over a total of 500 epochs, as the dataset exhibited significant variation in manipulations, leading to fluctuation in the loss of validation images.
4. Results and Discussion
After training, the final Accuracy obtained by the merged model was 89.59% in the set of test images, higher than the model trained just with original images, which obtained 78.64%.
5. Conclusions and Future Work
This study provides a comprehensive review of passive forensic methods designed to detect manipulated images, emphasizing their underlying principles and efficacy. Building upon this foundation, we employ a novel approach that integrates two distinct detection methods within a multi-stream convolutional neural network (CNN) framework. This innovative methodology allows us to harness the strengths of each individual method while mitigating their respective limitations, thereby enhancing overall detection accuracy. Moreover, our analysis is conducted on a meticulously curated dataset comprising a diverse range of image manipulations, ensuring the robustness and generalizability of our findings. Through this combined effort, we aim to advance the field of image manipulation detection by offering a unified and effective solution capable of addressing real-world challenges.
The amalgamation of differently trained models on the same image dataset highlights the potential of leveraging various passive detection techniques to improve overall performance. This concept is validated by the superior accuracy achieved by our final model compared to its constituent models, indicating the efficacy of using insight of passive methods to select streams for CNN analysis in a multi-stream model.
The study’s limitations primarily revolve around the size and scope of the dataset, coupled with the lack of comprehensive knowledge regarding the manipulations performed on the images. The absence of detailed documentation regarding the specific manipulations applied to each image impedes a thorough analysis of detection performance across different manipulation types. Furthermore, the study lacks a thorough statistical analysis and discussion of false positives/negatives. While the study demonstrates the potential for multi-stream CNN architectures, the absence of localization of manipulated areas represents a further avenue for exploration. Additionally, the study did not thoroughly explore the selection of optimal streams based on passive methods.
However, the findings hold broader implications, highlighting the potential of multi-stream CNN architectures in image manipulation detection. By integrating diverse passive detection methods, this approach not only enhances detection accuracy but also underscores the importance of combining complementary techniques for robust detection. Moreover, the study emphasizes the necessity for standardized datasets and rigorous evaluation metrics in the field of image forensics. These insights can inform future research aimed at developing more effective and reliable methods for detecting manipulated images, thereby enhancing the integrity and trustworthiness of digital media.
This research lays the groundwork for several avenues of expansion and real-world application. Firstly, further exploration could focus on enhancing the dataset used, incorporating more diverse and meticulously documented manipulations to improve the model’s robustness and generalization capabilities. Additionally, future studies could delve into developing methods for localizing manipulated areas within images, which would significantly enhance the practical utility of image manipulation detection systems.
Moreover, the multi-stream CNN architecture demonstrated in this research presents a promising framework for integrating various detection techniques. Expanding upon this, researchers could explore the selection of optimal streams based on passive methods, thereby refining the model’s ability to detect a wide range of manipulation types with greater accuracy.
In real-world scenarios, the findings of this research could be applied in various fields, including digital forensics, media authentication, and content moderation on social media platforms. By deploying robust image manipulation detection systems developed from this research, organizations and individuals can better safeguard against the dissemination of misleading or falsified visual content, thereby upholding the integrity and trustworthiness of digital media.
[ad_2]