Automation and Robots for Disaster Response

Detection of Forged Images Using a Combination of Passive Methods Based on Neural Networks

By inergency On Mar 14, 2024

[ad_1]

1. Introduction

Fake news, the dissemination of false information via social networks, has become a pervasive issue in today’s digital landscape. Such misinformation has the potential to deceive and garner more attention than factual news due to its deliberate crafting to evoke strong reactions [1]. Notably, events like the Higgs Boson exploration in 2012 generated a flurry of false information, with as many as 600 tweets per minute [2]. Of particular concern is the use of manipulated images, which can implicitly lend credibility to false narratives, thereby eroding public trust [3,4].

While numerous studies delve into identifying manipulation in images, many focus on datasets containing only specific tampering types for example [5,6], or contain datasets algorithmically generated such as [7]. Other papers use numerous more specific types of manipulation; for example, Liu and Pun [8] mentioned a kind of manipulation to remove parts of an image and replace them with surroundings. Besides, Rocha et al. [9] mentioned healing as a manipulation capable of softening features.

With the intention of categorizing manipulations in as few types as possible while still addressing all forms of manipulation, this paper adopts the categorization of image manipulations proposed by [10,11]. Along with copy-move and splicing, this categorization recognizes retouching as a manipulation technique that involves slightly modifying an image content without completely hiding large portions. Those modifications include applying filters, highlights, resizing, inpainting, or rotating parts of an image.

There are two general approaches to identifying image manipulation: active and passive. Active techniques, also known as preemptive, aim to preemptively insert structures in an image that can be used to detect any future changes. These structures can be visible to the human eye, like in watermarking, or hidden using steganography techniques. Manipulation can be reliably detected by assessing the integrity of the inserted structure. However, this approach requires that the structure be inserted correctly in a trustworthy image capture before any manipulation, as these methods only enable the detection of modifications post-insertion [12].

Meanwhile, passive methods, also known as blind methods, use only the content already present in an image and do not require any prior action, making them better suited for use in social media. To detect image forgery, passive methods can either analyze artifacts intrinsic to a digital image or search for inconsistencies in the content of an image. According to Lubna and Chowdhury [12], the artifacts present in digital images can be divided into three types: acquisition, format, and manipulation.

Acquisition artifacts are introduced in the image either by imperfections created by the manufacture of camera sensors called fixed-pattern noise or by programs used by cameras to process sensor data before storage; the introduction of artifacts in this manner usually follows a predictable pattern, and any divergences to it is evidence of manipulation.

Format artifacts, introduced in the digital storage of images by algorithms like the jpeg compression, by removing information less relevant to human eyes, areas of the image with artifacts of this type different than expected indicate the image was manipulated. Manipulation artifacts, that get introduced during image manipulation by a program, like the application of a blur filter that alters areas of an image in predictable patterns, the presence of those known artifacts in areas of an image is evidence of manipulation.

The other possible approach involves analyzing the content of an image for inconsistencies. Some inconsistencies are duplicated portions of an image, unnaturally sharp edges, shadow inconsistencies, and perspective inconsistencies [13]. Unfortunately, all those methods of detection have their pros and cons. There is no absolute best method capable of detecting all types of manipulation. Therefore, to better understand detection methods and their limitations, it is essential to understand how images can be manipulated in the first place [14].

There has yet to be a consensus on classifying all image manipulations; however, most papers recognize the existence of at least two types, copy-move, and splicing. In copy-move forgery, an area of an image is duplicated and pasted over another area of the same image. In splicing, however, an area of another image is pasted over the image. This manipulation can add new meaning to the image or conceal information; the resulting image is a fusion of two different images.

Following the discussion on various approaches to identifying image manipulation, it is essential to highlight how this paper contributes to advancing the field. While existing studies often focus on datasets limited to specific tampering types or utilize algorithmically generated datasets (e.g., [15,16,17,18]), this research employs a more comprehensive dataset. Notably, this dataset is manually curated and designed to encompass a broad spectrum of realistically manipulated images, covering all types of manipulation encountered in real-life scenarios.

This methodological improvement overcomes the limitations of previous studies, ensuring that the detection model is trained on diverse and representative data. By incorporating a wider range of manipulations, including subtle retouching techniques such as filters, highlights, resizing, and inpainting, the proposed approach enhances the model’s ability to generalize and accurately detect manipulation across various contexts.

Furthermore, our approach capitalizes on established passive detection methods within a multi-stream neural network architecture, as outlined in the abstract. This integration enables the utilization of knowledge acquired from traditional detection methods to enhance the accuracy of neural network-based approaches. By leveraging this synergy, our method aims to significantly mitigate overfitting, a prevalent challenge in such methodologies.

Besides this introductory section, the following sections aim to explain better how the proposed approach works; in Section 2, we discuss several relevant related works. Section 3 presents the methods, the dataset used, and the model architecture; then, Section 4 discusses the results we obtained. Finally, Section 5 presented the final considerations and suggestions to improve the approach.

2. Related Works

A problem in defining the search string consisted of the significant variability of terms used to refer to the addressed subject. For example, terms sometimes used to refer to counterfeits are “falsification”, “tampering”, “counterfeiting”, “adulteration”, “forgery”, “manipulation”, “edited”, “doctored” or “altered”. After some tests, we defined the following search string: “image AND (tamper OR forged OR forgery) AND (detect OR localize) AND NOT video”. This search returned a considerable amount of work. To delimit the search, using the first search string, we adopted the following criteria:

The presented detection method must follow the passive approach;
The Detection method should primarily focus on verifying digital images’ authenticity; and
Must be in the top five most relevant results of each base that follows the other criteria.

With this research, we find 15 works. However, most of the papers returned on this initial search did not claim to be capable of detecting all types of manipulation and primarily focused on a single type, therefore, to include works related to general identification, terms such as “global” or “universal” were tested. Unfortunately, not all works that detect all types of manipulation used those terms, so the search key had to be made less specific: “image AND (tamper OR forged OR forgery) AND (detect OR localize)” and to limit the search the following criteria were employed:

Introduce a method of detecting general-purpose manipulated images in their text;
The presented detection method must follow the passive approach;
The presented method should not require file formats with data compression;
The Detection method should primarily focus on verifying digital images’ authenticity;
The paper was published in the last five years; and
The paper must have the most relevant results of each base that follow the other criteria.

From this search, three additional works were incorporated, bringing the total to 18 selected works. Among the selected works, similarities emerge in their operational methodologies, despite variances in their detection approaches. We identified commonalities based on the detected manipulations, denoted as Det. Man. (D for duplication, S for splicing, and R for retouching with a slash used to distinguish them where applicable), their treatment of color space, techniques employed for feature extraction to extract pertinent image details, and the detection methods utilized to identify image manipulation. These shared characteristics are meticulously detailed in Table 1, which serves as a comprehensive overview of their collective attributes and methodologies.

The detection methods found can be categorized as classifiers capable of determining alterations in JPEG artifacts, classifiers to detect duplicated areas, and methods that compare detected features to determine correlation or fixed pattern noise (FPN).

To understand methods based on fixed pattern noise, it is first necessary to understand the image capture process. Image acquisition aims to transform an image into a discrete and numerical representation that a computer can store and process. This approach requires a sensor capable of capturing a range of energy from the electromagnetic spectrum and generating as output an electric signal proportional to the captured energy level; then, a digitizer must convert the analog signal into digital information that can be represented in binary form.

The manufacturing process of most sensors responsible for capturing images in digital cameras introduces imperfections that cause minor differences in light sensitivity [31]. The divergences of all the sensors present in a camera introduce a variation in the values of the pixels of images registered by these cameras, resulting in unevenness similar to a signature in all the images it generates [32].

The FPN of a sensor is constant; however, it varies from sensor to sensor. In Sensors of the Charged-Coupled Device (CCD) type, the FPN varies randomly, while in sensors of the Complementary Metal Oxide Semiconductor (CMOS) type, due to its perpendicular capture system, the FPN forms vertical bars, as can be seen in Figure 1 taken from [33].

The work by [25] assumes that the FPN information of the camera used for capture is previously known and then calculates the FPN of the image for validation by comparing the two and marking significantly different areas as being possibly altered.

On the other hand, methods based on classifiers use filtered features to determine if there is a correlation between parts of the image or alteration of the compression through machine learning [34]. In deep learning [35,36,37], convolutional neural networks (CNNs) are increasingly being used for image classification [38,39,40].

In this field, several authors are working to reduce the complexity of the classification step, considering the use of big data [41,42,43] and making this evaluation more efficient [44,45,46]. Finally, correlation-based methods compare features by similarity to determine whether they contain Duplication, often using a final step to eliminate a portion of the found correlations. In this context, unsupervised learning methods are also applied [47].

All datasets in our literature review focus on detecting Duplication and Splicing manipulations. Their most significant divergences are the images used for alteration, the size of the altered area, and the application of subsequent modifications to hide falsification on the manipulated areas. The work by [48], which is the only one focused on detecting manipulation retouching, had to algorithmically generate its dataset. The datasets used by analyzed works focused only on the detection of Duplication type are MICC-F220, MICC-F600, MICC-F, MICC-F2000, MICC-F8multi, IMD, CMH, CoMoFoD, and UCID.

The datasets encompassing Duplication and Splicing manipulations consist of CASIA v1.0, CASIA v2.0, Columbia, and the Image Manipulation Database. CASIA v1.0 and v2.0 feature manually tampered images with splicing and duplication manually introduced. In contrast, the Columbia and Image Manipulation Dataset are generated by algorithmically by editing patches of an image. The Columbia dataset exclusively contains splicing changes, while the Image Manipulation Database includes both duplication and splicing. Both datasets feature different alterations classified as retouching, aimed at camouflaging other modifications.

The absence of a standardized dataset across these investigations poses challenges in comparing the effectiveness of methods solely based on reported accuracy metrics. Additionally, the reliability of results obtained on datasets algorithmically generated by manipulating image patches may introduce unwanted biases and not accurately reflect the expected accuracy in real-world scenarios. Furthermore, the limited range of manipulations detected by these methods, along with other specific limitations such as the requirement for compression differences in images (e.g., [19,21,27,28]), or the necessity for information about the FPN of capture devices used (e.g., [25,29,30]), further restricts their applicability. Moreover, the unavailability of some datasets, as they were created by the authors and are not publicly accessible, adds another layer of complexity to the evaluation process.

To tackle these challenges, this study employs a meticulously curated dataset comprising four publicly available datasets featuring images manually tampered by humans. Importantly, this dataset does not impose restrictions on the types of manipulations, aiming to closely mimic real-life scenarios and facilitate generalization across a broader spectrum of manipulations. Furthermore, the multi-stream CNN approach enables the utilization of knowledge acquired from traditional passive methods. This approach allows for the extraction of data streams that may contain pertinent information for the model’s analysis.

3. Materials and Methods

In this section, we present the methodology adopted for image manipulation detection, which integrates Error-Level Analysis (ELA) and Discrete Wavelet Transform (DWT) alongside a novel multi-stream neural network architecture.

Our decision to adopt a multi-stream neural network architecture was driven by the recognition that traditional passive methods could offer valuable insights to guide the learning process of a Convolutional Neural Network (CNN) for image manipulation detection. Rather than aiming to identify the optimal method outright, we viewed this choice as an initial exploration of the potential capabilities such an architecture may offer. In line with this perspective, we selected two streams based on methodologies outlined in previous literature, with the intention of extracting different facets of image data. This approach was motivated by our desire to integrate diverse aspects of passive image analysis, serving as a foundational step in our investigation into the effectiveness of multi-stream architectures for detecting image manipulation.

The multi-stream architecture comprises three distinct CNNs, each operating on a unique data stream extracted from the original image. Two of these streams were selected based on methodologies outlined in prior literature, which will be explained in detail in the subsequent sections. Each stream is designed to analyze specific data subsets, while the third stream processes the unaltered image itself. By adopting this multi-stream framework, our aim is to leverage the strengths of traditional passive methods and integrate them into a unified detection system.

3.1. Error-Level Analysis

Error-Level Analysis is a passive detection method traditionally used by human forensics specialists to make differences in format artifacts of jpeg images more evident, it works by taking the difference between a jpeg image at different quality levels, making any difference in compression rate more evident on the resulting image [48], an example of this process can be seen in the Figure 2 and the code used to generate the images is presented in Listing 1.

Listing 1. Error-Level analysis implementation in pseudocode.

                      # Static method that performs Error Level Analysis (ELA) on an image using JPEG compression.

    # Returns a normalized difference image between the original image and a JPEG-compressed version of the image.
    FUNCTION method_1_ela(image, quality)
        # Create another image with jpeg compression of the given quality
        save_image_as_jpeg(image, temp_image, quality)
        compressed_image = open_image(temp_image)

        # Calculate the image difference between the original and the JPEG-compressed image
	difference_image = image - compressed_image

        # Normalize the difference image for contrast by assigning a value of 255 to the brightest points, while proportionally adjusting the values of all other points based on their distance from the brightest point.
	normalized_difference  = difference_image.normalizeContrast()
	RETURN normalized_difference
    END FUNCTION

In this example, the resulting image on the bottom is mostly black, but edited areas have more color in them, however, ELA results aren’t always so easily interpreted and traditionally need a forensics specialist to look at the results, however, this paper proposed using ELA as a feature extraction step and feeding it to a Convolutional Neural Network in order to perform the authenticity analysis of an image automatically.

3.2. Discrete Wavelet Transform

DWT stands for denoising [49], which is a mathematical technique used for analyzing signals that can be applied to images [50]. It is a way of decomposing an image into a set of frequency components, with each component representing a different level of detail or resolution [51].

In image manipulation detection, DWT is often used for feature selection as it allows for efficient compression of image data while preserving important image features. The DWT algorithm works by dividing an image into four smaller blocks or “sub-bands” of different frequencies: The LL (low-low) sub-band, which contains the low-frequency information, and the LH (low-high), HL (high-low), and HH (high-high) sub-bands, which contain the high-frequency information [52].

The equations to perform DWT can be found in Equations (1) and (2) and the equation to reverse the process known as inverse DWT is presented in Equation (3).

$W_{φ} (j_{0}, k) = \frac{1}{\sqrt[]{M}} \sum_{M} f (x) φ_{j_{0}, k} (x)$

(1)

$W_{ψ} (j, k) = \frac{1}{\sqrt{M}} \sum_{k} f (x) ψ_{j, k} (x)$

(2)

$\begin{matrix} f (x) = & \frac{1}{\sqrt{M}} \sum_{k} W_{φ} (j_{0}, k) φ_{j_{0}, k} (x) \\ + \frac{1}{\sqrt{M}} \sum_{i = i_{0}}^{\infty} \sum_{k} W_{ψ} (j, k) ψ_{j, k} (x) . \end{matrix}$

(3)

This work uses the technique proposed by [18], referred to in the rest of this work as DWT method for abbreviation purposes, which makes use of DWT, Bilateral Filters, and the Laplace operator to remove less meaningful features of an image, resulting in an image only with features containing sharp pixel variation, which is a common indicator of forgery.

This technique works by initially the image is converted to grayscale, then DWT is applied to decompose the image information into sub-bands, then the image is reconstructed after discarding the LL band and a bilateral filter, and the Laplace operator is applied, the result is an image where sharp pixel transitions are more easily visible, however when filters are used to mask the manipulation this method fails to highlight the manipulated areas. This work incorporates the image resulting from this method as one of the streams in a Convolutional Neural Network, utilizing it for feature selection to enhance the model’s ability to detect large pixel transitions.

An example of the result of this process can be seen in Figure 3 that was taken from the RTD and the code used to generate the images is presented in Listing 2.

Listing 2. DWT based method implementation in pseudocode.

FUNCTION method_2_dwt(image)
    # Convert image to grayscale and perform discrete wavelet transform
    gray_image = convertToGrayscale(image)
    coeffs = discreteWaveletTransform(gray_image)
    (LL, (LH, HL, HH)) = coeffs

    # Reconstruct the image using only the high-frequency components
    high_freq_components = (None, (LH, HL, HH))
    joinedLhHlHh = inverseDiscreteWaveletTransform(high_freq_components)

    # Apply bilateral filter to smooth the image while preserving edges
    blurred = bilateralFilter(joinedLhHlHh, 9, 75, 75)

    # Apply Laplacian edge detection to highlight edges
    kernel_size = 3
    imgLapacian = laplacianEdgeDetection(blurred, kernel_size)

    # Convert negative values to zero
    final_image = convertScaleToAbs(imgLapacian)

    RETURN final_image
END FUNCTION

3.3. Proposed Method

The proposed approach first consists of applying the ELA method and the method proposed by [18], both used as a feature selection step to generate two extra sets of images serving as the distinct streams for the subsequent model architecture.

Subsequently, the original dataset and the two new sets of images are shuffled and utilized for training and evaluation of three distinct CNNs. Following training, the three models are frozen to preserve the acquired knowledge and then merged. Additional learning layers are appended, and the combined model is fine-tuned through training. These steps are visually represented in Figure 4, providing an overview of the system architecture.

This approach aims to explore the accuracy of each stream individually in addition to their combination. For this, we individually trained three distinct models and a model composed of all three streams, plus additional learning layers. Stream A uses only the original images without alterations as input, Stream B uses only images with feature selection by ELA, and Stream C uses only images with feature selection by the DWT-based method. Finally, the models are combined and four layers are added to generate the Merged model, which uses images from all three streams as input.

Initially, we considered employing pre-trained models to integrate passive methods into our research. For an initial assessment of their performance compared to custom models, we selected the MobileNetV2 pre-trained model due to its lightweight nature. We conducted preliminary tests, comparing the performance of MobileNetV2 combined with four dense layers, each incorporating l2 regularization and followed by a batch normalization layer, against Model ‘A’ as detailed later. Results of this comparison are presented in Figure 5 and Figure 6.

The tests revealed a tendency for pre-trained models to overfit to the training images. Additionally, attempts to mitigate overfitting by unfreezing some of the pre-trained layers yielded similar results. Moreover, employing another pre-trained model, such as InceptionResNetV2, did not demonstrate superior performance compared to the custom model. Consequently, these findings lead us to conclude that pre-trained models may not be well-suited for image manipulation detection.

With that in mind, we decided to create three identical custom CNNs and implement regularization methods to minimize overfitting to produce results capable of generalizing to a more extensive set of images and facilitating the comparison of the three individual models.

The primary challenge in developing the neural networks for this study was addressing the issue of overfitting, where the models tend to perform well on training data but struggle to generalize to new, unseen data. To address this, we conducted a series of experiments aimed at evaluating the impact of various model architectures on accuracy. These experiments involved exploring different configurations such as altering the number and size of convolutional and dense layers, as well as implementing diverse regularization methods. While we did not gather specific data from these experiments, they were instrumental in guiding our selection process and ultimately led to the development of the final model architecture outlined in Figure 7.

Through this exploration, we discovered that simply increasing the number of layers did not yield significant accuracy improvements. On the contrary, it heightened the risk of overfitting. Consequently, we opted to craft a model comprising five convolutional layers, each followed by a batch normalization layer and a max pooling layer. The outputs from these layers were then flattened and channeled through four dense layers, each supported by a batch normalization layer. Additionally, to mitigate overfitting, L2 regularization was applied to both the convolutional and dense layers, as illustrated in Figure 7.

The chosen values for the convolutional layers were set as 16, 32, 64, 128, and 64, each utilizing a 3 by 3 kernel. Similarly, the dense layers were configured with sizes of 128, 128, 64, and 32.

The final proposed merged model is a culmination of individual models A, B, and C. Initially, the last non-output layers of models A and B were combined through a concatenation layer. Subsequently, this combined output was further merged with the last non-output layer of model C through another concatenation layer. To facilitate learning, two dense layers with 64 neurons each were added, accompanied by batch normalization layers. Notably, all layers of the individual models were frozen to preserve the acquired knowledge during training.

While Dropout layers were initially considered, our experimentation revealed that Batch Normalization exhibited superior performance within the same training time, aligning with findings by Singh et al. [53]. The optimizer employed was Adam, with accuracy serving as the primary metric during training. For the Loss function, Binary Cross Entropy was utilized. Rectified Linear Unit activation functions were applied across all layers, with the exception of the output layer, which employed the sigmoid function for classification purposes.

3.4. Dataset Assembly

The initial step in executing our experiments involved compiling the final dataset. To achieve this, we meticulously curated authentic and manipulated images from four distinct datasets. These datasets were chosen for their robust representation of real-life scenarios, as they involved human manipulation of images, with an emphasis on creating manipulations that were challenging to detect. The size of the dataset was limited due to the low availability of humanly manipulated images, which require significant time and effort to create. Additionally, biases may have been introduced due to the skill set of individuals selected to manipulate the images. Nevertheless, efforts were made to ensure that biases were minimized by selecting datasets that inherently did not impose restrictions on the types of manipulations that could be performed.

CASIA V2.0: proposed in [54], contains 7491 authentic images and 5123 manipulated images containing Splicing and or Duplication operations with retouching operations applied on top to mask alterations;
Realistic Tampering Dataset: Proposed by [55,56] containing 220 authentic and 220 Splicing and or Duplication manipulations made to the original images with the objective of being realistic. Retouching operations are sometimes applied to help hide Compositing and Duplication manipulations. In addition, this dataset provides masks of tampered areas and information about capture devices used;
IMD2020: Proposed by [57], it consists of four parts, first a dataset containing 80 authentic images manipulated to generate 1930 images tampered realistically and using all types of manipulation, with their respective manipulation masks. Then the second part consists of 35,000 authentic images captured by 2322 different camera models, the images were collected online and reviewed manually by the authors. The third has 35,000 algorithmically generated images with retouching manipulations. Finally, the last part has 2759 authentic images acquired by the authors with 19 different camera models designed for sensor noise analysis;
CASIA V1.0: Proposed in [54], Contains 800 authentic images, 459 Duplicate-type manipulation images, and 462 Splicing images. This dataset has no retouching operations applied.

By not imposing limitations on the number of manipulations used, certain types of manipulations may be overrepresented in the dataset. Furthermore, only one of the selected datasets provides information on the types of manipulation performed on each image, leaving uncertainty regarding the exact biases present. However, the hope is that the proportion of manipulations in the dataset mirrors that of real-world scenarios.

To balance the dataset and incorporate realistically manipulated images, we divided the images into two folders. The first folder consists of 7491 authentic images from the CASIA V2.0 dataset. The second folder contains 7491 manipulated images sourced from the realistic images part of the IMD2020 dataset, Realistic Tampering Dataset (RTD), CASIA V2.0, and an additional 218 images from CASIA V1.0, in order to make number of manipulated and pristine images equal. However, it remains uncertain whether there is a 50% incidence of manipulation in real-world applications, potentially introducing classifier bias. Table 2 illustrates the distribution of images from each dataset.

Therefore, as explained in Table 2 the final dataset used in this paper consists of 14,982 images in total, half original and half Manipulated, since some of the images were not supported by model all images were converted to .jpg format.

The second step of the final program is to apply the methods described to the dataset, as both methods generate an image as output. The result is two new sets of images with specific inconsistencies enhanced, using methods from the TensorFlow library whenever applicable.

The dataset was divided into three parts: 70% for training, 20% for validation, and 10% for testing. Each image was resized to 224 by 224 pixels to match the input size required by MobileNetV2, a pre-trained neural network that was initially considered but ultimately not used in favor of a custom model due to improved performance.

The experiment consisted of first creating two additional sets of images using the two selected passive methods. Subsequently, three convolutional neural networks were trained: Model A using the original dataset, and Models B and C utilizing the outputs of the selected passive methods. These models were then combined by concatenating them at the penultimate layer. Additionally, an extra dense layer followed by a batch regularization layer was introduced so the model could learn how to combine the results, this combination is the final merged model.

To enhance accuracy and mitigate overfitting, three techniques were employed. First, a model checkpoint callback was implemented, which saved the model weights corresponding to the minimum validation loss achieved during training. Second, an early stopping condition was defined, terminating training after 50 epochs without any improvement in validation loss.

The third technique used to reduce overfitting was the use of dataset augmentation techniques applied at the end of every epoch to the training images, which randomly flipped the image in the four-axis and changed the brightness, contrast, saturation, and hue of the images randomly.

The experiments were conducted on a system with the following specifications: Windows 11 operating system, Intel(R) Core(TM) i5-8300H CPU 2.30GHz, 16GB RAM, GeForce GTX 1050 graphics card, Python version 3.10, CUDA version 11.2, cuDNN version 8.1.1, and TensorFlow version 2.10.0. The training was performed with a batch size of 32 over a total of 500 epochs, as the dataset exhibited significant variation in manipulations, leading to fluctuation in the loss of validation images.

4. Results and Discussion

After training, the final Accuracy obtained by the merged model was 89.59% in the set of test images, higher than the model trained just with original images, which obtained 78.64%.

The Figure 8, Figure 9, Figure 10 and Figure 11 illustrate the area under the curve (AUC) of the Receiver Operating Characteristic (ROC) for the models, assessed across both validation and test datasets, highlighting their performance capabilities. Complementarily, Figure 12, Figure 13, Figure 14 and Figure 15 present detailed graphics depicting the accuracy and loss metrics during the training phase for models A, B, C and the specifically proposed merged model, showcasing each model’s progression and comparative effectiveness throughout the training process.

The culmination of these analyses is encapsulated in Table 3, which details the final results achieved after activating the predetermined stop conditions, offering a comprehensive overview of model efficacious and operational efficiencies.

5. Conclusions and Future Work

This study provides a comprehensive review of passive forensic methods designed to detect manipulated images, emphasizing their underlying principles and efficacy. Building upon this foundation, we employ a novel approach that integrates two distinct detection methods within a multi-stream convolutional neural network (CNN) framework. This innovative methodology allows us to harness the strengths of each individual method while mitigating their respective limitations, thereby enhancing overall detection accuracy. Moreover, our analysis is conducted on a meticulously curated dataset comprising a diverse range of image manipulations, ensuring the robustness and generalizability of our findings. Through this combined effort, we aim to advance the field of image manipulation detection by offering a unified and effective solution capable of addressing real-world challenges.

After training, our novel merged method demonstrated a remarkable accuracy of 89.59% on validation images, showcasing the advantages of our multi-stream CNN approach compared to individual streams. In contrast to models utilizing singular detection methodologies, such as Error Level Analysis (ELA) and the method proposed by [18], which were employed as separate streams in our multi-stream model, our merged model showcased superior performance. This comparison highlights the advantage of analyzing multiple streams concurrently, showcasing the synergistic effects of integrating diverse detection methods within a unified framework. Additionally, the individual streams achieved the following accuracies: the model trained solely on original images obtained an accuracy of 78.64%, while the ELA-based model achieved 68.02%, and the model utilizing the method proposed by [18] achieved 50.70%.

The amalgamation of differently trained models on the same image dataset highlights the potential of leveraging various passive detection techniques to improve overall performance. This concept is validated by the superior accuracy achieved by our final model compared to its constituent models, indicating the efficacy of using insight of passive methods to select streams for CNN analysis in a multi-stream model.

The study’s limitations primarily revolve around the size and scope of the dataset, coupled with the lack of comprehensive knowledge regarding the manipulations performed on the images. The absence of detailed documentation regarding the specific manipulations applied to each image impedes a thorough analysis of detection performance across different manipulation types. Furthermore, the study lacks a thorough statistical analysis and discussion of false positives/negatives. While the study demonstrates the potential for multi-stream CNN architectures, the absence of localization of manipulated areas represents a further avenue for exploration. Additionally, the study did not thoroughly explore the selection of optimal streams based on passive methods.

However, the findings hold broader implications, highlighting the potential of multi-stream CNN architectures in image manipulation detection. By integrating diverse passive detection methods, this approach not only enhances detection accuracy but also underscores the importance of combining complementary techniques for robust detection. Moreover, the study emphasizes the necessity for standardized datasets and rigorous evaluation metrics in the field of image forensics. These insights can inform future research aimed at developing more effective and reliable methods for detecting manipulated images, thereby enhancing the integrity and trustworthiness of digital media.

This research lays the groundwork for several avenues of expansion and real-world application. Firstly, further exploration could focus on enhancing the dataset used, incorporating more diverse and meticulously documented manipulations to improve the model’s robustness and generalization capabilities. Additionally, future studies could delve into developing methods for localizing manipulated areas within images, which would significantly enhance the practical utility of image manipulation detection systems.

Moreover, the multi-stream CNN architecture demonstrated in this research presents a promising framework for integrating various detection techniques. Expanding upon this, researchers could explore the selection of optimal streams based on passive methods, thereby refining the model’s ability to detect a wide range of manipulation types with greater accuracy.

In real-world scenarios, the findings of this research could be applied in various fields, including digital forensics, media authentication, and content moderation on social media platforms. By deploying robust image manipulation detection systems developed from this research, organizations and individuals can better safeguard against the dissemination of misleading or falsified visual content, thereby upholding the integrity and trustworthiness of digital media.

[ad_2]