CamDiff: Camouflage Image Augmentation via Diffusion Model

The burgeoning field of camouflaged object detection (COD) seeks to identify objects that blend into their surroundings. Despite the impressive performance of recent models, we have identified a limitation in their robustness, where existing methods may misclassify salient objects as camouflaged ones, despite these two characteristics being contradictory. This limitation may stem from lacking multi-pattern training images, leading to less saliency robustness. To address this issue, we introduce CamDiff, a novel approach inspired by AI-Generated Content (AIGC) that overcomes the scarcity of multi-pattern training images. Specifically, we leverage the latent diffusion model to synthesize salient objects in camouflaged scenes, while using the zero-shot image classification ability of the Contrastive Language-Image Pre-training (CLIP) model to prevent synthesis failures and ensure the synthesized object aligns with the input prompt. Consequently, the synthesized image retains its original camouflage label while incorporating salient objects, yielding camouflage samples with richer characteristics. The results of user studies show that the salient objects in the scenes synthesized by our framework attract the user's attention more; thus, such samples pose a greater challenge to the existing COD models. Our approach enables flexible editing and efficient large-scale dataset generation at a low cost. It significantly enhances COD baselines' training and testing phases, emphasizing robustness across diverse domains. Our newly-generated datasets and source code are available at https://github.com/drlxj/CamDiff.


Introduction
Camouflage is a predatory strategy that has evolved in natural objects through biological adaptation [4].Visually, organisms alter the appearance of their bodies to match their surroundings, making them difficult to detect at first glance.Motivated by this phenomenon, a recent field of research called camouflage object detection (COD) [8,12,14] has gained significant attention from the computer vision community [16,48,55].This area of study has broad applications, including medical image diagnosis and segmentation [35,46,7], species discovery [33], and crack inspection [10].
In the literature, several works [9,7,32] directly extend several well-developed salient object detection (SOD) for COD tasks.However, it is noteworthy that salient and camouflaged objects are two contrasting object categories.The greater the level of saliency, the lower the degree of camouflage, and vice versa [22].Hence, different strategies are imperative for detecting these two distinct object types.SOD models are based on both global and local contrasts, whereas COD models aim to avoid regions of high saliency.Unfortunately, our experiments (see Sec. 4) reveal a decline in the accuracy of current COD methods when both salient and camouflaged objects co-exist in an image.As Fig. 1 illustrates, we tested the robustness of several stateof-the-art (SOTA) COD methods on salient objects.Nevertheless, many of these methods misclassified the salient objects as camouflaged ones.These results indicate that the current COD models are not robust enough regarding the scenes containing salient objects.Specifically, the algorithms employed by PFNet [28] and ZoomNet [32] detect only the more salient object (the yellow ball) and neglect the less salient object (the green ball).Thus, we speculate that existing COD works may only learn to distinguish the foreground and background rather than the camouflage and saliency patterns.This underscores the necessity of further research in COD to gain insight into the camouflage pattern and make COD methods truly effective.
To distinguish the saliency and camouflage patterns, one straightforward idea is to train the network via contrastive learning, which has demonstrated its effectiveness in other vision tasks [5,54,19].As suggested in [3,44,20], strong data augmentation can significantly benefit contrastive learning, leading to effective feature representation modeling.However, generating positive and negative pairs as contrastive samples is not feasible in our setup due to the lack of salient objects in conventional camouflage datasets.Furthermore, existing COD datasets mainly contain a single 1 arXiv:2304.05469v1[cs.CV] 11 Apr 2023 Image GT SINet [9] PFNet [28] C2FNet [43] SegMAR [18] ZoomNet [32] Figure 1.Visualization results with current COD models tested on an image with salient objects.As the object is salient, the ground truth (GT) should be all-black for the COD task.Nonetheless, the existing COD methods are less robust to the scenes with salient objects, especially the PFNet and ZoomNet.
object, making the direct extension of contrastive learning infeasible.Besides, collecting and annotating a new dataset containing camouflage and salient objects within a single image would be time-consuming and labor-intensive.
In this study, we aim to enhance the robustness of future COD models regarding salient objects.To achieve this objective, we propose augmenting contrastive samples in the training data by leveraging the recent diffusion model [38,13] as a form of data augmentation to generate synthetic images.This approach is inspired by the success of AI-Generated Content (AIGC) [2,37] and large-scale generative models.While some recent attempts have been to utilize diffused images for data augmentation, these efforts are only feasible for more common scenarios such as daily indoor scenes [15] or urban landscapes [24] where the domain gap is small.By contrast, we are specifically interested in camouflage scenes, which are rare and challenging for pre-trained diffusion models.These differences make our task very challenging for synthesizing multi-pattern images with large domain gaps, which up to our current knowledge, has not been addressed in camouflage settings.In addition, existing works [17] rely on additional freeze-weight deep networks to generate pseudo labels as supervision, limiting their performance and application.These limitations motivate us to design a novel framework that generates realistic salient objects within the camouflaged scenes.Our approach differs from the concurrent diffusion-augmentation methods [1,52] regarding (a) the non-negligible domain gap and (b) the preserved camouflage label.
To address the target problem, in this work, we propose a diffusion-based adversarial generation framework CamDiff.Specifically, our method consists of a generator and a discriminator.The generator is a freeze-weight Latent Diffusion Model (LDM) [38] that has been trained on a large number of categories, making it possible to synthesize the most salient objects at scale.For the discriminator, we adopt the Contrastive Language-Image Pre-training (CLIP) [36] for its generality.Our discriminator compares the input image prompt and the synthesized object to ensure semantic consistency.To preserve the original camouflage label, we only add the generated salient object on the background, i.e., outside of the ground truth (GT) label.Therefore, our CamDiff effectively transforms the problem into an inpaint-ing task, without requiring any additional labeling cost.In such a way, we can effectively and easily enable customized editing, hence improving the development of COD from the data-driven aspect.
Our main contributions are summarized as follows: Overall, our research provides a fresh perspective on the notion of camouflage, and our newly introduced camouflage synthesis tool will serve as a foundation for advancing this rapidly growing field.

Diffusion Models
Diffusion models [13,38] are generative models that generate samples from a distribution by learning to gradually remove noise from data points.Recent research [6] shows that diffusion models outperform Generative Adversarial Networks (GANs) [11] in high-resolution image generation tasks without the drawbacks of mode collapse [31] and unstable training [30], and achieve unprecedented results in conditional image generation [37].Therefore, they have been applied in many domains such as text-to-image and guided synthesis [29,34], 3D shape generation [56,26], molecule prediction [45], video generation [51] and image inpainting [38].
Some researchers have studied the diffusion model for image inpainting.For example, Meng et al. [29] has found that diffusion models can not only fill regions of an image but can also accomplish it conditioned on a rough sketch of the image.Another study by literature [39] concludes that diffusion models can smoothly fill regions of an image with realistic content without edge artifacts when trained directly on the inpainting task.

Camouflage Object Detection
Camouflage object detection (COD) detects a concealed object within an image.Several research attention (e.g., SINet [9], UGTR [50], ZoomNet [32]) have focused on the comparison of COD with SOD and concluded that simply extending SOD models to solve the COD task cannot bring the desired results because the target objects have different attributes, i.e., concealed or prominent.To detect the concealed image, many methods have been proposed recently.For example, some methods utilize a multi-stage strategy to solve the concealment of camouflaged images.SINet [9] is the first multi-stage method to locate and distinguish camouflaged objects.Another multi-stage method is SegMar [18], which localizes objects and zooms in on possible object regions to progressively detect camouflaged objects.In addition, the multi-scale feature aggregation is the second main strategy that has been used in many methods, such as CubeNet [57], which integrates low-level and high-level features by introducing X connection and attention fusion, as well as ZoomNet [32], which process the input images at three different scales to fully explores imperceptible clues between the candidate objects and background surroundings.A detailed review of COD models is out of the scope of this work; we refer readers to recent top-tier works [16,14].

Camouflage Image Generation
Although generating camouflage images has received limited attention, a few notable works exist in this area.One of the earliest methods, proposed in 2010, relies on hand-crafted features [4].Zhang et al. [53] have recently proposed a deep learning-based approach for generating camouflaged images.Their method employs iterative optimization and attention-aware camouflage loss to selectively mask out salient features of foreground objects, while a saliency map ensures these features remain recognizable.However, the slow iterative optimization process limits the practical application of their method.Moreover, the style transfer of the background image to the hidden objects can often result in noticeable appearance discontinuities, leading to visually unnatural synthesized images.To overcome these limitations, Li et al. [23] has proposed a Location-Free Camouflage Generation Network.Although this method outperforms the previous approach [53] in terms of visual quality, it may fail to preserve desired foreground features or make objects identifiable using the saliency map in certain cases.In summary, existing methods all follow the same strategy to produce camouflage images: They use two images to represent the foreground image and the background image respectively, and they attempt to directly integrate the foreground image with the background image by finding a place where the foreground object is hard to detect within the synthesized image.

Overall Architecture
To evaluate the effectiveness of existing camouflage object detection (COD) methods on negative samples (i.e., scenes with salient objects), we suggest creating synthetic salient objects on top of current camouflage datasets.Normally, when a task-specific model is trained with COD datasets, it should effectively detect the camouflaged samples, while being robust and not detecting the synthesized salient ones.Therefore, such an approach allows us to thoroughly investigate whether a learning-based COD method can accurately distinguish between camouflage and salient objects.To achieve this objective, we propose a new generation network called CamDiff, which is built upon existing COD datasets.Since these datasets already contain camouflaged objects with corresponding camouflage ground truth masks, our aim is to add synthesized salient objects into the background.By doing so, we can maintain the original camouflage labels and leverage them while also introducing salient samples that have contrasting characteristics.
Fig. 2 illustrates the overall architecture of our proposed method.We start with a COD dataset, which provides us with a source image and its corresponding ground truth (GT).Using the GT, we identify the bounding box with the minimum coverage area to prevent CamDiff from altering the camouflaged image.Next, we divide the source image into nine areas via grid lines, using the bounding box to preserve the area where the camouflaged object is placed.Only eight of the areas are available for input into CamDiff.We randomly select one of these regions and cut it out from the source image, covering a specific proportion (e.g., 75% as the default setting in our experiments) of the total area from the center.We then feed the masked image into the generation network, and CamDiff generates a salient object within the masked area.Finally, we place the selected region back into its original location within the source image.In such a manner, we can not only preserve the GT labels for camouflaged objects but also add contradictory synthesized salient samples.
To generate the salient object, we propose a generation Only after the discriminator judges that the synthesized object is consistent with the text input, the synthesized image can be output and placed back into the source image.The white star in the source image means that region ( 8) is selected as the masked region.framework based on the Generative Adversarial Network (GAN) architecture.Specifically, we utilize the widelyacknowledged Latent Diffusion Model (LDM) as the generator and the Contrastive Language-Image Pre-Training (CLIP) as the discriminator.As shown in Fig. 2, the input to our framework is an image with the previously-masked region, along with a text prompt that describes the target object.This masked region and text prompt are then fed into the generator.Based on the prompt, the LDM block generates the target object on top of the masked region.The filled-up region is then sent to the discriminator to determine if it matches the input prompt.If not, the generator adjusts the seed to generate a new salient object.The objective is to train the generation network to only produce validated images when the discriminator predicts a high probability of matching the input prompt.
Our framework transforms the image generation task into an inpainting task, and thus requires a mask to cover the selected region.The mask generation process is ex-plained in Algorithm 1.The mask is designed to cover a certain percentage of the selected region to avoid artifacts when blending the synthesized object with the source image.The ratio of the masked area to the region area is set to a constant, RAT IO M ASK .The size of the selected region is crucial for the inpainting task, as it can affect the quality of the generated salient object.If the region is too small, the LDM may fill the background instead of the object, while if it is too large, the salient object may be too much larger than the concealed object, misleading COD methods.Therefore, we set an upper bound (RAT IO M AX ) and a lower bound (RAT IO M IN ) for the ratio between the region area and the total area of the source image.The values for these parameters are listed in Tab. 1.

Latent Diffusion Model (LDM)
We use the LDM [38] which is pre-trained on a largescale dataset as our generator's base model.The LDM is a two-stage method that consists of an autoencoding model to learn the latent representation of an image and a Denoising Diffusion Probabilistic Model (DDPM) [13].In the first stage, the autoencoding model is trained to learn a space that is perceptually equivalent to the image space.The encoder E encodes the given image x ∈ R H×W ×3 to the latent representation z ∈ R H×W ×C so that z = E(x), while the Algorithm 1 Mask generation end if 15: end for 16: return mask deocder D reconstructs the estimated image x from the latent representation, such that x = D(z) and x ≈ x.In the second stage, the DDPM is trained to generate the latent representation within the pre-trained latent space based on a random Gaussian noise input z t .The neural backbone θ (z t , t) of the LDM is realized as a time-conditional UNet, and the objective of the DDPM trained on latent space is simplified as:

Conditioning LDM
To control the image synthesis, the conditional LDM implements a conditional denoising autoencoder θ (z t , y, t) through inputs y such as text, semantic maps, or other image-to-image translation tasks [38].The proposed CamDiff exploits this ability to control image synthesis through text input.To turn DDPMs into more flexible conditional image generators, their underlying UNet backbone is augmented with the cross-attention mechanism.The embedding sequences τ θ (y) ∈ R M ×dτ from the CLIP ViT-L/14 encoder is fused with latent feature maps via a crossattention layer implementing as: where

and W (i)
V are learnable projection matrix.The objective of the conditional LDM is converted from Eqn. 1 to:

CLIP for Zero-Shot Image Classification
To improve the quality of generated objects based on text input, it is necessary to use a discriminator that can assess the consistency of the generated objects with the text prompt.However, since the text prompt can be any arbitrary class, traditional classifiers that only recognize a fixed set of object categories are unsuitable for this task.Therefore, CLIP models offer a better option for this task.
The CLIP model comprises an image encoder and a text encoder.The image encoder can employ various computer vision architectures, including five ResNets of varying sizes and three vision transformer architectures.Meanwhile, the text encoder is a decoder-only transformer that uses masked self-attention to ensure that the transformer's representation for each token in a sequence depends solely on tokens that appear before it.This approach prevents any token from looking ahead to inform its representation better.Both encoders undergo pretraining to align similar text and images in vector space.This is achieved by taking image-text pairs and pushing their output vectors closer in vector space while separating the vectors of non-pairs.The CLIP model is trained on a massive dataset of 400 million text-image pairs already publicly available on the internet.

Experimental Setup
Datasets.To synthesize multi-pattern images for the COD task, we selected four widely-used COD datasets: CAMO [21], CHAM [42], COD10K [9], and NC4K [27].It should be noted that the COD10K dataset provides semantic labels as filenames.Therefore, we used the label directly as the text prompt.Some prompts are shown in Fig. 2, which lists the classes.However, the list of classes is not directly available for the other three datasets.Since they contain common animal species such as birds, cats, dogs, etc., we randomly chose a text prompt from the COD10K label list.Baselines.To evaluate the robustness of existing COD methods to both salient and camouflaged objects, we selected four representative and classical COD methods: SINet [9], PFNet [28], C2FNet [43], and ZoomNet [32], as our baselines.It is worth noting that since our paper submission, several new SOTA models have emerged, including FSPNet [16] and EVP model [25].However, this paper aims to explore new mechanisms for detecting camouflage patterns, and thus comprehensive testing of all models falls beyond the scope of this article.
Evaluation Metrics.To assess the quality of the synthesized image, we employed Inception Scores [40].For COD models, we follow previous works [8,48] and evaluate the performance using conventional metrics: Mean Absolute Error (M ), max F-measure (F m ), S-measure (S m ), and max E-measure (E m ).
Implementation Details.Our implementation of CamDiff is realized in the Pytorch framework, with hyperparameters related to mask generation specified in Tab. 1.The whole learning process is executed on a 2080Ti GPU.We followed the conventional train-test split [9,8,57,32], using a training set of 4,040 images from COD10K and CAMO.
Among these training samples, we replaced 3,717 images with our synthesized multi-pattern images.The original testing samples comprised 6,473 images from CAMO, CHAM, COD10K, and NC4K.To form our Diff-COD testing set, we replaced 5,395 images with our generated images.Although we cannot entirely replace the camouflage dataset since some images contain specific objects that the diffusion model may not generate well using the pre-trained weights, our success rate remains high.Specifically, over 92% of the training images and 83% of the testing images can be modified with extra salient patterns.This high success rate confirms the effectiveness of our generation framework.Note that we resized the images and masks to 512 × 512 to meet the requirements of the LDM.

Quality of Synthesized Images
Inception Score.To prove that our CamDiff can generate a prominent object rather than a concealed object, we choose the inception score as the evaluation metric and evaluate it on the SOD datasets [47,41,49], COD datasets [21,42,9,27], and our generated dataset with multi-pattern images.Tab. 2 shows that the original SOD datasets have a higher inception score than the original COD dataset, which aligns with our expectations.The rationale behind the Inception Score is that a well-synthesized image should contain easily recognizable objects for an off-the-shelf recognition system.The recognition system is more likely to detect prominent objects rather than camouflaged ones.As a result, images with multiple patterns tend to have a higher Inception Score than those with camouflaged patterns.By comparing the Inception Score before and after the modification, we can easily evaluate the effectiveness of our framework.Upon replacing images in the COD dataset with multipattern images, it's evident that the inception score has increased across all COD datasets.This indicates that we have successfully incorporated prominent patterns on top of the original COD datasets.User Study.We conducted a user study to further evaluate the synthesized images' quality.Participants were given a subset of our synthesized images along with their corresponding labels (e.g., "Butterfly" in Fig. 3) and were asked to circle the object they detected first based on the label.The salient object chosen by the user was considered the most prominent since it attracted the most human attention.The results of our user study, with over 10 participants, showed that the average rate of users choosing the synthesized object, i.e., the salient ones, was 98%.This indicates that the synthesized objects are more prominent and easier to detect than the original objects in the images.
Overall, the increased inception score and positive results from the user study support our claim that CamDiff generates prominent objects rather than concealed ones in the synthesized images.In addition, CamDiff has demonstrated its robust capability to generate diverse objects and variations in posture for a single object type.Fig. 4 provides examples of various classes of synthesized images, each of which can be extended to generate three additional images of the same class.

Quantitative Comparison
In this section, we continue by introducing quantitative experiments by evaluating SOTA COD methods on the synthesized samples generated by our CamDiff.Tab. 3 shows the performance of pretrained models on original and generated testing samples; Tab. 4 compares the performance trained with original COD images and our generated training samples; Tab. 5 presents the robustness analysis on SOD datasets.Pretrained Weights Setting.We created a new Diff-COD dataset to evaluate existing COD methods' effectiveness on images containing salient and camouflaged objects.This dataset includes both types of images, and we trained four SOTA COD methods (SINet [9], PFNet [28], C2FNet [43], and ZoomNet [32]) on the Diff-COD training set.We then evaluated their performance on the Diff-COD testing set.Table 3. Quantitative results of the pre-trained COD models on Diff-COD test dataset and COD dataset.↑ (↓) denotes that the higher (lower) is better.
Freezed SINet [9] PFNet [28] C2FNet [43]  It's important to note that the pre-trained LDM (low-level dense module) block can only output images with a resolution of 512 × 512.This resolution is suitable for most existing methods trained with a resolution less than 352 × 352.However, the current SOTA method, ZoomNet [5], requires a main resolution of 384 × 384 and an additional higher resolution with a scale of 1.5 (576 × 576), which is larger than the capacity of the LDM model.To ensure a fair comparison, we retrained ZoomNet with a main scale Tab. 4 displays the results of the pre-trained COD models trained with original COD training sets and the newlytrained COD models on our Diff-COD training sets.It is evident that the models trained on the Diff-COD training set perform significantly better on the Diff-COD testing set compared to their counterparts.To further confirm the effectiveness of our approach in enhancing the robustness of COD models against saliency, we conducted experiments on conventional saliency datasets, including DUTS-TE [47], ECSSD [41], XPIE [49].As displayed in Tab. 5, when the models were trained using our Diff-COD dataset, their performance on saliency benchmarks declined.This is expected since the poorer performance on the SOD datasets indicates that the newly-trained models have truly learned the camouflage pattern but not the salient pattern.As a result, these models are better equipped to withstand the influence of salient objects.

Qualitative Comparison
Fig. 5 demonstrates the effect of training on multipattern images on the performance of COD models.The figure is divided into three cases, each presenting the results for a different camouflaged object (fish, crab, and frog).On the left side of the dashed line in each case, the original image from the COD dataset, a synthesized multi-pattern image, and the ground truth are shown.The right side displays the results of four pre-trained models (SINet, PFNet, C2FNet, and ZoomNet) on the original COD datasets in the first row.The second row of the illustration presents the results of the models tested on the synthesized images using the same checkpoints as in the first row.Most of them detect salient objects, which is undesirable, and the accuracy of detecting camouflaged objects decreases.For instance, SINet loses some parts compared with the mask in the first row, and ZoomNet ignores camouflaged objects.These results indicate that COD methods lack robustness to saliency.The third row of the illustration presents the results of the models trained on our Diff-COD dataset and then tested on the synthesized images.Compared to the second row, the robustness to saliency improves significantly.Nevertheless, compared to the first row, ZoomNet loses some parts of the camouflaged object.We believe this may be caused by adding noise in the training set making the fitting more difficult, but we plan to evaluate the cause in future work.
Overall, it can be concluded from Fig. 5 that the presence of salient objects harms the performance of COD models in detecting camouflaged objects.However, training the COD models on multi-pattern images increases their robustness to the effects of salient objects.

Conclusion
In summary, our work introduces CamDiff, a framework that generates salient objects while preserving the original label on camouflage scenes, enabling the easier collation and combination of contrastive patterns in realistic images without incurring extra costs related to learning and labeling.Through experiments conducted on Diff-COD test sets, we demonstrate that current COD methods lack robustness to negative examples (e.g., scenes with salient objects).To address this limitation, we create a novel Diff-COD training set using CamDiff.Our experimental results demonstrate that training existing COD models on this set improves their resilience to saliency.Overall, our work provides a new perspective on camouflage and contributes to the development of this emerging field.Future Work.We aim to extend our framework to consider original images with multiple objects and save room for their generation.Additionally, while we only implemented multipattern images as the data augmentation method in our experiments, we plan to evaluate the results using other data augmentation methods to provide a more comprehensive analysis of the impact of multi-pattern images on the performance and robustness of these models.

Figure 2 .
Figure 2. Our CamDiff consists of a generator and a discriminator.The input of CamDiff is a pair of a masked image and a text prompt.Only after the discriminator judges that the synthesized object is consistent with the text input, the synthesized image can be output and placed back into the source image.The white star in the source image means that region (8) is selected as the masked region.

Figure 3 .
Figure3.In the user study, the solution involved presenting the synthesized object within a green box, while the original object within the image was enclosed in a red box.The study results indicate that users were more likely to circle the objects in the green box, highlighting the synthesized objects as more prominent and easier to detect than the original objects within the images.

Figure 4 .
Figure 4. Examples of the synthesized images from CamDiff from various classes.Each image is extended to generate three additional images of the same class, featuring objects with varying appearances.

Figure 5 .
Figure 5. Qualitative Comparison.We conducted a qualitative comparison on three cases: Fish, Crab, and Frog.We analyzed the impact of adding salient objects to camouflaged images on pre-trained SINet, PFNet, C2FNet, and ZoomNet, respectively, by comparing the results of the first two rows.Furthermore, we evaluated the training results on the Diff-COD test set by comparing the qualitative outcomes with the pre-trained results.

1 :
Put the eight regions' index in a list candidates in order 2: Shuffle the index in candidates 3: for i in candidates do 4: if the area of region i is higher than RAT IO M IN then 5: if the area of region i is less than RAT IO M AX then

Table 2 .
Comparision of the generated dataset with the original COD and SOD dataset.The type "orig."means the original dataset, while the type "new" means the synthesized dataset based on the corresponding COD dataset. [32]Net[32]

Table 4 .
Quantitative results of the test Diff-COD dataset."Pre."means the model is loaded with the pre-trained checkpoint the officially released code provides."Tr." means that the model is loaded by the checkpoints trained on our synthesized training set.

Table 5 .
Quantitative results of the original SOD testing sets."Pre."means the model is loaded with the pre-trained checkpoint provided by the paper, while "Tr." means that the model is loaded by the checkpoints trained on our synthesized training set.288since288 × 1.5 = 432 is less than 512 and still a relatively high resolution.To ensure equal evaluation, we trained ZoomNet on the original and our new train sets with the same main resolution of 288 × 288.Tab.3 compares each model's performance with its pre-trained checkpoints on both Diff-COD and original COD datasets.The results indicate that all COD methods perform significantly worse on the Diff-COD dataset.This is because these methods detect the additionally generated salient object and classify them as camouflage ones, indicating a lack of robustness to saliency.As a result, we can conclude that our Diff-COD testing set serves as a more challenging benchmark and can be used as an additional tool for robustness analysis.Trained on our Generated Datasets.As previously mentioned, our framework has the capability to generate new training samples with both salient and camouflage objects.By training on our Diff-COD dataset using only camouflage supervision, the networks should learn the distinction between the two contrasting notions and become more resilient to saliency.