Polyp-PVT: Polyp Segmentation with Pyramid Vision Transformers

Most polyp segmentation methods use CNNs as their backbone, leading to two key issues when exchanging information between the encoder and decoder: 1) taking into account the differences in contribution between different-level features and 2) designing an effective mechanism for fusing these features. Unlike existing CNN-based methods, we adopt a transformer encoder, which learns more powerful and robust representations. In addition, considering the image acquisition influence and elusive properties of polyps, we introduce three standard modules, including a cascaded fusion module (CFM), a camouflage identification module (CIM), and a similarity aggregation module (SAM). Among these, the CFM is used to collect the semantic and location information of polyps from high-level features; the CIM is applied to capture polyp information disguised in low-level features, and the SAM extends the pixel features of the polyp area with high-level semantic position information to the entire polyp area, thereby effectively fusing cross-level features. The proposed model, named Polyp-PVT, effectively suppresses noises in the features and significantly improves their expressive capabilities. Extensive experiments on five widely adopted datasets show that the proposed model is more robust to various challenging situations (e.g., appearance changes, small objects, rotation) than existing representative methods. The proposed model is available at https://github.com/DengPingFan/Polyp-PVT.


I. INTRODUCTION
Colonoscopy is the gold standard for detecting colorectal lesions since it enables colorectal polyps to be identified and removed in time, thereby preventing further spread.As a fundamental task in medical image analysis, polyp segmentation (PS) aims to locate polyps accurately in the early stage, which is of great significance in the clinical prevention of rectal cancer.Traditional PS models mainly rely on low-level features, e.g., texture [1], geometric features [2], simple linear iterative clustering superpixels [3].However, these methods yield low-quality results and suffer from poor generalization ability.With the development of deep learning, PS has achieved promising progress.In particular, the U-shaped [4] has attracted significant attention due to its ability to adopt multi-level features for reconstructing high-resolution results.The segmentation examples of our model and SANet [7] with different challenge cases, e.g., camouflage (1 st and 2 nd rows) and image acquisition influence (3 rd row).The images from top to bottom are from ClinicDB [8], ETIS [9], and ColonDB [10], which show that our model has better generalization ability.
PraNet [5] employs a two-stage segmentation approach, adopting a parallel decoder to predict rough regions and an attention mechanism to restore a polyp's edges and internal structure for fine-grained segmentation.ThresholdNet [6] is a confidenceguided data enhancement method based on a hybrid manifold for solving the problems caused by limited annotated data and imbalanced data distributions.
Although these methods have greatly improved accuracy and generalization ability compared to traditional methods, it is still challenging for them to locate the boundaries of polyps, as shown in Fig. 1, for several reasons: (1) Image noise.During the data collection process, the lens rotates in the intestine to obtain polyp images from different angles, which also causes motion blur and reflector problems.As a result, this greatly increases the difficulty of polyp detection; (2) Camouflage.The color and texture of polyps are very similar to surrounding tissues, with low contrast, providing them with powerful camouflage properties [11], [12], and making them difficult to identify; (3) Polycentric data.Current models struggle to generalize to multicenter (or unseen) data with different domains/distributions.To address the above issues, our contributions in this paper are as follows: • We present a novel polyp segmentation framework, termed Polyp-PVT.Unlike existing CNN-based methods, we adopt the pyramid vision transformer as an encoder to extract more robust features.• To support our framework, we introduce three simple modules.Specifically, the cascaded fusion module (CFM) collects polyps' semantic and location information from the high-level features through progressive integration.Meanwhile, the camouflage identification module (CIM) is applied to capture polyp cues disguised in low-level features, using an attention mechanism to pay more attention to potential polyps, reducing incorrect information in the lower features.We further introduce the similarity aggregation module (SAM) equipped with a non-local and convolutional graph layer to mine local pixels and global semantic cues from the polyp area.• Finally, we conduct extensive experiments on five challenging benchmark datasets, including Kvasir-SEG [13], ClinicDB [8], ColonDB [10], Endoscene [14], and ETIS [9], to evaluate the performance of the proposed Polyp-PVT.On ColonDB, our method achieves a mean Dice (mDic) of 0.808, which is 5.5% higher than the existing cutting-edge method SANet [7].On the ETIS dataset, our model achieves a mean Dice (mDic) of 0.787, which is 3.7% higher than SANet [7].

A. Polyp Segmentation
Traditional Methods.Computer-aided detection is an effective alternative to manual detection, and a detailed survey has been conducted on detecting ulcers, polyps, and tumors in wireless capsule endoscopy imaging [15].Early solutions for polyp segmentation were mainly based on low-level features, such as texture [2], geometric features [2], or simple linear iterative clustering superpixels [3].However, these methods have a high risk of missed or false detection due to the high similarity between polyps and surrounding tissues.
Deep Learning-Based Methods.Deep learning techniques [16]- [25] have greatly promoted the development of polyp segmentation tasks.Akbari et al. [26] proposed a polyp segmentation model using a fully convolutional neural network, whose segmentation results are significantly better than traditional solutions.Brandao et al. [27] used the shape from the shading strategy to restore depth, merging the result into an RGB model to provide richer feature representations.More recently, encoder-decoder-based models, such as U-Net [4], UNet++ [28], and ResUNet++ [29], have gradually come to dominate the field with excellent performance.Sun et al. [30] introduced a dilated convolution to extract and aggregate highlevel semantic features with resolution retention to improve the encoder network.Psi-Net [31] introduced a multi-task segmentation model that combines contour and distance map estimation to assist segmentation mask prediction.Hemin et al. [32] first attempted to use a deeper feature extractor to perform polyp segmentation based on Mask R-CNN [33].
Different from the methods based on U-Net [4], [28], [34], PraNet [5] uses reverse attention modules to mine boundary information with a global feature map, which is generated by a parallel partial decoder from high-level features.Polyp-Net [35] proposed a dual-tree wavelet pooling CNN with a local gradient-weighted embedding level set, effectively avoiding erroneous information in high signal areas, thereby significantly reducing the false positive rate.Rahim et al. [36] proposed to use different convolution kernels for the same hidden layer for deeper feature extraction with MISH and rectified linear unit activation functions for deep feature propagation and smooth non-monotonicity.In addition, they adopted joint generalized intersections, which overcome scale invariance, rotation, and shape differences.Jha et al. [37] designed a real-time polyp segmentation method called ColonSNet.For the first time, Ahmed et al. [38] applied the generative adversarial network to the field of polyp segmentation.Another interesting idea proposed by Thambawita et al. [39] is introducing pyramid-based augmentation into the polyp segmentation task.Further, Tomar et al. [40] designed a dual decoder attention network based on ResUNet++ for polyp segmentation.More recently, MSEG [41] improved the PraNet and proposed a simple encoder-decoder structure.Specifically, they used Hardnet [42] to replace the original backbone network Res2Net50 backbone network and removed the attention mechanism to achieve faster and more accurate polyp segmentation.As an early attempt, Transfuse [43] was the first to employ a two-branch architecture combining CNNs and transformers in a parallel style.DCRNet [44] uses external and internal context relations modules to separately estimate the similarity between each location and all other locations in the same and different images.MSNet [45] introduced a multi-scale subtraction network to eliminate redundancy and complementary information between the multi-scale features.Providing a comprehensive review of polyp segmentation is beyond the scope of this paper.In Tab.I, however, we briefly survey representative works related to ours.
No To adapt to dense prediction tasks such as semantic segmentation, several methods [71]- [77] have also introduced the pyramid structure of CNNs to the design of transformer backbones.For instance, PVT-based models [71], [72] use a hierarchical transformer with four stages, showing that a pure transformer backbone can be as versatile as its CNN counterparts, and performs better in detection and segmentation tasks.In this work, we design a new transformer-based polyp segmentation framework, which can accurately locate the boundaries of polyps even in extreme scenarios.Channel Spatial

A. Overall Architecture
As shown in Fig. 2, our Polyp-PVT consists of 4 key modules: namely, a pyramid vision transformer (PVT) encoder, cascaded fusion module (CFM), camouflage identification module (CIM), and similarity aggregation module (SAM).Specifically, the PVT extracts multi-scale long-range dependencies features from the input image.The CFM is employed to collect semantic cues and locate polyps by aggregating highlevel features progressively.The CIM is designed to remove noise and enhance low-level representation information of polyps, including texture, color, and edges.The SAM is adopted to fuse the low-and high-level features provided by the CIM and CFM, effectively transmitting the information from the pixel-level polyp to the entire polyp.
Given an input image I ∈ R H×W ×3 , we use the transformer-based backbone [71] to extract four pyramid features , where C i ∈ {64, 128, 320, 512} and i ∈ {1, 2, 3, 4}.Then, we adjust the channel of three highlevel features X 2 , X 3 and X 4 to 32 through three convolutional units and feed them (i.e., X ′ 2 , X ′ 3 , and by the CIM.After that, the T 1 and T 2 are aligned and fused by SAM, yielding the final feature map . Finally, F is fed into a 1 × 1 convolutional layer to predict the polyp segmentation result P 2 .We use the sum of P 1 and P 2 as the final prediction.During training, we optimize the model with a main loss L main and an auxiliary loss L aux .The main loss is calculated between the final segmentation result P 2 and the ground truth (GT), which is used to optimize the final polyp segmentation result.Similarly, the auxiliary loss is used to supervise the intermediate result P 1 generated by the CFM.

B. Transformer Encoder
Due to uncontrolled factors in their acquisition, polyp images tend to contain significant noise, such as motion blur, rotation, and reflection.Some recent works [78], [79] have found that the vision transformer [66], [71], [72] demonstrates stronger performance and better robustness to input disturbances than CNNs [16], [17].Inspired by this, we use a vision transformer as our backbone network to extract more robust and powerful features for polyp segmentation.Different from [66], [73] that uses a fixed "columnar" structure or shifted windowing manner, the PVT [71] is a pyramid architecture whose representation is calculated with spatial-reduction attention operations; thus it enables to reduce the resource consumption.Note that the proposed model is backboneindependent; other famous transformer backbones are feasible in our framework.Specifically, we adopt the PVTv2 [72], which is the improved version of PVT with a more powerful feature extraction ability.To adapt PVTv2 to the polyp segmentation task, we remove the last classification layer and design a polyp segmentation head on top of four multiscale feature maps (i.e., X 1 , X 2 , X 3 , and X 4 ) generated by different stages.Among these feature maps, X 1 gives detailed appearance information of polyps, and X 2 , X 3 , and X 4 provide high-level features.

C. Cascaded Fusion Module
To balance the accuracy and computational resources, we follow recent popular practices [5], [80] to implement the cascaded fuse module (CFM).Specifically, we define F(•) as a convolutional unit composed of a 3×3 convolutional layer with padding set to 1, batch normalization [81] and ReLU [82].As shown in Fig. 2 (b), the CFM mainly consists of two cascaded parts, as follows: (1) In part one, we up-sample the highest-level feature map X ′ 4 to the same size as X ′ 3 and then pass the result through two convolutional units F 1 (•) and F 2 (•), yieldings: X 1 4 and X 2  4 .Then, we multiply X 1 4 and X ′ 3 and concatenate the result with X 2 4 .Finally, we use a convolution unit F 3 (•) to smooth the concatenated feature, yielding fused feature map X 34 ∈ R H 16 × W 16 ×32 .The process can be summarized as Eqn. 1.
where "⊙" denotes the Hadamard product, and Concat(•) is the concatenation operation along the channel dimension.
(2) As shown Eqn. 2, the second part follows a similar process to part one.Firstly, we up-sample X ′ 4 , X ′ 3 , X 34 to the same size as X ′ 2 , and smooth them using convolutional units F 4 (•), F 5 (•), and F 6 (•), respectively.Then, we multiply the smoothed X ′ 4 and X ′ 3 with X ′ 2 , and concatenate the resulting map with up-sampled and smoothed X 34 .Finally, we feed the concatenated feature map into two convolutional units (i.e., F 7 (•) and F 8 (•)) to reduce the dimension, and obtain , which is also the output of the CFM.

D. Camouflage Identification Module
Low-level features often contain rich detail information, such as texture, color, and edges.However, polyps tend to be very similar in appearance to the background.Therefore, we need a powerful extractor to identify the polyp details.
As shown in Fig. 2 (c), we introduce a camouflage identification module (CIM) to capture the details of polyps from different dimensions of the low-level feature map X 1 .Specifically, the CIM consists of a channel attention operation [83] Att c (•) and a spatial attention operation [84] Att s (•), which can be formulated as: The channel attention operation Att c (•) can be written as follow: where x is the input tensor and σ(•) is the Softmax function.P max (•) and P avg (•) denote adaptive maximum pooling and adaptive average pooling functions, respectively.H i (•), i ∈ {1, 2} shares parameters and consists of a convolutional layer with 1 × 1 kernel size to reduce the channel dimension 16 times, followed by a ReLU layer and another 1 × 1 convolutional layer to recover the original channel dimension.The spatial attention operation Att s (•) can be formulated as: where R max (•) and R avg (•) represent the maximum and average values obtained along the channel dimension, respectively.G(•) is a 7 × 7 convolutional layer with padding set to 3.

E. Similarity Aggregation Module
To explore high-order relations between the lower-level local features from CIM and higher-level cues from CFM.We introduce the non-local [85], [86] operation under graph convolution domain [87] to implement our similarity aggregation module (SAM).As a result, SAM can inject detailed appearance features into high-level semantic features using global attention.
Given the feature map T 1 , which contains high-level semantic information, and T 2 with rich appearance details, we fuse them through self-attention.First, two linear mapping functions W θ (•) and W ϕ (•) are applied on T 1 to reduce the dimension and obtain feature maps . Here, we take a convolution operation with a kernel size of 1 × 1 as the linear mapping process.This process can be expressed as follows: For T 2 , we use a convolutional unit W g (•) to reduce the channel dimension to 32 and interpolate it to the same size as T 1 .Then, we apply a Softmax function on the channel dimension and choose the second channel as the attention map, leading to . These operations are represented as F(•) in Fig. 3. Next, we calculate the Hadamard product between K and T ′ 2 .This operation assigns different weights to different pixels, increasing the weight of edge pixels.After that, we use an adaptive pooling operation to reduce the displacement of features and apply a center crop on it to obtain the feature map V ∈ R 4×4×16 .In summary, the process can be formulated as follows: where AP(•) denotes the pooling and crop operations.Then, we establish the correlation between each pixel in V and K through an inner product, which is written as follows: where "⊗" denotes the inner product operation.V T is the transpose of V and f is the correlation attention map.
After obtaining the correlation attention map f , we multiply it with the feature map Q, and the result features are fed to the graph convolutional layer [86] GCN(•), leading to G ∈ R 4×4×16 .Same to [86], we calculate the inner product between f and G as Eqn. 9, reconstructing the graph domain features into the original structural features: The reconstructed feature map Y ′ is adjusted to the same channel sizes with Y by a convolutional layer W z (•) with 1 × 1 kernel size, and then combined with the feature T 1 to obtain the final output Z ∈ R H 8 × W 8 ×32 of the SAM.Eqn. 10 summarizes the details of this process:

F. Loss Function
Our loss function can be formulated as Eqn.11: where L main and L aux are the main loss and auxiliary loss, respectively.The main loss L main is calculated between the final segmentation result P 2 and ground truth G, which can be written as: The auxiliary loss L aux is calculated between the intermediate result P 1 from the CFM and ground truth G, which can be formulated as: L w IoU (•) and L w BCE (•) are the weighted intersection over union (IoU) loss [88] and weighted binary cross entropy (BCE) loss [88], which restrict the prediction map in terms of the global structure (object-level) and local details (pixel-level) perspectives.Unlike the standard BCE loss function, which treats all pixels equally, L w BCE (•) considers the importance of each pixel and assigns higher weights to hard pixels.Furthermore, compared to the standard IoU loss, L w IoU (•) pays more attention to the hard pixels.

G. Implementation Details
We implement our Polyp-PVT with the PyTorch framework and use a Tesla P100 to accelerate the calculations.Considering the differences in the sizes of each polyp image, we adopt a multi-scale strategy [5], [41] in the training stage.The hyperparameter details are as follows.To update the network parameters, we use the AdamW [89] optimizer, which is widely used in transformer networks [71]- [73].The learning rate is set to 1e-4 and the weight decay is adjusted to 1e-4 too.Further, we resize the input images to 352 × 352 with a mini-batch size of 16 for 100 epochs.More details about the training loss cures, parameter setting, and network parameters are shown in Fig. 4, Tab.II, and Tab.III, respectively.The total training time is nearly 3 hours to achieve the best (e.g., 30 epochs) performance.For testing, we only resize the images to 352×352 without any post-processing optimization strategies.

A. Evaluation Metrics
We employ six widely-used evaluation metrics, including Dice [90], IoU, mean absolute error (MAE), weighted Fmeasure (F w β ) [91], S-measure (S α ) [92], and E-measure (E ξ ) [93], [94] to evaluate the model performances.Among these metrics, Dice and IoU are similarity measures at the regional level, which mainly focus on the internal consistency of segmented objects.Here, we report the mean value of Dice and IoU, denoted as mDic and mIoU, respectively.MAE is a pixelby-pixel comparison indicator that represents the average value of the absolute error between the predicted value and the true value.Weighted F-measure (F w β ) comprehensively considers the recall and precision and eliminates the effect of considering each pixel equally in conventional indicators.S-measure (S α ) focuses on the structural similarity of target prospects at the
Models.We collect several open source models from the field of polyp segmentation, for a total of nine comparative models, including U-Net [4], UNet++ [28], PraNet [5], SFA [95], MSEG [41], ACSNet [49], DCRNet [44], EU-Net [52] and SANet [7].For a fair comparison, we use their open-source codes to evaluate the same training and testing sets.Note that the SFA results are generated using the released test model.

C. Analysis of Learning Ability
Settings.We use the ClinicDB and Kvasir-SEG datasets to evaluate the learning ability of the proposed model.ClinicDB contains 612 images, which are extracted from 31 colonoscopy videos.Kvasir-SEG is collected from the polyp class in the Kvasir dataset and includes 1,000 polyp images.Following PraNet, we adopt the same 900 and 548 images from ClinicDB and Kvasir-SEG datasets as the training set, and the remaining 64 and 100 images are employed as the respective test sets.
Results.As can be seen in Tab.IV, our model is superior to the current methods, demonstrating that it has a better learning ability.On the Kvasir-SEG dataset, the mDic score of our model is 1.3% higher than that of the second-best model, SANet, and 1.9% higher than that of PraNet.On the ClinicDB dataset, the mDic score of our model is 2.1% higher than that of SANet and 3.8% higher than that of PraNet.

D. Analysis of Generalization Ability
Settings.To verify the generalization performance of the model, we test it on three unseen (i.e., Polycentric) datasets, namely ETIS, ColonDB, and EndoScene.There are 196 images in ETIS, 380 images in ColonDB, and 60 images in EndoScene.It is worth noting that the images in these datasets belong to different medical centers.In other words, the model has not seen their training data, which is different from the verification methods of ClinicDB and Kvasir-SEG.
Results.The results are shown in Tab.VI and Tab.V.As can be seen, our Polyp-PVT achieves a good generalization performance compared with the existing models.And our model generalizes easily to multicentric (or unseen) data with different domains/distributions.On ColonDB, it is ahead of the second-best SANet and classical PraNet by 5.5% and 9.6%, respectively.On ETIS, we exceed the SANet and PraNet by 3.7% and 15.9%, respectively.In addition, on EndoScene, our model is better than SANet and PraNet by 1.2% and 2.9%, respectively.Moreover, to prove the generalization ability of Polyp-PVT, we present the max Dice results in Fig. 5, where our model shows a steady improvement on both ColonDB and ETIS.In addition, we show the standard deviation (SD) of the mean dice (mDic) between our model and others in Tab.VII.As seen, there is not much difference in SD between our model and the comparison model, and they are both stable and balanced.

E. Qualitative Analysis
Fig. 6 and Fig. 7 show the visualization results of our model and the compared models.We find that our results have two advantages.
• Our model is able to adapt to data under different conditions.That is, it maintains a stable recognition and segmentation ability under different acquisition environments (different lighting, contrast, reflection, motion blur, small objects, and rotation).• The model segmentation results have internal consistency and predicted edges are closer to the ground-truth labels.We also provide FROC curves on ColonDB in Fig. 8, and our result is at the top, indicating that our effect achieves the best.

F. Ablation Study
We describe in detail the effectiveness of each component on the overall model.The training, testing, and hyperparameter settings are the same as mentioned in Sec.III-G.The results are shown in Tab.VIII.
Components.We use PVTv2 [72] as our baseline (Bas.) and evaluate module effectiveness by removing or replacing components from the complete Polyp-PVT and comparing the variants with the standard version.The standard version is denoted as "Polyp-PVT (PVT+CFM+CIM+SAM)", where "CFM", "CIM" and "SAM" indicate the usage of the CFM, CIM, and SAM, respectively.
Effectiveness of CFM.To analyze the effectiveness of the CFM, a version of "Polyp-PVT (w/o CFM)" is trained.Tab.VIII shows that the model without the CFM drops sharply on all five datasets compared to the standard Polyp-PVT.
In particular, the mDic is reduced from 0.937 to 0.915 on ClinicDB.Effectiveness of CIM.To demonstrate the ability of the CIM, we also remove it from Polyp-PVT, denoting this as "Polyp-PVT (w/o CIM)".As shown in Tab.VIII, this variant performs worse than the overall Polyp-PVT.Specifically, removing the CIM causes the mDic to decrease by 1.8% on Endoscene.Meanwhile, it is obvious that the lack of the CIM introduces significant noise (please refer to Fig. 10).In order to further explore the internal of CIM, the feature visualizations of the two main configurations inside the CIM are shown in Fig 9 .It can be seen that the low-level features have a large amount of detailed information.Still, the differences between polyps and other normal tissues cannot be mined directly from this information.Thanks to the channel attention and spatial attention mechanism, information such as details and edges of polyps can be discerned from a large amount of redundant information.
Effectiveness of SAM.Similarly, we test the effectiveness of the SAM module by removing it from the overall Polyp-PVT and replacing it with an element-wise addition operation,

G. Video Polyp Segmentation
To validate the superiority of the proposed model, we conduct experiments on the video polyp segmentation datasets.For a fair comparison, we re-train our model with the same training datasets and use the same testing set as PNS-Net [64], [97].We compare our model on three standard benchmarks (i.e., CVC-300-TV [96], CVC-612-T [8], and CVC-612-V [8]) against six cutting-edge approaches, including U-Net [4], UNet++ [28], ResUNet++ [29], ACSNet [49], PraNet [5], PNS-Net [64], in Tab.XI and Tab.XII.Note that PNS-Net provides all the prediction maps of the compared methods.As seen, our method is very competitive and far ahead of the best existing model, PNS-Net, by 3.1% and 6.7% on CVC-612-V and CVC-300-TV, respectively, in terms of mDice.

H. Limitations
Although the proposed Polyp-PVT model surpasses existing algorithms, it still performs poorly in certain cases.We present some failure cases in Fig. 12.As can be seen, one major limitation is the inability to detect accurate polyp boundaries with overlapping light and shadow (1 st row).Our model can identify the location information of polyps (green mask in 1 st row), but it regards the light and shadow part of the edge as the polyp (red mask in 1 st row).More deadly, our model incorrectly predicts the reflective point as a polyp (red mask in 2 nd and 3 rd rows).We notice that the reflective points are very salient in the image.Therefore, we speculate that the prediction may be based on only these points.More importantly, we believe that a simple way is to convert the input image into a gray image, which can eliminate the reflection and overlap of light and shadow to assist the model in judgment.

V. CONCLUSION
In this paper, we propose a new image polyp segmentation framework, named Polyp-PVT, which utilizes a pyramid vision transformer backbone as the encoder to explicitly extract more powerful and robust features.Extensive experiments show that Polyp-PVT consistently outperforms all current cutting-edge models on five challenging datasets without any pre-/postprocessing.In particular, for the unseen ColonDB dataset, the proposed model reaches a mean Dice score of above 0.8 for the first time.Interestingly, we also surpass the current cuttingedge PNS-Net in terms of the video polyp segmentation task, demonstrating excellent learning ability.Specifically, we obtain the above-mention achievements by introducing three simple components, i.e., a cascaded fusion module (CFM), a camouflage identification module (CIM), and a similarity aggregation module (SAM), which effectively extract high and low-level cues separately, and effectively fuse them for the final output.We hope this research will stimulate more novel ideas for solving the polyp segmentation task.

Introduction
Colorectal cancer is one of the deadly cancers in the world.Colonoscopy is the standard treatment to check, locate and remove colorectal polyps.However, it has been shown that the missed diagnosis rate of colorectal polyps during colonoscopy is between 6% and 27%.The use of automatic, accurate and real-time polyp segmentation during colonoscopy can help clinicians eliminate missing lesions and prevent the further development of colorectal cancer.
In recent years, significant progress has been seen in robot-assisted surgery and computer-assisted surgery.The segmentation of surgical instruments can accurately locate robotic instruments and estimate their posture, which is essential for the navigation of surgical robots.In addition, the segmentation results can be used to predict dangerous operations and reduce surgical risks.At the same time, it can provide a variety of automated solutions for postoperative work, such as objective skill evaluation, surgical report generation, and surgical process optimization, which are of great significance to clinical work.

Model
The method in this paper integrates three models for fusion, namely Polyp-PVT, Sinv2-PVT, and Transfuse-PVT.The official Polyp-PVT [dong2021PolypPVT] is designed for polyp segmentation and achieves SOTA segmentation capabilities and generalization performance.It uses transformer as the backbone network to extract richer features and solves the impact of colorectal image acquisition.Here, we adopt the standard structure without any modification.For the Transfuse [zhang2021transfuse], which is also adopted in the polyp segmentation, we improve it by replacing the transformer part with the PVT [wang2021pyramid, wang2021pvtv2] to enhance its performance.The official Sinv2 [fan2021concealed] proposes an end-toend network for search and recognition the concealed Object, which obtains considerable segmentation performance.
This task is similar to the polyp segmentation and surgical instrument segmentation, so we adopt.Here we employ a stronger PVT transformer [wang2021pyramid, wang2021pvtv2] to replace the original res2net [gao2019res2net] backbone to extract more powerful features.

Hyperparameter settings
We use the PyTorch framework to implement our model, and use Tesla V100 to accelerate the calculation.Taking into account the difference in the size of each polyp image, we adopted a multi-scale strategy in the training phase.The hyperparameter details are as follows.To update the network parameters, we use the AdamW optimizer, which is widely used in transformer networks.The learning rate is set to 1e-4, and the weight attenuation is also

Inference stage
In the inference stage, for the input images, we only resize the images to 352 × 352 without any data enhancement.
For the output, we upsample it to the original feature size.So, we can obtain 15 different prediction results of test dataset without any data enhancement.
In order to obtain a more stable prediction, we merge the 15 prediction results with a minority voting method.Because the voting strategy will produce many  First, the open operation is used to remove independent noise points, and then the area of the block in the prediction image is counted to remove relatively small noise blocks in the prediction image, so as to obtain the final prediction result.

Result
We show the qualitative results in Fig. 1, and give the results of the evaluations in Tab. 1.At the same time, we shared our failure cases in Fig. 2.

Discussion
It can be found that both of our improved algorithms Sinv2-PVT and Transfuse-PVT have the same performance as polyp-pvt.In the 5-fold cross-validation, the three results are relatively stable.There are results above 0.92 on the IoU, and we give some visual results in 1.However, there are certain shortcomings shown in 2. Our results almost correctly segment the equipment, but introduce some noise.One characteristic of these noises is that they are biased towards black.This is mainly due to the fact that there are more black instruments in the entire data set.Therefore, small black areas (food residuals, etc.) in the colonoscope will be identified as surgical instruments.These noises can be filtered by the size of the area to achieve the purpose of noise removal.

Conclusion
In this paper, we propose a robust generalized medical image segmentation framework, which is composed of multiple models, and uses the pyramid vision transformer backbone as an encoder to explicitly extract more powerful and powerful features.

Fig. 1 .
Fig. 1.The segmentation examples of our model and SANet[7] with different challenge cases, e.g., camouflage (1 st and 2 nd rows) and image acquisition influence (3 rd row).The images from top to bottom are from ClinicDB[8], ETIS[9], and ColonDB[10], which show that our model has better generalization ability.

Fig. 2 .
Fig. 2. Framework of our Polyp-PVT, which consists of a pyramid vision transformer (PVT) (a) as the encoder network, (b) cascaded fusion module (CFM) for fusing the high-level feature, (c) camouflage identification module (CIM) to filter out the low-level information, and (d) similarity aggregation module (SAM) for integrating the high-and low-level features for the final output.

Fig. 3 .
Fig. 3. Details of the introduced SAM.It is composed of GCN and non-local, which extend the pixel features of polyp regions with high-level semantic location cues to the entire region.

Fig. 6 .Fig. 7 .
Fig. 6.Visualization results with the current models.Green indicates a correct polyp.Yellow is the missed polyp.Red is the wrong prediction.As we can see, the proposed model can accurately locate and segment polyps, regardless of size.

Fig. 10 .Fig. 11 .
Fig. 10.Visualization of the ablation study results, which are converted from the output into heat maps.As can be seen, removing any module leads to missed or incorrectly detected results.TABLE XI THE RESULT OF VIDEO POLYP SEGMENTATION ON THE i.e., CVC-612-T AND CVC-612-V.CVC-612-T [8] CVC-612-V [8] Model mDic mIoU F w β

Fig. 12 .
Fig. 12. Visualization of some failure cases.Green indicates a correct polyp.Yellow is the missed polyp.Red is the wrong prediction.
B. W. Wang is with Shanghai Artificial Intelligence Laboratory, Shanghai, 200232, China J. Li is with Computer Vision Lab, Inception Institute of Artificial Intelligence, Abu Dhabi 00000, UAE H. Fu is with Institute of High Performance Computing, Agency for Science, Technology and Research, Singapore 138632, Singapore L. Shao is with UCAS-Terminus AI Lab, Terminus Group, Chongqing 400042, China Image GT Ours SANet

TABLE II PARAMETER
SETTING DURING THE TRAINING STAGE.

TABLE III NETWORK
PARAMETERS OF EACH MODULE.NOTE THAT THE ENCODER PARAMETERS ARE THE SAME AS PVT WITHOUT ANY CHANGES.BASICCONV2D AND CONV2D WITH THE PARAMETERS [IN CHANNEL, OUT CHANNEL, KERNEL SIZE, PADDING] AND GCN [NUM STATE, NUM NODE].
region and object level.E-measure (E ξ ) is used to evaluate the segmentation results at the pixel and image level.We report the mean and max value of E-measure, denoted as mE ξ and maxE ξ , respectively.The evaluation toolbox is derived from https://github.com/DengPingFan/PraNet.

TABLE IV QUANTITATIVE
RESULTS OF THE TEST DATASETS, i.e., KVASIR-SEG AND CLINICDB.

TABLE VII THE
STANDARD DEVIATION (SD) OF THE MEAN DICE (MDIC) OF OUR MODEL AND THE COMPARISON MODELS.

TABLE IX ABLATION
STUDY OF GCN IN THE SAM MODULE.

TABLE XII VIDEO
POLYP SEGMENTATION RESULTS ON THE CVC-300-TV.