MSRD-Unet: Multiscale Residual Dilated U-Net for Medical Image Segmentation

: Semantic segmentation is an exciting research topic in medical image analysis because it aims to detect objects in medical images. In recent years, approaches based on deep learning have shown a more reliable performance than traditional approaches in medical image segmentation. The U-Net network is one of the most successful end-to-end convolutional neural networks (CNNs) presented for medical image segmentation. This paper proposes a multiscale Residual Dilated convolution neural network (MSRD-UNet) based on U-Net. MSRD-UNet replaced the traditional convolution block with a novel deeper block that fuses multi-layer features using dilated and residual convolution. In addition, the squeeze and execution attention mechanism (SE) and the skip connections are redesigned to give a more reliable fusion of features. MSRD-UNet allows aggregation of contextual information, and the network goes without needing to increase the number of parameters or required floating-point operations (FLOPS). The proposed model was evaluated on three multimodal datasets: polyp, skin lesion, and nuclei segmentation. The obtained results proved that the MSDR-Unet model outperforms several state-of-the-art U-Net-based methods.


Introduction:
With the development of medical imaging technologies, medical images have become essential to medical research and clinical diagnosis.Medical image segmentation is indispensable in automated medical image analysis and understanding because it usually means capturing the region of interest in the medical image analyzed.More accurate segmentation leads to a more accurate analysis and understanding of a medical image.
Deep learning approaches based on convolutional neural networks (CNN) outperform traditional approaches in many applications 1, 2 ,3 .CNN positively influenced medical image segmentation, especially after the proposal of the U-Net 4 model.U-Net has achieved a leap in performance in medical image segmentation.The improved performance of U-Net comes from its symmetric structure and skip connections.Its structure consists of an encoder that captures lowlevel features, a corresponding decoder that captures semantic features, and a skip connection allowing it to fuse low-level features with semantic features that give more informative features.In addition, U-Net could achieve good segmentation performance with a relatively small data training set 5,6 .Hence, it has become the pioneer model in the field and inspired many modern models that modified it from different perspectives.Residual convolutions improve feature utilization in classification problems 7 ; many researchers adopted residual convolution into U-Net in medical image segmentation problems 8,9 .The replacement of the traditional convolution block with a residual convolution block in U-Net increases the model's depth and alleviates the gradient vanishing problem 8,9,10 .
Attention mechanisms have also succeeded in natural language processing 11 and many computer vision problems 12 and have been adopted in medical image segmentation.For instance, Ozan et al. 13 proposed a novel self-attention gate as an extension of the standard U-Net; this allows the model to automatically learn to focus on salient features that are moved throw skip connection.Ashish and Jose 14 developed a semantic-guided attention module for medical image segmentation.This model integrates a multi-scale technique for combining semantic information at different levels with self-attention modules to accumulate relevant contextual features.In addition, the guided attention model was integrated with the U-Net model to get more optimal discriminative features and reduce redundant lowlevel features that may occur from the encoderdecoder approach.Jianhui et al. 15 adopted a u-shaped network with residual convolutions and used a Squeeze-and-Excitation (SE) block to recalibrate the importance of different channels.SE is an attention mechanism proposed in 16 that automatically learns the weight of each feature channel, then emphasizes valuable characteristics and suppresses the unhelpful ones.
Zongwei et al. 17 replaced direct skip connections used in U-Net with nested and dense skip connections that add more accurate results.They argued that the network with a direct skip connection from the encoder to the decoder fuse semantically dissimilar feature maps while a nested skip connection fuses more semantically similar feature maps, which is easier in the learning process.Huimin et al. 18 presented another design for skip connection with fewer parameters and more accurate segmentation results than the nested skip connection.They argued that capturing low-level and semantic features needed the fusing of smaller and same-scale feature maps from the encoder and larger-scale feature maps from the decoder.
The dilated convolution is a technique presented first for semantic segmentation here 19 .The dilated convolution allowed the same kernel to have a wider reception field commensurate with the dilated parameter chosen.Many researchers adopted this technique for medical image segmentation and reported good performances 20,21,22 .
Recently, many researchers have integrated previously reviewed techniques to overcome the limitation of the convolution's fixed receptive field, gradient vanishing problem, and redundancy of feature maps.These integrations usually require more parameters and flop 8,22,23 .This paper proposes a new model that integrates these techniques in a new structure to achieve good segmentation performance using fewer parameters and flops.
In the proposed MSRD-Unet, the traditional convolution block is replaced with a novel Multiscale Residual Dilated Block (MSRDB), allowing the fusion of multi-scaled feature maps captured by different dilated kernels, as well as using residual style to get deeper to avoid vanishing problems.In addition, the skip connections were redesigned to enhance fusion for the feature maps; a sequence and excitation attention mechanism 16 was used to recalibrate channel-wise attention.The contributions in this model can be summarized as follows:  Multiscale Residual Dilated Block (MSRDB) aggregates two parallel dilation convolutions with the base convolution that feeds them.In addition, that base convolution is directly used as a residual connection.It allows aggregating different scale convolutions and alleviates the gradient vanishing without using more parameters. Skip connection is redesigned by fusing feature maps from different layers in the encoder before feeding them into the decoder layers.It helps to enhance the integration between the encoder feature map and the decoder maps.inaddition, using SE block after skip connection allows to capture the most critical features and reduce redundancy.
The proposed model is evaluated using three medical datasets(Data Science Bowl 24 , ISIC-2016 25 , and CVC-ClinicDB 26 ).The performance was compared with the U-net and three other states of art methods based on the U-shaped approach.The contributions of the paper can be summarized as follows:

The Proposed Method
This section presents the primary building blocks necessary for the proposed network MSRD-Unet and how the skip connection is redesigned to help the segmentation result.Fig. 1 depicts the overall structure of the proposed MSRD-UNet.

1.Multiscale Residual Dilated Block (MSRDB)
In the proposed approach, the traditional convolution block was replaced with a multiscale residual dilated convolution block, as shown in Fig. 2  (c).The convolution with default dilation 1 is passed to two parallel convolutions with different dilations 2 and 3, where all convolutions use the same kernel size (3 x 3).These three multiscale convolutions concatenate and then pass to a 1×1 convolution to compress these features.Each Convolution (Conv) is followed by Batch Normalization (BN) and Rectified Linear Unit (ReLU).Extracting features at varying scales gives the network the potential to learn more while using dilation convolution reduces the number of parameters needed.At last, the residual connection is passed from base convolution to allow that deeper structure without degradation performance problems.Fig. 2, shows the difference between MSRDB, traditional convolution, and residual convolution.

2.U-Shaped Structure
MSRD-Unet, as illustrated in Fig. 1, takes a U-shaped structure.It consists of an encoder path (on the above) and a decoder path (on the down).Each of them consists of five layers.Each layer in the encoder path used MSRDB block to option features, followed by a 2x2 max pooling operation with stride 2 downsampling (DS) the resolution.At each DS, the number of channels increases by 32, and the feature map's resolution is downscale in half.Every layer in the decoder path consists of a transposed convolution that upsamples (US) the feature map.At each US, the number of channels is compressed by 32, and the feature map's resolution is upscaled to double.Concatenation (CONC) is done between features map comes from encoder layer to its corresponding decoder layer followed by SE and MSRDB block.SE block is used for the dynamic channel-wise feature recalibration and to reduce redundant features before feeding them into the MSRDB block.

3.Skip Connection
The skip connection was redesigned to fuse features from different layers of the encoder before passing it to the decoder path.Feature fusion helps to integrate different scales of features where It aggregates spatial features to high-level features within encoder layers.In the encoder path, the first layer was aggregated with the third layer, and the second layer was aggregated with the fifth layer using element-wise summation.After that, it passes

Evaluation Metrics
Jaccard Similarity (JS) and Dice Score (DS) are essential metrics to evaluate medical image segmentation performance.They are suitable for medical datasets because of their imbalance issues 27 .As described in Eq. 1, Dice Score evaluates the spatial overlap between the ground truth and the predicted mask.

Experiment and Results
In this part, the implementation details of the proposed model with other models are presented, and the performance of the proposed model comparing with other state-of-the-art models.

1.Implementation Details
all the models are implemented on a GPU machine in Google's Colaboratory using the PyTorch framework 29 .Dice loss 30 is used as a standard segmentation loss function because it is suitable for imbalance issues in the datasets.Adam optimizer 31 was employed with a learning rate starting at 1e-4, and ReduceLROnPlateau 32 was used to optimize it.All the models were trained for 100 epochs with a batch size of 8 and an image resolution of 256 × 256 pixels.During training, random vertical and horizontal data flipping was used as data augmentation.Fig. 4 shows the curves, with red and blue colors depicting the training and testing losses of the proposed model, respectively, for each dataset.

Results and Discussion:
An ablation study is performed to detect the influence of different modifications on MSRD-Unet performance.as previously declared, MSRD-Unet has three modifications: the MSRDB and the redesigning skip connection and adding SE block.So it trained and tested all datasets with and without these modifications, as illustrated in Table .2.
The same hypermeter setting is applied to each experiment during the training and testing phases.DS was used to evaluate the results of the testing set for each dataset listed in Table .1,as it is a fundamental metric in segmentation performance.In the First experiment, the datasets training and testing by the model used only MSRDB with a direct skip connection and without adding SE.It also compares that model if replaced the MSRDB with traditional and residual convolution blocks, which is illustrated in Fig. 2. The experiment showed that MSRDB improves the model's performance in all datasets, as shown in Table 2.After that, the model MSRD-Unet was tested without SE block to detect the effect of redesigning the skip connection.It enhances the result relative to the The three datasets listed in Table .1 were evaluated using five performance criteria (DS, JS, AC, SE, and SP).The proposed MSRD-Unet was compared to state-of-the-art approaches, such as U-Net, Unet 3+ 18 , Att-UNet 13 , and RU-Net 8 .Table .3 compares our method and other state-of-the-art methods from the number of parameters and computational complexity illustrated by floatingpoint operations (FLOPs).MSDR-Unet requires approximately two fewer FLOPs and parameters than U-Net, Att-UNe, and RU-Net.However, it requires a quarter more parameters than Unet3+ and approximately 3% more FLOPs.The structure of the proposed model requires more memory than other models, while its inference time on a single test image is close to others models.So MSDR-Unet consumes acceptable computation resources.MSDR-Unet outperformed other state-of-art methods in three different datasets, as illustrated in Table .4.   On the Data Science Bowl dataset, the proposed model outperformed other state-of-the-art methods by all metrics.It improved the results of its closest competitor method (Att-UNet) by 1.31%, 1.17%, 1.03%, 0.55%, and 1.75% for DS, JS, AC, SE, and SP, respectively.
On the CVC-ClinicDB dataset, the proposed model outperformed other state-of-the-art methods by all metrics and improved the results of its closest competitor method (Unet 3+) by 4.52%, 5.54%, 0.61%, 2.52%, and 1.33% for DS, JS, AC, SE, and SP, respectively.
On the ISIC-2016 dataset, the proposed model outperformed other state-of-the-art methods by all metrics except for Specificity compared to Unet 3+.It improved the results of its closest competitor method (RU-Net) by 1.25%, 1.89%, 0.07%, 0.86%, and 1.33% for DS, JS, AC,and, SE, respectively.MSRD-Unet achieved good performance in three datasets that have different requirements.Its structure allows it to go deeper and wider at each layer.It captures feature maps that integrate the lowlevel features with more semantic features.This integration is essential to getting good semantic segmentation performance.In addition, MSRD-Unet is generalized testing set in all datasets, as explained in loss curves in Fig. 4, where underfitting or overfitting problems do not appear.Fig. 5 illustrates some qualitative examples of the performance of the proposed Networks in DataScience Bowl, CVC-ClinicDB, and ISIC-2016.It demonstrates their ability to overcome challenges found in each dataset.Each dataset has its challenges, as illustrated in Fig. 5  (A, B, and C), including the variety in colors, sizes, shapes, and texture.

Conclusion:
This paper presented an end-to-end network based on U-shaped deep learning for medical image segmentation called MSRD-Unet.This approach captures different levels of features by combining multiscale residual dilated convolution and redesigning skip connections to fuse different levels of features before passing them to the decoder.In addition, the SE block was used in the decoder path to recalibrate the dynamic channel-wise feature.Although the proposed method requires a few parameters and flops, it consumes more memory and has no improvement in time.Three different datasets were chosen, and five metrics were used to evaluate the proposed method's performance compared to the state-of-the-art methods.The quantitative and qualitative results showed that the proposed approach outperformed other methods and still involved acceptable computation costs.In future work, we intend to continue investigating the MSDRB and redesigning the skip connection to improve accuracy and efficiency. .
-We hereby confirm that all the Figures and Tables in the manuscript are mine ours.Besides, the Figures and images, which are not mine ours, have been given the permission for re-publication attached with the manuscript.-Ethical Clearance: The project was approved by the local ethical committee in University of Baghdad.

Figure 1 .
Figure 1.MSRD-Unet structure; each box represents a multi-channel feature map representing the feature map size and resolution.

Figure 2 .
Figure 2. (a) Traditional convolution, (b) Residual convolution, and (c) Multiscale residual dilated convolution, where k represents kernel size and d represents the dilation.

Figure 4 .
Figure 4. Dice loss (DL) curves for ISIC datasets (CVC-ClinicDB, Data Science Bowl, and ISIC-2016) training on the proposed network (MSRD-Unet), where TRAIN-SET represents the training set, TEST-SET represents the testing set, and EP represents the number of epochs.

Figure 5 .
Figure 5. Segmentation Results produced by MSRD-Unet of some examples (A) from Data Science Bowl test set, (B) CVC-ClinicDB test set, and (C) ISIC-2016 test set.

P-ISSN: 2078-8665 Published Online First: Suppl. November 2022 2022, 19(6): 1603-1611 E-ISSN: 2411-7986 3060 them
to the corresponding decoder layer, as shown in Fig.1.tosummationbetweentwodifferent feature map shapes need to use Upsampling block (UP) and downsampling block (DOWN) (detailed in Fig.3).Redesigned skips connection enhances segmentation results because it decreases the scale gaps of features between encoder and decoder layers simultaneously without losing spatial features.80% of the images were used as a training set, while the remaining 20% served as the testing set.The second dataset is the ISIC-2016 data set provided by the International Skin Imaging Collaboration for skin lesion segmentation.It consists of 900 images in the training set and 379 in the testing set with annotation masks.The last dataset is CVC-ClinicDB for polyp segmentation in Endoscopic Colonoscopy Frames.It contains 612 images with an annotation mask and training set; 80% of the images were used for the training set and the rest for the testing set.Table.1 summarizes all information on the datasets used for the evaluation.

P-ISSN: 2078-8665 Published Online First: Suppl. November 2022 2022, 19(6): 1603-1611 E-ISSN: 2411-7986 3061 model
with a direct skip connection tested before, as illustrated in Table2.At last, the MSRD-Unet is tested with all components to detect the influence of SE on performance.It enhances the result, as illustrated in Table.2.