Wavelet-Attention Swin for Automatic Diabetic Retinopathy Classification

Diabetic retinopathy (DR) is a complication of diabetes that affects the eyes by damaging the blood vessels in the retina. High blood sugar levels can cause leakage or blockage of these vessels, leading to vision loss or blindness. Early detection of DR is crucial to prevent blindness, but manually analyzing fundus images can be time-consuming, especially with a large number of images. Swin-Transformers have gained popularity in medical image analysis, reducing calculations and yielding improved results. This paper introduces the WT Attention-Db5 Block, which focuses attention on the high-frequency domain using Discrete Wavelet Transform (DWT). This block extracts detailed information from the high-frequency field while retaining essential low-frequency information. The study discusses findings from the 2019 Blindness Detection challenge (APTOS 2019 BD) held by the Asia Pacific Tele-Ophthalmology Society.The proposed WT-Swin model achieves significant improvements in classification accuracy. For Swin-T, the training and validation accuracies are 99.14% and 98.91%, respectively. For binary classification using Swin-B, the training accuracy is 99.01%, the validation accuracy is 99.18%, and the test accuracy is 98%. In multi-classification, the training and validation accuracies are 93.19% and 86.34%, respectively, while the test accuracy is 86%.In conclusion, early detection of DR is essential for preventing vision loss. The WT Attention-Db5 Block integrated into the WT-Swin model shows promising results in classification accuracy.


Introduction
The eye is one of the main parts affected by diabetes and thus leads to diabetic retinopathy 1 .It is considered one of the main complications of diabetes, which does not show early symptoms, as the blood vessels in the eye are damaged, due to the high level of sugar in the blood, thus leading to swelling of the vessels 2 .ehT advanced stage of this https://doi.org/10.21123/bsj.2024.8565P-ISSN: 2078-8665 -E-ISSN: 2411-7986 Baghdad Science Journal DR disease is at its beginning without any symptoms in the early stages of it, but when its stages progress, symptoms begin to appear noticeably.Therefore, detection of the disease in its early stages makes treatment more beneficial and effective and thus may prevent the development of the disease 5 .
Discrete wavelet transformation is considered one of the methods in which a conversion is made from the time domain to the frequency domain.It is also used in compression clouds, including JPEG 2000.
Usually, the waves are functions and these functions are integrated with the zero waves above and below the x-axis 6 .It can reduce spectral noise, but it can preserve spectral details, and it can also be used to reduce local noise, as well as in analyzing information with different bands and thus rebuilding the spectrum to its original form after decomposition 7 .
Swin Transformers can be considered the backbone of general-purpose computer vision and has performed well on many tasks including object detection, semantic segmentation, and image classification.The main idea of Swin is to use the hierarchy as well as many prefixes in the encoder to the adapter and the locale and thus all of these help in visual tasks 8 .
As it consists of several windows arranged hierarchically, the presence of these windows gives high efficiency by reducing the calculations for selfattention to local windows that are not overlapping in nature, and yet it allows communication through these windows 9 .
In this work, a novel Wavelet Attention WTA-Db5 Block is currently being developed by the team, which will selectively capture essential highfrequency information without affecting lowfrequency information.Furthermore, a new method called Wavelet-Attention (WTA-Db5-Swin) is being proposed, which will utilize the Wavelet-Db5-Attention Block to extract detailed image data more accurately.This will result in improved efficiency in image classification accuracy, with Swin-T and Swin-B versions of the Swin Transformer being adopted as the default backbone for optimal performance.As the number of patients with diabetes and DR has risen, diagnosing DR has become more challenging in recent years.This has led to an increase in the number of cases that go undiagnosed and untreated.Early detection and treatment of DR are more cost-effective than late or incorrect diagnosis.To address this issue, the study focused on using wavelet attention methods to extract features from fundus images.They then used the Swin Transformer instead of CNN deep learning to speed up the diagnosis process and reduce training and testing time.
The proposed approach contributes to the development of new theories related to the use of wavelet analysis and attention mechanisms in medical image analysis, specifically for the diagnosis of the diabetic retina, and improves the efficiency of existing models that can be further extended and applied to other medical imaging tasks.However, the key contributions of the paper are summarized as follows:  The proposed approach combines wavelet analysis and attention mechanisms to improve the accuracy of automatic diabetic retinopathy classification.
 The proposed WT Attention-Db5 Block extracts detailed information from the high-frequency domain based on DWT, while preserving basic information in the low-frequency domain, leading to better results and reduced computational burden.
 The proposed approach uses Swin-Transformers, which have the advantage of reducing computation while providing better results, to improve the accuracy of diabetic retinopathy classification.
 The proposed approach is evaluated on the APTOS 2019 dataset and achieves a high accuracy of 98% for binary classification and 86% for multi-classification.

Related Work
Li H, et al. proposal that is used for image classification, where WA-CNN is used in image analysis to extract high and low-frequency features, as well as through which detailed information and noise can be obtained 11 .Through the experimental results when applying the method to CIFAR-10 and CIFAR-100, it has been proven that it achieves good results in obtaining high classification accuracy.Sabiha G K. et al. 12 , proposed a new method that relies on a deep feature generator based on correction, and this method is inspired by ViT.Both ViT (Vision Transformer) used MLP-mixer, which uses a fixed-size square patch to extract features.In this way, rectangular patches were used instead of square patches and copies of these were used.
Patches to create deeply hidden patterns, DenseNet201 was used that has 201 layers and trained on the ImageNet dataset for image classification tasks.The method used achieved good results with high accuracy of more than 90% for classification (Normal, NPDR, and PDR) on the APTOS 2019 dataset.Danny C. et al. 13 , transformed UNet (PCAT-UNet), and this unit depends on drawing attention and is in the shape of the letter U, and it is based on a transformer, and to combine the features, a skip connection was used on both sides and the results showed that the proposed method gives good results in segmenting the retinal blood vessels on both datasets (DRIVE, STARE, CHASE_DB1).Gupta, et al.AlShemmary and Omran 16 proposed a method for detecting pupils in eye images using a combination of morphological operations and Hough Transform.
The method converts the local iris area into a rectangular block to calculate inconsistencies in the image.There is a potential relation between the method for detecting pupils in eye images using a combination of morphological operations and Hough Transform and diabetic retinopathy classification, as diabetic retinopathy can cause changes in the retinal blood vessels and may result in changes in the shape and size of the pupil.The method for detecting pupils in eye images can be used to identify abnormalities in the pupils of diabetic retinopathy patients.
Jaskari, et al. 17 , presented novel results for 9 BNNs by investigating a clinical dataset and a 5-class classification scheme, as well as benchmark datasets and a binary classification scheme.A novel uncertainty measure is also proposed, which improves performance on some datasets.The findings suggest that BNNs can be utilized for uncertainty estimation in classifying diabetic retinopathy on clinical data, but proper uncertainty measures are needed to optimize performance, and methods developed for benchmark datasets might not generalize to clinical datasets.
Zia, et al. 18 , proposed a computerized learning model utilizing deep neural networks that have the potential to accurately detect key precursors of Diabetic Retinopathy (DR) from retinal images.By combining the strengths of selected models (VGG and Inception V3) and using an entropy concept to select the most discriminant features, the model can classify features such as enlarged veins, liquid dribble, exudates, hemorrhages, and miniaturized scale aneurysms into different classes and determine the severity level of DR in diabetic retinopathy images.This model can be a useful tool for the accurate diagnosis and treatment of patients with DR.
Ashour 19 , highlighted the effectiveness of Artificial Neural Networks (ANN) in time-series applications, particularly Back Propagation and Recurrent neural networks, in solving linear, semi-linear, and nonlinear time series.The study employed forecast skill (SS), mean square error, and absolute mean square error to measure the efficiency and accuracy of the estimation methods used.The study found that RBF neural networks were less efficient and accurate in solving nonlinear time series, but showed good efficiency in the case of linear or semi-linear time series.Overall, the study provides insights into improving modern methods for time series forecasting.

Materials and Methods
In this section, the global context-modeling framework is first introduced, followed by a detailed discussion of the design shown in Fig. 1.

Pre-processing
Retina images are susceptible to various issues, such as inconsistent image sizes caused by the use of cameras with varying aspect ratios and heights.This inconsistency affects the image quality and may result in the appearance of black areas around the eye, which do not provide any useful information for diagnosis.To improve the model's performance, it is necessary to standardize the retinal images.This can be achieved by cropping a circular area containing only the eyeball and removing all the pixels around the eye that are not relevant.
One approach to cropping the black areas is to convert the image to grayscale to identify the black areas accurately, based on their pixel density.A mask can then be created by defining the rows and columns, which can help remove the vertical and horizontal black rectangles that may appear in the upper right areas of the image.Afterward, the image can be resized to the desired width and height, denoted as R.
Another critical issue is the shape of the eye, which can vary from circular to oval.To standardize the shape of the eye, a circular crop can be made around the center of the image.These steps help ensure that all input images are similar, which is necessary for our Swin framework that requires input images of (RxR x3).
The original images in the dataset are highresolution color images with different sizes (R1 x R2 x3) captured by various cameras.To obtain the desired input size for our Swin framework, the pixels are cropped from the right and left sides of each original image to achieve a square shape and remove any non-relevant parts, as illustrated in Fig. 2. Cropping and resizing the images was carried out in the following steps: Step 1: Find the height of the image (R1) and width of the image (R2).
Step 2: Crop a part from left (r_) and right (r_ℎ) for each of the images, as in Eq. 1, and Eq. 2 Step 3: Resize the resulting image from step 2 to (512 x 512). Step

Analysis of Method
In this section, the principle of Discrete Wavelet Transform (DWT) is first introduced, followed by a detailed discussion of the Wavelet-Db5-Attention Block design.

Discrete Wavelet Transform
Wavelet transform (WT) is a mathematical method used in the signal analysis in the field of signal processing that uses a set of orthogonal waves 11 .It has been widely used recently in various fields, including noise reduction, pressure, and analysis 20 .
The basic idea of a wave transform is that any function can be represented as a superposition of a group of waves, which thus forms the basic function of the transform, and it uses wavelets as functions localized in both time and frequency.Different times can be utilized to capture local features in the frequency domain, which enables the extraction of information in the time and frequency domains simultaneously 21 .DWT is used to analyze the data into several different components and at different frequency intervals as well, and this helps in image processing because it adapts the separate data 11 .
There are many basic functions in this conversion, including Haar wave Daubechies (db) and Symlets (sym), and other waveform transformers 22 .The 2D-DWT can be used in digital image processing, as it converts the input image into a set of low-frequency information and high-frequency information that may be vertical, horizontal, or diagonal, as shown in Fig. 3 20 .In this study, the wavelets that provide the best classification performance are obtained as "db5" for the Daubechies wavelet family.This study is significant in several ways.Firstly, it demonstrates the potential of using the Swin Transformer with WTA to improve the accuracy of DR classification on the APSTOS dataset.
Secondly, the use of WTA for feature extraction in the Swin Transformer is a novel approach that has not been extensively explored in the literature on DR classification.Thirdly, the authors' approach outperformed previous deep learning models and human experts, suggesting its potential to improve the efficiency and accuracy of DR diagnosis.
Finally, the study highlights the potential for deep learning approaches to address gaps in the literature related to DR classification, paving the way for future research.The structure of the proposed WT-db5 block is shown in Fig. 4.

Results and Discussion
The results of the study on DR classification using the Swin Transformer with WTA have several potential benefits, for the research community, clinicians, and healthcare systems.This study improves the accuracy of DR diagnosis; deep learning models like the Swin Transformer with WTA have the potential to facilitate earlier detection and treatment, ultimately reducing the burden of disease.Deep learning models can process large amounts of data quickly and accurately, potentially reducing the workload of clinicians and enabling more efficient use of resources.Additionally, the interpretability of the Swin Transformer with WTA may make it easier for clinicians to understand and trust the model's predictions, increasing its utility in clinical practice.
The proposed architecture was developed using a software package (Python), and the implementation was specific to central processing units (CPUs).All experiments were conducted on Google Colaboratory (Colab) using a 15G Graphics Processing Unit (GPU).The use of a GPU allowed for faster computations and improved performance compared to using only a CPU.It also enables to train of larger models and process larger datasets, which was essential for achieving the research goals.Table 1 concludes the parameters set for the proposed model.

Datasets
The APTOS 2019 dataset, created by the Asia Pacific TeleOphthalmology Society (APTOS) and used in the challenge of detecting blindness, consists of fundus imaging of the retina with diverse imaging conditions 23 .The dataset has been manually classified into five severity levels of Diabetic Retinopathy (DR) by specialists, ranging from 0 (no DR) to 4 (Proliferative DR), with varying levels of severity in between; where "1" means Mild1; "2" means Moderate; and "3" means Severe 24,25 .

Research on the Different Classifications Binary Classification
The training model for DR binary classification with wavelet attention WTA using Swin Transformer (Swin-T) and (Swin-B) was performed for 100 epochs.During each epoch, the model was trained on the training dataset using (Adam) optimizer with a learning rate of 0.001, momentum of 0.9, and weight decay of 0.0005.The learning rate was reduced by a factor of 0.1 after every 30 epochs.
In To improve the performance of the swin_tiny (Swin-T) and swin_base (Swin-B) models, the WT Attention-Db5 Block was applied.This block utilized wavelet (db5) on the pre-processed image resulting from the circular crop, which had a size of 260, 260.The results of binary classification using this block are presented in Table 2 and Table 3, and the visual representation of the training process is shown in Fig. 6 and Fig. 7.
For binary classification of DR, the Swin-T model achieved impressive results as shown in Table 3, with an accuracy of 98% on the test dataset, including 97% accuracy for images without DR.
The test loss was also low at 0.0102.Similarly, the Swin-B model performed exceptionally well, achieving a test accuracy of 98% with a test loss of 0.0079.This indicates that the model can generalize well and avoid overfitting.Regarding multiple classification of DR, the Swin-T model's performance varied depending on the severity of the condition, as shown in Table 5.While the model achieved high accuracy (98%) for identifying images without DR, its accuracy dropped significantly as the severity of DR increased, with an average accuracy of 84%.This suggests that the Swin-T model may have limitations in identifying more severe cases of DR.
On the other hand, the Swin-B model for multiple classification of DR performed reasonably well, with an average accuracy of 86% and a test loss of 0.0327.The model achieved high accuracy 97% for identifying images without DR, indicating its ability to accurately classify healthy retinal images.As for future work, the proposed model can be utilized for detecting specific lesions in diabetic retinopathy, such as (AM) and (HE), it can be applied to other medical image analysis tasks, such as the detection and classification of lesions in other diseases or medical conditions.

Figure 4 . 3 :
Figure 4. WT Attention-Db5 Block.The WT-db5 block performs DWT on the image (im) to obtain a low-frequency component   and   ,   ,    have three high-frequency components.Take the   ,   .Where when comparing low frequencies with high frequencies, where low frequencies contain the basic information of the image and can preserve it from damage, while high frequencies only contain a lot of noise, but they retain the detailed information from the image.Since the WT-db5 block can be defined as in Eq. 3: = (  ,  (  , ((  ,   ))) … … … The function F (•, •) collects the final features of the image, while Sm (.) refers to the SoftMax and δr (,)

Figure 5 .
Figure 5. Part of the training process implementation for DR binary classification using (a) Swin-T, and (b) Swin-B.

Figure 7 .
Figure 7. Training and validation over epochs for APTOS 2019 dataset, (a) loss, (b) accuracy, (epochs=100), WT Attention-Db5 Block-Swin-B to binary class.Multiple Classifications This is a description of a model designed for classifying diabetic retinopathy into multiple categories using Swin Transformers (Swin-T) and (Swin-B) with wavelet attention (WTA) for feature extraction.The training process consisted of training the model on the training dataset for 100 epochs with the Adam optimizer and a learning rate of 0.001.Throughout the training process, both Swin-T and Swin-B models were evaluated for their performance in binary classification of DR.The

Figure 8 .
Figure 8. Part of the training process implementation for DR multiple classification using (a) Swin-T, and (b) Swin-B.Table 4, Table 5, Fig. 9, and Fig. 10 summarize the performance of the WT Attention-Db5 Block with swin_tiny (Swin-T) and swin_base (Swin-B) for multiple classifications.

Figure 10 .
Figure 10.Training and validation over epochs for APTOS 2019 dataset, (a)loss, (b) accuracy, (epochs=100), WT Attention-Db5 Block-Swin-B to multi-class.The proposed model Swin Transformer with Wavelet Attention (WTA) achieves superior accuracy and performance compared to conventional deep learning approaches by being more effective, quicker, and more precise.Using Swin Transformer allows accessing non-local information and, combined with other techniques such as wavelet attention, helps extract remote features, leading to increased diagnosis efficiency https://doi.org/10.21123/bsj.2024.8565P-ISSN: 2078-8665 -E-ISSN: 2411-7986 Baghdad Science Journal Conclusion In this paper, a WT-Swin model based on the new WT Attention-Db5 Block was presented for earlystage detection of two and five-severity grades for Diabetic Retinopathy.The network was trained on APTOS 2019 dataset.The test accuracy of 98% and loss of 0.0102 for Swin-T, while the test accuracy of 98% and test loss of 0.0079 was reported with the binary-label classification for Swin-B when they apply Swin-T for multiple class the test accuracy of 84% and the loss of 0.0243, while the test accuracy of 86% and test loss of 0.0327 for multi-label classification when used Swin-B.The proposed method achieved better performance than computerassisted diabetic retinopathy detection systems in terms of speed and accuracy, and it could be good for use in clinical applications to detect DR.The study on DR classification using the Swin Transformer with WTA has the potential to help researchers uncover critical areas related to the pathophysiology of diabetic retinopathy and the development of new diagnostic and treatment approaches.The study may help researchers reveal new insights into the complex and subtle features of the retina that are associated with diabetic retinopathy.The use of WTA for feature extraction in the Swin Transformer allows for the capture of multi-scale features, which may detect previously unrecognized patterns in the images that are associated with the disease.By analyzing large datasets of retinal images, deep learning models like the Swin Transformer with WTA can identify patterns and features that are associated with different stages and subtypes of the disease.This could ultimately lead to the development of personalized diagnostic and treatment approaches that are tailored to the individual patient.