Enhancing Fuzzy C-Means Clustering with a Novel Standard Deviation Weighted Distance Measure

The aim of this paper is to present a new approach to address the Fuzzy C Mean algorithm, which is considered one of the most important and famous algorithms that addressed the phenomenon of uncertainty in forming clusters according to the overlap ratios. One of the most important problems facing this algorithm is its reliance primarily on the Euclidean distance measure, and by nature, the situation is that this measure makes the formed clusters take a spherical shape, which is unable to contain complex or overlapping cases. Therefore, this paper attempts to propose a new measure of distance, where we were able to derive a formula for the variance of the fuzzy cluster to be entered as a weight on the Euclidean Distance (WED) formula. Moreover, the calculation was processed partitions matrix through the use of the K-Means algorithm and creating a hybrid environment between the fuzzy algorithm and the sharp algorithm. To verify what was presented, experimental simulation was used and then applied to reality using environmental data for the physical and chemical examination of water testing stations in Basra Governorate. It was proven through the experimental results that the proposed distance measure Weighted Euclidean distance had the advantage over improving the work of the HFCM algorithm through the criterion (Obj_Fun, Iteration, Min_optimization, good fit clustering and overlap) when (c = 2,3) and according to the simulation results, c = 2 was chosen to form groups for the real data, which contributed to determine the best objective function (23.93, 22.44, 18.83) at degrees of fuzzing (1.2, 2, 2.8), while according to the degree of fuzzing (m = 3.6), the objective function for Euclidean Distance (ED) was the lowest, but the criteria were (Iter. = 2, Min_optimization = 0 and δ 𝑋𝐵 ) which confirms that (WED) is the best.


Introduction
Fuzzy algorithms are of significant importance in addressing many phenomena characterized by uncertainty, particularly in relation to the outcomes of laboratory testing aimed at determining concentrations of specific compounds.This particular category of difficulty frequently entails outcomes that are frequently accompanied by Published Online First: February, 2024 https://doi.org/10.21123/bsj.2024.9516P-ISSN: 2078-8665 -E-ISSN: 2411-7986 Baghdad Science Journal advancement of software development plays a crucial part in the enhancement of optimization algorithms.This advancement has resulted in numerous contributions, particularly within the domain of artificial intelligence, wherein it has played a significant role.The foundation for the development and proposition of numerous algorithms, including hybrid algorithms, is in their ability to tackle complexity and interference, enhance performance, and minimize errors.Consequently, there have been significant advances in this domain.The authors put forth a hybrid technique that combines Principal Component Analysis (PCA) and Fuzzy C-Means (FCM) in order to investigate the characteristics of air pollution 1 .The objective of this work was to develop a model that enhances the outcomes of K-Means clustering through the utilization of principle components analysis (PCA) for the processing of multidimensional data 2 .The researchers were able to calculate the density distributions of unstable, neutron-rich exotic nuclei using binary cluster analysis, and they concluded that the calculated cross-sections of the nuclei interactions agreed with practical values 3 .The researchers developed a new algorithm by hybridizing the K-Means algorithm with the DBSCAN algorithm where this proposal addressed the problem of determining the number of initial clusters as well as cluster centers 4 .In this study, a novel methodology was introduced for the assessment of torsion in braiding columns during experimental trials the proposed technique involved the collection of three-dimensional acoustic emission data, which were subsequently analyzed and classified using fuzzy c-means (FCM) technology to identify and categorize instances of damage 5 .The study was conducted to examine the monitoring of water quality, which is regarded as a crucial aspect of safeguarding surface water.The present study focused on the examination of water samples collected from the Nile River, with a specific emphasis on the drinking water stations (CDWPs) situated in Cairo 6 .The study addresses the problem of medical imaging of cancer diseases by applying genetic algorithms and hazy and acute cluster algorithms using an ADF diffusion filter that improved the accuracy of the algorithms' results 7 .This study presented a data clustering technique using the modified Voronoi Fuzzy Algorithm (VFCA), which aims to divide the sensed area into a number of cells that were grouped using the Fuzzy C Mean Clustering (FCM) algorithm.The results also showed the efficiency of the modified algorithm compared to traditional clustering algorithms 8 .The study presented an explanation of how to improve the security and safety of transmitting information and messages over the Internet by hiding data within clusters by adopting the Least Significant Bit (LSB) method 9 .A hybrid algorithm was proposed to improve the work of the FCM fuzzy clustering algorithm by adopting the Tabu probabilistic heuristic algorithm and finding a global clustering based on the value of the objective function at its minimum, and the results showed the superiority of the Tabu-FCM hybrid algorithm 10 .The study addressed the problem of image retrieval by improving the accuracy and speed of the retrieval system by applying the Fuzzy C Means (FCM) clustering algorithm to reduce the search space and speed up the image retrieval process.The results showed a significant improvement in the accuracy and speed of the retrieval system 11 .They proposed a model that addresses the problem of the fuzzy exponent in the FCM algorithm by building a hybrid model (IT2-FCM-FTS) so that this model was able to improve the results of predictions in fuzzy time series 12 .They used the binary cluster analysis method with the aim of identifying groups of patients with a high risk and low risk of contracting Covid-19 disease, based on a set of vital factors.The study concluded that early detection of the disease can reduce cases of severe infection with the disease 13 .
The study aimed to classify the pollution areas in the Orontes River in Syria into low-pollution areas and high-pollution areas through the use of hierarchical cluster analysis.The study achieved to classify these areas in the form of two clusters, and the study reached the identification of the pollutant elements in the water 14 .The study proposed a technique that processes images in a way that can collect prominent particles and separate them from the background of the image, taking some treatments, including removing outliers from prominent areas.This technique was applied to six sets of images that have backgrounds complicated 15 .The research Addressed the problem of distributing health human resources to healthcare sites and hospitals in Jakarta, Indonesia by adopting FCM and K-means algorithms through which they were able to find three clusters in the formation of hospitals 16 .Therefore, the contribution of this paper is that the researchers were able to provide a new contribution in addressing the formation of clusters and improving the performance of the Fuzzy Means Clustering algorithm by deriving a formula for the variance of the Fuzzy Clusters as a weight that enters the distance measure, and then developing a hybrid scenario to develop the Fuzzy Means Clustering Algorithm (FCM) Partitioned or membership degree matrix based on the K-means algorithm.Finally, this methodology was applied to the water sector by monitoring the levels of salt concentrations in the waters of Basra Governorate/Iraq, as this phenomenon is linked to an important goal of the sustainability goals set by the United Nations (UN) in the year (2017-2021), which is the goal (6 clean water and sanitation), noting that this goal negatively affects goal (14 life underwater).

K-Means Algorithm
It algorithm was proposed by Hartigan  (1957), adopting the approach of dividing a number of solutions (S) that have (q) dimensions into K homogeneous 17 .The mean clustering method is considered one of the simple traditional methods through which clusters are formed by pre-determining the number of clusters (K)

2
 Then apply steps 3-5 and stop when the condition in step 5 is true.

Fuzzy Clustering
Fuzzy Logic Fuzzy logic is an expression of the state of uncertainty that simulates and addresses problems of complexity and interference of data, modeling and measurement errors, etc.The foundation of this logic was laid by the Iranian scientist Zadeh in 1974, and this approach achieved great progress with the development of programming and artificial intelligence algorithms 20 , the basis of this logic is based on two basic principles: the availability of belonging functions   and the degree of each element's belonging to the comprehensive set (), which is expressed as: :  → {0, 1};  ∈  Therefore, the fuzzy set can be defined as a set of ordered pairs of a number of elements () that belong to the comprehensive set, and the element belonging function is the function that determines the degree to which these elements belong to the fuzzy set (A) 21 :  = {(,   ();  ∈ )}

Fuzzy C-Means Clustering
The complexity and overlap facing many multi-dimensional data made traditional clustering algorithms useless, so it was necessary to develop clustering techniques that could deal with the overlap between clusters and form homogeneous groups according to fuzzy logic.The first algorithm to address this interference was Fuzzy C Means (FCM) 22 , which aims to collect large data in the form of new, more homogeneous groups based on determining degrees of belonging to the targeted cases.This algorithm was developed by Dunn, Bezdek In 1974, by developing the partition matrix in the K-Means clustering algorithm while determining the degree of fuzzing, then the objective function was obtained, which represents the minimization sum of square errors 24,25 as in the following eq 23 : _(, , ) = ∑ ∑    (   =1  =1 ,   ) 3 Where: P: It is a matrix (k×n) and represents the membership degree for each element within the clusters.m: Fuzziness coefficient (fuzziness exponent), whose value is defined  <  ≤ ∞   : value of observation i at dimension j.   : represents the center of cluster k of dimension j. (  ;   ): is a measure of similarity or difference of the observation value  from its cluster center   .
In order for us to achieve the objective function, that must determine both the center of the fuzzy cluster   and determine the partition matrix that contains the membership degree to each case   and the degree of fuzziness () and (c) the number of clusters know its formulas as follows: Accordingly, the obtained Eq.4 represents the estimate of the cluster center 26 .Accordingly, the cluster center here has been weighted by the membership degree(  ), which also depends on the measure of similarity or dissimilarity, and accordingly the matrix can estimate membership degree according to the following equation 27 : Where: : represents the ratio of (sampling size), i.e. the number of elements in each cluster (n_k) to the total number of elements(  ), and is the true condition ∑ ℎ  = 1  =1 . (  ,   ): Represents a measure of similarity or difference (distance measure).

Measure of Distance
The distance or similarity measure (  ,   ) is a measure based on the formation of a similarity matrix for (n) cases and (q) variables.Accordingly, the degree of closeness between the points of each variable can be determined according to the cases to form the Proximate Matrix based on this in Measuring the distance based on the centers of the clusters 28, 29 .If it has a vector of variables,  = [ 1  2 …   ] the information matrix that has a rank( × ), and then a convergence matrix is as follows 30 : There are several types of similarity and difference metrics to determine distance but two traditional distance metrics will be displayed: Euclidean Distance (ED), Square Euclidean Distance (SED), and two other proposed measures are Weighted Euclidean Distance (WED) and Weighted Square Euclidean Distance (WSED), which will be explained later: is any weighted, that equals 1 in this case

Proposed Weighted Distance
The formula for the distance measure in the fuzzy average cluster algorithm was developed by deriving an approximate formula for the variance of the fuzzy cluster.Referring to Eq. 3, the researchers were able to derive this equation and find the weighted equation for the distance measure.To achieve this, the derivative amount was found (,,)   and shown below: ) 9 Then the final equation of the derivative approaches the formula of the weighted variance with membership degree and the fuzziness exponent: ) 10 The two Eq 7, 8 can therefore be interpreted to become the distance weighted by the inverse of the standard deviation of the

Developed Fuzzy C-means Algorithm
The FCM cluster algorithm will be developed in two steps: Step 1: Calculate the membership matrix degree from the original data by adopting the following steps: The first step represents the initialization stage of the initial cluster centers, adopting the K-Means algorithm 1. Determining the number of clusters.

2.
Determining the initial cluster centers randomly.

3.
Calculating the distance according to formula 6.

5.
Returning to step 2 and obtain cluster centers according to the clusters achieved in step (4).12.If the condition is true it stops.If the condition is not true, the (  ) matrix was updated according to Eq.13:

Identification of the Validity of Fuzzy Cluster Xie-Beni Criteria (δ 𝑋𝐵 )
The cluster validity criterion (  ) verifies the validity of the cluster structure based on the objective function and the fuzzing exponents.Accordingly, it is considered better than the partitions coefficient criterion, which depends only on the membership degrees of the partitions matrix.The best cluster structure is determined by determining the lowest value achieved for this criterion, and it is calculated to the following formula

Results and Discussion
The results will be discussed in two directions: the first is the experimental aspect (simulation) and the second is the applied aspect, which includes water sector data and its study through physical and chemical examination data, which represents one of the dimensions of the water sustainability goals.
They will be presented and their results analyzed as follows:

Simulation Aspect
The simulation method is based on the study of different and complex phenomena in addition to testing the proposed mathematical formulas and developed algorithms with a view to demonstrating their suitability and conformity with reality, so the experimental aspect was built by adopting the following determinants:  Determining the sizes of the experimental samples (n=20, 50, 200). Determining the dimensions of the variables (q=8, 15, 20). Determining the degrees of blurring (degree of overlap) m=(1.2, 2, 2.8, 3.6). Generating random variables assuming a uniform distribution using the RAND command.
 Converting the random variables generated in paragraph 1 to a normal distribution using the RANDN directive. Repeating the experiment to achieve improvement Iter = 100 In applying the above steps and adopting Matlab V.2023, the simulation results were obtained, as follows: The results of Table 1 showed the superiority of the fuzzy cluster algorithm using the Weighted Euclidean Distance (WED) measure for all the specified experimental cases and all degrees of fuzzing.That noticed a large amount of improvement in the binary cluster case, where the algorithm was able to stop at Iter = 2 and with the least improvement of the error Min_Opt.= 0 In addition, it achieved the lowest Obj_Fun objective function and the best fuzzy cluster structure according to the   criterion, whose preference is determined according to the lowest value.It also achieved the best results when c = 3 in terms of the Obj_Fun and the efficiency of the fuzzy cluster structure according to the   criterion.The results of Table 2 showed the superiority of the fuzzy cluster algorithm using the (WED) measure for all the specified experimental cases and all degrees of fuzzing.A significant improvement has been shown in the binary cluster case.Where the algorithm was able to stop at Iter = 2 and with the least improvement of the error Min_Opt.= 0 In addition, it achieved the lowest Obj_Fun and the best validity of the fuzzy cluster structure according to the   criterion, whose preference is determined according to the lowest value.It also achieved the best results when c = 3 in terms of the objective function Obj_Fun and the validity of the fuzzy cluster structure according to the   criterion, except for one case in which FCM(ED) was the best at the (m = 3.6), but the cluster structure according to FCM(WED) was the best.The results of Table 3 showed the superiority of the fuzzy cluster algorithm using the (WED) measure for all the specified experimental cases and all fuzziness exponents.A significant improvement was shown in the binary cluster case where the algorithm was able to stop at Iter = 2 and with the least improvement of the error Min_Opt.= 0.In addition, it achieved the lowest Obj_Fun and the best validity for the fuzzy cluster structure according to the   criterion, whose preference is determined according to the lowest value.It also achieved the The results indicated that the weighted methods using the standard deviation of the clusters contributed significantly to smoothing the data, improving the analysis results, and forming clusters that have the ability to contain the cases.In addition, the simulation results showed that the fuzzy cluster when c = 2 is the best because it achieved the best results in terms of Iter.The minimum amount of improvement is Min_Opt.

Practical Aspect
In order to make use of this HFCM algorithm and verify the efficiency of the proposals, these algorithms were implemented on the data of physical and chemical tests for the water testing stations in Basra Governorate, which amount to 8 stations, namely the Shatt al-Arab stations( 1 ،  2 ، 2  ، 3 ،  4 ), and the Qurna station.For the Tigris River (T 34 ), the city's two stations for the Euphrates River (E 20 ،E 21 ), as this sector is considered one of the important sectors that affect aquatic and human life and the production sectors and industries of various kinds.It is also considered one of the sustainability goals presented by the United Nations in its report (2017-2021).Which included goal 6 (clean water and sanitation) among 17 goals.Data was collected from eight water testing stations on a monthly basis for the years (2010 -2021).
It is clear from the results of Table 4 that the new proposed formula for the Euclidean distance measure weighted by the fuzzy cluster standard deviation (WED) has achieved the best results compared to the traditional distance measures (SED, ED) through the approved standards, which will be explained as follows:

When c = 2 m = 1.2:
The WED measure achieved the best results.The value of the objective function Obj_Fun, which represents the amount of error (23.93), was the lowest compared to other measures.In addition, the number of replicates achieved the best results at (Iter.= 2) and the least amount of error improvement was equal to 0. As for the validity criterion of the fuzzy cluster structure, it was the best because it also achieved the lowest value   = ., and this is evidence that the (WED) measure has contributed to improving the work of the FCM algorithm and reaching the best results with the least number of iterations compared to other measures, which can also determine the importance of each method by adopting the AverageMax criterion, which determines the degree of overlap between clusters.It is clear from   The WED measure achieved the best results.The value of the Obj_Fun, which represents the amount of error 18.83, was the lowest compared to other measures.In addition, the number of replicates achieved the best results at (Iter.= 2) and the least amount of error improvement was equal to 0. As for the validity criterion of the fuzzy cluster structure, it was the best because it also achieved the lowest value   = ., and this is evidence that the (WED) measure has contributed to improving the work of the HFCM algorithm and reaching the best results with the least number of iterations compared to other measures, it shows us from Fig 3 and according to the results of the AverageMax criterion that the HFCM algorithm has shown that the fuzzy cluster is appropriate at the degree of fuzzing m = 2 for the ED and WED metrics, which achieved 0.65, 0.75, respectively.As for the two measures SED and WSED, the fuzzy cluster had a crisp cluster, where the AverageMax value is higher than 90%.The results showed that the (ED) measure achieved the best lowest value for the objective function Obj_Fun, which represents the amount of error 11.01, which is the lowest compared to the other measures, but the (WED) measure was the best with respect to the (Iter., Min_Opt,   ) criteria.Therefore, preference can go to the (WED) measure, as it achieved the best results in most criteria, and therefore it contributed to improving the work of the HFCM algorithm and reaching the best results.
As it is clear from Fig 4 and according to the AverageMax criterion, the fuzzy cluster according to the HFCM algorithm was adequate according to ED and WED, but the percentage of overlap was less according to the (WED) measure, which achieved a percentage of 0.71.This is evidence that it suffers from uncertainty at a lower percentage than this is the case in the case of (ED), which achieved a percentage of 0.60, and this is clear from the many cases of overlap.As for the two measures (SED and WSED), they achieved the highest 85%.The figure shows that the fuzzy cluster has clear tendencies.

Conclusion
Based on the analysis of the experimental findings and the observed impact of the Weighted Euclidean Distance (WED) measure on enhancing the performance of the Fuzzy Mean Clustering algorithm, it can be inferred that the utilization of WED leads to improved accuracy and validity of the resultant cluster structure, particularly in cases when the number of clusters (k) is set to 2. Therefore, the Euclidean Distance (ED) measure ranked second in terms of significance.Hence, researchers propose the utilization of these metrics and their incorporation through the fuzzy clustering technique to ascertain their significance within the domain of clustering.This approach aims to establish a hybrid framework that amalgamates the fuzzy clustering algorithm with artificial intelligence algorithms, thereby mitigating errors, enhancing algorithmic performance, and facilitating its application across diverse domains.

6 . 7 . 8 . 9 . 10 . 11 .
Forming new clusters.Applying the belonging function according to formula 5 to obtain the partitions matrix (  ).The SecondStep: Hyper FCM Algorithm: Entering the centers of the resulting clusters in step 5 Entering the matrix of primary affiliation scores resulting from step 7 Calculating the objective function according to formula 3. Checking the condition| +1 −   | < .

Fig 1
that the FCM algorithm does not show overlap and that the fuzzy cluster is moving towards the clear cluster since AverageMax achieved the highest 90% for all metrics.

Figure 1 .
Figure 1.The stations overlap between the clusters when the fuzziness exponent (1.2)

Figure 2 .
Figure 2. The stations overlap between the clusters when the fuzziness exponent (2)

Figure 3 .
Figure 3.The stations overlap between the clusters when the fuzziness exponent (2.8)

Figure 4 .
Figure 4.The stations overlap between the clusters when the fuzziness exponent (3.6)