Measuring Positive and Negative Association of Apriori Algorithm with Cosine Correlation Analysis

This work aims to see the positive association rules and negative association rules in the Apriori algorithm by using cosine correlation analysis. The default and the modified Association Rule Mining algorithm are implemented against the mushroom database to find out the difference of the results. The experimental results showed that the modified Association Rule Mining algorithm could generate negative association rules. The addition of cosine correlation analysis returns a smaller amount of association rules than the amounts of the default Association Rule Mining algorithm. From the top ten association rules, it can be seen that there are different rules between the default and the modified Apriori algorithm. The difference of the obtained rules from positive association rules and negative association rules strengthens to each other with a pretty good confidence score.

Association rule analysis is used to obtain association rules which often appear in dataset (1). The default Association Rule Mining (ARM) finds association rules between items which existed in a transaction which is called positive association rules (PAR). Whereas, the negative association rules (NAR) which shows the association between lowfrequent-itemset, is also essential to be analyzed. Its argued that with the proper analysis, NAR will strengthen the positive association rules. Therefore, it will add more advantages to ARM.
To measure the certainty and usability level of a rule, support and confidence score are used. Those measurements are not good enough because confidence is only a conditional probability prediction value of two or more item sets (2). They cannot measure the correlation value between two item sets. It may cause misleading association analysis results and then it can cause more significant problems later on the extended applications.
This weakness can be handled by adding a correlation analysis on ARM (3). Correlation analysis shows the relationship between two or more variables (4). The combining of the correlation analysis will increase the quality of ARM by producing rules with better correlation item sets.
Previous studies related to this research only concern positive association rules. Some of them are followed by cosine correlation analysis as well (5). There were also some studies which produced PAR and NAR without any correlation analysis (6,7) . The reason for the using correlation analysis is that cosine correlation analysis has a null-invariant characteristic which is useful in analyzing vast data (8). Another work indicated that negative correlation was found but was ignored in mining personality of students (9). In earlier work, the use of negative correlation in process mining was useful to detect fraud (10).
Therefore, in this work will study the measuring PAR and NAR by using Apriori algorithm on the support-confidence framework and cosine correlation analysis. Apriori algorithm is chosen to be modified as the oldest and well known association rule algorithm. This algorithm is simple yet it has been proven to be customized in solving problems. Recent work has used it to help semantic maps to illustrate the imbalance of implications between functions (11). Apriori is also helped to handle rules which are defined by using the examination of only two τ-dependent tables by implementing nondeterministic information systems Apriori based (12). Road's accident in Dubai can be analyzed successfully by Apriori (13). The investigation pattern of contact can be used to provide optimal node selection in IoT networks's (14). Therefore, if the modification of Apriori algorithm can return significant results, it can be assumed that the modification is a promising approach. The advantages of this research are: (i). Enriching data mining methods, especially ARM method with the Apriori algorithm. (ii). Improving association rule quality by producing PAR and NAR followed by cosine correlation analysis.

Related Work:
Negative ARM was usually ignored and has got the interest recently (15). The idea, to mining the negative correlation in association rule, aims to reduce the not interesting obtained rules and to achieve more interesting rules (16). One work (17) proposed the approach to reduce the set of PAR and NAR with a low computational cost. One reason is to reduce the computational cost, especially in a vast dataset, multi-objective evolutionary algorithm. The concern to mining negative association is also shown by (9). In the market-basket analysis domain, negative association rules are also needed to identify the pattern of products, to identify whether any conflict among each other or any products compliment to each other's. They also propose an approach which reduces database scans. (18) used the extension of Apriori algorithm in the spatial domain. The spatial mining generates positive and negative frequent itemset which will be applied at a temporal bar at a specific situation. The results surpass the ones which do not find the negative correlation. The other work provides the mining of positive and negative as a service (19). It works on XML data and the other work (20), concerned in mining the negative correlation in infrequent itemset.
In health domain, negative association which is combined with regression is useful in determining the depressive symptoms in middle aged and elderly (21). Negative correlation is also needed in existing biclustering algorithms for microarrays by proposing NBic-ARM (22). The constraint is added to discovering the negative association rule in several domains (23). For all the related work, the measure of negative correlation has not been investigated. Therefore, this work also propose the cosine correlation to measure the negative correlation.
In theoretical aspect, some approaches have shown as well that negative correlation has improved the performance of a specific method. A method combines negative correlation learning (NCL) with convolution neural network (1-dim CNN) to solve insufficient data problem (24). Negative correlations also helped to relax the variance property in the Mixed Generalized Ordered Response (MGOR) models (25). Modified Neural Network algorithm with adaptive negative correlation (NEA_ANCL) has shown evolutionary result (26). Similar work has shown as well than negative correlation learning neural network which combines with Particle-swarm optimization which can solve denoising problem in wavelet analysis technique (WAT) (27).

The Approach:
A.
Positive Association Rules (PAR) and Negative Association Rules (NAR) PAR (A B) refers to the association rule between items which exist in a transaction. For example, the association rule that shows which items will be bought together by a customer. Here are some characteristics of PAR (20): Note: minimum support (Minsup), minimum confidence (Minconf) B. Correlation Correlation is a statistical measurement of relationship between two or more variables (21). Correlation analysis result shows how strong the correlation between two or more variables is. Correlation rule is valid if its correlation value fulfills the minimum correlation value required. There are some correlation measurements. One of them is cosine correlation. Given two itemsets (A and B), cosine value can be known as follows: Cosine value ranges from 0-1. The bigger its value, the more transactions that consist both itemsets (A and B). The smaller its value, less transaction which consists of both itemsets (A and B) (22). Cosine value is a good correlation indicator because it is null-invariant characteristic. It means that nulltransaction will not affect correlation measurement. Null-transaction, a transaction which does not contain any of the tested itemsets, is an important property to measure correlation in a huge transactional database (9).
C. Apriori Algorithm enriched with negative correlation In this work, Apriori algorithm is used as main method which will be modified. Apriori algorithm will be enriched by negative association rule and cosine correlation analysis for each itemsets whose support values are eligible. This approach will be done as follows: 1. Generate k-item-sets derived from (k-1)itemsets. 2. Counting support value of each itemset a. Itemsets which have support ≥ Minsup will be counting its cosine value. i. If cosine ≥ Mincos, then it will generate 2k-2 distinctive association patterns (antecedent consequent). ii. Each rule which has confidence ≥ Minconf will be included in PAR. iii.PAR is a positive associative rule which has the entire requirement in support, confidence, and cosine. b. Itemsets which have support < Minsup will be generated into distinctive 2k-2 association patterns.
i. Each pattern will be used to generate A B (Antecedent Negative Rule/ANR) and A ¬B (Consequent Negative Rule/CNR). ii. Every form which has support ≥ Minsup will be counted for its cosine. iii. Every form with cosine ≥ Mincos will be counted for its confidence. iv. If its confidence ≥ Minconf then it will be included in NAR (which consists of ANR and CNR). NAR is a negative association rule which has the entire requirement in support, confidence, and cosine. Note: minimum cosine (Mincos) As below is the pseudo code of the modified Apriori algorithm: The default Apriori algorithm only considers the value of support and the score of confidence. In this work, the proposed method will enrich with the cosine correlation analysis, after calculating the value of support for itemsets that has support values ≥ Minsup (as written in the pseudocode line with mark "*"). Any itemset that satisfies the provisions of the Mincos value will be for its confidence value. Association rules whose confidence values ≥ Minconf is a valid positive association rules.

2.
The enriching on NAR mining also includes Cosine correlation analysis. The default Apriori algorithm does not produce negative association rules. Therefore, this research uses itemsets that does not comply with the minimum support to develop a negative association rules.

Materials and Methods: Dataset and the experimental settings
As a new idea in adding the negative correlation, the experiments will be more investigated for real data instead of theoretical aspects. The main aim of experiments is to make sure that the negative correlation can make the obtained rules are stronger compare to without it. Therefore, the experiment simply performed the variety of normal experiments. In the future work the experiments can move to the more comprehensive experiments which consider more aspects.
The experiments are conducted by extracting PAR and NAR in the support-confidence framework with Cosine correlation analysis using Apriori algorithm for frequent itemsets mining. This research will also perform PAR mining using the default ARM algorithm with Apriori algorithm for frequent itemsets mining on a support-confidence framework. Roberto Bayardo from the UCI dataset prepares the data. The domain is mushroom data that can be accessed in this repository (http://fimi.ua.ac.be/data/). There are 8124 records with 23 fields in each file. The whole album is comprised of over 119 different items.
The scenario of experiments is by performing the variation of a minimum score of support, confidence and the cosine. The Minsup is 30%, 40% and 50%. The Minconf is 50%, 60%, 70% and 80%. The Mincos is 50%, 60%, 70% and 80%. The 1 st scenario uses Minsup 30% and Minconf 50%, the 2 nd scenario uses Minsup 30% and Minconf 60%, and the next scenarios are the combination of those three scores respectively.

Results
In this part, the details of the experiment's results are explained. The test will show the top ten obtained rules for each scenario (the variation of. Each scenario covers of PAR, ANR and CNR. In case that the results in a plot are not much different compared to the other scenario, the specific outcome of PAR, ANR and CNR will not be shown in detail. In the end, the analysis of results will be discussed in the next subsection. The minimum number of cosine will be used only for ANR and CNR. From the 1 st scenario in Tables 1-4 The obtained results of the 2 nd (Table 5), 3 rd (Table 6) and 4 th ( Table 7) scenarios are similar comparing to the results of the 1 st scenario. The order of the obtained results is not different either.      The obtained results of the 6 th , (  (Table 13) and 8 th (Table 14) scenarios are similar comparing to the results of the 5 th scenario.
The order of the obtained results is not different, either    The result of the 9th scenario is shown in Tables 15 -18. The similar results are obtained in the 10 th (Table 19), 11 th and 12 th scenarios. The exact the same results as shown in Table 20 came out in 11 th and 12 th scenarios included the obtained number of rules. Therefore, the received amount of combination (50%, 80%) the same as the combination (50%, 70%). In overall, the results are consistent   Since PAR is the main goal in Apriori, it also resumes the difference of the PAR obtained results, shown in Table 21. Implicitly, it is shown that the modified Apriori does not ruin the default one  This evaluation is carried out to analyze the results of experiments that have been done. The experimental results will be used as the analysis of the association rules generated, the number of association rules made, and by looking at the top ten results produced by each algorithm. The results of this analysis will show the obtained effects by the embedding of a NAR mining and Cosine correlation analysis.
There are differences between the results of association rules in the default Apriori and the modified Apriori algorithm. Several association rules generated by the default Apriori algorithm are not included in the results of the modified Apriori for not meeting the minimum cosine terms. The larger the minimum cosine value is given, the more different results have occurred. The more significant the minimum cosine value is given, the more significant the difference between the numbers of PAR generated by the default Apriori and the number of PAR made by the modified Apriori. The number of NAR only produced by the modified Apriori algorithm continues to decrease for the increasing of the minimum cosine value at various support values (30%, 40%, 50%) and confidence (50%, 60%, 70%, 80%).
The experimental results provide PAR and NAR which can be used to enrich the knowledge of an item. They also indicate a mutually reinforcing relationship between association rules which are equally included in the top ten rules of association of each type. By implementing with various minimum support, cosine, and confidence, there are some association rules which always been at a top 3 on the results of positive association rules mining. The minimum support = 30% and minimum confidence = 50%, 60%, 70%, 80% produce NAR of ANR ¬ 58 (if there is rootstalk, then the veil is white). While with the minimum support = 30% and minimum confidence = 50%, 60%, 70% found NAR in the form of CNR 86 86 ¬ 58 (if the veil is white, then there is rootstalk). Both association rules are included in the top ten results from each type. The relationship between the two items (86 and 58) further confirms that the relationship between them and additional knowledge about item 86, which was included in the top ten results of PAR mining.
NAR in the form of CNR at minimum support = 30% and 40% resulted in two mutually reinforcing association rules, namely 1 ¬ 28 (if the mushroom is poisonous, then it has no aroma) and 28 ¬1 (if mushroom has no aroma, then it is not poisonous). In addition, the NAR in the form of ANR on minimum support = 40% yield ¬ 1 28 (if mushroom is not poisonous, then it has no aroma) and ¬ 28 1 (if it has aroma, then it is poisonous), which further strengthens the relationship between item 1 and 28.
There are some association rules which fulfill support requirement, but they do not satisfy the cosine requirement. It caused the different sequence of association rules produced by the default and the modified Apriori algorithm. The series in the association rules of the negligence Apriori and the modified Apriori on various minimum cosine values resulting from the association rules that do not surpass the minimum cosine value.

Conclusions:
This work have enriched the defaults Apriori algorithm with negative ARM. The aim is to enhance the results of the obtained rules. Therefore, it will strengthen the analysis. This work use Cosine correlation analysis to extend the result with negative association rule. The results of the series of experiments against mushroom database show that this work obtains different results of association rules mining. The difference is both in terms of quantity and sequence between the default and the modified ARM algorithm. The obtained results, both PAR and NAR, strengthen to each other with a pretty good confidence score. The near future work is to study the performance of ARM algorithm in more massive datasets.