Performance Evaluation of Intrusion Detection System using Selected Features and Machine Learning Classifiers

: Some of the main challenges in developing an effective network-based intrusion detection system (IDS) include analyzing large network traffic volumes and realizing the decision boundaries between normal and abnormal behaviors. Deploying feature selection together with efficient classifiers in the detection system can overcome these problems. Feature selection finds the most relevant features, thus reduces the dimensionality and complexity to analyze the network traffic. Moreover, using the most relevant features to build the predictive model, reduces the complexity of the developed model, thus reducing the building classifier model time and consequently improves the detection performance. In this study, two different sets of selected features have been adopted to train four machine-learning based classifiers. The two sets of selected features are based on Genetic Algorithm (GA) and Particle Swarm Optimization (PSO) approach respectively. These evolutionary-based algorithms are known to be effective in solving optimization problems. The classifiers used in this study are Naïve Bayes, k-Nearest Neighbor, Decision Tree and Support Vector Machine that have been trained and tested using the NSL-KDD dataset. The performance of the abovementioned classifiers using different features values was evaluated. The experimental results indicate that the detection accuracy improves by approximately 1.55% when implemented using the PSO-based selected features than that of using GA-based selected features. The Decision Tree classifier that was trained with PSO-based selected features outperformed other classifiers with accuracy, precision, recall, and f-score result of 99.38%, 99.36%, 99.32%, and 99.34% respectively. The results show that using optimal features coupling with a good classifier in a detection system able to reduce the classifier model building time, reduce the computational burden to analyze data, and consequently attain high detection rate.


Introduction:
Intrusion detection system (IDS) is one of the protection methods against network attacks and threats in most organizations in addition to firewalls, authentication and encryption. IDS model was first proposed by (1), that is a software to monitor and detect any intrusion in a system or network. A modern effective network-based IDS should be able to automate the network surveillance, analysis process and attacks detection or classification with high accuracy percentage in short amount of time (2,3). An IDS can be categorized into signature-based, anomaly-based or hybrid-based. Signature-based IDS only accurately detects known attacks while anomaly-based IDS able to detect unknown attacks by comparing the current profiles against the predefined normal behaviours. The later method is effective against zero-day attacks, but it still has high false positive rates (4,5) and hence of recent, hybrid method has been developed to overcome these limitations (6).
Due to the privacy and security issues, getting a reasonably large and complete real-world network traffic data with attacks footprints for IDS performance assessment has been made difficult. Alternatively, researchers use the publicly available benchmark datasets, namely KDD CUP 99 and NSL-KDD to evaluate the IDS performance. The NSL-KDD dataset has been used extensively, including in this work, as it provides an improved version of the original KDD Cup 99 dataset that contains huge amount of redundant records (7). Nonetheless, NSK-KDD still consists of large network traffic volumes with 125,973 instances of 41 network-related features and an assigned label to classify each record instance as either normal or abnormal. Analyzing a huge dataset imposes a heavy computational burden and hence increases the processing time. Feature selection or reduction approach has then been proposed to solve such problem. Feature selection identifies and removes irrelevant features that do not contribute to the accuracy of a predictive model and has been widely used in machine learning, data mining and data analysis (8). Using reduced set of features, also known as the selected features, it reduces the complexity of the developed model, that is reduces the building classifier model time (9).
This study investigates the performance of an IDS that uses only few selected features, as opposed to all 41 features using popular machine-learning based classifiers. Different features that have been selected using the evolutionary-based feature selection techniques from another research work have been adopted. In specific, 11 features selected using Genetic Algorithm (GA) by (10) and 20 features selected using Particle Swarm Optimization (PSO) by (11) have been used and hence, the feature selection implementation is not within the scope of this study. Machine learning (ML) techniques have been widely used for network intrusion detection as they are able to classify benign and attack patterns precisely. ML algorithms automate the improvement of their detection accuracy with subsequent trainings which may contain new and previously unseen data. However, building ML models are time consuming with the increase of data volumes (12). Hence, reducing the volumes of data to be processed using feature reduction method is critical to improve the detection performance. In this work, four state-ofthe-art machine learning classifiers, namely Naïve Bayes, k-Nearest Neighbor, Decision Tree and Support Vector Machine have been implemented and evaluated. The detection accuracy of the abovementioned classifiers using different sets of features values were studied. This paper is structured as follows. Section 2 presents an overview of the machine learning-based classifiers used in this. Section 3 discusses some of the IDS models using different machine learning classifiers. Existing feature selection approaches are covered in Section 4. Section 5 reviews some of the related works on IDS models using different feature selection methods and classifiers. Section 6 presents the experimental setup including the dataset and performance metrics used in this study. The performance of classifier models using different sets of selected features are compared and discussed in Section 7. Final comments and conclusions are provided in Section 8.

Machine Learning Classifiers:
Machine learning (ML) enables the IDSes to detect new attacks without human intervene. ML allows the IDS to change its execution strategy based on the recently acquired data. In general, there are two types of learning techniques namely the supervised and unsupervised learning. Supervised learning involves algorithms that are 'taught' by examples, with the input and out-put labels are provided during training (13). The unsupervised learning algorithms are left to interpret the data without guidance as no labeled data are provided in training dataset. Unsupervised learning identifies similarities and differences in data by clustering and association techniques (14).
The machine learning-based classifiers used in this study are the supervised probabilistic-based Naïve Bayes, k-Nearest Neighbors, Decision Tree and Support Vector Machine. All these classifiers are part of the state-of-the-art classifiers for they have been widely used for classification and regression problems due to their effectiveness. The theoretical background of these algorithms has been heavily discussed in many published works and hence not discuss in depth in the following subsections. The following subsections discuss the classifiers in general including their historical backgrounds, recent development and applications.

Naïve Bayes
Naïve Bayes (NB) classifier is a probabilistic-based classifier which uses Bayes' theorem and assumes features are independent of each other and their weight are equally important (15). One of NB problems is the 'Zero frequency or probability' situation in which the model is not able to make prediction if it has not observed a certain category in the training data set, yet a new and unseen-before input variable appears in the test data set.
Smoothing techniques such as Laplace estimation can be applied to avoid this undesirable situation (16).
With some improvements made towards the traditional NB, it has been used extensively in text classification area, along with other classification areas as it is simple to implement, computationally fast and robust (17,18). Moreover, Naïve Bayes are among the simplest Bayesian network models 886 that can achieve higher accuracy level if coupled with kernel density estimation (19,20).

k-Nearest Neighbors
The k-Nearest Neighbors (kNN) is a nonparametric classification method that has been widely used due to its simplicity and effectiveness (21). kNN was first described by Fix and Hodges in 1951 (22) in a USAF School of Aviation Medicine technical report and later expanded by Cover and Hart (23). kNN classifies each unlabeled data, t based on the k nearest neighbors, known as the neighborhood of t. Majority voting among the data label in the neighborhood is then used to decide the classification for t with or without consideration of distanced-based weighting.
kNN requires no prior knowledge on the distribution of the data (24). However, kNN is biased by the selection of the k value. One way in choosing good k value is to run the algorithm many times and choose the one with the best performance. One of the disadvantages of this classifier is its computational cost is considerably high as it needs to compute distance the unlabeled data t to all training samples. One promising approach made to improve the kNN accuracy is by clustering technique (25,26). kNN has been deployed in many domain areas including text mining, agriculture and medicine but has been heavily applied in finance-related areas such the stock market forecasting, bank customer profiling, managing financial risk as well as money laundering analyses (27).

Decision Tree
Decision Tree (DT) is a supervised learning method that maps from observations about a data to conclusions about its target value (28). The leaves represent the class or the label, the non-leaf nodes are the features and the branches represent conjunction of features that lead the specific a class. To create a DT, the training data or records are distributed recursively according to the attribute values (29).
DT is computationally fast even when dealing with large training sets since they are generally balanced and hence traversing the tree from root to the leaf requires approximately O(log2 N). The tree-based algorithms include ID3 (Iterative Dichotomiser 3), C4.5 (successor of ID3), CART (Classification and Regression Tree), CHAID (Chi-Square Automatic Interaction Detection), MARS (Multivariate Adaptive Regression Splines) and cTree (Conditional Inference Trees). One of the main challenges in DT is to build a good decision tree, that is smallest decision tree possible. Nonetheless, DT is one of the most used techniques in IDS for its fast adaptation, simplicity, and accuracy (30).

Support Vector Machine
A Support Vector Machine (SVM) is based on statistical learning theory and was developed by Vapnik in 1995 (31). SVM finds the optimal hyperplane that differentiates any two classes efficiently. By using different types of kernel functions, the low dimensional input space is transformed to a high dimensional space. Hence these nonseparable classes can then be separated by adding more dimensions. Linear, sigmoid, polynomial and radial basis functions (RBF) are some of the commonly used kernel functions, which play a significant role in SVM (32).
SVMs have performed well in multiple areas of biological analysis including analysing RNA-Sequencing and microarray gene expression data due to their capabilities to generalize well with high dimensional data (33,34). However, SVM's performance may degrade when data is not linearly separable and having large data sets to process, as the precompute of the kernel matrix might become infeasible (35).

Intrusion Detection System using Machine Learning Classifiers:
Machine learning (ML) has been widely used in network intrusion detection for its ability to classify benign and attack patterns with high precision. Many studies have shown that classifier that is developed with an efficient subset of relevant features provides higher predictive accuracy compared to a classifier developed from the complete set of features (41,42). Feature Selection (FS) is a popular preprocessing technique aims to find the most relevant features, that is features that have high correlation with the respective results (43). Using only relevant features in building the predictive model, it reduces the complexity of the developed model, hence reduces the building classifier model time and improve the accuracy and efficiency. In general, FS approaches can be classified into three categories, which are the wrapper, filter and hybrid (44). Filter methods only consider the relevance between features and class labels, independent of the classifiers as depicted in Figure 1. It ranks the features using statistical techniques such as t-test or fisher discriminant ratio, information theory, correlation coefficient, variance threshold as well as using distance measurement (45). These methods require less computational resources and faster than wrapper methods as no cross-validation process is performed. In wrapper methods, the incremental learning sessions from the specific machine learning algorithm is integrated into the feature selection process as depicted in Figure 2. The prediction performance of the algorithm is tested using different feature subsets and finally, the subset with the best performance is selected. Wrapper methods which are based on greedy search algorithms generally achieve high accuracy than filter methods. Wrapper methods for feature selection can be categorized into step forward feature selection, step backwards feature selection and exhaustive feature selection. Meanwhile, the hybrid methods, also known as embedded methods combine both filtering and wrapping methods to obtain the best of both techniques.

Related Works:
The following paragraphs discuss some of the existing IDS models with various feature selection techniques and classifiers and Table 2 presents the summarized information. Sarvari et al. (10) proposed an intrusion detection system using a hybrid SVM approach with Genetic Algorithm (GA) FS method. GA is a stochastic optimization algorithm, that is based on natural evolution aims to find the optimal solution. Hence, by implementing GA, the number of important features has been reduced from 41 to 11. These 11 significant features are categorized into three groups, ranked based on their importance. considering the feature event is independent of the class value. Multi-class SVM is used to classify the different types of attacks in the NSL-KDD dataset. Using the proposed model 31 features were selected out of 41. The proposed system achieved 98% in accuracy and 0.13% false positive rate. Chakir et al. (11) improved IDS efficiency by using the Information Gain (IG) feature selection method and SVM with Particle Swarm Optimization (PSO) for improved classification. PSO is a stochastic approach that performs searches using population or swarm of particles. Experiments were performed on the dataset NSL KDD and top ranked 20 features were selected. The experimental studies indicate that the proposed IG-PSO-SVM detection model performed well with 0.9% false alarm rate and 99.8% accuracy as well as precision.
Al-Yaseen (41) suggested a wrapper feature selection method, based on the firefly algorithm and SVM. The SVM model was used to assess each of the subsets of features selected from the firefly approach. The key benefit of the proposed system is its ability to adjust the firefly algorithm to match the selection of features and 10 top ranked features are selected. Their solution achieved about 78.89% in accuracy, and only 75.81% when uses all 41 features. The results of the analysis show the effectiveness of proposed feature selection technique in improving the detection system. This study investigates the performance of the evolutionary-based feature selection methods when coupled with some of the state-of-the-art classifiers in detecting attacks in the NSL KDD data set. Therefore, the 11 GA-based selected features and the 20 PSO-based selected features by Sarvari et al. (10) and Chakir et al. (11) respectively have been adopted in this work. As mentioned earlier, implementing evolutionary-based feature selections is not within the scope of the study. The authors use the features that have been selected from the abovementioned works and evaluate the performance of these two approaches.
The following paragraph provides some background on the evolutionary computing that has gained increasing attention from researchers.
Due to the optimization capabilities of the evolutionary-based feature selection techniques, these algorithms have gained much attention from the researchers. Among the popular algorithms include Genetic Algorithm (GA), Particle Swarm Optimization (PSO) and ant colony optimization that have been widely used (50)(51)(52). Genetic algorithms are randomized search algorithms that rely on biologically inspired operators such as mutation, crossover, selection and reproduction to provide optimization. GA is an iterative process that evolves in time and using the rule of survival of the fittest to arrive at the best solution. It operates on string structures like biological structures and in every generation, a new set of strings is created using parts of the fittest members of the old set. GA is computationally costly and can take a long time to converge due to its stochastic nature (53). PSO was inspired by the movement behavior exhibits by the flocks of birds and swarms of insects. Proposed by Elberhart and Kennedy (54). PSO consists of individuals or known as particles that have a position and a velocity. Using a mathematical formula, it iteratively improves the solution by moving these particles in the given search-space. The movement of each particle is influenced by its local best-known position but is also guided toward the best-known positions in the search-space, updated by other particles that have found better positions. This moves the swarm toward the best solutions. PSO is easy to implement and computationally inexpensive compared to GA. However, with more features in the data set, the solution space increases rapidly. In addition, high number of uncorrelated or redundant features result in many local optima detected in a large solution space and thus, evolutionary-based methods still suffer from the local optimal stagnation problems (45). In this work the data dimension is limited to 41 and hypothetically, PSO should be able to converge fast and expected to have less selected features than GA. However, based on Table 2, the selected features of PSO derived by (11) is higher than those of GA derived by (10). This could due to the selection of Information Gain threshold value used in the experiments that led to 20 important features been selected. Similarly, another work that deploys a hybrid model that integrates Gini Index with PSO can be found in (55). The authors only consider features as important thus selected when the respective Gini Index's scores are less than 0.4. Consequently, only 18 features are selected from the NSL-KDD dataset in their work. In this study, two sets of selected features, one with 11 features selected using GA and another set of 20 features selected using PSO, have been adopted to train the four different predictive models.

Methodology:
The NSL-KDD dataset, proposed intrusion detection system and performance metrics used in this study are discussed in the following subsections.

Dataset
NSL-KDD dataset (56), is an improved version of KDD-CUP 99 dataset that has been used in this study. It has no redundant and duplicate records and thus, better detection rate is expected. In this dataset, there are 125,973 instances with 41 attributes or features and one assigned label to indicate the record as normal or abnormal. Figure 3 depicts the 41 features of the NSL-KDD dataset. These features can be divided into three different categories as follows: 1) features extracted from the TCP/IP connection, 2) features to access TCP packet payload and 3) time-based traffic features and host-based traffic features. The attacks in this dataset can be classified into four different types of attacks, namely the DoS, Probe, U2R and R2L attacks. This public benchmark dataset has been widely used by many researchers to conduct different types of analyses and develop effective IDSes (57-60). Design and Implementation Figure 4 shows the proposed IDS model used in this study and the processes involved. These processes include pre-processing data, using selected features, building classification models, and evaluating performance are then elaborated in the following subsections. Pre-processing Data: Data Transformation and Normalization Figure 5 shows two records taken from the NSL-KDD dataset, in specific records for line 2 and 6 that contain mixed of numerical and string values. These strings or nominal feature values need to be transformed into numeric values with the affected columns are columns number 2 (Protocol_type), 3 (Services), 4 (Flag) and 42 (Attack or Normal). The data in column 42 for each record has been transformed, in particular the 'normal' value has been assigned to value 0 and the 'anomaly' value has been assigned to value 1.

Figure 5. NSL-KDD Records.
Due to the large variation among some of the feature values, for example values 146 and 0.08 as shown in line 2 of Figure 5, normalization is required for better performance. Normalization scales the data features into a specific range without altering the feature's statistical properties. The maximum and minimum values of the features were determined, and data is converted into a normalized form using the following equation:

Using Selected Features: Adopting Two Sets of Selected Significant Features
In this study, two sets of selected significant features have been applied. The 20 selected features by (11) obtained using Information Gain and Particle Swarm Optimization (PSO) and 11 selected features by (10) obtained using Genetic Algorithm (GA) are fed into the machine learning models. Both GA and PSO are evolutionary algorithms with their own advantages and limitations. Table 3

Building Classification Models: Training/Testing Data and Predictive Models
The NSL-KDD data are split into training and testing sets for supervised learning. Following the previous works by Sarvari et al. (10) and Chakir et al. (11), 80% of the data has been randomly selected and used to train the machine learning models and the rest of 20% is used for the classifier's performance evaluation. Table 4 shows the statistics of the data used in this study. The Decision Tree, Naïve Bayes, Support Vector Machine, and k-Nearest Neighbor algorithms are implemented using MATLAB version R2018b. Using the training data, four predictive models are then built and to be used for classifying the remaining 20% of the dataset.

Performance Metrics Evaluation
The accuracy, precision, recall and F-score performance measurements are used to evaluate the performance of the classifiers with different sets of selected features. The confusion matrix is the basis for calculating the abovementioned performance metrics of the classifiers. It includes true positive (TP) that specifies the normal instances that are correctly predicted, true negative (TN) that indicates the abnormal instances that are identified correctly, false positive (FP) that denotes the abnormal instances that are wrongly assumed as normal and false negative (FN) that specifies the abnormal instances detected as normal.
The descriptions of the performance metrics are as follows: -(i) Classification rate or Accuracy: one of the most important performance measurements of a classification algorithm that shows the ability of the algorithm to accurately predict positive and negative instances, as shown in the following formula:

Results and Discussion:
Accuracy is the most critical performance measurement in intrusion detection and Figure 6 shows all the classifiers' performances using both PSO-based and GA-based selected features sets. Interestingly, even though PSO has greater number of selected features used to develop the predictive models, the overall performance of the models is superior than that of GA's. This could due to the ability of the PSO, together with Information Gain to correctly anticipate the most relevant attack features in the dataset. In general, the accuracy improves by approximately 1.55% when implemented using the PSO-based selected features than that of using GA-based selected features. As expected, the Decision Tree (DT) classifier attained the highest accuracy percentage, which is 99.38% with PSO selected features. Meanwhile, decision tree classifier with GA-based selected features only able to detect up to 98% of accuracy. The results are consistent with the studies shown in Table 1, the decision tree's performance. In this experiment, the NB classifier performed the worst behind SVM and kNN. In summary, the classifiers' accuracy using significant features derived from PSO performed better than those with features obtained by GA. The precision results that show the classifier's percentage of predicting instances correctly is one of the important indicators of good models, are shown in Figure 7. The classifiers using PSO-based selected features outperformed the classifiers that are trained by the GA-based selected features. Again, as expected, the decision tree classifier obtained the highest precision percentages (of value 99.36%) compared to other classifiers. Unlike previous results, in this experiment, SVM has the worst precision percentage with value of 88.81%, behind NB and kNN respectively. The performance difference rate shown by SVM in these two different features sets is huge, which is about 5.73%. Meanwhile the other three classifiers are considerably consistent in their performance.   Figure 8 depicts the recall or sensitivity rate of the predictive models. The classifiers using PSO-based selected features outperformed the classifiers that are trained by the GA-based selected features except for SVM.
Such problem is prominent in SVM and there has been published works discussing this phenomenon, known as the outlier sensitivity problem of standard SVM (61). Many have found SVMs do not perform well with certain noise intensities. The performance of SVM trained by the PSO-based selected features degraded with the presence of noise and even worse than that of DGA-based, by approximately 4.5%. The rest of the classifiers are consistent in their performance. The decision tree classifier again attained the highest recall percentages (of value 99.32%) compared to other classifiers.  Figure 9 shows the f-score or f-measure rate of the predictive models. In general, the classifiers' f-score performs better by approximately 1.56% when implemented using the PSO-based selected features than that of using GA-based selected features. Decision Tree (DT) classifier attained the highest accuracy percentage, which is 99.34% with PSO selected features. In this experiment, the NB classifier performed the worst behind SVM and kNN with percentage of 87.6% using the GA-based features. In summary, as expected, the efficient decision tree outperformed other classifiers in all test instances, in both feature sets. The standard SVM's sensitivity rate is susceptible to noise and can be improved upon as suggested in (62). kNN's performances are also considerably good in comparison to the other two classifiers, and meanwhile the NB classifier performed the worst in most of the test.

Conclusion:
The performance of four supervised classifiers with different selected feature values on the NSL-KDD dataset were evaluated. The feature values were derived from Particle Swarm Optimization (PSO) and Genetic Algorithm (GA) feature selection approach respectively.
The experimental results show that using smaller number of selected and relevant features may not necessarily improves the accuracy. Instead, using the appropriate number of relevant and significant features, even if the number is big, it could enhance the performance of the machine learning models. The 20 features selected by PSO outperformed the 11 features selected by GA in every performance metric except for recall due to existing SVM's outlier sensitivity problem. The adopted PSO feature selection method with Information Gain selected the top 20 relevant features from the 41 features in NSL-KDD dataset and hence improves the complexity, time, and the accuracy of the predictive models. Decision Tree has proven to be an efficient classifier and outperformed Naïve Bayes, k-Nearest Neighbor and Support Vector Machine classifiers in every evaluation test. In this experimental study, a maximum accuracy of 99.38% and precision of 99.36% have been attained by the decision tree-based IDS using particle swarm optimization feature selection.
In summary, combining a good feature selection with an efficient classifier in a detection system able to reduce to complexity of data analysis and consequently improve the detection performance.