Anomaly Detection Approach Based on Deep Neural Network and Dropout

Regarding to the computer system security, the intrusion detection systems are fundamental components for discriminating attacks at the early stage. They monitor and analyze network traffics, looking for abnormal behaviors or attack signatures to detect intrusions in early time. However, many challenges arise while developing flexible and efficient network intrusion detection system (NIDS) for unforeseen attacks with high detection rate. In this paper, deep neural network (DNN) approach was proposed for anomaly detection NIDS. Dropout is the regularized technique used with DNN model to reduce the overfitting. The experimental results applied on NSL_KDD dataset. SoftMax output layer has been used with cross entropy loss function to enforce the proposed model in multiple classification, including five labels, one is normal and four others are attacks (Dos, R2L, U2L and Probe). Accuracy metric was used to evaluate the model performance. The proposed model accuracy achieved to 99.45%. Commonly the recognition time is reduced in the NIDS by using feature selection technique. The proposed DNN classifier implemented with feature selection algorithm, and obtained on accuracy reached to 99.27%.

The NIDSs are developed as classifiers to separate the normal traffic from the anomalous one (4). Deep learning has been emerged as a new method that could be utilized for Big Data in low training time consumption and high accuracy rate with its distinctive learning mechanism (5). Deep learning is non-linear approach within machine learning, it could be used for detection intrusions to develop adaptive IDSs (6,7). Dropout is a regularized technique used to prevent the deep model from overfitting (8). Because of the large number of features, accommodation of the data in pattern detection becomes restricted sometimes. The feature selection method used with the classifier to provide enhanced estimation and decrease the implementation time (9). Precisely, the significant contributions of this paper are: • NIDS was provided by using DNN model.
Dropout is technique used to reduce the overfitting. • The proposed DNN model yields a detection rate of 99.45%, and it is able to classify the data into five class labels (normal, and four attack labels). The test outcomes demonstrate this approach is essentially potential for real time detection .

Related Work
Various studies have been considered for improving the classification problems, precisely in the intrusion detection system. The greater parts of the related works are: 1) Reyadh Shaker Naoum1, et.al, "An Enhanced Resilient Backpropagation Artificial Neural Network for Intrusion Detection System", 2012 (10). The authors proposed classifier system for intrusions utilizing an improved resilient backpropagation neural network. This classifier has ability to classify the records into five classes with a sensibly decent identification rate about 94.7% and with a false positive rate of 15.7%. The dataset which utilized in that analysis was NSL_KDD.

Deep Neural Network (DNN)
Deep learning is a powerful gathering of methods used to learn the neural nets. The neural network is a biologically motivated paradigm enables the computer to learn from observational data (17). The expression "deep" usually alludes to the quantity of hidden layers with the neural net, each layer can be viewed as an individual algorithm all alone. DNN is one of the deep learning algorithms, which it is commonly used. The DNNs structure comprises from input layer, number of hidden layers, and output layer. Input data value are fed to the DNNs, and the output values are calculated progressively along the DNN hidden layers, at each layer, the input vector represented the output of every unit in the last hidden layer multiply with weight vector with each unit in the current layer in order to compute the weighted sum. At that point, the nonlinear function such as (hyperbolic tangent Tanh, sigmoid or rectified linear unit RELU) is utilized after weighted sum to compute the layer output values. The series of computation in layers change the representations into bit more abstract representations (18).

Dropout
A dropout is technique use to cripple the deep neural network by removing hidden units stochastically from it during training cases, to reduce data overfitting, we randomly omit hidden units with dropout rate 0.5. So, we are randomly sampling from collection of different thinned networks (n is the number of units which can be dropped), and all these thinned networks share weights, this is as extreme as bagging can get. At testing phase, the geometric mean has been taken to all thinned network predictions to produce the mean network prediction. The 'mean network' that has all the outgoing weights halves (8). Dropout makes discourage brittle co-adaptations of hidden unit feature discriminators; it has been done by injecting with special kind of noise to the hidden output values through the forward pass of training phase. The noise zeros drops out to the limited fraction of the output values of the units within current layer, exactly like to the type of noise that added to the denoising autoencoder input (19).

Feature Selection
The data features that are used to train the machine learning models have a great influence on the model performance. Unfortunately, a considerable lot of these features are either partially or totally irrelevant/redundant to the objective concept (20). Feature selection is a procedure utilizes to choose the best number of features needed to improve the data accuracy. By utilizing pertinent features, the classifier can in general improve its predictive accuracy (21).

The NSL-KDD Dataset
The NSL-KDD Dataset was set up to keep away from some characteristic issues of the KDD Cup 1999 Dataset. Indeed, even so it is generally old and not ideal representative to actual networks, it remained the perfect reference to show the contrast between the NIDS models. It was utilized in the past to assess the NIDS model performance by numerous researchers. This dataset contains 125,973 network traffic points in the KDD Train+ dataset (22). Each NSL-KDD dataset record builds on 41 features. It was readied by utilizing the system traffic captured by 1998 DARPA IDS assessment program the network traffic incorporates normal and various attack types, for example rootto-local (R2L), Probing, user-to-root (U2R), and DoS. It is sure that the vast majority of the recent attacks are possibly derived from the known attacks.

The Proposed Solution for NIDS
Simple deep neural network was constructed. NSL_KDD dataset was utilized to fit and assess the model, this dataset consists of 41 data features, and is categorized into five categories according to their characteristics, one is normal and the four others are attacks. SoftMax output layer with cross entropy loss function were used to enforce the model in multi class classification. Figure 1. represents general block diagram of the proposed system. (2)^1 + (2) , (2) = ( (2) ), For more clarification see Fig.2 Here * denotes the element-wise multiplication, ( ) is a vector of independent Bernoulli random variables every one of which has (1-p) of being 1. That vector is inspected and multiplied with elementwise by the outputs of that layer (1) , to create the thinned outputs ~ (1) . At that point these thinned outputs are utilized as input to the following layer. This procedure is applied at each layer. The factor of 1 1− utilized during training phase to ensure at test time, each input will reach each layer when all units get utilized.

Cross-entropy Function and SoftMax Output
Layer: The input of the proposed model to the first layer will be 41 nodes and the output will be five nodes. We will utilize the root mean squared-error cross entropy calculation as loss function, to calculate the contrast between two probability distributions. Normally the true distribution (the one that the machine learning calculation is attempting to match), and predict distribution as follows: Where p is the target distribution, q is the predicted distribution.
SoftMax is to characterize another sort of output layer for the proposed neural networks, it will map the last hidden neurons to output nodes, where the SoftMax activation function is: refers to neuron activation function in last layer. refers to the neuron input in last layer as shown in Fig.3. This activation function starts in the same way as with a ReLU layer, by forming the weighted input: = ∑ −1 + However, the ReLU function to get the output is not applied. Rather, a SoftMax function is applied to the . As indicated by this function, the activation to ℎ output neuron.

Figure 3. SoftMax output layer
In testing phase, geometric mean network is used, it contains the total hidden units but with their outgoing weights halved. This gives exceptionally closely performance to averaging along a vast number of dropout networks. Mean network is actually as taking the geometric mean of the probability distributions over classes predicted by 2^N possible thinned networks (N is the number of units which can be dropped).
These thinned networks do not all make same predictions, and mean network prediction is destined to be a higher log probability for the right answer than the log probabilities denoted by the individual thinned networks. Each thinned network estimator is defined as: ̂= ( = / ) (4) The geometric mean of all predictions of these thinned nets, that each one can be computed as in equation (4): Where k indicates to number of all thinned nets caused by dropout during training case.

Performance Evaluation
Usually, the performance of ANIDS models are assessed in terms of accuracy, recall, precision, and F-score. NIDS needed high detection rate/accuracy. The confusion matrix is utilized to compute these metrics.

Confusion Matrix
A confusion matrix shows the quantity of incorrect and right forecasts come about by the classification model contrasted with the actual outcomes (the objective value) in the data. The matrix M×M, where M is number of labels values. In the confusion matrix, TP (true positive) is the quantity of attack records effectively arranged. TN (true negative) which is the quantity of normal records effectively classified. and the number of normal records incorrectly classified is FP (false positive). FN (false negative) is the number of attack records incorrectly classified. (P and N) positive, and negative samples, respectively [3].

Experimental Results
The experiments have been applied on NSL-KDD dataset to fit and test the model by two estimation methods (holdout, and 5-fold cross validation), and in two cases, one in total 41 features values, and the other in using feature selection method.

Experimental Results Using 41 Data Features
The experiments have been done by using total 41 NSL-KDD feature values.

A. Results by 5-folds Cross Validation
Cross validation method is a way to estimate the skill of model on unseen data, but it is the greater computational expense. This method systematically creates and evaluates multiple classifiers on multiple data subsets. As shown in Figure 4.

Figure 4. Five folds cross validation
In our experiment, 5_folds cross validation was used. The "KDDTrain.csv" (contains on125973 data points) has been partitioned into 100778 training data points, and 25196 testing data points (80% for training rate, and 20% for testing). First hidden layer units will be ReLU activation function, and second hidden layer units are sigmoid activation function, learning rate was (0.1), and the number of epochs was 150. the results are shown in Table 2. After seeing these results, third result was selected, the model accuracy was 85.86%. This model setting will be selected to apply in the next holdout estimating method.

B. Results by Holdout Method
NSL_KDD dataset has 125,973 network traffic samples stored in" KDDTrain+.csv" file. This dataset has been partitioned into 100778 data points to train, and the remain 25192 points to test. Training data also partitioned into 67521points for training and the 33257 points to validate (training rate=67%, and validation rate=33%). See Fig. 5 followed. 706 Experiments have been implemented in two stages with different number of iterations. the starting setting to proposed model, consists of (41 input nodes, 100 ,100 ,5 output nodes) first hidden layer consisted of 100 units of ReLU activation function, second one had 100 units of sigmoid activation function, the learning rate was 0.1, dropout rate was 0.5, and number of epochs were 150. The first stage was utilized to identify the adequate type of activation function was utilized to hidden units in each layer, by shifting the activation function types among each ElU, ReLU, Tanh, and Sigmoid. Table 3 shows the results of experiment in this stage. According to these results, third result was selected to next stage, it achieved a highest accuracy (99.26%). In second stage, adequate learning rate was determined. Table 4 shows the results. Here, the number of epochs were 300, and dropout rate was 0.5. From these results, the proposed model accuracy reached to 99.45%, and it will be the proposed model accuracy.

C. Performance Evaluation
As we recommended previously, the performance of NIDS models are assessed by accuracy, recall, precision, and F-score. The confusion matrix is utilized to compute these parameters. Confusion matrix shows the quantity of incorrect and right forecasts come about by the classification model contrasted with the actual outputs, as shown in Table 5. To compute the model performance accuracy, this formula will be used.
Accuracy= ′ = 0.9945624 (6) The proposed model can classify the labeled testing dataset, as shown in Table 6.   Figure 6 shows the accuracy history chart, that was recorded on model implementation during training phase: Figure 6. Accuracy chart during 300 epochs

Results with Using Feature Selection Method
The experiments have been done using feature selection technique called "SelectKBest" (a class of the sklearn. feature_selection module within python programing language used for selection/dimensionality reduction on sample dataset. This method is based on (F-test) for estimating the mutual information between features before scoring the features to k (highest scores, sets optionally by the user), chi2 is used as f_regression metric in this method). These experiments have been implemented in two stages. First one was used to determine the best minimum number of features could be selected, and also determining the adequate dropout rate. The results were shown in Table 8, during this stage, learning rate was 0.1. First hidden layer unites were 100 ReLU activation functions, and second hidden layer unites were 100 sigmoid activation functions. From these results, the ninth one was chosen. Now the final setting to proposed model was illustrated. The accuracy of the proposed model reached to 99.27%. Table 9 illustrates the confusion matrix resulted from this experiment.  Table 10 shows how the proposed DNN model classification to the labeled testing data.  Table 11 shows the F-measure, recall, and precision metric values to the proposed model.   Figure 7 shows the accuracy chart of the proposed model using 3 6 features, during the training phase.

Figure 7. The classifier accuracy chart during training
In spite of this model has accuracy reached to 99.27% less than last one 99.46%, but, this model is the robust, because of when you are looking at Fig.  5 about the training chart of the model with whole 41 NSL_KDD, you can watch the training and validation curves are noisy during their raising along epochs raising, but in Fig. 6 about training the model using feature selection method (36 data features) the curves are smother than last one, also the training phase time has less consumption, without feature selection method, the time has more consumption than without that method. The average time to each epoch without feature selection method was 11 seconds, but with selection method was 9 seconds. This means the time consumption to train the model without selection method (300 epochs * 11 seconds= 3300 seconds) when the accuracy reached to 99.45%, but with feature selection method (250 epochs * 9 seconds=2250) to get accuracy to 99.27%.