1 Introduction

In today’s large-scale computer networks, an immense amount of flow data are observed each day, among which there are some records that do not conform to the normal network behavior. Some of them are malicious and can cause serious damage to the network. Therefore, it is important to sift through traffic data and detect anomalous events as they occur to ensure timely corrective actions. Network anomalies stand for a large fraction of the Internet traffic and compromise the performance of the network resources [1]. Possible reasons for traffic anomalies are changes in the network topology (e.g., newly connected hosts, routing changes) or network usage (e.g., changed customer behavior, new applications). Anomalies may also be caused by failures of network devices as well as by malicious worm or attack traffic [2]. The early detection of such events is of particular interest as they may impair the safe and reliable operation of the network.

Network anomaly detection aims to detect patterns in a given network traffic data that do not conform to an established normal behavior [3] and has become an important area for both commercial interests as well as academic research. Applications of anomaly detection typically stem from the perspectives of network monitoring and network security. In network monitoring, a service provider is often interested in capturing such network characteristics as heavy flows that use a link with a given capacity, flow size distributions, and the number of distinct flows. In network security, the interest lies in characterizing known or unknown anomalous patterns of an attack or a virus. The anomalies may waste network resources, cause performance degradation of network devices and end hosts, and lead to security issues concerning all Internet users. Although network anomaly detection has been widely studied, it remains a challenging task due to the following factors:

  • In large and complicated networks, the normal behavior can be multi-modal, and the boundary between normal and anomalous events is often not precise.

  • Usually the network attacks adapt themselves continuously to cheat the firewalls and security filters, making the anomaly detection problem more difficult.

  • Network traffic data often contains noise which tends to be similar to the true anomalies, and it is difficult to remove them.

In recent years, deep learning has grown very fast and achieved good results in many scenarios. There is a trend to use deep learning technologies for anomaly detection [4]. These feature learning approaches and models have been successful to a certain extent and match or exceed state of the art techniques.

This paper presents an anomaly detection method using deep learning models, specifically the feedforward neural network (FNN) model and convolutional neural network (CNN) model. The performance of the model is evaluated by several experiments with a popular NSL-KDD dataset [5]. From the experimental results, we find the FNN and CNN models not only have a strong modeling ability for network anomaly detection, but also have high accuracy. Compared with several traditional machine learning methods, such as J48, Naive Bayes, NB Tree, Random Forest, Random Tree and SVM, the proposed models obtain a higher accuracy and detection rate with lower false positive rate. The deep learning models can effectively improve both the detection accuracy and the ability to identify anomaly types.

2 The Models

2.1 Feedforward Neural Networks

Feedforward neural networks (FNN) are called networks because they are typically represented by composing together many different functions. The model is associated with a graph describing how the functions are composed together. For example, we might have three functions \(f^{(1)}\), \(f^{(2)}\), and \(f^{(3)}\) connected in a chain, to form \(f(x) = f^{(3)}(f^{(2)}(f^{(1)}(x)))\). These chain structures are the most commonly used structures of neural networks. In this case, \(f^{(1)}\) is called the first layer of the network, \(f^{(2)}\) is called the second layer, and so on. The overall length of the chain gives the depth of the model. It is from this terminology that the name deep learning arises. The final layer of a feedforward network is called the output layer.

During neural network training, we drive f(x) to match \(f^*(x)\) to get our model. The training data provides us with noisy, approximate examples of \(f^*(x)\) evaluated at different training points. Each example x is accompanied by a label \(y \approx f^*(x)\). The training examples specify directly what the output layer like at each point x; it must produce a value that is close to y. The behavior of other layers is not determined by the training data. The learning algorithm must decide how to use these layers to produce the desired output, but the training data does not show what each individual layer does. Instead, the learning algorithm must know how to use these layers to achieve best implementation. Because the training data does not show the output for each of these layers, so that they are called hidden layers.

2.2 Convolutional Neural Network

A CNN (Convolutional Neural Network) consists of an input and an output layer, as well as multiple hidden layers, see Fig. 1 as an example. The hidden layers of a CNN typically consist of convolutional layers, pooling layers, fully connected layers and normalization layers. Description of the process as a convolution in neural networks is by convention. Mathematically it is a cross-correlation rather than a convolution.

Fig. 1.
figure 1

Convolutional Neural Network

Convolutional. Convolutional layers apply a convolution operation to the input data, passing the result to the next layer. The convolution emulates the response of an individual neuron to stimulation. Each convolutional processes data only when it just in its responsed field. Although fully connected feedforward neural networks can be used to learn features to classify data, a very high number of neurons would be necessary, when we design a deep architecture. The convolution operation brings a solution to this problem as it reduces the number of free parameters, allowing the network to be deeper with fewer parameters. In this way, it resolves the vanishing or exploding gradients problem in training traditional multi-layer neural networks with many layers by using backpropagation.

Pooling. Convolutional networks may include local or global pooling layers, which combine the outputs of neuron clusters at one layer into a single neuron in the next layer. Usually, we use the average output of these neurons as a result.

Fully Connected. Fully connected layers connect every neuron in one layer to every neuron in another layer. It is in principle the same as the traditional multi-layer perceptron neural network (MLP).

Weights. CNNs share weights in convolutional layers, which means that the same filter is used for each receptive field in one layer repeatedly, this can reduces memory use and improves performance.

3 The Proposed Methods

3.1 The Dataset

We used NSL-KDD dataset [5] in our work. NSL-KDD is a dataset suggested to solve some of the inherent problems of the KDD’99 dataset. The KDD Cup includes normal and different kinds of attack traffic, such as DoS, Probing, user-to-root (U2R), and root-to-local (R2L). The network traffic for training was collected for seven weeks. The KDD Cup dataset has been widely used as a benchmark dataset for many years in the evaluation of NIDS (Network Intrusion Detection Systems) before. However, one of the major drawback with the dataset is that it contains an enormous amount of redundant records both in the training and testing data. It was observed that almost 78% and 75% records are redundant in the training and test dataset, respectively. This redundancy makes the learning algorithms more sensitive towards the frequent attack records and leads to poor classification results for the infrequent, but harmful records.

NSL-KDD was proposed to overcome the limitation of KDD Cup dataset. The dataset is derived from the KDD Cup dataset. It improved the previous dataset in four ways. First, It does not include redundant records in the training set, so the classifiers will not be biased towards more frequent records. Second, There is no duplicate records in the proposed testing sets; therefore, the performance of the learners are not biased by the methods which have better detection rates on the frequent records. Further, The number of selected records from each difficulty level group is inversely proportional to the percentage of records in the original KDD dataset. As a result, the classification rates of distinct machine learning methods vary in a wider range, which makes it more efficient to have an accurate evaluation of different learning techniques. Moreover, the number of records in the training and testing sets are reasonable, which makes it affordable to run the experiments on the complete set without the need to randomly select a small portion. Consequently, evaluation results of different research works will be consistent and comparable.

This dataset contains a training set, KDDTrain+ and two testing sets called KDDTest+ and KDDTest21. KDDTrain+ includes the full NSL-KDD training set with attack-type labels in CSV format and KDDTest+ includes the full NSL-KDD testing set with attack-type labels in CSV format. In addition, the dataset providers analyzed the difficulty level of the records in KDD dataset. Surprisingly, about 98% of the records in the training set and 86% of the records in the testing set were correctly classified with 21 learners. So, they deleted the examples which are detected by all 21 learners from KDDTest+ and form another testing set - KDDTest21.

Each record in the NSL-KDD dataset consists of 41 features mentioned in Table 1 and is labeled with either normal or a particular kind of attack. These features include basic features derived directly from a TCP/IP connection, traffic features accumulated in a window interval, either time, e.g. two seconds, or several connections, and content features extracted from the application layer data of connections, that shows in Table 1. Among 41 features, three are nominal, four are binary, and remaining 34 are continuous. The training data contains 23 traffic classes that include 22 classes of attack and one normal class. The testing data contains of 38 traffic classes that include 21 attack classes from the training data, 16 novel attacks, and one normal class.

Table 1. Features in NSL-KDD Dataset

The NSL-KDD also has five categories: Normal, DoS, Probe, user-to-root (U2R), and root-to-local (R2L). Table 2 shows the sample number for the five categories. There are totally 125973 samples in training set with 67343 Normal samples, 45927 DoS samples, 995 R2L samples, 52 U2R samples and 11656 Probe samples. As well as in testing set, there are 22544 samples in all, including 9711 Normal samples, 7458 DoS samples, 2754 R2L samples, 200 U2R samples and 2421 Probe samples.

Table 2. Anomaly statistics in the training and testing sets

3.2 Data Preprocessing

Numericalization. There are 38 numeric features and 3 non-numeric features in the NSL-KDD dataset. Because the input value of our models should be a numeric matrix, we must convert some non-numeric features, such as ‘protocol_type’, ‘service’ and ‘flag’ features, into numeric form. For instance, the feature ‘protocol_type’ has three kinds of attributes, which are ‘tcp’, ‘udp’, and ‘icmp’. And then we encode these values into binary vector (1,0,0), (0,1,0) and (0,0,1). Similarly, the feature service has 70 kinds of attributes, and the feature ‘flag’ has 11 kinds of attributes. Using the method we mentioned before, 41-dimensional features map into 122-dimensional features after transformation. Because for the convenience of our CNN model, we append 22 ‘0’ each line total 144 features to make a 12 * 12 matrix.

Normalization. First, according to some features, such as ‘duration [0, 58329]’ and ‘src_bytes [0, \(1.3*10^9\)]’, where the difference between the maximum and minimum values has a very large scope, we apply the logarithmic scaling method to reduce scaling scope. Second, the value of every feature is mapped to the [0,1] range linearly according to formula below, where Max denotes the maximum value and Min denotes minimum value for each feature.

$$x_i = \frac{x_i-Min}{Max-Min}$$

3.3 The Proposed Models

We designed a couple of Feedforward neural network models to test their performance including models with one hidden layer and two hidden layers. In one hidden layer case, we also tried different models with 100, 80 and 60 nodes in hidden layer. Meanwhile, the models with two hidden layers are designed as 100/80, 80/60 and 60/40 nodes in each hidden layer. Furthermore, we changed their learning rate to observe their performance. These models are shown in Sect. 4 in detail.

We also designed a CNN model to detect network abnormal traffic. It is obvious that the training of the CNN model consists of two parts: Forward Propagation and Back Propagation. Forward Propagation is responsible for calculating the output values. Back Propagation is responsible for passing the residuals that were accumulated to update the weights, which is not fundamentally different from the normal neural network training.

Here, we first reshape the handled 144 features to a 12 * 12 matrix, as the input to our CNN model. The convolution layers are treated as a feature extractor to get feature maps, and then the average of each feature map is computed by Global Average Pooling. Finally, the resulting vector is fed directly into Softmax layer for classification. The generalization ability of the network is improved by this method. Global Average Pooling also be found that it can be compute to get a generic localization deep representation in. A single fully-connected layer is added between Global Average pooling layer and Softmax layer, and Class Activation Maps (CAM) is computed with the weights of this fully-connected layer like Fig. 2.

Fig. 2.
figure 2

Our CNN model

4 Evaluation

4.1 The Metrics

In our model, the most important performance indicator of network anomaly detection is used to measure the performance of our deep learning models. In addition, we use the detection rate and false positive rate to measure the performance. The True Positive (TP) means the number of anomaly traffic that are identified as anomaly. The False Positive (FP) denotes the number of normal records that are identified as anomaly. The True Negative (TN) is equivalent to those correctly admitted, and it means the number of normal traffic that are identified as normal. The False Negative (FN) denotes the number of anomaly traffic that are identified as normal. Table 3 shows the definition of confusion matrix. We have the following notation:

Table 3. Confusion matrix

Accuracy: the percentage of the number of records classified correctly versus total the records shown below.

$$AC =\frac{TP + TN}{TP + TN + FP + FN}$$

True Positive Rate (TPR): as the equivalent of the Detection Rate (DR), it shows the percentage of the number of records identified correctly over the total number of anomaly records, as shown below.

$$TPR = \frac{TP}{TP + FN}$$

False Positive Rate (FPR): the percentage of the number of records rejected incorrectly is divided by the total number of normal records, as shown below.

$$FPR = \frac{FP}{FP + TN}$$

Hence, the motivation for the network anomaly detection models is to obtain a higher accuracy and detection rate with a lower false positive rate.

4.2 Experimental Results

In this paper, we used one of the most current and popular deep learning frameworks - Tensorflow [6]. Tensorflow is an excellent machine learning framework belonging to Google. Its powerful capabilities enable researchers to create the machine learning models they designed more simply. The experiment is performed on a server, with a configuration of an Intel Xeon E5-2630 v3 (15M Cache, 2.40 GHz) and 32GB memory. Some experiments have been designed to study the performance of our two different models for five-category classification, such as Normal, DoS, R2L, U2R and Probe. In order to demonstrate the efficiency of the proposed models, we also compare the results with several classical machine learning algorithms, such as J48, Naive Bayes, NB Tree, Random Forest, Random Tree and SVM.

FNN Model. Figure 3 shows the training process on KDDTrain+ and testing process on KDDTest+ and KDDtest21. From the figure, we can see the accuracy rapidly rise before about 3000 epochs and gradually tend to flatten out after about 15000 epochs. Furthermore, the accuracy on the training dataset is the highest which is close to 1, it is because the model is trained from the same dataset directly and captures all the anomaly characteristics. While the detection accuracy on dataset KDDTest+ and KDDTest21 are about 80%, and 50% respectively. The accuracy on KDDTest21 is lower than KDDTest+. It is because KDDTest21 removes all the anomalies that can be easily detected, resulting in a low detection accuracy.

Fig. 3.
figure 3

The Accuracy of FNN

Table 4 shows the accuracy with different FNN models. In the five-category classification experiments, different kinds of Feedforward Neural Network was tested. The parameters include number of hidden layers, number of nodes in each layer and the learning rate of algorithm. As we can see, the best result 80.34% shows when there are 100 nodes in hidden layer, meanwhile the learning rate is 0.5. When the number of nodes below 100, as nodes increase, there is a raising tendency for precision. However, when we add the node number to 120, the precious comes to a slightly drop. Results are getting complicated when it comes to the Feedforward Neural Network with two hidden layers. We can’t describe a obvious rule under such circumstances, just think a little lower performance than the Feedforward Neural Network with just one hidden layer. However, there is also a nice result (80.3%) when first hidden layer has 80 nodes and the second hidden layer has 60 nodes with 0.8 learning rate.

Table 4. The accuracy with different FNN models
Table 5. The accuracy with different CNN models

CNN Model. Table 5 shows the detection result for CNN model. For CNN, we can not design the enough layers since the hardware limitation. As a result, we just tried a simple model with a drop optimization and it reaches an result that accuracy comes to 77.8%. We also find the accuracy of CNN model is insensitive with the size of kernel, number of full connect layer nodes and learning rate. Compared with FNN model, the accuracy of CNN is relatively lower. The reason is that the connection among the features in the dataset may not be so strong. Another possible reason may be the layer is not deep enough.

4.3 Comparison with Other Methods

To make our results more intuitive, we used another famous classic machine learning framework - scikit-learn to implement J48, Naive Bayes, NB Tree Random Forest, Random Tree and SVM on the same benchmark dataset. The results of the experiment are described in the Fig. 4. Compared with these classic machine learning methods, the performance of deep learning algorithms shows a remarkable improvement. The classical machine learning algorithms generally achieve about 75% accuracy, and the best NB tree reached 75.22% accuracy. However, the deep learning models can achieve at least 77.8% accuracy, and the best one comes to 80.34% accuracy.

Fig. 4.
figure 4

Comparartion in different machine learning algorithm

Table 6 shows the result for the Feedforward Neural Network on testing set KDDTest+ in the five-category classification experiments. The model has achieved considerable results in the recognition of DOS and probe, but not good in the rest two categories. When look at the dataset in Table 2, we find the anomaly U2R has few samples in both training set and testing set, so the accuracy is greatly affected. While the anomaly R2L have enough samples in testing set whereas have few samples in training set, the data in the testing set is about 3 times of the training set. A small amount of training data can not produce a model with sufficient generalization capability, resulting in low detection rate. Furthermore, the FPR of R2L and U2R are very low at the same time, which shows these models have excellent ability to identify the subclasses trained by training set and the subclasses which could not be identified may be these classes containing no or just a few samples in the training set. As a result, the deep learning models is practical for network anomaly detection.

Table 6. The detection results for different anomaly types

5 Discussion

Based on the same benchmark, using KDDTrain+ as the training set and KDDTest+ and KDDTest-21 as the testing set, the experimental results show that the network anomaly detection models both have higher accuracy than the other machine learning methods. In addition, there’s still a lot of space for the improvement of network anomaly detection using deep learning.

First of all, we should refine this model’s pertinence to the specific aspect. As Robin Sommer mention [7], we should keep the scope narrow. Due to the diversity of network traffic, we should not have the idea that design a model which can detect all types of anomalies. Oppositely, a model which can detect the specific anomalies under specific environment should be established. These models may not require strong generalization and migration capabilities, but we can draw on the thought of boosting and bagging, which means overlay and integrate these models to become a model detecting different anomalies in one environment, this can cause the weakness of generalization ability to a certain extent. For example, we can design different models to detect DoS anomaly and U2R anomaly, then we use these models sequentially to get the result of each anomaly. At last we combine the two results comprehensively to achieve the final result. We believe this could reach a better result than a general model.

Meanwhile, we can not test a deep enough CNN model because of the limitation of hardware. The huge amount of practice in different areas confirm the big effect on CNN. It avoids explicit feature extraction and implicitly learns from training data, this guarantee it can learn the inherent relation in features. When solute anomaly detection problem (whose essential issue is the classification problem), we are confident of the better performance CNN will show when we use the new device.

6 Related Work

Network anomaly detection is an important and dynamic research area. Many network anomaly detection methods and systems have been proposed in the literature.

Duffield et al. [8] proposed a rule-based anomaly detection on the IP network which correlates the packet and flow level information. Cherkasova et al. [9] presented an integrated framework of using regression based transaction models and application performance signatures to detect anomalous application behavior. Sharma et al. [10] used the Auto-Regressive models and a time-invariant relationships based approach to detect the fault. Pannu et al. [11] presented an adaptive anomaly detection framework that can self adapt by learning from observed anomalies at runtime. Tan et al. presented two anomaly prediction systems PREPARE [12] and ALERT [13] that integrate online anomaly prediction, learning-based cause inference, and predictive prevention actuation to minimize the performance anomaly penalty without human intervention. They also investigated the anomalous behavior of three datasets [14]. Bronevetsky et al. [15] designed a novel technique that combine classification algorithms with information on the abnormality of application behavior to improve detection. Gu et al. [16] developed an attack detection system LEAPS based on supervised statistical learning to classify benign and malicious system events. Tati et al. [17] proposed an algorithm to efficiently diagnose large-scale clustered failures in computer networks which is based on greedy approach. Besides, several works focus on the anomaly/fault pattern analysis. Birke et al. [18] conducted a failure pattern analysis on 10K virtual and physical machines hosted on five commercial datacenters over an observation period of one year. Their objective is to establish a sound understanding of the differences and similarities between failures of physical and virtual machines. Rosa et al. [19] studied three types of unsuccessful executions in traces of a Google datacenter, namely fail, kill, and eviction. Their objective is to identify their resource waste, impacts on application performance, and root causes.

Most recently, with fast development of deep learning, there is a trend to use deep learning on network anomaly detection. Maimó et al. [20] used deep learning method for anomaly detection in 5G networks. Tang et al. [21] applied a deep learning approach for flow-based anomaly detection in an SDN environment. They just use six basic features (that can be easily obtained in an SDN environment) taken from the forty-one features of NSL-KDD Dataset. Yin et al. [22] proposed a deep learning approach for intrusion detection using recurrent neural networks (RNN). Javaid et al. [23] used Self-taught Learning (STL), a deep learning based technique, to develop a Network Intrusion Detection System. Roy et al. [24] used Deep Neural Network as a classifier for the different types of intrusion attacks and did a comparative study with Support Vector Machine (SVM). Li et al. [25] proposed a image conversion method of NSL-KDD data and evaluated the performance of the image conversion method by binary class classification experiments.

In comparison, this paper applies feedforward neural network model and convolutional neural network model for network anomaly detection and studies the impact of different model parameters. We also perform a detailed comparison of detection accuracy with several traditional machine learning methods and demonstrate the deep learning based detection models can achieve a better accuracy.

7 Conclusion

This paper presents a new anomaly detection method based on deep learning models, specifically the feedforward neural network (FNN) model and convolutional neural network (CNN) model. The performance of the models is evaluated by several experiments with a popular NSL-KDD dataset. From the experimental results, we find the FNN and CNN models not only have a strong modeling ability for network anomaly detection, but also have relatively high accuracy. Compared with several traditional machine learning methods, such as J48, Naive Bayes, NB Tree, Random Forest, Random Tree and SVM, the proposed models obtain a higher accuracy and detection rate with lower false positive rate. The deep learning models can effectively improve both the detection accuracy and the ability to identify the anomaly types. With the continuous advancement of deep learning technology and the development of hardware accelerators, we believe better performance can be achieved in network anomaly detection by using deep learning methods in the future.