Keywords

1 Introduction

In recent years, more and more entities are storing their valuable information in places reachable through networks, which in some way makes themselves potential victims of malicious applications (malwares). Malware attacks have also increased greatly in both quantities and categories. Malware detectors based on signature database [1, 6, 17] or static analysis [14] are faced with increasing difficulty because they are often vulnerable to obfuscation methods [11]. So many researchers put effort into dynamic analysis and develop algorithms to identify malicious programs through their behaviors. In dynamic analyses, system API call sequences are most frequently used to represent the behaviors of programs. Data mining and traditional machine learning methods are often employed to handle malware detection tasks based on API call sequences [7, 15, 19]. These methods usually require low dimensional statistical features as the input and thus expertise-based feature engineering is necessary. These requirements lead to a bottleneck in accuracy.

With the development of deep learning models, many new models have been proposed to detect malware based on raw API call sequences and most of them are using recurrent neural network (RNN) models [12, 13, 16, 18]. These RNN-based models reach better accuracy than data mining and traditional machine learning methods but challenges still exist. The recurrent architecture causes inevitable low parallelization when processing and brings uncertainty in receptive field size on input sequences. Furthermore, recently published models become increasingly complicated and are combined with various other analysis methods. In this situation, we propose a Temporal Convolutional Network with ATTention (TCN-ATT) architecture, which is a relatively simple and effective non-recurrent model, to detect malwares based on API call sequences. Noticing that the length of API call sequences varies from 100 to 20000, these whole sequences are not proper inputs for either recurrent or convolutional models. So a split-and-combine method is proposed in the TCN-ATT model.

Our contributions are:

  • For the first time, TCN is introduced to malware detection based on API call sequences (Sect. 3.2) bringing considerable accuracy and high efficiency. To further improve the accuracy, a specifically designed attention layer is employed in our architecture (Sect. 3.3).

  • We propose a sequence splitting method together with a corresponding loss function for the detection task in view of API call sequences characteristics (Sects. 3.1 and 3.4). They control the model size and hold the accuracy no matter how the length of input sequences varies. They can also be applied to other models on sequence-based malware detection task.

  • We give a deduplication method to improve the performance of our model as well as other sequence-based models (Sect. 2.2). We define two parameters for the method to control the deduplication intensity. This method helps to reduce redundancy while retaining some repetition behavior information.

2 Data Preprocessing

Before introducing the TCN-ATT model, we give a brief introduction about procedures that are implemented in front of the model, including the content of input sequences and the proposed deduplication method. Firstly, API call sequences are extracted from programs by running them in virtual environments. Then the sequence data go through some preprocessing to make them fit for the model. Finally, these preprocessed data are fed to TCN-ATT model, which will give a decision about whether the program is benign or malicious.

2.1 Malware Behavior Representation

For each executable file, an API call sequence is extracted from sandbox to represent the behavior and the function name of each API call is used. Each API function is represented by a specific integer which is finally transformed to a one-hot vector in training and testing steps.

2.2 Duplicate API Sequences Processing

After analyzing API call sequences, we find out that one API is often called multiple times consecutively and this kind of repetition also happens to some subsequences consisting of several APIs. This happens to both benign and malicious softwares, because the program sometimes do some similar tasks consecutively. In order to reduce the length of sequences fed to the model, Kolosnjaji et al. [10] and Xiaofeng et al. [18] mentioned some methods to remove continuous same API functions in sequences. [10] dose not consider the case of repetition of an API group and [18] simply removes all the duplicates. In this paper we propose a deduplication method which is similar to theirs but more flexible.

In our consideration, the duplicates of an API call subsequence pattern should be reduced to avoid information redundancy but should not be totally removed because the repetition itself contains program behavior information. So, we define two parameters for the duplicate reducing method: \(l_m\), the max length of a target pattern; k, the max number of consecutive duplicates kept for a pattern. For example, given a sequence \(A_1 A_2 A_3 A_1 A_2 A_3 A_1 A_2 A_3\): when we set \(l_m = 3\) and \(k = 2\), the de-duplicated sequence is \(A_1 A_2 A_3 A_1 A_2 A_3\); when we set \(l_m = 2\) and \(k = 2\), \(A_1 A_2 A_3\) is not regarded as a pattern with \(len(A_1 A_2 A_3 )=3>l_m\) and therefore this sequence stays unchanged after deduplication.

As described above, this design removes less valuable duplicates and keeps some repeating behavior information. This deduplication method is proved to bring accuracy improvement for models, according to experiments in Sect. 4.2.

3 TCN-ATT Model

As shown in Fig. 1, the whole preprocessed sequence from each sample is firstly split into several subsequences with a fixed size. Each subsequence is fed into a network containing the TCN module [2], an attention layer and a fully connected (FC) layer. Then the network produces a sub-prediction for the subsequence. Finally, these sub-predictions are fed into a task-specific loss function to train the whole model in the training phase and analyzed by specific rules to give a sample-level prediction in the practicing phase.

Fig. 1.
figure 1

Process of each sample’s sequence in TCN-ATT model

3.1 Sequence Splitting

The length of dynamic extracted API call sequences is usually rather big and varies a lot among different samples, even after deduplication preprocessing. For both the TCN-ATT model and recurrent models, it does not lead to good results to feed each sequence into the model without splitting it. So we split each sequence into parts of a fixed size. Thus the input of the TCN model will be the subsequences instead of the whole API call sequence. Sequence splitting also brings a benefit that the size of the model only depends on the subsequence length setting regardless of whole sequence size. A proper subsequence length n for the model is chosen by experiments, which will be described in Sect. 4.4.

Although this method solves the issue of sequence length variation, it brings some difficulties in combining partial results into a sample-level classification conclusion. Therefore, a task-specific loss for training phase and a result-combining method for predicting phase are designed to cooperate with splitting method, which will be described in Sect. 3.4.

3.2 Temporal Convolutional Network

RNN models have a bottleneck in accuracy and efficiency when faced with mass data due to its recurrent structure. So we introduce a non-recurrent model to take the place of RNNs.

Temporal Convolutional Network (TCN) is a network architecture proposed by Bai et al. [2]. This fully convolutional network produces an output of the same length as the input, similar to RNNs. With the utilization of dilated convolutions and residual connections, the TCN is able to allow very long effective history with a rather shallow network. And it is worth mentioning that the ability is of great importance to a malware detector based on API call sequences.

A simple convolution only has an input reception field with size linear in the depth of the network. This makes it challenging to apply it to sequence tasks, especially those requiring longer history. So dilated convolutions are employed to enable a larger receptive field with size exponential in the depth of the network. To put it formally, for a 1-D sequence input \( \varvec{x} \in \mathbb {R}^n \) and a convolutional filter \( f: \{0, ..., k-1\} \rightarrow \mathbb {R} \), the dilated convolution operation F on elements end with index s of the sequence is defined as:

$$\begin{aligned} F(s)=(\varvec{x} *_d f)(s)=\sum ^{k-1}_{i=0}f(i)\cdot \varvec{x}_{s-d\cdot i} \end{aligned}$$
(1)

where d is the dilation factor, k is the filter size, and each \((s - d \cdot i)\) is index of an element from the ‘past’ part in the input \( \varvec{x} \). Using this type of convolution, the effective history of one such layer is \((k-1)d\). Furthermore, d is increased exponentially with the depth of the network (i.e., \(d=O(2^i)\) at level i).

With above designs, the TCN model is able to take similar inputs and produce similar outputs as RNNs while it is efficient taking advantage of convolution architectures. In our architecture, we use the sequence-to-sequence (seq2seq) mode of TCN. To be specific, it works as described below.

For a 1-D sequence input \( \varvec{x} \in \mathbb {R}^n \) representing an API call sequence, \( \varvec{x}' \) is its corresponding one-hot-encoded form with a size of \(n\times m\), where n is the length of the sequence and m is the total number of involved API functions. Then \( \varvec{x}' \) is fed into the TCN and the module finally produce an output \( \varvec{H} \) with a size of \(n\times c\), where c is the size of the output feature that TCN produces for each time of the sequence. We choose this mode because we expect the network to produce more suggestive results and to have more interpretabilities with the attention layer described in the next subsection.

3.3 Attention Layer

Models with attention mechanism are now the state-of-art for multiple tasks [3]. Attention mechanism is usually able to improve the performance as well as the interpretability of various models. In the TCN-ATT model, we also design an attention layer between the TCN and the FC layer. This attention layer helps to reduce the size of the feature matrix produced from TCN while keeps important information in it and therefore improves the model performance. For the aforementioned output \( \varvec{H} \in \mathbb {R}^{n \times c} \), the operation of the attention layer is:

$$\begin{aligned} \varvec{\alpha }= softmax\left( tanh(\varvec{H}) \varvec{\mu }^T \right) \end{aligned}$$
(2)
$$\begin{aligned} \varvec{w} = \varvec{\alpha }^T \varvec{H} \end{aligned}$$
(3)

where \( \varvec{\mu }\in \mathbb {R}^{1 \times c} \) is the attention factor that we expect the model to learn, and \( \varvec{w} \in \mathbb {R}^{1 \times c} \) is the final vector that represents the feature of the input \( \varvec{x} \). We can see \( \varvec{\alpha }\in \mathbb {R}^{n \times 1} \) as a coefficient vector calculated from \( \varvec{H} \) and \( \varvec{\mu }\) representing the importance of feature vectors produced from all the n time steps.

Through this attention layer, the original feature \( \varvec{H} \) with size \(n\times c\) is compressed to the final feature \( \varvec{w} \) with size of \(1 \times c\). Furthermore, we can analyze the importance of subsequences for an input \( \varvec{x} \) via the calculated \( \varvec{\alpha }\).

3.4 Task-Specific Loss

Each subsequence \( \varvec{x} \) with fixed length n is transformed to a feature vector \( \varvec{w} \in \mathbb {R}^{1 \times c} \) through the TCN model and the attention layer. Then \( \varvec{w}\) is fed into a FC layer and a softmax layer to produce a prediction result \( \varvec{p} \in \mathbb {R}^2 \), and \(\hat{y} =argmax(p)\) is the label that the model predicts (0 as benign and 1 as malicious). For each sequence sample, which is split into a set of subsequence \(\varvec{X}=\{\varvec{x}_1,\varvec{x}_2,...,\varvec{x}_k\}\), the model will give a set of predictions \(\varvec{P}=\{\varvec{p}_1,\varvec{p}_2,...,\varvec{p}_k \}\) and \( \hat{Y}=\{\hat{y}_1, \hat{y}_2, ..., \hat{y}_k\} \). While in other classification tasks the model is usually expected to make all sub-predictions of one sample close to the ground truth label, it works differently in this task.

One malware may not run maliciously all the time and the API call sequences extracted can also have benign parts. Thus, the part-combining procedure on this task is different. In the predicting phase, we regard a sample with all subsequences predicted to 0 as benign, while regarding a sample with at least one subsequence predicted to 1 as malicious. Under this consideration, we should avoid pushing the model too hard and allow the model to produce some benign sub-predictions for malicious samples. Therefore, we give a task-specific loss function \( L(y, \varvec{P}) \) to calculate the total loss of a prediction set \(\varvec{P}=\{\varvec{p}_1,\varvec{p}_2,...,\varvec{p}_k \}\) from a sample s:

$$\begin{aligned} \begin{aligned} L(y, \varvec{P})\,=\,&\chi _{y=0 | \max \hat{Y} = 0}\sum ^k_{i=0}L(y, \varvec{p}_i) \\&+\left( 1-\chi _{y=0 | \max \hat{Y} = 0} \right) \sum ^k_{i=0} \left( \left( 1-\beta \cdot \chi _{\hat{y} = 0} \right) L(y, \varvec{p}_i) \right) \end{aligned} \end{aligned}$$
(4)
$$\begin{aligned} L(y, \varvec{p}_i) = -\left( y\log \varvec{p}_{i, 0} + (1-y)\log \varvec{p}_{i, 1} \right) \end{aligned}$$
(5)
$$\begin{aligned} \hat{y} =argmax(p) \end{aligned}$$
(6)
$$\begin{aligned} \hat{Y}=\{\hat{y}_1, \hat{y}_2, ..., \hat{y}_k\} \end{aligned}$$
(7)

where y is the ground truth label of the sample, \( \hat{y}_i \) is the predicted label for each subsequence, and \( \beta \in [0, 1] \) is a hyper-parameter designed to reduce the loss from specific subsequences. \( \chi \) is an indicator function. \( \chi _{g} \) equals 1 when g is true and 0 otherwise. To put it simply, \(L(y, \varvec{P})\) equals \(\sum ^k_{i=0}L(y, \varvec{p}_i)\) when the sample is labeled benign or when the model predicts no subsequences of a malicious sample to be malicious. When there are some subsequences predicted to be malicious in a malicious sample, the hyper-parameter \(\beta \) will restrict the loss function to reduce the punishment on benign sub-predictions, since they may be actually correct.

4 Experiments

In this section, we evaluate the effectiveness of TCN-ATT model on the malware detection task as well as the influence of the proposed designs and settings, including API sequence deduplication method, the attention layer itself and hyper-parameters. The efficiency of the proposed model is also evaluated.

4.1 Dataset and Evaluation Metrics

We collect over 6900 malicious PE files from Malekal websiteFootnote 1, CILPKU08 dataset and Henchiri-DatasetFootnote 2. Also, over 2700 benign PE files are acquired from Windows system files or downloaded from several websites (e.g. completely free software, softonic). We then check these files by uploading them to the VirusTotalFootnote 3 website in order to keep some mislabeled samples out of the dataset. According to VirusTotal results, the final dataset contains malware from families including Backdoor, Trojan-Downloader, Trojan-Ransom, AdWare and Worm. In general, our dataset contains 2497 malicious samples and 2497 benign samples. We use 5-fold cross validation to evaluate different methods. Thus at each time, 80% samples are used for training and 20% are for testing.

To evaluate the performance of different mechanisms, the following evaluation metrics are used: accuracy (ACC), precision (PR), recall (RC), receiver operating characteristic (ROC) curve and area under curve (AUC).

These files are run in Cuckoo sandbox, which can extract API call sequences while files are running in a Windows 7 virtual environment. After being preprocessed as descried in Sect. 2, these sequences are fed into TCN-ATT model.

We implement the TCN-ATT model and other models envolved in experiments by python 3.6.5 with Tensorflow and Scikit-Learn. We train and test these models in a Ubuntu system with 8 GTX-1080Ti GPUs.

4.2 Effect of Deduplication

We first conduct an experiment to evaluate the effect of the proposed deduplication method. Original sequences and shortened sequences converted from original ones are used to train several models respectively. Then we evaluate trained models by accuracy. We choose RNN, LSTM from RNN family, LSTM with the attention layer, as well as TCN and our TCN-ATT model. The setting \(\{l_m = 5, k = 2\}\) are proved to work best in pre-experiments and thus only results under this setting are shown in Table 1 for simplicity. From Table 1 we can see an evident increase in the ACC of each model fed with shortened sequences, which indicates that the proposed deduplication method is effective in improving the performance of a diverse range of models in this task. So we use shortened sequences as the input in all the following experiments.

Table 1. Accuracy using different input sequences

4.3 Malware Detection

In this part, we compare the TCN-ATT model with traditional machine learning methods and some deep learning models:

  • Decision Tree/Naive Bayes/SVM/Random Forest. Popular traditional machine learning methods. Directly feeding sequences or subsequences into these models leads to poor results. So a transition probability matrix is calculated as the feature vector for each sample.

  • RNN/LSTM/GRU. Widely-used recurrent models in sequence tasks [4, 8] under split-and-combine method without attention layer.

  • TCN. The original TCN model under split-and-combine method without the attention layer.

  • LSTM/GRU with attention. LSTM/GRU model under split-and-combine method with the attention layer.

  • GRU attention and TCN attention without sequence splitting. Models of this group are fed with whole sequences and they make predictions directly with no combining method.

  • CNN+LSTM. A model containing two convolutional layers and one LSTM layer proposed by Kolosnjaji et al. [10].

  • Bi-Residual LSTM. A LSTM-based model containing two bidirectional layers with residual connection proposed by Xiaofeng et al. [18].

  • TCN-ATT. The proposed model in this paper.

Table 2. Accuracy of different models in malware detection

The dropout rate is 0.5 and single feature size (c in Sect. 3) is 128 for RNNs/LSTMs/GRUs/TCNs envolved. For models under split-and-combine method, the input sequences are split with window size 600 and \(\beta \) in loss the function is 0.25. For non-attention deep models, all the \( \varvec{H} \in \mathbb {R}^{n \times c} \) is fed into the FC layer. The dilations setting is [1, 2, 4, 8, 16, 32] in all TCNs. Hyper-parameter values of other methods are also carefully selected to reach their best accuracy.

Table 2 shows detection accuracy, precision and recall of above models. Comparing models 1\(\sim \)4 with models 5\(\sim \)16, we can conclude that deep models perform better than traditional machine learning models when using API call sequences as the input. Results of models 5\(\sim \)8 show build-in abilities of original models under the split-and-combine structure. The model containing TCN outperforms other three recurrent models. From models 6\(\sim \)8 and models 12, 13, 16, it is observed that the attention layer brings a significant improvement to original models. This layer allows the model to selectively focus on important parts of the whole output feature and thus help obtain better results. Similarly, results of model 10, 11 and model 13, 16 indicate that our split-and-combine method brings considerable performance improvement to these models. From above results, we can draw a conclusion that the TCN-ATT model outperforms other models in Table 2, since it reaches the highest accuracy, precision and recall.

Table 3. Effect of different hyper-parameters

4.4 Hyper-parameters

We also conduct experiments to choose hyper-parameter values of the TCN-ATT model. We evaluate different settings by accuracy and AUC. The results are presented in Table 3 and Fig. 2.

The window size (i.e. subsequence length) is an important hyper-parameter of the TCN-ATT model. With a bigger window size, the TCN module is able to take a longer subsequence as its input, which brings less sequence splitting but more training cost. As shown in Table 3, the accuracy and AUC do not always increase when the window size grows and 600 reaches the best result.

The dilations setting defines the parameter d in each dilated convolution layer (see Sect. 3.2) and the number of layers. It also highly affects the proposed model. As the dilations go deeper, the receptive field of the top cell becomes larger, while training becomes more difficult. Similar to the window size hyper-parameter, the best result comes from a middle value and we regard \(dilations = [1, 2, 4, 8, 16, 32]\) as the trade-off between the receptive field and the training cost.

Fig. 2.
figure 2

ROC of some hyper-parameters

According to our experiments, the dropout rate seems to have little influence on the model. The accuracy is over 98.2% whichever value we choose from 0.0 to 0.75. And the best dropout rate is 0.5.

The overlap rate represents how much one subsequence overlaps its neighbors. We did not mention this setting in Sect. 3 because it is indicated that the best result is reached when there is no overlap.

For brevity, other hyper-parameters of less importance are not discussed here.

4.5 Efficiency

We also conduct experiments to evaluate the efficiency of TCN-ATT model. A set of 1000 samples is randomly selected from our dataset and is expanded to 10000 by repeating samples. Experiments are conducted in both training phase and testing phase on this dataset and the time cost of each model is recorded. This is performed on TCN-ATT model as well as RNN/LSTM/GRU with splitting and attention mechanism. In these experiments, only one GPU is used in case that TCN can run on multi-GPU mode while recurrent models are not able to. All models involved use settings that can reach the best accuracy as described in Sect. 4.3 and the batch size is set to 32. Table 4 shows the results.

Table 4. Time cost of some models

As expected, LSTM and GRU cost more time than the original RNN while reaching higher accuracy. However, taking advantage of convolutional architectures, TCN-ATT outperforms above three recurrent models in terms of time cost. It saves time by 62% in training and 60% in testing compared with the GRU+attention model. So it’s indicated that the TCN-ATT model has not only high accuracy but also excellent efficiency.

5 Related Works

API call information is widely used in malware detection. Static analysis extracts API calls from portable execution files [5], log files [16] and DEX files on mobile platforms [9, 19]. And API call sequences can be captured dynamically as well. Based on API call sequences, Ravi et al. [13] use Markov chain to model the sequences and designed a data mining algorithm to generate the classification rules. Some researchers apply machine learning methods for classification. Hansen et al. [7] utilize random forest algorithm to classify the malware based on API call sequences and API call frequency.

In recent years, the development of deep learning have greatly influenced malware detection methods. Pektas et al. [12] construct an API call graph and used graph embedding methods to generate graph embeddings. The normalized graph embeddings are forwarded into a deep neural network for classification. Since recurrent neural networks have good performance in tackling sequence data, Tobiyama et al. [16] use the RNN to extract feature vectors from the input API sequence, convert the feature vectors into images and apply a CNN to classify the images. Kolosnjaji et al. [10] process the API sequences via deep neural network which is composed of CNN and LSTM. Lu et al. [18] utilize the Bidirectional Residual LSTM to process the API sequence data and use machine learning methods based on API statistic features. Most of these sequence-based models rely on recurrent structures, which process long inputs sequentially and thus limit their performance.

6 Conclusion

In this paper, we present a convolutional network architecture called TCN-ATT for malware detection based on API call sequences. A temporal convolutional module and an attention layer are employed for stronger feature extraction ability. We also design a sequence splitting method and a task specific loss to enhance robustness for long sequences while controlling the model size. For sequence preprocessing, a formalized deduplication method with two parameters is proposed. It brings accuracy rise for our architecture and other sequence-based models. With above techniques, the proposed architecture obtains an accuracy of 98.60% and reduces time cost by over 60% compared with recurrent models. Experimental results indicate that the proposed approach is an effective classifier for automatic malware detection task. In the future, a sub-prediction combining method with more intelligence technique can be designed to bring more robustness and adaptability. Furthermore, analyses on attention layer values can be conducted to find out what the model focuses on and help to improve the performance.