Network Attack Detection and Classification Using Machine Learning Models Based on UNSW-NB15 Data-Set

29 min readOct 30, 2020

Network intrusion detection and attack categorization is a field of active research but a major problem faced by researchers is the unavailability of datasets that simulate network traffic generated by modern day computers. There are several datasets available such as the KD98, KDDCUP99 ( KDD Cup 1999 Data ) which were generated with the aim of developing network intrusion detection systems that could detect the difference between “bad” network intrusions and “good” network connections but these datasets were generated around 2 decades ago and do not adequately represent network traffic generated by modern day computers.

The UNSW-NB15 data set was created with the intention of bridging this gap. This article describes multiple machine learning models that were trained ( on the UNSW-NB15 dataset ) and tested by the author to detect and classify network attacks.

The data-set is based on raw packets which were created using IXIA ( now called Keysight ) Perfect storm tool ( https://www.keysight.com/in/en/products/network-test/network-test-hardware/perfectstorm.html ).

**Image by Author depicting Framework Architecture to generate the UNSW-NB15 data-set ( based on architecture provided in study link )**

A TCP Dump tool was used to capture around 100 GB of raw network packets. Few other tools called Argus ( https://openargus.org ), Bro Intrusion Detection System tools along with 12 developed algorithms were used to generate 49 features using the raw network packets. The tools were also used to assign class labels to each row in the data-set.

**Image by Author ( based on** **https://www.ijeat.org/wp-content/uploads/papers/v9i3/C5809029320.pdf** )

The data-set contains 9 types of attacks namely Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode and Worms.

A brief explanation taken from the following link gives an intuitive idea about each of the attacks.

Fuzzers: Fuzzing or Fuzzers involve sending large amount of invalid input to the servers with the intention of looking for possible vulnerabilities. Invalid input might be an unexpected network protocol, a file of a different format, some random user input etc. Fuzzing is sometimes used by ethical hackers to find out security loopholes which could be exploited during an attack.

Analysis: Analysis is a more aggressive and intrusive form of network attack in which the attacker infiltrates the targeted system to find out any weaknesses. This type of attack is also sometimes referred to as Active Reconnaissance. In military operations, reconnaissance involves exploring an area to gain information about enemy’s activities and environment. Similarly, in Active Reconnaissance form of attack, the intruder gathers information by performing activities such as network scans to discover open ports and access points, finding out ways to access the system etc.

In Active Reconnaissance, the intruder usually leaves a digital footprint unlike Passive Reconnaissance which is discussed later.

Backdoors: In Backdoor attack, the intruder gains unauthorized access to a system via an unsecured points of entry such as a port or application, thus bypassing normal security measures to access the system. A malware is usually installed on the system. As the result, the attacker usually gains access to the databases, servers allowing him/her to remotely send commands to exploit the system.

Once installed, the malware is difficult to detect and can be used to steal data, install additional malwares or hijack more devices on the network.

Denial of service (DoS): Denial of Service or Dos is a very well known form of cyber attack in which the attacker floods or overwhelms the targeted system with a large number of requests with the intention of making it temporarily or permanently unusable to the intended users. Some indicators of DoS attack are unexpected huge spike in network traffic originating from a single IP or IP range, suspicious surge in requests to a just one endpoint.

Exploits: An exploit is any kind of attack in which one takes advantage of a bug or system flaw to use it for their benefit or to exploit it. Exploit usually takes the form of some lines of code that is used by the attacker to gain access to the system and install some form of malware. Once a malware is installed, the system might show some common signs and symptoms such as slow performance, unexplained crashes or changes in settings, etc.

Generic: Generic attack is based on ciphers. It is a collision attack on the secret key generated by the cryptographic principles. This can be applied on some code ciphers. One of the most obvious examples is a cipher code that takes an X bit key and the generic attack takes cipher text and attempts to decrypt it using all 2^X combination of keys.

Reconnaissance: Reconnaissance is an attempt to collect preliminary information about any network or target host with the intention of exploiting the system to gain access to the target hosts or networks.

Unlike Analysis or Active Reconnaissance, Passive Reconnaissance does not involve direct interaction with the target machine and hence is far easier to hide. This class uses information which is freely available such as “Whois” information, ARIN records. Advanced google searches or social media searches often aid in Reconnaissance attacks.

Shellcode: Shellcode is a subset of the class “Exploits” in which the attacker uses a small piece of code as a payload to take control of the target machine. Shell codes can be written in higher level language but may not work in some cases due to which it is often written in assembly language.

Worms: A worm is a common type of malware that spreads by copying itself over and over again from one computer to another. It often uses a computer network to spread and does not require any human intervention. Worms can be used to steal data or install a backdoor, thus allowing the intruder to gain control over the target machine.

Table of Content

Source of Data
Existing Approaches to the Problem
Improvements To Existing Approaches
Exploratory Data Analysis
Initial approach to problem
Models Explained
Comparison of All Models

Source of Data

The data-set was downloaded from UNSW ( University of New South Wales, Australia ) portal -

( Image source — https://www.unsw.adfa.edu.au/unsw-canberra-cyber/cybersecurity/ADFA-NB15-Datasets/ )

The website provides links to all raw pcap files along with the logs generated by the tools used for feature generation.

A total of 2,540,044 records were generated using the 100 GB raw pcap files and were stored in 4 CSV files.

From these 4 CSV files, a training dataset and test dataset were created which contain around 175,321 points and 82,332 points respectively. Both datasets are imbalanced with 56,000 points in the training set with class label “Normal” and 119,341 points for the 9 attack types mentioned above.

The features are described in another CSV file named UNSW-NB15_features.csv

Existing Approaches to the Problem

Analysis of KDD-Cup’99, NSL-KDD and UNSW-NB15 Datasets using Deep Learning in IoT

( https://www.sciencedirect.com/science/article/pii/S1877050920308334 )

In this study, a Deep Neural Network (DNN) with 20 hidden layers was used. 70% of the dataset was used for training, the remaining 30% was equally split into test and cross validation sets

From UNSW data set, only 8 features — #2,4,7,8,9,10,15,16 were used for training the neural network. The whole experiment was performed on MatLab 2016b with 8GB RAM on Mac OS X.

The following confusion matrix is based on the predictions of the model thus trained.

**Image by Author ( based on confusion matrix taken from above link )**

A combined ( i.e with train and test data included ) accuracy of 99.2 % on UNSW-NB15 dataset was achieved in this study. However, one needs to note that the total number of points used is just 360 which is a very small number to give any concrete results.

2. Intelligent methods for intrusion detection in local area networks

( http://ceur-ws.org/Vol-2514/paper84.pdf )

In this study, the authors have used a 8 layered Convolutional Neural Network to detect Network intrusion. The following image is taken from the study link and shows the CCN architecture used for the NID model.

**Image by Author ( based on model architecture given in study link )**

Batch size of 256 was used along with a learning rate of 0.001. The network was trained for 200 epochs.

Class balancing was tried in this study which resulted in better model performance.

3. Intrusion detection using Deep learning and big data

(https://www.researchgate.net/publication/332100759_Intrusion_Detection_Using_Big_Data_and_Deep_Learning_Techniques )

In this study, k-means clustering was performed based on the features and the homogeneity score for every feature was calculated. The features were then sorted based on scores. A deep learning model, Random forest and Gradient Boosting Tree were then trained, initially, with all features. Then the feature with lowest homogeneity score was eliminated and the entire training process was repeated again. For every step, the accuracy was calculated and stored.

The pseudo code below, taken from the same link, shows a shorter version of the approach described above

**Pseudo code depicting algorithm for feature ranking and selection ( taken from given link )**

Following tables give the accuracy scores for the NID and Attack classification models

**Binary Classification Results ( taken from given link )**

**Multi-Class Classification Results ( taken from given link )**

The results obtained for all features and reduced number of features is almost the same. An accuracy of 97%+ was achieved in network attack classification.

However, it must be noted that all normal traffic data points were filtered out while training and testing the NAC models and the accuracy achieved is only data points of “attack” type.

4. Two Stage Classifier

( https://ro.uow.edu.au/cgi/viewcontent.cgi?article=3163&context=eispapers1 )

The study proposes the use of a 2 stage classifier to appropriately handle majority and minority classes.

It can be seen in the EDA above that some classes like normal, exploits, and generic occur too frequently whereas other classes like analysis, backdoor, shellcode and worms. Due to this, the model performance is significantly affected as sometimes, minority classes are altogether ignored. The proposed approach aims to rectify this issue.

The image shown below is taken from the study and shows the approach proposed in the study

**Image by Author ( based on model description given in link )**

Thus, 3 separate models are trained. The first model predicts whether a data point belongs to majority class or minority class. Once that is determined, the classifiers in the second stage classify the data point as per the attack category.

5. NID using dimensionality reduction by identifying top 4 features

( https://www.researchgate.net/publication/332265020_UNSW-NB15_dataset_feature_selection_and_network_intrusion_detection_using_deep_learning )

The study proposes use of Random Forest Classifier and Artificial Neural Network to detect network intrusions. Results obtained in this study are as follows.

**Image by Author ( Confusion matrix displaying final results )**

The classifier is able to detect 85% of attacks and has an overall accuracy of 91.5%. Network attack classification was not explored in this study.

6. Performance evaluation for various machine learning techniques for NID

( https://www.sciencedirect.com/science/article/pii/S1877050918301029 )

In this study, 4 machine learning models are trained on the UNSW-NB15 dataset.

Support Vector Machine
Naive Bayes
Decision Tree
Random Forest

Following table taken from the link shows the results obtained in this study

**Comparison Of Different Intrusion Detection Systems ( taken from study link )**

Unlike earlier studies, the models were simply trained without the aid of any other techniques like feature selection, multi-stage classification etc. Network attack classification was not explored in this study.

7. Hybrid feature selection method process along with Naive Bayes and Decision trees

https://onlinelibrary.wiley.com/doi/pdf/10.1002/spy2.91

In this approach, a hybrid feature selection process was used along with 2 classifiers — Naive Bayes and Tree Based J48 which is similar to a decision tree. For determining an optimum subset of features, 2 hybrid feature selection processes were used — k-means clustering and correlation based feature selection.

All of the work was done in weka9 ( https://www.cs.waikato.ac.nz/~ml/weka/ ) which is an open source machine learning software.

The image shown below is based on the study link and shows the process employed.

**Image by Author ( depicting process employed in given link )**

The results are shown in 2 separate tables as follows.

For Naive Bayes classifier

For J48 classifier

Improvements To Existing Approaches

One Vs Rest Classifier

A one vs rest classifier with the best performing algorithm — XGBoost was used to train individual models for every attack category. These “class-models” were then combined together to create a single classifier that used the output of each of these models to make a final attack category prediction.

OVR XGBoost WITH Class Balancing

Log loss for One Vs Rest Model

As can be seen, in the snippet above, the class models perform fairly well, and are able to distinguish, with a fair degree of accuracy, data points belonging to the class of the model and data points belonging to other classes. Each test data point is given as input to each of these class models to get a probability score for that class. The output of the model which is the most “confident” i.e the output of the model which gives the highest probability score is taken as the final class prediction.

Few other approaches are tried along with the One Vs Rest approach such as class-balancing, eliminating all non-numerical and non-binary features.

Three Stage Classifier

A 2 stage classifier was earlier mentioned in the “existing approaches” section. In the 2 stage classifier, the attack classification task was carried out in 2 stages. In the first stage, the data point was classified as majority or minority class i.e whether the data point belonged to attack classes for which only few points are available in the dataset — “Analysis”, “Backdoor”, “Shellcode”, “Worms” or to majority class — “Normal”, “Generic”, “Exploits”, “Fuzzers”, “DoS”, “Reconnaissance”. The final attack classification was done in the 2nd stage.

An improvement that was made over this approach by adding another stage at the beginning of the classification task, thus making it a 3 stage classifier. This stage involved using a random forest model to first classify the dataset into 2 categories — “attack” or “normal” with 1 denoting attack and 0 denoting normal. The second and third stage classifications were done in the manner explained above.

2 different approaches were tried. In the first approach, a single classifier was used in the third stage.

Code Snippet depicting the use of 3 Stage Classifier ( First Approach )

Log loss for first approach

Test Accuracy for first approach

Whereas in the second approach, 2 different classifiers were used for minority classes and majority classes.

Code Snippet depicting the use of 3 Stage Classifier ( Second Approach )

Both models got almost similar results. The details are given in the last few sections.

Exploratory Data Analysis

Dataset Features

The dataset contains 49 features of 5 types — Integer, float, binary, nominal and timestamp which were generated using the raw pcap file with the help of a few tools. A brief description of each feature taken from NUSW-NB15_features.csv is as given below.

Transaction Protocol

Some of the top occurring transaction protocols are TCP, UDP which are some of the commonly used transaction protocols. Some other commonly used protocols for various attack categories can be seen in the heat-map below.

“unas” and “ospf” or Open Short Path First protocols tend to appear more in “attack” data points whereas another protocol “arp” or “Address Resolution Protocol” do not occur once.

Service

The top occurring service type is a group of services which are rarely used and hence grouped under “Not-much-used services” or nmu.

**Heat-map displaying top 10 services used in different attacks**

One observation for the “Generic” attack category that can be made on the basis of the heat-map above is that the most common service type for this attack category is “DNS”. Similarly, “exploits” attacks tend to occur over HTTP, SNMP, FTP and POP3 protocols.

Rate

No detail is provided about this feature in the features CSV file. A quick look at the box plot and “describe” table in pandas reveals that the value of “rate” tends to be higher for “generic” attacks with IQR between 111,111 and 250,000.

Source to destination time to live value — sttl

The time-to-live value is an upper bound to the time beyond which a network packet is discarded. The average source to destination TTL value in the dataset is around 179 ms whereas the median lies around 254 ms.

**Description and Distribution for Source to Destination TTL**

**Box plot denoting the sttl range for different attack categories ( excluding outliers )**

The box plot above shows that the sttl value for attack categories “analysis” and “exploits” can be as low 60 ms excluding outliers.

Destination bits per second

Destination bits per second is yet another feature that distinguishes normal network activities from attacks.

**Violin plot depicting range of download for various attack categories ( including outliers )**

**Destination bits per second for different attack categories ( excluding outliers )**

Median for dload for normal network traffic lies close to 1447.02 bits per second whereas for almost all attack classes, the dload value tends to be close to 0.

Destination TCP window advertisement value

The destination window advertisement value — dwin, indicates the amount of data it can safely receive without causing any network congestion.

**Violin Plot for destination TCP window advertisement values for various attack categories**

One observation can be made on the basis of the violin plots above for “Generic” attack types which is that almost all generic attack types have dwin close to 0.

trans_depth

trans_depth represents the number of requests that is sent in a single connection without waiting for corresponding responses.

**Box plot for trans_depth values for different attack categories ( without outliers )**

For analysis attacks, the interquartile range is from 0 to 1. For worms, trans_depth values lie at 1.

Source inter packet arrival time ( mSec )

Source inter packet arrival time or sinpkt indicates the amount of time between 2 consecutive packets in milliseconds.

**Box plot for Source inter packet arrival time ( mSec )**

Source inter packet arrival time for generic attack types is close to 0 unlike other categories of attack.

Target Labels

The dataset contains 2 class labels. The first column “label” classifies the data points into categories “attack” ( denoted by 1 ) and “normal” ( denoted by 0 ). This label can be used for training network intrusion detection models and does not dwell too deep into the kind of attack.

The second label “attack_cat” further classifies all data points with “label” 1 i.e all attack data points into the category of attack. As already mentioned above, the attack categories are — “Analysis”, “Backdoor”, “”DoS”, “Exploits”, “Fuzzers”, “Generic”, “Reconnaissance”, “Shellcode” and “Worms”.

Correlation Between Features

Pearson correlation coefficient was used to determine linear correlation between all features. A value of +1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation ( https://en.wikipedia.org/wiki/Pearson_correlation_coefficient ).

The following image taken from Wikipedia gives an intuitive idea about how Pearson correlation coefficient works -

**Image displaying PCC values for different arrangements of X and Y features ( from wikipedia )**

In the image above, a PCC of value -1 indicates negative linear correlation. When X increases by 1, Y decreases by 1 linearly. As PCC increases from -1 to +1, this correlation shifts towards positive. Thus, when PCC is 1, X and Y increase in a positive and linear manner i.e when X increases by 1, Y increases by 1 too. A PCC of 0 indicates that both X and Y have no correlation.

With this information in mind, the PCC values table below gives a birds-eye view showing correlation between all features amongst themselves. A scale on the right shows colors used to indicate correlation values. All positive correlations are shown by the color orange. Stronger the correlation, darker is the tone of the color. Similarly, negative correlation is shown by the color blue.

**Heatmap depicting correlation between all features among each other**

The following code was used to derive correlation between all dataset features.

There are some expected correlations between some features such as

sbytes — spkts : Source bytes and sources packets count
sloss — spkts : Source packets loss and source packets count
sloss — sbytes : Source packets loss and source packets bytes
dbytes — dpkts : Destination bytes and destination packets count
dloss — dpkts : Destination packets loss and destination packets count
dloss — dbytes : Destination packets loss and destination packets bytes

swin negatively correlated to rate, sttl and sload. Positively correlated to dttl.

A TCP window advertisement determines the maximum amount of data that can be sent before the sender must wait for an acknowledgement from the receiver. By advertising its window size, the receiver side manages flow control.
A higher source window advertisement leads to a higher destination time to live for packets.

Following other group of features appear to be positively correlated among each other

ct_srv_src — No. of connections that contain the same service (14) and source address (1) in 100 connections according to the last time (26)
ct_state_ttl — No. for each state (6) according to specific range of values for source/destination time to live (10) (11).
ct_dst_ltm — No. of connections of the same destination address (3) in 100 connections according to the last time (26)
ct_src_dport_ltm — No of connections of the same source address (1) and the destination port (4) in 100 connections according to the last time (26).
ct_dst_sport_ltm — No of connections of the same destination address (3) and the source port (2) in 100 connections according to the last time (26).
ct_dst_src_ltm — No of connections of the same source (1) and the destination (3) address in 100 connections according to the last time (26).
ct_src_ltm — No. of connections of the same source address (1) in 100 connections according to the last time (26).
ct_srv_dst — No. of connections that contain the same service (14) and destination address (3) in 100 connections according to the last time (26).

Sessions that contain FTP commands will by nature require FTP login. Hence, features ct_ftp_cmd ( No of flows that have a command in ftp session ) and is_ftp_login ( If the ftp session is accessed by user and password then 1 else 0 ) are positively correlated.

Following features are highly correlated to label and hence might play a critical role in network attack detection

sttl
rate
dload
ct_state_ttl

The dataset was then split according to their attack class types and a correlation matrix was again generated.

The correlation matrix for “normal” attack is as follows

**Correlation Matrix for “Normal” Class**

Correlation matrix for some other classes stood out. For example — below is the correlation matrix for the same features for the class “Shellcode”. Most features appear to be strongly positively correlated to each other.

**Correlation Matrix for “ShellCode” Class**

Below is another correlation matrix for the class “Analysis”.

**Correlation Matrix for “Analysis” Class**

Further exploratory data analysis can be found in the github link.

During EDA, all features are categorized on the basis of their data type — binary, numerical, categorical and are accordingly stored in 3 different lists. Post EDA, the 3 lists are pickled so as to be used later.

Initial approach to problem

Reading train/test data

Train and test data is read from CSV files using pandas’ native method read_csv.

Loading Dataset into Pandas Dataframe using read_csv Method

Cross validation data

Cross validation data is obtained by splitting the train data into 2 parts — X_train and X_cv. For all models discussed below, X_train data is used to train the models and X_cv is used to fine tune the hyper-parameters. Once training with the optimum hyper-parameters is done, the model is tested on X_test data. The accuracy score and log-loss for test data is captured so as to be used later on to compare various models.

Cross validation data is stratified on the basis of the target feature. For Network intrusion detection, stratification is done using the “label” column. For Network Attack Categorization, the “attack_cat” column is used.

Data pre-processing, standard scaling numerical features, label encoding and one hot encoding categorical features.

Some data pre-processing is required since not all features can be used for training the models. The data set consists of 3 types of data — binary, categorical and numerical.

All binary features are used as they are. Categorical features are first label encoded and then converted into one-hot-vectors using sklearn’s one hot encoder sklearn.preprocessing.OneHotEncoder.

All numerical features are applied standard scaling sklearn’s Standardscaler — sklearn.preprocessing.StandardScaler.

All features in test data too are transformed using the one hot encoders and standard scalars obtained by fitting the train data in the previous step.

One all data preprocessing is done, the encoded features are stacked against each other using hstack to create encoded train data — X_train_encoded, cross validation data — X_cv_encoded and test data — X_test_encoded. The encoded datasets are then used to train the models explained below.

Training and model evaluation

Performance of various classification algorithms for Network Intrusion Detection ( NID) and Network attack categorization were measured. This was done so as to get an idea about how well different models performed and then to use the best performing model to design a more sophisticated algorithm, if required.

For the task for Network Intrusion Detection and attack categorization, following classifiers were trained and their performance was measured using log-loss and accuracy score.

Network Intrusion Detection

K Nearest Neighbours
Logistic Regression
Random Forest
XGBoost
Decision Tree
Neural Networks

Network Attack Categorization

K Nearest Neighbours
Logistic Regression
Random Forest
XGBoost
Decision Tree
Neural Networks

Models Explained

Network Intrusion Detection

K Nearest Neighbours

K Nearest Neighbours is an algorithm which is used in classification or regression tasks. In KNN classification, the class of K closest neighbours is used to determine the class of the query point. In most cases, a majority vote is taken i.e the class to which majority of the k closest points belong to, is taken as the class of the query point.

For our classification task, the KNN model is trained using different values of K ranging from 20 to 40. The model performance is then determined using cross validation data. The value of K that gives the lowest log-loss is taken as the best hyper-parameter value and the final model is trained using this value of K.

The model is then evaluated using test data.

**Cross Validation Log Loss Vs Hyper Parameter K for KNN**

Here, the value of K=34 is taken as the best parameter since the value of log-loss is the lowest at 0.138177.

Logistic Regression

A logistic function is a function with S shaped curve shown as below

**Standard Logistiic Sigmoid Function ( Image source —** **https://en.wikipedia.org/wiki/Logistic_function** )

Logistic regression models use a logistic function internally to predict outputs based on input features. A regularization term is used to ensure that the model does not end up overfitting the train data. The hyper-parameter c is the inverse of the regularization strength. Thus higher the value of c, lower is the regularization term and so the chances of overfitting data increases.

Unlike KNN, the hyper parameter c in this case is taken on a logarithmic scale with the lowest value of c being 0.001. The next values are simply multiples of 10.

**Cross Validation Log Loss Vs Hyper Parameter c for Logistic Regression where c is Inverse of regularization strength**

Random Forest

A random forest uses a large number of decision trees together for training and giving out a final class prediction.

For Random forest, the hyper parameter “n_estimators” is tuned which is the number of trees that will be used in the random forest for training.

A wide range of n_estimators ranging from 10 to 1000 are used to train the Random Forest models and the log-loss is calculated using the cross validation dataset.

As can be seen in the snippet below, a n_estimators value of 1000 yields the lowest log loss.

**Cross Validation Log Loss Vs Hyper Parameter n_estimators which is the number of Estimators in the Random Forest**

The top 30 important features based on the Random Forest classifier are as given below.

**Top 30 Important Features for Random Forest Classifier**

sttl or Source to destination time to live is the most important feature followed by ct_state_ttl or number. for each state (6) according to a specific range of values for source/destination time to live.

XGBoost

XGBoost stands for Extreme Gradient Boosting and is one of the trending machine learning algorithms that has gained a lot of popularity in Kaggle competitions.

The scikit-learn API implementation of XGBoost Classifier has been used for the classification task.

XGBoost gives the lowest test log-loss as compared to all earlier algorithms thus making it one of the best performing algorithms.

**Cross Validation Log Loss Vs Hyper Parameter for XGBoost**

Just like Random forest classifier, XGBoost too provides a feature wise importance matrix. As per the XGBoost algorithm, sttl is the most important feature.

Neural Networks

Artificial neural networks are algorithms that try to vaguely mimic the functioning of neurons in the human brain. A neuron has dendrites to receive signals and axons to send output signals along with a cell body to process it. Similarly, an artificial neuron too has inputs and output.

In case of a biological neuron, depending on the signal strength, the neuron is fired. Similarly, an artificial neuron, with the help of functions called as activation functions, mimic this process.

The input values are taken and each value is multiplied with a weight to get a final sum. This final sum is then given to an activation function, which depending upon the input sum gives an output. There are multiple types of activation functions. One of the most commonly used activation functions is “relu” which gives an output which is equal to the input value as long as the input value is positive. Otherwise, it gives an output of 0.

**A Biological Neuron ( source —** **https://commons.wikimedia.org/wiki/File:Neuron.svg** )

**Image by Author — Similarity between a Biological Neuron and Artificial Neuron**

A neural network with 2 hidden layers is implemented. The first visible layer adds a dense layer with 128 units. Dense implements the operation: output = activation(dot(input, kernel) + bias) where activation is the element-wise activation function passed as the activation argument, kernel is a weights matrix created by the layer, and bias is a bias vector created by the layer. The hidden layers contain 64 units each and use “relu” or Rectified Linear Unit. The relu function outputs the input value if it is positive else it simply outputs 0.

A batch size of 128 is used for training the model. The model is trained for 100 epochs or iterations

The neural network described above can be diagrammatically represented as follows

**Neural Network Architecture used for Training**

As can be seen in the diagram above, the entire network contains 1 input layer, 2 hidden layers and 1 output node. The input layer is the layer where the training data is fed into the network. There are 2 hidden layers with 64 nodes each with “relu” as the activation function. The “relu” function ensures that the neuron is triggered only if the input to this neuron is positive.

The network is trained for 100 epochs i.e for 100 iterations. An optimizer named “adam” is used which is an optimization algorithm that updates the weights of the network after every iteration.

Network Attack Categorization

For Network Attack Categorization, a similar approach was taken. Multiple models were tried and their performance was measured using log-loss and accuracy score. The best performing model was used to build 2 more models later on.

The same process is repeated again using “test data sharpening” to improve model performance. The feature “label” ( which indicates whether the data point is a normal point or belongs to an attack category ) is taken as the target column and remaining columns are taken as features. The best Network Intrusion Detection algorithm ( Random Forest ) is taken along with the best hyper-parameters. This model is now trained using the features against the target column. Once the training is complete, the test data features are used to get output predictions — normal or attack ( i.e 0 or 1 ). The output of the model is then added as another feature. In case of training data, the “label” column is added as a new feature. The following Network Attack Categorization models are then trained on this dataset with this new feature.

K Nearest Neighbours
Logistic Regression
Random Forest
XGBoost
Decision Tree
Neural Networks

The training process for all above models are similar to the training process for Network Intrusion Detection models. Hence, the training process has not been elaborated in the section above. For more details on the same, please refer to the github link provided.

Apart from the standard 6 models shown above, few other models are trained.

One Vs Rest — With balanced data / unbalanced data

The One Vs Rest based model consists of 9 models trained for each individual class. The best performing model ( i.e XGBoost ) is taken based on earlier results.

For training the first model, data points belonging to class 0 ( i.e Analysis ) are taken as positive points and remaining all data points are taken as negative. The model is then trained and stored in a dict so as to be used later for predicting class labels. Similarly, the second model is trained using points belonging to class with attack category label 1 ( i.e Backdoor ) by labeling only those points as positive and remaining all points as negative. The model is again stored in the same dict. Similarly, other 7 models are trained and stored in the dict. The dict itself is stored as a pickle file so that this One Vs Rest model could be used later on.

For predicting class labels for a test data point, each model is used to predict probabilities for their respective classes. Each model predicts the probability score of the test point belonging to its own class or other class i.e 0 or 1. The class label of the model which gives the highest probability score is given as the final class label.

Balanced and Unbalanced Data

All One Vs Rest algorithms support data balancing which is an inbuilt mechanism in the OneVsRest classes to internally balance imbalance data using oversampling or under sampling. Data balancing is triggered using a flag — True or False. If False, data balancing is not done and training is carried out on the data as it is. If data-balancing flag is kept on, then only 2000 points from each class is taken while training. Classes like “worms” which have fewer than 2000 points are over sampled and classes which have more than 2000 points are under-sampled.

OVR — XGBoost

The base model used in this One Vs Rest classifier is XGBoost. 9 XGBoost models are trained, one belonging to each class.

Code to Train Individual OVR Classes

In the fit_class method, all points belonging to that particular class are given a label of 1. All other points are given a label of 0.

for each train point, if a point belongs to the current class then we store 1 for y otherwise we store 0.

The pseudo-code for the same is as shown below

The model is trained and then saved in a dict with the key as the class label.

Saving OVR Models

While predicting class labels for query points, the output of all 9 OVR models are taken. The output of the model which gives the highest probability value is taken as the final output.

The final output is then returned back.

OVR — Neural Network

This OVR model uses an ANN as its base model. This is the only difference in this model.

The model training and predictions functionality is handled in the same way as the OVR model with XGBoost as the base model.

OVR — XGBoost — With K Best Features

This OVR model has a similar architecture to the OVR XGBoost model explained earlier. The only difference in this implementation is that before the OVR XGBoost model is fitted to the dataset with reduced number of features.

**Flow Chart Representing Training Strategy for OVR XGBoost Model using K Best Features**

While making predictions on query points, the k best feature selector is called for each model and feature reduction is performed on the points. The query data points are then used for performing classifications as described earlier.

OVR — XGBoost — With only Numerical / Binary features

As the title suggests, this model is trained only on numerical and binary features. All categorical features are eliminated from the training, cross validation and test dataset before being used for the model.

Three Stage Classifier

Approach 1 ( With single third stage classifier )

2 different implementations of 3 stage classifiers were done. In the first implementation, a simpler approach is taken in which the first stage is trained to classify attack or normal classes. The column “label” is taken as the target and the model is trained with the other features.

**Image by Author — 3 Stage Classifier Architecture**

The “label” feature is then added as a new column to the dataset and this new set of features is used for training the second stage classifier.

As already mentioned earlier, the second stage classifier separates the minority classes from the majority classes and hence all data points belonging to minority classes are given a target value of 0 and all data points belonging to majority classes are given a target value of 1.

The 2nd stage classifier is then trained and saved as shown below.

The target column used in the 2nd stage is again added as a new feature to the dataset and used for training the 3rd stage classifier.

The feature “attack_cat” is used for training the 3rd stage classifier. The model is trained and saved in the manner shown above.

For predictions, the query data points are given to the first stage classifier which gives a binary result 0 or 1 ( i.e normal or attack ), the output is then added as a new feature to the query points as a new column and is then given to the next classifier which again gives a binary output 0 or 1 ( i.e minority class or majority class ).

This output is again added as a new feature to the query points and then given to the final classifier which gives the final class prediction i.e attack category.

Approach 2 ( with 2 Separate Third Stage Classifiers for Majority/Minority Classes )

In this approach, the model works similar to approach 1 until the 2 stage classification. After the 2nd stage classifications i.e minority or majority class. The final prediction “attack_category” is done by 2 different classifiers in the third stage.

**3 Stage Classifier with 2 Separate Last Stage Classifiers**

While making predictions, depending on the output of the second stage classifier, the query point is given to the minority class model or the majority class model.If multiple query points are given, then a “mask” is generated depending upon the output of the 2nd stage classifier. This mask separates the query points that have been predicted to be minority class from points that have been predicted to be majority class.

Finally, the query points are given to the 3rd stage classifiers which make the final classification.

The results are then merged together back in the original sequence and returned as the final output.

Comparison of All Models

Network Intrusion Detection

The results for all Network Intrusion Detection models are given in the summary table below.

Result For Network Intrusion Detection Models

Network Attack Categorization

The results for all Network attack categorization models are given below.

Result For Network Attack Detection Categorization Models

The three stage classifiers were built in a separate module. The results of the same are shown below.

Results For Three Stage Classifier Models

Future work

Although the features in the dataset provide a decent picture of network traffic, they do not entirely capture all the information in the raw packets. These features are “derived” from the raw pcap files using tools and algorithms that have been explained in an earlier section. Performing classification based on these derived features is equivalent to determining the object in an image without actually looking at the picture but by looking at the description of the object in the picture such as “height”, “weight” etc. These features might provide a good idea about the object in the image but it would never be as good as looking directly at the image.

The website also provides the raw pcap files which were used to generate the dataset features. These pcap files can be converted to hex dumps or into raw binary format and can be used along with neural networks or LSTM to predict attack categories.