Mathematics
2024, 12, 571
13 of 26
on categorical variables, which are predominant in network datasets. Variables such as
HTTP methods, DNS query lengths, and MQTT topics were subjected to label encoding.
This step transformed these categorical strings into a numerical format, a prerequisite for
subsequent ML algorithms.
Following the encoding, the numerical representations underwent one-hot encoding.
This transformation is particularly crucial because it converts categorical integer features
into a binary matrix, thereby mitigating any misleading ordinal relationships that traditional
numerical encoding might imply. One-hot encoding expands the feature space, enabling the
model to better understand and differentiate categorical data. Consequently, the number of
features after the one-hot encoding increased to 119.
Given the expanded feature space, the next step involved streamlining the dataset.
•
The dataset was scrutinized for duplicate records, and such instances were removed
to prevent biases in the model’s learning process.
•
The data was examined for null values, ensuring the integrity and consistency of
the dataset.
•
A novel approach was adopted in which a hash function was employed for each
column to identify identical columns. By comparing the hashes, groups of identical
columns were identified, and all but one in each group were removed. This step is cru-
cial for reducing redundancy in the dataset, thereby enhancing the model’s efficiency.
After the reduction and cleaning processes, the feature count decreased to 99. To fur-
ther refine the dataset, a Chi-squared test was applied. This statistical test is instrumental
in feature selection because it evaluates the independence of each feature against the target
variable. The Chi-squared test scored each of the 99 features, allowing us to identify and
select the top 93 features that exhibited the most significant relationships with the target
variable. This selection was influenced by the intrinsic ability of the CNN component in
the ensemble to discern and use the most pertinent features effectively.
The data processing efforts culminated in the distribution of network traffic as follows,
effectively delineating the varied landscape of normal and anomalous activities within the
dataset (Table
2
).