Improved density peak clustering with a flexible manifold distance and natural nearest neighbors for network intrusion detection | Scientific Reports
HomeHome > News > Improved density peak clustering with a flexible manifold distance and natural nearest neighbors for network intrusion detection | Scientific Reports

Improved density peak clustering with a flexible manifold distance and natural nearest neighbors for network intrusion detection | Scientific Reports

Mar 12, 2025

Scientific Reports volume 15, Article number: 8510 (2025) Cite this article

Metrics details

Recently, density peak clustering (DPC) has gained attention for its ability to intuitively determine the number of classes, identify arbitrarily shaped clusters, and automatically detect and exclude anomalies. However, DPC faces challenges as it considers only the global distribution, resulting in difficulties with group density, and its point allocation strategy may lead to a domino effect. To expand the scope of DPC, this paper introduces a density peak clustering algorithm based on the manifold distance and natural nearest neighbors (DPC-MDNN). This approach establishes nearest neighbor relationships based on the manifold distance and introduces representative points using local density for distribution segmentation. In addition, an assignment strategy based on representatives and candidates is adopted, reducing the domino effect through microcluster merging. Extensive comparisons with five competing methods across artificial and real datasets demonstrate that DPC-MDNN can more accurately identify clustering centers and achieve better clustering results. Furthermore, application experiments using three subdatasets confirm that DPC-MDNN enhances the accuracy of network intrusion detection and has high practicality.

Cluster analysis is a widely used unsupervised learning technique designed to classify samples on the basis of their similarity1. This technique presents a challenging problem in data mining and machine learning but has important applications in various fields, such as network intrusion detection2, malicious data tracking3, hierarchical dependency analysis4, model aggregation5, image processing6, large uncertain graphs7, and bioscience8. Numerous clustering algorithms have been proposed for applications in different scenarios. For example, clustering techniques are used to analyze and classify player behavior in profiling systems using real-world datasets, assisting in personality prediction and classification9. In addition, an unsupervised deep learning model based on an autoencoder was utilized to evaluate the feature subset, which was applied in cancer prediction10. In terms of recommending data classification, a novel end-to-end recommendation scenario was introduces that jointly learns the collaborative signal and knowledge graph context11. Traditional clustering algorithms can be categorized into four main types: (i) division-based clustering algorithms, including the K-means algorithm12 and the K-medoids algorithm13; (ii) hierarchy-based clustering algorithms, including the clustering using representatives (CURE) algorithm14 and the balanced iterative reducing and clustering using hierarchies (BIRCH) algorithm15; (iii) density-based clustering algorithms, including the density-based spatial clustering of applications with noise (DBSCAN) algorithm16 and the ordering points to identify clustering structure (OPTICS) algorithm17; and (iv) grid-based clustering algorithms, including the statistical information grid (STING) algorithm18. While each type of algorithm has its own advantages and disadvantages, they can be limited by factors such as incorrect identification of the shape and distribution of the data being analyzed.

Rodriguez et al.19 recently proposed the density peak clustering (DPC) algorithm, which is designed to first calculate the local density and relative distance of samples and then construct a decision diagram that considers these values. The clustering centers are then selected on the basis of experience, and finally, each noncentral point is assigned to the center of the nearest cluster with higher density. One of the main advantages of DPC is its simplicity and efficiency, since it does not require iterative optimization of the objective function. In addition, it can identify clusters of any shape, making it a useful tool in various applications.

DPC has been applied to image segmentation, capturing the intrinsic structure of images and detecting nonspherical clusters20. Li et al. employed DPC for hyperspectral anomaly detection, leveraging its strengths to overcome the Gaussian distribution assumption and the contamination background statistics by anomalies21. Xu et al.22 extended DPC to overlapping community detection by automatically selecting centers with linear fitting, yielding better results in both realistic and synthetic network experiments. Xie et al.23 measured the importance of each instance using the local density and relative distance introduced in DPC, constructed a sequence of density peaks, and extracted important instances to determine the optimal undersampling size of a class. Zarikas et al.24 applied DPC to segment computed tomography (CT) images of neocoronary pneumonia, combining the generalized extreme value distribution with DPC to determine the optimal number of class cluster centers, and achieved better image segmentation results.

In recent years, researchers have proposed several variants of DPC. Du et al.25 proposed the DPC-KNN-PCA algorithm by integrating K-nearest neighbors and principal component analysis into DPC. However, this algorithm is computationally expensive. Xie et al.26 proposed FKNN-DPC , which determines local density on the basis of an exponential kernel of distances from a point to its K-nearest neighbors. However, FKNN-DPC does not perform well using high-dimensional datasets and does not solve the density gap problem in DPC. Some researchers have proposed combining DPC with other models. For example, Wang et al.27 proposed a new strategy for automatically determining cluster centers that increases the time complexity. However, this algorithm still requires calculating \(d_c\), and the results are sensitive to the choice of parameters. Seyedi et al.28 incorporated the K-nearest neighbor idea to calculate the global truncation distance and label propagation based on a graph to solve the problem of poor point clustering in the overlapping regions of class clusters; however, their algorithm requires iterative computation to achieve high accuracy. Jiang et al.29 proposed a method to calculate the truncation distance based on the Gini coefficient and identify centroids via KNN. While this method is effective in clustering data with large density disparities, it is ineffective for data with overlapping regions. Guan et al.30 proposed the FHC-LDP algorithm, which employs fast KNN to reduce its complexity and a bottom-up hierarchical clustering strategy instead of DPC’s class cluster center selection strategy. Zhang et al.31 proposed a new algorithm based on density decay graphs to address the drawbacks of the DPC, which requires manual selection of cluster centers and is more restricted by parameters. However, their algorithm lacks dynamic adjustment and requires more parameters than the original algorithm does.

Clustering algorithms are often applied in intrusion detection. Pu et al.2 proposed an unsupervised anomaly detection method that combines subspace clustering (SSC) and a one-class support vector machine (OCSVM) to detect attacks without any prior knowledge. This method was evaluated using the NSL-KDD32 dataset. Dai et al.33 proposed a K-modes clustering algorithm based on a weighted overlap distance named WODKM and verified its intrusion detection performance using the KDD Cup 99 dataset. The model detects intrusion by segmenting clustering results into normal and abnormal clusters. Then, the weighted average density of the target x to be detected in each cluster and the weighted overlap distance between x and each center point are analyzed. Yao et al.34 proposed an improved DBSCAN algorithm based on K-prototypes and a dissimilarity matrix to improve the clustering quality. This algorithm was applied to network intrusion detection. While DBSCAN improved the clustering quality using the KDD Cup 99 dataset for network intrusion detection, it may not be able to solve the problem of poor clustering results for data with large differences in data distribution density. Zhang et al.35 proposed a clustering detection method for network intrusion features based on a support vector machine and an LCA block algorithm, which improved the detection rate and time and is an effective network security protection method. However, this method has certain limitations in real-time clustering detection of unknown intrusion behaviors in large-scale data. Zou et al.36 proposed a network intrusion detection method based on a decision tree twin support vector machine and hierarchical clustering, named HC-DTTWSVM, which can effectively detect different categories of network intrusion using the NSL-KDD and UNSW-NB15 intrusion detection benchmark datasets. Wang et al.37 proposed an improved k-means clustering algorithm for detecting intrusions on wireless networks on the basis of federated learning and selected the open wireless network attack dataset (AWID) as the experimental dataset. This method achieved better performance in terms of the detection rate, false detection rate, and unknown attack type detection.

On the basis of the reasons mentioned above, this paper introduces a DPC approach based on the manifold distance and natural nearest neighbors (DPC-MDNN). First, a natural neighbor relationship is established according to the manifold distance. Then, representative points are identified on the basis of the new local density to shape the overall distribution. The key step in this approach is an assignment strategy based on representative and candidate points, which mitigates the domino effect by merging microclusters. We apply the proposed method to intrusion detection to verify its effectiveness.

The remainder of the paper is organized as follows. Section "Related work" reviews the DPC algorithm and its recent research trends. Section "Proposed algorithm" details a search strategy based on the concept of simple and natural neighbors and then describes our proposed algorithm DPC-MDNN and its implementation. To confirm the effectiveness of DPC-MDNN, Section "Experiments" presents a comparison of the experimental results of the proposed DPC-MDNN with those of five other related variants using different datasets. Section "Application of DPC-MDNN in intrusion detection" introduces a application of our proposed algorithm to network intrusion detection and a comparison with the other three clustering algorithms. Finally, Section "Conclusion" concludes this paper and outlines our future research directions.

The DPC algorithm relies on two important assumptions. i) The center point of a cluster is surrounded by many sample points, and the density of these points is smaller than that of the center point. ii) The distance between cluster centers and other points of higher density is relatively large. As a result, the local density \(\rho\) and relative distance \(\delta\) serve as two essential evaluation indicators in DPC.

To simplify local density calculations, DPC offers two methods, namely, the cutoff distance and the Gaussian kernel. These two methods can be applied to datasets of varying sizes.

Given a dataset \(X=\left\{ x_{1},x_{2},...,x_{i},...,x_{n}\right\}\), the local density \(\rho _{x_{i}}\) of any point \(x_{i}\) is calculated using the cutoff distance via Eq. (1):

where \(d_{c}\) is an input parameter and \(d_{i,j}\) is the Euclidean distance between \(x_{i}\) and \(x_{j}\). The truncation value \(d_{c}\) accounts for 1% to 2% of the mean Euclidean distance among all samples in a target dataset.

\(\chi (x)\) is the truncation function. Its value is determined by x. \(\chi (x)\) equals 1 when \(x<0\). In other cases, its value is 0. Therefore, \(\rho _{x_{i}}\) counts the number of points within a radius of \(d_c\) with \(x_{i}\) as the center. Obviously, the more data points that exist within a range, the higher its local density is, and vice versa.

When the Gaussian kernel function is used, the local density of point \(x_{i}\) is calculated using Eq. (2):

where \(d_{c}\) and \(d_{i,j}\) have the same meaning as in Eq. (1). The Gaussian kernel is appropriate for a dataset with a relatively small number of examples.

For any point \(x_i\), its distance \(\delta _i\) can be described as Eq. (3):

For point \(x_i\), \(\delta _i\) denotes the relative distance from \(x_i\) to the nearest point \(x_j\) with a higher density. Except when \(x_i\) has the greatest density, \(\delta _i\) denotes its distance from the farthest point.

After the values of \(\rho\) and \(\delta\) for each point \(x_i\) are obtained, a decision graph is generated using these two values as its horizontal and vertical coordinates. Then, a cluster center is manually selected. The selection of the center point should meet two conditions: they must have high local density \(\rho\) and a large relative distance \(\delta\). The data points that meet the conditions are usually located in the upper-right corner of the decision graph and far from the distribution of other points.

Although DPC can recognize arbitrarily shaped clusters with fewer iterations, there are still some challenges to be solved.

To better understand the DPC process, an example is presented to explain its implementation details. In this study, an analysis is conducted using the Jain dataset38, as shown in Fig. 1.

Clustering process of DPC using the Jain dataset.

DPC selects cluster centers manually using decision graphs, but there may be cluster centers errors when using datasets with uneven density. Figure 1a shows the correct distribution of the dataset, and Fig. 1b shows the density distribution of the data points in the dataset. The ordinate represents the density size, the abscissa represents the data point number, and the point with the largest density value is the red point at the top. Figure 1c shows the constructed decision graph, with the cluster center selected on the basis of the DPC algorithm assumption in the upper-right corner. The final clustering results are shown in Fig. 1d, where the cross mark represents the selected cluster center. Both selected cluster centers are located in the lower category. The number of samples in the blue cluster far exceeds that of green cluster, and compared with the latter, the distribution of the former is denser. High-density blue clustering samples contribute more to local density than green clustering samples do. Because the two cluster centers selected by DPC belong to densely distributed blue cluster samples, the final clustering effect needs to be significantly improved.

The DPC assignment strategy implies that if a reference point is mistakenly allocated to a different class group, subsequent data points may also be inaccurately assigned, leading to a domino effect. For example, Fig. 2a displays the Flame39 dataset, and Fig. 2b shows its clustering result (\(d_c\) =0.02).

Domino effect in the Flame dataset.

Figure 2a shows how DPC identifies the class cluster centers and then assigns the noncentral points on the basis of their local density from largest to smallest. If a point is assigned to the wrong class, all the points referenced by it will also be misclassified, leading to a cascading effect that results in a substantial number of points being incorrectly groups on both the left and right sides of the figure. This finding demonstrates that DPC considers only the clustering of a point and can easily lead to misclassification. A misclassified point can lead to misclassification of later points.

Recently, an advancement in cluster, known as natural neighbors, a subfield of the nearest neighbor algorithm40, has been proposed. Unlike the K-neighborhood, the natural neighborhood is an adaptive algorithm that does not require the parameter K to be set and adaptively identifies the distribution characteristics of streamlined datasets. This approach has been used extensively in data mining41. The relevant definitions are as follows.

Given a dataset D, d(i, j) is defined as the Euclidean distance between points i and j in D; its relevant formula is defined in Definition 1 below.

(K-nearest neighbors, KNN) \(KNN_k(i)\) represents the nearest k point of point i, which is defined by Eq. (4).

where argsort(d(i, j)) returns the index value of point i after ascending sorting on other points, and the distance between points i and j is calculated via the Euclidean method.

(Reverse K-nearest neighbor, RNN) If point j is a member of the K-nearest neighbor set KNN(i) of point i, then point i is the reverse nearest neighbor of point j, which is defined as shown in Eq. (5).

(Stable search state) For each point in the dataset D(n is its size), the search is extended by consecutively checking the K-nearest neighbors, starting with \(k=1\) and ending with \(k=n\). A dataset D is said to be in a stable searching state when \(k=\lambda\), at which each point in D has at least one nearest neighbor, or the number of samples containing 0 nearest neighbors no longer changes during the iterations.

(Natural neighbor eigenvalue) When the natural neighborhood search algorithm reaches a stable state, an iterative parameter k is the natural eigenvalue, denoted as \(\lambda\).

(Natural neighbors, NN) Once the stable state is achieved and j is listed as a neighbor of data point i, and vice versa, then i and j become mutual natural neighbors, as denoted by Eq. (6).

Here, an overview of the natural neighbor search algorithm is presented. At the start of each iteration, we first search for the K-nearest neighbors of each data point i and then calculate the number of reverse nearest neighbors for each i. k is the number of iterations. We subsequently check if two termination conditions have been met: (i) no data point is without reverse nearest neighbors, and (ii) the count of data points without reverse nearest neighbors has not changed over two consecutive iterations. Algorithm 1 outlines the natural neighbor search algorithm named NaN-Searching.

The natural neighbor search algorithm requires the construction of a distance matrix, resulting in high time complexity. However, by utilizing the KD-Tree42 data structure, the computational cost of distance calculations can be reduced, reducing the time complexity from O(\(N^{2}\)) to O(NlogN).

Natural neighbor searching

The DPC algorithm traditionally relies on the Euclidean distance to measure the distance between points but falls short in accurately evaluating the global structural similarity of data. This study combines the natural neighbor search algorithm and the manifold distance approach, which is based on the nearest neighbor graph43, to better assess the distance between sample points.

First, the natural neighborhood search algorithm is applied to determine the mutual natural neighbors and connect them through a connected graph. Here, the manifold distance between a data point and its natural neighbors remains the Euclidean distance. However, to calculate the manifold distance between a distant point and a data point, the sum of the Euclidean distances between data points along the path traversing the connected graph needs to be evaluated. Specifically, the manifold distance between data points i and j is defined as Eq. (7):

where l denotes the number of hops between data points i and j, \(P_{i,j}\) represents the set of available paths that connect i and j, d(i, j) is the Euclidean distance between i and j, and \(p_k\) is the contained data points in path \(P_{\left( i,j\right) }\). The algorithm presented in this paper exploits the use of the manifold distance, which considers the global distribution of a dataset and their local relationships among individuals.

In a three-dimensional space, 1000 random points are given. For any two points, a plot of the manifold and Euclidean distances is created and mapped into a two-dimensional space for distance visualization, as shown in Fig. 3. In spaces of different dimensions, a noticeable discrepancy exists between the two distance metrics.

Visualization of the manifold and Euclidean distances.

With the Flame dataset as an example, points A and B in Fig. 4a represent the clustering centers of two classes. By setting \(d_c\) to 0.01, we can compute the distance from point C to A or B via the Euclidean distance or manifold distance, respectively. The calculation results are presented in Table 1. The manifold method allows for more appropriate clustering, as the distance calculated from AC to the corresponding class center is larger than that calculated from BC, as shown in Table 1. As a result, C is assigned to the correct class in subsequent noncenter assignments, as shown in Fig. 4b.

Clustering using Euclidean and manifold distances for the Flame dataset.

Manifold learning methods are renowned for their robustness and stability. When dealing with noise and perturbations, these methods effectively preserve the relative positions of data points within a low-dimensional manifold space. For example, Isomap44 employs a global shortest-path algorithm to construct a geodesic distance matrix, ensuring that perturbations affecting specific data points are confined to their local neighborhoods and do not disrupt the overall dataset structure.

Moreover, manifold distances excel at capturing local data structures during dimensionality reduction while simultaneously preserving global structures. This ability to retain local structures ensures that localized perturbations, such as changes in the labels of specific points, do not propagate widely across the dataset.

Manifold learning and popularity distance mitigate the domino effect through several foundational concepts:

(1) Local weighting: The popularity distance integrates the Euclidean distance with local density weighting, enhancing the clustering of points in dense regions while isolating noise and sparse points. This ensures that points within the same cluster exhibit stronger cohesion, whereas outliers are effectively excluded.

(2) Density estimation: This method relies on density estimation techniques, such as neighborhood-based approaches such as KNN or local weighting, to quantify the local density of each point. These estimations enable a nuanced distinction between dense and sparse regions in the dataset.

(3) Robust clustering: By categorizing points into high- and low-density regions, manifold learning creates a reliable clustering framework. Points in dense regions naturally form cohesive clusters, whereas those in sparse regions are treated as isolated, thus minimizing the spread of errors across clusters.

(4) Enhanced robustness: The popularity distance reduces the influence of noise and outliers by assigning lower weights to points in low-density regions. This approach limits their impact on the clustering results, enhancing the algorithm’s resilience to perturbations and preventing cascading errors.

In the initial stage of the DPC algorithm, the Euclidean distance is computed between the target sample and all other points to determine the point density, as measured by the global parameter \(d_c\). However, the effectiveness of algorithmic clustering is highly dependent on the value of the empirical parameter \(d_c\), which cannot accurately represent the local density. Therefore, this study introduces the concept of natural neighbors and representative points to identify density peaks, which much more accurately reflect the characteristics of local density.

Given a dataset \(X = (x_1, x_2,..., x_n)\), for any \(x_i \in X\), the natural nearest neighbor density of \(x_i\) is defined as Eq. (8):

where \(d(x_i, x_j)\) denotes the Euclidean distance between points \(x_i\) and \(x_j\) and \(NN(x_i)\) denotes the set of natural nearest neighbors of point \(x_i\).

The natural neighbor density is based on the natural neighbor information of the points, in contrast to the DPC algorithm, which considers a global distribution and local density dependent on the truncation distance. The latter method does not ensure an accurate number of points within the truncation distance. As a result, the natural neighbor density provides a more precise reflection of the distribution of the nearest neighbors of the points. However, using the natural neighbor density alone does not solve the problem of density gaps in class clusters. To address this issue, representative points are introduced to determine the density peaks in class clusters45. The points with the highest local density are defined as the representative points, and the class cluster centers are chosen from these points.

Our study introduces the natural neighbor algorithm, which extends the density formula and identifies the density core to address the issues of parameter sensitivity and overly simplistic density formulas. The proposed method solves several problems. First, no manual input of parameters is needed, minimizing the impact of these variables on clustering performance. Second, the definition of local density is optimized to better reflect the distributional characteristics of surrounding data points, increasing the flexibility of the method in handling more complex, multilevel density datasets. Finally, the extracted core points retain a rough outline of the cluster shapes, which serves as a basis for the subsequent clustering process.

For a given dataset D, for any data object \(i\in D\), assume that its local neighborhood is nb(i), which contains i itself. If there exists a data object j in nb(i) and j is the densest object in nb(i), then j is considered a local representative point of i and all the data objects in the local neighborhood of i, denoted as Eq. (9).

According to the above definition, a point may have multiple representative points. The closest sample i is selected as its definitive representative point, that is, \(Rep(i)=\mathop {\arg \min }\limits _{j\in (j_1,j_2)}R(i,j)\).

If the representative point i is j and the representative point of j is z, then z becomes the new representative point for i. This rule is defined as the representative point transfer rule. Combining the competitive rule and the transfer rule, each sample has only one representative point. Some samples present themselves as their own representative points, and these samples are referred to as candidate points. The candidate points must satisfy the following condition: \(Rep(i) = i\).

Representative points search

Algorithm 2 outlines the specific steps of the representative points search that establish an initial setup of candidate points, which will ultimately serve as microcluster centers.

Figure 5a shows the chosen candidate points for the Flame dataset. The red points on the plot represent the candidate points, and the remaining points are part of the initial assignment, connected to the candidate points by short lines.

Before and after microcluster merging using the Flame dataset.

To prevent the DPC algorithms from experiencing allocation errors with better quality results, this study employs a two-part strategy. First, the dataset is divided into microclusters, and then these microclusters are merged on the basis of their intercluster correlation.

The definition of similarity \(S_{p_1,p_2}\) between sample \({p_1}\) and sample \({p_2}\) is as follows in Eq. (10).

where \(R(p_1,p_2)\) is the manifold distance between point \(p_1\) and point \(p_2\). As the distance between two samples \(p_1\) and \(p_2\) decreases, their relationship becomes closer, thus increasing their similarity.

The formula for calculating the correlation \(S_{p_i,C_{m}}\) between sample \(p_i\) and microcluster \(C_{m}\) is provided in Eq. (11).

where \(C_m\) is the set of samples in microcluster m and \(|C_{m}|\) denotes the number of samples in microcluster m. If the value of \(S_{p_1,C_{m}}\) is larger, this microcluster will be more attractive to other samples.

Natural shared nearest neighbors (NSNN):

A point z is said to belong to \(NSNN(p_i,p_j)\) as follows in Eq. (12), if it belongs to both the natural neighbors of points \(p_i\) and \(p_j\).

Degree of natural extension overlap (NEO):

The \(NEO(C_m,C_n)\) between the class clusters \(C_m\) and \(C_n\) is calculated as follows in Eq. (13).

The correlation (\(S_{C_{m},C_{n}}\)) of two microclusters is calculated as follows in Eq. (14).

\(S_{C_{m},C_{n}}\) represents the probability that two microclusters belong to the same class of clusters.

To address the issue of chain errors in the allocation strategy of the DPC algorithm, a merging approach is adopted on the basis of the similarity between microclusters. The operation of merging class clusters can be divided into two steps. In step one, the two classes with the highest similarity value of \(S_{C_{m},C_{n}}\) are merged. In step two, a determination is made of whether the number of merged class clusters is the same as the expected number of input class clusters. If the number differs, the above operation is repeated; if it is consistent, the merging process is completed, and the final clustering result is obtained.

Using the Flame dataset as an example, Fig. 5 (b) shows the results of the assignment. Through the microcluster merging step, the clusters with the highest similarity are sequentially fused. The image shows the use of short line segments to connect similar clusters, ultimately forming a shape with two clusters.

The DPC-MDNN is designed to select representative points from the dataset to ensure effective clustering and subsequently assign other samples to the cluster of their respective representative points. By searching for the natural neighbors of each sample to form local neighborhoods and calculating density, the algorithm identifies the representative points of each sample on the basis of relative distance and density relationships, selecting core points to form initial clusters. Then, a microcluster merging strategy is employed to achieve the final clustering result. The natural neighbor search algorithm is used to obtain the neighborhood set, thus automatically determining the neighborhood of each sample and eliminating the need for distance truncation. Therefore, the proposed DPC-MDNN algorithm does not require manual setting of other parameters.

DPC is based on the adaptive nearest neighbor parameter, the input dataset D and the number of clusters cn, and the clustering results Label are output. The specific steps are as follows: Algorithm 3. The clustering process of the DPC-MDNN algorithm consists of five main steps: natural neighbor search (line 2), manifold distance calculation (line 3), core point identification (line 6), initial clustering on the basis of representative relationships (line 7), and microcluster merging (lines 8–20).

The natural neighbor search algorithm can be optimized using KD-Tree, resulting in a time complexity of O(nlogn). For the manifold distance calculation, Dijkstra’s algorithm is used to search for the shortest path, increasing the time complexity of this step to \(O(n^3)\). Identifying core points requires traversing all sample points and their natural neighbors, leading to a time complexity of O(mn), where n is the number of sample points in the dataset and m is the number of points in the natural neighborhood of each sample. The initial clustering based on representative relationships also involves traversing sample points and natural neighbors, resulting in a time complexity of O(mn), with n and m having the same meaning as in the previous step. The time complexity of microcluster merging is \(O(c^2)\), where c is the number of microclusters, which is much smaller than n, the number of sample points in the dataset. Therefore, the overall time complexity of the DPC-MDNN algorithm is \(O(n^3)\), where n is the number of sample points in the dataset.

DPC-MDNN.

The performance of the DPC-MDNN algorithm is validated using both artificial and real datasets and compared with the performance of five different clustering methods: DPC19, DPC-KNN-PCA25, SNN-DPC46, DPC-CE47, and DPC-DLP28. Table 2 collects their code version and URL of the software/Package used in this study.

The experiments include both artificial and real datasets, with the corresponding details provided in Tables 3 and 4, including the numbers of instances, dimensions, and clusters. The artificial datasets include data following a Gaussian distribution, streamlined datasets, and clustered datasets with a high degree of overlap. Moreover, real datasets include multidimensional standard clustering (UCI datasets48) and image datasets commonly used to evaluate the performance of clustering algorithms, as well as image segmentation datasets and biometric information datasets. The experiments are conducted on a system Windows 11 with Intel(R) Core(TM) i9-12900KF specifications and 32 GB of memory, along with Python 3.10 or MATLAB R2023b as the IDE.

To assess the clustering effect, six commonly used metrics, namely, accuracy (ACC), adjusted mutual information (AMI), adjusted Rand index (ARI), precision (P), recall (R) and F1 score, each with an upper bound of 1, are used. The size of the metric corresponds positively to the clustering effect.

Let us use image classification as an example to demonstrate the definitions of several indicators. TP represents the number of images in a set of images predicted to be positive samples, which are truly positive samples; TN represents the number of images in a set of images predicted to be negative, which are actually negative samples; FP represents the number of images in a group of images predicted to be positive but negative samples, also known as “false detection”; and FN represents the number of images in a set of images predicted to be negative, which are actually positive samples, also known as “missed detection”. On this basis, we can derive the definitions of several indicators. They are defined as follows.

Accuracy (ACC):

ACC is the proportion of correctly classified samples (TP+TN) to the total sample (TOTAL), defined as follows in Eq. (15):

Adjusted mutual information (AMI):

AMI is an indicator used to assess the similarity between the clustering results and the real classification, defined as follows in Eq. (16), where MI refers to mutual information and EMI(U,V) is the expectation of mutual information MI(U,V).

Adjusted Rand index (ARI):

ARI measures the degree of similarity between two clustering results by counting the number of sample point pairs located in the same cluster and different clusters in Eq. (17), where a represents the number of point pairs belonging to the same cluster in both real and experimental cases, b represents the number of point pairs belonging to the same cluster in real cases but not in experimental cases, c represents the number of point pairs not belonging to the same cluster in real cases but in experimental cases, and d represents the number of point pairs that do not belong to the same cluster in both real and experimental cases.

Precision (P):

The percentage of a group of images that are predicted to be positive samples that are actually positive samples is defined as follows in Eq. (18).

Recall (R):

The proportion of successfully predicted images among all truly positive samples is defined as follows in Eq. (19).

F1 score:

F1 score is an indicator used to measure the accuracy of binary classification (or multitask binary classification) models in statistics and can be regarded as a weighted average of model accuracy and recall rate, defined as follows in Eq. (20).

Different algorithms require different parameter settings. Table 5 shows the optimal experimental parameters selected after multiple experiments, where \(d_c\) is the truncation distance, k is the number of neighbors, cn is the number of clusters, p is the ratio of neighbors to the total number of samples, that is, \(k=p*n\), \(T_r\) is the path distance threshold ratio, \(P_r\) is the distance penalty ratio, and the product of the two and the distance \(d_{ij}\) is the required distance threshold and distance penalty. No other parameters are required except cn for DPC-MDNN.

Figure 6 shows the clustering results of the Aggregation50 dataset, a two-dimensional dataset, with six different algorithms for better visualization. There are two instances where the data points of two clusters are connected. In these two places, it is easy for errors in assigning labels to occur. For example, in the clustering results using the DPC algorithm, there are multiple sample point allocation errors, and other algorithms also have a small number of data point allocation errors at the intersection. Overall, the proposed algorithm adopts the natural neighbors and manifold distance methods to more accurately measure the distribution of sample points, so the allocation results of the algorithm are more accurate.

Comparison of the clustering results of the six algorithms using the Aggregation dataset.

The metric scores of the clustering results of DPC, DPC-KNN-PCA, SNN-DPC, DPC-CE, DPC-DLP and the DPC-MDNN proposed in this paper using each two-dimensional dataset are shown in Table 6. The best score is highlighted in bold. Table 6 shows that DPC-MDNN outperforms the DPC algorithm for each dataset, with increases in both the average ACC and the average ARI of 13.22% and 17.64%, respectively. The average AMI score improves by 12.58%. The average P score improves by 47.12%, the average R score improves by 38.38%, and the F1 score improves by 44.26%.

The DPC-MDNN algorithm consistently outperforms the other algorithms across all datasets and evaluation metrics, demonstrating its superior performance. The results from our experiments are as follows. For the aggregation dataset, the DPC-MDNN achieves an ACC of 0.987, an improvement of 1.0% over the second-ranked SNN-DPC, with an ACC of 0.977. In terms of the ARI, the ARI of DPC-MDNN reaches 0.964, a 12.0% improvement over that of the SNN-DPC, which has an ARI of 0.861. Additionally, DPC-MDNN leads in the AMI, with a score of 0.935, and in the F1 score, with a value of 0.993, representing a 19.6% increase in the F1 score compared with that of SNN-DPC, which achieves a score of 0.830. For the D31 dataset, the ACC of DPC-MDNN is 0.917, an 8.4% improvement over that of DPC-KNN-PCA, which has an ACC of 0.846. Although the ARI of DPC-MDNN is 0.911, which is slightly lower than the ARI of SNN-DPC at 0.922, DPC-MDNN outperforms in term of other metrics, such as the AMI, with a score of 0.952, representing a 1.8% improvement over that of SNN-DPC, which achieves a score of 0.935, indicating superior clustering consistency. For the Flame dataset, DPC-MDNN achieves an ACC of 0.957, which is 5.7% higher than the ACC of 0.905 for DPC-CE. It also leads in the ARI score, which a value of 0.953, representing an 11.7% improvement over that of SNN-DPC, which has an ARI of 0.853, and in the AMI, with a score of 0.934, an improvement of 9.1% over that of DPC-CE, which achieves a score of 0.856. The DPC-MDNN’s F1 score is 0.978, which is slightly lower than the SNN-DPC’s score of 0.987. For the Path-based dataset, DPC-MDNN performs best, with an ACC of 0.954, which is 6.5% higher than that of the ACC of 0.896 for DPC-CE. The DPC-MDNN also leads in terms of the ARI, with a score of 0.926, showing a 10.2% improvement over SNN-DPC’s ARI of 0.840, and in AMI, with a score of 0.933, an improvement of 16.3% over that of DPC-CE, which achieves an AMI of 0.802. For the R15 dataset, the ACC of DPC-MDNN is 0.952, which is 1.8% higher than that of the SNN-DPC, which is 0.935. The F1 score is 0.992, which is slightly higher than that of SNN-DPC (0.990), and its AMI is 0.935, which is 0.4% higher than that of DPC-CE (AMI of 0.931).

Column charts of six metric analyses for six algorithms with five 2D artificial datasets (Aggregation, D31, Flame, Path-based and R15).

In conclusion, DPC-MDNN achieves superior performance compared with that of DPC and outperforms other algorithms across all datasets. Among them, the code of label alignment is not provided in other algorithm codes, resulting in a low index of individual running results. Although some algorithms outperform DPC-MDNN for some datasets, the overall performance of the DPC-MDNN is significant. This is because the algorithm considers the overall distributivity of the dataset and identifies the natural neighborhood distribution of the sample points before the clustering process begins. In addition, the assignment strategy based on representative relationships and the secondary assignment strategy based on similarity merging reduce the possibility of chain errors in the assignment of sample points, thus enhancing the consistency between the labels assigned after the clustering process and the true cluster labels. DPC-MDNN is more convenient in algorithm execution because it has the characteristics of self-adaptation, whereas other algorithms need to constantly adjust various parameters manually.

Table 7 shows the parameter settings of each algorithm across the multidimensional datasets. The parameter meanings are the same as those in the previous section, with constant values of \(T_r\) and \(P_r\), \(T_r\)=0.25, and \(P_r\)=0.3.

Figure 7 presents a plot of column charts for six metric analyses using six algorithms with five 2D artificial datasets (Aggregation, D31, Flame, Path-based and R15). As shown in Table 8, the clustering results of DPC-MDNN are generally better than those of the other clustering algorithms using the real datasets. For the Iris dataset, SNN-DPC achieved the best accuracy (0.973) and F1 score (0.953), demonstrating excellent classification ability. DPC-MDNN, with slightly lower accuracy (0.967), outperformed other methods in terms of clustering metrics such as the ARI (0.924) and AMI (0.923), whereas traditional DPC performed the worst (ACC of 0.667). For the Wine dataset, DPC-MDNN achieved the best results, with an accuracy of 0.843 and an F1 score of 0.924, leading all the algorithms. This model also exhibited strong ARI (0.459) and AMI (0.342). SNN-DPC and DPC-CE also achieved high accuracies (0.786 and 0.787, respectively) but lagged in overall performance. For the Ecoli dataset, DPC-MDNN achieved the highest accuracy (0.625) and AMI (0.392), with DPC-CE slightly outperforming the other methods in terms of the F1 score (0.495) by balancing precision and recall. SNN-DPC exhibited the highest precision (0.576) but weaker recall (0.416), whereas DPC and DPC-KNN-PCA performed poorly (F1 scores of 0.188 and 0.470, respectively). For the Seeds dataset, DPC-CE achieved the best overall performance, with an accuracy of 0.919, an ARI of 0.789, and an AMI of 0.751, leading all methods. SNN-DPC excelled in precision (0.895), while DPC-MDNN, although slightly behind, maintained high accuracy (0.905). For the Thyroid dataset, DPC-MDNN performed best, with an accuracy of 0.913, an F1 score of 0.884, an ARI of 0.856 and an AMI of 0.835, outperforming all the other methods. SNN-DPC also performed well, with a recall of 0.811 and an F1 score of 0.824. For the WDBC dataset, DPC-MDNN achieved the best results, leading all the algorithms in accuracy (0.933), F1 score (0.848), and clustering metrics (ARI of 0.748 and AMI of 0.636). Traditional methods, such as DPC and DPC-KNN-PCA, performed poorly (F1 scores of 0.589 and 0.557, respectively). For the Zoo dataset, DPC-MDNN outperformed the other methods in terms of accuracy (0.839), F1 score (0.541), ARI (0.765) and AMI (0.713). Traditional algorithms such as DPC and DPC-KNN-PCA had weaker performance (ACCs of 0.446 and 0.561, respectively). For the Ionosphere dataset, DPC-MDNN led in terms of accuracy (0.819), F1 score (0.654), ARI (0.357) and AMI (0.271). SNN-DPC was strong in terms of precision (0.701) and recall (0.633), whereas traditional methods lagged behind (F1 scores of 0.390 and 0.356). For the Pima dataset, the DPC-MDNN and DPC achieved the highest accuracy (0.668), but the DPC-MDNN achieved the best precision (0.557), while DPC had a slightly higher F1 score (0.505). All clustering metrics were low, with DPC-DLP having the highest ARI (0.102). For the Glass dataset, DPC-MDNN performed the best in terms of accuracy (0.631), ARI (0.235), and AMI (0.286). DPC-KNN-PCA had the highest precision (0.491) but weaker recall and F1 scores. SNN-DPC showed moderate overall performance (F1 score of 0.331).

Column charts of six metric analyses for six algorithms using ten multidimensional subdatasets from UCI (Iris, Wine, Ecoli, Seeds, Thyroid, WDBC, Zoo, Ionosphere, Pima and Glass).

Comparison of clustering results with the DPC and DPC-MDNN algorithms for three datasets (Thyroid, WDBC and Wine).

Figure 8 provides a visualization of the evaluation indicators (ACC, ARI, AMI, precision, recall and F1 score). The results indicate that the DPC-MDNN algorithm performs well overall and can accurately capture the dataset distribution. Figure 9 shows a comparison of the clustering results for the DPC and DPC-MDNN algorithms using different datasets. Moreover, Fig. 8 shows that for the Wine, Ecoli and Ionosphere datasets, DPC-MDNN yields significantly improved performance. For the Pima datasets, there is not much difference in the clustering results among the algorithms, but the results using DPC-MDNN are still improved. However, for the Seeds dataset, the performance is somewhat poor. This finding is related to the dimensionality of the dataset. The dimensionality of the Seeds and Pima datasets are not very high, so even DPC can achieve high accuracy. However, the dimensionality of the WDBC, Zoo and Ionosphere datasets are high, and the original DPC has difficulty analyzing the density distribution of the dataset. Therefore, there are shortcomings in the calculation of local density and relative distance, as well as in the selection of cluster centers. In summary, DPC-MDNN still exhibits good clustering performance for high-dimensional datasets and is suitable for a wider range of application scenarios.

To verify the applicability and effectiveness of the improved algorithm, image datasets, namely, the Olivetti face54 and Coli20 object55 datasets, were selected for the experiments. For the former, the first 100 facial images were selected for the experiment with 10 clusters, whereas for the latter, 150 object images were selected with 15 clusters. Both datasets are commonly used for verifying the effectiveness of clustering algorithms. Table 9 shows the parameter settings of each algorithm for the two image datasets, with the same parameter meanings as before.

Figure 10 provides cluster results for DPC-MDNN using two image datasets, namely, Olivetti Faces and Coil20. As shown in Table 10, DPC-MDNN demonstrated outstanding performance across all the metrics. In particular, for the Coil20 dataset, DPC-MDNN achieved perfect scores in accuracy (ACC), precision (P), recall (R), and F1 score, all reaching 1.0, indicating its flawless classification and clustering capabilities. For the Olivetti Faces dataset, DPC-MDNN had only two recognition errors and achieved an accuracy of 0.955 and an F1 score of 0.990, significantly outperforming the other algorithms, especially in analyzing complex, high-dimensional image data. In contrast, traditional DPC exhibited relatively weak performance for both datasets, with an F1 score of only 0.556 for the Coil20 dataset and 0.664 for the Olivetti Faces dataset, reflecting its limited ability to capture complex features.

Cluster results with DPC-MDNN for two image datasets, Olivetti faces and Coil20.

DPC-MDNN achieved the best performance for both the Olivetti Faces and Coil20 datasets, significantly surpassing the other algorithms across all the evaluation metrics. For the Coil20 dataset, a perfect score of 1.0 was achieved for all the metrics, including the accuracy, precision, recall, and F1 score, demonstrating the outstanding clustering and classification capabilities of the proposed algorithm. For complex, high-dimensional image datasets such as Olivetti faces, DPC-MDNN exhibited exceptional proficiency in capturing data feature distributions, selecting cluster centers, and managing nonlinear patterns. In contrast, the traditional DPC algorithm exhibited relatively weak performance, particularly for high-dimensional and complex datasets, with lower accuracy and F1 scores, highlighting its limitations in addressing intricate density distributions. While DPC-KNN-PCA and SNN-DPC showed partial improvements in some metrics, they still fell short of the performance of DPC-MDNN, especially in terms of the recall and F1 score. DPC-MDNN is well-suited for analyzing complex and high-dimensional image datasets, such as facial recognition and object classification tasks, where it significantly enhances accuracy and clustering performance. Figure 11 shows a visualization of the evaluation indicators. The results indicate that the DPC-MDNN algorithm performs well overall and can accurately capture the dataset distribution.

Column charts for the six metrics with six algorithms using two image datasets (Olivetti Faces and Coil20).

To comprehensively evaluate the applicability and effectiveness of the improved algorithm, this study employs two distinct datasets: the PASCAL VOC200756 dataset and the Breast Cancer Wisconsin57 (Diagnostic) dataset. The PASCAL VOC2007 dataset, which is widely known for its applicability in tasks such as object detection and image segmentation, was selected for clustering experiments in this study. From this dataset, 100 images spanning various object categories were chosen and grouped into 2 clusters to evaluate the performance of the algorithm in handling visual data with diverse content. In addition, the Breast Cancer Wisconsin (Diagnostic) dataset, a standard benchmark in medical research, was utilized to validate the algorithm’s ability to process structured numerical data. This dataset consists of 569 samples, representing two clusters for benign and malignant cases. The parameter settings for the algorithms used in these experiments are detailed in Table 11.

The indicators for the six algorithms using the PASCAL VOC2007 and Breast Cancer Wisconsin datasets are shown in Table 12. For the PASCAL VOC2007 dataset, DPC-MDNN achieved an ACC of 0.98, 34% higher than that of the second-place algorithm. For the Breast Cancer Wisconsin dataset, the DPC-MDNN algorithm yielded the best results in terms of all the indices; the ARI was 23.5% higher, the AMI was 25.3% higher, the precision was 8.2%, and the F1 score was 7.3% higher than those of the second-place algorithm. It can be concluded from the table that most indices of the DPC-MDNN algorithm are optimal, and some results are slightly inferior to those of the other algorithms. However, from the overall result, the performance of the DPC-MDNN algorithm is better.

On the basis of the experimental results, the DPC-MDNN algorithm outperforms the other algorithms on the basis of the bioinformatics and image segmentation datasets, highlighting its advantages and broader applicability. The ARI and AMI, as key metrics for evaluating clustering performance, provide valuable insights. However, for the image segmentation dataset, both the ARI and AMI values are zero, whereas nonzero values are observed for the other datasets. This finding indicates that the data in the image segmentation task may lack clear grouping structures, making it challenging for any clustering results to align with the true classifications. These findings also suggest that clustering algorithms of this type face inherent limitations when applied to image segmentation datasets.

We performed experiments using three subdatasets (NSL-KDD, IOT-2358 and CIC-IDS-201759) to evaluate the effectiveness of the proposed network intrusion detection model DPC-MDNN.

NSL-KDD contains 125,973 records in the training set and 22,544 in the test set. The data format is consistent with that of KDDCUP99, featuring 32 continuous and 9 discrete attributes (41 total). The last attribute is a label indicating whether the data are normal or represent an attack. This section focuses on a clustering analysis across five classes, including both normal and abnormal categories, with the data distributions shown in Table 13.

The IoT-23 dataset includes 20 types of malicious traffic and 3 types of normal traffic in both raw PCAP and flow-based log files. The dataset used in this experiment consists of log files with 18 features and a label column, which indicates whether the traffic is malicious. Owing to the large size of the dataset, a subset was selected for this experiment. The selection rule was as follows: for each of the 23 device traffic categories, if the number of samples was less than 100,000, all samples were included; otherwise, only 10% of the samples were selected. After filtering, the dataset has dimensions of (3, 414, 435, 22). Redundant labels were merged, and categories with too few samples were removed. The processed dataset, totaling 390 MB, is distributed as shown in Table 14.

The CIC-IDS-2017 dataset is well-known for its rich content and diverse attack types, including common network attacks such as DDoS, Botnet, and Web attacks, alongside normal traffic, enabling comprehensive security analysis. Each record is meticulously labeled with the attack type and stage, facilitating precise classification and model training. Owing to the large scale of the dataset, which contains over 2.8 million records, only a subset is used in this chapter. The selection rules are as follows: categories with fewer than 500 samples are fully included; for smaller categories such as Web Attack-XSS and Web Attack-Brute Force, approximately 50% of the data are extracted. For medium-sized categories, extraction rates range from 10% to 50%, whereas for larger categories such as BENIGN and DoS Hulk, lower proportions are extracted to balance the data distribution and reduce the computational complexity. The detailed distribution is shown in Table 15.

Real-world data are sometimes inconsistent and lack certain behaviors or trends. Therefore, preprocessing the original data to address this issue is a proven research method. Preprocessing is similar to data mining, where the goal is to transform the data into an understandable format. This step is also important for improving the quality of data, such as removing noise and redundant information. Data cleaning, transformation, integration, and reduction are some of the key methods used in data preprocessing. In this work, we perform data transformation and data normalization preprocessing steps to align the data according to the proposed method.

The values of the feature, namely, protocol type, service, flag, and class, in the NSL-KDD dataset are of the string data type. However, the proposed algorithm only supports numerical values. Therefore, we need to convert the string values in the dataset to numerical values. Taking labels as an example, the corresponding relationship after digitization is shown in Table 16.

Similarly, in the IOT-23 dataset, proto, sever, and conn_State and labels are character type features. The corresponding relationships after the labels are digitized are shown in Table 17.

In the preprocessing of the CIC-IDS-2017 dataset, irrelevant string columns such as Flow ID, Source IP, Destination IP, and Timestamp were removed. Additionally, the label column was converted from string values to numerical values, with the mapping between values detailed in Table 18.

Normalization is a key step that is applied in preprocessing to prepare the data for machine learning algorithms. The objective is to change the numeric values in the dataset to a common scale. Normalization becomes important when the values of features in the dataset have different ranges. This paper uses min–max normalization technology to convert the attribute value x of the original data to the value \(x^{'}\) in the interval [0,1], as shown in Eq. (21).

In Eq. (21), \(x_{i}\) represents the current data in the dataset, min(x) represents the smallest value in the given dataset, and max(x) represents the largest value among the datasets.

The NSL-KDD and IOT-23 datasets and both single attack and mixed attack methods were used to evaluate the performance of the proposed algorithm, DPC-MDNN. In the single attack type detection experiment, data of different attack types are fused with normal type data separately for detection. This experimental method can verify the effectiveness of the model in identifying known attack type data. The mixed attack type detection experiment refers to the detection of all attack types after mixing them, which can better demonstrate the feasibility of the algorithm. In addition, we compared and analyzed the algorithm using three evaluation indicators: recall (DR), false alarm (FR), and accuracy (ACC). In this experiment, three other intrusion detection models based on clustering algorithms, namely, DPC, DPNN21, and RDP60, were used for comparison. The corresponding parameters are set in Table 19.

Table 20 presents a comparison of the detection accuracy of different intrusion detection models using NSL-KDD for a single attack type. As shown in Table 20, the intrusion detection model of the DPC algorithm can effectively detect DoS and PROBE attacks, but when faced with U2R and R2L attacks, accuracy decreases, resulting in less reliable results. The DPNN algorithm accurately detects DoS and PROBE attacks but not U2R and R2L attacks. The RDP algorithm has less intrusion detection capabilities, but it can more accurately detect attack types such as PROBE. The DPC-MDNN proposed in this paper can more accurately detect three different types of data, R2L, DoS and PROBE, thereby improving the detection efficiency.

Table 21 shows the performance differences of different intrusion detection models in identifying complex attack types for the NSL-KDD dataset. The DPC-MDNN has ACCs of 91.8% and 89.6% for mixed attack types, DRs of 87.2% and 85.1%, and FRs of only 6.7% and 6.9%, respectively. Although the DPNN-based intrusion detection model has a high DR, the FR is also high. After comparison, DPN-MDNN achieves extremely low FR and high ACC.

To confirm the applicability of the DPC-MDNN algorithm for network anomaly detection, we also perform several experiments using IOT-23, which is the latest network trafc dataset and covers 11 common attacks, including Benign, DDoS, PartOfAHorizontal-PortScan, Okiru and C&C.

Table 22 presents a comparison of the detection accuracy for different intrusion detection models using IOT-23 for a single attack type. The detection accuracy of DPC is the lowest among several algorithms. DPNN has higher detection accuracy for Benign and DDoS attack types than DPC-MDNN does. However, DPC-MDNN achieves the highest detection accuracies for all four other attack types. Overall, DPC-MDNN achieves some improvement over the original DPC and other algorithms.

Table 23 shows the performance differences of different intrusion detection models in identifying complex attack types using the IOT-23 dataset. When conducting mixed attack experiments using the IOT-23 dataset, the three indicators of the algorithm proposed in this paper are all optimal, with the DPNN being slightly inferior.

Table 24 shows the corresponding indicators and running time of each algorithm using the CIC-IDS-2017 dataset. It can be concluded from the data in the table that the DPC-MDNN algorithm performs best for this dataset, the DR is 3% higher than that of the second-place algorithm, the FR is 1.3% lower than that of the second-place algorithm, and the ACC is 1.2% higher than that of the second-place algorithm, but the time is slightly longer than that of some algorithms. However, for the other algorithms, parameter tuning is always a time-consuming process. Therefore, we need to find appropriate parameters so that the algorithm can correctly generate the number of clusters and obtain higher accuracy. Therefore, in terms of time, our algorithm still has advantages.

In summary, when the DPC-MDNN is used for single-attack detection and mixed-attack detection uwint the NSL-KDD and IOT-23 datasets, the evaluation index scores are mostly optimal values. Compared with those of the original DPC, the clustering effect is improved to a certain extent, and the accuracy is also significantly improved, especially for attack types with a relatively small sample size. DPC-MDNN achieves excellent performance, with high levels of DR, FR, and ACC, verifying the algorithm’s good performance in detecting attack behavior.

This paper introduces a new clustering algorithm named DPC-MDNN, which uses the point with the highest density in the natural neighborhood as the core point for clustering. After the natural neighborhood information of the dataset samples is obtained and their local density is computed, those representative points for each point are given on the basis of density and neighborhood. Then, candidate points are selected from these representative points, and the remaining points are assigned to the corresponding candidate points. Finally, to define a new similarity, DPC-MDNN can apply a candidate-based hierarchical clustering method. In addition, the feasibility of this approach for various datasets has been demonstrated. The experimental results show that DPC-MDNN can identify complex shapes in streamlined datasets and clusters of different densities in nonuniformly distributed datasets. In particular, when high- and low-density clusters are near each other. The experimental results for well-known datasets show that the proposed DPC-MDNN method can identify anomalies with higher detection accuracy.

The datasets used and analyzed during the current study are publicly available, and their respective access addresses can be found in the references. The implementation code of the proposed new algorithm is available at https://github.com/foreverwhb-ustb/DPC-MDNN.git.

Jain, A. K., Murty, M. N. & Flynn, P. J. Data clustering: a review. ACM Comput. Surv. (CSUR) 31, 264–323 (1999).

Article MATH Google Scholar

Pu, G., Wang, L., Shen, J. & Dong, F. A hybrid unsupervised clustering-based anomaly detection method. Tsinghua Sci. Technol. 26, 146–153. https://doi.org/10.26599/TST.2019.9010051 (2021).

Article MATH Google Scholar

Liu, Y., Li, W., Dong, X. & Ren, Z. Resilient formation tracking for networked swarm systems under malicious data deception attacks. International Journal of Robust and Nonlinear Control. https://doi.org/10.1002/rnc.7777. https://onlinelibrary.wiley.com/doi/pdf/10.1002/rnc.7777.

Zhou, W. et al. Hidim: A novel framework of network intrusion detection for hierarchical dependency and class imbalance. Comput. Secur. 148, 104155. https://doi.org/10.1016/j.cose.2024.104155 (2025).

Article MATH Google Scholar

Li, C. et al. Rfl-apia: A comprehensive framework for mitigating poisoning attacks and promoting model aggregation in iiot federated learning. IEEE Trans. Industr. Inf. 20, 12935–12944. https://doi.org/10.1109/TII.2024.3431020 (2024).

Article MATH Google Scholar

mehdi Cherrat, E., Alaoui, R. & Bouzahir, H. Improving of fingerprint segmentation images based on k-means and dbscan clustering. Int. J. Electr. Comput. Eng. (IJECE) 9, 2425–2432 (2019).

Article MATH Google Scholar

Halim, Z., Waqas, M., Baig, A. R. & Rashid, A. Efficient clustering of large uncertain graphs using neighborhood information. Int. J. Approx. Reason. 90, 274–291 (2017).

Article MathSciNet MATH Google Scholar

Abdolzadegan, D., Moattar, M. H. & Ghoshuni, M. A robust method for early diagnosis of autism spectrum disorder from EEG signals based on feature selection and dbscan method. Biocybern. Biomed. Eng. 40, 482–493 (2020).

Article Google Scholar

Halim, Z., Atif, M., Rashid, A. & Edwin, C. A. Profiling players using real-world datasets: Clustering the data and correlating the results with the big-five personality traits. IEEE Trans. Affect. Comput. 10, 568–584 (2017).

Article Google Scholar

Uzma, Al-Obeidat, F., Tubaishat, A., Shah, B. & Halim, Z. Gene encoder: a feature selection technique through unsupervised deep learning-based clustering for large gene expression data. Neural Comput. Appl. 1–23 (2020).

Elahi, E. & Halim, Z. Graph attention-based collaborative filtering for user-specific recommender system using knowledge graph and deep neural networks. Knowl. Inf. Syst. 64, 2457–2480 (2022).

Article MATH Google Scholar

Hartigan, J. A. & Wong, M. A. Algorithm AS 136: A K-Means clustering algorithm. Journal of the Royal Statistical Society. Series C (Appl. Stat.) 28, 100–108. https://doi.org/10.2307/2346830 (1979).

Article MATH Google Scholar

Park, H.-S. & Jun, C.-H. A simple and fast algorithm for K-medoids clustering. Expert Syst. Appl. 36, 3336–3341. https://doi.org/10.1016/j.eswa.2008.01.039 (2009).

Article MATH Google Scholar

Guha, S., Rastogi, R. & Shim, K. CURE: An efficient clustering algorithm for large databases. ACM SIGMOD Rec. 27, 73–84. https://doi.org/10.1145/276305.276312 (1998).

Article MATH Google Scholar

Zhang, T., Ramakrishnan, R. & Livny, M. BIRCH: An efficient data clustering method for very large databases. SIGMOD Rec. 25, 103–114. https://doi.org/10.1145/235968.233324 (1996).

Article MATH Google Scholar

Ester, M., Kriegel, H.-P., Sander, J. & Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. - 2nd International Conference on Knowledge Discovery and Data Mining, KDD 1996, 226 – 231 (1996).

Ankerst, M., Breunig, M. M., Kriegel, H.-P. & Sander, J. Optics: Ordering points to identify the clustering structure. ACM SIGMOD Rec. 28, 49–60 (1999).

Article Google Scholar

Wang, W., Yang, J. & Muntz, R. R. Sting: A statistical information grid approach to spatial data mining. In Proc. of the 23rd International Conference on Very Large Data Bases, 186–195 (1997).

Laio, Alessandro & Rodriguez, Alex. Clustering by fast search and find of density peaks. Science 344, 1492–1496. https://doi.org/10.1126/science.1242072 (2014).

Article ADS PubMed MATH Google Scholar

Lei, T. et al. Automatic fuzzy clustering framework for image segmentation. IEEE Trans. Fuzzy Syst. 28, 2078–2092. https://doi.org/10.1109/TFUZZ.2019.2930030 (2020).

Article MATH Google Scholar

Li, L., Zhang, H., Peng, H. & Yang, Y. Nearest neighbors based density peaks approach to intrusion detection. Chaos Solitons Fractals 110, 33–40. https://doi.org/10.1016/j.chaos.2018.03.010 (2018).

Article ADS MathSciNet MATH Google Scholar

Xu, M., Li, Y., Li, R., Zou, F. & Gu, X. Eadp: An extended adaptive density peaks clustering for overlapping community detection in social networks. Neurocomputing 337, 287–302. https://doi.org/10.1016/j.neucom.2019.01.074 (2019).

Article MATH Google Scholar

Xie, X., Liu, H., Zeng, S., Lin, L. & Li, W. A novel progressively undersampling method based on the density peaks sequence for imbalanced data. Knowl.-Based Syst. 213, 106689. https://doi.org/10.1016/j.knosys.2020.106689 (2021).

Article MATH Google Scholar

Zarikas, V., Poulopoulos, S. G., Gareiou, Z. & Zervas, E. Clustering analysis of countries using the covid-19 cases dataset. Data Brief 31, 105787. https://doi.org/10.1016/j.dib.2020.105787 (2020).

Article PubMed PubMed Central Google Scholar

Du, M., Ding, S. & Jia, H. Study on density peaks clustering based on k-nearest neighbors and principal component analysis. Knowl.-Based Syst. 99, 135–145. https://doi.org/10.1016/j.knosys.2016.02.001 (2016).

Article MATH Google Scholar

Xie, J., Gao, H., Xie, W., Liu, X. & Grant, P. W. Robust clustering by detecting density peaks and assigning points based on fuzzy weighted k-nearest neighbors. Inf. Sci. 354, 19–40. https://doi.org/10.1016/j.ins.2016.03.011 (2016).

Article MATH Google Scholar

Wang, Z., Li, Y., Du, H. & Wei, X. A fast density peak clustering method with autoselect cluster centers. Mob. Inf. Syst. 2022, e4176101. https://doi.org/10.1155/2022/4176101 (2022).

Article Google Scholar

Seyedi, S. A., Lotfi, A., Moradi, P. & Qader, N. N. Dynamic graph-based label propagation for density peaks clustering. Expert Syst. Appl. 115, 314–328. https://doi.org/10.1016/j.eswa.2018.07.075 (2019).

Article Google Scholar

Jiang, D., Zang, W., Sun, R., Wang, Z. & Liu, X. Adaptive density peaks clustering based on k-nearest neighbor and gini coefficient. IEEE Access 8, 113900–113917. https://doi.org/10.1109/ACCESS.2020.3003057 (2020).

Article MATH Google Scholar

Guan, J., Li, S., He, X., Zhu, J. & Chen, J. Fast hierarchical clustering of local density peaks via an association degree transfer method. Neurocomputing 455, 401–418. https://doi.org/10.1016/j.neucom.2021.05.071 (2021).

Article MATH Google Scholar

Zhang, Z. et al. Density decay graph-based density peak clustering. Knowl.-Based Syst. 224, 107075. https://doi.org/10.1016/j.knosys.2021.107075 (2021).

Article MATH Google Scholar

for Cybersecurity, C. I. Nsl-kdd dataset. https://www.unb.ca/cic/datasets/nsl.html (Accessed 15 March 2023) (2009).

Dai, Y., Yuan, G., Yang, Z. & Wang, B. K-modes clustering algorithm based on weighted overlap distance and its application in intrusion detection. Sci. Program. 2021, 9972589 (2021).

MATH Google Scholar

Yao, S., Xu, H., Yan, L. & Su, J. Applying an improved dbscan clustering algorithm to network intrusion detection. In 2021 11th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS), vol. 2, 865–868 (IEEE, 2021).

Zhang, J., Sun, J. & He, H. Clustering detection method of network intrusion feature based on support vector machine and lca block algorithm. Wirel. Pers. Commun. 127, 599–613 (2022).

Article MATH Google Scholar

Zou, L., Luo, X., Zhang, Y., Yang, X. & Wang, X. Hc-dttsvm: a network intrusion detection method based on decision tree twin support vector machine and hierarchical clustering. IEEE Access 11, 21404–21416 (2023).

Article Google Scholar

Xie, B., Dong, X. & Wang, C. An improved k-means clustering intrusion detection algorithm for wireless networks based on federated learning. Wirel. Commun. Mob. Comput. 2021, 9322368 (2021).

Article Google Scholar

Jain, A. K. & Law, M. H. C. Data clustering: A user’s dilemma. In Pal, S. K., Bandyopadhyay, S. & Biswas, S. (eds.) Pattern Recognition and Machine Intelligence, Lecture Notes in Computer Science, 1–10, https://doi.org/10.1007/11590316_1 (Springer, 2005).

Fu, L. & Enzo, M. Flame, a novel fuzzy clustering method for the analysis of dna microarray data. BMC Bioinform. 83, 1–15. https://doi.org/10.1186/1471-2105-8-3 (2007).

Article MATH Google Scholar

Zhu, Q., Feng, J. & Huang, J. Natural neighbor: A self-adaptive neighborhood method without parameter k. Pattern Recogn. Lett. 80, 30–36 (2016).

Article ADS MATH Google Scholar

Li, J., Zhu, Q., Wu, Q. & Fan, Z. A novel oversampling technique for class-imbalanced learning based on smote and natural neighbors. Inf. Sci. 565, 438–455 (2021).

Article MathSciNet MATH Google Scholar

Ram, P. & Sinha, K. Revisiting kd-tree for nearest neighbor search. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19, 1378-1388, https://doi.org/10.1145/3292500.3330875 (Association for Computing Machinery, 2019).

Wang, R., Shan, S., Chen, X., Dai, Q. & Gao, W. Manifold-manifold distance and its application to face recognition with image sets. IEEE Trans. Image Process. 21, 4466–4479 (2012).

Article ADS MathSciNet PubMed MATH Google Scholar

Tenenbaum, J. B., de Silva, V. & Langford, J. C. A global geometric framework for nonlinear dimensionality reduction. Science 290, 2319–2323. https://doi.org/10.1126/science.290.5500.2319 (2000).

Article ADS PubMed MATH Google Scholar

Cheng, D., Zhu, Q., Huang, J., Yang, L. & Wu, Q. Natural neighbor-based clustering algorithm with local representatives. Knowl.-Based Syst. 123, 238–253. https://doi.org/10.1016/j.knosys.2017.02.027 (2017).

Article MATH Google Scholar

Liu, R., Wang, H. & Yu, X. Shared-nearest-neighbor-based clustering by fast search and find of density peaks. Inf. Sci. 450, 200–226. https://doi.org/10.1016/j.ins.2018.03.031 (2018).

Article ADS MathSciNet MATH Google Scholar

Guo, W. et al. Density peak clustering with connectivity estimation. Knowl.-Based Syst. 243, 108501. https://doi.org/10.1016/j.knosys.2022.108501 (2022).

Article MATH Google Scholar

FRANK, A. Uci machine learning repository. http://archive.ics.uci.edu/ml (2010).

Chang, H. & Yeung, D.-Y. Robust path-based spectral clustering. Pattern Recogn. 41, 191–203. https://doi.org/10.1016/j.patcog.2007.04.010 (2008).

Article ADS MATH Google Scholar

Gionis, A., Mannila, H. & Tsaparas, P. Clustering aggregation. ACM Trans. Knowl. Discov. Data 1, 4-es. https://doi.org/10.1145/1217299.1217303 (2007).

Article Google Scholar

Limin, F. & Enzo, M. Flame, a novel fuzzy clustering method for the analysis of dna microarray data. BMC Bioinform. 8, 3 (2007).

Article MATH Google Scholar

Veenman, C., Reinders, M. & Backer, E. A maximum variance cluster algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 24, 1273–1280. https://doi.org/10.1109/TPAMI.2002.1033218 (2002).

Article MATH Google Scholar

Mehmood, R., Zhang, G., Bie, R., Dawood, H. & Ahmad, H. Clustering by fast search and find of density peaks via heat diffusion. Neurocomputing 208, 210–217. https://doi.org/10.1016/j.neucom.2016.01.102 (2016).

Article MATH Google Scholar

Lai, J. H., Yuen, P. C. & Feng, G. C. Face recognition using holistic Fourier invariant features. Pattern Recogn. 34, 95–109. https://doi.org/10.1016/S0031-3203(99)00200-9 (2001).

Article ADS MATH Google Scholar

Samaria, F. & Harter, A. Parameterisation of a stochastic model for human face identification. In Proc. of 1994 IEEE Workshop on Applications of Computer Vision, 138–142, https://doi.org/10.1109/ACV.1994.341300 (1994).

Everingham, M. & Winn, J. The pascal visual object classes challenge 2007 (voc2007) development kit. Int. J. Comput. Vis. 111, 98–136 (2006).

Article MATH Google Scholar

Wolberg, W. Breast Cancer Wisconsin (Original). UCI Mach. Learn. Reposit.[SPACE]https://doi.org/10.24432/C5HP4Z (1990).

Article MATH Google Scholar

Parmisano, A., Garcia, S. & Erquiaga, M. Iot-23 dataset: A labeled dataset of malware and benign iot traffic. Avast-AIC laboratory, Stratosphere IPS, Czech Technical University (CTU) (2019).

Goryunov, M. N., Matskevich, A. G. & Rybolovlev, D. A. Synthesis of a machine learning model for detecting computer attacks based on the cicids2017 dataset. Proc. of the Institute for System Programming of the RAS 32, 81–94 (2020).

Article Google Scholar

Hu, X., Yan, M., Chen, Y., Yang, L. & Du, J. Rotation-dpeak: Improving density peaks selection for imbalanced data. In Big Data (eds Mei, H. et al.) 45–58 (Springer Singapore, 2021).

Chapter MATH Google Scholar

Download references

This work was partly supported by the National Natural Science Foundation of China (61572074), the Key Technologies Research and Development Program of China (2020YFB1712104) and the Research and Development of Intelligent Construction and Intelligent Operation Big Data Science Platform System (2024–0459).

Jinyu Zhang, Yu Shen, Siqi Wang, Bo Deng and Wentao Zhao contributed equally to this work.

School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing, 100083, China

Hongbo Wang, Jinyu Zhang, Yu Shen, Siqi Wang & Bo Deng

USTB-CERIS Joint Innovation Centre of Big Data Science for Intelligent Construction and Smart Operation, University of Science and Technology Beijing, Beijing, 100083, China

Hongbo Wang

Courant Institute of Mathematical Sciences, New York University, New York, 10012, USA

Wentao Zhao

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

Hongbo Wang and Yu Shen carried out the studies and drafted the manuscript. Jinyu Zhang carried out the experimental tests and improved the manuscript. Siqi Wang and Bo Deng improved the comparison of reinforcement performance indicators through new experiments and data analysis according to three reviewer’s revision suggestions. Wentao Zhao conducted a comprehensive and in-depth analysis of the experimental data and results. All the authors reviewed the final manuscript.

Correspondence to Hongbo Wang or Wentao Zhao.

All the authors (Hongbo Wang, Yu Shen, Jinyu Zhang, Siqi Wang, Bo Deng and Wentao Zhao) declare that there are no conflicts of interest regarding the publication of this paper.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

Wang, H., Zhang, J., Shen, Y. et al. Improved density peak clustering with a flexible manifold distance and natural nearest neighbors for network intrusion detection. Sci Rep 15, 8510 (2025). https://doi.org/10.1038/s41598-025-92509-4

Download citation

Received: 14 July 2024

Accepted: 27 February 2025

Published: 12 March 2025

DOI: https://doi.org/10.1038/s41598-025-92509-4

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative