Algebraic Topology Methods for Data Clustering in High Dimensions

Authors

  • Yifan Zhang Pennon Education Qingdao School, Qingdao, China Author

Keywords:

high-dimensional clustering, topological data analysis, persistent homology, persistence images, cluster stability

Abstract

High-dimensional clustering is widely utilized across diverse domains, including image analysis, sensor data mining, and exploratory machine learning. However, its reliability is frequently compromised by inherent challenges such as distance concentration, heterogeneous data density, pervasive noise, and complex nonlinear data geometry. While existing methods—such as K-means, DBSCAN, HDBSCAN, and UMAP-based clustering—provide valuable geometric or density-based partitions, they fail to directly incorporate persistent topological structures as a robust criterion for evaluating cluster stability. To address these limitations, this paper proposes Topology-Guided Stable Clustering (TGSC), an innovative framework designed to enhance clustering robustness. The TGSC approach systematically combines sparse mutual k-nearest-neighbor filtration, local persistent homology, persistence-image representation, topology-geometry feature fusion, and bootstrap-based stability selection to capture multi-scale structural properties. Comprehensive experiments conducted on two public high-dimensional datasets demonstrate that TGSC significantly improves clustering performance compared to the baseline UMAP+HDBSCAN method. Specifically, on the Fashion-MNIST dataset, the Adjusted Rand Index (ARI) increases from 0.514 ± 0.027 to 0.608 ± 0.022, while the Normalized Mutual Information (NMI) rises from 0.596 ± 0.023 to 0.667 ± 0.018. On the UCI-HAR dataset, the ARI improves from 0.571 ± 0.025 to 0.646 ± 0.024. Under conditions with 15% Gaussian noise, TGSC enhances the ARI by 9.6 percentage points and noise-label precision by 12.6 percentage points, maintaining an efficient runtime ratio of 1.19. These results suggest that persistent topological summaries provide a highly effective stability signal for high-dimensional clustering, while supporting deeper structural interpretation through persistence diagrams and Betti-curve analysis.

Downloads

Published

2026-06-14