CLUEstering
High-performance density-based weighted clustering library developed at CERN
|
CLUEstering is a C++ library designed for high-performance, multi-dimensional clustering. It is based on the CLUE algorithm, a highly-parallel, density-based weighted clustering algorithm developed at CERN inside the CMS experiment. Unlike many popular clustering algorithm, CLUEstering combines the flexibility of density=based clustering, which makes it robust to noisy data and thus particularly suitable for experimental data, with the generality of weighted clustering, which allows to use the algorithm in a wide range of applications where the points are not all equal but they can have different relative importance. Whereas for most common density-based algorithms weights can only be assigned to the points by modifying the distance matrix or the dataset, CLUEstering uses the values directly in the computation of each point's density, thus allowing to use the weights in a more natural way and providing better performance.
The CLUE algorithm is designed to be highly parallel, and thus efficiently executed on modern hardware accelerators such as GPUs and FPGAs. CLUEstering's backend uses alpaka, a C++ performance portability library to make the backend portable and automatically executable on many tipes of accelerators without any code duplicaiton. Finally, the high scalability of the algirithm makes it suitable for large datasets, especially when executed on parallel hardware.
The algorithm proceeds in four main steps, which are also illustrated in the GIF below:
dc
parameter, that represents the radius inside of which the points are included in the computation of the local density for each point.dm
parameter.rhoc
become seeds, which act as cluster centres, while the others are marked as outliers.dc
, rhoc
and dm
.