CLUEstering
High-performance density-based weighted clustering library developed at CERN
Loading...
Searching...
No Matches
Getting started

In this section we see how to write minimal code that clusters data using CLUEstering.

Using the C++ interface

Below is a simple C++ code snippet, which can also be found, along with the CMake file for build, in the examples folder of the repository:

int main() {
// Obtain the queue, which is used for allocations and kernel launches.
auto queue = clue::get_queue(0u);
// Allocate the points on the host and device.
clue::PointsHost<2> h_points = clue::read_csv<2>(queue, "path-to-data.csv");
clue::PointsDevice<2> d_points(queue, h_points.size());
// Define the parameters for the clustering and construct the clusterer.
const float dc = 20.f, rhoc = 10.f, outlier = 20.f;
clue::Clusterer<2> algo(queue, dc, rhoc, outlier);
// Launch the clustering
// The results will be stored in the `clue::PointsHost` object
const std::size_t block_size{256};
algo.make_clusters(h_points, d_points, clue::FlatKernel{.5f}, queue, block_size);
// Read the data from the host points
auto clusters_indexes = h_points.clusterIndexes(); // Get the cluster index for each points
auto seed_map =
h_points.isSeed(); // Obtain a boolean array indicating which points are the seeds
// i.e. the cluster centers
}
Header file for the CLUEstering library.
The Clusterer class is the interface for running the clustering algorithm. It provides methods to set...
Definition Clusterer.hpp:25
The FlatKernel class implements a flat kernel for convolution. It returns a constant value for the ke...
Definition ConvolutionalKernel.hpp:13
The PointsDevice class is a data structure that manages points on a device. It provides methods to al...
Definition PointsDevice.hpp:27
The PointsHost class is a data structure that manages points in host memory. It provides methods to a...
Definition PointsHost.hpp:24
clue::Queue get_queue(TIdx device_id=TIdx{})
Get an alpaka queue created from a device correspoding to a given index.
Definition get_queue.hpp:20
clue::PointsHost< NDim > read_csv(TQueue &queue, const std::string &file_path)
Read points from a CSV file into a PointsHost object.

The first step is to create the Queue object. A Queue can be thought as a std::thread or as a stream of CUDA/HIP, and represents a queue of operations to be executed on a specific device. The queue will be used to allocate memory and to launch the kernels on the device. The clue::get_queue function provides a convenient way to obtain a queue from a specific device, whose index is to passed as an argument. Alternatively, the clue::get_device function can be used to obtain a device object, repcresenting an accelerator device, which can be then used to create a queue:

auto device = clue::get_device(0u); // Get the device with index 0
auto queue = clue::get_queue(device); // We then create a queue from the device with the `get_queue` function
auto another_queue = clue::Queue(device); // or we can also call the Queue's constructor directly
clue::Device get_device(uint32_t device_id)
Get the alpaka device corresponding to a given index.
Definition get_device.hpp:16

The next step is to create the containers for the device points. CLUEstering provides the clue::PointsHost and clue::PointsDevice containers, representing respectively data allocated on the host and on the device. Here the data is read from a CSV file using the clue::read_csv function, which returns a PointsHost object, and then an empty PointsDevice object is created.

Then, the clue::Clusterer, which is the object that handles the internal allocations and contains the algorithm logic, is created. The Clusterer requires the CLUE algorithm's parameters to be passed. Their meaning is explained in the introduction section, along with a description of the algorithm.

Finally, the algorithm is launched with the make_clusters method, which takes as arguments the host and device points, the kernel to use for the clustering, the queue to use for the operations and the bloch size. The input data is copied from the host to the device container, the algorithm is then executed on the device, where the results are computed and finally copied back to the host container.

The results of the clustering can then be read from the host points: the clue::PointsHost::clusterIndexes method returns a span of integers representing the cluster index for each point, while the clue::PointsHost::isSeed method returns a boolean array indicating which points are the seeds of the clusters.

How to compile the code

In order to compile code that uses CLUEstering, three steps are needed:

  1. including the library headers, either by fetching the source code (with wget git submodule, CMake FetchContent or similar) or by installing it getting the path with CMake find_package command.
  2. including and if needed linking the backend specific libraries and/or compilers
  3. specifying the alpaka backend to use

Here is a full example of a CMake file that compiles the code above:

cmake_minimum_required(VERSION 3.16.0)
project(CLUEsteringExample)
set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_CXX_EXTENSIONS OFF)
find_package(CLUEstering)
if(NOT CLUEstering_FOUND)
message(
FATAL_ERROR
"CLUEstering not found. Please install it."
)
endif()
if(ALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENABLED)
add_executable(serial.out main.cpp)
target_compile_definitions(serial.out
PRIVATE ALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENABLED)
endif()
if(ALPAKA_ACC_CPU_B_TBB_T_SEQ_ENABLED)
find_package(TBB REQUIRED)
if(NOT TARGET TBB::tbb)
message(FATAL_ERROR "TBB not found. Please install it.")
endif()
add_executable(tbb.out main.cpp)
target_compile_definitions(tbb.out PRIVATE ALPAKA_ACC_CPU_B_TBB_T_SEQ_ENABLED)
target_link_libraries(tbb.out PRIVATE TBB::tbb)
endif()
if(ALPAKA_ACC_CPU_B_OMP2_T_SEQ_ENABLED)
find_package(OpenMP REQUIRED)
if(NOT TARGET OpenMP::OpenMP_CXX)
message(FATAL_ERROR "OpenMP not found. Please install it.")
endif()
add_executable(openmp.out main.cpp)
target_compile_definitions(openmp.out
PRIVATE ALPAKA_ACC_CPU_B_OMP2_T_SEQ_ENABLED)
target_link_libraries(openmp.out PRIVATE OpenMP::OpenMP_CXX)
endif()
if(ALPAKA_ACC_GPU_CUDA_ENABLED)
include(CheckLanguage)
check_language(CUDA)
if(CMAKE_CUDA_COMPILER)
set_source_files_properties(main.cpp PROPERTIES LANGUAGE CUDA)
add_executable(cuda.out main.cpp)
target_compile_definitions(cuda.out PRIVATE ALPAKA_ACC_GPU_CUDA_ENABLED)
target_compile_options(cuda.out PRIVATE --expt-relaxed-constexpr)
set_target_properties(
cuda.out PROPERTIES CUDA_SEPARABLE_COMPILATION ON CUDA_ARCHITECTURES
"50;60;61;62;70;80;90")
else()
message(FATAL_ERROR "CUDA not found. Please install it.")
endif()
endif()

Using the Python interface

Here is a minimal example of a Python code using CLUEstering:

import CLUEstering as clue
clust = clue.clusterer(1., 5., 1.5)
clust.read_data(data)
clust.run_clue()
clust.to_csv('./output/', 'data_results.csv')

Just like in the C++ interface, the CLUEstering.clusterer constructor just takes the CLUE's parameters as arguments. The parameters can also be updated using the CLUEstering.clusterer.set_params method. The data is read using the CLUEstering.clusterer.read_data method, that accepts five types of data formats: pandas DataFrames, Python lists, Numpy arrays, Python dictionaries and strings containing paths to CSV files. The algorithm is run with the CLUEstering.clusterer.run_clue method. The backend used for running the algorithm can be specified by passing a string to run_clue:

clust.run_clue("cpu serial")
clust.run_clue("cpu tbb")
clust.run_clue("cpu openmp")
clust.run_clue("gpu cuda")
clust.run_clue("gpu hip")

NOTE: the support for the SYCL backends is still in an experimental state and is not included in the Python interface as of version 2.6.5.

The results of the clustering can be accessed with the clsuter's public getters:

clust.n_clusters # returns the number of clusters
clust.n_seeds # returns the number of seeds
clust.clusters # returns a list of the clusters found
clust.cluster_ids # returns an array with the cluster index of each point
clust.is_seed # returns an boolean array specifying which points are seeds
clust.cluster_points # returns the clusters as nested arrays containing the points in each cluster
clust.points_per_cluster # returns an array with the number of points in each cluster
clust.output_df # returns a dataframe with the input data and the results of the clustering

the output dataframe can also be exported to a CSV file with the CLUEstering.clusterer.to_csv method.

Finally, the clusterer provides the CLUEstering.clusterer.input_plotter method for plotting the input data and the CLUEstering.clusterer.cluster_plotter method for plotting the clustered data.

Data plotted with the input plotter
Data plotted with the cluster plotter

Format of the data required by CLUEstering

To finish this section, we describe what is the expected format for the data passed to CLUEstering.

CSV files

When the data is passed from a CSV file, each row should contain the data for each point, by putting in order the coordinates and then the point's weight. In the header, the coordinates should be labelled as x*, going from 0 to N-1 where N is the number of dimensions, and at the end the weight column should be called weight. Below is an example:

x0,x1,x2,weight
-9.95,5.17,0.15,1.0
-9.43,5.68,0.15,1.0
-11.0,7.29,0.15,1.0
-10.7,-4.37,0.15,1.0
3.5,4.48,0.15,1.0
3.0,2.94,0.15,1.0
-9.97,4.04,0.15,1.0
-10.36,-4.39,0.15,1.0

Passing data to clue::PointsHost and clue::PointsDevice

The host and device containers in the C++ interface expect data to be passed in an SoA format, meaning that the coordinate values of all the points in each dimension should be adjacent in memory. The difference between an SoA layout and a traditional AoS layout is shown in the diagram below.

If the is already contained in external containers, these can be passed to the clue::PointsHost constructor either as pointers, std::spans or any container satisfying the std::contiguous_range concept. The data can be passed through either two or four buffers:

  • two buffers: the first one should contain all the coordinates and the weights, and the second one the results of the clustering, i.e. the cluster indexes and the is_seed map.
  • four buffers: the first buffer should contain the coordinates, the second the weights and the last two should contain the results.

Passing data in the Python interface

As said above, the CLUEstering.clusterer.read_data method takes input data in five different formats:

  • for Python dictionaries, pandas DataFrames and CSV files, the same naming conventions for the data members apply
  • for Python lists and Numpy arrays, the coordinates can be passed both in AoS and SoA format, where in the first case it will be automatically converted to SoA before calling the C++ module. In both formats, the weights should be passed as a separate nested array.
    # The same applies to Numpy arrays
    data_aos = [[x0, y0, z0], [x1, y1, z1], [x2, y2, z2], [w0, w1, w2]]
    data_soa = [[[x0, x1, x2], [y0, y1, y2], [z0, z1, z2]], [w0, w1, w2]]