Clusterer

class CLUEstering.clusterer(dc: float, rhoc: float, dm: [<class 'float'>, None] = None, seed_dc: [<class 'float'>, None] = None, ppbin: int = 128)[source]

Bases: object

Wrapper class for performing clustering using the CLUE algorithm.

Parameters:
dc : float

Spatial parameter controlling the region for local density calculation.

rhoc : float

Density threshold separating seeds from outliers.

dm : float

Spatial parameter controlling the region for follower search.

ppbin : int

Average number of points per tile.

kernel : clue_kernels.Algo.kernel

Kernel used to calculate local density.

clust_data : ClusteringDataSoA

Container for input data.

clust_prop : cluster_properties

Container for clustering results.

elapsed_time : float

Execution time of the algorithm in nanoseconds.

set_params(dc: float, rhoc: float, dm: [<class 'float'>, None] = None, seed_dc: [<class 'float'>, None] = None, ppbin: int = 128) None[source]

Set parameters for the clustering algorithm.

Parameters:
dc : float

Spatial parameter for density calculation.

rhoc : float

Density threshold.

dm : float or None

Follower search region. Defaults to dc if None.

seed_dc : float or None

Seed search region. Defaults to dc if None.

ppbin : int

Average points per tile.

read_data(input_data: DataFrame | str | dict | list | ndarray, wrapped_coords: list | ndarray | None = None) None[source]

Read input data and initialize clustering-related attributes.

Parameters:
input_data : Union[pd.DataFrame, str, dict, list, np.ndarray]

Data to read. Can be one of: - pandas DataFrame: must contain one column per coordinate plus one column for weight. - string: path to a CSV file containing the data. - dict: dictionary with coordinates and weights. - list or ndarray: list of coordinate lists plus a weight list.

wrapped_coordinates : list or np.ndarray

List or array indicating which dimensions are periodic.

Raises:

ValueError – If the data format is not supported.

Returns:

None

set_wrapped(wrapped_coords: list | ndarray) None[source]

Set which coordinates are periodic (wrapped).

Parameters:
wrapped_coordinates : list or np.ndarray

List or array indicating which dimensions are periodic.

Returns:

None

choose_kernel(choice: str, parameters: list | None = None, function: ~types.LambdaType = <function clusterer.<lambda>>) None[source]

Set the kernel for local density calculation.

The default kernel is a flat kernel with parameter 0.5.

Parameters:
choice : str

Kernel type to use. Options are: ‘flat’, ‘exp’, ‘gaus’, or ‘custom’.

parameters : list or None

Parameters for the kernel. Required for ‘flat’, ‘exp’, ‘gaus’. Not required for ‘custom’.

function : function, optional

Function to use for a custom kernel.

Raises:

ValueError – If the number of parameters is invalid or the kernel choice is invalid.

Returns:

None

property coords : ndarray

Return the coordinates of the points used for clustering.

Returns:

Coordinates array.

Return type:

np.ndarray

property weight : ndarray

Return the weights of the points.

Returns:

Weights array.

Return type:

np.ndarray

property n_dim : int

Return the number of dimensions.

Returns:

Number of dimensions.

Return type:

int

property n_points : int

Return the number of points in the dataset.

Returns:

Number of points.

Return type:

int

list_devices(backend: str = 'all') None[source]

List available devices for a given backend.

Parameters:
backend : str, optional

Backend to list devices for. Options: ‘all’, ‘cpu serial’, ‘cpu tbb’, ‘cpu openmp’, ‘gpu cuda’, ‘gpu hip’. Defaults to ‘all’.

Raises:

ValueError – If the backend is not valid.

Returns:

None

run_clue(backend: str = 'cpu serial', block_size: int = 1024, device_id: int = 0, verbose: bool = False, dimensions: list | None = None) None[source]

Execute the CLUE clustering algorithm.

Parameters:
backend : str, optional

Backend to use for execution. Defaults to ‘cpu serial’.

block_size : int, optional

Size of blocks for parallel execution. Defaults to 1024.

device_id : int, optional

Device ID to run the algorithm on. Defaults to 0.

verbose : bool, optional

If True, prints execution time and number of clusters found.

dimensions : list[int] or None, optional

Optional list of dimensions to consider. Defaults to None.

Returns:

None

fit(data: DataFrame | str | dict | list | ndarray, backend: str = 'cpu serial', block_size: int = 1024, device_id: int = 0, verbose: bool = False, dimensions: list | None = None) Clusterer[source]

Run the CLUE clustering algorithm on the input data.

Parameters:
data : Union[pd.DataFrame, str, dict, list, np.ndarray]

Input data. Can be a pandas DataFrame, a CSV file path (string), a dictionary with coordinate keys and weight, or a list/array containing coordinates and weights.

backend : str, optional

Backend to use for the algorithm execution.

block_size : int, optional

Block size for parallel execution.

device_id : int, optional

ID of the device to run the algorithm on.

verbose : bool, optional

If True, prints execution information.

dimensions : list or None, optional

List of dimensions to consider. If None, all are used.

Returns:

Returns the clusterer object itself.

Return type:

Clusterer

Raises:

Various exceptions if input data is invalid or clustering fails.

fit_predict(data: [], backend: str = 'cpu serial', block_size: int = 1024, device_id: int = 0, verbose: bool = False, dimensions: list | None = None) ndarray[source]

Run the CLUE clustering algorithm and return the cluster labels.

Parameters:
data : Union[pd.DataFrame, str, dict, list, np.ndarray]

Input data. Can be a pandas DataFrame, a CSV file path (string), a dictionary with coordinate keys and weight, or a list/array containing coordinates and weights.

backend : str, optional

Backend to use for the algorithm execution.

block_size : int, optional

Block size for parallel execution.

device_id : int, optional

ID of the device to run the algorithm on.

verbose : bool, optional

If True, prints execution information.

dimensions : list or None, optional

List of dimensions to consider. If None, all are used.

Returns:

Array containing the cluster index for every point.

Return type:

np.ndarray

Raises:

Various exceptions if input data is invalid or clustering fails.

property n_clusters : int

Return the number of clusters found.

Returns:

Number of clusters reconstructed by CLUE.

Return type:

int

property cluster_ids : ndarray

Index of the cluster to which each point belongs.

Returns:

Array mapping each point to its cluster.

Return type:

np.ndarray

property labels : ndarray

Alias for cluster_ids.

Returns:

Array mapping each point to its cluster.

Return type:

np.ndarray

property cluster_points : ndarray

List of points for each cluster.

Returns:

Array of arrays containing point indices per cluster.

Return type:

np.ndarray

property points_per_cluster : ndarray

Number of points in each cluster.

Returns:

Array containing the number of points in each cluster.

Return type:

np.ndarray

property output_df : DataFrame

DataFrame containing cluster_ids.

Returns:

Pandas DataFrame with cluster assignments.

Return type:

pd.DataFrame

cluster_centroid(cluster_index: int) ndarray[source]

Computes the centroid coordinates of a specified cluster.

Parameters:
cluster_id

ID of the cluster.

Returns:

Coordinates of the cluster centroid.

Return type:

np.ndarray

Raises:

ValueError – If the cluster_id is invalid.

cluster_centroids() ndarray[source]

Computes the centroids of all clusters.

Returns:

Array of shape (n_clusters-1, n_dim) containing cluster centroids.

Return type:

np.ndarray

input_plotter(filepath: str | None = None, plot_title: str = '', title_size: float = 16, x_label: str = 'x', y_label: str = 'y', z_label: str = 'z', label_size: float = 16, pt_size: float = 1, pt_colour: str = 'b', grid: bool = True, grid_style: str = '--', grid_size: float = 0.2, x_ticks=None, y_ticks=None, z_ticks=None, **kwargs) None[source]

Plots the input points in 1D, 2D, or 3D space.

Parameters:
filepath : str or None

Path to save the plot. If None, the plot is shown interactively.

plot_title : str

Title of the plot.

title_size : float

Font size of the plot title.

x_label : str

Label for the x-axis.

y_label : str

Label for the y-axis.

z_label : str

Label for the z-axis.

label_size : float

Font size for axis labels.

pt_size : float

Size of the points.

pt_colour : str

Colour of the points.

grid : bool

Whether to display a grid.

grid_style : str

Line style of the grid.

grid_size : float

Line width of the grid.

x_ticks : list or None

Custom tick locations for x-axis.

y_ticks : list or None

Custom tick locations for y-axis.

z_ticks : list or None

Custom tick locations for z-axis (only for 3D plots).

kwargs : dict

Optional functions for converting coordinates.

Returns:

None

Return type:

None

cluster_plotter(filepath: str | None = None, plot_title: str = '', title_size: float = 16, x_label: str = 'x', y_label: str = 'y', z_label: str = 'z', label_size: float = 16, outl_size: float = 10, pt_size: float = 10, grid: bool = True, grid_style: str = '--', grid_size: float = 0.2, x_ticks=None, y_ticks=None, z_ticks=None, **kwargs) None[source]

Plots clusters with different colors and outliers as gray crosses.

Parameters:
filepath : str or None

Path to save the plot. If None, the plot is shown interactively.

plot_title : str

Title of the plot.

title_size : float

Font size of the plot title.

x_label : str

Label for the x-axis.

y_label : str

Label for the y-axis.

z_label : str

Label for the z-axis.

label_size : float

Font size for axis labels.

outl_size : float

Marker size for outliers.

pt_size : float

Marker size for cluster points.

grid : bool

Whether to display a grid.

grid_style : str

Line style of the grid.

grid_size : float

Line width of the grid.

x_ticks : list or None

Custom tick locations for x-axis.

y_ticks : list or None

Custom tick locations for y-axis.

z_ticks : list or None

Custom tick locations for z-axis (only for 3D plots).

kwargs : dict

Optional functions for converting coordinates.

Returns:

None

Return type:

None

to_csv(output_folder: str, file_name: str) None[source]

Creates a file containing the coordinates of all the points and their cluster_ids.

Parameters:
output_folder : str

Full path to the desired output folder.

file_name : str

Name of the file, with the ‘.csv’ suffix.

Returns:

None

Return type:

None

import_clusterer(input_folder: str, file_name: str) None[source]

Imports the results of a previous clustering.

Parameters:
input_folder : str

Full path to the folder containing the CSV file.

file_name : str

Name of the file, with the ‘.csv’ suffix.

Raises:

ValueError – If the file does not exist or cannot be read correctly.

Returns:

None

Return type:

None