Some Notes on Clustering

Available Clustering Options

  • Any clustering method that comes from the Scikit-learn clustering module

  • Shared Nearest Neighbor clustering (implemented by us inside of the cfl.clustering_methods submodule)

  • Any other module that uses the same interface as a Scikit-learn clusterer

Recommendations for Choosing a Clusterer

DBSCAN and KMeans are the two clustering methods we’ve worked with the most, so unless you have a reason to choose another, maybe stick with one of those.

KMeans - Advantages: only one parameter to tune, meaning of parameter (# of clusters) is pretty intuitive - Potential Disadvantages: can only detect globular clusters, forces the user to choose the number of clusters (a goal of CFL is to detect number of macrovariables without supervision)

DBSCAN - Advantages: does not force you to pre-define number of clusters, can detect clusters of any shape, you can maybe get away with only tuning one parameter (eps) - Disadvantages: has two parameters (eps and min_samples) (even though eps is more important to tune than min_samples), can be tricky to tune well, may not correctly distinguish two clusters if there is overlap between the clusters

Shared Nearest Neighbor - a derivative of DBSCAN designed to perform well on high dimensional data

Optuna is one of various tools that you can use to select clustering hyperparameters.