# Some Notes on Clustering

## Available Clustering Options 

- Any clustering method that comes from the [Scikit-learn clustering module](https://scikit-learn.org/stable/modules/clustering.html) 
- Shared Nearest Neighbor clustering (implemented by us inside of the `cfl.clustering_methods` submodule) 
- Any other module that uses the same interface as a Scikit-learn clusterer

## Recommendations for Choosing a Clusterer 

DBSCAN and KMeans are the two clustering methods we've worked with the most, so unless you have a reason to choose another, maybe stick with one of those.

KMeans
    - Advantages: only one parameter to tune, meaning of parameter (# of clusters) is pretty intuitive 
    - Potential Disadvantages: can only detect globular clusters, forces the user to choose the number of clusters (a goal of CFL is to detect number of macrovariables without supervision)

DBSCAN 
    - Advantages: does not force you to pre-define number of clusters, can detect clusters of any shape, you can maybe get away with only tuning one parameter (eps)
    - Disadvantages: has two  parameters (eps and min_samples) (even though eps
    is more important to tune than min_samples), can be tricky to tune well, may
    not correctly distinguish two clusters if there is overlap between the
    clusters 
    
Shared Nearest Neighbor 
    - a derivative of DBSCAN designed to perform well on high dimensional data


[Optuna](https://optuna.org/) is one of various tools that you can use to select
clustering hyperparameters.