cfl.clustering package
Submodules
cfl.clustering.Y_given_Xmacro module
This module approximates the probability of each value of Y given each cause macrostate. Instead of learning the complete density, for each y_i in a dataset it computes the distance from y_i to it’s closest k neighbors in each cause macrostate. This approach leverages the fact that all x_j in a given macrostate have the same effect on Y by construction to reduce the number of X values over which we need to compute this density.
- cfl.clustering.Y_given_Xmacro._categorical_Y(Y_data, x_lbls, precompute_distances=True)
Estimates the conditional probability density P(Y=y|Xmacrostate) for categorical data, where ‘y’ is an observation in Y_data and Xmacrostate is a macrovariable state constructed from X_data, the “causal” data set. This function should only be used when Y_data contains categorical variables. This function normalizes the final probabilities learned for each Xmacrostate.
- Parameters:
Y_data (np.ndarray) – the “effects” data set, the observations in which are to be clustered
x_lbls (np.ndarray) – a 1D array (same length/aligned with Y_data) of the CFL labels predicted for the x (cause) data
precompute_distances (boolean) – when True, distances between all samples will be precomputed. This will significantly speed up this function, but uses considerable space for larger datasets.
- Returns:
- an array with a row for each observation
in Y_data and a column for each class in x_lbls. The entries of the array contain the conditional probability P(y|x) for the corresponding y value, given that the x is a member of the corresponding class of that column
- Return type:
np.ndarray
- cfl.clustering.Y_given_Xmacro._continuous_Y(Y_data, x_lbls, precompute_distances=True)
Estimates the conditional probability density P(Y=y|Xmacrostate) for every y (observation in Y_data) and Xmacrostate (macrovariable constructed from X_data, the “causal” data set) when Y_data contains variable(s) over a continuous distribution.
This function approximates the probability density P(Y=y_1) by using the density of points around y_1, as determined by the average distance between the k nearest neighbors. (Small distance=high density, large distance=low density) as a proxy. This function normalizes the final probabilities learned for each Xmacrostate.
- Pseudocode:
use sklearn’s euclidean_distances function to precompute distances between all pairs of points in Y_data
separate these distances out by X macrostate
sort these distances
for each X macrostate, the steps so far give us a matrix of sorted distances from each point in Y_data to each point in the X macrostate
now we can go through each point in Y_data, pull the first k columns of distances for each X macrostate matrix, and take the average. This gives us the average of the closest k distances in each X macrostate
- Parameters:
Y_data (np.ndarray) – the “effects” data set, the observations in which are to be clustered
x_lbls (np.ndarray) – a 1D array (same length/aligned with Y_data) of the CFL labels predicted for the X (cause) data
precompute_distances (boolean) – when True, distances between all samples will be precomputed. This will significantly speed up this function, but uses considerable space for larger datasets.
- Returns:
- a 2D array with a row for each observation in Y_data
and a column for each macrostate in x_lbls. The entries of the array contain the conditional probability P(y|x) for the corresponding y value, given that the x is a member of the corresponding macrostate of that column.
- Return type:
np.ndarray
Note
Why is P(y|Xmacrostate) calculated, instead of P(y|x) for each individual x? The clusters of x created immediately prior to this step are observational macrostates of X (see “Causal Feature Learning: An Overview” by Eberhardt, Chalupka, Pierona 2017). Observational macrostates are a type of equivalence class defined by the relationship P(y|x_1)=P(y|x_2) for any x_1, x_2 in the same macrostate. So, theoretically, it should be redundant to check each x observation individually since each x in the same cluster should have the same effect on the conditional probability of y. This method also significantly reduces the amount of computation that needs to be done.
- cfl.clustering.Y_given_Xmacro.sample_Y_dist(Y_type, dataset, x_lbls, precompute_distances=True)
Finds (a proxy of) P(Y=y | Xmacrostate) for all Y=y. This function uses the data type of the variable(s) in Y to select the correct method for sampling P(Y=y |X=Xmacrostate). This function is used by EffectClusterer for partitioning the effect space.
- Parameters:
Y_type (str) – type of data provided. Valid values: ‘continuous’, ‘categorical’
dataset (Dataset) – Dataset object containing X and Y data
x_lbls (np.ndarray) – Cluster assignments for X data
- Returns:
- array with P(Y=y |Xmacrostate) distribution (aligned to the
Y dataset)
- Return type:
np.ndarray
cfl.clustering.cause_clusterer module
- class cfl.clustering.cause_clusterer.CauseClusterer(data_info, block_params)
Bases:
Block
This class uses clustering to form the observational partition that CFL is trying to identify over the cause space. It trains a user-defined clustering model to cluster datapoints based on P(Y|X=x) (usually learned by a CondDensityEstimator). Once the model is trained, it can then be used to assign new datapoints to the clusters found.
- block_params
a set of parameters specifying a clusterer. The ‘model’ key must be specified and can either be the name of an sklearn.cluster model, or a clusterer model object that follows the cfl.clustering.ClustererModel interface. If the former, additional keys may be specified as parameters to the sklearn object.
- Type:
dict
- model
clusterer object to partition cause data
- Type:
sklearn.cluster or cfl.clustering.ClustererModel
- data_info
dictionary with the keys ‘X_dims’, ‘Y_dims’, and ‘Y_type’ (whether the y data is categorical or continuous)
- Type:
dict
- name
name of the model so that the model type can be recovered from saved parameters (str)
- trained
boolean tracking whether self.model has been trained yet
- Type:
bool
- _create_model()
given self.block_params, build the clustering model
- get_block_params()
return self.block_params
- _get_default_block_params()
return values for block_params to defualt to if unspecified
- train()
fit a model with P(Y|X=x) found by CDE
- predict()
assign new datapoints to clusters found in train
- save_block()
save the state of the object
- load_block()
load the state of the object from a specified file path
Example
from cfl.clustering.clusterer import CauseClusterer from cfl.dataset import Dataset
X = <cause data> Y = <effect data> prev_results = <put CDE results here> data = Dataset(X, Y)
# syntax 1 c = CauseClusterer(data_info ={‘X_dims’: X.shape, ‘Y_dims’: Y.shape,
‘Y_type’: ‘continuous’},
- block_params={‘model’: ‘DBSCAN’,
- ‘model_params’{‘eps’: 0.3,
‘min_samples’: 10}})
# syntax 2 # MyClusterer should inherit cfl.clustering.ClustererModel my_clusterer = MyClusterer(param1=0.1, param2=0.5) c = CauseClusterer(data_info ={‘X_dims’: X.shape, ‘Y_dims’: Y.shape,
‘Y_type’: ‘continuous’},
block_params={‘model’: my_clusterer})
results = c.train(data, prev_results)
- __init__(data_info, block_params)
Initialize Clusterer object
- Parameters:
data_info (dict) – dict with information about the dataset shape
block_params (dict) – a set of parameters specifying a clusterer. The ‘model’ key must be specified and can either be the name of an sklearn.cluster model, or a clusterer model object that follows the cfl.clustering.ClustererModel interface. Hyperparameters for the model may be specified through the ‘model_params’ dictionary. ‘tune’ may be set to True if you would like to perform hyperparameter tuning.
Returns: None
- _create_model()
Return a clustering model given self.block_params. If self.block_params[‘model’] is a string, it will try to instantiate the sklearn.cluster model with the same name. Otherwise, it will treat the value of self.block_params[‘model’] as the instantiated model.
Arguments: None :returns:
- the model
to partition the cause space with.
- Return type:
sklearn.cluster model or cfl.clusterer.ClustererModel
- _get_default_block_params()
Private method that specifies default clustering method parameters. Note: clustering method currently defaults to DBSCAN. While DBSCAN is a valid starting method, the choice of clustering method is highly dependent on your dataset. Please do not rely on the defaults without considering your use case.
Arguments: None :returns: dictionary of parameter names (keys) and values (values) :rtype: dict
- get_block_params()
Get parameters for this clustering model.
Arguments: None :returns: dictionary of parameter names (keys) and values (values) :rtype: dict
- load_block(file_path)
Load clusterer model from path.
- Parameters:
file_path (str) – path to load saved model from
Returns: None
- predict(dataset, prev_results)
Assign new datapoints to clusters found in training.
- Parameters:
dataset (Dataset) – Dataset object containing X, Y and pyx data to assign partition labels to
prev_results (dict) – dictionary that contains a key called ‘pyx’, whose value is an array of probabilities
- Returns:
dictionary of results, containing ‘x_lbls’, a numpy array of class assignments for each sample in dataset.X.
- Return type:
dict
- save_block(file_path)
Save clusterer model to specified path.
- Parameters:
file_path (str) – path to save to
Returns: None
- train(dataset, prev_results)
Train self.model on ‘pyx’ stored in prev_results. Tune hyperparameters if specified.
- Parameters:
dataset (Dataset) – Dataset object containing X, Y to assign partition labels to (not used, here for consistency)
prev_results (dict) – dictionary that contains a key called ‘pyx’, whose value is an array of probabilities
- Returns:
- dictionary of results, the most important of which is
x_lbls, a numpy array of class assignments for each sample in dataset.X. Also includes ‘tuning_fig, ‘tuning_errs’, and ‘param_combos’ if self.block_params[‘tune’] is True.
- Return type:
dict
cfl.clustering.cluster_tuning_util module
This module helps tune hyperparameters for CauseClusterer and EffectClusterer Block types. It iterates over combinations of hyperparameter values and computes the error of predicting the values being clustered from the cluster assignments found using the given hyperparameters. It then displays these predictions to the user and prompts for input as to what set of hyperparameter values to move forward with.
- cfl.clustering.cluster_tuning_util._score(true, pred)
Computes the mean squared error between ground truth and prediction.
- Parameters:
true (np.ndarray) – ground truth array of size (n_samples, n_features)
pred (np.ndarray) – predicted array of size (n_samples, n_features)
- Returns:
mean squared error between true and pred
- Return type:
np.float
- cfl.clustering.cluster_tuning_util.compute_predictive_error(Xlr, Ylr, n_iter=100)
Fits a linear model to a randomly selected subset of data and evalutes this model on the remaining subset of data n_iter times, then returns the average error over these n_iter runs.
- Parameters:
Xlr (np.ndarray) – array of cluster assignments of size (n_samples,)
Ylr (np.ndarray) – array of original data points used for clustering, of size (n_samples, n_features)
n_iter (int) – number of times to retrain and evaluate model. Defaults to 100.
- Returns:
mean error across n_iter runs
- Return type:
np.float
- cfl.clustering.cluster_tuning_util.get_parameter_combinations(param_ranges)
Given a dictionary of parameter ranges, returns a list of all parameter combinations to evaluate.
- Parameters:
param_ranges (dict) – dictionary of parameters, where values are all iterable
- Returns:
list of dictionaries of all parameter combinations
- Return type:
list
- cfl.clustering.cluster_tuning_util.get_user_params(suggested_params)
Queries the user for the final hyperparameters to proceed with.
- Parameters:
suggested_params (dict) – parameters to suggest as defaults.
- Returns:
dictionary of hyperparameters specified.
- Return type:
dict
- cfl.clustering.cluster_tuning_util.suggest_elbow_idx(errs)
Uses a heuristic to suggest where an “elbow” occurs in the errors. This currently does not work well and is not used by CFL.
- Parameters:
errs (np.ndarray) – array of error for every parameter combination
- Returns:
index of where elbow occurs in errs list
- Return type:
int
- cfl.clustering.cluster_tuning_util.tune(data_to_cluster, model_name, model_params, user_input)
Manages the tuning process for clustering hyperparameters. This function loops through all parameter combinations as specified by the user, finds the error for predicting the original data clustered from the cluster assignments, shows the user these errors, queries the user for final hyperparameter values to use, and returns these.
- Parameters:
data_to_cluster (np.ndarray) – array of data that is being clustered, of size (n_samples, n_features)
model_name (str) – name of model to instantiate
model_params (dict) – dictionary of hyperparameter values to try, where values are all iterable
user_input (bool) – whether to solicit user input or proceed with automatically identified optimal hyperparameters. This should always be set to True currently, as the automated hyperparameter selection method currently only returns experimental suggestions.
- Returns:
chosen parameters to proceed with (matplotlib.pyplot.Figure) : figure displaying tuning errors (np.ndarray) : array of errors for each hyperparameter combination (param_combos) : list of dictionaries of each hyperparameter combination
- Return type:
(dict)
- cfl.clustering.cluster_tuning_util.visualize_errors(errs, params_list, params_to_tune)
Visualizes the errors computed for every parameter combination.
- Parameters:
errs (np.ndarray) – array of error for every parameter combination
params_list (list) – list of dicts of all parameter combinations as given by get_parameter_combinations
params_to_tune (dict) – original dict of parameters to iterate over.
- Returns:
figure that is displayed
- Return type:
matplotlib.pyplot.figure
cfl.clustering.clusterer_model module
- class cfl.clustering.clusterer_model.ClustererModel(data_info, model_params)
Bases:
object
This is an abstract class defining the type of model that can be passed into a CauseClusterer or EffectClusterer Block. If you build your own clustering model to pass into CauseClusterer or EffectClusterer, you should inherit ClustererModel to enure that you have specified all required functionality to properly interface with the CFL pipeline. CDEModel specifies the following required methods: __init__, fit_predict
Attributes : None
- fit_predict()
fits the clustering model and returns predictions on a set of data.
- abstract __init__(data_info, model_params)
Do any setup required for your model here. :param data_info: a dictionary containing information about the
data that will be passed in. Should contain - ‘X_dims’ key with a tuple value specifying shape of X, - ‘Y_dims’ key with a tuple value specifying shape of Y, - ‘Y_type’ key with a string value specifying whether Y is ‘continuous’ or ‘categorical’.
- Parameters:
model_params (dict) – dictionary containing parameters for the model. This is a way for users to specify any modifiable parts of your model.
Returns: None
- abstract fit_predict(pyx)
Assign class labels for all samples by training self.model on pyx. Note that ClustererModels have a fit_predict method instead of separate fit and predict methods because most clustering methods do not handle predictionon new samples without re-fitting the model. TODO: handle both fit,predict and fit_predict in the future. :param pyx: an (n_samples,?) sized array of P(Y|X=x) estimates
for all n_samples values of X in our dataset.
- Returns:
an (n_samples,) sized array of class assignments for all samples in dataset.
- Return type:
np.ndarray
cfl.clustering.effect_clusterer module
- class cfl.clustering.effect_clusterer.EffectClusterer(data_info, block_params)
Bases:
Block
This class uses clustering to form the observational partition that CFL is trying to identify over the effect space. It trains a user-defined clustering model, to cluster datapoints based on a proxy for P(Y=y|X) (more information on this proxy can be found in the helper file cfl/clustering/Y_given_Xmacro.py). Once this model is trained, it can then be used to assign new datapoints to the clusters found.
- block_params
a set of parameters specifying a clusterer. The ‘model’ key must be specified and can either be the name of an sklearn.cluster model, or a clusterer model object that follows the scikit-learn interface. If the former, additional keys may be specified as parameters to the sklearn object.
- Type:
dict
- model
clusterer object to partition effect data
- Type:
sklearn.cluster or cfl.clustering.ClustererModel
- data_info
dictionary with the keys ‘X_dims’, ‘Y_dims’, and ‘Y_type’ (whether the y data is categorical or continuous)
- Type:
dict
- name
name of the model so that the model type can be recovered from saved parameters (str)
- trained
boolean tracking whether self.model has been trained yet
- Type:
bool
- _create_model()
given self.block_params, build the clustering model
- get_block_params()
return self.block_params
- _get_default_block_params()
return values for block_params to defualt to if unspecified
- train()
fit a model with P(Y|X=x) found by CDE
- predict()
assign new datapoints to clusters found in train
- save_block()
save the state of the object
- load_block()
load the state of the object from a specified file path
Example
from cfl.clustering.clusterer import EffectClusterer from cfl.dataset import Dataset
X = <cause data> Y = <effect data> prev_results = <put CDE results here> data = Dataset(X, Y)
# syntax 1 c = EffectClusterer(data_info ={‘X_dims’: X.shape, ‘Y_dims’: Y.shape,
‘Y_type’: ‘continuous’},
- block_params={‘model’: ‘DBSCAN’,
- ‘model_params’{‘eps’: 0.3,
‘min_samples’: 10}})
# syntax 2 # MyClusterer should inherit cfl.clustering.ClustererModel my_clusterer = MyClusterer(param1=0.1, param2=0.5) c = EffectClusterer(data_info ={‘X_dims’: X.shape, ‘Y_dims’: Y.shape,
‘Y_type’: ‘continuous’},
block_params={‘model’: my_clusterer})
results = c.train(data, prev_results)
- __init__(data_info, block_params)
Initialize Clusterer object
- Parameters:
data_info (dict) – dict with information about the dataset shape
block_params (dict) – a set of parameters specifying a clusterer. The ‘model’ key must be specified and can either be the name of an sklearn.cluster model, or a clusterer model object that follows the cfl.clustering.ClustererModel interface. Hyperparameters for the model may be specified through the ‘model_params’ dictionary. ‘tune’ may be set to True if you would like to perform hyperparameter tuning. ‘precompute_distances’ may also be specified. If true, a pre-caching method will be used that reduces runtime but is more memory-intensive. If false, the original compute-on-the-fly method will be used. (defaults to True)
Returns: None
- _create_model()
Return a clustering model given self.block_params. If self.block_params[‘model’] is a string, it will try to instantiate the sklearn.cluster model with the same name. Otherwise, it will treat the value of self.block_params[‘model’] as the instantiated model.
Arguments: None :returns:
- the model
to partition the cause space with.
- Return type:
sklearn.cluster model or cfl.clusterer.ClustererModel
- _get_default_block_params()
Private method that specifies default clustering method parameters. Note: clustering method currently defaults to DBSCAN. While DBSCAN is a valid starting method, the choice of clustering method is highly dependent on your dataset. Please do not rely on the defaults without considering your use case.
Arguments: None :returns: dictionary of parameter names (keys) and values (values) :rtype: dict
- get_block_params()
Get parameters for this clustering model.
Arguments: None :returns: dictionary of parameter names (keys) and values (values) :rtype: dict
- load_block(file_path)
Load clusterer model from path.
- Parameters:
file_path (str) – path to load saved model from
Returns: None
- predict(dataset, prev_results)
Assign new datapoints to clusters found in training.
- Parameters:
dataset (Dataset) – Dataset object containing X, Y and pyx data to assign partition labels to
prev_results (dict) – dictionary that contains a key called ‘x_lbls’, whose value is an array of labels over the dataset samples.
- Returns:
dictionary of results, containing ‘y_lbls’, a numpy array of class assignments for each sample in dataset.Y, as well as ‘y_probs’, the proxy for P(Y=y|X).
- Return type:
dict
- save_block(file_path)
Save clusterer model to specified path. :param file_path: path to save to :type file_path: str
Returns: None
- train(dataset, prev_results)
Assign new datapoints to clusters found in training.
- Parameters:
dataset (Dataset) – Dataset object containing X, Y data to assign partition labels to (not used, here for consistency)
prev_results (dict) – dictionary that contains a key called ‘x_lbls’, whose value is an array of labels over the dataset samples.
- Returns:
- dictionary of results, the most important of which is
y_lbls, a numpy array of class assignments for each sample in dataset.Y. ‘y_probs’, the proxy for P(Y=y|X), is also stored (see Y_given_Xmacro.py for computation details). Also includes ‘tuning_fig, ‘tuning_errs’, and ‘param_combos’ if self.block_params[‘tune’] is True.
- Return type:
dict
cfl.clustering.snn module
This code provides an implementation of Shared Nearest Neighbor (SNN) Clustering for use in the clustering step of CFL.
SNN is a variation of DBSCAN that uses a non-Euclidean distance metric to cluster points. It was developed as an alternative to DBSCAN that performs better at creating clusters across regions with variable densities of points.
We implement it here as a method that may do better in high-dimensional spaces. Clustering methods that use Euclidean distance metrics tend to perform poorly in high-dimensional spaces because the distances between all points become approximately equal as dimensionality increases. Instead of finding nearby points with Euclidean distance, SNN uses an alternative distance metric based on the neighbor of nearest neighbors shared between two points. However, SNN clustering still (in the current implementation) uses Euclidean distance to construct the k-nearest neighbors graph, so this method may also suffer from some of the shortfalls of other clustering methods in high-dimensional space.
This method is also an example of a custom clustering method that can be used for CFL clustering in the exact same way as any other Sklearn clustering method because it follows the same interface.
this code is modified by Jenna Kahn from the implemention in “ Shared Nearest Neighbor Clustering Algorithm: Implementation and Evaluation ” in github repository albert-espin/snn-clustering
Used under the following license:
MIT License
Copyright (c) 2019 Albert Espín
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
- class cfl.clustering.snn.SNN(neighbor_num, min_shared_neighbor_proportion, eps)
Bases:
BaseEstimator
,ClusterMixin
Class for performing the Shared Nearest Neighbor (SNN) clustering algorithm.
- Parameters:
neighbor_num (int) – K number of neighbors to consider for shared nearest neighbor similarity
min_shared_neighbor_proportion (float [0, 1]) – Proportion of the K nearest neighbors that need to share two data points to be considered part of the same cluster
- self.labels_
[assigned after fitting data] Cluster labels for each point in the dataset given to fit(). Noisy samples are given the label -1
- self.core_sample_indices_
[assigned after fitting data] Indices of core samples
- self.components_
[assigned after fitting data] Copy of each core sample found by training
Note
Naming conventions for attributes are based on the analogous ones of DBSCAN. Some documentationcopied from the sklearn DBSCAN documentation
- __init__(neighbor_num, min_shared_neighbor_proportion, eps)
Constructor
- fit(X)
Perform SNN clustering from features or distance matrix.
- Parameters:
X (array or sparse (CSR) matrix of shape (n_samples, n_features) – or array of shape (n_samples, n_samples)): A feature array
- Returns:
- the SNN model with self.labels_, self.core_sample_indices_,
self.components_ assigned
- Return type:
self
- fit_predict(X, y=None, sample_weight=None)
Performs clustering on X and returns cluster labels.
- Parameters:
X – array or sparse (CSR) matrix of shape (n_samples, n_features), or array of shape (n_samples, n_samples). A feature array, or array of distances between samples if
metric='precomputed'
.sample_weight – array, shape (n_samples,), optional Weight of each sample, such that a sample with a weight of at least
min_samples
is by itself a core sample; a sample with negative weight may inhibit its eps-neighbor from being core. Note that weights are absolute, and default to 1.y – Ignored. Not used, present here for API consistency by convention
- Returns:
cluster labels
- Return type:
y (ndarray, shape (n_samples,))
- cfl.clustering.snn.snn(X, neighbor_num, min_shared_neighbor_num, eps)
Perform Shared Nearest Neighbor (SNN) clustering algorithm clustering.
- Parameters:
X (array or sparse (CSR) matrix of shape (n_samples, n_features) – array of shape (n_samples, n_samples)): A feature array
neighbor_num (int) – K number of neighbors to consider for shared nearest neighbor similarity
min_shared_neighbor_num (int) – Number of nearest neighbors that need to share two data points to be considered part of the same cluster
eps (float [0, 1]) – parameter for DBSCAN, radius of the neighborhood. Default is the sklearn default
- Returns:
- indices of the core points, as determined
by DBSCAN
dbscan.labels_ : array of cluster labels for each point
- Return type: