cfl.clustering package

Submodules

cfl.clustering.Y_given_Xmacro module

This module approximates the probability of each value of Y given each cause macrostate. Instead of learning the complete density, for each y_i in a dataset it computes the distance from y_i to it’s closest k neighbors in each cause macrostate. This approach leverages the fact that all x_j in a given macrostate have the same effect on Y by construction to reduce the number of X values over which we need to compute this density.

cfl.clustering.Y_given_Xmacro._categorical_Y(Y_data, x_lbls, precompute_distances=True)

Estimates the conditional probability density P(Y=y|Xmacrostate) for categorical data, where ‘y’ is an observation in Y_data and Xmacrostate is a macrovariable state constructed from X_data, the “causal” data set. This function should only be used when Y_data contains categorical variables. This function normalizes the final probabilities learned for each Xmacrostate.

Parameters:

Y_data (np.ndarray) – the “effects” data set, the observations in which are to be clustered
x_lbls (np.ndarray) – a 1D array (same length/aligned with Y_data) of the CFL labels predicted for the x (cause) data
precompute_distances (boolean) – when True, distances between all samples will be precomputed. This will significantly speed up this function, but uses considerable space for larger datasets.

Returns:

an array with a row for each observation: in Y_data and a column for each class in x_lbls. The entries of the array contain the conditional probability P(y|x) for the corresponding y value, given that the x is a member of the corresponding class of that column

Return type:

np.ndarray

cfl.clustering.Y_given_Xmacro._continuous_Y(Y_data, x_lbls, precompute_distances=True)

Estimates the conditional probability density P(Y=y|Xmacrostate) for every y (observation in Y_data) and Xmacrostate (macrovariable constructed from X_data, the “causal” data set) when Y_data contains variable(s) over a continuous distribution.

This function approximates the probability density P(Y=y_1) by using the density of points around y_1, as determined by the average distance between the k nearest neighbors. (Small distance=high density, large distance=low density) as a proxy. This function normalizes the final probabilities learned for each Xmacrostate.

Pseudocode:

use sklearn’s euclidean_distances function to precompute distances between all pairs of points in Y_data
separate these distances out by X macrostate
sort these distances
for each X macrostate, the steps so far give us a matrix of sorted distances from each point in Y_data to each point in the X macrostate
now we can go through each point in Y_data, pull the first k columns of distances for each X macrostate matrix, and take the average. This gives us the average of the closest k distances in each X macrostate

Parameters:

Y_data (np.ndarray) – the “effects” data set, the observations in which are to be clustered
x_lbls (np.ndarray) – a 1D array (same length/aligned with Y_data) of the CFL labels predicted for the X (cause) data
precompute_distances (boolean) – when True, distances between all samples will be precomputed. This will significantly speed up this function, but uses considerable space for larger datasets.

Returns:

a 2D array with a row for each observation in Y_data: and a column for each macrostate in x_lbls. The entries of the array contain the conditional probability P(y|x) for the corresponding y value, given that the x is a member of the corresponding macrostate of that column.

Return type:

np.ndarray

Note

Why is P(y|Xmacrostate) calculated, instead of P(y|x) for each individual x? The clusters of x created immediately prior to this step are observational macrostates of X (see “Causal Feature Learning: An Overview” by Eberhardt, Chalupka, Pierona 2017). Observational macrostates are a type of equivalence class defined by the relationship P(y|x_1)=P(y|x_2) for any x_1, x_2 in the same macrostate. So, theoretically, it should be redundant to check each x observation individually since each x in the same cluster should have the same effect on the conditional probability of y. This method also significantly reduces the amount of computation that needs to be done.

cfl.clustering.Y_given_Xmacro.sample_Y_dist(Y_type, dataset, x_lbls, precompute_distances=True)

Finds (a proxy of) P(Y=y | Xmacrostate) for all Y=y. This function uses the data type of the variable(s) in Y to select the correct method for sampling P(Y=y |X=Xmacrostate). This function is used by EffectClusterer for partitioning the effect space.

Parameters:

Y_type (str) – type of data provided. Valid values: ‘continuous’, ‘categorical’
dataset (Dataset) – Dataset object containing X and Y data
x_lbls (np.ndarray) – Cluster assignments for X data

Returns:

array with P(Y=y |Xmacrostate) distribution (aligned to the: Y dataset)

Return type:

np.ndarray

cfl.clustering.cause_clusterer module

class cfl.clustering.cause_clusterer.CauseClusterer(data_info, block_params)

Bases: Block

This class uses clustering to form the observational partition that CFL is trying to identify over the cause space. It trains a user-defined clustering model to cluster datapoints based on P(Y|X=x) (usually learned by a CondDensityEstimator). Once the model is trained, it can then be used to assign new datapoints to the clusters found.

block_params

a set of parameters specifying a clusterer. The ‘model’ key must be specified and can either be the name of an sklearn.cluster model, or a clusterer model object that follows the cfl.clustering.ClustererModel interface. If the former, additional keys may be specified as parameters to the sklearn object.

Type:: dict

model

clusterer object to partition cause data

Type:: sklearn.cluster or cfl.clustering.ClustererModel

data_info

dictionary with the keys ‘X_dims’, ‘Y_dims’, and ‘Y_type’ (whether the y data is categorical or continuous)

Type:: dict

name: name of the model so that the model type can be recovered from saved parameters (str)

trained

boolean tracking whether self.model has been trained yet

Type:: bool

_create_model(): given self.block_params, build the clustering model

get_block_params(): return self.block_params

_get_default_block_params(): return values for block_params to defualt to if unspecified

train(): fit a model with P(Y|X=x) found by CDE

predict(): assign new datapoints to clusters found in train

save_block(): save the state of the object

load_block(): load the state of the object from a specified file path

Example

from cfl.clustering.clusterer import CauseClusterer from cfl.dataset import Dataset

X = <cause data> Y = <effect data> prev_results = <put CDE results here> data = Dataset(X, Y)

# syntax 1 c = CauseClusterer(data_info ={‘X_dims’: X.shape, ‘Y_dims’: Y.shape,

‘Y_type’: ‘continuous’},

block_params={‘model’: ‘DBSCAN’,

‘model_params’{‘eps’: 0.3,
‘min_samples’: 10}})

# syntax 2 # MyClusterer should inherit cfl.clustering.ClustererModel my_clusterer = MyClusterer(param1=0.1, param2=0.5) c = CauseClusterer(data_info ={‘X_dims’: X.shape, ‘Y_dims’: Y.shape,

‘Y_type’: ‘continuous’},

block_params={‘model’: my_clusterer})

results = c.train(data, prev_results)

__init__(data_info, block_params)

Initialize Clusterer object

Parameters:

data_info (dict) – dict with information about the dataset shape
block_params (dict) – a set of parameters specifying a clusterer. The ‘model’ key must be specified and can either be the name of an sklearn.cluster model, or a clusterer model object that follows the cfl.clustering.ClustererModel interface. Hyperparameters for the model may be specified through the ‘model_params’ dictionary. ‘tune’ may be set to True if you would like to perform hyperparameter tuning.

Returns: None

_create_model()

Return a clustering model given self.block_params. If self.block_params[‘model’] is a string, it will try to instantiate the sklearn.cluster model with the same name. Otherwise, it will treat the value of self.block_params[‘model’] as the instantiated model.

Arguments: None :returns:

the model
to partition the cause space with.

Return type:: sklearn.cluster model or cfl.clusterer.ClustererModel

_get_default_block_params()

Private method that specifies default clustering method parameters. Note: clustering method currently defaults to DBSCAN. While DBSCAN is a valid starting method, the choice of clustering method is highly dependent on your dataset. Please do not rely on the defaults without considering your use case.

Arguments: None :returns: dictionary of parameter names (keys) and values (values) :rtype: dict

get_block_params()

Get parameters for this clustering model.

Arguments: None :returns: dictionary of parameter names (keys) and values (values) :rtype: dict

load_block(file_path)

Load clusterer model from path.

Parameters:: file_path (str) – path to load saved model from

Returns: None

predict(dataset, prev_results)

Assign new datapoints to clusters found in training.

Parameters:

dataset (Dataset) – Dataset object containing X, Y and pyx data to assign partition labels to
prev_results (dict) – dictionary that contains a key called ‘pyx’, whose value is an array of probabilities

Returns:

dictionary of results, containing ‘x_lbls’, a numpy array of class assignments for each sample in dataset.X.

Return type:

dict

save_block(file_path)

Save clusterer model to specified path.

Parameters:: file_path (str) – path to save to

Returns: None

train(dataset, prev_results)

Train self.model on ‘pyx’ stored in prev_results. Tune hyperparameters if specified.

Parameters:

dataset (Dataset) – Dataset object containing X, Y to assign partition labels to (not used, here for consistency)
prev_results (dict) – dictionary that contains a key called ‘pyx’, whose value is an array of probabilities

Returns:

dictionary of results, the most important of which is: x_lbls, a numpy array of class assignments for each sample in dataset.X. Also includes ‘tuning_fig, ‘tuning_errs’, and ‘param_combos’ if self.block_params[‘tune’] is True.

Return type:

dict

cfl.clustering.cluster_tuning_util module

This module helps tune hyperparameters for CauseClusterer and EffectClusterer Block types. It iterates over combinations of hyperparameter values and computes the error of predicting the values being clustered from the cluster assignments found using the given hyperparameters. It then displays these predictions to the user and prompts for input as to what set of hyperparameter values to move forward with.

cfl.clustering.cluster_tuning_util._score(true, pred)

Computes the mean squared error between ground truth and prediction.

Parameters:

true (np.ndarray) – ground truth array of size (n_samples, n_features)
pred (np.ndarray) – predicted array of size (n_samples, n_features)

Returns:

mean squared error between true and pred

Return type:

np.float

cfl.clustering.cluster_tuning_util.compute_predictive_error(Xlr, Ylr, n_iter=100)

Fits a linear model to a randomly selected subset of data and evalutes this model on the remaining subset of data n_iter times, then returns the average error over these n_iter runs.

Parameters:

Xlr (np.ndarray) – array of cluster assignments of size (n_samples,)
Ylr (np.ndarray) – array of original data points used for clustering, of size (n_samples, n_features)
n_iter (int) – number of times to retrain and evaluate model. Defaults to 100.

Returns:

mean error across n_iter runs

Return type:

np.float

cfl.clustering.cluster_tuning_util.get_parameter_combinations(param_ranges)

Given a dictionary of parameter ranges, returns a list of all parameter combinations to evaluate.

Parameters:: param_ranges (dict) – dictionary of parameters, where values are all iterable
Returns:: list of dictionaries of all parameter combinations
Return type:: list

cfl.clustering.cluster_tuning_util.get_user_params(suggested_params)

Queries the user for the final hyperparameters to proceed with.

Parameters:: suggested_params (dict) – parameters to suggest as defaults.
Returns:: dictionary of hyperparameters specified.
Return type:: dict

cfl.clustering.cluster_tuning_util.suggest_elbow_idx(errs)

Uses a heuristic to suggest where an “elbow” occurs in the errors. This currently does not work well and is not used by CFL.

Parameters:: errs (np.ndarray) – array of error for every parameter combination
Returns:: index of where elbow occurs in errs list
Return type:: int

cfl.clustering.cluster_tuning_util.tune(data_to_cluster, model_name, model_params, user_input)

Manages the tuning process for clustering hyperparameters. This function loops through all parameter combinations as specified by the user, finds the error for predicting the original data clustered from the cluster assignments, shows the user these errors, queries the user for final hyperparameter values to use, and returns these.

Parameters:

data_to_cluster (np.ndarray) – array of data that is being clustered, of size (n_samples, n_features)
model_name (str) – name of model to instantiate
model_params (dict) – dictionary of hyperparameter values to try, where values are all iterable
user_input (bool) – whether to solicit user input or proceed with automatically identified optimal hyperparameters. This should always be set to True currently, as the automated hyperparameter selection method currently only returns experimental suggestions.

Returns:

chosen parameters to proceed with (matplotlib.pyplot.Figure) : figure displaying tuning errors (np.ndarray) : array of errors for each hyperparameter combination (param_combos) : list of dictionaries of each hyperparameter combination

Return type:

(dict)

cfl.clustering.cluster_tuning_util.visualize_errors(errs, params_list, params_to_tune)

Visualizes the errors computed for every parameter combination.

Parameters:

errs (np.ndarray) – array of error for every parameter combination
params_list (list) – list of dicts of all parameter combinations as given by get_parameter_combinations
params_to_tune (dict) – original dict of parameters to iterate over.

Returns:

figure that is displayed

Return type:

matplotlib.pyplot.figure

cfl.clustering.clusterer_model module

class cfl.clustering.clusterer_model.ClustererModel(data_info, model_params)

Bases: object

This is an abstract class defining the type of model that can be passed into a CauseClusterer or EffectClusterer Block. If you build your own clustering model to pass into CauseClusterer or EffectClusterer, you should inherit ClustererModel to enure that you have specified all required functionality to properly interface with the CFL pipeline. CDEModel specifies the following required methods: __init__, fit_predict

Attributes : None

fit_predict(): fits the clustering model and returns predictions on a set of data.

abstract __init__(data_info, model_params)

Do any setup required for your model here. :param data_info: a dictionary containing information about the

data that will be passed in. Should contain - ‘X_dims’ key with a tuple value specifying shape of X, - ‘Y_dims’ key with a tuple value specifying shape of Y, - ‘Y_type’ key with a string value specifying whether Y is ‘continuous’ or ‘categorical’.

Parameters:: model_params (dict) – dictionary containing parameters for the model. This is a way for users to specify any modifiable parts of your model.

Returns: None

abstract fit_predict(pyx)

Assign class labels for all samples by training self.model on pyx. Note that ClustererModels have a fit_predict method instead of separate fit and predict methods because most clustering methods do not handle predictionon new samples without re-fitting the model. TODO: handle both fit,predict and fit_predict in the future. :param pyx: an (n_samples,?) sized array of P(Y|X=x) estimates

for all n_samples values of X in our dataset.

Returns:: an (n_samples,) sized array of class assignments for all samples in dataset.
Return type:: np.ndarray

cfl.clustering.effect_clusterer module

class cfl.clustering.effect_clusterer.EffectClusterer(data_info, block_params)

Bases: Block

This class uses clustering to form the observational partition that CFL is trying to identify over the effect space. It trains a user-defined clustering model, to cluster datapoints based on a proxy for P(Y=y|X) (more information on this proxy can be found in the helper file cfl/clustering/Y_given_Xmacro.py). Once this model is trained, it can then be used to assign new datapoints to the clusters found.

block_params

a set of parameters specifying a clusterer. The ‘model’ key must be specified and can either be the name of an sklearn.cluster model, or a clusterer model object that follows the scikit-learn interface. If the former, additional keys may be specified as parameters to the sklearn object.

Type:: dict

model

clusterer object to partition effect data

Type:: sklearn.cluster or cfl.clustering.ClustererModel

data_info

dictionary with the keys ‘X_dims’, ‘Y_dims’, and ‘Y_type’ (whether the y data is categorical or continuous)

Type:: dict

name: name of the model so that the model type can be recovered from saved parameters (str)

trained

boolean tracking whether self.model has been trained yet

Type:: bool

_create_model(): given self.block_params, build the clustering model

get_block_params(): return self.block_params

_get_default_block_params(): return values for block_params to defualt to if unspecified

train(): fit a model with P(Y|X=x) found by CDE

predict(): assign new datapoints to clusters found in train

save_block(): save the state of the object

load_block(): load the state of the object from a specified file path

Example

from cfl.clustering.clusterer import EffectClusterer from cfl.dataset import Dataset

X = <cause data> Y = <effect data> prev_results = <put CDE results here> data = Dataset(X, Y)

# syntax 1 c = EffectClusterer(data_info ={‘X_dims’: X.shape, ‘Y_dims’: Y.shape,

‘Y_type’: ‘continuous’},

block_params={‘model’: ‘DBSCAN’,

‘model_params’{‘eps’: 0.3,
‘min_samples’: 10}})

# syntax 2 # MyClusterer should inherit cfl.clustering.ClustererModel my_clusterer = MyClusterer(param1=0.1, param2=0.5) c = EffectClusterer(data_info ={‘X_dims’: X.shape, ‘Y_dims’: Y.shape,

‘Y_type’: ‘continuous’},

block_params={‘model’: my_clusterer})

results = c.train(data, prev_results)

__init__(data_info, block_params)

Initialize Clusterer object

Parameters:

data_info (dict) – dict with information about the dataset shape
block_params (dict) – a set of parameters specifying a clusterer. The ‘model’ key must be specified and can either be the name of an sklearn.cluster model, or a clusterer model object that follows the cfl.clustering.ClustererModel interface. Hyperparameters for the model may be specified through the ‘model_params’ dictionary. ‘tune’ may be set to True if you would like to perform hyperparameter tuning. ‘precompute_distances’ may also be specified. If true, a pre-caching method will be used that reduces runtime but is more memory-intensive. If false, the original compute-on-the-fly method will be used. (defaults to True)

Returns: None

_create_model()

Return a clustering model given self.block_params. If self.block_params[‘model’] is a string, it will try to instantiate the sklearn.cluster model with the same name. Otherwise, it will treat the value of self.block_params[‘model’] as the instantiated model.

Arguments: None :returns:

the model
to partition the cause space with.

Return type:: sklearn.cluster model or cfl.clusterer.ClustererModel

_get_default_block_params()

Private method that specifies default clustering method parameters. Note: clustering method currently defaults to DBSCAN. While DBSCAN is a valid starting method, the choice of clustering method is highly dependent on your dataset. Please do not rely on the defaults without considering your use case.

Arguments: None :returns: dictionary of parameter names (keys) and values (values) :rtype: dict

get_block_params()

Get parameters for this clustering model.

Arguments: None :returns: dictionary of parameter names (keys) and values (values) :rtype: dict

load_block(file_path)

Load clusterer model from path.

Parameters:: file_path (str) – path to load saved model from

Returns: None

predict(dataset, prev_results)

Assign new datapoints to clusters found in training.

Parameters:

dataset (Dataset) – Dataset object containing X, Y and pyx data to assign partition labels to
prev_results (dict) – dictionary that contains a key called ‘x_lbls’, whose value is an array of labels over the dataset samples.

Returns:

dictionary of results, containing ‘y_lbls’, a numpy array of class assignments for each sample in dataset.Y, as well as ‘y_probs’, the proxy for P(Y=y|X).

Return type:

dict

save_block(file_path)

Save clusterer model to specified path. :param file_path: path to save to :type file_path: str

Returns: None

train(dataset, prev_results)

Assign new datapoints to clusters found in training.

Parameters:

dataset (Dataset) – Dataset object containing X, Y data to assign partition labels to (not used, here for consistency)
prev_results (dict) – dictionary that contains a key called ‘x_lbls’, whose value is an array of labels over the dataset samples.

Returns:

dictionary of results, the most important of which is: y_lbls, a numpy array of class assignments for each sample in dataset.Y. ‘y_probs’, the proxy for P(Y=y|X), is also stored (see Y_given_Xmacro.py for computation details). Also includes ‘tuning_fig, ‘tuning_errs’, and ‘param_combos’ if self.block_params[‘tune’] is True.

Return type:

dict

cfl.clustering.snn module

This code provides an implementation of Shared Nearest Neighbor (SNN) Clustering for use in the clustering step of CFL.

SNN is a variation of DBSCAN that uses a non-Euclidean distance metric to cluster points. It was developed as an alternative to DBSCAN that performs better at creating clusters across regions with variable densities of points.

We implement it here as a method that may do better in high-dimensional spaces. Clustering methods that use Euclidean distance metrics tend to perform poorly in high-dimensional spaces because the distances between all points become approximately equal as dimensionality increases. Instead of finding nearby points with Euclidean distance, SNN uses an alternative distance metric based on the neighbor of nearest neighbors shared between two points. However, SNN clustering still (in the current implementation) uses Euclidean distance to construct the k-nearest neighbors graph, so this method may also suffer from some of the shortfalls of other clustering methods in high-dimensional space.

This method is also an example of a custom clustering method that can be used for CFL clustering in the exact same way as any other Sklearn clustering method because it follows the same interface.

this code is modified by Jenna Kahn from the implemention in “ Shared Nearest Neighbor Clustering Algorithm: Implementation and Evaluation ” in github repository albert-espin/snn-clustering

Used under the following license:

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

class cfl.clustering.snn.SNN(neighbor_num, min_shared_neighbor_proportion, eps)

Bases: BaseEstimator, ClusterMixin

Class for performing the Shared Nearest Neighbor (SNN) clustering algorithm.

Parameters:

neighbor_num (int) – K number of neighbors to consider for shared nearest neighbor similarity
min_shared_neighbor_proportion (float [0, 1]) – Proportion of the K nearest neighbors that need to share two data points to be considered part of the same cluster

self.labels_: [assigned after fitting data] Cluster labels for each point in the dataset given to fit(). Noisy samples are given the label -1

self.core_sample_indices_: [assigned after fitting data] Indices of core samples

self.components_: [assigned after fitting data] Copy of each core sample found by training

Note

Naming conventions for attributes are based on the analogous ones of DBSCAN. Some documentationcopied from the sklearn DBSCAN documentation

__init__(neighbor_num, min_shared_neighbor_proportion, eps): Constructor

fit(X)

Perform SNN clustering from features or distance matrix.

Parameters:

X (array or sparse (CSR) matrix of shape (n_samples, n_features) – or array of shape (n_samples, n_samples)): A feature array

Returns:

the SNN model with self.labels_, self.core_sample_indices_,: self.components_ assigned

Return type:

self

fit_predict(X, y=None, sample_weight=None)

Performs clustering on X and returns cluster labels.

Parameters:

X – array or sparse (CSR) matrix of shape (n_samples, n_features), or array of shape (n_samples, n_samples). A feature array, or array of distances between samples if metric='precomputed'.
sample_weight – array, shape (n_samples,), optional Weight of each sample, such that a sample with a weight of at least min_samples is by itself a core sample; a sample with negative weight may inhibit its eps-neighbor from being core. Note that weights are absolute, and default to 1.
y – Ignored. Not used, present here for API consistency by convention

Returns:

cluster labels

Return type:

y (ndarray, shape (n_samples,))

cfl.clustering.snn.snn(X, neighbor_num, min_shared_neighbor_num, eps)

Perform Shared Nearest Neighbor (SNN) clustering algorithm clustering.

Parameters:

X (array or sparse (CSR) matrix of shape (n_samples, n_features) – array of shape (n_samples, n_samples)): A feature array
neighbor_num (int) – K number of neighbors to consider for shared nearest neighbor similarity
min_shared_neighbor_num (int) – Number of nearest neighbors that need to share two data points to be considered part of the same cluster
eps (float [0, 1]) – parameter for DBSCAN, radius of the neighborhood. Default is the sklearn default

Returns:

indices of the core points, as determined: by DBSCAN

dbscan.labels_ : array of cluster labels for each point

Return type:

dbscan.core_sample_indices_

cfl.clustering package

Submodules

cfl.clustering.Y_given_Xmacro module

cfl.clustering.cause_clusterer module

cfl.clustering.cluster_tuning_util module

cfl.clustering.clusterer_model module

cfl.clustering.effect_clusterer module

cfl.clustering.snn module

Module contents