cfl.post_cfl package
Submodules
cfl.post_cfl.intervention_rec module
This module provides recommendations for values to intervene to in subsequent experimentation to refine the observational partition to a causal partition. It 1) identifies values in high-density regions where CFL has more certainty about it’s macrostate assignments and 2) selects a subset of these points that is far from the macrostate boundaries.
- cfl.post_cfl.intervention_rec._compute_density(pyx)
For each point in pyx, compute density proxy.
- Parameters:
pyx (np.ndarray) – output of a CDE Block of shape (n_samples, n target features)
- Returns:
- array of density proxys aligned with pyx of shape
(n_samples,)
- Return type:
np.ndarray
- cfl.post_cfl.intervention_rec._discard_boundary_samples(pyx, high_density_mask, cluster_labels, eps=0.5)
Given points of high density, discard points that lie close to a cluster boundary.
- Parameters:
pyx (np.ndarray) – pyx (np.ndarray) : output of a CDE Block of shape (n_samples, n target features)
high_density_mask (np.ndarray) – mask of shape (n_samples,) indicating which samples are considered high-density
cluster_labels (np.ndarray) – array of integer cluster labels aligned with pyx of shape (n_samples,)
eps (float) – a threshhold for how close to the macrostate boundary a sample can be. Defaults to 0.5.
- Returns:
- mask of shape (n_samples,) where value of 1 means
that a) a point is considered high-density and b) a point doesn’t lie close to a cluster boundary. 0 otherwise.
- Return type:
np.ndarray
- cfl.post_cfl.intervention_rec._get_high_density_samples(density, cluster_labels, k_samples=None)
Returns the highest density samples per cluster.
- Parameters:
density (np.ndarray) – computed density for each sample in pyx, of shape (n_samples,)
cluster_labels (np.ndarray) – array of integer cluster labels aligned with pyx of shape (n_samples,)
k_samples (int) – number of samples to extract per cluster. If None, returns all cluster members. If greater than number of cluster members, returns all cluster members. Defaults to None. Note: if several points have the same density at the cutoff density value, all will be returned so more than k_samples examples may be returned.
- Returns:
- mask of shape (n_samples,) where value of 1 means
that a point is considered high-density. 0 otherwise.
- Return type:
np.ndarray
- cfl.post_cfl.intervention_rec._get_recommendations(pyx, cluster_labels, k_samples=100, eps=0.5, visualize=True, exp_path=None, dataset_name=None)
For a set of data points, compute density for each point, extract high density samples, and discard points near cluster boundaries. Plot and return location of resulting subset of points.
- Parameters:
pyx (np.ndarray) – output of a CDE Block of shape (n_samples, n target features)
cluster_labels (np.ndarray) – array of integer cluster labels aligned with pyx of shape (n_samples,)
k_samples (int) – number of samples to extract per cluster. If None, returns all cluster members. If greater than number of cluster members, returns all cluster members. Defaults to 100.
eps (float) – a threshhold for how close to the macrostate boundary a sample can be. Defaults to 0.5.
visualize (bool) – whether to visualize samples selected. Defaults to True.
exp_path (str) – path to saved Experiment
dataset_name (str) – name of dataset to load results for. Defaults to None
- Returns:
- mask of shape (n_samples,) where value of 1 means
that a) a point is considered high-density and b) a point doesn’t lie close to a cluster boundary. 0 otherwise.
- Return type:
np.ndarray
- cfl.post_cfl.intervention_rec._plot_results(pyx, hd_mask, final_mask, cluster_labels, exp_path, dataset_name, feature_names=None)
Plot the original distribution of data overlayed with the points recommended for intervention. Will save the figure to: [exp_path]/[dataset_name]/intervention_recs.fig
- Parameters:
pyx (np.ndarray) – output of a CDE Block of shape (n_samples, n target features)
hd_mask (np.ndarray) – mask of shape (n_samples,) where value of 1 means that a point is considered high-density. 0 otherwise.
final_mask (np.ndarray) – mask of shape (n_samples,) where value of 1 means that a) a point is considered high-density and b) a point doesn’t lie close to a cluster boundary. 0 otherwise.)
cluster_lables (np.ndarray) – an (n_samples,) array of macrostate assignments
exp_path (str) – path to saved Experiment
dataset_name (str) – name of dataset to load results for.
feature_names (list) – optional list of names of each feature to plot. defaults to None.
Returns : None
- cfl.post_cfl.intervention_rec.get_recommendations(exp, data=None, dataset_name='dataset_train', cause_or_effect='cause', visualize=True, k_samples=100, eps=0.5)
Wrapper that will get recommendations by experiment and dataset name. :param exp: path to experiment or Experiment object :type exp: str or cfl.Experiment :param data: not used here, here for consistency :type data: None :param dataset_name: name of dataset to load results for. Defaults to
‘dataset_train’
- Parameters:
cause_or_effect (str) – load results for cause or effect partition. Valid values are ‘cause’, ‘effect’. Defaults to ‘cause’.
visualize (bool) – whether to visualize samples selected. Defaults to True.
k_samples (int) – number of samples to extract per cluster. If None, returns all cluster members. If greater than number of cluster members, returns all cluster members. Defaults to 100.
eps (float) – a threshhold for how close to the macrostate boundary a sample can be. Defaults to 0.5.
- Returns:
- mask of shape (n_samples,) where value of 1 means
that a) a point is considered high-density and b) a point doesn’t lie close to a cluster boundary. 0 otherwise.
- Return type:
np.ndarray
cfl.post_cfl.macro_cond_prob module
This module computes the conditional probability of Y macrostate given each X macrostate. It visualizes this conditional probability.
- cfl.post_cfl.macro_cond_prob._compute_cond_prob(xlbls, ylbls)
Compute the probability of a sample being in Y macrostate j given it being in X macrostate i.
- Parameters:
xlbls (np.ndarray) – an (n_samples,) array of X macrostate assignments.
ylbls (np.ndarray) – an (n_samples,) array of Y macrostate assignments.
- Returns:
- an (n_X_macrostates,n_Y_macrostates) array of probabilities
where the value at index (i,j) is equal to P(Ymacro=j | Xmacro=i)
- Return type:
np.ndarray
- cfl.post_cfl.macro_cond_prob.compute_macro_cond_prob(exp, data=None, dataset_name='dataset_train', visualize=True)
Wrapper to compute the macro conditional probability given a specific Experiment directory path or object.
- Arguments :
exp (str or cfl.Experiment) : path to experiment or Experiment object data (None) : not used here, here for consistency dataset_name (str) : name of dataset to load results for. Defaults to
‘dataset_train’
- visualize (bool)whether to visualize samples selected. Defaults
to True.
- Returns:
- an (n_X_macrostates,n_Y_macrostates) array of probabilities
where the value at index (i,j) is equal to P(Ymacro=j | Xmacro=i)
- Return type:
np.ndarray
- cfl.post_cfl.macro_cond_prob.visualize_cond_prob(P_Ym_given_Xm, uxlbls, uylbls, fig_path=None)
Visualize the conditional probabilities.
- Parameters:
P_Ym_given_Xm (np.ndarray) – an (n_X_macrostates,n_Y_macrostates) array of probabilities where the value at index (i,j) is equal to P(Ymacro=j | Xmacro=i)
uxlbls (np.ndarray) – an array of unique X macrostate labels
uylbls (np.ndarray) – an array of unique Y macrostate labels
fig_path (str) – path to save figure to, if not None. Defaults to None.
- Returns:
None
cfl.post_cfl.microvariable_importance module
This module provides a measure of the importance of each microvariable in distinguishing between any two given macrostates that CFL found.
- cfl.post_cfl.microvariable_importance._kl_divergence(p, q)
Helper function for discrimination_KL. Computes the KL divergence in both directions and returns the mean. :param p: first distribution to compare :type p: np.ndarray :param q: second distribution to compare :type q: np.ndarray
- Returns :
float : kl divergence between p and q
- cfl.post_cfl.microvariable_importance.compute_microvariable_importance(exp, data, dataset_name='dataset_train', visualize=True, cause_or_effect='cause')
Wrapper function to compute microvariable importance given an Experiment directory path or object. :param exp: path to experiment or Experiment object :type exp: str or cfl.Experiment :param data: an (n_samples,n_features) array of microvariable
measurements to evaluate
- Parameters:
dataset_name (str) – name of dataset to load results for. Defaults to ‘dataset_train’
visualize (bool) – whether to visualize samples selected. If True, will save to [exp_path]/[dataset_name]/microvariable_importance.fig. Defaults to True.
cause_or_effect (str) – load results for cause or effect partition. Valid values are ‘cause’, ‘effect’. Defaults to ‘cause’.
- Returns:
- an (n_clusters, n_clusters, n_features) sized array
where element (i,j,k) specifies the distance between the distribution of feature k in cluster i and cluster j.
- Return type:
np.ndarray
- cfl.post_cfl.microvariable_importance.discriminate_clusters(data, lbls, disc_func=<function discrimination_KL>)
Compute how well each feature in data discriminates each pairwise class boundary using disc_func.
- Parameters:
data (np.ndarray) – (n_samples, n_features) dataset
lbls (np.ndarray) – (n_samples,) partition over data
disc_func (function) – a function that takes two samples of 1D data and returns some distance between them
- Returns:
- an (n_clusters, n_clusters, n_features) sized array
where element (i,j,k) specifies the distance between the distribution of feature k in cluster i and cluster j.
- Return type:
np.ndarray
- cfl.post_cfl.microvariable_importance.discrimination_KL(fi_samples, fj_samples)
Compute the KL divergence between two samples by estimating the distributions from the two samples and then taking the kl divergence between these two distributions. :param fi_samples: samples from distribution 1 :type fi_samples: np.ndarray :param fj_samples: samples from distribution 2 :type fj_samples: np.ndarray
- Returns:
kl divergence between p and q
- Return type:
float
- cfl.post_cfl.microvariable_importance.plot_disc_vals(disc_vals, fig_path=None)
Visualize the distances between distributions of each feature between each pair of clusters. :param disc_vals: an (n_clusters, n_clusters, n_features) sized
array where element (i,j,k) specifies the distance between the distribution of feature k in cluster i and cluster j.)
- Parameters:
fig_path (str) – path to save figure to, if not None. Defaults to None.
cfl.post_cfl.post_cfl_util module
A set of helper function to load in Experiment results used in post_cfl analyses.
- cfl.post_cfl.post_cfl_util.get_exp_path(exp)
If exp is an object, get path from object, otherwise, return the path specified. :param exp: path to experiment or Experiment object :type exp: str or cfl.Experiment
- Returns:
path to saved Experiment directory.
- Return type:
str
- cfl.post_cfl.post_cfl_util.load_macrolbls(exp, dataset_name='dataset_train', cause_or_effect='cause')
Load macrostate labels from experiment directory path or object. :param exp: path to experiment or Experiment object :type exp: str or cfl.Experiment :param dataset_name: name of dataset to load results for. Defaults to
‘dataset_train’
- Parameters:
cause_or_effect (str) – load results for cause or effect partition. Valid values are ‘cause’, ‘effect’. Defaults to ‘cause’.
- Returns:
an (n_samples,) array of macrostate assignments.
- Return type:
np.ndarray
- cfl.post_cfl.post_cfl_util.load_pyx(exp, dataset_name='dataset_train')
Load P(Y|X) estimate from experiment directory path or object. :param exp: path to experiment or Experiment object :type exp: str or cfl.Experiment :param dataset_name: name of dataset to load results for. Defaults to
‘dataset_train’
- Returns:
an array of P(Y|X) estimates.
- Return type:
np.ndarray