cfl.post_cfl package

Submodules

cfl.post_cfl.intervention_rec module

This module provides recommendations for values to intervene to in subsequent experimentation to refine the observational partition to a causal partition. It 1) identifies values in high-density regions where CFL has more certainty about it’s macrostate assignments and 2) selects a subset of these points that is far from the macrostate boundaries.

cfl.post_cfl.intervention_rec._compute_density(pyx)

For each point in pyx, compute density proxy.

Parameters:

pyx (np.ndarray) – output of a CDE Block of shape (n_samples, n target features)

Returns:

array of density proxys aligned with pyx of shape

(n_samples,)

Return type:

np.ndarray

cfl.post_cfl.intervention_rec._discard_boundary_samples(pyx, high_density_mask, cluster_labels, eps=0.5)

Given points of high density, discard points that lie close to a cluster boundary.

Parameters:
  • pyx (np.ndarray) – pyx (np.ndarray) : output of a CDE Block of shape (n_samples, n target features)

  • high_density_mask (np.ndarray) – mask of shape (n_samples,) indicating which samples are considered high-density

  • cluster_labels (np.ndarray) – array of integer cluster labels aligned with pyx of shape (n_samples,)

  • eps (float) – a threshhold for how close to the macrostate boundary a sample can be. Defaults to 0.5.

Returns:

mask of shape (n_samples,) where value of 1 means

that a) a point is considered high-density and b) a point doesn’t lie close to a cluster boundary. 0 otherwise.

Return type:

np.ndarray

cfl.post_cfl.intervention_rec._get_high_density_samples(density, cluster_labels, k_samples=None)

Returns the highest density samples per cluster.

Parameters:
  • density (np.ndarray) – computed density for each sample in pyx, of shape (n_samples,)

  • cluster_labels (np.ndarray) – array of integer cluster labels aligned with pyx of shape (n_samples,)

  • k_samples (int) – number of samples to extract per cluster. If None, returns all cluster members. If greater than number of cluster members, returns all cluster members. Defaults to None. Note: if several points have the same density at the cutoff density value, all will be returned so more than k_samples examples may be returned.

Returns:

mask of shape (n_samples,) where value of 1 means

that a point is considered high-density. 0 otherwise.

Return type:

np.ndarray

cfl.post_cfl.intervention_rec._get_recommendations(pyx, cluster_labels, k_samples=100, eps=0.5, visualize=True, exp_path=None, dataset_name=None)

For a set of data points, compute density for each point, extract high density samples, and discard points near cluster boundaries. Plot and return location of resulting subset of points.

Parameters:
  • pyx (np.ndarray) – output of a CDE Block of shape (n_samples, n target features)

  • cluster_labels (np.ndarray) – array of integer cluster labels aligned with pyx of shape (n_samples,)

  • k_samples (int) – number of samples to extract per cluster. If None, returns all cluster members. If greater than number of cluster members, returns all cluster members. Defaults to 100.

  • eps (float) – a threshhold for how close to the macrostate boundary a sample can be. Defaults to 0.5.

  • visualize (bool) – whether to visualize samples selected. Defaults to True.

  • exp_path (str) – path to saved Experiment

  • dataset_name (str) – name of dataset to load results for. Defaults to None

Returns:

mask of shape (n_samples,) where value of 1 means

that a) a point is considered high-density and b) a point doesn’t lie close to a cluster boundary. 0 otherwise.

Return type:

np.ndarray

cfl.post_cfl.intervention_rec._plot_results(pyx, hd_mask, final_mask, cluster_labels, exp_path, dataset_name, feature_names=None)

Plot the original distribution of data overlayed with the points recommended for intervention. Will save the figure to: [exp_path]/[dataset_name]/intervention_recs.fig

Parameters:
  • pyx (np.ndarray) – output of a CDE Block of shape (n_samples, n target features)

  • hd_mask (np.ndarray) – mask of shape (n_samples,) where value of 1 means that a point is considered high-density. 0 otherwise.

  • final_mask (np.ndarray) – mask of shape (n_samples,) where value of 1 means that a) a point is considered high-density and b) a point doesn’t lie close to a cluster boundary. 0 otherwise.)

  • cluster_lables (np.ndarray) – an (n_samples,) array of macrostate assignments

  • exp_path (str) – path to saved Experiment

  • dataset_name (str) – name of dataset to load results for.

  • feature_names (list) – optional list of names of each feature to plot. defaults to None.

Returns : None

cfl.post_cfl.intervention_rec.get_recommendations(exp, data=None, dataset_name='dataset_train', cause_or_effect='cause', visualize=True, k_samples=100, eps=0.5)

Wrapper that will get recommendations by experiment and dataset name. :param exp: path to experiment or Experiment object :type exp: str or cfl.Experiment :param data: not used here, here for consistency :type data: None :param dataset_name: name of dataset to load results for. Defaults to

‘dataset_train’

Parameters:
  • cause_or_effect (str) – load results for cause or effect partition. Valid values are ‘cause’, ‘effect’. Defaults to ‘cause’.

  • visualize (bool) – whether to visualize samples selected. Defaults to True.

  • k_samples (int) – number of samples to extract per cluster. If None, returns all cluster members. If greater than number of cluster members, returns all cluster members. Defaults to 100.

  • eps (float) – a threshhold for how close to the macrostate boundary a sample can be. Defaults to 0.5.

Returns:

mask of shape (n_samples,) where value of 1 means

that a) a point is considered high-density and b) a point doesn’t lie close to a cluster boundary. 0 otherwise.

Return type:

np.ndarray

cfl.post_cfl.macro_cond_prob module

This module computes the conditional probability of Y macrostate given each X macrostate. It visualizes this conditional probability.

cfl.post_cfl.macro_cond_prob._compute_cond_prob(xlbls, ylbls)

Compute the probability of a sample being in Y macrostate j given it being in X macrostate i.

Parameters:
  • xlbls (np.ndarray) – an (n_samples,) array of X macrostate assignments.

  • ylbls (np.ndarray) – an (n_samples,) array of Y macrostate assignments.

Returns:

an (n_X_macrostates,n_Y_macrostates) array of probabilities

where the value at index (i,j) is equal to P(Ymacro=j | Xmacro=i)

Return type:

np.ndarray

cfl.post_cfl.macro_cond_prob.compute_macro_cond_prob(exp, data=None, dataset_name='dataset_train', visualize=True)

Wrapper to compute the macro conditional probability given a specific Experiment directory path or object.

Arguments :

exp (str or cfl.Experiment) : path to experiment or Experiment object data (None) : not used here, here for consistency dataset_name (str) : name of dataset to load results for. Defaults to

‘dataset_train’

visualize (bool)whether to visualize samples selected. Defaults

to True.

Returns:

an (n_X_macrostates,n_Y_macrostates) array of probabilities

where the value at index (i,j) is equal to P(Ymacro=j | Xmacro=i)

Return type:

np.ndarray

cfl.post_cfl.macro_cond_prob.visualize_cond_prob(P_Ym_given_Xm, uxlbls, uylbls, fig_path=None)

Visualize the conditional probabilities.

Parameters:
  • P_Ym_given_Xm (np.ndarray) – an (n_X_macrostates,n_Y_macrostates) array of probabilities where the value at index (i,j) is equal to P(Ymacro=j | Xmacro=i)

  • uxlbls (np.ndarray) – an array of unique X macrostate labels

  • uylbls (np.ndarray) – an array of unique Y macrostate labels

  • fig_path (str) – path to save figure to, if not None. Defaults to None.

Returns:

None

cfl.post_cfl.microvariable_importance module

This module provides a measure of the importance of each microvariable in distinguishing between any two given macrostates that CFL found.

cfl.post_cfl.microvariable_importance._kl_divergence(p, q)

Helper function for discrimination_KL. Computes the KL divergence in both directions and returns the mean. :param p: first distribution to compare :type p: np.ndarray :param q: second distribution to compare :type q: np.ndarray

Returns :

float : kl divergence between p and q

cfl.post_cfl.microvariable_importance.compute_microvariable_importance(exp, data, dataset_name='dataset_train', visualize=True, cause_or_effect='cause')

Wrapper function to compute microvariable importance given an Experiment directory path or object. :param exp: path to experiment or Experiment object :type exp: str or cfl.Experiment :param data: an (n_samples,n_features) array of microvariable

measurements to evaluate

Parameters:
  • dataset_name (str) – name of dataset to load results for. Defaults to ‘dataset_train’

  • visualize (bool) – whether to visualize samples selected. If True, will save to [exp_path]/[dataset_name]/microvariable_importance.fig. Defaults to True.

  • cause_or_effect (str) – load results for cause or effect partition. Valid values are ‘cause’, ‘effect’. Defaults to ‘cause’.

Returns:

an (n_clusters, n_clusters, n_features) sized array

where element (i,j,k) specifies the distance between the distribution of feature k in cluster i and cluster j.

Return type:

np.ndarray

cfl.post_cfl.microvariable_importance.discriminate_clusters(data, lbls, disc_func=<function discrimination_KL>)

Compute how well each feature in data discriminates each pairwise class boundary using disc_func.

Parameters:
  • data (np.ndarray) – (n_samples, n_features) dataset

  • lbls (np.ndarray) – (n_samples,) partition over data

  • disc_func (function) – a function that takes two samples of 1D data and returns some distance between them

Returns:

an (n_clusters, n_clusters, n_features) sized array

where element (i,j,k) specifies the distance between the distribution of feature k in cluster i and cluster j.

Return type:

np.ndarray

cfl.post_cfl.microvariable_importance.discrimination_KL(fi_samples, fj_samples)

Compute the KL divergence between two samples by estimating the distributions from the two samples and then taking the kl divergence between these two distributions. :param fi_samples: samples from distribution 1 :type fi_samples: np.ndarray :param fj_samples: samples from distribution 2 :type fj_samples: np.ndarray

Returns:

kl divergence between p and q

Return type:

float

cfl.post_cfl.microvariable_importance.plot_disc_vals(disc_vals, fig_path=None)

Visualize the distances between distributions of each feature between each pair of clusters. :param disc_vals: an (n_clusters, n_clusters, n_features) sized

array where element (i,j,k) specifies the distance between the distribution of feature k in cluster i and cluster j.)

Parameters:

fig_path (str) – path to save figure to, if not None. Defaults to None.

cfl.post_cfl.post_cfl_util module

A set of helper function to load in Experiment results used in post_cfl analyses.

cfl.post_cfl.post_cfl_util.get_exp_path(exp)

If exp is an object, get path from object, otherwise, return the path specified. :param exp: path to experiment or Experiment object :type exp: str or cfl.Experiment

Returns:

path to saved Experiment directory.

Return type:

str

cfl.post_cfl.post_cfl_util.load_macrolbls(exp, dataset_name='dataset_train', cause_or_effect='cause')

Load macrostate labels from experiment directory path or object. :param exp: path to experiment or Experiment object :type exp: str or cfl.Experiment :param dataset_name: name of dataset to load results for. Defaults to

‘dataset_train’

Parameters:

cause_or_effect (str) – load results for cause or effect partition. Valid values are ‘cause’, ‘effect’. Defaults to ‘cause’.

Returns:

an (n_samples,) array of macrostate assignments.

Return type:

np.ndarray

cfl.post_cfl.post_cfl_util.load_pyx(exp, dataset_name='dataset_train')

Load P(Y|X) estimate from experiment directory path or object. :param exp: path to experiment or Experiment object :type exp: str or cfl.Experiment :param dataset_name: name of dataset to load results for. Defaults to

‘dataset_train’

Returns:

an array of P(Y|X) estimates.

Return type:

np.ndarray

Module contents