cfl.visualization package

Submodules

cfl.visualization.basic_visualizations module

A set of generic functions to visualize data samples by macrostate found by CFL.

Usage:

from cfl.visualization_methods import macrostate_vis data = an n_samples x an up to 3D shape for each sample macrostate_vis(data=data, exp_id=0, cause_or_effect=’cause’,

subtract_global_mean=True)

cfl.visualization.basic_visualizations._plot(data, lbls, feature_names=None, dim_names=None, subtract_global_mean=True, fig_path=None, figsize=None, kwargs={})

Visualizes macrostates, determining whether to delegate to _plot_1D, _plot_2D, or _plot_3D. :param data: an (n_samples, …) array (with up to three

additional dimensions after the n_samples dimension) of data points to visualize by macrostate.

Parameters:
  • lbls (np.ndarray) – an (n_samples,) array of macrostate assignments.

  • feature_names (list) – optional nested list of names of each feature to plot for each dimension. Defaults to None.

  • dim_names (list) – list of str names for each dimension in data after the n_samples dimension.

  • subtract_global_mean (bool) – if True, global mean will be subtracted from each macrostate mean displayed. If False, raw macrostate means will be displayed. Defaults to True.

  • fig_path (str) – path to save figure to. Will not save if value is None. Defaults to None.

  • figsize (tuple) – size of figure to display. If None, sizing will be done automatically. Defaults to None.

  • kwargs (dict) – additional arguments to pass to matplotlib.pyplot.imshow. Defaults to {}.

Returns: None

cfl.visualization.basic_visualizations._plot_1D(n_lbls, u_lbls, n_features, means, vmin, vmax, cmap, feature_names, dim_names, figsize=None, kwargs={})

Plot samples by macrostate for 1D samples. :param n_lbls: number unique macrostates :type n_lbls: int :param u_lbls: labels for each unique macrostate :type u_lbls: np.ndarray :param n_features: list of # of features for each dimension (in this

case, only one element in the list).

Parameters:
  • means (np.ndarray) – average sample per macrostate

  • vmin (float) – lower bound color value for matplotlib.pyplot.imshow.

  • vmax (float) – upper bound color value for matplotlib.pyplot.imshow.

  • cmap (str) – color map to use for imshow, detailed in matplotib.pyplot documentation.

  • feature_names (list) – nested list of names of each feature to plot for each dimension.

  • dim_names (list) – list of str names for each dimension in data after the n_samples dimension.

  • figsize (tuple) – size of figure to display. If None, sizing will be done automatically. Defaults to None.

  • kwargs (dict) – additional arguments to pass to matplotlib.pyplot.imshow. Defaults to {}.

Returns: None

cfl.visualization.basic_visualizations._plot_2D(n_lbls, u_lbls, n_features, means, vmin, vmax, cmap, feature_names, dim_names, figsize=None, kwargs={})

Plot samples by macrostate for 2D samples. :param n_lbls: number unique macrostates :type n_lbls: int :param u_lbls: labels for each unique macrostate :type u_lbls: np.ndarray :param n_features: list of # of features for each dimension (in this

case, two elements in the list).

Parameters:
  • means (np.ndarray) – average sample per macrostate

  • vmin (float) – lower bound color value for matplotlib.pyplot.imshow.

  • vmax (float) – upper bound color value for matplotlib.pyplot.imshow.

  • cmap (str) – color map to use for imshow, detailed in matplotib.pyplot documentation.

  • feature_names (list) – nested list of names of each feature to plot for each dimension.

  • dim_names (list) – list of str names for each dimension in data after the n_samples dimension.

  • figsize (tuple) – size of figure to display. If None, sizing will be done automatically. Defaults to None.

  • kwargs (dict) – additional arguments to pass to matplotlib.pyplot.imshow. Defaults to {}.

Returns: None

cfl.visualization.basic_visualizations._plot_3D(n_lbls, u_lbls, n_features, means, vmin, vmax, cmap, feature_names, dim_names, figsize=None, kwargs={})

Plot samples by macrostate for 3D samples. :param n_lbls: number unique macrostates :type n_lbls: int :param u_lbls: labels for each unique macrostate :type u_lbls: np.ndarray :param n_features: list of # of features for each dimension (in this

case, three elments in the list).

Parameters:
  • means (np.ndarray) – average sample per macrostate

  • vmin (float) – lower bound color value for matplotlib.pyplot.imshow.

  • vmax (float) – upper bound color value for matplotlib.pyplot.imshow.

  • cmap (str) – color map to use for imshow, detailed in matplotib.pyplot documentation.

  • feature_names (list) – nested list of names of each feature to plot for each dimension.

  • dim_names (list) – list of str names for each dimension in data after the n_samples dimension.

  • figsize (tuple) – size of figure to display. If None, sizing will be done automatically. Defaults to None.

  • kwargs (dict) – additional arguments to pass to matplotlib.pyplot.imshow. Defaults to {}.

Returns: None

cfl.visualization.basic_visualizations.visualize_macrostates(exp_path, data, feature_names=None, data_series='dataset_train', cause_or_effect='cause', subtract_global_mean='True', figsize=None, kwargs={})

Main fucntion to visualize macrostates. Given a path to an saved Experiment, it loads in the specified macrostate labels and delegates to the _plot helper function. :param exp_path: path to saved Experiment :type exp_path: str :param data: an (n_samples, …) array (with up to three

additional dimensions after the n_samples dimension) of data points to visualize by macrostate.

Parameters:
  • feature_names (list) – optional nested list of names of each feature to plot for each dimension. Defaults to None.

  • data_series (str) – name of dataset to load results for. Defaults to ‘dataset_train’

  • cause_or_effect (str) – load results for cause or effect partition. Valid values are ‘cause’, ‘effect’. Defaults to ‘cause’.

  • subtract_global_mean (bool) – if True, global mean will be subtracted from each macrostate mean displayed. If False, raw macrostate means will be displayed. Defaults to True.

  • figsize (tuple) – size of figure to display. If None, sizing will be done automatically. Defaults to None.

  • kwargs (dict) – additional arguments to pass to matplotlib.pyplot.imshow. Defaults to {}.

Returns: None

cfl.visualization.cde_diagnostic module

Contains two main functions: pyx_scatter() and cde_diagnostic() and helpers for those functions.

These functions can be used to examine the quality of the CDE’s learning

cfl.visualization.cde_diagnostic.__for_continuous_Y(Y, pyx)

This method is for a Y that consists of continuous variable(s). If Y contains a single variable, its distribution will be plotted as a single histogram.

If Y contains multiple variables, they will be plotted together as a stacked histogram.

cfl.visualization.cde_diagnostic.__pyx_scatter_gt_legend(ax, pyx, ground_truth_labels)

Plots data from each ground_truth_class as a separate series in the scatter plot. Does this so that each label can be associated with a legend.

cfl.visualization.cde_diagnostic.cde_diagnostic(cfl_experiment)

Creates a figure to help diagnose whether the CDE is predicting the target variable(s) effectively or should be tuned further.

This function creates a figure with two subplots. The first shows the actual distribution of the Y variable(s), according to the data. The second shows the predicted distribution of the Y variable(s), as outputted by the CDE.

If the effect data (Y) has type continuous (as specified in the data_info dictionary), then histograms showing the distribution of the effect variable are created. If Y is continuous and multidimensional, a stacked histogram with each feature is created. If Y has type categorical, bar charts with the mean values for each feature in Y are created.

If the CDE is doing a good job of learning the effect, the two subplots should contain similar or near-identical distributions.

Note

This function may not work for higher dimensional continuous Ys.

Parameters:

cfl_experiment (cfl.experiment.Experiment) –

Returns:

(Fig) - A matplotlib.pyplot Figure object that contains the diagnostic plot (Axes) - An array of matplotlib.pyplot Axes objects that are the subplots of the Figure object

Example Usage: ```

from cfl.visualization_methods import cde_diagnostic as cd fig, axes = cd.cde_diagnostic(cfl_experiment) plt.show()

```

cfl.visualization.cde_diagnostic.pyx_scatter(cfl_experiment, ground_truth=None, colored_by=None)

Creates a scatter plot with a sample of points from the CDE output, colored by ground truth (if given).

Note

This visualization method is only good for 1D effect data.

Example Usage: ```

from cfl.visualization_methods import cde_diagnostic as cd fig, ax = pyx_scatter(cfl_experiment, ground_truth) plt.show()

```

Parameters:
  • cfl_experiment (cfl.experiment.Experiment) – a trained CFL pipeline

  • ground_truth (np array) – (Optional) an array, aligned with the CFL training data that contains the ground truth macrovariable labels for the cause data. If provided, the points in the plot will be colored according to their ground truth state. Otherwise, all points will be colored the same.

Returns:

(Fig) - A matplotlib.pyplot Figure object that contains the scatter plot (Axes) - A matplotlib.pyplot Axes object that shows the scatter plot

cfl.visualization.clustering_to_sankey module

Iman Wahle and Jenna Kahn 1/8/20 Sankey diagram code

Create a Sankey diagram to show how samples move between clusters when cluster parameters are varied.

Usage of this function:

```

import plotly.graph_objects as go from cfl.visualization_methods import clustering_to_sankey as sk

#x_lbls_L = list of x labels from several different rounds of clustering on the same data

link, label = sk.convert_lbls_to_sankey_nodes(x_lbls_L) # plot fig = go.Figure(data=

[go.Sankey(node = dict(pad = 15, thickness=20, label = label, color = “blue”),

link = link)])

fig.update_layout(title_text=”Sample Sankey”, font_size=10) fig.show()

```

cfl.visualization.clustering_to_sankey.convert_lbls_to_sankey_nodes(x_lbls_L)

Convert cluster labels into source, target, and value information

Parameters:

x_lbls_L – x_lbls_L is a list of x_lbls, the result from multiple kmeans clusterings on the same data

Returns:

a representation of nodes and weighted connections between

them to make a Sankey diagram

labels (list): labels for every node in the sankey diagram

Return type:

link (dict)

cfl.visualization.data_sample_visualizations module

Two functions are for viewing images

cfl.visualization.data_sample_visualizations.view_class_examples(images, im_shape, n_examples, x_lbls)

Shows images in matplotlib with labels displayed at the top of each image. Best for viewing a lot of images at once.

Parameters:
  • images (2D or 3D np array) – Array of images (must be aligned with x_lbls) If 2D, axis 0 = samples, axis 1 = flattened image pixels If 3D, axis 0 = samples, axis 1 = image rows, axis 2 = image cols

  • n_rows (int) – Number of rows of images to display.

  • x_lbls (1D np array) – labels to show at the top of each image. Should be aligned with the images input

Returns: None

cfl.visualization.data_sample_visualizations.view_random_example(image_array, random_state=None)

Chooses a random image from the image_array and displays it. Setting random state causes it to be reproducible. :param image_array: array of images of size

(n_samples, image_dim1, img_dim2)

Parameters:

random_state (int) – random state to set rng.choice to for selecting random images to display. Default is None.

Returns: None

Module contents