Some Notes on Conditional Density Estimators (CDEs)

A CDE is a machine learning model that, given variables x and y, learns to estimate the probability of y given x. The CDE is the first step of CFL.

How the CDE step of CFL works

In the current implementation of CFL, we calculate the conditional expection during the CDE step. Conditional Expectation takes as input the causal data set X and the effect data set Y and outputs the expectation (mean) of the probability distribution P(Y | X= x) for each sample, x in the dataset X. These expectations are then clustered in the second step of CFL.

Sample CFL Workflow

Input Shape for CDEs

Most CDEs

Most of the CDEs provided expect a 2-D input with the shape (n_samples, n_features) for both the X and the Y data.

CNNs

The CondExpCNN and CondExpCNN are examples of convolutional neural networks (CNNs). CNNs are well-suited to processing image data. A CNN expects 4-D input images with the shape (n_samples, n_rows, n_cols, n_channels) and a 2-D Y input.

    # Here is some example code showing how to reshape your data to the correct dimensionality 
    import numpy as np 
    from sklearn.datasets import load_iris

    # get some data 
    two_D = load_iris().data
    two_D.shape
    >>> (150, 4)

    # make this dataset 3D for demo
    X = np.reshape(two_D, (two_D.shape[0], 2, 2))
    X.shape 
    >>> (150, 2, 2)

    # 3-D to 4-D (for a CNN)
    X_new = np.expand_dims(X, -1) 
    X_new.shape
    >>> (150, 2, 2, 1) 

    # flatten a 3-D dataset 
    X_flat = np.reshape(X, (X.shape[0], np.prod(X.shape[1], X.shape[2])))
    X_flat.shape 
    >>> (150, 4)

Parameter Details

When constructing a new CDE object, you can customize its parameters. This allows you to specify the configuration of your CDE model during instantiation. Here are some of the parameters you can set:

'batch_size'
- What is it: batch size for neural network training
- Valid values: int
- Default: 32
- Applies to: all CondExpBase derivatives
'n_epochs'
- What is it: number of epochs to train for
- Valid values: int, >0
- Default: 20
- Applies to: all CondExpBase derivatives
'optimizer'
- What is it: which optimizer to use in training (https://www.tensorflow.org/api_docs/python/tf/keras/optimizers)
- Valid values: string (i.e. ‘adam’, ‘sgd’, etc.)
- Default: 'adam'
- Applies to: all CondExpBase derivatives
'opt_config'
- What is it: a dictionary of optimizer parameters
- Valid values: python dict. Lookup valid parameters for your optimizer here: https://www.tensorflow.org/api_docs/python/tf/keras/optimizers
- Default: {}
- Applies to: all CondExpBase derivatives
'verbose'
- What is it: whether to print run updates (currently does this no matter what)
- Valid values: bool
- Default: True
- Applies to: all CondExpBase derivatives
'dense_units'
- What is it: list of tf.keras.Dense layer sizes
- Valid values: int list
- Default: [50, data_info['Y_dims'][1]]
- Applies to: CondExpMod
'activations'
- What is it: list of activation functions corresponding to layers specified in ‘dense_units’
- Valid values: string list. See valid activations here: https://www.tensorflow.org/api_docs/python/tf/keras/activations
- Default: ['relu', 'linear']
- Applies to: CondExpMod
'dropouts'
- What is it: list of dropout rates after each layer specified in ‘dense_units’
- Valid values: float (from 0 to 1) list.
- Default: [0, 0]
- Applies to: CondExpMod
'weights_path'
- What is it: path to saved keras model checkpoint to load in to model
- Valid values: string
- Default: None
- Applies to: all CondExpBase derivatives
'loss'
- What is it: which loss function to optimize network with respect to (https://www.tensorflow.org/api_docs/python/tf/keras/losses)
- Valid values: string
- Default: mean_squared_error
- Applies to: all CondExpBase derivatives