{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "f73bb462",
   "metadata": {},
   "source": [
    "# Applying CFL to an Altitude Extraction Problem"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "98565c89",
   "metadata": {},
   "source": [
    "This notebook demonstrates how to apply the CFL algorithm to an altitude extraction problem. The datasets are comprised of longitude, latitude, and elevation information from Google's Earth Engine, with temperature data generated from the elevation data using a simple linear model with Gaussian noise. CFL then learns a model that is able to generate contours of the given area. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "850c9768",
   "metadata": {},
   "source": [
    "### Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "333c07fd",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Imports\n",
    "from cfl.experiment import Experiment\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import re # regular expressions for data cleaning\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "import matplotlib.pyplot as plt\n",
    "from alt_extraction.helpers import * # imports all functions from alt_extract_helpers.py"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11228858",
   "metadata": {},
   "source": [
    "### Get, Clean, and Save Data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "120140d1",
   "metadata": {},
   "source": [
    "Using this online dataset from Google Earth Engine (and opening in the code editor): https://developers.google.com/earth-engine/datasets/catalog/AU_GA_DEM_1SEC_v10_DEM-H \n",
    "\n",
    "Run the script in the following drive link to extract the data at the desired resolution (stored as the `SCALE` variable in the script): https://drive.google.com/file/d/1hxCDoeWw3POj4p2RA_g2TNkcyDaF6Hky/view?usp=sharing \n",
    "\n",
    "Download the data as a CSV to Google drive and then move to a local directory. Note that this get and clean process needs to be done for each dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cd43bf82",
   "metadata": {},
   "source": [
    "Note that cleaned datasets at 10km, 13km, 40km, and 150km resolutions are available in the Github under the `data/altitude` folder. Skip this section if you are pulling the data directly from the Github."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0b5f4457",
   "metadata": {},
   "outputs": [],
   "source": [
    "folder = '../../../data/altitude' # replace with your data folder path\n",
    "resolution = '10km' # replace with your resolution + 'km'\n",
    "path = folder + f'/elevation_{resolution}.csv' # replace with your file path/name\n",
    "df = pd.read_csv(path)\n",
    "\n",
    "# Data cleaning\n",
    "long = []\n",
    "lat = []\n",
    "for i, row in df.iterrows():\n",
    "    coords = re.search(r'\\[(.*?)\\]', row['.geo']).group(1).split(',')\n",
    "    long.append(float(coords[0]))\n",
    "    lat.append(float(coords[1]))\n",
    "\n",
    "df['long'] = long\n",
    "df['lat'] = lat\n",
    "df.drop(columns=['.geo', 'system:index'], axis=1, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "6efe68a7",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Augment data with linearly generated temperature data\n",
    "lapse_rate = 0.0065 # deg C per m\n",
    "sea_level_temp = 19 # deg C - along east coast, using Sydney as reference\n",
    "err_std_dev = 0.2 # deg C - taking into account errors for lapse rate and sea level temp\n",
    "\n",
    "def linear_elevation_to_temp(elevations, err=True): # elevation in meters\n",
    "    temps = []\n",
    "    for elevation in elevations:\n",
    "        if err:\n",
    "            err = np.random.normal(0, err_std_dev)\n",
    "        else:\n",
    "            err = 0\n",
    "        temp = sea_level_temp - (lapse_rate * elevation) + err\n",
    "        temps.append(temp)\n",
    "    return temps\n",
    "\n",
    "test_df = df.copy() # make a copy of the original dataframe for test data\n",
    "\n",
    "# For training data, add Gaussian noise and drop elevation\n",
    "df['generated_temp'] = linear_elevation_to_temp(df['elevation'])\n",
    "df.drop(columns=['elevation'], axis=1, inplace=True)\n",
    "df.to_csv(folder + f'/{resolution}_data.csv', index=False) # save training data\n",
    "\n",
    "# For test data, add no noise\n",
    "test_df['generated_temp'] = linear_elevation_to_temp(test_df['elevation'], err=False)\n",
    "test_df.to_csv(folder + f'/{resolution}_test.csv', index=False) # save test data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ccf7a4a3",
   "metadata": {},
   "source": [
    "NOTE: Ensure that all datasets that are intended to be used are obtained, cleaned, augmented, and stored prior to proceeding with the rest of the notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f310a78a",
   "metadata": {},
   "source": [
    "### Data Visualization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "54a119b1",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Pick one dataset to visualize (will visualize unperturbed test data)\n",
    "folder = '../../../data/altitude' # replace with your data folder path\n",
    "resolution = '40km'\n",
    "truth_file = folder + f'/{resolution}_test.csv'\n",
    "truth_data = pd.read_csv(truth_file)\n",
    "true_alt, true_temp = get_alt_temp_grids(truth_data, ocean=False)\n",
    "\n",
    "# Plotting using functions from alt_extract_helpers.py\n",
    "plot_area(true_alt, 'elevation', title=f'{resolution} elevation map', grey_back=True)\n",
    "plot_area(true_temp, 'temperature', title=f'{resolution} temp map', grey_back=True, color='jet')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ec301c80",
   "metadata": {},
   "source": [
    "Using the true elevation data to generate gradients (BFS of nearest peak):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0089654b",
   "metadata": {},
   "outputs": [],
   "source": [
    "search_depth = 5 # search depth (number of pixels)\n",
    "nan_val = -100 # deprecated, any arbitrary value works\n",
    "\n",
    "U, V = gen_elevation_grads(true_alt, search_depth, nan_val=nan_val, grey_back=True)\n",
    "angles = grad_angles(U, V, title=f'Angles of Gradients for {resolution}² Resolution (Search Depth={search_depth})', grey_back=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f2ee8263",
   "metadata": {},
   "source": [
    "### Preprocessing and Training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "ca9eea17",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Pick one dataset to preprocess and train on\n",
    "folder = '../../../data/altitude' # replace with your data folder path\n",
    "resolution = '40km'\n",
    "n_clusters = 10 # number of clusters to run CFL algorithm with\n",
    "train_file = folder + f'/{resolution}_data.csv'\n",
    "\n",
    "train_data = pd.read_csv(train_file)\n",
    "Xraw = np.array(train_data[['lat', 'long']])\n",
    "Yraw = np.array(train_data['generated_temp']).reshape(-1,1)\n",
    "\n",
    "# Standardize data\n",
    "X = StandardScaler().fit_transform(Xraw)\n",
    "Y = StandardScaler().fit_transform(Yraw)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "8b17c324",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create 3 dictionaries: one for data info, one with CDE parameters, and one with cluster parameters\n",
    "\n",
    "# the parameters should be passed in dictionary form\n",
    "data_info = {'X_dims' : X.shape,\n",
    "             'Y_dims' : Y.shape,\n",
    "             'Y_type' : 'continuous' #options: 'categorical' or 'continuous'\n",
    "            }\n",
    "\n",
    "# pass in empty parameter dictionaries to use the default parameter values (not\n",
    "# allowed for data_info)\n",
    "CDE_params = {  'model'        : 'CondExpMod',\n",
    "                'model_params' : {\n",
    "                    # model architecture\n",
    "                    'dense_units' : [50, data_info['Y_dims'][1]],\n",
    "                    'activations' : ['relu', 'linear'],\n",
    "                    'dropouts'    : [0, 0],\n",
    "                    # training parameters\n",
    "                    'batch_size'  : 128,\n",
    "                    'n_epochs'    : 2500,\n",
    "                    'optimizer'   : 'adam',\n",
    "                    'opt_config'  : {'lr' : 2e-4},\n",
    "                    'loss'        : 'mean_squared_error',\n",
    "                    'best'        : True,\n",
    "                    'early_stopping' : True,\n",
    "                    # verbosity\n",
    "                    'verbose'     : 1,\n",
    "                    'show_plot'   : True,\n",
    "                }\n",
    "}\n",
    "\n",
    "# cluster_params consists of specifying two clustering objects\n",
    "# CFL automatically recognizes the names of all sklearn.cluster models as keywords\n",
    "cause_cluster_params =  {'model' : 'KMeans',\n",
    "                         'model_params' : {'n_clusters' : n_clusters},\n",
    "                         'verbose' : 0\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d09ccd27",
   "metadata": {},
   "outputs": [],
   "source": [
    "# block_names indicates which CDE and clustering models to use\n",
    "block_names = ['CondDensityEstimator', 'CauseClusterer']\n",
    "\n",
    "# block_params is aligned to block_names\n",
    "block_params = [CDE_params, cause_cluster_params]\n",
    "\n",
    "results_path = 'alt_extraction_sample_run_1' # directory to save results to\n",
    "\n",
    "# Create a CFL experiment with specified parameters\n",
    "my_exp = Experiment(X_train=X,\n",
    "                    Y_train=Y,\n",
    "                    data_info=data_info,\n",
    "                    block_names=block_names,\n",
    "                    block_params=block_params,\n",
    "                    results_path=results_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "79189ff3",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Run training (will take a bit of time for larger datasets)\n",
    "results = my_exp.train()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d9c9fcde",
   "metadata": {},
   "source": [
    "### Visualizing Results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "37c5fae7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of unique xlbls (should match n_clusters): 10\n"
     ]
    }
   ],
   "source": [
    "xlbls = results['CauseClusterer']['x_lbls']\n",
    "print(f'Number of unique xlbls (should match n_clusters): {len(set(xlbls))}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8594f88b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot the clusters as a contour map\n",
    "_ = reconstruct_contour(train_data, xlbls, n_clusters, title=None, grey_back=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "92e3af62",
   "metadata": {},
   "source": [
    "### Finding Optimal Cluster Number with a Test Set"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bee33190",
   "metadata": {},
   "source": [
    "Note that a prime test set resolution (in km) makes most sense to ensure no overlap with the training set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "da757cb3",
   "metadata": {},
   "outputs": [],
   "source": [
    "folder = '../../../data/altitude' # replace with your data folder path\n",
    "train_resolution = '40km' # replace with desired train set resolution + 'km'\n",
    "test_resolution = '13km' # replace with desired test set resolution + 'km'\n",
    "train_file = folder + f'/{train_resolution}_data.csv'\n",
    "test_file = folder + f'/{test_resolution}_test.csv'\n",
    "\n",
    "train_data = pd.read_csv(train_file)\n",
    "trainX, trainY = np.array(train_data[['lat', 'long']]), np.array(train_data['generated_temp']).reshape(-1,1)\n",
    "trainX = StandardScaler().fit_transform(trainX)\n",
    "trainY = StandardScaler().fit_transform(trainY)\n",
    "\n",
    "test_data = pd.read_csv(test_file)\n",
    "del test_data['elevation']\n",
    "testX, testY = np.array(test_data[['lat', 'long']]), np.array(test_data['generated_temp']).reshape(-1,1)\n",
    "testX = StandardScaler().fit_transform(testX)\n",
    "testY = StandardScaler().fit_transform(testY)\n",
    "\n",
    "print('Training data shape:', trainX.shape, trainY.shape)\n",
    "print('Test data shape:', testX.shape, testY.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "cd7e1a67",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create 2 dictionaries: one for data info, one with CDE parameters, cluster parameters are to be tested\n",
    "\n",
    "data_info = {'X_dims' : trainX.shape,\n",
    "             'Y_dims' : trainY.shape,\n",
    "             'Y_type' : 'continuous' #options: 'categorical' or 'continuous'\n",
    "            }\n",
    "\n",
    "CDE_params = {  'model'        : 'CondExpMod',\n",
    "                'model_params' : {\n",
    "                    # model architecture\n",
    "                    'dense_units' : [50, data_info['Y_dims'][1]],\n",
    "                    'activations' : ['relu', 'linear'],\n",
    "                    'dropouts'    : [0, 0],\n",
    "                    # training parameters\n",
    "                    # smaller batch size since datasets may be smaller when testing\n",
    "                    'batch_size'  : 64,\n",
    "                    'n_epochs'    : 2500,\n",
    "                    'optimizer'   : 'adam',\n",
    "                    # smaller learning rate to compensate for smaller batch size\n",
    "                    'opt_config'  : {'lr' : 1e-4},\n",
    "                    'loss'        : 'mean_squared_error',\n",
    "                    'best'        : True,\n",
    "                    'early_stopping' : True,\n",
    "                    # verbosity\n",
    "                    'verbose'     : 0, # don't log or show plot for checking clusters vs accuracy\n",
    "                    'show_plot'   : False,\n",
    "                }\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9b624479",
   "metadata": {},
   "outputs": [],
   "source": [
    "n_clusters = [1, 2, 3, 5, 7, 10, 15, 20, 30, 40, 50] # number of clusters to run CFL algorithm with\n",
    "train_errs = []\n",
    "test_errs = []\n",
    "\n",
    "# for each number of clusters, run the experiment, predict on the test set, and calculate error\n",
    "# this will take a while, especially for larger datasets\n",
    "for n in n_clusters:\n",
    "    cause_cluster_params =  {'model' : 'KMeans',\n",
    "                         'model_params' : {'n_clusters' : n},\n",
    "                         'verbose' : 0\n",
    "    }\n",
    "\n",
    "    block_names = ['CondDensityEstimator', 'CauseClusterer']\n",
    "    block_params = [CDE_params, cause_cluster_params]\n",
    "    results_path = 'alt_extraction_optim_clusters_runs' # directory to save results to\n",
    "\n",
    "    my_exp = Experiment(X_train=trainX,\n",
    "                        Y_train=trainY,\n",
    "                        data_info=data_info,\n",
    "                        block_names=block_names,\n",
    "                        block_params=block_params,\n",
    "                        results_path=results_path)\n",
    "    results = my_exp.train()\n",
    "    xlbls = results['CauseClusterer']['x_lbls']\n",
    "    train_errs.append(by_point_err(train_data, xlbls, train_data, xlbls))\n",
    "\n",
    "    my_exp.add_dataset(X=testX, Y=testY, dataset_name='test_data')\n",
    "    pred_results = my_exp.predict('test_data')\n",
    "    pred_xlbls = pred_results['CauseClusterer']['x_lbls']\n",
    "    test_errs.append(by_point_err(train_data, xlbls, test_data, pred_xlbls))\n",
    "\n",
    "    print(f'train err for {n} clusters: {train_errs[-1]}')\n",
    "    print(f'test err for {n} clusters: {test_errs[-1]}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "185ac526",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Visualize n_clusters vs train/test errors\n",
    "plt.figure(0)\n",
    "plt.plot(n_clusters, train_errs, label='Train')\n",
    "plt.plot(n_clusters, test_errs, label='Test')\n",
    "plt.xlabel('Number of clusters')\n",
    "plt.ylabel('Mean squared error by cluster')\n",
    "plt.legend()\n",
    "_ = plt.title(f'Mean squared error by cluster vs number of clusters - {train_resolution} training set')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5c0a76dd",
   "metadata": {},
   "source": [
    "NOTE: This section on finding optimal cluster number with a test set can be ran with multiple training sets. The test errors can then be saved and plotted to visualize the shift in optimal cluster number as resolution changes. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}