Regression Modeling on Argo using Scikit-learn — Using sklearn with BGC-Argo ocean observations (2024)

Overview

Among machine learning approaches, RFR is a relatively simple algorithm that can be trained on regional datasets too small for deep learning. RFR has been successfully applied to a range of cases in oceanography (e.g. \citet{sharp2022_GOBAIO2, callens2020_Using, tong2019_MultiFeature}), including for oxygen prediction in the Southern Ocean using BGC-Argo float data (\citet{giglio2018_Estimating}). Furthermore, many oceanographic applications highlight RFR as a useful tool for geospatial observations because of its reduced overfitting tendency and ability to handle non-linear relationships between variables \citep{zhou2023_Comparative, sharp2022_Monthly}. Deep learning methods are not necessarily better than simpler algorithms for data that are non-uniformly distributed; where multiple algorithms produce similar regional predictions, simple learners can offer greater stability and interpretability.

The objectives for this notebook are:

Train RFR on data from BGC-Argo ocean autonomous profilers and GO-SHIP bottle data
Validate RFR to select the best model parameters, use cross-validation to assess overfitting
Test RFR for an estimate of prediction error
Predict missing variable on other profiling observations

from IPython.display import ImageImage(filename='../images/rfr_workflow.png')

Prerequisites

This section was inspired by this template of the wonderful The Turing Way Jupyter Book.

Following your overview, tell your reader what concepts, packages, or other background information they’ll need before learning your material. Tie this explicitly with links to other pages here in Foundations or to relevant external resources. Remove this body text, then populate the Markdown table, denoted in this cell with | vertical brackets, below, and fill out the information following. In this table, lay out prerequisite concepts by explicitly linking to other Foundations material or external resources, or describe generally helpful concepts.

Label the importance of each concept explicitly as helpful/necessary.

Concepts	Importance	Notes
Intro to Cartopy	Necessary
Understanding of NetCDF	Helpful	Familiarity with metadata structure
Project management	Helpful

Time to learn: estimate in minutes. For a rough idea, use 5 mins per subsection, 10 if longer; add these up for a total. Safer to round up and overestimate.
System requirements:
- Populate with any system, version, or non-Python software requirements if necessary
- Otherwise use the concepts table above and the Imports section below to describe required packages as necessary
- If no extra requirements, remove the System requirements point altogether

Imports

Begin your body of content with another --- divider before continuing into this section, then remove this body text and populate the following code cell with all necessary Python imports up-front:

# Import packagesimport numpy as npimport pandas as pdfrom scipy import interpolateimport scipyimport xarray as xrimport matplotlib.pyplot as pltimport matplotlib.colors as mcolorsfrom datetime import datetime, timedeltaimport requestsimport timeimport osimport urllib3import shutil# import tqdm# Import argopy functionsimport argopyfrom argopy import DataFetcher # This is the class to work with Argo datafrom argopy import ArgoIndex # This is the class to work with Argo indexfrom argopy import ArgoNVSReferenceTables # This is the class to retrieve data from Argo reference tablesfrom argopy import ArgoColors # This is a class with usefull pre-defined colorsfrom argopy.plot import scatter_map, scatter_plot # This is a function to easily make maps argopy.reset_options()argopy.clear_cache()argopy.set_options(cachedir='cache_bgc')# xr.set_options(display_expand_attrs = False)

<argopy.options.set_options at 0x7f5e8c55f400>

# Scikit-learn packagesfrom scipy.stats import kdefrom scipy.stats import iqrfrom sklearn import linear_modelfrom sklearn import preprocessingfrom sklearn import metricsfrom sklearn.model_selection import train_test_splitfrom sklearn.ensemble import RandomForestRegressorfrom sklearn.metrics import mean_squared_errorfrom sklearn.metrics import r2_scorefrom sklearn.metrics import classification_report, confusion_matrixfrom sklearn.metrics import accuracy_scorefrom sklearn.metrics import precision_recall_fscore_support as score# Plotting Packagesimport matplotlibimport matplotlib.pyplot as pltimport matplotlib.patches as patchimport matplotlib.colors as mpcolorsfrom matplotlib.colors import LogNormfrom matplotlib.ticker import FormatStrFormatterimport matplotlib.patheffects as peimport seaborn as snsimport gswfrom cmocean import cm as cmo

Optional Background: Random Forest Regression

Random Forest (RF) regression is a supervised machine learning method that averages the results over a “forest” of \(n\) decision trees for final prediction. Each decision tree determines a series of conditions at each node based on a set of observed variables, referred to as features. For regression, the trees find conditions that best separate the target observations into branches with the least variance.

Formally, observations are sorted into \(j\) regions of predictor space (\(R_j\)) during training. Following the notation in James et al. (2013), each node in the tree considers a split into two new regions based on a condition \(s\) for a predictor \(X_j\):

(2)\[\begin{equation}\begin{split}R_1(j,s) = \{X|X_j < s\} \text{ and } R_2(j,s) = \{X|X_j \ge s\}, \\\end{split}\end{equation}\]

and each region is assigned a prediction value that is the average of all target variable values (\(\hat{y}_{R_j}\) ) within \(R_j\). The condition \(s\) is chosen to minimize the residual sum of squares in the resulting bins:

(3)\[\begin{equation}\begin{split}\sum_{i: \: x_i \in \: R_1(j,s)}{} (y_i-\hat{y}_{R_1})^2 \; + \sum_{i: \: x_i \in \: R_2(j,s)}{} (y_i-\hat{y}_{R_2})^2,\end{split}\end{equation}\]

where \(y_i\) represents each observed value. During testing, validation, and prediction, RF sorts input observations into the regions and assigns them the value \(\hat{y}_{R_j}\). One advantage of RF is that it limits the decision at each node split to a random subset of features. As a result, the trees in Random Forest are less correlated than those in a family of ``bagged’’ trees which consider all possible features at all nodes. The RF method is therefore less prone to overfitting, especially on spatiotemporally biased data (Stock et.al (2022), Sharp et al. 2022).

Determine spatial and temporal bounds of Argo data

We will use Argopy to set up some data to run RFR.As an example, we access 3 years of North Atlantic data in 2019 through 2022.We adapt this code from the example Argopy notebook are documented here.

Our approach downloads synthetic profiles, which are described in the previous notebook, Accessing Argo .

# In the format [minlon, maxlon, minlat, maxlat, mindepth, maxdepth, start_date, end_date]BOX = [-56, -45, 54, 60, 0, 500, '2019-01', '2023-01']idx = ArgoIndex(index_file='bgc-s').load() # Synthetic profilesidx

<argoindex.pandas>Host: https://data-argo.ifremer.frIndex: argo_synthetic-profile_index.txtConvention: argo_synthetic-profile_index (Synthetic-Profile directory file of the Argo GDAC)Loaded: True (314361 records)Searched: False

# Select profile in a space/time domain:index_BOX = [BOX[ii] for ii in [0, 1, 2, 3, 6, 7]] # We don't want the pressure axis BOX limitsidx.search_lat_lon_tim(index_BOX)

<argoindex.pandas>Host: https://data-argo.ifremer.frIndex: argo_synthetic-profile_index.txtConvention: argo_synthetic-profile_index (Synthetic-Profile directory file of the Argo GDAC)Loaded: True (314361 records)Searched: True (1464 matches, 0.4657%)

# Get the list of all parameters for this region:print('Parameters measured in the region: \n' + str(idx.read_params())); print()print ('There are ' + str(len(idx.read_wmo())) + ' floats in the region.')

Parameters measured in the region: ['BBP700', 'CDOM', 'CHLA', 'CHLA_FLUORESCENCE', 'CP660', 'DOWNWELLING_PAR', 'DOWN_IRRADIANCE380', 'DOWN_IRRADIANCE412', 'DOWN_IRRADIANCE490', 'DOXY', 'NITRATE', 'PH_IN_SITU_TOTAL', 'PRES', 'PSAL', 'TEMP']There are 51 floats in the region.

Fetching data using Argopy

The following code fetches the actual data from the BOX bounds we used earlier. If using the parameters given here, the process takes about 6 minutes.

f = DataFetcher(ds='bgc', mode='expert', params='all', parallel=True, progress=True, cache=False, chunks_maxsize={'time': 30}, )f = f.region(BOX).load()f

Queued

Downloading netcdf from the erddap

Formatting xarray dataset

Callback

Failed or No Data

Success

Succeed in Merging chunks of xarray dataset (100% processed)

Final post-processing of the merged dataset () ...

/home/runner/miniconda3/envs/sklearn-argo-dev/lib/python3.9/site-packages/argopy/xarray.py:70: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`. self._dims = list(xarray_obj.dims.keys())

<datafetcher.erddap>Name: Ifremer erddap Argo BGC data fetcher for a space/time regionAPI: https://erddap.ifremer.fr/erddapDomain: [x=-56.00/-45.00; y=54.00/60.0 ... 00.0; t=2019-01-01/2023-01-01]BGC variables: ['BBP700', 'CDOM', 'CHLA', 'CP660', 'DOWNWELLING_PAR', 'DOWN_IRRADIANCE380', 'DOWN_IRRADIANCE412', 'DOWN_IRRADIANCE490', 'DOXY', 'NITRATE', 'PH_IN_SITU_TOTAL', 'PRES', 'PSAL', 'TEMP']BGC 'must be measured' variables: []Performances: cache=False, parallel=TrueUser mode: expertDataset: bgc

Choose example parameter to visualize

We can plot the adjusted Salinity over pressure, and color by QC flag.

fig, ax = scatter_plot(ds, a_param + '_ADJUSTED_QC', this_x = a_param + '_ADJUSTED', vmin=0, vmax=9, cmap=ArgoColors('qc').cmap, figsize=(5,5)) #ArgoColors('qc').cmapax.set_title("QC for Adjusted %s [%s]\n'%s' mission" % (param_info['prefLabel'], a_param + '_ADJUSTED', f.mission), fontdict={'weight': 'bold', 'size': 14});

Splitting data before training our model.

During RFR development, observations are separated into different datasets for training, validation, and testing steps (see Workflow schematic above). We note that the observations within a single glider or float profile are highly correlated; since the individual profiles are the ‘‘independent’’ units here rather than the pointwise samples, we keep observations from vertical profiles together during data splitting.

# Prepare dataframe before splittingargodat = ds.argo.point2profile()argodat = argodat.to_dataframe().reset_index()argodat.columns# Run QCqcvars = ['DOXY_ADJUSTED_QC', 'PSAL_ADJUSTED_QC', 'TEMP_ADJUSTED_QC', 'PRES_ADJUSTED_QC', 'TIME_QC']print(str(len(argodat)) + ' obs before QC')X = len(argodat)for var in qcvars: argodat = argodat[(argodat[var] == 1) | (argodat[var] == 2)| (argodat[var] == 8)] #qc flag meanings 1: good data, 2: probably good data, 8: interpolated value print('after ' + var) print('\t' + str(len(argodat)) + ' obs left ' + ' \t' +str(X-len(argodat)) + ' dropped'); X = len(argodat)argodat.describe();

/home/runner/miniconda3/envs/sklearn-argo-dev/lib/python3.9/site-packages/argopy/xarray.py:70: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`. self._dims = list(xarray_obj.dims.keys())

/home/runner/miniconda3/envs/sklearn-argo-dev/lib/python3.9/site-packages/argopy/xarray.py:70: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`. self._dims = list(xarray_obj.dims.keys())

1996520 obs before QCafter DOXY_ADJUSTED_QC272748 obs left 1723772 droppedafter PSAL_ADJUSTED_QC255411 obs left 17337 droppedafter TEMP_ADJUSTED_QC255411 obs left 0 droppedafter PRES_ADJUSTED_QC255411 obs left 0 droppedafter TIME_QC255411 obs left 0 dropped

# Add training variables def datetime2ytd(time, year = 2019): """" Return time in YTD format from datetime format.""" return (time - np.datetime64('2019-01-01'))/np.timedelta64(1, 'D')argodat['YD'] = datetime2ytd(argodat['TIME'])argodat['CT'] = gsw.CT_from_t(argodat['PSAL_ADJUSTED'], argodat['TEMP_ADJUSTED'], argodat['PRES_ADJUSTED'])argodat['SA']= gsw.SA_from_SP(argodat['PSAL_ADJUSTED'],argodat['PRES_ADJUSTED'],argodat['LONGITUDE'],argodat['LATITUDE'])argodat['SIGMA0'] = gsw.sigma0(argodat.SA.values, argodat.CT.values)argodat['SPICE'] = gsw.spiciness0(argodat.SA.values, argodat.CT.values)dvars = ['N_PROF','TIME', 'YD', 'LATITUDE', 'LONGITUDE', 'PRES_ADJUSTED','PSAL_ADJUSTED', 'TEMP_ADJUSTED', 'CT', 'SA', 'SIGMA0', 'SPICE', 'DOXY_ADJUSTED']argo_df = argodat[dvars]

argo_df

	N_PROF	TIME	YD	LATITUDE	LONGITUDE	PRES_ADJUSTED	PSAL_ADJUSTED	TEMP_ADJUSTED	CT	SA	SIGMA0	SPICE	DOXY_ADJUSTED
0	65	2019-01-01 06:03:25.000	0.252373	56.933030	-52.521630	4.760000	34.624645	3.108200	3.108622	34.788850	27.576438	-0.005224	313.042755
1	65	2019-01-01 06:03:25.000	0.252373	56.933030	-52.521630	6.280000	34.624847	3.109900	3.110227	34.789055	27.576450	-0.004914	313.051453
2	65	2019-01-01 06:03:25.000	0.252373	56.933030	-52.521630	8.300000	34.624748	3.110500	3.110705	34.788956	27.576328	-0.004934	313.024292
4	65	2019-01-01 06:03:25.000	0.252373	56.933030	-52.521630	12.200000	34.624748	3.110800	3.110767	34.788964	27.576328	-0.004922	313.263916
8	65	2019-01-01 06:03:25.000	0.252373	56.933030	-52.521630	20.259998	34.624855	3.113100	3.112572	34.789096	27.576264	-0.004642	312.554932
...	...	...	...	...	...	...	...	...	...	...	...	...	...
1994410	1250	2022-12-30 14:35:51.036	1459.608230	59.672974	-50.982277	456.649139	34.842865	3.781540	3.747749	35.009959	27.689356	0.219996	292.526306
1994421	1250	2022-12-30 14:35:51.036	1459.608230	59.672974	-50.982277	465.851868	34.843231	3.770672	3.736246	35.010335	27.690821	0.218972	292.605347
1994432	1250	2022-12-30 14:35:51.036	1459.608230	59.672974	-50.982277	475.282532	34.843609	3.759535	3.724458	35.010722	27.692322	0.217925	292.711700
1994442	1250	2022-12-30 14:35:51.036	1459.608230	59.672974	-50.982277	484.937714	34.843998	3.748132	3.712390	35.011121	27.693860	0.216857	292.717194
1994453	1250	2022-12-30 14:35:51.036	1459.608230	59.672974	-50.982277	492.613831	34.843433	3.731520	3.695292	35.010561	27.695144	0.214551	293.101227

255411 rows × 13 columns

Begin preparing data for training

def split_profiles(floatDF): """  @param floatDF: dataframe with all float data @return:  training_data: training data (80% of profiles) with ship added validation_data: validation data (20% of profiles)  This method treats a float profile like the smallest discrete "unit" upon which to train and validate the model. Our goal is to predict a water column (profile) rather than a discrete observation, so profiles are kept together while splitting. Note that test data will be all profiles from the SOGOS float when test_split is set as False by default. """ # Create list of unique profile ID's and select random 80% for training profnum = pd.unique(floatDF.N_PROF) np.random.seed(42) training_pnum = np.random.choice(profnum, int(np.floor(0.8*len(profnum))), replace=False) training_data = floatDF[floatDF['N_PROF'].isin(training_pnum)] # Take HALF of remaining profiles that were not in training set, for validation (10% of total) pnum_vt = [x for x in profnum if x not in training_pnum] # remaining profiles after training data removed validation_pnum= np.random.choice(pnum_vt, int(np.floor(0.5*len(pnum_vt))), replace=False) validation_data = floatDF[floatDF['N_PROF'].isin(validation_pnum)] test_pnum = [x for x in pnum_vt if x not in validation_pnum] test_data = floatDF[floatDF['N_PROF'].isin(test_pnum)] return training_data, validation_data, test_data

[training, validation, test] = split_profiles(argo_df)print('# of total profiles: ' + str(len(argo_df.N_PROF.unique())))print('# training observations: \t' + str(len(training)))print('# validation observations: \t' + str(len(validation)))print('# test observations: \t\t' + str(len(test)))

# of total profiles: 1018# training observations: 203102# validation observations: 25786# test observations: 26523

training.head()

	N_PROF	TIME	YD	LATITUDE	LONGITUDE	PRES_ADJUSTED	PSAL_ADJUSTED	TEMP_ADJUSTED	CT	SA	SIGMA0	SPICE	DOXY_ADJUSTED
0	65	2019-01-01 06:03:25	0.252373	56.93303	-52.52163	4.760000	34.624645	3.1082	3.108622	34.788850	27.576438	-0.005224	313.042755
1	65	2019-01-01 06:03:25	0.252373	56.93303	-52.52163	6.280000	34.624847	3.1099	3.110227	34.789055	27.576450	-0.004914	313.051453
2	65	2019-01-01 06:03:25	0.252373	56.93303	-52.52163	8.300000	34.624748	3.1105	3.110705	34.788956	27.576328	-0.004934	313.024292
4	65	2019-01-01 06:03:25	0.252373	56.93303	-52.52163	12.200000	34.624748	3.1108	3.110767	34.788964	27.576328	-0.004922	313.263916
8	65	2019-01-01 06:03:25	0.252373	56.93303	-52.52163	20.259998	34.624855	3.1131	3.112572	34.789096	27.576264	-0.004642	312.554932

Set up model versions to try.

To demonstrate, we try our model on three different simple versions. Real models should use more thorough methods to determine the optimal feature list.

var_list = { 'Model_A': ['SPICE', 'SIGMA0'], 'Model_B': ['CT', 'SA', 'PRES_ADJUSTED'], 'Model_C': ["CT", "SA", 'PRES_ADJUSTED', 'YD', 'LATITUDE', 'LONGITUDE'], }model_list = list(var_list.keys())# Create dictionary of featuresallvars = []for k in var_list.keys(): allvars = allvars + var_list[str(k)]allvars = list(set(allvars)) # remove duplicates

pal = sns.color_palette('Set2', n_colors = len(model_list))model_palettes = {k:v for k, v in zip(model_list, pal)}

Main RFR training method

var_predict = 'DOXY_ADJUSTED'def train_RFR(var_list, training, validation, test, ntrees=1000, max_feats = 1/3, min_samples_split=5): """  Main method to train the RF model. Update 04.18.24: Redo bootstrapping.  @param:  var_list: list of variables to use in the model training: training data unscaled, i.e. original range of values validation: validation data unscaled test: test data unscaled  ntrees: 1000 trees by default. @return: Mdl: trained RF model Mdl_MAE: Rescaled mean absolute error for training, validation, and test sets Mdl_IQR: Rescaled IQR for training, validation, and test sets DF_with_error: Dataframe with error metrics for the *TEST* set validation_error: Dataframe with error metrics for the *VALIDATION* set """ Mdl = RandomForestRegressor(ntrees, max_features = max_feats, random_state = 0, bootstrap=True, min_samples_split=min_samples_split) # max_features: use at most X features at each split (m~sqrt(total features)) # Drop NaN's without profid or wmoid cols_na = [col for col in training.columns if col not in ['profid', 'wmoid', 'AOU', 'dist_maxb']] training_nona = training.dropna(axis=0, subset=cols_na) # makes same length as training_scaled validation_nona = validation.dropna(axis=0, subset=cols_na) test_nona = test.dropna(axis=0, subset=cols_na) # Create X Variables for each subset of data. Scale down.  X_training = training_nona[var_list].to_numpy() X_validation = validation_nona[var_list].to_numpy() X_test = test_nona[var_list].to_numpy() # Nitrate, the target variable.  Y_training = training_nona[var_predict].to_numpy() Y_validation = validation_nona[var_predict].to_numpy() Y_test = test_nona[var_predict].to_numpy() # Train the model Mdl.fit(X_training, Y_training) # Estimate Y_pred_training = Mdl.predict(X_training) Y_pred_validation = Mdl.predict(X_validation) Y_pred_test = Mdl.predict(X_test) # Create dataframe for the test set with depth -->  DF_with_error = test_nona.copy(); DF_with_error = DF_with_error.reset_index(drop=True) observed_nitrate = DF_with_error[var_predict].to_numpy() # Save new dataframe with test results DF_with_error['test_prediction'] = Y_pred_test DF_with_error['test_error'] = DF_with_error['test_prediction'] - observed_nitrate DF_with_error['test_relative_error'] = DF_with_error['test_error']/observed_nitrate # Error metrics AE_RF_training = np.abs(Y_pred_training - training_nona[var_predict]) IQR_RF_training = iqr(abs(AE_RF_training)) r2_RF_training = r2_score(training_nona[var_predict], Y_pred_training) # Return validation metrics AE_RF_validation = np.abs(Y_pred_validation - validation_nona[var_predict]) IQR_RF_validation = iqr(abs(AE_RF_validation)) r2_RF_validation = r2_score(validation_nona[var_predict], Y_pred_validation) validation_error = validation_nona.copy() #pd.DataFrame() validation_error['val_error'] = Y_pred_validation - validation_nona[var_predict] validation_error['val_relative_error'] = validation_error['val_error']/validation_nona[var_predict] AE_RF_test = np.abs(Y_pred_test - test_nona[var_predict]) # same as DF_with_error['test_error'] IQR_RF_test = iqr(abs(AE_RF_test)) r2_RF_test = r2_score(test_nona[var_predict], Y_pred_test) Mdl_MAE = [np.nanmedian(abs(AE_RF_training)), np.nanmedian(abs(AE_RF_validation)), np.nanmedian(abs(AE_RF_test))] Mdl_IQR = [IQR_RF_training, IQR_RF_validation, IQR_RF_test] Mdl_r2 = [r2_RF_training, r2_RF_validation, r2_RF_test] return [Mdl, Mdl_MAE, Mdl_IQR, Mdl_r2, DF_with_error, validation_error]

Next, we train the model using our main train_RFR() function.Depending on your dataset size and model lists, this may take several minutes (here, about 8 minutes).

Note that use a relatively small number of decision tres (n=500) to speed up our example.

# Train the models with chosen lists.# Set parametersmodels = dict.fromkeys(model_list)models_MAE = dict.fromkeys(model_list)models_IQR = dict.fromkeys(model_list)models_r2 = dict.fromkeys(model_list)models_DF_err = dict.fromkeys(model_list)models_val_err = dict.fromkeys(model_list)ntrees=500max_feats = 1/3min_samples_split = 5description = 'ntrees=' + str(ntrees) + ', max_features=' + str(max_feats)print(description + '\n')for k in model_list: [models[k], models_MAE[k], models_IQR[k], models_r2[k], models_DF_err[k], models_val_err[k]] = train_RFR(var_list[k], training, validation, test, ntrees = ntrees, max_feats = max_feats, min_samples_split=min_samples_split)

ntrees=500, max_features=0.3333333333333333

RFR_results = [models, models_MAE, models_IQR, models_r2, models_DF_err, models_val_err]

Feature Importance

RFR returns a measure for each feature called ``feature importance’’ that reflects its predictive value. For each split where a given feature is used to determine the split criterion, the difference in the variance of the pre-split node compared to the two post-split nodes is calculated (and weighted by the number of samples in the pre-split node). These values are summed across all of the relevant splits in a given decision tree; the average of these sums across all trees represents the feature importance of the given variable. The feature importances are computed for all variables used in the model, and are scaled relative to each other so that the sum of all of the final feature importances is 1. The feature importances are often screened during RFR development to help select a final set of features.

allvars = ['SPICE', 'SIGMA0', 'CT', 'SA', 'PRES_ADJUSTED', 'YD', 'LATITUDE', 'LONGITUDE']

# Make total feature importance dictionary for plottingdict = dict.fromkeys(allvars)feat_imps = pd.DataFrame()# Fill in feature importancefor ind, vkey in enumerate(model_list): for var in var_list[vkey]: feat_imps.at[ind, var] = models[vkey].feature_importances_[var_list[vkey].index(var)]# feat_imps = feat_imps.set_index('Mdl').fillna(np.nan)feat_imps['Mdl'] = model_listfeat_imps = feat_imps.set_index('Mdl').fillna(np.nan)feat_imps

	SPICE	SIGMA0	CT	SA	PRES_ADJUSTED	YD	LATITUDE	LONGITUDE
Mdl
Model_A	0.650909	0.349091	NaN	NaN	NaN	NaN	NaN	NaN
Model_B	NaN	NaN	0.410785	0.344402	0.244813	NaN	NaN	NaN
Model_C	NaN	NaN	0.211573	0.216692	0.115993	0.245552	0.112361	0.09783

We can compare feature importances across models.

pal = sns.color_palette('Set2', n_colors = len(model_list))model_palettes = {k:v for k, v in zip(model_list, pal)}fig = plt.figure(figsize=(5,3), layout='constrained')ax = fig.gca()barwidth=.8modlist = model_listvarslist = allvarsxlabel = ['spice', 'sigma', 'temp', 'salinity', 'pres', 'time', 'lat', 'lon']# Result of plot methodxind = dict.fromkeys(modlist)xind[modlist[0]] = np.arange(len(varslist))*len(modlist)for i in range(len(modlist)-1): xind[modlist[i+1]] = [x + barwidth for x in xind[modlist[i]]]for modtag in modlist: ax.bar(xind[modtag], feat_imps[varslist].loc[modtag].to_numpy(), color = model_palettes[modtag], alpha=0.8, label = modtag, zorder=3)ax.grid(axis='y', zorder=0)ax.legend(loc='upper right')ax.set_title('Feature Importances by Model')ax.set_xticks(xind[modlist[2]])ax.set_xticklabels(varslist)ax.set_xticklabels(xlabel);

Validation Error Analysis

def get_model_metrics(resultlist, mod_list=model_list): """  resultlist: typically list of [models, models_MAE, models_IQR, models_r2, models_DF_err]""" models_MAE=resultlist[1] models_IQR=resultlist[2] models_r2=resultlist[3] model_metrics = pd.DataFrame() model_metrics['validation_MAE'] = [x[1][1].item() for x in models_MAE.items()] model_metrics['validation_IQR'] = [x[1][1].item() for x in models_IQR.items()] model_metrics['test_MAE'] = [x[1][2].item() for x in models_MAE.items()] model_metrics['test_IQR'] = [x[1][2].item() for x in models_IQR.items()] model_metrics['model'] = mod_list model_metrics.set_index('model', inplace=True) return model_metrics# Method to print MAE dictionariesdef print_dict(dict): print("\n".join("{} \t{}".format(k, v) for k, v in dict.items()))def print_sorted(dict): print("\n".join("{} \t{}".format(k, v) for k, v in sorted(dict.items(), key=lambda x:x[1])))

model_metrics = get_model_metrics(RFR_results, mod_list = RFR_results[1].keys()).sort_values(by='validation_MAE')model_metrics.sort_values(by='validation_MAE', ascending=False)[['validation_MAE', 'validation_IQR']]

	validation_MAE	validation_IQR
model
Model_A	4.844838	6.930837
Model_B	4.252438	6.321586
Model_C	1.908496	3.470855

# Error printing methodsdef get_95_bounds(data): mean = np.mean(data); std_dev = np.std(data) low = mean - 2 * std_dev high = mean + 2 * std_dev return [low, high]# Method to print errors restricted to depth def print_error_depths(data, var='test_error', pres_lim = [0,500]): data = data[(data.PRES_ADJUSTED > pres_lim[0]) & (data.PRES_ADJUSTED < pres_lim[1])] err = data[var] print('Errors between depths ' + str(pres_lim[0]) + ' to ' + str(pres_lim[1]) + ':') print('median abs error: \t' + str(np.abs(err).median())) print('mean abs error \t\t' + str(np.abs(err).mean())) # Bounds 95 [low, high] = get_95_bounds(err) print('\n95% of errors fall between:') print(str(low.round(5)) + ' to ' + str(high.round(5)) )

print('Model_C') # leaf size: 5print_error_depths(RFR_results[5]['Model_C'], var='val_relative_error', pres_lim = [0,500])

Model_CErrors between depths 0 to 500:median abs error: 0.0064830623181472685mean abs error 0.01049724417272320795% of errors fall between:-0.03439 to 0.02635

We can visualize the distribution of errors between models.

tag = 'Model Validation Errors'var = 'val_relative_error'dat = RFR_results[5]ymax=60fig = plt.figure(figsize=(7,3))ax = fig.gca()for mod in model_list[-4:]: RF = dat[mod][var] ax.hist(RF, bins=40, alpha=0.5, color= model_palettes[mod], label=mod, density=True)ax.vlines(0, ymin=0, ymax=ymax, colors='k', alpha=0.4, linestyles='dashed')ax.set_ylim([0,ymax])plt.legend()ax.set_xlabel('Relative Error %')ax.grid(alpha=0.5, zorder=0)ax.set_xlim([-.1, .1])ax.set_title('Validation Error Distributions by Model')

Text(0.5, 1.0, 'Validation Error Distributions by Model')

Testing on withheld data

This is where you begin your first section of material, loosely tied to your objectives stated up front. Tie together your notebook as a narrative, with interspersed Markdown text, images, and more as necessary,

# some subsection codenew = "helpful information"

Another content subsection

Keep up the good work! A note, try to avoid using code comments as narrative, and instead let them only exist as brief clarifications where necessary.

Your second content section

Here we can move on to our second objective, and we can demonstrate

Subsection to the second section

a quick demonstration

of further and further

header levels

as well \(m = a * t / h\) text! Similarly, you have access to other \(\LaTeX\) equation functionality via MathJax (demo below from link),

(4)\[\begin{align}\dot{x} & = \sigma(y-x) \\\dot{y} & = \rho x - y - xz \\\dot{z} & = -\beta z + xy\end{align}\]

Check out any number of helpful Markdown resources for further customizing your notebooks and the Jupyter docs for Jupyter-specific formatting information. Don’t hesitate to ask questions if you have problems getting it to look just right.

Last Section

If you’re comfortable, and as we briefly used for our embedded logo up top, you can embed raw html into Jupyter Markdown cells (edit to see):

Info

Your relevant information here!

Feel free to copy this around and edit or play around with yourself. Some other admonitions you can put in:

Success

We got this done after all!

Warning

Be careful!

Danger

Scary stuff be here.

We also suggest checking out Jupyter Book’s brief demonstration on adding cell tags to your cells in Jupyter Notebook, Lab, or manually. Using these cell tags can allow you to customize how your code content is displayed and even demonstrate errors without altogether crashing our loyal army of machines!

Summary

Add one final --- marking the end of your body of content, and then conclude with a brief single paragraph summarizing at a high level the key pieces that were learned and how they tied to your objectives. Look to reiterate what the most important takeaways were.

What’s next?

Let Jupyter book tie this to the next (sequential) piece of content that people could move on to down below and in the sidebar. However, if this page uniquely enables your reader to tackle other nonsequential concepts throughout this book, or even external content, link to it here!

Resources and references

Finally, be rigorous in your citations and references as necessary. Give credit where credit is due. Also, feel free to link to relevant external material, further reading, documentation, etc. Then you’re done! Give yourself a quick review, a high five, and send us a pull request. A few final notes:

Kernel > Restart Kernel and Run All Cells... to confirm that your notebook will cleanly run from start to finish
Kernel > Restart Kernel and Clear All Outputs... before committing your notebook, our machines will do the heavy lifting
Take credit! Provide author contact information if you’d like; if so, consider adding information here at the bottom of your notebook
Give credit! Attribute appropriate authorship for referenced code, information, images, etc.
Only include what you’re legally allowed: no copyright infringement or plagiarism

Thank you for your contribution!