Overview
Among machine learning approaches, RFR is a relatively simple algorithm that can be trained on regional datasets too small for deep learning. RFR has been successfully applied to a range of cases in oceanography (e.g. \citet{sharp2022_GOBAIO2, callens2020_Using, tong2019_MultiFeature}), including for oxygen prediction in the Southern Ocean using BGC-Argo float data (\citet{giglio2018_Estimating}). Furthermore, many oceanographic applications highlight RFR as a useful tool for geospatial observations because of its reduced overfitting tendency and ability to handle non-linear relationships between variables \citep{zhou2023_Comparative, sharp2022_Monthly}. Deep learning methods are not necessarily better than simpler algorithms for data that are non-uniformly distributed; where multiple algorithms produce similar regional predictions, simple learners can offer greater stability and interpretability.
The objectives for this notebook are:
Train RFR on data from BGC-Argo ocean autonomous profilers and GO-SHIP bottle data
Validate RFR to select the best model parameters, use cross-validation to assess overfitting
Test RFR for an estimate of prediction error
Predict missing variable on other profiling observations
from IPython.display import ImageImage(filename='../images/rfr_workflow.png')
Prerequisites
This section was inspired by this template of the wonderful The Turing Way Jupyter Book.
Following your overview, tell your reader what concepts, packages, or other background information they’ll need before learning your material. Tie this explicitly with links to other pages here in Foundations or to relevant external resources. Remove this body text, then populate the Markdown table, denoted in this cell with |
vertical brackets, below, and fill out the information following. In this table, lay out prerequisite concepts by explicitly linking to other Foundations material or external resources, or describe generally helpful concepts.
Label the importance of each concept explicitly as helpful/necessary.
Concepts | Importance | Notes |
---|---|---|
Intro to Cartopy | Necessary | |
Understanding of NetCDF | Helpful | Familiarity with metadata structure |
Project management | Helpful |
Time to learn: estimate in minutes. For a rough idea, use 5 mins per subsection, 10 if longer; add these up for a total. Safer to round up and overestimate.
System requirements:
Populate with any system, version, or non-Python software requirements if necessary
Otherwise use the concepts table above and the Imports section below to describe required packages as necessary
If no extra requirements, remove the System requirements point altogether
Imports
Begin your body of content with another ---
divider before continuing into this section, then remove this body text and populate the following code cell with all necessary Python imports up-front:
# Import packagesimport numpy as npimport pandas as pdfrom scipy import interpolateimport scipyimport xarray as xrimport matplotlib.pyplot as pltimport matplotlib.colors as mcolorsfrom datetime import datetime, timedeltaimport requestsimport timeimport osimport urllib3import shutil# import tqdm# Import argopy functionsimport argopyfrom argopy import DataFetcher # This is the class to work with Argo datafrom argopy import ArgoIndex # This is the class to work with Argo indexfrom argopy import ArgoNVSReferenceTables # This is the class to retrieve data from Argo reference tablesfrom argopy import ArgoColors # This is a class with usefull pre-defined colorsfrom argopy.plot import scatter_map, scatter_plot # This is a function to easily make maps argopy.reset_options()argopy.clear_cache()argopy.set_options(cachedir='cache_bgc')# xr.set_options(display_expand_attrs = False)
<argopy.options.set_options at 0x7f5e8c55f400>
# Scikit-learn packagesfrom scipy.stats import kdefrom scipy.stats import iqrfrom sklearn import linear_modelfrom sklearn import preprocessingfrom sklearn import metricsfrom sklearn.model_selection import train_test_splitfrom sklearn.ensemble import RandomForestRegressorfrom sklearn.metrics import mean_squared_errorfrom sklearn.metrics import r2_scorefrom sklearn.metrics import classification_report, confusion_matrixfrom sklearn.metrics import accuracy_scorefrom sklearn.metrics import precision_recall_fscore_support as score# Plotting Packagesimport matplotlibimport matplotlib.pyplot as pltimport matplotlib.patches as patchimport matplotlib.colors as mpcolorsfrom matplotlib.colors import LogNormfrom matplotlib.ticker import FormatStrFormatterimport matplotlib.patheffects as peimport seaborn as snsimport gswfrom cmocean import cm as cmo
Optional Background: Random Forest Regression
Random Forest (RF) regression is a supervised machine learning method that averages the results over a “forest” of \(n\) decision trees for final prediction. Each decision tree determines a series of conditions at each node based on a set of observed variables, referred to as features. For regression, the trees find conditions that best separate the target observations into branches with the least variance.
Formally, observations are sorted into \(j\) regions of predictor space (\(R_j\)) during training. Following the notation in James et al. (2013), each node in the tree considers a split into two new regions based on a condition \(s\) for a predictor \(X_j\):
(2)\[\begin{equation}\begin{split}R_1(j,s) = \{X|X_j < s\} \text{ and } R_2(j,s) = \{X|X_j \ge s\}, \\\end{split}\end{equation}\]
and each region is assigned a prediction value that is the average of all target variable values (\(\hat{y}_{R_j}\) ) within \(R_j\). The condition \(s\) is chosen to minimize the residual sum of squares in the resulting bins:
(3)\[\begin{equation}\begin{split}\sum_{i: \: x_i \in \: R_1(j,s)}{} (y_i-\hat{y}_{R_1})^2 \; + \sum_{i: \: x_i \in \: R_2(j,s)}{} (y_i-\hat{y}_{R_2})^2,\end{split}\end{equation}\]
where \(y_i\) represents each observed value. During testing, validation, and prediction, RF sorts input observations into the regions and assigns them the value \(\hat{y}_{R_j}\). One advantage of RF is that it limits the decision at each node split to a random subset of features. As a result, the trees in Random Forest are less correlated than those in a family of ``bagged’’ trees which consider all possible features at all nodes. The RF method is therefore less prone to overfitting, especially on spatiotemporally biased data (Stock et.al (2022), Sharp et al. 2022).
Determine spatial and temporal bounds of Argo data
We will use Argopy to set up some data to run RFR.As an example, we access 3 years of North Atlantic data in 2019 through 2022.We adapt this code from the example Argopy notebook are documented here.
Our approach downloads synthetic profiles, which are described in the previous notebook, Accessing Argo .
# In the format [minlon, maxlon, minlat, maxlat, mindepth, maxdepth, start_date, end_date]BOX = [-56, -45, 54, 60, 0, 500, '2019-01', '2023-01']idx = ArgoIndex(index_file='bgc-s').load() # Synthetic profilesidx
<argoindex.pandas>Host: https://data-argo.ifremer.frIndex: argo_synthetic-profile_index.txtConvention: argo_synthetic-profile_index (Synthetic-Profile directory file of the Argo GDAC)Loaded: True (314361 records)Searched: False
# Select profile in a space/time domain:index_BOX = [BOX[ii] for ii in [0, 1, 2, 3, 6, 7]] # We don't want the pressure axis BOX limitsidx.search_lat_lon_tim(index_BOX)
<argoindex.pandas>Host: https://data-argo.ifremer.frIndex: argo_synthetic-profile_index.txtConvention: argo_synthetic-profile_index (Synthetic-Profile directory file of the Argo GDAC)Loaded: True (314361 records)Searched: True (1464 matches, 0.4657%)
# Get the list of all parameters for this region:print('Parameters measured in the region: \n' + str(idx.read_params())); print()print ('There are ' + str(len(idx.read_wmo())) + ' floats in the region.')
Parameters measured in the region: ['BBP700', 'CDOM', 'CHLA', 'CHLA_FLUORESCENCE', 'CP660', 'DOWNWELLING_PAR', 'DOWN_IRRADIANCE380', 'DOWN_IRRADIANCE412', 'DOWN_IRRADIANCE490', 'DOXY', 'NITRATE', 'PH_IN_SITU_TOTAL', 'PRES', 'PSAL', 'TEMP']There are 51 floats in the region.
Fetching data using Argopy
The following code fetches the actual data from the BOX
bounds we used earlier. If using the parameters given here, the process takes about 6 minutes.
f = DataFetcher(ds='bgc', mode='expert', params='all', parallel=True, progress=True, cache=False, chunks_maxsize={'time': 30}, )f = f.region(BOX).load()f
Queued
Downloading netcdf from the erddap
Formatting xarray dataset
Callback
Failed or No Data
Success
Succeed in Merging chunks of xarray dataset (100% processed)
Final post-processing of the merged dataset () ...
/home/runner/miniconda3/envs/sklearn-argo-dev/lib/python3.9/site-packages/argopy/xarray.py:70: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`. self._dims = list(xarray_obj.dims.keys())
<datafetcher.erddap>Name: Ifremer erddap Argo BGC data fetcher for a space/time regionAPI: https://erddap.ifremer.fr/erddapDomain: [x=-56.00/-45.00; y=54.00/60.0 ... 00.0; t=2019-01-01/2023-01-01]BGC variables: ['BBP700', 'CDOM', 'CHLA', 'CP660', 'DOWNWELLING_PAR', 'DOWN_IRRADIANCE380', 'DOWN_IRRADIANCE412', 'DOWN_IRRADIANCE490', 'DOXY', 'NITRATE', 'PH_IN_SITU_TOTAL', 'PRES', 'PSAL', 'TEMP']BGC 'must be measured' variables: []Performances: cache=False, parallel=TrueUser mode: expertDataset: bgc
Again, the data is retrieved as an xarray Dataset. Metadata can be returned using the object index
in a DataFrame.
# Check the data structure (xarray.dataset):ds = f.data # Datasetdf = f.index # Dataframe index
We can visualize where our training profiles are using a function included in Argopy.By setting the parameter set_global=True
, we can see the profiles in a global map context.
scatter_map(df, traj=False, set_global=False, legend=False);ax = plt.gca()ax.set_title('Float Profile Locations (by WMO)');
a_param = 'PSAL'reftbl = ArgoNVSReferenceTables().tbl('R03')param_info = reftbl[reftbl['altLabel']==a_param].iloc[0].to_dict()param_info# To make the scatter map, we need to have the data mode available in one DataFrame column# so we need to add a new column with the DATA_MODE of the PARAMETER:df["variables"] = df["parameters"].apply(lambda x: x.split())df["%s_DM" % a_param] = df.apply(lambda x: x['parameter_data_mode'][x['variables'].index(a_param)] if a_param in x['variables'] else '', axis=1)
# Finally plot the map:fig, ax = scatter_map(df, hue="%s_DM" % a_param, cmap="data_mode", markersize=24, markeredgecolor='w', traj_color='gray', legend_title='Data mode')ax.set_title("Data mode for %s (%s)\n%i profiles from the %s\n%i profiles downloaded" % (param_info['prefLabel'], a_param, idx.N_MATCH, idx.convention_title, df.shape[0]));
Choose example parameter to visualize
We can plot the adjusted Salinity over pressure, and color by QC flag.
fig, ax = scatter_plot(ds, a_param + '_ADJUSTED_QC', this_x = a_param + '_ADJUSTED', vmin=0, vmax=9, cmap=ArgoColors('qc').cmap, figsize=(5,5)) #ArgoColors('qc').cmapax.set_title("QC for Adjusted %s [%s]\n'%s' mission" % (param_info['prefLabel'], a_param + '_ADJUSTED', f.mission), fontdict={'weight': 'bold', 'size': 14});
Splitting data before training our model.
During RFR development, observations are separated into different datasets for training, validation, and testing steps (see Workflow schematic above). We note that the observations within a single glider or float profile are highly correlated; since the individual profiles are the ‘‘independent’’ units here rather than the pointwise samples, we keep observations from vertical profiles together during data splitting.
# Prepare dataframe before splittingargodat = ds.argo.point2profile()argodat = argodat.to_dataframe().reset_index()argodat.columns# Run QCqcvars = ['DOXY_ADJUSTED_QC', 'PSAL_ADJUSTED_QC', 'TEMP_ADJUSTED_QC', 'PRES_ADJUSTED_QC', 'TIME_QC']print(str(len(argodat)) + ' obs before QC')X = len(argodat)for var in qcvars: argodat = argodat[(argodat[var] == 1) | (argodat[var] == 2)| (argodat[var] == 8)] #qc flag meanings 1: good data, 2: probably good data, 8: interpolated value print('after ' + var) print('\t' + str(len(argodat)) + ' obs left ' + ' \t' +str(X-len(argodat)) + ' dropped'); X = len(argodat)argodat.describe();
/home/runner/miniconda3/envs/sklearn-argo-dev/lib/python3.9/site-packages/argopy/xarray.py:70: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`. self._dims = list(xarray_obj.dims.keys())
/home/runner/miniconda3/envs/sklearn-argo-dev/lib/python3.9/site-packages/argopy/xarray.py:70: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`. self._dims = list(xarray_obj.dims.keys())
1996520 obs before QCafter DOXY_ADJUSTED_QC272748 obs left 1723772 droppedafter PSAL_ADJUSTED_QC255411 obs left 17337 droppedafter TEMP_ADJUSTED_QC255411 obs left 0 droppedafter PRES_ADJUSTED_QC255411 obs left 0 droppedafter TIME_QC255411 obs left 0 dropped
# Add training variables def datetime2ytd(time, year = 2019): """" Return time in YTD format from datetime format.""" return (time - np.datetime64('2019-01-01'))/np.timedelta64(1, 'D')argodat['YD'] = datetime2ytd(argodat['TIME'])argodat['CT'] = gsw.CT_from_t(argodat['PSAL_ADJUSTED'], argodat['TEMP_ADJUSTED'], argodat['PRES_ADJUSTED'])argodat['SA']= gsw.SA_from_SP(argodat['PSAL_ADJUSTED'],argodat['PRES_ADJUSTED'],argodat['LONGITUDE'],argodat['LATITUDE'])argodat['SIGMA0'] = gsw.sigma0(argodat.SA.values, argodat.CT.values)argodat['SPICE'] = gsw.spiciness0(argodat.SA.values, argodat.CT.values)dvars = ['N_PROF','TIME', 'YD', 'LATITUDE', 'LONGITUDE', 'PRES_ADJUSTED','PSAL_ADJUSTED', 'TEMP_ADJUSTED', 'CT', 'SA', 'SIGMA0', 'SPICE', 'DOXY_ADJUSTED']argo_df = argodat[dvars]
argo_df
N_PROF | TIME | YD | LATITUDE | LONGITUDE | PRES_ADJUSTED | PSAL_ADJUSTED | TEMP_ADJUSTED | CT | SA | SIGMA0 | SPICE | DOXY_ADJUSTED | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 65 | 2019-01-01 06:03:25.000 | 0.252373 | 56.933030 | -52.521630 | 4.760000 | 34.624645 | 3.108200 | 3.108622 | 34.788850 | 27.576438 | -0.005224 | 313.042755 |
1 | 65 | 2019-01-01 06:03:25.000 | 0.252373 | 56.933030 | -52.521630 | 6.280000 | 34.624847 | 3.109900 | 3.110227 | 34.789055 | 27.576450 | -0.004914 | 313.051453 |
2 | 65 | 2019-01-01 06:03:25.000 | 0.252373 | 56.933030 | -52.521630 | 8.300000 | 34.624748 | 3.110500 | 3.110705 | 34.788956 | 27.576328 | -0.004934 | 313.024292 |
4 | 65 | 2019-01-01 06:03:25.000 | 0.252373 | 56.933030 | -52.521630 | 12.200000 | 34.624748 | 3.110800 | 3.110767 | 34.788964 | 27.576328 | -0.004922 | 313.263916 |
8 | 65 | 2019-01-01 06:03:25.000 | 0.252373 | 56.933030 | -52.521630 | 20.259998 | 34.624855 | 3.113100 | 3.112572 | 34.789096 | 27.576264 | -0.004642 | 312.554932 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1994410 | 1250 | 2022-12-30 14:35:51.036 | 1459.608230 | 59.672974 | -50.982277 | 456.649139 | 34.842865 | 3.781540 | 3.747749 | 35.009959 | 27.689356 | 0.219996 | 292.526306 |
1994421 | 1250 | 2022-12-30 14:35:51.036 | 1459.608230 | 59.672974 | -50.982277 | 465.851868 | 34.843231 | 3.770672 | 3.736246 | 35.010335 | 27.690821 | 0.218972 | 292.605347 |
1994432 | 1250 | 2022-12-30 14:35:51.036 | 1459.608230 | 59.672974 | -50.982277 | 475.282532 | 34.843609 | 3.759535 | 3.724458 | 35.010722 | 27.692322 | 0.217925 | 292.711700 |
1994442 | 1250 | 2022-12-30 14:35:51.036 | 1459.608230 | 59.672974 | -50.982277 | 484.937714 | 34.843998 | 3.748132 | 3.712390 | 35.011121 | 27.693860 | 0.216857 | 292.717194 |
1994453 | 1250 | 2022-12-30 14:35:51.036 | 1459.608230 | 59.672974 | -50.982277 | 492.613831 | 34.843433 | 3.731520 | 3.695292 | 35.010561 | 27.695144 | 0.214551 | 293.101227 |
255411 rows × 13 columns
Begin preparing data for training
def split_profiles(floatDF): """ @param floatDF: dataframe with all float data @return: training_data: training data (80% of profiles) with ship added validation_data: validation data (20% of profiles) This method treats a float profile like the smallest discrete "unit" upon which to train and validate the model. Our goal is to predict a water column (profile) rather than a discrete observation, so profiles are kept together while splitting. Note that test data will be all profiles from the SOGOS float when test_split is set as False by default. """ # Create list of unique profile ID's and select random 80% for training profnum = pd.unique(floatDF.N_PROF) np.random.seed(42) training_pnum = np.random.choice(profnum, int(np.floor(0.8*len(profnum))), replace=False) training_data = floatDF[floatDF['N_PROF'].isin(training_pnum)] # Take HALF of remaining profiles that were not in training set, for validation (10% of total) pnum_vt = [x for x in profnum if x not in training_pnum] # remaining profiles after training data removed validation_pnum= np.random.choice(pnum_vt, int(np.floor(0.5*len(pnum_vt))), replace=False) validation_data = floatDF[floatDF['N_PROF'].isin(validation_pnum)] test_pnum = [x for x in pnum_vt if x not in validation_pnum] test_data = floatDF[floatDF['N_PROF'].isin(test_pnum)] return training_data, validation_data, test_data
[training, validation, test] = split_profiles(argo_df)print('# of total profiles: ' + str(len(argo_df.N_PROF.unique())))print('# training observations: \t' + str(len(training)))print('# validation observations: \t' + str(len(validation)))print('# test observations: \t\t' + str(len(test)))
# of total profiles: 1018# training observations: 203102# validation observations: 25786# test observations: 26523
training.head()
N_PROF | TIME | YD | LATITUDE | LONGITUDE | PRES_ADJUSTED | PSAL_ADJUSTED | TEMP_ADJUSTED | CT | SA | SIGMA0 | SPICE | DOXY_ADJUSTED | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 65 | 2019-01-01 06:03:25 | 0.252373 | 56.93303 | -52.52163 | 4.760000 | 34.624645 | 3.1082 | 3.108622 | 34.788850 | 27.576438 | -0.005224 | 313.042755 |
1 | 65 | 2019-01-01 06:03:25 | 0.252373 | 56.93303 | -52.52163 | 6.280000 | 34.624847 | 3.1099 | 3.110227 | 34.789055 | 27.576450 | -0.004914 | 313.051453 |
2 | 65 | 2019-01-01 06:03:25 | 0.252373 | 56.93303 | -52.52163 | 8.300000 | 34.624748 | 3.1105 | 3.110705 | 34.788956 | 27.576328 | -0.004934 | 313.024292 |
4 | 65 | 2019-01-01 06:03:25 | 0.252373 | 56.93303 | -52.52163 | 12.200000 | 34.624748 | 3.1108 | 3.110767 | 34.788964 | 27.576328 | -0.004922 | 313.263916 |
8 | 65 | 2019-01-01 06:03:25 | 0.252373 | 56.93303 | -52.52163 | 20.259998 | 34.624855 | 3.1131 | 3.112572 | 34.789096 | 27.576264 | -0.004642 | 312.554932 |
Set up model versions to try.
To demonstrate, we try our model on three different simple versions. Real models should use more thorough methods to determine the optimal feature list.
var_list = { 'Model_A': ['SPICE', 'SIGMA0'], 'Model_B': ['CT', 'SA', 'PRES_ADJUSTED'], 'Model_C': ["CT", "SA", 'PRES_ADJUSTED', 'YD', 'LATITUDE', 'LONGITUDE'], }model_list = list(var_list.keys())# Create dictionary of featuresallvars = []for k in var_list.keys(): allvars = allvars + var_list[str(k)]allvars = list(set(allvars)) # remove duplicates
pal = sns.color_palette('Set2', n_colors = len(model_list))model_palettes = {k:v for k, v in zip(model_list, pal)}
Main RFR training method
var_predict = 'DOXY_ADJUSTED'def train_RFR(var_list, training, validation, test, ntrees=1000, max_feats = 1/3, min_samples_split=5): """ Main method to train the RF model. Update 04.18.24: Redo bootstrapping. @param: var_list: list of variables to use in the model training: training data unscaled, i.e. original range of values validation: validation data unscaled test: test data unscaled ntrees: 1000 trees by default. @return: Mdl: trained RF model Mdl_MAE: Rescaled mean absolute error for training, validation, and test sets Mdl_IQR: Rescaled IQR for training, validation, and test sets DF_with_error: Dataframe with error metrics for the *TEST* set validation_error: Dataframe with error metrics for the *VALIDATION* set """ Mdl = RandomForestRegressor(ntrees, max_features = max_feats, random_state = 0, bootstrap=True, min_samples_split=min_samples_split) # max_features: use at most X features at each split (m~sqrt(total features)) # Drop NaN's without profid or wmoid cols_na = [col for col in training.columns if col not in ['profid', 'wmoid', 'AOU', 'dist_maxb']] training_nona = training.dropna(axis=0, subset=cols_na) # makes same length as training_scaled validation_nona = validation.dropna(axis=0, subset=cols_na) test_nona = test.dropna(axis=0, subset=cols_na) # Create X Variables for each subset of data. Scale down. X_training = training_nona[var_list].to_numpy() X_validation = validation_nona[var_list].to_numpy() X_test = test_nona[var_list].to_numpy() # Nitrate, the target variable. Y_training = training_nona[var_predict].to_numpy() Y_validation = validation_nona[var_predict].to_numpy() Y_test = test_nona[var_predict].to_numpy() # Train the model Mdl.fit(X_training, Y_training) # Estimate Y_pred_training = Mdl.predict(X_training) Y_pred_validation = Mdl.predict(X_validation) Y_pred_test = Mdl.predict(X_test) # Create dataframe for the test set with depth --> DF_with_error = test_nona.copy(); DF_with_error = DF_with_error.reset_index(drop=True) observed_nitrate = DF_with_error[var_predict].to_numpy() # Save new dataframe with test results DF_with_error['test_prediction'] = Y_pred_test DF_with_error['test_error'] = DF_with_error['test_prediction'] - observed_nitrate DF_with_error['test_relative_error'] = DF_with_error['test_error']/observed_nitrate # Error metrics AE_RF_training = np.abs(Y_pred_training - training_nona[var_predict]) IQR_RF_training = iqr(abs(AE_RF_training)) r2_RF_training = r2_score(training_nona[var_predict], Y_pred_training) # Return validation metrics AE_RF_validation = np.abs(Y_pred_validation - validation_nona[var_predict]) IQR_RF_validation = iqr(abs(AE_RF_validation)) r2_RF_validation = r2_score(validation_nona[var_predict], Y_pred_validation) validation_error = validation_nona.copy() #pd.DataFrame() validation_error['val_error'] = Y_pred_validation - validation_nona[var_predict] validation_error['val_relative_error'] = validation_error['val_error']/validation_nona[var_predict] AE_RF_test = np.abs(Y_pred_test - test_nona[var_predict]) # same as DF_with_error['test_error'] IQR_RF_test = iqr(abs(AE_RF_test)) r2_RF_test = r2_score(test_nona[var_predict], Y_pred_test) Mdl_MAE = [np.nanmedian(abs(AE_RF_training)), np.nanmedian(abs(AE_RF_validation)), np.nanmedian(abs(AE_RF_test))] Mdl_IQR = [IQR_RF_training, IQR_RF_validation, IQR_RF_test] Mdl_r2 = [r2_RF_training, r2_RF_validation, r2_RF_test] return [Mdl, Mdl_MAE, Mdl_IQR, Mdl_r2, DF_with_error, validation_error]
Next, we train the model using our main train_RFR()
function.Depending on your dataset size and model lists, this may take several minutes (here, about 8 minutes).
Note that use a relatively small number of decision tres (n=500) to speed up our example.
# Train the models with chosen lists.# Set parametersmodels = dict.fromkeys(model_list)models_MAE = dict.fromkeys(model_list)models_IQR = dict.fromkeys(model_list)models_r2 = dict.fromkeys(model_list)models_DF_err = dict.fromkeys(model_list)models_val_err = dict.fromkeys(model_list)ntrees=500max_feats = 1/3min_samples_split = 5description = 'ntrees=' + str(ntrees) + ', max_features=' + str(max_feats)print(description + '\n')for k in model_list: [models[k], models_MAE[k], models_IQR[k], models_r2[k], models_DF_err[k], models_val_err[k]] = train_RFR(var_list[k], training, validation, test, ntrees = ntrees, max_feats = max_feats, min_samples_split=min_samples_split)
ntrees=500, max_features=0.3333333333333333
RFR_results = [models, models_MAE, models_IQR, models_r2, models_DF_err, models_val_err]
Feature Importance
RFR returns a measure for each feature called ``feature importance’’ that reflects its predictive value. For each split where a given feature is used to determine the split criterion, the difference in the variance of the pre-split node compared to the two post-split nodes is calculated (and weighted by the number of samples in the pre-split node). These values are summed across all of the relevant splits in a given decision tree; the average of these sums across all trees represents the feature importance of the given variable. The feature importances are computed for all variables used in the model, and are scaled relative to each other so that the sum of all of the final feature importances is 1. The feature importances are often screened during RFR development to help select a final set of features.
allvars = ['SPICE', 'SIGMA0', 'CT', 'SA', 'PRES_ADJUSTED', 'YD', 'LATITUDE', 'LONGITUDE']
# Make total feature importance dictionary for plottingdict = dict.fromkeys(allvars)feat_imps = pd.DataFrame()# Fill in feature importancefor ind, vkey in enumerate(model_list): for var in var_list[vkey]: feat_imps.at[ind, var] = models[vkey].feature_importances_[var_list[vkey].index(var)]# feat_imps = feat_imps.set_index('Mdl').fillna(np.nan)feat_imps['Mdl'] = model_listfeat_imps = feat_imps.set_index('Mdl').fillna(np.nan)feat_imps
SPICE | SIGMA0 | CT | SA | PRES_ADJUSTED | YD | LATITUDE | LONGITUDE | |
---|---|---|---|---|---|---|---|---|
Mdl | ||||||||
Model_A | 0.650909 | 0.349091 | NaN | NaN | NaN | NaN | NaN | NaN |
Model_B | NaN | NaN | 0.410785 | 0.344402 | 0.244813 | NaN | NaN | NaN |
Model_C | NaN | NaN | 0.211573 | 0.216692 | 0.115993 | 0.245552 | 0.112361 | 0.09783 |
We can compare feature importances across models.
pal = sns.color_palette('Set2', n_colors = len(model_list))model_palettes = {k:v for k, v in zip(model_list, pal)}fig = plt.figure(figsize=(5,3), layout='constrained')ax = fig.gca()barwidth=.8modlist = model_listvarslist = allvarsxlabel = ['spice', 'sigma', 'temp', 'salinity', 'pres', 'time', 'lat', 'lon']# Result of plot methodxind = dict.fromkeys(modlist)xind[modlist[0]] = np.arange(len(varslist))*len(modlist)for i in range(len(modlist)-1): xind[modlist[i+1]] = [x + barwidth for x in xind[modlist[i]]]for modtag in modlist: ax.bar(xind[modtag], feat_imps[varslist].loc[modtag].to_numpy(), color = model_palettes[modtag], alpha=0.8, label = modtag, zorder=3)ax.grid(axis='y', zorder=0)ax.legend(loc='upper right')ax.set_title('Feature Importances by Model')ax.set_xticks(xind[modlist[2]])ax.set_xticklabels(varslist)ax.set_xticklabels(xlabel);
Validation Error Analysis
def get_model_metrics(resultlist, mod_list=model_list): """ resultlist: typically list of [models, models_MAE, models_IQR, models_r2, models_DF_err]""" models_MAE=resultlist[1] models_IQR=resultlist[2] models_r2=resultlist[3] model_metrics = pd.DataFrame() model_metrics['validation_MAE'] = [x[1][1].item() for x in models_MAE.items()] model_metrics['validation_IQR'] = [x[1][1].item() for x in models_IQR.items()] model_metrics['test_MAE'] = [x[1][2].item() for x in models_MAE.items()] model_metrics['test_IQR'] = [x[1][2].item() for x in models_IQR.items()] model_metrics['model'] = mod_list model_metrics.set_index('model', inplace=True) return model_metrics# Method to print MAE dictionariesdef print_dict(dict): print("\n".join("{} \t{}".format(k, v) for k, v in dict.items()))def print_sorted(dict): print("\n".join("{} \t{}".format(k, v) for k, v in sorted(dict.items(), key=lambda x:x[1])))
model_metrics = get_model_metrics(RFR_results, mod_list = RFR_results[1].keys()).sort_values(by='validation_MAE')model_metrics.sort_values(by='validation_MAE', ascending=False)[['validation_MAE', 'validation_IQR']]
validation_MAE | validation_IQR | |
---|---|---|
model | ||
Model_A | 4.844838 | 6.930837 |
Model_B | 4.252438 | 6.321586 |
Model_C | 1.908496 | 3.470855 |
# Error printing methodsdef get_95_bounds(data): mean = np.mean(data); std_dev = np.std(data) low = mean - 2 * std_dev high = mean + 2 * std_dev return [low, high]# Method to print errors restricted to depth def print_error_depths(data, var='test_error', pres_lim = [0,500]): data = data[(data.PRES_ADJUSTED > pres_lim[0]) & (data.PRES_ADJUSTED < pres_lim[1])] err = data[var] print('Errors between depths ' + str(pres_lim[0]) + ' to ' + str(pres_lim[1]) + ':') print('median abs error: \t' + str(np.abs(err).median())) print('mean abs error \t\t' + str(np.abs(err).mean())) # Bounds 95 [low, high] = get_95_bounds(err) print('\n95% of errors fall between:') print(str(low.round(5)) + ' to ' + str(high.round(5)) )
print('Model_C') # leaf size: 5print_error_depths(RFR_results[5]['Model_C'], var='val_relative_error', pres_lim = [0,500])
Model_CErrors between depths 0 to 500:median abs error: 0.0064830623181472685mean abs error 0.01049724417272320795% of errors fall between:-0.03439 to 0.02635
We can visualize the distribution of errors between models.
tag = 'Model Validation Errors'var = 'val_relative_error'dat = RFR_results[5]ymax=60fig = plt.figure(figsize=(7,3))ax = fig.gca()for mod in model_list[-4:]: RF = dat[mod][var] ax.hist(RF, bins=40, alpha=0.5, color= model_palettes[mod], label=mod, density=True)ax.vlines(0, ymin=0, ymax=ymax, colors='k', alpha=0.4, linestyles='dashed')ax.set_ylim([0,ymax])plt.legend()ax.set_xlabel('Relative Error %')ax.grid(alpha=0.5, zorder=0)ax.set_xlim([-.1, .1])ax.set_title('Validation Error Distributions by Model')
Text(0.5, 1.0, 'Validation Error Distributions by Model')
Testing on withheld data
This is where you begin your first section of material, loosely tied to your objectives stated up front. Tie together your notebook as a narrative, with interspersed Markdown text, images, and more as necessary,
# some subsection codenew = "helpful information"
Another content subsection
Keep up the good work! A note, try to avoid using code comments as narrative, and instead let them only exist as brief clarifications where necessary.
Your second content section
Here we can move on to our second objective, and we can demonstrate
Subsection to the second section
a quick demonstration
of further and further
header levels
as well \(m = a * t / h\) text! Similarly, you have access to other \(\LaTeX\) equation functionality via MathJax (demo below from link),
(4)\[\begin{align}\dot{x} & = \sigma(y-x) \\\dot{y} & = \rho x - y - xz \\\dot{z} & = -\beta z + xy\end{align}\]
Check out any number of helpful Markdown resources for further customizing your notebooks and the Jupyter docs for Jupyter-specific formatting information. Don’t hesitate to ask questions if you have problems getting it to look just right.
Last Section
If you’re comfortable, and as we briefly used for our embedded logo up top, you can embed raw html into Jupyter Markdown cells (edit to see):
Info
Your relevant information here!
Feel free to copy this around and edit or play around with yourself. Some other admonitions
you can put in:
Success
We got this done after all!
Warning
Be careful!
Danger
Scary stuff be here.
We also suggest checking out Jupyter Book’s brief demonstration on adding cell tags to your cells in Jupyter Notebook, Lab, or manually. Using these cell tags can allow you to customize how your code content is displayed and even demonstrate errors without altogether crashing our loyal army of machines!
Summary
Add one final ---
marking the end of your body of content, and then conclude with a brief single paragraph summarizing at a high level the key pieces that were learned and how they tied to your objectives. Look to reiterate what the most important takeaways were.
What’s next?
Let Jupyter book tie this to the next (sequential) piece of content that people could move on to down below and in the sidebar. However, if this page uniquely enables your reader to tackle other nonsequential concepts throughout this book, or even external content, link to it here!
Resources and references
Finally, be rigorous in your citations and references as necessary. Give credit where credit is due. Also, feel free to link to relevant external material, further reading, documentation, etc. Then you’re done! Give yourself a quick review, a high five, and send us a pull request. A few final notes:
Kernel > Restart Kernel and Run All Cells...
to confirm that your notebook will cleanly run from start to finishKernel > Restart Kernel and Clear All Outputs...
before committing your notebook, our machines will do the heavy liftingTake credit! Provide author contact information if you’d like; if so, consider adding information here at the bottom of your notebook
Give credit! Attribute appropriate authorship for referenced code, information, images, etc.
Only include what you’re legally allowed: no copyright infringement or plagiarism
Thank you for your contribution!