Regression Modeling on Argo using Scikit-learn — Using sklearn with BGC-Argo ocean observations (2024)

Regression Modeling on Argo using Scikit-learn — Using sklearn with BGC-Argo ocean observations (1)
Regression Modeling on Argo using Scikit-learn — Using sklearn with BGC-Argo ocean observations (2)

Overview

Among machine learning approaches, RFR is a relatively simple algorithm that can be trained on regional datasets too small for deep learning. RFR has been successfully applied to a range of cases in oceanography (e.g. \citet{sharp2022_GOBAIO2, callens2020_Using, tong2019_MultiFeature}), including for oxygen prediction in the Southern Ocean using BGC-Argo float data (\citet{giglio2018_Estimating}). Furthermore, many oceanographic applications highlight RFR as a useful tool for geospatial observations because of its reduced overfitting tendency and ability to handle non-linear relationships between variables \citep{zhou2023_Comparative, sharp2022_Monthly}. Deep learning methods are not necessarily better than simpler algorithms for data that are non-uniformly distributed; where multiple algorithms produce similar regional predictions, simple learners can offer greater stability and interpretability.

The objectives for this notebook are:

  • Train RFR on data from BGC-Argo ocean autonomous profilers and GO-SHIP bottle data

  • Validate RFR to select the best model parameters, use cross-validation to assess overfitting

  • Test RFR for an estimate of prediction error

  • Predict missing variable on other profiling observations

from IPython.display import ImageImage(filename='../images/rfr_workflow.png') 

Regression Modeling on Argo using Scikit-learn — Using sklearn with BGC-Argo ocean observations (3)

Prerequisites

This section was inspired by this template of the wonderful The Turing Way Jupyter Book.

Following your overview, tell your reader what concepts, packages, or other background information they’ll need before learning your material. Tie this explicitly with links to other pages here in Foundations or to relevant external resources. Remove this body text, then populate the Markdown table, denoted in this cell with | vertical brackets, below, and fill out the information following. In this table, lay out prerequisite concepts by explicitly linking to other Foundations material or external resources, or describe generally helpful concepts.

Label the importance of each concept explicitly as helpful/necessary.

Concepts

Importance

Notes

Intro to Cartopy

Necessary

Understanding of NetCDF

Helpful

Familiarity with metadata structure

Project management

Helpful

  • Time to learn: estimate in minutes. For a rough idea, use 5 mins per subsection, 10 if longer; add these up for a total. Safer to round up and overestimate.

  • System requirements:

    • Populate with any system, version, or non-Python software requirements if necessary

    • Otherwise use the concepts table above and the Imports section below to describe required packages as necessary

    • If no extra requirements, remove the System requirements point altogether

Imports

Begin your body of content with another --- divider before continuing into this section, then remove this body text and populate the following code cell with all necessary Python imports up-front:

# Import packagesimport numpy as npimport pandas as pdfrom scipy import interpolateimport scipyimport xarray as xrimport matplotlib.pyplot as pltimport matplotlib.colors as mcolorsfrom datetime import datetime, timedeltaimport requestsimport timeimport osimport urllib3import shutil# import tqdm# Import argopy functionsimport argopyfrom argopy import DataFetcher # This is the class to work with Argo datafrom argopy import ArgoIndex # This is the class to work with Argo indexfrom argopy import ArgoNVSReferenceTables # This is the class to retrieve data from Argo reference tablesfrom argopy import ArgoColors # This is a class with usefull pre-defined colorsfrom argopy.plot import scatter_map, scatter_plot # This is a function to easily make maps argopy.reset_options()argopy.clear_cache()argopy.set_options(cachedir='cache_bgc')# xr.set_options(display_expand_attrs = False)
<argopy.options.set_options at 0x7f5e8c55f400>
# Scikit-learn packagesfrom scipy.stats import kdefrom scipy.stats import iqrfrom sklearn import linear_modelfrom sklearn import preprocessingfrom sklearn import metricsfrom sklearn.model_selection import train_test_splitfrom sklearn.ensemble import RandomForestRegressorfrom sklearn.metrics import mean_squared_errorfrom sklearn.metrics import r2_scorefrom sklearn.metrics import classification_report, confusion_matrixfrom sklearn.metrics import accuracy_scorefrom sklearn.metrics import precision_recall_fscore_support as score# Plotting Packagesimport matplotlibimport matplotlib.pyplot as pltimport matplotlib.patches as patchimport matplotlib.colors as mpcolorsfrom matplotlib.colors import LogNormfrom matplotlib.ticker import FormatStrFormatterimport matplotlib.patheffects as peimport seaborn as snsimport gswfrom cmocean import cm as cmo

Optional Background: Random Forest Regression

Random Forest (RF) regression is a supervised machine learning method that averages the results over a “forest” of \(n\) decision trees for final prediction. Each decision tree determines a series of conditions at each node based on a set of observed variables, referred to as features. For regression, the trees find conditions that best separate the target observations into branches with the least variance.

Formally, observations are sorted into \(j\) regions of predictor space (\(R_j\)) during training. Following the notation in James et al. (2013), each node in the tree considers a split into two new regions based on a condition \(s\) for a predictor \(X_j\):

(2)\[\begin{equation}\begin{split}R_1(j,s) = \{X|X_j < s\} \text{ and } R_2(j,s) = \{X|X_j \ge s\}, \\\end{split}\end{equation}\]

and each region is assigned a prediction value that is the average of all target variable values (\(\hat{y}_{R_j}\) ) within \(R_j\). The condition \(s\) is chosen to minimize the residual sum of squares in the resulting bins:

(3)\[\begin{equation}\begin{split}\sum_{i: \: x_i \in \: R_1(j,s)}{} (y_i-\hat{y}_{R_1})^2 \; + \sum_{i: \: x_i \in \: R_2(j,s)}{} (y_i-\hat{y}_{R_2})^2,\end{split}\end{equation}\]

where \(y_i\) represents each observed value. During testing, validation, and prediction, RF sorts input observations into the regions and assigns them the value \(\hat{y}_{R_j}\). One advantage of RF is that it limits the decision at each node split to a random subset of features. As a result, the trees in Random Forest are less correlated than those in a family of ``bagged’’ trees which consider all possible features at all nodes. The RF method is therefore less prone to overfitting, especially on spatiotemporally biased data (Stock et.al (2022), Sharp et al. 2022).

Determine spatial and temporal bounds of Argo data

We will use Argopy to set up some data to run RFR.As an example, we access 3 years of North Atlantic data in 2019 through 2022.We adapt this code from the example Argopy notebook are documented here.

Our approach downloads synthetic profiles, which are described in the previous notebook, Accessing Argo .

# In the format [minlon, maxlon, minlat, maxlat, mindepth, maxdepth, start_date, end_date]BOX = [-56, -45, 54, 60, 0, 500, '2019-01', '2023-01']idx = ArgoIndex(index_file='bgc-s').load() # Synthetic profilesidx
<argoindex.pandas>Host: https://data-argo.ifremer.frIndex: argo_synthetic-profile_index.txtConvention: argo_synthetic-profile_index (Synthetic-Profile directory file of the Argo GDAC)Loaded: True (314361 records)Searched: False
# Select profile in a space/time domain:index_BOX = [BOX[ii] for ii in [0, 1, 2, 3, 6, 7]] # We don't want the pressure axis BOX limitsidx.search_lat_lon_tim(index_BOX)
<argoindex.pandas>Host: https://data-argo.ifremer.frIndex: argo_synthetic-profile_index.txtConvention: argo_synthetic-profile_index (Synthetic-Profile directory file of the Argo GDAC)Loaded: True (314361 records)Searched: True (1464 matches, 0.4657%)
# Get the list of all parameters for this region:print('Parameters measured in the region: \n' + str(idx.read_params())); print()print ('There are ' + str(len(idx.read_wmo())) + ' floats in the region.')
Parameters measured in the region: ['BBP700', 'CDOM', 'CHLA', 'CHLA_FLUORESCENCE', 'CP660', 'DOWNWELLING_PAR', 'DOWN_IRRADIANCE380', 'DOWN_IRRADIANCE412', 'DOWN_IRRADIANCE490', 'DOXY', 'NITRATE', 'PH_IN_SITU_TOTAL', 'PRES', 'PSAL', 'TEMP']There are 51 floats in the region.

Fetching data using Argopy

The following code fetches the actual data from the BOX bounds we used earlier. If using the parameters given here, the process takes about 6 minutes.

f = DataFetcher(ds='bgc', mode='expert', params='all', parallel=True, progress=True, cache=False, chunks_maxsize={'time': 30}, )f = f.region(BOX).load()f

Queued

Downloading netcdf from the erddap

Formatting xarray dataset

Callback

Failed or No Data

Success

Succeed in Merging chunks of xarray dataset (100% processed)

Final post-processing of the merged dataset () ...
/home/runner/miniconda3/envs/sklearn-argo-dev/lib/python3.9/site-packages/argopy/xarray.py:70: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`. self._dims = list(xarray_obj.dims.keys())
<datafetcher.erddap>Name: Ifremer erddap Argo BGC data fetcher for a space/time regionAPI: https://erddap.ifremer.fr/erddapDomain: [x=-56.00/-45.00; y=54.00/60.0 ... 00.0; t=2019-01-01/2023-01-01]BGC variables: ['BBP700', 'CDOM', 'CHLA', 'CP660', 'DOWNWELLING_PAR', 'DOWN_IRRADIANCE380', 'DOWN_IRRADIANCE412', 'DOWN_IRRADIANCE490', 'DOXY', 'NITRATE', 'PH_IN_SITU_TOTAL', 'PRES', 'PSAL', 'TEMP']BGC 'must be measured' variables: []Performances: cache=False, parallel=TrueUser mode: expertDataset: bgc

Again, the data is retrieved as an xarray Dataset. Metadata can be returned using the object index in a DataFrame.

# Check the data structure (xarray.dataset):ds = f.data # Datasetdf = f.index # Dataframe index

We can visualize where our training profiles are using a function included in Argopy.By setting the parameter set_global=True, we can see the profiles in a global map context.

scatter_map(df, traj=False, set_global=False, legend=False);ax = plt.gca()ax.set_title('Float Profile Locations (by WMO)');

Regression Modeling on Argo using Scikit-learn — Using sklearn with BGC-Argo ocean observations (4)

a_param = 'PSAL'reftbl = ArgoNVSReferenceTables().tbl('R03')param_info = reftbl[reftbl['altLabel']==a_param].iloc[0].to_dict()param_info# To make the scatter map, we need to have the data mode available in one DataFrame column# so we need to add a new column with the DATA_MODE of the PARAMETER:df["variables"] = df["parameters"].apply(lambda x: x.split())df["%s_DM" % a_param] = df.apply(lambda x: x['parameter_data_mode'][x['variables'].index(a_param)] if a_param in x['variables'] else '', axis=1)
# Finally plot the map:fig, ax = scatter_map(df, hue="%s_DM" % a_param, cmap="data_mode", markersize=24, markeredgecolor='w', traj_color='gray', legend_title='Data mode')ax.set_title("Data mode for %s (%s)\n%i profiles from the %s\n%i profiles downloaded" % (param_info['prefLabel'], a_param, idx.N_MATCH, idx.convention_title, df.shape[0]));

Regression Modeling on Argo using Scikit-learn — Using sklearn with BGC-Argo ocean observations (5)

Choose example parameter to visualize

We can plot the adjusted Salinity over pressure, and color by QC flag.

fig, ax = scatter_plot(ds, a_param + '_ADJUSTED_QC', this_x = a_param + '_ADJUSTED', vmin=0, vmax=9, cmap=ArgoColors('qc').cmap, figsize=(5,5)) #ArgoColors('qc').cmapax.set_title("QC for Adjusted %s [%s]\n'%s' mission" % (param_info['prefLabel'], a_param + '_ADJUSTED', f.mission), fontdict={'weight': 'bold', 'size': 14});

Regression Modeling on Argo using Scikit-learn — Using sklearn with BGC-Argo ocean observations (6)

Splitting data before training our model.

During RFR development, observations are separated into different datasets for training, validation, and testing steps (see Workflow schematic above). We note that the observations within a single glider or float profile are highly correlated; since the individual profiles are the ‘‘independent’’ units here rather than the pointwise samples, we keep observations from vertical profiles together during data splitting.

# Prepare dataframe before splittingargodat = ds.argo.point2profile()argodat = argodat.to_dataframe().reset_index()argodat.columns# Run QCqcvars = ['DOXY_ADJUSTED_QC', 'PSAL_ADJUSTED_QC', 'TEMP_ADJUSTED_QC', 'PRES_ADJUSTED_QC', 'TIME_QC']print(str(len(argodat)) + ' obs before QC')X = len(argodat)for var in qcvars: argodat = argodat[(argodat[var] == 1) | (argodat[var] == 2)| (argodat[var] == 8)] #qc flag meanings 1: good data, 2: probably good data, 8: interpolated value print('after ' + var) print('\t' + str(len(argodat)) + ' obs left ' + ' \t' +str(X-len(argodat)) + ' dropped'); X = len(argodat)argodat.describe();
/home/runner/miniconda3/envs/sklearn-argo-dev/lib/python3.9/site-packages/argopy/xarray.py:70: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`. self._dims = list(xarray_obj.dims.keys())
/home/runner/miniconda3/envs/sklearn-argo-dev/lib/python3.9/site-packages/argopy/xarray.py:70: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`. self._dims = list(xarray_obj.dims.keys())
1996520 obs before QCafter DOXY_ADJUSTED_QC272748 obs left 1723772 droppedafter PSAL_ADJUSTED_QC255411 obs left 17337 droppedafter TEMP_ADJUSTED_QC255411 obs left 0 droppedafter PRES_ADJUSTED_QC255411 obs left 0 droppedafter TIME_QC255411 obs left 0 dropped
# Add training variables def datetime2ytd(time, year = 2019): """" Return time in YTD format from datetime format.""" return (time - np.datetime64('2019-01-01'))/np.timedelta64(1, 'D')argodat['YD'] = datetime2ytd(argodat['TIME'])argodat['CT'] = gsw.CT_from_t(argodat['PSAL_ADJUSTED'], argodat['TEMP_ADJUSTED'], argodat['PRES_ADJUSTED'])argodat['SA']= gsw.SA_from_SP(argodat['PSAL_ADJUSTED'],argodat['PRES_ADJUSTED'],argodat['LONGITUDE'],argodat['LATITUDE'])argodat['SIGMA0'] = gsw.sigma0(argodat.SA.values, argodat.CT.values)argodat['SPICE'] = gsw.spiciness0(argodat.SA.values, argodat.CT.values)dvars = ['N_PROF','TIME', 'YD', 'LATITUDE', 'LONGITUDE', 'PRES_ADJUSTED','PSAL_ADJUSTED', 'TEMP_ADJUSTED', 'CT', 'SA', 'SIGMA0', 'SPICE', 'DOXY_ADJUSTED']argo_df = argodat[dvars]
argo_df
N_PROF TIME YD LATITUDE LONGITUDE PRES_ADJUSTED PSAL_ADJUSTED TEMP_ADJUSTED CT SA SIGMA0 SPICE DOXY_ADJUSTED
0 65 2019-01-01 06:03:25.000 0.252373 56.933030 -52.521630 4.760000 34.624645 3.108200 3.108622 34.788850 27.576438 -0.005224 313.042755
1 65 2019-01-01 06:03:25.000 0.252373 56.933030 -52.521630 6.280000 34.624847 3.109900 3.110227 34.789055 27.576450 -0.004914 313.051453
2 65 2019-01-01 06:03:25.000 0.252373 56.933030 -52.521630 8.300000 34.624748 3.110500 3.110705 34.788956 27.576328 -0.004934 313.024292
4 65 2019-01-01 06:03:25.000 0.252373 56.933030 -52.521630 12.200000 34.624748 3.110800 3.110767 34.788964 27.576328 -0.004922 313.263916
8 65 2019-01-01 06:03:25.000 0.252373 56.933030 -52.521630 20.259998 34.624855 3.113100 3.112572 34.789096 27.576264 -0.004642 312.554932
... ... ... ... ... ... ... ... ... ... ... ... ... ...
1994410 1250 2022-12-30 14:35:51.036 1459.608230 59.672974 -50.982277 456.649139 34.842865 3.781540 3.747749 35.009959 27.689356 0.219996 292.526306
1994421 1250 2022-12-30 14:35:51.036 1459.608230 59.672974 -50.982277 465.851868 34.843231 3.770672 3.736246 35.010335 27.690821 0.218972 292.605347
1994432 1250 2022-12-30 14:35:51.036 1459.608230 59.672974 -50.982277 475.282532 34.843609 3.759535 3.724458 35.010722 27.692322 0.217925 292.711700
1994442 1250 2022-12-30 14:35:51.036 1459.608230 59.672974 -50.982277 484.937714 34.843998 3.748132 3.712390 35.011121 27.693860 0.216857 292.717194
1994453 1250 2022-12-30 14:35:51.036 1459.608230 59.672974 -50.982277 492.613831 34.843433 3.731520 3.695292 35.010561 27.695144 0.214551 293.101227

255411 rows × 13 columns

Begin preparing data for training

def split_profiles(floatDF): """  @param floatDF: dataframe with all float data @return:  training_data: training data (80% of profiles) with ship added validation_data: validation data (20% of profiles)  This method treats a float profile like the smallest discrete "unit" upon which to train and validate the model. Our goal is to predict a water column (profile) rather than a discrete observation, so profiles are kept together while splitting. Note that test data will be all profiles from the SOGOS float when test_split is set as False by default. """ # Create list of unique profile ID's and select random 80% for training profnum = pd.unique(floatDF.N_PROF) np.random.seed(42) training_pnum = np.random.choice(profnum, int(np.floor(0.8*len(profnum))), replace=False) training_data = floatDF[floatDF['N_PROF'].isin(training_pnum)] # Take HALF of remaining profiles that were not in training set, for validation (10% of total) pnum_vt = [x for x in profnum if x not in training_pnum] # remaining profiles after training data removed validation_pnum= np.random.choice(pnum_vt, int(np.floor(0.5*len(pnum_vt))), replace=False) validation_data = floatDF[floatDF['N_PROF'].isin(validation_pnum)] test_pnum = [x for x in pnum_vt if x not in validation_pnum] test_data = floatDF[floatDF['N_PROF'].isin(test_pnum)] return training_data, validation_data, test_data
[training, validation, test] = split_profiles(argo_df)print('# of total profiles: ' + str(len(argo_df.N_PROF.unique())))print('# training observations: \t' + str(len(training)))print('# validation observations: \t' + str(len(validation)))print('# test observations: \t\t' + str(len(test)))
# of total profiles: 1018# training observations: 203102# validation observations: 25786# test observations: 26523
training.head()
N_PROF TIME YD LATITUDE LONGITUDE PRES_ADJUSTED PSAL_ADJUSTED TEMP_ADJUSTED CT SA SIGMA0 SPICE DOXY_ADJUSTED
0 65 2019-01-01 06:03:25 0.252373 56.93303 -52.52163 4.760000 34.624645 3.1082 3.108622 34.788850 27.576438 -0.005224 313.042755
1 65 2019-01-01 06:03:25 0.252373 56.93303 -52.52163 6.280000 34.624847 3.1099 3.110227 34.789055 27.576450 -0.004914 313.051453
2 65 2019-01-01 06:03:25 0.252373 56.93303 -52.52163 8.300000 34.624748 3.1105 3.110705 34.788956 27.576328 -0.004934 313.024292
4 65 2019-01-01 06:03:25 0.252373 56.93303 -52.52163 12.200000 34.624748 3.1108 3.110767 34.788964 27.576328 -0.004922 313.263916
8 65 2019-01-01 06:03:25 0.252373 56.93303 -52.52163 20.259998 34.624855 3.1131 3.112572 34.789096 27.576264 -0.004642 312.554932

Set up model versions to try.

To demonstrate, we try our model on three different simple versions. Real models should use more thorough methods to determine the optimal feature list.

var_list = { 'Model_A': ['SPICE', 'SIGMA0'], 'Model_B': ['CT', 'SA', 'PRES_ADJUSTED'], 'Model_C': ["CT", "SA", 'PRES_ADJUSTED', 'YD', 'LATITUDE', 'LONGITUDE'], }model_list = list(var_list.keys())# Create dictionary of featuresallvars = []for k in var_list.keys(): allvars = allvars + var_list[str(k)]allvars = list(set(allvars)) # remove duplicates
pal = sns.color_palette('Set2', n_colors = len(model_list))model_palettes = {k:v for k, v in zip(model_list, pal)}

Main RFR training method

var_predict = 'DOXY_ADJUSTED'def train_RFR(var_list, training, validation, test, ntrees=1000, max_feats = 1/3, min_samples_split=5): """  Main method to train the RF model. Update 04.18.24: Redo bootstrapping.  @param:  var_list: list of variables to use in the model training: training data unscaled, i.e. original range of values validation: validation data unscaled test: test data unscaled  ntrees: 1000 trees by default. @return: Mdl: trained RF model Mdl_MAE: Rescaled mean absolute error for training, validation, and test sets Mdl_IQR: Rescaled IQR for training, validation, and test sets DF_with_error: Dataframe with error metrics for the *TEST* set validation_error: Dataframe with error metrics for the *VALIDATION* set """ Mdl = RandomForestRegressor(ntrees, max_features = max_feats, random_state = 0, bootstrap=True, min_samples_split=min_samples_split) # max_features: use at most X features at each split (m~sqrt(total features)) # Drop NaN's without profid or wmoid cols_na = [col for col in training.columns if col not in ['profid', 'wmoid', 'AOU', 'dist_maxb']] training_nona = training.dropna(axis=0, subset=cols_na) # makes same length as training_scaled validation_nona = validation.dropna(axis=0, subset=cols_na) test_nona = test.dropna(axis=0, subset=cols_na) # Create X Variables for each subset of data. Scale down.  X_training = training_nona[var_list].to_numpy() X_validation = validation_nona[var_list].to_numpy() X_test = test_nona[var_list].to_numpy() # Nitrate, the target variable.  Y_training = training_nona[var_predict].to_numpy() Y_validation = validation_nona[var_predict].to_numpy() Y_test = test_nona[var_predict].to_numpy() # Train the model Mdl.fit(X_training, Y_training) # Estimate Y_pred_training = Mdl.predict(X_training) Y_pred_validation = Mdl.predict(X_validation) Y_pred_test = Mdl.predict(X_test) # Create dataframe for the test set with depth -->  DF_with_error = test_nona.copy(); DF_with_error = DF_with_error.reset_index(drop=True) observed_nitrate = DF_with_error[var_predict].to_numpy() # Save new dataframe with test results DF_with_error['test_prediction'] = Y_pred_test DF_with_error['test_error'] = DF_with_error['test_prediction'] - observed_nitrate DF_with_error['test_relative_error'] = DF_with_error['test_error']/observed_nitrate # Error metrics AE_RF_training = np.abs(Y_pred_training - training_nona[var_predict]) IQR_RF_training = iqr(abs(AE_RF_training)) r2_RF_training = r2_score(training_nona[var_predict], Y_pred_training) # Return validation metrics AE_RF_validation = np.abs(Y_pred_validation - validation_nona[var_predict]) IQR_RF_validation = iqr(abs(AE_RF_validation)) r2_RF_validation = r2_score(validation_nona[var_predict], Y_pred_validation) validation_error = validation_nona.copy() #pd.DataFrame() validation_error['val_error'] = Y_pred_validation - validation_nona[var_predict] validation_error['val_relative_error'] = validation_error['val_error']/validation_nona[var_predict] AE_RF_test = np.abs(Y_pred_test - test_nona[var_predict]) # same as DF_with_error['test_error'] IQR_RF_test = iqr(abs(AE_RF_test)) r2_RF_test = r2_score(test_nona[var_predict], Y_pred_test) Mdl_MAE = [np.nanmedian(abs(AE_RF_training)), np.nanmedian(abs(AE_RF_validation)), np.nanmedian(abs(AE_RF_test))] Mdl_IQR = [IQR_RF_training, IQR_RF_validation, IQR_RF_test] Mdl_r2 = [r2_RF_training, r2_RF_validation, r2_RF_test] return [Mdl, Mdl_MAE, Mdl_IQR, Mdl_r2, DF_with_error, validation_error]

Next, we train the model using our main train_RFR() function.Depending on your dataset size and model lists, this may take several minutes (here, about 8 minutes).

Note that use a relatively small number of decision tres (n=500) to speed up our example.

# Train the models with chosen lists.# Set parametersmodels = dict.fromkeys(model_list)models_MAE = dict.fromkeys(model_list)models_IQR = dict.fromkeys(model_list)models_r2 = dict.fromkeys(model_list)models_DF_err = dict.fromkeys(model_list)models_val_err = dict.fromkeys(model_list)ntrees=500max_feats = 1/3min_samples_split = 5description = 'ntrees=' + str(ntrees) + ', max_features=' + str(max_feats)print(description + '\n')for k in model_list: [models[k], models_MAE[k], models_IQR[k], models_r2[k], models_DF_err[k], models_val_err[k]] = train_RFR(var_list[k], training, validation, test, ntrees = ntrees, max_feats = max_feats, min_samples_split=min_samples_split)
ntrees=500, max_features=0.3333333333333333
RFR_results = [models, models_MAE, models_IQR, models_r2, models_DF_err, models_val_err]

Feature Importance

RFR returns a measure for each feature called ``feature importance’’ that reflects its predictive value. For each split where a given feature is used to determine the split criterion, the difference in the variance of the pre-split node compared to the two post-split nodes is calculated (and weighted by the number of samples in the pre-split node). These values are summed across all of the relevant splits in a given decision tree; the average of these sums across all trees represents the feature importance of the given variable. The feature importances are computed for all variables used in the model, and are scaled relative to each other so that the sum of all of the final feature importances is 1. The feature importances are often screened during RFR development to help select a final set of features.

allvars = ['SPICE', 'SIGMA0', 'CT', 'SA', 'PRES_ADJUSTED', 'YD', 'LATITUDE', 'LONGITUDE']
# Make total feature importance dictionary for plottingdict = dict.fromkeys(allvars)feat_imps = pd.DataFrame()# Fill in feature importancefor ind, vkey in enumerate(model_list): for var in var_list[vkey]: feat_imps.at[ind, var] = models[vkey].feature_importances_[var_list[vkey].index(var)]# feat_imps = feat_imps.set_index('Mdl').fillna(np.nan)feat_imps['Mdl'] = model_listfeat_imps = feat_imps.set_index('Mdl').fillna(np.nan)feat_imps
SPICE SIGMA0 CT SA PRES_ADJUSTED YD LATITUDE LONGITUDE
Mdl
Model_A 0.650909 0.349091 NaN NaN NaN NaN NaN NaN
Model_B NaN NaN 0.410785 0.344402 0.244813 NaN NaN NaN
Model_C NaN NaN 0.211573 0.216692 0.115993 0.245552 0.112361 0.09783

We can compare feature importances across models.

pal = sns.color_palette('Set2', n_colors = len(model_list))model_palettes = {k:v for k, v in zip(model_list, pal)}fig = plt.figure(figsize=(5,3), layout='constrained')ax = fig.gca()barwidth=.8modlist = model_listvarslist = allvarsxlabel = ['spice', 'sigma', 'temp', 'salinity', 'pres', 'time', 'lat', 'lon']# Result of plot methodxind = dict.fromkeys(modlist)xind[modlist[0]] = np.arange(len(varslist))*len(modlist)for i in range(len(modlist)-1): xind[modlist[i+1]] = [x + barwidth for x in xind[modlist[i]]]for modtag in modlist: ax.bar(xind[modtag], feat_imps[varslist].loc[modtag].to_numpy(), color = model_palettes[modtag], alpha=0.8, label = modtag, zorder=3)ax.grid(axis='y', zorder=0)ax.legend(loc='upper right')ax.set_title('Feature Importances by Model')ax.set_xticks(xind[modlist[2]])ax.set_xticklabels(varslist)ax.set_xticklabels(xlabel);

Regression Modeling on Argo using Scikit-learn — Using sklearn with BGC-Argo ocean observations (7)

Validation Error Analysis

def get_model_metrics(resultlist, mod_list=model_list): """  resultlist: typically list of [models, models_MAE, models_IQR, models_r2, models_DF_err]""" models_MAE=resultlist[1] models_IQR=resultlist[2] models_r2=resultlist[3] model_metrics = pd.DataFrame() model_metrics['validation_MAE'] = [x[1][1].item() for x in models_MAE.items()] model_metrics['validation_IQR'] = [x[1][1].item() for x in models_IQR.items()] model_metrics['test_MAE'] = [x[1][2].item() for x in models_MAE.items()] model_metrics['test_IQR'] = [x[1][2].item() for x in models_IQR.items()] model_metrics['model'] = mod_list model_metrics.set_index('model', inplace=True) return model_metrics# Method to print MAE dictionariesdef print_dict(dict): print("\n".join("{} \t{}".format(k, v) for k, v in dict.items()))def print_sorted(dict): print("\n".join("{} \t{}".format(k, v) for k, v in sorted(dict.items(), key=lambda x:x[1]))) 
model_metrics = get_model_metrics(RFR_results, mod_list = RFR_results[1].keys()).sort_values(by='validation_MAE')model_metrics.sort_values(by='validation_MAE', ascending=False)[['validation_MAE', 'validation_IQR']]
validation_MAE validation_IQR
model
Model_A 4.844838 6.930837
Model_B 4.252438 6.321586
Model_C 1.908496 3.470855
# Error printing methodsdef get_95_bounds(data): mean = np.mean(data); std_dev = np.std(data) low = mean - 2 * std_dev high = mean + 2 * std_dev return [low, high]# Method to print errors restricted to depth def print_error_depths(data, var='test_error', pres_lim = [0,500]): data = data[(data.PRES_ADJUSTED > pres_lim[0]) & (data.PRES_ADJUSTED < pres_lim[1])] err = data[var] print('Errors between depths ' + str(pres_lim[0]) + ' to ' + str(pres_lim[1]) + ':') print('median abs error: \t' + str(np.abs(err).median())) print('mean abs error \t\t' + str(np.abs(err).mean())) # Bounds 95 [low, high] = get_95_bounds(err) print('\n95% of errors fall between:') print(str(low.round(5)) + ' to ' + str(high.round(5)) )
print('Model_C') # leaf size: 5print_error_depths(RFR_results[5]['Model_C'], var='val_relative_error', pres_lim = [0,500])
Model_CErrors between depths 0 to 500:median abs error: 0.0064830623181472685mean abs error 0.01049724417272320795% of errors fall between:-0.03439 to 0.02635

We can visualize the distribution of errors between models.

tag = 'Model Validation Errors'var = 'val_relative_error'dat = RFR_results[5]ymax=60fig = plt.figure(figsize=(7,3))ax = fig.gca()for mod in model_list[-4:]: RF = dat[mod][var] ax.hist(RF, bins=40, alpha=0.5, color= model_palettes[mod], label=mod, density=True)ax.vlines(0, ymin=0, ymax=ymax, colors='k', alpha=0.4, linestyles='dashed')ax.set_ylim([0,ymax])plt.legend()ax.set_xlabel('Relative Error %')ax.grid(alpha=0.5, zorder=0)ax.set_xlim([-.1, .1])ax.set_title('Validation Error Distributions by Model')
Text(0.5, 1.0, 'Validation Error Distributions by Model')

Regression Modeling on Argo using Scikit-learn — Using sklearn with BGC-Argo ocean observations (8)

Testing on withheld data

This is where you begin your first section of material, loosely tied to your objectives stated up front. Tie together your notebook as a narrative, with interspersed Markdown text, images, and more as necessary,

# some subsection codenew = "helpful information"

Another content subsection

Keep up the good work! A note, try to avoid using code comments as narrative, and instead let them only exist as brief clarifications where necessary.

Your second content section

Here we can move on to our second objective, and we can demonstrate

Subsection to the second section

a quick demonstration

of further and further
header levels

as well \(m = a * t / h\) text! Similarly, you have access to other \(\LaTeX\) equation functionality via MathJax (demo below from link),

(4)\[\begin{align}\dot{x} & = \sigma(y-x) \\\dot{y} & = \rho x - y - xz \\\dot{z} & = -\beta z + xy\end{align}\]

Check out any number of helpful Markdown resources for further customizing your notebooks and the Jupyter docs for Jupyter-specific formatting information. Don’t hesitate to ask questions if you have problems getting it to look just right.

Last Section

If you’re comfortable, and as we briefly used for our embedded logo up top, you can embed raw html into Jupyter Markdown cells (edit to see):

Info

Your relevant information here!

Feel free to copy this around and edit or play around with yourself. Some other admonitions you can put in:

Success

We got this done after all!

Warning

Be careful!

Danger

Scary stuff be here.

We also suggest checking out Jupyter Book’s brief demonstration on adding cell tags to your cells in Jupyter Notebook, Lab, or manually. Using these cell tags can allow you to customize how your code content is displayed and even demonstrate errors without altogether crashing our loyal army of machines!

Summary

Add one final --- marking the end of your body of content, and then conclude with a brief single paragraph summarizing at a high level the key pieces that were learned and how they tied to your objectives. Look to reiterate what the most important takeaways were.

What’s next?

Let Jupyter book tie this to the next (sequential) piece of content that people could move on to down below and in the sidebar. However, if this page uniquely enables your reader to tackle other nonsequential concepts throughout this book, or even external content, link to it here!

Resources and references

Finally, be rigorous in your citations and references as necessary. Give credit where credit is due. Also, feel free to link to relevant external material, further reading, documentation, etc. Then you’re done! Give yourself a quick review, a high five, and send us a pull request. A few final notes:

  • Kernel > Restart Kernel and Run All Cells... to confirm that your notebook will cleanly run from start to finish

  • Kernel > Restart Kernel and Clear All Outputs... before committing your notebook, our machines will do the heavy lifting

  • Take credit! Provide author contact information if you’d like; if so, consider adding information here at the bottom of your notebook

  • Give credit! Attribute appropriate authorship for referenced code, information, images, etc.

  • Only include what you’re legally allowed: no copyright infringement or plagiarism

Thank you for your contribution!

Regression Modeling on Argo using Scikit-learn — Using sklearn with BGC-Argo ocean observations (2024)
Top Articles
Latest Posts
Article information

Author: Laurine Ryan

Last Updated:

Views: 5887

Rating: 4.7 / 5 (77 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Laurine Ryan

Birthday: 1994-12-23

Address: Suite 751 871 Lissette Throughway, West Kittie, NH 41603

Phone: +2366831109631

Job: Sales Producer

Hobby: Creative writing, Motor sports, Do it yourself, Skateboarding, Coffee roasting, Calligraphy, Stand-up comedy

Introduction: My name is Laurine Ryan, I am a adorable, fair, graceful, spotless, gorgeous, homely, cooperative person who loves writing and wants to share my knowledge and understanding with you.