Ensembles¶

Overview¶

For each step, read the explanation, then run the code cell(s) right below it.

You will practice how to:

Load and prepare data for modeling
Compare manual voting ensemble vs. VotingClassifier
Compare a manual stacking ensemble vs. StackingClassifier
Create a custom scorer using make_scorer
Tune and evaluate RandomForestClassifier and AdaBoostClassifier with custom scorer
Demonstrate how to utilize ColumnTransformer for preprocessing in a Pipeline
Plot and interpret tree based feature importances
Plot scores by probability thresholds

Import libraries¶

In [1]:

Copied!





import os
import sys
from pathlib import Path
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier, VotingClassifier, StackingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, StratifiedKFold, cross_val_predict
from sklearn.metrics import recall_score, fbeta_score, confusion_matrix, make_scorer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

# Change float format to display 3 decimal places only
pd.options.display.float_format = '{:.6f}'.format

# Set random seed variable for code reproducibility
SEED = 1
import os
import sys
from pathlib import Path
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier, VotingClassifier, StackingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, StratifiedKFold, cross_val_predict
from sklearn.metrics import recall_score, fbeta_score, confusion_matrix, make_scorer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

# Change float format to display 3 decimal places only
pd.options.display.float_format = '{:.6f}'.format

# Set random seed variable for code reproducibility
SEED = 1

In [2]:

Copied!





# Import local libraries
root_dir = Path.cwd().resolve().parents[0]
sys.path.append(str(root_dir))

# Visualization functions
from src.visuals.make_plots import *

# Helper functions
from src.utils.helpers import *

# Load the "autoreload" extension so that code can change
%load_ext autoreload
#%reload_ext autoreload

# Always reload modules so that as you change code in src, it gets loaded
%autoreload 2
# Import local libraries
root_dir = Path.cwd().resolve().parents[0]
sys.path.append(str(root_dir))

# Visualization functions
from src.visuals.make_plots import *

# Helper functions
from src.utils.helpers import *

# Load the "autoreload" extension so that code can change
%load_ext autoreload
#%reload_ext autoreload

# Always reload modules so that as you change code in src, it gets loaded
%autoreload 2

Personal Loan Example¶

Create a dataframe for the UniversalBank.csv data

In the next cell, we

load the dataset,
drop ID and ZIP Code columns because they are not useful predictors.

This is a continuation of the Personal Loan Acceptance examples we have been using in the lectures.

In [3]:

Copied!

bank_df = pd.read_csv(os.path.join('..', 'data', 'UniversalBank.csv'))

bank_df.drop(columns=['ID', 'ZIP Code'], inplace=True)
bank_df.columns = [c.replace(' ', '_') for c in bank_df.columns]

bank_df.info()
bank_df = pd.read_csv(os.path.join('..', 'data', 'UniversalBank.csv'))

bank_df.drop(columns=['ID', 'ZIP Code'], inplace=True)
bank_df.columns = [c.replace(' ', '_') for c in bank_df.columns]

bank_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Age                 5000 non-null   int64  
 1   Experience          5000 non-null   int64  
 2   Income              5000 non-null   int64  
 3   Family              5000 non-null   int64  
 4   CCAvg               5000 non-null   float64
 5   Education           5000 non-null   int64  
 6   Mortgage            5000 non-null   int64  
 7   Personal_Loan       5000 non-null   int64  
 8   Securities_Account  5000 non-null   int64  
 9   CD_Account          5000 non-null   int64  
 10  Online              5000 non-null   int64  
 11  CreditCard          5000 non-null   int64  
dtypes: float64(1), int64(11)
memory usage: 468.9 KB

Simple vs. Pipeline Feature Engineering¶

As we learned in the Naive Bayes lecture, MultinomialNB works with purely categorical data. This provides a great opportunity to introduce different feature engineering approaches.

🧩 Simple feature engineering includes transformations such as:

Basic transformations like log, ratios, or arithmetic combinations of columns
Creating indicator variables (0/1 flags) based on conditions
Binning numeric variables into categorical buckets

✅ These are safe to be run on data even before/after partitioning the data for training since they do not learn anything from the data.

🔄 Later, we will introduce pipeline feature engineering, which includes transformations that should be fit on the training data and then applied to validation/test data using the same learned parameters. Some examples include:

Encoding methods such as creating dummy variables (one-hot encoding)
Scaling such as standard or min/max transformations
Imputation using mean or median

⚠️ These should be done inside a pipeline to avoid data leakage, where information from the test set unintentionally influences the model.

In this example, the Naive Bayes model works best when continuous predictors are converted into bins and is one example of simple feature engineering.

For simplicity, before introducing pipelines, we will create dummy variables on the entire dataset. To preserve flexibility, we will keep the original bank_df before creating bins and adding dummy variables so we can reuse the raw data later.

In [4]:

Copied!





bank_df2 = bank_df.copy()
bank_df2['Education'] = bank_df2['Education'].astype('category')
bank_df2['Age'] = pd.cut(bank_df2['Age'], 5, labels=range(1, 6)).astype('category')
bank_df2['Experience'] = pd.cut(bank_df2['Experience'], 10, labels=range(1, 11)).astype('category')
bank_df2['Income'] = pd.cut(bank_df2['Income'], 5, labels=range(1, 6)).astype('category')
bank_df2['CCAvg'] = pd.cut(bank_df2['CCAvg'], 6, labels=range(1, 7)).astype('category')
bank_df2['Mortgage'] = pd.cut(bank_df2['Mortgage'], 10, labels=range(1, 11)).astype('category')

bank_df2 = pd.get_dummies(bank_df2, prefix_sep='_')

bank_df2.head()
bank_df2 = bank_df.copy()
bank_df2['Education'] = bank_df2['Education'].astype('category')
bank_df2['Age'] = pd.cut(bank_df2['Age'], 5, labels=range(1, 6)).astype('category')
bank_df2['Experience'] = pd.cut(bank_df2['Experience'], 10, labels=range(1, 11)).astype('category')
bank_df2['Income'] = pd.cut(bank_df2['Income'], 5, labels=range(1, 6)).astype('category')
bank_df2['CCAvg'] = pd.cut(bank_df2['CCAvg'], 6, labels=range(1, 7)).astype('category')
bank_df2['Mortgage'] = pd.cut(bank_df2['Mortgage'], 10, labels=range(1, 11)).astype('category')

bank_df2 = pd.get_dummies(bank_df2, prefix_sep='_')

bank_df2.head()

Out[4]:

	Family	Securities_Account	CreditCard	Age_1	Age_2	Age_3	Age_4	...	Mortgage_1	Mortgage_2	Mortgage_3	Mortgage_4	Mortgage_5	Mortgage_6	Mortgage_7	Mortgage_8	Mortgage_9	Mortgage_10
0	4	1	0	True	False	False	False	...	True	False	False	False	False	False	False	False	False	False
1	3	1	0	False	False	True	False	...	True	False	False	False	False	False	False	False	False	False
2	1	0	0	False	True	False	False	...	True	False	False	False	False	False	False	False	False	False
3	1	0	0	False	True	False	False	...	True	False	False	False	False	False	False	False	False	False
4	4	0	1	False	True	False	False	...	True	False	False	False	False	False	False	False	False	False

5 rows × 45 columns

Split the data into training and test sets

Next, we separate the predictors from the target variable Personal_Loan, then create a training set (70%) and a test set (30%).

In [5]:

Copied!

X = bank_df2.drop(columns=['Personal_Loan'])
y = bank_df2['Personal_Loan']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=SEED)

print('Training set:', X_train.shape)
print('Test set:', X_test.shape)
X = bank_df2.drop(columns=['Personal_Loan'])
y = bank_df2['Personal_Loan']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=SEED)

print('Training set:', X_train.shape)
print('Test set:', X_test.shape)

Training set: (3500, 44)
Test set: (1500, 44)

Manual Voting Classifier¶

Fit several base learner models

To demonstrate a manual ensemble, we will create 4 models:

LogisticRegression
KNeighborsClassifier
DecisionTreeClassifier
MultinomialNB

💡Parameter choices for each base learner were intentionally selected for demonstration purposes.💡

In [6]:

Copied!





lr = LogisticRegression(penalty='l2', C=0.1, solver='liblinear', random_state=SEED)
lr.fit(X_train, y_train)
title1 = 'Logistic Regression Base Learner'
lr_pred = lr.predict(X_test)
lr_metrics = evaluate_model(y_test, lr_pred, beta=2, model_name=title1)
lr_metrics
lr = LogisticRegression(penalty='l2', C=0.1, solver='liblinear', random_state=SEED)
lr.fit(X_train, y_train)
title1 = 'Logistic Regression Base Learner'
lr_pred = lr.predict(X_test)
lr_metrics = evaluate_model(y_test, lr_pred, beta=2, model_name=title1)
lr_metrics

Out[6]:

	Accuracy	Precision	Recall	F1	F2
Logistic Regression Base Learner	0.950667	0.921348	0.550336	0.689076	0.598540

In [7]:

Copied!





knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
title2 = 'KNN Base Learner'
knn_pred = knn.predict(X_test)
knn_metrics = evaluate_model(y_test, knn_pred, beta=2, model_name=title2)
knn_metrics
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
title2 = 'KNN Base Learner'
knn_pred = knn.predict(X_test)
knn_metrics = evaluate_model(y_test, knn_pred, beta=2, model_name=title2)
knn_metrics

Out[7]:

	Accuracy	Precision	Recall	F1	F2
KNN Base Learner	0.933333	0.929825	0.355705	0.514563	0.405819

In [8]:

Copied!





dt = DecisionTreeClassifier(max_depth=3, random_state=SEED)
dt.fit(X_train, y_train)
title3 = 'Decision Tree Base Learner'
dt_pred = dt.predict(X_test)
dt_metrics = evaluate_model(y_test, dt_pred, beta=2, model_name=title3)
dt_metrics
dt = DecisionTreeClassifier(max_depth=3, random_state=SEED)
dt.fit(X_train, y_train)
title3 = 'Decision Tree Base Learner'
dt_pred = dt.predict(X_test)
dt_metrics = evaluate_model(y_test, dt_pred, beta=2, model_name=title3)
dt_metrics

Out[8]:

	Accuracy	Precision	Recall	F1	F2
Decision Tree Base Learner	0.956000	1.000000	0.557047	0.715517	0.611193

In [9]:

Copied!





nb = MultinomialNB(alpha=1.0)
nb.fit(X_train, y_train)
title4 = 'Naive Bayes Base Learner'
nb_pred = nb.predict(X_test)
nb_metrics = evaluate_model(y_test, nb_pred, beta=2, model_name=title4)
nb_metrics
nb = MultinomialNB(alpha=1.0)
nb.fit(X_train, y_train)
title4 = 'Naive Bayes Base Learner'
nb_pred = nb.predict(X_test)
nb_metrics = evaluate_model(y_test, nb_pred, beta=2, model_name=title4)
nb_metrics

Out[9]:

	Accuracy	Precision	Recall	F1	F2
Naive Bayes Base Learner	0.912667	0.562500	0.543624	0.552901	0.547297

Consolidate evaluation metrics

Let's consolidate the evaluation metrics for all 4 base learners. For our manual ensemble, we will focus on the F2 evaluation metric.

In [10]:

Copied!





results_df = pd.concat([
    lr_metrics,
    knn_metrics,
    dt_metrics,
    nb_metrics
])

results_df
results_df = pd.concat([
    lr_metrics,
    knn_metrics,
    dt_metrics,
    nb_metrics
])

results_df

Out[10]:

	Accuracy	Precision	Recall	F1	F2
Logistic Regression Base Learner	0.950667	0.921348	0.550336	0.689076	0.598540
KNN Base Learner	0.933333	0.929825	0.355705	0.514563	0.405819
Decision Tree Base Learner	0.956000	1.000000	0.557047	0.715517	0.611193
Naive Bayes Base Learner	0.912667	0.562500	0.543624	0.552901	0.547297

Create a results table with voting predictions

This results table will be used to evaluate the manual ensemble for hard and soft voting.

In [11]:

Copied!





models = {
    'lr': lr,
    'knn': knn,
    'dt': dt,
    'nb': nb
}

(pred_cols, prob_cols, result) = build_manual_ensemble_results(models, X_test, y_test)
result.head(10)
models = {
    'lr': lr,
    'knn': knn,
    'dt': dt,
    'nb': nb
}

(pred_cols, prob_cols, result) = build_manual_ensemble_results(models, X_test, y_test)
result.head(10)

Out[11]:

	lr_pred	lr_prob	knn_prob	dt_prob	nb_pred	nb_prob	average_prob
0	0	0.020384	0.000000	0.007675	0	0.007630	0.008922
1	0	0.002694	0.000000	0.007675	0	0.000043	0.002603
2	0	0.003346	0.000000	0.007675	0	0.000075	0.002774
3	0	0.038275	0.000000	0.259740	0	0.060384	0.089600
4	0	0.036118	0.000000	0.007675	0	0.006894	0.012672
5	0	0.029258	0.000000	0.007675	0	0.000263	0.009299
6	0	0.017519	0.000000	0.007675	0	0.002855	0.007012
7	0	0.185223	0.000000	0.259740	0	0.093892	0.134714
8	1	0.542300	0.000000	0.259740	1	0.589052	0.347773
9	1	0.506497	0.333333	0.259740	1	0.681355	0.445231

CP4_Ensembles

For CP4 credit, display a random sample of 5 rows for actual = 1 observations and calculate the hard voting majority, average_prob, and soft voting average_pred on Canvas. random_state and missing model predictions/probabilities will be set during class.

In [12]:

Copied!

cols = ['actual','lr_pred','lr_prob','knn_pred','knn_prob','dt_pred','dt_prob','nb_pred','nb_prob']
cols = ['actual','lr_pred','lr_prob','knn_pred','knn_prob','dt_pred','dt_prob','nb_pred','nb_prob']

Compare the individual models to the manual ensemble predictions

Below we compare each model and the two manual ensemble methods:

Majority vote using class labels
Average probability using predicted probabilities

In [13]:

Copied!





model_map = {
    'Logistic Regression': 'lr_pred',
    'k-Nearest Neighbors': 'knn_pred',
    'Decision Tree': 'dt_pred',
    'Naive Bayes': 'nb_pred',
    'Manual Majority Vote': 'majority',
    'Manual Average Probability': 'average_pred',
}

summary = evaluate_manual_ensemble_models(result, model_map)
summary.sort_values(by='F2', ascending=False)
model_map = {
    'Logistic Regression': 'lr_pred',
    'k-Nearest Neighbors': 'knn_pred',
    'Decision Tree': 'dt_pred',
    'Naive Bayes': 'nb_pred',
    'Manual Majority Vote': 'majority',
    'Manual Average Probability': 'average_pred',
}

summary = evaluate_manual_ensemble_models(result, model_map)
summary.sort_values(by='F2', ascending=False)

Out[13]:

	Model	Accuracy	Precision	Recall	F1	F2
2	Decision Tree	0.956000	1.000000	0.557047	0.715517	0.611193
5	Manual Average Probability	0.954000	0.965116	0.557047	0.706383	0.608504
0	Logistic Regression	0.950667	0.921348	0.550336	0.689076	0.598540
3	Naive Bayes	0.912667	0.562500	0.543624	0.552901	0.547297
4	Manual Majority Vote	0.944667	0.958333	0.463087	0.624434	0.516467
1	k-Nearest Neighbors	0.933333	0.929825	0.355705	0.514563	0.405819

Based on the combined results, the Decision Tree, Logistic Regression, and Naive Bayes models are all performing at a similar level, while the k-Nearest Neighbors model is clearly underperforming.

Because ensemble methods combine predictions across models, including a weaker model like k-Nearest Neighbors can negatively impact overall performance. This is especially noticeable in the manual voting approach as the majority vote method performs worse than the top individual models.

👉 Adding more models does not always improve an ensemble—especially when those models are weak or contribute noisy predictions.

Ensemble methods like voting classifiers tend to perform best when:

✅ Individual models have similar performance levels
🔀 Models make different types of errors (i.e., they are diverse)

Remove k-Nearest Neighbors model from the manual ensemble

Now, we will rerun the manual ensemble results without KNN.

In [14]:

Copied!





models2 = {
    'lr': lr,
    'dt': dt,
    'nb': nb
}

(pred_cols2, prob_cols2, result2) = build_manual_ensemble_results(models2, X_test, y_test)
result2.head(10)
models2 = {
    'lr': lr,
    'dt': dt,
    'nb': nb
}

(pred_cols2, prob_cols2, result2) = build_manual_ensemble_results(models2, X_test, y_test)
result2.head(10)

Out[14]:

	lr_pred	lr_prob	dt_prob	nb_pred	nb_prob	majority	average_prob
0	0	0.020384	0.007675	0	0.007630	0	0.011896
1	0	0.002694	0.007675	0	0.000043	0	0.003471
2	0	0.003346	0.007675	0	0.000075	0	0.003698
3	0	0.038275	0.259740	0	0.060384	0	0.119466
4	0	0.036118	0.007675	0	0.006894	0	0.016896
5	0	0.029258	0.007675	0	0.000263	0	0.012398
6	0	0.017519	0.007675	0	0.002855	0	0.009349
7	0	0.185223	0.259740	0	0.093892	0	0.179618
8	1	0.542300	0.259740	1	0.589052	1	0.463698
9	1	0.506497	0.259740	1	0.681355	1	0.482531

Compare the manual ensemble predictions without KNN

Below we compare each model and the two manual ensemble methods:

In [15]:

Copied!





model_map2 = {
    'Logistic Regression': 'lr_pred',
    'Decision Tree': 'dt_pred',
    'Naive Bayes': 'nb_pred',
    'Manual Majority Vote': 'majority',
    'Manual Average Probability': 'average_pred',
}

summary2 = evaluate_manual_ensemble_models(result2, model_map2)
summary2.sort_values(by='F2', ascending=False)
model_map2 = {
    'Logistic Regression': 'lr_pred',
    'Decision Tree': 'dt_pred',
    'Naive Bayes': 'nb_pred',
    'Manual Majority Vote': 'majority',
    'Manual Average Probability': 'average_pred',
}

summary2 = evaluate_manual_ensemble_models(result2, model_map2)
summary2.sort_values(by='F2', ascending=False)

Out[15]:

	Model	Accuracy	Precision	Recall	F1	F2
4	Manual Average Probability	0.959333	0.958333	0.617450	0.751020	0.664740
3	Manual Majority Vote	0.952000	0.923077	0.563758	0.700000	0.611354
1	Decision Tree	0.956000	1.000000	0.557047	0.715517	0.611193
0	Logistic Regression	0.950667	0.921348	0.550336	0.689076	0.598540
2	Naive Bayes	0.912667	0.562500	0.543624	0.552901	0.547297

By combining only the stronger models (Decision Tree, Logistic Regression, and Naive Bayes), the manual ensemble especially the average probability (**soft voting*) achieved the highest F2 Score, outperforming all individual models.

👉 Strong ensembles are built not just by adding more models, but by carefully selecting models that contribute meaningful and complementary predictions.

Recreate Ensemble using VotingClassifier

Creating the voting ensemble manually was more for illustration purposes but in practice, you should use sklearn's ensemble method VotingClassifier. Let's recreate that same classifier using the Decision Tree, Logistic Regression, and Naive Bayes base learners we created previously.

In [16]:

Copied!





estimators = [
        ('lr', lr),
        ('dt', dt),
        ('nb', nb)
    ]
vc_hard = VotingClassifier(estimators=estimators, voting='hard')
vc_hard.fit(X_train, y_train)
vc_hard_pred = vc_hard.predict(X_test)

vc_soft = VotingClassifier(estimators=estimators, voting='soft')
vc_soft.fit(X_train, y_train)
vc_soft_pred = vc_soft.predict(X_test)

vc_hard_metrics = evaluate_model(y_test, vc_hard_pred, beta=2, model_name='VotingClassifier Hard')
vc_soft_metrics = evaluate_model(y_test, vc_soft_pred, beta=2, model_name='VotingClassifier Soft')

pd.concat([vc_hard_metrics,vc_soft_metrics]).sort_values(by='F2', ascending=False)
estimators = [
        ('lr', lr),
        ('dt', dt),
        ('nb', nb)
    ]
vc_hard = VotingClassifier(estimators=estimators, voting='hard')
vc_hard.fit(X_train, y_train)
vc_hard_pred = vc_hard.predict(X_test)

vc_soft = VotingClassifier(estimators=estimators, voting='soft')
vc_soft.fit(X_train, y_train)
vc_soft_pred = vc_soft.predict(X_test)

vc_hard_metrics = evaluate_model(y_test, vc_hard_pred, beta=2, model_name='VotingClassifier Hard')
vc_soft_metrics = evaluate_model(y_test, vc_soft_pred, beta=2, model_name='VotingClassifier Soft')

pd.concat([vc_hard_metrics,vc_soft_metrics]).sort_values(by='F2', ascending=False)

Out[16]:

	Accuracy	Precision	Recall	F1	F2
VotingClassifier Soft	0.959333	0.958333	0.617450	0.751020	0.664740
VotingClassifier Hard	0.952000	0.923077	0.563758	0.700000	0.611354

Key VotingClassifier Parameters¶

Below are the parameters for VotingClassifier in sklearn.

estimators
List of (name, model) tuples for the base estimators.

voting
Determines how the ensemble combines model outputs.

"hard" → majority vote using predicted class labels
"soft" → average predicted probabilities

weights
Optional list of numeric weights. Models with larger weights have more influence in the final vote.

n_jobs
Number of CPU cores used in parallel when fitting the estimators.

None → default behavior
-1 → use all available cores

flatten_transform
Used only when voting='soft' and calling .transform().

True → flatten output into a 2D array
False → keep a more structured output format

verbose
If True, displays progress information while fitting.

Manual Stacking Classifier¶

Stacking uses the outputs of the base models as the inputs to a second-level model called a meta-learner.

In this step, we will utilize all 4 of the base learners to generate predictions and then fit a new Logistic Regression model on those predictions.

First, we will evaluate the Logistic Regression coefficients for the meta-learner.

In [17]:

Copied!





train_result = pd.DataFrame({
    'actual': y_train,
    'lr_pred': lr.predict(X_train),
    'knn_pred': knn.predict(X_train),
    'dt_pred': dt.predict(X_train),
    'nb_pred': nb.predict(X_train),
})

params = {'penalty': 'l2', 'C': 1e42, 'solver': 'liblinear', 'random_state': SEED}
msc = LogisticRegression(**params)
msc.fit(train_result[pred_cols], y_train)

pd.DataFrame({'Coefficient': [msc.intercept_[0]] + list(msc.coef_[0])}, index=['Intercept'] + pred_cols)

train_result = pd.DataFrame({
    'actual': y_train,
    'lr_pred': lr.predict(X_train),
    'knn_pred': knn.predict(X_train),
    'dt_pred': dt.predict(X_train),
    'nb_pred': nb.predict(X_train),
})

params = {'penalty': 'l2', 'C': 1e42, 'solver': 'liblinear', 'random_state': SEED}
msc = LogisticRegression(**params)
msc.fit(train_result[pred_cols], y_train)

pd.DataFrame({'Coefficient': [msc.intercept_[0]] + list(msc.coef_[0])}, index=['Intercept'] + pred_cols)

Out[17]:

	Coefficient
Intercept	-4.062824
lr_pred	2.547507
knn_pred	5.640015
dt_pred	11.789058
nb_pred	0.729674

As we learned in the Logistic Regression model and tutorial, each coefficient reflects the impact of a base model’s prediction on the log(odds) of predicting the positive class.

Intercept (-4.063)
Baseline log-odds when all model predictions are 0.
→ Indicates a strong bias toward the negative class unless models vote otherwise.
dt_pred (11.789) 🚀
The Decision Tree has the strongest influence on the final prediction.
→ When this model predicts 1, it dramatically increases the odds of belonging to the positive class.
knn_pred (5.64)
The KNN model also has a strong positive contribution, but much less than the decision tree.
lr_pred (2.548)
The Logistic Regression base model contributes moderately to the final decision.
nb_pred (0.73)
The Naive Bayes model has the weakest influence, suggesting it adds limited value to the ensemble.

In [18]:

Copied!





test_result = pd.DataFrame({
    'actual': y_test,
    'lr_pred': lr.predict(X_test),
    'knn_pred': knn.predict(X_test),
    'dt_pred': dt.predict(X_test),
    'nb_pred': nb.predict(X_test),
})

msc_pred = (msc.predict(test_result[pred_cols]) >= 0.50).astype(int)

msc_metrics = evaluate_model(y_test, msc_pred, beta=2, model_name='Manual Stacking')

msc_metrics.sort_values(by='F2', ascending=False)
test_result = pd.DataFrame({
    'actual': y_test,
    'lr_pred': lr.predict(X_test),
    'knn_pred': knn.predict(X_test),
    'dt_pred': dt.predict(X_test),
    'nb_pred': nb.predict(X_test),
})

msc_pred = (msc.predict(test_result[pred_cols]) >= 0.50).astype(int)

msc_metrics = evaluate_model(y_test, msc_pred, beta=2, model_name='Manual Stacking')

msc_metrics.sort_values(by='F2', ascending=False)

Out[18]:

	Accuracy	Precision	Recall	F1	F2
Manual Stacking	0.961333	0.959596	0.637584	0.766129	0.683453

Recreate Ensemble using StackingClassifier

Like we did with the manual voting classifier, let's now recreate the same stacking classifier using sklearn's ensemble method StackingClassifier.

In [19]:

Copied!





estimators = [
    ('lr', lr),
    ('knn', knn),
    ('dt', dt),
    ('nb', nb)
]

sc = StackingClassifier(estimators=estimators,
                        final_estimator=LogisticRegression(**params),
                       stack_method='predict',
                        cv=5)
sc.fit(X_train, y_train)
sc_pred = sc.predict(X_test)

sc_metrics = evaluate_model(y_test, sc_pred, beta=2, model_name='StackingClassifier')
sc_metrics
estimators = [
    ('lr', lr),
    ('knn', knn),
    ('dt', dt),
    ('nb', nb)
]

sc = StackingClassifier(estimators=estimators,
                        final_estimator=LogisticRegression(**params),
                       stack_method='predict',
                        cv=5)
sc.fit(X_train, y_train)
sc_pred = sc.predict(X_test)

sc_metrics = evaluate_model(y_test, sc_pred, beta=2, model_name='StackingClassifier')
sc_metrics

Out[19]:

	Accuracy	Precision	Recall	F1	F2
StackingClassifier	0.952667	0.923913	0.570470	0.705394	0.617733

Why are some metrics lower with StackingClassifier?

When comparing our manual stacking approach to sklearn’s StackingClassifier, we can see that several evaluation metrics are lower when using the built-in implementation.

This happens because StackingClassifier uses a more realistic and less biased training process for the meta-model.

What’s Different? 🧠

🔁 Cross-validated predictions (sklearn) The meta-model is trained using out-of-fold predictions from the base models. This means each prediction is made on data the base model has not seen during training.
⚠️ In-sample predictions (manual approach) In our manual stacking method, the meta-model was trained on predictions generated from the same data used to train the base models. This introduces data leakage, making the model appear stronger than it actually is.

Key StackingClassifier Parameters¶

Below are the parameters for StackingClassifier in sklearn.

estimators
List of (name, model) tuples for the base estimators.

final_estimator
The meta-model that learns how to combine the base estimators.

cv
Cross-validation strategy used to create out-of-fold predictions for training the meta-model.

Integer (for number of folds)
cross-validation splitter object
iterable of train/test splits
None → default cross-validation behavior

stack_method
How the base estimators generate inputs for the meta-model.

"auto" → automatically choose an available method
"predict_proba" → use predicted probabilities
"decision_function" → use decision scores
"predict" → use predicted class labels

n_jobs
Number of CPU cores used in parallel.

None → default behavior
-1 → use all available cores

passthrough
Controls whether the original predictor variables are included alongside the base-model outputs.

False → meta-learner uses only the base-model outputs
True → meta-learner uses both the original features and the base-model outputs

verbose
Controls how much fitting progress is printed.

0 → no additional output
higher values → more progress information

Custom scorer evaluation

So far, we have been comparing our models using F2 score, which does make sense with the Personal Loan example since the goal is to capture as many new customers as possible in the next marketing campaign but still have some precision in our marketing outreach efforts. When using methods like cross_val_score or GridSearchCV, you can easily specify a scoring parameter using predefined scorer names. Check out sklearn's user guide for String name scorers here.

However, not all metrics are available such as the F2 score, and you may often want to customize your evaluation metric to accomplish specific business goals. In order to utilize a custom evaluation metric in these methods, you need to use sklearn's maker_scorer that is callable object that returns a scalar score. It is possible to pass sklearn's fbeta_score into a custom scorer but this way we are utilizing a known calculation to demonstration usage of a custom scorer that we can then compare to prior evaluation results.

The first step is to create our new scoring function.

In [20]:

Copied!





def custom_f2_score(y_true, y_pred):
    cm = confusion_matrix(y_true, y_pred)
    fp = cm[0, 1]
    fn = cm[1, 0]
    tp = cm[1, 1]
    
    # Handle division by zero
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    
    # F2 calculation (beta = 2)
    beta = 2
    if precision + recall == 0:
        return 0
    
    f2 = (1 + beta**2) * (precision * recall) / ((beta**2 * precision) + recall)
    
    return f2

## Validate the score using the original Decision Tree preds
print(f"Custom F2 score: {custom_f2_score(y_test, dt_pred):.4f}")
def custom_f2_score(y_true, y_pred):
    cm = confusion_matrix(y_true, y_pred)
    fp = cm[0, 1]
    fn = cm[1, 0]
    tp = cm[1, 1]
    
    # Handle division by zero
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    
    # F2 calculation (beta = 2)
    beta = 2
    if precision + recall == 0:
        return 0
    
    f2 = (1 + beta**2) * (precision * recall) / ((beta**2 * precision) + recall)
    
    return f2

## Validate the score using the original Decision Tree preds
print(f"Custom F2 score: {custom_f2_score(y_test, dt_pred):.4f}")

Custom F2 score: 0.6112

Custom Scoring Functions: Allowed Inputs

When creating a custom scoring function (instead of using a built-in sklearn metric), your function must follow a specific structure so it can work with tools like cross-validation and Grid Search.

y_true The actual (ground truth) target values.

Typically a pandas Series or numpy array
Contains the true class labels

y_pred The predicted values from the model.

By default, this comes from model.predict()
Contains predicted class labels (e.g., 0/1)

👉 Minimum Required Signature

def custom_metric(y_true, y_pred):
    return score

Must return a single scalar value (not a list or array)
Higher values should indicate better performance (unless handled with greater_is_better=False)

When Using Probabilities or Scores 📊

If your metric requires probabilities or decision scores, you must configure make_scorer accordingly:

needs_proba=True → function receives predicted probabilities
needs_threshold=True → function receives decision scores

In this case, your function will look like:

def custom_metric(y_true, y_pred_proba):
    return score

Instantiate make_scorer

Next, we will use the custom_f2_score to create our new make_scorer object.

In [21]:

Copied!

f2_scorer = make_scorer(custom_f2_score, greater_is_better=True)
f2_scorer = make_scorer(custom_f2_score, greater_is_better=True)

Build a RandomForestClassifier

In the Ensembles lecture, we mentioned two powerful “out-of-the-box” ensemble methods:

🌲 Random Forest → A strong bagging-based model that builds multiple decision trees and aggregates their results
⚡ AdaBoost → A popular boosting model that sequentially improves performance by focusing on previously misclassified observations

First, we will build and evaluate a Random Forest model and demonstrate how to use sklearn's cross_val_score with our custom F2 scorer.

In [22]:

Copied!

rf = RandomForestClassifier(max_depth=None, n_estimators=300, random_state=SEED, n_jobs=-1)

scores = cross_val_score(rf, X_train, y_train, cv=5, scoring=f2_scorer)

print("F2 Scores per fold:", scores)
print(f"Average Cross-Validation F2 Score: {scores.mean():.4f}")

rf = RandomForestClassifier(max_depth=None, n_estimators=300, random_state=SEED, n_jobs=-1)

scores = cross_val_score(rf, X_train, y_train, cv=5, scoring=f2_scorer)

print("F2 Scores per fold:", scores)
print(f"Average Cross-Validation F2 Score: {scores.mean():.4f}")

F2 Scores per fold: [0.8        0.80188679 0.80246914 0.80696203 0.82043344]
Average Cross-Validation F2 Score: 0.8064

Cross-Validation Reminder 🔁

When using cross_val_score, you should not pass in a fitted model.

❌ Do NOT use a model that has already been trained with .fit()
✅ Always pass in an unfitted model instance

This is because cross_val_score will:

Split the data into folds
Train the model separately on each training fold
Evaluate performance on each validation fold

In [23]:

Copied!





title5 = 'Random Forest Model'
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)
rf_metrics = evaluate_model(y_test, rf_pred, beta=2, model_name=title5)
rf_metrics
title5 = 'Random Forest Model'
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)
rf_metrics = evaluate_model(y_test, rf_pred, beta=2, model_name=title5)
rf_metrics

Out[23]:

	Accuracy	Precision	Recall	F1	F2
Random Forest Model	0.968667	0.925000	0.744966	0.825279	0.775140

Key RandomForestClassifier Parameters¶

n_estimators Number of trees in the forest.

Integer (e.g., 100, 200, 500)
More trees → typically better performance but increased computation time

Larger values reduce variance and improve stability.

max_depth Maximum depth of each decision tree.

None → Trees grow until pure or minimum samples reached
Integer (e.g., max_depth=5)

Controls model complexity where smaller depth = less overfitting.

min_samples_split Minimum number of samples required to split an internal node.

Integer → exact number (e.g., 10)
Float → fraction of dataset (e.g., 0.05)

Larger values = fewer splits → simpler trees.

min_samples_leaf Minimum number of samples required at a leaf node.

Integer → exact number (e.g., 5)
Float → fraction of dataset

Helps prevent very small leaves and reduces overfitting.

max_features Number of features considered when looking for the best split.

"sqrt" → square root of total features (default for classification)
"log2" → log base 2 of features
Integer or float → custom number

Introduces randomness and helps reduce correlation between trees.

bootstrap Whether bootstrap sampling is used.

True → Sample with replacement (default)
False → Use entire dataset for each tree

Core concept of bagging.

random_state Controls randomness for reproducibility.

Integer (e.g., 1, 42)

Ensures consistent results across runs.

n_jobs Number of CPU cores used for training.

-1 → Use all available cores
Integer → specific number of cores

Speeds up training for large forests.

Tune an AdaBoostClassifier

Next, we will build and perform some basic tuning of an AdaBoost model and demonstrate how to use sklearn's GridSearchCV with our custom F2 scorer. We will use the original Decision Tree base learner as the base_estimator for the boosting model.

In [24]:

Copied!





ada = AdaBoostClassifier(
    estimator=dt,
    random_state=SEED
)

param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 1.0, 10.0],
    'estimator__max_depth': [1, 3, 5]
}

gs_ada = GridSearchCV(estimator=ada, param_grid=param_grid, cv=5, scoring=f2_scorer, n_jobs=-1)
gs_ada.fit(X_train, y_train)

print('Best AdaBoost Parameters:', gs_ada.best_params_)
print(f'Best AdaBoost Cross-Validation F2 Score: {gs_ada.best_score_:.4f}')

ada = AdaBoostClassifier(
    estimator=dt,
    random_state=SEED
)

param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 1.0, 10.0],
    'estimator__max_depth': [1, 3, 5]
}

gs_ada = GridSearchCV(estimator=ada, param_grid=param_grid, cv=5, scoring=f2_scorer, n_jobs=-1)
gs_ada.fit(X_train, y_train)

print('Best AdaBoost Parameters:', gs_ada.best_params_)
print(f'Best AdaBoost Cross-Validation F2 Score: {gs_ada.best_score_:.4f}')

Best AdaBoost Parameters: {'estimator__max_depth': 5, 'learning_rate': 1.0, 'n_estimators': 100}
Best AdaBoost Cross-Validation F2 Score: 0.8537

Display all evaluation metrics

For comparison purposes, let's display all of the evaluation metrics we have gathered so far. Keep in mind the earlier base learners were not tuned or designed to be strong models on their own.

In [25]:

Copied!





title6 = 'AdaBoost Model'
ab_pred = gs_ada.predict(X_test)
ab_metrics = evaluate_model(y_test, ab_pred, beta=2, model_name=title6)
results_df2 = pd.concat([lr_metrics, knn_metrics, dt_metrics, nb_metrics, 
                         vc_hard_metrics, vc_soft_metrics, sc_metrics,
                         rf_metrics, ab_metrics])

results_df2.sort_values(by='F2', ascending=False)
title6 = 'AdaBoost Model'
ab_pred = gs_ada.predict(X_test)
ab_metrics = evaluate_model(y_test, ab_pred, beta=2, model_name=title6)
results_df2 = pd.concat([lr_metrics, knn_metrics, dt_metrics, nb_metrics, 
                         vc_hard_metrics, vc_soft_metrics, sc_metrics,
                         rf_metrics, ab_metrics])

results_df2.sort_values(by='F2', ascending=False)

Out[25]:

	Accuracy	Precision	Recall	F1	F2
AdaBoost Model	0.970000	0.912698	0.771812	0.836364	0.796399
Random Forest Model	0.968667	0.925000	0.744966	0.825279	0.775140
VotingClassifier Soft	0.959333	0.958333	0.617450	0.751020	0.664740
StackingClassifier	0.952667	0.923913	0.570470	0.705394	0.617733
VotingClassifier Hard	0.952000	0.923077	0.563758	0.700000	0.611354
Decision Tree Base Learner	0.956000	1.000000	0.557047	0.715517	0.611193
Logistic Regression Base Learner	0.950667	0.921348	0.550336	0.689076	0.598540
Naive Bayes Base Learner	0.912667	0.562500	0.543624	0.552901	0.547297
KNN Base Learner	0.933333	0.929825	0.355705	0.514563	0.405819

Feature Engineering Workflow

Previously, we performed all feature engineering steps (including creating dummy variables) directly on the full dataset before splitting into training and test sets.

While this approach is useful for learning and quick experimentation, it is not the recommended workflow for building final models.

To demonstrate how you should prepare your final model for PA4, we will:

Perform simple feature engineering inside a feature_engineering(df) function
Use a Column Transformer to handle dummy variable creation (one-hot encoding)
Integrate everything into a more structured and reusable pipeline

Previously, we created bins for the numerical variables and then created dummy variables but that was primarily for our Naive Bayes model. Tree based models don't require that preprocessing so we will instead demonstrate the following instead:

Engineer one new variable as a ratio of Income and the Family variables
Convert the Education to a category since it is numeric and the preprocessor will not detect it as the correct data type requiring dummy variables

For the purposes of this tutorial only, we have included a fit parameter to be able to choose if the build_model() returns a fitted model. This is so we can reuse the pipeline for cross-validation.

❌ Do NOT do this for PA4 ❌

In [26]:

Copied!





def feature_engineering(df):
    df = df.copy()
    df['Income_Per_Family'] = df['Income'] / df['Family'].astype(float)
    df['Education'] = df['Education'].astype('category')
    education_labels = {1: 'Undergrad', 2: 'Graduate', 3: 'Advanced/Professional'}
    df['Education'] = df['Education'].cat.rename_categories(education_labels)
    return df
def feature_engineering(df):
    df = df.copy()
    df['Income_Per_Family'] = df['Income'] / df['Family'].astype(float)
    df['Education'] = df['Education'].astype('category')
    education_labels = {1: 'Undergrad', 2: 'Graduate', 3: 'Advanced/Professional'}
    df['Education'] = df['Education'].cat.rename_categories(education_labels)
    return df

In [27]:

Copied!





def build_model(df, target, fit=True):
    df_fe = feature_engineering(df)
    X = df_fe.drop(columns=[target])
    y = df_fe[target]

    categorical_cols = X.select_dtypes(include=["object", "category", "bool"]).columns.tolist()
    numeric_cols = X.select_dtypes(exclude=["object", "category", "bool"]).columns.tolist()

    preprocessor = ColumnTransformer(
        transformers=[("cat",
                       OneHotEncoder(drop='first', 
                                     handle_unknown="ignore", 
                                     sparse_output=False,
                                     dtype='int'), 
                                    categorical_cols,
                      ),
                      ("num", "passthrough", numeric_cols),
                     ]
    )

    base_estimator = DecisionTreeClassifier(max_depth=5, random_state=SEED)

    model = Pipeline([
        ("preprocessor", preprocessor),
        ("model", AdaBoostClassifier(estimator=base_estimator,
                                          learning_rate=1.0,
                                           n_estimators=100,
                                           random_state=SEED)),
    ])

    if fit:
        model.fit(X, y)
    return model
def build_model(df, target, fit=True):
    df_fe = feature_engineering(df)
    X = df_fe.drop(columns=[target])
    y = df_fe[target]

    categorical_cols = X.select_dtypes(include=["object", "category", "bool"]).columns.tolist()
    numeric_cols = X.select_dtypes(exclude=["object", "category", "bool"]).columns.tolist()

    preprocessor = ColumnTransformer(
        transformers=[("cat",
                       OneHotEncoder(drop='first', 
                                     handle_unknown="ignore", 
                                     sparse_output=False,
                                     dtype='int'), 
                                    categorical_cols,
                      ),
                      ("num", "passthrough", numeric_cols),
                     ]
    )

    base_estimator = DecisionTreeClassifier(max_depth=5, random_state=SEED)

    model = Pipeline([
        ("preprocessor", preprocessor),
        ("model", AdaBoostClassifier(estimator=base_estimator,
                                          learning_rate=1.0,
                                           n_estimators=100,
                                           random_state=SEED)),
    ])

    if fit:
        model.fit(X, y)
    return model

Key OneHotEncoder Parameters¶

categories Specifies the categories for each feature.

"auto" → Automatically determine categories from the data (default)
List of lists → Manually define category order

drop Controls whether to drop one category per feature to avoid multicollinearity.

None → Keep all categories (default)
"first" → Drop the first category
"if_binary" → Drop one category only for binary features

sparse_output Determines the output format.

True → Returns a sparse matrix (default in newer versions)
False → Returns a dense NumPy array

handle_unknown Specifies how to handle unseen categories during transform.

"error" → Raise an error (default)
"ignore" → Ignore unknown categories and encode as all zeros

dtype Controls the data type of the output.

Default → float64
Can set to int or float32 if needed

feature_names_out (used with .get_feature_names_out()) Generates readable column names after encoding.

Prepare test data and evaluate AdaBoost Pipeline

In PA4, you will not have access to the hidden test data and the Autograder will handle the featuring engineering that you define and model training from your pipeline. However, we do need to apply the featuring engineering to a test data so we can evaluate our new AdaBoost Pipeline model.

We will create another training set (70%) and a test set (30%) but this time we will use the raw bank_df and not create splits for y.

In [28]:

Copied!

train, test = train_test_split(bank_df, test_size=0.30, random_state=SEED)

print('Training set:', train.shape)
print('Test set:', test.shape)
train, test = train_test_split(bank_df, test_size=0.30, random_state=SEED)

print('Training set:', train.shape)
print('Test set:', test.shape)

Training set: (3500, 12)
Test set: (1500, 12)

In [29]:

Copied!





target = 'Personal_Loan'
test_fe = feature_engineering(test)
X_test2 = test_fe.drop(columns=[target])
y_test2 = test_fe[target]

ab_pipe = build_model(train, target)
title7 = 'AdaBoost Pipeline Model'
ab_pipe_pred = ab_pipe.predict(X_test2)
ab_pipe_metrics = evaluate_model(y_test2, ab_pipe_pred, beta=2, model_name=title7)
results_df3 = pd.concat([lr_metrics, knn_metrics, dt_metrics, nb_metrics, 
                         vc_hard_metrics, vc_soft_metrics, sc_metrics,
                         rf_metrics, ab_metrics, ab_pipe_metrics])

results_df3.sort_values(by='F2', ascending=False)
target = 'Personal_Loan'
test_fe = feature_engineering(test)
X_test2 = test_fe.drop(columns=[target])
y_test2 = test_fe[target]

ab_pipe = build_model(train, target)
title7 = 'AdaBoost Pipeline Model'
ab_pipe_pred = ab_pipe.predict(X_test2)
ab_pipe_metrics = evaluate_model(y_test2, ab_pipe_pred, beta=2, model_name=title7)
results_df3 = pd.concat([lr_metrics, knn_metrics, dt_metrics, nb_metrics, 
                         vc_hard_metrics, vc_soft_metrics, sc_metrics,
                         rf_metrics, ab_metrics, ab_pipe_metrics])

results_df3.sort_values(by='F2', ascending=False)

Out[29]:

	Accuracy	Precision	Recall	F1	F2
AdaBoost Pipeline Model	0.983333	0.962687	0.865772	0.911661	0.883562
AdaBoost Model	0.970000	0.912698	0.771812	0.836364	0.796399
Random Forest Model	0.968667	0.925000	0.744966	0.825279	0.775140
VotingClassifier Soft	0.959333	0.958333	0.617450	0.751020	0.664740
StackingClassifier	0.952667	0.923913	0.570470	0.705394	0.617733
VotingClassifier Hard	0.952000	0.923077	0.563758	0.700000	0.611354
Decision Tree Base Learner	0.956000	1.000000	0.557047	0.715517	0.611193
Logistic Regression Base Learner	0.950667	0.921348	0.550336	0.689076	0.598540
Naive Bayes Base Learner	0.912667	0.562500	0.543624	0.552901	0.547297
KNN Base Learner	0.933333	0.929825	0.355705	0.514563	0.405819

Display Feature Importances

The AdaBoost Pipeline model performed best across all models we have evaluated but let's explore the most powerful features in the model using its feature importances. We will demonstrate how to pull the feature_importances_ from the pipeline as well as a standard model object by retraining another Random Forest model without the pipeline.

In [30]:

Copied!





rf2 = RandomForestClassifier(max_depth=None, n_estimators=300, random_state=SEED, n_jobs=-1)
df_fe = feature_engineering(bank_df)
df_fe = pd.get_dummies(df_fe, prefix_sep='_')
X_train2 = df_fe.drop(columns=[target])
y_train2 = df_fe[target]
rf.fit(X_train2, y_train2)

fig, ax = plt.subplots(figsize=(8, 6))
plot_feature_importances(X_train2.columns, rf.feature_importances_, ax, 'Random Forest Feature Importances')
rf2 = RandomForestClassifier(max_depth=None, n_estimators=300, random_state=SEED, n_jobs=-1)
df_fe = feature_engineering(bank_df)
df_fe = pd.get_dummies(df_fe, prefix_sep='_')
X_train2 = df_fe.drop(columns=[target])
y_train2 = df_fe[target]
rf.fit(X_train2, y_train2)

fig, ax = plt.subplots(figsize=(8, 6))
plot_feature_importances(X_train2.columns, rf.feature_importances_, ax, 'Random Forest Feature Importances')

No description has been provided for this image

In [31]:

Copied!

pipe_features = ab_pipe.named_steps['preprocessor'].get_feature_names_out()
pipe_model = ab_pipe.named_steps['model']

fig, ax = plt.subplots(figsize=(8, 6))
plot_feature_importances(pipe_features, pipe_model.feature_importances_, ax, 'AdaBoost Pipeline Feature Importances')
pipe_features = ab_pipe.named_steps['preprocessor'].get_feature_names_out()
pipe_model = ab_pipe.named_steps['model']

fig, ax = plt.subplots(figsize=(8, 6))
plot_feature_importances(pipe_features, pipe_model.feature_importances_, ax, 'AdaBoost Pipeline Feature Importances')

Plot Scores by Probability Thresholds

Up to this point, we have relied on the .predict() method to generate predictions and this utilizes the default 0.50 threshold when the model implements .predict_proba(). However, this default threshold is not always optimal, especially when the business objective prioritizes one type of error over another.

Instead of relying on a single cutoff, we can evaluate model performance across a range of probability thresholds. This allows us to better understand how the model behaves as we become more or less strict in assigning the positive class.

Let's now recreate the AdaBoost Pipeline without fitting the model and plot the F2 Score over different probability thresholds.

In [32]:

Copied!





df_fe = feature_engineering(bank_df)
X = df_fe.drop(columns=[target])
y = df_fe[target]

pipe = build_model(train, target, fit=False)

results = plot_metric_by_threshold_cv(
    pipeline=pipe,
    X=X,
    y=y,
    metric=f2_scorer,
    metric_name="F2 Score",
    cv=5
)
df_fe = feature_engineering(bank_df)
X = df_fe.drop(columns=[target])
y = df_fe[target]

pipe = build_model(train, target, fit=False)

results = plot_metric_by_threshold_cv(
    pipeline=pipe,
    X=X,
    y=y,
    metric=f2_scorer,
    metric_name="F2 Score",
    cv=5
)

Best threshold: 0.36
Best F2 Score: 0.9346

In [ ]: