Ensembles¶
Overview¶
For each step, read the explanation, then run the code cell(s) right below it.
You will practice how to:
- Load and prepare data for modeling
- Compare manual voting ensemble vs.
VotingClassifier - Compare a manual stacking ensemble vs.
StackingClassifier - Create a custom scorer using
make_scorer - Tune and evaluate
RandomForestClassifierandAdaBoostClassifierwith custom scorer - Demonstrate how to utilize
ColumnTransformerfor preprocessing in aPipeline - Plot and interpret tree based feature importances
- Plot scores by probability thresholds
Import libraries¶
import os
import sys
from pathlib import Path
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier, VotingClassifier, StackingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, StratifiedKFold, cross_val_predict
from sklearn.metrics import recall_score, fbeta_score, confusion_matrix, make_scorer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
# Change float format to display 3 decimal places only
pd.options.display.float_format = '{:.6f}'.format
# Set random seed variable for code reproducibility
SEED = 1
# Import local libraries
root_dir = Path.cwd().resolve().parents[0]
sys.path.append(str(root_dir))
# Visualization functions
from src.visuals.make_plots import *
# Helper functions
from src.utils.helpers import *
# Load the "autoreload" extension so that code can change
%load_ext autoreload
#%reload_ext autoreload
# Always reload modules so that as you change code in src, it gets loaded
%autoreload 2
Personal Loan Example¶
Create a dataframe for the UniversalBank.csv data
In the next cell, we
- load the dataset,
- drop
IDandZIP Codecolumns because they are not useful predictors.
This is a continuation of the Personal Loan Acceptance examples we have been using in the lectures.
bank_df = pd.read_csv(os.path.join('..', 'data', 'UniversalBank.csv'))
bank_df.drop(columns=['ID', 'ZIP Code'], inplace=True)
bank_df.columns = [c.replace(' ', '_') for c in bank_df.columns]
bank_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 5000 non-null int64 1 Experience 5000 non-null int64 2 Income 5000 non-null int64 3 Family 5000 non-null int64 4 CCAvg 5000 non-null float64 5 Education 5000 non-null int64 6 Mortgage 5000 non-null int64 7 Personal_Loan 5000 non-null int64 8 Securities_Account 5000 non-null int64 9 CD_Account 5000 non-null int64 10 Online 5000 non-null int64 11 CreditCard 5000 non-null int64 dtypes: float64(1), int64(11) memory usage: 468.9 KB
Simple vs. Pipeline Feature Engineering¶
As we learned in the Naive Bayes lecture, MultinomialNB works with purely categorical data. This provides a great opportunity to introduce different feature engineering approaches.
🧩 Simple feature engineering includes transformations such as:
- Basic transformations like log, ratios, or arithmetic combinations of columns
- Creating indicator variables (0/1 flags) based on conditions
- Binning numeric variables into categorical buckets
✅ These are safe to be run on data even before/after partitioning the data for training since they do not learn anything from the data.
🔄 Later, we will introduce pipeline feature engineering, which includes transformations that should be fit on the training data and then applied to validation/test data using the same learned parameters. Some examples include:
- Encoding methods such as creating dummy variables (one-hot encoding)
- Scaling such as standard or min/max transformations
- Imputation using mean or median
⚠️ These should be done inside a pipeline to avoid data leakage, where information from the test set unintentionally influences the model.
In this example, the Naive Bayes model works best when continuous predictors are converted into bins and is one example of simple feature engineering.
For simplicity, before introducing pipelines, we will create dummy variables on the entire dataset. To preserve flexibility, we will keep the original bank_df before creating bins and adding dummy variables so we can reuse the raw data later.
bank_df2 = bank_df.copy()
bank_df2['Education'] = bank_df2['Education'].astype('category')
bank_df2['Age'] = pd.cut(bank_df2['Age'], 5, labels=range(1, 6)).astype('category')
bank_df2['Experience'] = pd.cut(bank_df2['Experience'], 10, labels=range(1, 11)).astype('category')
bank_df2['Income'] = pd.cut(bank_df2['Income'], 5, labels=range(1, 6)).astype('category')
bank_df2['CCAvg'] = pd.cut(bank_df2['CCAvg'], 6, labels=range(1, 7)).astype('category')
bank_df2['Mortgage'] = pd.cut(bank_df2['Mortgage'], 10, labels=range(1, 11)).astype('category')
bank_df2 = pd.get_dummies(bank_df2, prefix_sep='_')
bank_df2.head()
| Family | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | Age_1 | Age_2 | Age_3 | Age_4 | ... | Mortgage_1 | Mortgage_2 | Mortgage_3 | Mortgage_4 | Mortgage_5 | Mortgage_6 | Mortgage_7 | Mortgage_8 | Mortgage_9 | Mortgage_10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4 | 0 | 1 | 0 | 0 | 0 | True | False | False | False | ... | True | False | False | False | False | False | False | False | False | False |
| 1 | 3 | 0 | 1 | 0 | 0 | 0 | False | False | True | False | ... | True | False | False | False | False | False | False | False | False | False |
| 2 | 1 | 0 | 0 | 0 | 0 | 0 | False | True | False | False | ... | True | False | False | False | False | False | False | False | False | False |
| 3 | 1 | 0 | 0 | 0 | 0 | 0 | False | True | False | False | ... | True | False | False | False | False | False | False | False | False | False |
| 4 | 4 | 0 | 0 | 0 | 0 | 1 | False | True | False | False | ... | True | False | False | False | False | False | False | False | False | False |
5 rows × 45 columns
Split the data into training and test sets
Next, we separate the predictors from the target variable Personal_Loan, then create a training set (70%) and a test set (30%).
X = bank_df2.drop(columns=['Personal_Loan'])
y = bank_df2['Personal_Loan']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=SEED)
print('Training set:', X_train.shape)
print('Test set:', X_test.shape)
Training set: (3500, 44) Test set: (1500, 44)
Manual Voting Classifier¶
Fit several base learner models
To demonstrate a manual ensemble, we will create 4 models:
LogisticRegressionKNeighborsClassifierDecisionTreeClassifierMultinomialNB
💡Parameter choices for each base learner were intentionally selected for demonstration purposes.💡
lr = LogisticRegression(penalty='l2', C=0.1, solver='liblinear', random_state=SEED)
lr.fit(X_train, y_train)
title1 = 'Logistic Regression Base Learner'
lr_pred = lr.predict(X_test)
lr_metrics = evaluate_model(y_test, lr_pred, beta=2, model_name=title1)
lr_metrics
| Accuracy | Precision | Recall | F1 | F2 | |
|---|---|---|---|---|---|
| Logistic Regression Base Learner | 0.950667 | 0.921348 | 0.550336 | 0.689076 | 0.598540 |
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
title2 = 'KNN Base Learner'
knn_pred = knn.predict(X_test)
knn_metrics = evaluate_model(y_test, knn_pred, beta=2, model_name=title2)
knn_metrics
| Accuracy | Precision | Recall | F1 | F2 | |
|---|---|---|---|---|---|
| KNN Base Learner | 0.933333 | 0.929825 | 0.355705 | 0.514563 | 0.405819 |
dt = DecisionTreeClassifier(max_depth=3, random_state=SEED)
dt.fit(X_train, y_train)
title3 = 'Decision Tree Base Learner'
dt_pred = dt.predict(X_test)
dt_metrics = evaluate_model(y_test, dt_pred, beta=2, model_name=title3)
dt_metrics
| Accuracy | Precision | Recall | F1 | F2 | |
|---|---|---|---|---|---|
| Decision Tree Base Learner | 0.956000 | 1.000000 | 0.557047 | 0.715517 | 0.611193 |
nb = MultinomialNB(alpha=1.0)
nb.fit(X_train, y_train)
title4 = 'Naive Bayes Base Learner'
nb_pred = nb.predict(X_test)
nb_metrics = evaluate_model(y_test, nb_pred, beta=2, model_name=title4)
nb_metrics
| Accuracy | Precision | Recall | F1 | F2 | |
|---|---|---|---|---|---|
| Naive Bayes Base Learner | 0.912667 | 0.562500 | 0.543624 | 0.552901 | 0.547297 |
Consolidate evaluation metrics
Let's consolidate the evaluation metrics for all 4 base learners. For our manual ensemble, we will focus on the F2 evaluation metric.
results_df = pd.concat([
lr_metrics,
knn_metrics,
dt_metrics,
nb_metrics
])
results_df
| Accuracy | Precision | Recall | F1 | F2 | |
|---|---|---|---|---|---|
| Logistic Regression Base Learner | 0.950667 | 0.921348 | 0.550336 | 0.689076 | 0.598540 |
| KNN Base Learner | 0.933333 | 0.929825 | 0.355705 | 0.514563 | 0.405819 |
| Decision Tree Base Learner | 0.956000 | 1.000000 | 0.557047 | 0.715517 | 0.611193 |
| Naive Bayes Base Learner | 0.912667 | 0.562500 | 0.543624 | 0.552901 | 0.547297 |
Create a results table with voting predictions
This results table will be used to evaluate the manual ensemble for hard and soft voting.
models = {
'lr': lr,
'knn': knn,
'dt': dt,
'nb': nb
}
(pred_cols, prob_cols, result) = build_manual_ensemble_results(models, X_test, y_test)
result.head(10)
| actual | lr_pred | lr_prob | knn_pred | knn_prob | dt_pred | dt_prob | nb_pred | nb_prob | majority | average_prob | average_pred | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0.020384 | 0 | 0.000000 | 0 | 0.007675 | 0 | 0.007630 | 0 | 0.008922 | 0 |
| 1 | 0 | 0 | 0.002694 | 0 | 0.000000 | 0 | 0.007675 | 0 | 0.000043 | 0 | 0.002603 | 0 |
| 2 | 0 | 0 | 0.003346 | 0 | 0.000000 | 0 | 0.007675 | 0 | 0.000075 | 0 | 0.002774 | 0 |
| 3 | 0 | 0 | 0.038275 | 0 | 0.000000 | 0 | 0.259740 | 0 | 0.060384 | 0 | 0.089600 | 0 |
| 4 | 0 | 0 | 0.036118 | 0 | 0.000000 | 0 | 0.007675 | 0 | 0.006894 | 0 | 0.012672 | 0 |
| 5 | 0 | 0 | 0.029258 | 0 | 0.000000 | 0 | 0.007675 | 0 | 0.000263 | 0 | 0.009299 | 0 |
| 6 | 0 | 0 | 0.017519 | 0 | 0.000000 | 0 | 0.007675 | 0 | 0.002855 | 0 | 0.007012 | 0 |
| 7 | 0 | 0 | 0.185223 | 0 | 0.000000 | 0 | 0.259740 | 0 | 0.093892 | 0 | 0.134714 | 0 |
| 8 | 0 | 1 | 0.542300 | 0 | 0.000000 | 0 | 0.259740 | 1 | 0.589052 | 0 | 0.347773 | 0 |
| 9 | 0 | 1 | 0.506497 | 0 | 0.333333 | 0 | 0.259740 | 1 | 0.681355 | 0 | 0.445231 | 0 |
CP4_Ensembles
For CP4 credit, display a random sample of 5 rows for actual = 1 observations and calculate the hard voting majority, average_prob, and soft voting average_pred on Canvas. random_state and missing model predictions/probabilities will be set during class.
cols = ['actual','lr_pred','lr_prob','knn_pred','knn_prob','dt_pred','dt_prob','nb_pred','nb_prob']
Compare the individual models to the manual ensemble predictions
Below we compare each model and the two manual ensemble methods:
- Majority vote using class labels
- Average probability using predicted probabilities
model_map = {
'Logistic Regression': 'lr_pred',
'k-Nearest Neighbors': 'knn_pred',
'Decision Tree': 'dt_pred',
'Naive Bayes': 'nb_pred',
'Manual Majority Vote': 'majority',
'Manual Average Probability': 'average_pred',
}
summary = evaluate_manual_ensemble_models(result, model_map)
summary.sort_values(by='F2', ascending=False)
| Model | Accuracy | Precision | Recall | F1 | F2 | |
|---|---|---|---|---|---|---|
| 2 | Decision Tree | 0.956000 | 1.000000 | 0.557047 | 0.715517 | 0.611193 |
| 5 | Manual Average Probability | 0.954000 | 0.965116 | 0.557047 | 0.706383 | 0.608504 |
| 0 | Logistic Regression | 0.950667 | 0.921348 | 0.550336 | 0.689076 | 0.598540 |
| 3 | Naive Bayes | 0.912667 | 0.562500 | 0.543624 | 0.552901 | 0.547297 |
| 4 | Manual Majority Vote | 0.944667 | 0.958333 | 0.463087 | 0.624434 | 0.516467 |
| 1 | k-Nearest Neighbors | 0.933333 | 0.929825 | 0.355705 | 0.514563 | 0.405819 |
Based on the combined results, the Decision Tree, Logistic Regression, and Naive Bayes models are all performing at a similar level, while the k-Nearest Neighbors model is clearly underperforming.
Because ensemble methods combine predictions across models, including a weaker model like k-Nearest Neighbors can negatively impact overall performance. This is especially noticeable in the manual voting approach as the majority vote method performs worse than the top individual models.
👉 Adding more models does not always improve an ensemble—especially when those models are weak or contribute noisy predictions.
Ensemble methods like voting classifiers tend to perform best when:
- ✅ Individual models have similar performance levels
- 🔀 Models make different types of errors (i.e., they are diverse)
Remove k-Nearest Neighbors model from the manual ensemble
Now, we will rerun the manual ensemble results without KNN.
models2 = {
'lr': lr,
'dt': dt,
'nb': nb
}
(pred_cols2, prob_cols2, result2) = build_manual_ensemble_results(models2, X_test, y_test)
result2.head(10)
| actual | lr_pred | lr_prob | dt_pred | dt_prob | nb_pred | nb_prob | majority | average_prob | average_pred | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0.020384 | 0 | 0.007675 | 0 | 0.007630 | 0 | 0.011896 | 0 |
| 1 | 0 | 0 | 0.002694 | 0 | 0.007675 | 0 | 0.000043 | 0 | 0.003471 | 0 |
| 2 | 0 | 0 | 0.003346 | 0 | 0.007675 | 0 | 0.000075 | 0 | 0.003698 | 0 |
| 3 | 0 | 0 | 0.038275 | 0 | 0.259740 | 0 | 0.060384 | 0 | 0.119466 | 0 |
| 4 | 0 | 0 | 0.036118 | 0 | 0.007675 | 0 | 0.006894 | 0 | 0.016896 | 0 |
| 5 | 0 | 0 | 0.029258 | 0 | 0.007675 | 0 | 0.000263 | 0 | 0.012398 | 0 |
| 6 | 0 | 0 | 0.017519 | 0 | 0.007675 | 0 | 0.002855 | 0 | 0.009349 | 0 |
| 7 | 0 | 0 | 0.185223 | 0 | 0.259740 | 0 | 0.093892 | 0 | 0.179618 | 0 |
| 8 | 0 | 1 | 0.542300 | 0 | 0.259740 | 1 | 0.589052 | 1 | 0.463698 | 0 |
| 9 | 0 | 1 | 0.506497 | 0 | 0.259740 | 1 | 0.681355 | 1 | 0.482531 | 0 |
Compare the manual ensemble predictions without KNN
Below we compare each model and the two manual ensemble methods:
model_map2 = {
'Logistic Regression': 'lr_pred',
'Decision Tree': 'dt_pred',
'Naive Bayes': 'nb_pred',
'Manual Majority Vote': 'majority',
'Manual Average Probability': 'average_pred',
}
summary2 = evaluate_manual_ensemble_models(result2, model_map2)
summary2.sort_values(by='F2', ascending=False)
| Model | Accuracy | Precision | Recall | F1 | F2 | |
|---|---|---|---|---|---|---|
| 4 | Manual Average Probability | 0.959333 | 0.958333 | 0.617450 | 0.751020 | 0.664740 |
| 3 | Manual Majority Vote | 0.952000 | 0.923077 | 0.563758 | 0.700000 | 0.611354 |
| 1 | Decision Tree | 0.956000 | 1.000000 | 0.557047 | 0.715517 | 0.611193 |
| 0 | Logistic Regression | 0.950667 | 0.921348 | 0.550336 | 0.689076 | 0.598540 |
| 2 | Naive Bayes | 0.912667 | 0.562500 | 0.543624 | 0.552901 | 0.547297 |
By combining only the stronger models (Decision Tree, Logistic Regression, and Naive Bayes), the manual ensemble especially the average probability (**soft voting*) achieved the highest F2 Score, outperforming all individual models.
👉 Strong ensembles are built not just by adding more models, but by carefully selecting models that contribute meaningful and complementary predictions.
Recreate Ensemble using VotingClassifier
Creating the voting ensemble manually was more for illustration purposes but in practice, you should use sklearn's ensemble method VotingClassifier. Let's recreate that same classifier using the Decision Tree, Logistic Regression, and Naive Bayes base learners we created previously.
estimators = [
('lr', lr),
('dt', dt),
('nb', nb)
]
vc_hard = VotingClassifier(estimators=estimators, voting='hard')
vc_hard.fit(X_train, y_train)
vc_hard_pred = vc_hard.predict(X_test)
vc_soft = VotingClassifier(estimators=estimators, voting='soft')
vc_soft.fit(X_train, y_train)
vc_soft_pred = vc_soft.predict(X_test)
vc_hard_metrics = evaluate_model(y_test, vc_hard_pred, beta=2, model_name='VotingClassifier Hard')
vc_soft_metrics = evaluate_model(y_test, vc_soft_pred, beta=2, model_name='VotingClassifier Soft')
pd.concat([vc_hard_metrics,vc_soft_metrics]).sort_values(by='F2', ascending=False)
| Accuracy | Precision | Recall | F1 | F2 | |
|---|---|---|---|---|---|
| VotingClassifier Soft | 0.959333 | 0.958333 | 0.617450 | 0.751020 | 0.664740 |
| VotingClassifier Hard | 0.952000 | 0.923077 | 0.563758 | 0.700000 | 0.611354 |
Key VotingClassifier Parameters¶
Below are the parameters for VotingClassifier in sklearn.
estimators
List of (name, model) tuples for the base estimators.
voting
Determines how the ensemble combines model outputs.
"hard"→ majority vote using predicted class labels"soft"→ average predicted probabilities
weights
Optional list of numeric weights. Models with larger weights have more influence in the final vote.
n_jobs
Number of CPU cores used in parallel when fitting the estimators.
None→ default behavior-1→ use all available cores
flatten_transform
Used only when voting='soft' and calling .transform().
True→ flatten output into a 2D arrayFalse→ keep a more structured output format
verbose
If True, displays progress information while fitting.
Manual Stacking Classifier¶
Stacking uses the outputs of the base models as the inputs to a second-level model called a meta-learner.
In this step, we will utilize all 4 of the base learners to generate predictions and then fit a new Logistic Regression model on those predictions.
First, we will evaluate the Logistic Regression coefficients for the meta-learner.
train_result = pd.DataFrame({
'actual': y_train,
'lr_pred': lr.predict(X_train),
'knn_pred': knn.predict(X_train),
'dt_pred': dt.predict(X_train),
'nb_pred': nb.predict(X_train),
})
params = {'penalty': 'l2', 'C': 1e42, 'solver': 'liblinear', 'random_state': SEED}
msc = LogisticRegression(**params)
msc.fit(train_result[pred_cols], y_train)
pd.DataFrame({'Coefficient': [msc.intercept_[0]] + list(msc.coef_[0])}, index=['Intercept'] + pred_cols)
| Coefficient | |
|---|---|
| Intercept | -4.062824 |
| lr_pred | 2.547507 |
| knn_pred | 5.640015 |
| dt_pred | 11.789058 |
| nb_pred | 0.729674 |
As we learned in the Logistic Regression model and tutorial, each coefficient reflects the impact of a base model’s prediction on the log(odds) of predicting the positive class.
Intercept (-4.063)
Baseline log-odds when all model predictions are 0.
→ Indicates a strong bias toward the negative class unless models vote otherwise.dt_pred (11.789) 🚀
The Decision Tree has the strongest influence on the final prediction.
→ When this model predicts 1, it dramatically increases the odds of belonging to the positive class.knn_pred (5.64)
The KNN model also has a strong positive contribution, but much less than the decision tree.lr_pred (2.548)
The Logistic Regression base model contributes moderately to the final decision.nb_pred (0.73)
The Naive Bayes model has the weakest influence, suggesting it adds limited value to the ensemble.
test_result = pd.DataFrame({
'actual': y_test,
'lr_pred': lr.predict(X_test),
'knn_pred': knn.predict(X_test),
'dt_pred': dt.predict(X_test),
'nb_pred': nb.predict(X_test),
})
msc_pred = (msc.predict(test_result[pred_cols]) >= 0.50).astype(int)
msc_metrics = evaluate_model(y_test, msc_pred, beta=2, model_name='Manual Stacking')
msc_metrics.sort_values(by='F2', ascending=False)
| Accuracy | Precision | Recall | F1 | F2 | |
|---|---|---|---|---|---|
| Manual Stacking | 0.961333 | 0.959596 | 0.637584 | 0.766129 | 0.683453 |
Recreate Ensemble using StackingClassifier
Like we did with the manual voting classifier, let's now recreate the same stacking classifier using sklearn's ensemble method StackingClassifier.
estimators = [
('lr', lr),
('knn', knn),
('dt', dt),
('nb', nb)
]
sc = StackingClassifier(estimators=estimators,
final_estimator=LogisticRegression(**params),
stack_method='predict',
cv=5)
sc.fit(X_train, y_train)
sc_pred = sc.predict(X_test)
sc_metrics = evaluate_model(y_test, sc_pred, beta=2, model_name='StackingClassifier')
sc_metrics
| Accuracy | Precision | Recall | F1 | F2 | |
|---|---|---|---|---|---|
| StackingClassifier | 0.952667 | 0.923913 | 0.570470 | 0.705394 | 0.617733 |
Why are some metrics lower with StackingClassifier?
When comparing our manual stacking approach to sklearn’s StackingClassifier, we can see that several evaluation metrics are lower when using the built-in implementation.
This happens because StackingClassifier uses a more realistic and less biased training process for the meta-model.
What’s Different? 🧠
🔁 Cross-validated predictions (sklearn) The meta-model is trained using out-of-fold predictions from the base models. This means each prediction is made on data the base model has not seen during training.
⚠️ In-sample predictions (manual approach) In our manual stacking method, the meta-model was trained on predictions generated from the same data used to train the base models. This introduces data leakage, making the model appear stronger than it actually is.
Key StackingClassifier Parameters¶
Below are the parameters for StackingClassifier in sklearn.
estimators
List of (name, model) tuples for the base estimators.
final_estimator
The meta-model that learns how to combine the base estimators.
cv
Cross-validation strategy used to create out-of-fold predictions for training the meta-model.
- Integer (for number of folds)
- cross-validation splitter object
- iterable of train/test splits
None→ default cross-validation behavior
stack_method
How the base estimators generate inputs for the meta-model.
"auto"→ automatically choose an available method"predict_proba"→ use predicted probabilities"decision_function"→ use decision scores"predict"→ use predicted class labels
n_jobs
Number of CPU cores used in parallel.
None→ default behavior-1→ use all available cores
passthrough
Controls whether the original predictor variables are included alongside the base-model outputs.
False→ meta-learner uses only the base-model outputsTrue→ meta-learner uses both the original features and the base-model outputs
verbose
Controls how much fitting progress is printed.
0→ no additional output- higher values → more progress information
Custom scorer evaluation
So far, we have been comparing our models using F2 score, which does make sense with the Personal Loan example since the goal is to capture as many new customers as possible in the next marketing campaign but still have some precision in our marketing outreach efforts. When using methods like cross_val_score or GridSearchCV, you can easily specify a scoring parameter using predefined scorer names. Check out sklearn's user guide for String name scorers here.
However, not all metrics are available such as the F2 score, and you may often want to customize your evaluation metric to accomplish specific business goals. In order to utilize a custom evaluation metric in these methods, you need to use sklearn's maker_scorer that is callable object that returns a scalar score. It is possible to pass sklearn's fbeta_score into a custom scorer but this way we are utilizing a known calculation to demonstration usage of a custom scorer that we can then compare to prior evaluation results.
The first step is to create our new scoring function.
def custom_f2_score(y_true, y_pred):
cm = confusion_matrix(y_true, y_pred)
fp = cm[0, 1]
fn = cm[1, 0]
tp = cm[1, 1]
# Handle division by zero
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
# F2 calculation (beta = 2)
beta = 2
if precision + recall == 0:
return 0
f2 = (1 + beta**2) * (precision * recall) / ((beta**2 * precision) + recall)
return f2
## Validate the score using the original Decision Tree preds
print(f"Custom F2 score: {custom_f2_score(y_test, dt_pred):.4f}")
Custom F2 score: 0.6112
Custom Scoring Functions: Allowed Inputs
When creating a custom scoring function (instead of using a built-in sklearn metric), your function must follow a specific structure so it can work with tools like cross-validation and Grid Search.
y_true
The actual (ground truth) target values.
- Typically a pandas Series or numpy array
- Contains the true class labels
y_pred
The predicted values from the model.
- By default, this comes from
model.predict() - Contains predicted class labels (e.g., 0/1)
👉 Minimum Required Signature
def custom_metric(y_true, y_pred):
return score
- Must return a single scalar value (not a list or array)
- Higher values should indicate better performance (unless handled with
greater_is_better=False)
When Using Probabilities or Scores 📊
If your metric requires probabilities or decision scores, you must configure make_scorer accordingly:
needs_proba=True→ function receives predicted probabilitiesneeds_threshold=True→ function receives decision scores
In this case, your function will look like:
def custom_metric(y_true, y_pred_proba):
return score
Instantiate make_scorer
Next, we will use the custom_f2_score to create our new make_scorer object.
f2_scorer = make_scorer(custom_f2_score, greater_is_better=True)
Build a RandomForestClassifier
In the Ensembles lecture, we mentioned two powerful “out-of-the-box” ensemble methods:
- 🌲 Random Forest → A strong bagging-based model that builds multiple decision trees and aggregates their results
- ⚡ AdaBoost → A popular boosting model that sequentially improves performance by focusing on previously misclassified observations
First, we will build and evaluate a Random Forest model and demonstrate how to use sklearn's cross_val_score with our custom F2 scorer.
rf = RandomForestClassifier(max_depth=None, n_estimators=300, random_state=SEED, n_jobs=-1)
scores = cross_val_score(rf, X_train, y_train, cv=5, scoring=f2_scorer)
print("F2 Scores per fold:", scores)
print(f"Average Cross-Validation F2 Score: {scores.mean():.4f}")
F2 Scores per fold: [0.8 0.80188679 0.80246914 0.80696203 0.82043344] Average Cross-Validation F2 Score: 0.8064
Cross-Validation Reminder 🔁
When using cross_val_score, you should not pass in a fitted model.
- ❌ Do NOT use a model that has already been trained with
.fit() - ✅ Always pass in an unfitted model instance
This is because cross_val_score will:
- Split the data into folds
- Train the model separately on each training fold
- Evaluate performance on each validation fold
title5 = 'Random Forest Model'
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)
rf_metrics = evaluate_model(y_test, rf_pred, beta=2, model_name=title5)
rf_metrics
| Accuracy | Precision | Recall | F1 | F2 | |
|---|---|---|---|---|---|
| Random Forest Model | 0.968667 | 0.925000 | 0.744966 | 0.825279 | 0.775140 |
Key RandomForestClassifier Parameters¶
n_estimators
Number of trees in the forest.
- Integer (e.g., 100, 200, 500)
- More trees → typically better performance but increased computation time
Larger values reduce variance and improve stability.
max_depth
Maximum depth of each decision tree.
None→ Trees grow until pure or minimum samples reached- Integer (e.g., max_depth=5)
Controls model complexity where smaller depth = less overfitting.
min_samples_split
Minimum number of samples required to split an internal node.
- Integer → exact number (e.g., 10)
- Float → fraction of dataset (e.g., 0.05)
Larger values = fewer splits → simpler trees.
min_samples_leaf
Minimum number of samples required at a leaf node.
- Integer → exact number (e.g., 5)
- Float → fraction of dataset
Helps prevent very small leaves and reduces overfitting.
max_features
Number of features considered when looking for the best split.
"sqrt"→ square root of total features (default for classification)"log2"→ log base 2 of features- Integer or float → custom number
Introduces randomness and helps reduce correlation between trees.
bootstrap
Whether bootstrap sampling is used.
True→ Sample with replacement (default)False→ Use entire dataset for each tree
Core concept of bagging.
random_state
Controls randomness for reproducibility.
- Integer (e.g., 1, 42)
Ensures consistent results across runs.
n_jobs
Number of CPU cores used for training.
-1→ Use all available cores- Integer → specific number of cores
Speeds up training for large forests.
Tune an AdaBoostClassifier
Next, we will build and perform some basic tuning of an AdaBoost model and demonstrate how to use sklearn's GridSearchCV with our custom F2 scorer. We will use the original Decision Tree base learner as the base_estimator for the boosting model.
ada = AdaBoostClassifier(
estimator=dt,
random_state=SEED
)
param_grid = {
'n_estimators': [50, 100, 200],
'learning_rate': [0.01, 0.1, 1.0, 10.0],
'estimator__max_depth': [1, 3, 5]
}
gs_ada = GridSearchCV(estimator=ada, param_grid=param_grid, cv=5, scoring=f2_scorer, n_jobs=-1)
gs_ada.fit(X_train, y_train)
print('Best AdaBoost Parameters:', gs_ada.best_params_)
print(f'Best AdaBoost Cross-Validation F2 Score: {gs_ada.best_score_:.4f}')
Best AdaBoost Parameters: {'estimator__max_depth': 5, 'learning_rate': 1.0, 'n_estimators': 100}
Best AdaBoost Cross-Validation F2 Score: 0.8537
Display all evaluation metrics
For comparison purposes, let's display all of the evaluation metrics we have gathered so far. Keep in mind the earlier base learners were not tuned or designed to be strong models on their own.
title6 = 'AdaBoost Model'
ab_pred = gs_ada.predict(X_test)
ab_metrics = evaluate_model(y_test, ab_pred, beta=2, model_name=title6)
results_df2 = pd.concat([lr_metrics, knn_metrics, dt_metrics, nb_metrics,
vc_hard_metrics, vc_soft_metrics, sc_metrics,
rf_metrics, ab_metrics])
results_df2.sort_values(by='F2', ascending=False)
| Accuracy | Precision | Recall | F1 | F2 | |
|---|---|---|---|---|---|
| AdaBoost Model | 0.970000 | 0.912698 | 0.771812 | 0.836364 | 0.796399 |
| Random Forest Model | 0.968667 | 0.925000 | 0.744966 | 0.825279 | 0.775140 |
| VotingClassifier Soft | 0.959333 | 0.958333 | 0.617450 | 0.751020 | 0.664740 |
| StackingClassifier | 0.952667 | 0.923913 | 0.570470 | 0.705394 | 0.617733 |
| VotingClassifier Hard | 0.952000 | 0.923077 | 0.563758 | 0.700000 | 0.611354 |
| Decision Tree Base Learner | 0.956000 | 1.000000 | 0.557047 | 0.715517 | 0.611193 |
| Logistic Regression Base Learner | 0.950667 | 0.921348 | 0.550336 | 0.689076 | 0.598540 |
| Naive Bayes Base Learner | 0.912667 | 0.562500 | 0.543624 | 0.552901 | 0.547297 |
| KNN Base Learner | 0.933333 | 0.929825 | 0.355705 | 0.514563 | 0.405819 |
Feature Engineering Workflow
Previously, we performed all feature engineering steps (including creating dummy variables) directly on the full dataset before splitting into training and test sets.
While this approach is useful for learning and quick experimentation, it is not the recommended workflow for building final models.
To demonstrate how you should prepare your final model for PA4, we will:
- Perform simple feature engineering inside a
feature_engineering(df)function - Use a Column Transformer to handle dummy variable creation (one-hot encoding)
- Integrate everything into a more structured and reusable pipeline
Previously, we created bins for the numerical variables and then created dummy variables but that was primarily for our Naive Bayes model. Tree based models don't require that preprocessing so we will instead demonstrate the following instead:
- Engineer one new variable as a ratio of
Incomeand theFamilyvariables - Convert the
Educationto a category since it is numeric and the preprocessor will not detect it as the correct data type requiring dummy variables
For the purposes of this tutorial only, we have included a fit parameter to be able to choose if the build_model() returns a fitted model. This is so we can reuse the pipeline for cross-validation.
❌ Do NOT do this for PA4 ❌
def feature_engineering(df):
df = df.copy()
df['Income_Per_Family'] = df['Income'] / df['Family'].astype(float)
df['Education'] = df['Education'].astype('category')
education_labels = {1: 'Undergrad', 2: 'Graduate', 3: 'Advanced/Professional'}
df['Education'] = df['Education'].cat.rename_categories(education_labels)
return df
def build_model(df, target, fit=True):
df_fe = feature_engineering(df)
X = df_fe.drop(columns=[target])
y = df_fe[target]
categorical_cols = X.select_dtypes(include=["object", "category", "bool"]).columns.tolist()
numeric_cols = X.select_dtypes(exclude=["object", "category", "bool"]).columns.tolist()
preprocessor = ColumnTransformer(
transformers=[("cat",
OneHotEncoder(drop='first',
handle_unknown="ignore",
sparse_output=False,
dtype='int'),
categorical_cols,
),
("num", "passthrough", numeric_cols),
]
)
base_estimator = DecisionTreeClassifier(max_depth=5, random_state=SEED)
model = Pipeline([
("preprocessor", preprocessor),
("model", AdaBoostClassifier(estimator=base_estimator,
learning_rate=1.0,
n_estimators=100,
random_state=SEED)),
])
if fit:
model.fit(X, y)
return model
Key OneHotEncoder Parameters¶
categories
Specifies the categories for each feature.
"auto"→ Automatically determine categories from the data (default)- List of lists → Manually define category order
drop
Controls whether to drop one category per feature to avoid multicollinearity.
None→ Keep all categories (default)"first"→ Drop the first category"if_binary"→ Drop one category only for binary features
sparse_output
Determines the output format.
True→ Returns a sparse matrix (default in newer versions)False→ Returns a dense NumPy array
handle_unknown
Specifies how to handle unseen categories during transform.
"error"→ Raise an error (default)"ignore"→ Ignore unknown categories and encode as all zeros
dtype
Controls the data type of the output.
- Default →
float64 - Can set to
intorfloat32if needed
feature_names_out (used with .get_feature_names_out())
Generates readable column names after encoding.
Prepare test data and evaluate AdaBoost Pipeline
In PA4, you will not have access to the hidden test data and the Autograder will handle the featuring engineering that you define and model training from your pipeline. However, we do need to apply the featuring engineering to a test data so we can evaluate our new AdaBoost Pipeline model.
We will create another training set (70%) and a test set (30%) but this time we will use the raw bank_df and not create splits for y.
train, test = train_test_split(bank_df, test_size=0.30, random_state=SEED)
print('Training set:', train.shape)
print('Test set:', test.shape)
Training set: (3500, 12) Test set: (1500, 12)
target = 'Personal_Loan'
test_fe = feature_engineering(test)
X_test2 = test_fe.drop(columns=[target])
y_test2 = test_fe[target]
ab_pipe = build_model(train, target)
title7 = 'AdaBoost Pipeline Model'
ab_pipe_pred = ab_pipe.predict(X_test2)
ab_pipe_metrics = evaluate_model(y_test2, ab_pipe_pred, beta=2, model_name=title7)
results_df3 = pd.concat([lr_metrics, knn_metrics, dt_metrics, nb_metrics,
vc_hard_metrics, vc_soft_metrics, sc_metrics,
rf_metrics, ab_metrics, ab_pipe_metrics])
results_df3.sort_values(by='F2', ascending=False)
| Accuracy | Precision | Recall | F1 | F2 | |
|---|---|---|---|---|---|
| AdaBoost Pipeline Model | 0.983333 | 0.962687 | 0.865772 | 0.911661 | 0.883562 |
| AdaBoost Model | 0.970000 | 0.912698 | 0.771812 | 0.836364 | 0.796399 |
| Random Forest Model | 0.968667 | 0.925000 | 0.744966 | 0.825279 | 0.775140 |
| VotingClassifier Soft | 0.959333 | 0.958333 | 0.617450 | 0.751020 | 0.664740 |
| StackingClassifier | 0.952667 | 0.923913 | 0.570470 | 0.705394 | 0.617733 |
| VotingClassifier Hard | 0.952000 | 0.923077 | 0.563758 | 0.700000 | 0.611354 |
| Decision Tree Base Learner | 0.956000 | 1.000000 | 0.557047 | 0.715517 | 0.611193 |
| Logistic Regression Base Learner | 0.950667 | 0.921348 | 0.550336 | 0.689076 | 0.598540 |
| Naive Bayes Base Learner | 0.912667 | 0.562500 | 0.543624 | 0.552901 | 0.547297 |
| KNN Base Learner | 0.933333 | 0.929825 | 0.355705 | 0.514563 | 0.405819 |
Display Feature Importances
The AdaBoost Pipeline model performed best across all models we have evaluated but let's explore the most powerful features in the model using its feature importances. We will demonstrate how to pull the feature_importances_ from the pipeline as well as a standard model object by retraining another Random Forest model without the pipeline.
rf2 = RandomForestClassifier(max_depth=None, n_estimators=300, random_state=SEED, n_jobs=-1)
df_fe = feature_engineering(bank_df)
df_fe = pd.get_dummies(df_fe, prefix_sep='_')
X_train2 = df_fe.drop(columns=[target])
y_train2 = df_fe[target]
rf.fit(X_train2, y_train2)
fig, ax = plt.subplots(figsize=(8, 6))
plot_feature_importances(X_train2.columns, rf.feature_importances_, ax, 'Random Forest Feature Importances')
pipe_features = ab_pipe.named_steps['preprocessor'].get_feature_names_out()
pipe_model = ab_pipe.named_steps['model']
fig, ax = plt.subplots(figsize=(8, 6))
plot_feature_importances(pipe_features, pipe_model.feature_importances_, ax, 'AdaBoost Pipeline Feature Importances')
Plot Scores by Probability Thresholds
Up to this point, we have relied on the .predict() method to generate predictions and this utilizes the default 0.50 threshold when the model implements .predict_proba(). However, this default threshold is not always optimal, especially when the business objective prioritizes one type of error over another.
Instead of relying on a single cutoff, we can evaluate model performance across a range of probability thresholds. This allows us to better understand how the model behaves as we become more or less strict in assigning the positive class.
Let's now recreate the AdaBoost Pipeline without fitting the model and plot the F2 Score over different probability thresholds.
df_fe = feature_engineering(bank_df)
X = df_fe.drop(columns=[target])
y = df_fe[target]
pipe = build_model(train, target, fit=False)
results = plot_metric_by_threshold_cv(
pipeline=pipe,
X=X,
y=y,
metric=f2_scorer,
metric_name="F2 Score",
cv=5
)
Best threshold: 0.36 Best F2 Score: 0.9346