Chapter 10
Chapter 10 Logistic Regression
Original Code Credit:: Shmueli, Galit; Bruce, Peter C.; Gedeck, Peter; Patel, Nitin R.. Data Mining for Business Analytics Wiley.
Modifications have been made from the original textbook examples due to version changes in library dependencies and/or for clarity.
Download this notebook and data here.
Import Libraries
import os
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from mord import LogisticIT
import matplotlib.pyplot as plt
import seaborn as sns
from dmba import classificationSummary, gainsChart, liftChart
from dmba.metric import AIC_score
import matplotlib
%matplotlib inline
SEED = 1
10.3 Example: Acceptance of Personal Loan
Logistic regression model for loan acceptance (training data)
bank_df = pd.read_csv(os.path.join('data', 'UniversalBank.csv'))
bank_df.drop(columns=['ID', 'ZIP Code'], inplace=True)
bank_df.columns = [c.replace(' ', '_') for c in bank_df.columns]
# Treat education as categorical, convert to dummy variables
bank_df['Education'] = bank_df['Education'].astype('category')
new_categories = {1: 'Undergrad', 2: 'Graduate', 3: 'Advanced/Professional'}
bank_df['Education'] = bank_df['Education'].cat.rename_categories(new_categories)
bank_df2 = pd.get_dummies(bank_df, prefix_sep='_', drop_first=True, dtype=int)
y = bank_df2['Personal_Loan']
X = bank_df2.drop(columns=['Personal_Loan'])
# partition data
train_X, valid_X, train_y, valid_y = train_test_split(X, y, test_size=0.4, random_state=SEED)
# fit a logistic regression (set penalty=l2 and C=1e42 to avoid regularization)
logit_reg = LogisticRegression(penalty="l2", C=1e42, solver='liblinear')
logit_reg.fit(train_X, train_y)
print('intercept ', logit_reg.intercept_[0])
pd.DataFrame({'coeff': logit_reg.coef_[0]}, index=X.columns).transpose()
intercept -12.560703103343736
Age | Experience | Income | Family | CCAvg | Mortgage | Securities_Account | CD_Account | Online | CreditCard | Education_Graduate | Education_Advanced/Professional | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
coeff | -0.035686 | 0.037234 | 0.058921 | 0.61268 | 0.240926 | 0.001015 | -1.029088 | 3.662056 | -0.679537 | -0.960725 | 4.21121 | 4.36177 |
print('AIC', AIC_score(valid_y, logit_reg.predict(valid_X), df = len(train_X.columns) + 1))
AIC -733.9975169177105
10.4 Evaluating Classification Performance
Propensities for four customers in validation data
logit_reg_pred = logit_reg.predict(valid_X)
logit_reg_proba = logit_reg.predict_proba(valid_X)
logit_result = pd.DataFrame({'actual': valid_y,
'p(0)': [p[0] for p in logit_reg_proba],
'p(1)': [p[1] for p in logit_reg_proba],
'predicted': logit_reg_pred })
# display four different cases
interestingCases = [2764, 932, 2721, 702]
print(logit_result.loc[interestingCases])
actual p(0) p(1) predicted
2764 0 0.976165 0.023835 0
932 0 0.331940 0.668060 1
2721 1 0.031254 0.968746 1
702 1 0.985943 0.014057 0
Training and Validation Confusion Matrices
# training confusion matrix
classificationSummary(train_y, logit_reg.predict(train_X))
# validation confusion matrix
classificationSummary(valid_y, logit_reg.predict(valid_X))
Confusion Matrix (Accuracy 0.9600)
Prediction
Actual 0 1
0 2683 30
1 90 197
Confusion Matrix (Accuracy 0.9600)
Prediction
Actual 0 1
0 1791 16
1 64 129
Cumulative gains chart and decile-wise lift chart for the validation data for Universal Bank loan offer
df = logit_result.sort_values(by=['p(1)'], ascending=False)
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 4))
gainsChart(df.actual, ax=axes[0])
liftChart(df.actual, title=False, ax=axes[1])
plt.tight_layout()
plt.show()
10.5 Logistic Regression for Multi-class Classification
Ordinal and nominal (multi-class) regression in Python
data = pd.read_csv(os.path.join('data', 'accidentsFull.csv'))
outcome = 'MAX_SEV_IR'
predictors = ['ALCHL_I', 'WEATHER_R']
y = data[outcome]
X = data[predictors]
train_X, train_y = X, y
print('Nominal logistic regression')
logit = LogisticRegression(penalty="l2", solver='lbfgs', C=1e24)
logit.fit(X, y)
print(' intercept', logit.intercept_)
print(' coefficients', logit.coef_)
print()
probs = logit.predict_proba(X)
results = pd.DataFrame({
'actual': y, 'predicted': logit.predict(X),
'P(0)': [p[0] for p in probs],
'P(1)': [p[1] for p in probs],
'P(2)': [p[2] for p in probs],
})
print(results.head())
print()
print('Ordinal logistic regression')
logit = LogisticIT(alpha=0)
logit.fit(X, y)
print(' theta', logit.theta_)
print(' coefficients', logit.coef_)
print()
probs = logit.predict_proba(X)
results = pd.DataFrame({
'actual': y, 'predicted': logit.predict(X),
'P(0)': [p[0] for p in probs],
'P(1)': [p[1] for p in probs],
'P(2)': [p[2] for p in probs],
})
results.head()
Nominal logistic regression
intercept [-0.17870851 0.81673248 -0.63802397]
coefficients [[ 0.52491466 0.40469045]
[ 0.15763357 0.16071816]
[-0.68254823 -0.56540861]]
actual predicted P(0) P(1) P(2)
0 1 1 0.490569 0.498933 0.010498
1 0 0 0.554023 0.441483 0.004494
2 0 0 0.554023 0.441483 0.004494
3 0 1 0.490569 0.498933 0.010498
4 0 1 0.393700 0.578117 0.028183
Ordinal logistic regression
theta [-1.06916285 2.77444326]
coefficients [-0.40112008 -0.25174207]
actual | predicted | P(0) | P(1) | P(2) | |
---|---|---|---|---|---|
0 | 1 | 1 | 0.496205 | 0.482514 | 0.021281 |
1 | 0 | 0 | 0.558866 | 0.424510 | 0.016624 |
2 | 0 | 0 | 0.558866 | 0.424510 | 0.016624 |
3 | 0 | 1 | 0.496205 | 0.482514 | 0.021281 |
4 | 0 | 1 | 0.397402 | 0.571145 | 0.031453 |
10.6 Example of Complete Analysis: Predicting Delayed Flights
delays_df = pd.read_csv(os.path.join('data', 'FlightDelays.csv'))
# Create an indicator variable
delays_df['isDelayed'] = [1 if status == 'delayed' else 0
for status in delays_df['Flight Status']]
def createGraph(group, xlabel, axis):
groupAverage = delays_df.groupby([group])['isDelayed'].mean()
if group == 'DAY_WEEK': # rotate so that display starts on Sunday
groupAverage = groupAverage.reindex(index=np.roll(groupAverage.index,1))
groupAverage.index = ['Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat']
ax = groupAverage.plot.bar(color='C0', ax=axis)
ax.set_ylabel('Average Delay')
ax.set_xlabel(xlabel)
return ax
def graphDepartureTime(xlabel, axis):
temp_df = pd.DataFrame({'CRS_DEP_TIME': delays_df['CRS_DEP_TIME'] // 100,
'isDelayed': delays_df['isDelayed']})
groupAverage = temp_df.groupby(['CRS_DEP_TIME'])['isDelayed'].mean()
ax = groupAverage.plot.bar(color='C0', ax=axis)
ax.set_xlabel(xlabel); ax.set_ylabel('Average Delay')
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(10, 10))
createGraph('DAY_WEEK', 'Day of week', axis=axes[0][0])
createGraph('DEST', 'Destination', axis=axes[0][1])
graphDepartureTime('Departure time', axis=axes[1][0])
createGraph('CARRIER', 'Carrier', axis=axes[1][1])
createGraph('ORIGIN', 'Origin', axis=axes[2][0])
createGraph('Weather', 'Weather', axis=axes[2][1])
plt.tight_layout()
Proportion of delayed flights by each of the six predictors. Time of day is divided into hourly bins
agg = delays_df.groupby(['ORIGIN', 'DAY_WEEK', 'CARRIER']).isDelayed.mean()
agg = agg.reset_index()
# Define the layout of the graph
height_ratios = []
for i, origin in enumerate(sorted(delays_df.ORIGIN.unique())):
height_ratios.append(len(agg[agg.ORIGIN == origin].CARRIER.unique()))
gridspec_kw = {'height_ratios': height_ratios, 'width_ratios': [15, 1]}
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(10, 6),
gridspec_kw = gridspec_kw)
axes[0, 1].axis('off')
axes[2, 1].axis('off')
maxIsDelay = agg.isDelayed.max()
for i, origin in enumerate(sorted(delays_df.ORIGIN.unique())):
data = pd.pivot_table(agg[agg.ORIGIN == origin], values='isDelayed', aggfunc='sum',
index=['CARRIER'], columns=['DAY_WEEK'])
data = data[[7, 1, 2, 3, 4, 5, 6]] # Shift last columns to first
ax = sns.heatmap(data, ax=axes[i][0], vmin=0, vmax=maxIsDelay,
cbar_ax=axes[1][1], cmap=sns.light_palette("navy"))
ax.set_xticklabels(['Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat'])
if i != 2:
ax.get_xaxis().set_visible(False)
ax.set_ylabel('Airport ' + origin)
plt.show()
Estimated Logistic regression model for delayed flights (based on the training set)
# convert to categorical
delays_df.DAY_WEEK = delays_df.DAY_WEEK.astype('category')
# create hourly bins departure time
delays_df.CRS_DEP_TIME = [round(t / 100) for t in delays_df.CRS_DEP_TIME]
delays_df.CRS_DEP_TIME = delays_df.CRS_DEP_TIME.astype('category')
predictors = ['DAY_WEEK', 'CRS_DEP_TIME', 'ORIGIN', 'DEST', 'CARRIER', 'Weather']
outcome = 'isDelayed'
X = pd.get_dummies(delays_df[predictors], drop_first=True)
y = delays_df[outcome]
classes = ['ontime', 'delayed']
# split into training and validation
train_X, valid_X, train_y, valid_y = train_test_split(X, y, test_size=0.4, random_state=SEED)
logit_full = LogisticRegression(penalty="l2", C=1e42, solver='liblinear')
logit_full.fit(train_X, train_y)
print('intercept ', logit_full.intercept_[0])
pd.DataFrame({'coeff': logit_full.coef_[0]}, index=X.columns).transpose()
intercept -1.2191068769073172
Weather | DAY_WEEK_2 | DAY_WEEK_3 | DAY_WEEK_4 | DAY_WEEK_5 | DAY_WEEK_6 | DAY_WEEK_7 | CRS_DEP_TIME_7 | CRS_DEP_TIME_8 | CRS_DEP_TIME_9 | ... | ORIGIN_IAD | DEST_JFK | DEST_LGA | CARRIER_DH | CARRIER_DL | CARRIER_MQ | CARRIER_OH | CARRIER_RU | CARRIER_UA | CARRIER_US | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
coeff | 9.325014 | -0.597928 | -0.704795 | -0.798678 | -0.295774 | -1.129236 | -0.135377 | 0.631496 | 0.382262 | -0.365022 | ... | -0.134055 | -0.523642 | -0.545662 | 0.352302 | -0.684517 | 0.743163 | -0.710518 | -0.194203 | 0.315418 | -0.97143 |
1 rows × 33 columns
print('AIC', AIC_score(valid_y, logit_full.predict(valid_X), df = len(train_X.columns) + 1))
AIC 1004.5346225948085
Evaluating performance of all-predictor model on validation
logit_reg_pred = logit_full.predict_proba(valid_X)
full_result = pd.DataFrame({'actual': valid_y,
'p(0)': [p[0] for p in logit_reg_pred],
'p(1)': [p[1] for p in logit_reg_pred],
'predicted': logit_full.predict(valid_X)})
full_result = full_result.sort_values(by=['p(1)'], ascending=False)
# confusion matrix
classificationSummary(full_result.actual, full_result.predicted, class_names=classes)
gainsChart(full_result.actual, figsize=[5, 5])
plt.show()
Confusion Matrix (Accuracy 0.8309)
Prediction
Actual ontime delayed
ontime 705 9
delayed 140 27
Logistic regression model with fewer predictors
delays_df['CRS_DEP_TIME'] = [round(t / 100) for t in delays_df['CRS_DEP_TIME']]
delays_red_df = pd.DataFrame({
'Sun_Mon' : [1 if d in (1, 7) else 0 for d in delays_df.DAY_WEEK],
'Weather' : delays_df.Weather,
'CARRIER_CO_MQ_DH_RU' : [1 if d in ("CO", "MQ", "DH", "RU") else 0
for d in delays_df.CARRIER],
'MORNING' : [1 if d in (6, 7, 8, 9) else 0 for d in delays_df.CRS_DEP_TIME],
'NOON' : [1 if d in (10, 11, 12, 13) else 0 for d in delays_df.CRS_DEP_TIME],
'AFTER2P' : [1 if d in (14, 15, 16, 17, 18) else 0 for d in delays_df.CRS_DEP_TIME],
'EVENING' : [1 if d in (19, 20) else 0 for d in delays_df.CRS_DEP_TIME],
'isDelayed' : [1 if status == 'delayed' else 0 for status in delays_df['Flight Status']],
})
X = delays_red_df.drop(columns=['isDelayed'])
y = delays_red_df['isDelayed']
classes = ['ontime', 'delayed']
# split into training and validation
train_X, valid_X, train_y, valid_y = train_test_split(X, y, test_size=0.4,
random_state=SEED)
logit_red = LogisticRegressionCV(penalty="l1", solver='liblinear', cv=5)
logit_red.fit(train_X, train_y)
print('intercept ', logit_red.intercept_[0])
pd.DataFrame({'coeff': logit_red.coef_[0]}, index=X.columns).transpose()
intercept -2.509726593729543
Sun_Mon | Weather | CARRIER_CO_MQ_DH_RU | MORNING | NOON | AFTER2P | EVENING | |
---|---|---|---|---|---|---|---|
coeff | 0.625337 | 4.920328 | 1.281507 | 0.0 | 0.0 | 0.0 | 0.0 |
print('AIC', AIC_score(valid_y, logit_red.predict(valid_X), df=len(train_X.columns) + 1))
# confusion matrix
classificationSummary(valid_y, logit_red.predict(valid_X), class_names=classes)
AIC 940.6290360089256
Confusion Matrix (Accuracy 0.8331)
Prediction
Actual ontime delayed
ontime 714 0
delayed 147 20
logit_reg_proba = logit_red.predict_proba(valid_X)
red_result = pd.DataFrame({'actual': valid_y,
'p(0)': [p[0] for p in logit_reg_proba],
'p(1)': [p[1] for p in logit_reg_proba],
'predicted': logit_red.predict(valid_X),
})
red_result = red_result.sort_values(by=['p(1)'], ascending=False)
ax = gainsChart(full_result.actual, color='C1', figsize=[5, 5])
gainsChart(red_result.actual, color='C0', ax=ax)
plt.show()
Logistic regression model for loan acceptance using Statmodels
# add constant column
y = bank_df2['Personal_Loan']
X = bank_df2.drop(columns=['Personal_Loan'])
X = sm.add_constant(X, prepend=True)
# partition data
train_X, valid_X, train_y, valid_y = train_test_split(X, y, test_size=0.4, random_state=SEED)
# use GLM (general linear model) with the binomial family to fit a logistic regression
logit_reg = sm.GLM(train_y, train_X, family=sm.families.Binomial())
logit_result = logit_reg.fit()
logit_result.summary()
Dep. Variable: | Personal_Loan | No. Observations: | 3000 |
---|---|---|---|
Model: | GLM | Df Residuals: | 2987 |
Model Family: | Binomial | Df Model: | 12 |
Link Function: | Logit | Scale: | 1.0000 |
Method: | IRLS | Log-Likelihood: | -340.15 |
Date: | Mon, 07 Apr 2025 | Deviance: | 680.30 |
Time: | 07:32:46 | Pearson chi2: | 8.10e+03 |
No. Iterations: | 8 | Pseudo R-squ. (CS): | 0.3325 |
Covariance Type: | nonrobust |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | -12.5634 | 2.336 | -5.377 | 0.000 | -17.143 | -7.984 |
Age | -0.0354 | 0.086 | -0.412 | 0.680 | -0.204 | 0.133 |
Experience | 0.0369 | 0.086 | 0.431 | 0.666 | -0.131 | 0.205 |
Income | 0.0589 | 0.004 | 15.044 | 0.000 | 0.051 | 0.067 |
Family | 0.6128 | 0.103 | 5.931 | 0.000 | 0.410 | 0.815 |
CCAvg | 0.2408 | 0.060 | 4.032 | 0.000 | 0.124 | 0.358 |
Mortgage | 0.0010 | 0.001 | 1.301 | 0.193 | -0.001 | 0.003 |
Securities_Account | -1.0305 | 0.422 | -2.443 | 0.015 | -1.857 | -0.204 |
CD_Account | 3.6628 | 0.460 | 7.961 | 0.000 | 2.761 | 4.565 |
Online | -0.6794 | 0.216 | -3.140 | 0.002 | -1.103 | -0.255 |
CreditCard | -0.9609 | 0.274 | -3.507 | 0.000 | -1.498 | -0.424 |
Education_Graduate | 4.2075 | 0.364 | 11.573 | 0.000 | 3.495 | 4.920 |
Education_Advanced/Professional | 4.3580 | 0.365 | 11.937 | 0.000 | 3.642 | 5.074 |