Originally submitted as part of the module:¶
Machine Learning and Predictive Analytics¶
This notebook contains sections of python code which should be run individually in order, or with a Run All Cells command, such as the option in Jupyter.
There is no need to modify cells other than the very first, which contains a pip install command for UCI Machine Learning Repository, which may or may not be present in your environment already.
Imports and Data Cleaning¶
The following cells cover;
- Necessary python imports
- Data retrieval from the UCI repository
- Checks for NaN or missing values in columns important to the objective
- Removal of duplicate emails (rows)
- Selection of individual Feature columns
In [1]:
#Uncomment below to add the UCI repository
#pip install ucimlrepo
In [2]:
#Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
from ucimlrepo import fetch_ucirepo
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import accuracy_score, classification_report, roc_curve, roc_auc_score, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
In [3]:
#Data ingestion from UCI Repository
#http://archive.ics.uci.edu/dataset/94/spambase
spamData = fetch_ucirepo(id=94)
#Assign targets and features from UCI data.
X = spamData.data.features
y = spamData.data.targets
In [4]:
#Data cleaning, create pandas dataframe to examine dataset
spamDataDF = pd.concat([X, y], axis=1)
#Show size of the dataset.
print(f'Dataset shape: {spamDataDF.shape}')
#Features and Target column to check for empty values
importantColumns = ['word_freq_000',
'word_freq_650',
'word_freq_857',
'word_freq_415',
'word_freq_85',
'word_freq_1999',
'char_freq_;',
'char_freq_(',
'char_freq_[',
'char_freq_!',
'char_freq_$',
'char_freq_#',
'Class']
#Print a count of empty values
print('\nFeature Name | Empty Count')
print('-'*30)
for columnName in importantColumns:
nullCount = spamDataDF[columnName].isnull().sum()
print(f'{columnName:<15} | {nullCount}')
Dataset shape: (4601, 58) Feature Name | Empty Count ------------------------------ word_freq_000 | 0 word_freq_650 | 0 word_freq_857 | 0 word_freq_415 | 0 word_freq_85 | 0 word_freq_1999 | 0 char_freq_; | 0 char_freq_( | 0 char_freq_[ | 0 char_freq_! | 0 char_freq_$ | 0 char_freq_# | 0 Class | 0
In [5]:
#Consider duplicate emails (rows)
print(f'Duplicate rows: {X.duplicated().sum()}')
#New dataframe to hold de-duplciated values
dupesCleaned = spamDataDF.drop_duplicates()
#Last column in dataframe is the Target, split X and y based on location
XCleaned = dupesCleaned.iloc[:, :-1]
yCleaned = dupesCleaned.iloc[:, -1]
print(f'\nTarget counts after dedupe:\n{yCleaned.value_counts()}')
print(f'\nTotal records removed: {len(spamDataDF.index)-len(dupesCleaned.index)}')
Duplicate rows: 394 Target counts after dedupe: Class 0 2531 1 1679 Name: count, dtype: int64 Total records removed: 391
In [6]:
#Feature selection by isolating the columns with only numbers, or only symbols
scoreColumns = ['word_freq_000',
'word_freq_650',
'word_freq_857',
'word_freq_415',
'word_freq_85',
'word_freq_1999',
'char_freq_;',
'char_freq_(',
'char_freq_[',
'char_freq_!',
'char_freq_$',
'char_freq_#']
#New X to limit only to specified Features
XCleaned = XCleaned[scoreColumns]
Logistic Regression Modelling¶
The following cells cover the setup of a Logistic Regression (LR) model, including;
- Setting up a Train/Test/Split
- Scaling features
- Fitting a model with various C values and Cross-Validation
- Accuracy calculation
- Coefficient/Weight analysis of impactful features
- A ROC/AUC plot for the LR model
In [7]:
#Train/test/split the data into 80/20 training/testing
XTrainingLR, XTestingLR, yTrainingLR, yTestingLR = train_test_split(XCleaned,
yCleaned,
test_size=0.2,
random_state=[Redacted - Student ID],
stratify=yCleaned)
#Setup a scaler and scale features
LRScaler = StandardScaler()
XTrainingScaledLR = LRScaler.fit_transform(XTrainingLR)
XTestingScaledLR = LRScaler.transform(XTestingLR)
In [8]:
#Implementation of logistic regression model, using various C values
#and using k-fold cross-validation of 5
lrModel = LogisticRegressionCV(random_state=[Redacted - Student ID], Cs=[0.01, 0.1, 1, 10, 100, 1000], cv=5)
lrModel.fit(XTrainingScaledLR, yTrainingLR)
#Prediction and accuracy calculation
yTestingPredictionLR = lrModel.predict(XTestingScaledLR)
testingAccuracyLR = accuracy_score(yTestingLR, yTestingPredictionLR)
print(f'Best C: {lrModel.C_}')
print(f'Testing %: {testingAccuracyLR:.3f}')
Best C: [100.] Testing %: 0.817
In [9]:
#Coefficient/weights analysis to see which features influenced the model the most
modelWeights = pd.DataFrame({
'Feature': XCleaned.columns,
'Weight': lrModel.coef_[0]
})
#Calculate absolute values for coefficients/weights
modelWeights['Abs_Coefficient'] = np.abs(modelWeights['Weight'])
modelWeights = modelWeights.sort_values('Abs_Coefficient', ascending=True)
#Plot importance of features
plt.figure(figsize=(7, 5))
plotColours = ['deepskyblue' if weight > 0 else 'tomato' for weight in modelWeights['Weight']]
barLabels = plt.barh(range(len(modelWeights)), modelWeights['Weight'], color=plotColours)
plt.bar_label(barLabels, fmt='%.2f', label_type='center', color='black', padding=3)
plt.yticks(range(len(modelWeights)), modelWeights['Feature'])
plt.axvline(x=0, color='black', linewidth=1)
plt.xlabel('Feature Weight')
plt.ylabel('Features')
plt.title('Logistic Regression Feature Weights')
plt.grid(True, alpha=0.2, axis='x')
plt.tight_layout()
plt.show()
In [10]:
#An ROC graph to evaluate LR.
#Calculate LR True/False positive rates, and AUC
LRyTestingProba = lrModel.predict_proba(XTestingScaledLR)
LRFalsePosRate, LRTruePosRate, LRThresholds = roc_curve(yTestingLR, LRyTestingProba[:, 1])
LRAUCScore = roc_auc_score(yTestingLR, LRyTestingProba[:, 1])
#Plot ROC curve
plt.figure(figsize=(6, 4))
plt.plot(LRFalsePosRate, LRTruePosRate, label=f'Logistic Regression (AUC={LRAUCScore:.3f})', color='deepskyblue', lw=2)
plt.plot([0, 1], [0, 1], '--', linewidth=2, label='50/50 Coin Flip (AUC=0.5)')
plt.xlabel('False Positive Rate (Real emails blocked)')
plt.ylabel('True Positive Rate (Spam emails blocked)')
plt.title('ROC Analysis of Logistic Regression Model')
plt.legend(loc='lower right')
plt.grid(alpha=0.2)
plt.show()
K-Nearest Neighbours¶
A secondary implementation using a k-Nearest Neighbours (kNN) model.
The following cells cover the setup of a kNN model including;
- Train/test/split
- Feature scaling
- Determining the initial K value, and building a range
- ImplementingkKNN including cross-validation, with GridSearchCV
In [11]:
#Keep the same train/test/split values as LR for consistency
XTrainingkNN, XTestingkNN, yTrainingkNN, yTestingkNN = train_test_split(XCleaned,
yCleaned,
test_size=0.2,
random_state=[Redacted - Student ID],
stratify=yCleaned)
kNNScaler = StandardScaler()
XTrainingScaledkNN = kNNScaler.fit_transform(XTrainingkNN)
XTestingScaledkNN = kNNScaler.transform(XTestingkNN)
#Set initial k to be square root of the number of training samples, making sure odd to avoid ties
sampleSize = len(XTrainingkNN)
initialK = math.sqrt(sampleSize)
initialK = round(initialK)
if initialK % 2 == 0:
initialK += 1
#Build a range of possible K values, keeping them odd
KRange = (initialK - 50, initialK + 50)
KList = [i for i in KRange if i > 0 and i % 2 != 0]
#Fit a KNN model with GridSearch
paramGrid = {'n_neighbors': KList, 'metric': ['euclidean', 'manhattan']}
kNNGrid = GridSearchCV(KNeighborsClassifier(), paramGrid, cv=5, scoring='f1_macro')
kNNGrid.fit(XTrainingScaledkNN, yTrainingkNN)
#Predict using the best KNN model
yTestingPredictionkNN = kNNGrid.predict(XTestingScaledkNN)
testingAccuracykNN = accuracy_score(yTestingkNN, yTestingPredictionkNN)
print(f'Initial k neighbours: {initialK}')
print(f'Best k: {kNNGrid.best_params_['n_neighbors']}')
print(f'Best distance metric: {kNNGrid.best_params_['metric']}')
print(f'Testing %: {testingAccuracykNN:.3f}')
Initial k neighbours: 59 Best k: 9 Best distance metric: manhattan Testing %: 0.831
Comparison between Models¶
The following cells show comparisons between the LR and kNN models. The cells include;
- A Receiver Operating Characteristic (ROC) plot to examine performance, including Area Under the Curve (AUC)
- The Classification reports for both models
- A Confusion Matrix for each model
In [12]:
#An ROC graph to compare LR vs kNN
#Calculate KNN True/False positive rates, and AUC
kNNyTestingProba = kNNGrid.predict_proba(XTestingScaledkNN)
kNNFalsePosRate, kNNTruePosRate, kNNThresholds = roc_curve(yTestingkNN, kNNyTestingProba[:, 1])
kNNAUCScore = roc_auc_score(yTestingkNN, kNNyTestingProba[:, 1])
#Plot ROC curve
plt.figure(figsize=(6, 4))
plt.plot(LRFalsePosRate, LRTruePosRate, label=f'Logistic Regression (AUC={LRAUCScore:.3f})', color='deepskyblue', lw=2)
plt.plot(kNNFalsePosRate, kNNTruePosRate, label=f'kNN (AUC={kNNAUCScore:.3f})', color='tomato', lw=2)
plt.plot([0, 1], [0, 1], '--', linewidth=2, label='50/50 Coin Flip (AUC=0.5)')
plt.xlabel('False Positive Rate (Real emails blocked)')
plt.ylabel('True Positive Rate (Spam emails blocked)')
plt.title('ROC Comparison: LR vs kNN')
plt.legend(loc='lower right')
plt.grid(alpha=0.2)
plt.show()
In [13]:
#Classification reports
print('Linear Regression'.center(60)+'\n')
print(classification_report(yTestingLR, yTestingPredictionLR, digits=3))
print("-"*60+'\n')
print('kNN'.center(60)+'\n')
print(classification_report(yTestingkNN, yTestingPredictionkNN, digits=3))
Linear Regression
precision recall f1-score support
0 0.788 0.953 0.862 506
1 0.896 0.613 0.728 336
accuracy 0.817 842
macro avg 0.842 0.783 0.795 842
weighted avg 0.831 0.817 0.809 842
------------------------------------------------------------
kNN
precision recall f1-score support
0 0.824 0.915 0.867 506
1 0.846 0.705 0.769 336
accuracy 0.831 842
macro avg 0.835 0.810 0.818 842
weighted avg 0.833 0.831 0.828 842
In [14]:
#Side-by-side confusion matrix to compare KNN and LR
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
cmkNN = confusion_matrix(yTestingkNN, yTestingPredictionkNN)
cmLR = confusion_matrix(yTestingLR, yTestingPredictionLR)
sns.heatmap(cmkNN, annot=True, fmt='d', cmap='Reds',
xticklabels=['Not Spam', 'Spam'],
yticklabels=['Not Spam', 'Spam'],
ax=axes[0])
axes[0].set_xlabel('kNN Predicted')
axes[0].set_ylabel('Actual Outcome')
axes[0].set_title('kNN Confusion Matrix')
sns.heatmap(cmLR, annot=True, fmt='d', cmap='Blues',
xticklabels=['Not Spam', 'Spam'],
yticklabels=['Not Spam', 'Spam'],
ax=axes[1])
axes[1].set_xlabel('LR Predicted')
axes[1].set_ylabel('Actual Outcome')
axes[1].set_title('Logistic Regression Confusion Matrix')
plt.show()