Return to MSc Portfolio.

Originally submitted as part of the module:¶

Machine Learning and Predictive Analytics¶

This notebook contains sections of python code which should be run individually in order, or with a Run All Cells command, such as the option in Jupyter.

There is no need to modify cells other than the very first, which contains a pip install command for UCI Machine Learning Repository, which may or may not be present in your environment already.

Imports and Data Cleaning¶

The following cells cover;

  • Necessary python imports
  • Data retrieval from the UCI repository
  • Checks for NaN or missing values in columns important to the objective
  • Removal of duplicate emails (rows)
  • Selection of individual Feature columns
In [1]:
#Uncomment below to add the UCI repository
#pip install ucimlrepo
In [2]:
#Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
from ucimlrepo import fetch_ucirepo
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import accuracy_score, classification_report, roc_curve, roc_auc_score, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
In [3]:
#Data ingestion from UCI Repository
#http://archive.ics.uci.edu/dataset/94/spambase
spamData = fetch_ucirepo(id=94)

#Assign targets and features from UCI data.
X = spamData.data.features
y = spamData.data.targets
In [4]:
#Data cleaning, create pandas dataframe to examine dataset
spamDataDF = pd.concat([X, y], axis=1)

#Show size of the dataset.
print(f'Dataset shape: {spamDataDF.shape}')

#Features and Target column to check for empty values
importantColumns = ['word_freq_000', 
                    'word_freq_650', 
                    'word_freq_857',
                    'word_freq_415', 
                    'word_freq_85', 
                    'word_freq_1999',
                    'char_freq_;', 
                    'char_freq_(', 
                    'char_freq_[', 
                    'char_freq_!',
                    'char_freq_$', 
                    'char_freq_#', 
                    'Class']

#Print a count of empty values
print('\nFeature Name    | Empty Count')
print('-'*30)
for columnName in importantColumns:
    nullCount = spamDataDF[columnName].isnull().sum()
    print(f'{columnName:<15} | {nullCount}')
Dataset shape: (4601, 58)

Feature Name    | Empty Count
------------------------------
word_freq_000   | 0
word_freq_650   | 0
word_freq_857   | 0
word_freq_415   | 0
word_freq_85    | 0
word_freq_1999  | 0
char_freq_;     | 0
char_freq_(     | 0
char_freq_[     | 0
char_freq_!     | 0
char_freq_$     | 0
char_freq_#     | 0
Class           | 0
In [5]:
#Consider duplicate emails (rows)
print(f'Duplicate rows: {X.duplicated().sum()}')

#New dataframe to hold de-duplciated values
dupesCleaned = spamDataDF.drop_duplicates()

#Last column in dataframe is the Target, split X and y based on location
XCleaned = dupesCleaned.iloc[:, :-1]
yCleaned = dupesCleaned.iloc[:, -1]

print(f'\nTarget counts after dedupe:\n{yCleaned.value_counts()}')
print(f'\nTotal records removed: {len(spamDataDF.index)-len(dupesCleaned.index)}')
Duplicate rows: 394

Target counts after dedupe:
Class
0    2531
1    1679
Name: count, dtype: int64

Total records removed: 391
In [6]:
#Feature selection by isolating the columns with only numbers, or only symbols
scoreColumns = ['word_freq_000', 
                'word_freq_650', 
                'word_freq_857', 
                'word_freq_415', 
                'word_freq_85', 
                'word_freq_1999', 
                'char_freq_;', 
                'char_freq_(', 
                'char_freq_[', 
                'char_freq_!', 
                'char_freq_$', 
                'char_freq_#']

#New X to limit only to specified Features
XCleaned = XCleaned[scoreColumns]

Logistic Regression Modelling¶

The following cells cover the setup of a Logistic Regression (LR) model, including;

  • Setting up a Train/Test/Split
  • Scaling features
  • Fitting a model with various C values and Cross-Validation
  • Accuracy calculation
  • Coefficient/Weight analysis of impactful features
  • A ROC/AUC plot for the LR model
In [7]:
#Train/test/split the data into 80/20 training/testing
XTrainingLR, XTestingLR, yTrainingLR, yTestingLR = train_test_split(XCleaned, 
                                                                    yCleaned, 
                                                                    test_size=0.2, 
                                                                    random_state=[Redacted - Student ID], 
                                                                    stratify=yCleaned)
#Setup a scaler and scale features
LRScaler = StandardScaler()
XTrainingScaledLR = LRScaler.fit_transform(XTrainingLR)
XTestingScaledLR = LRScaler.transform(XTestingLR)
In [8]:
#Implementation of logistic regression model, using various C values
#and using k-fold cross-validation of 5
lrModel = LogisticRegressionCV(random_state=[Redacted - Student ID], Cs=[0.01, 0.1, 1, 10, 100, 1000], cv=5)
lrModel.fit(XTrainingScaledLR, yTrainingLR)

#Prediction and accuracy calculation
yTestingPredictionLR = lrModel.predict(XTestingScaledLR)
testingAccuracyLR = accuracy_score(yTestingLR, yTestingPredictionLR)

print(f'Best C: {lrModel.C_}')
print(f'Testing %: {testingAccuracyLR:.3f}')
Best C: [100.]
Testing %: 0.817
In [9]:
#Coefficient/weights analysis to see which features influenced the model the most
modelWeights = pd.DataFrame({
    'Feature': XCleaned.columns,
    'Weight': lrModel.coef_[0]
})

#Calculate absolute values for coefficients/weights
modelWeights['Abs_Coefficient'] = np.abs(modelWeights['Weight'])
modelWeights = modelWeights.sort_values('Abs_Coefficient', ascending=True)

#Plot importance of features
plt.figure(figsize=(7, 5))
plotColours = ['deepskyblue' if weight > 0 else 'tomato' for weight in modelWeights['Weight']]
barLabels = plt.barh(range(len(modelWeights)), modelWeights['Weight'], color=plotColours)
plt.bar_label(barLabels, fmt='%.2f', label_type='center', color='black', padding=3)
plt.yticks(range(len(modelWeights)), modelWeights['Feature'])
plt.axvline(x=0, color='black', linewidth=1)
plt.xlabel('Feature Weight')
plt.ylabel('Features')
plt.title('Logistic Regression Feature Weights')
plt.grid(True, alpha=0.2, axis='x')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [10]:
#An ROC graph to evaluate LR.
#Calculate LR True/False positive rates, and AUC
LRyTestingProba = lrModel.predict_proba(XTestingScaledLR)
LRFalsePosRate, LRTruePosRate, LRThresholds = roc_curve(yTestingLR, LRyTestingProba[:, 1])
LRAUCScore = roc_auc_score(yTestingLR, LRyTestingProba[:, 1])

#Plot ROC curve
plt.figure(figsize=(6, 4))
plt.plot(LRFalsePosRate, LRTruePosRate, label=f'Logistic Regression (AUC={LRAUCScore:.3f})', color='deepskyblue', lw=2)
plt.plot([0, 1], [0, 1], '--', linewidth=2, label='50/50 Coin Flip (AUC=0.5)')
plt.xlabel('False Positive Rate (Real emails blocked)')
plt.ylabel('True Positive Rate (Spam emails blocked)')
plt.title('ROC Analysis of Logistic Regression Model')
plt.legend(loc='lower right')
plt.grid(alpha=0.2)
plt.show()
No description has been provided for this image

K-Nearest Neighbours¶

A secondary implementation using a k-Nearest Neighbours (kNN) model.

The following cells cover the setup of a kNN model including;

  • Train/test/split
  • Feature scaling
  • Determining the initial K value, and building a range
  • ImplementingkKNN including cross-validation, with GridSearchCV
In [11]:
#Keep the same train/test/split values as LR for consistency
XTrainingkNN, XTestingkNN, yTrainingkNN, yTestingkNN = train_test_split(XCleaned, 
                                                                        yCleaned, 
                                                                        test_size=0.2, 
                                                                        random_state=[Redacted - Student ID], 
                                                                        stratify=yCleaned)

kNNScaler = StandardScaler()
XTrainingScaledkNN = kNNScaler.fit_transform(XTrainingkNN)
XTestingScaledkNN = kNNScaler.transform(XTestingkNN)

#Set initial k to be square root of the number of training samples, making sure odd to avoid ties
sampleSize = len(XTrainingkNN)
initialK = math.sqrt(sampleSize)
initialK = round(initialK)
if initialK % 2 == 0:
    initialK += 1

#Build a range of possible K values, keeping them odd
KRange = (initialK - 50, initialK + 50)
KList = [i for i in KRange if i > 0 and i % 2 != 0]

#Fit a KNN model with GridSearch
paramGrid = {'n_neighbors': KList, 'metric': ['euclidean', 'manhattan']}
kNNGrid = GridSearchCV(KNeighborsClassifier(), paramGrid, cv=5, scoring='f1_macro')
kNNGrid.fit(XTrainingScaledkNN, yTrainingkNN)

#Predict using the best KNN model
yTestingPredictionkNN = kNNGrid.predict(XTestingScaledkNN)
testingAccuracykNN = accuracy_score(yTestingkNN, yTestingPredictionkNN)

print(f'Initial k neighbours: {initialK}')
print(f'Best k: {kNNGrid.best_params_['n_neighbors']}')
print(f'Best distance metric: {kNNGrid.best_params_['metric']}')
print(f'Testing %: {testingAccuracykNN:.3f}')
Initial k neighbours: 59
Best k: 9
Best distance metric: manhattan
Testing %: 0.831

Comparison between Models¶

The following cells show comparisons between the LR and kNN models. The cells include;

  • A Receiver Operating Characteristic (ROC) plot to examine performance, including Area Under the Curve (AUC)
  • The Classification reports for both models
  • A Confusion Matrix for each model
In [12]:
#An ROC graph to compare LR vs kNN
#Calculate KNN True/False positive rates, and AUC
kNNyTestingProba = kNNGrid.predict_proba(XTestingScaledkNN)
kNNFalsePosRate, kNNTruePosRate, kNNThresholds = roc_curve(yTestingkNN, kNNyTestingProba[:, 1])
kNNAUCScore = roc_auc_score(yTestingkNN, kNNyTestingProba[:, 1])

#Plot ROC curve
plt.figure(figsize=(6, 4))
plt.plot(LRFalsePosRate, LRTruePosRate, label=f'Logistic Regression (AUC={LRAUCScore:.3f})', color='deepskyblue', lw=2)
plt.plot(kNNFalsePosRate, kNNTruePosRate, label=f'kNN (AUC={kNNAUCScore:.3f})', color='tomato', lw=2)
plt.plot([0, 1], [0, 1], '--', linewidth=2, label='50/50 Coin Flip (AUC=0.5)')
plt.xlabel('False Positive Rate (Real emails blocked)')
plt.ylabel('True Positive Rate (Spam emails blocked)')
plt.title('ROC Comparison: LR vs kNN')
plt.legend(loc='lower right')
plt.grid(alpha=0.2)
plt.show()
No description has been provided for this image
In [13]:
#Classification reports
print('Linear Regression'.center(60)+'\n')
print(classification_report(yTestingLR, yTestingPredictionLR, digits=3))
print("-"*60+'\n')
print('kNN'.center(60)+'\n')
print(classification_report(yTestingkNN, yTestingPredictionkNN, digits=3))
                     Linear Regression                      

              precision    recall  f1-score   support

           0      0.788     0.953     0.862       506
           1      0.896     0.613     0.728       336

    accuracy                          0.817       842
   macro avg      0.842     0.783     0.795       842
weighted avg      0.831     0.817     0.809       842

------------------------------------------------------------

                            kNN                             

              precision    recall  f1-score   support

           0      0.824     0.915     0.867       506
           1      0.846     0.705     0.769       336

    accuracy                          0.831       842
   macro avg      0.835     0.810     0.818       842
weighted avg      0.833     0.831     0.828       842

In [14]:
#Side-by-side confusion matrix to compare KNN and LR
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

cmkNN = confusion_matrix(yTestingkNN, yTestingPredictionkNN)
cmLR = confusion_matrix(yTestingLR, yTestingPredictionLR)

sns.heatmap(cmkNN, annot=True, fmt='d', cmap='Reds',
            xticklabels=['Not Spam', 'Spam'],
            yticklabels=['Not Spam', 'Spam'],
                   ax=axes[0])
axes[0].set_xlabel('kNN Predicted')
axes[0].set_ylabel('Actual Outcome')
axes[0].set_title('kNN Confusion Matrix')

sns.heatmap(cmLR, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Not Spam', 'Spam'],
            yticklabels=['Not Spam', 'Spam'],
                   ax=axes[1])
axes[1].set_xlabel('LR Predicted')
axes[1].set_ylabel('Actual Outcome')
axes[1].set_title('Logistic Regression Confusion Matrix')

plt.show()
No description has been provided for this image

Return to MSc Portfolio.