Kevin Burridge - eDiscovery Professional

Machine Learning and Predictive Analytics

For my Machine Learning and Predictive Analytics module, I developed a project investigating whether a model could effectively filter email spam using only language-agnostic features. In a world where 45.6% of all global emails are spam, I wanted to see if symbols and numbers alone (like the frequency of dollar signs or exclamation marks) could identify a malicious message regardless of the language it was written in.

Project Overview

Using the Spambase dataset, I performed a multivariate analysis on over 4,200 instances. I intentionally narrowed my focus to just 12 features consisting of specific characters and number frequencies to test the limits of a "minimalist" approach.

Methodology

Data Cleaning: I addressed potential data leakage by identifying and removing 391 duplicate instances.
Model Selection: I conducted a comparative analysis between Logistic Regression (LR) and k-Nearest Neighbours (kNN).
Optimization: I utilized Scikit-learn's LogisticRegressionCV and GridSearchCV to handle feature scaling, regularization, and hyperparameter tuning through 5-fold cross-validation.

Key Findings

The Interpretability Winner: Logistic Regression achieved an accuracy of 81.7% and a strong AUC of 0.889. Its high interpretability allowed me to confirm that high frequencies of "$", "000", and "!" were the strongest predictors of spam.
The Performance Trade-off: While kNN showed a slightly higher accuracy of 83.1%, it also produced nearly double the false positives. Meaning it was much more likely to block legitimate, "real" emails.
Real-World Application: I concluded that for a scalable spam filter, Logistic Regression is a valid choice because of its lower computational cost and lower risk of hiding important messages from users.

Resources

Submission report.

Jupyter notebook of model.

MSc Data Science

University of the West of England, Bristol.

Machine Learning and Predictive Analytics