Machine Learning and Predictive Analytics
For my Machine Learning and Predictive Analytics module, I developed a project investigating whether a model could effectively filter email spam using only language-agnostic features. In a world where 45.6% of all global emails are spam, I wanted to see if symbols and numbers alone (like the frequency of dollar signs or exclamation marks) could identify a malicious message regardless of the language it was written in.
Project Overview
Using the Spambase dataset, I performed a multivariate analysis on over 4,200 instances. I intentionally narrowed my focus to just 12 features consisting of specific characters and number frequencies to test the limits of a "minimalist" approach.
Methodology
- Data Cleaning: I addressed potential data leakage by identifying and removing 391 duplicate instances.
- Model Selection: I conducted a comparative analysis between Logistic Regression (LR) and k-Nearest Neighbours (kNN).
- Optimization: I utilized Scikit-learn's LogisticRegressionCV and GridSearchCV to handle feature scaling, regularization, and hyperparameter tuning through 5-fold cross-validation.
Key Findings
- The Interpretability Winner: Logistic Regression achieved an accuracy of 81.7% and a strong AUC of 0.889. Its high interpretability allowed me to confirm that high frequencies of "$", "000", and "!" were the strongest predictors of spam.
- The Performance Trade-off: While kNN showed a slightly higher accuracy of 83.1%, it also produced nearly double the false positives. Meaning it was much more likely to block legitimate, "real" emails.
- Real-World Application: I concluded that for a scalable spam filter, Logistic Regression is a valid choice because of its lower computational cost and lower risk of hiding important messages from users.
Resources
Submission report.
Jupyter notebook of model.