Kevin Burridge - eDiscovery Professional

Originally submitted as assessment material for the module:

Machine Learning and Predictive Analytics

Problem Identification

The problem to investigate is email spam filtering and the adaptability of a Machine Learning (ML) algorithm to correctly classify spam emails. Spam detection is critical to organisations and users across the globe, helping to improve cybersecurity and assist fraud prevention. In 2023 alone, an estimated 45.6% of all emails sent globally were identified as spam (Statista, 2024). Filtering spam from inboxes using ML is a perfect example of bringing theory into practice in the real-world.

In this paper, spam emails will refer to malicious emails (e.g. phishing) or other unsolicited email whose arrival in an inbox is likely to cause annoyance to a user. I set out to build a predictive model which can classify emails using frequency-based features, specifically experimenting with language-agnostic features.

Dataset Description and Analysis

This analysis uses the Spambase dataset, originally released in 1999 by Hopkins et al. It contains 4,601 instances with 57 features (Hopkins et al., 1999). To create the dataset, researchers collected spam emails provided by their internal postmaster at Hewlett-Packard Labs and other individuals, along with non-spam emails collected from personal emails. The dataset features were generated by exploring the frequency that terms appeared in an email. They also created features which measured the run-length of sequences of consecutive capital letters.

The dataset was compiled in 1999. As a result, spam habits and emails are likely to have changed in the decades since. This is known as Concept Drift, which Singhal et al. (2020) described as a scenario “…where the relation between the input data and the target variable changes over time…”. Whilst time has passed, I feel that the variety of features and number of instances available lend well to a simple language-agnostic experiment to detect spam.

After the dataset was imported, checks were performed for missing values. An additional check was conducted for any duplicate instances (emails). This would be possible since the authors noted the emails were sourced “…from our postmaster and individuals who had filed spam.” (Hopkins et al., 1999) and as such there could be overlap. This check resulted in the removal of 391 instances (emails) where a copy was already present, leaving 4,210 instances remaining. Duplicates were removed to prevent data leakage between training and testing sets.

A multivariate analysis will be conducted. This is justified since the dataset has multiple features available for analysis such as dollar signs, exclamation marks, and numbers.

Of the 57 available features I have selected the only 12 features which are numbers or symbols, as these fit my approach of being language agnostic. I rejected the feature word_freq_3d as this is mixed. Spam emails often rely on exaggerated visuals, such as the overuse of dollar signs or large monetary amounts. This approach is grounded in research. In a 2024 article, Longtchi et al. found a direct correlation between Attention Grabbing techniques (bold fonts, uppercase letters, highlighted text) and Visual Deception (replacing “vv” with “w”) in malicious emails (Longtchi et al., 2024).

The 12 features are listed in Table A, including examples of the first and last instances, along with the Target, Class.

Instance #	#0	#1	#2	#3	#4	#4205	#4206	#4207	#4208	#4209
word_freq_000	0	0.43	1.16	0	0	0	0	0	0	0
word_freq_650	0	0	0	0	0	0	0	0	0	0
word_freq_857	0	0	0	0	0	0	0	0	0	0
word_freq_415	0	0	0	0	0	0	0	0	0	0
word_freq_85	0	0	0	0	0	0	0	0	0	0
word_freq_1999	0	0.07	0	0	0	0	0	0	0	0
char_freq_;	0	0	0.01	0	0	0	0	0.102	0	0
char_freq_(	0	0.132	0.143	0.137	0.135	0.232	0	0.718	0.057	0
char_freq_[	0	0	0	0	0	0	0	0	0	0
char_freq_!	0.778	0.372	0.276	0.137	0.135	0	0.353	0	0	0.125
char_freq_$	0	0.18	0.184	0	0	0	0	0	0	0
char_freq_#	0	0.048	0.01	0	0	0	0	0	0	0
Class	1	1	1	1	1	0	0	0	0	0
Table A: Selected Features, showing the first and last instances in Spambase.

Algorithm Selection

The primary choice of algorithm for this task is LR. As a form of supervised learning, LR can be trained with past examples of spam, enabling classification of unseen examples. Pal (2021) writes in Cancer Research, Statistics, and Treatment (CRST) of how multivariate LR is a widely accepted form of ML for cancer diagnosis, which uses the past experience of patients to estimate an outcome by modelling previous data. Whilst this dataset is not predicting the outcome of cancer, it is following the same multivariate approach to create a binary classification (spam/not spam).

As well as classification, probabilities can be produced by LR, enabling custom thresholds to be set if required, as well as providing confidence levels e.g. “a 90% chance this email is spam”. LR is computationally efficient compared to other models, with a fast training and prediction speed, it also provides the ability to understand which features are having the most effect on predictions. There are risks of course, such as overfitting, or non-linearity within features. However, the input features are all numerical with linear relationships between them. For example, the frequency of the dollar sign appearing in an email, and whether it was classed as spam.

The model was implemented using Scikit-learn (Pedregosa et al., 2011). Prior to fitting the model, a Train/Test/Split was performed with 80% training, 20% testing, giving 3,368 and 842, respectively. The LR model was implemented using Scikit-learn’s LogisticRegressionCV function. This essentially allowed for various configurations to be tested, with the best model automatically selected. Features were scaled and a selection of regularisation values were provided. Cross-validation was set to five, to aid with the selection of the best model. The best model produced an accuracy score of 81.7%.

An advantage of LR as previously mentioned, is the high interpretability of the model. Feature coefficients, or weights, can be examined to determine which had more impact. Figure 1 shows the outcome of the 12 features. As might be expected, an email was more likely to be marked spam if it had a high use of dollar signs, multiple zeros, and exclamation marks.

No description has been provided for this image

Figure 1: Feature weighting from the Logistic Regression model.

Evaluation

Various metrics are available to interpret the performance of the model. As Bonaccorso notes (2018, p.184), “The performances of all classifiers must be measured using different approaches…” and as such we will consider a Receiver Operating Characteristic (ROC) plot, a Confusion Matrix, and Classification Metrics. For the purposes of this implementation the default classification score of 50% will remain, this is an experimental implementation based on language agnostic features. The default scoring will also enable easy comparison to an alternative model.

Initially let us explore a ROC plot shown in Figure 2. Broadly, the graph appears to be performing well, with an Area Under the Curve (AUC) score of 0.889 (3.d.p). AUC scores range from 0.5 (no better than a coin flip) to 1.0 (a perfect score) and each graph should be taken in context to its application. Indeed, a study published in 2022 in The Lancet, considered AUC scores published in COVID-19 research material. Our score of 0.889 could have been graded anywhere from Fair to Excellent depending on institution (de Hond et al., 2022).

No description has been provided for this image

Figure 2: ROC Plot of the Logistic Regression model, with added 50/50 comparison.

We can next visualise the model’s performance by implementing a Confusion Matrix, shown in Figure 3. Within our total test set of 842, the model correctly identified 206 and 482 emails as spam and not spam, respectively. There were 130 (15.4%) instances where spam emails evaded the model. 24 Emails (2.9%) were false positives and had this been a real email filter, may have been automatically deleted or moved to a spam folder. Considering that this is based only on the frequency of numbers and symbols, these are promising results.

No description has been provided for this image

Figure 3: Confusion Matrix breaking down the model’s performance over the test set.

Finally, we can consider performance metrics such as precision and recall. Table 2 shows the classification results over the test set of 842. Whilst the model had a high precision and was very accurate when it did detect spam (89.6%), it only managed to correctly locate 61.3% of spam instances. This is in line with the Confusion Matrix metrics of 130 false negatives. When predicting not spam it clearly performed better, isolating 95.3% of all instances, although its confidence had a lower precision at 78.8%, in line with the 24 false positives.

	Precision	Recall	F1
Not Spam	78.8%	95.3%	86.2%
Spam	89.6%	61.3%	72.8%
Table 2: Classification results for Logistic Regression over the test set.

Overall, the LR model has fared well with only 12 features.

Comparative Analysis

An alternative model for classifying spam could be k-Nearest Neighbours (kNN), another form of supervised learning. A kNN model was implemented as a comparative exercise against LR. The same Train/Test/Split methodology was implemented with a common random variable ensuring the same training and test sets. Scikit-learn’s GridSearchCV capability was employed with an initial k-value as the square root of the training set, which was expanded either side to allow a range of values to be tested. Features were scaled with StandardScaler(), to help prevent higher values from skewing the distance calculations. The same scaler was used with the LR model. The grid search also tested Euclidean and Manhattan distances, cross-validation was set to five sets, as with LR.

In comparison to LR, the kNN model achieved an accuracy score of 83.1%. A subtle improvement of 1.4% over LR. Although when calculating the AUC score, there was a slight decrease in score by 0.003, at 0.886. This is illustrated in Figure 4 which shows an ROC comparison between the two models, where LR comes out ahead.

No description has been provided for this image

Figure 4: ROC comparison between the Logistic Regression and kNN models.

Figure 5 illustrates the kNN confusion matrix performance. The kNN model has managed to increase the true-positive spam detection whilst also reducing the number of false negatives. However, the false positive category has a notable increase to 43, whereas LR only marked 24 false positives. In the context of a spam filter where emails may be deleted or filtered from mailboxes, this is a notable decrease in performance.

No description has been provided for this image

Figure 5: Confusion Matrix showing the performance of kNN over the test set.

We can compare the classification metrics for both models side-by-side. Table 3 shows the consolidated F1 scores are higher for kNN which on paper would normally suggest a better performance, given also that the kNN model located 70.5% of the spam emails. However, as shown with the confusion matrix there was a noticeable drop in performance with real emails being misclassified as spam. On balance, a user is more likely to accept seeing the occasional spam email, than have their real mail hidden or deleted.

			Precision	Recall	F1
	Not Spam	LR	78.8%	95.3%	86.2%
	Not Spam	kNN	82.4%	91.5%	86.7%
	Spam	LR	89.6%	61.3%	72.8%
	Spam	kNN	84.6%	70.5%	76.9%
Table 3: Classification comparison between Logistic Regression and kNN over the test set.

As a lazy-learning methodology, kNN models can accept multiple features such as those present in the current dataset, and whilst a kNN model can be trained and deployed to detect spam email it does have some distinct disadvantages over LR. These include a lower interpretability at a feature level; it is difficult to determine which features are having the most influence over the model. There is also the computational cost to consider. Cunningham and Delany (2022) explained that since all the work is done at runtime, a kNN model can have poor runtime performance if a large training set is used. This is important considering that a spam filter may have to consider millions of emails at any given moment.

Based on the results of the two models and a comparison of strengths and limitations, kNN can be rejected as a primary choice for classifying spam emails. Whilst it has some potential, the inability to examine which features are having the most impact, alongside the lower prediction speed, make kNN unsuitable at scale.

Conclusion

An LR model based solely on features containing numbers or symbols, has at least some real-world potential and could be explored further at scale. The implementation of the model allowed 95.3% of real emails through its filter, although with a disappointing catchment of only 61.3% of spam. However, this is before any advanced feature engineering and adjustments to coefficients. A comparative implementation of a kNN model only allowed 91.5% of real emails through.

This is not to say that LR is the very best approach. Recently in an experiment by Zavrak and Yilmaz (2023), analysing the performance of Convolutional Neural Networks (CNN) for spam detection, a CNN approach outperformed existing ML models in several areas. Ultimately, on the balance of accuracy, interpretability, and computational cost, an LR model would be suitable for a real-world implementation of a language agnostic spam-filter.

Word count (excluding references): 2,176

Bibliography

Bonaccorso, G. (2018) Machine learning algorithms: popular algorithms for data science and machine learning. Second edition. Birmingham; Packt Publishing.

Cunningham, P., Delany, S. J. (2022) k-Nearest Neighbour Classifiers - A Tutorial. ACM computing surveys [online] 54(6), pp. 1 [Accessed 18 January 2026]

de Hond, Anne A H et al. (2022) Interpreting area under the receiver operating characteristic curve. The Lancet Digital Health [online] 4(12), pp. 853-855. [Accessed 15 January 2026]

Hopkins, M., Reeber, E., Forman, G., Suermondt, J. (1999). Spambase, UCI Machine Learning Repository [online]. Available from: http://archive.ics.uci.edu/dataset/94/spambase [Accessed 11 January 2026]

Longtchi, T. et al. (2024) Quantifying Psychological Sophistication of Malicious Emails. IEEE access. [Online] 12, pp. 187512-187535. [Accessed 17 January 2026]

Pal, K. (2021) Logistic regression: A simple primer. Cancer Research, Statistics, and Treatment [online]. 4(3), pp. 551-554. [Accessed 18 January 2026]

Pedregosa et al. (2011) Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research [online]. JMLR 12, pp. 2825-2830. [Accessed 11 January 2026]

Singhal, S., Chawla, U., Shorey, R., ed. (2020) Machine Learning & Concept Drift based Approach for Malicious Website Detection, 2020 International Conference on COMmunication Systems & NETworkS (COMSNETS), Bengaluru, India, 7-11 January 2020. IEEE [online]. Available from: https://ieeexplore-ieee-org.uwe.idm.oclc.org/document/9027485 [Accessed 18 January 2026]

Statista (2024) Global spam volume as percentage of total e-mail traffic from 2011 to 2023. Available from: https://www.statista.com/statistics/420400/spam-email-traffic-share-annual/ [Accessed 14 January 2026]

Zavrak, S., Yilmaz, S. (2023) Email spam detection using hierarchical attention hybrid deep learning method. Expert systems with applications [Online] 233(120977) [Accessed 18 January 2026]