Blog

Follow our blog and get to know about the latest updates in the fields of technology like Python, Machine Learning, Data structures, Data Science, Digital Marketing etc.

Spam Mail Detection Using Machine Learning

20.04.2121

As we are in 2021, you must know what is an e-mail and most probably you have either received it or sent it or maybe both. Well, E-mail is a message that may contain textfilesimages, or other attachments that is sent through a network to a specified individual or group of individuals.

Despite having lot of chat apps like WhatsApp, Facebook messenger, snapchat etc, e-mail has remained a central part of daily digital life. In 2024, the number of global e-mail users is set to grow to 4.48 billion users, up from 3.8 billion in 2018. In terms of the most popular e-mail clients, Apple and Google are in a constant battle for the top spot.

E-mail went through a lot of updates with time, these updates allow marketers to send out interactive emails to clients, include a call to action message, dynamic content such as videos, images and gifs, pre-header texts and countdown timers. The increased customer network directly raises the overall revenue and longevity within the industry.

Emails offer confidentiality since the message is only visible to the sender and the receiver. Emails allow companies to send out detailed information using attachments such as spreadsheets and word reports. The added security feature of customised email portals allows firms to have control of the messages.

Email is huge! It has become an indispensable part of our lives and our businesses. In fact, a report from Statista estimates some 281.1 billion emails are sent every day, worldwide. That’s 37 emails for every person on the planet. And out of all that mails, more than half is spam. It’s annoying, it impacts productivity, and it opens us up to phishing and malware attacks.

You must have heard about spam email but in case you didn’t, Spam emails are nothing but junk email, an email sent without explicit consent from the recipient.

The main idea of spam is to make a profit and it is very cheap to send, the cost are insignificant as compared to conventional marketing techniques, so marketing by spam is very cost-effective, despite very low rates of purchases in response. Junk emails sent to penetrate user inboxes with messages intended to promote products and services in order to turn a profit.

But now a days, spam is progressively being viewed as a more severe messaging threat, as it is coming to be used to deliver worms, viruses, and Trojans as well as rooks of more directly financial nature. Spammers often trick even the sharpest of e-mail users into opening these messages.

Not only this there are variety of problems caused by spam such as,

·       Spam prevents the user from making full and good use of time, storage capacity and network bandwidth.

·       The huge volume of spam mails flowing through the computer networks have destructive effects on the memory space of email servers, communication bandwidth, CPU power and user time.

·       The menace of spam email is on the increase on yearly basis and is responsible for over 77% of the whole global email traffic.

·       It is also resulted to untold financial loss to many users who have fallen victim of internet scams and other fraudulent practices of spammers who send emails pretending to be from reputable companies with the intention to persuade individuals to disclose sensitive personal information like passwords, Bank Verification Number (BVN) and credit card numbers.

Regular users can spot a spam email from miles away by now — those unknown senders, enticing subject lines, weird links, spelling mistakes, unrealistic offers, non-personal salutations, and threatening language are dead giveaways. Still, there are many that pass off as legitimate on the surface, which is why users need a solid security program in place and exercise extra caution when opening emails.

The two common approaches used for filtering spam mails are knowledge engineering and machine learning. Emails are classified as either spam or ham using a set of rules in knowledge engineering. The person using the filter, or the software company that stipulates a specific rule-based spam-filtering tool must create a set of rules. Using this method does not guarantee efficient result since there is need to continually update the rules. This can lead to time wastage and it is not suitable especially for naive users.

Machine learning approach have proved to be more efficient than knowledge engineering approach. Machine learning field is a subfield from the broad field of artificial intelligence, this aims to make machines able to learn like human. Machine learning is everywhere. From self-driving cars to face recognition technology, it is machine learning behind the scenes that drives all of it. If you’ve ever used GMail or Yahoo Mail, you must have seen a folder named “Spam” where all unwanted mail goes in. Have you ever wondered how that works? That’s machine learning at work, too!

Unlike knowledge engineering approach, In machine learning, no rule is required to be specified, rather a set of training samples which are pre-classified email messages are provided. A particular machine learning algorithm is then used to learn the classification rules from these email messages. Several studies have been carried out on machine learning techniques and many of these algorithms are being applied in the field of email spam filtering. Examples of such algorithms include Deep Learning, Naïve Bayes, Support Vector Machines, Neural Networks, K-Nearest Neighbour, Rough sets, and Random Forests.

To effectively handle the threat posed by email spams, leading email providers such as Gmail, Yahoo mail and Outlook have employed the combination of different machine learning (ML) techniques such as Neural Networks,  in its spam filters. Since machine learning have the capacity to adapt to varying conditions, Gmail and Yahoo mail spam filters do more than just checking junk emails using pre-existing rules. They generate new rules themselves based on what they have learnt as they continue in their spam filtering operation.

For addressing the challenges of spam emails, the machine learning model is majorly driven by supervised learning, supervised learning is a learning in which we teach or train the machine using data which is well labelled that means some data is already tagged with the correct answer. After that, the machine is provided with a new set of examples(data) so that supervised learning algorithm analyses the training data(set of training examples) and produces a correct outcome from labelled data.

Whereas Unsupervised learning is the training of machine using information that is neither classified nor labelled and allowing the algorithm to act on that information without guidance. Here the task of machine is to group unsorted information according to similarities, patterns and differences without any prior training of data.

Below are some of the most popular machine learning methods:

Naïve Bayes classifier: It is a supervised machine learning algorithm where words probabilities play the main rule here. If some words occur often in spam but not in ham, then this incoming e-mail is probably spam. Naïve bayes classifier technique has become a very popular method in mail filtering software. Bayesian filter should be trained to work effectively. Every word has certain probability of occurring in spam or ham email in its database. If the total of words probabilities exceeds a certain limit, the filter will mark the e-mail to either category.

 Artificial Neural Networks classifier: An artificial neural network (ANN), also called simply a "Neural Network" (NN), is a computational model based on biological neural networks. It consists of an interconnected collection of artificial neurons. An artificial neural network is an adaptive system that changes its structure based on information that flows through the artificial network during a learning phase. The ANN is based on the principle of learning by example.

There are two types of training in neural network.

1.   Supervised: Here, the network is given a set of inputs and matching output patterns, known as training dataset, to train the network.                 

2.   Unsupervised: In this instance, the network trains itself by producing groups of patterns. There is no earlier set of training data given to the system. 

Support Vector Machines classifier: Support Vector Machine” (SVM) is a supervised machine learning algorithm that is mostly used in classification problems. In the SVM algorithm, we plot each data item as a point in n-dimensional space (where n is a number of features you have) with the value of each feature being the value of a particular coordinate. Then, we perform classification by finding the hyper-plane that differentiates the two classes very well. They can be easily trained and according to some researchers, they outperform many of the popular email spam classification methods 

Decision Tree: Decision Tree Classification generates the output as a binary tree like structure called a decision tree, in which each branch node represents a choice between a number of alternatives, and each leaf node represents a classification or decision.

Apart from above mentioned machine learning methods, Artificial Immune System classifier, Rough sets classifier, Ensemble classifiers, Random forest, Deep learning algorithms etc are also some of the popular machine learning methods. Especially Deep learning, Deep learning models can achieve very high accuracy in email spam classification. Deep learning is a kind of machine learning technique that allows computers to learn from experience and knowledge devoid of explicit programming and mine valuable patterns from primitive data.

In order to test the performance of above mentioned methods, some corpora of spam and legitimate emails had to be compiled; there are several collections of email publicly available to be used by researchers.

For building Spam Filtering system we generally follow this process;

Exploratory data analysis & Data processing: The data-set used here, is split into a training set and a test set, divided equally between spam and ham mails.

Text cleaning is the first step where we remove those words from the document which may not contribute to the information we want to extract. Emails may contain a lot of undesirable characters like punctuation marks, stop words, digits, etc which may not be helpful in detecting the spam email.

Feature extraction: Our algorithm always expect the input to be integers/floats, so we need to have some feature extraction layer in the middle to convert the words to integers/floats.

TRAINING THE CLASSIFIERS: Here we will train our models for classification (say, Naive Bayes classifier and Support Vector Machines), once the classifiers are trained, we can check the performance of the models on test-set. We extract word count vector for each mail in test-set and predict its class(ham or spam) with the trained model (in this case, NB classifier and SVM model).

Further, we will check the results on test set of the subset created.


Our students at Perfect eLearning have developed an application on Python which uses Machine Learning for the spam email detection. The application uses Naïve Baysian Model. Here’s the GitHub link that you can use to develop your own Spam Mail Detection Project on Python.

Spam Mail Detection MutliNomial Navie baysian Model



Conclusion: Spam email is one of the most demanding and troublesome internet issues in today’s world of communication and technology. It is almost impossible to think about e-mail without considering the issue of spam. Spammers by generating spam mails are misusing this communication facility and thus affecting organisations and many email users.

The machine learning model used by Google have now advanced to the point that it can detect and filter out spam and phishing emails with about 99.9 percent accuracy. The implication of this is that one out of a thousand messages succeed in evading their email spam filter.