Build an email spam detector using Amazon SageMaker

Spam emails, also known as junk mail, are sent to a large number of users at once and often contain scams, phishing content, or cryptic messages. Spam emails are sometimes sent manually by a human, but most often they are sent using a bot. Examples of spam emails include fake ads, chain emails, and impersonation attempts. There is a risk that a particularly well-disguised spam email may land in your inbox, which can be dangerous if clicked on. It’s important to take extra precautions to protect your device and sensitive information.

As technology is improving, the detection of spam emails becomes a challenging task due to its changing nature. Spam is quite different from other types of security threats. It may at first appear like an annoying message and not a threat, but it has an immediate effect. Also spammers often adapt new techniques. Organizations who provide email services want to minimize spam as much as possible to avoid any damage to their end customers.

In this post, we show how straightforward it is to build an email spam detector using Amazon SageMaker. The built-in BlazingText algorithm offers optimized implementations of Word2vec and text classification algorithms. Word2vec is useful for various natural language processing (NLP) tasks, such as sentiment analysis, named entity recognition, and machine translation. Text classification is essential for applications like web searches, information retrieval, ranking, and document classification.

Solution overview

This post demonstrates how you can set up email spam detector and filter spam emails using SageMaker. Let’s see how a spam detector typically works, as shown in the following diagram.

Emails are sent through a spam detector. An email is sent to the spam folder if the spam detector detects it as spam. Otherwise, it’s sent to the customer’s inbox.

We walk you through the following steps to set up our spam detector model:

Download the sample dataset from the GitHub repo.
Load the data in an Amazon SageMaker Studio notebook.
Prepare the data for the model.
Train, deploy, and test the model.

Prerequisites

Before diving into this use case, complete the following prerequisites:

Set up an AWS account.
Set up a SageMaker domain.
Create an Amazon Simple Storage Service (Amazon S3) bucket. For instructions, see Create your first S3 bucket.

Download the dataset

Download the email_dataset.csv from GitHub and upload the file to the S3 bucket.

The BlazingText algorithm expects a single preprocessed text file with space-separated tokens. Each line in the file should contain a single sentence. If you need to train on multiple text files, concatenate them into one file and upload the file in the respective channel.

Load the data in SageMaker Studio

To perform the data load, complete the following steps:

Download the spam_detector.ipynb file from GitHub and upload the file in SageMaker Studio.
In your Studio notebook, open the spam_detector.ipynb notebook.
If you are prompted to choose a Kernel, choose the Python 3 (Data Science 3.0) kernel and choose Select. If not, verify that the right kernel has been automatically selected.

Import the required Python library and set the roles and the S3 buckets. Specify the S3 bucket and prefix where you uploaded email_dataset.csv.

Run the data load step in the notebook.

Check if the dataset is balanced or not based on the Category labels.

We can see our dataset is balanced.

Prepare the data

The BlazingText algorithm expects the data in the following format:

__label__<label> "<features>"

Here’s an example:

__label__0 “This is HAM"
__label__1 "This is SPAM"

Check Training and Validation Data Format for the BlazingText Algorithm.

You now run the data preparation step in the notebook.

First, you need to convert the Category column to an integer. The following cell replaces the SPAM value with 1 and the HAM value with 0.

The next cell adds the prefix __label__ to each Category value and tokenizes the Message column.

The next step is to split the dataset into train and validation datasets and upload the files to the S3 bucket.

Train the model

To train the model, complete the following steps in the notebook:

Set up the BlazingText estimator and create an estimator instance passing the container image.

Set the learning mode hyperparameter to supervised.

BlazingText has both unsupervised and supervised learning modes. Our use case is text classification, which is supervised learning.

Create the train and validation data channels.

Start training the model.

Get the accuracy of the train and validation dataset.

Deploy the model

In this step, we deploy the trained model as an endpoint. Choose your preferred instance

Test the model

Let’s provide an example of three email messages that we want to get predictions for:

Click on below link, provide your details and win this award
Best summer deal here
See you in the office on Friday.

Tokenize the email message and specify the payload to use when calling the REST API.

Now we can predict the email classification for each email. Call the predict method of the text classifier, passing the tokenized sentence instances (payload) into the data argument.

Clean up

Finally , you can delete the endpoint to avoid any unexpected cost.

Also, delete the data file from S3 bucket.

Conclusion

In this post, we walked you through the steps to create an email spam detector using the SageMaker BlazingText algorithm. With the BlazingText algorithm, you can scale to large datasets. BlazingText is used for textual analysis and text classification problems, and has both unsupervised and supervised learning modes. You can use the algorithm for use cases like customer sentiment analysis and text classification.

To learn more about the BlazingText algorithm, check out BlazingText algorithm.

About the Author

Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.