Spam/Ham Filtering Message: An Approach with NLTK NaiveBayesClassifier

Sumeet Kumar
5 min readDec 11, 2020

Almost everybody has a smartphone today and every individual expect to get a message or two.

This means you are familiar with lots of messages that suggest cash, great lottery wins, and sell your car etc. We get hundred of spam messages daily. That may be irritating or space consuming, fishing attempts may also be possible. It is not, in any event, the material that we want to deal with. The demand for good spam philtres is therefore always strong.

1. Algorithm- Naïve Bayse:

Let me introduce you with one of the pretty effective algorithms for spam filtering: Naive Bayes classification.

The classification of Naive Bayes is a simple probability algorithm based on the assumption that all model characteristics are independent. We presume that any word in the message is independent of allother words in the sense of the spam philtre, and we count them as ignorant of the meaning.

By the state of the current set of terms, our classification algorithm generates probabilities of the message to be spam or not spam. The probability estimation is based on the Bayes formula, and the formula components are determined on the basis of the word frequencies in the whole message package.

2. Dataset Overview:

For our purpose, we will be using collection of SMS messages, which was put together by Tiago A. Almeida and José María Gómez Hidalgo. It is free and can be downloaded from the UCI Machine Learning Repository.

3. Let Load data and important libraries in Python:

First of all, we will load all the necessary libraries that we a going to use during the process of training the algorithm for the Spam/Ham detection of the message.

The layout of the dataset is clear. It includes two columns, one for the “spam / ham” mark and

another for the message text.

It contains 5572 records of different messages together with 747 spam and 4825 ham messages.

4. Data Preprocessing:

Preprocessing is the task of carrying out the preparation steps on the raw text corpus for the

successful completion of a task that involves textmining or natural language processing or some

other raw text.

Preprocessing of text includes the following steps in accordance with the data set for which we will write a function to perform all the steps through the function. We will perform the preprocessing steps like.

· Change sentence to lower case: The conversion of all text characters into a common context, such as lowercase, helps to avoid identifying two words differently when one is in lowercase and the other is not.

· Stemming/tokenize into words:To normalise terms with the same lemma, both of these strategies decrease inflexion forms.The difference between lemmatising and stemming is that, though stemming does not,

lemmatising carries out this decrease by considering the meaning of the term.

· Remove stop words: There are some terms that are appropriate for a sentence or a series of words in a language (here English), but they do not add to the sense of a term considered. For some languages, common stop words are supported by the Natural Language Toolkit (NLTK) library in Python.

· Join words to make sentence:We will join the after the above steps are performed to make a sentence

5. Feature creation:

Let’s create a single list of all the feature words as feature list after performing the preprocessing step. For that we will write a function and along with that we will we will even remove the duplicate words using method like Tf-Idf score but in this we have used the frequency distribution method to eliminate the duplicate word.

After performing the above set of steps, we got 8K+ feature.

6. Create Feature Map:

We have created a bag-of-words representation that’s created from scratch without using the CountVectorizer() function. We have used a binary representation instead of using the number of features to represent each word. In this bag-of-words table, ‘1’ means the word is present whereas ‘0’ means the absence of that word in that document. We can do this by setting the ‘binary’ parameter to ‘True’ in the CountVectorizer() function.

7. Training and Testing classifier:

Selecting the type of classifier to use is the next step. Usually, we will pick multiple candidate

classifiers in this step and compare them against the test-set results to come to the conclusion of final model to be fitted on the given data set. As we know that in our data the independent variables are independent of each other and Naive Bayes classification algorithm works pretty well in this kind of data.

We made 80–20 split of train and test data.

We trained ours model using nltk.NaiveBayesClassifier . And using this model we will Evaluate the model with the test data.

Now, It’s the time to test the model.

We got the accuracy of 99% on the training set and 98% on the test data.

Let’s test our model with an out the box message of our own. Not form the corpus of test we used in either training or testing the model.

Yes!! The model classifies the text correctly.

--

--