Text classification

Feb 2017

I always try to automate things I do more than two times. Recently I noticed that every time I add a new bookmark in my browser I carefully select to which directory it should choose. I thought that it should be done automatically. I started to delve into this topic. Starting from the beginning, there is vast area called natural language processing, which is concerned with the interaction between computers and human language. One of its sub-fields is text classification which aims to assign text documents into one or more predefined classes based on their contents. There are two classes of text classification techniques: supervised and unsupervised. In supervised methods, firstly it trains a model based on manually labeled training data which then can be used to assign a pre-defined class label to new data.


Unsupervised text classification addresses the problem of assigning classes to documents based solely on their content without a training set and pre-defined classes. Using text clustering together with topic modeling, it automatically discovers groups of similar documents. In this article, I’m going to walk through basic concepts of text classification and show simple but effective implementation.

Processing the input

Understanding human language is not straightforward. It is ambiguous and depends vastly on context. It contains neologisms, named entities and idioms. Also to understand human language, a computer should possess a common-sense knowledge to grasp the meaning of many natural language sentences. Therefore text document is represented as a feature vector, which is an n-dimensional vector of numerical features that represent some object. Most common representation is a bag of words, which is a collection of unique words that occur in a document with the number of times it occurs. It ignores the order of the words in the document - only the counts of words matters.

bag of words

Breaking down a text into discrete words is a process named tokenization. To improve accuracy, other processing steps are applied. Stop word removal is one of the most commonly used preprocessing steps. It removes stop words which are particularly common or semantically insignificant. In English, they could be at, which, and, so, a, an and much more. Stemming normalizes word forms into its root form (stem). It reduces the words working, worked, worker the root word - work. Therefore classifier groups verbs working and worked as one word with a single frequency. A stem need not be equal to the morphological root of the word (e.g, computer to comput) Lemmatization goes one step further than stemming. It obtains grammatically correct words and distinguishes words by their word sense with the use of a vocabulary (e.g., type can mean write or category). It is a much more difficult and expensive process than stemming. Case folding converts each word completely into lower case.


Naive Bayes Classifier

Naive Bayes classifier is probabilistic classifier which predicts a probability distribution over a set of classes. It is based on applying Bayes’ theorem which allows to compute the probability of an event given the fact that another event occurred. The formula applied to classification problem is stated as the following equation:

Bayes theorem

where C is the class and D is the document. It computes a probability of a class C given a document D when we know a probability of document D given a class C and independent probabilities of class C and document D. The probability of the document is constant, so it can be disregarded in our calculations. The probability of a class is the number of documents in training set that are assigned to this class divided by a total number of documents in training set. The one last missing piece is a probability of a document D given a class C. Under the naive assumption that words in a dataset are conditionally independent, it can be calculated as the product of the probabilities of the single words.

Bayes theorem

The different naive Bayes classifiers differ by the assumptions they make regarding the distribution of class-conditional probability. The Bernoulli distribution takes into account whether a word occurs or not. The multinomial distribution takes into account not only the presence of word but also its frequency. The class-conditional probability in the multinomial model can be written as:

conditional probability


  • count(w, C): a number of times word w appears in a sample of class C in the training data.
  • count(C): a total count of all words for class C.
  • V: a total count of different words in the training data.
  • α: an additive smoothing parameter which prevents zero probabilities (α=1 for Laplace smoothing).

Let's illustrate this concept with an example.

bayes example

So far, we have seen how to calculate a probability that the document belongs to specified class. Let’s take a brief look how classification works. The classification is divided into two phases: training and classifying. The former processes training data containing texts with pre-defined class. It breaks each text into preprocessed tokens and updates data necessary for explained equations. In a multinomial model, a classifier has to store total count of all words for each class, frequency of classes for each word, total count of documents labeled for each class and count of different words in the training data. The latter computes the probabilities for each trained class and classifies new data with the class of the maximum probability. While seemingly complex, the implementation is easy to grasp. Classifier stores counters needed for probability calculations while training phase simply increments them.

After the training phase, the classifier computes probabilities of each class for new data. The class with the highest probability is chosen as the final prediction.

Naive Bayes classifier is simple yet very efficient. It produces good results even with a naive assumption of independence between words. It is a commonly used in text classification because of its low complexity and reasonable accuracy. As a companion to this article, I created a gist that contains the full implementation of the multinomial naive Bayes classifier.