My goal in this section is to build a sentiment analysis model using a new classification method: Naive Bayes. This technique is particularly useful as it is simple to implement, fast to train, and provides a strong baseline for text classification tasks.

Unlike logistic regression, which finds a separating line between classes, Naive Bayes works by calculating the probability of a tweet belonging to the positive or negative class and then selecting the class with the highest probability.

1. The Foundation: Bayes' Rule

At the heart of this method is Bayes' Rule, which updates our belief based on new evidence. It relies on conditional probabilities.

1.1. Conditional Probability

A conditional probability answers the question: "What is the probability of event A happening, given that event B has already happened?" This is written as $P(A\mid B)$.

For example, what is the probability that a tweet is positive, given that it contains the word "happy"? We can visualize this by looking at the intersection of events. The conditional probability reduces our sample space from all tweets to only the tweets containing the word "happy".

Conditional Probability

The formula for conditional probability is:

$$ P(A\mid B) = \frac{P(A \cap B)}{P(B)} $$

1.2. Bayes' Rule

Bayes' Rule is derived from the conditional probability formula and allows us to "flip" the condition:

$$ P(A\mid B) = \frac{P(B\mid A)\,P(A)}{P(B)} $$

This is a powerful tool because it's often easier to calculate $P(B\mid A)$ than $P(A\mid B)$.

2. Naive Bayes for Sentiment Analysis

To classify a tweet, we want to determine if it's more likely to be positive or negative. We can frame this using Bayes' Rule. We want to compare:

Let's focus on the positive case. According to Bayes' Rule:

$$ P(\text{pos} \mid \text{tweet}) = \frac{P(\text{tweet} \mid \text{pos})\, P(\text{pos})}{P(\text{tweet})} $$

When comparing the positive and negative classes for the same tweet, the denominator $P(\text{tweet})$ is the same for both. Therefore, we can ignore it and just compare the numerators:

$$ \text{classification} \propto P(\text{tweet} \mid \text{class})\, P(\text{class}) $$

2.1. The "Naive" Assumption

Calculating the likelihood $P(\text{tweet} \mid \text{class})$ is difficult. A tweet is a sequence of words, like $(w_1, w_2, ..., w_n)$. The probability of this exact sequence is tiny and hard to model.

Here is where the "naive" assumption comes in:

We assume that every word in the tweet is conditionally independent of every other word, given the class (positive or negative).

This means that for a positive tweet, the presence of the word "happy" does not make the presence of the word "great" any more or less likely. This is not entirely true in language, but it's a simplifying assumption that works surprisingly well in practice.

With this assumption, the likelihood calculation becomes much simpler:

$$ P(\text{tweet} \mid \text{class}) = \prod_{i=1}^{n} P(w_i \mid \text{class}) $$

This means we can just multiply the probabilities of each individual word.

3. Training the Naive Bayes Model

Training the model involves calculating two sets of probabilities from the training data: the prior probabilities and the word likelihoods.

3.1. Calculating Priors

The prior probability, $P(\text{class})$, is the frequency of each class in the training set.

Corpus of Tweets

3.2. Calculating Word Likelihoods

The likelihood, $P(w \mid \text{class})$, is the probability of a word appearing given a class. The most direct way to calculate this is by using word frequencies, similar to the method in logistic regression.

First, we create a frequency dictionary for all words in our vocabulary, counting their occurrences in positive and negative tweets.

Frequency Dictionary

The Problem of Zero Probabilities

A major issue arises when a word in a tweet we want to classify was not present in the training data for a given class. For example, if the word "elated" never appeared in a negative tweet in our training set, its frequency freq("elated", neg) would be 0.

When we calculate the conditional probability, this leads to a zero value:

Probability calculation without smoothing

When we calculate the overall score for the tweet by multiplying the probabilities of all its words, this single zero will cause the entire score for that class to become zero. This incorrectly biases our classification and loses valuable information from other words in the tweet.

The Solution: Laplacian (Additive) Smoothing

To prevent this, we use Laplacian smoothing. The core idea is to add a small smoothing parameter $\alpha$ (usually 1) to the frequency count of each word. This is why it's also called "add-one smoothing" when $\alpha=1$.

However, just adding to the numerator would mean the probabilities for a class no longer sum to 1. To re-normalize, we must also adjust the denominator. We add $\alpha \times V$, where $V$ is the total number of unique words in our entire vocabulary.

The updated formula for the likelihood becomes:

Laplacian Smoothing Formula

Where:

This ensures that no word will have a zero probability, providing a more robust model.

Example Calculation

Let's apply this to our dataset. First, we calculate the smoothed probabilities for each word in the positive class:

Laplacian Positive Calculus

Next, we do the same for the negative class:

Laplacian Negative Calculus

4. Inference: Classifying a New Tweet

To classify a new, unseen tweet, we perform the following steps:

  1. Preprocess the tweet (tokenize, remove stopwords, etc.).
  2. For each class (positive and negative), calculate a "score".
  3. The class with the higher score wins.

The Word Sentiment Ratio

Before building the full classification formula, it's useful to look at the sentiment contribution of individual words. We can do this by calculating the ratio of a word's probability in the positive class versus the negative class.

Word Sentiment Ratio Formula

This ratio gives us a single number that represents the sentiment intensity of a word:

Table of word sentiment ratios

Full Classification with Log Probabilities

While the ratio is great for interpreting individual words, for classifying a full tweet we use log probabilities to avoid underflow. The classification rule is:

$$ \log P(\text{pos}) + \sum_{i=1}^{n} \log P(w_i \mid \text{pos})\;\;\mathop{\gtrless}_{\text{neg}}^{\text{pos}}\;\; \log P(\text{neg}) + \sum_{i=1}^{n} \log P(w_i \mid \text{neg}) $$

The term $\log P(w_i \mid \text{class})$ is a word's log-likelihood. The final score for a class is the sum of the log prior and all word log-likelihoods.

Words Frequencies Impact

By comparing the final scores, we determine the most probable sentiment.

4.4. Log-likelihood: visual intuition

Log prior Log likelihood

4.5. Lambda (word log-ratio)

We define the word contribution as the log-ratio:

$$ \lambda(w) = \log \frac{P(w \mid pos)}{P(w \mid neg)} = \log P(w \mid pos) - \log P(w \mid neg) $$

Tweet score is the sum over words:

$$ \Lambda(\text{tweet}) = \sum_{i=1}^n \lambda(w_i) $$

4.6. Log-likelihood decision rule

Decision rule using the tweet score:

4.7. Recap

With Naive Bayes + logs, the threshold is 0. We measure sentiment by the log-likelihood ratio (sum of word lambdas). A positive total favors positive; a negative total favors negative.

Pos/Neg sentiment

5. Evaluating Naive Bayes

Just like with logistic regression, we need to evaluate our model's performance on a test set. The most common metric is accuracy.

$$ \text{Accuracy} = \frac{\text{Number of correctly classified tweets}}{\text{Total number of tweets}} $$

We can also use a confusion matrix to get a more detailed view of our model's performance, showing us where the model is getting confused (e.g., misclassifying negative tweets as positive).

6. Pros and Cons of Naive Bayes

Pros:

Cons: