My goal in this section is to build a sentiment analysis model using Logistic Regression. This model will classify tweets as either positive or negative.
The core idea is to train a model using labeled data. In our case, the features $X$ are the tweets, and the labels $Y$ are their sentiments (1 for positive, 0 for negative).
The training process is an iterative loop:
A model can't understand raw text. We need to convert each tweet into a numerical vector.
The most straightforward way is to create a feature vector for each tweet based on a vocabulary $V$ (the set of all unique words across all tweets, our corpus).
For a single tweet, the feature vector $x$ would have the size of $V$. Each element of $x$ is 1 if the corresponding word from $V$ is in the tweet, and 0 otherwise. This can be expressed as:
$$x_i = \begin{cases} 1 & \text{if word } i \in \text{tweet} \\ 0 & \text{otherwise} \end{cases}$$This is called a sparse representation because for a large vocabulary, the vector will be mostly zeros.
Problem: If $V$ contains 10,000 words, our model needs to train on 10,001 parameters ($n+1$, where $n$ is the size of $V$ and $+1$ is for the bias term). This can be computationally very expensive.
To avoid large vectors, we can engineer more meaningful features. Instead of a huge sparse vector, we can represent each tweet with a dense vector of only 3 features.
First, we pre-calculate a frequency map (freqs
) for every word in our vocabulary $V$. This map stores how many times each word appears in positive tweets versus negative tweets.
Now, for any given tweet $m$, we can build its feature vector $x_m$ as follows:
1
.This transforms a long, sparse vector into a very small, dense vector like [1, 8, 11]
, which is much more efficient for training.
To get meaningful features, we must clean the raw text first. The goal is to reduce noise and standardize the words.
@user
), URLs, and retweet markers (RT
).a
, the
, is
) and punctuation that don't add meaning. Note: This step is context-dependent. For sentiment analysis, emoticons like :)
are valuable and should be kept.learning
, learned
-> learn
). This helps group related words, reducing the vocabulary size.Great
and great
as the same token.After these steps, a raw tweet is transformed into a clean list of tokens, ready for feature extraction.
The final step is to apply this process to our entire corpus of tweets. Each tweet is converted into its 3-feature vector. These vectors are then stacked together to form a single matrix X
.
Each row in the matrix X
represents a tweet, and each column represents a feature. This matrix, along with the corresponding label vector Y
, is what we'll use to train our logistic regression model.
After extracting features, the next step is to build a model that can classify a tweet as positive or negative. For this, we use Logistic Regression, a classification algorithm that predicts a probability.
The core of logistic regression is the hypothesis function, denoted as $h(x)$, which estimates the probability that the output is 1. In our case, it's the probability of a tweet being positive.
The hypothesis is defined using the sigmoid function, $g(z)$:
$$ h(x) = g(\theta^T x) $$Where:
The sigmoid function is defined as:
$$ g(z) = \frac{1}{1 + e^{-z}} $$This function always outputs a value between 0 and 1, which is perfect for representing a probability.
To make a final classification (0 or 1), we need a threshold. By convention, we use 0.5:
Looking at the sigmoid plot, $g(z) \ge 0.5$ when its input $z \ge 0$. Since our input is $z = \theta^T x$, this means:
$$ \theta^T x \ge 0 \implies \text{Predict Positive} $$ $$ \theta^T x < 0 \implies \text{Predict Negative} $$The line defined by $\theta^T x = 0$ is called the decision boundary. It's the line that separates the two predicted classes.
The goal of training is to find the optimal parameters $\theta$ that minimize the difference between our predictions ($\hat{Y}$) and the actual labels ($Y$). This is achieved by minimizing a cost function using an optimization algorithm like Gradient Descent.
Let's assume we have already trained our model and found the optimal parameters $\theta$. Now, we can predict the sentiment of a new tweet.
In this example, the dot product $\theta^T x$ is 4.92.
$$ h(x) = g(4.92) = \frac{1}{1 + e^{-4.92}} \approx 0.993 $$Since $0.993 \ge 0.5$, the model correctly predicts a positive sentiment.
To train a logistic regression classifier, we need to find the optimal parameters, $\theta$, that minimize the cost function, $J(\theta)$. This iterative process is carried out using the Gradient Descent algorithm.
Here's how the process works:
$\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta)$
Once the model is trained, we need to evaluate its performance on data it has never seen before. This process tells us how well our model generalizes to new, real-world examples. For this, we use a validation set (or test set) composed of a feature matrix $X_{val}$ and a label vector $Y_{val}$.
The first step is to use our trained parameters $\theta$ to make predictions on the validation data.
y_pred
, populated with 0s and 1s, representing the model's final prediction for each tweet (0 for negative, 1 for positive).
Now that we have a prediction vector y_pred
, we can compare it to the true labels in $Y_{val}$ to calculate the model's accuracy.
Accuracy is the proportion of predictions that the model got correct.
y_pred
and the true label vector $Y_{val}$. This results in a vector of booleans (True/False) or integers (1/0), where a 1
indicates a correct prediction and a 0
indicates an error.
For example, if y_pred = [0, 1, 1]
and Y_val = [0, 0, 1]
, the comparison yields [1, 0, 1]
.
For example, if we had 4 correct predictions out of 5 total tweets, the accuracy would be:
$$\text{Accuracy} = \frac{4}{5} = 0.8$$This means the model has an 80% accuracy on the validation set.
To ensure an unbiased evaluation, the dataset is typically split into three parts before training begins:
A common split is 80% for training, 10% for validation, and 10% for testing.