BlogPost6 Natural Language Processing with Classification and Vector Spaces.

5 minute read

Published: January 03, 2023

A course on LinkedIn.

Week 1
Week 2
Week 3
Week 4

Supervise ML and Sentiment Analysis

Problem: Classify sentiment of a tweet to be positive or negative: “I am happy because I am learning NLP”

Solution:

Firstly, perform vocabulary and feature extraction to turn a sentence into a list of vocabulary:
- S1: “I am happy because I am learning NLP”, S2: “I am upset” -> Vocabulary [I, am, happy, because, learning, NLP, upset]. Vocabulary is a long vector with size of V
- Turning sentence into sparse vector: “I am happy because I am learning NLP” [1, 1, 1, 1, 1, 1, 1, 1, 0]. The vector will be very long as the corpus gets larger.
Apply, positive and negative frequencies:
Vocabulary Positive Freq Negative Freq
Vocabulary Positive Freq Negative Freq
I 3 3
am 3 3
happy 2 0
because 1 0
learning 1 1
NLP 1 1
upset 0 2
not 0 1
Feature extraction:
- By applying the above formula this sentence S3: “I am upset, I am not learning NLP” -> V3=[1, 8, 11]
Preprocessing: Stemming and Stop word to preprocess the sentence.
- When preprocessing, you have to perform the following: (1) Eliminate handles and URLs; (2) Tokenize the sentence into words; (3) Remove stop words like “and, is, a, on”; (4) Stemming- or convert every word to its stem this can be incorporated to step (2); (5) Convert all your words to lower case.
- After the data is preprocessed, we obtain a list of vocabulary. The list of tokenized-stemmed-lowerCased word will be used to generate the Positive and negative Frequency dictionary/hash.
Apply Logistic regression to classify the sentence (which is now in form of a vector size (1,3)) into two classes of positive or negative.
- Training a LR using MSE as cost function. sigmoid as the logistic function, gradient decent as the training algorithm

Vocabulary	Positive Freq	Negative Freq
Vocabulary	Positive Freq	Negative Freq
I	3	3
am	3	3
happy	2	0
because	1	0
learning	1	1
NLP	1	1
upset	0	2
not	0	1

Finish of Week 1.

Bayes’ Rule

Problem: Review what’s probabilities and conditional probabilities are, how they operate

Conditional probabilities help us reduce the sample search space. For example given a specific event already happened, i.e. we know the word is happy
Infer probability of the word happy in the positive corpus
Naïve Bayes for sentiment analysis: supervised machine learning approach. It’s called naive because this method makes the assumption that the features you’re using for classification are all independent.
Example: start with similar tweet classification:
First start by creating the following table:
Construct the conditional probabilities tables:
From the table, it looks like that these words [I, am, learning, NLP] have the same/similar probabilities across two classes. Meanwhile, words like [happy, sad, not] have significant different probabilities, these are your power words tending to express one sentiment or the other. These words carry a lot of weight in determining your tweet sentiments. However, the word [Because] it only appears in the positive corpus. When this happens, you have no way of comparing between the two corpora, which will become a problem for your calculations. To avoid this, you will smooth your probability function
Laplacian Smoothing:
Problem: Given a sentence, sequence of words one after another, divided by the number of times the first word appeared. Now what if the two words never showed up next to each other in the training corpus (P_cond = 0).
Apply laplacian smoothing helps to overcome the issue. It works by adding a 1 in the numerator, and since there are V words to normalize, we add V in the denominator.
Log likelihood: learned about the ratio of positive words and negative words. This is required when the conditional probabilities of the word in the corpus is very small, which may cause the underflow of zeros value. Using the ration of two conditional probabilities, the higher the ratio is, the more positive the words are going to be. As the number of words we are using gets larger and larger, then we are very likely to get a product that is very close to 0.
Case example when the cond_prob of the word is too small. Having the λ dictionary will help a lot when doing inference.
By using log likelihood, we can now convert product of multiple conditional probabilitis to sum of multiple log likelihood. That is to avoid zero value when perform the prediction. Positive prediction is inferred when product positive sum of log likelihood and the other way around for negative prediction.
Train a Naïve Bayes model. Following 3 preprocessing steps from last week
- Step 4: compute Compute_freq(w, class) as the foundattion conditional probability. Get the condtional probalities of a word for being in a class: P(w|pos), P(w|neg).
- Step 5: Get laplacian smoothing of the probabilities: λ(w).
- Step 6: Compute log_prior of log likelihood.
Using Confidence Ellipses to interpret Naïve Bayes
A confidence ellipse is a way to visualize a 2D random variable. It is a better way than plotting the points over a cartesian plane because, with big datasets, the points can overlap badly and hide the real distribution of the data. Confidence ellipses summarize the information of the dataset with only four parameters:
- Center: It is the numerical mean of the attributes
- Height and width: Related with the variance of each attribute. The user must specify the desired amount of standard deviations used to plot the ellipse.
- Angle: Related with the covariance among attributes.
Application of Naive Bayes rules:
- Sentiment analysis
- Author identification: instead of positive/negative classification, we can classify sentence by author A/B
- Information retrieval: given a set of keyword and query, calculate the likelihood of query given the document.
- Word disambiguation: Bank refer to river bank or financial bank