BlogPost6 Natural Language Processing with Classification and Vector Spaces.
Published:
Table of contents
A course on LinkedIn.
Supervise ML and Sentiment Analysis
Problem: Classify sentiment of a tweet to be positive or negative: “I am happy because I am learning NLP”
Solution:
Firstly, perform vocabulary and feature extraction to turn a sentence into a list of vocabulary:
S1: “I am happy because I am learning NLP”, S2: “I am upset” ->
Vocabulary
[I, am, happy, because, learning, NLP, upset].Vocabulary
is a long vector with size ofV
Turning sentence into sparse vector: “I am happy because I am learning NLP” [1, 1, 1, 1, 1, 1, 1, 1, 0]. The vector will be very long as the corpus gets larger.
Apply, positive and negative frequencies:
Vocabulary Positive Freq Negative Freq Vocabulary Positive Freq Negative Freq I 3 3 am 3 3 happy 2 0 because 1 0 learning 1 1 NLP 1 1 upset 0 2 not 0 1 Feature extraction:
- By applying the above formula this sentence S3: “I am upset, I am not learning NLP” ->
V3=[1, 8, 11]
- By applying the above formula this sentence S3: “I am upset, I am not learning NLP” ->
Preprocessing:
Stemming
andStop word
to preprocess the sentence.When preprocessing, you have to perform the following: (1) Eliminate handles and URLs; (2) Tokenize the sentence into words; (3) Remove
stop words
like “and
,is
,a
,on
”; (4)Stemming-
or convert every word to its stem this can be incorporated to step (2); (5) Convert all your words to lower case.After the data is preprocessed, we obtain a list of vocabulary. The list of
tokenized-stemmed-lowerCased word
will be used to generate the Positive and negative Frequency dictionary/hash.
Apply Logistic regression to classify the sentence (which is now in form of a vector size (1,3)) into two classes of positive or negative.
- Training a LR using
MSE
as cost function.sigmoid
as the logistic function,gradient decent
as the training algorithm
- Training a LR using
- Finish of Week 1.
Bayes’ Rule
Problem: Review what’s probabilities and conditional probabilities are, how they operate
Conditional probabilities help us reduce the sample search space. For example given a specific event already happened, i.e. we know the word is happy
Infer probability of the word happy in the positive corpus
Naïve Bayes for sentiment analysis: supervised machine learning approach. It’s called naive because this method makes the assumption that the features you’re using for classification are all independent.
Example: start with similar tweet classification:
First start by creating the following table:
Construct the conditional probabilities tables:
From the table, it looks like that these words [
I
,am
,learning
,NLP
] have the same/similar probabilities across two classes. Meanwhile, words like [happy
,sad
,not
] have significant different probabilities, these are your power words tending to express one sentiment or the other. These words carry a lot of weight in determining your tweet sentiments. However, the word [Because
] it only appears in the positive corpus. When this happens, you have no way of comparing between the two corpora, which will become a problem for your calculations. To avoid this, you will smooth your probability functionLaplacian Smoothing:
Problem: Given a sentence, sequence of words one after another, divided by the number of times the first word appeared. Now what if the two words never showed up next to each other in the training corpus (P_cond = 0).
Apply laplacian smoothing helps to overcome the issue. It works by adding a
1
in the numerator, and since there areV
words to normalize, we addV
in the denominator.Log likelihood: learned about the ratio of positive words and negative words. This is required when the conditional probabilities of the word in the corpus is very small, which may cause the underflow of zeros value. Using the ration of two conditional probabilities, the higher the ratio is, the more positive the words are going to be. As the number of words we are using gets larger and larger, then we are very likely to get a product that is very close to 0.
Case example when the cond_prob of the word is too small. Having the λ dictionary will help a lot when doing inference.
By using log likelihood, we can now convert
product of multiple conditional probabilitis
tosum of multiple log likelihood
. That is to avoid zero value when perform the prediction. Positive prediction is inferred when product positive sum of log likelihood and the other way around for negative prediction.Train a Naïve Bayes model. Following 3 preprocessing steps from last week
Step 4: compute
Compute_freq(w, class)
as the foundattion conditional probability. Get the condtional probalities of a word for being in a class:P(w|pos)
,P(w|neg)
.Step 5: Get laplacian smoothing of the probabilities:
λ(w)
.Step 6: Compute
log_prior
of log likelihood.
Using Confidence Ellipses to interpret Naïve Bayes
A confidence ellipse is a way to visualize a 2D random variable. It is a better way than plotting the points over a cartesian plane because, with big datasets, the points can overlap badly and hide the real distribution of the data. Confidence ellipses summarize the information of the dataset with only four parameters:
Center: It is the numerical mean of the attributes
Height and width: Related with the variance of each attribute. The user must specify the desired amount of standard deviations used to plot the ellipse.
Angle: Related with the covariance among attributes.
Application of Naive Bayes rules:
Sentiment analysis
Author identification: instead of positive/negative classification, we can classify sentence by author A/B
Information retrieval: given a set of keyword and query, calculate the likelihood of query given the document.
Word disambiguation: Bank refer to
river bank
orfinancial bank