Monday, May 11, 2020

Text Classification with TF-IDF, Word Embedding and Naive Bayes

Introduction

This project introduces one of techniques of natural language processing: text classification. We shall explore step by step how to create new features through data analysis and do data cleaning. We also explain the conception of Term Frequency - Inverse Document Frequency and Word Embedding and how to implement them through the real data. 

The article will show you how to apply Naives Bayes on text data after the tokenization and vectorization. We will use one of Selection Best Feature techniques to chose only features that contribute to the performance of the prediction. After that, we will analyse and compare the evaluation metrics including accuracy , confusion matrix and classification report between baseline accuracy, TF-IDF + Bayes and Word Embedding + Bayes.

Bayesian Classifier

Naive Bayes is the most straightforward and fast classification algorithm, which is suitable for a large chunk of data. It uses Bayes theorem of probability for prediction of unknown class. It is simple and effective in answering questions such as "Given a particular term in the document, what is the likely chance (probability) that it belongs to the particular class?" 

The classification has two phases, a learning phase, and the evaluation phase. In the learning phase, classifier trains its model on a given dataset and in the evaluation phase, it tests the classifier performance. Performance is evaluated on the basis of various parameters such as accuracy, error, precision, and recall.

Pros: It is simple, easy and fast to train and predict small dataset.
Cons: It assumes every feature is independent, which isn’t always the case.

Term Frequency–Inverse Document Frequency (TF-IDF)

Machine doesn't understand the text. We have to transform text into sparse matrix or term-document matrix. The term-document matrix then is a two-dimensional matrix whose rows are the terms and columns are the documents/observations/samples, so each entry (i, j) represents the frequency of term i in document/observation/sample j.

For each entry in the matrix, the term frequency measures the number of times that term i appears in document j, and the inverse document frequency measures the number of documents in the corpus which contain term i. 

The tf-idf score is the product of these two metrics (tf*idf). So an entry's tf-idf score increases when term i appears frequently in document j, but decreases as the term appears in other documents. In another word, idf is a cross-document normalization, that puts less weight on common terms, and more weight on rare terms

Word Embedding

There are two main training algorithms for word2vec: Continuous Bag of Word (CBOW) and Skip-gram. The major difference between these two methods is that CBOW is using context to predict a target word while skip-gram is using a word to predict a target context.

Generally, the skip-gram method can have a better performance compared with CBOW method, for it can capture two semantics for a single word.

Source: Internet
The Word2Vec Skip-gram model, for example, takes in pairs of 2 words  generated by moving a window across text data, and trains a 1-hidden-layer neural network based on the synthetic task of given an input word, giving us a predicted probability distribution of nearby words to the input.

Feature selection

When we convert all of the texts in a dataset into word uni+bigram tokens, we may end up with tens of thousands of tokens. Not all of these tokens/features contribute to label prediction. So we can drop certain tokens, for instance those that occur extremely rarely across the dataset. We can also measure feature importance (how much each token contributes to label predictions), and only include the most informative tokens.

There are many statistical functions that take features and the corresponding labels and output the feature importance score. Two commonly used functions are f_classif and chi2.

Evaluation Metrics

Accuracy will yield misleading results if the data set is unbalanced; that is, when the numbers of observations in different classes vary greatly. Confusion Matrix and Classification Report will help us to evaluate the performance for each class.

Confusion Matrix

A confusion matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an supervised machine learning algorithm.

Each row of the matrix represents the instances in a predicted class while each column represents the instances in an actual class. The diagonal elements show the number of correct classifications for each class meanwhile the off-diagonal elements provides the mis-classifications for each class.

There are four ways to check if the predictions are right or wrong:

1) True Positive (TP): The class was positive and predicted positive.
2) True Negative (TN): The class was negative and predicted negative.
3) False Negative (FN): The class was positive but predicted negative
4) False Positive (FP) : The case was negative but predicted positive




Source: Wikipedia


Classification Report


The report shows the main classification metrics precision, recall and f1-score on a per-class basis. The metrics are calculated by using true and false positives, true and false negatives. Positive and negative in this case are generic names for the predicted classes.

Precision – What percent of your predictions were correct ?
Precision is the ability of a classifier not to label an instance positive that is actually negative. For each class it is defined as the ratio of true positives to the sum of true and false positives.

Precision = Accuracy of positive predictions. 

Precision = TP/(TP + FP)

Recall – What percent of the positive cases did you catch ?

Recall is the ability of a classifier to find all positive instances. For each class it is defined as the ratio of true positives to the sum of true positives (TP) and false negatives (FN).

Recall: Fraction of positives that were correctly identified.

Recall = TP/(TP+FN)

F1 score – What percent of positive predictions were correct?

The F1 score is a weighted harmonic mean of precision and recall such that the best score is 1.0 and the worst is 0.0. Generally speaking, F1 scores are lower than accuracy measures as they embed precision and recall into their computation. As a rule of thumb, the weighted average of F1 should be used to compare classifier models, not global accuracy.

F1 Score = 2*(Recall * Precision) / (Recall + Precision)

Case study: classify spam/ham messages 


Data Exploratory Analysis

Lets get the first look of data structure:















Dataset consists of 5572 records whose two columns are text and label without null data.

















From first 5 lines, we can see that text column is where we will generate vectorizer models and Label column is also called predicted variable.





There 2 unique values in Label column: Ham and Spam that indicate text message in Text column is Not Spam or Spam.

Baseline Accuracy


The baseline accuracy is important but often ignored in machine learning. It sets the benchmark in terms of minimum accuracy which the model should achieve. Baseline is calculated as the number of times majority class (ham) appear in the target variable (label column), divided by the number of total observations.


 In our case, baseline accuracy is 86%

The best way to understand the data and to engineer features is to raise questions and to answer them

1) What percentage of the documents are spam?






2) What is the longest message ?














3) What is the average length of documents (number of characters) for not spam and spam documents?











4) What is the average number of digits per document for not spam and spam documents?









Observation: In average, spam messages contain more digit characters than non-spam messages

5) What is the average number of non-word characters (anything other than a letter, digit or underscore) per document for not spam and spam documents?



Observation In average, spam messages contain more non-alphanumeric characters than non-spam messages

6) What are top 20 highest frequency words of spam and non-spam messages ?


































Data Cleaning


1) Conversion to Lower case: words like 'Go' and 'go' need to be considered as one word. Hence, these are converted to lowercase.
2) Remove non-alphanumeric characters
3) Remove punctuation
4) Remove stop words
5) Lemmatization: Unlike Stemming, reduces the inflected words properly ensuring that the root word belongs to the language. In Lemmatization root word is called Lemma. A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words. (nltk package) . Python NLTK provides WordNet Lemmatizer that uses the WordNet Database to lookup lemmas of words.








Data Preparation


1) Split  and shuffle dataset into train and test subsets with the ratio 70:30
2) Select predicted variables: len_text, digits, non_alpha_char, processed_text and target variable: label


test_size=0.2 means that size of testset (x_val, y_val) is 20% proportion of data and 80% data is trainset (x_train, y_train)


TF-IDF Transformation


Step 1 (tokenization): Divide the texts into words or smaller sub-texts, which will enable good generalization of relationship between the texts and the labels. This determines the “vocabulary” of the dataset (set of unique tokens present in the data).

Step 2 (vectorization): Define a good numerical measure to characterize these texts.





Step 4: Add selected features (len_text, digits, non_alpha_char) to trainset (x_train_tfidf)




Step 5: Add selected features (len_text, digits, non_alpha_char) to testset (x_val_tfidf)





Word2Vec Transformation


Similar to TF-IDF transformation, we perform the tokenization and vectorization to convert text into numerical matrix

Step 1 (tokenization): Set the input as the entire corpus, the model will build the set of vocabulary
Step 2: (vectorization): Create multidimensional vector from each document in trainset and testset










Let’s try to understand the parameters used in training the model:

1) Size: The number of dimensions of the embedding and the default is 100.
2) window: The maximum distance between a target word and words around it. The default window is 5.
3) min_count: The minimum count of words to consider when training the model; words with occurrence less than this count will be ignored. The default for min_count is 5.
4) workers: The number of partitions during training and the default workers is 3.
5) sg: The training algorithm, either CBOW(0) or skip gram(1). The default training algorithm is CBOW.

After the transformation, the data looks like:











We can't fit the Machine Algorithm with those data. We have to flatten the array in 'w2vec' column 




















Feature selection


1) Apply feature selection with f_classif and Bayesian on TF-IDF









2) Apply feature selection with _classif and Bayesian on Word2Vec






3) Compare the accuracy score between 2 models and baseline

TF-IDF + Naives Bayes: improve 5% of accuracy of classification from 86% to 91%

Word2Vec + Naive Bayes: improve 8% of accuracy of classification from 86% to 94%

Evaluation Metrics


1) Confusion matrix of Bayes prediction on TF-IDF



The correction of prediction = True Positive (TP) + True Negative (TN) = 878 + 140 = 1018

The mis-classification = False Negative (FN) + False Positive (FP) = 84 + 13 = 97

2) Confusion matrix of Bayes prediction on Word2Vec








The correction of prediction = TP + TN = 896 + 147 = 1043

The mis-classification = FN + FP = 84 + 13 = 72


Classification Report


1) Classification Report of Bayes prediction on TF-IDF





2) Classification Report of Bayes prediction on Word2Vec