Introduction
This project introduces one of techniques of natural language processing: text classification. We shall explore step by step how to create new features through data analysis and do data cleaning. We also explain the conception of Term Frequency - Inverse Document Frequency and Word Embedding and how to implement them through the real data.
The article will show you how to apply Naives Bayes on text data after the tokenization and vectorization. We will use one of Selection Best Feature techniques to chose only features that contribute to the performance of the prediction. After that, we will analyse and compare the evaluation metrics including accuracy , confusion matrix and classification report between baseline accuracy, TF-IDF + Bayes and Word Embedding + Bayes.
Bayesian Classifier
Naive Bayes is the most straightforward and fast classification algorithm, which is suitable for a large chunk of data. It uses Bayes theorem of probability for prediction of unknown class. It is simple and effective in answering questions such as "Given a particular term in the document, what is the likely chance (probability) that it belongs to the particular class?"
The classification has two phases, a learning phase, and the evaluation phase. In the learning phase, classifier trains its model on a given dataset and in the evaluation phase, it tests the classifier performance. Performance is evaluated on the basis of various parameters such as accuracy, error, precision, and recall.
Pros: It is simple, easy and fast to train and predict small dataset.
Cons: It assumes every feature is independent, which isn’t always the case.
Term Frequency–Inverse Document Frequency (TF-IDF)
Machine doesn't understand the text. We have to transform text into sparse matrix or term-document matrix. The term-document matrix then is a two-dimensional matrix whose rows are the terms and columns are the documents/observations/samples, so each entry (i, j) represents the frequency of term i in document/observation/sample j.
For each entry in the matrix, the term frequency measures the number of times that term i appears in document j, and the inverse document frequency measures the number of documents in the corpus which contain term i.
The tf-idf score is the product of these two metrics (tf*idf). So an entry's tf-idf score increases when term i appears frequently in document j, but decreases as the term appears in other documents. In another word, idf is a cross-document normalization, that puts less weight on common terms, and more weight on rare terms
Word Embedding
There are two main training algorithms for word2vec: Continuous Bag of Word (CBOW) and Skip-gram. The major difference between these two methods is that CBOW is using context to predict a target word while skip-gram is using a word to predict a target context.
Generally, the skip-gram method can have a better performance compared with CBOW method, for it can capture two semantics for a single word.
The Word2Vec Skip-gram model, for example, takes in pairs of 2 words generated by moving a window across text data, and trains a 1-hidden-layer neural network based on the synthetic task of given an input word, giving us a predicted probability distribution of nearby words to the input.
Generally, the skip-gram method can have a better performance compared with CBOW method, for it can capture two semantics for a single word.
Source: Internet |
Feature selection
When we convert all of the texts in a dataset into word uni+bigram tokens, we may end up with tens of thousands of tokens. Not all of these tokens/features contribute to label prediction. So we can drop certain tokens, for instance those that occur extremely rarely across the dataset. We can also measure feature importance (how much each token contributes to label predictions), and only include the most informative tokens.
There are many statistical functions that take features and the corresponding labels and output the feature importance score. Two commonly used functions are f_classif and chi2.
Evaluation Metrics
Accuracy will yield misleading results if the data set is unbalanced; that is, when the numbers of observations in different classes vary greatly. Confusion Matrix and Classification Report will help us to evaluate the performance for each class.
Confusion Matrix
A confusion matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an supervised machine learning algorithm.
Each row of the matrix represents the instances in a predicted class while each column represents the instances in an actual class. The diagonal elements show the number of correct classifications for each class meanwhile the off-diagonal elements provides the mis-classifications for each class.
There are four ways to check if the predictions are right or wrong:
1) True Positive (TP): The class was positive and predicted positive.
2) True Negative (TN): The class was negative and predicted negative.
3) False Negative (FN): The class was positive but predicted negative
4) False Positive (FP) : The case was negative but predicted positive
Source: Wikipedia |
Classification Report
The report shows the main classification metrics precision, recall and f1-score on a per-class basis. The metrics are calculated by using true and false positives, true and false negatives. Positive and negative in this case are generic names for the predicted classes.
Precision – What percent of your predictions were correct ?
Precision is the ability of a classifier not to label an instance positive that is actually negative. For each class it is defined as the ratio of true positives to the sum of true and false positives.
Precision = Accuracy of positive predictions.
Precision = TP/(TP + FP)
Recall – What percent of the positive cases did you catch ?
Recall is the ability of a classifier to find all positive instances. For each class it is defined as the ratio of true positives to the sum of true positives (TP) and false negatives (FN).
Recall: Fraction of positives that were correctly identified.
Recall = TP/(TP+FN)
F1 score – What percent of positive predictions were correct?
The F1 score is a weighted harmonic mean of precision and recall such that the best score is 1.0 and the worst is 0.0. Generally speaking, F1 scores are lower than accuracy measures as they embed precision and recall into their computation. As a rule of thumb, the weighted average of F1 should be used to compare classifier models, not global accuracy.
F1 Score = 2*(Recall * Precision) / (Recall + Precision)
Case study: classify spam/ham messages
Data Exploratory Analysis
Lets get the first look of data structure:
Dataset consists of 5572 records whose two columns are text and label without null data.
From first 5 lines, we can see that text column is where we will generate vectorizer models and Label column is also called predicted variable.
There 2 unique values in Label column: Ham and Spam that indicate text message in Text column is Not Spam or Spam.
Baseline Accuracy
The baseline accuracy is important but often ignored in machine learning. It sets the benchmark in terms of minimum accuracy which the model should achieve. Baseline is calculated as the number of times majority class (ham) appear in the target variable (label column), divided by the number of total observations.
In our case, baseline accuracy is 86%
1) What percentage of the documents are spam?
2) What is the longest message ?
3) What is the average length of documents (number of characters) for not spam and spam documents?
4) What is the average number of digits per document for not spam and spam documents?
Observation: In average, spam messages contain more digit characters than non-spam messages
5) What is the average number of non-word characters (anything other than a letter, digit or underscore) per document for not spam and spam documents?
Observation In average, spam messages contain more non-alphanumeric characters than non-spam messages
6) What are top 20 highest frequency words of spam and non-spam messages ?
Data Cleaning
2) Remove non-alphanumeric characters
3) Remove punctuation
4) Remove stop words
5) Lemmatization: Unlike Stemming, reduces the inflected words properly ensuring that the root word belongs to the language. In Lemmatization root word is called Lemma. A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words. (nltk package) . Python NLTK provides WordNet Lemmatizer that uses the WordNet Database to lookup lemmas of words.
Data Preparation
1) Split and shuffle dataset into train and test subsets with the ratio 70:30
2) Select predicted variables: len_text, digits, non_alpha_char, processed_text and target variable: label
TF-IDF Transformation
Step 1 (tokenization): Divide the texts into words or smaller sub-texts, which will enable good generalization of relationship between the texts and the labels. This determines the “vocabulary” of the dataset (set of unique tokens present in the data).
Step 2 (vectorization): Define a good numerical measure to characterize these texts.
Step 4: Add selected features (len_text, digits, non_alpha_char) to trainset (x_train_tfidf)
Step 5: Add selected features (len_text, digits, non_alpha_char) to testset (x_val_tfidf)
Word2Vec Transformation
Step 1 (tokenization): Set the input as the entire corpus, the model will build the set of vocabulary
Step 2: (vectorization): Create multidimensional vector from each document in trainset and testset
Let’s try to understand the parameters used in training the model:
1) Size: The number of dimensions of the embedding and the default is 100.
2) window: The maximum distance between a target word and words around it. The default window is 5.
3) min_count: The minimum count of words to consider when training the model; words with occurrence less than this count will be ignored. The default for min_count is 5.
4) workers: The number of partitions during training and the default workers is 3.
5) sg: The training algorithm, either CBOW(0) or skip gram(1). The default training algorithm is CBOW.
After the transformation, the data looks like:
We can't fit the Machine Algorithm with those data. We have to flatten the array in 'w2vec' column
Feature selection
1) Apply feature selection with f_classif and Bayesian on TF-IDF
2) Apply feature selection with _classif and Bayesian on Word2Vec
3) Compare the accuracy score between 2 models and baseline
TF-IDF + Naives Bayes: improve 5% of accuracy of classification from 86% to 91%
Word2Vec + Naive Bayes: improve 8% of accuracy of classification from 86% to 94%
Evaluation Metrics
1) Confusion matrix of Bayes prediction on TF-IDF
The correction of prediction = True Positive (TP) + True Negative (TN) = 878 + 140 = 1018
The mis-classification = False Negative (FN) + False Positive (FP) = 84 + 13 = 97
2) Confusion matrix of Bayes prediction on Word2Vec
The correction of prediction = TP + TN = 896 + 147 = 1043
The mis-classification = FN + FP = 84 + 13 = 72
Classification Report
F1-score from Word2Vec model for both spam and ham classes is better than TF-IDF model's.
View the Python code for Data Analysis on GitHub
View the Python code for TF-IDF + Bayes on GitHub
View the Python code for Word2Vec + Bayes on GitHub
View the Python code for Data Analysis on GitHub
View the Python code for TF-IDF + Bayes on GitHub
View the Python code for Word2Vec + Bayes on GitHub