Monday, May 11, 2020

Text Classification with TF-IDF, Word Embedding and Naive Bayes

Introduction

This project introduces one of techniques of natural language processing: text classification. We shall explore step by step how to create new features through data analysis and do data cleaning. We also explain the conception of Term Frequency - Inverse Document Frequency and Word Embedding and how to implement them through the real data. 

The article will show you how to apply Naives Bayes on text data after the tokenization and vectorization. We will use one of Selection Best Feature techniques to chose only features that contribute to the performance of the prediction. After that, we will analyse and compare the evaluation metrics including accuracy , confusion matrix and classification report between baseline accuracy, TF-IDF + Bayes and Word Embedding + Bayes.

Bayesian Classifier

Naive Bayes is the most straightforward and fast classification algorithm, which is suitable for a large chunk of data. It uses Bayes theorem of probability for prediction of unknown class. It is simple and effective in answering questions such as "Given a particular term in the document, what is the likely chance (probability) that it belongs to the particular class?" 

The classification has two phases, a learning phase, and the evaluation phase. In the learning phase, classifier trains its model on a given dataset and in the evaluation phase, it tests the classifier performance. Performance is evaluated on the basis of various parameters such as accuracy, error, precision, and recall.

Pros: It is simple, easy and fast to train and predict small dataset.
Cons: It assumes every feature is independent, which isn’t always the case.

Term Frequency–Inverse Document Frequency (TF-IDF)

Machine doesn't understand the text. We have to transform text into sparse matrix or term-document matrix. The term-document matrix then is a two-dimensional matrix whose rows are the terms and columns are the documents/observations/samples, so each entry (i, j) represents the frequency of term i in document/observation/sample j.

For each entry in the matrix, the term frequency measures the number of times that term i appears in document j, and the inverse document frequency measures the number of documents in the corpus which contain term i. 

The tf-idf score is the product of these two metrics (tf*idf). So an entry's tf-idf score increases when term i appears frequently in document j, but decreases as the term appears in other documents. In another word, idf is a cross-document normalization, that puts less weight on common terms, and more weight on rare terms

Word Embedding

There are two main training algorithms for word2vec: Continuous Bag of Word (CBOW) and Skip-gram. The major difference between these two methods is that CBOW is using context to predict a target word while skip-gram is using a word to predict a target context.

Generally, the skip-gram method can have a better performance compared with CBOW method, for it can capture two semantics for a single word.

Source: Internet
The Word2Vec Skip-gram model, for example, takes in pairs of 2 words  generated by moving a window across text data, and trains a 1-hidden-layer neural network based on the synthetic task of given an input word, giving us a predicted probability distribution of nearby words to the input.

Feature selection

When we convert all of the texts in a dataset into word uni+bigram tokens, we may end up with tens of thousands of tokens. Not all of these tokens/features contribute to label prediction. So we can drop certain tokens, for instance those that occur extremely rarely across the dataset. We can also measure feature importance (how much each token contributes to label predictions), and only include the most informative tokens.

There are many statistical functions that take features and the corresponding labels and output the feature importance score. Two commonly used functions are f_classif and chi2.

Evaluation Metrics

Accuracy will yield misleading results if the data set is unbalanced; that is, when the numbers of observations in different classes vary greatly. Confusion Matrix and Classification Report will help us to evaluate the performance for each class.

Confusion Matrix

A confusion matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an supervised machine learning algorithm.

Each row of the matrix represents the instances in a predicted class while each column represents the instances in an actual class. The diagonal elements show the number of correct classifications for each class meanwhile the off-diagonal elements provides the mis-classifications for each class.

There are four ways to check if the predictions are right or wrong:

1) True Positive (TP): The class was positive and predicted positive.
2) True Negative (TN): The class was negative and predicted negative.
3) False Negative (FN): The class was positive but predicted negative
4) False Positive (FP) : The case was negative but predicted positive




Source: Wikipedia


Classification Report


The report shows the main classification metrics precision, recall and f1-score on a per-class basis. The metrics are calculated by using true and false positives, true and false negatives. Positive and negative in this case are generic names for the predicted classes.

Precision – What percent of your predictions were correct ?
Precision is the ability of a classifier not to label an instance positive that is actually negative. For each class it is defined as the ratio of true positives to the sum of true and false positives.

Precision = Accuracy of positive predictions. 

Precision = TP/(TP + FP)

Recall – What percent of the positive cases did you catch ?

Recall is the ability of a classifier to find all positive instances. For each class it is defined as the ratio of true positives to the sum of true positives (TP) and false negatives (FN).

Recall: Fraction of positives that were correctly identified.

Recall = TP/(TP+FN)

F1 score – What percent of positive predictions were correct?

The F1 score is a weighted harmonic mean of precision and recall such that the best score is 1.0 and the worst is 0.0. Generally speaking, F1 scores are lower than accuracy measures as they embed precision and recall into their computation. As a rule of thumb, the weighted average of F1 should be used to compare classifier models, not global accuracy.

F1 Score = 2*(Recall * Precision) / (Recall + Precision)

Case study: classify spam/ham messages 


Data Exploratory Analysis

Lets get the first look of data structure:















Dataset consists of 5572 records whose two columns are text and label without null data.

















From first 5 lines, we can see that text column is where we will generate vectorizer models and Label column is also called predicted variable.





There 2 unique values in Label column: Ham and Spam that indicate text message in Text column is Not Spam or Spam.

Baseline Accuracy


The baseline accuracy is important but often ignored in machine learning. It sets the benchmark in terms of minimum accuracy which the model should achieve. Baseline is calculated as the number of times majority class (ham) appear in the target variable (label column), divided by the number of total observations.


 In our case, baseline accuracy is 86%

The best way to understand the data and to engineer features is to raise questions and to answer them

1) What percentage of the documents are spam?






2) What is the longest message ?














3) What is the average length of documents (number of characters) for not spam and spam documents?











4) What is the average number of digits per document for not spam and spam documents?









Observation: In average, spam messages contain more digit characters than non-spam messages

5) What is the average number of non-word characters (anything other than a letter, digit or underscore) per document for not spam and spam documents?



Observation In average, spam messages contain more non-alphanumeric characters than non-spam messages

6) What are top 20 highest frequency words of spam and non-spam messages ?


































Data Cleaning


1) Conversion to Lower case: words like 'Go' and 'go' need to be considered as one word. Hence, these are converted to lowercase.
2) Remove non-alphanumeric characters
3) Remove punctuation
4) Remove stop words
5) Lemmatization: Unlike Stemming, reduces the inflected words properly ensuring that the root word belongs to the language. In Lemmatization root word is called Lemma. A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words. (nltk package) . Python NLTK provides WordNet Lemmatizer that uses the WordNet Database to lookup lemmas of words.








Data Preparation


1) Split  and shuffle dataset into train and test subsets with the ratio 70:30
2) Select predicted variables: len_text, digits, non_alpha_char, processed_text and target variable: label


test_size=0.2 means that size of testset (x_val, y_val) is 20% proportion of data and 80% data is trainset (x_train, y_train)


TF-IDF Transformation


Step 1 (tokenization): Divide the texts into words or smaller sub-texts, which will enable good generalization of relationship between the texts and the labels. This determines the “vocabulary” of the dataset (set of unique tokens present in the data).

Step 2 (vectorization): Define a good numerical measure to characterize these texts.





Step 4: Add selected features (len_text, digits, non_alpha_char) to trainset (x_train_tfidf)




Step 5: Add selected features (len_text, digits, non_alpha_char) to testset (x_val_tfidf)





Word2Vec Transformation


Similar to TF-IDF transformation, we perform the tokenization and vectorization to convert text into numerical matrix

Step 1 (tokenization): Set the input as the entire corpus, the model will build the set of vocabulary
Step 2: (vectorization): Create multidimensional vector from each document in trainset and testset










Let’s try to understand the parameters used in training the model:

1) Size: The number of dimensions of the embedding and the default is 100.
2) window: The maximum distance between a target word and words around it. The default window is 5.
3) min_count: The minimum count of words to consider when training the model; words with occurrence less than this count will be ignored. The default for min_count is 5.
4) workers: The number of partitions during training and the default workers is 3.
5) sg: The training algorithm, either CBOW(0) or skip gram(1). The default training algorithm is CBOW.

After the transformation, the data looks like:











We can't fit the Machine Algorithm with those data. We have to flatten the array in 'w2vec' column 




















Feature selection


1) Apply feature selection with f_classif and Bayesian on TF-IDF









2) Apply feature selection with _classif and Bayesian on Word2Vec






3) Compare the accuracy score between 2 models and baseline

TF-IDF + Naives Bayes: improve 5% of accuracy of classification from 86% to 91%

Word2Vec + Naive Bayes: improve 8% of accuracy of classification from 86% to 94%

Evaluation Metrics


1) Confusion matrix of Bayes prediction on TF-IDF



The correction of prediction = True Positive (TP) + True Negative (TN) = 878 + 140 = 1018

The mis-classification = False Negative (FN) + False Positive (FP) = 84 + 13 = 97

2) Confusion matrix of Bayes prediction on Word2Vec








The correction of prediction = TP + TN = 896 + 147 = 1043

The mis-classification = FN + FP = 84 + 13 = 72


Classification Report


1) Classification Report of Bayes prediction on TF-IDF





2) Classification Report of Bayes prediction on Word2Vec






Sunday, April 26, 2020

Azure Machine Learning Service

Azure ML service is an end-to-end machine learning service available on Azure which can help  data scientist to automate the machine learning life-cycle. The model can be built in the local then deploy, train and validate it on the cloud. Azure ML Service currently supports Python SDK, .NET SDK and Java SDK. In this article, I use the Python for the development.

We explore how to train a machine learning model using the Azure ML service. Then, we talk about the purpose and role of experiments, runs, and models. Finally, We exercise how to train ML models with Azure resource available that includes creates a workspace, builds a compute target, and executes a training run using the Azure ML service.

1) Create a workspace


Workspace is where all references of ML resources are centralized. It keeps the historical logs of training, run versions. It allows us to define different Compute Targets to run your experiments where we pipeline the process of building ML model.

MS Azure offers us several ways to create a workspace. We can use either Azure Portal or Azure CLI or REST Service or Resource Template. I highly recommend to use Azure Portal because it is easy just with some clicks.

The first condition is to have a valid account and a valid subscription. Check out this article to see how to get one.

Check out this part to get step by step of workspace creation

Once we sucessfully created a workspace for Machine Learning, we should see it on our home page.
My workspace name is diemai_workspace:



Click on the workspace name to get the details.













My workspace's resource group is cognitiveai which is a subscription name and it is free in 30 days.

2)  Create and train Machine Learning model with Azure ML Service


First, we have to create an instance of workspace with Python by using azureml.core.Workspace object and filling the parameters with workspace name, subscription ID and group name as the following:






















3) Setup the computing resources

We need to specify a compute configuration that specifies virtual computer target and virtual machine type and size used in the computation. I set the computer target is amlcompute and its type is Standard_F2s_v2. I also indicate the version of python packages that are used in the run.





















The reason I set Standard_F2s_v2 for vm_size because it is valid resources provided by Azure when we created a subscription. vm_size is changed when we change the region of subscription. To know which resources are available for use, run the following command:

AmlCompute.supported_vmsizes(workspace = diemai_ws)

4) Create a python script with selected dataset and ML Algorithm

















After running the above code, we got a python file lg_house_boston.py in the local directory. We continue to execute that file and finally, we got trainset and testset variables as well as an instance of Linear Regression saved in lg_house_boston.pkl file.

5) Run Script

Now, we create a Run Script where we specify the python script and computing resource:








6) Experiment and Task Submit

Last step is to setup an experiment and submit it to Azure Cloud













When the training is completed, all log files are saved in azureml-logs directory




We can also view the process of run with Jupiter Widget















7) Download model to the local


We can download running model on Cloud to local and validate it with testset and compute mean square error















Tuesday, April 21, 2020

Sentiment Analysis with Azure Text Analytics API

Azure Natural Language API is one of Microsoft Cognitive Services API which provides a wide range functionalities from text analytics, Q&A, to text translation and chatbot. This article introduces the sentiment analysis capabilities of text analytics API. In order to call API services from Azure Portal, it requires a valid subscription, an endpoint request, request parameters, JSON-structured data and finally explore JSON responses of API.

1) Subscription


In order to create a resources on Portal Azure, the very first step is to have a Microsoft account and a valid subscription. Microsoft offers you an free Azure account with 200 credits to explore Azure Services in 30 days. Check out the Part 1 to see creation steps.

2) Endpoint Request


Endpoint is where we send the request and process it. It consists of URI and non-expired subscription key and . When we succeed in creating an account and a subscription on Portal Azure, we continue adding a new resource for text analytics:

Click on "Create a resource"










Select "AI + Machine Learning" and "Text Analytics"





































Fill the required input and click on "Create new". The input required in Resource group is subscription name in step 1.




If not any errors happen, the new resource is added in Home page as the following :















Click on the resource name, it is where we obtain the subscription key and endpoint request












Mission accomplished !!!

3) Request parameters for Sentiment Analysis


Request has 3 main parts: Header, URL and JSON documented-data

Header is dictionary object whose the key is "Ocp-Apim-Subscription-Key" and the value is API key  in the previous step.

A complete URL comprises 2 parts: Endpoint in the above image plus the API URI which is indicated particularly to our purpose. At this moment, we use Azure API to detect the sentiment  hidden in the text

JSON document is an array of JSON object consisting of key, language and text.










4) JSON response from Sentiment Analysis Request












First, the sent request has no errors so that  we can continue analyzing the response data. Clearly that score of text 3, text 5 and text 6 got very small which means they are negative because some words like "shiver", "crie", "die" appear in the text. The expected results are not bad at all !!!


5) JSON response from Entities Extraction Request


The only thing we should do is to modify the API request and JSON document in request body and send them to the Azure


























The output is very comprehensive and informative. It found 2 entities in text: Statue of Liberty and New York city. The confidence score for the prediction is about greater than 0.8. Also It provides the location of the entities.

6) JSON response from Language Identification request


Language detection is one of sub fields of Natural Language Processing  which recognizes the natural languages used in the text. Similar to precedent part, URL and JSON document are changed to be compatible with purpose use.


















As we can see, 3 different languages are detected in the texts are English, Vietnamese and French with confidence score is 1.0



Thursday, April 9, 2020

Microsoft Azure Vision API - Face Detection

This post is a continuation of part 1 :  Microsoft Azure Vision API - Computer Vision

Lets play around with Face API which is a part of Microsoft Azure Vision. Similar to computer vision, we must have a Microsoft account, a valid subscription and face resource which is key and endpoint in order to access Face API services. Check out the Part 1 to see creation steps.

The purpose of Face API is to extract some parts of face, the expression to predict the age, the emotion of the image.


First, we select a face image. It can be in local or an URL





Next, we copy the secret key subscription and end point URL from Microsoft Azure Resource that we created in Part 1







Then, We build a REST API request object which consists of headers, parameters and data. They are formed as dictionary object.






What we know about image is put in 'returnFaceAttributes' attribute in parameters object. In the example, I want to know the age, gender, head pose, facial hair of the object. I also want to check if that person was wearing glasses or not or whether he/she was wearing makeup.

We just finish the request part. Lets try to send it  and hope no errors happen.


Return code is 200 means request is completely processed. Lets explore return attributes in the response



















Json object is well-printed with pprint. With data extracted in the image, it is a man whose age is 37. His emotion is predicted as neutral with 99% confidence. He doesn't wear glasses neither. Wanna guess who is he?





















It is Emmanuel Macron, French President. Well, in the picture he looks 5 years younger than he is.