Make the data speak itself: KMeans - Clustering Method Part 1

1. What is Clustering?

Clustering is a classic machine learning problem where we don't work with labeled data. It works directly with features in data in order to figure out the pattern or logical groups in underlying dataset.

Clustering is an example of Unsupervised Machine Learning where we setup the model to learn the structure of data. It can be used as feature engineering for supervised machine learning problems such as regression or classification.

2. K-Means

The first clustering technique we'll explore is K-Means Clustering. This is very simple, intuitive and and easy to understand. The objective is to maximize the intra-cluster similarity and minimize the inter-cluster similarity.

How does it actually work?

K-Means Clustering

- Step 1: Initialize K centroids , means for example

Repeat:

- Step 2: Assign each point to a cluster

- Step 3: Recalculate the mean for each cluster

- Step 4: Re-assign the points to clusters

- Step 5: Iterate until points are in their final clusters

If the centroids are converged, stop the process. If not, the process is repeated from step 2 to step 5

3. Model Evaluation

The easy way to evaluate the clustering models is Silhouette Score because it does not require labeled data.

It measures of how similar an object is to objects in its own cluster and how different it is from objects in other clusters. Overall Silhouette score averages Silhouette coefficient of each sample.

4. Distance Metrics

4.1 Euclidean Distance

Euclidean

- The Euclidean distance is the square root of sums of squares of pairs of values from rows or columns.

- Data normalization and missing values processing must be taken into account before using this metric

Cosine

4.2. Cosine distance

- Cosine similarity is the dot product divided by the product of lengths (where the length is the square of dot product of a row/column with itself). Cosine distance is computed by subtracting the similarity from 1.

- In calculation of dot products, missing values are replaced by means. In calculation of lengths, the contribution of a missing value equals the square of the mean plus the variance.

Euclidean vs Cosine Distance

- Euclidean is useful for coordinate based measurements and more sensitive to curse of dimensionality.

- Cosine is better for data such as text where location of occurrence is less important

4.3. Manhattan Distance

Manhattan

- It is the sum of absolute pairwise distances.

- Normalization and treatment of missing values is similar as in the Euclidean distance, except that medians and median absolute distance from the median (MAD) are used instead of means and deviations.

- For discrete values, distances are again 0 or 1, hence the Manhattan distance for discrete columns is the same as the Euclidean.

4.3 Jaccard Distance

- Jaccard similarity between two sets X and Y is defined as the size of their intersection (X ∩ Y) divided by the size of the union (X ∪ Y). Jaccard distance is computed by subtracting the similarity from 1.

- Applies to sets (like word occurrence):

Sentence A: “I like chocolate ice cream.”
Set A = {I, like, chocolate, ice, cream}
Sentence B: “Do I want chocolate cream or vanilla cream?”
Set B = {Do, I, want, chocolate, cream, or, vanilla}
Jaccard distance(sentence A, sentence B) = 1 - ( 𝐴 ∩ B)/ (𝐴 ∪ 𝐵) = 1 - 3/9

5. Case Study

We will be using the wine quality data set for our case study. This data set contains various chemical properties of wine, such as acidity, sugar, pH, and alcohol. It also contains a quality metric (3-9, with highest being better) and a color (red or white). The name of the file is Wine_Quality_Data.

We will be using the chemical properties (i.e. everything but quality and color) to cluster the wine. Though this is unsupervised learning, it can be fun to see how our clustering results map onto color and quality.

View data file on Github

5.1. Know Your Data

- Statistical Description

Dataset has 12 columns which are continuous variable. Some columns have high standard deviation or variances, we can infer that data distribution is skewed.

Max and Min values among columns are highly different so it is necessary to perform the transformation before clustering.

- Plot Quality column vs Color column on Histogram

- Pairwise maximal correlations

The correlation matrix describes two features who have highest correlation in dataset. This matrix can be presented in the following Heatmap:

5.2. Data Preprocessing

In previous part, we found the data is skewed. Plus, min and max among variables are different. In this step, we will do the normalization and the scaling.

Scaling vs. Normalization: What’s the difference?

One of the reasons that it’s easy to get confused between scaling and normalization is because the terms are sometimes used interchangeably and, to make it even more confusing, they are very similar! In both cases, you’re transforming the values of numeric variables so that the transformed data points have specific helpful properties.

The difference is that, in scaling, you’re changing the range of your data while in normalization you’re changing the shape of the distribution of your data. Let’s talk a little more in-depth about each of these options.

By scaling your variables, you can help compare different variables on equal footing.

Data Transformation

Feature Scaling with sklearn.preprocessing.RobustScaler

This means that you’re transforming your data so that it fits within a specific scale, like 0–100 or 0–1. You want to scale data when you’re using methods based on measures of how far apart data points, like support vector machines, or SVM or k-nearest neighbors, or KNN. With these algorithms, a change of “1” in any numeric feature is given the same importance.

As a result, data range in the distribution of features is changed after the scaling

Observation:

After applying the transformation with RobustScaler, the magnitude (standard deviation) of all features looks similar now.

This can prepare for next step (K-Means clustering) because it is a distance-based algorithm, this difference of magnitude can create a problem.

Data Normalization

Feature Transformation with stats.zcore

Scaling just changes the range of your data. Normalization is a more radical transformation. The point of normalization is to change your observations so that they can be described as a normal distribution.

In general, you’ll only want to normalize your data if you’re going to be using a machine learning or statistics technique that assumes your data is normally distributed. Some examples of these include t-tests, ANOVA, linear regression

zscore should be used this case instead of boxcox because some negative datapoints lie in dataset

View Notebook of Data Exploratory Analysis on GitHub

5.3. KMeans Clustering

- Fit a K-means clustering model with two clusters.

- Examine the clusters by wine color.

- Plot the distribution of wine color and clusters on bar chart

Observation

By choosing k=2, we can see that white wine is mainly distributed in cluster 1 meanwhile, data points of red wine is mostly distributed in cluster 2. The number of data points in cluster 2 are just a half of cluster 1.

How can we select the best K value?

Hyperparameter Tuning

We have to select the best number of clusters with the help of inertia, silhouette score and distance metrics

1. Selecting Inertia with Elbow Method

Pick range of candidate values of K (e.g. 1 to 20)

Calculate average distance from centroid for each value

Plot and find elbow

Now fit K-Means models with cluster values ranging from 1 to 20. For each model, store the number of clusters and the inertia value.

Plot cluster number vs inertia. Does there appear to be an ideal cluster number?

Observation

When we changed the cluster value from 1 to 2, the inertia value reduced very sharply. This decrease in the inertia value reduces and eventually becomes constant as we increase the number of clusters further.

Here, we can choose any number of clusters between 4 and 8. You must also look at the computation cost while deciding the number of clusters. If we increase the number of clusters, the computation cost will also increase. So, if you do not have high computational resources, my advice is to choose a lesser number of clusters.

2. Selecting best K with Silhouette Method

Pick range of candidate values of K (e.g.1 to 10)
Plot silhouettes for each value of K Ideal value of silhouette = 1
The score runs -1 to 1. The worst value is -1 and the best value is 1. Value near 0 means clusters are overlapping. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar

Observation

- The best cluster that provides the highest silhouette score is 2.

- In above KMeans, Euclidean distance is used by default to calculate the Silhouette Coefficient in KMeans if we don't specify a distance metric.

- We will experiment KMeans with some other distance metrics such as Manhattan, Cosine and Jaccard.

How do distance metrics impact on the Clustering?

3. Distance metrics

Choice of distance metric is extremely important to clustering success

Each metric has strengths and most appropriate use-cases but sometimes choice of distance metric is also based on empirical evaluation

Observation:

The silhouette score is highest with in K = 3 and distance metric = cosine
The optimal cluster number is selected differently among metrics

Feature Engineering with Clustering in Classification

In this section, we are going to explore clustering as a form of feature engineering.

Create a binary target variable y, denoting if the quality is greater than 6 or not.

Create a variable called X_with_kmeans from data, by dropping the columns "quality", "color" from the dataset.

Create X_without_kmeans from that by dropping "kmeans".

For both datasets, using StratifiedShuffleSplit with 10 splits, fit 10 Random Forest Classifiers and average out the roc-auc scores.

Compare the average roc-auc scores for the models using the kmeans clusters as a feature and the one that doesn't use it.

4) For both datasets, using StratifiedShuffleSplit with 10 splits, fit 10 Random Forest Classifiers and average out the roc-auc scores.

Compare the average roc-auc scores for the models using the KMeans clusters as a feature and the one that doesn't use it.

5) Let's now explore if the number of clusters have an effect in this improvement.

Create the basis training set from data by restricting to float_column.
Fit a kmeans algotihm with 20 clusters. One hot encode it and add it to the basis training set. Don't add it to the previous iteration.
Fit 10 Logistic Regression models and compute the average roc-auc-score.

Plot the average roc-auc scores.

View Notebook of KMeans Clustering on Github

Make the data speak itself

Sunday, February 2, 2020

KMeans - Clustering Method Part 1

5.1. Know Your Data

Data Transformation

Data Normalization

Hyperparameter Tuning

1. Selecting Inertia with Elbow Method

2. Selecting best K with Silhouette Method

3. Distance metrics

Feature Engineering with Clustering in Classification

No comments:

Post a Comment