Clustering is an example of Unsupervised Machine Learning where we setup the model to learn the structure of data. It can be used as feature engineering for supervised machine learning problems such as regression or classification.
2. K-Means
How does it actually work?
K-Means Clustering |
- Step 2: Assign each point to a cluster
- Step 3: Recalculate the mean for each cluster
- Step 4: Re-assign the points to clusters
- Step 5: Iterate until points are in their final clusters
If the centroids are converged, stop the process. If not, the process is repeated from step 2 to step 5
3. Model Evaluation
The easy way to evaluate the clustering models is Silhouette Score because it does not require labeled data.
It measures of how similar an object is to objects in its own cluster and how different it is from objects in other clusters. Overall Silhouette score averages Silhouette coefficient of each sample.
4. Distance Metrics
4.1 Euclidean Distance
Euclidean |
- The Euclidean distance is the square root of sums of squares of pairs of values from rows or columns.
- Data normalization and missing values processing must be taken into account before using this metric
Cosine |
4.2. Cosine distance
- Cosine similarity is the dot product divided by the product of lengths (where the length is the square of dot product of a row/column with itself). Cosine distance is computed by subtracting the similarity from 1.
- In calculation of dot products, missing values are replaced by means. In calculation of lengths, the contribution of a missing value equals the square of the mean plus the variance.
Euclidean vs Cosine Distance
- Euclidean is useful for coordinate based measurements and more sensitive to curse of dimensionality.
- Cosine is better for data such as text where location of occurrence is less important
4.3. Manhattan Distance
Manhattan |
- It is the sum of absolute pairwise distances.
- Normalization and treatment of missing values is similar as in the Euclidean distance, except that medians and median absolute distance from the median (MAD) are used instead of means and deviations.
- For discrete values, distances are again 0 or 1, hence the Manhattan distance for discrete columns is the same as the Euclidean.
4.3 Jaccard Distance
- Jaccard similarity between two sets X and Y is defined as the size of their intersection (X ∩ Y) divided by the size of the union (X ∪ Y). Jaccard distance is computed by subtracting the similarity from 1.
- Applies to sets (like word occurrence):
- Sentence A: “I like chocolate ice cream.”
- Set A = {I, like, chocolate, ice, cream}
- Sentence B: “Do I want chocolate cream or vanilla cream?”
- Set B = {Do, I, want, chocolate, cream, or, vanilla}
- Jaccard distance(sentence A, sentence B) = 1 - ( 𝐴 ∩ B)/ (𝐴 ∪ 𝐵) = 1 - 3/9
5. Case Study
We will be using the wine quality data set for our case study. This data set contains various chemical properties of wine, such as acidity, sugar, pH, and alcohol. It also contains a quality metric (3-9, with highest being better) and a color (red or white). The name of the file is
Wine_Quality_Data
.
We will be using the chemical properties (i.e. everything but quality and color) to cluster the wine. Though this is unsupervised learning, it can be fun to see how our clustering results map onto color and quality.
5.1. Know Your Data
Dataset has 12 columns which are continuous variable. Some columns have high standard deviation or variances, we can infer that data distribution is skewed.
Max and Min values among columns are highly different so it is necessary to perform the transformation before clustering.
- Plot Quality column vs Color column on Histogram
5.2. Data Preprocessing
Scaling vs. Normalization: What’s the difference?
One of the reasons that it’s easy to get confused between scaling and normalization is because the terms are sometimes used interchangeably and, to make it even more confusing, they are very similar! In both cases, you’re transforming the values of numeric variables so that the transformed data points have specific helpful properties.
The difference is that, in scaling, you’re changing the range of your data while in normalization you’re changing the shape of the distribution of your data. Let’s talk a little more in-depth about each of these options.
By scaling your variables, you can help compare different variables on equal footing.
Data Transformation
Feature Scaling with
sklearn.preprocessing.RobustScaler
This means that you’re transforming your data so that it fits within a specific scale, like 0–100 or 0–1. You want to scale data when you’re using methods based on measures of how far apart data points, like support vector machines, or SVM or k-nearest neighbors, or KNN. With these algorithms, a change of “1” in any numeric feature is given the same importance.
As a result, data range in the distribution of features is changed after the scaling
Observation:
After applying the transformation with RobustScaler, the magnitude (standard deviation) of all features looks similar now.
This can prepare for next step (K-Means clustering) because it is a distance-based algorithm, this difference of magnitude can create a problem.
Data Normalization
Feature Transformation with stats.zcore
Scaling just changes the range of your data. Normalization is a more radical transformation. The point of normalization is to change your observations so that they can be described as a normal distribution.
In general, you’ll only want to normalize your data if you’re going to be using a machine learning or statistics technique that assumes your data is normally distributed. Some examples of these include t-tests, ANOVA, linear regression
zscore should be used this case instead of boxcox because some negative datapoints lie in dataset
5.3. KMeans Clustering
- Fit a K-means clustering model with two clusters.
- Examine the clusters by wine color.
- Plot the distribution of wine color and clusters on bar chart
Observation
By choosing k=2, we can see that white wine is mainly distributed in cluster 1 meanwhile, data points of red wine is mostly distributed in cluster 2. The number of data points in cluster 2 are just a half of cluster 1.
How can we select the best K value?
Hyperparameter Tuning
We have to select the best number of clusters with the help of inertia, silhouette score and distance metrics
1. Selecting Inertia with Elbow Method
- Pick range of candidate values of K (e.g. 1 to 20)
- Calculate average distance from centroid for each value
- Plot and find elbow
Now fit K-Means models with cluster values ranging from 1 to 20. For each model, store the number of clusters and the inertia value.
Plot cluster number vs inertia. Does there appear to be an ideal cluster number?
Observation
When we changed the cluster value from 1 to 2, the inertia value reduced very sharply. This decrease in the inertia value reduces and eventually becomes constant as we increase the number of clusters further.
Here, we can choose any number of clusters between 4 and 8. You must also look at the computation cost while deciding the number of clusters. If we increase the number of clusters, the computation cost will also increase. So, if you do not have high computational resources, my advice is to choose a lesser number of clusters.
2. Selecting best K with Silhouette Method
- Pick range of candidate values of K (e.g.1 to 10)
- Plot silhouettes for each value of K Ideal value of silhouette = 1
- The score runs -1 to 1. The worst value is -1 and the best value is 1. Value near 0 means clusters are overlapping. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar
Observation
- The best cluster that provides the highest silhouette score is 2.
- In above KMeans, Euclidean distance is used by default to calculate the Silhouette Coefficient in KMeans if we don't specify a distance metric.
- We will experiment KMeans with some other distance metrics such as Manhattan, Cosine and Jaccard.
How do distance metrics impact on the Clustering?
3. Distance metrics
Each metric has strengths and most appropriate use-cases but sometimes choice of distance metric is also based on empirical evaluation
Observation:
- The silhouette score is highest with in K = 3 and distance metric = cosine
- The optimal cluster number is selected differently among metrics
Feature Engineering with Clustering in Classification
In this section, we are going to explore clustering as a form of feature engineering.
- Create a binary target variable y, denoting if the quality is greater than 6 or not.
- Create a variable called X_with_kmeans from data, by dropping the columns "quality", "color" from the dataset.
- Create X_without_kmeans from that by dropping "kmeans".
- For both datasets, using StratifiedShuffleSplit with 10 splits, fit 10 Random Forest Classifiers and average out the roc-auc scores.
- Compare the average roc-auc scores for the models using the kmeans clusters as a feature and the one that doesn't use it.
- Create the basis training set from data by restricting to float_column.
- Fit a kmeans algotihm with 20 clusters. One hot encode it and add it to the basis training set. Don't add it to the previous iteration.
- Fit 10 Logistic Regression models and compute the average roc-auc-score.
Plot the average roc-auc scores.