CLUSTER ANALYSIS

Sabyasachi D Choudhury
5 min readApr 9, 2021

Cluster analysis groups individuals or objects into clusters objects in the same cluster are more similar to one another than they are to objects in other clusters. The attempt is to maximize the homogeneity of objects within the clusters while also maximizing the heterogeneity between the clusters. Like factor analysis, cluster analysis is also an inter-dependence technique.

A Simple Example

Suppose, you have done a pilot marketing of a candy on a randomly selected sample of consumers. Each of the consumers was given a candy and was asked whether they liked it and whether they will buy it. Now the respondents were grouped into following four clusters:

Now the group ―NOT LIKED, WILL BUY‖ group is a bit unusual. But people can buy for others. From a strategy point of view, the group ―LIKED, WILL NOT BUY‖ is important, because they are potential customers. A possible change in the pricing policy may change the purchasing decision.

What Exactly Are We Looking for?

From the example, it is very clear that we must have some objective on the basis of which we want to create clusters. The following questions need to be answered:

 What kind of similarity are we looking for? Is it pattern or proximity?

 How do we form the groups?

 How many groups should we form?

 What‘s the interpretation of each cluster?

 What‘s the strategy related to each of these clusters?

Problems with Cluster Analysis

1. Cluster analysis does not have a theoretical statistical basis. So no inference can be made from the sample to the population. It‘s only an exploratory technique. Nothing guarantees unique solutions.

2. Cluster analysis will always create clusters, regardless of actual existence of any structure in the data. Just because clusters can be found doesn‘t validate their existence.

3. The Cluster solution cannot be generalized because it is totally dependent upon the variables used as the basis for similarity measure. This criticism can be made against any statistical technique. With cluster variate completely specified by the researcher, the addition of spurious variables or the deletion of relevant variables can have substantial impact on the resulting solution. As a result, the researcher must be especially cognizant of the variables used in the analysis, ensuring that they have a strong conceptual support.

In data mining, hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types:

Agglomerative: This is a “bottom up” approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.

Divisive: This is a “top down” approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

Metric and linkage

In order to decide which clusters should be combined (for agglomerative), or where a cluster should be split (for divisive), a measure of dissimilarity between sets of observations is required. In most methods of hierarchical clustering, this is achieved by use of an appropriate metric (a measure of distance between pairs of observations), and a linkage criterion which specifies the dissimilarity of sets as a function of the pairwise distances of observations in the sets.

Generally the distance metric is the Euclidean distance. As for linkages, there are single linkage (the shortest distance between two clusters), complete linkage (the longest distance), and average linkage (the average of all the distances between the two clusters).

Ward’s Minimum Variance Cluster Analysis

1. Ward’s minimum variance criterion minimizes the total within-cluster variance.

2. At each step the pair of clusters with minimum cluster distance are merged.

3. To implement this method, at each step find the pair of clusters that leads to mini-mum increase in total within-cluster variance after merging.

Related Statistics

Semi Partial R Squared: The semi-partial R-squared (SPR) measures the loss of homogeneity due to merging two clusters to form a new cluster at a given step. If the value is small, then it suggests that the cluster solution obtained at a given step is formed by merging two very homogeneous clusters.

R Square: R-Square (RS) measures the heterogeneity of the cluster solution formed at a given step. A large value represents that the clusters obtained at a given step are quite different (i.e. heterogeneous) from each other, whereas a small value would signify that the clusters formed at a given step are not very different from each other.

Dendrogram: It‘s a chart showing which two clusters are merging at which distance.

Icicle: It‘s a chart showing which case is being merged into the a cluster at which level.

K Means Clustering

In data mining, k-means clustering is a method of clus-ter analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. A key limitation of k-means is its cluster model.

The concept is based on spherical clusters that are separable in a way so that the mean value converges towards the cluster center. The clusters are expected to be of similar size, so that the assignment to the nearest cluster center is the correct assignment. Researchers generally use Hierarchical Methods to find out the optimal number of clusters and then use K Means method to determine the actual clusters.

--

--