Dunn Index for K-Means Clustering Evaluation: Explained with Python Examples

In this tutorial we will explore the Dunn index and its application to K-Means clustering evaluation in Python.

Table of Contents

Introduction
Dunn Index Explained
Conclusion

Introduction

Dunn Index (DI) is one of the clustering algorithms evaluation measures. It is most commonly used to evaluate the goodness of split by a K-Means clustering algorithm for a given number of clusters.

We have previously discussed the Davies-Bouldin index and Calinski-Harabasz index, and Dunn index is yet another metric to evaluate the performance of clustering.

What is Dunn index?

Dunn index is calculated as a ratio of the smallest inter-cluster distance and the largest intra-cluster distance.

How to interpret Dunn index?

A high DI means better clustering since observations in each cluster are closer together (more dense), while clusters themselves are further away from each other (well separated).

Dunn Index Explained

In this part we will go through each step of the calculations and provide meaningful illustrative examples to help better understand the formulas.

Step 1: Calculate inter-cluster distance

The first step is to calculate the inter-cluster distance and there are several ways to calculate it.

The inter-cluster distance measures the distance between observations belonging to two difference clusters.

Consider two clusters: $C_a$ and $C_b$.

Each cluster has a centroid: $A$ and $B$ respectively, as well as some elements in each cluster $a_{i}$ and $b_{j}$, where:

$a_{i}$ : $i$-th observation of cluster $C_a$
$b_{j}$ : $j$-th observation of cluster $C_b$

Note: The inter-cluster distance is computer for a pair of clusters.

There are 5 ways of calculating inter-cluster distance:

Single linkage distance
Complete linkage distance
Average linkage distance
Centroid linkage distance
Average of centroids linkage distance

Single linkage distance

The single linkage distance is the smallest distance between two observations from two different clusters (closest observations):

$$\delta_1 (C_a, C_b) = \min \Bigg \{ d(a_{i}, b_{j}) \Bigg \}$$

where:

$a_{i} \in C_a$
$b_{j} \in C_b$

Complete linkage distance

The complete linkage distance is the largest distance between two observations from two different clusters (furthest observations):

where:

$a_{i} \in C_a$
$b_{j} \in C_b$

Average linkage distance

The average linkage distance is the average distance between all observations from two different clusters:

$$\delta_3 (C_a, C_b) = \frac{1}{N_{C_a} \times N_{C_b}} \sum_{i=1}^{N_{C_a}} \sum_{j=1}^{N_{C_b}} d(a_i, b_j)$$

where:

$a_{i} \in C_a$
$b_{j} \in C_b$
$N_{C_a}$ is the # of observations in $C_a$
$N_{C_b}$ is the # of observations in $C_b$

Centroid linkage distance

The centroid linkage distance is the distance between the centroids of two clusters:

$$\delta_4 (C_a, C_b) = d(A, B)$$

where:

$A$ is the centroid of $C_a$
$B$ is the centroid of $C_b$

Average of centroids linkage distance

Average of centroids linkage distance is the average distance between a centroid of a cluster and all observations from a different cluster:

$$\delta_5 (C_a, C_b) = \frac{1}{N_{C_a} + N_{C_b}} \Bigg \{ \sum_{i=1}^{N_{C_a}} d(a_i, B) + \sum_{j=1}^{N_{C_b}} d(b_j, A) \Bigg \}$$

where:

$a_{i} \in C_a$
$b_{j} \in C_b$
$N_{C_a}$ is the $\#$ of observations in $C_a$
$N_{C_b}$ is the $\#$ of observations in $C_b$
$A$ is the centroid of $C_a$
$B$ is the centroid of $C_b$

Depending which approach is chosen, following one of the above formulas will calculate the inter-cluster distance.

Step 2: Calculate intra-cluster distance

The second step is to calculate the intra-cluster distance and there are several ways to calculate it.

The intra-cluster distance measures the distance between observations belonging to the same cluster.

Consider one cluster: $C_a$ and its centroid $A$ as well as some elements in this cluster $a_{i}$ and $a_{j}$, where:

$a_{i}$ : $i$-th observation of cluster $C_a$
$a_{j}$ : $j$-th observation of cluster $C_a$

Note: The intra-cluster distance is computed for individual cluster (one distance measure per cluster).

There are 3 ways of calculating inter-cluster distance:

Complete diameter distance
Average diameter distance
Centroid diameter distance

Complete diameter distance

The complete diameter distance measures the distance between two furthest observations within the same cluster:

$$\Delta_1 (C) = \max \Bigg \{d(a_i, a_j) \Bigg \}$$

where:

$a_i \in C$
$a_j \in C$

Average diameter distance

The average diameter distance measures the average distance between all observations within the same cluster:

$$\Delta_2 (C) = \frac{1}{N_C \times (N_C-1)} \sum_{i \neq j}^{N_C} d(a_i, a_j) $$

where:

$a_i \in C$
$a_j \in C$
$N_{C}$ is the $\#$ of observations in $C$

Centroid diameter distance

The centroid diameter distance measures the double average distance between a centroid of a cluster and all observations from the same cluster:

$$\Delta_3 (C) = 2 \times \frac{1}{N_C} \sum_{i=1}^{N_C} d(a_i, A) $$

where:

$a_i \in C$
$N_{C}$ is the $\#$ of observations in $C$
$A$ is the centroid of $C$

Step 3: Calculate Dunn Index

Dunn index is defined as the ratio of the smallest inter-cluster distance and the largest intra-cluster distance.

For $m$ clusters, the Dunn index is calculated as:

$$DI_m = \frac{\min\limits_{1 \leq i \leq j \leq m} \delta (C_i, C_j)}{\max\limits_{1 \leq k \leq m} \Delta_k}$$

This means first that we should minimize the inter-cluster distance function, which is to find the distance between two closest clusters.

And then maximize the intra-cluster distance function, which is to find the cluster with the largest diameter.

Finally, we will take the ratio of the above two values which is referred to as the Dunn index (with range between 0 and 1).

From the above explanation, we can conclude that the large values of Dunn index represent better clustering.

Note: $\delta (C_i, C_j)$ can be calculated using any of the inter-cluster distance functions. And $\Delta_k$ can be calculated using any of the intra-cluster distance functions.

Conclusion

In this article we discussed how to calculate the Dunn index for clustering evaluation.

Feel free to leave comments below if you have any questions or have suggestions for some edits and check out more of my Python Programming articles.

Introduction

What is Dunn index?

How to interpret Dunn index?

Dunn Index Explained

Step 1: Calculate inter-cluster distance

Single linkage distance

Complete linkage distance

Average linkage distance

Centroid linkage distance

Average of centroids linkage distance

Step 2: Calculate intra-cluster distance

Complete diameter distance

Average diameter distance

Centroid diameter distance

Step 3: Calculate Dunn Index

Conclusion

Invite me to unlock the fullpotential of your analytics.

Invite me to unlock the full
potential of your analytics.