Processing math: 100%

Dunn Index for K-Means Clustering Evaluation: Explained with Python Examples

In this tutorial we will explore the Dunn index and its application to K-Means clustering evaluation in Python.

Table of Contents


Introduction

Dunn Index (DI) is one of the clustering algorithms evaluation measures. It is most commonly used to evaluate the goodness of split by a K-Means clustering algorithm for a given number of clusters.

We have previously discussed the Davies-Bouldin index and Calinski-Harabasz index, and Dunn index is yet another metric to evaluate the performance of clustering.

What is Dunn index?

Dunn index is calculated as a ratio of the smallest inter-cluster distance and the largest intra-cluster distance.

How to interpret Dunn index?

A high DI means better clustering since observations in each cluster are closer together (more dense), while clusters themselves are further away from each other (well separated).


Dunn Index Explained

In this part we will go through each step of the calculations and provide meaningful illustrative examples to help better understand the formulas.

Step 1: Calculate inter-cluster distance

The first step is to calculate the inter-cluster distance and there are several ways to calculate it.

The inter-cluster distance measures the distance between observations belonging to two difference clusters.

Consider two clusters: C_a and C_b.

Each cluster has a centroid: A and B respectively, as well as some elements in each cluster a_{i} and b_{j}, where:

  • a_{i} : i-th observation of cluster C_a
  • b_{j} : j-th observation of cluster C_b

Note: The inter-cluster distance is computer for a pair of clusters.


There are 5 ways of calculating inter-cluster distance:

  1. Single linkage distance
  2. Complete linkage distance
  3. Average linkage distance
  4. Centroid linkage distance
  5. Average of centroids linkage distance

Single linkage distance

The single linkage distance is the smallest distance between two observations from two different clusters (closest observations):

\delta_1 (C_a, C_b) = \min \Bigg \{ d(a_{i}, b_{j}) \Bigg \}

where:

  • a_{i} \in C_a
  • b_{j} \in C_b

Complete linkage distance

The complete linkage distance is the largest distance between two observations from two different clusters (furthest observations):

where:

  • a_{i} \in C_a
  • b_{j} \in C_b

Average linkage distance

The average linkage distance is the average distance between all observations from two different clusters:

\delta_3 (C_a, C_b) = \frac{1}{N_{C_a} \times N_{C_b}} \sum_{i=1}^{N_{C_a}} \sum_{j=1}^{N_{C_b}} d(a_i, b_j)

where:

  • a_{i} \in C_a
  • b_{j} \in C_b
  • N_{C_a} is the # of observations in C_a
  • N_{C_b} is the # of observations in C_b

Centroid linkage distance

The centroid linkage distance is the distance between the centroids of two clusters:

\delta_4 (C_a, C_b) = d(A, B)

where:

  • A is the centroid of C_a
  • B is the centroid of C_b

Average of centroids linkage distance

Average of centroids linkage distance is the average distance between a centroid of a cluster and all observations from a different cluster:

\delta_5 (C_a, C_b) = \frac{1}{N_{C_a} + N_{C_b}} \Bigg \{ \sum_{i=1}^{N_{C_a}} d(a_i, B) + \sum_{j=1}^{N_{C_b}} d(b_j, A) \Bigg \}

where:

  • a_{i} \in C_a
  • b_{j} \in C_b
  • N_{C_a} is the \# of observations in C_a
  • N_{C_b} is the \# of observations in C_b
  • A is the centroid of C_a
  • B is the centroid of C_b

Depending which approach is chosen, following one of the above formulas will calculate the inter-cluster distance.


Step 2: Calculate intra-cluster distance

The second step is to calculate the intra-cluster distance and there are several ways to calculate it.

The intra-cluster distance measures the distance between observations belonging to the same cluster.

Consider one cluster: C_a and its centroid A as well as some elements in this cluster a_{i} and a_{j}, where:

  • a_{i} : i-th observation of cluster C_a
  • a_{j} : j-th observation of cluster C_a

Note: The intra-cluster distance is computed for individual cluster (one distance measure per cluster).


There are 3 ways of calculating inter-cluster distance:

  1. Complete diameter distance
  2. Average diameter distance
  3. Centroid diameter distance

Complete diameter distance

The complete diameter distance measures the distance between two furthest observations within the same cluster:

\Delta_1 (C) = \max \Bigg \{d(a_i, a_j) \Bigg \}

where:

  • a_i \in C
  • a_j \in C

Average diameter distance

The average diameter distance measures the average distance between all observations within the same cluster:

\Delta_2 (C) = \frac{1}{N_C \times (N_C-1)} \sum_{i \neq j}^{N_C} d(a_i, a_j)

where:

  • a_i \in C
  • a_j \in C
  • N_{C} is the \# of observations in C

Centroid diameter distance

The centroid diameter distance measures the double average distance between a centroid of a cluster and all observations from the same cluster:

\Delta_3 (C) = 2 \times \frac{1}{N_C} \sum_{i=1}^{N_C} d(a_i, A)

where:

  • a_i \in C
  • N_{C} is the \# of observations in C
  • A is the centroid of C

Step 3: Calculate Dunn Index

Dunn index is defined as the ratio of the smallest inter-cluster distance and the largest intra-cluster distance.

For m clusters, the Dunn index is calculated as:

DI_m = \frac{\min\limits_{1 \leq i \leq j \leq m} \delta (C_i, C_j)}{\max\limits_{1 \leq k \leq m} \Delta_k}

This means first that we should minimize the inter-cluster distance function, which is to find the distance between two closest clusters.

And then maximize the intra-cluster distance function, which is to find the cluster with the largest diameter.

Finally, we will take the ratio of the above two values which is referred to as the Dunn index (with range between 0 and 1).

From the above explanation, we can conclude that the large values of Dunn index represent better clustering.

Note: \delta (C_i, C_j) can be calculated using any of the inter-cluster distance functions. And \Delta_k can be calculated using any of the intra-cluster distance functions.


Conclusion

In this article we discussed how to calculate the Dunn index for clustering evaluation.

Feel free to leave comments below if you have any questions or have suggestions for some edits and check out more of my Python Programming articles.