Fast k-medoids clustering in Python

This package is a wrapper around the fast Rust k-medoids package, implementing the FasterPAM and FastPAM algorithms along with the baseline k-means-style and PAM algorithms.

All algorithms expect a distance matrix and the number of clusters as input. They hence can be used with arbitrary dissimilarity functions.

If you use this code in scientific work, please cite the papers in the References. Thank you.

Example

import kmedoids
c = kmedoids.fasterpam(distmatrix, 5)
print("Loss is:", c.loss)

Implemented Algorithms

  • FasterPAM (Schubert and Rousseeuw, 2020, 2021)

  • FastPAM1 (Schubert and Rousseeuw, 2019, 2021)

  • PAM (Kaufman and Rousseeuw, 1987) with BUILD and SWAP

  • Alternating (k-means-style approach)

  • BUILD (Kaufman and Rousseeuw, 1987)

Note that the k-means style “alternating” algorithm yields rather poor result quality.

FasterPAM

kmedoids.fasterpam(diss, medoids, max_iter=100, init='random', random_state=None)

FasterPAM k-medoids clustering

This is an accelerated version of PAM clustering, that eagerly performs any swap found, and contains the O(k) improvement to find the best swaps faster.

References:

Erich Schubert, Peter J. Rousseeuw
Fast and Eager k-Medoids Clustering:
O(k) Runtime Improvement of the PAM, CLARA, and CLARANS Algorithms
Information Systems (101), 2021, 101804
Erich Schubert, Peter J. Rousseeuw:
Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS Algorithms
In: 12th International Conference on Similarity Search and Applications (SISAP 2019), 171-187.
Parameters
  • diss (ndarray) – square numpy array of dissimilarities

  • medoids (int or ndarray) – number of clusters to find or existing medoids

  • max_iter (int) – maximum number of iterations

  • init (str, "random", "first" or "build") – initialization method

  • random_state (int, RandomState instance or None) – random seed if no medoids are given

Returns

k-medoids clustering result

Return type

KMedoidsResult

FastPAM1

kmedoids.fastpam1(diss, medoids, max_iter=100, init='random', random_state=None)

FastPAM1 k-medoids clustering

This is an accelerated version of PAM clustering, that performs the same swaps as the original PAM (given the same starting conditions), but finds the best swap O(k) times faster.

References:

Erich Schubert, Peter J. Rousseeuw
Fast and Eager k-Medoids Clustering:
O(k) Runtime Improvement of the PAM, CLARA, and CLARANS Algorithms
Information Systems (101), 2021, 101804
Erich Schubert, Peter J. Rousseeuw:
Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS Algorithms
In: 12th International Conference on Similarity Search and Applications (SISAP 2019), 171-187.
Parameters
  • diss (ndarray) – square numpy array of dissimilarities

  • medoids (int or ndarray) – number of clusters to find or existing medoids

  • max_iter (int) – maximum number of iterations

  • init (str, "random", "first" or "build") – initialization method

  • random_state (int, RandomState instance or None) – random seed if no medoids are given

Returns

k-medoids clustering result

Return type

KMedoidsResult

PAM

kmedoids.pam(diss, medoids, max_iter=100, init='build', random_state=None)

PAM k-medoids clustering

This is an implementation of the original PAM (Partitioning Around Medoids) clustering algorithm. For improved versions, see the fastpam and fasterpam methods.

References:

Leonard Kaufman, Peter J. Rousseeuw:
Clustering by means of medoids.
In: Dodge Y (ed) Statistical Data Analysis Based on the L 1 Norm and Related Methods, pp 405–416, 1987
Leonard Kaufman, Peter J. Rousseeuw:
Finding Groups in Data: An Introduction to Cluster Analysis.
Parameters
  • diss (ndarray) – square numpy array of dissimilarities

  • medoids (int or ndarray) – number of clusters to find or existing medoids

  • max_iter (int) – maximum number of iterations

  • init (str, "random", "first" or "build") – initialization method

  • random_state (int, RandomState instance or None) – random seed if no medoids are given

Returns

k-medoids clustering result

Return type

KMedoidsResult

Alternating k=medoids (k-means style)

kmedoids.alternating(diss, medoids, max_iter=100, init='random', random_state=None)

Alternating k-medoids clustering (k-means-style algorithm)

Note: this yields substantially worse results than PAM algorithms on difficult data sets.

Parameters
  • diss (ndarray) – square numpy array of dissimilarities

  • medoids (int or ndarray) – number of clusters to find or existing medoids

  • max_iter (int) – maximum number of iterations

  • init (str, "random", "first" or "build") – initialization method

  • random_state (int, RandomState instance or None) – random seed if no medoids are given

Returns

k-medoids clustering result

Return type

KMedoidsResult

PAM BUILD

kmedoids.pam_build(diss, k)

PAM k-medoids clustering – BUILD only

This is an implementation of the original PAM (Partitioning Around Medoids) clustering algorithm. For improved versions, see the fastpam and fasterpam methods.

References:

Leonard Kaufman, Peter J. Rousseeuw:
Clustering by means of medoids.
In: Dodge Y (ed) Statistical Data Analysis Based on the L 1 Norm and Related Methods, 405-416, 1987
Leonard Kaufman, Peter J. Rousseeuw:
Finding Groups in Data: An Introduction to Cluster Analysis.
Parameters
  • diss (ndarray) – square numpy array of dissimilarities

  • k (int) – number of clusters to find

Returns

k-medoids clustering result

Return type

KMedoidsResult

k-Medoids result object

KMedoids clustering result

class kmedoids.KMedoidsResult
loss: float

Loss of this clustering (sum of deviations)

labels: ndarray

Cluster assignment

medoids: ndarray

Chosen medoid indexes

n_iter: int

Number of iterations

n_swap: int

Number of swaps performed

References

For further details on the implemented algorithm FasterPAM, see:

Erich Schubert, Peter J. Rousseeuw
Fast and Eager k-Medoids Clustering:
O(k) Runtime Improvement of the PAM, CLARA, and CLARANS Algorithms
Information Systems (101), 2021, 101804

an earlier (slower, and now obsolete) version was published as:

Erich Schubert, Peter J. Rousseeuw:
Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS Algorithms
In: 12th International Conference on Similarity Search and Applications (SISAP 2019), 171-187.

This is a port of the original Java code from ELKI to Rust. The Rust version is then wrapped for use with Python.

If you use this code in scientific work, please cite above papers. Thank you.

License: GPL-3 or later

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see <https://www.gnu.org/licenses/>.

FAQ: Why GPL and not Apache/MIT/BSD?

Because copyleft software like Linux is what built the open-source community.

Tit for tat: you get to use my code, I get to use your code.