Skip to content

Add KMedoids clusterer#398

Open
chouaibcher wants to merge 2 commits intoRubixML:2.6from
chouaibcher:master
Open

Add KMedoids clusterer#398
chouaibcher wants to merge 2 commits intoRubixML:2.6from
chouaibcher:master

Conversation

@chouaibcher
Copy link

Introduce a new KMedoids clusterer implementation (src/Clusterers/KMedoids.php) + documentation (docs/clusterers/k-medoids.md),
unit tests (tests/Clusterers/KMedoidsTest.php) and
a benchmark (benchmarks/Clusterers/KMedoidsBench.php).

Introduce a new KMedoids clusterer implementation (src/Clusterers/KMedoids.php) implementing Estimator, Learner, Online, Probabilistic, Verbose and Persistable interfaces. Adds full training/partial training, predict/proba, inertia/loss tracking, medoids/sizes accessors, serialization and parameter validation. Also add documentation (docs/clusterers/k-medoids.md), unit tests (tests/Clusterers/KMedoidsTest.php) and a benchmark (benchmarks/Clusterers/KMedoidsBench.php). Tests and code use a seeder, distance kernel, and basic logging; invalid inputs and untrained prediction are guarded by exceptions.
.
@andrewdalpino
Copy link
Member

Hey @chouaibcher this is really cool! I'll take a look ASAP although just to warn you I don't have much time for Rubix ML these days so please be patient with me. Thanks I'm looking forward to the review!

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces a new KMedoids clusterer to the clustering module, along with supporting documentation, tests, and benchmarking artifacts.

Changes:

  • Added Rubix\ML\Clusterers\KMedoids implementation.
  • Added user docs + mkdocs nav entry + composer keywords + changelog entry.
  • Added PHPUnit test coverage and a phpbench benchmark for KMedoids.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/Clusterers/KMedoids.php New K-Medoids clusterer implementation (online + probabilistic API).
tests/Clusterers/KMedoidsTest.php Unit tests mirroring the existing KMeans clusterer tests.
benchmarks/Clusterers/KMedoidsBench.php Benchmark harness for training + predicting with KMedoids.
docs/clusterers/k-medoids.md New documentation page for the KMedoids clusterer.
mkdocs.yml Adds K Medoids to the Clusterers navigation.
composer.json Adds K-Medoids keywords for package discoverability.
CHANGELOG.md Notes the new clusterer under version 2.5.3.
test-kmedoids-minimal.php Adds a standalone minimal script for loading/reflecting KMedoids.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +377 to +381
// Try each point in the cluster as a potential medoid
foreach ($indices as $candidateIndex) {
$totalDistance = 0.0;

foreach ($indices as $pointIndex) {
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The medoid update step iterates over every point in a cluster as a candidate medoid and then sums distances to every other point (quadratic work per cluster, per epoch). With the default epochs=1000, this is likely to be impractically slow on non-trivial datasets. Consider using batchSize/sampling or another optimization (e.g., PAM/CLARA-style approximations) to keep training tractable.

Copilot uses AI. Check for mistakes.

// Update medoids - find the sample that minimizes total distance within each cluster
foreach ($clusters as $cluster => $indices) {
if (empty($indices)) {
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a cluster has no assigned samples, the loop continues without updating $this->sizes[$cluster]. That can leave a stale non-zero size from a previous epoch, making sizes() inaccurate. Consider explicitly setting the size to 0 when $indices is empty (and/or resetting all sizes to 0 at the start of each epoch).

Suggested change
if (empty($indices)) {
if (empty($indices)) {
// Ensure sizes are accurate even for empty clusters
$this->sizes[$cluster] = 0;

Copilot uses AI. Check for mistakes.
Comment on lines +348 to +351
$labels = array_fill(0, count($allSamples), 0);

$fullDataset = Labeled::quick($allSamples, $labels);

Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$labels and $fullDataset are created but never used, which adds noise and suggests an incomplete batching/randomization step. Consider removing these variables (and the Labeled import) or using them to implement the intended mini-batch logic.

Copilot uses AI. Check for mistakes.
Comment on lines +67 to +73
* The size of each mini batch in samples.
*
* @var positive-int
*/
protected int $batchSize;

/**
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

batchSize is exposed as a hyper-parameter (and listed in params()), but it is never used during training. Either incorporate mini-batching into partial() (similar to KMeans) or remove the parameter/property to avoid a misleading public API.

Suggested change
* The size of each mini batch in samples.
*
* @var positive-int
*/
protected int $batchSize;
/**

Copilot uses AI. Check for mistakes.
Comment on lines +15 to +17
protected const TRAINING_SIZE = 10000;

protected const TESTING_SIZE = 10000;
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current KMedoids implementation doing quadratic medoid updates, a TRAINING_SIZE/TESTING_SIZE of 10,000 is likely to make this benchmark run for an impractically long time (or time out). Consider reducing the sizes/iterations or adjusting the benchmark to a feasible workload for this algorithm.

Suggested change
protected const TRAINING_SIZE = 10000;
protected const TESTING_SIZE = 10000;
protected const TRAINING_SIZE = 2000;
protected const TESTING_SIZE = 2000;

Copilot uses AI. Check for mistakes.
Comment on lines +10 to +14
## Parameters
| # | Name | Default | Type | Description |
|---|---|---|---|---|
| 1 | k | | int | The number of target clusters. |
| 2 | batch size | 128 | int | The size of each mini batch in samples. |
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameters table is malformed (each row starts with ||), which will not render as a proper Markdown table in mkdocs. It should match the format used in other clusterer docs (single leading | per row).

Copilot uses AI. Check for mistakes.
Comment on lines +3 to +46
* Minimal test for K-Medoids without dependencies
*/

// Simple autoloader for testing
spl_autoload_register(function ($class) {
$prefix = 'Rubix\\ML\\';
$base_dir = __DIR__ . '/src/';

$len = strlen($prefix);
if (strncmp($prefix, $class, $len) !== 0) {
return;
}

$relative_class = substr($class, $len);
$file = $base_dir . str_replace('\\', '/', $relative_class) . '.php';

if (file_exists($file)) {
require $file;
}
});

echo "K-Medoids Syntax Validation Test\n";
echo "=================================\n\n";

// Check if class can be loaded
if (class_exists('Rubix\ML\Clusterers\KMedoids')) {
echo "✓ KMedoids class exists\n";

// Check if all required methods exist
$reflection = new ReflectionClass('Rubix\ML\Clusterers\KMedoids');
$methods = ['train', 'predict', 'partial', 'proba', 'medoids', 'sizes', 'steps', 'losses'];

foreach ($methods as $method) {
if ($reflection->hasMethod($method)) {
echo "✓ Method '$method' exists\n";
} else {
echo "✗ Method '$method' missing\n";
}
}

echo "\n✓ All K-Medoids code structure is valid!\n";
} else {
echo "✗ KMedoids class not found\n";
}
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a local ad-hoc validation script (custom autoloader + console output) rather than a project test/benchmark artifact. Since there are already PHPUnit tests and a phpbench benchmark for KMedoids, consider removing this file from the repository to avoid shipping non-standard test helpers in the package root.

Suggested change
* Minimal test for K-Medoids without dependencies
*/
// Simple autoloader for testing
spl_autoload_register(function ($class) {
$prefix = 'Rubix\\ML\\';
$base_dir = __DIR__ . '/src/';
$len = strlen($prefix);
if (strncmp($prefix, $class, $len) !== 0) {
return;
}
$relative_class = substr($class, $len);
$file = $base_dir . str_replace('\\', '/', $relative_class) . '.php';
if (file_exists($file)) {
require $file;
}
});
echo "K-Medoids Syntax Validation Test\n";
echo "=================================\n\n";
// Check if class can be loaded
if (class_exists('Rubix\ML\Clusterers\KMedoids')) {
echo "✓ KMedoids class exists\n";
// Check if all required methods exist
$reflection = new ReflectionClass('Rubix\ML\Clusterers\KMedoids');
$methods = ['train', 'predict', 'partial', 'proba', 'medoids', 'sizes', 'steps', 'losses'];
foreach ($methods as $method) {
if ($reflection->hasMethod($method)) {
echo "✓ Method '$method' exists\n";
} else {
echo "✗ Method '$method' missing\n";
}
}
echo "\n✓ All K-Medoids code structure is valid!\n";
} else {
echo "✗ KMedoids class not found\n";
}
* This file previously contained a minimal, ad-hoc validation script for K-Medoids.
*
* It has been retired in favor of the standard PHPUnit tests and phpbench benchmarks
* and intentionally left as a no-op to avoid shipping non-standard test helpers
* in the package root.
*/

Copilot uses AI. Check for mistakes.
Comment on lines +339 to +342
// Merge new samples with existing samples
$newSamples = $dataset->samples();
$this->samples = array_merge($this->samples, $newSamples);

Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partial() always merges $dataset->samples() into $this->samples. When train() calls partial($dataset) after setting $this->samples = $dataset->samples(), the initial training set gets duplicated (and subsequent partial() calls keep accumulating duplicates). Consider avoiding the merge on the first call from train() (e.g., set $this->samples only in partial(), or pass a flag / clear samples before merging).

Suggested change
// Merge new samples with existing samples
$newSamples = $dataset->samples();
$this->samples = array_merge($this->samples, $newSamples);
// Merge new samples with existing samples, avoiding duplication
$newSamples = $dataset->samples();
if (empty($this->samples)) {
// Initial population of samples
$this->samples = $newSamples;
} elseif ($this->samples !== $newSamples) {
// Only merge if the incoming samples differ from the stored ones
$this->samples = array_merge($this->samples, $newSamples);
}

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,46 @@
<?php
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move any assertions that this test makes into the PHPUnit testing suite located in /tests? I prefer to keep all tests in the same place.

@@ -1,3 +1,6 @@
- 2.5.3
- Added K Medoids clusterer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a new feature, it will not be released as a bug fix. This would be released in 2.6 at minimum.

* @package Rubix/ML
* @author Cheribet Cherif Chouaib
*/
class KMedoids implements Estimator, Learner, Online, Probabilistic, Verbose, Persistable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell this is basically the mini-batch K-means class but instead of computing cluster centroids it takes the closest sample to the cluster center as the "centroid" in this context more appropriately termed medoid. Are there any other notable differences that I'm missing?

@andrewdalpino andrewdalpino changed the base branch from master to 2.6 March 2, 2026 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants