Conversation
Introduce a new KMedoids clusterer implementation (src/Clusterers/KMedoids.php) implementing Estimator, Learner, Online, Probabilistic, Verbose and Persistable interfaces. Adds full training/partial training, predict/proba, inertia/loss tracking, medoids/sizes accessors, serialization and parameter validation. Also add documentation (docs/clusterers/k-medoids.md), unit tests (tests/Clusterers/KMedoidsTest.php) and a benchmark (benchmarks/Clusterers/KMedoidsBench.php). Tests and code use a seeder, distance kernel, and basic logging; invalid inputs and untrained prediction are guarded by exceptions.
|
Hey @chouaibcher this is really cool! I'll take a look ASAP although just to warn you I don't have much time for Rubix ML these days so please be patient with me. Thanks I'm looking forward to the review! |
There was a problem hiding this comment.
Pull request overview
Introduces a new KMedoids clusterer to the clustering module, along with supporting documentation, tests, and benchmarking artifacts.
Changes:
- Added
Rubix\ML\Clusterers\KMedoidsimplementation. - Added user docs + mkdocs nav entry + composer keywords + changelog entry.
- Added PHPUnit test coverage and a phpbench benchmark for
KMedoids.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
src/Clusterers/KMedoids.php |
New K-Medoids clusterer implementation (online + probabilistic API). |
tests/Clusterers/KMedoidsTest.php |
Unit tests mirroring the existing KMeans clusterer tests. |
benchmarks/Clusterers/KMedoidsBench.php |
Benchmark harness for training + predicting with KMedoids. |
docs/clusterers/k-medoids.md |
New documentation page for the KMedoids clusterer. |
mkdocs.yml |
Adds K Medoids to the Clusterers navigation. |
composer.json |
Adds K-Medoids keywords for package discoverability. |
CHANGELOG.md |
Notes the new clusterer under version 2.5.3. |
test-kmedoids-minimal.php |
Adds a standalone minimal script for loading/reflecting KMedoids. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Try each point in the cluster as a potential medoid | ||
| foreach ($indices as $candidateIndex) { | ||
| $totalDistance = 0.0; | ||
|
|
||
| foreach ($indices as $pointIndex) { |
There was a problem hiding this comment.
The medoid update step iterates over every point in a cluster as a candidate medoid and then sums distances to every other point (quadratic work per cluster, per epoch). With the default epochs=1000, this is likely to be impractically slow on non-trivial datasets. Consider using batchSize/sampling or another optimization (e.g., PAM/CLARA-style approximations) to keep training tractable.
|
|
||
| // Update medoids - find the sample that minimizes total distance within each cluster | ||
| foreach ($clusters as $cluster => $indices) { | ||
| if (empty($indices)) { |
There was a problem hiding this comment.
When a cluster has no assigned samples, the loop continues without updating $this->sizes[$cluster]. That can leave a stale non-zero size from a previous epoch, making sizes() inaccurate. Consider explicitly setting the size to 0 when $indices is empty (and/or resetting all sizes to 0 at the start of each epoch).
| if (empty($indices)) { | |
| if (empty($indices)) { | |
| // Ensure sizes are accurate even for empty clusters | |
| $this->sizes[$cluster] = 0; |
| $labels = array_fill(0, count($allSamples), 0); | ||
|
|
||
| $fullDataset = Labeled::quick($allSamples, $labels); | ||
|
|
There was a problem hiding this comment.
$labels and $fullDataset are created but never used, which adds noise and suggests an incomplete batching/randomization step. Consider removing these variables (and the Labeled import) or using them to implement the intended mini-batch logic.
| * The size of each mini batch in samples. | ||
| * | ||
| * @var positive-int | ||
| */ | ||
| protected int $batchSize; | ||
|
|
||
| /** |
There was a problem hiding this comment.
batchSize is exposed as a hyper-parameter (and listed in params()), but it is never used during training. Either incorporate mini-batching into partial() (similar to KMeans) or remove the parameter/property to avoid a misleading public API.
| * The size of each mini batch in samples. | |
| * | |
| * @var positive-int | |
| */ | |
| protected int $batchSize; | |
| /** |
| protected const TRAINING_SIZE = 10000; | ||
|
|
||
| protected const TESTING_SIZE = 10000; |
There was a problem hiding this comment.
With the current KMedoids implementation doing quadratic medoid updates, a TRAINING_SIZE/TESTING_SIZE of 10,000 is likely to make this benchmark run for an impractically long time (or time out). Consider reducing the sizes/iterations or adjusting the benchmark to a feasible workload for this algorithm.
| protected const TRAINING_SIZE = 10000; | |
| protected const TESTING_SIZE = 10000; | |
| protected const TRAINING_SIZE = 2000; | |
| protected const TESTING_SIZE = 2000; |
| ## Parameters | ||
| | # | Name | Default | Type | Description | | ||
| |---|---|---|---|---| | ||
| | 1 | k | | int | The number of target clusters. | | ||
| | 2 | batch size | 128 | int | The size of each mini batch in samples. | |
There was a problem hiding this comment.
The parameters table is malformed (each row starts with ||), which will not render as a proper Markdown table in mkdocs. It should match the format used in other clusterer docs (single leading | per row).
| * Minimal test for K-Medoids without dependencies | ||
| */ | ||
|
|
||
| // Simple autoloader for testing | ||
| spl_autoload_register(function ($class) { | ||
| $prefix = 'Rubix\\ML\\'; | ||
| $base_dir = __DIR__ . '/src/'; | ||
|
|
||
| $len = strlen($prefix); | ||
| if (strncmp($prefix, $class, $len) !== 0) { | ||
| return; | ||
| } | ||
|
|
||
| $relative_class = substr($class, $len); | ||
| $file = $base_dir . str_replace('\\', '/', $relative_class) . '.php'; | ||
|
|
||
| if (file_exists($file)) { | ||
| require $file; | ||
| } | ||
| }); | ||
|
|
||
| echo "K-Medoids Syntax Validation Test\n"; | ||
| echo "=================================\n\n"; | ||
|
|
||
| // Check if class can be loaded | ||
| if (class_exists('Rubix\ML\Clusterers\KMedoids')) { | ||
| echo "✓ KMedoids class exists\n"; | ||
|
|
||
| // Check if all required methods exist | ||
| $reflection = new ReflectionClass('Rubix\ML\Clusterers\KMedoids'); | ||
| $methods = ['train', 'predict', 'partial', 'proba', 'medoids', 'sizes', 'steps', 'losses']; | ||
|
|
||
| foreach ($methods as $method) { | ||
| if ($reflection->hasMethod($method)) { | ||
| echo "✓ Method '$method' exists\n"; | ||
| } else { | ||
| echo "✗ Method '$method' missing\n"; | ||
| } | ||
| } | ||
|
|
||
| echo "\n✓ All K-Medoids code structure is valid!\n"; | ||
| } else { | ||
| echo "✗ KMedoids class not found\n"; | ||
| } |
There was a problem hiding this comment.
This looks like a local ad-hoc validation script (custom autoloader + console output) rather than a project test/benchmark artifact. Since there are already PHPUnit tests and a phpbench benchmark for KMedoids, consider removing this file from the repository to avoid shipping non-standard test helpers in the package root.
| * Minimal test for K-Medoids without dependencies | |
| */ | |
| // Simple autoloader for testing | |
| spl_autoload_register(function ($class) { | |
| $prefix = 'Rubix\\ML\\'; | |
| $base_dir = __DIR__ . '/src/'; | |
| $len = strlen($prefix); | |
| if (strncmp($prefix, $class, $len) !== 0) { | |
| return; | |
| } | |
| $relative_class = substr($class, $len); | |
| $file = $base_dir . str_replace('\\', '/', $relative_class) . '.php'; | |
| if (file_exists($file)) { | |
| require $file; | |
| } | |
| }); | |
| echo "K-Medoids Syntax Validation Test\n"; | |
| echo "=================================\n\n"; | |
| // Check if class can be loaded | |
| if (class_exists('Rubix\ML\Clusterers\KMedoids')) { | |
| echo "✓ KMedoids class exists\n"; | |
| // Check if all required methods exist | |
| $reflection = new ReflectionClass('Rubix\ML\Clusterers\KMedoids'); | |
| $methods = ['train', 'predict', 'partial', 'proba', 'medoids', 'sizes', 'steps', 'losses']; | |
| foreach ($methods as $method) { | |
| if ($reflection->hasMethod($method)) { | |
| echo "✓ Method '$method' exists\n"; | |
| } else { | |
| echo "✗ Method '$method' missing\n"; | |
| } | |
| } | |
| echo "\n✓ All K-Medoids code structure is valid!\n"; | |
| } else { | |
| echo "✗ KMedoids class not found\n"; | |
| } | |
| * This file previously contained a minimal, ad-hoc validation script for K-Medoids. | |
| * | |
| * It has been retired in favor of the standard PHPUnit tests and phpbench benchmarks | |
| * and intentionally left as a no-op to avoid shipping non-standard test helpers | |
| * in the package root. | |
| */ |
| // Merge new samples with existing samples | ||
| $newSamples = $dataset->samples(); | ||
| $this->samples = array_merge($this->samples, $newSamples); | ||
|
|
There was a problem hiding this comment.
partial() always merges $dataset->samples() into $this->samples. When train() calls partial($dataset) after setting $this->samples = $dataset->samples(), the initial training set gets duplicated (and subsequent partial() calls keep accumulating duplicates). Consider avoiding the merge on the first call from train() (e.g., set $this->samples only in partial(), or pass a flag / clear samples before merging).
| // Merge new samples with existing samples | |
| $newSamples = $dataset->samples(); | |
| $this->samples = array_merge($this->samples, $newSamples); | |
| // Merge new samples with existing samples, avoiding duplication | |
| $newSamples = $dataset->samples(); | |
| if (empty($this->samples)) { | |
| // Initial population of samples | |
| $this->samples = $newSamples; | |
| } elseif ($this->samples !== $newSamples) { | |
| // Only merge if the incoming samples differ from the stored ones | |
| $this->samples = array_merge($this->samples, $newSamples); | |
| } |
| @@ -0,0 +1,46 @@ | |||
| <?php | |||
There was a problem hiding this comment.
Can we move any assertions that this test makes into the PHPUnit testing suite located in /tests? I prefer to keep all tests in the same place.
| @@ -1,3 +1,6 @@ | |||
| - 2.5.3 | |||
| - Added K Medoids clusterer | |||
There was a problem hiding this comment.
Since this is a new feature, it will not be released as a bug fix. This would be released in 2.6 at minimum.
| * @package Rubix/ML | ||
| * @author Cheribet Cherif Chouaib | ||
| */ | ||
| class KMedoids implements Estimator, Learner, Online, Probabilistic, Verbose, Persistable |
There was a problem hiding this comment.
From what I can tell this is basically the mini-batch K-means class but instead of computing cluster centroids it takes the closest sample to the cluster center as the "centroid" in this context more appropriately termed medoid. Are there any other notable differences that I'm missing?
Introduce a new KMedoids clusterer implementation (src/Clusterers/KMedoids.php) + documentation (docs/clusterers/k-medoids.md),
unit tests (tests/Clusterers/KMedoidsTest.php) and
a benchmark (benchmarks/Clusterers/KMedoidsBench.php).