Implement of faster(maybe paralleled) K-Means clustering

Question

I am writing an algorithm that needs to add noise to the centroid of each iteration on top of KMeans. Therefore, I need to implement a custom KMeans function. I checked out the methods in Mathematica implement of Lloyd's algorithm, which is helpful. But this code runs slow on my data (my dataset has 6500 2d points). My question is, is it possible to write a faster KMeans method, and even better if it is parallel. Here is my modified KMeans and KMeans++ algorithms. Many thanks!

KMeans[list_, k_, opts : OptionsPattern[{DistanceFunction -> SquaredEuclideanDistance,
  "RandomSeed" -> {}}]] := BlockRandom[SeedRandom[OptionValue["RandomSeed"]];
Module[{m = RandomSample[list, k], update, partition, clusters}, 
update[] := m = Mean /@ clusters;
partition[_] := (clusters = GroupBy[list, RandomChoice@Nearest[m, #, (# -> OptionValue[#] &@DistanceFunction)] &][#] & /@m; update[]);
FixedPoint[partition, list];
{clusters, m}]]
KMeansPP[list_, k_, opts : OptionsPattern[{DistanceFunction -> SquaredEuclideanDistance,"RandomSeed" -> {}}]] := BlockRandom[SeedRandom[OptionValue["RandomSeed"]];
Module[{m = RandomSample[list, 1], update, partition, clusters, findCentroid},
findCentroid[] := AppendTo[m, RandomChoice[Min @@@(Table[(SquaredEuclideanDistance[m[[i]], #] & /@ list), {i, 1, Length@m}][Transpose]) ->list]];
Do[findCentroid[], k - 1];
update[] := m = Mean /@ clusters;
partition[_] := (clusters = GroupBy[list, RandomChoice@Nearest[m, #, (# ->OptionValue[#] &@DistanceFunction)] &][#] & /@m; update[]);
FixedPoint[partition, list];
{clusters, m}]]

Including data (or code to generate data) on which this algorithm should work would help greatly. — MarcoB, Mar 31 '22 at 16:54
Thank you, here is the data. http://cs.joensuu.fi/sipu/datasets/unbalance-gt.txt — Charmbracelet, Apr 01 '22 at 02:03
Might be good to check this. https://jeremykun.com/2013/02/04/k-means-clustering-and-birth-rates/#comment-29518 — Charmbracelet, Nov 03 '22 at 06:18

Implement of faster(maybe paralleled) K-Means clustering

0 Answers0