2

I am writing an algorithm that needs to add noise to the centroid of each iteration on top of KMeans. Therefore, I need to implement a custom KMeans function. I checked out the methods in Mathematica implement of Lloyd's algorithm, which is helpful. But this code runs slow on my data (my dataset has 6500 2d points). My question is, is it possible to write a faster KMeans method, and even better if it is parallel. Here is my modified KMeans and KMeans++ algorithms. Many thanks!

KMeans[list_, k_, opts : OptionsPattern[{DistanceFunction -> SquaredEuclideanDistance,
  "RandomSeed" -> {}}]] := BlockRandom[SeedRandom[OptionValue["RandomSeed"]];
Module[{m = RandomSample[list, k], update, partition, clusters}, 
update[] := m = Mean /@ clusters;
partition[_] := (clusters = GroupBy[list, RandomChoice@Nearest[m, #, (# -> OptionValue[#] &@DistanceFunction)] &][#] & /@m; update[]);
FixedPoint[partition, list];
{clusters, m}]]

KMeansPP[list_, k_, opts : OptionsPattern[{DistanceFunction -> SquaredEuclideanDistance,"RandomSeed" -> {}}]] := BlockRandom[SeedRandom[OptionValue["RandomSeed"]]; Module[{m = RandomSample[list, 1], update, partition, clusters, findCentroid}, findCentroid[] := AppendTo[m, RandomChoice[Min @@@(Table[(SquaredEuclideanDistance[m[[i]], #] & /@ list), {i, 1, Length@m}][Transpose]) ->list]]; Do[findCentroid[], k - 1]; update[] := m = Mean /@ clusters; partition[_] := (clusters = GroupBy[list, RandomChoice@Nearest[m, #, (# ->OptionValue[#] &@DistanceFunction)] &][#] & /@m; update[]); FixedPoint[partition, list]; {clusters, m}]]

Charmbracelet
  • 524
  • 2
  • 11

0 Answers0