9

For example having data

data = {16, 4, 17, 10, 15, 4, 4, 6, 7, 14, 9, 17, 27, 6, 1, 9, 0, 12, 20, 8, 0, 3, 4, 0, 3, 4}

I want to cluster them int 4 partitions

FindClusters[data, 4]

So that each partition is as close to 55 = (220/4) as possible.

The sequence order should not be changed.


I was also considering a knapsack based solution:

GetSample[n_,mydata_] := #[[Flatten@
     Position[LinearProgramming[-#, -{#}, -{Total[data]/n},
       {0, 1} & /@ #, Integers], 1], 1]] &@ mydata
GetSample[4, data]

after what i create a copy from where i remove sample elements and repeat n times.

Mr.Wizard
  • 271,378
  • 34
  • 587
  • 1,371
Margus
  • 1,987
  • 2
  • 15
  • 19
  • @PlatoManiac I imagine, that each partition would have their elements in the same order as they appear in the full list and, perhaps, the partitions themselves would be ordered in such a way, that their first elements are also in the same order. But Mean[data] returns 110/13, not 55. I guess, that's just the wording of the title slightly failing. – LLlAMnYP Jul 03 '15 at 12:19
  • Ya that might be the case @LLlAMnYP – PlatoManiac Jul 03 '15 at 12:21
  • I take it to mean that the first cluster can only be $\sum_i^n x_{i}$ – Feyre Jul 03 '15 at 12:22
  • http://www.wolframalpha.com/input/?i=total%7B16%2C+4%2C+17%2C+10%2C+15%2C+4%2C+4%2C+6%2C+7%2C+14%2C+9%2C+17%2C+27%2C+6%2C+1%2C+9%2C+0%2C+12%2C+20%2C+8%2C+0%2C+3%2C+4%2C+0%2C+3%2C+4%7D%2F4 – Margus Jul 03 '15 at 12:22
  • Should the total deviation Total(Abs[x1-55],Abs[x2-55],Abs[x3-55],Abs[x4-55]) be as small as possible, or what metric do you want to use? – Feyre Jul 03 '15 at 12:35
  • I want to partition all elements, so that partition sums are approximately equal, in this sample case sum of all elements is 220 and 4 partitions with desired partition sum of 55. – Margus Jul 03 '15 at 12:46
  • Please clarify what exactly you mean by "The sequence order should not be changed." Also add to your question what norm to use for measuring being close to 55 and if you only want one solution or all with the same being-closeness. – Karsten7 Jul 04 '15 at 10:48

5 Answers5

5

For lists of the size in the example, brute-forcing s/b fine:

Module[{d = #1, m = #2, l = Length@#1, dt = (Tr@#1)/#2, dtp, parted},
   parted = 
    Internal`PartitionRagged[d, #] & /@ 
     Join @@ Permutations /@ IntegerPartitions[l, {m}];
   dtp = Total[Abs[Total[parted, {3}] - dt], {2}];
   Pick[parted, dtp, Min@dtp]] &[data, 4]

(*
{{{16, 4, 17, 10}, {15, 4, 4, 6, 7, 14, 9}, {17, 27, 6, 1, 9}, {0, 12, 20, 8, 0, 3, 4, 0, 3, 4}}, 
{{16, 4, 17, 10}, {15, 4, 4, 6, 7, 14, 9}, {17, 27, 6, 1, 9, 0}, {12, 20, 8, 0, 3, 4, 0, 3, 4}}}
*)

I chose minimization of sum of absolute differences from goal as the metric, you can change it (dtp=...) to whatever floats your boat.

ciao
  • 25,774
  • 2
  • 58
  • 139
2

Here is a simple approach that splits the data into 4 sequences, with each sequence being as close to 55 as possible

drop[list_, m_] := 
 Drop[list, (Position[#, Sequence @@ Nearest[#, m]] &@ Accumulate@list)[[1, 1]]]
take[list_, m_] := 
 Take[list, (Position[#, Sequence @@ Nearest[#, m]] &@ Accumulate@list)[[1, 1]]]

NestList[{take[#[[2]], 55], drop[#[[2]], 55]} &, {{}, data}, 4][[2 ;;, 1]]
{{16, 4, 17, 10, 15}, {4, 4, 6, 7, 14, 9, 17}, {27, 6, 1, 9, 0, 12}, 
 {20, 8, 0, 3, 4, 0, 3, 4}}
Total /@ %

{62, 61, 55, 42}

One could use

pos[list_, m_] := (Position[#, Sequence @@ Nearest[#, m]] &@ Accumulate@list)[[1, 1]]

NestList[{pos[#[[2]], 55], drop[#[[2]], 55]} &, {{}, data}, 4][[2 ;;, 1]]

{5, 7, 6, 8}

to get some starting values, that could be used for some code that splits the data into sequences using a global measure for the difference to 55.

Karsten7
  • 27,448
  • 5
  • 73
  • 134
  • Minimizing the maximum difference of each sublist sum to 55 results in the following split positions: {{4, 6, 5, 11}, {4, 7, 4, 11}, {4, 7, 5, 10}, {4, 7, 6, 9}}. All resulting in having 8 as the maximum absolute difference. – Karsten7 Jul 03 '15 at 19:20
2

Here's an overengineered approach using the method of simulated annealing. I apologize for the poor style of coding, this is something I had lying around found somewhere online and modified just now for this task.

data = {16, 4, 17, 10, 15, 4, 4, 6, 7, 14, 9, 17, 27, 6, 1, 9, 0, 12, 20, 8, 0, 3, 4, 0, 3, 4};

Solution = Partition[Range@26, 7, 7, 1, {}];
Optimum = Solution;

Fitness[list_] := Norm@(Total /@ (data[[#]] & /@ list) - 55);

Iterate[Sol_, temp_] :=
 Module[{nSol = Sol, fromto = RandomSample[Range@4, 2], index, val},
  While[Length@nSol[[First@fromto]] == 0, 
   fromto = RandomSample[Range@4, 2]];
  index = RandomInteger[{1, Length@nSol[[First@fromto]]}];
  val = nSol[[First@fromto, index]];
  nSol = Delete[nSol, {First@fromto, index}];
  nSol = Insert[nSol, val, {Last@fromto, 1}];

  If[Fitness[nSol] < Fitness[Optimum], Optimum = nSol];

  Solution = 
   RandomChoice[{Exp[-Fitness[nSol]/temp], 
      Exp[-Fitness[Sol]/temp]} -> {nSol, Sol}]]

ListPlot@(tempSched = (.9 + Tanh[(10000 - #)/15000]) & /@ 
    Range[30000])

ListPlot@(Fitness /@ FoldList[Iterate, Solution, tempSched])

The current solution is a partition of the list of indices. With each step the program randomly takes a number from one partition and sticks it into another. It takes then a random choice between the newly proposed solution and the old one based on the current "temperature" and the fitness of either solution (how close they are to summing to 55). Also, if it finds a better solution than any previous one it saves it for future use.

After running the above, let's check the results.

Solution
data[[#]] & /@ Solution
Total /@ %
{{21, 24, 17, 7, 2, 16, 4, 18, 5}, {14, 9, 12, 13}, {15, 22, 6, 26, 8,
   23, 11, 20, 1}, {25, 10, 3, 19}}
{{0, 0, 0, 4, 4, 9, 10, 12, 15}, {6, 7, 17, 27}, {1, 3, 4, 4, 6, 4, 9,
   8, 16}, {3, 14, 17, 20}}
{54, 57, 55, 54}

Quite good, but we can do better.

Optimum
data[[#]] & /@ Optimum
Total /@ %
{{17, 8, 14, 9, 11, 18, 5}, {25, 20, 12, 13}, {15, 21, 4, 26, 6, 24, 
  7, 22, 23, 16, 1}, {2, 10, 3, 19}}
{{0, 6, 6, 7, 9, 12, 15}, {3, 8, 17, 27}, {1, 0, 10, 4, 4, 0, 4, 3, 4,
   9, 16}, {4, 14, 17, 20}}
{55, 55, 55, 55}

Exactly as desired. Finally OP requests to maintain the order of the data, so

data[[#]] & /@ SortBy[Sort /@ Optimum, First]
{{16, 10, 4, 4, 1, 9, 0, 3, 4, 0, 4}, {4, 17, 14, 20}, {15, 6, 7, 9,  6, 0, 12}, {17, 27, 8, 3}}
LLlAMnYP
  • 11,486
  • 26
  • 65
1

Since the size of the list is relatively small, you could use

clusters = ReplaceList[data, {w__, x__, y__, z__} -> {{w}, {x}, {y}, {z}}]

to split the original list data into 4 sublists while preserving the order. There are about 2300 clusters. Then sort these clusters of four sublists by the difference of their sums from 55.

SortBy[clusters, Total[Abs[Total[#, {2}] - 55]] &]

The first cluster in the sorted list has totals of {47,59,60,54}. This cluster is

(* {{16, 4, 17, 10},
    {15, 4, 4, 6, 7, 14, 9},
    {17, 27, 6, 1, 9},
    {0, 12, 20, 8, 0, 3, 4, 0, 3, 4}}  *)
KennyColnago
  • 15,209
  • 26
  • 62
0
data = {16, 4, 17, 10, 15, 4, 4, 6, 7, 14, 9, 17, 27, 6, 1, 9, 0, 12, 
   20, 8, 0, 3, 4, 0, 3, 4};
data1 = {};
i = 1;
While[i < Length[data] && 
  Total[data1] < 55, {data1 = 
   Append[data1, x = Total[{Nearest[data, 55 - Total[data1]][[1]]}]], 
  data = Delete[data, FirstPosition[data, x]]}; i++]
data2 = {};
i = 1;
While[i < Length[data] && 
  Total[data2] < 55, {data2 = 
   Append[data2, x = Total[{Nearest[data, 55 - Total[data2]][[1]]}]], 
  data = Delete[data, FirstPosition[data, x]]}; i++]
data3 = {};
i = 1;
While[i < Length[data] && 
  Total[data3] < 55, {data3 = 
   Append[data3, x = Total[{Nearest[data, 55 - Total[data3]][[1]]}]], 
  data = Delete[data, FirstPosition[data, x]]}; i++]
enddata={data1,data2,data3,data}

This yields exactly four times 55. In:

Total[data1]
Total[data2]
Total[data3]
Total[data]

Out: 55 55 55 55

Feyre
  • 8,597
  • 2
  • 27
  • 46