Faster searching for subsets with pairwise conditions

Question

I am looking for a faster way of finding large subsets of lists that obey pairwise conditions. In particular, I'm looking for a large (over 10k) set of quaternary words (i.e. ordered lists) of length 13 with Hamming distance greater than 5.

Here's an example in code: there are 67,108,864 words with 4 letters:

all = Tuples[Range[4], 13];

A naive solution to find a set that are all at least Hamming distance 5, we can sample one-at-a-time and compare to the current set:

set = Internal`Bag[];
Internal`StuffBag[set, RandomChoice[all]];
Dynamic[Internal`BagLength[set], UpdateInterval -> 1]
While[InternalBagLength @ set &lt; 10000, new = RandomChoice[all]; If[AllTrue[InternalBagPart[set, All], 
   HammingDistance[new, #] > 5 &], Internal`StuffBag[set, new]]]
result = Internal`BagPart[set, All];

I'm looking to find a set of length 10,000 but at the moment this code is just too slow, after an hour it only found about 1,400.

Can this type of random search problem be accelerated using compilation, parallelization, graph theory, heuristics, or anything else?

Related:

This is equivalent to the graph problem of finding a clique of size $n$, however it's too costly to build the underlying graph. Still maybe FindKClique could be useful somehow?
For the n=4 case, R implements multicore evolutionary algorithms for the same problem with DNA strings
Compiled hamming distance: Fastest way to measure Hamming distance of integers

Do you find the "Bag" functions to be especially faster than the "DynamicArray" data structure? Slightly off topic since in this question appending elements to the list is not the rate limiting step. — Jason B., Jan 18 '22 at 19:25
I can get a decent speedup over your code by just creating a NearestFunction at each iteration of the While loop and using that to find the nearest already-selected element. But the scaling of this problem means it still slows to a crawl eventually. — Jason B., Jan 18 '22 at 19:47

Daniel Lichtblau · Accepted Answer · 2022-01-19T19:48:24.307

Might be faster to cull as you go. I compile the selection function (possibly it can be made faster). Also I set the history length to zero although that does not help too much because the Select entails a big memory hit at least one time.

$HistoryLength = 0;
selectHam = 
  Compile[{{ll, _Integer, 2}, {vec, _Integer, 1}, {n, _Integer}}, 
   Select[ll, (Min[Apply[Plus, Sign[Abs[# - vec]]]] > n) &], 
   CompilationOptions -> {"InlineExternalDefinitions" -> True}, 
   RuntimeOptions -> "Speed", CompilationTarget -> "C"];

To keep my machine from croaking I only go to 12-tuples.

all = Tuples[Range[4], 12];
Length[all]
(* Out[22]= 16777216 *)

This takes around 20 minutes:

j = 0;
Timing[
 res = Reap[recent = all[[1]];
     rest = Rest@all;
     Sow[recent];
     While[rest =!= {},
      PrintTemporary[{j++, Length[rest]}];
      ok = selectHamG5[rest, recent];
      Sow[recent = First[ok]];
      rest = Rest[ok]
      ];
     ][[2, 1]];]
Length[res]
(* Out[24]= {1170.45, Null}
Out[25]= 4096 *)

I am currently running on a bigger machine for the 13-tuples case. I expect it to take a few hours to run to completion.

One might be able to curtail the memory overhead by using Scan (or Do), Sow and Reap instead of Select in the culling function.

--- edit ---

For tuples of modest length this can be done faster by representing each tuple as an integer. And the memory footprint is far smaller. I use a bitwise operation to gain a little more speed.

selectHamInt = 
  Compile[{{ll, _Integer, 
     1}, {val, _Integer}, {n, _Integer}, {ilen, _Integer}},
   Select[
    ll, (Apply[Plus, Sign[IntegerDigits[BitXor[#, val], 4, ilen]]] > 
       n) &], CompilationOptions -> {"InlineExternalDefinitions" -> 
      True}, RuntimeOptions -> "Speed", CompilationTarget -> "C"];
ilen = 13;
rest = Range[0, 4^ilen - 1];
Length[rest]
j = 0;
Timing[
 res = Reap[recent = rest[[1]];
     rest = Rest@rest;
     Sow[recent];
     While[ListQ[rest] && rest =!= {},
      ok = selectHamInt[rest, recent, 5, ilen];
      If[ok=!={},
        Sow[recent = First[ok]];
        rest = Rest[ok]];
      ];
     ][[2, 1]];]
Length[res]
(* Out[43]= {8831.97, Null} *)
Length[res]
(* Out[44]= 4096 *)

So it takes 2.5 hours to run to completion.

--- end edit ---

Thanks for this response, I'll try in the morrow! – M.R. Jan 19 '22 at 04:35 — M.R., Jan 19 '22 at 04:35

Ben Izd · Answer 2 · 2022-01-25T05:21:47.647

Correct me if I'm wrong, you're looking for 10,000 lists each with length 13, every element can be {1,2,3,4} and each list has at least 5 hamming distances from the rest.

It's probably beyond the realm of this community to use other languages but since it's substantially faster, I decide to write this answer with Julia (from version 12.1 can be run inside Mathematica, so not much deviation).

Code

# package to save the result in CSV
using DelimitedFiles
raise error to save the result when Ctrl+c is pressed during the calculation
Base.exit_on_sigint(false)
const file_path = "C:\FileName.csv"
@inbounds function simulation(n::Int64 = 4000,similarity::Int64 = 5,column_size::Int64 = 13)::Matrix{Int8}
# hamming_distance(left,right::Vector{Int8}) = count(==(1),left .!= right)

function above_hamming_distance(left,right::Vector{Int8},n::Int64)::Bool
    unique_elements = 0
    for i in 1:length(left)

        if left[i] != right[i]
            unique_elements+=1
        end
        if unique_elements &gt;= n
            return true
        end
    end
    false
end

# version 1
# check_choice(filtered_choices,result::Vector{Int8})=all(&gt;=(similarity),count.(==(1),eachrow(result'.!=filtered_choices)))

# version 2
# check a solution to be valid by compareing to the give solution
function check_choice(filtered_choices,result::Vector{Int8})::Bool

    for choice in eachrow(filtered_choices)

        if !above_hamming_distance(choice,result,similarity)
            return false
        end

    end
    true
end

# preallocating space
choices = zeros(Int8,n,column_size)

# add random choice to the solutions
all_values=Int8(1):Int8(4)
choices[1,:]=rand(all_values,column_size)
counter::Int64=2

function generate_choice(choices::Matrix{Int8},counter::Int64)

    result=rand(all_values,column_size)

    # if solution is valid by comparing to previous similar solutions, add to the solutions
    if check_choice(view(choices,1:counter-1,:),result)
        choices[counter,:]=result
        counter +=1
    end

    counter
end

try
    @time while counter &lt;= n
        counter = generate_choice(choices,counter)
        # println(counter)
    end
catch e
    writedlm( file_path,  choices[1:counter-1,:], ',')
end

choices

end
save the result
writedlm( file_path,  simulation(4000), ',')

Speed

1.061022 seconds (78.39 k allocations: 4.784 MiB)

It produces 4,000 items in around 1.0 seconds!

To test the result we can open the CSV in Mathematica and check:

data = Import["C:\\FileName.csv", "Data"];
result1 = Outer[HammingDistance, data, data, 1];
Count[MemberQ[#, 1 | 2 | 3 | 4] & /@ result1, True]
(* Out: 0 *)
DuplicateFreeQ[data]
(* Out: True *)

I'm not an expert in Julia and tried to limit the computation (using vectorized operation, preallocating) or memory usage (using view), but I'm sure with so many hacks and macros, it could be improved.

score 1 · Answer 3 · answered Jan 18 '22 at 20:11

You could use Reap and Sow and a loop over possible pairs. Here is a small example:

all = Tuples[Range[2], 3];
res = Reap[
    Do[If[HammingDistance[all[[i]], all[[j]]] > 2, 
       Sow[{all[[i]], all[[j]]}]], {i, Length[all]}, {j, 1, i - 1}];
    ][[2]];
res

Faster searching for subsets with pairwise conditions

3 Answers3

Code

raise error to save the result when Ctrl+c is pressed during the calculation

save the result

Speed