11

I am looking for a faster way of finding large subsets of lists that obey pairwise conditions. In particular, I'm looking for a large (over 10k) set of quaternary words (i.e. ordered lists) of length 13 with Hamming distance greater than 5.

Here's an example in code: there are 67,108,864 words with 4 letters:

all = Tuples[Range[4], 13];

A naive solution to find a set that are all at least Hamming distance 5, we can sample one-at-a-time and compare to the current set:

set = Internal`Bag[];
Internal`StuffBag[set, RandomChoice[all]];
Dynamic[Internal`BagLength[set], UpdateInterval -> 1]

While[InternalBagLength @ set < 10000, new = RandomChoice[all]; If[AllTrue[InternalBagPart[set, All], HammingDistance[new, #] > 5 &], Internal`StuffBag[set, new]]]

result = Internal`BagPart[set, All];

I'm looking to find a set of length 10,000 but at the moment this code is just too slow, after an hour it only found about 1,400.

Can this type of random search problem be accelerated using compilation, parallelization, graph theory, heuristics, or anything else?

Related:

M.R.
  • 31,425
  • 8
  • 90
  • 281
  • 1
    Do you find the "Bag" functions to be especially faster than the "DynamicArray" data structure? Slightly off topic since in this question appending elements to the list is not the rate limiting step. – Jason B. Jan 18 '22 at 19:25
  • I can get a decent speedup over your code by just creating a NearestFunction at each iteration of the While loop and using that to find the nearest already-selected element. But the scaling of this problem means it still slows to a crawl eventually. – Jason B. Jan 18 '22 at 19:47

3 Answers3

7

Might be faster to cull as you go. I compile the selection function (possibly it can be made faster). Also I set the history length to zero although that does not help too much because the Select entails a big memory hit at least one time.

$HistoryLength = 0;

selectHam = Compile[{{ll, _Integer, 2}, {vec, _Integer, 1}, {n, _Integer}}, Select[ll, (Min[Apply[Plus, Sign[Abs[# - vec]]]] > n) &], CompilationOptions -> {"InlineExternalDefinitions" -> True}, RuntimeOptions -> "Speed", CompilationTarget -> "C"];

To keep my machine from croaking I only go to 12-tuples.

all = Tuples[Range[4], 12];
Length[all]

(* Out[22]= 16777216 *)

This takes around 20 minutes:

j = 0;
Timing[
 res = Reap[recent = all[[1]];
     rest = Rest@all;
     Sow[recent];
     While[rest =!= {},
      PrintTemporary[{j++, Length[rest]}];
      ok = selectHamG5[rest, recent];
      Sow[recent = First[ok]];
      rest = Rest[ok]
      ];
     ][[2, 1]];]
Length[res]

(* Out[24]= {1170.45, Null}

Out[25]= 4096 *)

I am currently running on a bigger machine for the 13-tuples case. I expect it to take a few hours to run to completion.

One might be able to curtail the memory overhead by using Scan (or Do), Sow and Reap instead of Select in the culling function.

--- edit ---

For tuples of modest length this can be done faster by representing each tuple as an integer. And the memory footprint is far smaller. I use a bitwise operation to gain a little more speed.

selectHamInt = 
  Compile[{{ll, _Integer, 
     1}, {val, _Integer}, {n, _Integer}, {ilen, _Integer}},
   Select[
    ll, (Apply[Plus, Sign[IntegerDigits[BitXor[#, val], 4, ilen]]] > 
       n) &], CompilationOptions -> {"InlineExternalDefinitions" -> 
      True}, RuntimeOptions -> "Speed", CompilationTarget -> "C"];

ilen = 13; rest = Range[0, 4^ilen - 1]; Length[rest]

j = 0; Timing[ res = Reap[recent = rest[[1]]; rest = Rest@rest; Sow[recent]; While[ListQ[rest] && rest =!= {}, ok = selectHamInt[rest, recent, 5, ilen]; If[ok=!={}, Sow[recent = First[ok]]; rest = Rest[ok]]; ]; ][[2, 1]];] Length[res]

(* Out[43]= {8831.97, Null} *)

Length[res]

(* Out[44]= 4096 *)

So it takes 2.5 hours to run to completion.

--- end edit ---

Daniel Lichtblau
  • 58,970
  • 2
  • 101
  • 199
6

Correct me if I'm wrong, you're looking for 10,000 lists each with length 13, every element can be {1,2,3,4} and each list has at least 5 hamming distances from the rest.

It's probably beyond the realm of this community to use other languages but since it's substantially faster, I decide to write this answer with Julia (from version 12.1 can be run inside Mathematica, so not much deviation).

Code

# package to save the result in CSV
using DelimitedFiles

raise error to save the result when Ctrl+c is pressed during the calculation

Base.exit_on_sigint(false) const file_path = "C:\FileName.csv"

@inbounds function simulation(n::Int64 = 4000,similarity::Int64 = 5,column_size::Int64 = 13)::Matrix{Int8}

# hamming_distance(left,right::Vector{Int8}) = count(==(1),left .!= right)

function above_hamming_distance(left,right::Vector{Int8},n::Int64)::Bool
    unique_elements = 0
    for i in 1:length(left)

        if left[i] != right[i]
            unique_elements+=1
        end
        if unique_elements >= n
            return true
        end
    end
    false
end

# version 1
# check_choice(filtered_choices,result::Vector{Int8})=all(>=(similarity),count.(==(1),eachrow(result'.!=filtered_choices)))

# version 2
# check a solution to be valid by compareing to the give solution
function check_choice(filtered_choices,result::Vector{Int8})::Bool

    for choice in eachrow(filtered_choices)

        if !above_hamming_distance(choice,result,similarity)
            return false
        end

    end
    true
end

# preallocating space
choices = zeros(Int8,n,column_size)

# add random choice to the solutions
all_values=Int8(1):Int8(4)
choices[1,:]=rand(all_values,column_size)
counter::Int64=2

function generate_choice(choices::Matrix{Int8},counter::Int64)

    result=rand(all_values,column_size)

    # if solution is valid by comparing to previous similar solutions, add to the solutions
    if check_choice(view(choices,1:counter-1,:),result)
        choices[counter,:]=result
        counter +=1
    end

    counter
end

try
    @time while counter <= n
        counter = generate_choice(choices,counter)
        # println(counter)
    end
catch e
    writedlm( file_path,  choices[1:counter-1,:], ',')
end

choices

end

save the result

writedlm( file_path, simulation(4000), ',')

Speed

1.061022 seconds (78.39 k allocations: 4.784 MiB)

It produces 4,000 items in around 1.0 seconds!

To test the result we can open the CSV in Mathematica and check:

data = Import["C:\\FileName.csv", "Data"];
result1 = Outer[HammingDistance, data, data, 1];

Count[MemberQ[#, 1 | 2 | 3 | 4] & /@ result1, True] (* Out: 0 *)

DuplicateFreeQ[data] (* Out: True *)

I'm not an expert in Julia and tried to limit the computation (using vectorized operation, preallocating) or memory usage (using view), but I'm sure with so many hacks and macros, it could be improved.

Ben Izd
  • 9,229
  • 1
  • 14
  • 45
1

You could use Reap and Sow and a loop over possible pairs. Here is a small example:

all = Tuples[Range[2], 3];
res = Reap[
    Do[If[HammingDistance[all[[i]], all[[j]]] > 2, 
       Sow[{all[[i]], all[[j]]}]], {i, Length[all]}, {j, 1, i - 1}];
    ][[2]];

res

enter image description here

Daniel Huber
  • 51,463
  • 1
  • 23
  • 57