9

I have a set of data (about 20 million datapoints for my real data) of the form {x, y, z, m}. There are 3 spatial coordinates x, y and z and a fourth value, the mass (m) of the element at that point.

I want to bin the data onto a integer spaced 3D grid. And then bin the masses into a series of specific ranges. I want the number of masses that fall into each mass range for each point on the grid. I guess, for n ranges I will need n 3D arrays or alternatively a 3D array where each point has a list of n values.

Let's start with some random data

coords = RandomReal[{-100, 100}, {10^5, 3}]; (*these can be negative*)
masses = RandomReal[400, 10^5]; (*these can't be negative*)

and some random ranges to bin the masses, we'll make 50 as this is about how many I have in my real data:

ranges = Partition[Union[Round[RandomReal[400, 100], 0.001]], 2];

Based on this related question: https://stackoverflow.com/questions/8178714/mathematica-fast-2d-binning-algorithm. I started by rounding the coordinates, I want to bin onto unit bins so I simply used Floor on the coordinates.

indices = Floor[coords];

and in my case the masses have more precision than the mass range specifications so lets round those now too.

roundedmass = Round[masses, 0.001];

Now based on the related question I used GatherBy to collect like coordinates into lists.

gathered = GatherBy[Transpose[{indices, roundedmass}], First];

However, I'm stuck as to how to quickly bin the masses in each of the gathered coordinate-bins into the various mass ranges. In the end I guess I will have 50 3D arrays where each coordinate-bin has a value corresponding to the number of masses in the specified mass-range. 50 mass ranges = 50 3D arrays (probably sparse as there will be lots of empty space), or alternatively a 3D array where each point has a list of 50 values.

I've tried to use BinCounts but this also bins in between the mass ranges so I had to drop ever other bin that was returned, seems wasteful...:

fbinmasses = BinCounts[#, {Flatten@ranges}][[;; ;; 2]] &
binnedmass = ParallelMap[fbinmasses, gathered[[All, All, 2]]]; 
combinedoutput = Transpose[{gathered[[All, 1, 1]], binnedmass}];
sortedoutput = Sort[combinedoutput, OrderedQ[{#1[[1]], #2[[1]]}] &];

Now, I would need to make a sparse array, maybe like this (going to have issues with negative inidices...? And it seems like you can't store a list in a sparse array.

output = SparseArray[gathered[[All, 1, 1]] -> binnedmass]
s0rce
  • 9,632
  • 4
  • 45
  • 78

2 Answers2

8

From reading over your post it seems that there are two problems you're having trouble with:

  1. How to bin lists of numbers into bins that may not be consecutive
  2. How to represent the output of your calculation

To address #2 first, I think your idea to use a 3D SparseArray, where each element is a list of mass bin counts, is a good one. As you found, and as the documentation points out, the values of a SparseArray can't be lists, but we could just replace the Head of a list with some other dummy symbol like MassCounts to get this working. So the problem is to create a list of array rules that looks like this:

{ ... , {xIndex,yIndex,zIndex} -> MassCounts[m1, m2, ... , m50] , ...}

To address question #1, you already suggest a solution in your post (BinCounts[#, {Flatten@ranges}][[;; ;; 2]]). Another way to do this would be to write a new BinCounts function that can deal with non-consecutive bins, like this:

ClearAll[BinCountsNC];
BinCountsNC[xs_, bins_] :=
 With[{
   lowerEdges = First@Transpose@Sort[bins],
   upperEdges = Last@Transpose@Sort[bins]},

  (Thread[lowerEdges <= # < upperEdges] & /@ xs) //
    Boole //
   Total
  ]

In the timings below you can see that this alternative is significantly faster than your initial solution, which was a surprise to me, but there it is.

To make it a bit easier to construct the sparse array rules, I also just added 101 to all the coordinates. Putting everything together, we've got

coords = RandomReal[{-100, 100}, {10^5, 3}]; (*these can be negative*)
masses = RandomReal[400, 10^5]; (*these can't be negative*)

ranges = Partition[Union[Round[RandomReal[400, 100], 0.001]], 2];

indices = Floor[coords] + 101;

roundedmass = Round[masses, 0.001];

gathered = GatherBy[Transpose[{indices, roundedmass}], First];

In[141]:= 
rules = 
  Map[
   #[[1, 1]] -> MassCounts @@ BinCounts[#[[All, -1]], List@Flatten@ranges][[;; ;; 2]] &, 
   gathered]; // Timing

Out[141]= {140.941, Null}

In[142]:= 
rules2 = 
  Map[
   #[[1, 1]] -> MassCounts @@ BinCountsNC[#[[All, -1]], ranges] &, 
   gathered]; // Timing

Out[142]= {10.3167, Null}

The two methods produce the same results:

rules == rules2 (* -> True *)

And you can construct a SparseArray using either of these sets of rules:

array = SparseArray[rules];
array[[Sequence @@ (array // ArrayRules // First // First)]]
(* -> MassCounts[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, \
0, 0, 0, 0, 0, 0] *)
DGrady
  • 606
  • 6
  • 7
3

You can use the answer you linked to speed things up considerably. UnitStep and Rescale are used to mark the ranges in which a mass lies, 1 if in, 0 if not. Then SparseArray is used as in the linked answer to add up the marks.

The set up.

SeedRandom[1];
coords = RandomReal[{-100, 100}, {10^5, 3}]; (*these can be negative*)
masses = RandomReal[400, 10^5]; (*these can't be negative*)
ranges = Partition[Union[Round[RandomReal[400, 100], 0.001]], 2];
roundedmass = Round[masses, 0.001];

The solution.

Not compiling unitMark takes almost 50% longer. The key to efficient speed and memory usage is keeping arrays packed.

Some comments to explain the code:

  1. unitMark[x] is 1 when 0 <= x < 1; the masses roundedmass are rescaled for each range so that they lie in this interval exactly when the mass is between end points of the range, i.e, in $[min, max)$.

  2. The form of nzpos is {{rangesindex, roundedmassindex}, ..}.

  3. The output has the form bincounts[[x, y, z, m]], where x, y, z, m are the indices and the value is the numbers of masses lying in the bin {x, y, z, m}.

AbsoluteTiming[
 indices = Floor[coords + 101];
 unitMark = Compile[{{masses, _Real, 2}}, UnitStep[masses] - UnitStep[masses - 1], 
   RuntimeOptions -> "Speed"];
 nzpos = SparseArray[unitMark[Rescale[roundedmass, #, {0, 1}] & /@ ranges]][
   "NonzeroPositions"];
 nzindices = Transpose[ Transpose @ indices[[Last /@ nzpos]] ~Append~ (First /@ nzpos) ];
 SetSystemOptions["SparseArrayOptions" -> {"TreatRepeatedEntries" -> 1}];
 bincounts = SparseArray[nzindices -> 1];
 SetSystemOptions["SparseArrayOptions" -> {"TreatRepeatedEntries" -> 0}];
 ]
(* {0.333861, Null} *)

By comparison, the timings I got with DGrady's two methods were 80.025997 and 6.249729 seconds.

Example: Here are the indices for the bins with a "high" count.

Cases[ArrayRules[bincounts], HoldPattern[_ -> _?(# > 1 &)], Infinity]
(* {{ 53,  65, 105, 48} -> 2, { 61, 28, 173, 28} -> 2,
    { 68,  60,  46, 50} -> 2, {165, 30, 129, 50} -> 2,
    {198, 174,  56, 40} -> 2}   *)

Check

It's a little tricky because of the difference in the ways the counts are stored in the two solutions. In array the entries are either 0 representing all 50 counts are zero or MassCounts[c1, c2, ...]; sometimes all the c1, c2, etc.

If[(Normal@bincounts /. {0 ..} -> 0) == (Normal@array /. 
     MassCounts[0 ..] -> 0 /. MassCounts -> List),
 True, False, "huh?"]
(* True *)

There's a difference in memory usage too.

ByteCount /@ {bincounts, array}
(* {1434688, 126416264} *)
Michael E2
  • 235,386
  • 17
  • 334
  • 747