5

I'm trying to generate a histogram for a large data set with the following lines of code:

MaxSideLength = Ceiling[Max[data]];
Export["Histogram.png",
 Image[ArrayPlot[HistogramList[data, {1, MaxSideLength, 1}][[2]],
   ColorRules -> {0 -> Black, 1 -> Red, 2 -> Yellow}], ImageSize -> MaxSideLength]];

Where ImageMatrix has something like $O(10^6)$ datapoints and I want to have maybe $50,000 \times 50,000$ bins (where each bin corresponds to a pixel in the output PNG image).

Unfortunately, my memory usage goes through the roof with the above script (exhausting over 200 GB of RAM). I can pull off having maybe $20{\rm k} \times 20{\rm k}$ bins, but anything above this fails.

Is there a more efficient way to proceed? If need by, I can properly round all of the values in data.

I know this comparison is unfair, but given that the typical output size for a $20,000 \times 20,000$ pixel output PNG is something like 40 Mb, it seems a little odd to me that the binning and image creation process would require over $2500x$ the memory in RAM.

Ok, let's run a little analysis script provided by ssch:

mmu = MaxMemoryUsed[];
n = 20*10^6;
{w, h} = {5000, 5000};
data = RandomVariate[NormalDistribution[], {n, 2}];
bins = BinCounts[data,
 {Min[data], Max[data], (Max[data] - Min[data])/w},
 {Min[data], Max[data], (Max[data] - Min[data])/h}];
i = Image[bins /. {0 -> {0, 0, 0}, 1 -> {1, 0, 0}, 2 -> {1, 1, 0},
                   3 -> {0, 0, 1}, _?IntegerQ -> {1, 1, 1}}];
mmu2 = MaxMemoryUsed[];
mmu2 - mmu

The result for my value of n is $\approx 600*10^6$. Increasing n by an order of magnitude to $20*10^7$ yields a memory usage of $\approx 12*10^9$, however, if we run the script again, overwriting the "data" data structure, the memory usage reported is again only $\approx 600*10^6$.

However, if we increase the bin sizes to {w, h} = {10000, 10000}, the memory usage jumps to $\approx 3.5*10^9$ (for the first run), and then oddly $\approx 2.5*10^9$ for subsequent runs (since I suppose we're overwriting previous data structures). My conclusion is that an increase in the number of bins is responsible for the blowup in RAM usage.

Michael E2
  • 235,386
  • 17
  • 334
  • 747
Brownian
  • 53
  • 3
  • Is it in Export, Image or ArrayPlot you see the high memory usage?

    How does something like `n = 1000000; {w, h} = {5000, 5000}; data = RandomVariate[NormalDistribution[], {n, 2}];

    bins = BinCounts[data, {Min[data], Max[data], (Max[data] - Min[data])/w}, {Min[data], Max[data], (Max[data] - Min[data])/h}]; i=Image[bins /. {0 -> {0, 0, 0}, 1 -> {1, 0, 0}, 2 -> {1, 1, 0}, 3 -> {0, 0, 1}, _?IntegerQ -> {1, 1, 1}}];` behave memory-wise?

    – ssch Aug 11 '13 at 22:24
  • @ssch That didn't break a GB of RAM. I suspect the problem is ArrayPlot, since I recall that causing a huge memory burdon in isolation. However, as soon as I regenerate my data set, I'll provide explicit RAM usage values for everything. – Brownian Aug 11 '13 at 22:28
  • This is quite likely a packing issue... The 50k x 50k array (assuming it is packed, it will have as little memory footprint as possible) will unpack inside ArrayPlot, causing it to eat up a LOT of memory – rm -rf Aug 11 '13 at 22:34
  • @rm-rf Do you see a way forward, or is this problem fundamental? – Brownian Aug 11 '13 at 22:45
  • @rm-rf Why would Mathematica need to unpack the array? Can I tell ArrayPlot to assume all of the array entries are integers or real numbers (no symbolic values, etc.)? – Brownian Aug 11 '13 at 23:29
  • @Brownian I'm not sure... it was only a guess though. Unfortunately, I can't test it out right now. – rm -rf Aug 12 '13 at 00:18
  • @rm-rf, Brownian ArrayPlot does seem to unpack. (For those who don't know, set On["Packing"] to get messages when arrays are unpacked.) Image[1 - bins/N@Max[bins]] does not unpack when I run it. – Michael E2 Aug 12 '13 at 00:23
  • @MichaelE2 Syntactically, how would I fix my line of code to generate histograms with the On["Packing"] specification? – Brownian Aug 12 '13 at 00:53
  • Have you tried using Raster directly? Likely it is calculating the colors that unpacks it. You'll have to play to get it to do so without unpacking. – rcollyer Aug 12 '13 at 01:32
  • I'm just curious - what do you use the 50000 by 50000 pixel image for? Do you print it? – cormullion Aug 12 '13 at 06:34
  • @cormullion That would be a heck of a printer. I'm just outputting the image to have something I can conveniently search and zoom around on with any of a number of image viewer programs. Outside of the image format, I haven't found a good way of visualizing a data set of this size in a Mathematica notebook. – Brownian Aug 12 '13 at 10:08
  • Yes - 4 meter wide rolls of paper...! – cormullion Aug 12 '13 at 10:22

1 Answers1

5

This seems to work -- well, on the small example. I don't have 200GB to play with.

Th OP's setup (without the image):

Quit[]

SeedRandom[2];
n = 20*10^6;
{w, h} = {5000, 5000};
data = RandomVariate[NormalDistribution[], {n, 2}];

mmu = MaxMemoryUsed[];
bins = BinCounts[
    data, {Min[data], 
     Max[data], (Max[data] - Min[data])/w}, {Min[data], 
     Max[data], (Max[data] - Min[data])/h}]; // Timing

mmu2 = MaxMemoryUsed[];
mmu2 - mmu

(* {12.107534, Null} *)
(* 960687112 *)

I adapted the OP's color function to create a packed array for the image data:

cf = Compile[{{bins, _Integer, 2}},
    Map[Piecewise[{
        {{0., 0., 0.}, # == 0},
        {{1., 0., 0.}, # == 1},
        {{1., 1., 0.}, # == 2},
        {{0., 0., 1.}, # == 3}},
       {1., 1., 1.}] &,
     bins, {2}],
    RuntimeOptions -> "Speed", CompilationTarget -> "C"
   ];

One might convert the image to "Byte" to save space, considering the colors of the color function.

(img = Image[Image @ cf @ bins, "Byte"]) // Timing

image output

It does save space:

ByteCount @ img
(* 75000400 *)

ByteCount @ Image @ cf @ bins
(* 600000384 *)

Export finishes in a reasonable time (less than half the time it took for the larger Image@cf@bins).

With[{t0 = AbsoluteTime[]},
 Export["1Example.png", img];
 AbsoluteTime[] - t0]
(* 3.000048 *)

Possible alternative

There is another alternative, based on the answers by kguler and ruebenko to these questions respectively:

  1. Optimising 2D binning code

  2. https://stackoverflow.com/questions/8178714/mathematica-fast-2d-binning-algorithm

It uses SparseArray and it can be fast. Indeed when the OP's n was originally 2 * 10^6 it seemed promising. But when it changed to 20 * 10^6, the performance was considerably worse. Perhaps, if in the actual application, n is much smaller than w * h, it may still be worth considering. It seemed to use a lot a memory in constructing the SparseArray, but the final result can be quite a bit smaller than bins.

See the linked questions about the option "TreatRepeatedEntries". I present three variations depending on how one wants to treat Max[data]. The OP's code tends not to count the maximum, which lands just at the beginning of the bin after the last one, unless due to round-off error it just happens to land in the last one.

This is a direct adaptation of the answers to the linked questions. I assume h = w as they happen to be equal in the OP's question. The original answers fudge a little with epsilon, but the variations below avoid this.

epsilon = 10^-12;
indexes1 = 1 + Floor[(1 - epsilon) w Rescale[data]];
SetSystemOptions["SparseArrayOptions" -> {"TreatRepeatedEntries" -> 1}];
bins1 = SparseArray[indexes1 -> 1, {w, w}];
SetSystemOptions["SparseArrayOptions" -> {"TreatRepeatedEntries" -> First}];

Here is one that does what the OP's BinCounts code does. This one lops off the maximums with ArrayPad.

indexes2 = 1 + Floor[w Rescale[data]];
SetSystemOptions["SparseArrayOptions" -> {"TreatRepeatedEntries" -> 1}];
bins2 = ArrayPad[SparseArray[indexes2 -> 1, 1 + {w, w}], {0, -1}];
SetSystemOptions["SparseArrayOptions" -> {"TreatRepeatedEntries" -> First}];

One can put the maximum in the last bin with this:

indexes2 = # + Unitize[#-w] &@ Floor[w Rescale[data]];
SetSystemOptions["SparseArrayOptions" -> {"TreatRepeatedEntries" -> 1}];
bins2 = SparseArray[indexes2 -> 1, {w, w}]];
SetSystemOptions["SparseArrayOptions" -> {"TreatRepeatedEntries" -> First}];

Note if w != h1, then these variations would need to be adapted.

Michael E2
  • 235,386
  • 17
  • 334
  • 747