Help: Classify and Predict Fail for large datasets

Question

<< JLink`;
InstallJava[];
ReinstallJava[JVMArguments -> "-Xmx8192m"]

data = Import["E:\\data2.xlsx"][[1]];

fieldNames = data[[{1}, {1, 2, 4, 5, 6, 7, 8}]];
training = data[[2 ;; 220206, {1, 2, 4, 5, 6, 7, 8}]];
test = data[[220207 ;;, {1, 2, 4, 5, 6, 7, 8}]];

model = Predict[Cases[training, x_ :> (x[[;; 6]] -> x[[7]])]]

The last line produces the following error

No more memory available.
Mathematica kernel has shut down.
Try quitting other applications and then retry.

How should I get mathematica to produce a classifier/predictor function which i can use. Is mathematica capable of working with large datasets like sql server can using disk reads rather than memory.

Please help i have to produce results and really need a workable solution.

mma isn't built to run on large scales, just small toy prototypes — M.R., Oct 08 '16 at 00:27
@M.R. Evidence? - For example a 2013 WTC presentation shows a quantum chemistry simulation running on Mathematica for 5.5 hours vs 5 hours on C++. What is toy vs scale? — alancalvitti, Oct 08 '16 at 02:37
I don't know how to use File function since Predict and Classify input format is like {{x11,x12,x13,...}->y1,{x21,x22,x23,...}->y2,...}. But is there a way to feed data in chunks to Predict or Classify or even run these function on the chunks and then "merge" them? — user13892, Oct 08 '16 at 09:33
@alancalvitt anecdotal evidence: no one I know uses mathematica in any sort of production env - also take any ml vision example and multiply the input size by 10000 and mma will crash — M.R., Oct 08 '16 at 19:45
@alancalvitti I really think mma is a super powerful tool and an elegant language, but let's be honest about its limitations, largest of which is lack of production-level robustness. — M.R., Oct 08 '16 at 21:32
@M.R., this aint 2007, in the cloud you can run kernels in parallel. Take a look at Emerald Cloud Labs - they run all their robots, experiments, and UI on top of Mathematica, they call it SLL = Symbolic Lab Language. The factory is built and they have clients lined up. — alancalvitti, Oct 09 '16 at 15:57
Thank you @alancalvitti for presenting the use of mathematica practically. Can you please tell me how to solve my problem. It is would be very weird if mathematica core machine learning function Classify cannot work on large datasets in chunks. — user13892, Oct 09 '16 at 17:39
@user13892, I've seen this issue before, I don't think it's specific to Classify - try searching http://mathematica.stackexchange.com/search?q=out+of+memory, there's a code switch to increase it. I wish MMA upon installation would ask users how much memory they want to use. — alancalvitti, Oct 09 '16 at 20:23
@alancalvitti Classify will always barf when given too much image data — M.R., Oct 25 '16 at 16:57
@alancalvitti I've clocked it consistantly crashing at around 1.2gb of image data — M.R., Oct 25 '16 at 16:59
@alancalvitti Mentioning a third party private company in defense of Mathematica's scale and stability issues is truly a non sequitur. I'd love to discuss this further in a chat session offline. — M.R., Oct 25 '16 at 17:04
Maybe you can divide the training set into chunks and train on chunks like this? — xslittlegrass, Oct 31 '16 at 20:18
@M.R., can you email me? We can chat by Skype, joinme or similar. — alancalvitti, Nov 08 '16 at 02:13

score 3 · Answer 1 · answered Mar 08 '17 at 16:27

I've run into this issue several times over the past couple of years when testing and building classifier. On my current 16GB Windows laptop, I'm confident I can handle problems up to about 1Gb of data. Beyond that, things get a little dicier. Occasionally the specific nature of the problem supports some simple ways to reduce the size of the dataset, say by some form of random sampling, or by breaking the problem into smaller chunks. But otherwise I've fallen back on the tried and true way of solving this problem -- toss more horsepower at it. At my current company, we installed MMA on a mid-sized server (8 cores, 256 Gb of Ram). I've been able to handle sizable datasets (up to 70Gb) on this machine without any difficulties, using standard MMA commands. Some of these models have successfully run for over 100 CPU hours. At my previous company, we did something similar using a virtual server in AWS running Ubuntu. This works exceedingly well.The bottom line is that the machine you are using puts a constraint on the size of the problem you can handle in MMA.

Now as a couple of other commenters have noted, one of the challenges of using MMA is that it does not easily integrate with the production workflows that many company routinely use. But that's a different issue than the OP raised.

Help: Classify and Predict Fail for large datasets

1 Answers1