7

PearsonChiSquareTest[data, dist] computes bins for data as a step towards calculating the test statistic. For continuous distributions, the documentation describes the general approach. If I want to get the actual bins used, particularly for discrete distributions, can it be discovered?

Some code to get started:

SeedRandom[1];
dist = PoissonDistribution[1.5];
n = 100;
data = RandomVariate[dist, n];
testData = PearsonChiSquareTest[data, dist, "TestData"]
(* {3.31031, 0.507301}  *)
nbins = 2 n^(2/5) // Ceiling;
del = DeleteDuplicates[Quantile[dist, Range[0, 1, 1/nbins]]];
observedFrequency = BinCounts[data, {del}];
expectedFrequency = n*(CDF[dist, Most[del]] - CDF[dist, Most[del - 1]]);
chisq = Total[(observedFrequency - expectedFrequency)^2/expectedFrequency]
chisq == testData[[1]]
(* False *)
J. M.'s missing motivation
  • 124,525
  • 11
  • 401
  • 574
Ian
  • 1,285
  • 10
  • 17

1 Answers1

4

Statistics`GoodnessOfFitTestingDump`pearsonChiSquareTest

Key calculations for PearsonChiSquareTest are performed inside the function Statistics`GoodnessOfFitTestingDump`pearsonChiSquareTest.

enter image description here

SeedRandom[1];
mp = MachinePrecision;
dist = PoissonDistribution[1.5];
n = 100;
data = RandomVariate[dist, n];
testData = PearsonChiSquareTest[data, dist, "TestData"]

{3.31031, 0.507301}

With the input (data, null distribution, precision, "SampleToNull") it returns the Pearson test statistic and the observed counts:

Statistics`GoodnessOfFitTestingDump`pearsonChiSquareTest[data, dist,  mp, "SampleToNull"]

{3.31031, {25, 36, 27, 8, 4}}

The following lines are the relevant parts of the code for discrete univariate distributions:

Statistics`Library`DiscreteUnivariateDistributionQ[dist]

True

Bins

nbins = Statistics`GoodnessOfFitTestingDump`iOptimalBinCount[n]

13

bins = Quiet[N[Sort[Quantile[dist, N[Range[0, 1, 1/nbins], mp]], Less], mp]];
bins = bins + N[1/2, mp];
bins[[1]] = bins[[1]] - 1;
bins

{-0.5, 0.5, 0.5, 1.5, 1.5, 1.5, 1.5, 1.5, 2.5, 2.5, 2.5, 3.5, 3.5, 18.5}

Frequencies:

cdfs = Statistics`GoodnessOfFitTestingDump`iCDF[dist, bins];
cdfs = Rest[cdfs] - Most[cdfs];
exp = N[n cdfs, MachinePrecision]

{22.313, 0., 33.4695, 0., 0., 0., 0., 25.1021, 0., 0., 12.5511, 0., 6.56425}

obs = (Length[Cases[data, x_ /; x < #1]] &) /@ bins;
obs = Rest[obs] - Most[obs]

{25, 0, 36, 0, 0, 0, 0, 27, 0, 0, 8, 0, 4}

locs = Position[exp, 0 | 0.];
exp = Delete[exp, locs]

{22.313, 33.4695, 25.1021, 12.5511, 6.56425}

obs = Delete[obs, locs]

{25, 36, 27, 8, 4}

Total[ (obs - exp)^2/exp]

3.31031

kglr
  • 394,356
  • 18
  • 477
  • 896