14

I am studying a two-dimensional dataset, whose mean vector and covariance matrix are the following:

mean = {0.968479, 0.020717}
cov = {{0.0000797131, 0.000069929},{0.0000699293, 0.00174702}}

I want to generate a contour plot of the 95% confidence ellipse.

Alexey Popkov
  • 61,809
  • 7
  • 149
  • 368
Abhijit Saha
  • 449
  • 1
  • 4
  • 10
  • 1
    Is this about Mathematica, a software? related. 51163 – Kuba Apr 28 '15 at 10:03
  • Please look at this http://mathematica.stackexchange.com/questions/2411/showing-the-correlation-of-two-variables-using-a-plot – Abhijit Saha Apr 28 '15 at 10:13
  • 1
    Why not use EllipsoidQuantile on your dataset? – Michael E2 Apr 28 '15 at 11:03
  • Did you look at this? http://mathematica.stackexchange.com/questions/794/plot-ellipse-based-on-eigensystem/867#867 – Cassini Apr 28 '15 at 12:17
  • 1
    @MichaelE2 That's an excellent point about EllipsoidQuantile; thanks for bringing this up. In my understanding, however, that function answers a subtly different question. I expanded my answer below to cover this point as well. – MarcoB Apr 29 '15 at 19:45
  • @AbhijitSaha I wonder if you have had a chance to look over my answer below. Is that what you were looking for? If so, I'd appreciate it if you would consider accepting it. Otherwise, please let me know! – MarcoB Apr 29 '15 at 19:47
  • @MacroB thanks a lot. This is what i exactly wanted to know. – Abhijit Saha May 05 '15 at 18:24
  • I'm glad to hear that my answer works for you. I'd be honored if you could accept it. Thanks! – MarcoB May 11 '15 at 06:29

2 Answers2

40

The executive summary

You can use the built-in Ellipsoid function directly with your calculated mean and covariance. For 95% confidence, use:

Ellipsoid[mean, cov Quantile[ChiSquareDistribution[2], 0.95]]

That expression returns an Ellipsoid object that you can visualize as an Epilog to a ListPlot, or as an argument to Graphics (further formatting below).

Ellipsoids for other common critical values can be obtained in the same way. Note the different multipliers to cov:

90% :   Ellipsoid[mean, cov Quantile[ChiSquareDistribution[2], 0.9]]
95% :   Ellipsoid[mean, cov Quantile[ChiSquareDistribution[2], 0.95]]
99% :   Ellipsoid[mean, cov Quantile[ChiSquareDistribution[2], 0.99]]

A more detailed answer

First let me start by mentioning that a covariance matrix must be symmetric. You have a missing digit in cov[[1,2]] that makes your orignal one non-symmetric; I assume that it's a typo and will use the symmetric version below:

 mean = {0.968479, 0.020717};
 cov  = {{0.0000797131, 0.000069929}, {0.0000699293, 0.00174702}}

The easiest way to generate an ellipsoid with the right location and alignment given your distribution is to feed the mean and covariance directly to the Ellipsoid function, simply as Ellipsoid[mean, cov]. The resulting Ellipsoid is a graphical primitive, so it can be plotted on top of your data using e.g. Epilog or Graphics.

To get a practical example, let us generate and plot some random points from your distribution, assuming that it is normal.

 SeedRandom[1];
 sampledata = RandomVariate[MultinormalDistribution[mean, cov], 2500];

ListPlot[ sampledata, PlotRange -> All, PlotRangePadding -> Scaled[0.05], AspectRatio -> 1, Axes -> None, Frame -> True, Epilog -> {Opacity[0], EdgeForm[{Thick, Red}], Ellipsoid[mean, cov]} ]

Data and single-wide ellipsoid

As you can see, however, an Ellipsoid that is "one covariance wide", as the one we plotted, contains only a small fraction of the sampled points (only roughly 40% of the points, see below). Instead you requested an ellipsoid containing 95% of the points from your distribution. We need a wider Ellipsoid for that: but how wide?

We can figure that out by using the probability distribution function (PDF) for your multivariate distribution. We can integrate the PDF over a parametric ellipsoidal region to calculate what fraction of the samples falls within that region. Let's consider a two-dimensional MultinormalDistribution, with zero average and covariance expressed by {{sigmax^2, rho sigmax sigmay}, {rho sigmax sigmay, sigmay^2}}, where sigmax and sigmay are the standard deviations associated with each of the two independent variables, and rho is the correlation coefficient between the two variables. The standard deviations are positive numbers, and 0 <= rho <= 1.

Here we calculate an expression for the fraction of points found within a two-dimensional ellipse centered around zero (the mean of this distribution) and "n covariances wide" (notice the n factor in the Ellipsoid's descriptor. The integration is carried out over the region defined by that "n-wide" ellipse.

gencovar = {{sigmax^2, rho sigmax sigmay}, {rho sigmax sigmay, sigmay^2}};

Assuming[ {n > 0, sigmax > 0, sigmay > 0, 0 <= rho < 1}, Simplify[ Integrate[ PDF[MultinormalDistribution[{0, 0}, gencovar], {x, y}], {x, y} [Element] Ellipsoid[{0, 0}, n gencovar] ] ] ]

(* 1-E^(-n/2) *)

Now let's tabulate the value of that expression for a few n.

 TableForm[#, TableAlignments -> {Right, Top}] &@
  Table[{ToString@n <> "x cov", ToString@Round[100 %, 1] <> "%"}, {n, 1, 9, 1}]

(* 1x cov 39% 2x cov 63% 3x cov 78% 4x cov 86% 5x cov 92% 6x cov 95% 7x cov 97% 8x cov 98% 9x cov 99% *)

This means that our original "single-wide" Ellipsoid contained only 39% of the samples; to get 95% inclusion we need a 6x wide Ellipsoid.

Let's plot that for your original distribution (notice the all-important 6x factor in the Ellipsoid definition):

 ListPlot[
   sampledata,
   PlotRange -> All, PlotRangePadding -> Scaled[0.05], 
   AspectRatio -> 1, Axes -> None, Frame -> True,
   Epilog -> {Opacity[0], EdgeForm[{Thick, Red}], Ellipsoid[mean, 6 cov]}
 ]

Data with 95% confidence ellipsoid

Finally, we can also confirm this by explicitly counting the samples that fall within this Ellipsoid. The expression Select[sampledata, RegionMember[Ellipsoid[mean, n cov]]] allows us to select those samples in sampledata that lie within the geometric region defined by the "6x wide" Ellipsoid[mean, 6 cov].

 N@Length@Select[sampledata, RegionMember[Ellipsoid[mean, 6 cov]]] / Length@sampledata

(* 0.9544 *)

As expected from our previous calculations, approximately 95% of the points reside within the 6x Ellipsoid we defined.

For clarity, Length@Select[sampledata, RegionMember[Ellipsoid[mean, n cov]]] is the number of points lying within the Ellipsoid region. Length@sampledata is the total number of points in our sample. N is there to obtain an approximate numerical answer, rather than a symbolic one.

Why not use EllipsoidQuantile?

EllipsoidQuantile is a function available in the MultivariateStatistics package that was mentioned by @Michael E2 in a comment as a possible solution to this problem. EllipsoidQuantile[dataset, q] returns an Ellipsoid centered on the mean of dataset and scaled to contain a fraction q of the dataset.

At a glance, this would seem exactly what the OP asked, but this function behaves in a subtly different way. In my understanding, it treats dataset as the entire population, rather than as a sample from a larger population. If the sample available is large and it represents the population well, then the results of the two methods will be essentially indistinguishable. However, the two methods will give noticeably different results for smaller samples and high levels of confidence q.

I also have two more practical reasons not to use the EllipsoidQuantile function. First, it is difficult to apply styles to its output, although I have never been able to pinpoint why exactly. Additionally, some definitions in MultivariateStatistics seem to shadow definitions of functions that have since transitioned to built-in status, e.g. PrincipalComponents and MultinormalDistribution, so I'd rather not load the package unless it's absolutely necessary.

Here is some Manipulate code that allows one to compare the two results on the sampledata I generated above, and an example of when the two approaches differ considerably.

Needs["MultivariateStatistics`"]

Manipulate[ Show[{ ListPlot[ sampledata[[1 ;; ;; every]], Axes -> None, Frame -> True, Epilog -> Inset[Style["alpha = " <> ToString@alpha, FontSize -> 14], Scaled[{0.85, 0.9}]] ], Graphics[{ (* using EllipsoidQuantile ) EllipsoidQuantile[sampledata[[1 ;; ;; every]], alpha], ( Using the Ellipsoid method outlined above *) {Opacity[0], EdgeForm[{Thick, Red}], Ellipsoid[ Mean@sampledata[[1 ;; ;; every]], Evaluate[n /. First@Quiet@Solve[1 - E^(-n/2) == alpha, n]] Covariance@sampledata[[1 ;; ;; every]] ] } }] }, PlotRange -> {{0.93, 1.01}, {-0.17, 0.24}} ],

(* Manipulate variables *) {{alpha, 0.95, "[Alpha] value"}, 0.9, 0.99, 0.01, Appearance -> "Open"}, {{every, 100, "points"}, {1 -> "All", 10 -> "250", 20 -> "125", 50 -> "50", 100 -> "25", 250 -> "10"}, ControlType -> SetterBar} ]

A sample output highlights the difference: in red is the Ellipsoid result, and in black the EllipsoidQuantile output:

Manipulate to compare Ellipsoid and EllipsoidQuantile

MarcoB
  • 67,153
  • 18
  • 91
  • 189
  • 5
    Very nice explanation, +1 – bobthechemist Apr 29 '15 at 02:58
  • @bobthechemist Thank you! I appreciate the support, especially from a fellow chemist! – MarcoB Apr 29 '15 at 15:38
  • I appreciate the difference between the sample and population, but as far as styling, the output is an Ellipsoid. I don't see how that could be more difficult than your constructed Ellipsoid. I, too, am annoyed by the shadowing. They should promote it to the "System`" context like the most of the rest of statistics. (+1) – Michael E2 Apr 29 '15 at 20:01
  • 1
    @MichaelE2 EllipsoidQuantile does not actually return a standard Ellipsoid object. For instance, EllipsoidQuantile[sampledata,0.9] returns Ellipsoid[{0.968, 0.0214}, {0.0867, 0.0185}, {{0.0422, 0.999}, {-0.999, 0.0422}}]: this is a call to Ellipsoid with three arguments, instead of the two expected by the built-in function. This seems linked to an older Ellipsoid function within MultivariateStatistics that used three arguments. – MarcoB Apr 30 '15 at 04:56
  • @MichaelE2 The MultivariateStatistics package loads that older definition as well, so its EllipsoidQuantile function still works apparently seamlessly. However, this older Ellipsoid does not play nice with others: EdgeForm and FaceForm have no effect on it; Opacity applies to its edge, and not to its face as it does for the built-in Ellipsoid; this "three-parameter" Ellipsoid cannot be used as an Epilog to e.g. ListPlot (an attempt returns Ellipsoid is not a graphics primitive or directive). Sneaky! – MarcoB Apr 30 '15 at 04:57
  • @MichaelE2 However, I'm not very adept at looking into the guts of the MMA environment yet, so if I'm on a completely wrong path, I'd appreciate any pointers! – MarcoB Apr 30 '15 at 05:01
  • 3
    @MarcoB No, you're right, thanks. I didn't notice those details. The built-in Ellipsoid[mean, 6 cov] is displayed as GeometricTransformation[Disk[{0, 0}], {{{0.02186958161465372, 0.}, {0.019185360167972444, 0.10056859328450993}}, {0.968479, 0.020717}}] and the older one is converted to a closed, styled Line object inside Graphics. For instance, one can use this with ListPlot: Epilog -> {First@Graphics[{Red, EllipsoidQuantile[sampledata, 0.9]}]}. (The menu command Cell > Show Expression and adding // InputForm are some ways to investigate "What happened?".) – Michael E2 Apr 30 '15 at 11:52
  • @MichaelE2 That is very insightful. I had an inkling that the old Ellipsoid really had to be just a line rather than a disk. I am glad to see that confirmed, and to learn something new in the process! – MarcoB May 01 '15 at 03:01
  • @MichaelE2 Just a side note, Graphics cannot handle multiple MultivariateStatistics`Ellipsoids in v11.2, while it can in v9. Sample code: data = RandomReal[1, {10, 2}]; q = EllipsoidQuartiles[data]; q // Graphics I guess the issue is introduced since v10. – xzczd Nov 28 '17 at 06:45
  • @xzczd Strange. Graphics /@ q // Show works. Graphics[Ellipsoid[..]] works but not Graphics[{Ellipsoid[..]}] does. It's probably a bug, since the example in the docs does not work, either. – Michael E2 Nov 28 '17 at 13:09
4

Working code with Ellipsoid and Eigensystem

<< MultivariateStatistics`
mean = {0, 0};
cov = {{1, 0.7}, {0.7, 1}};
sampledata = RandomVariate[MultinormalDistribution[mean, cov], 2500];
Show[
 ListPlot[sampledata, PlotRange -> All],
 Graphics[
  Ellipsoid[{0, 0}, Sqrt[2*6*Eigensystem[cov][[1]]], 
   Eigensystem[cov][[2]]]],
 PlotRange -> {{-5, 5}, {-5, 5}}, AspectRatio -> 1
 ]

F

Fabio
  • 1,357
  • 6
  • 13