pgfplots Histogram with Curves

Question

Variations on this question have been asked many times before, but I have not been able to find a working solution. The closest I have seen are discussions on other stackexchanges concerning how to produce this with R.

I am trying to produce the following plot, with red and blue lines (the position of the lines is merely an estimate of where I expect them to appear in a solution, which must calculate their position based on the information in the data.csv file):

Part of my difficulty in asking this question is that I am not sure of the correct terminology to describe the red and the blue lines that I have drawn on the plot. I do not think they are gaussian curves as they are not symmetrical, but nevertheless I have included gaussian code in the below MWE in case my implementation of it is simply wrong (there is a very, very small gaussian plot on the bottom-left of the above chart that I have clearly been unsuccessful in implementing).

EDIT: As discussed in the comments below, I am looking for the curves to demonstrate a smooth estimation of the distribution of the results.

MWE

\documentclass{standalone}
\usepackage{pgfplots}
\usepackage{filecontents}
\begin{document}

\begin{filecontents*}{data.csv}
COLUMNA,COLUMNB
38,22
85,18
104,82
56,20
202,57
64,15
115,22
8,20
120,14
81,24
100,28
39,11
81,29
25,18
122,51
93,10
45,19
103,11
33,24
60,24
50,47
61,24
46,14
45,15
84,72
62,20
50,13
84,38
52,19
108,5
182,34
145,19
117,12
34,59
43,19
42,26
170,18
31,27
86,18
183,24
36,15
,21
,16
,26
\end{filecontents*} 

\pgfmathdeclarefunction{gauss}{2}{\pgfmathparse{1/(#2*sqrt(2*pi))*exp(-((x-#1)^2)/(2*#2^2))}%
}

\begin{tikzpicture}
\centering
\begin{axis}[
ybar,
/pgf/number format/.cd,
use comma,
1000 sep={},
title={Title},
xlabel={Bins},
ylabel={Instances},
x label style={at={(axis description cs:0.5,-0.1)},anchor=north},
y label style={at={(axis description cs:0.05,0.5)},anchor=south},
%xticklabel style={rotate=90, anchor=near xticklabel},
xtick distance=50,
ytick distance=2,
width=\textwidth, %10.5cm
height=6cm,
axis y line*=left,
axis x line*=bottom,
ymin=0,
xmin=0,
xticklabel interval boundaries,
]

%%%
\addplot +[blue,
fill opacity=0.5,
hist={bins=22,
data min=0,
data max=220,
}
] table[y=COLUMNA, col sep=comma] {data.csv};
\addlegendentry{Series A}

\addplot +[red,
fill opacity=0.5,
hist={bins=22,
data min=0,
data max=220,
}
] table[y=COLUMNB, col sep=comma] {data.csv};
\addlegendentry{Series A}

\addplot [fill=red!50, draw=none, domain=0:220] {gauss(1.86,2.12)};

\end{axis}
\end{tikzpicture}
\end{document}

What is the meaning of the curves then? is it a smooth estimation of the distribution? — caverac, Jan 24 '19 at 10:48
I'm voting to close this question as off-topic because you are searching for the right distribution and the corresponding parameters to fit your bins. But that clearly is not a LaTeX question. Once you have found the distribution + parameters you already know how to implement it, because you have already shown this for the Gaussian distribution. — Stefan Pinnow, Jan 24 '19 at 13:13
@Craig Moreover, it require a lot of numerical computation. It is better to do it outside the TeX environment and then use the data to plot things as you wish ;) — Raaja_is_at_topanswers.xyz, Jan 24 '19 at 13:15
Okay, thanks for your comments. I had hoped that a solution would be available whereby TeX itself would be capable of performing and plotting the necessary calculations (as it already does with a Gaussian curve), but if I understand @StefanPinnow correctly, what I am after is beyond its capabilities. — Craig, Jan 24 '19 at 13:38
@Craig It is not beyond its capabilities. But, TeX is not meant to do such tasks ;) (which, I myself learned recently). Moreover, doing such a thing will make the entire code more complicated than it should be (which, IMO could be avoided whenever it is possible). — Raaja_is_at_topanswers.xyz, Jan 24 '19 at 13:41
@Raaja is right. TeX could do this, but it is much better to use a suitable tool for that task. Here the main task is to find out a suitable distribution which is brain work and no tool will tell you that exect it has several distributions implemented and solves the parameters itself. Then, depending on a "degree of fitting" would be used to determine the best fit/distribution. When you know which type of distribution you have you could e.g. use gnuplot to fit the parameters. This can be done within an \addplot command using raw gnuplot ... — Stefan Pinnow, Jan 24 '19 at 14:28
That is why I vote to close this question as off-topic. The main task is a brain task or at least a mathematical task and therefore would be better suited elsewhere to ask. — Stefan Pinnow, Jan 24 '19 at 14:30
A Gaussian is well-defined by the mean, variance and normalization. These are things that one can let LaTeX find out without too much pain. If you had a prescription that tells us how the envelope is determined precisely, one could gauge more easily if this can be implemented in LaTeX smoothly, too. However, as long as this information is not there, users may be hesitant to write something only to hear in the end that this not quite what you wanted. — , Jan 24 '19 at 15:13
Your question got closed, so it is futile trying to answer it. You could come up with a well-defined prescription for the smooth plot using convolutions, see e.g. here. At the same time, Fran's answer substantiates what @Raaja is saying: you may be better off using external programs. (I do, however, believe that it will be possible to add a solution using pgfplots only, and it may not even be exceedingly convoluted.) — , Jan 24 '19 at 19:26
@marmot I just numerically checked the distribution (based on the hypotheses I made from OP's question) and it first doesn't follow the OP's motivation. The numerical values in the can be represented using Gaussian. However, it doesn't represent the distribution of the so-called binned data ;) So, unless OP makes it clear what is of interest, as you said, it is indeed a futile attempt per se. For reference, https://imgur.com/a/DFcwiy0 this represent the distribution of the first column of the data that is here. — Raaja_is_at_topanswers.xyz, Jan 24 '19 at 19:31
@Raaja I was only suggesting to smoothen out the distribution by using some sort of discrete convolution of the data. Something like f(x)=0.1*data(x-2)+0.2*data(x-1)+0.4*data(x)+0.2*data(x+1)+0.1*data(x+2) (of course not for the points at the ends of the interval). This would give you a sample of points you can draw a smooth curve through. Obviously, there are much more sophisticated recipes on the market, and I fully agree with you that one should then resort to standard software tailored for that. — , Jan 24 '19 at 19:36
@craig have a look here https://math.stackexchange.com/questions/497878/how-to-convert-a-histogram-to-a-pdf might be helpful in your case ;) — Raaja_is_at_topanswers.xyz, Jan 24 '19 at 20:10
Everyone, thanks for your useful and insightful comments on this, they're greatly appreciated. I'm not a statistician (my field is completely unrelated!) so I'm in the deep end with this sort of thing. Your suggestions and comments are well-received. — Craig, Jan 26 '19 at 13:33

pgfplots Histogram with Curves

0 Answers0