what is the best theory/model to use for prediction in multivariate data?

Question

I use software for pollutant propagation on rivers that takes as input a set of parameters ($p_1,p_2,\ldots,p_n$) and creates an output file which is basically a matrix where on each row the concentration of the pollutant in various places along the river at a given timestamp is given.

\begin{array} {|r|r|r|r|r...|r|} \hline \text{TIME} &500\,\text{m}&1000\,\text{m}&1500\,\text{m}&2000\,\text{m}&2500\,\text{m}&...&25000\,\text{m} \\ \hline 2015/12/07 - \text{4:50:00}&0.75&0.71&0.6&0.58&0.55&...&0.12 \\ \hline \hline 2015/12/07- \text{4:55:00}&0.71&0.70&0.58&0.56&0.51&...&0.10 \\ \hline \end{array}

My question is: what models/theories are suitable to be used to predict, based on a set of given INPUT parameters, the OUTPUT that would be closest to what the software will give as a result?

Also how many pairs INPUT/OUTPUT will I need to have in order to minimize the error?

It's not very clear what your input parameters represent. Could you clarify this? It seems like they could be initial conditions for the pollution at various points along the river. — Tyler Olsen, Oct 29 '15 at 16:06
If I understand your question it sounds like you have some software that models the pollutants in the river over time. If this is the case, why do you want to find a separate model that approximates what your software gives you already? Are you just trying to make a simplified model? Is this an interpolation problem? If you are trying to make a simplified model then do you know the water velocities at points in the river? — James, Oct 29 '15 at 19:57
@TylerOlsen The input parameters are: startDate, endDate, Polutant Type, Pollutant Concentration, River Chainage and a set of pairs [Pollutant Quantity, DateTime] representing how much pollutant was thrown in the river and the moment when that happened. — Sorin Ciolofan, Oct 31 '15 at 10:58
@James The software we have runs on a server and we need internet connection to connect to that software and run a simulation. Beside this, the server could happen to be down at some point when we need to run a simulation. So, as a backup solution we would want to "guess" , given an input i what is the closest output o that the software would give it would have the chance to run. We can use a historical archive of past N real simulations (pairs of input/output). Maybe I can use machine learning? — Sorin Ciolofan, Oct 31 '15 at 11:02
It sounds like you're just trying to solve the transport equation for a scalar field (aka: convection-diffusion-reaction equation). This is a PDE that describes precisely your situation. If I had to guess what your software was doing, I'd say that it's solving a 1D version of this model, taking your input parameters as initial conditions and making assumptions about the river speed at every point. Check out this wiki page for more info:
https://en.wikipedia.org/wiki/Convection%E2%80%93diffusion_equation — Tyler Olsen, Oct 31 '15 at 15:10
@TylerOlsen Thanks for reply, but that wiki page seems to present a very general form of the equation, which does not fit to the pollutant propagation scenario I have. 1) Do you know how the equation will look for my scenario (the one I described above)? 2) Can the equation from point 1) be solved with numerical methods? If yes, can you point me to some algorithm doing this? Thanks — Sorin Ciolofan, Nov 09 '15 at 12:49
The equation is as general as it needs to be. Take the first equation, make it 1D (since that's part of your assumptions) and you have: $\dot{\phi} + c\frac{\partial \phi}{\partial x} = q$, where $\hi$ is the pollutant concentration and q represents a pollutant source term. This can be solved numerically in a pretty straightforward fashion with a finite difference method, and you can use explicit time integration with small time steps. — Tyler Olsen, Nov 09 '15 at 14:42
@TylerOlsen Trying to understand the equation above. Please correct me if I am wrong. $ \phi(t,x)$ is the pollutant concentration at time t on place x on the river. The input we have is $ m(t_{i},x^{})=m_{i}$, i.e the mass of pollutant which was thrown in the river at location $x^{}$ on the river. We also know concentration of the pollutant $\alpha$ and the location $x^{*}$ where the pollutant is thrown. Not sure how this maps on the above equation and not sure what represents c and q there. Thanks. — Sorin Ciolofan, Nov 11 '15 at 12:01

ocramz · Answer 1 · 2015-10-30T06:55:46.110

The question on the minimal # of data points from the second code, the "forward" solver, is a very broad one. It depends on the complexity of the underlying phenomenon.

In the simplest case, the flow is laminar and the pollutant particles are effectively "infinitesimal"; I guess you can approximate this with Fick diffusion (+ mass transport, so a convection-diffusion law) or something similar. The output sample then would then follow a Gaussian law, and there are results from math.statistics and optimal learning regarding the convergence rate of the estimators of such distributions as a function of sample size.

(well, without going too deep, the Markov and Chebyshev inequalities could be used: https://en.wikipedia.org/wiki/Chebyshev%27s_inequality#Finite_samples)

In a realistic case, the flow might be turbulent, so the statistics of pollutant transport become much more complex. I can't easily say, but a combination of statistical estimation and numerical PDEs (this sounds like an "uncertainty quantification" type of problem) will help you.

what is the best theory/model to use for prediction in multivariate data?

1 Answers1