3

I use software for pollutant propagation on rivers that takes as input a set of parameters ($p_1,p_2,\ldots,p_n$) and creates an output file which is basically a matrix where on each row the concentration of the pollutant in various places along the river at a given timestamp is given.

\begin{array} {|r|r|r|r|r...|r|} \hline \text{TIME} &500\,\text{m}&1000\,\text{m}&1500\,\text{m}&2000\,\text{m}&2500\,\text{m}&...&25000\,\text{m} \\ \hline 2015/12/07 - \text{4:50:00}&0.75&0.71&0.6&0.58&0.55&...&0.12 \\ \hline \hline 2015/12/07- \text{4:55:00}&0.71&0.70&0.58&0.56&0.51&...&0.10 \\ \hline \end{array}

My question is: what models/theories are suitable to be used to predict, based on a set of given INPUT parameters, the OUTPUT that would be closest to what the software will give as a result?

Also how many pairs INPUT/OUTPUT will I need to have in order to minimize the error?

Anton Menshov
  • 8,672
  • 7
  • 38
  • 94
  • 2
    It's not very clear what your input parameters represent. Could you clarify this? It seems like they could be initial conditions for the pollution at various points along the river. – Tyler Olsen Oct 29 '15 at 16:06
  • 3
    If I understand your question it sounds like you have some software that models the pollutants in the river over time. If this is the case, why do you want to find a separate model that approximates what your software gives you already? Are you just trying to make a simplified model? Is this an interpolation problem? If you are trying to make a simplified model then do you know the water velocities at points in the river? – James Oct 29 '15 at 19:57
  • @TylerOlsen The input parameters are: startDate, endDate, Polutant Type, Pollutant Concentration, River Chainage and a set of pairs [Pollutant Quantity, DateTime] representing how much pollutant was thrown in the river and the moment when that happened. – Sorin Ciolofan Oct 31 '15 at 10:58
  • @James The software we have runs on a server and we need internet connection to connect to that software and run a simulation. Beside this, the server could happen to be down at some point when we need to run a simulation. So, as a backup solution we would want to "guess" , given an input i what is the closest output o that the software would give it would have the chance to run. We can use a historical archive of past N real simulations (pairs of input/output). Maybe I can use machine learning? – Sorin Ciolofan Oct 31 '15 at 11:02
  • It sounds like you're just trying to solve the transport equation for a scalar field (aka: convection-diffusion-reaction equation). This is a PDE that describes precisely your situation. If I had to guess what your software was doing, I'd say that it's solving a 1D version of this model, taking your input parameters as initial conditions and making assumptions about the river speed at every point. Check out this wiki page for more info:

    https://en.wikipedia.org/wiki/Convection%E2%80%93diffusion_equation

    – Tyler Olsen Oct 31 '15 at 15:10
  • @TylerOlsen Thanks for reply, but that wiki page seems to present a very general form of the equation, which does not fit to the pollutant propagation scenario I have. 1) Do you know how the equation will look for my scenario (the one I described above)? 2) Can the equation from point 1) be solved with numerical methods? If yes, can you point me to some algorithm doing this? Thanks – Sorin Ciolofan Nov 09 '15 at 12:49
  • The equation is as general as it needs to be. Take the first equation, make it 1D (since that's part of your assumptions) and you have: $\dot{\phi} + c\frac{\partial \phi}{\partial x} = q$, where $\hi$ is the pollutant concentration and q represents a pollutant source term. This can be solved numerically in a pretty straightforward fashion with a finite difference method, and you can use explicit time integration with small time steps. – Tyler Olsen Nov 09 '15 at 14:42
  • @TylerOlsen Trying to understand the equation above. Please correct me if I am wrong. $ \phi(t,x)$ is the pollutant concentration at time t on place x on the river. The input we have is $ m(t_{i},x^{})=m_{i}$, i.e the mass of pollutant which was thrown in the river at location $x^{}$ on the river. We also know concentration of the pollutant $\alpha$ and the location $x^{*}$ where the pollutant is thrown. Not sure how this maps on the above equation and not sure what represents c and q there. Thanks. – Sorin Ciolofan Nov 11 '15 at 12:01

1 Answers1

1

The question on the minimal # of data points from the second code, the "forward" solver, is a very broad one. It depends on the complexity of the underlying phenomenon.

In the simplest case, the flow is laminar and the pollutant particles are effectively "infinitesimal"; I guess you can approximate this with Fick diffusion (+ mass transport, so a convection-diffusion law) or something similar. The output sample then would then follow a Gaussian law, and there are results from math.statistics and optimal learning regarding the convergence rate of the estimators of such distributions as a function of sample size.

(well, without going too deep, the Markov and Chebyshev inequalities could be used: https://en.wikipedia.org/wiki/Chebyshev%27s_inequality#Finite_samples)

In a realistic case, the flow might be turbulent, so the statistics of pollutant transport become much more complex. I can't easily say, but a combination of statistical estimation and numerical PDEs (this sounds like an "uncertainty quantification" type of problem) will help you.

ocramz
  • 121
  • 4