7

I did search the question database regarding this question, and although one or two questions came close, they didn't really address my specific question.

In adaptive control based on minimizing tracking error (e.g. between plant and model), the designer is free to chose a cost function. More often than not the cost is selected as a function of the squared error.

But I've found in some practical applications that I can achieve a more robust controller by using absolute error. I understand that absolute error provides a more uniform weighting on the size of the error, and I suspect the squared error tends to 'wind-up' the adaptive controller with initially large errors. But I'm not sure how to show this in a generalized way. So I have two questions regarding this:

  1. Is there perhaps a simple analysis that can demonstrate the stability characteristics between absolute and squared error choices in the cost function?

  2. Any references on the matter?

docscience
  • 305
  • 2
  • 8
  • 1
    I'm not a controls guy so my opinion matters little. However, I wonder if squared error is typically used because of its obvious connection to least-squares based methods. One nice feature of the squared error cost function versus absolute error: the squared error cost function's derivative with respect to the error is continuous everywhere. The absolute error cost function's derivative is discontinuous near $\text{error} = 0$. I could imagine that this could have implications for stability and/or tracking error. – Jason R Sep 08 '16 at 15:27
  • What form does your cost-function have? The standard least-squares cost-function would typically be $J(\theta) = \int_{0}^{t} \left( y(\tau) - \theta(t) u(\tau) \right)^{2} \textrm{d} \tau$. Are you suggesting $J(\theta) = \int_{0}^{t} \left| y(\tau) - \theta(t) u(\tau) \right| \textrm{d} \tau$ instead? – Arnfinn Sep 14 '16 at 10:40
  • @Arnfinn my question - more general than specific, but the specific practical work that generated the question is on an application of model reference adaptive control where the plant is considered a scalar and the model is a first order lag. In this application the error was the difference between the output of the closed loop plant using an integrator with adjustable gain, and the model output. The cost I examined was either the square or absolute value of this error. So I guess the answer to your question - yes. But I'm using a gradient (MIT-like) minimization rather than least squares. – docscience Sep 14 '16 at 13:36
  • Do you have the book by Ioannou and Sun? https://www.amazon.com/Robust-Adaptive-Control-Electrical-Engineering/dp/0486498174 – Arnfinn Sep 14 '16 at 23:03
  • At any rate, this is a non-linear problem as you probably know, and the general frameworks for stability and convergence analysis would be Lyapunov stability analysis and/or the Grönwall–Bellman lemma... – Arnfinn Sep 14 '16 at 23:06
  • @Arnfinn sorry I missed your comments so far back. Yes I have Ioannou & Sun's book. Petros was my advisor at USC and I know Jing through the ACC. Both cost functions can be shown to be stable through Lyapunov analysis, but apart from stability, still struggling to learn why one choice might be better than the other for other reasons than stability. Nothing in Petros's book discusses this unfortunately. – docscience Feb 18 '17 at 15:47
  • @ docscience: In statistics, the squared function would be viewed as less robust in the sense that it has a breakdown point that is 1 percent .The absolute deviation is more robust and its breakdown point is greater. ( I think 50 percent IIRC ). Link below will explain it better than I could. www.stat.umn.edu/geyer/5601/notes/break.pdf – mark leeds Nov 30 '17 at 03:35

2 Answers2

2

This is an interesting question since both squared error and absolute error are convex functions, so they are both going to give the optimal solution when minimized. My intuition is that the $\ell_2$-norm (sum of squared values of error) converges to zero more quickly than the $\ell_1$-norm (sum of absolute values of error) when the search direction is right. For example, if the error is $\epsilon = y - Ax$. If $y$ is very close to $Ax$ and $0 < |y-Ax| \leq 1$, then $\epsilon^2$ is much smaller than $|\epsilon|$. Similarly, if $|y-Ax| > 1$, then $\epsilon^2 > |\epsilon|$. So the $\ell_2$ norm penalizes large errors more, and small errors less than the $\ell_1$ norm. This can also be understood by looking at the graphs of the two functions when $\epsilon \in \mathbb{R}$.

enter image description here

Of course the continuity of the derivative is another factor. The derivative of the absolute value technically does not exist at the minimum.

orchi_d
  • 577
  • 2
  • 7
-2

You should use squared error as versus absolute error if you expect to have both positive and negative error values in your population. By squaring the error, only the magnitude of each error value matters (although it is squared). If you want to recover the actual magnitude of error in the population, just take the square root at the end. This value is known as the root mean squared error (RMSE).

If, instead, you were to use absolute error values, the positives and negatives in your population would cancel each other out giving you a false metric.

user3877654
  • 109
  • 4
  • 1
    Hi: They don't cancel each other out using the absolute value just as they don't cancel each other using squaring. There's nothing wrong with the absolute value metric. It's just arises from a different loss function ( loss function is minimized by the median ) and has different statistical properties with respect to robustness, breakdown point, bias, efficiency etc. The squared version is is more popular and can provide an unbiased estimate of the variance, under the normality assumption. But, if this assumption is not true, it is a less robust estimate – mark leeds Oct 05 '18 at 13:09
  • @mark_leeds I agree that the absolute error values method does not literally cancel each other out per se, but consider the following special case: error_vals = [ 0, 1, -1, 2, -2, 3, -3, 4, -4] using RMSE, the loss would be Sqrt(50/9) but a calculation using absolute values will result in a loss metric = 0. What am I missing? – user3877654 Oct 05 '18 at 14:04
  • Hi: In a rush so just read quickly but where are you getting 0 and sqrt(50/9) from ? Maybe you use it differently, but, AFAIK, using the absolute value in the loss function leads to an more robust estimator such as median, Huber-M location estimator etc so I'm not clear on your numbers. – mark leeds Oct 06 '18 at 16:21
  • Hi Mark, Apologies for the typo. Here is how I got the (correct) numbers: RMSE = Sqrt(0+(1^2)+(-1^2)+(2^2)+(-2^2)+(3^2)+(-3^2)+(4^2)+(-4^2))/9. (Note this should have been Sqrt(60/9)). Using absolute values = (0+1+(-1)+2+(-2)+3+(-3)+4+(-4))/9 (Which should be 0/9) – user3877654 Oct 08 '18 at 13:24
  • I'm sorry but I don't follow. In the first one, you are calculate the distance between each observation and zero and then squaring that and then adding up over all observations. In the second you are calculating the distance between each observation and zero and then adding up the distances ? This is not what I was referring to as square error estimate versus some more robust estimate. I can't explain it easily here but the squared loss estimator is obtained by mnimizing the expectation of the sum of the squared errors. In this case of the normal distribution, this ends up being the mean. – mark leeds Oct 09 '18 at 15:04
  • Continued from above. Now, the absolute loss estimator is obtained using a different loss function, namely the expectation of the sum of the absolute value of the error terms. If one has a symmetric distribution and wants to minimize the expectation of the sum of the absolute value of the error terms, the estimate ends up being the median. I'm not clear on why you are calculating distances from zero in your example ? By robust, I mean the robustness of the estimator that arises based on the associated loss function. – mark leeds Oct 09 '18 at 15:08
  • @mark_leeds I am not certain I follow what you are saying. Error is calculated as the distance between the actual value and the predicted value. This can be a positive or negative distance, so squaring the distance just removes the effect of the negative signs. I am using zero in the mean here because it is a simplified example. If you could similarly provide a simplified example that would go a long way towards clarifying what you are trying to say. – user3877654 Oct 09 '18 at 16:05
  • Hi: I see what you're doing now but the absolute error is defined as $|estimate - actual|$ so you take the absolute value and add them up and they don't cancel. you're not taking the absolute value as far as I can tell. On a different note, you are talking about a forecasting error metric and I was talking about an estimation type of metric, but my argument still holds that absolute error doesn't make things cancel out. take your example, calculate the distance of each term from zero, then take the absolute value of each term, then add each term up,. I hope this clarifies. – mark leeds Oct 10 '18 at 16:26
  • Hi @mark_leeds If you take the absolute value of each term and take the average, you should get the same result as the RMSE. I was reading the original question as I have answered, but on rereading, docscience may well have been referring to the sum of errors (as versus the mean). I am unfamiliar with this approach or its relative benefits, so I cannot discuss further, but intuitively it certainly would seem to work as long as your sample size remains stable. – user3877654 Oct 11 '18 at 18:44
  • Hi: No MAE ( i mean absolute error ) does not result in the same estimator as the RMSE when deriving an estimator. The mean minimizes the RMSE and the median minimizes the absolute value of the error. Think of these various metrics as loss functions where the objective is to minimize them with respect them to density. Thinking of them as calculations given an estimate is using them AFTER the estimate is obtained. That is not wrong but it really doesn't lead to anything interesting. The metrics ( RMSE, MAE etc ) are used to derive estimators. That's their real use. – mark leeds Oct 12 '18 at 20:45