What reward function results in optimal learning?

Question

Let's think of the following situations:

You are teaching a robot to play ping pong
You are teaching a program to calculate square root
You are teaching math to a kid in school

These situations (i.e. supervised learning), and many others have one thing (among others) in common: the learner gets a reward based on its performance.

My question is, what should the reward function look like? Is there a "best" answer, or does it depend on the situation? If it depends on the situation, how does one determine which reward function to pick?

For example, take the following three reward functions:

enter image description here

Function A says:
- below a certain point, bad or worse are the same: you get nothing
- there is a clear difference between almost good and perfect
Function B says:
- you get reward linearly proportional to your performance
Function C says:
- if your performance is bad, it's ok, you did your best: you still get some reward
- there is not much difference between perfect and almost good

Intuitively, I'd think A would make the robot very focused and learn the exact pattern, but become stupid when dealing with similar patterns, while C would make it more adaptable to change at the cost of losing perfection.

One might also think of more complex functions, just to show but few:

enter image description here

So, how does one know which function to pick? Is it known which behavior would emerge from (at least) the basic A, B and C functions?

A side question is would this be fundamentally different for robots and human kids?

I doubt that a robot would become stupid by doing the same or similar thing over and over, unless being cybernetic. — ott--, May 07 '13 at 14:09
@ott, that's not what I meant. What I meant was with a reward function similar to A, the robot could become extremely good at the exact task, but terrible at tasks that are similar but slightly different. That's just my guess though. — Shahbaz, May 07 '13 at 14:14
Perhaps the theory behind this could be complicated, but an answer that says "I have thought different tasks to many robots and often function X gave me the best result", even if not perfectly correct, would give a great rule of thumb. — Shahbaz, May 07 '13 at 16:13

Ian · Answer 1 · 2013-05-09T19:50:57.800

"Optimal learning" is a very vague term, and it is completely dependent on the specific problem you're working on. The term you're looking for is "overfitting": enter image description here

(The green line is the error in predicting the result on the training data, the purple line the quality of the model, and the red line is the error of the learned model being used "in production")

In other words: when it comes to adapting your learned behavior to similar problmes, how you rewarded your system is less important than how many times you rewarded it -- you want to reduce errors in the training data, but not keep it in training so long that it loses the ability to work on similar models.

One method of addressing this problem is to cut your training data in half: use one half to learn on and the other half to to validate the training. It helps you identify when you begin to over-fit.

Non-linear reward functions

Most supervised learning algorithms expect that application of the reward function will produce a convex output. In other words, having local minima in that curve will prevent your system from converging to the proper behavior. This video shows a bit of the math behind cost/reward functions.

score 5 · Accepted Answer · edited Apr 13 '17 at 12:49

Short answer: the strongest reinforcement effect comes from delivering a valuable reward on an intermittent (random) schedule.

Longer version: One aspect of your question is about operant conditioning, at least as it applies to teaching maths to a complex organism. Applying this to machine learning is known as reinforcement learning.

Economics (as per jwpat7's answer) only addresses one part the story of reinforcement. Utility function tells you what reward has the strongest reinforcement effect (biggest impact on behaviour) in a given context. Is it praise? chocolate? cocaine? direct electrical stimulation to certain areas of the brain? Mostly my answer is about effect of context, assuming a given reward utility.

For complex organisms/behaviours, reward scheduling is at least as important as reward utility:

A "fixed-interval reward schedule" is the least effective way to modify behaviour with a given quantity of reward (I'll give you \$10 per week if you keep your bedroom tidy). Think dole bludger.
Fixed ratio reward schedules (I'll give you \$10 every seven days you have a tidy bedroom) are more effective than fixed intervals, but they have a kind of effectiveness ceiling (subject will tidy their room seven times when they are hungry for \$10, but not otherwise). Think mercenary.
The most influential way to deliver a given reward with a "variable interval reinforcement schedule" (e.g. every day you tidy your bedroom you have a 1/7 chance of getting $10). Think poker machine.

If you are a learning supervisor with a fixed reward budget, for a given learning situation, there will be an optimum balance of reward size (utility) and frequency. It's probably not a very small slice of reward at a very high frequency, nor a very large chunk of reward delivered very rarely. It might even be a random size reward at a random schedule - the optimum is usually determined experimentally for a particular situation.

Finally, the "optimum" schedule (random frequency, random quantity {p(reward),p(value)}) will probably vary at different stages in the learning process. For example, a new pupil might be subject to "primacy" effect (welcome! have a jelly bean) that quickly becomes fixed-interval reward if you repeat it. There might be a "recency" effect that gets more reinforcement value from a reward delivered on the very last trial ("finishing on a high note"). In between, there may be an accumulative "faith effect" where as a learner becomes more experienced, the optimum might shift toward lower probability, higher utility over time. Again, more stuff to determine empirically in your situation.

I'm reading this answer again, and I'd again want to say how great this answer is! In fact, let me give you some bounty! — Shahbaz, Nov 08 '13 at 18:12

score 3 · Answer 3 · answered May 07 '13 at 15:38

These issues are addressed, to some extent, by the study of utility functions in economics. A utility function expresses effective or perceived values of one thing in terms of another. (While the curves shown in the question are reward functions and express how much reward will be tendered for various performance levels, similar-looking utility functions could express how much performance results from various reward levels.)

What reward function will work best depends on equilibria between the payer and the performer. The wikipedia contract curve article illustrates with Edgeworth boxes how to find Pareto efficient allocations. The Von Neumann–Morgenstern utility theorem delineates conditions that ensure that an agent is VNM-rational and can be characterized as having a utility function. The “Behavioral predictions resulting from HARA utility” section of the Hyperbolic absolute risk aversion article in wikipedia describes behavioral consequences of certain utility functions.

Summary: These topics have been the subject of tremendous amounts of study in economics and microeconomics. Unfortunately, extracting a brief and useful summary that answers your question might also call for a tremendous amount of work, or the attention of someone rather more expert than me.

This is quite complicated, I'm not sure if I understand it. But are you sure utility function of economics applies to robotics too? In supervised learning (of a robot), the payer doesn't actually lose anything. The reward would often be just a number telling the robot how well they did the task. — Shahbaz, May 07 '13 at 16:11

score 1 · Answer 4 · answered Jun 03 '13 at 02:49

The optimal reward function depends on the learning objective, i.e. what is to be learned. For simple problems it may be possible to find a closed form representation for the optimal reward function. In fact for really simple problems I'm confident it is possible though I know of no formal methods for doing so (I suspect utility theory would address this question). For more complex problems I would argue that it is not possible to find a closed form solution.

Instead of seeking the optimal function we could look to an expert for a good reward function. One approach to doing so is a technique called Inverse Reinforcement Learning (IRL). It formulates a learning problem as a reinforcement learning problem where the reward function is unknown and the objective of the learning process. The paper Apprenticeship Learning via Inverse Reinforcement Learning by Pieter Abbeel and Andrew Ng is a good place to start learning about IRL.

score 0 · Answer 5 · answered May 07 '13 at 18:13

Any form of supervised learning is a directed search in policy space. You try to find the policy -- so which action to take -- which provides the maximum reward expectation. In your question you give reward as a function of performance. As long as this function is monotonic any method which converges will ultimately end up giving you maximum performance (too stay with your terminology).

How fast the method converges is another matter, and may well depend on the curve. But I think this will differ from method to method.

An entirely different problem is that for more complex scenarios performance is not a simple scalar, and defining it can be pretty difficult. What's the reward function for being good at math?

How fast the method converges is another matter, and may well depend on the curve., well of course. I was trying to understand how the curve affects learning (and not if it does, because I already know that it does). — Shahbaz, May 08 '13 at 07:37

What reward function results in optimal learning?

5 Answers5

Non-linear reward functions