Maximum Q value for new state in Q-Learning never exists

Question

I'm working on implementing a Q-Learning algorithm for a 2 player board game.

I encountered what I think may be a problem. When it comes time to update the Q value with the Bellman equation (above), the last part states that for the maximum expected reward, one must find the highest q value in the new state reached, s', after making action a.

However, it seems like the I never have q values for state s'. I suspect s' can only be reached from P2 making a move. It may be impossible for this state to be reached as a result of an action from P1. Therefore, the board state s' is never evaluated by P2, thus its Q values are never being computed.

I will try to paint a picture of what I mean. Assume P1 is a random player, and P2 is the learning agent.

P1 makes a random move, resulting in state s.
P2 evaluates board s, finds the best action and takes it, resulting in state s'. In the process of updating the Q value for the pair (s,a), it finds maxQ'(s', a) = 0, since the state hasn't been encountered yet.
From s', P1 again makes a random move.

As you can see, state s' is never encountered by P2, since it is a board state that appears only as a result of P2 making a move. Thus the last part of the equation will always result in 0 - current Q value.

Am I seeing this correctly? Does this affect the learning process? Any input would be appreciated.

Thanks.

score 2 · Accepted Answer · answered Apr 18 '19 at 09:26

2

Your problem is with how you have defined $s'$.

The next state for an agent is not the state that the agent's action immediately puts the environment into. It is the state when it next takes an action. For some, more passive, environments, these are the same things. But for many environments they are not. For instance, a robot that is navigating a maze may take an action to move forward. The next state does not happen immediately at the point that it starts to take the action (when it would still be in the same position), but at a later time, after the action has been processed by the environment for a while (and the robot is in a new position), and the robot is ready to take another action.

So in your 2-player game example using regular Q learning, the next state $s'$ for P2 is not the state immediately after P2's move, but the state after P1 has also played its move in reaction. From P2's perspective, P1 is part of the environment and the situation is no different to having a stochastic environment.

Once you take this perspective on what $s'$ is, then Q learning will work as normal.

However, you should note that optimal behaviour against a specific opponent - such as a random opponent - is not the same as optimal play in a game. There are other ways to apply Reinforcement Learning ideas to 2-player games. Some of them can use the same approach as above - e.g. train two agents, one for P1 and one for P2, with each treating the other as if it were part of the environment. Others use different ways of reversing the view of the agent so that it can play versus itself more directly - in those cases you can treat each player's immediate output as $s'$, but you need to modify the learning agent. A simple modification to Q learning is to alternate between taking $\text{max}_{a'}$ and $\text{min}_{a'}$ depending on whose turn you are evaluating (and assuming P1's goal is to maximise their score while P2's goal is to minimise P1's score - and by extension maximise their own score in any zero-sum game)

answered Apr 18 '19 at 09:26

Neil Slater

32,038
3
43
64

Thanks so much for the detailed answer. So if I understand what you are saying correctly, the q table should only be updated after P1 has had a chance to reply with a move? I think I get it. The state right after P2 takes an action is only used to find the immediate reward but plays no part whatsoever in finding the future expected reward. In order to get this future reward, one must wait until the p1 agent responds to the move, and consider this state as the one that yields the future reward. You will probably find this facetious, but I'd like to make sure this makes sense. – Pete Apr 18 '19 at 09:56
P1 is in state s1 and makes move a1 resulting in s2. P2 evaluates s2 and responds with an action that leads to state s3 (but doesn't update the q-table yet). From here P1 makes another move, resulting in state s4. It is at this point that P2 updates the q table using the Bellman equation. In this equation, the immediate reward is the reward from state s3 (the one that resulted from P2 making a move) and the future reward is the reward that resulted from P1 responding to s3, i.e., the reward for s4. Is this correct? Many thanks! – Pete Apr 18 '19 at 09:56
Your first comment is mostly correct. However, the immediate reward is assessed at same time as $s'$ and you should think of it the same say as $s'$ here - it is returned by the environment. There is a minor complication that the game may terminate on either P1 or P2's turn. – Neil Slater Apr 18 '19 at 10:48
From the second comment, you should be careful about your labelling. The time steps don't occur like that as far as the agent is concerned. However, in your terms, you update the estimate for $Q(s_1, a_1)$ when P1 receives control of the game back (when receiving the immediate reward and next state) i.e. once it knows that the next state is $s_3$. The agent should look ahead using its existing Q table about what the best action is. That is not the same as "the reward for $s_4$" - it is the expected discounted return for $s_3, a_3$. – Neil Slater Apr 18 '19 at 10:55
If P1 (the random AI) makes action a1 resulting in state s1 and P2 (learning agent) makes action a2 onto state s1, resulting in state s2, wouldn't I be updating the estimate for Q(s1, a2), since I am estimating how good P2's move (a2) is given the state reached by P1 (s1) using the reward gained s2 as the immediate reward? – Pete Apr 18 '19 at 11:04
@Pete: Yes I missed that you started with the non-learning agent. The numbering scheme you are using for turns is not helping. – Neil Slater Apr 18 '19 at 11:10
Yes, sorry about that, it is definitely confusing. Okay, I just wanted to make sure. And just to clarify, when it is P2's first move, the Q Table cannot be updated, since it has not received P1's reply to the move and thus cannot calculate the expected discounted reward. I.e. the q table should only be updated starting from P2's second move. Correct? – Pete Apr 18 '19 at 11:18
@Pete: Correct, until P2 knows $s, a, r, s'$ then it cannot calculate $NewQ(s,a)$. Technically this update can be made just before P2 makes its second move (because in Q learning you don't care what $a'$ is or the actual results of making that second move, you use the existing estimates), but P2 must be ready to make such a move. – Neil Slater Apr 18 '19 at 11:40
Right. Thanks for being patient and expansive in your answers. You helped me a ton! – Pete Apr 18 '19 at 11:55

Maximum Q value for new state in Q-Learning never exists

1 Answers1