• Over-estimation of q-values : Main problem of vanilla DQNs. Over-estimation is not a problem if it is uniformly distributed, as then, the relative action preferences will preserve, and doesn’t affect the policy much. Thrun and Schwartz (1993), who showed that if the action values contain random errors uniformly distributed in an interval [-e,e ] then each target is overestimated up to ‘gamma * e * (m-1)/(m+1)’ where m is the number of actions.
    • QUOTE : “Overoptimistic value estimates are not necessarily a problem in and of themselves. If all values would be uniformly higher then the relative action preferences are preserved and we would not expect the resulting policy to be any worse.Furthermore, it is known that sometimes it is good to be optimistic: optimism in the face of uncertainty is a well-known exploration technique* *(Kaelbling et al., 1996). If, however,the over-estimations are not uniform and not concentrated at states about which we wish to learn more, then they might negatively affect the quality of the resulting policy. Thrunand Schwartz (1993) give specific examples in which this leads to suboptimal policies, even asymptotically.”
    • Idea : To decouple or separate the evaluation from learning/updation of ‘w’ weights. In Double Q-learning, the idea was to use two different value functions for evaluation and learning. Here, as we have the fixed-weight OR fixed-target concept, we can use the target network weights for the evaluation part and training network weights for the usual learning part.
    • Experiment : Done on ATARI 2600 games in the Arcade Learning Environment. The Double DQN was trained and tested in the conditiones identical to Minh et. al(2015). Graphs based on value estimates wrt. training steps (in millions) showed that DQN is generally over-optimistic in its learning than Double DQN. However, this over-optimism affects the performance or score of the model, where we can notice that score starts to drop gradually at the same time when DQN’s over-estimation rises to higher levels, based on the graphs of value-estimates v/s training steps (in Log scale) and sore v/s training steps. The usual instability problems of off-policy learning can’t be argued here, as the Double DQN model’s learning is stable, which infers that cause of the problem is over-optimism.
    • For scoring, normalised scores were used basd on the ratio of difference of scores of agent & random_agent and scores of human & random_agent.
    • For robustness of human starting points, DQN as given Nair et al.(2015 -massive parallel methods for RL) was used.