EnRoute 42 | Jerome Francis

Papers Explained : My brief notes on the ideas presented in the papers. More such notes can be found here.

Pearl

19 Jan 2021

Paper Explained

Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables, Rakelly et al, 2019. Link to paper

Central Idea

This paper introduces an off-policy meta-RL method that disentangles task inference and control. It uses a probabilistic encoder to estimate task belief, which also enables the posterior sampling for structured-extended exploration. Its integration with off-policy RL method achieves both meta-training and adaptation efficiency.

Notes

Meta-RL is sample-inefficient when it comes to meta-training part, where it requires on-policy data for meta-training and adaptation processes. To make it sample-efficient, off-policy data can be used as an experience, but the problem is that meta-learning operates on the principle “meta training time should match meta testing time”, or, “if we are meta-learning over (n) tasks, training should have sets of (n)-tasks’ examples”. This is not possible with off-policy data, as it may present experience over tasks that may be different from the new task the agent may explore in meta-testing.

To incorporate off-policy, the authors viewed the adaptation of task in meta-learning as a POMDP (Partially observed MDP), where the hidden state is the specific task or its MDP. As agent is unaware of the task it currently in, a belief over the task can be used to describe this uncertainty. This can be done using a latent representation over prior experience, like an encoder.

How to learn latent contexts?

Suppose that we have a prior experience in the form of history of past transitions, which we can call ‘context’ , $c$. Here, $c ^T_n = (s_n, a_n, r_n, s'_n )$ be one transition in the task T so that $c^T_{1:N}$ comprises the experience collected so far. To encode the context into a latent form $Z$ the concept of variational inference is used, by training an inference network $$q_{\phi}(z

c)$, parameterized by$\phi$, that estimates the posterior$p(z

c)$$.

The inference network is optimized to model state-action value functions or to maximize the returns from a policy over a distribution of tasks. Using log-likelihood as objective, the variational lower bound is as follows,

$$E_T [E_{z∼q_{\phi}(z

c^T )} [R(T , z) + βD_{KL}(q_{\phi}(z

c^T )

p(z))]]$$.

(p(Z)) is a gaussian prior over (Z), (R(T,z)) can be taken as return or state-action values (can be taken as other objectives too). The KL-Divergence term shows the restriction/bottleneck of z to the information only provided by the context (c), also as an example of mutual information between (Z) and (c). This bottleneck mitigates overfitting in training tasks.

**Designing $$q_{\phi}(z

c)$$**

As the transitions follow Markov property, the ordering of $(s,a,s',r)$ tuples is not required to infer the task-belief. Hence, a permutation-invariant encoder can be used, where $$q_{\phi}(z

c_{1:N} )$$ is calculated as a product of independent factors

$$q_{\phi}(z

c_{1:N} ) ∝ Π^N_{n=1}Ψ_{\phi}(z

c_n)$$.

To keep the method tractable, we use Gaussian factors $$Ψ_{\phi}(z

c_n) = N (f ^µ \phi (c_n), f^σ _\phi (c_n))$, which result in a Gaussian posterior. Here$f\phi$is a neural network which outputs$\mu$and$\sigma$as mean and standard deviation of$c_n$$.

Posterior sampling for exploration

Also known as Thompson Sampling, it works in the following way : assume the information you have is true, do the best/optimal using it, and finally, use the new experience acquired to update your knowledge about the information you earlier had. This ultimately, narrows your knowledge to the best or most accurate meaning.

Similarly, here, during training time, the prior over (z) is learnt to represent task-distribution. Then, during the meta-test time, a (z) (hypothesis) is sampled from the prior, the model takes actions using it in environment, and updates the posterior with the new experience learnt. By further collecting more new experiences, the model can make a better guess about the current task, as the posterior narrows it down.

De-coupling data streams for inference and RL policy learning

“Data to train the encoder need not be the same as the data used for training policy”

Here the latent context $z$ can be considered as part of the state in the main off-policy RL loop, while the uncertainty by encoder $$q_{\phi}(z

c)$provides the structured yet stochastic exploration. This is implemented by using a *Context Sampler*$S_c$$ to sample context batches for the encoder. These batches are taken from the most recently collected batch of data, recollected every 1000 meta-training optimization steps. This is not a strict on-policy, but good enough to retain on-policy performance from recently-collected data. The off-policy RL method uses batches drawn uniformly from the entire buffer.

Implementation & Experimental results

SAC is an off-policy actor-critic method which is based on maximum entropy RL objective. The inference network is trained using the gradients from the critic, which models the network as a distribution over different Q functions.

PEARL uses 20-100x fewer samples during meta-training than previous meta-RL approaches while improving final asymptotic performance by 50-100% in five of the six domains.
Replacing encoder with RNN with de-correlated transitions shows comparable performance at the cost of slow optimization, but low performance when using RNN with whole trajectories.
Sampling context off-policy hurts performance, but using same data batches for RL and context helps in performance due to correlation.
Choosing a deterministic encoder heaves the exploration contribution to the RL policy learning, which drops the performance.

Further reading : Blogpost in BAIR by Kate Rakelly

Maml

16 Jan 2021

Paper Explained

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, Finn et al, 2017. Link to paper

Central Idea

This paper proposes a meta-learning algorithm that is model-agnostic, ie. it is compatible with any model trained with gradient descent, and requires a small number of gradient steps and little amount of training data to learn a variety of tasks in different learning problem domains like supervised learning like regression, classification and RL, and achieve SOTA performance.

Notes

Meta learning , “Learning to learn”

It is a learning method where a model is trained on a variety of tasks in order to get a general or meta-level (higher-level) knowledge about the tasks. Such learning helps the model to uncover the basic, common structural knowledge of the different tasks experienced, which makes the model versatile enough to quickly adapt to the same set of tasks in the future just by experiencing the tasks once (one-shot) or few times (few-shot).

For instance, a model having a rich, general and holistic knowledge about moving over a surface, can learn to walk, run, hop, crawl and jump over the surface in a shorter amount of time than starting to learn the former mentioned tasks from scratch, by using its basic understanding of locomotion, effectively.

Here, Locomotion (meta-task) –> walk, run, hop,crawl, jump (specific-tasks)

Such a novel learning method can be extended to different domains of machine learning problems like supervised regression & classification and reinforcement learning, where such variants are referred to as different “species” of meta-learning.

Model-Agnostic Meta Learning

Intuition

To perform meta-learning in a way that an optimal meta-learned policy is reached, from which any specific task can be quickly adapted by the model in few gradient steps of learning or with few task -specific experiences.

Process

Meta-learning model is a parameterized function (f_\theta) with parameter (\theta), having a task distribution (p(\tau)) and each task to be learnt is denoted by (\tau_i). In the algorithm, for each task, we first learn the task (\tau_i) using some (K) samples of (\tau_i), and then calculate the adapted parameters with gradient descent. This concludes the training on the adaptation process. Using the adapted parameters and an another batch of (\tau_i) , we perform the meta-optimization, which performs gradient step using the expected loss on performance of (f_\theta) over all of the tasks present in the (p(\tau)).

(θ’i = θ − \alpha∇{θ}\mathcal{L}{\tau_i} (f\theta)) –adaptation step

(θ ← θ − β∇θ\sum{\tau_{i}∼p(\tau)} \mathcal{L}{\tau_i} (f{θ’_i} )) –meta optimization

The only difference in supervised and reinforcement domains using this algorithm is in the loss function and dataset from with samples of tasks are taken. In supervised learning, the dataset is in the form of ((x_i,y_i)) and loss function is either MSE (for regression problems) or Cross-entropy (for classification problems).

MSE : $$\mathcal{L}{\tau_i} (f\phi) = \sum_{x^{(j)},y^{(j)}∼\tau_i}

f_\phi(x(j)) − y(j)

_2 ^2 $$

Cross-Entropy Loss : (\mathcal{L}{\tau_i} (f\phi) = \sum_{x^{(j)},y^{(j)}∼\tau_i} y^{(j)} \log f_\phi(x^{(j)}) + (1 − y_{(j)} ) \log(1 − f_\phi(x^{(j)} )))

Whereas, in reinforcement learning, the dataset is the collection of trajectories using (f_\theta) for task (\tau_i) and the loss function, given below, is the negative of expectation of the reward function using the adapted parameters (\theta_i) over all the tasks.

(\mathcal{L}{\tau_i} (f\phi) = − \mathbb{E}{x_t,a_t∼f\phi,q_{\tau_i}} [ \sum^H_{t=1} R_i(x_t, a_t) ])

Experiments

In an experiment to predict the sinusoidal nature, the model trained using the MAML method was able to infer the ampllitude and phase in the other half of the range, where no datapoints where given in training process, showing that the model has learnt the periodic nature of the sine wave.

in classification domain, the MAML model performed similar to some of the SOTA methods used in few-shot classification, and outperformed memory-augmented networs and the meta-learner LSTM .

Interesting fact : Use of a first-order appriximation in the gradientof the meta-objective (First-order MAML or FOMAML), exhibited similar performance like that of second order gradient present in MAML. This shows that most of the improvement comes from the inner loop of adaptation,more specifically, the adapted parameters, instead of updates in the meta-optimizing outer loop. This approximation speeds up the modle computation by 33%.

In reinforcement learning domain, the inner gradient updates were computed using the vanilla policy gradient (REINFORCE) and for the meta-optimizer trust-region policy optmization (TRPO).

Interesting fact : The authors noticed that halving the learning rate after the first gradient step exhibited superior performance. So, step-size during adaptation was set to (\alpha = 0.1) in the 1^st^ step, then (\alpha = 0.05) for all future steps.

Duelling Architecture for DQNs

08 Oct 2020

- Based on Advantage functions other than Q-learning, it separates the representation of state values and (state-dependent) action advantages.
- This architecture can learn which states are valuable or not, without knowing or learning the effect of an action on such a state. It is useful to determine the state where its actions don’t affect the environment in a relevant way.
- Previous works by Baird and Harmon (1995) : Advantage learning algo. and advantage updating algo. were introduced.
- Advantage Learning : It is a form of reinforcement learning similar to Q-learning except that it uses advantages rather than Q-values. For a state x and action u, the advantage for that state-action pair A(x,u) is related to the Q value Q(x,u) as:

A(x,u)=max(Q(x,u’)) + (Q(x,u) - max(Q(x,u’))) * k/dt

- - In advantage updating requires storing a value function V(x) in addition to the advantage function. Advantage learning is a more recent algorithm that supercedes advantage updatingNo, and requires only that the A(x,u) advantages be stored. The two algorithms have essentially identical behavior, but the later algorithm requires less information to be stored, and is a simpler algorithm, so it is generally recommended. Refer here.
- Note, value function V tells how good is it to be in a particular state. Q-value function tells which actions are good to be taken from a state (value denotes the choosing of an action). Advantage, which simply, *A = Q - V*, shows us the importance of an action.
- - The above equation is valid but not useful to identify V and A separately. This can be shown by the following steps :
  - - Adding a constant to V & subtracting the same from A, doesn’t help to identify V and A.
    - Making the advantage (A) to zero for an action is also not useful. Same is with using max operator on A
  - However, replacing the max operator with average gives stability to the model, even if the above equation’s meaning is lost due to this factor. (Reason ?)
- Advantages over single-stream architecture:
- - Effective learning of state-value function
  - Value stream also gets updated frequently with every update of Q-value
  - Separate advantage stream is robust to noise in the updates, while single stream is sensitive.

Double DQNs

05 Oct 2020

- Over-estimation of q-values : Main problem of vanilla DQNs. Over-estimation is not a problem if it is uniformly distributed, as then, the relative action preferences will preserve, and doesn’t affect the policy much. Thrun and Schwartz (1993), who showed that if the action values contain random errors uniformly distributed in an interval [-e,e ] then each target is overestimated up to ‘gamma * e * (m-1)/(m+1)’ where m is the number of actions.
- QUOTE : “Overoptimistic value estimates are not necessarily a problem in and of themselves. If all values would be uniformly higher then the relative action preferences are preserved and we would not expect the resulting policy to be any worse.Furthermore, it is known that sometimes it is good to be optimistic: optimism in the face of uncertainty is a well-known exploration technique* *(Kaelbling et al., 1996). If, however,the over-estimations are not uniform and not concentrated at states about which we wish to learn more, then they might negatively affect the quality of the resulting policy. Thrunand Schwartz (1993) give specific examples in which this leads to suboptimal policies, even asymptotically.”
- Idea : To decouple or separate the evaluation from learning/updation of ‘w’ weights. In Double Q-learning, the idea was to use two different value functions for evaluation and learning. Here, as we have the fixed-weight OR fixed-target concept, we can use the target network weights for the evaluation part and training network weights for the usual learning part.
- Experiment : Done on ATARI 2600 games in the Arcade Learning Environment. The Double DQN was trained and tested in the conditiones identical to Minh et. al(2015). Graphs based on value estimates wrt. training steps (in millions) showed that DQN is generally over-optimistic in its learning than Double DQN. However, this over-optimism affects the performance or score of the model, where we can notice that score starts to drop gradually at the same time when DQN’s over-estimation rises to higher levels, based on the graphs of value-estimates v/s training steps (in Log scale) and sore v/s training steps. The usual instability problems of off-policy learning can’t be argued here, as the Double DQN model’s learning is stable, which infers that cause of the problem is over-optimism.
- For scoring, normalised scores were used basd on the ratio of difference of scores of agent & random_agent and scores of human & random_agent.
- For robustness of human starting points, DQN as given Nair et al.(2015 -massive parallel methods for RL) was used.

Playing Atari with Deep Reinforcement Learning

03 Oct 2020

- Using Q-learning with deep learning techniques (neural networks) on ATARI games
- Most similar previous work was NFQ (neural fitted Q-learning)
- - NFQ optimizes the sequence of loss functions in Equation 2, using the RPROP algorithm to update the parameters of the Q-network. However, it uses a batch update that has a computational cost per iteration that is proportional to the size of the data set, whereas we consider stochastic gradient updates that have a low constant cost per iteration and scale to large data-sets.
  - NFQ has also been successfully applied to simple real-world control tasks using purely visual input, by first using deep autoencoders to learn a low dimensional representation of the task, and then applying NFQ to this representation.
  - Other work that was applied on Atari games were evolutionary algorithms Hyper NEAT
- Metric used for evaluation was average total reward and average of maximized state-action values or Q-values. These Q-values showed the situations where agent reacted to presence /absence of enemy.
- *Techniques used in DQN*
- - Clipping of rewards from actual scores to +1/-1 rewards :: It limits the scale of the error derivatives and makes it easier to use the same learning rate across multiple games. But, it could affect the performance of our agent since it cannot differentiate between rewards of different magnitude.
  - Simple frame-skipping technique :: The agent sees and selects actions on every kth frame instead of every frame, and its last action is repeated on skipped frames.
  - Experience replay : TO construct a set of recent experiences from the episodes received, and then random sample a batch of it to train the neural network for learning actions which give target rewards (near to expected return values).
  - Benefits are :
  - 1. Breaks correlation between states : Does not learn to mirror the following states as experienced during the episodes, but chooses the best action wrt. the given state.
    2. To stop changes in training data, which can oscillate by producing more episode (and more data) according to the best posssible action at a given training time.
      
      Quote:
      
      “When learning on-policy, the current parameters determine the next data sample that the parameters are trained on. For example, if the maximizing action is to move left then the training samples will be dominated by samples from the left-hand side; if the maximizing action then switches to the right then the training distribution will also switch. It is easy to see how unwanted feedback loops may arise and the parameters could get stuck in a poor local minimum, or even diverge catastrophically”
    3. Each step of experience is potentially used in many weight updates, which allows for greater data efficiency.
- Clarifications and Keywords:
- - No op actions : These mean that at the start of the game the agent will do nothing. This is used to break the correlation of repeating starting states and learning of the agent. For example, in Ping Pong, the agent should be able to move left if the ball will come to left, and this should be present for different positions of ball reaching the agent’s side (states are like when ball is moving leftwards in the beginning of movement, and the same ball has entered the agent-half of the game window).

Pearl

Paper Explained

Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables, Rakelly et al, 2019. Link to paper

Central Idea

Notes

Maml

Paper Explained

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, Finn et al, 2017. Link to paper

Central Idea

Notes

Duelling Architecture for DQNs

Double DQNs

Playing Atari with Deep Reinforcement Learning