According to an MIT study, reinforcement learning has grown multi-fold in the past decade. The turning point was when DeepMind’s AlphaGo defeated Lee Sedol, the world champion of Go in 2016.
The reward function is at the heart of reinforcement learning; However, the RL agents pick up unique behaviours and communication protocols during the learning process, rendering them unfit for real-world human-AI cooperation.
To that end, Meta AI has introduced a flexible approach, off-belief learning, to make the AI model’s actions intelligible to humans. Off-belief learning looks for grounded communication; the goal is to find the optimal way to communicate without making any assumptions.
Source: MIT technology review
Off-belief learning
The off-belief learning is a MARL algorithm that addresses the gap in earlier methods by controlling the cognitive reasoning depth, while converging to optimal behavior. OBL can also be applied iteratively to obtain a hierarchy of policies that converges to a unique solution, making it ideal for solving zero-shot coordination (ZSC) problems; the goal is to maximise the test-time performance between polices from independent training runs of the same algorithm (cross-play).
OBL has a lot of parallels with the Rational Speech Acts (RSA) framework. RSA assumes a speaker-listener setting with grounded hints and opens with a literal listener (LL) that takes the grounded information into account. RSA introduces a hierarchy of speakers and listeners, each level defined via Bayesian reasoning (i.e., a best response) given the level below. OBL is engineered to deal with highdimensional, complex Dec-POMDPs where agents have OBL to both act and communicate through their actions.
Zero-Shot Coordination – Self-play (SP) is one of the most common problem settings for learning in Dec-POMDPs, where a team of agents is trained and tested together. Optimal self play policies typically rely on arbitrary conventions, which the entire team can jointly coordinate on during training. However, many real-world problems require agents to coordinate with other unknown AI agents and humans at test time. This desiderata was formalised as the Zero-Shot Coordination (ZSC) setting, where the goal is stated as finding algorithms that allow agents to coordinate with independently trained agents at test time, a proxy for the independent reasoning process in humans. ZSC removes arbitrary conventions as optimal solutions and instead requires learning algorithms that produce robust and, ideally, unique solutions across multiple independent runs.
One of the big challenges in ZSC under partially observable settings is to determine how to interpret the actions of other agents and how to select actions that will be interpretable to other agents. OBL addresses this issue by learning a hierarchy of policies, with an optimal grounded policy at the lowest level, which does not interpret other agents’ actions at all.
The researchers introduced the OBL operator that computes π1 given any π0. If a common knowledge policy π0 is played by all agents up to τi, then agent i can compute a belief distribution Bπ0 (τ |τi) = P(τ |τi, π0) conditional on its AOH. This belief distribution fully describes the effect of the history on the current state.
The OBL operator is defined to be the operator that maps an initial policy π0 to a new policy π1 as follows:
Source: arxiv.org
The above equation proposes a simple algorithm for computing an off-belief learning policy in small tabular environments: compute Bπ0(τi) for each action observation history(AOH), and then compute Qπ0→π1 (τi) for each AOH in ‘backwards’ topological order. However, such approaches for POMDPs are feasible only for small size problems. To apply value iteration methods, the Bellman equation for Qπ0→π1 for each agent i is given as follows:
Source: arxiv.org
The Hanabi experiment
Researchers at Meta AI tested the methods in the more complex domain of Hanabi. Hanabi is a popular benchmark to test methodologies like MARL, theory of mind, and zero-shot coordination research.
Hanabi is a 2-5 player card game. The deck consists of 50 cards, split among five different colours (suits) and ranks, with each colour having three 1s, two 2s and 3s and 4s, and one 5. For 2-player game setting, each player maintains a 5-card hand. Here, the game gets tricky as each player can see their partner’s hand but not their own. Each team strives to play one card of each rank in each color in order from 1 to 5. The team shares 8 hint tokens and 3 life tokens. In each turn, a player must play or discard a card in their hand, or spend a hint token to provide a hint to their partner. Playing a card succeeds if it is the lowest-rank card in its color not yet played, otherwise it fails and loses a life token. Giving a hint consists of choosing a rank or a color that a partner’s hand contains and indicating all cards in the partner’s hand sharing that color or rank. Discarding a card or successfully playing a 5 regains one hint token. The team’s score is zero if all life tokens are lost, otherwise it is equal to the number of cards successfully played, giving a maximum possible score of 25.
Source: arxiv.org
The table shows that OBL with four different levels of scenarios manages to produce a better performance result in comparison to other models.
Conclusion
Off-belief learning is a new method that can train optimal grounded policies, preventing agents from exchanging information through arbitrary conventions. When used in a hierarchy, each level adds one step of reasoning over beliefs from the previous level, thus providing a controlled
means of reintroducing conventions and pragmatic reasoning into the learning process. Crucially, OBL removes the ‘weirdness’ of learning in Dec-POMDPs, since, given the beliefs induced by the level below, each level has a unique optimal policy. Therefore, OBL at convergence can solve instances of the ZSC problem. Importantly, OBL’s performance gains under ZSC directly translate to state-of-the-art ad-hoc teamplay and human-AI coordination results, validating the “ZSC hypothesis”.