We propose a unified mechanism for achieving coordination and communication in Multi-Agent Reinforcement Learning (MARL), through rewarding agents for having causal influence over other agents’ actions. Causal influence is assessed using counterfactual reasoning. At each timestep, an agent simulates alternate actions that it could have taken, and computes their effect on the behavior of other agents. Actions that lead to bigger changes in other agents’ behavior are considered influential and are rewarded. We show that this is equivalent to rewarding agents for having high mutual information between their actions. Empirical results demonstrate that influence leads to enhanced coordination and communication in challenging social dilemma environments, dramatically increasing the learning curves of the deep RL agents, and leading to more meaningful learned communication protocols. The influence rewards for all agents can be computed in a decentralized way by enabling agents to learn a model of other agents using deep neural networks. In contrast, key previous works on emergent communication in the MARL setting were unable to learn diverse policies in a decentralized manner and had to resort to centralized training. Consequently, the influence reward opens up a window of new opportunities for research in this area.
Quantifying the effects of environment and population diversity in multi-agent reinforcement learning
Generalization is a major challenge for multi-agent reinforcement learning. How well does an agent perform when placed in novel environments and in interactions with new co-players? In this paper, we investigate and quantify the relationship between generalization and diversity in the multi-agent domain. Across the range of multi-agent environments considered here, procedurally generating training levels significantly improves agent performance on held-out levels. However, agent performance on the specific levels used in training sometimes declines as a result. To better understand the effects of co-player variation, our experiments introduce a new environment-agnostic measure of behavioral diversity. Results demonstrate that population size and intrinsic motivation are both effective methods of generating greater population diversity. In turn, training with a diverse set of co-players strengthens agent performance in some (but not all) cases.
Kevin R. McKee, Joel Z. Leibo, Charlie Beattie & Richard Everett
Large language models show human-like content biases in transmission chain experiments
As the use of Large Language Models (LLMs) grows, it is important to examine if they exhibit biases in their output. Research in Cultural Evolution, using transmission chain experiments, demonstrates that humans have biases to attend to, remember, and transmit some types of content over others. Here, in five pre-registered experiments with the same methodology, we find that the LLM chatGPT-3 shows biases analogous to humans for content that is gender-stereotype consistent, social, negative, threat-related, and biologically counterintuitive, over other content. The presence of these biases in LLM output suggests that such content is widespread in its training data, and could have consequential downstream effects, by magnifying pre-existing human tendencies for cognitively appealing, and not necessarily informative, or valuable, content
Two Types of Aggression in Human Evolution
Two major types of aggression, proactive and reactive, are associated with contrasting expression, eliciting factors, neural pathways, development, and function. The distinction is useful for understanding the nature and evolution of human aggression. Compared with many primates, humans have a high propensity for proactive aggression, a trait shared with chimpanzees but not bonobos. By contrast, humans have a low propensity for reactive aggression compared with chimpanzees, and in this respect humans are more bonobo-like. The bimodal classification of human aggression helps solve two important puzzles. First, a long-standing debate about the significance of aggression in human nature is misconceived, because both positions are partly correct. The Hobbes–Huxley position rightly recognizes the high potential for proactive violence, while the Rousseau–Kropotkin position correctly notes the low frequency of reactive aggression. Second, the occurrence of two major types of human aggression solves the execution paradox, concerned with the hypothesized effects of capital punishment on self-domestication in the Pleistocene. The puzzle is that the propensity for aggressive behavior was supposedly reduced as a result of being selected against by capital punishment, but capital punishment is itself an aggressive behavior. Since the aggression used by executioners is proactive, the execution paradox is solved to the extent that the aggressive behavior of which victims were accused was frequently reactive, as has been reported. Both types of killing are important in humans, although proactive killing appears to be typically more frequent in war. The biology of proactive aggression is less well known and merits increased attention.
Property rights and long-run economic growth
Quantitative analysis of the model assigns a major role to changes in property rights in explaining growth over the very long run. As one example, the number of new ideas produced in a year rises by a factor of 110,000 in the simulated economy between 25,000 B.C. and the 20th century. A factor of 108 of this increase is due to the fact that the 20th century has a larger population base from which inventors are drawn; a factor of 4 of this increase is attributed to knowledge spillovers, i.e. to the notion that it is easier to produce ideas today because of discoveries made in the past. The remaining factor of 245 is assigned to an increase in the property rights variable, the fraction of resources used to compensate inventive effort.
Why arms control is so rare
Arming is puzzling for the same reason war is: it produces outcomes that could instead be realized through negotiation, without the costly diversion of resources arming entails. Despite this, arms control is exceedingly rare historically, so that arming is ubiquitous and its costs to humanity are large. We develop and test a theory that explains why arming is so common and its control so rare. The main impediment to arms control is the need for monitoring that renders a state’s arming transparent enough to assure its compliance but not so much as to threaten its security. We present evidence that this trade-off has undermined arms control in three diverse contexts: Iraq’s weapons programs after the Gulf War, great power competition in arms in the interwar period, and superpower military rivalry during the Cold War. These arms races account for almost 40% of all global arming in the past two centuries.