Cumulative Cultural Evolution in Some AI Algorithms

Mesoudi & Thornton (2018) provide the following criteria for when a population exhibits cumulative cultural evolution:

(i) a change in behaviour (or product of behaviour, such as an artefact), typically due to asocial learning, followed by (ii) the transfer via social learning of that novel or modified behaviour to other individuals or groups, where (iii) the learned behaviour causes an improvement in performance, which is a proxy of genetic and/or cultural fitness, with (iv) the previous three steps repeated in a manner that generates sequential improvement over time.

Eureka

Eureka is an algorithm for designing reward functions that allow RL agents to learn various behaviors. It works as follows. First, it runs K LLMs. As context, each LLM is given a description of the task the RL agent is supposed to carry out, the source code for the environment in which that task will be carried out, as well as an initial prompt. Each LLM outputs a reward function as executable Python code. Eureka then evaluates these reward functions using the fitness function for the environment and selects the best-performing reward function among the K outputs. It then generates a new prompt based on this best-performing reward function. This new prompt includes not only the reward function itself but also automated feedback that summarizes the policy training dynamics. After this, Eureka runs a new iteration of K LLMs, which are now also given the new prompt. This process is repeated for N iterations, after which Eureka outputs the best-performing reward function.

How does this process look in terms of cumulative cultural evolution? The first iteration of LLMs engages in purely asocial learning. The best-performing reward function is then transferred via social learning to the next iteration of LLMs, where it causes an improvement in performance. This second iteration of LLMs engages in further asocial learning, and the best-performing reward function is again transferred via social learning to the next iteration, causing an improvement in performance. Thus the first three steps are repeated in a way that leads to sequential improvement over time, and all four conditions for cumulative cultural evolution are satisfied.

Promptbreeder

Promptbreeder is an algorithm for generating prompts. Starting with an initial task prompt P (e.g. “Solve the math problem”), it generates a mutated task prompt P’ by running an LLM on the initial task prompt P and a mutation prompt M (e.g. “Make a variant of the prompt”). In this way, it initializes a population of mutated task-prompts and evolves the population by sampling two individuals, taking the individual with higher fitness, mutating it using a variety of mutation operators, and overwriting the loser with the mutated copy of the winner. In addition to mutating task prompts, Promptbreeder also mutates the mutation prompts themselves. (Note: I’ve skipped over a lot of complexity here, but for present purposes this should be enough.)

The initial iteration of mutated task prompts can be seen as generated by asocial learning. Social learning then happens when losers are overwritten by winners (though this also involves a mutation step). This leads to a performance improvement—experimentally, the authors found that fitness continues to increase throughout the run. As this process is repeated, leading to sequential improvement, all criteria for cumulative cultural evolution are met. (I haven’t here touched upon the fact that the mutation prompts are themselves mutating, which is interesting in its own right, but not necessary for establishing the presence of cumulative cultural evolution.)

Date

March 28, 2024

Previously

On the Foundations of Norm Psychology Humans are remarkably cooperative. This large-scale cooperation—extending far beyond immediate kin—has allowed us to achieve a dominant position on