Factors in Learning Dynamics Influencing Relative Strengths of Strategies in Poker Simulation

By inergency On Jan 2, 2024

1. Introduction

The analysis of poker can be traced back to Von Neumann, whose great interest in the game led to the development of the field of game theory [1]. With elements of incomplete and unreliable information, risk management, opponent modeling, and deception, the challenges that arise in analyzing poker are distinct from those in complete information games such as chess or checkers. As a complex game, early research tackled simpler non-cooperative, zero-sum, incomplete information games that had computable optimal strategies [2,3,4,5]. Here, we organize poker research into three approaches with distinct, interrelated goals: understanding human poker play, engineering effective heuristics to increase wins in real (AI or human) players, and the analysis of optimal poker using game theory.

The field of human poker play models human decision-making in competitive incomplete information settings. This research focus is part of a strong body of literature on modeling human learning [6,7]. Many studies observed participants in laboratory experiments during which simplified versions of poker with tractable optimal solutions were played [2,3]. Researchers observed that players displayed tendencies to make suboptimal decisions (i.e., bluff, take risks, deceive) based on their emotional states and perceptions of their opponents [4,8,9]. Researchers also examined whether humans could improve after repeated play. Even with consistent feedback over many rounds, most failed to improve their strategies, that is to play more optimally [5,8,9]. Findings from human-decision making studies helps us understand the intricacies of the game of poker. These insights have implications for both the development of optimal poker playing agents as well as a theoretical modeling of the game.

A separate endeavor involves creating winning poker agents that can calculate the best response to every move, paralleling the abilities of chess and checkers agents. These agents efficiently traverse an enormous constructed game state space to achieve this [10,11,12]. As an incomplete information game, constructing the state-space for poker is an impressive undertaking requiring: (1) a massive database of poker hands (2) methods that allow an optimal decision to be made in every situation. The goal of these solutions is that they are game-theoretically optimal such that no money is lost in the long run, regardless of the opponent’s strategy. Groups have employed machine learning and artificial intelligence methodologies to where agents can compete with the best human players [13,14] with some even claiming to have weakly solved the game of Texas Hold’em poker [10]. Successful gameplay, while of great value, fails to expand our understanding of decision-making and strengths of strategies (e.g., rational play, bluffing). Investigating the relative advantage of different strategies from a game theoretic perspective gives deeper insights into poker play that a myopic state-space parsing cannot. Additionally, game theory can utilize insights from both human and AI modeling approaches depicted above and a more principled approach to the analysis of poker.

The game theoretic framework provides a systematic approach to the analysis of optimal strategies and near Nash equilibrium play in poker play [15,16] and evolutionary game theory provides an approach to learning dynamics [17,18]. Poker involves a significant element of strategic reasoning under uncertainty as well as rational play based on strength of the player’s hand. Game theory can provide insights into stable, balanced strategies that players can adopt and introduces the concept of mixed strategies [19] in which players can randomize their actions to make their strategies less exploitable [4,12]. Incorporating learning dynamics in conjunction with the game theoretic framework provides a foundation for understanding how players can learn and adapt their strategies over repeated play which can lead to the development of AI agents capable of playing poker at an expert level [15]. An advantage of evolutionary game theory is that no expert knowledge of poker is required. Instead, through continuous play adaptation of strategies occurs until an effective strategy is developed. The hands-off nature can also lead to the development of previously unthought-of strategies [15]. It was shown that even without the incorporation of expert knowledge, an evolutionary method can result in the development of strategies that are competitive with Poki and PSOpti [15].

Later evolutionary game theoretic efforts established poker as a game of skill by exhibiting the success of rational agents over irrational agents in tournament structure play [20] where agents had fixed strategies. Using simple imitation dynamics was enough for the rational strategy to dominate a population of rational and irrational adaptive poker agents [21]. These analyses did not rely on the use of data from human poker play or the traversal of a constructed state space to uncover optimal actions to take in any state of the game. Instead the focus is on the performance of strategies, their relative efficacy, and the learning processes that influence which dominant strategies emerge over the course of play. Here, we highlight two distinct but interactive components of modeling: the defined strategies of the agents and the learning dynamics by which the agents revise their strategies. While studying strategies is ultimately the primary concern of game theorists, how learning mechanisms influence the relative strength of strategies over each other is also of great—perhaps even equal—importance. More specifically, it is not clear how the design of a learning mechanism impacts the outcome of strategy learning and whether a strategy will remain dominant across all learning dynamics. It is not sufficient to assume that any “reasonable” learning dynamic will lead to the same optimal strategy emerging in the population.

This paper aims to investigate the interplay between learning dynamic design and dominant strategies in poker. We test strategies (i.e., rational and irrational) across a variety of appropriate learning dynamics and track the evolution of strategies in the population. The learning dynamics employed in this paper are designed in the family of Erev and Roth (ER) reinforcement learning (RL) dynamics [22]. Importantly, these dynamics do not require agents to be highly or even boundedly rational. Limiting the agents to low rationality is well established as it has been seen that human interactions often do not think with high rationality [23,24,25,26]. The manner in which the agents maintain their strategy allows them to play and learn mixed strategies, an important aspect of poker play [4,12]. Our analysis uncovers features inherent to poker through carefully studying strategies rather than action sequences. It also provides descriptive qualities somewhat akin to that of modeling human play. The systematic definition and exploration of dynamics and strategies provides a foundation for future exploration of more complicated components of strategies such as bluffing and risk. Our approach can also provide simple, communicable, and easily transferable advice to the poker players of any skill level. After establishing simple strategies one can use to decide one’s actions, one is able to model real world play as a composition or weighting of the different strategies. Further, various reinforcement learning dynamics are proposed that have descriptive power with regard to the change in a players’ strategy based on outcomes of repeated play.

1.1. Previous Work

Javarone established poker as a game of skill using tournaments between rational and random agents [20] before introducing population dynamics to enable agents to revise their strategy through play. By employing voter-model-like dynamics, it was shown that under various initial population compositions, the rational strategy is ultimately the one that is learned [27]. Acknowledging that a revision rule predicated on knowledge of the opponent’s strategy is unrealistic, most recently Javarone introduced a strategy revision rule that only depends on the payoff received. Under this revision rule, when agents play until one has all of the money, the rational strategy becomes the dominant strategy in the population. Meanwhile, in single hand challenges neither rational nor irrational is distinguished as the dominant strategy [21].

Others have applied evolutionary game models to real world data of online poker play [28,29]. The learning of the players in the data set was summarized using a handful of strategy descriptors, and the learning of agents over the course of play was analyzed. Evolutionary modeling has been a useful element with regard to the design of some poker AI [15,30,31].

1.2. Evolutionary Game Theory

Classical game theory involves studying optimal strategies for players that do not change their strategy over time. Evolutionary game theory extends this by introducing learning, updating, and population level dynamics. By defining dynamics for learning in an agent population, interactions not predicated upon the assumption of rational actors are possible [23]. Further, the dynamics themselves can be investigated, not just the equilibria of the game. Evolutionary game theory has become a valuable tool in the arsenal of many researchers, with papers in over one hundred research areas applying evolutionary game theory concepts as a part of their research [32,33,34,35].

1.3. Poker

Texas Hold’em is the most popular of many variants of poker. This paper follows Javarone in studying two person (heads-up) Texas Hold’em. When two players sit down at a table to play, they bring with them the stack of chips which they will wager. The goal is to increase this chip stack. On a hand, a player is first dealt two cards that are only visible to them (hidden cards). One player places a small portion of their stack into the pot, the small blind, and the other a small quantity twice that size (the big blind). The two bets are added to the pot which will accumulate as chips are bet over the course of the hand. These mandatory bets create an incentive for players with weak hands to play. Then play commences. During a player’s turn, there are three possible actions: fold, call, or raise. By folding, a player forfeits the hand, preventing them from winning any money but also preventing them from losing any more money. If the player opts to call, they bet the minimum stake required to stay in the hand, and the turn passes to the next player. The other option is to raise. By doing so, the player bets the minimum stake required to stay in the hand, along with an extra amount of chips. This now forces the other player to pay the increased amount or fold. In a betting round, the turn passes back and forth between the two players until both have called or folded. Between betting rounds, community cards are revealed that are available to both players, with there ultimately being five community cards available. After the final betting round, if neither player has folded, they each create their best possible combination of five cards from their two hidden cards and the five community cards. The player with the best combination wins the pot.

1.4. Erev and Roth Learning

The learning dynamics used in this paper follow the principles of Erev and Roth (ER) learning. ER learning is a model of reinforcement learning and decision-making used across a wide range of strategic environments [36,37]. It can characterize patterns of both human decision-making and learning in agent-based simulations, providing a realistic representation of how agents adapt their strategies in response to feedback with no a priori heuristics. In our dynamics, each agent learns using an urn of colored marbles. The colors correspond to strategies. At the start of a game of poker an agent picks a marble at random from the urn. This dictates their strategy. The urn for each agent is updated by adding marbles based on the outcome of the game and the payoff received. ER learning has the advantage of low rationality, as the determination of strategy depends only on accumulated payoffs [25]. Hence, the simple learning and agent decision making allows us to more confidently identify interplay between strategy and game elements that influence payoffs. To make it clear that the results are tied to the game and not just the learning model employed, additional justification is given for the choosing of ER learning, as it is a widely applied, simple, and well-analyzed learning model. On top of this, other papers (including [37,38]) are discussed to explain that the results in learning that come from the use of ER learning are what can be expected from other learning models. We also include an exhaustive list of all models developed during the research process in the Appendix B. These elements appear as follows: Erev and Roth put forth two basic principles with which one should start their search for a descriptive model for learning: the law of effect and the power law of practice. The law of effect dictates that actions that lead to favorable outcomes are more likely to be repeated than actions that lead to less favorable outcomes. Per the power law of practice, the learning curve for agents is steep early on but quickly flattens. This is accomplished by agents accumulating propensities for each strategy over the course of play. As the agents become more experienced, each individual round of play has less influence on their belief. Human psychology studies have noted this behavior [39,40]. Compared to dynamics that are not probabilistic, leading to deterministic selection of strategies [38], we opted for Roth-Erov learning. These other strategies are similar to Roth-Erev learning in their myopic nature, and perform at least as well [37]. For the first exploration of learning dynamics in poker, the ER learning model was chosen due to its prevalence in studying psychological phenomena, such as poker. This flexible and well-analyzed model enables the results of this paper to be put into context with other results with greater precision. The ER model was not the only one examined in the research process; the other dynamics are listed in Appendix B.

The goal of this paper is to develop an understanding of what drives the success of strategies under different learning dynamics. Poker has been established as a game of skill, but we show that not all feedback mechanisms result in skilled play being learned.

2. Materials and Methods

The experiments involve the simulation of a simplified poker game on top of which agents revise their belief on strategies through repeated play.

2.1. Rational and Random Strategies

There are two strategies available to the agents. These are a rational strategy and a random strategy. The only difference between the two strategies is how they determine the strength of their starting hand when dealt their two hidden cards. Agents playing the rational strategy look up the strength of their hand in a table that maps each starting hand to the probability that it is the best hand. The construction of this table involved running one hundred million hands to calculate the probability of each hand being the best hand. The random agents pick their win probability randomly from a uniform distribution on the range [0.2923, 0.8493], the range of possible hand win probabilities. From there, agents use the same method to determine their actions, regardless of strategy. When considering their action, an agent has a belief of their win probability o, knows the current size of the pot P, and is deciding b. As they are looking to maximize the expected value of their hands, they use the following equation as proposed in [41]:

$E (b) = o (P + b + b) - b$

(1)

The first term is the money that the agent stands to win by betting b and their opponent calling. They will win the pot, win back their own wager, and take the wager of their opponent, which the agent believes will happen with probability o (their win probability). The second term is the money the agent loses by betting. When the agent bets, they are separating themselves from those chips, which are then added to the pot with certainty. With a bit of algebra:

$E (b) = (2 o - 1) b + o P$

(2)

This clearly shows that if $o > 0.5$ , the expected value increases with larger bet sizes so the agent will go all-in. If $o = 0.5$ , the bet size does not matter, and the agent will go all-in. If $o < 0.5$ , the expected value decreases with larger bets. Thus, they will see if betting the minimum stake required to stay in the hand has non-negative expected winnings. If it does, they place that bet. Otherwise they fold (which has expected winnings of zero). Either way the agent is maximizing their expected winnings under their belief about their hand strength. This decision making process is visualized in Figure 1. When the agent is considering their decision, they carry with them their believed win probability. The game state consists of the minimum stake required to stay in the hand as well as the current size of the pot. With that information, they use Equation (2) to calculate the expected winnings of a bet b, deciding to place the bet b which maximizes their expected winnings.

2.2. Relative Strengths of Strategies with No Learning

As a benchmark for comparison, we first consider the efficacy of the two strategies relative to one another by simulating one hundred million hands between a rational and random agent with no learning.

2.3. Learning Dynamics

To follow [42], take

q_{a k} (t)

to be the propensity of agent a to play strategy k at time t. For each of the n agents and strategies

k, j

q_{n k} (1) = q_{n j} (1)

, meaning each agent starts off with the same propensity for each strategy and all agents start off with the same propensities in each strategy. This is the case for all dynamics. Further, the agents calculate their probability of playing strategy k at time t as:

$p_{n k} (t) = \frac{q_{n k}}{(t)}$

(3)

This paper considers two possible strategies, rational and random, so agents will maintain propensities for these two strategies. Thus, k and j refer to rational and random strategies (not respectively). When discussing propensities, the strength of the initial belief should be considered. Agents that have large initial propensities relative to the average payoff received on each round will learn slowly (as they already have a strong belief), while those with small initial propensities relative to the average payoff received on each round are at risk of having highly varied learning at the start that relies on the outcomes of the first few hands. Below is the specification for each of the four learning dynamics implemented. We are interested in the converged composition of urns for the population.

2.3.1. Unweighted Learning

This first dynamic is a simple application of the law of effect and power law of practice. After a hand of poker, the winning agent adds one marble for the strategy they used. The losing agent does not reinforce their played strategy, and thus the learning for the winner can be specified as follows, given agent q won playing strategy j at time t:

$q_{n k} (t + 1) = \{\begin{matrix} q_{n k} (t) + 1 & j = k \\ q_{n k} (t) & o t h e r w i s e \end{matrix}$

(4)

2.3.2. Win Oriented Learning

To build on the unweighted learning, the reinforcement amount is now weighted by the amount of chips won. It is still the case that only the winning agent learns. For an agent q playing strategy j at time t:

$q_{n k} (t + 1) = \{\begin{matrix} q_{n k} (t) + R (x) & j = k \\ q_{n k} (t) & o t h e r w i s e \end{matrix}$

(5)

In this case, $R (x) = m a x (0, x)$ where x is the change in stack for the agent on the hand. For the winning agents, $x > 0$ , and for the losing agents, $x < 0$ , so no marbles are added.

2.3.3. Holistic Learning

To go one step further, now losing is considered. Strategies are now rewarded not only for maximizing their wins but minimizing their losses. As a result, both the winning and losing agents experience meaningful learning after each hand. To achieve this, the minimum possible payoff is subtracted from the payoff of the hand for each agent, giving:

$q_{n k} (t + 1) = \{\begin{matrix} q_{n k} (t) + R (x) & j = k \\ q_{n k} (t) & o t h e r w i s e \end{matrix}$

(6)

However, $R (x) = x - x_{m i n}$ . Note that $x_{m i n}$ is the least possible payoff, which occurs when the agent loses their entire chip stack on the hand. Thus, $x_{m i n} = -$ 10,000 as each agent starts with 10,000 chips.

2.3.4. Holistic Learning with Recency

The final step is to incorporate recency by introducing a simulation parameter

ϕ

, an extension of basic ER learning [42,43]. When the agent learns, some proportion of their previous belief is discounted. Initially this does not have much of an effect as the propensities are small. Later on, a small fraction of the agents’ propensities will be about as much as is learned on an epoch. Thus, each time they play a hand, the agent will forget about as much as they are learning. The proportion forgotten,

ϕ

, is kept small to prevent the agents from only considering the most recent outcomes. For a detailed analysis of the value of

ϕ

, see (Appendix A). This modification of holistic learning can be specified as:

$q_{n k} (t + 1) = \{\begin{matrix} (1 - ϕ) q_{n k} (t) + R (x) & j = k \\ (1 - ϕ) q_{n k} (t) & o t h e r w i s e \end{matrix}$

(7)

As in holistic learning, $R (x) = x - x_{m i n}$ where $x_{m i n} = -$ 10,000.

2.4. Simplified Poker

The agents play a simplified version of heads-up Texas Hold’em poker (only involving two players). To begin, one player pays a fixed amount of chips called the small blind, and the other pays the big blind—a quantity of chips double the value of the small blind. These blinds create an initial pot. The two players are then dealt two cards each that are only visible to the player receiving the cards. Betting ensues. The betting process begins with the player that paid the small blind. They have the choice to fold, call, or raise. If the small blind player does not fold, the big blind has the choice to fold, call, or raise. The turn will then pass back and forth until one player folds or both have called. If a player folds at any time, the other is the winner by default. The winner adds the chips from the pot to their stack and the hand is finished. In the case of both players calling, the five community cards are dealt and each player creates their best five card combination from their two hidden cards and the five community cards. The player with the best combination wins the pot.

2.5. Simulation Structure

The simulation consists of multiple hands of simplified poker between two agents at a time. On each hand, two agents are randomly selected from the population, they play simplified poker, learn, and are then placed back into the population. Play is organized into epochs. An epoch consists of a number of hands such that on average each agent in the population has played one hand. The simulations involve a population of two hundred agents—one hundred agents of each strategy—playing for ten thousand epochs. Different population sizes and lengths of simulation were tested and our simulation setup was sufficient for analysis.

4. Discussion and Conclusions

The study of poker and game theory has been intertwined since the field’s inception by Von Neumann in the 1920s. The analysis of poker owes much to game theory and evolutionary learning models. More recent work has established poker as a game of skill and shown the strength of rational play over random strategy [20,21]. Building on past efforts, in this paper we explore the influence of learning dynamics on the relative strengths of different strategies. The only difference between the rational and random strategies is how they determine the strength of their hidden cards at the beginning of the hand. The results indicate that rational play is not dominant across all learning dynamics. The features of learning dynamics can put rational agents at a disadvantage when loss is not taken into consideration. The dynamics we selected are well studied with desirable features to model poker. The learning dynamics begin with minimal features and build to include more aspects of game play. This allows for a principled investigation into the influence of dynamics and different learning specifications on strategies. Our approach had the added benefit of providing plausible models for sub-optimal human play as well.

The first type of learning dynamics only considers wins, while the second one includes the magnitude of the win. In the third specification, we also include the extent of the loss, and the fourth and final dynamic gives recent events more weight over past events. These dynamics build towards a more complete set of features of poker play. In many contexts, a win may be all that matters, but in gambling the magnitude of a win is also of interest. Given that poker involves chips to gamble with, minimizing the losses is also a core element of any effective strategy. The learning dynamic we employed reflect these considerations and show that in design dynamics, the context of the game must be taken into account.

When considering rational and random agents, our results highlight the influence of the learning specifications on which strategy will overtake in the population. The difference in outcomes for each dynamic gives valuable insights as to what makes each strategy effective. The efficacy of the rational strategy does not come from winning more hands. Surprisingly, in the cases where the rational agent wins, they do not receive a greater average payoff than a random agent. It is in fact that rational agent’s ability to minimize losses that leads to higher payoffs compared to random agents. This is only highlighted when the learning dynamics include loss aversion. As for random agents, the arbitrary hand strength that is selected does not inhibit winning and the magnitude of wins, but rather, leads to an increase in losses. Often when the random agent loses, they forfeit a large amount of their stack. In contrast, rational agents lose a smaller portion of their funds when facing a loss. This principled approach to building from simple dynamics to more sophisticated specifications has allowed for a deeper analysis of the strengths and drawbacks of each strategy.

The simple models in which the random strategy takes over the populations can provide insight into the sub-optimal play prevalent in real world poker play. The first dynamics can point to a bias or heuristic that players may have: more wins will result in higher gains. Similarly, many players may remember the exhilarating hands for which they won a large amount and ignore the times that also led to large losses. The second learning dynamics models such players where wins and the amount of winnings are considered and losses conveniently forgotten. The benefit of our approach is that it allows future learning dynamics to incorporate other factors in their analysis.

In addition to expanding on learning dynamics, we can explore other important strategies that are yet to be modeled. Bluffing is a key element of poker play that adds considerable complexity to the game. Future research plans will tackle bluffing in simple poker play by exploring a variety of bluffing techniques. Another area of interest is developing a mechanism for mixed strategies where agents randomize their choices based on certain probabilities. The game theory approach can also incorporate aspects of optimal play analysis in the form of situational play. Here, agents possess memory of games and recall the strength of a strategy when faced with a similar hand and depending on their past performance, adopt other strategies as a result. The ultimate goal of such research is to simulate a full poker game with a wide range of strategies and learning dynamics.

In this paper, we could not include an exhaustive analysis of all appropriate learning dynamics. While we considered other reinforcement learning mechanisms, the ER learning dynamics provided a sound foundation that later work can build upon.