Loading [MathJax]/jax/output/HTML-CSS/jax.js

Tic-Tac-Toe

or how to achieve an optimal strategy with game theory, reinforcement learning... and bunch of matchboxes

Chapters

  1. What?!
  2. Board, groups, symmetries and ternary numbers
  3. Game theory and minimax theorem
  4. MENACE or a pile of matchboxes mastering the game
  5. Reinforcement learning

What?!

Tic-tac-toe or in british english noughts and crosses is an ancient game which every seven years old children learns how to play and not to lose in exactly nine plays or so. Why would you spend your time with playing and studying such a triviality? Well, I got interested in Reinforcement learning recently. So I can use it for some more nasty stuff and tic-tac-toe is a classic example such that even Sutton & Barto [3] don't hesitate to put it into the book's introduction.

As well as many other trendy machine learning techniques were invented in the early 60s [4], so does Reinforcement learning. I was shocked (ok, positively surprised) when I watched one of Hanna Fry's Christmas lectures where Matt Parker came with a pile of matchboxes which could play noughts and crosses. For my surprise it wasn't Matt Parker who invented it, not even in-that-time PhD student Matt Scroggs [5] who he referred in his previous video. It was a former Bletchley Park researcher Donald Michie in 1960 who came up with a smart way how to teach a bunch of matchboxes with beads in it to play noughts and crosses [1, 2].

However, every seven years old children learns how to play and not to lose, it is possible to find and check all possible games, and define perfect unbeatable player which won't lose any game. To do that, we will take a look at game theory approach. [6]

In this article, I try to explain some valuable math properties about the game and try to come up with a few ways how to teach a computer to play tic-tac-toe.

Board, groups, symmetries and ternary numbers

The tic-tac-toe board is a 3 × 3 square-shaped grid. For simplicity and future references, let's denote all the fields by numbers 0-8, where the center position is 0. In each turn, new "O" noughts and "X" crosses are added and the board is filling up. Some of the board settings are indentical up-to-some transformation. As you can see on the picture bellow, both of the boards are equal - the second one is rotated 90° to the left.

In abstract algebra, we call such operations under which the object is invariant a symmetry group. In the tic-tac-toe case, the underlying shape of the board is a square, therefore, we would be interested in symmetry group of square. Figure bellow shows that we can reflect the square grid about four axes: tx, ty, tAC and tBD, and rotate entire grid about 90° r, 180° r2 and 270° r3, and still get the same square.

This group is called Dihedral Group D4 and it has 8 elements. Generated by ⟨a,b:a4=b2=2,ab=ba−1⟩ For more details see ProofWiki.

Permutations

Each of those group elements {e, r, r2, r3, tx, ty, tAC, tBD} has its own permutation or a mapping for each element projection. For example, an element r rotates the board about 90° to the right, therefore, 0 field maps back to 0, 1 goes to 7, 2->8, 3->1, 4->2, 5->3, 6->4, 7->5, 8->6. However, such notation is a bit confusing we can adopt widely used one - (0, 7, 8, 1, 2, 3, 4, 5, 6), where position or index is the origin and value is the target. In the following table, we can see all the elements' permutations.

Worth noticing that tic-tac-toe permutations do not form the entire permutation group for 8-element set S8. Some permutations would deform the square grid, e.g. applying (1, 2) on the board changes it such that 1 is no longer neighbour with 8, etc. For more details see ProofWiki.

element permutation short cycle form1)
e (0, 1, 2, 3, 4, 5, 6, 7, 8) ()
r (0, 7, 8, 1, 2, 3, 4, 5, 6) (7, 8, 1, 2, 3, 4, 5, 6)
r2 (0, 5, 6, 7, 8, 1, 2, 3, 4) (5, 6, 7, 8, 1, 2, 3, 4)
r3 (0, 3, 4, 5, 6, 7, 8, 1, 2) (3, 4, 5, 6, 7, 8, 1, 2)
tx (0, 7, 6, 5, 4, 3, 2, 1, 8) (1,7) (2, 6) (3, 5)
ty (0, 3, 2, 1, 8, 7, 6, 5, 4) (1,3) (4, 8) (5, 7)
tAC (0, 1, 8, 7, 6, 5, 4, 3, 2) (2, 8) (3, 7) (4, 6)
tBD (0, 5, 4, 3, 2, 1, 8, 7, 6) (1,5) (2, 4) (6, 8)

As you can see the center field 0 remains fixed for all permutations, which is pretty obvious and gives a clue why we chose 0 as the origin and why the rest of numbers go clockwise all around...

Ternary numbers

If we were interested in the game evolution - how "O" and "X" were added in each turn, we could denote the game as series of cells, e.g. 032 corresponds to 👇

Another handy piece in this tic-tac-toe-algebra puzzle are ternary numbers. All of us are familiar with decimal, hexadecimal and binary numbers, but numbers with base 3 are less common. For the game purposes they suits perfectly as they denote board fields with 0 for empty ones, 1 for "O" and 2 for "X". For example two boards mentioned before, their ternary numbers are 201000000 and 200010000 respectively.

Number of possible boards

Part of a reason why we did all the ☝️ algebra was to came up with number of possible tic-tac-toe boards. We know that the theoretical upper bound of possible games is 9! = 362 880. Factorial takes in count also boards where the game already ended but we added another steps. Henry Bottomley's came up with nice calculations where they estimated 255 168 possible games.

We know that we can do even better by removing all the symmetrical boards, we can count the numbers using some more advance algebra e.g. Burnside's lemma [without any further calculations] can give us better estimate of 2 475 2) or write a fairly short code.

from tic_tac_toe.utils.permutations import get_all_boards
symmetry_classes, all_boards = get_all_boards(exclude_winners=False)
counts = [len(sc) for sc in symmetry_classes]
counts_all = [len(b) for b in all_boards]
print(f"Unique boards: {counts}, sum: {sum(counts)}")
print(f"All boards: {counts_all}, sum: {sum(counts_all)}")
symmetry_classes_menace, all_boards_menace = get_all_boards(exclude_winners=True)
counts_menace = [len(sc) for sc in symmetry_classes_menace]
counts_all_menace = [len(b) for b in all_boards_menace]
print(f"Unique boards: {counts_menace}, sum: {sum(counts_menace)}")
print(f"All boards: {counts_all_menace}, sum: {sum(counts_all_menace)}")

Which give us 765 possible achievable positions where only 138 are possible games regardless on the order of players' moves and transformations.

Putting it all together, we saw that some boards are equivalent up to a symmetry, to find all symmetries for given board we can use the permutations obtained from the symmetry group of square. This observation saved us few thousands boards and we can concentrate on a few classes of boards. For easier notation, we can use handy ternary numbers to denote the boards.

Game theory and minimax theorem

Tic-tac-toe is a perfect information zero-sum game, which means that both players know or at least can find result for all possible moves and when one wins second one lose, and vice versa. From ☝️ computations we know that there are only 765 possible achievable boards which is not that hard to check them all and find the optimal strategy - unbeatable player which can not lose only draw and win.

X


Computer uses minimax algorithm and plays optimally. If more available options have the same score, it chooses one action at random.

Therefore, we would like to find moves which maximize chances of winning and minimize opponent's chances. In other words - we want to maximize reward and minimize harm in the game. Such an algorithm is called Minimax [6] based on fundamental minimax theorem.

John von Neumann who proved the Minimax theorem in 1928, which is considered the starting point of game theory.

Minimax algorithm

In simulation, when computer player is on turn, it tries to choose actions which have the highest value. We assume that opponent is also a perfect and rational player trying to win and play optimally - the opponent is playing the best available action. Therefore, on opponent's turns we consider the action with the minimum value.

Evolution of all possible boards from "020102112".

For every possible action of the board we check what is the value. If computer will win for that action, we assign value "10 - distance between current board and the winning board" or depth. On the other hand, if the action lead to losing the game, we use "depth - 10" as the value. For other scenarios we assign 0.

The figure above shows all possible turns for "020102112" board and value for each path. However, it is "X" turn we are looking for maximum possible value which is "+9". Now we can clearly see that, if player "X" is not playing the "220102112" (middle board) and "O" is perfect player, then "O" tries to maximize its value and plays "12x1x2112". For further analysis check the code bellow.

def get_score(board: Ternary, depth: int, player: int = 2) -> Tuple[int, bool]:
winner = whois_winner(board)
if winner == player:
return 10 - depth, True
elif winner < 1:
return 0, winner == 0
else:
return depth - 10, True
def minimax(board: Ternary, depth: int = 0, player: int = 2) -> Tuple[Ternary, int]:
score, ended = get_score(board, depth, player)
if score != 0 or ended:
return board, score
depth += 1
scored_actions: Dict = {}
actions = np.where(np.array(list(board.number)) == "0")[0]
active_player = len(actions) % 2 + 1
for a in actions:
action_board = list(board.number)
action_board[a] = str(active_player)
new_board = "".join(action_board)
_, scored_actions[new_board] = minimax(Ternary(new_board), depth, player)
if player == active_player:
max_value = max(scored_actions.values())
max_boards = [b for b, v in scored_actions.items() if v == max_value]
return Ternary(np.random.choice(max_boards)), max_value
else:
min_value = min(scored_actions.values())
min_boards = [b for b, v in scored_actions.items() if v == min_value]
return Ternary(np.random.choice(min_boards)), min_value
view raw minimax.py hosted with ❤ by GitHub

MENACE or a pile of matchboxes mastering the game

In the times when fast computers were not around, but the concept of computation excited many scientists. British researcher Donald Michie invented very clever idea how to teach matchboxes with colorful beads to play noughts and crosses, which he named "Matchbox Educable Noughts and Crosses Engine" or MENACE.

The matchbox machine MENACE build by D. Michie [1]

Each position on the board has its own colour . For every possible board there is a matchbox, bear in mind that we consider only unique symmetry classes. In the first turn matchbox there are initially 4 beads of each possible action represented by the coloured of the beads. For the matchboxes representing third turn, there are 3 beads of each colour, for fifth turns' 2 and for seventh 1. A bead chosen at random decides what action to do at every turn. The opponent turn then define next matchbox and the process is repeated until of the player wins or it is a draw. If the chosen beads led to a winning the game, we add 3 of each color to a corresponding matchbox. If game is a draw, we add 1 extra bead for each matchbox. If MENACE lost the game, we remove the beads.

X


The fundamental idea of MENACE is to add beads which led to winning and drawing the game, and remove beads which lost the game. Hopping that as more it plays games, the more unbeatable it will be.

If MENACE won the game, we will add three extra beads of winning action to each matchbox representing each turn. For draw we add extra one and if MENACE lost, each of the beads is removed. Following chart demonstrate a dummy version of average reward plot. It demonstrates change in number of beads in the first box. After a few plays you can observe that MENACE is losing every game, after a while it learns not to lose and the reward goes up.

Menace's matchbox brain

Each game below represents one of 304 matchboxes with colourful beads in it. Every available cell denotes number of beads for a give move. Playing the game, beads are added and removed based on the game result in real time.

880
080
000
O44
4X4
444
4O4
4X4
444
X44
4O4
444
XO4
444
444
X4O
444
444
X44
44O
444
X44
444
44O
4X4
4O4
444
OX4
444
444
4X4
44O
444
4X4
444
44O
4X4
444
4O4
OXO
2X2
222
OX2
2XO
222
OX2
2X2
22O
OX2
2X2
2O2
OX2
2X2
O22
OX2
OX2
222
OOX
2X2
222
O2X
2XO
222
O2X
2X2
22O
O2X
2X2
2O2
O2X
2X2
O22
O2X
OX2
222
OO2
2XX
222
O22
2XX
2O2
O22
2XX
O22
O22
OXX
222
OO2
2X2
22X
O22
2XO
22X
XO2
2XO
222
XO2
2X2
2O2
XO2
OX2
222
2O2
2XX
2O2
2O2
OXX
222
2O2
OX2
22X
XXO
2O2
222
XX2
2OO
222
XX2
2O2
22O
XX2
2O2
2O2
XX2
2O2
O22
XX2
OO2
222
XOX
2O2
222
X2X
2OO
222
X2X
2O2
22O
X2X
2O2
2O2
XO2
2OX
222
X2O
2OX
222
X22
2OX
22O
X22
2OX
2O2
X22
2OX
O22
X22
OOX
222
XO2
2O2
22X
X2O
2O2
22X
XOX
22O
222
XOX
222
22O
XOX
222
2O2
XOO
22X
222
XO2
22X
22O
XO2
22X
2O2
XO2
22X
O22
XO2
O2X
222
XOO
222
22X
XO2
22O
22X
XO2
222
2OX
XO2
222
O2X
XO2
O22
22X
XOO
222
2X2
XO2
22O
2X2
XO2
222
2XO
XO2
222
OX2
XOO
222
X22
XO2
22O
X22
XO2
222
X2O
XO2
222
XO2
XOO
X22
222
XO2
X2O
222
XO2
X22
22O
XO2
X22
2O2
XO2
X22
O22
XXO
22O
222
XXO
222
22O
XXO
222
2O2
XXO
222
O22
X2O
22X
22O
X2O
22X
2O2
X2O
22X
O22
X2O
222
O2X
X2O
22O
2X2
X2O
222
2XO
X2O
22O
X22
X2O
222
X2O
X2O
X2O
222
X2O
X22
22O
X2O
X22
2O2
XX2
22O
22O
XX2
22O
2O2
X22
22O
2XO
X22
X2O
22O
OX2
2OX
222
2XO
2OX
222
2X2
2OX
2O2
2X2
2OX
O22
OX2
2O2
2X2
2X2
2OO
2X2
OXO
22X
222
OX2
22X
22O
OX2
22X
2O2
OX2
22X
O22
OX2
O2X
222
OXO
222
2X2
OX2
22O
2X2
OX2
222
2XO
OX2
222
OX2
OX2
O22
2X2
OX2
X2O
222
OX2
X22
22O
2X2
O2O
2X2
2X2
X2O
22O
2X2
X2O
2O2
OXO
1XX
11O
OXO
1XX
1O1
OXO
1XX
O11
OXO
OXX
111
OXO
1XO
11X
OXO
1X1
1OX
OXO
1X1
O1X
OXO
OX1
11X
OXX
1XO
11O
OXX
1XO
1O1
OXX
1XO
O11
OXX
OXO
111
OX1
1XO
1OX
OX1
1XO
O1X
OX1
OXO
11X
OX1
1XO
X1O
OX1
1XO
XO1
OX1
OXO
X11
OX1
XXO
11O
OX1
XXO
1O1
OXX
1X1
1OO
OXX
1X1
O1O
OXX
OX1
11O
OX1
1XX
1OO
OX1
1XX
O1O
OX1
1X1
XOO
OX1
OX1
X1O
OXX
1X1
OO1
OXX
OX1
1O1
OX1
1XX
OO1
OX1
OXX
1O1
OX1
1X1
OOX
OX1
OX1
1OX
OX1
OX1
XO1
OOX
1XX
1O1
OOX
1XX
O11
OOX
OXX
111
OOX
1XO
11X
OOX
1X1
1OX
OOX
1X1
O1X
OOX
OX1
11X
OOX
1XO
1X1
OOX
OX1
1X1
O1X
1XO
1OX
O1X
1XO
O1X
O1X
OXO
11X
O1X
OXO
1X1
O1X
1XX
OO1
O1X
OXX
1O1
O1X
OX1
1OX
OO1
1XX
1OX
OO1
OXX
11X
OO1
OXX
1X1
O11
OXX
1OX
XOX
1XO
1O1
XOX
OXO
111
XO1
OXO
1X1
XO1
1XO
XO1
XO1
XXO
1O1
XXO
1OX
11O
XXO
1OX
1O1
XXO
OOX
111
XXO
1OO
11X
XXO
1O1
1OX
XXO
OO1
11X
XXO
1OO
1X1
XXO
1O1
1XO
XXO
OO1
1X1
XXO
1OO
X11
XXO
1O1
X1O
XXO
1O1
XO1
XXO
OO1
X11
XXO
XOO
111
XXO
XO1
11O
XXO
XO1
1O1
XX1
1OO
1OX
XX1
1OO
O1X
XX1
1OO
1XO
XX1
1OO
OX1
XX1
1OO
X1O
XX1
1OO
XO1
XX1
XOO
11O
XX1
XOO
1O1
XX1
1OX
1OO
XX1
1OX
O1O
XX1
OOX
11O
XX1
1O1
OXO
XX1
OO1
1XO
XX1
1O1
XOO
XX1
OO1
X1O
XX1
1OX
OO1
XX1
OOX
1O1
XX1
1O1
OOX
XX1
OO1
1OX
XX1
OO1
XO1
XX1
OOX
O11
XX1
OO1
O1X
XX1
OO1
OX1
XOX
1OO
11X
XOX
1O1
O1X
XOX
OO1
11X
XOX
1OO
1X1
XOX
1O1
1XO
X1X
1OO
1XO
X1X
1OO
OX1
X1X
1OO
X1O
X1X
1OO
XO1
X1X
1O1
OXO
XOO
1OX
1X1
XO1
1OX
1XO
XO1
1OX
OX1
XO1
OOX
1X1
X1O
1OX
1XO
XOX
11O
1OX
XOX
11O
O1X
XOX
11O
1XO
XOX
11O
OX1
XOX
O1O
1X1
XOX
11O
X1O
XOX
11O
XO1
XOX
X1O
11O
XOX
X1O
1O1
XOX
X1O
O11
XOX
11X
1OO
XOX
11X
O1O
XOX
111
OXO
XOX
111
XOO
XOX
X11
1OO
XOO
11X
1OX
XOO
11X
O1X
XOO
O1X
11X
XOO
11X
1XO
XOO
11X
OX1
XOO
O1X
1X1
XOO
11X
X1O
XOO
11X
XO1
XOO
X1X
11O
XOO
X1X
1O1
XOO
X1X
O11
XO1
11X
OXO
XO1
O1X
1XO
XO1
X1X
1OO
XO1
X1X
O1O
XO1
11X
OOX
XO1
O1X
1OX
XO1
X1X
OO1
XO1
O1X
O1X
XOO
11O
1XX
XOO
111
OXX
XOO
11O
X1X
XOO
X11
O1X
XO1
11O
OXX
XOO
11O
XX1
XOO
111
XXO
XOO
X1O
1X1
XOO
X11
1XO
XOO
X11
OX1
XO1
11O
XXO
XO1
X1O
1XO
XO1
X1O
OX1
XO1
X11
OXO
XXO
11O
OX1
XXO
X1O
1O1
XXO
X1O
O11
XXO
11X
1OO
XXO
11X
O1O
XXO
111
OXO
XXO
X11
1OO
XXO
X11
O1O
XXO
11X
OO1
X1O
11X
OXO
X1O
X1X
1OO
XX1
X1O
1OO
OXO
1OX
1X1
OX1
1OX
OX1
OX1
OOX
1X1
OXO
XOX
111
OX1
XOX
1O1
OXO
11X
1XO
OXO
11X
OX1
OXO
O1X
1X1
OXO
X1X
1O1
OX1
O1X
1XO

Reinforcement learning

Like in the MENACE case, the fundamental idea of reinforcement learning is to learn from experience with a carrot and stick method. Where moves which leads to a positive outcome are rewarded with a carrot and negative ones are punished with a stick.

One of the MENACE problems was that it could dies out of memory or some matchboxes lost all beads and MENACE couldn't move forward 3). Learning from MENACE discrete idea we can develop a continues realisation in probability space which will avoid a likely scenario of running out of actions.

Reward and value function

There are some moves which obviously lead to the end of the game and we can assign its final rewards. However, for majority of turns we don't know which tends to win the game and which to lose and draw. Because of that we will distinguish between reward, which gives immediate results and value function which measures the potential of future rewards.

As we are playing, winning and losing, we would like to back propagate rewards from the last moves into the earlier stages of the game. If an agent win the game, we reward it with 1, on the other hand for lost and draw we assigns no point. To do so we define a value function V(·) which assigns potential to every state of board St. In each game we retrospectively update the values using this formula. [3]

V(St) ← V(St) + α [V(St+1) - V(St)]

The value is derived as a difference between value of previous step (or a reward for final steps) and value of the current step. It is discounted by a learning rate α. Initially, we assign 1s or 0s to all winning and losing states and 0.5 to all others middle-states.

Exploration vs exploitation

If we've played the game purely based on the value function, it is likely that we would stick to play the same all over again. Therefore, we will introduce concept of exploration vs exploitation. In exploration case we pick an action at random regardless on its value. On the other hand, for exploitation we pick the action with highest value. We can alternate those methods at random with ε probability, therefore, we tend to call this method an ε-greedy policy.

Conclusion

"A reason for being interested in games is that they provide a microcosm of intellectual activity in general. Those thought processes which we regard as being specifically human accomplishments—learning from experience, inductive reasoning, argument by analogy, the formation and testing of new hypotheses, and so on — are brought into play even by simple games of mental skill.", said Donald Michie [1]. Games are a specific area of intellectual stimulus which could be computationally solved, such as the example of tic-tac-toe. Therefore, it is worth exploring and playing games using colourful approaches for reaching almost unbeatable perfect player.

Additionally, it is spectacular how a simple game, which every seven years old children learns how to play and not to lose, can attract so much of human interest and different branches of mathematics and computer science.

You can find Python and JavaScript codes for each method in the GitHub repo.

References:

  1. Michie, D. (1963), "Experiments on the mechanization of game-learning Part I. Characterization of the model and its parameters", Link
  2. Michie, D. (1961) "Trial and error". Penguin Science Survey.
  3. Sutton, R. S. & Barto, A. G. (2018), "Reinforcement Learning: An Introduction", The MIT Press.
  4. Mitchell, M. (2019), "Artificial intelligence a guide for thinking humans", Pelican books
  5. Scroggs, M. (2015), "MENACE: Machine Educable Noughts And Crosses Engine", Blog post & Game
  6. Luce, R. D. & Raiffa, H. (1957). "Games and decisions: Introduction and critical survey". New York: Wiley.
  1. In cycle form we denote only permutations which maps themself in a consecutive chain of elements and leave all elements which maps on themself (fixed points).
  2. We can re-frame the lemma to number of possible colorings of a 3x3 grid by three colors, we get: 1|D4|gG|Xg|=18(1×39+3×31+4×33)=2 475 (There is one element which leaves all elements unchanged [identity], there are 3 [rotations] which leave one element unchanged and 4 [axes] which leave 3 elements unchanged.)
    Unfortunately, it says nothing about game possibility - if such board could exists (e.g. all fields are "X"), but it gives us better upper bound.
  3. Actually, there is 1:5 chance that MENACE dies out of beads. One of the possible solution is to add more beads initially; 8 for first turn, 4 for third, 2 for fifth and 1 for seventh.