Skill v luck in determining backgammon winners

Getting backgammon data out of XG-Gammon

Backgammon is a game that combines chance and skill, and everyone who comes across it asks “how much is luck, and how much skill?”. The answer of course is “it depends”. Consider that for two exactly equally skilled players, the result appears to be 100% luck and the best forecast for a result is a coin flip. For a complete mismatch - a master playing a beginner - the master will win about 75% of one point matches, and nearly 100% of matches to 11 or more (the longer the match, the more chance for skill to emerge as dominant) - see my earlier post on those odds.

Since the rise of robot players that used machine learning neural networks to identify winning strategies, we’ve had a new way to quantify exactly how much skill and how much chance. Each decision by a player can be compared with an optimal play that the computer would have made, and the computer can estimate exactly how much equity in the final result you just gave up by playing differently to how it would have done. If you give up 1, you just moved from being certain to win to certain to lose; in practice losing 0.080 of your equity in one turn is usually defined as a “blunder”, and 0.020 as an “error”. Whereas gaining 0.300 of equity by a luck dice roll is a common threshold for a “joker”. Here’s a post from ancient history (ie 2005) on the origin of the term “blunder” in backgammon - apparently it only goes back to 1998, who knew?

Of course, while you’re playing a human under match conditions, you don’t know that you just gave up (for example) 0.096 of your equity, although sometimes you have a pretty good idea from a nagging feeling of “I really shouldn’t have done that…”. We quantify our errors in post-match analysis (you can play a computer in tutor mode too, and that’s the best way to train, but not the subject of today’s post).

As well as being the best backgammon players on the planet, XG-Gammon and other software such as GNU Backgammon and Snowie are used by many backgammon players to analyse their own matches, to identify areas for improvement and (let’s face it) to find out if the opponent really was as lucky as we thought they were. Online sites like FIBS (the First Internet Backgammon Server - open, free for all, lots of swearing, a bit clunky but a a great environment) and grid.gammon (invitation only, more players and more of them are very serious and skilled) let you save games you’ve just played in formats that can be opened and analysed for luck and mistakes by the various backgammon-playing software. If the user saves the analytical results to player profiles, you build up a database of games and matches. Against humans, I play mostly matches (eg first to X points, where common values of X are 3, 5, 7 and 11) and in this post I’m analysing match-level data; that is, ultimately I am using a spreadsheet generated by XG-Gammon where there is one row for each match I’ve played, with 54 columns of interesting data about that match.

To get this data out of XG-Gammon:

choose “players”
“see profile results”
From the “Results” menu, choose “Copy to clipboard”
“sessions list” (not sessions list (Match Play) - for me, this just freezes my system)
open Excel or equivalent and paste it in.
save it as a CSV, close Excel and load the data into a real analytical environment like R. Actually, this is me being snobbish - any backgammon players reading this who don’t want to use specialist analytical software like R, you can do a lot with Excel and I encourage you to have a go.

I keep my on-line profile, where I play matches against real opponents, separate from any practice I do against XG-Gammon in tutor mode, so when I do the above steps I get the genuine record of how well I’ve played. For this post, I’m analysing approximately 650 matches, which is the number I’ve played and recorded since I started using XG-Gammon as my main backgammon analytical tool (I try to record all, and probably do 99%, of matches I play on FIBS and grid.gammon).

Blunders and jokers

OK, let’s have a first look at some data. One of those columns is the number of blunders I made in each match and here’s a histogram showing the count of matches for which I made different numbers of blunders:

blunders

All the code that created graphics is at the bottom of the post.

My most common number of blunders in a single match is 4, followed by 2 and 5. In one nightmare I committed more than 40 blunders - quite an achievement!

Another item of interest is the number of jokers - dice rolls that materially changed the odds of success. Think of the double six you roll when it was your only chance:

jokers

It seems I most frequently get 4 or 5 really lucky rolls per match.

Backgammon matches are of different numbers of moves, of course, and the longer they are the more opportunities for both blunders and jokers. Apart from the fact that this dataset includes matches for first-to-11 down to first-to-1, sometimes things just whizz by, and sometimes the to-and-fro can go on forever. Better than the total number of blunders and jokers is the number of such events per move. Looking at the density of blunders and jokers per move, I’m pleased to see that on average my lucky jokers are more frequent than my blunders:

jokers-blunders

Elo rating

One thing people sometimes wonder about is the relationship between equity loss per decision and Elo rating. There can be no fundamental permanent relationship. Elo rating is basically a relativistic statement of how good you are (or rather, how well you’ve gone) compared to the people you play against; equity loss per decision is an absolute comparison of your decision making to a robot powered by a neural network.

Imagine a backgammon competition, with the same rules as FIBS, populated only by bots. In FIBS current conditions, these bots have ratings around 2100 compared to the starting point that all new players get of 1500 (chess players may think 2100 is surprisingly low - that’s because the element of chance in backgammon means that the worst player in the world still has a chance against the best, which puts a virtual ceiling on how high the Elo ratings can get). But in our new world after we reboot FIBS only for expert bot players, the bots will only be able to play eachother, and their Elo ratings will all hover around their starting point of 1500. Not exactly 1500, there will be a range, but it will be an illusion created by random chance. See my earlier post on Elo ratings for more discussion of how actual Elo ratings are volatile around their ‘true’ value. So the 2100 ratings the bots have in our current world are arbitrarily linked to the errors of the people they happen to be playing.

There’s enough stability in player standards that we know roughly how good a player with a FIBS rating of 1800 is, and XG-Gammon at some point in its life worked out a rule of thumb (ie statistical model) that mapped from the equity given up per decision to the Elo rating you’d expect an online player who consistently played at that level to end up with. From the data we’ve just extracted from XG-Gammon, it’s possible to reverse engineer that relationship. The data we got from our player profile included a column for “Eq per decision” and one for “Elo Level”. The first of these is it’s estimate of how much equity on average I gave up when I had a choice to make (ie excluding times I had only one or no legal moves); the second is the estimate of what my Elo level should be given my decision-making in an individual match. Here’s the relationship:

elo-ratings

As well as showing the equity given up per decision, I’ve annnotated it with XG “Player Rating” (PR), which is the most commonly recognisable assessment of the actual quality of a player’s decisions (as opposed to their Elo rating, which depends on luck and the competition). The lower a PR the better; world class players have PRs consistently below 5.

So we can see that if you consistently play with a PR of 10 - which qualifies you to be ‘advanced’ in most people’s books - you should expect eventually an Elo rating of a bit over 1700. This seems a lowish Elo rating to me compared to what I intuit are standards on FIBS (my average PR hovers around 11 and Elo rating on FIBS currently at 1800 whereas XG thinks I should be only 1663), but it will be different in any different pool of players and without more data there’s no point in arguing. Probably XG’s idea of Elo rating is relative to the pool on grid.gammon, which I think has a higher standard of play than FIBS.

Note - I couldn’t find PR defined in the XG manual (I imagine it’s there but I didn’t spend long looking), but according to this authoritative sounding forum post the XG Player Rating is just the equity error per decision multiplied by 500. So if you give up 0.01 (or 1%) of your equity each decision on average, you come up with a PR rating of 5, which is excellent and qualifies you as a ‘Master’. By definition (at least when it is XG that does the rating), XG’s PR is 0.

Skill versus luck

Another column of interest in the data obtained from XG is the “Cost luck” column. This contains XG’s estimate of how much the rolls of the dice cost you in total. It’s all very well if you’re a ‘Master’ to keep your error rate down to only give up 0.01 of your equity each turn, but what about those jokers and anti-jokers that push it 0.3 or more in either direction at a single roll of the dice? An obvious thing to do is to compare the “Cost luck” column with the equity per turn / PR data, and colour code each point on the resulting scatter plot by whether it was a win or a loss. The result is illuminating:

skill-v-luck

Very few of the red triangles denoting my wins happened when the dice were against me (ie luck < 0), no matter how well I played. And only one blue triangle (denoting a match loss) happened when I had good luck (luck > 0). Yup, the luck alone is all you need to divide matches into two very obvious clusters - those where you were lucky, and those where you weren’t. It’s a cruel game, and a fascinating one - this luck means that you have to play a lot of games before your skill level starts determining your average success rate, and at the end of the game you don’t know (without a bot to help you) whether it was your fault or not…

There’s two players in backgammon so better than looking at just my own skill rating is to examine the difference between my error rate and my opponents. Here’s the image of that:

skill-v-luck2

In fact, doing some statistical modelling we find some results that confirm the visuals:

predicting the result of a match with just luck as an explanatory variable gets it right 97.9% of the time
predicting the result with just net error rate as an explanatory variable gets it right only 65.0% of the time (remember, guessing at random would give you 50%)
predicting the result using both luck and error rate as explanatory variables gets it right 99.7% of the time.

Conclusions

Skill matters in backgammon, but when you play people at similar skill levels the games become more and more like a coin flip. Good playing can overcome moderately bad luck in an individual match, but mostly it doesn’t. You have to play a lot of matches for the skill to become important.

Code

All the data management and analysis was done in R.

#---------load up functionality and fonts------------
library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(showtext)
library(directlabels) # see http://stackoverflow.com/questions/13627735/stat-contour-with-data-labels-on-lines
library(caret)        # for cross-validation
library(RColorBrewer)
library(directlabels)

font.add.google("Poppins", "myfont")
showtext.auto()
theme_set(theme_light(base_family = "myfont"))

pal <- brewer.pal(7, "Set1")[ 1:2]

#-----------import data----------------
# Open XG-Gammon, choose "players", "see profile results", "Results",
# "Copy to clipboard", "sessions list".  Paste result into Excel and save as a csv.
# xg_orig <- read_csv("../data/xg-export.csv")
# or you can use one I've prepared earlier, just with the column of opponent names deleted:
xg_orig <- read_csv("https://raw.githubusercontent.com/ellisp/ellisp.github.io/source/data/xg-export.csv")
names(xg_orig) <- gsub(" ", "_", names(xg_orig), fixed = TRUE)
xg <- xg_orig %>%
   filter(Match_Length <= 15) # knock out a 99999 outlier

   ggplot(xg, aes(x = Blunders)) + 
   geom_histogram(binwidth = 1) +
      labs(x = "Blunders per match",
           y = "Number of matches")

   ggplot(xg, aes(x = Jokers)) + 
      geom_histogram(binwidth = 1) +
      labs(x = "Jokers per match",
           y = "Number of matches")

p <- xg %>%
   mutate(bm = Blunders / Moves,
          jm = Jokers / Moves) %>%
   select(bm, jm) %>%
   gather(variable, value) %>%
   mutate(variable = ifelse(variable == "bm", "        Blunders per move", "  Jokers per move")) %>%
   ggplot(aes(x = value, colour = variable)) +
   geom_density() +
   labs(x = "Events per move")

   direct.label(p)

#--------equity v Elo-----------
# PR = -(equity error per decision) * 500
# http://www.bgonline.org/forums/webbbs_config.pl?noframes;read=53424
# therefore e =  -PR / 500

# make a data frame I'll use for convertin equity per decision to Player Rating
# (PR) on various plots
pr_steps <- data.frame(PR = seq(from = 0, to = 40, by = 5)) %>%
   mutate(Eq_per_Decision = - PR / 500,
          PR = ifelse(PR == 40, "PR:", as.character(PR)))


# this plot shows that the "Elo_Level" is estimated by XG-Gammon based on error rate
ggplot(xg, aes(x = Eq_per_Decision, y = Elo_Level)) +
   geom_point() +
   scale_y_continuous(limits=c(670, 2100), 
                      breaks = seq(from = 800, to = 2000, by = 200)) +
   geom_text(data = pr_steps, y = 700, aes(label = PR), 
             colour = "grey50", family = "myfont") +
   labs(x = "Equity change per decision (negative means lost equity)",
        y = "Equivalent Elo level estimated by XG-Gammon",
        title = "XG-Gammon estimate of how equity per decision\nshould be related to Elo rating")


#-------------------skill v luck v winning------------------------
xg <- xg %>%
   mutate(Cost_luck_per_move = Cost_luck / Moves)

xg %>%
   mutate(Result = factor(ifelse(Result == 0, "Loss", "Win"),
                          levels = c("Win", "Loss"))) %>%
   ggplot(aes(x = Eq_per_Decision, y = Cost_luck_per_move, colour = Result)) +
   geom_point(aes(size = Match_Length), shape = 2) +
   geom_text(data = pr_steps, y = -1.3, aes(label = PR), 
             colour = "grey50", family = "myfont") +
   scale_radius("Match\nlength", breaks = c(1, 3, 5, 7, 9, 11)) +
   scale_colour_manual(values = pal) +
   labs(y = "Match equity gained through luck, per move\n",
        x = "Skill: equity change per decision\nnegative means equity lost through poor choices",
        title = "Luck is more important than skill\nin any single backgammon match")



xg_with_skill <- xg %>%
   mutate(Result = factor(ifelse(Result == 0, "Loss", "Win"),
                          levels = c("Win", "Loss")),
          Opp_Eq_per_Decision = Opp_Eq_per_move * Opp_Roll / Opp_Decisions, 
          net_skill = Eq_per_Decision - Opp_Eq_per_Decision ) 

xg_with_skill %>%
   ggplot(aes(x = net_skill, y = Cost_luck_per_move, colour = Result)) +
   geom_vline(xintercept = 0, colour = "grey45") +
   geom_hline(yintercept = 0, colour = "grey45") +
   geom_point(aes(size = Match_Length), shape = 2, alpha = 0.9) +
   scale_radius("Match\nlength", breaks = c(1, 3, 5, 7, 9, 11)) +
   scale_colour_manual(values = pal) +
   labs(y = "Match equity gained through luck, per move\n",
        x = "Skill: net equity change per decision from both players' decisions\nnegative means net equity loss through poor choices",
        title = "Luck is more important than skill\nin any single backgammon match") +
   annotate("text", x = 0.3, y = 0.04, label = "More skilled\nand luckier", 
            family = "myfont", colour = "grey40") +
   annotate("text", x = -0.05, y = -0.085, label = "Less skilled\nand unluckier", 
            family = "myfont", colour = "grey40") +
   
   annotate("text", x = 0.2, y = -0.01, label = "More skilled and\novercame bad luck",
            family = "myfont", colour = pal[1]) +
   annotate("segment", x = 0.14, xend = 0.09, y = -0.01, yend = -0.01,
            colour = pal[1], arrow = arrow(angle = 20)) +

   annotate("text", x = 0.17, y = -0.085, label = "More skilled\nbut luck won out",
            family = "myfont", colour = pal[2]) +
   annotate("segment", x = 0.12, xend = 0.04, y = -0.085, yend = -0.085,
            colour = pal[2], arrow = arrow(angle = 20))

#---------------------------luck v skill in modelling----------
ctrl <- trainControl(method = "repeatedcv", number = 10, savePredictions = TRUE)

mod1 <- train(Result ~ Cost_luck_per_move, method = "glm", family = "binomial",
              data = xg_with_skill, trControl = ctrl)
mod1 # 98.0% accuracy

mod2 <- train(Result ~ net_skill, method = "glm", family = "binomial",
              data = xg_with_skill, trControl = ctrl)
mod2 # 65.0%


mod3 <- train(Result ~ net_skill + Cost_luck_per_move, method = "glm", family = "binomial",
              data = xg_with_skill, trControl = ctrl)
mod3 # 99.7%