A TWO DIMENSIONAL MEASURE OF PERFORMANCE IN CHESS1
JAMAL MUNSHI
ABSTRACT: A two dimensional measure of chess performance is presented and it is shown to have more
information and a greater power of discrimination than scalar measures because it does not rely on scoring.
The scoring procedure used by scalar measures to reduce chess game outcomes from a trinomial to a
binomial variable causes some chess game outcome information to be irretrievably lost 2.
1. SUMMARY
A two dimensional measure of chess player performance is true to the trinomial nature of chess game
outcomes. It does not suffer from the inadequacies of scalar measures of performance described in a
previous paper (Munshi, 2014). Its use is demonstrated by measuring the performance of forty well
known chess players listed by the World Chess Federation as the world's best in August of 2014 (FIDE,
2014).
The two dimensions of chess performance are identified as the success rate when playing white and the
success rate when playing black. The sample data are taken from publicly available chess game databases.
Numerical methods are used to create a simulated sampling distribution from the sample data so that the
uncertainty in the performance measure may be assessed. Large sample sizes are needed to gain precision
and they are achieved by comparing each chess player against his combined opposition. The usual
procedure of making a pair-wise comparison between two chess players at a time involves the use of
small sample sizes and an unacceptable level of uncertainty3.
The proposed two dimensional measure of performance is compared with a scalar measure in terms of
information content and discrimination power. It is found that the two dimensional measure has more
power in detecting differences in performance than scalar measures. The results imply that the two
dimensional measure of performance contains more information. This finding can be explained in terms
of the loss of information incurred when chess game outcomes are converted from a trinomial to a
binomial variable by the use of scoring. (Munshi, 2014). Since the information loss occurs at the point of
scoring, the lost information cannot be recovered downstream no matter how sophisticated the
mathematics.
1
Originally posted on August 2014, revised September 2014, data entry errors corrected November 2014
Date: November 2014
Key words and phrases: chess, performance, rating, Elo, FIDE, playing strength, trinomial, statistics, numerical
methods, Studentized Euclidean distance, probability vector, Monte Carlo simulation, bootstrap, uncertainty
Author affiliation: Professor Emeritus, Sonoma State University, Rohnert Park, CA, 94928
[email protected], http://ssrn.com/author=2220942
3
An example of the high level of uncertainty in small samples is presented in the Appendix.
2
A TWO DIMENSIONAL MEASURE OF PERFORMANCE IN CHESS, JAMAL MUNSHI, 2014
2
2. THEORY
Chess game outcomes are determined by a trinomial stochastic process driven by an unobservable
probability vector that can be described in terms of the color of the pieces, or in terms of the identity of
the players as shown in Equations 1 and 2.
Equation 1
Equation 2
Colors of the pieces:
Identity of the players:
π(w,b) = π[pw, pb, pd]
π(x,y) = π[px, py, pd]
In these equations π represents a trinomial probability vector. In Equation 1, pw is the probability that
white will win, pb is the probability that black will win, and pd is the probability that the game will end in
draw. In Equation 2, px is the probability that player X will win, py is the probability that player Y will
win, and pd is the probability that the game will end in draw.
Equation 3
pw + pb + pd = px + py +pd = 1
The three component probabilities in each vector are subject to the constraint that they must add up to
unity because it is assumed that there is a 100% probability that one of the three states will occur4. This
constraint implies that when any two of the component probabilities are assigned values, the third
component probability is determined by subtraction. For example, in Equation 2, if we know that px=20%
and pd=45%, the vector is completely specified because py is determined by subtraction as py = 100-20-45
= 35%. In other words, the probability vector that determines chess game outcomes has two degrees of
freedom (Munshi, The Relative Playing Strength of Chess Players, 2014). We don't know what the
probability vector is until we have exactly two pieces of information about it.
The values of the component probabilities of the π are determined by many different factors and these
include the first move advantage enjoyed by white, the opening employed, the aggressiveness of the
playing style of each player, the general level of the move imperfection rate, and the difference in playing
strength between the two players (Munshi, A method for comparing chess openings, 2014). The move
imperfection rate is a significant consideration in chess games played by humans even at very high levels
of play and therefore large sample sizes are necessary for detecting the effects of the other factors
(Munshi, Pairwise comparison of chess opening variations, 2014).
2.1
A two dimensional measure of performance. Chess performance is a relative measure. We
may measure the performance of any player against another player or against a set of players and that
measure would only tell us how good the player is relative to his opponent(s) in the sample. Whatever
measure we use for this purpose, we know that it needs to have two dimensions in order to have two
degrees of freedom. For a measure of chess performance we must choose two independent dimensions
from a set of many possibilities.
Because of white's first move advantage, it is useful to separate games played as white from games played
as black and use these two independent dimensions together to assess a chess player's performance record
(Munshi, Comparing Chess Openings Part 3, 2014). Cartesian coordinates with x = percent of games won
4
We assume that the game will not be interrupted or changed midstream.
A TWO DIMENSIONAL MEASURE OF PERFORMANCE IN CHESS, JAMAL MUNSHI, 2014
3
as white (WAW) and y = percent of games won as black (WAB) are used to plot these values for playerP and opponent(s)-O for comparison. We assume that the number of games played as white and the
number of games played as black are equal, and that if they are not equal, they are close enough to use the
average as the common sample size. Once the two points are placed on the graph, we can measure their
Euclidean distance as dP = the distance of the Player from the Opposition. This distance serves as our
relative measure of performance for the Player against his Opponent(s). Both magnitude and direction of
the distance vector are important in making the comparison.
For the hypothesis test, our null hypothesis is that the two outcomes being compared could have been
generated by the same underlying probability vector, that is, πP= πO, and that the observed difference in
the sample was caused by sampling variation. The testable implication of this hypothesis is that the
magnitude of the distance in the population of all possible games from which the sample was taken is
zero. Designating the magnitude of the population distance as δ we write our hypotheses as shown below.
Null hypothesis:
Alternate hypothesis
H0
Ha
πP = πO
πP ≠ πO
Testable implication
Testable implication
δ= 0
δ>0
Both π and δ are unobservable. What we observe is a sample distance d and determine from the sampling
distribution of d whether a distance ≥ d could be observed by way of sampling variation if samples were
taken from a population in which the null hypothesis is true. If so we conclude that there is no measurable
difference in performance between player-P and opponents-O. However, if the observed value of d is too
large to be explained by sampling variation we may conclude that δ > 0 and therefore that πP ≠ πO.
It is useful to Studentize5 the distance d between each player and his combined opposition. The division
by the standard deviation changes the distance measured in "games" to a dimensionless number which
may be thought of as the number of standard deviations.
Both the magnitude and direction of the distance vector are considered in making the comparisons. If the
magnitude is too small to reject H0, the direction is not important because we accept the hypothesis that δ
could be zero. However, if we reject H0, we must determine whether direction of the distance vector
pointing from the Opposition to the Player is in a positive direction, a negative direction, or whether it is
in a neutral direction.
Vectors lying in the first quadrant, perhaps below6 but not far from the 45-degree line, may be considered
to be in a positive direction indicating that the Player performed better than the Opposition. Similarly, if
the distance vector is in the third quadrant, perhaps above but not far from the 135-degree line, it
represents a negative direction indicating that the Player performed worse than his combined opposition.
The second and fourth quadrants represent neutral direction in which case even when δ > 0, there may be
no real difference in performance between the player and the opposition.
5
The Studentized Euclidean distance is the computed distance divided by the standard deviation of the sampling
distribution that was estimated from sample data.
6
The expected value of the ideal angle is less than 45 degrees because of white's first move advantage.
A TWO DIMENSIONAL MEASURE OF PERFORMANCE IN CHESS, JAMAL MUNSHI, 2014
4
3. METHODOLOGY AND DATA
A sufficiently large number of chess players with relatively low move imperfection rates is required in
our sample to make it possible for us to compare different methods of measuring performance.
Accordingly, the top forty chess players in the world in August of 2014 (FIDE, 2014) are selected for
performance comparison and their playing record from January 2009 to August 2014 are taken from an
online database of chess games (chessgames.com, 2014). The sampling period was chosen so that the data
would be relatively current and yet provide sample sizes that are large enough to detect differences in
performance.
3.1
The data. The selected players and their game statistics are shown in Table 1. The identity of the
players and their FIDE ratings and rankings are irrelevant to the purpose of the study and they have
therefore been removed from the dataset. The intent of this paper is only to compare different methods of
performance measurement rather than to comment on the specific personalities involved. Hypothetical
synthetic data could have been used to make that comparison but actual game data among the top chess
players presents a greater sense of realism and credibility. Real data are preferred on this basis.
The variables in Table 1 are as follows:
Player
PAW
WAW
LAW
PAB
WAB
LAB
Avg n7
= A code by which this player will be identified throughout this study.
= Number of games this player played as white
= Number of games this player won as white
= Number of games this player lost as white = Number of games his combined opposition won as black
= Number of games this player played as black
= Number of games this player won as black
= Number of games this player lost as black = Number of games his combined opposition won as white
= The average sample size = (PGAW+PGAB)/2
We use these data to compare a two dimensional measure of performance with a scalar measure in terms
of discrimination power, information content, and uncertainty. The two measures of performance are used
to compare each player with his combined opposition in the sample.
3.2
Two-dimensional measure. The trinomial nature of chess game outcomes contains two degrees
of freedom and requires measures of performance to have two dimensions. In the two dimensional
measure of performance used in this study, we plot the player and his opposition in two dimensional
Cartesian coordinates with x = percentage of games won playing white and y = percentage of games won
playing black and compute the Euclidean distance between these points. We then interpret the magnitude
and direction of the distance vector as a measure of playing performance.
The forty chess players in the sample are compared with their combined opposition using this distance as
the relative measure of performance8. The comparison is made at an experiment-wide error rate of
7
Not used in the data analysis.
Although it is assumed that the samples are sufficiently large and the opponent base sufficiently randomized to be
used as a common benchmark, the comparison of the uncertainty in the two measures of performance is robust with
respect to this assumption.
8
A TWO DIMENSIONAL MEASURE OF PERFORMANCE IN CHESS, JAMAL MUNSHI, 2014
5
α=0.001 or 0.1%9. Since 40 comparisons are made, a Bonferroni adjustment (Abdi, 2007) is used to set
the error rate for each comparison to α = 0.000025. This means that the performance of the player
considered to be different from the opposition if if the probability that the observed difference was caused
by sampling variation is less than 0.0025%.
Player
PAW
WAW
LAW
PAB
WAB
LAB
Avg n
P01
354
188
49
346
134
45
350
P02
366
161
52
366
105
71
366
P03
361
168
52
358
94
91
359.5
P04
423
177
61
418
133
85
420.5
P05
366
179
54
366
138
74
366
P06
438
181
59
431
135
92
434.5
P07
246
92
30
235
58
29
240.5
P08
192
71
28
185
37
60
188.5
P09
357
156
38
341
115
58
349
P10
278
115
36
278
63
70
278
P11
228
67
32
225
42
59
226.5
P12
173
73
13
180
58
22
176.5
P13
321
94
56
312
67
80
316.5
P14
354
141
43
357
90
79
355.5
P15
269
101
39
261
61
54
265
P16
326
125
54
317
82
71
321.5
P17
449
177
77
449
124
106
449
P18
369
163
73
333
112
93
351
P19
165
74
25
155
51
30
160
P20
253
109
31
247
68
46
250
P21
209
97
13
210
55
23
209.5
P22
222
54
30
225
30
48
223.5
P23
242
109
30
213
68
35
227.5
P24
245
113
48
247
74
85
246
P25
217
93
41
231
53
72
224
P26
178
74
17
173
54
37
175.5
P27
260
83
35
269
53
74
264.5
P28
236
91
30
238
79
61
237
P29
241
97
29
233
65
53
237
P30
284
132
24
289
79
49
286.5
P31
201
69
33
204
39
50
202.5
P32
330
112
51
322
80
81
326
P33
220
102
41
219
68
51
219.5
P34
325
152
53
324
102
70
324.5
P35
199
104
37
183
53
55
191
P36
135
60
27
139
51
22
137
P37
256
110
46
254
104
51
255
P38
311
153
65
303
107
90
307
P39
250
125
23
250
71
49
250
P40
102
45
18
112
36
20
107
Table 1 Game statistics of the top forty players against all opponents during the sample period
It has been found in a recent study that higher values of α often lead to spurious and irreproducible results
(Johnson, 2013).
9
A TWO DIMENSIONAL MEASURE OF PERFORMANCE IN CHESS, JAMAL MUNSHI, 2014
6
The number of such differences observed in our sample of forty players serves as an indicator of the
discrimination power10 of the measure of performance used. The discrimination power is assumed to be
derived from the information content of the measure of performance. It is assumed that the measurement
method that identifies more differences has greater discrimination power and more information than the
measurement method that identifies fewer differences ceteris paribus.
3.3
Scalar measures of performance. The trinomial nature of chess game outcomes cannot be
represented by a scalar measure of performance. Therefore, scalar measures of chess performance such as
the Elo rating system (Elo, 2008) depend on a procedure called "scoring" to reduce trinomial chess game
outcomes to a binomial variable.
The scoring procedure assigns a value of score=1 for a win, score=0 for a loss, and score = 0.5 for a
draw11. If N chess games are played and the player wins W games, loses L games and D games end in
draw, then the player scores (2W+D)/2 and the opponent scores (2L+D)/2. Note that N = W+L+D and
that (2W+D)/2 + (2L+D)/2= (2W+D+2L+D)/2 = (2W+2L+2D)/2 = W+L+D = N. The two scores add up
to the total number of games played. When the scores are divided by N, the two fractional scores add up
to unity. Therefore, when chess game outcomes are converted into scores, chess loses a dimension and is
reduced from a trinomial process to a binomial process.
A binomial process has one degree of freedom and it can therefore be represented with a scalar variable
such as the Elo rating system. This simplicity is achieved at the cost of information. When chess game
outcomes are converted to scores some information becomes irretrievably lost (Munshi, The Relative
Playing Strength of Chess Players, 2014). In this paper we measure the effect of this information loss by
comparing the discrimination power of scores with the discrimination power of a two dimensional
measure of performance that is true to the trinomial nature of chess. The following relationships are used
to compute the fractional score for each player from the data in Table 1.
Equation 4
Equation 5
Equation 6
Equation 7
Equation 8
N = PGAW + PGAB
W = PWAW + PWAB
L = OWAW + OWAB
D=N-W-L
PSCORE = (W+D/2)/N
Since the sample sizes are large, the uncertainty in PSCORE is estimated using the Gaussian
approximation as shown below.
Equation 9
Equation 10
Variance (PSCORE) = PSCORE*(1-PSCORE)/N
Standard Deviation(PSCORE) = σPSCORE = √(Variance)
The value of PSCORE=0.5 represents a neutral position. The performance each player is therefore
measured relative to the neutral performance as PSCORE - 0.5. This measure becomes standardized and
corresponds with Equation 5 if we divide by the standard deviation as shown below.
10
Ability to discriminate between players with different levels of performance.
The score values may differ from one tournament to another but the principle of conversion to a binomial variable
is common to all scoring systems.
11
A TWO DIMENSIONAL MEASURE OF PERFORMANCE IN CHESS, JAMAL MUNSHI, 2014
Equation 11
7
Standardized Scalar Performance = (PSCORE - 0.50)/ σPSCORE
The performance of the forty players in the sample against their combined opposition is assessed using
Equation 13 at an error rate of α=0.000025 for each comparison in order to hold the experiment wide
error rate to α=0.001 for all forty comparisons. The number of differences found are counted. This
number corresponds with the discrimination power of the scalar measure of chess performance and is
indicative of the information content of this measure of performance.
4. DATA ANALYSIS
4.1
Two dimensional measure of performance. The data in Table 1 are used to compute the
Euclidean distance of each player from his combined opposition . These distances and their directions12
are shown in the columns labeled "Distance" and "Angle" in Table 2.
A Monte Carlo procedure is used to create a simulated sampling distribution of distances. The details of
the computations for all forty players are available in the data archive for this paper (Munshi,
Performance paper data analysis, 2014). The simulated replications of the sample data used in these
computations may be viewed graphically in the Appendix13. Each graph contains two color coded
markers. The position of the player is shown in red and that of his opponents is shown in blue. The size
and shape of the markers represent the uncertainty in the location of the marker on the graph. When the
two markers overlap it indicates no measurable difference in performance. When they are separated it
indicates a difference in performance. The greater the separation the greater is this difference as long as
the angle is in a positive direction.
The standard deviation of the sampling distribution of the Euclidean distance is computed from the
simulation results and shown in Table 2 in percentage terms as "Stdev". The Studentized distance of the
player from his combined opposition is computed as Distance/Stdev to serve as a standardized measure of
each player's performance as measured by his track record in the selected sample period.
Table 2 has been sorted according to the distance vector raking both magnitude and direction into account
from the highest performance to th lowest. The critical value of StdDist that corresponds with our
experiment-wide error rate of α=0.001 is StdDist = 4.25. Using this criterion we find that the top 34
players out of 40 listed in Table 2 outperformed their opponents on average. At the bottom of this list is
player P27 with StdDist = 5.116 > 4.25. In all of these cases we can reject H0 because StdDist > 4.25 and
because the distance vector lies in a positive direction or well above the neutral direction of -45 degrees.
In each case we conclude that the player in question performed better than his combined opposition.
Six of the 40 players in the list achieved a performance level where StdDist < 4.25 or had negative
directions approaching the neutral angle of -45 degrees. . In cases where the observed StdDist is within
12
The distance vector and its angle are used together to evaluate performance. The white distance and the black
distance could have also been used to provide the same information.
13
For the complete list of al 40 graphs please download the Excel file which is included in the data archive of this
paper (Munshi, Performance paper data analysis, 2014).
A TWO DIMENSIONAL MEASURE OF PERFORMANCE IN CHESS, JAMAL MUNSHI, 2014
the expected sampling variation at the error rate chosen for this study the direction of the distance vector
has no interpretation.
Player
Distance
Stdev
StdDist
Angle
P01
P09
0.469
0.370
0.036
0.035
13.040
10.527
29.536
23.846
P21
P05
0.430
0.384
0.041
0.038
10.485
9.981
18.457
24.100
P30
P39
0.394
0.417
0.040
0.043
9.969
9.608
13.572
10.819
P12
P04
0.400
0.297
0.047
0.034
8.514
8.758
26.640
20.197
P06
P02
0.296
0.312
0.033
0.037
8.931
8.492
17.517
15.399
P03
P37
0.321
0.326
0.037
0.046
8.762
7.081
1.328
35.422
P07
P23
0.281
0.361
0.039
0.045
7.116
7.987
23.189
22.568
P34
P20
0.320
0.321
0.041
0.043
7.839
7.544
15.968
14.324
P14
P26
0.279
0.335
0.035
0.050
7.902
6.739
5.645
15.164
P28
P18
0.269
0.250
0.043
0.040
6.272
6.281
14.497
11.704
P17
P29
0.226
0.287
0.034
0.043
6.633
6.649
9.070
9.195
P38
P35
0.288
0.337
0.044
0.055
6.621
6.132
9.969
-1.653
P10
P36
0.285
0.321
0.042
0.063
6.865
5.139
-4.501
35.983
P19
P33
0.326
0.288
0.055
0.048
5.911
5.975
21.799
13.902
P16
P15
0.221
0.232
0.038
0.042
5.754
5.555
8.047
5.900
P32
P24
0.185
0.269
0.037
0.049
5.039
5.545
-0.856
-8.470
P25
P27
0.253
0.200
0.050
0.039
5.064
5.116
-16.839
-20.375
P08
0.256
0.049
5.272
-25.809
P11
P31
0.171
0.187
0.042
0.049
4.093
3.851
-23.294
-14.893
P13
P22
0.125
0.134
0.037
0.040
3.377
3.401
-17.236
-32.446
P40
0.301
0.103
2.916
25.204
Table 2 Discrimination power of the two dimensional measure of performance
8
A TWO DIMENSIONAL MEASURE OF PERFORMANCE IN CHESS, JAMAL MUNSHI, 2014
9
Assuming that the top forty players in the world are actually better than their combined opposition and
that the sample of games taken is representative and not biased, then of the two measures of chess
performance being compared, the one with the greater discrimination power will detect this difference in
a greater proportion of the players. In the case of the two dimensional measure of performance we find
that the proportion detected is 34/40 = 85%. We now compute the corresponding proportion for the scalar
measure used by the Elo rating system so that we can compare the discrimination power and, by
inference, the information content of the two measures of performance.
It is interesting to note in Table 2 that the directions of the distance vectors are all well below 45 degrees
and some of the angles have negative values. This pattern seems to indicate that on average winners do
better as white than as black and that losers do better as black than as white. This dichotomy provides
further support for measuring chess performance in these two dimensions.
4.2
Scalar measure of performance. The chess game outcomes in Table 1are reduced from a
trinomial variable to a binomial variable by the use of the scoring procedure. Since binomial variables
have only one degree of freedom, a scalar measure may be used to compare the players based on their
scoring performance. The results, sorted by standardized scoring performance from highest to lowest, are
shown in Table 4. The column labels in Table 4 are described below.
Player
Games
Won
Lost
Draw
Score
ScoreSE
Std.Score
=
=
=
=
=
=
=
=
Player identity
The number of games played
The number of games won by the player
The number of games lost by the player
The number of games that ended in draw
The player's score
The standard deviation of the score derived from the Gaussian approximation
Standardized score = (Score - 0.5)/ScoreSE
The standardized score serves as our scalar measure of performance. At an experiment-wide error rate of
α=0.001with 40 comparisons the comparison α=0.000025. The corresponding the critical value of
StdScore in a Normal distribution is StdScore = 4.1. The performance of the 19 players at the top of the
list in Table 4 meets this condition. In the case of each of these players we reject H0 and conclude that the
player performed better than the average opponent. The performance of the 21 players in the bottom of
this list does not meet this condition and for them we fail to reject H0 and find no evidence that they
performed better than the opposition.
We are now able to compare the proposed two dimensional measure of performance and the widely used
scalar measure of performance in terms of their ability to detect differences in chess performance.
A TWO DIMENSIONAL MEASURE OF PERFORMANCE IN CHESS, JAMAL MUNSHI, 2014
10
4.3
Comparison of discrimination power. We found that the two dimensional measure of
performance was able to detect a difference in performance in 34 out of 40 players or in 85% of the
players tested while the scalar measure of performance detected a difference in only 19 out of 40 players
or in 47.5% of the players. To determine whether the observed difference of 37.5 percentage points may
have been caused by sampling variation we carry out a hypothesis test as noted below.
Research question
Does the two dimensional measure of performance have greater discrimination power?
H0:
p1≤ p2
The higher level of detection by the two dimensional measure in the sample is a result of sampling
variation.
Ha:
p1`>p2
The higher level of detection could not have been observed in this sample if the two dimensional measure
of performance did not have greater discrimination power.
Error rate:
α = 0.001 (The probability of incorrectly rejecting H0 is held to 0.1% or less.)
TwoDim
Scalar
Pooled
Comparisons
40
40
Detected
34
19
PctDetected
0.85
0.475
Variance
0.00942
SE
0.0971
Diff/SE
3.86
p-value
<0.001
Decision
Reject H0
Table 3 Hypothesis test for discrimination power
Difference
0.375
The details of the hypothesis test are shown in Table 3. Using the usual Gaussian approximation14 for
proportions we find that the pooled variance is 0.00942 and the standard error (SE) is 0.0971. In Table 3,
we find that the p-value < α and so we reject H0 and conclude that the observed difference could not have
been caused by sampling variation and that therefore the two dimensional measure of performance has
greater discrimination power than the scalar measure. The implication of this finding is that the scalar
measure contains less information than the two dimensional measure of performance. We ascribe this
difference to the information loss incurred by the use of the scoring procedure to represent chess
outcomes as a binomial variable.
14
A Monte Carlo simulation was used to verify the results of this approximation method. The simulation is available
in the online data archive of this paper (Munshi, Check Gaussian Approximation, 2014).
A TWO DIMENSIONAL MEASURE OF PERFORMANCE IN CHESS, JAMAL MUNSHI, 2014
Player
Games
Won
Lost
Draw
Score
Stdev
P01
700
322
94
284
0.6629
0.0179
P05
732
317
128
287
0.6291
0.0179
P09
698
271
96
331
0.6254
0.0183
P30
573
211
73
289
0.6204
0.0203
P21
419
152
36
231
0.6384
0.0235
P04
841
310
146
385
0.5975
0.0169
P39
500
196
72
232
0.6240
0.0217
P06
869
316
151
402
0.5949
0.0167
P23
455
177
65
213
0.6231
0.0227
P02
732
266
123
343
0.5977
0.0181
P37
510
214
97
199
0.6147
0.0215
P12
353
131
35
187
0.6360
0.0256
P34
649
254
123
272
0.6009
0.0192
P20
500
177
77
246
0.6000
0.0219
P03
719
262
143
314
0.5828
0.0184
P38
614
260
155
199
0.5855
0.0199
P07
481
150
59
272
0.5946
0.0224
P18
702
275
166
261
0.5776
0.0186
P14
711
231
122
358
0.5767
0.0185
P26
351
128
54
169
0.6054
0.0261
P19
320
125
55
140
0.6094
0.0273
P17
898
301
183
414
0.5657
0.0165
P36
274
111
49
114
0.6131
0.0294
P33
439
170
92
177
0.5888
0.0235
P29
474
162
82
230
0.5844
0.0226
P28
474
170
91
213
0.5833
0.0226
P35
382
157
92
133
0.5851
0.0252
P16
643
207
125
311
0.5638
0.0196
P10
556
178
106
272
0.5647
0.0210
P15
530
162
93
275
0.5651
0.0215
P40
214
81
38
95
0.6005
0.0335
P24
492
187
133
172
0.5549
0.0224
P32
652
192
132
328
0.5460
0.0195
P25
448
146
113
189
0.5368
0.0236
P31
405
108
83
214
0.5309
0.0248
P27
529
136
109
284
0.5255
0.0217
P08
377
108
88
181
0.5265
0.0257
P13
633
161
136
336
0.5197
0.0199
P22
447
84
78
285
0.5067
0.0236
P11
453
99
95
259
0.5044
0.0235
Table 4 Discrimination power of the scalar measure of performance
StdScore
9.11
7.23
6.84
5.94
5.90
5.77
5.72
5.70
5.42
5.39
5.32
5.31
5.25
4.56
4.50
4.30
4.23
4.16
4.14
4.04
4.01
3.97
3.85
3.78
3.73
3.68
3.37
3.26
3.08
3.02
3.00
2.45
2.36
1.56
1.24
1.18
1.03
0.99
0.28
0.19
11
A TWO DIMENSIONAL MEASURE OF PERFORMANCE IN CHESS, JAMAL MUNSHI, 2014
12
4.4
Correlation Check. The comparison of the discrimination power of the two measures of chess
performance carried out in the previous section requires that the two measures should be comparable in
terms of validity and differ only in terms of reliability. In other words, we assume that they are measuring
the same underlying reality but with different degrees of precision. A testable implication of this
assumption is that the two measures should be correlated. To check whether this assumption is
reasonable, we carry out a linear regression test as shown in Figure 1.
Correlation Check
Two dimensional Measure
25
y = 2.2458x - 0.4833
R² = 0.98
20
15
10
5
0
-5
0
2
4
6
8
10
Scalar Measure
Figure 1 Correlation between the two measures of performance
The linear regression in Figure 1 shows that the two measures of performance are highly correlated. The
hypothesis test for the degree of correlation in Table 5 shows that the probability of observing a
correlation this high (or higher) in a sample of forty players taken from a population in which the two
measures of performance are not correlated is less than our acceptable error rate of α = 0.001. The test
validates our assumption that the two measures are comparable in terms of validity but differ only with
respect to reliability.
ANOVA
Regression
df
SS
1
743.891
MS
743.891
F
1863.4
p-value
6.71164E-34
Residual
38
15.170
0.399
Total
39
759.061
Table 5 Hypothesis test for correlation between the two measures of performance
A TWO DIMENSIONAL MEASURE OF PERFORMANCE IN CHESS, JAMAL MUNSHI, 2014
13
5. CONCLUSIONS
The proposed two dimensional measure of chess performance has more precision and discrimination
power than the scalar measure because it contains more information by virtue of the fact that it is true to
the trinomial nature of chess game outcomes. Scalar measures such as the Elo rating system contain less
information because they rely on scoring to reduce chess game outcomes from a trinomial to a binomial
process. This simplification is achieved at a cost because scoring causes some chess game outcome
information to be lost. The information loss occurs at the point of scoring. No amount of mathematical
wizardry downstream can recover the lost information.
The reduction of chess game outcomes to a binomial variable by the use of scoring was likely a necessary
sacrifice of information for the sake of computational simplicity at a time when computational machinery
were costly and scarce. It is likely that this relationship between the value of information and the cost of
computation no longer applies because of advances in computer technology.
6. APPENDIX
6.1
The high level of uncertainty in small samples. The utility of head to head pairwise
comparisons of chess players is limited by the availability of data. These sample sizes tend to be small.
For example, consider the two pairwise data in Table 6. They were collected from an online database in
July 2014 (chessgames.com, 2014).
The usefulness of this information is limited by the high degree of uncertainty. In both of these cases it is
not possible to distinguish the score from its neutral value of 0.50 at any acceptable level of α. The data
do not contain useful information because the sample size is too small.
player
opponent
games
won
lost
draw
Score
Stdev
z-value
p-value
Aronian
Shirov
19
5
0
14
0.6316
0.1107
1.19
11.7%
Carlsen
Nakamura
43
16
5
22
0.6279
0.0737
1.74
4.1%
Table 6 Uncertainty of scores in small samples
The small sample problem in pairwise comparisons also exists in the two dimensional measure of
performance. The sampling strategy of this study was motivated by the need for large sample sizes.
A TWO DIMENSIONAL MEASURE OF PERFORMANCE IN CHESS, JAMAL MUNSHI, 2014
6.2
Graphical depiction of simulated repetitions of the game data.
P01
Won as Black
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
0.8
Won as White
P02
Won as Black
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
Won as White
P03
Won as Black
0.4
0.3
0.2
0.1
0
0
0.2
0.4
Won as White
0.6
14
A TWO DIMENSIONAL MEASURE OF PERFORMANCE IN CHESS, JAMAL MUNSHI, 2014
P04
Won as Black
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
Won as White
P05
Won as Black
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
Won as White
P06
Won as Black
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
Won as White
0.6
15
A TWO DIMENSIONAL MEASURE OF PERFORMANCE IN CHESS, JAMAL MUNSHI, 2014
Won as Black
P07
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0
0.2
0.4
0.6
Won as White
P08
Won as Black
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
Won as White
P09
Won as Black
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
Won as White
0.6
16
A TWO DIMENSIONAL MEASURE OF PERFORMANCE IN CHESS, JAMAL MUNSHI, 2014
P10
Won as Black
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
Won as White
P11
Won as Black
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
Won as White
Won as Black
P13
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0
0.1
0.2
Won as White
0.3
0.4
17
A TWO DIMENSIONAL MEASURE OF PERFORMANCE IN CHESS, JAMAL MUNSHI, 2014
18
7. REFERENCES
Abdi, H. (2007). Bonferroni Sidak. Retrieved February 2014, from utdallas.edu:
http://www.utdallas.edu/~herve/Abdi-Bonferroni2007-pretty.pdf
chessgames.com. (2014). chess game database. Retrieved July 2014, from chessgames.com.
Elo, A. (2008). The rating of chess players past and present. Ishi Press.
FIDE. (2014, August). Top 100. Retrieved August 2014, from fide.com:
http://ratings.fide.com/top.phtml?list=men
Glickman, M. (1999). Rating the chess rating system. Retrieved 2014, from Glicko:
http://www.glicko.net/research/chance.pdf
Good, I. (1955). On the marking of chess players. Mathematical Gazette.
Johnson, V. E. (2013, November). Revised Standards for Statistical Evidence. Retrieved December 2013,
from Proceedings of the National Academy of Sciences:
http://www.pnas.org/content/110/48/19313.full
Munshi, J. (2014, February). A method for comparing chess openings. Retrieved April 2014, from
arxiv.org: http://arxiv.org/ftp/arxiv/papers/1402/1402.6791.pdf
Munshi, J. (2014). A two dimensional measure of chess performance. Retrieved 2014, from Youtube:
http://www.youtube.com/edit?o=U&feature=vm&video_id=wbfrUyB8o8k
Munshi, J. (2014). A two dimensional measure of chess performance. Retrieved 2014, from Youtube:
http://www.youtube.com/watch?v=2MNtGhu9zPo
Munshi, J. (2014). Check Gaussian Approximation. Retrieved 2014, from Dropbox:
https://www.dropbox.com/s/3cr7w9459oxi4by/GaussianApproximationSimulation.xlsx?dl=0
Munshi, J. (2014). Comparing Chess Openings Part 3. Retrieved 2014, from ssrn:
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2441568
Munshi, J. (2014). Pairwise comparison of chess opening variations. Retrieved 2014, from ssrn:
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2472783
Munshi, J. (2014, November). Performance paper data analysis. Retrieved November 2014, from
Dropbox: https://www.dropbox.com/sh/109n73zx8qvmt6f/AABXz8tm6A1TzqaSIRRrS45a?dl=0
Munshi, J. (2014). Simulated sampling distribution. Retrieved 2014, from Dropbox:
https://www.dropbox.com/sh/4yipacz2ujqilwi/AACbLlutdqEfV0Qu2FtSho84a?dl=0
Munshi, J. (2014). The Relative Playing Strength of Chess Players. Retrieved 2014, from SSRN:
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2477868