Academia.eduAcademia.edu

A TWO DIMENSIONAL MEASURE OF PERFORMANCE IN CHESS

: A two dimensional measure of chess performance is presented and it is shown to have more information and a greater power of discrimination than scalar measures because it does not rely on scoring. The scoring procedure used by scalar measures to reduce chess game outcomes from a trinomial to a binomial variable causes some chess game outcome information to be irretrievably lost.

A TWO DIMENSIONAL MEASURE OF PERFORMANCE IN CHESS1 JAMAL MUNSHI ABSTRACT: A two dimensional measure of chess performance is presented and it is shown to have more information and a greater power of discrimination than scalar measures because it does not rely on scoring. The scoring procedure used by scalar measures to reduce chess game outcomes from a trinomial to a binomial variable causes some chess game outcome information to be irretrievably lost 2. 1. SUMMARY A two dimensional measure of chess player performance is true to the trinomial nature of chess game outcomes. It does not suffer from the inadequacies of scalar measures of performance described in a previous paper (Munshi, 2014). Its use is demonstrated by measuring the performance of forty well known chess players listed by the World Chess Federation as the world's best in August of 2014 (FIDE, 2014). The two dimensions of chess performance are identified as the success rate when playing white and the success rate when playing black. The sample data are taken from publicly available chess game databases. Numerical methods are used to create a simulated sampling distribution from the sample data so that the uncertainty in the performance measure may be assessed. Large sample sizes are needed to gain precision and they are achieved by comparing each chess player against his combined opposition. The usual procedure of making a pair-wise comparison between two chess players at a time involves the use of small sample sizes and an unacceptable level of uncertainty3. The proposed two dimensional measure of performance is compared with a scalar measure in terms of information content and discrimination power. It is found that the two dimensional measure has more power in detecting differences in performance than scalar measures. The results imply that the two dimensional measure of performance contains more information. This finding can be explained in terms of the loss of information incurred when chess game outcomes are converted from a trinomial to a binomial variable by the use of scoring. (Munshi, 2014). Since the information loss occurs at the point of scoring, the lost information cannot be recovered downstream no matter how sophisticated the mathematics. 1 Originally posted on August 2014, revised September 2014, data entry errors corrected November 2014 Date: November 2014 Key words and phrases: chess, performance, rating, Elo, FIDE, playing strength, trinomial, statistics, numerical methods, Studentized Euclidean distance, probability vector, Monte Carlo simulation, bootstrap, uncertainty Author affiliation: Professor Emeritus, Sonoma State University, Rohnert Park, CA, 94928 [email protected], http://ssrn.com/author=2220942 3 An example of the high level of uncertainty in small samples is presented in the Appendix. 2 A TWO DIMENSIONAL MEASURE OF PERFORMANCE IN CHESS, JAMAL MUNSHI, 2014 2 2. THEORY Chess game outcomes are determined by a trinomial stochastic process driven by an unobservable probability vector that can be described in terms of the color of the pieces, or in terms of the identity of the players as shown in Equations 1 and 2. Equation 1 Equation 2 Colors of the pieces: Identity of the players: π(w,b) = π[pw, pb, pd] π(x,y) = π[px, py, pd] In these equations π represents a trinomial probability vector. In Equation 1, pw is the probability that white will win, pb is the probability that black will win, and pd is the probability that the game will end in draw. In Equation 2, px is the probability that player X will win, py is the probability that player Y will win, and pd is the probability that the game will end in draw. Equation 3 pw + pb + pd = px + py +pd = 1 The three component probabilities in each vector are subject to the constraint that they must add up to unity because it is assumed that there is a 100% probability that one of the three states will occur4. This constraint implies that when any two of the component probabilities are assigned values, the third component probability is determined by subtraction. For example, in Equation 2, if we know that px=20% and pd=45%, the vector is completely specified because py is determined by subtraction as py = 100-20-45 = 35%. In other words, the probability vector that determines chess game outcomes has two degrees of freedom (Munshi, The Relative Playing Strength of Chess Players, 2014). We don't know what the probability vector is until we have exactly two pieces of information about it. The values of the component probabilities of the π are determined by many different factors and these include the first move advantage enjoyed by white, the opening employed, the aggressiveness of the playing style of each player, the general level of the move imperfection rate, and the difference in playing strength between the two players (Munshi, A method for comparing chess openings, 2014). The move imperfection rate is a significant consideration in chess games played by humans even at very high levels of play and therefore large sample sizes are necessary for detecting the effects of the other factors (Munshi, Pairwise comparison of chess opening variations, 2014). 2.1 A two dimensional measure of performance. Chess performance is a relative measure. We may measure the performance of any player against another player or against a set of players and that measure would only tell us how good the player is relative to his opponent(s) in the sample. Whatever measure we use for this purpose, we know that it needs to have two dimensions in order to have two degrees of freedom. For a measure of chess performance we must choose two independent dimensions from a set of many possibilities. Because of white's first move advantage, it is useful to separate games played as white from games played as black and use these two independent dimensions together to assess a chess player's performance record (Munshi, Comparing Chess Openings Part 3, 2014). Cartesian coordinates with x = percent of games won 4 We assume that the game will not be interrupted or changed midstream. A TWO DIMENSIONAL MEASURE OF PERFORMANCE IN CHESS, JAMAL MUNSHI, 2014 3 as white (WAW) and y = percent of games won as black (WAB) are used to plot these values for playerP and opponent(s)-O for comparison. We assume that the number of games played as white and the number of games played as black are equal, and that if they are not equal, they are close enough to use the average as the common sample size. Once the two points are placed on the graph, we can measure their Euclidean distance as dP = the distance of the Player from the Opposition. This distance serves as our relative measure of performance for the Player against his Opponent(s). Both magnitude and direction of the distance vector are important in making the comparison. For the hypothesis test, our null hypothesis is that the two outcomes being compared could have been generated by the same underlying probability vector, that is, πP= πO, and that the observed difference in the sample was caused by sampling variation. The testable implication of this hypothesis is that the magnitude of the distance in the population of all possible games from which the sample was taken is zero. Designating the magnitude of the population distance as δ we write our hypotheses as shown below. Null hypothesis: Alternate hypothesis H0 Ha πP = πO πP ≠ πO Testable implication Testable implication δ= 0 δ>0 Both π and δ are unobservable. What we observe is a sample distance d and determine from the sampling distribution of d whether a distance ≥ d could be observed by way of sampling variation if samples were taken from a population in which the null hypothesis is true. If so we conclude that there is no measurable difference in performance between player-P and opponents-O. However, if the observed value of d is too large to be explained by sampling variation we may conclude that δ > 0 and therefore that πP ≠ πO. It is useful to Studentize5 the distance d between each player and his combined opposition. The division by the standard deviation changes the distance measured in "games" to a dimensionless number which may be thought of as the number of standard deviations. Both the magnitude and direction of the distance vector are considered in making the comparisons. If the magnitude is too small to reject H0, the direction is not important because we accept the hypothesis that δ could be zero. However, if we reject H0, we must determine whether direction of the distance vector pointing from the Opposition to the Player is in a positive direction, a negative direction, or whether it is in a neutral direction. Vectors lying in the first quadrant, perhaps below6 but not far from the 45-degree line, may be considered to be in a positive direction indicating that the Player performed better than the Opposition. Similarly, if the distance vector is in the third quadrant, perhaps above but not far from the 135-degree line, it represents a negative direction indicating that the Player performed worse than his combined opposition. The second and fourth quadrants represent neutral direction in which case even when δ > 0, there may be no real difference in performance between the player and the opposition. 5 The Studentized Euclidean distance is the computed distance divided by the standard deviation of the sampling distribution that was estimated from sample data. 6 The expected value of the ideal angle is less than 45 degrees because of white's first move advantage. A TWO DIMENSIONAL MEASURE OF PERFORMANCE IN CHESS, JAMAL MUNSHI, 2014 4 3. METHODOLOGY AND DATA A sufficiently large number of chess players with relatively low move imperfection rates is required in our sample to make it possible for us to compare different methods of measuring performance. Accordingly, the top forty chess players in the world in August of 2014 (FIDE, 2014) are selected for performance comparison and their playing record from January 2009 to August 2014 are taken from an online database of chess games (chessgames.com, 2014). The sampling period was chosen so that the data would be relatively current and yet provide sample sizes that are large enough to detect differences in performance. 3.1 The data. The selected players and their game statistics are shown in Table 1. The identity of the players and their FIDE ratings and rankings are irrelevant to the purpose of the study and they have therefore been removed from the dataset. The intent of this paper is only to compare different methods of performance measurement rather than to comment on the specific personalities involved. Hypothetical synthetic data could have been used to make that comparison but actual game data among the top chess players presents a greater sense of realism and credibility. Real data are preferred on this basis. The variables in Table 1 are as follows: Player PAW WAW LAW PAB WAB LAB Avg n7 = A code by which this player will be identified throughout this study. = Number of games this player played as white = Number of games this player won as white = Number of games this player lost as white = Number of games his combined opposition won as black = Number of games this player played as black = Number of games this player won as black = Number of games this player lost as black = Number of games his combined opposition won as white = The average sample size = (PGAW+PGAB)/2 We use these data to compare a two dimensional measure of performance with a scalar measure in terms of discrimination power, information content, and uncertainty. The two measures of performance are used to compare each player with his combined opposition in the sample. 3.2 Two-dimensional measure. The trinomial nature of chess game outcomes contains two degrees of freedom and requires measures of performance to have two dimensions. In the two dimensional measure of performance used in this study, we plot the player and his opposition in two dimensional Cartesian coordinates with x = percentage of games won playing white and y = percentage of games won playing black and compute the Euclidean distance between these points. We then interpret the magnitude and direction of the distance vector as a measure of playing performance. The forty chess players in the sample are compared with their combined opposition using this distance as the relative measure of performance8. The comparison is made at an experiment-wide error rate of 7 Not used in the data analysis. Although it is assumed that the samples are sufficiently large and the opponent base sufficiently randomized to be used as a common benchmark, the comparison of the uncertainty in the two measures of performance is robust with respect to this assumption. 8 A TWO DIMENSIONAL MEASURE OF PERFORMANCE IN CHESS, JAMAL MUNSHI, 2014 5 α=0.001 or 0.1%9. Since 40 comparisons are made, a Bonferroni adjustment (Abdi, 2007) is used to set the error rate for each comparison to α = 0.000025. This means that the performance of the player considered to be different from the opposition if if the probability that the observed difference was caused by sampling variation is less than 0.0025%. Player PAW WAW LAW PAB WAB LAB Avg n P01 354 188 49 346 134 45 350 P02 366 161 52 366 105 71 366 P03 361 168 52 358 94 91 359.5 P04 423 177 61 418 133 85 420.5 P05 366 179 54 366 138 74 366 P06 438 181 59 431 135 92 434.5 P07 246 92 30 235 58 29 240.5 P08 192 71 28 185 37 60 188.5 P09 357 156 38 341 115 58 349 P10 278 115 36 278 63 70 278 P11 228 67 32 225 42 59 226.5 P12 173 73 13 180 58 22 176.5 P13 321 94 56 312 67 80 316.5 P14 354 141 43 357 90 79 355.5 P15 269 101 39 261 61 54 265 P16 326 125 54 317 82 71 321.5 P17 449 177 77 449 124 106 449 P18 369 163 73 333 112 93 351 P19 165 74 25 155 51 30 160 P20 253 109 31 247 68 46 250 P21 209 97 13 210 55 23 209.5 P22 222 54 30 225 30 48 223.5 P23 242 109 30 213 68 35 227.5 P24 245 113 48 247 74 85 246 P25 217 93 41 231 53 72 224 P26 178 74 17 173 54 37 175.5 P27 260 83 35 269 53 74 264.5 P28 236 91 30 238 79 61 237 P29 241 97 29 233 65 53 237 P30 284 132 24 289 79 49 286.5 P31 201 69 33 204 39 50 202.5 P32 330 112 51 322 80 81 326 P33 220 102 41 219 68 51 219.5 P34 325 152 53 324 102 70 324.5 P35 199 104 37 183 53 55 191 P36 135 60 27 139 51 22 137 P37 256 110 46 254 104 51 255 P38 311 153 65 303 107 90 307 P39 250 125 23 250 71 49 250 P40 102 45 18 112 36 20 107 Table 1 Game statistics of the top forty players against all opponents during the sample period It has been found in a recent study that higher values of α often lead to spurious and irreproducible results (Johnson, 2013). 9 A TWO DIMENSIONAL MEASURE OF PERFORMANCE IN CHESS, JAMAL MUNSHI, 2014 6 The number of such differences observed in our sample of forty players serves as an indicator of the discrimination power10 of the measure of performance used. The discrimination power is assumed to be derived from the information content of the measure of performance. It is assumed that the measurement method that identifies more differences has greater discrimination power and more information than the measurement method that identifies fewer differences ceteris paribus. 3.3 Scalar measures of performance. The trinomial nature of chess game outcomes cannot be represented by a scalar measure of performance. Therefore, scalar measures of chess performance such as the Elo rating system (Elo, 2008) depend on a procedure called "scoring" to reduce trinomial chess game outcomes to a binomial variable. The scoring procedure assigns a value of score=1 for a win, score=0 for a loss, and score = 0.5 for a draw11. If N chess games are played and the player wins W games, loses L games and D games end in draw, then the player scores (2W+D)/2 and the opponent scores (2L+D)/2. Note that N = W+L+D and that (2W+D)/2 + (2L+D)/2= (2W+D+2L+D)/2 = (2W+2L+2D)/2 = W+L+D = N. The two scores add up to the total number of games played. When the scores are divided by N, the two fractional scores add up to unity. Therefore, when chess game outcomes are converted into scores, chess loses a dimension and is reduced from a trinomial process to a binomial process. A binomial process has one degree of freedom and it can therefore be represented with a scalar variable such as the Elo rating system. This simplicity is achieved at the cost of information. When chess game outcomes are converted to scores some information becomes irretrievably lost (Munshi, The Relative Playing Strength of Chess Players, 2014). In this paper we measure the effect of this information loss by comparing the discrimination power of scores with the discrimination power of a two dimensional measure of performance that is true to the trinomial nature of chess. The following relationships are used to compute the fractional score for each player from the data in Table 1. Equation 4 Equation 5 Equation 6 Equation 7 Equation 8 N = PGAW + PGAB W = PWAW + PWAB L = OWAW + OWAB D=N-W-L PSCORE = (W+D/2)/N Since the sample sizes are large, the uncertainty in PSCORE is estimated using the Gaussian approximation as shown below. Equation 9 Equation 10 Variance (PSCORE) = PSCORE*(1-PSCORE)/N Standard Deviation(PSCORE) = σPSCORE = √(Variance) The value of PSCORE=0.5 represents a neutral position. The performance each player is therefore measured relative to the neutral performance as PSCORE - 0.5. This measure becomes standardized and corresponds with Equation 5 if we divide by the standard deviation as shown below. 10 Ability to discriminate between players with different levels of performance. The score values may differ from one tournament to another but the principle of conversion to a binomial variable is common to all scoring systems. 11 A TWO DIMENSIONAL MEASURE OF PERFORMANCE IN CHESS, JAMAL MUNSHI, 2014 Equation 11 7 Standardized Scalar Performance = (PSCORE - 0.50)/ σPSCORE The performance of the forty players in the sample against their combined opposition is assessed using Equation 13 at an error rate of α=0.000025 for each comparison in order to hold the experiment wide error rate to α=0.001 for all forty comparisons. The number of differences found are counted. This number corresponds with the discrimination power of the scalar measure of chess performance and is indicative of the information content of this measure of performance. 4. DATA ANALYSIS 4.1 Two dimensional measure of performance. The data in Table 1 are used to compute the Euclidean distance of each player from his combined opposition . These distances and their directions12 are shown in the columns labeled "Distance" and "Angle" in Table 2. A Monte Carlo procedure is used to create a simulated sampling distribution of distances. The details of the computations for all forty players are available in the data archive for this paper (Munshi, Performance paper data analysis, 2014). The simulated replications of the sample data used in these computations may be viewed graphically in the Appendix13. Each graph contains two color coded markers. The position of the player is shown in red and that of his opponents is shown in blue. The size and shape of the markers represent the uncertainty in the location of the marker on the graph. When the two markers overlap it indicates no measurable difference in performance. When they are separated it indicates a difference in performance. The greater the separation the greater is this difference as long as the angle is in a positive direction. The standard deviation of the sampling distribution of the Euclidean distance is computed from the simulation results and shown in Table 2 in percentage terms as "Stdev". The Studentized distance of the player from his combined opposition is computed as Distance/Stdev to serve as a standardized measure of each player's performance as measured by his track record in the selected sample period. Table 2 has been sorted according to the distance vector raking both magnitude and direction into account from the highest performance to th lowest. The critical value of StdDist that corresponds with our experiment-wide error rate of α=0.001 is StdDist = 4.25. Using this criterion we find that the top 34 players out of 40 listed in Table 2 outperformed their opponents on average. At the bottom of this list is player P27 with StdDist = 5.116 > 4.25. In all of these cases we can reject H0 because StdDist > 4.25 and because the distance vector lies in a positive direction or well above the neutral direction of -45 degrees. In each case we conclude that the player in question performed better than his combined opposition. Six of the 40 players in the list achieved a performance level where StdDist < 4.25 or had negative directions approaching the neutral angle of -45 degrees. . In cases where the observed StdDist is within 12 The distance vector and its angle are used together to evaluate performance. The white distance and the black distance could have also been used to provide the same information. 13 For the complete list of al 40 graphs please download the Excel file which is included in the data archive of this paper (Munshi, Performance paper data analysis, 2014). A TWO DIMENSIONAL MEASURE OF PERFORMANCE IN CHESS, JAMAL MUNSHI, 2014 the expected sampling variation at the error rate chosen for this study the direction of the distance vector has no interpretation. Player Distance Stdev StdDist Angle P01 P09 0.469 0.370 0.036 0.035 13.040 10.527 29.536 23.846 P21 P05 0.430 0.384 0.041 0.038 10.485 9.981 18.457 24.100 P30 P39 0.394 0.417 0.040 0.043 9.969 9.608 13.572 10.819 P12 P04 0.400 0.297 0.047 0.034 8.514 8.758 26.640 20.197 P06 P02 0.296 0.312 0.033 0.037 8.931 8.492 17.517 15.399 P03 P37 0.321 0.326 0.037 0.046 8.762 7.081 1.328 35.422 P07 P23 0.281 0.361 0.039 0.045 7.116 7.987 23.189 22.568 P34 P20 0.320 0.321 0.041 0.043 7.839 7.544 15.968 14.324 P14 P26 0.279 0.335 0.035 0.050 7.902 6.739 5.645 15.164 P28 P18 0.269 0.250 0.043 0.040 6.272 6.281 14.497 11.704 P17 P29 0.226 0.287 0.034 0.043 6.633 6.649 9.070 9.195 P38 P35 0.288 0.337 0.044 0.055 6.621 6.132 9.969 -1.653 P10 P36 0.285 0.321 0.042 0.063 6.865 5.139 -4.501 35.983 P19 P33 0.326 0.288 0.055 0.048 5.911 5.975 21.799 13.902 P16 P15 0.221 0.232 0.038 0.042 5.754 5.555 8.047 5.900 P32 P24 0.185 0.269 0.037 0.049 5.039 5.545 -0.856 -8.470 P25 P27 0.253 0.200 0.050 0.039 5.064 5.116 -16.839 -20.375 P08 0.256 0.049 5.272 -25.809 P11 P31 0.171 0.187 0.042 0.049 4.093 3.851 -23.294 -14.893 P13 P22 0.125 0.134 0.037 0.040 3.377 3.401 -17.236 -32.446 P40 0.301 0.103 2.916 25.204 Table 2 Discrimination power of the two dimensional measure of performance 8 A TWO DIMENSIONAL MEASURE OF PERFORMANCE IN CHESS, JAMAL MUNSHI, 2014 9 Assuming that the top forty players in the world are actually better than their combined opposition and that the sample of games taken is representative and not biased, then of the two measures of chess performance being compared, the one with the greater discrimination power will detect this difference in a greater proportion of the players. In the case of the two dimensional measure of performance we find that the proportion detected is 34/40 = 85%. We now compute the corresponding proportion for the scalar measure used by the Elo rating system so that we can compare the discrimination power and, by inference, the information content of the two measures of performance. It is interesting to note in Table 2 that the directions of the distance vectors are all well below 45 degrees and some of the angles have negative values. This pattern seems to indicate that on average winners do better as white than as black and that losers do better as black than as white. This dichotomy provides further support for measuring chess performance in these two dimensions. 4.2 Scalar measure of performance. The chess game outcomes in Table 1are reduced from a trinomial variable to a binomial variable by the use of the scoring procedure. Since binomial variables have only one degree of freedom, a scalar measure may be used to compare the players based on their scoring performance. The results, sorted by standardized scoring performance from highest to lowest, are shown in Table 4. The column labels in Table 4 are described below. Player Games Won Lost Draw Score ScoreSE Std.Score = = = = = = = = Player identity The number of games played The number of games won by the player The number of games lost by the player The number of games that ended in draw The player's score The standard deviation of the score derived from the Gaussian approximation Standardized score = (Score - 0.5)/ScoreSE The standardized score serves as our scalar measure of performance. At an experiment-wide error rate of α=0.001with 40 comparisons the comparison α=0.000025. The corresponding the critical value of StdScore in a Normal distribution is StdScore = 4.1. The performance of the 19 players at the top of the list in Table 4 meets this condition. In the case of each of these players we reject H0 and conclude that the player performed better than the average opponent. The performance of the 21 players in the bottom of this list does not meet this condition and for them we fail to reject H0 and find no evidence that they performed better than the opposition. We are now able to compare the proposed two dimensional measure of performance and the widely used scalar measure of performance in terms of their ability to detect differences in chess performance. A TWO DIMENSIONAL MEASURE OF PERFORMANCE IN CHESS, JAMAL MUNSHI, 2014 10 4.3 Comparison of discrimination power. We found that the two dimensional measure of performance was able to detect a difference in performance in 34 out of 40 players or in 85% of the players tested while the scalar measure of performance detected a difference in only 19 out of 40 players or in 47.5% of the players. To determine whether the observed difference of 37.5 percentage points may have been caused by sampling variation we carry out a hypothesis test as noted below. Research question Does the two dimensional measure of performance have greater discrimination power? H0: p1≤ p2 The higher level of detection by the two dimensional measure in the sample is a result of sampling variation. Ha: p1`>p2 The higher level of detection could not have been observed in this sample if the two dimensional measure of performance did not have greater discrimination power. Error rate: α = 0.001 (The probability of incorrectly rejecting H0 is held to 0.1% or less.) TwoDim Scalar Pooled Comparisons 40 40 Detected 34 19 PctDetected 0.85 0.475 Variance 0.00942 SE 0.0971 Diff/SE 3.86 p-value <0.001 Decision Reject H0 Table 3 Hypothesis test for discrimination power Difference 0.375 The details of the hypothesis test are shown in Table 3. Using the usual Gaussian approximation14 for proportions we find that the pooled variance is 0.00942 and the standard error (SE) is 0.0971. In Table 3, we find that the p-value < α and so we reject H0 and conclude that the observed difference could not have been caused by sampling variation and that therefore the two dimensional measure of performance has greater discrimination power than the scalar measure. The implication of this finding is that the scalar measure contains less information than the two dimensional measure of performance. We ascribe this difference to the information loss incurred by the use of the scoring procedure to represent chess outcomes as a binomial variable. 14 A Monte Carlo simulation was used to verify the results of this approximation method. The simulation is available in the online data archive of this paper (Munshi, Check Gaussian Approximation, 2014). A TWO DIMENSIONAL MEASURE OF PERFORMANCE IN CHESS, JAMAL MUNSHI, 2014 Player Games Won Lost Draw Score Stdev P01 700 322 94 284 0.6629 0.0179 P05 732 317 128 287 0.6291 0.0179 P09 698 271 96 331 0.6254 0.0183 P30 573 211 73 289 0.6204 0.0203 P21 419 152 36 231 0.6384 0.0235 P04 841 310 146 385 0.5975 0.0169 P39 500 196 72 232 0.6240 0.0217 P06 869 316 151 402 0.5949 0.0167 P23 455 177 65 213 0.6231 0.0227 P02 732 266 123 343 0.5977 0.0181 P37 510 214 97 199 0.6147 0.0215 P12 353 131 35 187 0.6360 0.0256 P34 649 254 123 272 0.6009 0.0192 P20 500 177 77 246 0.6000 0.0219 P03 719 262 143 314 0.5828 0.0184 P38 614 260 155 199 0.5855 0.0199 P07 481 150 59 272 0.5946 0.0224 P18 702 275 166 261 0.5776 0.0186 P14 711 231 122 358 0.5767 0.0185 P26 351 128 54 169 0.6054 0.0261 P19 320 125 55 140 0.6094 0.0273 P17 898 301 183 414 0.5657 0.0165 P36 274 111 49 114 0.6131 0.0294 P33 439 170 92 177 0.5888 0.0235 P29 474 162 82 230 0.5844 0.0226 P28 474 170 91 213 0.5833 0.0226 P35 382 157 92 133 0.5851 0.0252 P16 643 207 125 311 0.5638 0.0196 P10 556 178 106 272 0.5647 0.0210 P15 530 162 93 275 0.5651 0.0215 P40 214 81 38 95 0.6005 0.0335 P24 492 187 133 172 0.5549 0.0224 P32 652 192 132 328 0.5460 0.0195 P25 448 146 113 189 0.5368 0.0236 P31 405 108 83 214 0.5309 0.0248 P27 529 136 109 284 0.5255 0.0217 P08 377 108 88 181 0.5265 0.0257 P13 633 161 136 336 0.5197 0.0199 P22 447 84 78 285 0.5067 0.0236 P11 453 99 95 259 0.5044 0.0235 Table 4 Discrimination power of the scalar measure of performance StdScore 9.11 7.23 6.84 5.94 5.90 5.77 5.72 5.70 5.42 5.39 5.32 5.31 5.25 4.56 4.50 4.30 4.23 4.16 4.14 4.04 4.01 3.97 3.85 3.78 3.73 3.68 3.37 3.26 3.08 3.02 3.00 2.45 2.36 1.56 1.24 1.18 1.03 0.99 0.28 0.19 11 A TWO DIMENSIONAL MEASURE OF PERFORMANCE IN CHESS, JAMAL MUNSHI, 2014 12 4.4 Correlation Check. The comparison of the discrimination power of the two measures of chess performance carried out in the previous section requires that the two measures should be comparable in terms of validity and differ only in terms of reliability. In other words, we assume that they are measuring the same underlying reality but with different degrees of precision. A testable implication of this assumption is that the two measures should be correlated. To check whether this assumption is reasonable, we carry out a linear regression test as shown in Figure 1. Correlation Check Two dimensional Measure 25 y = 2.2458x - 0.4833 R² = 0.98 20 15 10 5 0 -5 0 2 4 6 8 10 Scalar Measure Figure 1 Correlation between the two measures of performance The linear regression in Figure 1 shows that the two measures of performance are highly correlated. The hypothesis test for the degree of correlation in Table 5 shows that the probability of observing a correlation this high (or higher) in a sample of forty players taken from a population in which the two measures of performance are not correlated is less than our acceptable error rate of α = 0.001. The test validates our assumption that the two measures are comparable in terms of validity but differ only with respect to reliability. ANOVA Regression df SS 1 743.891 MS 743.891 F 1863.4 p-value 6.71164E-34 Residual 38 15.170 0.399 Total 39 759.061 Table 5 Hypothesis test for correlation between the two measures of performance A TWO DIMENSIONAL MEASURE OF PERFORMANCE IN CHESS, JAMAL MUNSHI, 2014 13 5. CONCLUSIONS The proposed two dimensional measure of chess performance has more precision and discrimination power than the scalar measure because it contains more information by virtue of the fact that it is true to the trinomial nature of chess game outcomes. Scalar measures such as the Elo rating system contain less information because they rely on scoring to reduce chess game outcomes from a trinomial to a binomial process. This simplification is achieved at a cost because scoring causes some chess game outcome information to be lost. The information loss occurs at the point of scoring. No amount of mathematical wizardry downstream can recover the lost information. The reduction of chess game outcomes to a binomial variable by the use of scoring was likely a necessary sacrifice of information for the sake of computational simplicity at a time when computational machinery were costly and scarce. It is likely that this relationship between the value of information and the cost of computation no longer applies because of advances in computer technology. 6. APPENDIX 6.1 The high level of uncertainty in small samples. The utility of head to head pairwise comparisons of chess players is limited by the availability of data. These sample sizes tend to be small. For example, consider the two pairwise data in Table 6. They were collected from an online database in July 2014 (chessgames.com, 2014). The usefulness of this information is limited by the high degree of uncertainty. In both of these cases it is not possible to distinguish the score from its neutral value of 0.50 at any acceptable level of α. The data do not contain useful information because the sample size is too small. player opponent games won lost draw Score Stdev z-value p-value Aronian Shirov 19 5 0 14 0.6316 0.1107 1.19 11.7% Carlsen Nakamura 43 16 5 22 0.6279 0.0737 1.74 4.1% Table 6 Uncertainty of scores in small samples The small sample problem in pairwise comparisons also exists in the two dimensional measure of performance. The sampling strategy of this study was motivated by the need for large sample sizes. A TWO DIMENSIONAL MEASURE OF PERFORMANCE IN CHESS, JAMAL MUNSHI, 2014 6.2 Graphical depiction of simulated repetitions of the game data. P01 Won as Black 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 Won as White P02 Won as Black 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 Won as White P03 Won as Black 0.4 0.3 0.2 0.1 0 0 0.2 0.4 Won as White 0.6 14 A TWO DIMENSIONAL MEASURE OF PERFORMANCE IN CHESS, JAMAL MUNSHI, 2014 P04 Won as Black 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 Won as White P05 Won as Black 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 Won as White P06 Won as Black 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 Won as White 0.6 15 A TWO DIMENSIONAL MEASURE OF PERFORMANCE IN CHESS, JAMAL MUNSHI, 2014 Won as Black P07 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 0.2 0.4 0.6 Won as White P08 Won as Black 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 Won as White P09 Won as Black 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 Won as White 0.6 16 A TWO DIMENSIONAL MEASURE OF PERFORMANCE IN CHESS, JAMAL MUNSHI, 2014 P10 Won as Black 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 Won as White P11 Won as Black 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 Won as White Won as Black P13 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 0.1 0.2 Won as White 0.3 0.4 17 A TWO DIMENSIONAL MEASURE OF PERFORMANCE IN CHESS, JAMAL MUNSHI, 2014 18 7. REFERENCES Abdi, H. (2007). Bonferroni Sidak. Retrieved February 2014, from utdallas.edu: http://www.utdallas.edu/~herve/Abdi-Bonferroni2007-pretty.pdf chessgames.com. (2014). chess game database. Retrieved July 2014, from chessgames.com. Elo, A. (2008). The rating of chess players past and present. Ishi Press. FIDE. (2014, August). Top 100. Retrieved August 2014, from fide.com: http://ratings.fide.com/top.phtml?list=men Glickman, M. (1999). Rating the chess rating system. Retrieved 2014, from Glicko: http://www.glicko.net/research/chance.pdf Good, I. (1955). On the marking of chess players. Mathematical Gazette. Johnson, V. E. (2013, November). Revised Standards for Statistical Evidence. Retrieved December 2013, from Proceedings of the National Academy of Sciences: http://www.pnas.org/content/110/48/19313.full Munshi, J. (2014, February). A method for comparing chess openings. Retrieved April 2014, from arxiv.org: http://arxiv.org/ftp/arxiv/papers/1402/1402.6791.pdf Munshi, J. (2014). A two dimensional measure of chess performance. Retrieved 2014, from Youtube: http://www.youtube.com/edit?o=U&feature=vm&video_id=wbfrUyB8o8k Munshi, J. (2014). A two dimensional measure of chess performance. Retrieved 2014, from Youtube: http://www.youtube.com/watch?v=2MNtGhu9zPo Munshi, J. (2014). Check Gaussian Approximation. Retrieved 2014, from Dropbox: https://www.dropbox.com/s/3cr7w9459oxi4by/GaussianApproximationSimulation.xlsx?dl=0 Munshi, J. (2014). Comparing Chess Openings Part 3. Retrieved 2014, from ssrn: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2441568 Munshi, J. (2014). Pairwise comparison of chess opening variations. Retrieved 2014, from ssrn: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2472783 Munshi, J. (2014, November). Performance paper data analysis. Retrieved November 2014, from Dropbox: https://www.dropbox.com/sh/109n73zx8qvmt6f/AABXz8tm6A1TzqaSIRRrS45a?dl=0 Munshi, J. (2014). Simulated sampling distribution. Retrieved 2014, from Dropbox: https://www.dropbox.com/sh/4yipacz2ujqilwi/AACbLlutdqEfV0Qu2FtSho84a?dl=0 Munshi, J. (2014). The Relative Playing Strength of Chess Players. Retrieved 2014, from SSRN: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2477868