Selecting Attributes For Sport Forecasting Using Formal Concept Analysis
Selecting Attributes For Sport Forecasting Using Formal Concept Analysis
Selecting Attributes For Sport Forecasting Using Formal Concept Analysis
supporters. Roughly speaking, three dimensions have been considered for analysing/synthesizing prediction systems: 1)Those which analyse information on teams (endogenous) versus those which analyse results (exogenous); 2)Those which exploit quantitative data versus those which exploit qualitative knowledge, and nally, 3)Statistic-based ones versus other methods. Usually, one can work with hybrid models, and rarely with pure qualitative and exogenous reasoning systems appear in literature, although their use is considered for experiments (for example, frugal methods (Goldstein & Gigerenzer 2009) and based on the recognition heuristic (Goldstein & Gigerenzer 2002)) or as part of hybrid systems (see e.g. (Min et al. 2008)). There are two reasons that may justify this point. On the one hand, transformation from a large quantitative dataset to a qualitative problem is faced with the selection of an acceptable threshold and the discovery of better relations (see e.g. (Imberman et al. 1999)). On the other hand, a qualitative dataset must be accomplished with some amount of information based on condence, trust or probability of these data sets. The aim of this paper is to describe all researching work made for selecting and computing attribute sets related to soccer results, into a specic framework: FCA, and starting from soccer match results, with no previous analysis of any other specic attributes. This task is previous to build an Expert System for advising sport betting which could detects some kind of regularities on data. Concept lattices, which are computed from attribute values, represent a mathematical structure of relationships among the concepts which are involved in selected sport events to study. Since this method is bet-oriented, its performance is evaluated within a condence-based reasoning system. This sistem increases number of hits in soccer matches forecasting, discovering temporal trends by means of data mining and association rules reasoning. The analysis of attributes has been used in (Aranda-Corral et al. 2011) to describe a condence-based (and contextual) reasoning system for forecasting sports betting. In this paper we analyse the attribute selection problem as a problem of selection of features that shape the behaviour
Introduction
Formal Concept Analysis (FCA) (Ganter & Wille 1999) is a mathematical theory for data analysis using formal contexts and concept lattices as key tools. Domains can be formally modelled according to the extent and the intent of each formal concept. In FCA, the basic data structure is a formal context (with a qualitative nature) which represents a set of objects and their properties and it is useful both to detect and to describe regularities and structures of concepts. It also provides a sound formalism for reasoning with such structures, mainly Stem Basis and association rules. Therefore, it is interesting to consider its application for reasoning with temporal qualitative data in order to discover temporal trends (Aranda-Corral et al. 2011). In this paper, FCA application scope is the challenge of sports betting, specically, the forecasting of soccer leagues results. Forecasting sport results is a fast growing research area, because of its economic impact in betting markets as well as for its potential application to problems with similar behaviour (markets) (Inst. Engineering and Technology 2010). Considering sports betting as a complex system, soccer leagues represent a challenging system with a huge amount of knowledge, available through WWW, and its behaviour is weekly exhaustive analysed by journalists, betting companies and
Figure 1: FCA based model for prediction of qualitative features of Complex Systems of the complex system that represents professional soccer leagues. Theoretical framework, on which this model is based on, will be presented at (Aranda-Corral et al. 2011b). Due to a really huge amount of information, attribute selection advised by experts is mandatory. In fact, the system can be considered as a reasoning model based on bounded rationality and recognizition heuristics. and focused on features which were considered as important by human experts. Therefore, the system aims to forecast results, but it is designed based on bounded rationality models, instead of statistic models (although in the future hybrid models will be considered. The system is a rst prototype from a more general system, which are building to analyse qualitative features of Complex Systems (see Fig. 1), using FCA. The idea is to isolate qualitate attributes from (past) local interactions among components of complex system and to apply FCA tools in order to predict properties systems behavior in a near future. denes X := {a A | oIa for all o X} Y := {o O | oIa for all a Y } A (formal) concept is a pair (X, Y ) such that X = Y and Y = X. For example, concepts from living beings formal context (Fig. 2, left) is depicted in Fig. 2, right. Using this Fig. 2, each node is a concept, and its intension (or extension) can be formed by the set of attributes (or objects) included along the path to the top (or bottom). E.g. The node tagged with the attribute Legs represents to the concept ({Legs, M obility, N eedW ater}, {Cat, F rog}). In this paper it works with logical relations on attributes which are valid in the context. Logical expressions in FCA are implications between attributes. An implication is a pair of sets of attributes, written as Y1 Y2 , which is true with respect to M = (O, A, I) according to the following denition. A subset T A respects Y1 Y2 if Y1 T or Y2 T . It says that Y1 Y2 holds in M (M |= Y1 Y2 ) if for all o O, the set {o} respects Y1 Y2 . In that case, it is said that Y1 Y2 is an implication of M . Denition. 1 Let L be a set of implications and L be an implication. 1. L follows from L (L |= L) if each subset of A respecting L also respects L. 2. L is complete if every implication of the context follows from L. 3. L is non-redundant if for each L L, L \ {L} |= L. 4. L is a (implication) basis for M if L is complete and nonredundant. It can obtain a basis from the pseudo-intents (Guigues & Duquenne 1986) called Stem Basis (SB): L = {Y Y : Y is a pseudointent}
Figure 2: Formal context, associated concept lattice and Stem Basis A SB for the formal context on live beings is provided in Fig. 2 (right). It is important to remark that SB is only an example of a basis for a formal context. In this paper any specic property of the SB can be used, and it can be replaced by any implication basis. It is possible to extend |= in relation to any propositional formula with propositional variables in A, by considering each object o M as a valuation vo on A dening vo (A) = 1 (o, A) I Thus M |= F if and only if for any o O it holds that vo |= F . The Armstrong rules (Armstrong 1974) provides a formal basis for implicational reasoning: XY X Y, Y Z W , X X X Z Y X Z W A set of implications is closed if and only if the set is closed by these rules (Armstrong 1974). By dening A as the proof relation by Armstrong rules, it holds that the implicational bases are A -complete: Theorem 2 Let L be an implicational basis for M , and L an implication. Then M |= L if and only if L A L In order to work with formal contexts, stem basis and association rules, the Conexp1 software has been selected. It is used as a library to build the module which provides the implications (and association rules) to the reasoning module of our system. The reasoning module is a production system based on which was designed for (Aranda-Corral & Borrego-Daz 2010). Initially it works with SB, and entailment is based on the following result: Theorem 3 Let L be a basis for M and {A1 , . . . , An } Y A. The following conditions are equivalent: 1. S {A1 , . . . An } production system). 2. S
A p
Y (
A1 , . . . An Y
3. M |= {A1 , . . . An } Y .
1
http://sourceforge.net/projects/conexp/
Figure 3: Context based reasoning system Denition. 4 Let M = (O, A, I) be the monster context, and let O be a set of objects. 1. A context on O is a context M = (O1 , A, I) where O O1 O 2. A contextual selection on O and M is a map s : O P(O1 ) P(A) 3. A contextual KB for an object o O w.r.t. a selection s with condence is a subset of association rules with condence greater or equal to of the formal context associated to s(o) = (s1 (o), s2 (o)), that is, to the context M (s(o)) := (s1 (o), s2 (o), I s1 (o)s2 (o) ) (note that when condifence is 1 the contextual KB is a implicational basis). Contextual KBs is useful for entailing attributes on an object. The reasoning model on M is argumentative, where the argument is based on KBs extracted from subcontexts (Aranda-Corral et al. 2011b): Denition. 5 Let L be an implication and a background knowledge. It is said that L is a possible consequence of M under , M |= L, if there exists M a nonempty subcontext of M such that M |= {L}. Note that by theorem 3, when is a set of implications, it holds that |= is equivalent to which is dened by: M L if there exists M |= a subcontext of M such that S p L (where S is a stem basis for M ).
Figure 4: Concept Lattice for the match M laga-Sevilla (week 31, season 2009-10) a These attributes have to be computed and used to entail the forecasting. This analysis is assisted by Conexp. ConExp software is used to compute and analyze the concept laticces associated to the temporal contexts. In order to select most interesting attributes for the system, starting from an initial conguration, user can compute the associated concept lattice and check it. In this way, attributes goodness (and thresholds) can be evaluated to reconsider current attribute selection. For example, in Fig. 4, the concept lattice associated to contextual selection for M laga-Sevilla match a is shown. This contextual selection is obtained from a given attribute selection and last 38 weeks matches before. In this concept lattice, the attribute ID 1 T 16 is dened by: the budget of team2 is greater than 1 times the budget of team1 , where 1 is the threshold the expert must estimate. In the concept lattice we can observe that the biggest concept containing the attributes team2 wins and ID 1 T 16 covers the about the 10% of the objects owned by the rst attribute, therefore it is suggested to use the second attribute for reasoning with association rules to get a prediction. The system computes the value of an amount of attributes on objects. Experimentally a boolean combination of attributes is possible. Once the temporal context has been computed, the system can build contextual selections by selecting the match and the attribute set. The selection of attributes was made by considering four kinds of factors: those related with the classication, the history of teams matches in the recent past, results of direct matches and other non related results, as for example the difference between team budgets. Seventeen relevant attributes were selected.The attribute set has three special attributes, T eam1 wins (1), T eam2 wins (2) and draws (X). With respect to data source, they are automatically extracted from RSSSF Archive2 . Objects are matches and attributes are a list of features, including temporal stamp (week, year). Data was collected for the past four years. Actually the size of the context is about 300 objects and 18 attributes (although several of them are parametrized, see
2
Attribute selection
We have chosen a small set of attributes with many possibilities through a few customizable parameters. When these parameters are having set up with proper values, the set of attributes will represent teams logical behavior. Recall that formal concept analysis works with qualitative attributes and all teams information which we work with are quantitative data. Thus it is necessary to convert quantitative attributes into qualitative ones. This task is left to users by choosing a proper threshold to each attribute. Before choosing the set of base attributes , we have carried out a analysis on information about soccer results . The aim have been to discover which factors are more inuential in teams behavior and which ones are less inuential. First of all, we have collected any interesting factor found, and after analyzing each one, individually, we have chosen most suitable ones. Examined factors can be classied in four different categories (see Table 1): those related to seasons classication, those related to previous teams results, those related to historical direct matches and any other factors. It is worth to note that to increase possibilities of the attribute set, and considering the Boolean nature of formal context attributes, we have added the option to create new ones by means of logical combinations of these attributes. According to considered factors, the system computes a base set of 18 attributes, which are customizable by some parameters. This will let us to obtain a diverse set of attributes. In Table 2 attributes are specied. Four parameters are used: Threshold: Parameter to be used to translate quantitative attribute values into qualitative ones. Team: Recall that in the formal context considered, objects are matches but attributes belongs to team properties. This parameter will set the team from object (match) on which attribute will be considered. It has two possible values: {HOME, AWAY}. Thus, usually, we will have
http://www.rsssf.com
twice each attribute at context, once for home team and once for away. Number of Matches: sets the number of past matches to be considered when some attributes are computed, e.g. the ones associated to previous team results. Kind of matches: sets past matches type to be taken into account to compute some attributes, considering home/away teams condition at matches. Three possible values: {MATCHES AS HOME TEAM, MATCHES AS AWAY TEAM, ALL MATCHES} With these parameters, and the possibility to compound attributes, it is possible to build a detailed attributes set. Note that experiments show that simplest and most logical attributes give a good team behavior representation. Although we consider that a versatile attributes set, as above described, was necessary because of a huge number of factors can determine the result of a soccer match. Task of customizing the attribute set is left to users, and it is the most important one in forecasting process. Thus, a basic soccer knowledge should be required. The goodness of customization will determine system results.
we need to handle the situation of a team playing in a different division from current season division. Other troubled situation where there are not enough matches for attribute computation is to compute results for directed matches between two teams because of there is only a few of such matches in the data source. For these two related situations we offer two solutions. First is to compute attribute with a null value, but in this way we are giving a fake information to the system. We are setting that attribute is not true but, in fact, we have not information enought to determine it, so a better approach is required. Chosen solution is based on adjusting attributes threshold. The value of this threshold is decreased proportionally to relation between number of required matches and number of available matches. Threshold is revised by number of match results available number of match results needed
new = old
When number of required matches is too high and number of available matches is low, it looks like we are giving fake information to system again, but our experience shows that collateral effects of this approach are worthless compared to compute attributes with a null value.
Computing problems
The way of competition causes to take into account some special situations for computing attributes values. In this section we describe the main problems emerged and how they were xed. Roughly speaking, these main problems concerns to initial matches in season.
Strict attributes
An attribute is strict when only a few objects can satisfy it, because of its threshold is too high. By working with sets of strict attributes, we can assure that they estimate the teams behavior better than other sets. Thus, with strict attributes, we will have very reliable estimates, but just only for very few matches, and non for most of others. In the other hand, using less strict attributes, system will produce less reliable estimations but for a big scope. So it is essential to nd a balance between these two opposite situations: reliability of attribute set against number of matches without information. A good solution could be to build and use different attribute sets, ones more strict and others less. Thus, less strict attribute sets will be used when strict ones fail doing an estimation.
Factors Associated to the classication in the league Team in the rst classication level Team in the last classication level Difference between teams classications Team was in a different league last year Team socred a important number of goals (in the last matches) Associated to previous results of the team Number of consecutive won matches. Number of consecutive lost matches. Number of consecutive draws. Number of non consecutive won matches in previous weeks. Number of non consecutive lost matches in previous weeks. Number of non consecutive draws in previous weeks. Points collected in previous weeks. Factors related with directed matches (incluidas previous years) Number of wins in previous directed matchs number of losts in previous directed matchs number of draws in previous directed matchs Other Factors Number of red cards collected by the teams players. Wheather the day and the city where the match took place Motivation because of the fans support when playing as home team. Team hires a new coach. Some players of the team are selected for their National Team. Difference between teams budgets. One or more important teams players are injured. Cups collected in the lasts years.
medium/high medium/high medium/high low medium high high medium/Low high medium low
yes yes yes no no (hard to parametrize) no (hard to parametrize, subjective) no (only useful when new coach hired) no (relevant for some nationalities) yes no (hard to automatically collect the data) no (only for a few of teams)
Table 1: Factors considered for selecting/building attributes home team and more exigent threshold for attributes related to away team. Therefore, it will be easier for home team to satisfy an attribute than away team. It is possible to imitate this trend based on this approach. Around 50% of played matches nish with victory of home team. This means that the attribute value, corresponding to matches result, will be home team victory around 50% of objects from formal context. As consequence of the former, many rules from the inferred association rules will contain the attribute result = home team victory within their conclusions. Thus when forecasting a match the system will infer, in most of cases, home team victory as consequence of overestimation condence value for this result. It is possible to avoid this effect easily, just applying a decreasing (reduction) factor over condence for home team victory. It is estimated by means of experiments.
Results
Following the process described above, an experiment was run for the Spanish premier soccer league from 2009-10. Attributes were selected according the experience of an expert, and contextual KB is computed (in Fig. 5 a KB fragment for M laga-Sevilla match is shown). From this selection is a computed for each match in each week. 2009-10 season: Experiments with the system show forecasts of about 58.16% by a contextual selection based on the previous 38 matches of each team. Such a percentage of hits for a qualitative reasoning system may be considered as an acceptable result comparable with expectable results of experts (Goldstein & Gigerenzer 2009;
Attribute 1) Number of non consecutive won matches in previous weeks > threshold 2) Number of non consecutive lost matches in previous weeks > threshold 3) Number of non consecutive draws in previous weeks > threshold 4) Points collected in previous matches> threshold 5) Position in the classication based on previous matches> threshold 6) Number of positions over the opponent in the classication based on previous matches> threshold 7) Number of positions under the opponent in the classication based on previous matches> threshold 8) Number of wins in previous directed matchs (included previos leagues) > threshold 9) Number of losts in previous directed matchs (included previos leagues) > threshold 10) Number of drawns in previous directed matchs (included previos leagues) > threshold 11) Position in the classication > threshold 12) Number of positions over the opponent in the classication> threshold 13) Number of positions under the opponent in the classication> threshold 14) Number of consecutive won matches> threshold 15) Number of consecutive lost matches> threshold 16) Number of consecutive draws> threshold 17) Teams budget Y times bigger than opponents budget (Y > threshold) 18) Teams budget Y times smaller than opponents budget (Y > threshold)
Congurable parameters <threshold> <Team> <Number of Matchs> <Matchs> <threshold> <Team> <Number of Matchs> <Matchs> <threshold> <Team> <Number of Matchs> <Matchs> <threshold> <Team> <Number of Matchs> <Matchs> <threshold> <Team> <Number of Matchs> <Matchs> <threshold> <Team> <Number of Matchs> <Matchs> <threshold> <Team> <Number of Matchs> <Matchs> <threshold> <Team> <Number of Matchs> <Matchs> <threshold> <Team> <Number of Matchs> <Matchs> <threshold> <Number of Matchs> <Matchs> <threshold> <Team> <Matchs> <threshold> <Team> <Matchs> <threshold> <Team> <Matchs> <threshold> <Team> <Matchs> <threshold> <Team> <Matchs> <threshold> <Team> <Matchs> <threshold> <Team> <threshold> <Team>
Table 2: Attributes and parameters teams and past matches. 2010-11 season: According to the idea commented above, we have evaluated the system in the second half of 2010-11 soccer season. A way to evaluate how good is this forecasting sistem is comparing number of successes in our pool with the most popular betting selections. This popular selections are collected from the most voted results for each match, published at state agency web that controls soccer pools. In Fig. 6 both results are compared. Our hits are in blue and popular ones in green and last seventeen weeks from 2010-11 season are represented. Note that Spanish soccer pools are over 15 matches.
Figure 7: Comparative of correct predictions on the whole season 2010-11. Percentages namics, a partial understanding of system. In this paper, FCA is applied to this aim with a specic application. The selection attribute problem based on FCA-based reasoning system for sport forecasting is analysed. In fact, the reasoning system is a computational logic model for bounded rationality. The model is concerned with association rule reasoning and it does not use -in its current form- more sophisticated probability tools (as for example (Min et al. 2008)). As is stated in (Goldstein & Gigerenzer 1996), the theory of probabilistic mental models assumes that inferences about unknown states of the world are based on probability cues (Brunswik 1955). It can say that condence of association rules extracted from subcontexts play the role of probability cues. Any statistical approaches have been taken into account, because of it was not the aim of this paper. Although a comparative study of our system against C4.5 classier has been done. For this, two different attribute selections have been considered and used for both, C4.5 classier and our system. The experiment is to forecast all matches (380) in season 2010-11. In order to stimate each match result, considering N (weeks) as timestamp, previous matches are used to build contextual selection (or trainining set in C4.5) from weeks N 1 to N 19 (190 objects). Fig. 7 shows the percentage of correct predictions for our system and C4.5 classier, using both attribute selections. Other cols are also shown: users most voted results, local team always win and two random generated. These random generated cols were built assuming different weigths per result. It means, < 1 : 55%, X : 23%, 2 : 22% > and < 1 : 65%, X : 18%, 2 : 17% > were used, where 1, X, 2 are the probabilities for forecasting a match with the result: local team wins, drawn and away team wins, respectively. It is worth to note that, while classier achieves highest performances (58,68%) when number of matches increase from 190 to 380, our system reaches this highest performance (59,74%) using only 190 instances. This conclusion is based on our system use some fast and frugal (Goldstein & Gigerenzer 1996) methods, and these are designed to achieve aceptable results using as less as possible resources. The relationship of our proposal with Recognition Heuristics (Goldstein & Gigerenzer 2002) (roughly speaking, if one of the possibilities is recognized and the other is not, then infer that the recognized object has the higher value with respect to the criterion) is not clear. We may assert that our model recognises trends in contexts. Trends (represented as association rules) can be considered as a kind of recognizing method, though. The system is based on bounded rationality models instead of statistic models, although in future hybrid models will be considered. In the short term, we carry on extending our system in order to be able to combine the results of two or more attribute sets with different exigency level. Therefore the system will return only one result and more reliable. In the long term, we aim to extend the model in orfer to obtain a general system
to detect emergent concepts in Complex Systems After some real betting experiments during current season (2010-2011) with one customized attribute set, we have observed another intriguing fact. If we take a look to number of successful predictions per week, we are able to distinguish some groups of consecutive weeks in which number of correct predictions is under or over the average. Recall that these predictions are the logical inferred results by one customized attribute set. This suggests that it could be possible to nd another attribute set, with a different parameters customization, which it will accomplish the correct predictions of rst attribute set. It means that when rst attribute set produce bad forecasting, second should produce good ones, and vice versa. The reason of this is that each match there is not only one possible logical result. It means, when one of rsts teams of current ranking plays against one of lasts team, attending to ranking criteria, the logical result of this match would be that rst one wins. But if we attend to others, like rst team lost last week and second team won last 5 weeks, this results would be different. Future works pass through for nding these complementary attribute sets and detecting when their behaviors change during season in order to select the proper attribute set to forecast each week. Finally, we are also analyzing how to nde a weight for matches which allows the system to work with matches from different divisions, simultaneously. Note that a winning match at rst division will have a higher weight than a winning at second. This will be really useful at the beginning of season because of we need to compute attributes related to previous matches results and teams which are involved played at different divisions last season.
G. A. Aranda-Corral, J. Borrego-Daz, J. Gal n-P ez, Condence a a Based Reasoning with Local Temporal Formal Contexts. to appear in IWANN 2011, LNCS (2011). G. A. Aranda-Corral, J. Borrego-Daz, J. Gal n-P ez, Bounded Ra a a tionality for Data Reasoning based on Formal Concept Analysis. To appear in DEXA Workshop DALI (2011). W. Armstrong, Dependency structures of data base relationships. Proc. of IFIP Congress, Geneva, 580-583 (1974). J.L. Balc zar, Redundancy, Deduction Schemes, and Minimuma Size Bases for Association Rules, Logical Methods in Computer Science 6(2):1-23 (2010). E. Brunswik, Representative design and probabilistic theory in a functional psychology. Psychological Review, (62):193-217 (1955). B. Ganter and R. Wille. Formal Concept Analysis - Mathematical Foundations. Springer, 1999. J. C. Giarratano, G.D. Riley, Expert Systems: Principles and Programming. Brooks/Cole Publishing Co ( 2005). D. G. Goldstein, G. Gigerenzer, Reasoning The Fast and Frugal Way: Models of Bounded Rationality, Psychological Review 103(4): 650-669 (1996). D. G. Goldstein, G. Gigerenzer, Models of ecological rationality: the recognition heuristic, Psychological review, 109(1): 7590 (2002). D.G. Goldstein, G. Gigerenzer, Fast and frugal forecasting. International Journal of Forecasting, 25, 760-772 (2009). Guigues, J.-L., Duquenne, V.: Familles minimales d implications informatives resultant dun tableau de donnees binaires. Math. Sci. Humaines 95, 518 (1986). S. P. Imberman, B. Domanski, R. A. Orchard: Using Booleanized Data To Discover Better Relationships Between Metrics. Int. CMG Conference 1999: 530-539 B. Min, J. Kim, C. Choe, H. Eom, R. I. McKay, A compound framework for sports results prediction: A football case study. Know.-Based Syst. 21(7):551-562. 2008
Acknowledgements
Supported by TIN2009-09492 project of Spanish Ministry of Science and Innovation, and Excellence project TIC-6064 of Junta de Andaluca conanced with FEDER founds.
References
Why Spain will win..., Engineering & Technology 5 June - 18 June 2010. J. A. Alonso-Jim nez, G. A. Aranda-Corral, J. Borrego-Daz, and e M. M. Fern ndez-Lebr n, M. J. Hidalgo-Doblado, Extenda o ing Attribute Exploration by Means of Boolean Derivatives, Proc. 6th Int. Conf. Concept Lattices and Their Applications (CLA2008), pp. 121-132 (2008). P. Andersson, M. Ekman, J. Edman, Forecasting the fast and frugal way: A study of performance and information-processing strategies of experts and non-experts when predicting the World Cup 2002 in soccer, Working Paper Series in Business Administration 2003:9, Stockholm School of Economics. G. A. Aranda-Corral, J. Borrego-Daz, Reconciling Knowledge in social tagging web services. Proc. 5th Int. Conf. Hybrid AI Systems (HAIS 2010), LNAI, vol. 6077. Springer-Verlag, Berlin, 383-390 (2010).