Probabilistic Estimation Based Data Mining For Discovering Insurance Risks
Probabilistic Estimation Based Data Mining For Discovering Insurance Risks
Probabilistic Estimation Based Data Mining For Discovering Insurance Risks
Introduction
The Property & Casualty (P&C) insurance business deals with the insuring of tangible assets, e.g. cars, boats, homes, etc. The insuring company evaluates the risk of the asset being insured taking into account characteristics of the asset as well as the owner of the asset. Based on the level of risk, the company charges a certain xed, regular premium to the insured. Actuarial analysis of policy and claims data plays a major role in the analysis, identication, and pricing of P&C risks. Actuaries develop insurance risk models by segmenting large populations of policies into predictively accurate risk groups, each with its own distinct risk characteristics. A well-known segment is male drivers under age 25 who drive sports cars. Examples of risk characteristics include mean claim rate, mean claim severity amount, pure premium (i.e., claim rate times severity), and loss ratio (i.e., pure premium over premium charged). Pure premium is perhaps the most important risk characteristic because it represents the minimum amount that policyholders in a risk group must be charged in order to cover the claims generated by that risk group. Actual premiums charged are ultimately determined based on the pure premiums of each risk group, as well as on the cost structure of the insurance company, its marketing strategy, competitive factors, etc. Ideally, insurance companies would like to develop risk models based on the entire universe of potential policies in order to maximize the accuracy of their risk assessments. Although no insurer possesses complete information, many insurers, particularly ones operating across large territories, have access to vast quantities of information given their very sizable books of business. A book of business corresponds to either a type of policy or to the set of policies of that type in a territory, depending on context. It is common for such rms to have millions of policies in each of their major regions, with many years of accumulated claims data. The actuarial departments of insurance companies make use of this data to develop risk models for the markets served by their companies. The availability of large quantities of insurance data represents both an opportunity and a challenge for data mining. The rst and perhaps most fundamental challenge for standard data mining techniques is that pure premium is the product of two other risk characteristics: claim frequency and claim severity. Claim frequency is the average rate at which individual policyholders from a risk group le for claims. Frequency is usually expressed as the number of claims led per policy per unit time (i.e., quarterly, annually, etc.); however, it can also be expressed as a percentage by multiplying by 100. For example, a frequency of 25% means that the average number of claims led in a given unit of time is 0.25 times the number of policies. This is not to say that 25% of policyholders le claims; only about 19.5% will le one claim in the given time period and an unlucky 2.6% will le two or more claims. Thus, the 25% refers to a rate, not a probability. Claim severity is more straightforward. It is simply the average dollar amount per claim. If one were forced to use standard data mining algorithms, such as CHAID [1], CART [2], C4.5 [3], or SPRINT [4], one might try to view frequency modeling as a classication problem and severity modeling as a regression problem. However, further examination suggests that these modeling tasks are unlike standard classication or regression problems. Viewing frequency prediction as a classication problem is misleading. It is certainly not the case that every individual policyholder will le a claim with either 100% certainty or 0% certainty. In actuality, every individual has the potential to le claims, it is just that some do so at much higher rates than others. The predictive modeling task is therefore to discover and describe groups of policyholders, each with its own unique ling rate, rather than attempt to discover groups that are classied as either always ling claims or never ling claims. From the point of view of standard data mining algorithms, severity prediction appears to be very much a regression problem, given that the data elds that correspond to this variable have continuous values across a wide range. However, the distributional characteristics of claim amounts are quite different from the traditional Gaussian (i.e., least-squares optimality) assumption incorporated into most regression
modeling systems. Insurance actuaries have long recognized that the severity distribution is often highly skewed with long thick tails. Reliance on the Gaussian assumption for modeling individual claims can lead to suboptimal results, which is a well-known problem from the point of view of robust estimation [5]. A more fundamental obstacle for standard data mining algorithms is that specialized, domain-specic equations must be used for estimating frequency and severity. Equations for estimating frequency must incorporate variables that reect the earned exposure of each data record; i.e., the amount of time that the corresponding policy is actually in force during the stated time interval. Equations for estimating severity must take into account claim status; i.e., whether a claim is fully settled or still open. Some types of claims can take several years to settle, most notably bodily injury claims. To obtain reliable risk models, all claims must be considered when estimating frequency, but only those claims that are fully settled should be used when estimating severity. Standard data mining algorithms are typically not equipped to make these distinctions, nor are they equipped to perform the necessary calculations. A further complication for standard data mining algorithms is that insurance actuaries demand statistical rigor and tight condence bounds on the risk parameters that are obtained; i.e., the risk groups must be actuarially credible. Actuarial credibility, which is discussed in subsequent sections, is a further requirement that standard data mining algorithms are ill-equipped to handle because they are not designed to perform the necessary calculations to ensure that only actuarially credible risk groups are identied. The above challenges have motivated our own research [6, 7] and have lead to the development of the IBM ProbETM (Probabilistic Estimation) predictive modeling kernel. This C++ kernel embodies several innovations that address the challenges posed by insurance data. The algorithms are able to construct rigorous rule-based models of insurance risk, where each rule represents a risk group. The algorithms differ from standard data mining algorithms in that the domain-specic calculations necessary for modeling insurance risks are not only integrated into the algorithms, they are in fact used to help guide the search for risk groups. The IBM UPATM (Underwriting Protability Analysis) application [8] is built around ProbE and provides the infrastructure for using ProbE to construct rule-based risk models. UPA was designed with input from marketing, underwriting, and actuarial end-users. The graphical user interface is tailored to the insurance industry for enhanced ease of use. Innovative features such as sensitivity analysis help in evaluating the business impact of rules. An iterative modeling paradigm permits discovered rules to be edited and the edited rules to be used as seeds for further data mining. In a joint development project with a P&C company, the UPA solution amply demonstrated the value that a discovery-driven approach can bring to the actuarial analysis of insurance data.
Some of the features incorporated into ProbE were strongly inuenced by the mathematical rigor with which actuaries approach the problem of modeling insurance risks. Actuarial science is based on the construction and analysis of statistical models that describe the process by which claims are led by policyholders (see, for example, [9]). Different types of insurance often require the use of different statistical models. The statistical models that are incorporated into the current version of ProbE are geared toward property and casualty insurance in general, and automobile insurance in particular. For any type of insurance, the choice of statistical model is strongly inuenced by the fundamental nature of the claims process. For property and casualty insurance, the claims process consists of claims being led by policyholders at varying points in time and for varying amounts. In the normal course of events, wherein claims are not the result of natural disasters or other widespread catastrophes, loss events that result in claims (i.e., accidents, re, theft, etc.) tend to be randomly distributed in time with no
signicant pattern to the occurrence of those events from the point of view of insurance risk. Policyholders can also le multiple claims for the same type of loss over the life of a policy. As illustrated in Figure 1, these properties are the dening characteristics of Poisson random processes [9]. ProbE thus uses Poisson processes to model claim lings.
Time
2000
4000
6000
8000
10000
10
100
1000
10000
100000
The traditional method used by actuaries to construct risk models involves rst segmenting the overall population of policyholders into a collection of risk groups based on a set of factors, such as age, gender, driving distance to place of employment, etc. The risk parameters of each group are then estimated from historical policy and claims data. Ideally, the resulting risk groups should be homogeneous with respect to risk; i.e., further subdividing the risk groups by introducing additional factors should yield substantially the same risk parameters. Actuaries typically employ a combination of intuition, guesswork, and trial-and-error hypothesis testing to identify suitable factors. The human effort involved is often quite high and good risk models can take several years to develop and rene. ProbE replaces manual exploration of potential risk factors with automated search. Risk groups are identied in a top-down fashion by a method similar to those employed in classication and regression tree algorithms [2, 1, 4, 3]. Starting with an overall population of policyholders, ProbE recursively divides the policyholders into risk groups by identifying a sequence of factors that produce the greatest increase in homogeneity within the subgroups that are produced. The process is continued until each of the resulting risk groups is either declared to be homogeneous or is too small to be further subdivided from the point of view of actuarial credibility. One of the key differences between ProbE as embodied in the UPA and other classication and regression tree algorithms is that splitting factors are selected based on statistical models of insurance risks. In the case of UPA, a joint Poisson/log-normal model is used to enable the simultaneous modeling of frequency and severity and, hence, pure premium. The joint Poisson/log-normal model explicitly takes into account insurance-specic variables, such as earned exposure and claim status. In addition, it provides feedback to ProbEs search engine on the degree of actuarial credibility of each proposed risk group so that only those splitting factors that yield actuarially credible risk groups are considered for further exploration. By explicitly taking these aspects of the problem into account, ProbE is able to overcome the major barriers
that cause standard data mining algorithms to be suboptimal for this application.
The optimization criterion used to identify splitting factors is based on the principles of maximum likelihood estimation. Specically, the negative log-likelihood of each data record is calculated assuming a joint Poisson/log-normal statistical model, and these negative log likelihoods are then summed to yield the numerical criterion that is to be optimized. Minimizing this negative log-likelihood criterion causes splitting factors to be selected that maximize the likelihood of the observed data given the joint Poisson/log-normal models of each of the resulting risk groups. Historical data for each policy is divided into distinct time intervals for the purpose of data mining, with one data record constructed per policy per time interval. Time-varying risk characteristics are then assumed to remain constant within each time interval; that is, for all intents and purposes their values are assumed to change only from one time interval to the next. The choice of time scale is dictated by the extent to which this assumption is appropriate given the type of insurance being considered and the business practices of the insurer. For convenience, quarterly intervals will be assumed to help make the discussion below more concrete, but it should be noted that monthly or yearly intervals are also possible Assuming that data is divided into quarterly intervals, most data records will span entire quarters, but some will not. In particular, data records that span less than a full quarter must be created for policies that were initiated or terminated mid-quarter, or that experienced mid-quarter changes in their risk characteristics. In the case of the latter, policy-quarters must be divided into shorter time intervals so that separate data records are created for each change in the risk characteristics of a policy. This subdivision must be performed in order to maintain the assumption that risk characteristics remain constant within the time intervals represented by each data record. In particular, subdivision must occur is when claims are led under a policy in a given quarter because the ling of a claim can itself be an indicator of future risk (i.e., the more claims one les, the more likely one is to le future claims). The actual time period covered by a database record is the earned exposure of that record. Figure 3 depicts the database records that are constructed as a result of subdivision. In this gure, Q0, Q1, Q2, etc., represent the ending days of a sequence of quarters. T0 represents the day on which a particular policy came into force, while T1 represents the day the rst claim was led under that policy. Though not illustrated, T2, T3, T4, etc., would represent the days on which subsequent claims were led. For data mining purposes, the policy claims data is divided into a sequence of database records with earned exposures t1, t2, t3, etc. As illustrated, new policies typically come into force in the middle of quarters. Thus, the earned exposure for the rst quarter of a policys existence (e.g., t1) is generally less than a full quarter. The earned exposures for subsequent quarters, on the other hand, correspond to full quarters (e.g., t2, t3, and t4) until such time that a claim is led, the risk characteristics change mid-quarter, or the policy is terminated. When a claim is led or the risk characteristic changes, the data for that quarter is divided into two or more records. The earned exposure for the rst database record (e.g., t5) indicates the point in the quarter at which the claim was led. The earned exposure for the second record (e.g., t6) indicates the time remaining in the quarter, assuming only one claim is led in the quarter as illustrated in the diagram. If two or more claims are led in the quarter, then three or more database records are constructed: one record for each claim and one record for the remainder of the quarter (assuming that the policy has not been terminated). Likewise for other changes in risk characteristics, such as adding or removing drivers, cars, etc., from the policy. For Poisson random processes, the time between claim events follows an exponential distribution. Moreover, no matter at what point one starts observing the process, the time to the next claim event has
Q0 Database Records:
T0 t1
Q1 t2
Q2 t3
Q3 t4
Q4 t5
T1 t6
Q5 t7
Policy Inception
Claim Filed
(3)
In the case of claim amounts, the joint probability density function for the severities s 1 : : : sk of k settled claims is given by:
Qk
i=1
p1
2
log
si
i=1 (log(si )
2 log
log
)2
:
log
The estimates of the mean log severity log and the variance of the log severity typically used for log-normal distributions:
k 1X ^log = k log(si ) i=1
(5)
and
(6)
Equations 5 and 6 are used during training to estimate the parameters of the severity distribution for individual claims. These estimators presume that the individual severity distributions are log-normal. The usual unbiased estimators for the mean and variance of severity are used after data mining has been completed to estimate the parameters of the aggregate severity distribution:
k 1 Xs ^= k i i=1 k 1 X ^ 2 = k ; 1 (si ; ^)2 : i=1
(7)
(8)
Only fully settled claims are considered when applying Equations 5-8. The severity elds of unsettled claims are often used to record reserve amounts; i.e., the money that insurers hold aside to cover pending claims. Reserve amounts are not actual losses and therefore are not used to develop models for predicting actual losses. As mentioned earlier, negative log-likelihoods are calculated for each database record in a risk group based on Equations 2 and 4. The nonconstant terms in the negative log-likelihoods are then summed and used as the criterion for selecting splitting factors in the top-down identication of risk groups. The constant terms do not contribute to the selection of splitting factors and, hence, are omitted to reduce the amount of computation. With constant terms removed, the negative log-likelihood score for the ith database record is:
=>
:
8 > <
ti ti + log( ti + log(
log log
for non-claim records for open claim records for settled claim records,
(9)
where ti is the earned exposure for the ith record. Note that the Poisson portion of the model contributes an amount ti + log(1= ) to the score of each claim record and an amount t i to the score of each non-claim record. The sum of these values equals the negative logarithm of Equation 2. The lognormal portion of the model contributes nothing to the scores of non-claim records, and an amount 2 log( log) + (log(si ) ; log )2=(2 log) to the score of each settled claim record. p sum of these values The P equals the negative logarithm of Equation 4 with constant terms (i.e., k=1 log( 2 si )) removed. In the i case of open claim records, an expected value estimate of the log-normal score is constructed based on the scores of the settled claim records. After dropping constant terms from this expected value estimate, open claim records contribute an amount log( log) to the log-normal portions of their scores. If the database records for a risk group contain k settled claims and l open claims, then the sum of the above scores is given by:
N X i=1
ti + (k + l) log
log
+ 2
k X i=1
2 log
(log(si) ;
log
)2 :
(10)
In the above equation, N is the total number of database records for the risk group, the rst k of which are assumed for convenience to be settled claim records. Equation 10 is then summed over all risk groups to yield the overall score of the risk model. The top-down procedure described in the previous section identies risk groups by minimizing the overall score in a stepwise fashion, where each step involves
dividing a larger risk group into two smaller risk groups so as to reduce the value of the overall score to the maximum extent possible. From the point of view of data mining technology, the important thing to note about the above equations is that insurance-specic quantities such as earned exposure and claim status enter into both the equations for estimating model parameters and the equations for selecting splitting factors. Earned exposure effectively plays the role of a weighting factor, while claim status plays the role of a correction factor that adjusts for missing data in one of the two data elds to be predicted (i.e., the settled claim amount given that a claim was led). Equation 10 essentially replaces the entropy calculations used in many standard tree-based data mining algorithms. It should be noted that entropy is, in fact, a special case of negative loglikelihood. Its calculation need not be restricted to categorical or Gaussian (least-squares) distributions. The development of the joint Poisson/log-normal model presented above illustrates the general methodology one can employ to customize the splitting criteria of tree-based data mining algorithms to take into account data characteristics that are peculiar to specic applications.
Actuarial Credibility
X ; E X] r p: (11) E X] Typical choices of r and p used by actuaries are r = 0:05 and p = 0:9. In other words, X must be within 5% of E X ] with 90% condence. P
To ensure that actuarially credible risk groups are constructed, ProbE permits a maximum fractional standard error to be imposed on the estimated pure premiums of each risk group. In the process of subdividing larger risk groups into smaller risk groups, ProbE only considers splitting factors that yield smaller risk groups that obey this constraint. Specically, each risk group must satisfy the following inequality: p
ProbEs top-down modeling procedure is constrained to produce risk groups that are actuarially credible. In actuarial science, credibility [9] has to do with the accuracy of the estimated risk parameters (in this case, frequency, severity, and ultimately pure premium). Accuracy is measured in terms of statistical condence intervals; that is, how far can the estimated risk parameters deviate from their true values and with what probability. A fully credible estimate is an estimate that has a sufciently small condence interval. In particular, estimated parameter values X must be within a certain factor r of their true (i.e. expected) values E X ] with probability at least p:
Var X ]
E X]
r0
(12)
where X is the pure premium estimate of the risk group, E X ] is the expected value of the pure premium, Var X ] is the variance of the pure premium estimate, and r 0 is the maximum allowed fraction standard error. If a splitting factors that satises Equation 12 cannot be found for a given risk group, that risk group is declared to be too small to be subdivided and no further renement of the risk group is performed. Actuarial credibility is ensured by the fact that, for any pair of values of p and r in Equation 11, there exists a corresponding value of r 0 for Equation 12 such that
In particular, if X is approximately Gaussian and p = 0:9, then the corresponding value for r 0 as a function of r is r r0 = : (14)
X ; E X] E X]
p if and only if
Var X ]
E X]
r0 :
(13)
1:645
9
For a 5% maximum error with 90% condence, the corresponding value of r 0 would thus be 3:04%. When applying the above credibility constraint, the mean and variance of the pure premium estimate are approximated by their empirical estimates. Thus, the fractional standard error for pure premium is approximated by s
p
Var X ]
E X]
1 + 1 ^2 : k + l k ^2
(15)
Note that this fractional standard error varies as a function of the statistical properties of each risk group. The determination of when a risk group is too small to be subdivided is thus context-dependent. The ability to impose a context-dependent actuarial credibility constraint on the top-down process by which risk groups are constructed is another important feature of ProbE that distinguishes it from all other tree-based modeling methods, such as CHAID [1], CART [2], C4.5 [3], or SPRINT [4]. Equation 15 can also be used to obtain a rough estimate of the amount of data needed to justify a given number of risk groups. In general, the standard deviation of claim severity tends to be at least as large as the mean claim severity; hence, ^ 2= ^2 1 in most cases. To achieve a 5% maximum error with 90% condence, a risk group must therefore cover at least 2,164 claim records, or about 108,200 quarterly records given that the average quarterly claim rate for automobile insurance tends to be about 2%. Multiply 108,200 by the number of risk groups and it becomes quite evident that a very large number of quarterly data records must be considered in order to achieve fully credible results.
Predictive Accuracy
Model Score
Best Model
10
enough splitting factors. As more splitting factors are introduced, a point of overtting is reached where the value of the score as estimated on the training data no longer reects the value that would be obtained on new data. Adding splitting factors beyond this point would simply make the model worse. Overtting mathematically corresponds to a situation in which the score as estimated on the training data substantially underestimates the expected value of the score that would be obtained if the true statistical properties of the data were already known. Results from statistical learning theory (see, for example, [10]) demonstrate that, although there is always some probability that underestimation will occur for a given model, both the probability and the degree of underestimation are increased by the fact that we explicitly search for the model that minimizes the estimated score. This search biases the difference between the estimated score and the expected value of the score toward the maximum difference among competing models. To avoid overtting, the available training data is randomly divided into two subsets: one that is used for actual training (i.e., estimation of parameters and selection of splitting factors); the other that is used for validation purposes to estimate the true performance of the model. As splitting factors are introduced by minimizing the score on the actual training data, a sequence of risk models is constructed in which each successive model contains more risk groups than its predecessors. The true score of each model is then estimated by evaluating Equation 10 on the validation data for each risk group in the model and summing the results. The model that minimizes this unbiased estimate of the true score is selected as the most accurate risk model given the available data. As illustrated in Figure 4, the introduction of each successive splitting factor simultaneously increases the number of risk groups and decreases the score of the risk model on the actual training data. The most suitable risk model is the one with the smallest score on the validation data.
The UPA solution consists of the UPA application and a methodology for processing P&C policy and claims data using the application. The UPA application is a client-server Java-based application. On the server side, the ProbE C++ data mining kernel is used for actual execution of mining tasks. The client-server implementation is multithreaded and a process scheduling subsystem on the server manages and synchronizes requests for ProbE runs that may ow in from any of the clients. Results of mining are available in various graphical and tabular formats, some of which may require a business analyst to interpret while others can be directly interpreted by a business decision maker. In preparation for mining, company policy and claims data may be combined with exogenous data, such as demographics, and stored as a set of records. Each record is essentially a snapshot of a policy during an interval of time, including any claim information. Trend information is captured in a set of derived elds. The application is geared to predict pure premium, which is the product of claim frequency and claim severity. Though not explicitly present in the raw data, it is readily computed once mean frequency and mean severity have been estimated. The user has control over three distinct phases in the mining process. 1. Training is the process in which the application discovers the statistically signicant subpopulations that exist in the data. 2. Calibration is the process in which the application applies a second data set to the rules discovered in the training phase, and calibrates the statistics associated with each rule, such as claim rate, claim amount, and pure premium. 3. Evaluation permits a user to evaluate the rules on yet another data set to conrm the actuarial credibility of the calibrated rules.
11
The training, calibration, and test data sets are constructed so as to be disjoint (i.e., they have no records in common). This is necessary to ensure the statistical reliability of the rules and subsequent analysis. Both the calibration and test data sets are obtained by randomly sampling the entire data set that was constructed for analysis. The claim rates and severities measured on the calibration and test data sets therefore reect the rates and severities of the entire data set. The training data set, on the other hand, is a stratied random sample in which the proportion of claim to non-claim records is greater than in the entire data set. The distinction between training and calibration does not generally exist in other data mining algorithms. The distinction was made in the UPA in order to satisfy the demand for actuarial credibility while simultaneously keeping the computational requirements to a reasonable level. Because ProbE makes many passes over the training data in the process of identifying risk groups, it is desirable from a computational standpoint to keep the size of the training set to a minimum. However, actuarial credibility demands large quantities of data. Instead of attempting to achieve full credibility on a large amount of training data, the compromise made in the UPA solution is to achieve a weaker level of credibility on a smaller amount of training data, but to then use a much larger quantity of calibration data to re-estimate the model parameters of each risk group identied during training in order to achieve fully credible parameter estimates. The fractional standard error of pure premium that needs to be achieved on the training data in order to achieve a desired fractional standard error of pure premium on the calibration data is given by the following equation:
0 0 rtraining = rcalibration
Number of Claim Records in the Calibration Set : Number of Claim Records in the Training Set
(16)
Another compromise made in the UPA solution is to stratify the training data by randomly excluding a large percentage of non-claim records. Stratication can dramatically reduce the size of the training set; for example, in the case of quarterly automobile data, 90% of the quarterly non-claim records can be removed with minimal impact on the predictive accuracy of the resulting risk groups. Stratication is justied by the fact that its effect is to nonlinearly rescale all estimated claim frequencies, and this nonlinear rescaling can be accounted for in Equations 3 and 10 by linearly scaling the earned exposures of the remaining non-claim records in inverse proportion to the fraction of non-claim records that remain. Thus, if 90% of non-claim records were removed, with 10% remaining, then the effect of stratication can be mathematically compensated by dividing the earned exposures of all remaining non-claim records by 0:10. The earned exposures of the claim records, on the other hand, would remain the same. By scaling the earned exposures in this fashion, the resulting estimated claim frequencies given by Equation 3 and the resulting model scores given by Equation 10 would then be the same, to within estimation error, as those obtained without stratication. Any differences in these values, and hence any differences in the choice of splitting factors, would be entirely due to the sampling noise introduced through stratication. Note that stratication has no effect on the determination of actuarial credibility because Equation 15 depends only on the claim records that are present. Mining runs produce risk models that are represented as collections of rules. A typical rule is illustrated below: RULE #22 IF Field "VANTILCK" "Vehicle Antilock Break Discount?" = "Antilock Brake" Field "VEHTYPE" "Type of Vehicle" = "Truck" THEN claim rate 0.0115561
12
mean severity 5516.84 std dev severity 11619.9 pure premium 63.753 loss ratio 0.688204 608 training claims out of 53221 training points Several statistics are reported for each rule, including claim rate, mean severity, standard deviation of the severity, pure premium (i.e., claim rate times severity), and loss ratio (i.e., pure premium over premium charged). Two additional statistics that are reported for each rule are the number of total examples that match the rule and the total number of those examples that are claim-related. In the case of the rule illustrated here, 53,221 examples matched the rule, out of which 608 had incurred claims. The risk models produced by ProbE can be used as the basis for establishing new price structures for the premiums charged to policyholders. In addition, the models can be analyzed to uncover nuggets; i.e., previously unknown risk factors that, if incorporated into existing price scenarios, could improve overall protability.
The rst step in uncovering nuggets begins with the lift charts that are generated from a mining run. A typical UPA lift chart is displayed in Figure 5. The X-axis is a cumulative percentage count of the policies, sorted in order of decreasing predicted pure premium. The values therefore range from 0 to 100. The Y-axis is the cumulative percentage of actual premiums collected from, or actual claims paid to the policyholders in the order dened by the X-axis. The Y-axis therefore also ranges from 0 to 100. The chart displays three plots. The rst plot is that of a hypothetical situation, in which a uniform premium is collected for each policy. This essentially represents the scenario in which an insurance rm has no insight about its policies and spreads its risk uniformly across the entire pool. The second plot displays rms current actual premium pricing. This plot shows the actual cumulative premiums collected for the policies when sorted in descending order by predicted pure premium. The third plot displays the scenario proposed by the UPA in which the UPA-recommended pricing is plotted (which is the actual cumulative claim amounts for the policies when sorted in descending order by predicted pure premium). In our experience, the relationships among the curves shown in Figure 5 are commonly encountered in practice. Actuaries have identied many distinct risk groups and their characteristics have been incorporated into the premiums charged. However, as the lift chart illustrates, the UPA solution has a strong likelihood of discovering previously unknown risk groups and is therefore able to suggest more competitive prices in many situations. The lift charts provide a quick visual indication of whether a detailed analysis of mining results will uncover any nuggets. If an actual mining run results in a lift chart very similar to the one illustrated in Figure 5, then the business analyst has a basis for continuing further investigations of the rules. If the lift chart indicates very little or no difference between actual pricing and the UPA-proposed pricing, then further investigation would likely have little business value. To uncover nuggets, the analyst needs to rst understand the statistics for the entire book of business. The UPA application can present these universal statistics to the user: for "Accs This Qtr Ult $ BI+PD" claim rate 0.00600882 mean severity 4676.55 std dev severity 9165.3
13
14
were identied using the methodology described in this paper. Six of these nuggets were selected by the insurer for a detailed benets assessment study using the insurers internal methodology for evaluating proposed changes to their pricing and/or underwriting practices. The benets assessment study indicated that implementing just these 6 nuggets in a single state could potentially realize a net prot gain of several million dollars. The benets that could be realized by scaling up the business implementation of all 43 nuggets across multiple states are clearly appealing. One of the six nuggets has already been widely publicized in the media. While it is well-known among insurers that drivers of high-performance sports cars are more likely to have accidents than are other motorists, the UPA discovered that if the sports car was not the only vehicle in the household, then the accident rate is not much greater than that of a regular car. In one estimate [11], just letting Corvettes and Porsches into [the insurers] preferred premium plan could bring in an additional $4.5 million in premium revenue over the next two years without a signicant rise in claims. Another publicly disclosed nugget relates to experienced drivers, who tend to have relatively low claim frequencies. However, the UPA turned up a particular segment of experienced drivers who are unusually accident prone.
Evaluation
In addition to data mining runs that were performed for the purpose of uncovering nuggets, runs were also performed to assess the UPAs ability to identify distinct risk groups as a function of the amount of training data provided, as well as to assess the predictive accuracy of the risk models produced by the UPA versus those obtained using other data mining technologies. Figure 6 shows an example of the relationship among lift curves that was observed as the amount of training data was varied from 43 thousand records to 1.38 million records. As this gure illustrates, increasing the amount of training data increases the accuracy of the resulting model, as indicated by the increase in lift. Accurate risk models are thus obtained only from large training sets. On the surface, these results seem to contradict the results obtained by Oates and Jensen [12] for classication tree algorithms. Their experiments demonstrate that the error rates of decision tree classiers tend to rapidly reach a plateau as the number of training records increases. In fact, the plateau is often reached with only a few thousand training records. Once reached, further increases in the number of training records has little effect on the accuracy of the resulting classication tree. A similar plateau almost certainly exists in the case of insurance risk modeling because, ultimately, there is always a limit to the degree of accuracy one can attain in any prediction problem. However, the start of the plateau clearly exists beyond the 1.38 million record mark, instead of the several thousand record mark observed by Jensen and Oats. The reason for this difference has to do with nature of the prediction problem. Decision tree classiers make yes/no type predictions and model accuracy is assessed on the basis of whether those predictions are right or wrong. Risk models, on the other hand, make predictions about the values of continuous parameters (i.e., frequency, severity, and pure premium). Model accuracy is assessed not on whether the predictions are right or wrong, but on how well those predictions reect reality. Such assessments are analogous to drawing distinctions between shades of gray, instead of the black and white distinctions made by classiers. Moreover, insurance data is inherently noisy so that large amounts of data are needed to obtain accurate parameter estimates. Consequently, the accuracy plateau for risk models will be reached only for very large training sets. The size of the training sets needed to obtain accurate risk models placed severe constraints on the experiments we were able to perform to compare the UPA to other data mining technologies. Except for SPRINT [4], all of the other tree-based modeling programs available to us (i.e., CART [2] and C4.5 [3]) could not handle the data volumes involved (1.38 million records constitutes roughly one gigabyte of data).
15
100
80
60
40 1,380K Records 691K Records 346K Records 86K Records 43K Records Uniform 0 20 40 60 80 100
20
Cumulative Policy-Quarters
16
100
80
60
20
Cumulative Policy-Quarters
17
100
80
60
20
0 0 20 40 60 80 100
Cumulative Policy-Quarters
18
better predictions. Insurance claims data, as previously discussed, are highly skewed. Some methods of robust estimation involve deleting extreme values (i.e., outliers). Such methods are not appropriate from an actuarial standpoint because extremely high (and extremely low) claims do occur and the regularity with which they occur must be modeled in order to avoid nancial ruin. Other methods of robust estimation are based on the use of probability distributions that better reect the observed skew of the data, as well as the thickness of the tails of the observed distributions. This approach is the one preferred by actuaries, who routinely make use of a wide range of distributional models in their analyses [9]. The same approach likewise guided the development of ProbE. Because ProbE was developed from the point of view of robust estimation, our a priori expectation was that ProbE would be highly robust with respect to the risk models it produces. The lift curves presented above are consistent with this expectation, and we anticipate that extensive quantitative evaluations will further conrm our expectation. In conclusion, we demonstrate that extra leverage can be obtained in data mining by 1) employing suitable statistical models that accurately reect the underlying statistical properties of the data, and 2) incorporating relevant domain specic constraints, e.g., actuarial credibility for insurance risk discovery in the UPA solution. The ProbE data mining framework has enabled this approach, and will continue to serve as a robust kernel for domains where extracting maximal predictive accuracy in the mining process is at a premium.
References
[1] G.V. Kass. An Exploratory Technique for Investigating Large Quantities of Categorical Data. Applied Statistics, 29(2):119127. [2] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classication and Regression Trees. Wadsworth, Monterrey, CA., 1984. [3] J.R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. [4] J. Shafer, R. Agrawal, and M. Mehta. SPRINT: A Scalable Parallel Classier for Data Mining. In Proceedings of the 22nd International Conference on Very Large Databases, 1996. [5] R.R. Wilcox. Introduction to Robust Estimation and Hypothesis Testing. Academic Press, 1997. [6] J. Hosking, E. Pednault, and M. Sudan. A Statistical Perspective on Data Mining. Future Generation Computer Systems, November 1997. [7] E. Pednault. Statistical Learning Theory. MIT Encyclopedia of the Cognitive Sciences, 1998. [8] C. Apte, E. Grossman, E. Pednault, B. Rosen, F. Tipu, and B. White. Insurance Risk Modeling Using Data Mining Technology. In Proceedings of PADD99: The Practical Application of Knowledge Discovery and Data Mining, pages 3947, 1998. IBM Research Division technical report RC-21314. [9] S.A. Klugman, H.H. Panjer, and G.E. Wilmot. Loss Models: From Data to Decisions. John Wiley & Sons, 1998. [10] V.N. Vapnik. Statistical Learning Theory. John Wiley & Sons, 1998. [11] L. Bransten. Looking for Patterns. The Wall Street Journal, page R16 and R20, June 21 1999. [12] T. Oates and D. Jensen. Large Datasets Lead to Overly Complex Models: an Explanation and a Solution. In Proceedings of The Fourth International Conference on Knowledge Discovery and Data Mining, pages 294298, 1998.
19