Loss Data Analytics (Frees)
Loss Data Analytics (Frees)
Loss Data Analytics (Frees)
Preface 9
Contributor List 11
Acknowledgements 13
Reviewer Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Frequency Modeling 31
2.1 Frequency Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.1.1 How Frequency Augments Severity Information . . . . . . . . . . . . . . . . . . . . . . 31
2.2 Basic Frequency Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.2.1 Foundations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.2 Moment and Probability Generating Functions . . . . . . . . . . . . . . . . . . . . . . 34
2.2.3 Important Frequency Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3 The (a, b, 0) Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4 Estimating Frequency Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.4.1 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.4.2 Frequency Distributions MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5 Other Frequency Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.5.1 Zero Truncation or Modification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.6 Mixture Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.7 Goodness of Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.9 R Code for Plots in this Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.10 Further Resources and Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3
4 CONTENTS
6 Simulation 161
6.1 Generating Independent Uniform Observations . . . . . . . . . . . . . . . . . . . . . . . . . . 161
6.2 Inverse Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
6.3 How Many Simulated Values? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Book Description
• The online version contains many interactive objects (quizzes, computer demonstrations, interactive
graphs, video, and the like) to promote deeper learning.
• A subset of the book is available for offline reading in pdf and EPUB formats.
• The online text will be available in multiple languages to promote access to a worldwide audience.
The online text will be freely available to a worldwide audience. The online version will contain many
interactive objects (quizzes, computer demonstrations, interactive graphs, video, and the like) to promote
deeper learning. Moreover, a subset of the book will be available in pdf format for low-cost printing. The
online text will be available in multiple languages to promote access to a worldwide audience.
This book will be useful in actuarial curricula worldwide. It will cover the loss data learning objectives of the
major actuarial organizations. Thus, it will be suitable for classroom use at universities as well as for use
by independent learners seeking to pass professional actuarial examinations. Moreover, the text will also
be useful for the continuing professional development of actuaries and other professionals in insurance and
related financial risk management industries.
An online text is a type of open educational resource (OER). One important benefit of an OER is that it
equalizes access to knowledge, thus permitting a broader community to learn about the actuarial profession.
Moreover, it has the capacity to engage viewers through active learning that deepens the learning process,
producing analysts more capable of solid actuarial work. Why is this good for students and teachers and
others involved in the learning process?
Cost is often cited as an important factor for students and teachers in textbook selection (see a recent post
on the $400 textbook). Students will also appreciate the ability to “carry the book around” on their mobile
devices.
9
10 CONTENTS
Although the intent is that this type of resource will eventually permeate throughout the actuarial curriculum,
one has to start somewhere. Given the dramatic changes in the way that actuaries treat data, loss data seems
like a natural place to start. The idea behind the name loss data analytics is to integrate classical loss data
models from applied probability with modern analytic tools. In particular, we seek to recognize that big data
(including social media and usage based insurance) are here and high speed computation s readily available.
Project Goal
The project goal is to have the actuarial community author our textbooks in a collaborative fashion.
To get involved, please visit our Loss Data Analytics Project Site.
Contributor List
11
12 CONTENTS
Acknowledgements
Edward Frees acknowledges the John and Anne Oros Distinguished Chair for Inspired Learning in Business
which provided seed money to support the project. Frees and his Wisconsin colleagues also acknowledge
a Society of Actuaries Center of Excellence Grant that provided funding to support work in dependence
modeling and health initiatives.
We acknowledge the Society of Actuaries for permission to use problems from their examinations.
We also wish to acknowledge the support and sponsorship of the International Association of Black Actuaries
in our joint efforts to provide actuarial educational content to all.
Reviewer Acknowledgment
• Hirokazu (Iwahiro) Iwasawa
Figure 1:
13
14 CONTENTS
Chapter 1
Chapter Preview. This book introduces readers to methods of analyzing insurance data. Section 1.1 begins
with a discussion of why the use of data is important in the insurance industry. Although obvious, the
importance of data is critical - it is the whole premise of the book. Next, Section 1.2 gives a general overview
of the purposes of analyzing insurance data which is reinforced in the Section 1.3 case study. Naturally,
there is a huge gap between these broads goals and a case study application; this gap is covered through the
methods and techniques of data analysis covered in the rest of the text.
15
16 CHAPTER 1. INTRODUCTION TO LOSS DATA ANALYTICS
Insurance is a data-driven industry and analytics is a key to deriving information from data. But what
is analytics? Making data-driven business decisions has been described as business analytics, business
intelligence, and data science. These terms, among others, are sometimes used interchangeably and sometimes
used separately, referring to distinct domains of applications. As an example of such distinctions, business
intelligence may focus on processes of collecting data, often through databases and data warehouses, whereas
business analytics utilizes tools and methods for statistical analyses of data. In contrast to these two terms
that emphasize business applications, the term data science can encompass broader applications in many
scientific domains. For our purposes, we use the term analytics to refer to the process of using data to
make decisions. This process involves gathering data, understanding models of uncertainty, making general
inferences, and communicating results.
This text will focus on short-term insurance contracts. By short-term, we mean contracts where the insurance
coverage is typically provided for six months or a year. If you are new to insurance, then it is probably
easiest to think about an insurance policy that covers the contents of an apartment or house that you are
renting (known as renters insurance) or the contents and property of a building that is owned by you or a
friend (known as homeowners insurance). Another easy example is automobile insurance. In the event of an
accident, this policy may cover damage to your vehicle, damage to other vehicles in the accident, as well as
medical expenses of those injured in the accident.
In the US, policies such as renters and homeowners are known as property insurance whereas a policy such as
auto that covers medical damages to people is known as casualty insurance. In the rest of the world, these
are both known as nonlife or general insurance, to distinguish them from life insurance.
Both life and nonlife insurances are important. To illustrate, (Insurance Information Institute, 2015) estimates
that direct insurance premiums in the world for 2013 was 2,608,091 for life and 2,032,850 for nonlife; these
figures are in millions of US dollars. As noted earlier, the total represents 6.3% of the world GDP. Put
another way, life accounts for 56.2% of insurance premiums and 3.5% of world GDP, nonlife accounts for
43.8% of insurance premiums and 2.7% of world GDP. Both life and nonlife represent important economic
activities and are worthy of study in their own right.
Yet, life insurance considerations differ from nonlife. In life insurance, the default is to have a multi-year
contract. For example, if a person 25 years old purchases a whole life policy that pays upon death of the
insured and that person does not die until age 100, then the contract is in force for 75 years. We think of this
as a long-term contract.
Further, in life insurance, the benefit amount is often stipulated in the contract provisions. In contrast, most
short-term contracts provide for reimbursement of insured losses which are unknown before the accident.
(Of course, there are usually limits placed on the reimbursement amounts.) In a multi-year life insurance
contract, the time value of money plays a prominent role. In contrast, in a short-term nonlife contract, the
random amount of reimbursement takes priority.
In both life and nonlife insurances, the frequency of claims is very important. For many life insurance
contracts, the insured event (such as death) happens only once. In contrast, for nonlife insurances such as
automobile, it is common for individuals (especially young male drivers) to get into more than one accident
during a year. So, our models need to reflect this observation; we will introduce different frequency models
than you may have seen when studying life insurance.
For short-term insurance, the framework of the probabilistic model is straightforward. We think of a
one-period model (the period length, e.g., six months, will be specified in the situation).
• At the beginning of the period, the insured pays the insurer a known premium that is agreed upon by
both parties to the contract.
1.1. RELEVANCE OF ANALYTICS 17
Figure 1.1: Timeline of a Typical Insurance Policy. Arrows mark the occurrences of random events. Each x
marks the time of scheduled events that are typically non-random.
• At the end of the period, the insurer reimburses the insured for a (possibly multivariate) random loss
that we will denote as y.
This framework will be developed as we proceed but we first focus on integrating this framework with concerns
about how the data may arise and what we can accomplish with this framework.
One way to describe the data arising from operations of a company that sells insurance products is to adopt
a granular approach. In this micro oriented view, we can think specifically about what happens to a contract
at various stages of its existence. Consider Figure 1.1 that traces a timeline of a typical insurance contract.
Throughout the existence of the contract, the company regularly processes events such as premium collection
and valuation, described in Section 1.2; these are marked with an x on the timeline. Further, non-regular
and unanticipated events also occur. To illustrate, times t2 and t4 mark the event of an insurance claim
(some contracts, such as life insurance, can have only a single claim). Times t3 and t5 mark the events when
a policyholder wishes to alter certain contract features, such as the choice of a deductible or the amount of
coverage. Moreover, from a company perspective, one can even think about the contract initiation (arrival,
time t1 ) and contract termination (departure, time t6 ) as uncertain events.
18 CHAPTER 1. INTRODUCTION TO LOSS DATA ANALYTICS
Armed with insurance data and a method of organizing the data into variable types, the end goal is to use
data to make decisions. Of course, we will need to learn more about methods of analyzing and extrapolating
data but that is the purpose of the remaining chapters in the text. To begin, let us think about why we wish
to do the analysis. To provide motivation, we take the insurer’s viewpoint (not a person) and introduce ways
of bringing money in, paying it out, managing costs, and making sure that we have enough money to meet
obligations.
Specifically, in many insurance companies, it is customary to aggregate detailed insurance processes into
larger operational units; many companies use these functional areas to segregate employee activities and
areas of responsibilities. Actuaries and other financial analysts work within these units and use data for the
following activities:
1. Initiating Insurance. At this stage, the company makes a decision as to whether or not to take on a
risk (the underwriting stage) and assign an appropriate premium (or rate). Insurance analytics has its
actuarial roots in ratemaking, where analysts seek to determine the right price for the right risk.
2. Renewing Insurance. Many contracts, particularly in general insurance, have relatively short
durations such as 6 months or a year. Although there is an implicit expectation that such contracts will
be renewed, the insurer has the opportunity to decline coverage and to adjust the premium. Analytics
is also used at this policy renewal stage where the goal is to retain profitable customers.
3. Claims Management. Analytics has long been used in (1) detecting and preventing claims fraud, (2)
managing claim costs, including identifying the appropriate support for claims handling expenses, as
well as (3) understanding excess layers for reinsurance and retention.
4. Loss Reserving. Analytic tools are used to provide management with an appropriate estimate of
future obligations and to quantify the uncertainty of the estimates.
5. Solvency and Capital Allocation. Deciding on the requisite amount of capital and ways of allocating
capital to alternative investment activities represent other important analytics activities. Companies
must understand how much capital is needed so that they will have sufficient flow of cash available to
meet their obligations. This is an important question that concerns not only company managers but
also customers, company shareholders, regulatory authorities, as well as the public at large. Related
to issues of how much capital is the question of how to allocate capital to differing financial projects,
typically to maximize an investor’s return. Although this question can arise at several levels, insurance
companies are typically concerned with how to allocate capital to different lines of business within a
firm and to different subsidiaries of a parent firm.
Although data is a critical component of solvency and capital allocation, other components including an
economic framework and financial investments environment are also important. Because of the background
needed to address these components, we will not address solvency and capital allocation issues further in this
text.
Nonetheless, for all operating functions, we emphasize that analytics in the insurance industry is not an
exercise that a small group of analysts can do by themselves. It requires an insurer to make significant
investments in their information technology, marketing, underwriting, and actuarial functions. As these areas
represent the primary end goals of the analysis of data, additional background on each operational unit is
provided in the following subsections.
1.2. INSURANCE COMPANY OPERATIONS 19
Setting the price of an insurance good can be a perplexing problem. In manufacturing, the cost of a good
is (relatively) known and provides a benchmark for assessing a market demand price. In other areas of
financial services, market prices are available and provide the basis for a market-consistent pricing structure
of products. In contrast, for many lines of insurance, the cost of a good is uncertain and market prices are
unavailable. Expectations of the random cost is a reasonable place to start for a price, as this is the optimal
price for a risk-neutral insurer. Thus, it has been traditional in insurance pricing to begin with the expected
cost and to add to this so-called margins to account for the product’s riskiness, expenses incurred in servicing
the product, and a profit/surplus allowance for the insurance company.
For some lines of business, especially automobile and homeowners insurance, analytics has served to sharpen
the market by making the calculation of the good’s expectation more precise. The increasing availability
of the internet among consumers has promoted transparency in pricing. Insurers seek to increase their
market share by refining their risk classification systems and employing skimming the cream underwriting
strategies. Recent surveys (e.g., (Earnix, 2013)) indicate that pricing is the most common use of analytics
among insurers.
Underwriting, the process of classifying risks into homogenous categories and assigning policyholders to
these categories, lies at the core of ratemaking. Policyholders within a class have similar risk profiles and
so are charged the same insurance price. This is the concept of an actuarially fair premium; it is fair to
charge different rates to policyholders only if they can be separated by identifiable risk factors. To illustrate,
an early contribution, Two Studies in Automobile Insurance Ratemaking, by (Bailey and LeRoy, 1960)
provided a catalyst to the acceptance of analytic methods in the insurance industry. This paper addresses
the problem of classification ratemaking. It describes an example of automobile insurance that has five use
classes cross-classified with four merit rating classes. At that time, the contribution to premiums for use and
merit rating classes were determined independently of each other. Thinking about the interacting effects of
different classification variables is a more difficult problem.
Insurance is a type of financial service and, like many service contracts, insurance coverage is often agreed
upon for a limited time period, such as six months or a year, at which time commitments are complete.
Particularly for general insurance, the need for coverage continues and so efforts are made to issue a new
contract providing similar coverage. Renewal issues can also arise in life insurance, e.g., term (temporary) life
insurance, although other contracts, such as life annuities, terminate upon the insured’s death and so issues
of renewability are irrelevant.
In absence of legal restrictions, at renewal the insurer has the opportunity to:
• accept or decline to underwrite the risk and
• determine a new premium, possibly in conjunction with a new classification of the risk.
Risk classification and rating at renewal is based on two types of information. First, as at the initial stage,
the insurer has available many rating variables upon which decisions can be made. Many variables will
not change, e.g., sex, whereas others are likely to have changed, e.g., age, and still others may or may not
change, e.g., credit score. Second, unlike the initial stage, at renewal the insurer has available a history
of policyholder’s loss experience, and this history can provide insights into the policyholder that are not
available from rating variables. Modifying premiums with claims history is known as experience rating, also
sometimes referred to as merit rating.
Experience rating methods are either applied retrospectively or prospectively. With retrospective methods, a
refund of a portion of the premium is provided to the policyholder in the event of favorable (to the insurer)
experience. Retrospective premiums are common in life insurance arrangements (where policyholders earned
dividends in the U.S. and bonuses in the U.K.). In general insurance, prospective methods are more common,
where favorable insured experience is rewarded through a lower renewal premium.
20 CHAPTER 1. INTRODUCTION TO LOSS DATA ANALYTICS
Claims history can provide information about a policyholder’s risk appetite. For example, in personal lines
it is common to use a variable to indicate whether or not a claim has occurred in the last three years. As
another example, in a commercial line such as worker’s compensation, one may look to a policyholder’s
average claim over the last three years. Claims history can reveal information that is hidden (to the insurer)
about the policyholder.
In some of areas of insurance, the process of paying claims for insured events is relatively straightforward. For
example, in life insurance, a simple death certificate is all that is needed as the benefit amount is provided in
the contract terms. However, in non-life areas such as property and casualty insurance, the process is much
more complex. Think about even a relatively simple insured event such as automobile accident. Here, it is
often helpful to determine which party is at fault, one needs to assess damage to all of the vehicles and people
involved in the incident, both insured and non-insured, the expenses incurred in assessing the damages, and so
forth. The process of determining coverage, legal liability, and settling claims is known as claims adjustment.
Insurance managers sometimes use the phrase claims leakage to mean dollars lost through claims management
inefficiencies. There are many ways in which analytics can help manage the claims process, (Gorman and
Swenson, 2013). Historically, the most important has been fraud detection. The claim adjusting process
involves reducing information asymmetry (the claimant knows exactly what happened; the company knows
some of what happened). Mitigating fraud is an important part of claims management process.
One can think about the management of claims severity as consisting of the following components:
• Claims triaging. Just as in the medical world, early identification and appropriate handling of high
cost claims (patients, in the medical world), can lead to dramatic company savings. For example, in
workers compensation, insurers look to achieve early identification of those claims that run the risk of
high medical costs and a long payout period. Early intervention into those cases could give insurers
more control over the handling of the claim, the medical treatment, and the overall costs with an earlier
return-to-work.
• Claims processing. The goal is to use analytics to identify situations suitable for small claims
handling processes and those for adjuster assignment to complex claims.
• Adjustment decisions. Once a complex claim has been identified and assigned to an adjuster, analytic
driven routines can be established to aid subsequent decision-making processes. Such processes can also
be helpful for adjusters in developing case reserves, an important input to the insurer’s loss reserves,
Section 1.2.4.
In addition to the insured’s reimbursement for insured losses, the insurer also needs to be concerned with
another source of revenue outflow, expenses. Loss adjustment expenses are part of an insurer’s cost of
managing claims. Analytics can be used to reduce expenses directly related to claims handling (allocated) as
well as general staff time for overseeing the claims processes (unallocated). The insurance industry has high
operating costs relative to other portions of the financial services sectors.
In addition to claims payments, there are many other ways in which insurers use to data to manage their
products. We have already discussed the need for analytics in underwriting, that is, risk classification at the
initial acquisition stage. Insurers are also interested in which policyholders elect to renew their contract and,
as with other products, monitor customer loyalty.
Analytics can also be used to manage the portfolio, or collection, of risks that an insurer has acquired. When
the risk is initially obtained, the insurer’s risk can be managed by imposing contract parameters that modify
contract payouts. In Chapter xx introduces common modifications including coinsurance, deductibles, and
policy upper limits.
After the contract has been agreed upon with an insured, the insurer may still modify its net obligation by
entering into a reinsurance agreement. This type of agreement is with a reinsurer, an insurer of an insurer. It
1.3. CASE STUDY: WISCONSIN PROPERTY FUND 21
is common for insurance companies to purchase insurance on its portfolio of risks to gain protection from
unusual events, just as people and other companies do.
An important feature that distinguishes insurance from other sectors of the economy is the timing of the
exchange of considerations. In manufacturing, payments for goods are typically made at the time of a
transaction. In contrast, for insurance, money received from a customer occurs in advance of benefits or
services; these are rendered at a later date. This leads to the need to hold a reservoir of wealth to meet
future obligations in respect to obligations made. The size of this reservoir of wealth, and the importance of
ensuring its adequacy in regard to liabilities already assumed, is a major concern for the insurance industry.
Setting aside money for unpaid claims is known as loss reserving; in some jurisdictions, reserves are also
known as technical provisions. We saw in Figure 1.1 how future obligations arise naturally at a specific
(valuation) date; a company must estimate these outstanding liabilities when determining its financial strength.
Accurately determining loss reserves is important to insurers for many reasons.
1. Loss reserves represent a loan that the insurer owes its customers. Under-reserving may result in a
failure to meet claim liabilities. Conversely, an insurer with excessive reserves may present a weaker
financial position than it truly has and lose market share.
2. Reserves provide an estimate for the unpaid cost of insurance that can be used for pricing contracts.
3. Loss reserving is required by laws and regulations. The public has a strong interest in the financial
strength of insurers.
4. In addition to the insurance company management and regulators, other stakeholders such as investors
and customers make decisions that depend on company loss reserves.
Loss reserving is a topic where there are substantive differences between life and general (also known as
property and casualty, or non-life), insurance. In life insurance, the severity (amount of loss) is often not a
source of concern as payouts are specified in the contract. The frequency, driven by mortality of the insured,
is a concern. However, because of the length of time for settlement of life insurance contracts, the time value
of money uncertainty as measured from issue to date of death can dominate frequency concerns. For example,
for an insured who purchases a life contract at age 20, it would not be unusual for the contract to still be
open in 60 years time. See, for example, (Bowers et al., 1986) or (Dickson et al., 2013) for introductions to
reserving for life insurance.
In this section, for a real case study such as the Wisconsin Property Fund, you learn how to:
• Describe how data generating events can produce data of interest to insurance analysts.
• Identify the type of each variable.
• Produce relevant summary statistics for each variable.
• Describe how these summary statistcs can be used in each of the major operational areas of an insurance
company.
Let us illustrate the kind of data under consideration and the goals that we wish to achieve by examining
the Local Government Property Insurance Fund (LGPIF), an insurance pool administered by the Wisconsin
Office of the Insurance Commissioner. The LGPIF was established to provide property insurance for local
government entities that include counties, cities, towns, villages, school districts, and library boards. The
fund insures local government property such as government buildings, schools, libraries, and motor vehicles.
The fund covers all property losses except those resulting from flood, earthquake, wear and tear, extremes in
temperature, mold, war, nuclear reactions, and embezzlement or theft by an employee.
22 CHAPTER 1. INTRODUCTION TO LOSS DATA ANALYTICS
The property fund covers over a thousand local government entities who pay approximately $25 million in
premiums each year and receive insurance coverage of about $75 billion. State government buildings are not
covered; the LGPIF is for local government entities that have separate budgetary responsibilities and who
need insurance to moderate the budget effects of uncertain insurable events. Coverage for local government
property has been made available by the State of Wisconsin since 1911.
At a fundamental level, insurance companies accept premiums in exchange for promises to indemnify a
policyholder upon the uncertain occurrence of an insured event. This indemnification is known as a claim. A
positive amount, also known as the severity of the claim, is a key financial expenditure for an insurer. So,
knowing only the claim amount summarizes the reimbursement to the policyholder.
Ignoring expenses, an insurer that examines only amounts paid would be indifferent to two claims of 100
when compared to one claim of 200, even though the number of claims differ. Nonetheless, it is common
for insurers to study how often claims arise, known as the frequency of claims. The frequency is important
for expenses, but it also influences contractual parameters (such as deductibles and policy limits) that are
written on a per occurrence basis, is routinely monitored by insurance regulators, and is often a key driven in
the overall indemnification obligation of the insurer. We shall consider the two claims variables, the severity
and frequency, as the two main outcome variables that we wish to understand, model, and manage.
To illustrate, in 2010 there were 1,110 policyholders in the property fund. Table 1.1 shows the distribution of
the 1,377 claims. Almost two-thirds (0.637) of the policyholders did not have any claims and an additional
18.8% only had one claim. The remaining 17.5% (=1 - 0.637 - 0.188) had more than one claim; the
policyholder with the highest number recorded 239 claims. The average number of claims for this sample was
1.24 (=1377/1110).
Type
Number 0 1 2 3 4 5 6 7 8 9 or more Sum
Count 707 209 86 40 18 12 9 4 6 19 1,110
Proportion 0.637 0.188 0.077 0.036 0.016 0.011 0.008 0.004 0.005 0.017 1.000
Third
Minimum First Quartile Median Mean Quartile Maximum
167 2,226 4,951 56,330 11,900 12,920,000
Developing models to represent and manage the two outcome variables, frequency and severity, is the focus
of the early chapters of this text. However, when actuaries and other financial analysts use those models,
24 CHAPTER 1. INTRODUCTION TO LOSS DATA ANALYTICS
they do so in the context of externally available variables. In general statistical terminology, one might call
these explanatory or predictor variables; there are many other names in statistics, economics, psychology,
and other disciplines. Because of our insurance focus, we call them rating variables as they will be useful in
setting insurance rates and premiums.
We earlier considered a sample of 1,110 observations which may seem like a lot. However, as we will seen
in our forthcoming applications, because of the preponderance of zeros and the skewed nature of claims,
actuaries typically yearn for more data. One common approach that we adopt here is to examine outcomes
from multiple years, thus increasing the sample size. We will discuss the strengths and limitations of this
strategy later but, at this juncture, just want to show the reader how it works.
Specifically, Table 1.3 shows that we now consider policies over five years of data, years 2006, . . . , 2010,
inclusive. The data begins in 2006 because there was a shift in claim coding in 2005 so that comparisons with
earlier years are not helpful. To mitigate the effect of open claims, we consider policy years prior to 2011. An
open claim means that all of the obligations are not known at the time of the analysis; for some claims, such
an injury to a person in an auto accident or in the workplace, it can take years before costs are fully known.
Table 1.3 shows that the average claim varies over time, especially with the high 2010 value due to a single
large claim. The total number of policyholders is steadily declining and, conversely, the coverage is steadily
increasing. The coverage variable is the amount of coverage of the property and contents. Roughly, you can
think of it as the maximum possible payout of the insurer. For our immediate purposes, it is our first rating
variable. Other things being equal, we would expect that policyholders with larger coverage will have larger
claims. We will make this vague idea much more precise as we proceed.
R Code for Summary of Claim Frequency and Severity, Deductibles, and Coverages
Insample <- read.csv("Data/PropertyFundInsample.csv", header=T, na.strings=c("."), stringsAsFactors=FALS
t1<- summaryBy(Insample$Freq ~ 1, data = Insample,
FUN = function(x) { c(ma=min(x), m1=median(x),m=mean(x),mb=max(x)) } )
names(t1) <- c("Minimum", "Median","Average", "Maximum")
t2 <- summaryBy(Insample$yAvg ~ 1, data = Insample,
FUN = function(x) { c(ma=min(x), m1=median(x), m=mean(x),mb=max(x)) } )
names(t2) <- c("Minimum", "Median","Average", "Maximum")
t3 <- summaryBy(Deduct ~ 1, data = Insample,
FUN = function(x) { c(ma=min(x), m1=median(x), m=mean(x),mb=max(x)) } )
names(t3) <- c("Minimum", "Median","Average", "Maximum")
t4 <- summaryBy(BCcov/1000 ~ 1, data = Insample,
FUN = function(x) { c(ma=min(x), m1=median(x), m=mean(x),mb=max(x)) } )
names(t4) <- c("Minimum", "Median","Average", "Maximum")
Table2 <- rbind(t1,t2,t3,t4)
Table2a <- round(Table2,3)
Rowlable <- rbind("Claim Frequency","Claim Severity","Deductible","Coverage (000's)")
Table2aa <- cbind(Rowlable,as.matrix(Table2a))
Table2aa
The following display describes the rating variables considered in this chapter. To handle the skewness, we
henceforth focus on logarithmic transformations of coverage and deductibles. To get a sense of the relationship
between the non-continuous rating variables and claims, Table 1.5 relates the claims outcomes to these
categorical variables. Table 1.5 suggests substantial variation in the claim frequency and average severity of
the claims by entity type. It also demonstrates higher frequency and severity for the Fire5 variable and the
reverse for the NoClaimCredit variable. The relationship for the Fire5 variable is counter-intuitive in that
one would expect lower claim amounts for those policyholders in areas with better public protection (when
the protection code is five or less). Naturally, there are other variables that influence this relationship. We
will see that these background variables are accounted for in the subsequent multivariate regression analysis,
which yields an intuitive, appealing (negative) sign for the Fire5 variable.
Description of Rating Variables
V ariable Description
EntityType Categorical variable that is one of six types: (Village, City,
County, Misc, School, or Town)
LnCoverage Total building and content coverage, in logarithmic millions of dollars
LnDeduct Deductible, in logarithmic dollars
AlarmCredit Categorical variable that is one of four types: (0, 5, 10, or 15)
for automatic smoke alarms in main rooms
NoClaimCredit Binary variable to indicate no claims in the past two years
Fire5 Binary variable to indicate the fire class is below 5
(The range of fire class is 0 to 10
26 CHAPTER 1. INTRODUCTION TO LOSS DATA ANALYTICS
R Code for Claims Summary by Entity Type, Fire Class, and No Claim Credit
ByVarSumm<-function(datasub){
tempA <- summaryBy(Freq ~ 1 , data = datasub,
FUN = function(x) { c(m = mean(x), num=length(x)) } )
datasub1 <- subset(datasub, yAvg>0)
tempB <- summaryBy(yAvg ~ 1, data = datasub1,FUN = function(x) { c(m = mean(x)) } )
tempC <- merge(tempA,tempB,all.x=T)[c(2,1,3)]
tempC1 <- as.matrix(tempC)
return(tempC1)
}
datasub <- subset(Insample, TypeVillage == 1);
t1 <- ByVarSumm(datasub)
datasub <- subset(Insample, TypeCity == 1);
t2 <- ByVarSumm(datasub)
datasub <- subset(Insample, TypeCounty == 1);
t3 <- ByVarSumm(datasub)
datasub <- subset(Insample, TypeMisc == 1);
t4 <- ByVarSumm(datasub)
datasub <- subset(Insample, TypeSchool == 1);
t5 <- ByVarSumm(datasub)
datasub <- subset(Insample, TypeTown == 1);
t6 <- ByVarSumm(datasub)
datasub <- subset(Insample, Fire5 == 0);
t7 <- ByVarSumm(datasub)
datasub <- subset(Insample, Fire5 == 1);
t8 <- ByVarSumm(datasub)
datasub <- subset(Insample, Insample$NoClaimCredit == 0);
t9 <- ByVarSumm(datasub)
datasub <- subset(Insample, Insample$NoClaimCredit == 1);
t10 <- ByVarSumm(datasub)
t11 <- ByVarSumm(Insample)
R Code for Claims Summary by Entity Type and Alarm Credit Category
#Claims Summary by Entity Type and Alarm Credit
ByVarSumm<-function(datasub){
tempA <- summaryBy(Freq ~ AC00 , data = datasub,
FUN = function(x) { c(m = mean(x), num=length(x)) } )
datasub1 <- subset(datasub, yAvg>0)
if(nrow(datasub1)==0) { n<-nrow(datasub)
return(c(0,0,n))
} else
{
28 CHAPTER 1. INTRODUCTION TO LOSS DATA ANALYTICS
We have now seen the Fund’s two outcome variables, a count variable for the number of claims and a
continuous variable for the claims amount. We have also introduced a continuous rating variable, coverage,
discrete quantitative variable, (logarithmic) deductibles, two binary rating variable, no claims credit and
fire class, as well as two categorical rating variables, entity type and alarm credit. Subsequent chapters
will explain how to analyze and model the distribution of these variables and their relationships. Before
getting into these technical details, let us first think about where we want to go. General insurance company
functional areas are described in Section 1.2; let us now think about how these areas might apply in the
context of the property fund.
Initiating Insurance
Because this is a government sponsored fund, we do not have to worry about selecting good or avoiding poor
risks; the fund is not allowed to deny a coverage application from a qualified local government entity. If we
do not have to underwrite, what about how much to charge?
We might look at the most recent experience in 2010, where the total fund claims were approximately
28.16 million USD (= 1377 claims × 20452 average severity). Dividing that among 1,110 policyholders, that
1.4. FURTHER RESOURCES AND CONTRIBUTORS 29
suggests a rate of 24,370 ( ≈ 28,160,000/1110). However, 2010 was a bad year; using the same method,
our premium would be much lower based on 2009 data. This swing in premiums would defeat the primary
purpose of the fund, to allow for a steady charge that local property managers could utilize in their budgets.
Having a single price for all policyholders is nice but hardly seems fair. For example, Table 1.5 suggests that
Schools have much higher claims than other entities and so should pay more. However, simply doing the
calculation on an entity by entity basis is not right either. For example, we saw in Table 1.6 that had we
used this strategy, entities with a 15% alarm credit (for good behavior, having top alarm systems) would
actually wind up paying more.
So, we have the data for thinking about the appropriate rates to charge but will need to dig deeper into the
analysis. We will explore this topic further in Chapter 6 on premium calculation fundamentals. Selecting
appropriate risks is introduced in Chapter 7 on risk classification.
Renewing Insurance
Although property insurance is typically a one-year contract, Table 1.3 suggests that policyholders tend to
renew; this is typical of general insurance. For renewing policyholders, in addition to their rating variables
we have their claims history and this claims history can be a good predictor of future claims. For example,
Table 1.3 shows that policyholders without a claim in the last two years had much lower claim frequencies
than those with at least one accident (0.310 compared to 1.501); a lower predicted frequency typically results
in a lower premium. This is why it is common for insurers to use variables such as NoClaimCredit in their
rating. We will explore this topic further in Chapter 8 on experience rating.
Claims Management
Of course, the main story line of 2010 experience was the large claim of over 12 million USD, nearly half
the claims for that year. Are there ways that this could have been prevented or mitigated? Are their ways
for the fund to purchase protection against such large unusual events? Another unusual feature of the 2010
experience noted earlier was the very large frequency of claims (239) for one policyholder. Given that there
were only 1,377 claims that year, this means that a single policyholder had 17.4 % of the claims. This also
suggestions opportunities for managing claims, the subject of Chapter 9.
Loss Reserving
In our case study, we look only at the one year outcomes of closed claims (the opposite of open). However,
like many lines of insurance, obligations from insured events to buildings such as fire, hail, and the like, are
not known immediately and may develop over time. Other lines of business, including those were there are
injuries to people, take much longer to develop. Chapter 10 introduces this concern and loss reserving, the
discipline of determining how much the insurance company should retain to meet its obligations.
• Edward W. (Jed) Frees, University of Wisconsin-Madison, is the principal author of the initital
version of this chapter. Email: [email protected] for chapter comments and suggested improvements.
This book introduces loss data analytic tools that are most relevant to actuaries and other financial risk
analysts. Here are a few reference cited in the chapter.
30 CHAPTER 1. INTRODUCTION TO LOSS DATA ANALYTICS
Chapter 2
Frequency Modeling
Chapter Preview. A primary focus for insurers is estimating the magnitude of aggregate claims it must bear
under its insurance contracts. Aggregate claims are affected by both the frequency of insured events and the
severity of the insured event. Decomposing aggregate claims into these two components, which each warrant
significant attention, is essential for analysis and pricing. This chapter discusses frequency distributions,
measures, and parameter estimation techniques.
Basic Terminology
We use claim to denote the indemnification upon the occurrence of an insured event. While some authors
use claim and loss interchangeably, others think of loss as the amount suffered by the insured whereas claim
is the amount paid by the insurer. Frequency represents how often an insured event occurs, typically within
a policy contract. Here, we focus on count random variables that represent the number of claims, that is,
how frequently an event occurs. Severity denotes the amount, or size, of each payment for an insured event.
In future chapters, the aggregate model, which combines frequency models with severity models, is examined.
Recall from Chapter 1 that setting the price of an insurance good can be a complex problem. In manufacturing,
the cost of a good is (relatively) known. In other financial service areas, market prices are available. In
insurance, we can generalize the price setting as follows: start with an expected cost. Add “margins” to
account for the product’s riskiness, expenses incurred in servicing the product, and a profit/surplus allowance
for the insurer.
That expected cost for insurance can be defined as the expected number of claims times the expected amount
per claim, that is, expected frequency times severity. The focus on claim count allows the insurer to consider
those factors which directly affect the occurrence of a loss, thereby potentially generating a claim. The
frequency process can then be modeled.
31
32 CHAPTER 2. FREQUENCY MODELING
Insurers and other stakeholders, including governmental organizations, have various motivations for gathering
and maintaining frequency datasets.
• Contractual - In insurance contracts, it is common for particular deductibles and policy limits to
be listed and invoked for each occurrence of an insured event. Correspondingly, the claim count data
generated would indicate the number of claims which meet these criteria, offering a unique claim
frequency measure. Extending this, models of total insured losses would need to account for deductibles
and policy limits for each insured event.
• Behaviorial - In considering factors that influence loss frequency, the risk-taking and risk-reducing
behavior of individuals and companies should be considered. Explanatory (rating) variables can have
different effects on models of how often an event occurs in contrast to the size of the event.
– In healthcare, the decision to utilize healthcare by individuals, and minimize such healthcare
utilization through preventive care and wellness measures, is related primarily to his or her personal
characteristics. The cost per user is determined by those personal, the medical condition, potential
treatment measures, and decisions made by the healthcare provider (such as the physician) and
the patient. While there is overlap in those factors and how they affect total healthcare costs,
attention can be focused on those separate drivers of healthcare visit frequency and healthcare
cost severity.
– In personal lines, prior claims history is an important underwriting factor. It is common to use
an indicator of whether or not the insured had a claim within a certain time period prior to the
contract.
– In homeowners insurance, in modeling potential loss frequency, the insurer could consider loss
prevention measures that the homeowner has adopted, such as visible security systems. Separately,
when modeling loss severity, the insurer would examine those factors that affect repair and
replacement costs.
• Databases. Many insurers keep separate data files that suggest developing separate frequency and
severity models. For example, a policyholder file is established when a policy is written. This file records
much underwriting information about the insured(s), such as age, gender, and prior claims experience,
policy information such as coverage, deductibles and limitations, as well as the insurance claims event.
A separate file, known as the “claims” file, records details of the claim against the insurer, including
the amount. (There may also be a “payments” file that records the timing of the payments although
we shall not deal with that here.) This recording process makes it natural for insurers to model the
frequency and severity as separate processes.
• Regulatory and Administrative Insurance is a highly regulated and monitored industry, given its
importance in providing financial security to individuals and companies facing risk. As part of its duties,
regulators routinely require the reporting of claims numbers as well as amounts. This may be due to
the fact that there can be alternative definitions of an “amount,” e.g., paid versus incurred, and there
is less potential error when reporting claim numbers. This continual monitoring helps ensure financial
stability of these insurance companies.
In this section, we will introduce the distributions that are commonly used in actuarial practice to model
count data. The claim count random variable is denoted by N ; by its very nature it assumes only non-
negative integral values. Hence the distributions below are all discrete distributions supported on the set of
non-negative integers (Z+ ).
2.2. BASIC FREQUENCY DISTRIBUTIONS 33
2.2.1 Foundations
Since N is a discrete random variable taking values in Z+ , the most natural full description of its distribution
is through the specification of the probabilities with which it assumes each of the non-negative integral values.
This leads us to the concept of the probability mass function (pmf ) of N , denoted as pN (·) and defined
as follows:
We note that there are alternate complete descriptions, or characterizations, of the distribution of N ; for
example, the distribution function of N denoted by FN (·) and defined below is another such:
bxc
P
Pr(N = k), x ≥ 0;
FN (x) := k=0 (2.2)
0, otherwise.
In the above, b·c denotes the floor function; bxc denotes the greatest integer less than or equal to x. We
note that the survival function of N , denoted by SN (·), is defined as the ones’-complement of FN (·), i.e.
SN (·) := 1 − FN (·). Clearly, the latter is another characterization of the distribution of N .
Often one is interested in quantifying a certain aspect of the distribution and not in its complete description.
This is particularly useful when comparing distributions. A center of location of the distribution is one such
aspect, and there are many different measures that are commonly used to quantify it. Of these, the mean is
the most popular; the mean of N , denoted by µN 1 , is defined as
∞
X
µN = kpN (k). (2.3)
k=0
We note that µN is the expected value of the random variable N , i.e. µN = E N . This leads to a general
class of measures, the **moments*8 of the distribution; the r-th moment of N , for r > 0, is defined as EN r
and denoted by µ0N (r). Hence, for r > 0, we have
∞
X
µN (r) = EN r = k r pN (k). (2.4)
k=0
We note that µ0N (·) is a well-defined non-decreasing function taking values in [0, ∞), as Pr(N ∈ Z+ ) = 1;
also, note that µN = µ0N (1).
Another basic aspect of a distribution is its dispersion, and of the various measures of dispersion studied
in the literature, the standard deviation is the most popular. Towards defining it, we first define the
variance of N , denoted by Var N , as Var N := E(N − µN )2 when µN is finite. By basic properties of the
expected value of a random variable, we see that Var N := E N 2 − (E N )2 . The standard deviation of N ,
denoted by σN , is defined as the square-root of Var N . Note that the latter is well-defined as Var N , by its
2
definition as the average squared deviation from the mean, is non-negative; Var N is denoted by σN . Note
that these two measures take values in [0, ∞).
1 For example, if there are 3 risk factors each of which the number of levels are 2, 3 and 4, respectively, we have k =
(2 − 1) × (3 − 1) × (4 − 1) = 6.
34 CHAPTER 2. FREQUENCY MODELING
Now we will introduce two generating functions that are found to be useful when working with count variables.
Recall that the moment generating function (mgf) of N , denoted as MN (·), is defined as
∞
X
MN (t) = E etN = etk pN (k), t ∈ R.
k=0
We note that while MN (·) is well defined as it is the expectation of a non-negative random variable (etN ),
though it can assume the value ∞. Note that for a count random variable, MN (·) is finite valued on (−∞, 0]
with MN (0) = 1. The following theorem, whose proof can be found in (Billingsley, 2008) (pages 285-6),
encapsulates the reason for its name.
∗
Theorem 2.1. Let N be a count random variable such that E et N is finite for some t∗ > 0. We have the
following:
All moment of N are finite, i.e.
EN r < ∞, r ≥ 0.
The mgf can be used to generate its moments as follows:
dm
= EN m ,
m
M N (t) m ≥ 1.
dt
t=0
The mgf MN (·) characterizes the distribution; in other words it uniquely specifies the distribution.
Another reason that the mgf is very useful as a tool is that for two independent random variables X and Y ,
with their mgfs existing in a neighborhood of 0, the mgf of X + Y is the product of their respective mgfs.
A related generating function to the mgf is called the probability generating function (pgf ), and is a
useful tool for random variables taking values in Z+ . For a random variable N , by PN (·) we denote its pgf
and we define it as follows:
PN (s) := E sN , s ≥ 0. (2.5)
The pgf PN (·) characterizes the distribution; in other words it uniquely specifies the distribution.
2.2. BASIC FREQUENCY DISTRIBUTIONS 35
In this sub-section we will study three important frequency distributions used in Statistics, namely the
Binomial, the Negative Binomial and the Poisson distributions. In the following, a risk denotes a unit covered
by insurance. A risk could be an individual, a building, a company, or some other identifier for which
insurance coverage is provided. For context, imagine an insurance data set containing the number of claims
by risk or stratified in some other manner. The above mentioned distributions also happen to be the most
commonly used in insurance practice for reasons, some of which we mention below.
• These distributions can be motivated by natural random experiments which are good approximations to
real life processes from which many insurance data arise. Hence, not surprisingly, they together offer a
reasonable fit to many insurance data sets of interest. The appropriateness of a particular distribution
for the set of data can be determined using standard statistical methodologies, as we discuss later in
this chapter.
• They provide a rich enough basis for generating other distributions that even better approximate or
well cater to more real situations of interest to us.
– The three distributions are either one-parameter or two-parameter distributions. In fitting to data,
a parameter is assigned a particular value. The set of these distributions can be enlarged to their
convex hulls by treating the parameter(s) as a random variable (or vector) with its own probability
distribution, with this larger set of distributions offering greater flexibility. A simple example that
is better addressed by such an enlargement is a portfolio of claims generated by insureds belonging
to many different risk classes.
– In insurance data, we may observe either a marginal or inordinate number of zeros, i.e. zero
claims by risk. When fitting to the data, a frequency distribution in its standard specification
often fails to reasonably account for this occurrence. The natural modification of the above three
distributions, however, accommodate this phenomenon well towards offering a better fit.
– In insurance we are interested in total claims paid, whose distribution results from compounding
the fitted frequency distribution with a severity distribution. These three distributions have
properties that make it easy to work with the resulting aggregate severity distribution.
Binomial Distribution
We begin with the binomial distribution which arises from any finite sequence of identical and independent
experiments with binary outcomes. The most canonical of such experiments is the (biased or unbiased) coin
tossing experiment with the outcome being heads or tails. So if N denotes the number of heads in a sequence
of m independent coin tossing experiments with an identical coin which turns heads up with probability
q, then the distribution of N is called the binomial distribution with parameters (m, q), with m a positive
integer and q ∈ [0, 1]. Note that when q = 0 (resp., q = 1) then the distribution is degenerate with N = 0
(resp., N = m) with probability 1. Clearly, its support when q ∈ (0, 1) equals {0, 1, . . . , m} with pmf given
by 2
m k
pk := q (1 − q)m−k , k = 0, . . . , m.
k
The reason for its name is that the pmf takes values among the terms that arise from the binomial expansion
of (q + (1 − q))m . This realization then leads to the the following expression for the pgf of the binomial
distribution:
m m
X m k X m
P (z) := zk q (1 − q)m−k = (zq)k (1 − q)m−k = (qz + (1 − q))m = (1 + q(z − 1))m .
k k
k=0 k=0
2 Preferring the multiplicative form to others (e.g., additive one) was already hinted in (8.4).
36 CHAPTER 2. FREQUENCY MODELING
Note that the above expression for the pgf confirms the fact that the binomial distribution is the m-convolution
of the Bernoulli distribution, which is the binomial distribution with m = 1 and pgf (1 + q(z − 1)). Also,
note that the mgf of the binomial distribution is given by (1 + q(et − 1))m .
The central moments of the binomial distribution can be found in a few different ways. To emphasize the key
property that it is a m-convolution of the Bernoulli distribution, we derive below the moments using this
property. We begin by observing that the Bernoulli distribution with parameter q assigns probability of q and
1 − q to 1 and 0, respectively. So its mean equals q (= 0 × (1 − q) + 1 × q); note that its raw second moment
equals its mean as N 2 = N with probability 1. Using these two facts we see that the variance equals q(1 − q).
Moving on to the Binomial distribution with parameters m and q, using the fact that it is the m-convolution
of the Bernoulli distribution, we write N as the sum of N1 , . . . , Nm , where Ni are iid Bernoulli variates. Now
using the moments of Bernoulli and linearity of the expectation, we see that
m
X m
X
EN =E Ni = E Ni = mq.
i=1 i=1
Also, using the fact that the variance of the sum of independent random variables is the sum of their
variances, we see that
m
! m
X X
Var N = Var Ni = Var Ni = mq(1 − q).
i=1 i=1
Alternate derivations of the above moments are suggested in the exercises. One important observation,
especially from the point of view of applications, is that the mean is greater than the variance unless q = 0.
Poisson Distribution
After the Binomial distribution, the Poisson distribution (named after the French polymath Sim’eon Denis
Poisson) is probably the most well known of discrete distributions. This is partly due to the fact that it
arises naturally as the distribution of the count of the random occurrences of a type of event in a certain
time period, if the rate of occurrences of such events is a constant. Relatedly, it also arises as the asymptotic
limit of the Binomial distribution with m → ∞ and mq → λ.
The Poisson distribution is parametrized by a single parameter usually denoted by λ which takes values in
(0, ∞). Its pmf is given by
e−λ λk
pk = , k = 0, 1, . . .
k!
It is easy to check that the above specifies a pmf as the terms are clearly non-negative, and that they sum to
one follows from the infinite Taylor series expansion of eλ . More generally, we can derive its pgf, P (·), as
follows:
∞ ∞
X X e−λ λk z k
P (z) := pk z k = = e−λ eλz = eλ(z−1) , ∀z ∈ R.
k!
k=0 k=0
Towards deriving its mean, we note that for the Poisson distribution
(
0, k = 0;
kpk =
λpk−1 , k ≥ 1;
In fact, more generally, using either a generalization of the above or using Theorem 2.2, we see that
m−1
dm
Y
= λm ,
E (N − i) = m
P N (s) m ≥ 1.
i=0
ds
s=1
Var N = E N 2 − (E N )2 = E N (N − 1) + E N − (E N )2 = λ2 + λ − λ2 = λ.
The third important count distribution is the Negative Binomial distribution. Recall that the Binomial
distribution arose as the distribution of the number of successes in m independent repetition of an experiment
with binary outcomes. If we instead consider the number of successes until we observe the r-th failure in
independent repetitions of an experiment with binary outcomes, then its distribution is a Negative Binomial
distribution. A particular case, when r = 1, is the geometric distribution. In the following we will allow the
parameter r to be any positive real, and unfortunately when r in not an integer the above random experiment
would not be applicable. To then motivate the distribution more generally, and in the process explain its
name, we recall the binomial series, i.e.
s(s − 1) 2
(1 + x)s = 1 + sx + x + . . . ..., s ∈ R; |x| < 1.
2!
s
If we define k , the generalized binomial coefficient, by
s s(s − 1) · · · (s − k + 1)
= ,
k k!
then we have
∞
s
X s
(1 + x) = xk , s ∈ R; |x| < 1.
k
k=0
for r > 0 and β >= 0, then it defines a valid pmf. Such defined distribution is called the negative binomial
distribution with parameters (r, β) with r > 0 and β ≥ 0. Moreover, the binomial series also implies that the
pgf of this distribution is given by
1
P (z) = (1 − β(z − 1))−r , |z| ≤ 1 + , β ≥ 0.
β
The above implies that the mgf is given by
t −r 1
M (t) = (1 − β(e − 1)) , t ≤ log 1 + , β ≥ 0.
β
We derive its moments using Theorem 2.1 as follows:
38 CHAPTER 2. FREQUENCY MODELING
= rβ(1 + β) + r2 β 2 ;
and EN = EN 2 − (EN )2 = rβ(1 + β) + r2 β 2 − r2 β 2 = rβ(1 + β)
We note that when β > 0, we have Var N > E N . In other words, this distribution is overdispersed
(relative to the Poisson); similarly, when q > 0 the binomial distribution is said to be underdispersed
(relative to the Poisson).
Finally, we observe that the Poisson distribution also emerges as a limit of negative binomial distributions.
Towards establishing this, let βr be such that as r approaches infinity rβr approaches λ > 0. Then we see
that the mgfs of negative binomial distributions with parameters (r, βr ) satisfies
with the right hand side of the above equation being the mgf of the Poisson distribution with parameter λ3
kpk = λpk−1 , k ≥ 1,
pk b
=a+ , k ≥ 1; (2.6)
pk−1 k
this raises the question if there are any other distributions which satisfy this seemingly general recurrence
relation.
To begin with let a < 0. In this case as (a + b/k) → a < 0 as k → ∞, and the ratio on the left is non-negative,
it follows that if a < 0 then b should satisfy b = −ka, for some k ≥ 1. Any such pair (a, b) can be written as
−q (m + 1)q
, , q ∈ (0, 1), m ≥ 1;
1−q 1−q
3 corresponding to VAgecat1
2.4. ESTIMATING FREQUENCY DISTRIBUTIONS 39
note that the case a < 0 with a + b = 0 yields the degenerate at 0 distribution which is the binomial
distribution with q = 0 and arbitrary m ≥ 1.
In the case of a = 0, again by non-negativity of the ratio pk /pk−1 , we have b ≥ 0. If b = 0 the distribution is
degenerate at 0, which is a binomial with q = 0 or a Poisson distribution with λ = 0 or a negative binomial
distribution with β = 0. If b > 0, then clearly such a distribution is a Poisson distribution with mean (i.e. λ)
equal to b.
In the case of a > 0, again by non-negativity of the ratio pk /pk−1 , we have a + b/k ≥ 0 for all k ≥ 1. The
most stringent of these is the inequality a + b ≥ 0. Note that a + b = 0 again results in degeneracy at 0;
excluding this case we have a + b > 0 or equivalently b = (r − 1)a with r > 0. Some algebra easily yields the
following expression for pk :
k+r−1
pk = p0 ak , k = 1, 2, . . . .
k
The above series converges for a < 1 when r > 0, with the sum given by p0 ∗ ((1 − a)(−r) − 1). Hence, equating
the latter to 1 − p0 we get p0 = (1 − a)(−r) . So in this case the pair (a, b) is of the form (a, (r − 1)a), for
r > 0 and 0 < a < 1; since a equivalent parametrization is (β/(1 + β), (r − 1)β/(1 + β)), for r > 0 and 0 < β,
we see from above that such distributions are negative binomial distributions.
From the above development we see that not only does the recurrence (2.6) tie these three distributions
together, but also it characterizes them. For this reason these three distributions are collectively referred to in
the actuarial literature as (a, b, 0) class of distributions, with 0 referring to the starting point of the recurrence.
Note that the value of p0 is implied by (a, b) since the probabilities have to sum to one. Of course, (2.6) as a
recurrence relation for pk makes the computation of the pmf efficient by removing redundancies. Later, we
will see that it does so even in the case of compound distributions with the frequency distribution belonging
to the (a, b, 0) class - this fact is the more important motivating reason to study these three distribution from
this viewpoint.
Example 2.3.1. A discrete probability distribution has the following properties
2
pk = c 1 + pk−1 k = 1, 2, 3, . . .
k
9
p1 =
256
Determine the expected value of this discrete random variable.
Show Example Solution
Solution: Since the pmf satisfies the (a, b, 0) recurrence relation we know that the underlying distribution is
one among the binomial, Poisson and negative binomial distributions. Since the ratio of the parameters (i.e.
b/a) equals 2, we know that it is negative binomial and that r = 3. Moreover, since for a negative binomial
p1 = r(1 + β)−(r+1) β, we have β = 3. Finally, since the mean of a negative binomial is rβ we have the mean
of the given distribution equals 9.
In Section 2.2 we introduced three distributions of importance in modeling various types of count data arising
from insurance. Let us now suppose that we have a set of count data to which we wish to fit a distribution,
and that we have determined that one of these (a, b, 0) distributions is more appropriate than the others.
Since each one of these forms a class of distributions if we allow its parameter(s) to take any permissible value,
there remains the task of determining the best value of the parameter(s) for the data at hand. This is a
40 CHAPTER 2. FREQUENCY MODELING
statistical point estimation problem, and in parametric inference problems the statistical inference paradigm
of maximum likelihood usually yields efficient estimators. In this section we will describe this paradigm and
derive the maximum likelihood estimators (mles).
Let us suppose that we observe the iid random variables X1 , X2 , . . . , Xn from a distribution with pmf pθ ,
where θ is an unknown value in Θ ⊆ Rd . For example, in the case of the Poisson distribution
θx
pθ (x) = e−θ , x = 0, 1, . . . ,
x!
with Θ = (0, ∞). In the case of the binomial distribution we have
m x
pθ (x) = q (1 − q)m−x , x = 0, 1, . . . , m,
x
with θ := (m, q) ∈ {0, 1, 2, . . .} × (0, 1]. Let us suppose that the observations are x1 , . . . , xn ; in this case the
probability of observing this sample from pθ equals
n
Y
pθ (xi ).
i=1
The above, denoted by L(θ), viewed as a function of θ is called the likelihood. Note that we suppressed its
dependence on the data, to emphasize that we are viewing it as a function of the parameter. For example, in
the case of the Poisson distribution we have
n
!−1
Pn Y
x
L(λ) = e−nλ λ i=1 i xi ! ;
i=1
The maximum likelihood estimator (mle) for θ is any maximizer of the likelihood; in a sense the mle
chooses the parameter value that best explains the observed observations. Consider a sample of size 3 from a
Bernoulli distribution (binomial with m = 1) with values 0, 1, 0. The likelihood in this case is easily checked
to equal
L(q) = q(1 − q)2 ,
and the plot of the likelihood is given in Figure 2.1. As shown in the plot, the maximum value of the likelihood
equals 4/27 and is attained at q = 1/3, and hence the mle for q is 1/3 for the given sample. In this case one
can resort to algebra to show that
2
2 1 4 4
q(1 − q) = q − q− + ,
3 3 27
and conclude that the maximum equals 4/27, and is attained at q = 1/3 (using the fact that the first term is
non-positive in the interval [0, 1]). But as is apparent, this way of deriving the mle using algebra does not
generalize. In general, one resorts to calculus to derive the mle - note that for some likelihoods one may
have to resort to other optimization methods, especially when the likelihood has many local extrema. It is
customary to equivalently maximize the logarithm of the likelihood4 L(·), denoted by l(·), and look at the set
of zeros of its first derivative5 l0 (·). In the case of the above likelihood, l(q) = log(q) + 2 log(1 − q), and
d 1 2
l0 (q) := l(q) = − .
dq q 1−q
The unique zero of l0 (·) equals 1/3, and since l00 (·) is negative, we have 1/3 is the unique maximizer of the
likelihood and hence its mle.
4 We use matrix derivative here.
5A slight benefit of working with l(·) is that constant terms in L(·) do not appear in l0 (·) whereas they do in L0 (·).
2.4. ESTIMATING FREQUENCY DISTRIBUTIONS 41
In the following, we derive the mle for the three members of the (a, b, 0) class. We begin by summarizing the
discussion above. In the setting of observing iid random variables X1 , X2 , . . . , Xn from a distribution with
pmf pθ , where θ is an unknown value in Θ ⊆ Rd , the likelihood L(·), a function on Θ is defined as
n
Y
L(θ) := pθ (xi ),
i=1
where x1 , . . . , xn are the observed values. The maximum likelihood estimator (mle) of θ, denoted as θ̂MLE is
a function which maps the observations to an element of the set of maximizers of L(·), namely
Note the above set is a function of the observations, even though this dependence is not made explicit.
In the case of the three distributions that we will study, and quite generally, the above set is a singleton
with probability tending to one (with increasing sample size). In other words, for many commonly used
distributions and when the sample size is large, the mle is uniquely defined with high probability.
In the following, we will assume that we have observed n iid random variables X1 , X2 , . . . , Xn from the
distribution under consideration, even though the parametric value is unknown. Also, x1 , x2 , . . . , xn will
denote the observed values. We note that in the case of count data, and data from discrete distributions in
general, the likelihood can alternately be represented as
Y m
L(θ) := (pθ (k)) k ,
k≥0
where
n
X
mk := |{i|xi = k, 1 ≤ i ≤ n}| = I(xi = k), k ≥ 0.
i=1
Note that this is an information loss-less transformation of the data. For large n it leads to compression of
the data in the sense of sufficiency. Below, we present expressions for the mle in terms of {mk }k≥1 as well.
MLE - Poisson Distribution: In this case, as noted above, the likelihood is given by
n
!−1
Y Pn
L(λ) = xi ! e−nλ λ i=1 xi ,
i=1
and
n
1X
l0 (λ) = −n + xi .
λ i=1
Pn
Since l00 < 0 if i=1 xi > 0, the maximum is attained at the sample mean. In the contrary, the maximum is
attained at the least possible parameter value, that is the mle equals zero. Hence, we have
n
1X
λ̂MLE = Xi .
n i=1
Note that the sample mean can be computed also as
1X
kmk .
n
k≥1
It is noteworthy that in the case of the Poisson, the exact distribution of λ̂MLE is available in closed form - it
is a scaled Poisson - when the underlying distribution is a Poisson. This is so as the sum of independent
Poisson random variables is a Poisson as well. Of course, for large sample size one can use the ordinary
Central Limit Theorem (CLT) to derive a normal approximation. Note that the latter approximation holds
even if the underlying distribution is any distribution with a finite second moment.
MLE - Binomial Distribution: Unlike the case of the Poisson distribution, the parameter space in the
case of the binomial is 2-dimensional. Hence the optimization problem is a bit more challenging. We begin
by observing that the likelihood is given by
n
!
Y m Pn Pn
L(m, q) = q i=1 xi (1 − q)nm− i=1 xi ,
i=1
xi
Note that since m takes only non-negative integral values, we cannot use multivariate calculus to find the
optimal values. Nevertheless, we can use single variable calculus to show that
n
1X
q̂MLE × m̂MLE = Xi . (2.7)
n i=1
and hence we establish equation (2.7). The above reduces the task to the search for m̂MLE , which is member
of the set of the maximizers of
2.4. ESTIMATING FREQUENCY DISTRIBUTIONS 43
n
!
1 X
L m, xi . (2.8)
nm i=1
Note the likelihood would be zero for values of m smaller than max xi , and hence
1≤i≤n
m̂MLE ≥ max xi .
1≤i≤n
Towards specifying an algorithm to compute m̂MLE , we first point out that for some data sets m̂MLE could
equal ∞, indicating that a Poisson distribution would render a better fit than any binomial distribution.
This is so as the binomial distribution with parameters (m, x/m) approaches the Poisson distribution with
parameter x with m approaching infinity. The fact that some data sets will prefer a Poisson distribution
should not be surprising since in the above sense the set of Poisson distribution is on the boundary of the set
of binomial distributions.
Interestingly, in (Olkin et al., 1981) they show that if the sample mean is less than or equal to the sample
variance then m̂MLE = ∞; otherwise, there Pn exists a finite m that maximizes equation (2.8). In Figure 2.2
1
below we display the plot of L m, nm i=1 x i for three different samples of size 5; they differ only in the
value of the sample maximum. The first sample of (2, 2, 2, 4, 5) has the ratio of sample mean to sample
variance greater than 1 (1.875), the second sample of (2, 2, 2, 4, 6) has the ratio equal to 1.25 which is closer
to 1, and the third sample of (2, 2, 2, 4, 7) has the ratio less than 1 (0.885). For these three samples,Pnas shown
1
in Figure 2.2, m̂MLE equals 7, 18 and ∞, respectively. Note that the limiting value of L m, nm i=1 xi as
m approaches infinity equals
n
!−1 ( n
)
Y X
xi ! exp − xi xnx . (2.9)
i=1 i=1
Also, note that Figure 2.2 shows that the mle of m is non-robust, i.e. changes in a small proportion of the
data set can cause large changes in the estimator.
The above discussion suggests the following simple algorithm:
• Step 1. If the sample mean is less than or equal to the sample variance, m̂M LE = ∞. The mle suggested
distribution is a Poisson distribution with λ̂ = x.
• Step 2. If the sample mean is greater than the sample variance, then compute L(m, x/m) for m values
greater than or equal to the sample maximum until L(m, x/m) is close to the value of the Poisson
likelihood given in (??). The value of m that corresponds to the maximum value of L(m, x/m) among
those computed equals m̂M LE .
We note that if the underlying distribution is the binomial distribution with parameters (m, q) (with q > 0)
then m̂M LE will equal m for large sample sizes. Also, q̂M LE will have an asymptotically normal distribution
and converge with probability one to q.
MLE - Negative Binomial Distribution: The case of the negative binomial distribution is similar to
that of the binomial distribution in the sense that we have two parameters and the MLEs are not be available
in closed form. A difference between them is that unlike the binomial parameter m which takes positive
integral values, the parameter r of the negative binomial can assume any positive real value. This makes the
optimization problem a tad more complex. We begin by observing that the likelihood can be expressed in the
following form:
n !
Y r + xi − 1
L(r, β) = (1 + β)−n(r+x) β nx .
i=1
x i
and hence
δ n(r + x) nx
l(r, β) = − + .
δβ 1+β β
Equating the above to zero, we get
r̂M LE × β̂M LE = x.
The above reduces the two dimensional optimization problem to a one-dimensional problem - we need to
maximize
n
X r + xi − 1
l(r, x/r) = log − n(r + x) log(1 + x/r) + nx log(x/r),
i=1
xi
with respect to r, with the maximizing r being its mle and β̂M LE = x/r̂M LE . In (Levin et al., 1977) it is show
that if the sample variance is greater than the sample mean then there exists a unique r > 0 that maximizes
l(r, x/r) and hence a unique MLE for r and β. Also, they show that if σ̂ 2 ≤ x, then the negative binomial
likelihood will be dominated by the Poisson likelihood with λ̂ = x - in other words, a Poisson distribution
offers a better fit to the data. The guarantee in the case of σ̂ 2 > µ̂ permits us to use any algorithm to
maximize l(r, x/r). Towards an alternate method of computing the likelihood, we note that
X xi
n X n
X
l(r, x/r) = log(r − 1 + j) − log(xi !) − n(r + x) log(r + x) + nr log(r) + nx log(x),
i=1 j=1 i=1
which yields
Xn X xi
1 δ 1 1
l(r, x/r) = − log(r + x) + log(r).
n δr n i=1 j=1 r − 1 + j
We note that, in the above expressions, the inner sum equals zero if xi = 0. The mle for r is a zero of the last
expression, and hence a root finding algorithm can be used to compute it. Also, we have
2 Xn X xi
1 δ x 1 1
l(r, x/r) = − .
n δr2 r(r + x) n i=1 j=1 (r − 1 + j)2
A simple but quickly converging iterative root finding algorithm is the Newton’s method, which incidentally
the Babylonians are believed to have used for computing square roots. Applying the Newton’s method to our
problem results in the following algorithm:
2.4. ESTIMATING FREQUENCY DISTRIBUTIONS 45
Step iii. If rk+1 ∼ rk , then report rk+1 as MLE; else increment k by 1 and repeat Step ii.
For example, we simulated a 5 sample of 41, 49, 40, 27, 23 from the negative binomial with parameters r = 10
and β = 5. Choosing the starting value of r such that
rβ = µ̂ and rβ(1 + β) = σ̂ 2
leads to the starting value of 23.14286. The iterates of r from the Newton’s method are
the rapid convergence seen above is typical of the Newton’s method. Hence in this example, r̂M LE ∼ 21.60647
and β̂M LE = 8.3308
R Implementation of Newton’s Method - Negative Binomial MLE for r
Show R Code
Newton<-function(x,abserr){
mu<-mean(x);
sigma2<-mean(x^2)-mu^2;
r<-mu^2/(sigma2-mu);
b<-TRUE;
iter<-0;
while (b) {
tr<-r;
m1<-mean(c(x[x==0],sapply(x[x>0],function(z){sum(1/(tr:(tr-1+z)))})));
m2<-mean(c(x[x==0],sapply(x[x>0],function(z){sum(1/(tr:(tr-1+z))^2)})));
r<-tr-(m1-log(1+mu/tr))/(mu/(tr*(tr+mu))-m2);
b<-!(abs(tr-r)<abserr);
iter<-iter+1;
}
c(r,iter)
}
To summarize our discussion of mle for the (a, b, 0) class of distributions, in Figure 2.3 below we plot the
maximum value of the Poisson likelihood, L(m, x/m) for the binomial, and L(r, x/r) for the negative binomial,
for the three samples of size 5 given in Table 2.1. The data was constructed to cover the three orderings
of the sample mean and variance. As show in the Figure 2.3, and supported by theory, if µ̂ < σ̂ 2 then the
negative binomial will result in a higher maximum likelihood value; if µ̂ = σ̂ 2 the Poisson will have the highest
likelihood value; and finally in the case that µ̂ > σ̂ 2 the binomial will give a better fit than the others. So
before fitting a frequency data with an (a, b, 0, ) distribution, it is best to start with examining the ordering of
µ̂ and σ̂ 2 . We again emphasize that the Poisson is on the boundary of the negative binomial and binomial
distributions. So in the case that µ̂ ≥ σ̂ 2 (µ̂ ≤ σ̂ 2 , resp.) the Poisson will yield a better fit than the negative
binomial (binomial, resp.), which will also be indicated by r̂ = ∞ (m̂ = ∞, resp.).
In the above we discussed three distributions with supports contained in the set of non-negative integers,
which well cater to many insurance applications. Moreover, typically by allowing the parameters to be a
function of known (to the insurer) explanatory variables such as age, sex, geographic location (territory),
and so forth, these distributions allow us to explain claim probabilities in terms of these variables. The field
of statistical study that studies such models is known as regression analysis - it is an important topic of
actuarial interest that we will not pursue in this book; see (Frees, 2009a).
There are clearly infinitely many other count distributions, and more importantly the above distributions by
themselves do not cater to all practical needs. In particular, one feature of some insurance data is that the
proportion of zero counts can be out of place with the proportion of other counts to be explainable by the
above distributions. In the following we modify the above distributions to allow for arbitrary probability for
zero count irrespective of the assignment of relative probabilities for the other counts. Another feature of
a data set which is naturally comprised of homogeneous subsets is that while the above distributions may
provide good fits to each subset, they may fail to do so to the whole data set. Later we naturally extend the
(a, b, 0) distributions to be able to cater to, in particular, such data sets.
2.5. OTHER FREQUENCY DISTRIBUTIONS 47
Let us suppose that we are looking at auto insurance policies which appear in a database of auto claims
made in a certain period. If one is to study the number of claims that these policies have made during this
period, then clearly the distribution has to assign a probability of zero to the count variable assuming the
value zero. In other words, by restricting attention to count data from policies in the database of claims, we
have in a sense zero-truncated the count data of all policies. In personal lines (like auto), policyholders may
not want to report that first claim because of fear that it may increase future insurance rates - this behavior
will inflate the proportion of zero counts. Examples such as the latter modify the proportion of zero counts.
Interestingly, natural modifications of the three distributions considered above are able to provide good fits
to zero-modified/truncated data sets arising in insurance.
In the below we modify the probability assigned to zero count by the (a, b, 0) class while maintaining the
relative probabilities assigned to non-zero counts - zero modification. Note that since the (a, b, 0) class of
distribution satisfies the recurrence (2.6), maintaining relative probabilities of non-zero counts implies that
recurrence (2.6) is satisfied for k ≥ 2. This leads to the definition of the following class of distributions.
Definition. A count distribution is a member of the (a, b, 1) class if for some constants a and b the
probabilities pk satisfy
pk b
=a+ , k ≥ 2. (2.10)
pk−1 k
Note that since the recursion starts with p1 , and not p0 , we refer to this super-class of (a, b, 0) distributions
by (a, b, 1). To understand this class, recall that each valid pair of values for a and b of the (a, b, 0) class
corresponds to a unique vector of probabilities {pk }k≥0 . If we now look at the probability vector {p̃k }k≥0
given by
1 − p̃0
p̃k = · pk , k ≥ 1,
1 − p0
where p̃0 ∈ [0, 1) is arbitrarily chosen, then since the relative probabilities for positive values according to
{pk }k≥0 and {p̃k }k≥0 are the same, we have {p̃k }k≥0 satisfies recurrence (2.10). This, in particular, shows
that the class of (a, b, 1) distributions is strictly wider than that of (a, b, 0).
In the above, we started with a pair of values for a and b that led to a valid (a, b, 0) distribution, and then
looked at the (a, b, 1) distributions that corresponded to this (a, b, 0) distribution. We will now argue that
the (a, b, 1) class allows for a larger set of permissible values for a and b than the (a, b, 0) class. Recall from
Section 2.3 that in the case of a < 0 we did not use the fact that the recurrence (2.6) started at k = 1, and
hence the set of pairs (a, b) with a < 0 that are permissible for the (a, b, 0) class is identical to those that
are permissible for the (a, b, 1) class. The same conclusion is easily drawn for pairs with a = 0. In the case
that a > 0, instead of the constraint a + b > 0 for the (a, b, 0) class we now have the weaker constraint of
a + b/2 > 0 for the (a, b, 1) class. With the parametrization b = (r − 1)a as used in Section 2.3, instead of
r > 0 we now have the weaker constraint of r > −1. In particular, we see that while zero modifying a (a, b, 0)
distribution leads to a distribution in the (a, b, 1) class, the conclusion does not hold in the other direction.
Zero modification of a count distribution F such that it assigns zero probability to zero count is called a zero
truncation of F . Hence, the zero truncated version of probabilities {pk }k≥0 is given by
(
0, k = 0;
p̃k = pk
1−p0 , k ≥ 1.
In particular, we have that a zero modification of a count distribution {pk }k≥0 , denoted by {pM
k }k≥0 , can
be written as a convex combination of the degenerate distribution at 0 and the zero truncation of {pk }k≥0 ,
denoted by {pTk }k≥0 . That is we have
pM M M T
k = p0 · δ0 (k) + (1 − p0 ) · pk , k ≥ 0.
48 CHAPTER 2. FREQUENCY MODELING
Example 2.5.1. Zero Truncated/Modified Poisson. Consider a Poisson distribution with parameter
λ = 2. Calculate pk , k = 0, 1, 2, 3, for the usual (unmodified), truncated and a modified version with
(pM
0 = 0.6).
k pk pTk pM
k
−λ
0 p0 = e = 0.135335 0 0.6
p1 1−pM
1 p1 = p0 (0 + λ1 ) = 0.27067 1−p0 = 0.313035 1−p0 p1 = 0.125214
0
p2 = p1 λ2 = 0.27067 λ
2 pT2 = pT1 2 = 0.313035 pM
2 = 0.125214
p3 = p2 λ3 = 0.180447 λ λ
3 pT3 = pT2 3 = 0.208690 p3 = pM
M
2 3 = 0.083476
n
X
F (x) = αi · Fi (x). (2.11)
i=1
The above expression can be seen as a direct application of Bayes theorem. As an example, consider a
population of drivers split broadly into two sub-groups, those with less than 5-years of driving experience
and those with more than 5-years experience. Let α denote the proportion of drivers with less than 5 years
experience, and F≤5 and F>5 denote the distribution of the count of claims in a year for a driver in each
group, respectively. Then the distribution of claim count of a randomly selected driver is given by
α · F≤5 + (1 − α)F>5 .
An alternate definition of a mixture distribution is as follows. Let Ni be a random variable with distribution
Fi , i = 1, . . . , k. Let I be a random variable taking values 1, 2, . . . , k with probabilities α1 , . . . , αk , respectively.
Then the random variable NI has a distribution given by equation (2.11)6 .
In (2.11) we see that the distribution function is a convex combination of the component distribution functions.
This result easily extends to the density function, the survival function, the raw moments, and the expectation
as these are all linear functionals of the distribution function. We note that this is not true for central
6 This
in particular lays out a way to simulate from a mixture distribution that makes use of efficient simulation schemes that
may exist for the component distributions.
2.6. MIXTURE DISTRIBUTIONS 49
moments like the variance, and conditional measures like the hazard rate function. In the case of variance it
is easily seen as
Xk
ENI = EENI |I + EENI |I = αi ENi + EENI |I,
i=1
and hence is not a convex function of the variances unless the group means are all equal.
Example 2.6.1. SOA Exam Question. In a certain town the number of common colds an individual will
get in a year follows a Poisson distribution that depends on the individual’s age and smoking status. The
distribution of the population and the mean number of colds are as follows:
Table 2.3 : The distribution of the population and the mean number of colds
1. Calculate the probability that a randomly drawn person has 3 common colds in a year.
2. Calculate the conditional probability that a person with exactly 3 common colds in a year is an adult
smoker.
Show Example Solution
Solution.
1. Using development above, we can write the required probability as Pr(NI = 3), with I denoting
the group of the randomly selected individual with 1, 2 and 3 signifying the groups Children, Adult
Non-Smoker, and Adult Smoker, respectively. Now by conditioning we get
with N1 , N2 and N3 following Poisson distributions with means 3, 1 and 4, respectively. Using the
above, we get Pr(NI = 3) ∼ 0.1235
2. The required probability can be written as Pr(I = 3|NI = 3), which equals
In the above example, the number of subgroups k was equal to three. In general, k can be any natural number,
but when k is large it is parsimonious from a modeling point of view to take the following infinitely many
subgroup approach. To motivate this approach, let the i-th subgroup be such that its component distribution
Fi is given by Gθ˜i , where G· is a parametric family of distributions with parameter space Θ ⊆ Rd . With this
assumption, the distribution function F of a randomly drawn observation from the population is given by
k
X
F (x) = αi Gθ˜i (x), ∀x ∈ R.
i=1
claims λ - smaller values for good drivers, and larger values for others. There is a distribution of λ in the
population; a popular and convenient choice for modeling this distribution is a gamma distribution with
parameters (α, θ). With these specifications it turns out that the resulting distribution of N , the claims of a
randomly chosen driver, is a negative binomial with parameters (r = α, β = θ). This can be shown in many
ways, but a straightforward argument is as follows:
∞ Z ∞
e−λ λk λα−1 e−λ/θ
Z
1 Γ(α + k)
Pr(N = k) = α
dλ = α
λα+k−1 e−λ(1+1/θ) dλ =
0 k! Γ(α)θ k!Γ(α)θ 0 k!Γ(α)θα (1 + 1/θ)α+k
α k
α+k−1 1 θ
= , k = 0, 1, . . .
k 1+θ 1+θ
It is worth mention that by considering mixtures of a parametric class of distributions we increase the richness
of the class, resulting in the mixture class being able to cater well to more applications that the parametric
class we started with. In the above case, this is seen as we have observed earlier that in a sense the Poisson
distributions are on the boundary of negative binomial distributions and by mixing Poisson we get the interior
distributions as well. Mixture modeling is a very important modeling technique in insurance applications,
and later chapters will cover more aspects of this modeling technique.
Example 2.6.2. Suppose that N |Λ ∼ Poisson(Λ) and that Λ ∼ gamma with mean of 1 and variance of 2.
Determine the probability that N = 1.
Show Example Solution
Solution. For a gamma distribution with parameters (α, θ), we have that the mean is αθ and the variance
is αθ2 . Using these expressions we have
1
α= and θ = 2.
2
Now, one can directly use the above result to conclude that N is distributed as a negative binomial with
r = α = 12 and β = θ = 2. Thus
1
1+r−1 1 β
Pr(N = 1) = ( )
1 (1 + β)r 1+β
1 1
1+ 2 −1 1 2
=
1 (1 + 2)1/2 1 + 2
1
= 3/2 = 0.19245.
3
In 1993, a portfolio of n = 7, 483 automobile insurance policies from a major Singaporean insurance company
had the distribution of auto accidents per policyholder as given in Table 2.4.
Now if we use Poisson(λ̂M LE ) as the fitted distribution, then a tabular comparison of the fitted counts and
observed counts is given by Table 2.5 below, where p̂k represents the estimated probabilities under the fitted
Poisson distribution.
The motivation for the above statistic derives from the fact that
K 2
X (mk − npk )
npk
k=1
has a limiting chi-square distribution with K − 1 degrees of freedom if pk , k = 1, . . . , K are the true cell
probabilities. Now suppose that only the summarized data represented by mk , k = 1, . . . , K is available.
Further, if pk ’s are functions of s parameters, replacing pk ’s by any efficiently estimated probabilities pbk ’s
results in the statistic continuing to have a limiting chi-square distribution but with degrees of freedom
given by K − 1 − s. Such efficient estimates can be derived for example by using the mle method (with a
multinomial likelihood) or by estimating the s parameters which minimizes the Pearson’s chi-square statistic
above. For example, the R code below does calculate an estimate for λ doing the latter and results in the
estimate 0.06623153, close but different from the mle of λ using the full data:
m<-c(6996,455,28,4,0);
op<-m/sum(m);
g<-function(lam){sum((op-c(dpois(0:3,lam),1-ppois(3,lam)))^2)};
optim(sum(op*(0:4)),g,method="Brent",lower=0,upper=10)$par
52 CHAPTER 2. FREQUENCY MODELING
When one uses the full-data to estimate the probabilities the asymptotic distribution is in between chi-square
distributions with parameters K − 1 and K − 1 − s. In practice it is common to ignore this subtlety and
assume the limiting chi-square has K − 1 − s degrees of freedom. Interestingly, this practical shortcut works
quite well in the case of the Poisson distribution.
For the Singaporean auto data the Pearson’s chi-square statistic equals 41.98 using the full data mle for λ.
Using the limiting distribution of chi-square with 5 − 1 − 1 = 3 degrees of freedom, we see that the value of
41.98 is way out in the tail (99-th percentile is below 12). Hence we can conclude that the Poisson distribution
provides an inadequate fit for the data.
In the above we started with the cells as given in the above tabular summary. In practice, a relevant question
is how to define the cells so that the chi-square distribution is a good approximation to the finite sample
distribution of the statistic. A rule of thumb is to define the cells in such a way to have at least 80% if not all
of the cells having expected counts greater than 5. Also, it is clear that a larger number of cells results in a
higher power of the test, and hence a simple rule of thumb is to maximize the number of cells such that each
cell has at least 5 observations.
2.8 Exercises
Theoretical Exercises:
Exercise 2.1. Derive an expression for pN (·) in terms of FN (·) and SN (·).
Exercise 2.2. A measure of center of location must be equi-variant with respect to shifts. In other words,
if N1 and N2 are two random variables such that N1 + c has the same distribution as N2 , for some constant
c, then the difference between the measures of the center of location of N2 and N1 must equal c. Show that
the mean satisfies this property.
Exercise 2.3. Measures of dispersion should be invariant w.r.t. shifts and scale equi-variant. Show that
standard deviation satisfies these properties by doing the following:
• Show that for a random variable N , its standard deviation equals that of N + c, for any constant c.
• Show that for a random variable N , its standard deviation equals 1/c times that of cN , for any positive
constant c.
Exercise 2.4. Let N be a random variable with probability mass function given by
(
6
1
pN (k) := π 2 k2 , k ≥ 1;
0, otherwise.
Exercise 2.8. (Non-Uniqueness of the MLE) Consider the following parametric family of densities
indexed by the parameter p taking values in [0, 1]:
fp (x) = p · φ(x + 2) + (1 − p) · φ(x − 2), x ∈ R,
where φ(·) represents the standard normal density.
• Show that for all p ∈ [0, 1], fp (·) above is a valid density function.
• Find an expression in p for the mean and the variance of fp (·).
• Let us consider a sample of size one consisting of x. Show that when x equals 0, the set of MLEs for p
equals [0, 1]; also show that the mle is unique otherwise.
Exercise 2.9. Graph the region of the plane corresponding to values of (a, b) that give rise to valid (a, b, 0)
distributions. Do the same for (a, b, 1) distributions.
Exercise 2.10. (Computational Complexity) For the (a, b, 0) class of distributions, count the number
of basic math operations needed to compute the n probabilities p0 . . . pn−1 using the recurrence relationship.
For the negative binomial distribution with non-integral r, count the number of such operations using the
brute force approach. What do you observe?
Exercises with a Practical Focus:
Exercise 2.11. SOA Exam Question. You are given:
1. pk denotes the probability that the number of claims equals k for k = 0, 1, 2, . . .
2. ppm
n
= m!
n! , m ≥ 0, n ≥ 0
Exercise 2.12. SOA Exam Question. During a one-year period, the number of accidents per day was
distributed as follows:
No. of Accidents 0 1 2 3 4 5
No. of Days 209 111 33 7 5 2
You use a chi-square test to measure the fit of a Poisson distribution with mean 0.60. The minimum expected
number of observations in any group should be 5. The maximum number of groups should be used. Determine
the value of the chi-square statistic.
A discrete probability distribution has the following properties
3k + 9
Pr(N = k) = Pr(N = k − 1), k = 1, 2, 3, . . .
8k
Determine the value of Pr(N = 3). (Ans: 0.1609)
Exercises
Here are a set of exercises that guide the viewer through some of the theoretical foundations of Loss Data
Analytics. Each tutorial is based on one or more questions from the professional actuarial examinations –
typically the Society of Actuaries Exam C.
Frequency Distribution Guided Tutorials
likbinm<-function(m){
prod((dbinom(x,m,mean(x)/m)))
}
liknbinm<-function(r){
prod(dnbinom(x,r,1-mean(x)/(mean(x)+r)))
}
x<-c(2,5,6,8,9)+2;
n<-(9:100);
r<-(1:100);
ll<-unlist(lapply(n,likbinm));
n[ll==max(ll[!is.na(ll)])]
y<-cbind(n,ll);
z<-cbind(rep("$\\hat{\\sigma}^2<\\hat{\\mu}$",length(n)),rep("Binomial - $L(m,\\overline{x}/m)$",length(
ll<-unlist(lapply(r,liknbinm));
ll[is.na(ll)]=0;
r[ll==max(ll[!is.na(ll)])];
y<-rbind(y,cbind(r,ll));
z<-rbind(z,cbind(rep("$\\hat{\\sigma}^2<\\hat{\\mu}$",length(r)),rep("Neg.Binomial - $L(r,\\overline{x}/
y<-rbind(y,cbind(r,rep(prod(dpois(x,mean(x))),length(r))));
z<-rbind(z,cbind(rep("$\\hat{\\sigma}^2<\\hat{\\mu}$",length(r)),rep("Poisson - $L(\\overline{x})$",leng
x<-c(2,5,6,8,9);
ll<-unlist(lapply(n,likbinm));
n[ll==max(ll[!is.na(ll)])]
y<-rbind(y,cbind(n,ll));
z<-rbind(z,cbind(rep("$\\hat{\\sigma}^2=\\hat{\\mu}$",length(n)),rep("Binomial - $L(m,\\overline{x}/m)$"
ll<-unlist(lapply(r,liknbinm));
ll[is.na(ll)]=0;
r[ll==max(ll[!is.na(ll)])];
y<-rbind(y,cbind(r,ll));
z<-rbind(z,cbind(rep("$\\hat{\\sigma}^2=\\hat{\\mu}$",length(r)),rep("Neg.Binomial - $L(r,\\overline{x}/
y<-rbind(y,cbind(r,rep(prod(dpois(x,mean(x))),length(r))));
z<-rbind(z,cbind(rep("$\\hat{\\sigma}^2=\\hat{\\mu}$",length(r)),rep("Poisson - $L(\\overline{x})$",leng
x<-c(2,3,6,8,9);
ll<-unlist(lapply(n,likbinm));
n[ll==max(ll[!is.na(ll)])]
y<-rbind(y,cbind(n,ll));
z<-rbind(z,cbind(rep("$\\hat{\\sigma}^2>\\hat{\\mu}$",length(n)),rep("Binomial - $L(m,\\overline{x}/m)$"
ll<-unlist(lapply(r,liknbinm));
ll[is.na(ll)]=0;
r[ll==max(ll[!is.na(ll)])];
y<-rbind(y,cbind(r,ll));
z<-rbind(z,cbind(rep("$\\hat{\\sigma}^2>\\hat{\\mu}$",length(r)),rep("Neg.Binomial - $L(r,\\overline{x}/
y<-rbind(y,cbind(r,rep(prod(dpois(x,mean(x))),length(r))));
z<-rbind(z,cbind(rep("$\\hat{\\sigma}^2>\\hat{\\mu}$",length(r)),rep("Poisson - $L(\\overline{x})$",leng
colnames(y)<-c("x","lik");
colnames(z)<-c("dataset","Distribution");
dy<-cbind(data.frame(y),data.frame(z));
library(tikzDevice);
library(ggplot2);
options(tikzMetricPackages = c("\\usepackage[utf8]{inputenc}","\\usepackage[T1]{fontenc}", "\\usetikzlib
"\\usepackage{amssymb}","\\usepackage{amsmath}","\\usepackage[active]{pre
tikz(file = "plot_test_2.tex", width = 6.25, height = 6.25);
ggplot(data=dy,aes(x=x,y=lik,col=Distribution)) + geom_point(size=0.25) + facet_grid(dataset~.)+
2.10. FURTHER RESOURCES AND CONTRIBUTORS 55
labs(x="m/r",y="Likelihood",title="");
dev.off();
• N.D. Shyamalkumar, The University of Iowa, is the principal author of the initital version of this
chapter. Email: [email protected] for chapter comments and suggested improvements.
• Krupa Viswanathan, Temple University, [email protected], provided substantial improvements.
Here are a few reference cited in the chapter.
56 CHAPTER 2. FREQUENCY MODELING
Chapter 3
Chapter Preview. The traditional loss distribution approach to modeling aggregate losses starts by separately
fitting a frequency distribution to the number of losses and a severity distribution to the size of losses. The
estimated aggregate loss distribution combines the loss frequency distribution and the loss severity distribution
by convolution. Discrete distributions often referred to as counting or frequency distributions were used in
Chapter 2 to describe the number of events such as number of accidents to the driver or number of claims to
the insurer. Lifetimes, asset values, losses and claim sizes are usually modeled as continuous random variables
and as such are modeled using continuous distributions, often referred to as loss or severity distributions.
Mixture distributions are used to model phenomenon investigated in a heterogeneous population, such as
modelling more than one type of claims in liability insurance (small frequent claims and large relatively
rare claims). In this chapter we explore the use of continuous as well as mixture distributions to model the
random size of loss. We present key attributes that characterize continuous models and means of creating
new distributions from existing ones. In this chapter we explore the effect of coverage modifications, which
change the conditions that trigger a payment, such as applying deductibles, limits, or adjusting for inflation,
on the distribution of individual loss amounts.
In this section we calculate the basic distributional quantities: moments, percentiles and generating functions.
3.1.1 Moments
Let X be a continuous random variable with probability density function fX (x). The k-th raw moment of
X, denoted by µ0k , is the expected value of the k-th power of X, provided it exists. The first raw moment µ01
is the mean of X usually denoted by µ. The formula for µ0k is given as
Z ∞
µ0k = E X k = xk fX (x) dx.
0
The support of the random variable X is assumed to be nonnegative since actuarial phenomena are rarely
negative.
The k-th central moment of X, denoted by µk , is the expected value of the k-th power of the deviation of X
from its mean µ. The formula for µk is given as
h i Z ∞
k k
µk = E (X − µ) = (x − µ) fX (x) dx.
0
57
58 CHAPTER 3. MODELING LOSS SEVERITY
The second central moment µ2 defines the variance of X, denoted by σ 2 . The square root of the variance is
the standard deviation σ. A further characterization of the shape of the distribution includes its degree of
symmetry as well as its flatness compared to the normal distribution. The ratio of the third central moment
to the cube of the standard deviation µ3 /σ 3 defines the coefficient of skewness which is a measure of
symmetry. A positive coefficient of skewness indicates that the distribution is skewed to the right (positively
skewed). The ratio of the fourth central moment to the fourth power of the standard deviation µ4 /σ 4
defines the coefficient of kurtosis. The normal distribution has a coefficient of kurtosis of 3. Distributions
with a coefficient of kurtosis greater than 3 have heavier tails and higher peak than the normal, whereas
distributions with a coefficient of kurtosis less than 3 have lighter tails and are flatter.
Example 3.1.1. SOA Exam Question. Assume that the rv X has a gamma distribution with mean 8
and skewness 1. Find the variance of X.
Show Example Solution
Solution. The probability density function of X is given by
α
(x/θ) −x/θ
fX (x) = e
xΓ (α)
for x > 0. For α > 0, the k-th raw moment is
Z ∞
0 1 Γ (k + α) k
k
xk+α−1 e−x/θ dx =
µk = E X = α
θ
0 (α − 1)!θ Γ (α)
Given Γ (r + 1) = rΓ (r) and Γ (1) = 1, then µ01 = E (X) = αθ, µ02 = E X 2 = (α + 1) αθ2 , µ03 = E X 3 =
h i
3
E (X − µ01 ) µ03 − 3µ02 µ01 + 2µ01
3
(α + 2) (α + 1) αθ3 − 3 (α + 1) α2 θ3 + 2α3 θ3 2
Skewness = 3/2
= 3/2
= 3/2
= =1
Var (X) Var (X) (αθ2 ) α1/2
Hence, α = 4. Since, E (X) = αθ = 8, then θ = 2 and finally, Var (X) = αθ2 = 16.
3.1.2 Quantiles
Percentiles can also be used to describe the characteristics of the distribution of X. The 100pth percentile of
the distribution of X, denoted by πp , is the value of X which satisfies
FX πp − ≤ p ≤ F (πp ) ,
for 0 ≤ p ≤ 1.
The 50-th percentile or the middle point of the distribution, π0.5 , is the median. Unlike discrete random
variables, percentiles of continuous variables are distinct.
Example 3.1.1. SOA Exam Question. Let X be a continuous random variable with density function
fX (x) = θe−θx , for x > 0 and 0 elsewhere. If the median of this distribution is 13 , find θ.
Show Example Solution
Solution.
The distribution function is FX (x) = 1 − e−θx . So, FX (π0.5 ) = 1 − e−θπ0.5 = 0.5. As, π0.5 = 13 , we have
FX 13 = 1 − e−θ/3 = 0.5 and θ = 3 ln 2.
3.1. BASIC DISTRIBUTIONAL QUANTITIES 59
The moment generating function, denoted by MX (t) uniquely characterizes the distribution of X. While it is
possible for two different distributions to have the same moments and yet still differ, this is not the case with
the moment generating function. That is, if two random variables have the same moment generating function,
then they have the same distribution. The moment generating is a real function whose k-th derivative at zero
is equal to the k-th raw moment of X. The moment generating function is given by
Z ∞
MX (t) = E etX = etx fX (x) dx
0
Then,
b 1
MX −b2 =
= = 0.2.
(b + b2 ) (1 + b)
Thus, b = 4.
Example 3.1.4. SOA P Exam Question. Let X1 , . . . , Xn be independent Ga (αi , θ) random variables. Find
n
the distribution of S = i=1 Xi .
Show Example Solution
Solution.
The moment generating function of S is
n
!
Pn Y
tS
= E et i=1 Xi = E tXi
MS (t) = E e e
i=1
−αi
The moment generating function of Xi is MXi (t) = (1 − θt) . Then,
n Pn
−αi −
Y αi
MS (t) = (1 − θt) = (1 − θt) i=1 ,
i=1
Pn
indicating that S ∼ Ga ( i=1 αi , θ).
∂MS (t)
By finding the first and second derivatives of MS (t) at zero, we can show that E (S) = ∂t = αθ
Pn t=0
where α = i=1 αi , and
∂ 2 MS (t)
E S2 = = (α + 1) αθ2 .
∂t2 t=0
60 CHAPTER 3. MODELING LOSS SEVERITY
The probability generating function, denoted by PX (z), also uniquely characterizes the distribution of X. It
is defined as Z ∞
X
z x fX (x) dx
PX (z) = E z =
0
for all z for which the expected value exists.
We can also use the probability generating function to generate moments of X. By taking the k-th derivative
of PX (z) with respect to z and evaluate it at z = 1, we get E [X (X − 1) . . . (X − k + 1)] .
The probability generating function is more useful for discrete rvs and was introduced in Section 2.2.2.
The gamma distribution is commonly used in modeling claim severity. The traditional approach in modelling
losses is to fit separate models for claim frequency and claim severity. When frequency and severity are
modeled separately it is common for actuaries to use the Poisson distribution for claim count and the
gamma distribution to model severity. An alternative approach for modelling losses that has recently gained
popularity is to create a single model for pure premium (average claim cost) that will be described in Chapter
4.
The continuous variable X is said to have the gamma distribution with shape parameter α and scale parameter
θ if its probability density function is given by
α
(x/θ)
fX (x) = exp (−x/θ) for x > 0.
xΓ (α)
Note that α > 0, θ > 0.
The two panels in Figure 3.1 demonstrate the effect of the scale and shape parameters on the gamma density
function.
R Code for Gamma Density Plots
par(mfrow=c(1, 2), mar = c(4, 4, .1, .1))
Figure 3.1: Gamma Densities. The left-hand panel is with shape=2 and Varying Scale. The right-hand panel
is with scale=100 and Varying Shape.
62 CHAPTER 3. MODELING LOSS SEVERITY
lines(x,fgamma, col = k)
}
legend("topright", c("scale=100", "scale=150", "scale=200", "scale=250"), lty=1, col = 1:4)
Γ (α + k)
E X k = θk
for k > 0.
Γ (α)
The mean and variance are given by E (X) = αθ and Var (X) = αθ2 , respectively.
Since all moments exist for any positive k, the gamma distribution is considered a light tailed distribution,
which may not be suitable for modeling risky assets as it will not provide a realistic assessment of the
likelihood of severe losses.
The Pareto distribution, named after the Italian economist Vilfredo Pareto (1843-1923), has many economic
and financial applications. It is a positively skewed and heavy-tailed distribution which makes it suitable for
modeling income, high-risk insurance claims and severity of large casualty losses. The survival function of the
Pareto distribution which decays slowly to zero was first used to describe the distribution of income where
a small percentage of the population holds a large proportion of the total wealth. For extreme insurance
claims, the tail of the severity distribution (losses in excess of a threshold) can be modelled using a Pareto
distribution.
The continuous variable X is said to have the Pareto distribution with shape parameter α and scale parameter
θ if its pdf is given by
αθα
fX (x) = α+1 x > 0, α > 0, θ > 0.
(x + θ)
The two panels in Figure 3.2 demonstrate the effect of the scale and shape parameters on the Pareto density
function.
## Loading required package: stats4
## Loading required package: splines
R Code for Pareto Density Plots
3.2. CONTINUOUS DISTRIBUTIONS FOR MODELING LOSS SEVERITY 63
Figure 3.2: Pareto Densities. The left-hand panel is with scale=2000 and Varying Shape. The right-hand
panel is with shape=3 and Varying Scale
64 CHAPTER 3. MODELING LOSS SEVERITY
The Weibull distribution, named after the Swedish physicist Waloddi Weibull (1887-1979) is widely used in
reliability, life data analysis, weather forecasts and general insurance claims. Truncated data arise frequently
in insurance studies. The Weibull distribution is particularly useful in modeling left-truncated claim severity
distributions. Weibull was used to model excess of loss treaty over automobile insurance as well as earthquake
inter-arrival times.
The continuous variable X is said to have the Weibull distribution with shape parameter α and scale parameter
θ if its probability density function is given by
α x α−1 x α
fX (x) = exp − x > 0, α > 0, θ > 0.
θ θ θ
The two panels Figure 3.3 demonstrate the effects of the scale and shape parameters on the Weibull density
function.
R Code for Weibull Density Plots
par(mfrow=c(1, 2), mar = c(4, 4, .1, .1))
It can be easily seen that the shape parameter α describes the shape of the hazard function of the Weibull
distribution. The hazard function is a decreasing function when α < 1, constant when α = 1 and increasing
when α > 1. This behavior of the hazard function makes the Weibull distribution a suitable model for a wide
variety of phenomena such as weather forecasting, electrical and industrial engineering, insurance modeling
and financial risk analysis.
The k-th moment of the Weibull distributed random variable is given by
k
E X k = θk Γ 1 +
.
α
Figure 3.3: Weibull Densities. The left-hand panel is with shape=3 and Varying Scale. The right-hand panel
is with scale=100 and Varying Shape.
3.2. CONTINUOUS DISTRIBUTIONS FOR MODELING LOSS SEVERITY 67
and
2 !
2 2 1
Var(X) = θ Γ 1+ − Γ 1+ ,
α α
respectively.
Example 3.2.2. Suppose that the probability distribution of the lifetime of AIDS patients (in months) from
the time of diagnosis is described by the Weibull distribution with shape parameter 1.2 and scale parameter
33.33.
a. Find the probability that a randomly selected person from this population survives at least 12 months,
b. A random sample of 10 patients will be selected from this population. What is the probability
that at most two will die within one year of diagnosis.
c. Find the 99-th percentile of this distribution.
Show Example Solution
Solution.
a. $Let X ∼ W ei (1.2, 33.33) be the lifetime of AIDS patients (in months). We have,
1.2
Pr (X ≥ 12) = S X (12) = e−( 33.33 )
12
= 0.746.
b. $Let Y be the number of patients who die within one year of diagnosis. Then, Y ∼ Bin (10, 0.254) and
Pr (Y ≤ 2) = 0.514.
c. $Let π0.99 denote the 99-th percentile of this distribution. Then,
π0.99 1.2
SX (π0.99 ) = exp − = 0.01.
33.33
The Generalized Beta Distribution of the Second Kind (GB2) was introduced by Venter (1983) in the
context of insurance loss modeling and by McDonald (1984) as an income and wealth distribution. It is a
four-parameter very flexible distribution that can model positively as well as negatively skewed distributions.
The continuous variable X is said to have the GB2 distribution with parameters a, b, α and β if its probability
density function is given by
axaα−1
fX (x) = a α+β
for x > 0,
baα B (α, β) [1 + (x/b) ]
The GB2 provides a model for heavy as well as light tailed data. It includes the exponential, gamma, Weibull,
Burr, Lomax, F, chi-square, Rayleigh, lognormal and log-logistic as special or limiting cases. For example, by
setting the parameters a = α = β = 1, then the GB2 reduces to the log-logistic distribution. When a = 1
and β → ∞, it reduces to the gamma distribution and when α = 1 and β → ∞, it reduces to the Weibull
distribution.
68 CHAPTER 3. MODELING LOSS SEVERITY
bk α + ka , β − ka
k
E X = , k > 0.
(α, β)
Earlier applications of the GB2 were on income data and more recently have been used to model long-tailed
claims data. GB2 was used to model different types of automobile insurance claims, severity of fire losses as
well as medical insurance claim data.
In Section 3.2 we discussed some elementary known distributions. In this section we discuss means of creating
new parametric probability distributions from existing ones. Let X be a continuous random variable with
a known probability density function fX (x) and distribution function FX (x). Consider the transformation
Y = g (X), where g(X) is a one-to-one transformation defining a new random variable Y . We can use the
distribution function technique, the change-of-variable technique or the moment-generating function technique
to find the probability density function of the variable of interest Y . In this section we apply the following
techniques for creating new families of distributions: (a) multiplication by a constant (b) raising to a power,
(c) exponentiation and (d) mixing.
If claim data show change over time then such transformation can be useful to adjust for inflation. If the
level of inflation is positive then claim costs are rising, and if it is negative then costs are falling. To adjust
for inflation we multiply the cost X by 1+ inflation rate (negative inflation is deflation). To account for
currency impact on claim costs we also use a transformation to apply currency conversion from a base to a
counter currency.
Consider the transformation Y = cX, where c > 0, then the distribution function of Y is given by
y y
FY (y) = Pr (Y ≤ y) = Pr (cX ≤ y) = Pr X ≤ = FX .
c c
Hence, the probability density function of interest fY (y) can be written as
1 y
fY (y) = fX .
c c
Suppose that X belongs to a certain set of parametric distributions and define a rescaled version Y = cX,
c > 0. If Y is in the same set of distributions then the distribution is said to be a scale distribution. When
a member of a scale distribution is multiplied by a constant c (c > 0), the scale parameter for this scale
distribution meets two conditions:
The parameter is changed by multiplying by c;
All other parameter remain unchanged.
3.3. METHODS OF CREATING NEW DISTRIBUTIONS 69
Example 3.3.1. SOA Exam Question. The aggregate losses of Eiffel Auto Insurance are denoted in Euro
currency and follow a Lognormal distribution with µ = 8 and σ = 2. Given that 1 euro = 1.3 dollars, find
the set of lognormal parameters, which describe the distribution of Eiffel’s losses in dollars?
Show Example Solution
Solution.
Let X and Y denote the aggregate losses of Eiffel Auto Insurance in euro currency and dollars respectively.
As Y = 1.3X, we have,
y y
FY (y) = Pr (Y ≤ y) = Pr (1.3X ≤ y) = Pr X ≤ = FX .
1.3 1.3
X follows a lognormal distribution with parameters µ = 8 and σ = 2. The probability density function of X
is given by ( 2 )
1 1 ln x − µ
fX (x) = √ exp − for x > 0.
xσ 2π 2 σ
As dx 1
dy = 1.3 , the probability density function of interest fY (y) is
( 2 ) ( 2 )
1 y 1 1.3 1 ln (y/1.3) − µ 1 1 ln y − (ln 1.3 + µ)
fY (y) = fX = √ exp − = √ exp − .
1.3 1.3 1.3 yσ 2π 2 σ yσ 2π 2 σ
Then Y follows a lognormal distribution with parameters ln 1.3 + µ = 8.26 and σ = 2.00. If we let µ = ln(m)
then it can be easily seen that m=eµ is the scale parameter which was multiplied by 1.3 while σ is the shape
parameter that remained unchanged.
Example 3.3.2. SOA Exam Question. Demonstrate that the gamma distribution is a scale distribution.
Show Example Solution
Solution.
Let X ∼ Ga(α, θ) and Y = cX. As dx
1
dy = c , then
y α
y y
1 cθ
fY (y) = fX = exp − .
c c yΓ (α) cθ
We can see that Y ∼ Ga(α, cθ) indicating that gamma is a scale distribution and θ is a scale parameter.
In the previous section we have talked about the flexibility of the Weibull distribution in fitting reliability data.
Looking to the origins of the Weibull distribution, we recognize that the Weibull is a power transformation of
the exponential distribution. This is an application of another type of transformation which involves raising
the random variable to a power.
Consider the transformation Y = X τ , where τ > 0, then the distribution function of Y is given by
FY (y) = Pr (Y ≤ y) = Pr (X τ ≤ y) = Pr X ≤ y 1/τ = FX y 1/τ .
and
1
fY (y) = y 1/τ −1 f X y 1/τ .
τ
Example 3.3.3. We assume that X follows the exponential distribution with mean θ and consider the
transformed variable Y = X τ . Show that Y follows the Weibull distribution when τ is positive and determine
the parameters of the Weibull distribution.
Show Example Solution
Solution.
As X ∼ Exp(θ), we have
1 −x/θ
fX (x) = e x > 0.
θ
Solving for x yields x = y 1/τ . Taking the derivative, we have
dx 1 1
= y τ − 1.
dy τ
Thus,
1 α−1
1 1 1 1 1 −1 − y τ α y α
fY (y) = y τ −1 f X y τ = yτ e θ = e−(y/β) .
τ τθ β β
where α = τ1 and β = θτ . Then, Y follows the Weibull distribution with shape parameter α and scale
parameter β.
3.3.4 Exponentiation
The normal distribution is a very popular model for a wide number of applications and when the sample size
is large, it can serve as an approximate distribution for other models. If the random variable X has a normal
distribution with mean µ and variance σ 2 , then Y = eX has lognormal distribution with parameters µ and
σ 2 . The lognormal random variable has a lower bound of zero, is positively skewed and has a long right tail.
A lognormal distribution is commonly used to describe distributions of financial assets such as stock prices.
It is also used in fitting claim amounts for automobile as well as health insurance. This is an example of
another type of transformation which involves exponentiation.
Consider the transformation Y = eX , then the distribution function of Y is given by
FY (y) = Pr (Y ≤ y) = Pr eX ≤ y = Pr (X ≤ ln y) = FX (ln y) .
Example 3.3.4. SOA Exam Question. X has a uniform distribution on the interval (0, c). Y = eX .
Find the distribution of Y .
Show Example Solution
Solution.
3.3. METHODS OF CREATING NEW DISTRIBUTIONS 71
Mixture distributions represent a useful way of modelling data that are drawn from a heterogeneous population.
This parent population can be thought to be divided into multiple subpopulations with distinct distributions.
Two-point Mixture
If the underlying phenomenon is diverse and can actually be described as two phenomena representing
two subpopulations with different modes, we can construct the two point mixture random variable X.
Given random variables X1 and X2 , with probability density functions fX1 (x) and fX2 (x) respectively, the
probability density function of X is the weighted average of the component probability density function
fX1 (x) and fX2 (x). The probability density function and distribution function of X are given by
fX (x) = afX1 (x) + (1 − a) fX2 (x) ,
and
FX (x) = aFX1 (x) + (1 − a) FX2 (x) ,
for 0 < a < 1, where the mixing parameters a and (1 − a) represent the proportions of data points that fall
under each of the two subpopulations respectively. This weighted average can be applied to a number of
other distribution
related quantities. The k-th moment and moment generating function of X are given by
E X k = aE X1K + (1 − a) E X2k , and
MX (t) = aMX1 (t) + (1 − a) MX2 (t) ,
respectively.
Example 3.3.5. SOA Exam Question. The distribution of the random variable X is an equally weighted
mixture of two Poisson distributions with parameters λ1 and λ2 respectively. The mean and variance of X
are 4 and 13, respectively. Determine Pr (X > 2).
Show Example Solution
Solution.
Substituting the first in the second equation, we get λ1 + λ2 = 8 and λ21 + λ22 = 50. After further substitution,
we get that the parameters of the two Poisson distributions are 1 and 7, respectively. So,
Pr (X > 2) = 0.5 Pr (X1 > 2) + 0.5 Pr (X2 > 2) = 0.05.
72 CHAPTER 3. MODELING LOSS SEVERITY
k-point Mixture
In case of finite mixture distributions, the random variable of interest X has a probability pi of being
drawn from homogeneous subpopulation i, where i = 1, 2, . . . , k and k is the initially specified number
of subpopulations in our mixture. The mixing parameter pi represents the proportion of observations
from subpopulation i. Consider the random variable X generated from k distinct subpopulations, where
subpopulation i is modeled by the continuous distribution fXi (x). The probability distribution of X is given
by
Xk
fX (x) = pi fXi (x),
i=1
Pk
where 0 < pi < 1 and i=1 pi = 1.
This model is often referred to as a finite mixture or a k point mixture. The distribution function, r-th
moment and moment generating functions of the k-th point mixture are given as
k
X
FX (x) = pi FXi (x),
i=1
k
X
E (X r ) = pi E (Xir ), and
i=1
k
X
MX (t) = pi MXi (t),
i=1
respectively.
Example 3.3.6. SOA Exam Question. Y1 is a mixture of X1 and X2 with mixing weights a and (1 − a).
Y2 is a mixture of X3 and X4 with mixing weights b and (1 − b). Z is a mixture of Y1 and Y2 with mixing
weights c and (1 − c).
Show that Z is a mixture of X1 , X2 , X3 and X4 , and find the mixing weights.
Show Example Solution
Solution. Applying the formula for a mixed distribution, we get
fY1 (x) = afX1 (x) + (1 − a) fX2 (x)
A mixture with a very large number of subpopulations (k goes to infinity) is often referred to as a continuous
mixture. In a continuous mixture, subpopulations are not distinguished by a discrete mixing parameter but
by a continuous variable θ, where θ plays the role of pi in the finite mixture. Consider the random variable
X with a distribution depending on a parameter θ, where θ itself is a continuous random variable. This
description yields the following model for X
Z ∞
fX (x) = fX (x |θ ) g (θ)dθ,
0
where fX (x |θ ) is the conditional distribution of X at a particular value of θ and g (θ) is the probability
statement made about the unknown parameter θ, known as the prior distribution of θ (the prior information
or expert opinion to be used in the analysis).
The distribution function, k-th moment and moment generating functions of the continuous mixture are
given as Z ∞
FX (x) = FX (x |θ ) g (θ)dθ,
−∞
Z ∞
k
E X k |θ
E X = g (θ)dθ,
−∞
Z ∞
MX (t) = E etX = E etx |θ
g (θ)dθ,
−∞
respectively.
The k-th moment of the mixture distribution can be rewritten as
Z ∞
E Xk = E X k |θ g (θ)dθ = E E X k |θ
.
−∞
E (X) = E [E (X |θ )]
and
Var (X) = E [Var (X |θ )] + Var [E (X |θ )] .
Example 3.3.7. SOA Exam Question. X has a binomial distribution with a mean of 100q and a variance
of 100q (1 − q) and q has a beta distribution with parameters a = 3 and b = 2. Find the unconditional mean
and variance of X.
Show Example Solution
Solution.
a 3 a(a+1)
= 25 .
As q ∼ Beta(3, 2), we have E (q) = a+b = 5 and E q 2 = (a+b)(a+b+1)
Now, using the formulas for the unconditional mean and variance, we have
.
74 CHAPTER 3. MODELING LOSS SEVERITY
Example 3.3.8. SOA Exam Question. Claim sizes, X, are uniform on for each policyholder. varies by
policyholder according to an exponential distribution with mean 5. Find the unconditional distribution, mean
and variance of X.
Show Example Solution
Solution.
1
The conditional distribution of X is fX ( x| θ) = 10 for θ < x < θ + 10.
− θ5
The prior distribution of θ is g (θ) = 51 e for 0 < θ < ∞.
The conditional mean and variance of X are given by
θ + θ + 10
E ( X| θ) = =θ+5
2
and
2
[(θ + 10) − θ] 100
Var ( X| θ) = = ,
12 12
respectively.
Hence, the unconditional mean and variance of X are given by
and
100
Var (X) = E [V (X |θ )] + Var [E (X |θ )] = E + Var (θ + 5) = 8.33 + Var (θ) = 33.33.
12
The unconditional distribution of X is
Z
fX (x) = fX (x|θ) g (θ) dθ.
( Rx 1 − θ5 x
1
1 − e− 5
0 50 e dθ = 10 0 ≤ x ≤ 10,
fX (x) = Rx 1 − θ5
(x−10) x
x−10 50
e dθ = 10 1
e− 5 − e− 5 10 < x < ∞.
Under an ordinary deductible policy, the insured (policyholder) agrees to cover a fixed amount of an insurance
claim before the insurer starts to pay. This fixed expense paid out of pocket is called the deductible and often
denoted by d. The insurer is responsible for covering the loss X less the deductible d. Depending on the
agreement, the deductible may apply to each covered loss or to a defined benefit period (month, year, etc.)
Deductibles eliminate a large number of small claims, reduce costs of handling and processing these claims,
reduce premiums for the policyholders and reduce moral hazard. Moral hazard occurs when the insured takes
more risks, increasing the chances of loss due to perils insured against, knowing that the insurer will incur
3.4. COVERAGE MODIFICATIONS 75
the cost (e.g. a policyholder with collision insurance may be encouraged to drive recklessly). The larger the
deductible, the less the insured pays in premiums for an insurance policy.
Let X denote the loss incurred to the insured and Y denote the amount of paid claim by the insurer. Speaking
of the benefit paid to the policyholder, we differentiate between two variables: The payment per loss and the
payment per payment. The payment per loss variable, denoted by Y L , includes losses for which a payment is
made as well as losses less than the deductible and hence is defined as
L 0 X < d,
Y = (X − d)+ = .
X −d X >d
Y L is often referred to as left censored and shifted variable because the values below d are not ignored and
all losses are shifted by a value d.
On the other hand, the payment per payment variable, denoted by Y P , is not defined when there is no
payment and only includes losses for which a payment is made. The variable is defined as
P Undefined X ≤ d
Y =
X −d X>d
Y P is often referred to as left truncated and shifted variable or excess loss variable because the claims smaller
than d are not reported and values above d are shifted by d.
Even when the distribution of X is continuous, the distribution of Y L is partly discrete and partly continuous.
The discrete part of the distribution is concentrated at Y = 0 (when X ≤ d) and the continuous part is
spread over the interval Y > 0 (when X > d). For the discrete part, the probability that no payment is made
is the probability that losses fall below the deductible; that is,
Pr Y L = 0 = Pr (X ≤ d) = FX (d) .
Using the transformation Y L = X − d for the continuous part of the distribution, we can find the probability
density function of Y L given by
FX (d) y = 0,
fY L (y) =
fX (y + d) y > 0
and
FX (y + d) − FX (d)
FY P (y) = ,
1 − FX (d)
for y > 0, respectively.
The raw moments of Y L and Y P can be found directly using the probability density function of X as follows
h Z ∞
k i k
E YL = (x − d) fX (x) dx,
d
and h k i
YL
R∞ k
h i (x − d) fX (x) dx E
P k d
E Y = = ,
1 − F X (d) 1 − F X (d)
76 CHAPTER 3. MODELING LOSS SEVERITY
respectively.
We have seen that the deductible d imposed on an insurance policy is the amount of loss that has to be
paid out of pocket before the insurer makes any payment. The deductible d imposed on an insurance policy
reduces the insurer’s payment. The loss elimination ratio (LER) is the percentage decrease in the expected
payment of the insurer as a result of imposing the deductible. LER is defined as
E (X) − E Y L
LER = .
E (X)
A little less common type of policy deductible is the franchise deductible. The Franchise deductible will apply
to the policy in the same way as ordinary deductible except that when the loss exceeds the deductible d, the
full loss is covered by the insurer. The payment per loss and payment per payment variables are defined as
L 0 X ≤ d,
Y =
X X > d,
and
P Undefined X ≤ d,
Y =
X X > d,
respectively.
Example 3.4.1. SOA Exam Question. A claim severity distribution is exponential with mean 1000. An
insurance company will pay the amount of each claim in excess of a deductible of 100. Calculate the variance
of the amount paid by the insurance company for one claim, including the possibility that the amount paid is
0.
Show Example Solution
Solution.
Let Y L denote the amount paid by the insurance company for one claim.
0 X ≤ 100,
Y L = (X − 100)+ =
X − 100 X > 100.
and h Z ∞
2 i 2 100
E YL = (x − 100) fX (x) dx = 2 × 10002 e− 1000 .
100
So,
100
100
2
Var Y L = 2 × 10002 e− 1000 − 1000e− 1000 = 990, 944.
An arguably simpler path to the solution is to make use of the relationship between X and Y P . If X is
exponentially distributed with mean 1000, then Y P is also exponentially distributed
with the same mean,
because of the memoryless property of the exponential distribution. Hence, E Y P =1000 and
h 2 i
E YP = 2 × 10002 .
h 2 i h 2 i 100
E YL =E YP SX (100) = 2 × 10002 e− 1000 .
3.4. COVERAGE MODIFICATIONS 77
Note that we divide by SX (4) = 1 − FX (4), as this is the range where the variable Y P is defined.
0 ( 4 d)
θ − θ exp(− dθ ) θ − θ exp(− 3θ )
=
θ θ
4
d 4/3
= 1 − exp − 3 = 1 − e−d/θ = 1 − 0.34/3 = 0.8.
θ
78 CHAPTER 3. MODELING LOSS SEVERITY
Under a limited policy, the insurer is responsible for covering the actual loss X up to the limit of its coverage.
This fixed limit of coverage is called the policy limit and often denoted by u. If the loss exceeds the policy
limit, the difference X − u has to be paid by the policyholder. While a higher policy limit means a higher
payout to the insured, it is associated with a higher premium.
Let X denote the loss incurred to the insured and Y denote the amount of paid claim by the insurer. Then
Y is defined as
X X ≤ u,
Y =X ∧u=
u X > u.
It can be seen that the distinction between Y L and Y P is not needed under limited policy as the insurer will
always make a payment.
Even when the distribution of X is continuous, the distribution of Y is partly discrete and partly continuous.
The discrete part of the distribution is concentrated at Y = u (when X > u), while the continuous part is
spread over the interval Y < u (when X ≤ u). For the discrete part, the probability that the benefit paid is
u, is the probability that the loss exceeds the policy limit u; that is,
Pr (Y = u) = Pr (X > u) = 1 − F X (u) .
For the continuous part of the distribution Y = X, hence the probability density function of Y is given by
fX (y) 0 < y < u,
fY (y) =
1 − FX (u) y = u.
The raw moments of Y can be found directly using the probability density function of X as follows
h i Z u Z ∞ Z u
k k k k
xk fX (x) dx + uk [1 − F X (u)] dx.
E Y = E (X ∧ u) = x fX (x) dx + u fX (x)dx
0 u 0
Example 3.4.4. SOA Exam Question. Under a group insurance policy, an insurer agrees to pay 100%
of the medical bills incurred during the year by employees of a small company, up to a maximum total of one
million dollars. The total amount of bills incurred, X, has probability density function
x(4−x)
0 < x < 3,
fX (x) = 9
0 elsewhere.
where x is measured in millions. Calculate the total amount, in millions of dollars, the insurer would expect
to pay under this policy.
Show Example Solution
Solution.
Define the total amount of bills payed by the insurer as
X X ≤ 1,
Y =X ∧1=
1 X > 1.
R1 x2 (4−x) R3 x(4−x)
So E (Y ) = E (X ∧ 1) = 0 9 dx +1∗ 1 9 dx = 0.935.
3.4. COVERAGE MODIFICATIONS 79
3.4.3 Coinsurance
As we have seen in Section 3.4.1, the amount of loss retained by the policyholder can be losses up to the
deductible d. The retained loss can also be a percentage of the claim. The percentage α, often referred to as
the coinsurance factor, is the percentage of claim the insurance company is required to cover. If the policy is
subject to an ordinary deductible and policy limit, coinsurance refers to the percentage of claim the insurer is
required to cover, after imposing the ordinary deductible and policy limit. The payment per loss variable,
Y L , is defined as
0 X ≤ d,
Y L = α (X − d) d < X ≤ u,
α (u − d) X > u.
The policy limit (the maximum amount paid by the insurer) in this case is α (u − d), while u is the maximum
covered loss.
The k-th moment of Y L is given by
h k i
Z u Z ∞
k k
E YL = [α (x − d)] fX (x) dx + [α (u − d)] fX (x) dx.
d u
A growth factor (1 + r) may be applied to X resulting in an inflated loss random variable (1 + r) X (the
prespecified d and u remain unchanged). The resulting per loss variable can be written as
d
0 X ≤ 1+r ,
L d u
Y = α [(1 + r) X − d] 1+r < X ≤ 1+r ,
u
α (u − d) X > 1+r .
and
( " 2 # " 2 #
h i
L 2 2 u d d u d
E Y = α2 (1 + r) E X ∧ −E X∧ −2 E X∧ −E X ∧ ,
1+r 1+r 1+r 1+r 1+r
respectively.
L
first and second moments of Y are general. Under full coverage, α = 1, r = 0,
The formulae given for the
L
u = ∞, d = 0 and E Y reduces to E (X). If only an ordinary deductible is imposed, α = 1, r = 0, u = ∞
and E Y L reduces to E (X) − E (X ∧ d). If only a policy limit is imposed α = 1, r = 0, d = 0 and E Y L
reduces to E (X ∧ u).
Example 3.4.5. SOA Exam Question. The ground up loss random variable for a health insurance policy
in 2006 is modeled with X, an exponential distribution with mean 1000. An insurance policy pays the loss
above an ordinary deductible of 100, with a maximum annual payment of 500. The ground up loss random
variable is expected to be 5% larger in 2007, but the insurance in 2007 has the same deductible and maximum
payment as in 2006. Find the percentage increase in the expected cost per payment from 2006 to 2007.
Show Example Solution
Solution.
We define the amount per loss Y L in both years as
0 X ≤ 100,
L
Y2006 = X − 100 100 < X ≤ 600,
500 X > 600.
80 CHAPTER 3. MODELING LOSS SEVERITY
0 X ≤ 95.24,
L
Y2007 = 1.05X − 100 95.24 < X ≤ 571.43,
500 X > 571.43.
So,
600
100
L
= E (X ∧ 600) − E (X ∧ 100) = 1000 1 − e− 1000 − 1000 1 − e− 1000
E Y2006
= 356.026
.
L
E Y2007 = 1.05 [E (X ∧ 571.43) − E (X ∧ 95.24)]
h 571.43
95.24
i
= 1.05 1000 1 − e− 1000 − 1000 1 − e− 1000
= 361.659
.
356.026
P
E Y2006 = 100 = 393.469.
e− 1000
361.659
P
E Y2007 = 95.24 = 397.797.
e− 1000
P
E (Y2007 )
Because E (Y2006
− 1 = 0.011, there is an increase of 1.1% from 2006 to 2007.
P
)
3.4.4 Reinsurance
In Section 3.4.1 we introduced the policy deductible, which is a contractual arrangement under which an
insured transfers part of the risk by securing coverage from an insurer in return for an insurance premium.
Under that policy, when the loss exceeds the deductible, the insurer is not required to pay until the insured has
paid the fixed deductible. We now introduce reinsurance, a mechanism of insurance for insurance companies.
Reinsurance is a contractual arrangement under which an insurer transfers part of the underlying insured risk
by securing coverage from another insurer (referred to as a reinsurer) in return for a reinsurance premium.
Although reinsurance involves a relationship between three parties: the original insured, the insurer (often
referred to as cedent or cedant) and the reinsurer, the parties of the reinsurance agreement are only the
primary insurer and the reinsurer. There is no contractual agreement between the original insured and the
reinsurer. The reinsurer is not required to pay under the reinsurance contract until the insurer has paid a
loss to its original insured. The amount retained by the primary insurer in the reinsurance agreement (the
reinsurance deductible) is called retention.
Reinsurance arrangements allow insurers with limited financial resources to increase the capacity to write
insurance and meet client requests for larger insurance coverage while reducing the impact of potential losses
and protecting the insurance company against catastrophic losses. Reinsurance also allows the primary
insurer to benefit from underwriting skills, expertize and proficient complex claim file handling of the larger
reinsurance companies.
Example 3.4.6. SOA Exam Question. In 2005 a risk has a two-parameter Pareto distribution with
α = 2 and θ = 3000. In 2006 losses inflate by 20%. Insurance on the risk has a deductible of 600 in each year.
3.4. COVERAGE MODIFICATIONS 81
Pi , the premium in year i, equals 1.2 times expected claims. The risk is reinsured with a deductible that
stays the same in each year. Ri , the reinsurance premium in year i, equals 1.1 times the expected reinsured
R2005 R2006
claims. P2005 =0.55 . Calculate P2006 .
3000
= 3000 − 3000 1 − = 2500
3600
Since X2006 = 1.2X2005 and Pareto is a scale distribution with scale parameter θ, then X2006 ∼ P a (2, 3600)
3600
= 3600 − 3600 1 − = 3085.714
4200
Xi − 600 ≤ dR
0
YiR =
Xi − 600 − dR Xi − 600 > dR
R2005
Since = 0.55, then R2005 = 3000 × 0.55 = 1650
P2005
R
R
1650
Since R2005 = 1.1E Y2005 , then E Y2005 = 1.1 = 1500
82 CHAPTER 3. MODELING LOSS SEVERITY
R
= E X2005 − 600 − dR + = E (X2005 ) − E X2005 ∧ 600 + dR
E Y2005
3000
= 3000 − 3000 1 − = 1500 ⇒ dR = 2400
3600 + dR
R
= E X2006 − 600 − dR + = E (X2006 − 3000)+ = E (X2006 ) − E (X2006 ∧ 3000)
E Y2006
3600
= 3600 − 3600 1 − = 1963.636
6600
R
R2006 = 1.1E Y2006 = 1.1 × 1963.636 = 2160
R2006 2160
Therefore P2006 = 3702.857 = 0.583
Pricing of insurance premiums and estimation of claim reserving are among many actuarial problems that
involve modeling the severity of loss (claim size). The principles for using maximum likelihood to estimate
model parameters were introduced in Chapter xxx. In this section, we present a few examples to illustrate
how actuaries fit a parametric distribution model to a set of claim data using maximum likelihood. In these
examples we derive the asymptotic variance of maximum-likelihood estimators of the model parameters. We
use the delta method to derive the asymptotic variances of functions of these parameters.
Example 3.5.1. SOA Exam Question. Consider a random sample of claim amounts: 8,000 10,000 12,000
15,000. You assume that claim amounts follow an inverse exponential distribution, with parameter θ.
a. Calculate the maximum likelihood estimator for θ.
b. Approximate the variance of the maximum likelihood estimator.
c. Determine an approximate 95% confidence interval for θ.
d. Determine an approximate 95% confidence interval for Pr (X ≤ 9, 000) .
Show Example Solution
Solution.
The probability density function is
θ
θe− x
fX (x) = ,
x2
where x > 0.
a. $The likelihood function, L (θ), can be viewed as the probability of the observed data, written as a function
of the model’s parameter θ P4 1
4 −θ
Y θ4 e i=1 xi
L (θ) = fXi (xi ) = Q4 2
.
i=1 i=1 xi
3.5. MAXIMUM LIKELIHOOD ESTIMATION 83
4
d ln L (θ) 4 X 1
= − .
dθ θ i=1 xi
The maximum likelihood estimator of θ, denoted by θ̂, is the solution to the equation
4
4 X 1
− = 0.
θ̂ x
i=1 i
Thus, θ̂ = P4 4 1
= 10, 667
i=1 xi
d2 ln L (θ) −4
= 2.
dθ2 θ
Evaluating the second derivative of the loglikelihood function at θ̂ = 10, 667 gives a negative value, indicating
θ̂ as the value that maximizes the loglikelihood function.
b. $Taking reciprocal of negative expectation of the second derivative of ln L (θ), we obtain an estimate of
h 2 i−1 2
ar θ̂ = E d ln
the variance of θ̂ Vd L(θ)
dθ 2
= θ̂4 = 28, 446, 222.
θ=θ̂
h i 9,000
2
9,000 − θ̂
V ar g θ̂ = − θ̂2 e
d V̂ θ̂ = 0.0329.
Example 3.5.2. SOA Exam Question. A random sample of size 6 is from a lognormal distribution with
parameters µ and σ. The sample values are 200, 3,000, 8,000, 60,000, 60,000, 160,000.
84 CHAPTER 3. MODELING LOSS SEVERITY
where x > 0.
a. $The likelihood function, L (µ, σ), is the product of the pdf for each data point.
6 6 2
Y 1 1X ln xi − µ
L (µ, σ) = fXi (xi ) = 3 Q6 exp − .
i=1
6
σ (2π) i=1 xi
2 i=1 σ
The loglikelihood function, ln L (µ, σ), is the sum of the individual logarithms.
6 6 2
X 1X ln xi − µ
ln (µ, σ) = −6lnσ − 3ln (2π) − ln xi − .
i=1
2 i=1 σ
6
−6 1 X 2
+ 3 (ln xi − µ̂) = 0.
σ̂ σ̂ i=1
These yield the estimates
P6 P6
ln xi (ln xi −µ̂)2
µ̂ = i=16 = 9.38 and σ̂ 2 = i=1 6 = 5.12.
The second partial derivatives are
∂ 2 lnL(µ,σ) ∂ 2 lnL(µ,σ) P6 ∂ 2 lnL(µ,σ) P6 2
∂µ2 = −6
σ2 , ∂µ∂σ = −2
σ3 i=1 (ln xi − µ) and ∂σ 2 = 6
σ2 − 3
σ4 i=1 (ln xi − µ) .
b. $ To derive the covariance matrix of the MLE we need to find the expectations of the second derivatives.
Since the random variable X is from a lognormal distribution with parameters µ and σ, then lnX is normally
distributed with mean µ and variance σ 2 .
2
E ∂ lnL(µ,σ) = E −6
−6
∂µ 2 σ2 = σ2 ,
2 P6 P6 P6
E ∂ lnL(µ,σ)
∂µ∂σ = −2
σ3
−2
i=1 E (ln xi − µ) = σ 3
−2
i=1 [E (ln xi ) − µ]= σ 3 i=1 (µ − µ) = 0,
3.5. MAXIMUM LIKELIHOOD ESTIMATION 85
and
2 P6 2 P6 P6
E ∂ lnL(µ,σ)
∂σ 2 = 6
σ2 − 3
σ4 i=1 E (ln xi − µ) = 6
σ2 − 3
σ4 i=1 V (ln xi ) = 6
σ2 − 3
σ4 i=1 σ2 = −12
σ2 .
Using the negatives of these expectations we obtain the Fisher information matrix
6
σ2 0
0 σ122
.
The covariance matrix, Σ, is the inverse of the Fisher information matrix
" 2 #
σ
0
Σ = 6 σ2
0 12
.
The estimated matrix is given by
0.8533 0
Σ̂ =
0 0.4267
.
√
c. $ The 95% confidence interval for µ is given by 9.38 ± 1.96 0.8533 = (7.57, 11.19).
√
The 95% confidence interval for σ 2 is given by 5.12 ± 1.96 0.4267 = (3.84, 6.40).
2
d. $ The mean of X is exp µ + σ2 . Then, the maximum likelihood estimate of
σ2
g (µ, σ) = exp µ +
2
is
σ̂ 2
g (µ̂, σ̂) = exp µ̂ + = 153, 277.
2
We use the delta method to approximate the variance of the mle g (µ̂, σ̂).
2
2
∂g(µ,σ)
∂µ = exp µ + σ2 and ∂g(µ,σ)
∂σ = σexp µ + σ2 .
Using the delta method, the approximate variance of g (µ̂, σ̂) is given by
" #
h i ∂g(µ,σ)
∂g(µ,σ) ∂g(µ,σ) ∂µ
V̂ (g (µ̂, σ̂)) = Σ
∂µ ∂σ ∂g(µ,σ)
∂σ µ=µ̂,σ=σ̂
0.8533 0 153, 277
= 153, 277 346, 826 =
0 0.4267 346, 826
71,374,380,000
σ2
The 95% confidence interval for exp µ + 2 is given by
√
153, 277 ± 1.96 71, 374, 380, 000 = (−370, 356, 676, 910).
Since the mean of the lognormal distribution cannot be negative, we should replace the negative lower limit
in the previous interval by a zero.
86 CHAPTER 3. MODELING LOSS SEVERITY
In the previous section we considered the maximum likelihood estimation of continuous models from complete
(individual) data. Each individual observation is recorded, and its contribution to the likelihood function is
the density at that value. In this section we consider the problem of obtaining maximum likelihood estimates
of parameters from grouped data. The observations are only available in grouped form, and the contribution
of each observation to the likelihood function is the probability of falling in a specific group (interval). Let nj
represent the number of observations in the interval ( cj−1 , cj ] The grouped data likelihood function is thus
given by
Yk
n
L (θ) = [F ( cj | θ) − F ( cj−1 | θ)] j ,
j=1
where c0 is the smallest possible observation (often set to zero) and ck is the largest possible observation
(often set to infinity).
Example 3.5.3. SOA Exam Question. For a group of policies, you are given that losses follow the
distribution function F (x) = 1 − xθ , for θ < x < ∞. Further, a sample of 20 losses resulted in the following:
d ln L (θ) −9 6 5
= + + .
dθ (10 − θ) θ θ
−9 11
+ =0
10 − θ̂ θ̂
and θ̂ = 5.5.
3.5. MAXIMUM LIKELIHOOD ESTIMATION 87
Another distinguishing feature of data gathering mechanism is censoring. While for some event of interest
(losses, claims, lifetimes, etc.) the complete data maybe available, for others only partial information is
available; information that the observation exceeds a specific value. The limited policy introduced in Section
3.4.2 is an example of right censoring. Any loss greater than or equal to the policy limit is recorded at the
limit. The contribution of the censored observation to the likelihood function is the probability of the random
variable exceeding this specific limit. Note that contributions of both complete and censored data share the
survivor function, for a complete point this survivor function is multiplied by the hazard function, but for a
censored observation it is not.
Example 3.5.4. SOA Exam Question. The random variable has survival function:
θ4
SX (x) = 2.
(θ2 + x2 )
Two values of X are observed to be 2 and 4. One other value exceeds 4. Calculate the maximum likelihood
estimate of θ.
Show Example Solution
Solution.
The contributions of the two observations 2 and 4 are fX (2) and fX (4) respectively. The contribution of the
third observation, which is only known to exceed 4 is SX (4). The likelihood function is thus given by
4xθ4
fX (x) = 3.
(θ2 + x2 )
Thus,
8θ4 16θ4 θ4 128θ12
L (θ) = 3 3 2 = 3 5,
(θ2 + 4) (θ2 + 16) (θ2 + 16) (θ2 + 4) (θ2 + 16)
So,
ln L (θ) = ln128 + 12lnθ − 3ln θ2 + 4 − 5ln θ2 + 16
,
and
dlnL(θ) 12 6θ 10θ
dθ = θ − (θ 2 +4) − (θ 2 +16) .
12 6θ̂ 10θ̂
− − =0
θ̂ θ̂2 + 4 θ̂2 + 16
or
12 θ̂2 + 4 θ̂2 + 16 − 6θ̂2 θ̂2 + 16 − 10θ̂2 θ̂2 + 4 = − 4θ̂4 + 104θ̂2 + 768 = 0,
This section is concerned with the maximum likelihood estimation of the continuous distribution of the
random variable X when the data is incomplete due to truncation. If the values of X are truncated at d,
then it should be noted that we would not have been aware of the existence of these values had they not
exceeded d. The policy deductible introduced in Section 3.4.1 is an example of left truncation. Any loss less
than or equal to the deductible is not recorded. The contribution to the likelihood function of an observation
x truncated at d will be a conditional probability and the fX (x) will be replaced by SfXX(x)
(d) .
Example 3.5.5. SOA Exam Question. For the single parameter Pareto distribution with θ = 2, maximum
likelihood estimation is applied to estimate the parameter α. Find the estimated mean of the ground up loss
distribution based on the maximum likelihood estimate of α for the following data set:
Ordinary policy deductible of 5, maximum covered loss of 25 (policy limit 20)
8 insurance payment amounts: 2, 4, 5, 5, 8, 10, 12, 15
2 limit payments: 20, 20.
Show Example Solution
Solution.
The contributions of the different observations can be summarized as follows:
For the exact loss: fX (x)
For censored observations: SX (25).
fX (x)
For truncated observations: SX (5) .
Given that ground up losses smaller than 5 are omitted from the data set, the contribution of all observations
should be conditional on exceeding 5. The likelihood function becomes
Q8 2
i=1 fX (xi ) SX (25)
L (α) = 8 .
[SX (5)] SX (5)
For the single parameter Pareto the probability density and distribution functions are given by
α
αθα θ
fX (x) = and FX (x) = 1 − ,
xα+1 x
for x > θ, respectively. Then, the likelihood and loglikelihood functions are given by
α8 510α
L (α) = Q8 α+1 252α
,
i=1 xi
8
X
ln L (α) = 8lnα − (α + 1) ln xi + 10αln5 − 2αln25.
i=1
dlnL(α) 8
P8
dθ = α − i=1 ln xi + 10ln5 − 2ln25.
The maximum likelihood estimator, α̂, is the solution to the equation
8
8 X
− ln xi + 10ln5 − 2ln25 = 0,
α̂ i=1
which yields
8 8
α̂ = P8 = = 0.785.
i=1 ln xi − 10ln5 + 2ln25
(ln7 + ln9 + . . . + ln20) − 10ln5 + 2ln25
The mean of the Pareto only exists for α > 1. Since α̂ = 0.785 < 1. Then, the mean does not exist.
3.6. FURTHER RESOURCES AND CONTRIBUTORS 89
• Zeinab Amin, The American University in Cairo, is the principal author of this chapter. Email:
[email protected] for chapter comments and suggested improvements.
• Many helpful comments have been provided by Hirokazu (Iwahiro) Iwasawa, [email protected] .
Exercises
Here are a set of exercises that guide the viewer through some of the theoretical foundations of Loss Data
Analytics. Each tutorial is based on one or more questions from the professional actuarial examinations –
typically the Society of Actuaries Exam C.
Severity Distribution Guided Tutorials
Notable contributions include: Cummins and Derrig (2012), Frees and Valdez (2008), Klugman et al. (2012),
Kreer et al. (2015), McDonald (1984), McDonald and Xu (1995), Tevet (2016), and Venter (1983).
90 CHAPTER 3. MODELING LOSS SEVERITY
Chapter 4
Chapter Preview. Chapters 2 and 3 have described how to fit parametric models to frequency and severity
data, respectively. This chapter describes selection of models. To compare alternative parametric models, it
is helpful to introduce models that summarize data without reference to a specific parametric distribution.
Section 4.1 describes nonparametric estimation, how we can use it for model comparisons and how it can be
used to provide starting values for parametric procedures.
The process of model selection is then summarized in Section 4.2. Although our focus is on continuous data,
the same process can be used for discrete data or data that come from a hybrid combination of discrete
and continuous data. Further, Section 4.3 describes estimation for alternative sampling schemes, included
grouped, censored and truncated data, following the introduction provided in Chapter 3. The chapter closes
with Section 4.4 on Bayesian inference, an alternative procedure where the (typically unknown) parameters
are treated as random variables.
91
92 CHAPTER 4. MODEL SELECTION AND ESTIMATION
The population distribution F (·) can be summarized in various ways. These include moments, the distribution
function F (·) itself, the quantiles or percentiles associated with the distribution, and the corresponding mass
or density function f (·). Summary statistics based on the sample, X1 , . . . , Xn , are known as nonparametric
estimators of the corresponding summary measures of the distribution. We will examine moment estimators,
distribution function estimators, quantile estimators, and density estimators, as well as their statistical
properties such as expected value and variance. Using our data observations x1 , . . . , xn , we can put numerical
values to these estimators and compute nonparametric estimates.
Moment Estimators
The k-th moment, E [X k ] = µ0k , is our first example of a population summary measure. It is estimated
with the corresponding sample statistic
n
1X k
X .
n i=1 i
In typical applications, k is a positive integer, although it need not be. For the first moment (k = 1), the prime
symbol (0) and the 1 subscript are usually dropped, using µ = µ01 to denote the mean. The corresponding
sample estimator for µ is called the sample mean, denoted with a bar on top of the random variable:
n
1X
X̄ = Xi .
n i=1
Sometimes, µ0k is called the k-th raw moment to distinguish it from the k-th central moment, E [(X −µ)k ] =
µk , which is estimated as
n
1X k
Xi − X̄ .
n i=1
The second central moment (k = 2) is an important case for which we typically assign a new symbol,
σ 2 = E [(X − µ)2 ], known as the variance. The corresponding sample estimator for σ 2 is called the sample
variance.
To estimate the distribution function nonparametrically, we define the empirical distribution function to
be
Here, the notation I(·) is the indicator function; it returns 1 if the event (·) is true and 0 otherwise.
Example 4.1.1. Toy Data Set. To illustrate, consider a fictitious, or “toy,” data set of n = 10 observations.
Determine the empirical distribution function.
i 1 2 3 4 5 6 7 8 9 10
Xi 10 15 15 15 20 23 23 23 23 30
4.1. NONPARAMETRIC INFERENCE 93
Quantiles
We have already seen the median, which is the number such that approximately half of a data set is below
(or above) it. The first quartile is the number such that approximately 25% of the data is below it and the
third quartile is the number such that approximately 75% of the data is below it. A 100p percentile is
the number such that 100 × p percent of the data is below it.
To generalize this concept, consider a distribution function F (·), which may or may not be from a continuous
variable, and let q be a fraction so that 0 < q < 1. We want to define a quantile, say qF , to be a number such
that F (qF ) ≈ q. Notice that when q = 0.5, qF is the median; when q = 0.25, qF is the first quartile, and so
on.
To be precise, for a given 0 < q < 1, define the qth quantile qF to be any number that satisfies
94 CHAPTER 4. MODEL SELECTION AND ESTIMATION
Here, the notation F (x−) means to evaluate the function F (·) as a left-hand limit.
To get a better understanding of this definition, let us look at a few special cases. First, consider the case
where X is a continuous random variable so that the distribution function F (·) has no jump points, as
illustrated in Figure 4.2. In this figure, a few fractions, q1 , q2 , and q3 are shown with their corresponding
quantiles qF,1 , qF,2 , and qF,3 . In each case, it can be seen that F (qF −) = F (qF ) so that there is a unique
quantile. Because we can find a unique inverse of the distribution function at any 0 < q < 1, we can write
qF = F −1 (q).
Figure 4.3 shows three cases for distribution functions. The left panel corresponds to the continuous case
just discussed. The middle panel displays a jump point similar to those we already saw in the empirical
distribution function of Figure 4.1. For the value of q shown in this panel, we still have a unique value of the
quantile qF . Even though there are many values of q such that F (qF −) ≤ q ≤ F (qF ), for a particular value
of q, there is only one solution to equation (4.1). The right panel depicts a situation in which the quantile
can not be uniquely determined for the q shown as there is a range of qF ’s satisfying equation (4.1).
Example 4.1.2. Toy Data Set: Continued. Determine quantiles corresponding to the 20th, 50th, and
95th percentiles.
Show Example Solution
Solution. Consider Figure 4.1. The case of q = 0.20 corresponds to the middle panel, so the 20th percentile
is 15. The case of q = 0.50 corresponds to the right panel, so the median is any number between 20 and 23
inclusive. Many software packages use the average 21.5 (e.g. R, as seen below). For the 95th percentile, the
solution is 30. We can see from the graph that 30 also corresponds to the 99th and the 99.99th percentiles.
quantile(xExample, probs=c(0.2, 0.5, 0.95), type=6)
By taking a weighted average between data observations, smoothed empirical quantiles can handle cases such
as the right panel in Figure 4.3. The qth smoothed empirical quantile is defined as
where j = b(n + 1)qc, h = (n + 1)q − j, and X(1) , . . . , X(n) are the ordered values (the order statistics)
corresponding to X1 , . . . , Xn . Note that this is a linear interpolation between X(j) and X(j+1) .
Example 4.1.3. Toy Data Set: Continued. Determine the 50th and 20th smoothed percentiles.
Show Example Solution
Solution Take n = 10 and q = 0.5. Then, j = b(11)0.5c = b5.5c = 5 and h = (11)(0.5) − 5 = 0.5. Then the
0.5-th smoothed empirical quantile is
Now take n = 10 and q = 0.2. In this case, j = b(11)0.2c = b2.2c = 2 and h = (11)(0.2) − 2 = 0.2. Then the
0.2-th smoothed empirical quantile is
Density Estimators
When the random variable is discrete, estimating the probability mass function f (x) = Pr(X = x) is
straightforward. We simply use the empirical average, defined to be
n
1X
fn (x) = I(Xi = x).
n i=1
For a continuous random variable, consider a discretized formulation in which the domain of F (·) is partitioned
by constants {c0 < c1 < · · · < ck } into intervals of the form [cj−1 , cj ), for j = 1, . . . , k. The data observations
are thus “grouped” by the intervals into which they fall. Then, we might use the basic definition of the
empirical mass function, or a variation such as
nj
fn (x) = cj−1 ≤ x < cj ,
n × (cj − cj−1 )
96 CHAPTER 4. MODEL SELECTION AND ESTIMATION
where nj is the number of observations (Xi ) that fall into the interval [cj−1 , cj ).
Extending this notion to instances where we observe individual data, note that we can always create arbitrary
groupings and use this formula. More formally, let b > 0 be a small positive constant, known as a bandwidth,
and define a density estimator to be
n
1 X
fn (x) = I(x − b < Xi ≤ x + b) (4.2)
2nb i=1
1 1
E I(x − b < X ≤ x + b) = (F (x + b) − F (x − b))
2b 2b
1
F (x) + bF 0 (x) + b2 C1 F (x) − bF 0 (x) + b2 C2
=
2b
C1 − C2
= F 0 (x) + b → F 0 (x) = f (x),
2
as b → 0. That is, fn (x) is an asymptotically unbiased estimator of f (x) (its expectation approaches the true
value as sample size increases to infinity). This development assumes some smoothness of F (·), in particular,
twice differentiability at x, but makes no assumptions on the form of the distribution function F . Because of
this, the density estimator fn is said to be nonparametric.
More generally, define the kernel density estimator as
n
1 X x − Xi
fn (x) = w (4.3)
nb i=1 b
where w is a probability density function centered about 0. Note that equation (4.2) simply becomes the
kernel density estimator where w(x) = 12 I(−1 < x ≤ 1), also known as the uniform kernel. Other popular
choices are shown in Table 4.1.
Here, φ(·) is the standard normal density function. As we will see in the following example, the choice of
bandwidth b comes with a bias-variance tradeoff between matching local distributional features and reducing
the volatility.
Example 4.1.4. Property Fund. Figure 4.4 shows a histogram (with shaded gray rectangles) of logarithmic
property claims from 2010. The (blue) thick curve represents a Gaussian kernel density where the bandwidth
was selected automatically using an ad hoc rule based on the sample size and volatility of the data. For
this dataset, the bandwidth turned out to be b = 0.3255. For comparison, the (red) dashed curve represents
the density estimator with a bandwidth equal to 0.1 and the green smooth curve uses a bandwidth of 1.
As anticipated, the smaller bandwidth (0.1) indicates taking local averages over less data so that we get a
better idea of the local average, but at the price of higher volatility. In contrast, the larger bandwidth (1)
4.1. NONPARAMETRIC INFERENCE 97
Figure 4.4: Histogram of Logarithmic Property Claims with Superimposed Kernel Density Estimators
smooths out local fluctuations, yielding a smoother curve that may miss perturbations in the local average.
For actuarial applications, we mainly use the kernel density estimator to get a quick visual impression of the
data. From this perspective, you can simply use the default ad hoc rule for bandwidth selection, knowing
that you have the ability to change it depending on the situation at hand.
Show R Code
#Density Comparison
hist(log(ClaimData$Claim), main="", ylim=c(0,.35),xlab="Log Expenditures", freq=FALSE, col="lightgray")
lines(density(log(ClaimData$Claim)), col="blue",lwd=2.5)
lines(density(log(ClaimData$Claim), bw=1), col="green")
lines(density(log(ClaimData$Claim), bw=.1), col="red", lty=3)
legend("topright", c("b=0.3255 (default)", "b=0.1", "b=1.0"), lty=c(1,3,1),
lwd=c(2.5,1,1), col=c("blue", "red", "green"), cex=1)
Nonparametric density estimators, such as the kernel estimator, are regularly used in practice. The concept
can also be extended to give smooth versions of an empirical distribution function. Given the definition of
the kernel density estimator, the kernel estimator of the distribution function can be found as
n
1X x − Xi
F̂n (x) = W .
n i=1 b
where W is the distribution function associated with the kernel density w. To illustrate, for the uniform
kernel, we have w(y) = 12 I(−1 < y ≤ 1), so
0
y < −1
y+1
W (y) = −1 ≤ y < 1
2
1 y≥1
98 CHAPTER 4. MODEL SELECTION AND ESTIMATION
Example 4.1.5. SOA Exam Question. You study five lives to estimate the time from the onset of a
disease to death. The times to death are:
2 3 3 3 7
Using a triangular kernel with bandwith 2, calculate the density function estimate at 2.5.
Show Example Solution
Solution. For the kernel density estimate, we have
n
1 X x − Xi
fn (x) = w ,
nb i=1 b
where n = 5, b = 2, and x = 2.5. For the triangular kernel, w(x) = (1 − |x|) × I(|x| ≤ 1). Thus,
x−Xi
w x−X
Xi b b
i
2.5−2
2 2 = 41 (1 − 14 )(1) = 3
4
3
−1
1 − −1
2.5−3
3
3 2 = 4 4
(1) =
4
3
2.5−7
7 2 = −2.25 (1 − | − 2.25|)(0) = 0
The previous section introduced nonparametric estimators in which there was no parametric form assumed
about the underlying distributions. However, in many actuarial applications, analysts seek to employ
a parametric fit of a distribution for ease of explanation and the ability to readily extend it to more
complex situations such as including explanatory variables in a regression setting. When fitting a parametric
distribution, one analyst might try to use a gamma distribution to represent a set of loss data. However,
another analyst may prefer to use a Pareto distribution. How does one know which model to select?
Nonparametric tools can be used to corroborate the selection of parametric models. Essentially, the approach is
to compute selected summary measures under a fitted parametric model and to compare it to the corresponding
quantity under the nonparametric model. As the nonparametric does not assume a specific distribution and is
merely a function of the data, it is used as a benchmark to assess how well the parametric distribution/model
represents the data. This comparison may alert the analyst to deficiencies in the parametric model and
sometimes point ways to improving the parametric specification.
We have already seen the technique of overlaying graphs for comparison purposes. To reinforce the application
of this technique, Figure 4.5 compares the empirical distribution to two parametric fitted distributions. The
left panel shows the distribution functions of claims distributions. The dots forming an “S-shaped” curve
represent the empirical distribution function at each observation. The thick blue curve gives corresponding
values for the fitted gamma distribution and the light purple is for the fitted Pareto distribution. Because the
Pareto is much closer to the empirical distribution function than the gamma, this provides evidence that
4.1. NONPARAMETRIC INFERENCE 99
Figure 4.5: Nonparametric Versus Fitted Parametric Distribution and Density Functions. The left-hand
panel compares distribution functions, with the dots corresponding to the empirical distribution, the thick
blue curve corresponding to the fitted gamma and the light purple curve corresponding to the fitted Pareto.
The right hand panel compares these three distributions summarized using probability density functions.
the Pareto is the better model for this data set. The right panel gives similar information for the density
function and provides a consistent message. Based on these figures, the Pareto distribution is the clear choice
for the analyst.
For another way to compare the appropriateness of two fitted models, consider the probability-probability
(pp) plot. A pp plot compares cumulative probabilities under two models. For our purposes, these two
models are the nonparametric empirical distribution function and the parametric fitted model. Figure 4.6
shows pp plots for the Property Fund data. The fitted gamma is on the left and the fitted Pareto is on the
right, compared to the same empirical distribution function of the data. The straight line represents equality
between the two distributions being compared, so points close to the line are desirable. As seen in earlier
demonstrations, the Pareto is much closer to the empirical distribution than the gamma, providing additional
evidence that the Pareto is the better model.
A pp plot is useful in part because no artificial scaling is required, such as with the overlaying of densities
in Figure 4.5, in which we switched to the log scale to better visualize the data. Furthermore, pp plots are
available in multivariate settings where more than one outcome variable is available. However, a limitation of
the pp plot is that, because they plot cumulative distribution functions, it can sometimes be difficult to detect
where a fitted parametric distribution is deficient. As an alternative, it is common to use a quantile-quantile
(qq) plot, as demonstrated in Figure 4.7.
The qq plot compares two fitted models through their quantiles. As with pp plots, we compare the non-
parametric to a parametric fitted model. Quantiles may be evaluated at each point of the data set, or on a
grid (e.g., at 0, 0.001, 0.002, . . . , 0.999, 1.000), depending on the application. In Figure 4.7, for each point on
the aforementioned grid, the horizontal axis displays the empirical quantile and the vertical axis displays
the corresponding fitted parametric quantile (gamma for the upper two panels, Pareto for the lower two).
Quantiles are plotted on the original scale in the left panels and on the log scale in the right panels to allow
us to see where a fitted distribution is deficient. The straight line represents equality between the empirical
distribution and fitted distribution. From these plots, we again see that the Pareto is an overall better fit
than the gamma. Furthermore, the lower-right panel suggests that the Pareto distribution does a good job
with large observations, but provides a poorer fit for small observations.
100 CHAPTER 4. MODEL SELECTION AND ESTIMATION
Figure 4.6: Probability-Probability (pp) Plots. The horizontal axes gives the empirical distribution function
at each observation. In the left-hand panel, the corresponding distribution function for the gamma is shown in
the vertical axis. The right-hand panel shows the fitted Pareto distribution. Lines of y = x are superimposed.
Example 4.1.6. SOA Exam Question. The graph below shows a pp plot of a fitted distribution compared
to a sample.
Comment on the two distributions with respect to left tail, right tail, and median probabilities.
Solution. The tail of the fitted distribution is too thick on the left, too thin on the right, and the fitted
distribution has less probability around the median than the sample. To see this, recall that the pp plot
graphs the cumulative distribution of two distributions on its axes (empirical on the x-axis and fitted on the
y-axis in this case). For small values of x, the fitted model assigns greater probability to being below that
value than occurred in the sample (i.e. F (x) > Fn (x)). This indicates that the model has a heavier left tail
than the data. For large values of x, the model again assigns greater probability to being below that value
and thus less probability to being above that value (i.e. S(x) < Sn (x). This indicates that the model has a
lighter right tail than the data. In addition, as we go from 0.4 to 0.6 on the horizontal axis (thus looking at
the middle 20% of the data), the pp plot increases from about 0.3 to 0.4. This indicates that the model puts
only about 10% of the probability in this range.
When selecting a model, it is helpful to make the graphical displays presented. However, for reporting
results, it can be effective to supplement the graphical displays with selected statistics that summarize model
goodness of fit. Table 4.2 provides three commonly used goodness of fit statistics. Here, Fn is the empirical
distribution and F is the fitted distribution.
4.1. NONPARAMETRIC INFERENCE 101
Figure 4.7: Quantile-Quantile (qq) Plots. The horizontal axes gives the empirical quantiles at each observation.
The right-hand panels they are graphed on a logarithmic basis. The vertical axis gives the quantiles from the
fitted distributions; Gamma quantiles are in the upper panels, Pareto quantiles are in the lower panels.
102 CHAPTER 4. MODEL SELECTION AND ESTIMATION
D− = maxi=1,...,n Fi − i−1n
R 2 1
Pn 2
Cramer-von Mises n (Fn (x) − F (x)) f (x)dx 12n + i=1 (Fi − (2i − 1)/n)
R (Fn (x)−F (x))2 n 2
−n − n1 i=1 (2i − 1) log (Fi (1 − Fn+1−i ))
P
Anderson-Darling n F (x)(1−F (x)) f (x)dx
where Fi is defined to be F (xi ).
The Kolmogorov-Smirnov statistic is the maximum absolute difference between the fitted distribution
function and the empirical distribution function. Instead of comparing differences between single points,
the Cramer-von Mises statistic integrates the difference between the empirical and fitted distribution
functions over the entire range of values. The Anderson-Darling statistic also integrates this difference
over the range of values, although weighted by the inverse of the variance. It therefore places greater emphasis
on the tails of the distribution (i.e when F (x) or 1 − F (x) = S(x) is small).
Exaxmple 4.1.7. SOA Exam Question (modified). A sample of claim payments is:
29 64 90 135 182
Compare the empirical claims distribution to an exponential distribution with mean 100 by calculating the
value of the Kolmogorov-Smirnov test statistic.
Show Example Solution
Solution. For an exponential distribution with mean 100, the cumulative distribution function is F (x) =
1 − e−x/100 . Thus,
The Kolmogorov-Smirnov test statistic is therefore KS = max(0.2517, 0.2727, 0.1934, 0.1408, 0.1620) = 0.2727.
The method of moments and percentile matching are nonparametric estimation methods that provide
alternatives to maximum likelihood. Generally, maximum likelihood is the preferred technique because it
employs data more efficiently. However, methods of moments and percentile matching are useful because they
are easier to interpret and therefore allow the actuary or analyst to explain procedures to others. Additionally,
the numerical estimation procedure (e.g. if performed in R) for the maximum likelihood is iterative and
requires starting values to begin the recursive process. Although many problems are robust to the choice of
the starting values, for some complex situations, it can be important to have a starting value that is close to
the (unknown) optimal value. Method of moments and percentile matching are techniques that can produce
desirable estimates without a serious computational investment and can thus be used as a starting value for
computing maximum likelihood.
4.1. NONPARAMETRIC INFERENCE 103
Method of Moments
Under the method of moments, we approximate the moments of the parametric distribution using the
empirical (nonparametric) moments described in Section 4.1.1. We can then algebraically solve for the
parameter estimates.
Example 4.1.8. Property Fund. For the 2010 property fund, there are n = 1, 377 individual claims (in
thousands of dollars) with
n n
1X 1X 2
m1 = Xi = 26.62259 and m2 = X = 136154.6.
n i=1 n i=1 i
Fit the parameters of the gamma and Pareto distributions using the method of moments.
Show Example Solution
Solution. To fit a gamma distribution, we have µ1 = αθ and µ02 = α(α + 1)θ2 . Equating the two yields the
method of moments estimators, easy algebra shows that
26.622592
α̂ = = 0.005232809
136154.6 − 26.622592
136154.6 − 26.622592
θ̂ = = 5, 087.629.
26.62259
For comparison, the maximum likelihood values turn out to be α̂M LE = 0.2905959 and θ̂M LE = 91.61378, so
there are big discrepancies between the two estimation procedures. This is one indication, as we have seen
before, that the gamma model fits poorly.
In contrast, now assume a Pareto distribution so that µ1 = θ/(α − 1) and µ02 = 2θ2 /((α − 1)(α − 2)). Easy
algebra shows
µ02
α=1+ and θ = (α − 1)µ1 .
µ02 − µ21
136154.6
α̂ = 1 + = 2.005233
136154.6 − 26, 622592
θ̂ = (2.005233 − 1) · 26.62259 = 26.7619
The maximum likelihood values turn out to be α̂M LE = 0.9990936 and θ̂M LE = 2.2821147. It is interesting
that α̂M LE < 1; for the Pareto distribution, recall that α < 1 means that the mean is infinite. This is another
indication that the property claims data set is a long tail distribution.
104 CHAPTER 4. MODEL SELECTION AND ESTIMATION
As the above example suggests, there is flexibility with the method of moments. For example, we could
have matched the second and third moments instead of the first and second, yielding different estimators.
Furthermore, there is no guarantee that a solution will exist for each problem. You will also find that
matching moments is possible for a few problems where the data are censored or truncated, but in general,
this is a more difficult scenario. Finally, for distributions where the moments do not exist or are infinite,
method of moments is not available. As an alternative for the infinite moment situation, one can use the
percentile matching technique.
Percentile Matching
Under percentile matching, we approximate the quantiles or percentiles of the parametric distribution using
the empirical (nonparametric) quantiles or percentiles described in Section 4.1.1.
Example 4.1.9. Property Fund. For the 2010 property fund, we illustrate matching on quantiles. In
particular, the Pareto distribution is intuitively pleasing because of the closed-form solution for the quantiles.
Recall that the distribution function for the Pareto distribution is
α
θ
F (x) = 1 − .
x+θ
We remark here that a numerical routine is required for these solutions as no analytic solution is available.
Furthermore, recall that the maximum likelihood estimates are α̂M LE = 0.9990936 and θ̂M LE = 2.2821147,
so the percentile matching provides a better approximation for the Pareto distribution than the method of
moments.
Calculate the estimate of θ by percentile matching, using the 40th and 80th empirically smoothed percentile
estimates.
Show Example Solution
Solution. With 11 observations, we have j = b(n + 1)qc = b12(0.4)c = b4.8c = 4 and h = (n + 1)q − j =
12(0.4) − 4 = 0.8. By interpolation, the 40th empirically smoothed percentile estimate is π̂0.4 = (1 − h)X(j) +
hX(j+1) = 0.2(86) + 0.8(90) = 89.2.
Similarly, for the 80th empirically smoothed percentile estimate, we have 12(0.8) = 9.6 so the estimate is
π̂0.8 = 0.4(200) + 0.6(210) = 206.
Using the loglogistic cumulative distribution, we need to solve the following two equations for parameters θ
and gamma:
(89.2/θ)γ (206/θ)γ
0.4 = γ
and 0.8 =
1 + (89.2/θ) 1 + (206 + θ)γ
Solving for each parenthetical expression gives 23 = (89.2/θ)γ and 4 = (206/θ)γ . Taking the ratio of the second
ln(6)
equation to the first gives 6 = (206/89.2)γ ⇒ γ = ln(206/89.2) = 2.1407. Then 41/2.1407 = 206/θ ⇒ θ = 107.8
This section underscores the idea that model selection is an iterative process in which models are cyclically
(re)formulated and tested for appropriateness before using them for inference. After summarizing the process
of selecting a model based on the dataset at hand, we describe model selection process based on:
• an in-sample or training dataset,
• an out-of-sample or test dataset, and
• a method that combines these approaches known as cross-validation.
In our development, we examine the data graphically, hypothesize a model structure, and compare the data
to a candidate model in order to formulate an improved model. Box (1980) describes this as an iterative
process which is shown in Figure 4.8.
This iterative process provides a useful recipe for structuring the task of specifying a model to represent a set
of data. The first step, the model formulation stage, is accomplished by examining the data graphically and
using prior knowledge of relationships, such as from economic theory or industry practice. The second step in
the iteration is based on the assumptions of the specified model. These assumptions must be consistent with
106 CHAPTER 4. MODEL SELECTION AND ESTIMATION
the data to make valid use of the model. The third step is diagnostic checking; the data and model must be
consistent with one another before additional inferences can be made. Diagnostic checking is an important
part of the model formulation; it can reveal mistakes made in previous steps and provide ways to correct
these mistakes.
The iterative process also emphasizes the skills you need to make analytics work. First, you need a willingness
to summarize information numerically and portray this information graphically. Second, it is important to
develop an understanding of model properties. You should understand how a probabilistic model behaves in
order to match a set of data to it. Third, theoretical properties of the model are also important for inferring
general relationships based on the behavior of the data.
It is common to refer to a dataset used for analysis as an in-sample or training dataset. Techniques available
for selecting a model depend upon whether the outcomes X are discrete, continuous, or a hybrid of the two,
although the principles are the same.
Graphical and other Basic Summary Measures. Begin by summarizing the data graphically and with
statistics that do not rely on a specific parametric form, as summarized in Section 4.1. Specifically, you will
want to graph both the empirical distribution and density functions. Particularly for loss data that contain
many zeros and that can be skewed, deciding on the appropriate scale (e.g., logarithmic) may present some
difficulties. For discrete data, tables are often preferred. Determine sample moments, such as the mean and
variance, as well as selected quantiles, including the minimum, maximum, and the median. For discrete data,
the mode (or most frequently occurring value) is usually helpful.
These summaries, as well as your familiarity of industry practice, will suggest one or more candidate parametric
models. Generally, start with the simpler parametric models (for example, one parameter exponential before
a two parameter gamma), gradually introducing more complexity into the modeling process.
Critique the candidate parametric model numerically and graphically. For the graphs, utilize the tools
introduced in Section 4.1.2 such as pp and qq plots. For the numerical assessments, examine the statistical
significance of parameters and try to eliminate parameters that do not provide additional information.
Likelihood Ratio Tests. For comparing model fits, if one model is a subset of another, then a likelihood
ratio test may be employed; see for example Sections 15.4.3 and 17.3.2.
Goodness of Fit Statistics. Generally, models are not proper subsets of one another so overall goodness of
fit statistics are helpful for comparing models. Information criteria are one type of goodness of statistic. The
4.2. MODEL SELECTION 107
Figure 4.9: Model Validation. A data set of size n is randomly split into two subsamples.
most widely used examples are Akaike’s Information Criterion (AIC ) and the Schwarz Bayesian Criterion
(BIC ); they are are widely cited because they can be readily generalized to multivariate settings. Section
15.4.4 provides a summary of these statistics.
For selecting the appropriate distribution, statistics that compare a parametric fit to a nonparametric
alternative, summarized in Section 4.1.2, are useful for model comparison. For discrete data, a chi-square
goodness of fit statistic (see Section 2.7) is generally preferred as it is more intuitive and simpler to explain.
Model validation is the process of confirming that the proposed model is appropriate, especially in light of
the purposes of the investigation. An important criticism of the model selection process is that it can be
susceptible to data-snooping, that is, fitting a great number of models to a single set of data. By looking at a
large number of models, we may overfit the data and understate the natural variation in our representation.
Model Validation Process. We can respond to this criticism by using a technique sometimes known as
out-of-sample validation. The ideal situation is to have available two sets of data, one for training, or
model development, and one for testing, or model validation. We initially develop one or several models on
the first data set that we call our candidate models. Then, the relative performance of the candidate models
can be measured on the second set of data. In this way, the data used to validate the model is unaffected by
the procedures used to formulate the model.
The model validation process not only addresses the problem of overfitting the data but also supports the
goal of predictive inference. Particularly in actuarial applications, our goal is to make statements about
about new experience rather than a dataset at hand. For example, we use claims experience from one year to
develop a model that can be used to price insurance contracts for the following year. As an analogy, we can
think about the training data set as experience from one year that is used to predict the behavior of the next
year’s test data set.
Random Split of the Data. Unfortunately, rarely will two sets of data be available to the investigator.
However, we can implement the validation process by splitting the data set into training and test subsamples,
respectively. Figure 4.9 illustrates this splitting of the data.
Various researchers recommend different proportions for the allocation. Snee (1977) suggests that data-
splitting not be done unless the sample size is moderately large. The guidelines of Picard and Berk (1990)
108 CHAPTER 4. MODEL SELECTION AND ESTIMATION
show that the greater the number of parameters to be estimated, the greater the proportion of observations
needed for the model development subsample. As a rule of thumb, for data sets with 100 or fewer observations,
use about 25-35% of the sample for out-of-sample validation. For data sets with 500 or more observations,
use 50% of the sample for out-of-sample validation.
Model Validation Statistics. Much of the literature supporting the establishment of a model validation
process is based on regression and classification models that you can think of as an input-output problem
(James et al. (2013)). That is, we have several inputs x1 , . . . , xk that are related to an output y through a
function such as
y = g (x1 , . . . , xk ) .
One uses the training sample to develop an estimate of g, say, ĝ, and then calibrate the distance from the
observed outcomes to the predictions using a criterion of the form
X
d(yi , ĝ (xi1 , . . . , xik )). (4.4)
i
Here, the sum i is over the test data. In many regression applications, it is common to use squared Euclidean
distance of the form d(yi , g) = (yi − g)2 . In actuarial applications, Euclidean distance d(yi , g) = |yi − g| is
often preferred because of the skewed nature of the data (large outlying values of y can have a large effect on
the measure). The Chapter 4 Technical Supplement A describes another measure, the Gini index that is
useful in actuarial applications.
Selecting a Distribution. Still, our focus so far has been to select a distribution for a data set that can be
used for actuarial modeling without additional inputs x1 , . . . , xk . Even in this more fundamental problem,
the model validation approach is valuable. If we base all inference on only in-sample data, then there is a
tendency to select more complicated models then needed. For example, we might select a four parameter GB2,
generalized beta of the second kind, distribution when only a two parameter Pareto is needed. Information
criteria such as AIC and BIC included penalties for model complexity and so provide some protection but
using a test sample is the best guarantee to achieve parsimonious models. From a quote often attributed to
Einstein, we want to “use the simplest model as possible but no simpler.”
Although out-of-sample validation is the gold standard in predictive modeling, it is not always practical to do
so. The main reason is that we have limited sample sizes and the out-of-sample model selection criterion in
equation (4.4) depends on a random split of the data. This means that different analysts, even when working
the same data set and same approach to modeling, may select different models. This is likely in actuarial
applications because we work with skewed data sets where there is a large chance of getting some very large
outcomes and large outcomes may have a great influence on the parameter estimates.
Cross-Validation Procedure. Alternatively, one may use cross-validation, as follows.
• The procedure begins by using a random mechanism to split the data into K subsets known as folds,
where analysts typcially use 5 to 10.
• Next, one uses the first K -1 subsamples to estimate model parameters. Then, “predict” the outcomes
for the K th subsample and use a measure such as in equation (4.4) to summarize the fit.
• Now, repeat this by holding out each of the K sub-samples, summarizing with a cumulative out-of-sample
statistic.
Repeat these steps for several candidate models and choose the model with the lowest cumulative out-of-sample
statistic.
Cross-validation is widely used because it retains the predictive flavor of the out-of-sample model validation
process but, due to the re-use of the data, is more stable over random samples.
4.3. ESTIMATION USING MODIFIED DATA 109
Basic theory and many applications are based on individual observations that are “complete” and “unmodified,”
as we have seen in the previous section. Chapter 3 introduced the concept of observations that are “modified”
due to two common types of limitations: censoring and truncation. For example, it is common to think
about an insurance deductible as producing data that are truncated (from the left) or policy limits as yielding
data that are censoreed (from the right). This viewpoint is from the primary insurer (the seller of the
insurance). However, as we will see in Chapter 10, a reinsurer (an insurer of an insurance company) may
not observe claims smaller than an amount, only that a claim exists, an example of censoring from the left.
So, in this section, we cover the full gamut of alternatives. Specifically, this section will address parametric
estmation methods for three alternatives to individual, complete, and unmodified data: interval-censored
data available only in groups, data that are limited or censored, and data that may not be observed due to
truncation.
Consider a sample of size n observed from the distribution F (·), but in groups so that we only know the group
into which each observation fell, not the exact value. This is referred to as grouped or interval-censored
data. For example, we may be looking at two successive years of annual employee records. People employed
in the first year but not the second have left sometime during the year. With an exact departure date
(individual data), we could compute the amount of time that they were with the firm. Without the departure
date (grouped data), we only know that they departed sometime during a year-long interval.
Formalizing this idea, suppose there are k groups or intervals delimited by boundaries c0 < c1 < · · · < ck . For
each observation, we only observe the interval into which it fell (e.g. (cj−1 , cj )), not the exact value. Thus,
we only know the number of observations in each interval. The constants {c0 < c1 < · · · < ck } form some
partition of the domain of F (·). Then the probability of an observation Xi falling in the jth interval is
Now, define nj to be the number of observations that fall in the jth interval, (cj−1 , cj ]. Thus, the likelihood
function (with respect to the parameter(s) θ) is
n
Y k
Y nj
L(θ) = f (xi ) = {F (cj ) − F (cj−1 )}
j=1 j=1
Maximizing the likelihood function (or equivalently, maximizing the log-likelihood function) would then
produce the maximum likelihood estimates for grouped data.
Example 4.3.1. SOA Exam Question. You are given:
(i) Losses follow an exponential distribution with mean θ.
(ii) A random sample of 20 losses is distributed as follows:
where p = e−1000/θ . Maximizing this expression with respect to p is equivalent to maximizing the likelihood
20 −1000
with respect to θ. The maximum occurs at p = 33 and so θ̂ = ln(20/33) = 1996.90.
Censored Data
Censoring occurs when we observe only a limited value of an observation. The most common form is
right-censoring, in which we record the smaller of the “true” dependent variable and a censoring variable.
Using notation, let X represent an outcome of interest, such as the loss due to an insured event. Let CU denote
the censoring time, such as CU = 5. With right-censored observations, we observe X if it is below censoring
point CU ; otherwise if X is higher than the censoring point, we only observe the censored CU . Therefore, we
record XU∗ = min(X, CU ). We also observe whether or not censoring has occurred. Let δU = I(X ≥ CU ) be a
binary variable that is 1 if censoring occurs, y ≥ CU , and 0 otherwise.
For example, CU may represent the upper limit of coverage of an insurance policy. The loss may exceed the
amount CU , but the insurer only has CU in its records as the amount paid out and does not have the amount
of the actual loss X in its records.
4.3. ESTIMATION USING MODIFIED DATA 111
Similarly, with left-censoring, we only observe X if X is above censoring point (e.g. time or loss amount)
CL ; otherwise we observe CL . Thus, we record XL∗ = max(X, CL ) along with the censoring indicator
δL = I(X ≤ CL ).
For example, suppose a reinsurer will cover insurer losses greater than CL . Let Y = XL∗ − CL represent the
amount that the reinsurer is responsible for. If the policyholder loss X < CL , then the insurer will pay the
entire claim and Y = 0, no loss for the reinsurer. If the loss X ≥ CL , then Y = X − CL represents the
reinsurer’s retained claims. If a loss occurs, the reinsurer knows the actual amount if it exceeds the limit CL ,
otherwise it only knows that it had a loss of 0.
As another example of a left-censored observation, suppose we are conducting a study and interviewing a
person about an event in the past. The subject may recall that the event occurred before CL , but not the
exact date.
Truncated Data
We just saw that censored observations are still available for study, although in a limited form. In contrast,
truncated outcomes are a type of missing data. An outcome is potentially truncated when the availability
of an observation depends on the outcome.
In insurance, it is common for observations to be left-truncated at CL when tfhe amount is
we do not observe X X < CL
Y =
X − CL X ≥ CL .
In other words, if X is less than the threshold CL , then it is not observed. FOr example, CL may represent
the deductible associated with an insurance coverage. If the insured loss is less than the deductible, then the
insurer does not observe or record the loss at all. If the loss exceeds the deductible, then the excess X − CL
is the claim that the insurer covers.
Similarly for right-truncated data, if X exceeds a threshold CU , then it is not observed. In this case, the
amount is
X X < CU
Y =
we do not observe X X ≥ CU .
Classic examples of truncation from the right include X as a measure of distance to a star. When the distance
exceeds a certain level CU , the star is no longer observable.
Figure 4.10 compares truncated and censored observations. Values of X that are greater than the “upper”
censoring limit CU are not observed at all (right-censored), while values of X that are smaller than the “lower”
truncation limit CL are observed, but observed as CL rather than the actual value of X (left-truncated).
Show Example
Example – Mortality Study. Suppose that you are conducting a two-year study of mortality of high-risk
subjects, beginning January 1, 2010 and finishing January 1, 2012. Figure 4.11 graphically portrays the six
types of subjects recruited. For each subject, the beginning of the arrow represents that the the subject was
recruited and the arrow end represents the event time. Thus, the arrow represents exposure time.
• Type A - Right-censored. This subject is alive at the beginning and the end of the study. Because
the time of death is not known by the end of the study, it is right-censored. Most subjects are Type A.
• Type B - Complete information is available for a type B subject. The subject is alive at the beginning
of the study and the death occurs within the observation period.
• Type C - Right-censored and left-truncated. A type C subject is right-censored, in that death
occurs after the observation period. However, the subject entered after the start of the study and is said
to have a delayed entry time. Because the subject would not have been observed had death occurred
before entry, it is left-truncated.
112 CHAPTER 4. MODEL SELECTION AND ESTIMATION
• Type D - Left-truncated. A type D subject also has delayed entry. Because death occurs within
the observation period, this subject is not right censored.
• Type E - Left-truncated. A type E subject is not included in the study because death occurs prior
to the observation period.
• Type F - Right-truncated. Similarly, a type F subject is not included because the entry time occurs
after the observation period.
For simplicity, we assume fixed censoring times and a continuous outcome X. To begin, consider the case of
right-censored data where we record XU∗ = min(X, CU ) and censoring indicator δU = I(X ≥ CU ). If censoring
occurs so that δU = 1, then X ≥ CU and the likelihood is Pr(X ≥ CU ) = 1 − F (CU ). If censoring does not
occur so that δU = 0, then X < CU and the likelihood is f (x). Summarizing, we have the likelihood of a
single observation as
f (x) if δ = 0 1−δ δ
= (f (x)) (1 − F (CU )) .
1 − F (CU ) if δ = 1
The right-hand expression allows us to present the likelihood more compactly. Now, for an iid sample of size
n, {(xU 1 , δ1 ), . . . , (xU n , δn )}, the likelihood is
n
Y Y Y
1−δi δi
L(θ) = (f (xi )) (1 − F (CU i )) = f (xi ) {1 − F (CU i )},
i=1 δi =0 δi =1
Q
with potential censoring times {CU 1 , . . . , CUQ
n }. Here, the notation “ δi =0 ” means to take the product over
uncensored observations, and similarly for “ δi =1 .”
On the other hand, truncated data are handled in likelihood inference via conditional probabilities. Specifically,
we adjust the likelihood contribution by dividing by the probability that the variable was observed. To
summarize, we have the following contributions to the likelihood function for six types of outcomes:
Q
where “ E ” is the product over observations with Exact values, and similarly for Right-, Left- and I nterval-
censoring.
For right-censored and left-truncated data, the likelihood is
Y f (xi ) Y 1 − F (CU i )
L(θ) = ,
1 − F (CLi ) 1 − F (CLi )
E R
and similarly for other combinations. To get further insights, consider the following.
Show Example
Special Case: Exponential Distribution. Consider data that are right-censored and left-truncated, with
random variables Xi that are exponentially distributed with mean θ. With these specifications, recall that
f (x) = θ−1 exp(−x/θ) and F (x) = 1 − exp(−x/θ).
For this special case, the log-likelihood is
X X
L(θ) = {ln f (xi ) − ln(1 − F (CLi ))} + {ln(1 − F (CU i )) − ln(1 − F(CLi ))}
E R
X X
= (− ln θ − (xi − CLi )/θ) − (CU i − CLi )/θ.
E R
To simplify the notation, define δi = I(Xi ≥ CU i ) to be a binary variable that indicates right-censoring. Let
Xi∗∗ = min(Xi , CU i ) − CLi be the amount that the observed variable exceeds the lower truncation limit.
With this, the log-likelihood is
n
X x∗∗
i
L(θ) = − ((1 − δi ) ln θ + ) (4.5)
i=1
θ
Taking derivatives with respect to the parameter θ and setting it equal to zero yields the maximum likelihood
estimator
n
1 X ∗∗
θb = x ,
nu i=1 i
P
where nu = i (1 − δi ) is the number of uncensored observations.
The log-likelihood is
Maximizing this expression by setting the derivative with respect to θ equal to 0, we have
700
L0 (θ) = −3θ−1 + 700θ−2 = 0 ⇒ θ̂ = = 233.33
3
Example 4.3.3. SOA Exam Question. You are given the following information about a random sample:
(i) The sample size equals five.
(ii) The sample is from a Weibull distribution with τ = 2.
(iii) Two of the sample observations are known to exceed 50, and the remaining three observations are 20,
30, and 45.
Calculate the maximum likelihood estimate of θ.
Show Example Solution
Solution. The likelihood function is
12
−6 16650 16650
+ = 0 ⇒ θ̂ = = 52.6783
θ θ3 6
Nonparametric estimators provide useful benchmarks, so it is helpful to understand the estimation procedures
for grouped, censored, and truncated data.
Grouped Data
As we have seen in Section 4.3.1, observations may be grouped (also referred to as interval censored) in the
sense that we only observe them as belonging in one of k intervals of the form (cj−1 , cj ], for j = 1, . . . , k. At
the boundaries, the empirical distribution function is defined in the usual way:
number of observations ≤ cj
Fn (cj ) =
n
116 CHAPTER 4. MODEL SELECTION AND ESTIMATION
For other values of x ∈ (cj−1 , cj ), we can estimate the distribution function with the ogive estimator, which
linearly interpolates between Fn (cj−1 ) and Fn (cj ), i.e. the values of the boundaries Fn (cj−1 and Fn (cj ) are
connected with a straight line. This can formally be expressed as
cj − x x − cj−1
Fn (x) = Fn (cj−1 ) + Fn (cj ) for cj−1 ≤ x < cj
cj − cj−1 cj − cj−1
Fn (cj ) − Fn (cj−1 )
fn (x) = Fn0 (x) = for cj−1 ≤ x < cj .
cj − cj−1
Example 4.3.4. SOA Exam Question. You are given the following information regarding claim sizes for
100 claims:
Using the ogive, calculate the estimate of the probability that a randomly chosen claim is between 2000 and
6000.
Show Example Solution
Solution. At the boundaries, the empirical distribution function is defined in the usual way, so we have
F100 (1000) = 0.16, F100 (3000) = 0.38, F100 (5000) = 0.63, F100 (10000) = 0.81
For other claim sizes, the ogive estimator linearly interpolates between these values:
It can be useful to calibrate parametric likelihood methods with nonparametric methods that do not rely on
a parametric form of the distribution. The product-limit estimator due to (Kaplan and Meier, 1958) is a
well-known estimator of the distribution in the presence of censoring.
To begin, first note that the empirical distribution function Fn (x) is an unbiased estimator of the distribution
function F (x) (in the “usual” case in the absence of censoring). This is because Fn (x) is the average of indicator
variables that are also unbiased, that is, E I(X ≤ x) = Pr(X ≤ x) = F (x). Now suppose the the random
outcome is censored on the right by a limiting amount, say, CU , so that we record the smaller of the two,
X ∗ = min(X, CU ). For values of x that are smaller than CU , the indicator variable still provides an unbiased
4.3. ESTIMATION USING MODIFIED DATA 117
estimator of the distribution function before we reach the censoring limit. That is, E I(X ∗ ≤ x) = F (x)
because I(X ∗ ≤ x) = I(X ≤ x) for x < CU . In the same way, E I(X ∗ > x) = 1 − F (x) = S(x).
Now consider two random variables that have different censoring limits. For illustration, suppose that we
observe X1∗ = min(X1 , 5) and X2∗ = min(X2 , 10) where X1 and X2 are independent draws from the same
distribution. For x ≤ 5, the empirical distribution function F2 (x) is an unbiased estimator of F (x). However,
for 5 < x ≤ 10, the first observation cannot be used for the distribution function because of the censoring
limitation. Instead, the strategy developed by (Kaplan and Meier, 1958) is to use Sn (5) as an estimator of S(5)
and then to use the second observation to estimate the conditional survivor function Pr(X > x|X > 5) = S(x) S(5) .
Specifically, for 5 < x ≤ 10, the estimator of the survival function is
Ŝ(x) = S2 (5) × I(X2∗ > x).
Extending this idea, for each observation i, let ui be the upper censoring limit (= ∞ if no censoring). Thus,
the recorded value is xi in the case of no censoring and ui if there is censoring. Let t1 < · · · < tk be k distinct
points at which an uncensored loss occurs, and let sj be the number of uncensored losses xi ’s at tj . The
corresponding risk
Pnset is the number Pnof observations that are active (not censored) at a value less than tj ,
denoted as Rj = i=1 I(xi ≥ tj ) + i=1 I(ui ≥ tj ).
Kaplan-Meier Product Limit Estimator. With this notation, the product-limit estimator of the
distribution function is
(
0 x < t1
F̂ (x) = (4.6)
Q sj
1− j:tj ≤x 1− Rj x ≥ t1 .
4 4 5+ 5+ 5+ 8 10+ 10+ 12 15
j tj sj Rj
1 4 2 10
2 8 1 5
3 12 1 2
4 15 1 1
In addition to right-censoring, we now extend the framework to allow for left-truncated data. As before, for
each observation i, let ui be the upper censoring limit (= ∞ if no censoring). Further, let di be the lower
truncation limit (0 if no truncation). Thus, the recorded value (if it is greater than di ) is xi in the case of no
censoring and ui if there is censoring. Let t1 < · · · < tk be k distinct points at which an event of interest
occurs, and let sj be the number of recorded events xi ’s at time point tj . The corresponding risk set is
n
X n
X n
X
Rj = I(xi ≥ tj ) + I(ui ≥ tj ) − I(di ≥ tj ).
i=1 i=1 i=1
With this new definition of the risk set, the product-limit estimator of the distribution function is as in
equation (4.6).
Greenwood’s Formula. (Greenwood, 1926) derived the formula for the estimated variance of the product-
limit estimator to be
X sj
ar(F̂ (x)) = (1 − F̂ (x))2
Vd .
Rj (Rj − sj )
j:tj ≤x
R‘s survfit method takes a survival data object and creates a new object containing the Kaplan-Meier estimate
of the survival function along with confidence intervals. The Kaplan-Meier method (type='kaplan-meier')
is used by default to construct an estimate of the survival curve. The resulting discrete survival function
has point masses at the observed event times (discharge dates) tj , where the probability of an event given
survival to that duration is estimated as the number of observed events at the duration sj divided by the
number of subjects exposed or ’at-risk’ just prior to the event duration Rj .
Two alternate types of estimation are also available for the survfit method. The alternative (type='fh2')
handles ties, in essence, by assuming that multiple events at the same duration occur in some arbitrary order.
Another alternative (type='fleming-harrington') uses the Nelson-Äalen (see (Aalen, 1978)) estimate of
the cumulative hazard function to obtain an estimate of the survival function. The estimated cumulative
hazard Ĥ(x) starts at zero and is incremented at each observed event duration tj by the number of events sj
divided by the number at risk Rj . With the same notation as above, the Nelson-Äalen estimator of the
distribution function is
(
0 x < t1
F̂N A (x) =
P
sj
1 − exp − j:tj ≤x Rj x ≥ t1 .
Note that the above expression is a result of the Nelson-Äalen estimator of the cumulative hazard function
X sj
Ĥ(x) =
Rj
j:tj ≤x
and the relationship between the survival function and cumulative hazard function, ŜN A (x) = e−Ĥ(x) .
Observation (i) 1 2 3 4 5 6 7 8 9 10
di 0 0 0 0 0 0 0 1.3 1.5 1.6
xi 0.9 − 1.5 − − 1.7 − 2.1 2.1 −
ui − 1.2 − 1.5 1.6 − 1.7 − − 2.3
j tj sj Rj Ŝ(tj )
1 0.9 1 10 − 3 = 7 1 − 71 = 6
7 5
6 1
2 1.5 1 8−2=6 7 1 − 6 = 7
5
3 1.7 1 5−0=5 7 1 − 15 = 74
4
4 2.1 2 3 7 1 − 23 = 21 4
j tj sj Rj
1 4 2 10
2 8 1 5
3 12 1 2
4 15 1 1
The Nelson-Äalen estimate of S(11) is ŜN A (11) = e−Ĥ(11) = e−0.4 = 0.67, since
2
X sj X sj
Ĥ(11) = =
Rj Rj
j:tj ≤11 j=1
2 1
= + = 0.2 + 0.2 = 0.4.
10 5
From earlier work, the Kaplan-Meier estimate of S(11) is Ŝ(11) = 0.64. Then Greenwood’s estimate of the
variance of the product-limit estimate of S(11) is
X sj 2 1
ar(Ŝ(11)) = (Ŝ(11))2
Vd = (0.64) 2
+ = 0.0307.
Rj (Rj − sj ) 10(8) 5(4)
j:tj ≤11
120 CHAPTER 4. MODEL SELECTION AND ESTIMATION
Up to this point, our inferential methods have focused on the frequentist setting, in which samples are
repeatedly drawn from a population. The vector of parameters θ is fixed yet unknown, whereas the outcomes
X are realizations of random variables.
In contrast, under the Bayesian framework, we view both the model parameters and the data as random
variables. We are uncertain about the parameters θ and use probability tools to reflect this uncertainty.
There are several advantages of the Bayesian approach. First, we can describe the entire distribution of
parameters conditional on the data. This allows us, for example, to provide probability statements regarding
the likelihood of parameters. Second, this approach allows analysts to blend prior information known from
other sources with the data in a coherent manner. This topic is developed in detail in the credibility chapter.
Third, the Bayesian approach provides a unified approach for estimating parameters. Some non-Bayesian
methods, such as least squares, require a separate approach to estimate variance components. In contrast,
in Bayesian methods, all parameters can be treated in a similar fashion. This is convenient for explaining
results to consumers of the data analysis. Fourth, Bayesian analysis is particularly useful for forecasting
future responses.
As stated earlier, under the Bayesian perspective, the model parameters and data are both viewed as random.
Our uncertainty about the parameters of the underlying data generating process is reflected in the use of
probability tools.
Prior Distribution. Specifically, think about θ as a random vector and let π(θ) denote the distribution
of possible outcomes. This is knowledge that we have before outcomes are observed and is called the prior
distribution. Typically, the prior distribution is a regular distribution and so integrates or sums to one,
depending on whether θ is continuous or discrete. However, we may be very uncertain (or have no clue)
about the distribution of θ; the Bayesian machinery allows the following situation
Z
π(θ)dθ = ∞,
f (x) = f (x|θ)π(θ)dθ.
Posterior Distribution of Parameters. After outcomes have been observed (hence the terminology
“posterior”), one can use Bayes theorem to write the distribution as
f (x, θ) f (x|θ)π(θ)
π(θ|x) = =
f (x) f (x)
The idea is to update your knowledge of the distribution of θ (π(θ)) with the data x.
We can summarize the distribution using a confidence interval type statement.
Definition. [a, b] is said to be a 100(1 − α)% credibility interval for θ if
Pr(a ≤ θ ≤ b|x) ≥ 1 − α.
To get the exact posterior density, we integrate the above function over its range (0.6, 0.8)
0.8 0.8
q5 q 6 q4 − q5
Z
q 4 − q 5 dq = − = 0.014069 ⇒ π(q|1, 0) =
0.6 5 6 0.6 0.014069
Then
0.8
q4 − q5
Z
P (0.7 < q < 0.8|1, 0) = dq = 0.5572
0.7 0.014069
2θ 2 1
f (3|θ)π(θ) (3+θ)3 θ 2 2(3 + θ)−3
π(θ|3) = R ∞ = R∞ −3 dθ
= −2 |∞
= 32(3 + θ)−3 , θ > 1
1
f (3|θ)π(θ)dθ 1
2(3 + θ) −(3 + θ) 1
Then
122 CHAPTER 4. MODEL SELECTION AND ESTIMATION
Z ∞ ∞ 16
P (Θ > 2|3) = 32(3 + θ)−3 dθ = −16(3 + θ)−2 2 = = 0.64
2 25
In classical decision analysis, the loss function l(θ̂, θ) determines the penalty paid for using the estimate θ̂
instead of the true θ.
The Bayes estimate is that value that minimizes the expected loss E [l(θ̂, θ)].
Some important special cases include:
Z Z Z
E(y|x) = yf (y|x)dy = y f (y|θ)π(θ|x)dθ dy
Z
= E(y|θ)π(θ|x)dθ.
Example 4.4.3. SOA Exam Question. For a particular policy, the conditional probability of the annual
number of claims given Θ = θ, and the probability distribution of Θ are as follows:
Number of Claims 0 1 2
Probability 2θ θ 1 − 3θ
θ 0.05 0.30
Probability 0.80 0.20
Two claims are observed in Year 1. Calculate the Bayesian estimate (Bühlmann credibility estimate) of the
number of claims in Year 2.
Show Example Solution
Solution. Note that E(θ) = 0.05(0.8) + 0.3(0.2) = 0.1 and E(θ2 ) = 0.052 (0.8) + 0.32 (0.2) = 0.02
We also have µ(θ) = 0(2θ)+1(θ)+2(1−3θ) = 2−5θ and v(θ) = 02 (2θ)+12 (θ)+22 (1−3θ)−(2−5θ)2 = 9θ−25θ2 .
Thus
4.4. BAYESIAN INFERENCE 123
θ
f (x|θ) = , 0<x<∞
(x + θ)2
(ii) For half of the company’s policies θ = 1 , while for the other half θ = 3.
For a randomly selected policy, losses in Year 1 were 5. Calculate the posterior probability that losses for this
policy in Year 2 will exceed 8.
Show Example Solution
Solution. We are given the prior distribution of θ as P (θ = 1) = P (θ = 3) = 12 , the conditional distribution
f (x|θ), and the fact that we observed X1 = 5. The goal is to find the predictive probability P (X2 > 8|X1 = 5).
The posterior probabilities are
f (5|θ = 1)P (θ = 1)
P (θ = 1|X1 = 5) =
f (5|θ = 1)P (θ = 1) + f (5|θ = 3)P (θ = 3)
1 1 1
( ) 16
= 1 136 2 3 1 = 1 72 3 =
(
36 2 ) + (
64 2 ) 72 + 128
43
f (5|θ = 3)P (θ = 3)
P (θ = 3|X1 = 5) =
f (5|θ = 1)P (θ = 1) + f (5|θ = 3)P (θ = 3)
27
= 1 − P (θ = 1|X1 = 5) =
43
Z ∞
P (X2 > 8|θ) = f (x|θ)dx
8
Z ∞ ∞
θ θ θ
= 2
dx = − =
8 (x + θ) x+θ 8
8+θ
P (X2 > 8|X1 = 5) = P (X2 > 8|θ = 1)P (θ = 1|X1 = 5) + P (X2 > 8|θ = 3)P (θ = 3|X1 = 5)
1 16 3 27
= + = 0.2126
8 + 1 43 8 + 3 43
124 CHAPTER 4. MODEL SELECTION AND ESTIMATION
(i) The probability that an insured will have at least one loss during any year is p.
(ii) The prior distribution for p is uniform on [0, 0.5].
(iii) An insured is observed for 8 years and has at least one loss every year.
Calculate the posterior probability that the insured will have at least one loss during Year 9.
Thus, the posterior probability that the insured will have at least one loss during Year 9 is
Z 5
P (X9 = 1|1, 1, 1, 1, 1, 1, 1, 1) = P (X9 = 1|p)π(p|1, 1, 1, 1, 1, 1, 1, 1)dp
0
Z 5
= p(9)(0.5−9 )p8 dp = 9(0.5−9 )(0.510 )/10 = 0.45
0
One randomly chosen risk has three claims during Years 1-6. Calculate the posterior probability of a claim
for this risk in Year 7.
Solution. The probabilities are from a binomial distribution with 6 trials in which 3 successes were observed.
6
P (3|I) = (0.13 )(0.93 ) = 0.01458
3
6
P (3|II) = (0.23 )(0.83 ) = 0.08192
3
6
P (3|III) = (0.43 )(0.63 ) = 0.27648
3
f (x|θ)π(θ)
π(θ|x) = f (x)
∝ f (x|θ)π(θ)
Posterior is proportional to likelihood × prior
For conjugate distributions, the posterior and the prior come from the same family of distributions. The
following illustration looks at the Poisson-gamma special case, the most well-known in actuarial applications.
Show Example
Special Case – Poisson-Gamma Assume a Poisson(λ) model distribution so that
n
Y λxi e−λ
f (x|λ) =
i=1
xi !
µ = v = E(λ) = αθ
a = V ar(λ) = αθ2
v 1
k= =
a θ
n nθ
⇒Z= =
n + 1/θ nθ + 1
θ 1 θ+µ
0.15 = (1) + µ=
θ+1 θ+1 θ+1
2θ 1 4θ + µ
0.20 = (2) + µ=
2θ + 1 2θ + 1 2θ + 1
Here are a set of exercises that guide the viewer through some of the theoretical foundations of Loss Data
Analytics. Each tutorial is based on one or more questions from the professional actuarial examinations,
typically the Society of Actuaries Exam C.
Model Selection Guided Tutorials
Contributors
• Edward W. (Jed) Frees and Lisa Gao, University of Wisconsin-Madison, are the principal authors
of the initital version of this chapter. Email: [email protected] for chapter comments and suggested
improvements.
4.5. FURTHER RESOURCES AND CONTRIBUTORS 127
In welfare economics, it is common to compare distributions via the Lorenz curve, developed by Max Otto
Lorenz (Lorenz, 1905). A Lorenz curve is a graph of the proportion of a population on the horizontal axis and
a distribution function of interest on the vertical axis. It is typically used to represent income distributions.
When the income distribution is perfectly aligned with the population distribution, the Lorenz curve results
in a 45 degree line that is known as the line of equality. The area between the Lorenz curve and the line of
equality is a measure of the discrepancy between the income and population distributions. Two times this
area is known as the Gini index, introduced by Corrado Gini in 1912.
Example – Classic Lorenz Curve. For an insurance example, Figure 4.12 shows a distribution of insurance
losses. This figure is based on a random sample of 2000 losses. The left-hand panel shows a right-skewed
histogram of losses. The right-hand panel provides the corresponding Lorenz curve, showing again a skewed
distribution. For example, the arrow marks the point where 60 percent of the policyholders have 30 percent
of losses. The 45 degree line is the line of equality; if each policyholder has the same loss, then the loss
distribution would be at this line. The Gini index, twice the area between the Lorenz curve and the 45 degree
line, is 37.6 percent for this data set.
We now introduce a modification of the classic Lorenz curve and Gini statistic that is useful in insurance
applications. Specifically, we introduce an ordered Lorenz curve which is a graph of the distribution of losses
versus premiums, where both losses and premiums are ordered by relativities. Intuitively, the relativities
point towards aspects of the comparison where there is a mismatch between losses and premiums. To make
the ideas concrete, we first provide some notation. We will consider i = 1, . . . , n policies. For the ith policy,
let
• yi denote the insurance loss,
• xi be the set of policyholder characteristics known to the analyst,
• Pi = P (xi ) be the associated premium that is a function of xi ,
128 CHAPTER 4. MODEL SELECTION AND ESTIMATION
We now sort the set of policies based on relativities (from smallest to largest) and compute the premium and
loss distributions. Using notation, the premium distribution is
Pn
i=1PP (xi )I(Ri ≤ s)
F̂P (s) = n , (4.7)
i=1 P (xi )
where
I(·) is the
indicator function, returning a 1 if the event is true and zero otherwise. The graph
F̂P (s), F̂L (s) is an ordered Lorenz curve.
The classic Lorenz curve shows the proportion of policyholders on the horizontal axis and the loss distribution
function on the vertical axis. The ordered Lorenz curve extends the classical Lorenz curve in two ways, (1)
through the ordering of risks and prices by relativities and (2) by allowing prices to vary by observation.
We summarize the ordered Lorenz curve in the same way as the classic Lorenz curve using a Gini index,
defined as twice the area between the curve and a 45 degree line. The analyst seeks ordered Lorenz curves
that approach passing through the southeast corner (1,0); these have greater separation between the loss and
premium distributions and therefore larger Gini indices.
Example – Loss Distribution.
Suppose we have n = 5 policyholders with experience as:
Variable i 1 2 3 4 5 Sum
Loss yi 5 5 5 4 6 25
Premium P (xi ) 4 2 6 5 8 25
Relativity R(xi ) 5 4 3 2 1
Figure 4.13 compares the Lorenz curve to the ordered version based on this data. The left-hand panel shows
the Lorenz curve. The horizontal axis is the cumulative proportion of policyholders (0, 0.2, 0.4, and so forth)
and the vertical axis is the cumulative proportion of losses (0, 4/25, 9/25, and so forth). This figure shows
little separation between the distributions of losses and policyholders.
The right-hand panel shows the ordered Lorenz curve. Because observations are sorted by relativities, the
first point after the origin (reading from left to right) is (8/25, 6/25). The second point is (13/25, 10/25),
with the pattern continuing. For the ordered Lorenz curve, the horizontal axis uses premium weights, the
vertical axis uses loss weights, and both axes are ordered by relativities. From the figure, we see that there is
greater separation between losses and premiums when viewed through this relativity.
4.5. FURTHER RESOURCES AND CONTRIBUTORS 129
Gini Index
Specifically, the Gini index can be calculated as follows. Suppose that the empirical ordered Lorenz curve
is given by {(a0 = 0, b0 = 0), (a1 , b1 ), . . . , (an = 1, bn = 1)} for a sample of n observations. Here, we use
aj = F̂P (Rj ) and bj = F̂L (Rj ). Then, the empirical Gini index is
n−1
X aj+1 + aj bj+1 + bj
Gini
[ = 2 (aj+1 − aj ) −
j=0
2 2
n−1
X
= 1− (aj+1 − aj )(bj+1 + bj ). (4.9)
j=0
Example – Loss Distribution: Continued. In the figure, the Gini index for the left-hand panel is 5.6%.
In contrast, the Gini index for the right-hand panel is 14.9%.
The Gini statistics based on an ordered Lorenz curve can be used for out-of-sample validation. The procedure
follows:
1. Use an in-sample data set to estimate several competing models.
2. Designate an out-of-sample, or validation, data set of the form {(yi , xi ), i = 1, . . . , n}.
3. Establish one of the models as the base model. Use this estimated model and explanatory variables
from the validation sample to form premiums of the form P (xi )).
4. Use an estimated competing model and validation sample explanatory variables to form scores of the
form S(xi )).
5. From the premiums and scores, develop relativities Ri = S(xi )/P (xi ).
6. Use the validation sample outcomes yi to compute the Gini statistic.
Example – Out-of-Sample Validation.
130 CHAPTER 4. MODEL SELECTION AND ESTIMATION
Suppose that we have experience from 25 states. For each state, we have available 500 observations that can
be used to predict future losses. For this illustration, we have generated losses using a gamma distrbution
with common shape parameter equal to 5 and a scale parameter that varies by state, from a low of 20 to 66.
Determine the ordered Lorenz curve and the corresponding Gini statistic to compare the two rate procedures.
Show Example Solution
For our base premium, we simply use the maximum likelihood estimate assuming a common distribution
among all states. For the gamma distribution, this turns out to be simply the average which for our simulation
is P=219.96. You can think of this common premium as based on a community rating principle. As an
alternative, we use averages that are state-specific. Because this illustration uses means that vary by states,
we anticipate this alternative rating procedure to be preferred to the community rating procedure. (Recall
for the gamma distribution that the mean equals the shape times the scale or, 5 times the scale parameter,
for our example.)
Out of sample claims were generated from the same gamma distribution, with 200 observations for each state.
In the following, we have the ordered Lorenz curve.
For these data, the Gini index is 0.187 with a standard error equal to 0.00381.
Discussion
In insurance claims modeling, standard out-of-sample validation measures are not the most informative due
to the high proportions of zeros (corresponding to no claim) and the skewed fat-tailed distribution of the
positive values. The Gini index can be motivated by the economics of insurance. Intuitively, the Gini index
measures the negative covariance between a policy’s “profit” (P − y, premium minus loss) and the rank of
the relativity (R, score divided by premium). That is, the close approximation
[ ≈ − 2 Cov
Gini d ((P − y), rank(R)) .
n
This observation leads an insurer to seek an ordering that produces to a large Gini index. Thus, the Gini
index and associated ordered Lorenz curve are useful for identifying profitable blocks of insurance business.
4.5. FURTHER RESOURCES AND CONTRIBUTORS 131
Unlike classical measures of association, the Gini index assumes that a premium base P is currently in place
and seeks to assess vulnerabilities of this structure. This approach is more akin to hypothesis testing (when
compared to goodness of fit) where one identifies a “null hypothesis” as the current state of the world and
uses decision-making criteria/statistics to compare this with an “alternative hypothesis.”
The insurance version of the Gini statistic was developed by (Frees et al., 2011) and (Frees et al., 2014) where
you can find formulas for the standard errors and other additional background information.
132 CHAPTER 4. MODEL SELECTION AND ESTIMATION
Chapter 5
Chapter Preview. This chapter introduces probability models for describing the aggregate claims that arise
from a portfolio of insurance contracts. We presents two standard modeling approaches, the individual risk
model and the collective risk model. Further, we discuss strategies for computing the distribution of the
aggregate claims. Finally, we examine the effects of individual policy modifications on the aggregate loss
distribution.
5.1 Introduction
The objective of this chapter is to build a probability model to describe the aggregate claims by an insurance
system occurring in a fixed time period. The insurance system could be a single policy, a group insurance
contract, a business line, or an entire book of an insurance’s business. In the chapter, aggregate claims refers
to either the number or the amount of claims from a portfolio of insurance contracts. However, the modeling
framework is readily to apply in the more general setup.
Consider an insurance portfolio of n individual contracts, and let S denote the aggregate losses of the portfolio
in a given time period. There are two approaches to modeling the aggregate losses S, the individual risk
model and the collective risk model. The individual risk model emphasizes the loss from each individual
contract and represents the aggregate losses as:
S = X1 + X2 + · · · + Xn ,
where Xi (i = 1, . . . , n) is interpreted as the loss amount from the ith contract. It is worth stressing that n
denotes the number of contracts in the portfolio and thus is a fixed number rather than a random variable.
For the individual risk model, one usually assumes Xi ’s are independent, i.e., Xi ⊥ Xj ∀ i, j. Because of
different contract features such as coverage and exposure, Xi ’s are not necessarily identically distributed. A
notable feature of the distribution of each Xi is the probability mass at zero corresponding to the event of no
claims.
The collective risk model represents the aggregate losses in terms of a frequency distribution and a severity
distribution:
S = X1 + X2 + · · · + XN .
Here one thinks of a random number of claims N that may represent either the number of losses or the number
of payments. In contrast, in the individual risk model, we use a fixed number of contracts n. We think of
X1 , X2 , . . . , XN as representing the amount of each loss. Each loss may or may not corresponding to a unique
contract. For instance, there may be multiple claims arising from a single contract. It is natural to think
about Xi > 0 because if Xi = 0 then no claim has occurred. Typically we assume that conditional on N = n,
X1 , X2 , · · · , Xn are iid random variables. The distribution of N is known as the frequency distribution,
133
134 CHAPTER 5. AGGREGATE LOSS MODELS
and the common distribution of X is known as the severity distribution. We further assume N and X are
independent. With the collective risk model, we may decompose the aggregate losses into the frequency (N )
process and the severity (X) process. This flexibility allows the analyst to comment on these two separate
components. For example, sales growth due to lower underwriting standards could lead to higher frequency
of losses but might not affect severity. Similarly, inflation or other economic forces could have an impact on
severity but not on frequency.
Sn = X1 + X2 + · · · + Xn
to be the aggregate loss from all contracts in a portfolio or group of contracts. Under the independence
assumption on Xi0 s, it is straightforward to show
n
X n
X
E(Sn ) = E(Xi ), Var(Sn ) = Var(Xi )
i=1 i=1
Yn Yn
PSn (z) = PXi (z), MSn (t) = MXi (t)
i=1 i=1
where PS (·) and MS (·) are probability generating function and moment generating function of S, respectively.
The distribution of each Xi contains mass at zero, corresponding to the event of no claim. One strategy to
incorporate the zero mass in the distribution is using the two-part framework:
0 Ii = 0
Xi = Ii × Bi =
Bi Ii = 1
Here Ii is a Bernoulli variable indicating whether or not a loss occurs for the ith contract, and Bi , a r.v.
with nonnegative support, represents the amount of losses of the contract given loss occurrence. Assume
that I1 , . . . , In , B1 , . . . , Bn are mutually independent. Denote Pr(Ii = 1) = qi , µi = E(Bi ), and σi2 = Var(Bi ).
One can show
n
X
E(Sn ) = qi µj
i=1
n
X
qi σi2 + qi (1 − qj )µ2i
Var(Sn ) =
i=1
n
Y
PSn (z) = (1 − qi + qi PBi (z))
i=1
Yn
MSn (t) = (1 − qi + qi MBi (t))
i=1
A special case of the above model is when Bi follows a degenerate distribution with µi = bi and σi2 = 0. One
example is term life insurance or a pure endowment insurance where bi represents the amount of insurance of
the ith contract.
Another strategy to accommodate zero mass in the distribution of Xi is a collective risk model, i.e. Xi =
Zi1 + · · · + ZiNi where Xi = 0 when Ni = 0. The collective risk model will be discussed in detail in the next
section.
Example 5.2.1. SOA Exam Question. An insurance company sold 300 fire insurance policies as follows:
5.2. INDIVIDUAL RISK MODEL 135
300
X
qi σi2 + qi (1 − qi )µ2i
Var S300 =
i=1
4002 3002
= 100 0.05 + 0.05(1 − 0.05)2002 + 200 0.06 + 0.06(1 − 0.06)1502
12 12
= 600, 466.67.
Follow-Up. Now suppose everybody receives the policy maximum if a claim occurs. What is the expected
aggregate loss and variance of the aggregate loss? Each policy claim amount Bi is now fixed at P olM ax
instead of random, so σi2 = Var Bi = 0 and µi = P olM ax.
300
X
E SX = qi µi = 100 {0.05(400)} + 200 {0.06(300)} = 5, 648
i=1
300
X 300
X
Var S X = qi σi2 + qi (1 − qi )µ2i = qi (1 − qi )µ2i
i=1 i=1
= 100 (0.05)(1 − 0.05)4002 + 200 (0.06)(1 − 0.06)3002
The individual risk model can also be used for claim frequency. If Xi denotes the number of claims from the
ith contract, and Sn is interpreted as the total number of claims from the portfolio. In this case, the above
two-part framework still applies. Assume Xi belongs to the (a, b, 0) class with pmf denoted by pik . Let XiT
denote the associated zero-truncated distribution in the (a, b, 1) class with the pmf pTik = pik /(1 − pi0 ) for
k = 1, 2, . . .. Using the relationship between their generating functions:
For the low-risk policies, we have qi = 0.03 and for the high-risk policies, we have qi = 0.05. Further, Bi = NiT ,
the zero-truncated version of Ni . Thus, we have
λ
µi = E(Bi ) = E(NiT ) =
1 − e−λ
λ[1 − (λ + 1)e−λ ]
σi2 = Var(Bi ) = Var(NiT ) =
(1 − e−λ )2
Pn
Let the portfolio claim frequency be Sn = i=1 Ni . Using the formulas above, the expected claim frequency
of the portfolio is
100
X
E Sn = qi µi
i=1
1 2
= 40 0.03 + 60 0.05
1 − e−1 1 − e−2
= 40(0.03)(1.5820) + 60(0.05)(2.3130) = 8.8375
The variance of the claim frequency of the portfolio is
100
X
qi σi2 + qi (1 − qi )µ2i
Var Sn =
i=1
1 − 2e−1 2[1 − 3e−2 ]
2 2
= 40 0.03 + 0.03(0.97)(1.5820 ) + 60 0.05 + 0.05(0.95)(2.3130 )
(1 − e−1 )2 (1 − e−2 )2
= 23.7214
Note that equivalently, we could have calculated the mean and variance of an individual policy directly using
the relationship between the zero-modified and zero-truncated Poisson distributions.
5.2. INDIVIDUAL RISK MODEL 137
To understand the distribution of the aggregate loss, one could use central limit theorem to approximate the
distribution of Sn . Denote µS = E(S) and σS2 = Var(S), the cdf of Sn is:
s − µS
FSn (s) = Pr(Sn ≤ s) = Φ .
σS
Example 5.2.3. SOA Exam Question - Follow-Up. As in the original example earlier, an insurance
company sold 300 fire insurance policies, with claim amounts uniformly distributed between 0 and the policy
maximum. Using the normal approximation, calculate the probability that the aggregate claim amount
exceeds $3, 500.
Show Example Solution
Solution. We have seen earlier that E S300 = 2, 824 and Var S300 = 600, 466.67. Then
For small n, the distribution of Sn is likely skewed, and the normal approximation would be a poor choice.
To examine the aggregate loss distribution, we go back to the basics and first principles. Specifically, the
distribution can be derived recursively. Define Sk = X1 + · · · + Xk , k = 1, . . . , n, we have: For k = 1:
For k = 2, . . . , n,
FSk (s) = Pr(X1 + · · · + Xk ≤ s) = Pr(Sk−1 + Xk ≤ s)
= EXk [Pr(Sk−1 ≤ s − Xk |Xk )] = EXk FSk−1 (s − Xk ) .
There are some simple cases where the Sn has a closed form. Examples include
A special case is when Xi0 s are identically distributed. Let FX (x) = Pr(X ≤ x) be the common distribution
of Xi (i = 1, . . . , n), we define
∗n
FX (x) = Pr(X1 + · · · + Xn ≤ x)
the n-fold convolution of FX .
Example 5.2.4. Gamma Distribution. For an easy case, assume that Xi ∼ gamma with parameters
(α, θ). As we know, the moment generating function (mgf ) is MX (t) = (1 − θt)−α . Thus, the mgf of the sum
Sn = X1 + · · · + Xn is
Thus, Sn has a gamma distribution with parameters (nα, θ). This makes it easy to compute F ∗n (x) =
Pr(Sn ≤ x). This property is known as “closed under convolution”.
i=1 i=1
Pn
Thus, Sn has a negative binomial distribution with parameters (β, i=1 ri ).
More generally, we can compute F ∗n recursively. Begin the recursion at n = 1 using F ∗1 (x) = F (x). Next,
for n = 2, we have
Recall F (0) = 0.
Similarly, let Sn = X1 + X2 + · · · + Xn
Example 5.2.6. SOA Exam Question (modified). The annual number of doctor visits for each
individual in a family of 4 has geometric distribution with mean 1.5. The annual numbers of visits for the
family members are mutually independent. An insurance pays 100 per doctor visit beginning with the 4th
visit per family. Calculate the probability that the family will receive an insurance payment this year.
Show Example Solution
Solution. Let Xi ∼ Geometric(β = 1.5) be the number of doctor visits for one individual in the family and
S4 = X1 + X2 + X3 + X4 be the number of doctor visits for the family. The sum of 4 independent geometric
distributions each with mean 1.5 follows a negative binomial distribution, i.e. S4 ∼ N egBin(β = 1.5, r = 4).
If the insurance pays 100 per visit beginning with the 4th visit for the family, then the family will not receive
an insurance payment if they have less than 4 claims. This probability is
Pr(S4 < 4) = Pr(S4 = 0) + Pr(S4 = 1) + Pr(S4 = 2) + Pr(S4 = 3)
4(1.5) 4(5)(1.52 ) 4(5)(6)(1.53 )
= (1 + 1.5)−4 + + +
(1 + 1.5)5 2(1 + 1.5)6 3!(1 + 1.5)7
= 0.0256 + 0.0614 + 0.0922 + 0.1106 = 0.2898
5.3. COLLECTIVE RISK MODEL 139
Under the collective risk model S = X1 + · · · + XN , {Xi } are iid, and independent of N . Let µ = E (Xi ) and
σ 2 = Var (Xi ) ∀ i. Using the law of iterated expectations, the mean is
EN = Var N = λ
Var S = λ(σ 2 + µ2 ) = λ E X 2 .
Example 5.3.1. SOA Exam Question. The number of accidents follows a Poisson distribution with
mean 12. Each accident generates 1, 2, or 3 claimants with probabilities 1/2, 1/3, and 1/6 respectively.
Calculate the variance in the total number of claimants.
Show Example Solution
Solution.
2 1 2 2 1 2 1 10
EX =1 +2 +3 =
2 3 6 3
10
Var S = λ E X 2 = 12 = 40
3
E N = Var N = 12
1 1 1 5
µ=EX=1 +2 +3 =
2 3 6 3
10 25 5
σ 2 = E X 2 − (E X)2 = − =
3 9 9
2
5 5
⇒ Var S = (12) + (12) = 40.
9 3
140 CHAPTER 5. AGGREGATE LOSS MODELS
In general, the moments of S can be derived from its moment generating function (mgf ). Because {Xi } are
iid, we denote the mgf of X as MX (t) = E (etX ). Using the law of iterated expectations, the mgf of S is
where we use the relation E[et(X1 +···+Xn ) ] = E(etX1 ) · · · E(etXn ) = (MX (t))n . Now, recall that the probability
generating function (pgf ) of N is P (z) = E(z N ). Denote MX (t) = z, it is shown
Special Case. Poisson Frequency. Let N ∼ P oisson(λ). Thus, the pgf of N is PN (z) = exp[λ(z − 1)],
and the mgf of S is
Example 5.3.2. SOA Exam Question. You are the producer of a television quiz show that gives cash
prizes. The number of prizes, N , and prize amount, X, have the following distributions:
n Pr(N = n) x Pr(X = x)
1 0.8 0 0.2
2 0.2 100 0.7
1000 0.1
Your budget for prizes equals the expected aggregate cash prizes plus the standard deviation of aggregate
cash prizes. Calculate your budget.
Show Example Solution
Solution. We need to calculate the mean and standard deviation of the aggregate (sum) of cash prizes. The
moments of the frequency distribution N are
Thus, the mean and variance of the aggregate cash prize are
ES = µE N = 170(1.2) = 204
Var S = σ 2 E N + µ2 Var N
= 78, 100(1.2) + 1702 (0.16) = 98, 344
√
Budget = ES+ Var S
p
= 204 + 98, 344 = 517.60.
The distribution of S is called a compound distribution, and it can be derived based on the convolution of
FX as follows:
FS (s) = Pr (X1 + · · · + XN ≤ s)
= E [Pr (X1 + · · · + XN ≤ s|N = n)]
∗N
= E FX (s)
X∞
∗n
= p0 + pn FX (s)
n=1
142 CHAPTER 5. AGGREGATE LOSS MODELS
Example 5.3.3. SOA Exam Question. The number of claims in a period has a geometric distribution
with mean 4. The amount of each claim X follows Pr(X = x) = 0.25, x = 1, 2, 3, 4. The number of claims
and the claim amounts are independent. Let S denote the aggregate claim amount in the period. Calculate
FS (3).
Show Example Solution
Solution. By definition, we have
∞
N
! n
!
X X X
FS (3) = Pr Xi ≤ 3 = Pr Xi ≤ 3|N = n Pr(N = n)
i=1 n=0 i=1
X 3
X
= F ∗n (3) pn = F ∗n (3)pn
n n=0
= p0 + F ∗1 (3) p1 + F ∗2 (3) p2 + F ∗3 (3) p3
Notice that we did not need to recursively calculate F ∗3 (3) by recognizing that each X ∈ {1, 2, 3, 4}, so the
only way of obtaining X1 + X2 + X3 ≤ 3 is to have X1 = X2 = X3 = 1. Additionally, for n ≥ 4, F ∗n (3) = 0
since it is impossible for the sum of 4 or more X’s to be less than 3. For n = 0, F ∗0 (3) = 1 since the sum of 0
X’s is 0, which is always less than 3. Laying out the probabilities systematically,
Finally,
FS (3) = p0 + F ∗1 (3) p1 + F ∗2 (3) p2 + F ∗3 (3) p3
1 3 4 3 16 1 64
= + + + = 0.3456
5 4 25 16 125 64 625
When E(N ), one may also use the central limit theorem to approximate the distribution of S as in the
S−E(S)
individual risk model. That is, √ approximately follows N (0, 1).
Var(S)
Using the normal approximation, determine the probability that the aggregate loss will exceed 150% of the
expected loss.
Show Example Solution
Solution. To use the normal approximation, we must first find the mean and variance of the aggregate loss S
Then under the normal approximation, aggregate loss S is approximately normal with mean 80,000 and
standard deviation 32,000. The probability that S will exceed 150% of the expected aggregate loss is therefore
S−E S 1.5E S − E S
Pr(S > 1.5E S) = Pr √ > √
Var S Var S
0.5E S
= Pr N (0, 1) > √
Var S
0.5(80, 000)
= Pr N (0, 1) > = Pr(N (0, 1) > 1.25)
32, 000
= 1 − Φ(1.25) = 0.1056
E N = 25 Var N = 25
2
E X = 5+95
2 = 50 = µ Var X = (95−5)
12 = 675 = σ 2
Then for S,
ES = µ E N = 50(25) = 1, 250
Var S = σ 2 E N + µ2 Var N
= 675(25) + 502 (25) = 79, 375
144 CHAPTER 5. AGGREGATE LOSS MODELS
Using the normal approximation, S is approximately normal with mean 1,250 and variance 79,375. The
probability that S exceeds 2,000 is
S−E S 2, 000 − E S
Pr(S > 2, 000) = Pr √ > √
Var S Var S
2, 000 − 1, 250
= Pr N (0, 1) > √
79, 375
= Pr(N (0, 1) > 2.662) = 1 − Φ(2.662) = 0.003884
Insurance on the aggregate loss S, subjected to a deductible d, is called net stop-loss insurance. The quantity
E[(S − d)+ ]
Z ∞
E(S − d)+ = (1 − FS (s)) ds
d R ∞
d
(s − d)fS (s)ds continuous
= P
s>d (s − d)fS (s)ds discrete
= E(S) − E(S ∧ d)
Example 5.3.6. SOA Exam Question. In a given week, the number of projects that require you to work
overtime has a geometric distribution with β = 2. For each project, the distribution of the number of overtime
hours in the week is as follows:
x f (x)
5 0.2
10 0.3
20 0.5
The number of projects and the number of overtime hours are independent. You will get paid for overtime
hours in excess of 15 hours in the week. Calculate the expected number of overtime hours for which you will
get paid in the week.
Show Example Solution
Solution. The number of projects in a week requiring overtime work has distribution N ∼ Geometric(β = 2),
while the number of overtime hours worked per project has distribution X as described above. The aggregate
number of overtime hours in a week is S and we are therefore looking for
E (S − 15)+ = E S − E (S ∧ 15).
To find E S = E X E N , we have
5.3. COLLECTIVE RISK MODEL 145
1 1
Pr(S = 0) = Pr(N = 0) = =
1+β 3
2 0.4
Pr(S = 5) = Pr(X = 5, N = 1) = 0.2 =
9 9
Pr(S = 10) = Pr(X = 10, N = 1) + Pr(X1 = X2 = 5, N = 2)
2 4
= 0.3 + (0.2)(0.2) = 0.0726
9 27
1 0.4
Pr(S ≥ 15) = 1− + + 0.0726 = 0.5496
3 9
⇒ E (S ∧ 15) = 0 Pr(S = 0) + 5 Pr(S = 5) + 10 Pr(S = 10) + 15 Pr(S ≥ 15)
1 0.4
= 0 +5 + 10(0.0726) + 15(0.5496) = 9.193
3 9
Therefore,
E (S − 15)+ = E S − E (S ∧ 15)
= 28 − 9.193 = 18.807
Recursive Net Stop-Loss Premium Calculation. For the discrete case, this can be computed recursively
as
j
X
E S ∧ (j + 1) = xfS (x) + (j + 1) Pr(S ≥ j + 1).
x=0
Similarly
j
X
E S∧j = xfS (x) + j Pr(S ≥ j + 1).
x=0
146 CHAPTER 5. AGGREGATE LOSS MODELS
as required.
Exercise 5.3.7. Exam M, Fall 2005, 19 - Continued. Recall that the goal of this question was to
calculate E (S − 15)+ . Note that the support of S is equally spaced over units of 5, so this question can also
be done recursively, using steps of h = 5:
• Step 1:
• Step 2:
• Step 3:
E (S − 15)+ = E (S − 10)+ − 5[1 − Pr(S ≤ 10)]
= E (S − 10)+ − 5 Pr(S ≥ 15)
= 21.555 − 5(0.5496) = 18.887
There are few combinations of claim frequency and severity distributions that result in an easy-to-compute
distribution for aggregate losses. This section gives some simple examples. Analysts view these examples as
too simple to be used in practice.
Example 5.3.8. One has a closed-form expression for the aggregate loss distribution by assuming a geometric
frequency distribution and an exponential severity distribution.
Assume that claim count N is geometric with parameter β such that E(N ) = β, and that claim amount X is
exponential with parameter θ such that E(X) = θ. Recall that the pgf of N and the mgf of X are:
1
PN (z) = ,
1 − β(z − 1)
1
MX (t) = .
1 − θt
Thus, the mgf of aggregate loss S is
5.3. COLLECTIVE RISK MODEL 147
1
MS (t) = PN [MX (t)] =
1
1−β 1−θt +1
1
= 1+ ([1 − θ(1 + β)z]−1 − 1)...(1) (5.1)
1+β
1 β 1
= (1) + ...(2) (5.2)
1+β 1 + β 1 − θ(1 + β)t
From (2), we note that S is also equivalent to a 2-point mixture of 0 and X ∗ . Specifically,
1
Pr(S = 0) =
1+β
∗ β s
Pr(S > s) = Pr(X > s) = exp −
1+β θ(1 + β)
with pdf
β s
fS (s) = exp − .
θ(1 + β)2 θ(1 + β)
Example 5.3.9. Consider a collective risk model with an exponential severity and an arbitrary frequency
distribution. Recall that if If Xi ∼ Exponential(θ), then the sum of iid exponential, Sn = X1 + · · · + Xn ,
has a Gamma distribution, i.e. Sn ∼ Gamma(n, θ). This has cdf:
Z s s
∗n 1
FX (s) = Pr(Sn ≤ s) = n
sn−1 exp − ds
0 Γ(n)θ θ
n−1
X 1 s j −s/θ
= 1− e .
j=0
j! θ
For the aggregate loss distribution, we can interchange order of summations to get
∞
X
∗n
FS (s) = p0 + pn FX (s)
n=1
∞ n−1
X X 1 s j −s/θ
= 1− pn e
n=1 j=0
j! θ
∞
X 1 s j
= 1 − e−s/θ Pj
j=0
j! θ
where P j = pj+1 + pj+2 + · · · = Pr(N > j), the “survival function” of the claims count distribution.
In this section, we examine a particular compound distribution where the number of claims is a Poisson
distribution and the amount of claims is a Gamma distribution. This specification leads to what is known as
a Tweedie distribution. The Tweedie distribution has a mass probability at zero and a continuous component
for positive values. Because of this feature, it is widely used in insurance claims modeling, where the zero
mass is interpreted as no claims and the positive component as the amount of claims.
Specifically, consider the collective risk model S = X1 + · · · + XN . Suppose that N has a Poisson distribution
with mean λ, and each Xi has a Gamma distribution shape parameter α and scale parameter γ. The Tweedie
distribution is derived as the Poisson sum of gamma variables. To understand the distribution of S, we first
examine the mass probability at zero. It is straightforward to see that the aggregate loss is zero when there is
no claims occurred, thus:
fS (0) = Pr(S = 0) = Pr(N = 0) = e−λ .
In addition, one notes that that S conditional on Ni = n, denoted by Sn = X1 + · · · + Xn , follows a gamma
distribution with shape nα and scale γi . Thus, for s > 0, the density of a Tweedie distribution can be
calculated as
∞
X
fS (s) = pn fSn (s)
n=1
∞
X (λi )n 1 nα−1 −yγ
= e−λi y e
n=1
n! γ nα
Thus, the Tweedie distribution can be thought of a mixture of zero and a positive valued distribution, which
makes it a convenient tool for modeling insurance claims and for calculating pure premiums. The mean and
variance of the Tweedie compound Poisson model are:
α α(1 + α)
E(S) = λ and Var(S) = λ .
γ γ2
As another important feature, the Tweedie distribution is a special case of exponential dispersion models, a
class of models used to describe the random component in generalized linear models. To see this, we consider
the following reparameterizations:
µ2−p 2−p
λ= , α= , γ = φ(p − 1)µp−1
φ(2 − p) p−1
5.4. COMPUTING THE AGGREGATE CLAIMS DISTRIBUTION 149
With the above relationships, one can show that the distribution of S is
µ2−p
1 −s
fS (s) = exp − + C(s; φ)
φ (p − 1)µp−1 2−p
where
0
if y = 0
X (1/φ)1/(p−1) y (2−p)/(p−1) n 1
C(s; φ/ωi ) = ln if y > 0
n≥1
(2 − p)(p − 1)(2−p)/(p−1) n!Γ(n(2 − p)/(p − 1))s
Hence, the distribution of S belongs to the exponential family with parameters µ, φ, and p ∈ (1, 2), and we
have
E(S) = µ and Var(S) = φµp
It is also worth mentioning the two limiting cases of the Tweedie model: p → 1 results in the Poisson distribution
and p → 2 results in the gamma distribution. The Tweedie compound Poisson model accommodates the
situations in between.
The recursive method applies to compound models where the frequency component N belongs to either
(a, b, 0) or (a, b, 1) class and the severity component X has a discrete distribution. For continuous X, a
common practice is to first discretize the severity distribution and then the recursive method is ready to
apply.
Assume that N is in the (a, b, 1) class so that pk = a + kb pk−1 , k = 2, 3, . . .. Further assume that the
support of X is {0, 1, . . . , m}, discrete and finite. Then, the probability function of S is:
fS (s) = Pr(S = s)
( s∧m
)
1 X bx
= [p1 − (a + b)p0 ] fX (s) + a+ fX (x)fS (s − x) .
1 − afX (0) x=1
s
Example 5.4.1. SOA Exam Question. The number of claims in a period N has a geometric distribution
with mean 4. The amount of each claim X follows Pr(X = x) = 0.25, for x = 1, 2, 3, 4. The number of claims
and the claim amount are independent. S is the aggregate claim amount in the period. Calculate FS (3).
Show Example Solution
Solution. The severity distribution X follows
1
fX (x) = , x = 1, 2, 3, 4.
4
The frequency distribution N is geometric with mean 4, which is a member of the (a, b, 0) class with b = 0,
β
a = 1+β = 54 , and p0 = 1+β
1
= 15 . Thus, we can use the recursive method
x∧m
X
fS (x) = 1 (a + 0)fX (y)fS (x − y)
y=1
x∧m
4 X
= fX (y)fS (x − y)
5 y=1
Specifically, we have
1
fS (0) = Pr(N = 0) = p0 =
5
1
4X 4
fS (1) = fX (y)fS (1 − y) = fX (1)fS (0)
5 y=1 5
4 1 1 1
= =
5 4 5 25
2
4X 4
fS (2) = fX (y)fS (2 − y) = [fX (1)fS (1) + fX (2)fS (0)]
5 y=1 5
4 1 1 1 4 6 6
= + = =
5 4 25 5 5 100 125
4
fS (3) = [fX (1)fS (2) + fX (2)fS (1) + fX (3)fS (0)]
5
4 1 1 1 6 1 5 + 25 + 6
= + + = = 0.0576
5 4 25 5 125 5 125
⇒ FS (3) = fS (0) + fS (1) + fS (2) + fS (2) + fS (3) = 0.3456
5.4.2 Simulation
The distribution of aggregate loss can be evaluated using Monte Carlo simulation. The idea is one can calculate
the empirical distribution of S using a random sample. Blow we summarize the simulation procedures for the
aggregate loss models.
5.5. EFFECTS OF COVERAGE MODIFICATIONS 151
where I(·) is an indicator function. The empirical distribution F̂S (s) will converge to FS (s) almost surely as
m → ∞.
The above procedure assumes that the parameters of the frequency and severity distributions are known. In
practice, one would need to estimate these parameters from the data. For instance, the assumptions in the
collective risk model suggest a two-stage estimation where a model is developed for the number of claims N
from the data on claim counts and a model is developed for the severity of claims X from the data on the
amount of claims.
This section focuses on an individual risk model for claim counts. Consider the number of claims from a
group of n policies:
S = X1 + · · · + Xn
where we assume Xi are iid representing the number of claims from policy i. In this case, the exposure for
the portfolio is b using policy as exposure base. The pgf of S is
Pn
PS (z) = E(z S ) = E z i=1 Xi
n
Y
= E(z Xi ) = [PX (z)]n
i=1
Special Case Poisson. If Xi ∼ P oisson(λ), its pgf* is PX (z) = eλ(z−1) . Then the pgf of S is
So S ∼ P oisson(nλ).
152 CHAPTER 5. AGGREGATE LOSS MODELS
Special Case Negative binomial. If Xi ∼ N egBin(β, r), its pgf is PX (z) = [1 − β(z − 1)]−r . Then the pgf
of S is
PS (z) = [[1 − β(z − 1)]−r ]n = [1 − β(z − 1)]−nr .
So S ∼ N B(β, nr).
Example 5.5.1. Assume that the number of claims for each vehicle is Poisson with mean λ. Given the
following data on the observed number of claims for each household, calculate the MLE of λ.
If the exposure of the portfolio change from n1 to n2 , we can establish the following relation between the
aggregate claim counts:
PS2 (z) = [PX (z)]n2 = [PX (z)n1 ]n2 /n1 = PS1 (z)n2 /n1 .
This section examine the effect of deductible on claim frequency. Intuitively, there will be fewer claims filed
when a policy deductible is imposed because a loss below deductible might not result in a claim. Even if an
insured does file a claim, this may not result in a payment by the policy, since the claim may be denied or
the loss amount may ultimately be determined to be below deductible. Let N L denote the number of loss
(i.e. the number of claims with no deductible), and N P denote the number of payments when a deductible d
is imposed. Our goal is to identify the distribution of N P given the distribution of N L . We show below that
the relationship between N L and N P can be established within an aggregate risk model framework.
5.5. EFFECTS OF COVERAGE MODIFICATIONS 153
Note that sometimes changes in deductible will affect policyholder behavior. We assume that this is not the
case, i.e. the distribution of losses for both frequency and severity remain unchanged when the deductible
changes.
Given there are N L losses, let X1 , X2 . . . , XN L be the associated amount of losses. For j = 1, . . . , N L , define
1 if Xj > d
Ij = .
0 otherwise
Then we establish
N P = I1 + I2 + · · · + INL .
Note that conditioning on N L , the distribution of N P ∼ Binomial(N L , v), where v = Pr(X > d). Thus,
given N L ,
NL
P
E z N |N L = [1 + v(z − 1)]
So the pgf of N P is
P h P i
PN P (z) = EN P z N = EN L EN P z N |N L
h L
i
= EN L (1 + v(z − 1))N
= PN L (1 + v(z − 1))
Thus, we can write the pgf of N P as the pgf of N L , evaluated at a new argument z ∗ = 1 + v(z − 1), that is,
PN P (z) = PN L (z ∗ ).
Special Cases:
• N L ∼ P oisson(λ). The pgf of N L is PN L = exp(λ(z − 1)). Thus the pgf of N P is
So the payment number has the same distribution as the loss number but with the expected number of
payments equal to λv = λ Pr(X > d).
−r
• N L ∼ N egBin(β, r). The pgf of N L is PN L (z) = [1 − β (z − 1)] .
−r
PN P (z) = (1 − β(1 + v(z − 1) − 1))
−r
= (1 − βv(z − 1)) ∼ N egBin(βv, r)
So the payment number has the same distribution as the loss number but with parameters βv and r.
Example 5.5.2. Suppose that loss amounts Xi ∼ P areto(α = 4, θ = 150). You are given that the loss
frequency is N L ∼ P oisson(λ) and the payment frequency distribution N P1 ∼ P oisson(0.4) with d1 = 30.
Find the distribution of N P2 with d2 = 100.
154 CHAPTER 5. AGGREGATE LOSS MODELS
Example 5.5.3. Follow-Up. Now suppose instead that the loss frequency is N L ∼ N egBin(β, r) and for
deductible d1 = 30, the payment frequency N P1 is negative binomial with mean 0.4. Find the mean of the
payment frequency N P2 with deductible d2 = 100.
Show Example Solution
Solution. Because the loss frequency N L is negative binomial, we can relate the parameter β of the N L
distribution and the parameter β1 of the first payment distribution N P1 using β1 = βv1 , where
4
5
v1 = Pr(X > 30) =
6
Thus, the mean of N P1 and the mean of N L are related
0.4 = rβ1 = r (βv1 )
4
0.4 6
⇒ rβ = = 0.4
v1 5
4
Note that v2 = Pr(X > 100) = 35 as in the original question. Then the second payment frequency
distribution is N P2 ∼ N egBin(βv2 , r) with mean
4 4
6 3
r(βv2 ) = (rβ)v2 = 0.4 = 0.1075
5 5
Next we examine the more general case where N L is a zero-modified distribution. Recall that a modified
distribution is defined in terms of an unmodified one. That is,
1 − pM
0
pM 0
k = c pk , for k = 1, 2, 3, . . . , with c = .
1 − p00
In the case that pM0 = 0, we call this a “truncated” distribution at zero, or ZT . For other arbitrary values of
pM
0 , this is a zero-modified, or ZM , distribution. The pgf for the modified distribution is shown as
P M (z) = 1 − c + c P 0 (z).
When N L follows a zero-modified distribution, the distribution of N P is established using the same relation
PN P (z) = PN L (1 + v(z − 1)).
Special Cases:
5.5. EFFECTS OF COVERAGE MODIFICATIONS 155
1 − pM
0 1 − pM
0
PN L (z) = 1 − + exp[λ(z − 1)].
1 − exp(−λ) 1 − exp(−λ)
1 − pM
0 1 − pM
0
PN L (z) = 1 − + exp[λv(z − 1)].
1 − exp(−λ) 1 − exp(−λ)
1 − pM
0 1 − pM
0 −r
PN L (z) = 1 − + [1 − β (z − 1)] .
1 − (1 + β)−r 1 − (1 + β)−r
1 − pM
0 1 − pM
0 −r
PN L (z) = 1 − + [1 − βv (z − 1)] .
1 − (1 + β)−r 1 − (1 + β)−r
So the number of payments is also a ZM-NegBin distribution with parameters βv, r, and pM
0 . Similarly,
the probability at zero can be evaluated using Pr(N P = 0) = PN P (0).
Example 5.5.4. Aggregate losses are modeled as follows:
(i) The number of losses follows a zero-modified Poisson distribution with λ = 3 and pM
0 = 0.5.
(ii) The amount of each loss has a Burr distribution with α = 3, θ = 50, γ = 1.
(iii) There is a deductible of d = 30 on each loss.
(iv) The number of losses and the amounts of the losses are mutually independent.
Calculate E N P and Var N P .
Show Example Solution
Solution. Since N L follows a ZM-Poisson distribution with parameters λ and pM
0 , we know that N
P
also
follows a ZM-Poisson distribution, but with parameters λv and pM
0 , where
3
1
v = Pr(X > 30) = = 0.2441
1 + (30/50)
λ∗
0.7324
E N P = (1 − pM
0 ) = 0.5
1 − e−λ∗ 1 − e−0.7324
= 0.7053
∗ 2
λ∗ [1 − (λ∗ + 1)e−λ ] λ∗
Var N P = (1 − pM
0 ) + pM M
0 (1 − p0 )
(1 − e−λ∗ )2 1 − e−λ∗
2
0.7324(1 − 1.7324e−0.7324 )
2 0.7324
= 0.5 + 0.5
(1 − e−0.7324 )2 1 − e−0.7324
= 0.7244
156 CHAPTER 5. AGGREGATE LOSS MODELS
In this section, we examine how the change in deductibles affect aggregate payments from an insurance
portfolio. We assume that policy limits, coinsurance, and inflation have no effect on the frequency of payments
made by an insurer. As in the previous section, we further assume that deductible changes do not impact the
distribution of losses for both frequency and severity.
Recall the notation N L for the number of losses. With ground-up loss X and policy deductible d, we use
N P = I(X1 > d) + · · · + I(XN L > d) for the number of payments. Also, define the amount of payment on a
per-loss basis as
d
0 X<
1+r
d u
XL = α[(1 + r)X − d] ≤X< ,
1+r 1+r
u
α(u − d) X≥
1+r
d
Undefined X<
1+r
d u
XP = α[(1 + r)X − d] ≤X< .
1+r 1+r
u
α(u − d) X≥
1+r
In the above, r, u, α represents the inflation rate, policy limit, and coinsurance, respectively. Hence, aggregate
costs (payment amounts) can be expressed either on a per loss or per payment basis:
S = X1L + · · · + XN
L
L
P P
= X1 + · · · + XN P .
The fundamentals regarding collective risk models are ready to apply. For instance, we have:
E(S) = E N L E X L = E N P E X P
2
Var(S) = E N L Var X L + E X L Var(N L )
2
= E N P Var X P + E X P Var(N P )
Example 5.5.5. SOA Exam Question. A group dental policy has a negative binomial claim count
distribution with mean 300 and variance 800. Ground-up severity is given by the following table:
Severity Probability
40 0.25
80 0.25
120 0.25
200 0.25
5.5. EFFECTS OF COVERAGE MODIFICATIONS 157
You expect severity to increase 50% with no change in frequency. You decide to impose a per claim deductible
of 100. Calculate the expected total claim payment after these changes.
Show Example Solution
Solution. The cost per loss with a 50% increase in severity and a 100 deductible per claim is
0 1.5x < 100
YL =
1.5x − 100 1.5x ≥ 100
Alternative Method: Using the Per Payment Basis. Previously, we calculated the expected total claim payment
by multiplying the expected number of losses by the expected payment per loss. Recall that we can also
multiply the expected number of payments by the expected payment per payment. In this case, we have
S = Y1P + · · · + YNPP
P undefined 1.5x < 100
Y =
1.5x − 100 1.5x ≥ 100
Example 5.5.7. SOA Exam Question. A company insures a fleet of vehicles. Aggregate losses have a
compound Poisson distribution. The expected number of losses is 20. Loss amounts, regardless of vehicle
type, have exponential distribution with θ = 200. To reduce the cost of the insurance, two modifications are
to be made:
(i) A certain type of vehicle will not be insured. It is estimated that this will reduce loss frequency by 20%.
(ii) A deductible of 100 per loss will be imposed.
Calculate the expected aggregate amount paid by the insurer after the modifications.
Show Example Solution
Solution. On a per loss basis, we have a 100 deductible. Thus, the expectation per loss is
Loss frequency has been reduced by 20%, resulting in an expected number of losses
E N L = 0.8(20) = 16
E S = E Y L E N L = 121.31(16) = 1, 941
Alternative Method: Using the Per Payment Basis. We can also use the per payment basis to find the
expected aggregate amount paid after the modifications. For the per payment severity,
This is not surprising – recall that the exponential distribution is memoryless, so the expected claim amounts
paid in excess of 100 is still exponential with mean 200. Now we look at the payment frequency. With the
deductible of 100, the probability that a payment occurs is Pr(X > 100) = e−100/200 Thus,
E N P = 16e−100/200 = 9.7
Putting this together, we produce the same answer using the per payment basis as the per loss basis from
earlier
E S = E Y P E N P = 200(9.7) = 1, 941
5.6. FURTHER RESOURCES AND CONTRIBUTORS 159
Here are a set of exercises that guide the viewer through some of the theoretical foundations of Loss Data
Analytics. Each tutorial is based on one or more questions from the professional actuarial examinations,
typically the Society of Actuaries Exam C.
Aggregate Loss Guided Tutorials
Contributors
• Peng Shi and Lisa Gao, University of Wisconsin-Madison, are the principal authors of the initital
version of this chapter. Email: [email protected] for chapter comments and suggested improvements.
160 CHAPTER 5. AGGREGATE LOSS MODELS
Chapter 6
Simulation
This algorithm is called a linear congruential generator. The case of c = 0 is called a multiplicative congruential
generator; it is particularly useful for really fast computations.
For illustrative values of a and m, Microsoft’s Visual Basic uses m = 224 , a = 1, 140, 671, 485, and c =
12, 820, 163 (see http://support.microsoft.com/kb/231847). This is the engine underlying the random number
generation in Microsoft’s Excel program.
The sequence used by the analyst is defined as Un = Bn /m. The analyst may interpret the sequence {Ui } to
be (approximately) identically and independently uniformly distributed on the interval (0,1). To illustrate
the algorithm, consider the following.
161
162 CHAPTER 6. SIMULATION
step n Bn Un
0 B0 =1
5
1 B1 = mod (3 × 1 + 2) = 5 U1 = 15
2
2 B2 = mod (3 × 5 + 2) = 2 U2 = 15
8
3 B3 = mod (3 × 2 + 2) = 8 U3 = 15
11
4 B4 = mod (3 × 8 + 2) = 11 U4 = 15
Sometimes computer generated random results are known as pseudo-random numbers to reflect the fact that
they are machine generated and can be replicated. That is, despite the fact that {Ui } appears to be i.i.d, it
can be reproduced by using the same seed number (and the same algorithm). The ability to replicate results
can be a tremendous tool as you use simulation while trying to uncover patterns in a business process.
The linear congruential generator is just one method of producing pseudo-random outcomes. It is easy to
understand and is (still) widely used. The linear congruential generator does have limitations, including the
fact that it is possible to detect long-run patterns over time in the sequences generated (recall that we can
interpret “independence” to mean a total lack of functional patterns). Not surprisingly, advanced techniques
have been developed that address some of this method’s drawbacks.
Xi = F −1 (Ui ) .
The result is that the sequence {Xi } is approximately i.i.d. with distribution function F .
To interpret the result, recall that a distribution function, F , is monotonically increasing and so the inverse
function, F −1 , is well-defined. The inverse distribution function (also known as the quantile function), is
defined as
F −1 (y) = inf {F (x) ≥ y},
x
y = F (x) ⇔ y = 1 − e−x/θ
⇔ −θ ln(1 − y) = x = F −1 (y).
Thus, if U has a uniform (0,1) distribution, then X = −θ ln(1 − U ) has an exponential distribution with
parameter θ.
Some Numbers. Take θ = 10 and generate three random numbers to get
Pareto Distribution Example. Suppose that we would like to generate observations from a Pareto
α
θ
distribution with parameters α and θ so that F (x) = 1 − x+θ . To compute the inverse transform, we can
use the following steps: α
θ
y = F (x) ⇔ 1 − y =
x+θ
−1/α x+θ x
⇔ (1 − y) = = +1
θ θ
⇔ θ (1 − y)−1/α − 1 = x = F −1 (y).
Inverse Transform Justification. Why does the random variable X = F −1 (U ) have a distribution
function “F ”?
This is easy to establish in the continuous case. Because U is a Uniform random variable on (0,1), we know
that Pr(U ≤ y) = y, for 0 ≤ y ≤ 1. Thus,
Pr(X ≤ x) = Pr(F −1 (U ) ≤ x)
= Pr(F (F −1 (U )) ≤ F (x))
= Pr(U ≤ F (x)) = F (x)
as required. The key step is that $ F(Fˆ{-1}(u)) = u$ for each u, which is clearly true when F is strictly
increasing.
Bernoulli Distribution Example. Suppose that we wish to simulate random variables from a Bernoulli
distribution with parameter p = 0.85. A graph of the cumulative distribution function shows that the quantile
function can be written as
−1 0 0 < y ≤ 0.85
F (y) =
1 0.85 < y ≤ 1.0.
Discrete Distribution Example. Consider the time of a machine failure in the first five years. The
distribution of failure times is given as:
Time (x) 1 2 3 4 5
probability 0.1 0.2 0.1 0.4 0.2
F (x) 0.1 0.3 0.4 0.8 1.0
Using the graph of the distribution function, with the inverse transform we may define
1 0 < U ≤ 0.1
2 0.1 < U ≤ 0.3
X= 3 0.3 < U ≤ 0.4
4 0.4 < U ≤ 0.8
5 0.8 < U ≤ 1.0.
For general discrete random variables there may not be an ordering of outcomes. For example, a person could
own one of five types of life insurance products and we might use the following algorithm to generate random
outcomes:
whole life 0 < U ≤ 0.1
endowment 0.1 < U ≤ 0.3
X= term life 0.3 < U ≤ 0.4
universal life 0.4 < U ≤ 0.8
variable life 0.8 < U ≤ 1.0.
Both algorithms produce (in the long-run) the same probabilities, e.g., Pr(whole life) = 0.1, and so forth. So,
neither is incorrect. You should be aware that there is “more than one way to skin a cat.” (What an old
expression!) Similarly, you could use an alterative algorithm for ordered outcomes (such as failure times 1, 2,
166 CHAPTER 6. SIMULATION
From the graph, we can see that the inverse transform for generating random variables with this distribution
function is
0 0 < U ≤ 0.7
X = F −1 (U ) =
−1000 ln( 1−U
0.3 ) 0.7 < U < 1.
As you have seen, for the discrete and mixed random variables, the key is to draw a graph of the distribution
function that allows you to visualize potential values of the inverse function.
So, hR is your best estimate of E h(X) and s2h,R provides an indication of the uncertainty of your estimate.
As one criterion for your confidence in the result, suppose that you wish to be within 1% of the mean with
95% certainty. According to the central limit theorem, your estimate should be approximately normally
distributed. Thus, you should continue your simulation until
.01hR
√ ≥ 1.96
sh,R / R
or equivalently
s2h,R
R ≥ 38, 416 2 .
hR
This criterion is a direct application of the approximate normality (recall that 1.96 is the 97.5th percentile of
the standard normal curve). Note that hR and sh,R are not known in advance, so you will have to come up
with estimates as you go (sequentially), either by doing a little pilot study in advance or by interrupting your
procedure intermittently to see if the criterion is satisfied.
168 CHAPTER 6. SIMULATION
Chapter 7
169
170 CHAPTER 7. PREMIUM CALCULATION FUNDAMENTALS
Chapter 8
Risk Classification
Chapter Preview. This chapter motivates the use of risk classification in insurance pricing and introduces
readers to the Poisson regression as a prominent example of risk classification. In Section 8.1 we explain
why insurers need to incorporate various risk characteristics, or rating factors, of individual policyholders in
pricing insurance contracts. We then introduce Section 8.2 the Poisson regression as a pricing tool to achieve
such premium differentials. The concept of exposure is also introduced in this section. As most rating factors
are categorical, we show in Section 8.3 how the multiplicative tariff model can be incorporated in the Poisson
regression model in practice, along with numerical examples for illustration.
8.1 Introduction
Through insurance contracts, the policyholders effectively transfer their risks to the insurer in exchange for
premiums. For the insurer to stay in business, the premium income collected from a pool of policyholders
must at least equal to the benefit outgo. Ignoring the frictional expenses associated with the administrative
cost and the profit margin, the net premium charged by the insurer thus should be equal to the expected loss
occurring from the risk that is transferred from the policyholder.
If all policyholders in the insurance pool have identical risk profiles, the insurer simply charges the same
premium for all policyholders because they have the same expected loss. In reality however the policyholders
are hardly homogeneous. For example, mortality risk in life insurance depends on the characteristics of
the policyholder, such as, age, sex and life style. In auto insurance, those characteristics may include age,
occupation, the type or use of the car, and the area where the driver resides. The knowledge of these
characteristics or variables can enhance the ability of calculating fair premiums for individual policyholders
as they can be used to estimate or predict the expected losses more accurately.
Indeed, if the insurer do not differentiate the risk characteristics of individual policyholders and simply
charges the same premium to all insureds based on the average loss in the portfolio, the insurer would face
adverse selection, a situation where individuals with a higher chance of loss are attracted in the portfolio and
low-risk individuals are repelled. For example, consider a health insurance industry where smoking status is
171
172 CHAPTER 8. RISK CLASSIFICATION
an important risk factor for mortality and morbidity. Most health insurers in the market require different
premiums depending on smoking status, so smokers pay higher premiums than non-smokers, with other
characteristics being identical. Now suppose that there is an insurer, we will call EquitabAll, that offers the
same premium to all insureds regardless of smoking status, unlike other competitors. The net premium of
EquitabAll is natually an average mortality loss accounting for both smokers and non-smokers. That is, the net
premium is a weighted average of the losses with the weights being the proportion of smokers and non-smokers,
respectively. Thus it is easy to see that that a smoker would have a good incentive to purchase insurance
from EquitabAll than from other insurers as the offered premium by EquitabAll is relatively lower. At the
same time non-smokers would prefer buying insurance from somewhere else where lower premiums, computed
from the non-smoker group only, are offered. As a result, there will be more smokers and less non-smokers
in the EquitabAll’s portfolio, which leads to larger-than-expected losses and hence a higher premium for
insureds in the next period to cover the higher costs. With the raised new premium in the next period,
non-smokers in EquitabAll will have even greater incentives to switch the insurer. As this cycle continues
over time, EquitabAll would gradually retain more smokers and less non-smokers in its portfolio with the
premium continually raised, eventually leading to a collapsing of business. In the literature this phenomenon
is known as the adverse selection spiral or death spiral. Therefore, incorporating and differentiating important
risk characteristics of individuals in the insurance pricing process are a pertinent component for both the
determination of fair premium for individual policyholders and the long term sustainability of insurers.
In order to incorporate relevant risk characteristics of policyholders in the pricing process, insurers maintain
some classification system that assigns each policyholder to one of the risk classes based on a relatively small
number of risk characteristics that are deemed most relevant. These characteristics used in the classification
system are called the rating factors, which are a priori variables in the sense that they are known before
the contract begins (e.g., sex, health status, vehicle type, etc, are known during the underwriting). All
policyholders sharing identical risk factors thus are assigned to the same risk class, and are considered
homogeneous from the pricing viewpoint; the insurer consequently charge them the same premium.
An important task in any risk classification is to construct a quantitative model that can determine the
expected loss given various rating factors of a policyholder. The standard approach is to adopt a statistical
regression model which produces the expected loss as the output when the relevant risk factors are given
as the inputs. In this chapter we learn the Poisson regression, which can be used when the loss is a count
variable, as a prominent example of an insurance pricing tool.
The Poisson regression model has been successfully used in a wide range of applications and has an advantage
of allowing closed-form expressions for important quantities, which provides a informative intuition and
interpretation. In this section we introduce the Poisson regression as a natural extension of the Poisson
distribution.
• Formally learn how to formulate the Poisson regression model using indicator variables when the
explanatory variables are categorical.
8.2. POISSON REGRESSION MODEL 173
Poisson Distribution
To introduce the Poisson regression, let us consider a hypothetical health insurance portfolio where all
policyholders are of the same age and only one risk factor, smoking status, is relevant. Smoking status thus
is a categorical variable containing two different types: smoker and non-smoker. In the statistical literature
different types in a given categorical variable are commonly called levels. As there are two levels for the
smoking status, we may denote smoker and non-smoker by level 1 and 2, respectively. Here the numbering is
arbitrary and nominal. Suppose now that we are interested in pricing a health insurance where the premium
for each policyholder is determined by the number of outpatient visits to doctor’s office during a year. The
amount of medical cost for each visit is assumed to be the same regardless of the smoking status for simplicity.
Thus if we believe that smoking status is a valid risk factor in this health insurance, it is natural to consider
the data separately for each smoking status. In Table 8.1 we present the data for this portfolio.
µy e−µ
Pr(Y = y) = , y = 0, 1, 2, . . . (8.1)
y!
and E (Y ) = Var (Y ) = µ. Furthermore, the mle of the Poisson distribution is given by the sample mean.
Thus if we denote the Poisson mean parameter for each level by µ(1) (smoker) and µ(2) (non-smoker), we
see from Table 8.1 that µ̂(1) = 0.0926 and µ̂(2) = 0.0746. This simple example shows the basic idea of risk
classification. Depending on the smoking status a policyholder will have a different risk characteristic and it
can be incorporated through varying Poisson parameter in computing the fair premium. In this example the
ratio of expected loss frequencies is µ̂(1) /µ̂(2) = 1.2402, implying that smokers tend to visit doctor’s office
24.02% times more frequently compared to non-smokers.
It is also informative to note that if the insurer charges the same premium to all policyholders regardless
of the smoking status, based on the average characteristic of the portfolio, as was the case for EquitabAll
described in Introduction, the expected frequency (or the premium) µ̂ is 0.0792, obtained from the last
column of Table 8.1. It is easily verified that
n1 n2
µ̂ = µ̂(1) + µ̂(2) = 0.0792, (8.2)
n1 + n2 n1 + n2
where ni is the number of observations in each level. Clearly, this premium is a weighted average of the
premiums for each level with the weight equal to the proportion of the insureds in that level.
A simple Poisson regression
In the example above, we have fitted a Poisson distribution for each level separately, but we can actually
174 CHAPTER 8. RISK CLASSIFICATION
combine them together in a unified fashion so that a single Poisson model can encompass both smoking and
non-smoking statuses. This can be done by relating the Poisson mean parameter with the risk factor. In
other words, we make the Poisson mean, which is the expected loss frequency, respond to the change in the
smoking status. The conventional approach to deal with a categorical variable is to adopt indicator or dummy
variables that take either 1 or 0, so that we turn the switch on for one level and off for others. Therefore we
may propose to use
µ = β0 + β1 x 1 (8.3)
log µ = β0 + β1 x1 , (8.4)
We generally prefer the log linear relation (8.4) to the linear one in (8.3) to prevent undesirable events of
producing negative µ values, which may happen when there are many different risk factors and levels. The
setup (8.4) and (8.5) then results in different Poisson frequency parameters depending on the level in the risk
factor:
( (
β0 + β1 eβ0 +β1 if smoker (level 1),
log µ = or equivalently, µ= (8.6)
β0 eβ0 if non-smoker (level 2),
achieving what we aim for. This is the simplest form of the Poisson regression. Note that we require a single
indicator variable to model two levels in this case. Alternatively, it is also possible to use two indicator
variables through a different coding scheme. This scheme requires dropping the intercept term so that (8.4)
is modified to
log µ = β1 x1 + β2 x2 , (8.7)
The numerical result of (8.6) is the same as (8.9) as all the coefficients are given as numbers in actual
estimation, with the former setup more common in most texts; we also stick to the former.
With this Poisson regression model we can easily understand how the coefficients β0 and β1 are linked to the
expected loss frequency in each level. According to (8.6), the Poisson mean of the smokers, µ(1) , is given by
where µ(2) is the Poisson mean for the non-smokers. This relation between the smokers and non-smokers
suggests a useful way to compare the risks embedded in different levels of a given risk factor. That is, the
proportional increase in the expected loss frequency of the smokers compared to that of the non-smokers is
simply given by a multiplicative factor eβ1 . Putting another way, if we set the expected loss frequency of the
non-smokers as the base value, the expected loss frequency of the smokers is obtained by applying eβ1 to the
base value.
Dealing with multi-level case
We can readily extend the two-level case to a multi-level one where l different levels are involved for a single
rating factor. For this we generally need l − 1 indicator variables to formulate
where xk is an indicator variable that takes 1 if the policy belongs to level k and 0 otherwise, for k =
1, 2, . . . , l − 1. By omitting the indicator variable associated with the last level in (8.11) we effectively chose
level l as the base case, but this choice is arbitrary and does not matter numerically. The resulting Poisson
parameter for policies in level k then becomes, from (8.11),
(
eβ0 +βk if the policy belongs to level k (k=1,2, ..., l-1),
µ=
eβ0 if the policy belongs to level l.
Thus if we denote the Poisson parameter for policies in level k by µ(k) , we can relate the Poisson parameter
for different levels through µ(k) = µ(l) eβk , k = 1, 2, . . . , l − 1. This indicates that, just like the two-level case,
the expected loss frequency of the kth level is obtained from the base value multiplied by the relative factor
eβk . This relative interpretation becomes more powerful when there are many risk factors with multi-levels,
and leads us to a better understanding of the underlying risk and more accurate prediction of future losses.
Finally, we note that the varying Poisson mean is completely driven by the coefficient parameters βk ’s, which
are to be estimated from the dataset; the procedure of the parameter estimation will be discussed later in
this chapter.
We now describe the Poisson regression in a formal and more general setting. Let us assume that there
are n independent policyholders with a set of rating factors characterized by a k-variate vector1 . The ith
policyholder’s rating factor is thus denoted by vector xi = (1, xi1 , . . . , xik )0 , and the policyholder has recorded
the loss count yi ∈ {0, 1, 2, . . .} from the last period of loss observation, for i = 1, . . . , n. In the regression
literature, the values xi1 , . . . , xik are generally known as the explanatory variables, as these are measurements
providing information about the variable of interest yi . In essence, regression analysis is a method to quantify
the relationship between a variable of interest and explanatory variables.
We also assume, for now, that all policyholders have the same one unit period for loss observation, or equal
exposure of 1, to keep things simple; we will discuss more details on the exposure in the following subsection.
As done before, we describe the Poisson regression through its mean function. For this we first denote µi to
be the expected loss count of the ith policyholder under the Poisson specification (8.1):
The condition inside the expectation operation in (8.12) indicates that the loss frequency µi is the model
output responding to the given set of risk factors or explanatory variables. In principle the conditional mean
E (yi |xi ) in (8.12) can take different forms depending on how we specify the relationship between x and
1 For example, if there are 3 risk factors each of which the number of levels are 2, 3 and 4, respectively, we have k =
(2 − 1) × (3 − 1) × (4 − 1) = 6.
176 CHAPTER 8. RISK CLASSIFICATION
y. The standard choice for the Poisson regression is to adopt the exponential function, as we mentioned
previously, so that
0
µi = E (yi |xi ) = exi β , yi ∼ P ois(µi ), i = 1, . . . , n. (8.13)
Here β = (β0 , . . . , βk )0 is the vector of coefficients so that xi0 β = β0 + β1 xi1 + . . . + βk xik . The exponential
function in (8.13) ensures that µi > 0 for any set of rating factors xi . Often (8.13) is rewritten as a log linear
form
to reveal the relationship when the right side is set as the linear form, xi0 β. Again, we see that the mapping
works well as both sides of (8.14), log µi and xi β, can now cover the entire real values. This is the formulation
of the Poisson regression, assuming that all policyholders have the same unit period of exposure. When the
exposures differ among the policyholders, however, as is the case in most practical cases, we need to revise
this formulation by adding exposure component as an additional term in (8.14).
Concept of Exposure
In order to determine the size of potential losses in any type of insurance, one must always know the
corresponding exposure. The concept of exposure is an extremely important ingredient in insurance pricing,
though we usually take it for granted. For example, when we say the expected claim frequency of a health
insurance policy is 0.2, it does not mean much without the specification of the exposure such as, in this case,
per month or per year. In fact, all premiums and losses need the exposure precisely specified and must be
quoted accordingly; otherwise all subsequent statistical analyses and predictions will be distorted.
In the previous section we assumed the same unit of exposure across all policyholders, but this is hardly
realistic in practice. In health insurance, for example, two different policyholders with different lengths of
insurance coverage (e.g., 3 months and 12 months, respectively) could have recorded the same number of
claim counts. As the expected number of claim counts would be proportional to the length of coverage, we
should not treat these two policyholders’ loss experiences identically in the modelling process. This motivates
the need of the concept of exposure in the Poisson regression.
The Poisson distribution in (8.1) is parametrised via its mean. To understand the exposure, we alternatively
parametrize the Poisson pmf in terms of the rate parameter λ, based on the definition of the Poisson process:
(λt)y e−λt
Pr(Y = y) = , y = 0, 1, 2, . . . (8.15)
y!
with E (Y ) = Var (Y ) = λt. Here λ is known as the rate or intensity per unit period of the Poisson process
and t represents the length of time or exposure, a known constant value. For given λ the Poisson distribution
(8.15) produces a larger expected loss count as the exposure t gets larger. Clearly, (8.15) reduces to (8.1)
when t = 1, which means that the mean and the rate become the same for the unit exposure, the case we
considered in the previous subsection.
In principle the exposure does not need to be measured in units of time and may represent different things
depending the problem at hand. For example,
1. In health insurance, the rate may be the occurrence of a specific disease per 1,000 people and the
exposure is the number of people considered in the unit of 1,000.
2. In auto insurance, the rate may be the number of accidents per year of a driver and the exposure is the
length of the observed period for the driver in the unit of year.
8.3. CATEGORICAL VARIABLES AND MULTIPLICATIVE TARIFF 177
3. For workers compensation, the rate may be the probability of injury in the course of employment per
dollar and the exposure is the payroll amount in dollar.
4. In marketing, the rate may be the number of customers who enter a store per hour and the exposure is
the number of hours observed.
5. In civil engineering, the rate may be the number of major cracks on the paved road per 10 kms and the
exposure is the length of road considered in the unit of 10 kms.
6. In credit risk modelling, the rate may be the number of default events per 1000 firms and the exposure
is the number of firms under consideration in the unit of 1,000.
Actuaries may be able to use different exposure bases for a given insurable loss. For example, in auto insurance,
both the number of kilometres driven and the number of months coved by insurance can be used as exposure
bases. Here the former is more accurate and useful in modelling the losses from car accidents, but more
difficult to measure and manage for insurers. Thus, a good exposure base may not be the theoretically best
one due to various practical constraints. As a rule, an exposure base must be easy to determine, accurately
measurable, legally and socially acceptable, and free from potential manipulation by policyholders.
Incorporating exposure in Poisson regression
As exposures affect the Poisson mean, constructing Poisson regressions requires us to carefully separate the
rate and exposure in the modelling process. Focusing on the insurance context, let us denote the rate of the
loss event of the ith policyholder by λi , the known exposure (the length of coverage) by mi and the expected
loss count under the given exposure by µi . Then the Poisson regression formulation in (8.13) and (8.14)
should be revised in light of (8.15) as
0
µi = E (yi |xi ) = mi λi = mi exi β , yi ∼ P ois(µi ), i = 1, . . . , n, (8.16)
which gives
Adding log mi in (8.17) does not pose a problem in fitting as we can always specify this as an extra explanatory
variable, as it is a known constant, and fix its coefficient to 1. In the literature the log of exposure, log mi , is
commonly called the offset.
8.2.4 Exercises
• The multiplicative tariff model when the rating factors are categorical.
• How to construct the Poisson regression model based on the multiplicative tariff structure.
In practice most rating factors in insurance are categorical variables, meaning that they take one of the
pre-determined number of possible values. Examples of categorical variables include sex, type of cars, the
driver’s region of residence and occupation. Continuous variables, such as age or auto mileage, can also be
grouped by bands and treated as categorical variables. Thus we can imagine that, with a small number
of rating factors, there will be many policyholders falling into the same risk class, charged with the same
premium. For the remaining of this chapter we assume that all rating factors are categorical variables.
To illustrate how categorical variables are used in the pricing process, we consider a hypothetical auto
insurance with only two rating factors:
• Type of vehicle: Type A (personally owned) and B (owned by corporations). We use index j = 1 and 2
to respectively represent each level of this rating factor.
• Age band of the driver: Young (age < 25), middle (25 ≤ age < 60) and old age (age ≥ 60). We use
index k = 1, 2 and 3, respectively, for this rating factor.
From this classification rule, we may create an organized table or list, such as the one shown in Table 8.2,
collected from all policyholders. Clearly there are 2 × 3 = 6 different risk classes in total. Each row of
the table shows a combination of different risk characteristics of individual policyholders. Our goal is to
compute six different premiums for each of these combinations. Once the premium for each row has been
determined using the given exposure and claim counts, the insurer can replace the last two columns in Table
8.2 with a single column containing the computed premiums. This new table then can serve as a manual to
determine the premium for a new policyholder given the rating factors during the underwriting process. In
non-life insurance, a table (or a set of tables) or list that contains each set of rating factors and the associated
premium is referred to as a tariff. Each unique combination of the rating factors in a tariff is called a tariff
cell; thus, in Table 8.2 the number of tariff cells is six, same as the number of risk classes.
yjk
zjk = , j = 1, 2; k = 1, 2, 3.
mjk
For example, z12 = 8/208.5 = 0.03837, meaning that a policyholder in tariff cell (1,2) would have 0.03837
accidents if insured for a full year on average. The set of zij values then corresponds to the rate parameter in
the Poisson distribution (8.15) as they are the event occurrence rates per unit exposure. That is, we have
zjk = λ̂jk where λjk is the Poisson rate parameter. Producing zij values however does not do much beyond
comparing the average loss frequencies across risk classes. To fully exploit the dataset, we will construct a
pricing model from Table 8.2 using the Poisson regression, for the remaining part of the chapter.
We comment that actual loss records used by insurers typically include much more risk factors, in which
case the number of cells grows exponentially. The tariff would then consist of a set of tables, instead of one,
separated by some of the basic rating factors, such as sex or territory.
In this subsection, we introduce the multiplicative tariff model, a popular pricing structure that can be
naturally used within the Poisson regression framework. The developments here is based on Table 8.2. Recall
that the loss count of a policyholder is described by the Poisson regression model with rate λ and the exposure
m, so that the expected loss count becomes mλ. As m is a known constant, we are essentially concerned
with modelling λ, so that it responds to the change in the rating factors. Among other possible functional
forms, we commonly choose the multiplicative2 relation to model the Poisson rate λjk for rating factor (j, k):
Here {f1j , j = 1, 2} are the parameters associated with the two levels in the first rating factor, car type, and
{f2k , k = 1, 2, 3} associated with the three levels in the age band, the second rating factor. For instance, the
Poisson rate for a mid-aged policyholder with a Type B vehicle is given by λ22 = f0 × f12 × f22 . The first
term f0 is some base value to be discussed shortly. Thus these six parameters are understood as numerical
representations of the levels within each rating factor, and are to be estimated from the dataset.
The multiplicative form (8.18) is easy to understand and use, because it clearly shows how the expected loss
count (per unit exposure) changes as each rating factor varies. For example, if f11 = 1 and f12 = 1.2, then
the expected loss count of a policyholder with a vehicle of type B would be 20% larger than type A, when
the other factors are the same. In non-life insurance, the parameters f1j and f2k are known as relativities as
they determine how much expected loss should change relative to the base value f0 . The idea of relativity is
quite convenient in practice, as we can decide the premium for a policyholder by simply multiplying a series
of corresponding relativities to the base value.
Dropping an existing rating factor or adding a new one is also transparent with this multiplicative structure.
In addition, the insurer may easily adjust the overall premium for all policyholders by controlling the base
value f0 without changing individual relativities. However, by adopting the multiplicative form, we implicitly
assume that there is no serious interaction among the risk factors.
When the multiplicative form is used we need to address an identification issue. That is, for any c > 0, we
can write
f1j
λjk = f0 × × c f2k . (8.19)
c
By comparing with (8.18), we see that the identical rate parameter λjk can be obtained for very different
individual relativities. This over-parametrization, meaning that many different sets of parameters arrive at
the identical model, obviously calls for some restriction on f1j and f2k . The standard practice is to make one
2 Preferring the multiplicative form to others (e.g., additive one) was already hinted in (8.4).
180 CHAPTER 8. RISK CLASSIFICATION
relativity in each rating factor equal to one. This can be made arbitrarily, so we will assume that f11 = 1 and
f21 = 1 for our purpose. This way all other relativities are uniquely determined. The tariff cell (j, k) = (1, 1)
is then called the base tariff cell, where the rate simply becomes λ11 = f0 , corresponding to the base value
according to (8.18). Thus the base value f0 is generally interpreted as the Poisson rate of the base tariff cell.
Again, (8.18) is log-transformed and rewritten as
as it is easier to work with in estimating process, similar to (8.14). This log linear form makes the log
relativities of the base level in each rating factor equal to zero, i.e., log f11 = log f21 = 0, and leads to the
following alternative, more explicit expression for (8.20):
log f0 + 0 + 0 for a policy in cell (1, 1),
log f0 + 0 + log f22 for a policy in cell (1, 2),
log f + 0
0 + log f23 for a policy in cell (1, 3),
log λ = (8.21)
log f0 + log f12 + 0 for a policy in cell (2, 1),
log f0 + log f12 + log f22 for a policy in cell (2, 2),
log f0 + log f12 + log f23 for a policy in cell (2, 3).
This clearly shows that the Poisson rate parameter λ varies across different tariff cells, with the same log
linear form used in the Poisson regression framework. In fact the reader may see that (8.21) is an extended
version of the early expression (8.6) with multiple risk factors and that the log relativities now play the role
of βi parameters. Therefore all the relativities can be readily estimated via fitting a Poisson regression with a
suitably chosen set of indicator variables.
For the second rating factor, we employ two indicator variables for the age band, that is,
(
1 for age band 2,
x2 = (8.23)
0 otherwise.
and
(
1 for age band 3,
x3 = (8.24)
0 otherwise.
The triple (x1 , x2 , x3 ) then can effectively and uniquely determine each risk class. By observing that the
indicator variables associated with Type A and Age band 1 are omitted, we see that tariff cell (j, k) = (1, 1)
plays the role of the base cell. We emphasize that our choice of the three indicator variables above has been
8.3. CATEGORICAL VARIABLES AND MULTIPLICATIVE TARIFF 181
carefully made so that it is consistent with the choice of the base levels in the multiplicative tariff model in
the previous subsection (i.e., f11 = 1 and f21 = 1).
With the proposed indicator variables we can rewrite the log rate (8.20) as
which is identical to (8.21) when each triple value is actually applied. For example, we can verify that the
base tariff cell (j, k) = (1, 1) corresponds to (x1 , x2 , x3 ) = (0, 0, 0), and in turn produces log λ = log f0 or
λ = f0 in (8.25) as required.
Poisson regression for the tariff model}
Under this specification, let us consider n policyholders in the portfolio with the ith policyholder’s risk
characteristic given by a vector of explanatory variables xi = (xi1 , xi2 , xi3 )0 , for i = 1, . . . , n. We then
recognize (8.25) as
where β0 , . . . , β3 can be mapped to the corresponding log relativities in (8.25). This is exactly the same setup
as in (8.17) except for the exposure component. Therefore, by incorporating the exposure in each risk class,
the Poisson regression model for this multiplicative tariff model finally becomes
log µi = log λi + log mi = log mi + β0 + β1 xi1 + β2 xi2 + β3 xi3 = log mi + xi0 β, (8.27)
with f11 = 1 and f21 = 1 from the original construction. For the actual dataset, βi , i = 0, 1, 2, 3, is replaced
with the mle bi using the method in the technical supplement at the end of this chapter (Section 8.5).
We present two numerical examples of the Poisson regression. In the first example we construct a Poisson
regression model from Table 8.2, which is a dataset of a hypothetical auto insurer. The second example uses
an actual industry dataset with more risk factors. As our purpose is to show how the Poisson regression
model can be used under a given classification rule, we are not concerned with the quality of the Poisson
model fit in this chapter.
Example 8.1: Poisson regression for the illustrative auto insurer
In the last few subsections we considered a dataset of a hypothetical auto insurer with two risk factors, as
given in Table 8.2. We now apply the Poisson regression model to this dataset. As done before, we have set
(j, k) = (1, 1) as the base tariff cell, so that f11 = f21 = 1. The result of the regression gives the coefficient
estimates (b0 , b1 , b2 , b3 ) = (−2.3359, −0.3004, −0.7837, −1.0655), which in turn produces the corresponding
relativities
from the relation given in (8.28). The R script and the output are as follows.
Show R Code
182 CHAPTER 8. RISK CLASSIFICATION
Coefficients:
(Intercept) VtypeF2 AgebndF2 AgebndF3
-2.3359 -0.3004 -0.7837 -1.0655
6
X
log µi = xi0 β+ log mi = β0 + β1 I(Sexi = M ) + βt I(V agei = t + 1)
t=2
13
X
+ βt I(V typei = A) × I(Agei = t − 7) + log mi .
t=7
3 corresponding to VAgecat1
8.3. CATEGORICAL VARIABLES AND MULTIPLICATIVE TARIFF 183
The fitting result is given in Table Table 8.3, for which we have several comments.
• The claim frequency is higher for male by 17.3%, when other rating factors are held fixed. However,
this may have been affected by the fact that all unspecified sex has been assigned to male.
• Regarding the vehicle age, the claim frequency gradually decreases as the vehicle gets old, when other
rating factors are held fixed. The level starts from 2 for this variable but, again, the numbering is
nominal and does not affect the numerical result.
• The policyholder age variable only applies to type A (automobile) vehicle, and there is no policy in
the first age band. We may speculate that younger drivers less than age 21 drive their parents’ cars
rather than having their own because of high insurance premiums or related regulations. The missing
relativity may be estimated by some interpolation or the professional judgement of the actuary. The
claim frequency is the lowest for age band 3 and 4, but gets substantially higher for older age bands, a
reasonable pattern seen in many auto insurance loss datasets.
*We also note that there is no base level in the policyholder age variable, in the sense that no relativity is
equal to 1. This is because the variable is only applicable to vehicle type A. This does not cause a problem
numerically, but one may set the base relativity as follows if necessary for other purposes. Since there is no
policy in age band 0, we consider band 1 as the base case. Specifically, we treat its relativity as a product of
0.918 and 1, where the former is the common relativity (that is, the common premium reduction) applied to
all policies with vehicle type A and the latter is the base value for age band 1. Then the relativity of age
band 2 can be seen as 0.917 = 0.918 × 0.999, where 0.999 is understood as the relativity for age band 2. The
remaining age bands can be treated similarly.
As another example consider a female policyholder aged 60 who owns a 3-year-old vehicle of type O. The
expected claim frequency for this policyholder is
Note that for this policy the age band variable is not used as the vehicle type is not A. The R script is given
as follows.
Show R Code
mydat <- read.csv("SingaporeAuto.csv", quote = "", header = TRUE)
attach(mydat)
# compute relativities
exp(Pois_reg2$coefficients)
detach(mydat)
The Poisson regression is a special member of a more general regression model class known as the generalized
linear model (glm). The glm develops a unified regression framework for datasets when the response valuables
are continuous, binary or discrete. The classical linear regression model with normal error is also a member
of the glm. There are many standard statistical texts dealing with the glm, including (McCullagh and Nelder,
1989). More accessible texts are (Dobson and Barnett, 2008), (Agresti, 1996) and (Faraway, 2016). For
8.5. TECHNICAL SUPPLEMENT – ESTIMATING POISSON REGRESSION MODELS 185
actuarial and insurance applications of the glm see (Frees, 2009b), (De Jong and Heller, 2008). Also, (Ohlsson
and Johansson, 2010) discusses the glm in non-life insurance pricing context with tariff analyses.
Contributor
• Joseph H. T. Kim, Yonsei University, is the principal author of the initital version of this chapter.
Email: [email protected] for chapter comments and suggested improvements.
n
X
log L(β) = l(β) = (−µi + yi log µi − log yi !)
i=1
n
X
= (−mi exp(xi0 β) + yi (log mi + xi0 β) − log yi !) (8.31)
i=1
To obtain the mle of β = (β0 , . . . , βk )0 , we differentiate4 l(β) with respect to vector β and set it to zero:
n
∂ X
l(β) = (yi − mi exp(xi0 b)) xi = 0. (8.32)
∂β
i=1
β=b
Numerically solving this equation system gives the mle of β, denoted by b = (b0 , b1 , . . . , bk )0 . Note that,
as xi = (1, xi1 , . . . , xik )0 is a column vector, equation (8.32) is a system of k + 1 equations with both sides
written as column vectors of size k + 1. If we denote µ̂i = mi exp(xi0 b), we can rewrite (8.32) as
n
X
(yi − µ̂i ) xi = 0. (8.33)
i=1
Since the solution b satisfies this equation, it follows that the first among the array of k + 1 equations,
corresponding to the first constant element of xi , yields
n
X
(yi − µ̂i ) × 1 = 0, (8.34)
i=1
n
X n
X
n−1 yi = ȳ = n−1 µ̂i . (8.35)
i=1 i=1
4 We use matrix derivative here.
186 CHAPTER 8. RISK CLASSIFICATION
This is an interesting property saying that the average of the individual losses, ȳ, is same as the average of
the estimated values. That is, the sample mean is preserved under the fitted Poisson regression model.
Maximum Likelihood Estimation for Grouped Data
Sometimes the data is not available at the individual policy level. For example, Table 8.2 provides collective
loss information for each risk class after grouping individual policies. When this is the case, yi and mi , the
quantities needed for the mle calculation in (8.32), are unavailable for each i. However this does not pose a
problem as long as we have the total loss counts and total exposure for each risk class.
To elaborate, let us assume that there are K different risk classes, and further that, in the kth risk class, we
have nk policies with the total exposure m(k) and the average loss count ȳ(k) , for k = 1, . . . , K; the total loss
count for the kth risk class is then nk ȳ(k) . We denote the set of indices of the policies belonging to the kth
class by Ck . As all policies in a given risk class share the same risk characteristics, we may denote xi = x(k)
for all i ∈ Ck . With this notation, we can rewrite (8.32) as
n
X K n X
X o
(yi − mi exp(xi0 b)) xi = (yi − mi exp(xi0 b)) xi
i=1 k=1 i∈Ck
K n
X X o
0
= yi − mi exp(x(k) b) x(k)
k=1 i∈Ck
K n
X X X o
0
= yi − mi exp(x(k) b) x(k)
k=1 i∈Ck i∈Ck
K
X
0
= nk ȳ(k) − m(k) exp(x(k) b) x(k) = 0. (8.36)
k=1
Since nk ȳ(k) in (8.36) represents the total loss count for the kth risk class and m(k) is its total exposure, we
see that for the Poisson regression the mle b is the same whether if we use the individual data or the grouped
data.
Information matrix
Taking second derivatives to (8.31) gives the information matrix of the mle estimators,
n n
∂2
X X
I(β) = −E l(β) = mi exp(xi0 β)xi xi0 = µi xi xi0 . (8.37)
∂β∂β 0 i=1 i=1
For actual datasets, µi in (8.37) is replaced with µ̂i = mi exp(xi0 b) to estimate the relevant variances and
covariances of the mle b or its functions.
For grouped datasets, we have
K n X
X o XK
I(β) = mi exp(xi0 β)xi xi0 = 0
m(k) exp(x(k) 0
β)x(k) x(k) . (8.38)
k=1 i∈Ck k=1
Chapter 9
Chapter Preview. This chapter introduces credibility theory which is an important actuarial tool for estimating
pure premiums, frequencies, and severities for individual risks or classes of risks. Credibility theory provides
a convenient framework for combining the experience for an individual risk or class with other data to
produce more stable and accurate estimates. Several models for calculating credibility estimates will be
discussed including limited fluctuation, Bühlmann, Bühlmann-Straub, and nonparametric and semiparametric
credibility methods. The chapter will also show a connection between credibility theory and Bayesian
estimation which was introduced in Chapter 4.
R̂ = Z X̄ + (1 − Z)M,
For a large risk whose loss experience is stable from year to year, Z might be close to 1. For a smaller risk
whose losses vary widely from year to year, Z may be close to 0.
Credibility theory is also used for computing rates for individual classes within a classification rating plan.
When classification plan rates are being determined, some or many of the groups may not have sufficient
187
188 CHAPTER 9. EXPERIENCE RATING USING CREDIBILITY THEORY
data to produce stable and reliable rates. The actual loss experience for a group will be assigned a credibility
weight Z and the complement of credibility 1 − Z may be given to the average experience for risk across
all classes. Or, if a class rating plan is being updated, the complement of credibility may be assigned to
the current class rate. Credibility theory can also be applied to the calculation of expected frequencies and
severities.
Computing numeric values for Z requires analysis and understanding of the data. What are the variances in
the number of losses and sizes of losses for risks? What is the variance between expected values across risks?
Limited fluctuation credibility, also called “classical credibility”, was given this name because the method
explicitly attempts to limit fluctuations in estimates for claim frequencies, severities, or losses. For example,
suppose that you want to estimate the expected number of claims for a group of risks in an insurance rating
class. How many risks are needed in the class to ensure that a specified level of accuracy is attained in the
estimate? First the question will be considered from the perspective of how many claims are needed.
Let N be a random variable representing the number of claims for a group of risks. The observed number of
claims will be used to estimate µN = E[N ], the expected number of claims. How big does µN need to be to
get a good estimate? One way to quantify the accuracy of the estimate would be a statement like: “The
observed value of N should be within 5% of µN at least 90% of the time." Writing this as a mathematical
expression would give Pr[0.95µN ≤ N ≤ 1.05µN ] ≥ 0.90. Generalizing this statement by letting k replace 5%
and probability p replace 0.90 produces a confidence interval
The expected number of claims required for the probability on the left-hand side of (9.1) to equal p is called
the full credibility standard.
If the expected number of claims is greater than or equal to the full credibility standard then full credibility
can be assigned to the data so Z = 1. Usually the expected value µN is not known so full credibility will be
assigned to the data if the actual observed value of N is greater than or equal to the full credibility standard.
The k and p values must be selected and the actuary may rely on experience, judgment, and other factors in
making the choices.
Subtracting µN from each term in (9.1) and dividing by the standard deviation σN of N gives
−kµN N − µN kµN
Pr ≤ ≤ ≥ p. (9.2)
σN σN σN
For large values of µN = E[N ] it may be reasonable to approximate the distribution for Z = (N − µN )/σN
with the standard normal distribution.
9.2. LIMITED FLUCTUATION CREDIBILITY 189
Let yp be the value such that Pr[−yp ≤ Z ≤ yp ] = Φ(yp ) − Φ(−yp ) = p where Φ() is the cumulative standard
normal distribution. Because Φ(−yp ) = 1 − Φ(yp ), the equality can be rewritten as 2Φ(yp ) − 1 = p. Solving
for yp gives yp = Φ−1 ((p + 1)/2) where Φ−1 () is the inverse of the cumulative normal.
Equation (9.2) will be satisfied if kµN /σN ≥ yp assuming the normal approximation. First we will consider
this inequality for the case when N has a Poisson distribution: Pr[N = n] = λn eλ /n!. Because λ = µN = σN2
1/2 1/2
for the Poisson, taking square roots yields µN = σN . So, kµN /µN ≥ yp which is equivalent to µN ≥ (yp /k)2 .
Let’s define λkp to be the value of µN for which equality holds. Then the full credibility standard for the
Poission distribution is
y 2
p
λkp = with yp = Φ−1 ((p + 1)/2). (9.3)
k
If the expected number of claims µN is greater than or equal to λkp then equation (9.1) is assumed to hold
and full credibility can be assigned to the data. As noted previously, because µN is usually unknown, full
credibility is given if the observed value of N satisfies N ≥ λkp .
Example 9.2.1. The full credibility standard is set so that the observed number of claims is to be within
5% of the expected value with probability p = 0.95. If the number of claims has a Poisson distribution find
the number of claims needed for full credibility.
Show Example Solution
Solution Referring to a normal table, yp = Φ−1 ((p + 1)/2) = Φ−1 ((0.95 + 1)/2)=Φ−1 (0.975) = 1.960. Using
this value and k = .05 then λkp = (yp /k)2 = (1.960/0.05)2 = 1, 536.64. After rounding up the full credibility
standard is 1,537.
If claims are not Poisson distributed then equation (9.2) does not imply (9.3). Setting the upper bound of Z
in (9.2) equal to yp gives kµN /σN = yp . Squaring both sides and moving everything to the right side except
for one of the µN ’s gives µN = (yp /k)2 (σN
2
/µN ). This is the full credibility standard for frequency and will
be denoted by nf ,
y 2 σ 2 2
σN
p N
nf = = λkp . (9.4)
k µN µN
2
This is the same equation as the Poisson full credibility standard except for the (σN /µN ) multiplier. When
the claims distribution is Poisson this extra term is one because the variance equals the mean.
Example 9.2.2. The full credibility standard is set so that the total number of claims is to be within 5% of
the observed value with probability p = 0.95. The number of claims has a negative binomial distribution
r x
x+r−1 1 β
Pr(N = x) =
x 1+β 1+β
2
We see that the negative binomial distribution with (σN /µN ) > 1 requires more claims for full credibility than
a Poission distribution for the same k and p values. The next example shows that a binomial distribution
2
which has (σN /µN ) < 1 will need fewer claims for full credibility.
190 CHAPTER 9. EXPERIENCE RATING USING CREDIBILITY THEORY
Example 9.2.3. The full credibility standard is set so that the total number of claims is to be within 5% of
the observed value with probability p = 0.95. The number of claims has a binomial distribution
m x
Pr(N = x) = q (1 − q)m−x .
x
Solution From the first example in this section λkp = 1, 536.64. The mean and variance for a binomial
2
are E(N ) = mq and Var(N ) = mq(1 − q) so (σN /µN ) = (mq(1 − q)/(mq)) = 1 − q which equals 3/4 when
2
q = 1/4. So, nf = λkp (σN /µN ) = 1, 536.64(3/4) = 1, 152.48 and rounding up gives a full credibility standard
of 1,153.
Rather than use expected number of claims to define the full credibility standard, the number of exposures
can be used for the full credibility standard. An exposure is a measure of risk. For example, one car insured
for a full year would be one car-year. Two cars each insured for exactly one-half year would also result in
one car-year. Car-years attempt to quantify exposure to loss. Two car-years would be expected to generate
twice as many claims as one car-year if the vehicles have the same risk of loss. To translate a full credibility
standard denominated in terms of number of claims to a full credibility standard denominated in exposures
one needs a reasonable estimate of the expected number of claims per exposure.
Example 9.2.4. The full credibility standard should be selected so that the observed number of claims
will be within 5% of the expected value with probability p = 0.95. The number of claims has a Poisson
distribution. If one exposure is expected to have about 0.20 claims per year, find the number of exposures
needed for full credibility.
Solution With p = 0.95 and k = .05, λkp = (yp /k)2 = (1.960/0.05)2 = 1, 536.64 claims are required for full
credibility. The claims frequency rate is 0.20 claims/exposures. To convert the full credibility standard to a
standard denominated in exposures the calculation is: (1,536.64 claims)/(0.20 claims/exposures) = 7,683.20
exposures. This can be rounded up to 7,684.
Frequency can be defined as the number of claims per exposure. Letting m represent number of exposures
then the observed claim frequency is N/m which is used to estimate E(N/m):
Because the number of exposures is not a random variable, E(N/m) = E(N )/m = µN /m and the prior
equation becomes
µN N µN
Pr (1 − k) ≤ ≤ (1 + k) ≥ p.
m m m
Mulitplying through by m results in equation (9.1) at the beginning of the section. The full credibility
standards that were developed for estimating expected number of claims also apply to frequency.
9.2. LIMITED FLUCTUATION CREDIBILITY 191
Aggregate losses are the total of all loss amounts for a risk or group of risks. Letting S represent aggregate
losses then
S = X1 + X2 + · · · + XN .
The random variable N represents the number of losses and random variables X1 , X2 , . . . , XN are the
individual loss amounts. In this section it is assumed that N is independent of the loss amounts and that
X1 , X2 , . . . , XN are iid.
The mean and variance of S are
−kµS kµS
Pr ≤Z≤ ≥p
σS σS
with Z = (S − µS )/σS . As done in the previous section the distribution for Z is assumed to be normal and
kµS /σS = yp = Φ−1 ((p + 1)/2). This equation can be rewritten as µ2S = (yp /k)2 σS2 . Using the prior formulas
for µS and σS2 gives (µN µX )2 = (yp /k)2 (µN σX 2
+ µ2X σN
2
). Dividing both sides by µN µ2X and reordering terms
on the right side results in a full credibility standard nS for aggregate losses
" # " 2 #
y 2 σ 2 σ 2 2
σN
σX
p N X
nS = + = λkp + . (9.5)
k µN µX µN µX
Example 9.2.5. The number of claims has a Poisson distribution. Individual loss amounts are independently
and identically distributed with a Pareto distribution F (x) = 1 − [θ/(x + θ)]α . The number of claims and
loss amounts are independent. If observed aggregate losses should be within 5% of the expected value with
probability p = 0.95, how many losses are required for full credibility?
Show Example Solution
2
Solution Because the number of claims is Poission, (σN /µN ) = 1. The mean of the Pareto is µX = θ/(α − 1)
and the variance is σX = θ α/[(α−1) (α−2)] so (σX /µX )2 = α/(α−2). Combining the frequency and severity
2 2 2
2
When the number of claims are Poisson distributed then equation (9.5) can be simplified using (σN /µN ) = 1.
2 2 2 2 2 2 2 2
It follows that [(σN /µN ) + (σX /µX ) ] = [1 + (σX /µX ) ] = [(µx + σX )/µX ] = E(X )/E(X) using the
relationship µ2X + σX2
= E(X 2 ). The full credibility standard is nS = λkp E(X 2 )/E(X)2 .
The pure premium P P is equal to aggregate losses S divided by exposures m: P P = S/m. The full credibility
standard for pure premium will require
The number of exposures m is assumed fixed and not a random variable so µP P = E(S/m) = E(S)/m = µS /m.
µ S µ
S S
Pr (1 − k) ≤ ≤ (1 + k) ≥ p.
m m m
This means that the full credibility standard nP P for the pure premium is the same as that for aggregate
losses
" 2 #
2
σN σX
nP P = nS = λkp + .
µn µX
Let X be a random variable representing the size of one claim. Claim severity is µX = E(X). Suppose that
X1 , X2 , . . . , Xn is a random sample of n claims that will be used to estimate claim severity µX . The claims
are assumed to be iid. The average value of the sample is
1
X̄ = (X1 + X2 + · · · + Xn ) .
n
How big does n need to be to get a good estimate? Note that n is not a random variable whereas it is in the
aggregate loss model.
In Section 9.2.1 the accuracy of an estimator was defined in terms of a confidence interval. For severity this
confidence interval is
where k and p need to be specified. Following the steps in Section 9.2.1, mean claim severity µX is subtracted
from each term and the standard deviation of the claim severity estimator σX̄ is divided into each term
yielding
−kµX kµX
Pr ≤Z≤ ≥p
σX̄ σX̄
with Z = (X̄ − µX )/σX . As in prior sections, it is assumed that Z is approximately normally distributed
and the prior equation is satistifed if kµX /σX̄ ≥ yp with yp = Φ−1 ((p + 1)/2). Because X̄ is the average of
individual claims X1 , X2 , . . . , Xn , its standard deviation is equal to the standard deviation of an individual
9.2. LIMITED FLUCTUATION CREDIBILITY 193
√ √ √
claim divided by n: σX̄ = σX / n. So, kµX /(σX / n) ≥ yp and with a little algebra this can be rewritten
as n ≥ (yp /k)2 (σX /µX )2 . The full credibility standard for severity is
y 2 σ 2
σX
2
p X
nX = = λkp . (9.6)
k µX µX
Note that the term σX /µX is the coefficient of variation for an individual claim. Even though λkp is the full
credibility standard for frequency given a Poisson distribution, there is no assumption about the distribution
for the number of claims.
Example 9.2.6. Individual loss amounts are independently and identically distributed with a Pareto
distribution F (x) = 1 − [θ/(x + θ)]α . How many claims are required for the average severity of observed
claims to be within 5% of the expected severity with probability p = 0.95?
Show Example Solution
2
Solution The mean of the Pareto is µX = θ/(α − 1) and the variance is σX = θ2 α/[(α − 1)2 (α − 2)] so
2 −1
(σX /µX ) = α/(α − 2). From a normal table yp = Φ ((0.95 + 1)/2) = 1.960. The full credibility standard is
nX = (1.96/0.05)2 [α/(α − 2)] = 1, 536.64α/(α − 2). Suppose α = 3 then nX = 4, 609.92 for a full credibility
standard of 4,610.
In prior sections full credibility standards were calculated for estimating frequency (nf ), pure premium (nP P ),
and severity (nX ) - in this section these full credibility standards will be denoted by n0 . In each case the full
credibility standard was the expected number of claims required to achieve a defined level of accuracy when
using empirical data to estimate an expected value. If the observed number of claims is greater than or equal
to the full credibility standard then a full credibility weight Z = 1 is given to the data.
In limited fluctuation credibility, credibility weights Z assigned to data are
r
n
Z= if n < n0 and Z = 1 for n ≥ n0
n0
where n0 is the full credibility standard. The quantity n is the number of claims for the data that is used to
estimate the expected frequency, severity, or pure premium.
Example 9.2.7. The number of claims has a Poisson distribution. Individual loss amounts are independently
and identically distributed with a Pareto distribution F (x) = 1 − [θ/(x + θ)]α . Assume that α = 3. The
number of claims and loss amounts are independent. The full credibility standard is that the observed pure
premium should be within 5% of the expected value with probability p = 0.95. What credibility Z is assigned
to a pure premium computed from 1,000 claims?
Show Example Solution
Solution Because the number of claims is Poisson,
2
E(X 2 ) 2
σN σX
= + .
[E X]2 µN µX
The mean of the Pareto is µX = θ/(α − 1) and the second moment is E(X 2 ) = 2θ2 /[(α − 1)(α − 2)] so
E(X 2 )/[E X]2 = 2(α − 1)/(α − 2). From a normal table yp = Φ−1 ((0.95 + 1)/2) = 1.960. The full credibility
standard is
194 CHAPTER 9. EXPERIENCE RATING USING CREDIBILITY THEORY
p
Limited fluctuation credibility uses the formula Z = n/n0 to limit the fluctuation in the credibility-weighted
estimate to match the fluctuation allowed for data with expected claims at the full credibility standard.
Variance or standard deviation is used as the measure of fluctuation. Rather than derive the square-root
formula an example is shown
Suppose that average claim severity is being estimated from a sample of size n that is less that the full
credibility standard n0 = nX . Applying credibility theory the estimate µ̂X would be
µ̂X = Z X̄ + (1 − Z)MX
with X̄ = (X1 + X2 + · · · + Xn )/n and independent random variables Xi representing the sizes of individual
claims. The complement of credibility is applied to MX which could be last year’s estimated average severity
adjusted for inflation, the average severity for a much larger pool of risks, or some other relevant quantity
selected by the actuary. It is assumed that the variance of MX is zero or negligible. With this assumption
n
Var(µ̂X ) = Var(Z X̄) = Z 2 Var(X̄) = Var(X̄).
n0
Because X̄ = (X1 + X2 + · · · + Xn )/n it follows that Var(X̄) = Var(X)/n where random variable X is one
claim. So,
n n Var(X) Var(X)
Var(µ̂X ) = Var(X̄) = = .
n0 n0 n n0
The last term is exactly the variance of a sample mean X̄ when the sample size is equal to the full credibility
standard n0 = nX .
A classification rating plan groups policyholders together into classes based on risk characteristics. Although
policyholders within a class have similarities, they are not identical and their expected losses will not be
exactly the same. An experience rating plan can supplement a class rating plan by credibility weighting
an individual policyholder’s loss experience with the class rate to produce a more accurate rate for the
policyholder.
9.3. BÜHLMANN CREDIBILITY 195
In the presentation of Bühlmann credibility it is convenient to assign a risk parameter θ to each policyholder.
Losses X for the policyholder will have a common distribution function Fθ (x) with mean µ(θ) = E(X|θ)
and variance σ 2 (θ) = Var(X|θ). In the prior sentence losses can represent pure premiums, aggegrate losses,
number of claims, claim severities, or some other measure of loss. Parameter θ can be continuous, discrete, or
multivariate depending on the model.
If the policyholder had losses x1 , . . . , xn during n observation periods then we want to find
E(µ(θ)|x1 , . . . , . . . , xn ), the conditional expectation of µ(θ) given x1 , . . . , xn . Another way to view
this is to consider random variable Xn+1 which is the observation during period n + 1. Finding
E(Xn+1 |x1 , x2 , . . . , xn ) is equivalent to finding E(µ(θ)|x1 , x2 , . . . , xn ) assuming that X1 , . . . , Xn , Xn+1 are
iid.
The Bühlmann credibility-weighted estimate for E(µ(θ)|X1 , . . . , Xn ) for the policyholder is
with
Random variables Xj are assumed to be iid for j = 1, . . . , n. The quantity X̄ is the average of n observations
and E(X̄|θ) = E(Xj |θ) = µ(θ).
If a policyholder is randomly chosen from the class and there is no loss information about the risk then it’s
expected loss is µ = E(µ(θ)) where the expectation is taken over all θ’s in the class. In this situation Z = 0
and the expected loss is µ̂(θ) = µ for the risk. The quantity µ can also be written as µ = E(Xj ) or µ = E(X̄)
and is often called the overall mean or collective mean. Note that E(Xj ) is evaluated with the “law of total
expectation”: E(Xj )=E(E(Xj |θ)).
Example 9.3.1. The number of claims X for an insured in a class has a Poisson distribution with mean
θ > 0. The risk parameter θ is exponentially distributed within the class with pdf f (θ) = e−θ . What is the
expected number of claims for an insured chosen at random from the class?
Show Example Solution
Solution Random variable X is Poisson with parameter θ and E(X|θ)
R∞= θ. The expected number of claims
for a randomly chosen insured is µ = E(µ(θ)) = E(E(X|θ)) =E(θ) = 0 θe−θ dθ. Integration by parts gives
µ = 1.
The prior example has risk parameter θ as a positive real number but the risk parameter can be a categorical
variable as shown in the next example.
Example 9.3.2. For any risk (policyholder) in a population the number of losses N in a year has a Poisson
distribution with parameter λ. Individual loss amounts Xi for a risk are independent of N and are iid with
Pareto distribution F (x) = 1 − [θ/(x + θ)]α . There are three types of risks in the population as follows:
196 CHAPTER 9. EXPERIENCE RATING USING CREDIBILITY THEORY
Although formula (9.7) was introduced using experience rating as an example, the Bühlmann credibility
model has wider application. Suppose that a rating plan has multiple classes. Credibility formula (9.7) can
be used to determine individual class rates. The overall mean µ would be the average loss for all classes
combined, X̄ would be the experience for the individual class, and µ̂(θ) would be the estimated loss for the
class.
When computing the credibility estimate µ̂(θ) = Z X̄ + (1 − Z)µ, how much weight Z should go to experience
X̄ and how much weight (1 − Z) to the overall mean µ? In Bühlmann credibility there are three factors that
need to be considered:
• How much variation is there in a single observation Xj for a selected risk? With X̄ = (X1 + · · · + Xn )/n
and assuming that the observations are iid, it follows that Var(X̄|θ)=Var(Xj |θ)/n. For larger Var(X̄|θ)
less credibility weight Z should be given to experience X̄. The Expected Value of the Process Variance,
abbreviated EPV, is the expected value of Var(Xj |θ) across all risks:
EP V = E(Var(Xj |θ)).
Note that we used E(X̄|θ) = E(Xj |θ) for the second equality. *How many observations n were used to
compute X̄? More observations would infer a larger Z.
Example 9.3.3. The number of claims N in a year for a risk in a population has a Poisson distribution
with mean λ > 0. The risk parameters λ for the population are uniformly distributed over the interval (0,2).
Calculate the EPV and VHM for the population.
9.3. BÜHLMANN CREDIBILITY 197
The Bühlmann credibility formula includes values for n, EPV, and VHM :
n EP V
Z= , K= . (9.8)
n+K V HM
If n increases then so does Z. If the VHM increases then Z increases. If the EPV increases then Z gets
smaller. Unlike limited fluctuation credibility where Z = 1 when the expected number of claims is greater
than the full credibility standard, Z can approach but not equal 1 as the number of observations n goes to
infinity.
If you multiply the numerator and denominator of the Z formula by (VHM /n) then Z can be rewritten as
V HM
Z= .
V HM + (EP V /n)
The number of observations n is captured in the term (EPV /n). As shown in bullet (1) at the beginning of
the section, E(Var(X̄|θ))=EPV /n. As the number of observations get larger, the expected variance of X̄
gets smaller and credibility Z increases so that more weight gets assigned to X̄ in the credibility-weighted
estimate µ̂(θ).
Example 9.3.4. Use the “law of total variance" to show that Var(X̄) = VHM + (EPV /n) and derive a
formula for Z in terms of X̄.
Show Example Solution
Solution The quantity Var(X̄) is called the unconditional variance or the total variance of X̄. The law of
total variance says
In bullet (1) at the beginning of this section we showed E(Var(X̄|θ))=EPV /n. In the second bullet (2),
Var(E(X̄|θ))=VHM. Reordering the right hand side gives Var(X̄)= VHM +(EPV /n). Another way to write
the formula for credibility Z is Z=Var(E(X̄|θ))/Var(X̄). This implies (1 − Z)=E(Var(X̄|θ))/Var(X̄).
The following long example and solution demonstrates how to compute the credibility-weighted estimate
with frequency and severity data.
Example 9.3.5. For any risk in a population the number of losses N in a year has a Poisson distribution
with parameter λ. Individual loss amounts X for a selected risk are independent of N and are iid with
exponential distribution F (x) = 1 − ex/β . There are three types of risks in the population as shown below. A
risk was selected at random from the population and all losses were recorded over a five-year period. The
total amount of losses over the five-year period was 5,000. Use Bühlmann credibility to estimate the annual
expected aggregate loss for the risk.
Solution Because individual loss amounts X are exponentially distributed, E(X)=β and Var(X)=β 2 .
For aggregate loss S = X1 + · · · + XN , the mean is E(S)=E(N )E(X) and process variance is
Var(S)=E(N )Var(X)+[E(X)]2 Var(N ). With Poisson frequency and exponentially distributed loss amounts,
E(S)=λβ and Var(S)=λβ 2 + β 2 λ = 2λβ 2 .
Population mean µ: Risk means are µ(A)=0.5(1000)=500; µ(B)=1.0(1500)=1500; µ(C)=2.0(2000)=4000;
and µ=0.50(500)+0.30(1500)+0.20(4000)=1,500.
VHM: VHM =0.50(500 − 1500)2 + 0.30(1500 − 1500)2 + 0.20(4000 − 1500)2 =1,750,000.
EPV: Process variances are σ 2 (A) = 2(0.5)(1000)2 = 1, 000, 000; σ 2 (B) = 2(1.0)(1500)2 = 4, 500, 000;
σ 2 (C) = 2(2.0)(2000)2 = 16, 000, 000; and EPV =0.50(1,000,000)+0.30(4,500,000)+0.20(16,000,000)=5,050,000.
X̄: X̄5 = 5, 000/5=1,000.
K: K = 5, 050, 000/1, 750, 000=2.89.
Z: There are five years of observations so n = 5. Z = 5/(5 + 2.89)=0.63.
µ̂(θ): µ̂(θ) = 0.63(1, 000) + (1 − 0.63)1, 500 = 1, 185.00 .
For a policyholder with risk parameter θ, Bühlmann credibility uses a linear approximation µ̂(θ) = Z X̄ +
(1 − Z)µ to estimate E(µ(θ)|X1 , . . . , Xn ), the expected loss for the policyholder given prior losses X1 , . . . , Xn .
We can rewrite this as µ̂(θ) = a + bX̄ which makes it obvious that the credibility estimate is a linear function
of X̄.
If E(µ(θ)|X1 , . . . , Xn ) is approximated by the linear function a + bX̄ and constants a and b are chosen so
that E[(E(µ(θ)|X1 , . . . , Xn ) − (a + bX̄))2 ] is minimized, what are a and b? The answer is b = n/(n + K) and
a = (1 − b)µ with K = EP V /V HM and µ = E(µ(θ)). More detail can be found in references (Buhlmann,
1967), (Buhlmann and Gisler, 2005), (Klugman et al., 2012), and (Tse, 2009).
Bühlmann credibility is also called least-squares credibility, greatest accuracy credibility, or Bayesian credibility.
• Compute a credibility-weighted estimate for the expected loss for a risk or group of risks using the
Bühlmann-Straub model.
• Determine the credibility Z assigned to observations.
• Calculate required values including the Expected Value of the Process Variance (EPV ), Variance of the
Hypothetical Means (VHM ) and collective mean µ.
• Recognize situations when the Bühlmann-Straub model is appropriate.
With standard Bühlmann or least-squares credibility as described in the prior section, losses X1 , . . . , Xn for a
policyholder are assumed to be iid. If the subscripts indicate year 1, year 2 and so on up to year n, then
the iid assumption means that the policyholder has the same exposure to loss every year. For commercial
insurance this assumption is frequently violated.
9.4. BÜHLMANN-STRAUB CREDIBILITY 199
Consider a commercial policyholder that uses a fleet of vehicles in its business. In year 1 there are m1 vehicles
in the fleet, m2 vehicles in year 2, .., and mn vehicles in year n. The exposure to loss from ownership and use
of this fleet is not constant from year to year. The annual losses for the fleet are not iid.
Define Yjk to be the loss for the k th vehicle in the fleet for year j. Then, the total losses for the fleet in year j
are Yj1 + · · · + Yjmj where we are adding up the losses for each of the mj vehicles. In the Bühlmann-Straub
model it is assumed that random variables Yjk are iid across all vehicles and years for the policyholder. With
this assumption the means E(Yjk |θ) = µ(θ) and variances Var(Yjk |θ) = σ 2 (θ) are the same for all vehicles
and years. The quantity µ(θ) is the expected loss and σ 2 (θ) is the variance in the loss for one year for one
vehicle for a policyholder with risk parameter θ.
If Xj is the average loss per unit of exposure in year j, Xj = (Y1 + · · · + Ymj )/mj , then E(Xj ) = µ(θ) and
Var(Xj ) = σ 2 (θ)/mj for policyholder with risk parameter θ. The average loss per vehicle for the entire n-year
period is
n n
1 X X
X̄ = mj Xj , m= mj .
m j=1 j=1
It follows that E(X̄|θ) = µ(θ) and Var(X̄|θ) = σ 2 (θ)/m where µ(θ) and σ 2 (θ) are the mean and variance for
a single vehicle for one year for the policyholder.
Example 9.4.1. Prove that Var(X̄|θ) = σ 2 (θ)/m for a risk with risk parameter θ.
Solution
n
1 X
Var(X̄|θ) = Var mj Xj |θ
m j=1
n n
1 X 1 X 2
= Var(m X
j j |θ) = m Var(Xj |θ)
m2 j=1 m2 j=1 j
n n
1 X 2 2 σ 2 (θ) X
= m (σ (θ)/m j ) = mj = σ 2 (θ)/m.
m2 j=1 j m2 j=1
with
200 CHAPTER 9. EXPERIENCE RATING USING CREDIBILITY THEORY
Note that µ̂(θ) is the estimator for the expected loss for one exposure. If the policyholder has mj exposures
then the expected loss is mj µ̂(θ).
In an example in the prior section it was shown that Z=Var(E(X̄|θ))/Var(X̄) where X̄ is the average loss for
n observations. In equation (9.9) the X̄ is the average loss for m exposures and the same Z formula can be
used:
Var(E(X̄)) Var(E(X̄))
Z= = .
Var(X̄) E(Var(X̄|θ)) + Var(E(X̄|θ))
The denominator was expanded using “the law of total variance." As noted above E(X̄|θ) = µ(θ) so
Var(E(X̄|θ)) = Var(µ(θ)) = V HM . Because Var(X̄|θ) = σ 2 (θ)/m it follows that E(Var(X̄|θ))=E(σ 2 (θ))/m=EPV /m.
Making these substitutions and a little algebra gives
m EP V
Z= , K= . (9.10)
m+K V HM
This is the same Z as for Bühlmann credibility except number of exposures m replaces number of years or
observations n.
Example 9.4.2.
A commercial automobile policyholder had the following exposures and claims over a three-year period:
• The number of claims in a year for each vehicle in the policyholder’s fleet is Poisson distributed with
the same mean (parameter) λ.
• Parameter λ is distributed among the policyholders in the population with pdf f (λ) = 6λ(1 − λ) with
0 < λ < 1.
The policyholder has 18 vehicles in its fleet in year 4. Use Bühlmann-Straub credibility to estimate the
expected number of policyholder claims in year 4.
Show Example Solution
Solution The expected number of claims for one vehicle for a randomly chosen policyholder is µ = E(λ) =
R1
0
λ[6λ(1 − λ)]dλ = 1/2. The average number of claims per vehicle for the policyholder is X̄=13/36.
The Expected Value of the Process Variance for a single vehicle is EPV =E(λ) = 1/2. The Variance
R1
of the Hypothetical Means across policyholders is VHM =Var(λ)=E(λ2 )-(E(λ))2 = 0 λ2 [6λ(1 − λ)]dλ −
9.5. BAYESIAN INFERENCE AND BÜHLMANN 201
(1/2)2 = (3/10) − (1/4) = (6/20) − (5/20) = 1/20. So, K=EPV /VHM =(1/2)/(1/20)=10. The number
of exposures in the experience period is m = 9 + 12 + 15 = 36. The credibility is Z = 36/(36 + 10) =
18/23. The credibility-weighted estimate for the number of claims for one vehicle is µ̂(θ) = Z X̄ + (1 −
Z)µ=(18/23)(13/36)+(5/23)(1/2)=9/23. With 18 vehicles in the fleet in year 4 the expected number of
claims is 18(9/23)=162/23=7.04 .
Section 4.4 reviews Bayesian inference and it is assumed that the reader is familiar with that material. This
section will compare Bayesian inference and Bühlmann credibility and show connections between the two
models.
A risk with risk parameter θ has expected loss µ(θ) = E(X|θ) with random variable X representing pure
premium, aggegrate loss, number of claims, claim severity, or some other measure of loss. If the risk had n
losses x1 , . . . , xn then E(µ(θ)|x1 , . . . , xn ) is the conditional expectation of µ(θ). The Bühlmann credibility
formula µ̂(θ) = Z X̄ +(1−Z)µ is a linear function of X̄ = (x1 +· · ·+xn )/n used to estimate E(µ(θ)|x1 , . . . , xn ).
Expectation E(µ(θ)|x1 , . . . , xn ) can be calculated from the conditional density function f (x|θ) and the
posterior distribution π(θ|x1 , . . . , xn ):
Z
E(µ(θ)|x1 , . . . , xn ) = µ(θ)π(θ|x1 , ..xn )dθ
Z
µ(θ) = E(X|θ) = xf (x|θ)dx.
Qn
j=1 f (xj |θ)
π(θ|x1 , . . . , xn ) = π(θ).
f (x1 , ..xn )
The conditional density function f (x|θ) and the prior distribution π(θ) must be specified. The numerator on
the right-hand side is called the likelihood.
In the Gamma-Poisson model the number of claims X has a Poisson distribution Pr(X = x|λ) = λx e−λ /x! for
a risk with risk parameter λ. The prior distribution for λ is gamma with π(λ) = β α λα−1 e−βλ /Γ(α). (Note
202 CHAPTER 9. EXPERIENCE RATING USING CREDIBILITY THEORY
that a rate parameter β is being used in the gamma distribution rather than a scale parameter.) The mean
of the gamma is E(λ) = α/β and the variance is Var(λ) = α/β 2 . In this section we will assume that λ is the
expected number of claims per year though we could have chosen another time interval.
If a risk is selected at random from the population then the expected number of claims in a year is
E(N )=E(E(N |λ))=E(λ)=α/β. If we had no observations for the selected risk then the expected number of
claims for the risk is α/β.
During n years the following number of claims by year was observed for the randomly selected risk: x1 , . . . , xn .
From Bayes theorem the posterior distribution is
Qn xj −λ
j=1 (λ e /xj !)
π(λ|x1 , . . . , xn ) = β α λα−1 e−βλ /Γ(α).
Pr(x1 , . . . , xn )
Combining terms that have a λ and putting all other terms into constant C gives
Pn
(α+ xj )−1 −(β+n)λ
π(λ|x1 , . . . , xn ) = Cλ j=1 e .
Pn
This is a gamma distribution with parameters α0 = α + j=1 xi and β 0 = β + n. The constant must be
α0 R∞
C = β 0 /Γ(α0 ) so that 0 π(λ|x1 , . . . , xn )dλ = 1 though we do not need to know C. As explained in chapter
four the gamma distribution is a conjugate prior for the Poisson distribution so the posterior distribution is
also gamma.
Because the posterior distribtution is gamma the expected number of claims for the selected risk is
Pn
α+ j=1 xj α + number of claims
E(λ|x1 , . . . , xn ) = = .
β+n β + number of years
This formula is slightly different from chapter four because β is multiplied times λ in the exponential of the
gamma pdf whereas in chapter four λ is divided by parameter θ.
Now we will compute the Bühlmann credibility estimate for the Gamma-Poisson model. The variance for a Pois-
son distribution with parameter λ is λ so EPV =E(Var(X|λ))=E(λ)=α/β. The mean number claims for the risk
is λ so VHM =Var(E(X|λ))=Var(λ)=α/β 2 . The credibility parameter isPK=EPV /VHM =(α/β)/(α/β 2 ) = β.
n
The overall mean is E(E(X|λ))=E(λ)=α/β. The sample mean is X̄ = ( j=1 xj )/n. The credibility-weighted
estimate for the expected number of claims for the risk is
Pn Pn
n j=1 xj n α α + j=1 xj
µ̂ = + (1 − ) =
n+β n n+β β β+n
For the Gamma-Poisson model the Bühlmann credibility estimate equals the Bayesian analysis answer.
For the Gamma-Poisson claims model the Bühlmann credibility estimate for the expected number of claims
exactly matches the Bayesian answer. The term exact credibility is applied in this situation. Exact credibility
may occur if the probability distribution for Xj is in the linear exponential family and the prior distribution
is a conjugate prior. Besides the Gamma-Poisson model other examples include Gamma-Exponential, Normal-
Normal, and Beta-Binomial. More information about exact credibility can be found in (Buhlmann and Gisler,
2005), (Klugman et al., 2012), and (Tse, 2009).
The beta-binomial model is useful for modeling the probability of an event. Assume that random variable X
is the number of successes in n trials and that X has a binomial distribution Pr(X = x|p) = nx px (1 − p)n−x .
In the beta-binomial model the prior distribution for probability p is a beta distribution with pdf
9.5. BAYESIAN INFERENCE AND BÜHLMANN 203
Γ(α + β) α−1
π(p) = p (1 − p)β−1 , 0 < p < 1, α > 0, β > 0.
Γ(α)Γ(β)
n
px (1 − p)n−x Γ(α + β) α−1
π(p|x) = x
p (1 − p)β−1 .
Pr(x) Γ(α)Γ(β)
Combining terms that have a p and putting everything else into the constant C yields
This is a beta distribtuion with new parameters α0 = α + x and β 0 = β + (n − x). The constant must be
Γ(α+β+n)
C = Γ(α+x)Γ(β+n−x) .
α
The mean for the beta distribution with parameters α and β is E(p) = α+β . Given x successes in n trials in
α+x
the beta-binomial model the mean of the posterior distribution is E(p|x) = α+β+n . As the number of trials n
and successes x increase, the expected value of p approaches x/n. The Bühlmann credibility estimate for
E(p|x) is exactly the same as shown in the following example.
Example 9.5.1 The probability that a coin toss will yield heads is p. The prior distribution for probability
p is beta with parameters α and β. On n tosses of the coin there were exactly x heads. Use Bühlmann
credibility to estimate the expected value of p.
Solution Define random variables Yj such that Yj = 1 if the j th coin toss is heads and Yj = 0 if tails for
j = 1, . . . , n. Random variables Yj are iid with Pr[Y = 1|p] = p and Pr[Y = 0|p] = 1 − p The number of heads
in n tosses can be represented by the random variable X = Y1 + · · · + Yn . We want to estimate p = E[Yj ] using
Bühlmann credibility: p̂ = Z Ȳ + (1 − Z)µ. The overall mean is µ = E(E(Yj |p)) = E(p) = α/(α + β). The
sample mean is ȳ = x/n. The credibility is Z = n/(n + K) and K=EPV /VHM. With Var(Yj |p) = p(1 − p) it
follows that EPV =E(Var(Yj |p))=E(p(1 − p)). Because E(Yj ) = p then VHM =Var(E(Yj |p))=Var(p). For the
beta distribution
α α(α + 1) αβ
E(p) = , E(p2 ) = , and Var(p) = .
α+β (α + β)(α + β + 1) (α + β)2 (α + β + 1)
Parameter K=EPV /VHM =[E(p)-E(p2 )]/Var(p). With some algebra this reduces to K = α + β. The
Bühlmann credibility-weighted estimate is
n x n
α
p̂ = + 1−
n+α+β n n+α+β α+β
α+x
p̂ =
α+β+n
The examples in this chapter have provided assumptions for calculating credibility parameters. In actual
practice the actuary must use real world data and judgment to determine credibility parameters.
Limited-fluctuation credibility requires a full credibility standard. The general formula for aggregate losses or
pure premium is
" #
y 2 σ 2 σ 2
p N X
nS = +
k µN µX
with N representing number of claims and X the size of claims. If one assumes σX = 0 then the full credibility
standard for frequency results. If σN = 0 then the full credibility formula for severity follows. Probability p
and k value are often selected using judgment and experience.
2
In practice it is often assumed that the number of claims is Poisson distributed so that σN /µN = 1. In this
case the formula can be simplified to
y 2 E(X 2 )
p
nS = .
k (E(X))2
An empirical mean and second moment for the sizes of individual claim losses can be computed from past
data, if available.
Bayesian analysis as described previously requires assumptions about a prior distribution and likelihood. It is
possible to produce estimates without these assumptions and these methods are often referred to as empirical
Bayes methods. Bühlmann and Bühlmann-Straub credibility with parameters estimated from the data are
included in category of empirical Bayes methods.
Bühlmann Model First we will address the simpler Bühlmann model. Assume that there are r risks in a
population. For risk i with risk parameter θi the losses for n periods are Xi1 , . . . , Xin . The losses
Pnfor a risk
are iid across periods as assumed in the Bühlmann model. For risk i the sample mean is X̄i = j=1 Xij /n
Pn
and the unbiased sample process variance is s2i = j=1 (Xij − X̄i )2 /(n − 1). An unbiased estimator for the
EPV can be calculated by taking the average of s2i for the r risks in the population:
r r n
1X 2 1 XX
EP
\ V = si = (Xij − X̄i )2 . (9.11)
r i=1 r(n − 1) i=1 j=1
9.6. ESTIMATING CREDIBILITY PARAMETERS 205
The individual risk means X̄i for i = 1, . . . , r can be used to estimate the VHM. An unbiased estimator of
Var(X̄i ) is
r r
1 X 1X
d X̄i ) =
Var( (X̄i − X̄)2 and X̄ = X̄i ,
r − 1 i=1 r i=1
The VHM is the second term on the right because µ(θi ) = E(X̄i |Θ = θi ) is the hypothetical mean for risk i.
So,
As discussed previously in Section 9.3.1, EPV /n = E(Var(X̄i |Θ = θi )) and using the above estimators gives
an unbiased estimator for the VHM :
r
1 X EP
\ V
V\
HM = (X̄i − X̄)2 − . (9.12)
r − 1 i=1 n
Although the expected loss for a risk with parameter θi is µ(θi )=E(X̄i |Θ = θi ), the variance of the sample
mean X̄i is greater that the variance of the hypothetical means: Var(X̄i ) ≥Var(µ(θi )). The variance in the
sample means Var(X̄i ) includes both the variance in the hypothetical means plus a process variance term
because for individual observations Xij , V ar(Xij |Θ = θi ) > 0.
Example9.6.2. Two policyholders had claims over a three-year period as shown in the table below. Calculate
the nonparametric estimate for the VHM.
Bühlmann-Straub Model Empirical formulas for EPV and VHM in the Bühlmann-Straub model are
more complicated because a risk’s number of exposures can change from one period to another. Also, the
number of experience periods does not have to be constant across the population because exposure rather
than time measures loss potential. First some definitions:
• Xij is the losses per exposure for risk i in period j. Losses can refer to number of claims or amount of
loss. There are r risks so i = 1, . . . , r.
• ni is the number of observation periods for risk i
• mij is the number of exposures for risk i in period j for j = 1, . . . , ni
Risk i with risk parameter θi has mij exposures in period j which means that the losses per exposure random
variable can be written as Xij = (Yi1 + · · · + Yimij )/mij . Random variable Yik is the loss for one exposure.
For risk i losses Yik are iid with mean E(Yik )=µ(θi ) and process variance Var(Yik )=σ 2 (θi ). It follows that
Var(Xij )=σ 2 (θi )/mi,j .
Two more important definitions are:
Pni Pni
• X̄i = m1i j=1 mij Xij with mi = j=1 mij . X̄i is the average loss per exposure for risk i for all
observation
Pr periods combined.Pr
1
• X̄ = m i=1 mi X̄i with m = i=1 mi . X̄ is the average loss per exposure for all risks for all observation
periods combined.
Random variable X̄i is the average loss for all mi exposures for risk i for all years combined. Random variable
X̄ is the average loss for all exposures for all risks for all years combined.
An unbiased estimator for the process variance σ 2 (θi ) of one exposure for risk i is
Pni
2 j=1 mij (Xij − X̄i )2
si = .
ni − 1
The mij weights are applied to the squared differences because the Xij are the averages of mij exposures.
The weighted average of the sample variances si 2 for each risk i in the population with weights proportional
to the number of (ni − 1) observation periods will produce the expected value of the process variance (EPV )
estimate
Pr Pr Pni
(n i − 1)si
2
i=1 mij (Xij − X̄i )2
EP
\ i=1
V = Pr = Pj=1
r .
i=1 (ni − 1) i=1 (ni − 1)
9.6. ESTIMATING CREDIBILITY PARAMETERS 207
Pr
i=1 mi (X̄i − X̄)2 − (r − 1)∗EP
\ V∗
V\
HM = 1
Pr 2
.
m − m i=1 mi
This complicated formula is necessary because of the varying number of exposures. Proofs that the EPV and
VHM estimators shown above are unbiased can be found in several references mentioned at the end of this
chapter including (Buhlmann and Gisler, 2005), (Klugman et al., 2012), and (Tse, 2009).
Example 9.6.3. Two policyholders had claims shown in the table below. Estimate the expected number of
claims for each policyholder using Bü hlmann-Straub credibility and calculating parameters from the data.
B Number of claims 0 0 1 2
B Insured vehicles 0 2 3 4
In the prior section on nonparametric estimation, there were no assumptions about the distribution of
the losses per exposure random variables Xij . Assuming that the Xij have a particular distribution and
using properties of the distribution along with the data to determine credibility parameters is referred to as
semiparametric estimation.
An example of semiparametric estimation would be the assumption of a Poisson distribution when estimating
claim frequencies. The Poisson distribution has the property that the mean and variance are identical and
this property can simplify calculations. The following simple example comes from the prior section but now
includes a Poisson assumption about claim frequencies.
Example 9.6.4. Two policyholders had claims over a three-year period as shown in the table below. Assume
that the number of claims for each risk has a Poisson distribution. Estimate the expected number of claims
208 CHAPTER 9. EXPERIENCE RATING USING CREDIBILITY THEORY
for each policyholder using Bühlmann credibility and calculating necessary parameters from the data.
1
1
( 3 − 1)2 + ( 53 − 1)2 − 13 = 59
V\HM = 2−1
1
K = 5/9 = 95 , ZA = ZB = 3+(9/5) 3
= 58
µ̂A = 85 13 + (1 − 58 )1 = 12 7
, µ̂B = 58 53 + (1 − 85 )1 = 17
12 .
We did not have to make the Poisson assumption in the prior example because there was enough data to use
nonparametric estimation but the following example is commonly used to demonstrate a situation where
semiparametric estimation is needed. There is insufficient data for nonparametric estimation but with the
Poisson assumption estimates can be calculated.
Example 9.6.5. A portfolio of 2,000 policyholders generated the following claims profile during a five-year
period:
Number of Claims
In 5 Years Number of policies
0 923
1 682
2 249
3 70
4 51
5 25
In your model you assume that the number of claims for each policyholder has a Poisson distribution and that
a policyholder’s expected number of claims is constant through time. Use Bühlmann credibility to estimate
the annual expected number of claims for policyholders with 3 claims during the five-year period.
Solution Let θi be the risk parameter for the ith risk in the portfolio with mean µ(θi ) and variance σ 2 (θi ).
With the Poisson assumption µ(θi ) = σ 2 (θi ). The expected value of the process variance is EPV=E(σ 2 (θi ))
where the expectation is taken across all risks in the population. Because of the Poisson assumption for all
risks it follows that EPV=E(σ 2 (θi ))=E(µ(θi )). An estimate for the annual expected number of claims is
µ̂(θi )= (observed number of claims)/5. This can also serve as the estimate for the process variance for a
risk. Weighting the process variance estimates (or means) by the number of policies in each group gives the
estimators
ˆ 1
V HM = [923(0 − 0.1719)2 + 682(0.20 − 0.1719)2 + 249(0.40 − 0.1719)2
2000 − 1
0.1719
+70(0.60 − 0.1719)2 + 51(0.80 − 0.1719)2 + 25(1 − 0.1719)2 ] −
5
= 0.0111
K̂ = ∗EPˆ V ∗/V HM
ˆ = 0.1719/0.0111 = 15.49
5
Ẑ = = 0.2440
5 + 15.49
µ̂3 claims = 0.2440(3/5) + (1 − 0.2440)0.1719 = 0.2764.
The estimated loss for risk i in a credibility weighted model is µ̂(θi ) = Zi X̄i + (1 − Zi )X̄ where X̄i is
the loss per exposure for risk i and
Pr X̄ is loss per exposure for the population. The overall mean in the
Bühlmann-Straub model is X̄ = i=1 (mi /m)X̄i where mi and m are number of exposures for risk i and
population, respectively. The same formula works for the simpler Bühlmann model by setting mi = 1 and
m = r where r is the number of risks.
For the credilility weighted estimators to be in balance we want
r
X r
X
X̄ = (mi /m)X̄i = (mi /m)µ̂(θi ).
i=1 i=1
If this equation is satisfied then the estimated losses for each risk will add up to the population total, an
important goal in ratemating, but this may not happen if X̄ is used for the complement of credibility.
In order to find a complement of credibility that will bring the credibility-weighted estimators into balance
we will set µ̂ as the complement of crediblity:
r
X r
X
(mi /m)X̄i = (mi /m)(Zi X̄i + (1 − Zi )µ̂).
i=1 i=1
r
X r
X r
X
mi X̄i = mi Zi X̄i + µ̂ mi (1 − Zi ),
i=1 i=1 i=1
and
Pr
i=1 mi (1 − Zi )X̄i
µ̂ = P r .
i=1 mi (1 − Zi )
mi (mi + K) − mi
mi (1 − Zi ) = mi 1 − = mi = KZi .
mi + K mi + K
210 CHAPTER 9. EXPERIENCE RATING USING CREDIBILITY THEORY
A complement of credibility that will bring the credibility-weighed estimators into balance with the overall
mean loss per exposure is
Pr
i=1 Zi X̄i
µ̂ = P r .
i=1 Zi
Example 9.6.6. An example from the nonparametric Bühlmann-Straub section had the following data for
two risks. Find the complement of credibility µ̂ that will produce crediblity-weighted estimates that are in
balance.
B Number of claims 0 0 1 2
B Insured vehicles 0 2 3 4
0.7703(1) + 0.8118(1/3)
µ̂ = = 0.6579.
0.7703 + 0.8118
The updated credibility estimates are µ̂A = 0.7703(1) + (1 − 0.7703)(.6579) = 0.9214 versus the previous
0.9139 and µ̂B = 0.8118(1/3) + (1 − 0.8118)(.6579) = 0.3944 versus previous 0.3882. Checking the balance on
the new estimators: (7/16)(0.9214)+(9/16)0.3944)=0.6250. This exactly matches X̄ = 10/16 = 0.6250.
Here are a set of exercises that guide the viewer through some of the theoretical foundations of Loss Data
Analytics. Each tutorial is based on one or more questions from the professional actuarial examinations,
typically the Society of Actuaries Exam C.
Credibility Guided Tutorials
Contributors
• Gary Dean, Ball State University is the author of the initital version of this chapter. Email: cgdean@
bsu.edu for chapter comments and suggested improvements.
Chapter 10
Chapter Preview. Define S to be (random) obligations that arise from a collection (portfolio) of insurance
contracts.
• We are particularly interested in probabilities of large outcomes and so formalize the notion of a
heavy-tail distribution in Section 10.1.
• How much in assets does an insurer need to retain to meet obligations arising from the random S? A
study of risk measures in Section 10.2 helps to address this question.
• As with policyholders, insurers also seek mechanisms in order to spread risks. A company that sells
insurance to an insurance company is known as a reinsurer, studied in Section 10.3.
211
212 CHAPTER 10. PORTFOLIO MANAGEMENT INCLUDING REINSURANCE
the occurrence of large losses. Speaking plainly, a rv is said to be heavier-tailed if higher probabilities are
assigned to larger values. Unwelcome outcomes are more likely to occur for an insurance portfolio that is
described by a loss rv possessing heavier (right) tail. Tail weight can be an absolute or a relative concept.
Specifically, for the former, we may consider a rv to be heavy-tailed if certain mathematical properties of the
probability distribution are met. For the latter, we can say the tail of one distribution is heavier than the
other if some tail measures are larger/smaller.
In the statistics and probability literature, there are several quantitative approaches have been proposed
to classify and compare tail weight. Among most of these approaches, the survival functions serve as the
building block. In what follows, we are going to introduce two simple yet useful tail classification methods, in
which the basic idea is to study the quantities that are closely related to behavior of the survival function of
X.
One possible way of classifying the tail weight of distribution is by assessing the existence of raw moments.
Since our major interest lies in the right tails of distributions, we henceforth assume the obligation/loss rv X
to be positive. At the outset, let us recall that the k−th raw moment of a continuous rv X, for k ≥ 0, can be
computed via
Z ∞
µ0k = k xk−1 S(x)dx,
0
where S(·) denotes the survival function of X. It is a simple matter to see that the existence of the raw
moments depends on the asymptotic behavior of the survival function at infinity. Namely, the faster the
survival function decays to zero, the higher the order of finite moment the associated rv possesses. Hence the
maximal order of finite moment, denoted by k ∗ := sup{k ∈ R+ : µ0k < ∞}, can be considered as an indicator
of tail weight. This observation leads us to the moment-based tail weight classification method, which is
defined formally next.
Definition 10.1. For a positive loss random variable X, if all the positive raw moments exist, namely the
maximal order of finite moment k ∗ = ∞, then X is said to be light-tailed based on the moment method. If
k ∗ = a ∈ (0, ∞), then X is said to be heavy-tailed based on the moment method. Moreover, for two positive
loss random variables X1 and X2 with maximal orders of moment k1∗ and k2∗ respectively, we say X1 has a
heavier (right) tail than X2 if k1∗ ≤ k2∗ .
It is noteworthy that the first part of Definition 10.1 is an absolute concept of tail weight, while the second
part is a relative concept of tail weight which compares the (right) tails between two distributions. Next,
we are going to present a few examples that illustrate the applications of the moment-based method for
comparing tail weight. Some of these examples are borrowed from Klugman et al. (2012).
Example 10.1.1. Finiteness of gamma moments. Let X ∼ Gamma(α, θ), with α > 0 and θ > 0, then
for all k > 0, show that µ0k < ∞.
Solution.
10.1. TAILS OF DISTRIBUTIONS 213
∞
xα−1 e−x/θ
Z
µ0k = xk dx
0 Γ(α)θα
Z ∞
(yθ)α−1 e−y
= (yθ)k θdy
0 Γ(α)θα
θk
= Γ(α + k) < ∞.
Γ(α)
Since all the positive moments exist, i.e., k ∗ = ∞, in accordance with the moment-based classification method
in Definition 10.1, the gamma distribution is light-tailed.
Example 10.1.2. Finiteness of Weibull moments. Let X ∼ W eibull(θ, τ ), with θ > 0 and τ > 0, then
for all k > 0, show that µ0k < ∞.
Show Example Solution
Solution.
∞
τ xτ −1 −(x/θ)τ
Z
µ0k = xk e dx
0 θτ
∞
y k/τ −y/θτ
Z
= e dy
0 θτ
= θk Γ(1 + k/τ ) < ∞.
Again, due to the existence of all the positive moments, the Weibull distribution is light-tailed.
We notice in passing that the gamma and Weibull distributions have been used quite intensively in the
actuarial practice nowadays. Applications of these two distributions are vast which include, but are not limited
to, insurance claim severity modelling, solvency assessment, loss reserving, aggregate risk approximation,
reliability engineering and failure analysis. We have thus far seen two examples of using the moment-based
method to analyze light-tailed distributions. We document a heavy-tailed example in what follows.
Example 10.1.3. Heavy tail nature of the Pareto distribution. Let X ∼ P areto(α, θ), with α > 0
and θ > 0, then for k > 0
∞
αθα
Z
0
µk = xk dx
0 (x + θ)α+1
Z ∞
= αθα (y − θ)k y −(α+1) dy.
θ
Z ∞
k−α−1 < ∞, for k < α;
gk := y dy =
θ = ∞, for k ≥ α.
Meanwhile,
214 CHAPTER 10. PORTFOLIO MANAGEMENT INCLUDING REINSURANCE
(y − θ)k y −(α+1)
lim = lim (1 − θ/y)k = 1.
y→∞ y k−α−1 y→∞
Application of the limit comparison theorem for improper integrals yields µ0k is finite if and only if gk is finite.
Hence we can conclude that the raw moments of Pareto rv’s exist only up to k < α, i.e., k ∗ = α, and thus
the distribution is heavy-tailed. What is more, the maximal order of finite moment depends only on the
shape parameter α and it is an increasing function of α. In other words, based on the moment method, the
tail weight of Pareto rv’s is solely manipulated by α – the smaller the value of α, the heavier the tail weight
becomes. Since k ∗ < ∞, the tail of Pareto distribution is heavier than those of the gamma and Weibull
distributions.
We are going to conclude this current section by an open discussion on the limitations of the moment-based
method. Despite its simple implementation and intuitive interpretation, there are certain circumstances in
which the application of the moment-based method is not suitable. First, for more complicated probabilistic
models, the k-th raw moment may not be simple to derive, and thus the identification of the maximal order of
finite moment can be challenging. Second, the moment-based method does not well comply with main body of
the well established heavy tail theory in literature. Specifically, the existence of moment generating functions
is arguably the most popular method for classifying heavy tail versus light tail within the community of
academic actuaries. However, for some rv’s such as the lognormal rv’s, their moment generating functions
do not exist even that all the positive moments are finite. In these cases, applications of the moment-based
methods can lead to different tail weight assessment. Third, when we need to compare the tail weight between
two light-tailed distributions both having all positive moments exist, the moment-based method is no longer
informative (see, e.g., Examples 10.1 and 10.2).
In order to resolve the aforementioned issues of the moment-based classification method, an alternative
approach for comparing tail weight is to directly study the limiting behavior of the survival functions.
SX (t)
γ := lim .
t→∞ SY (t)
We say that
Solution.
10.2. RISK MEASURES 215
SX (t) (1 + t/θ)−α
lim = lim
t→∞ SY (t) t→∞ exp{−(t/θ)τ }
exp{t/θτ }
= lim
t→∞ (1 + t1/τ /θ)α
t i
P∞
i=0 θ τ /i!
= lim
t→∞ (1 + t 1/τ /θ)α
∞ −α
t(1/τ −i/α)
X
−i/α
= lim t + /θτ i i!
t→∞
i=0
θ
= ∞.
Therefore, the Pareto distribution has a heavier tail than the Weibull distribution. One may also realize that
exponentials go to infinity faster than polynomials, thus the aforementioned limit must be infinite.
For some distributions of which the survival functions do not admit explicit expressions, we may find the
following alternative formula useful:
0
SX (t) SX (t)
lim = lim
0
t→∞ SY (t) t→∞ S (t)
Y
−fX (t)
= lim
t→∞ −fY (t)
fX (t)
= lim .
t→∞ fY (t)
et/λ
∝ lim
t→∞ (t + θ)α+1 tτ −1
= ∞,
heavier. However, knowing one risk is more dangerous (asymptotically) than the others may not provide
sufficient information for a sophisticated risk management purpose, and in addition, one is also interested
in quantifying how much more. In fact, the magnitude of risk associated with a given loss distribution is
an essential input for many insurance applications, such as actuarial pricing, reserving, hedging, insurance
regulatory oversight, and so forth.
The literature on risk measures has been growing rapidly in popularity and importance. In the succeeding
twp subsections, we introduce two indices which have recently earned unprecedented amount of interest
among theoreticians, practitioners, and regulators. They are namely the Value-at-Risk (VaR) and the Tail
Value-at-Risk (TVaR) measures. The economic rationale behind these two popular risk measures is similar to
that for the tail classification methods introduced in the previous section, with which we hope to capture the
risk of extremal losses represented by the distribution tails. Following this is a broader discussion of desirable
properties of risk measures.
10.2.1 Value-at-Risk
The VaR measure outputs the smallest value of X such that the associated cdf first excesses or equates to q.
In the fields of probability and statistics, the VaR is also known as the percentiles.
Here is how we should interpret VaR in the lingo of actuarial mathematics. The VaR is a forecast of the
‘maximal’ probable loss for a insurance product/portfolio or a risky investment occurring q × 100% of times,
over a specific time horizon (typically, one year). For instance, let X be the annual loss rv of an insurance
product, V aR0.95 [X] = 100 million means that there is a 5% chance that the loss will exceed 100 million over
a given year. Owing to the meaningful interpretation, VaR has become the industrial standard to measuring
financial and insurance risks since 1990’s. Financial conglomerates, regulators, and academics often utilize
VaR to price insurance products, measure risk capital, ensure the compliance with regulatory rules, and
disclose the financial positions.
Next, we are going to present a few examples about the computation of VaR.
Example 10.2.1. VaR for the exponential distribution. Consider an insurance loss rv X ∼ Exp(θ)
for θ > 0, then the cdf of X is given by
Thus
−1
V aRq [X] = FX (q) = −θ[log(1 − q)].
10.2. RISK MEASURES 217
The result reported in Example 10.6 can be generalized to any continuous rv’s having strictly increasing cdf.
Specifically, the VaR of any continuous rv’s is simply the inverse of the corresponding cdf Let us consider
another example of continuous rv which has the support from negative infinity to positive infinity.
Example 10.2.2. VaR for the normal distribution. Consider an insurance loss rv X ∼ N ormal(µ, σ 2 )
with µ ∈ R and σ > 0. In this case, one may interpret the negative values of X as profit or revenue. Give a
closed-form expression for the VaR.
Show Example Solution
Solution.
Because normal distribution is a continuous distribution, the VaR of X must satisfy
q = FX (V aRq [X])
= Pr [(X − µ)/σ ≤ (V aRq [X] − µ)/σ]
= Φ((V aRq [X] − µ)/σ).
Therefore, we have
V aRq [X] = Φ−1 (q) σ + µ.
In many insurance applications, we have to deal with transformations of rv’s. For instance, in Example
10.7, the loss rv X ∼ N ormal(µ, σ 2 ) can be viewed as a linear transformation of a standard normal
rv Z ∼ N ormal(0, 1), namely X = Zσ + µ. By setting µ = 0 and σ = 1, it is straightforward for us to check
V aRq [Z] = Φ−1 [q]. A useful finding revealed from Example 10.7 is that the VaR of a linear transformation of
the normal rv’s is equivalent to the linear transformation of the VaR of the original rv’s. This finding can
be further generalized to any rv’s as long as the transformations are strictly increasing. The next example
highlights the usefulness of the abovementioned finding.
Example 10.2.3. VaR for transformed variables. Consider an insurance loss rv Y ∼ lognormal(µ, σ 2 ),
for µ ∈ R and σ > 0. Give an expression of the V aR of Y in terms of the standard normal inverse cdf.
Show Example Solution
Solution.
d
Note that log Y ∼ N ormal(µ, σ 2 ), or equivalently let X ∼ N ormal(µ, σ 2 ), then Y = eX which is strictly
d
increasing transformation. Here, the notation ‘=’ means equality in distribution. The VaR of Y is thus given
by the exponential transformation of the VaR of X. Precisely, for q ∈ [0, 1],
We have thus far seen a number of examples about the VaR for continuous rv’s, let us consider an example
concerning the VaR for a discrete rv
Example 10.2.4. VaR for a discrete random variable. Consider an insurance loss rv with the following
probability distribution:
1, with probability 0.75;
Pr[X = x] = 3, with probability 0.20;
4, with probability 0.05.
Let us now conclude this current subsection by an open discussion of the VaR measure. Some advantages of
utilizing VaR include
• possessing a practically meaningful interpretation;
• relatively simple to compute for many distributions with closed-form distribution functions;
• no additional assumption is required for the computation of VaR.
On the other hand, the limitations of VaR can be particularly pronounced for some risk management practices.
We report some of them herein:
• the selection of the confidence level q ∈ [0, 1] is highly subjective, while the VaR can be very sensitive
to the choice of q (e.g., in Example 10.9, V aR0.95 [X] = 3 and V aR0.950001 [X] = 4);
• the scenarios/loss information that are above the (1 − p) × 100% worst event, are completely neglected;
• VaR is not a coherent risk measure (specifically, the VaR measure does not satisfy the subadditivity
axiom, meaning that diversification benefits may not be fully reflected).
Recall that the VaR represents the (1 − p) × 100% chance maximal loss. As we mentioned in the previous
section, one major drawback of the VaR measure is that it does not reflect the extremal losses occurring
beyond the (1 − p) × 100% chance worst scenario. For an illustration purpose, let us consider the following
slightly unrealistic yet inspiring example.
Example 10.2.5. Consider two loss rv’s X ∼ U nif orm[0, 100], and Y ∼ Exp(31.71). We use VaR at 95%
confidence level to measure the riskiness of X and Y . Simple calculation yields (see, also, Example 10.6),
and thus these two loss distributions have the same level of risk according to V aR0.95 . However, it is clear
that Y is more risky than X if extremal losses are of major concern since X is bounded above while Y is
unbounded. Simply quantifying risk by using VaR at a specific confidence level could be misleading and may
not reflect the true nature of risk.
As a remedy, the Tail Value-at-Risk (TVaR) was proposed to measure the extremal losses that are above
a given level of VaR as an average. We document the definition of TVaR in what follows. For the sake of
simplicity, we are going to confine ourselves to continuous positive rv’s only, which are more frequently used
in the context of insurance risk management. We refer the interested reader to Hardy (2006) for a more
comprehensive discussion of TVaR for both discrete and continuous rv’s.
Definition 10.4. Fix q ∈ [0, 1], the Tail Value-at-Risk of a (continuous) rv X is formulated as
Z ∞
1
T V aRq [X] = xfX (x)dx. (10.2)
(1 − q) πq
Example 10.2.6. TVaR for a normal distribution. Consider an insurance loss rv X ∼ N ormal(µ, σ 2 )
with µ ∈ R and σ > 0. Give an expression for TVaR.
Show Example Solution
Solution.
Let Z be the standard normal rv. For q ∈ [0, 1], the TVaR of X can be computed via
(1)
where ‘ = ’ holds because of the results reported in Example 10.7. Next, we turn to study T V aRq [Z] =
E[Z|Z > V aRq [Z]]. Let ω(q) = (Φ−1 (q))2 /2, we have
Z ∞
1 2
(1 − q) T V aRq [Z] = z √ e−z /2 dz
Φ−1 (q) 2π
Z ∞
1
= √ e−x dx
ω(q) 2π
1 −ω(q)
= √ e
2π
= φ(Φ−1 (q)).
Thus,
φ(Φ−1 (q))
T V aRq [X] = σ + µ.
1−q
We mentioned earlier in the previous subsection that the VaR of a strictly increasing function of rv is equal
to the function of VaR of the original rv. Motivated by the results in Example 10.11, one can show that
the TVaR of a strictly increasing linear transformation of rv is equal to the function of VaR of the original
rv This is due to the linearity property of expectations. However, the aforementioned finding can not be
extended to non-linear functions. The following example of lognormal rv serves as a counter example.
Example 10.2.7. TVaR of a lognormal distribution. Consider an insurance loss rv X ∼
lognormal(µ, σ 2 ), with µ ∈ R and σ > 0. Show that
220 CHAPTER 10. PORTFOLIO MANAGEMENT INCLUDING REINSURANCE
2
eµ+σ /2
T V aRq [X] = Φ(Φ−1 (q) − σ).
(1 − q)
Z ∞
1
T V aRq [X] = xfX (x)dx
(1 − q) πq
Z ∞
(log x − µ)2
1 1
= √ exp − dx
(1 − q) πq σ 2π 2σ 2
Z ∞
(1) 1 1 1 2
= √ e− 2 w +σw+µ dw
(1 − q) ω(q) 2π
2
eµ+σ /2 ∞ 1 − 1 (w−σ)2
Z
= √ e 2 dw
(1 − q) ω(q) 2π
2
eµ+σ /2
= Φ(ω(q) − σ), (10.3)
(1 − q)
(1)
where ‘ = ’ holds by applying change of variable w = (log x − µ)/σ, and ω(q) = (log πq − µ)/σ. Evoking the
formula of VaR for lognormal rv reported in Example 10.7, we can simplify the expression (10.3) into
2
eµ+σ /2
T V aRq [X] = Φ(Φ−1 (q) − σ).
(1 − q)
Clearly, the TVaR of lognormal rv is not the exponential of the TVaR of normal rv.
For distributions of which the distribution functions are more tractable to work with, we may apply integration
by parts technique to rewrite equation (10.2) as
" #
Z ∞
∞ 1
T V aRq [X] = −xSX (x)πq + SX (x)dx
πq (1 − q)
Z ∞
1
= πq + SX (x)dx.
(1 − q) πq
πq = −θ[log(1 − q)].
Z ∞
T V aRq [X] = πq + e−x/θ dx/(1 − q)
πq
= πq + θe−πq /θ dx/(1 − q)
= πq + θ.
It can also be helpful to express the TVaR in terms of limited expected values. Specifically, we have
Z ∞
T V aRq [X] = (x − πq + πq )fX (x)dx/(1 − q)
πq
Z ∞
1
= πq + (x − πq )fX (x)dx
(1 − q) πq
= πq + eX (πq )
(E[X] − E[X ∧ πq ])
= πq + , (10.4)
(1 − q)
where eX (d) := E[X − d|X > d] for d > 0 denotes the mean excess loss function. For many commonly
used parametric distributions, the formulas for calculating E[X] and E[X ∧ πq ] can be found in a table of
distributions.
Example 10.2.9. TVaR of the Pareto distribution. Consider a loss rv X ∼ P areto(θ, α) with θ > 0
and α > 0. The cdf of X is given by
α
θ
FX (x) = 1 − , for x > 0.
θ+x
h i
πq = θ (1 − q)−1/α − 1 . (10.5)
θ
E[X] = ,
α−1
θ (θ/(θ + πq ))α−1
T V aRq [X] = πq +
α − 1 (θ/(θ + πq ))α
θ πq + θ
= πq +
α−1 θ
πq + θ
= πq + ,
α−1
Z 1
1
T V aRq [X] = V aRα [X] dα. (10.6)
(1 − q) q
What this alternative formula (10.6) tells is that TVaR in fact is the average of V aRα [X] with varying
degree of confidence level over α ∈ [q, 1]. Therefore, the TVaR effectively resolves most of the limitations of
VaR outlined in the previous subsection. First, due to the averaging effect, the TVaR may be less sensitive
to the change of confidence level compared with VaR. Second, all the extremal losses that are above the
(1 − q) × 100% worst probable event are taken in account.
In this respect, it is a simple matter for us to see that for any given q ∈ [0, 1]
Third and perhaps foremost, TVaR is a coherent risk measure and thus is able to more accurately capture
the diversification effects of insurance portfolio. Herein, we do not intend to provide the proof of the coherent
feature for TVaR, which is considered to be challenging technically.
To compare the magnitude of risk in a practically convenient manner, we aim to seek a function that maps
the loss rv of interest to a numerical value indicating the level of riskiness, which is termed the risk measure.
Putting mathematically, denoted by X a set of insurance loss rv’s, a risk measure is a functional map
H : X → R+ . In principle, risk measures can admit an unlimited number of functional p formats. Classical
examples of risk measures include the mean E[X], the standard deviation SD(X) := Var(X), the standard
deviation principle
• interpretable practically;
• being able to reflect the most critical information of risk underpinning the loss distribution.
A vast number of risk measures have been developed in the literature of actuarial mathematics. Unfortunately,
there is no best risk measure that can outperform the others, and the selection of appropriate risk measure
depends mainly on the application questions at hand. In this respect, it is imperative to emphasize that ‘risk’
is a subjective concept, and thus even given the same problem, there are multifarious approaches to assess
risk. However, for many risk management applications, there is a wide agreement that economically sounded
risk measures should satisfy four major axioms which we are going to describe them in detail next. Risk
measures that satisfy these axioms are termed coherent risk measures.
Consider in what follows a risk measure H(·), then H is a coherent risk measure if the following axioms are
satisfied.
• Axiom 1. Subadditivity: H(X + Y ) ≤ H(X) + H(Y ). The economic implication of this axiom is that
diversification benefits exist if different risks are combined.
• Axiom 2. Monotonicity: if Pr[X ≤ Y ] = 1, then H(X) ≤ H(Y ). Recall that X and Y are rv’s
representing losses, the underlying economic implication is that higher losses essentially leads to a
higher level of risk.
• Axiom 3. Positive homogeneity: H(cX) = cH(X) for any positive constant c. A potential economic
implication about this axiom is that risk measure should be independent of the monetary units in which
the risk is measured. For example, let c be the currency exchange rate between the US and Canadian
dollars, then the risk of random losses measured in terms of US dollars (i.e., X) and Canadian dollars
(i.e., cX) should be different only up to the exchange rate c (i.e., cH(x) = H(cX)).
• Axiom 4. Translation invariance: H(X + c) = H(X) + c for any positive constant c. If the constant c
is interpreted as risk-free cash, this axiom tells that no additional risk is created for adding cash to an
insurance portfolio, and injecting risk-free capital of c can only reduce the risk by the same amount.
Verifying the coherent properties for some risk measures can be quite straightforward, but it can be very
challenging sometimes. For example, it is a simple matter to check that the mean is a coherent risk measure
since for any pair of rv’s X and Y having finite means and constant c > 0,
• validation of subadditivity: E[X + Y ] = E[X] + E[Y ];
• validation of monotonicity: if Pr[X ≤ Y ] = 1, then E[X] ≤ E[Y ];
• validation of positive homogeneity: E[cX] = cE[X];
• validation of translation invariance: E[X + c] = E[X] + c
On a different note, the standard deviation is not a coherent risk measure. Specifically, one can check that
the standard deviation satisfies
• validation of subadditivity:
p
SD[X + Y ] = Var(X) + Var(Y ) + 2Cov(X, Y )
p
≤ SD(X)2 + SD(Y )2 + 2SD(X)SD(Y )
= SD(X) + SD(Y );
• validation of positive homogeneity: SD[cX] = c SD[X].
However, the standard deviation does not comply with translation invariance property as for any positive
constant c,
SD(X + c) = SD(X) < SD(X) + c.
224 CHAPTER 10. PORTFOLIO MANAGEMENT INCLUDING REINSURANCE
Moreover, the standard deviation also does not satisfy the monotonicity property. To see this, consider the
following two rv’s
0, with probability 0.25;
X= (10.8)
4, with probability 0.75,
Pr[Y = 4] = 1. (10.9)
√ √
It is easy to check that Pr[X ≤ Y ] = 1, but SD(X) = 42 · 0.25 · 0.75 = 3 > SD(Y ) = 0.
We have so far checked that E[·] is a coherent risk measure, but not SD(·). Let us now proceed to study the
coherent property for the standard deviation principle (10.7) which is a linear combination of two coherent
and incoherent risk measures. To this end, for a given α > 0, we check the four axioms for HSD (X + Y ) one
by one:
• validation of subadditivity:
It only remains to verify the monotonicity property, which may or may not be satisfied depending on the value
√
of α. To see this, consider again the setup of (10.8) and (10.9) in which Pr[X ≤ Y ] = 1. Let α = 0.1 · 3,
then √HSD (X) = 3 + 0.3 = 3.3 < HSD (Y ) = 4 and the monotonicity condition is met. On the other hand, let
α = 3, then HSD (X) = 3 + 3 = 6 > HSD (Y ) = 4 and the monotonicity condition is not satisfied. More
precisely, by setting
√
HSD (X) = 3 + α 3 ≤ 4 = HSD (Y ),
√
we find that the monotonicity condition is only satisfied for 0 ≤ α ≤ 1/ 3, and thus the standard deviation
principle HSD is coherent. This result appears to be very intuitive to us since the standard deviation principle
HSD is √a linear combination two risk measures of which one is coherent and the other is incoherent. If
α ≤ 1/ 3, then the coherent measure dominates the incoherent one, thus the resulting measure HSD is
coherent and vice versa.
10.3 Reinsurance
Recall that reinsurance is simply insurance purchased by an insurer. Insurance purchased by non-insurers is
sometimes known as primary insurance to distinguish it from reinsurance. Reinsurance differs from personal
10.3. REINSURANCE 225
insurance purchased by individuals, such as auto and homeowners insurance, in contract flexibility. Like
insurance purchased by major corporations, reinsurance programs are generally tailored more closely to the
buyer. For contrast, in personal insurance buyers typically cannot negotiate on the contract terms although
they may have a variety of different options (contracts) from which to choose.
The two broad types are proportional and non-proportional reinsurance. A proportional reinsurance contract
is an agreement between a reinsurer and a ceding company (also known as the reinsured) in which the
reinsurer assumes a given percent of losses and premium. A reinsurance contract is also known as a treaty.
Non-proportional agreements are simply everything else. As examples of non-proportional agreements, this
chapter focuses on stop-loss and excess of loss contracts. For all types of agreements, we split the total risk
S into the portion taken on by the reinsurer, Yreinsurer , and that retained by the insurer, Yinsurer , that is,
S = Yinsurer + Yreinsurer .
The mathematical structure of a basic reinsurance treaty is the same as the coverage modifications of personal
insurance introduced in Chapter 3. For a proportional reinsurance, the transformation Yinsurer = cS is
identical to a coinsurance adjustment in personal insurance. For stop-loss reinsurance, the transformation
Yreinsurer = max(0, S − M ) is the same as an insurer’s payment with a deductible M and Yinsurer =
min(S, M ) = S ∧ M is equivalent to what a policyholder pays with deductible M . For practical applications of
the mathematics, in personal insurance the focus is generally upon the expectation as this is a key ingredient
used in pricing. In constrast, for reinsurance the focus is on the entire distribution of the risk, as the extreme
events are a primary concern of the financial stability of the insurer and reinsurer.
This chapter describes the foundational and most basic of reinsurance treaties: Section 10.3.1 for proportional
and Section 10.3.2 for non-proportional. Section 10.3.3 gives a flavor of more complex contracts.
• In a quota share treaty, the reinsurer receives a flat percent, say 50%, of the premium for the book of
business reinsured.
• In exchange, the reinsurer pays 50% of losses, including allocated loss adjustment expenses
• The reinsurer also pays the ceding company a ceding commission which is designed to reflect the
differences in underwriting expenses incurred.
The amounts paid by the direct insurer and the reinsurer are summarized as
Example 10.3.1. Distribution of losses under quota share. To develop intuition for the effect of
quota-share agreement on the distribution of losses, the following is a short R demonstration using simulation.
Note the relative shapes of the distributions of total losses, the retained portion (of the insurer), and the
reinsurer’s portion.
226 CHAPTER 10. PORTFOLIO MANAGEMENT INCLUDING REINSURANCE
par(mfrow=c(1,3))
plot(density(S), xlim=c(0,3*theta), main="Total Loss", xlab="Losses")
plot(density(0.75*S), xlim=c(0,3*theta), main="Insurer (75%)", xlab="Losses")
plot(density(0.25*S), xlim=c(0,3*theta), main="Reinsurer (25%)", xlab="Losses")
The quota share contract is particularly desirable for the reinsurer. To see this, suppose that an insurer and
reinsurer wish to enter a contract to share total losses S such that
for some generic function g(·) (known as the retention function). Suppose further that the insurer only cares
about the variability of retained claims and is indifferent to the choice of g as long as V ar Yinsurer stays the
same and equals, say, Q. Then, the following result shows that the quota share reinsurance treaty minimizes
the reinsurer’s uncertainty as measured by V ar Yreinsurer .
Proposition. Suppose that V ar Yinsurer = Q. Then, V ar((1 − c)S) ≤ V ar(g(S)) for all g(.).
Show the Justification of the Proposition
Proof of the Proposition. With Yreinsurer = S − Yinsurer and the law of total variation
‘
The proposition is intuitively appealing - with quota share insurance, the reinsurer shares the responsibility
for very large claims in the tail of the distribution. This is in contrast to non-proportional agreements where
reinsurers take responsibility for the very large claims.
Now assume n risks in the porfolio, X1 , . . . , Xn , so that the portfolio sum is S = X1 + · · · + Xn . For simplicity,
we focus on the case of independent risks. Let us consider a variation of the basic quota share agreement
where the amount retained P by the insurer may vary with each risk, say ci . Thus, the insurer’s portion of the
n
portfolio risk is Yinsurer = i=1 ci Xi . What is the best choice of the proportions ci ?
To formalize this question, we seek to find those values of ci that minimize V ar Yinsurer subject to the
constraint that E Yinsurer = K. The requirement that E Yinsurer = K suggests that the insurers wishes to
retain a revenue in at least the amount of the constant K. Subject to this revenue constraint, the insurer
wishes to minimize uncertainty of the retained risks as measured by the variance.
Show the Optimal Retention Proportions
The Optimal Retention Proportions
Minimizing V ar Yinsurer subject to E Yinsurer = K is a constrained optimization problem - we can use the
method of Lagrange multipliers, a calculus technique, to solve this. To this end, define the Lagrangian
L =V
Par(Yinsurer ) − λ(E P
Yinsurer − K)
n n
= i=1 c2i V ar Xi − λ( i=1 ci E Xi − K)
Taking a partial derivative with respect to λ and setting this equal simply means that the constraint,
E Yinsurer = K, is enforced and we have to choose the proportions ci to satisfy this constraint. Moreover,
taking the partial derivative with respect to each proportion ci yields
∂
L = 2ci V ar Xi − λ E Xi = 0
∂ci
so that
λ E Xi
ci = .
2 V ar Xi
From the math, it turns out that the constant for the ith risk, ci is proportional to VEarXXi i . This is intuitively
appealing. Other things being equal, a higher revenue as measured by E Xi means a higher value of ci . In the
same way, a higher value of uncertainty as measured by V ar Xi means a lower value of ci . The proportional
scaling factor is determined by the revenue requirement E Yinsurer = K. The following example helps to
develop a feel for this relationship.
Example 10.3.2. Three Pareto risks. Consider three risks that have a Pareto distribution. Provide a
graph, and supporting code, that give values of c1 , c2 , and c3 for a required revenue K. Note that these
values increase linearly with K.
Show an Example with Three Pareto Risks
theta1 = 1000;theta2 = 2000;theta3 = 3000;
alpha1 = 3;alpha2 = 3;alpha3 = 4;
library(actuar)
propnfct <- function(alpha,theta){
mu <- mpareto(shape=alpha, scale=theta, order=1)
var <- mpareto(shape=alpha, scale=theta, order=2) - mu^2
228 CHAPTER 10. PORTFOLIO MANAGEMENT INCLUDING REINSURANCE
Under a stop loss arrangement, the insurer sets a retention level M (> 0) and pays in full total claims for
which S ≤ M . Thus, the insurer retains an amount M of the risk. Further, for claims for which S > M , the
10.3. REINSURANCE 229
direct insurer pays M and the reinsurer pays the remaining amount S − M . Summarizing, the amounts paid
by the direct insurer and the reinsurer are
(
S for S ≤ M
Yinsurer = = min(S, M ) = S ∧ M
M for S > M
and
(
0 for S ≤ M
Yreinsurer = = max(0, S − M ).
S−M for S > M
because E g(S) = K.
Now, for any retention function, we have g(S) ≤ S, that is, the insurer’s retained claims are less than or
equal to total claims. Using the notation gSL (S) = S ∧ M for stop loss insurance, we have
M − gSL (S) = M − (S ∧ M )
= (M − S) ∧ 0
≤ (M − g(S)) ∧ 0.
Squaring each side yields
Excess of Loss
A closely related form of non-proportional reinsurance is the excess of loss coverage. Under this contract,
we assume that the total risk S can be thought of as composed as n separate risks X1 , . . . , Xn and that each
of these risks are subject to upper limit, say, Mi . So the insurer retains
n
X
Yi,insurer = Xi ∧ Mi Yinsurer = Yi,insurer
i=1
and the reinsurer is responsible for the excess, Yreinsurer = S − Yinsurer . The retention limits may vary by
risk or may be the same for all risks, Mi = M , for all i.
What is the best choice of the excess of loss retention limits Mi ? To formalize this question, we seek to find
those values of Mi that minimize V ar Yinsurer subject to the constraint that E Yinsurer = K. Subject to this
revenue constraint, the insurer wishes to minimize uncertainty of the retained risks (as measured by the
variance).
Show the Optimal Retention Proportions
The Optimal Retention Limits
Minimizing V ar Yinsurer subject to E Yinsurer = K is a constrained optimization problem - we can use the
method of Lagrange multipliers, a calculus technique, to solve this. As before, define the Lagrangian
Z M
E S∧M = (1 − F (S))dx
0
and
Z M
E (S ∧ M )2 = 2 x(1 − F (x))dx
0
Taking a partial derivative with respect to λ and setting this equal simply means that the constraint,
E Yinsurer = K, is enforced and we have to choose the limits Mi to satisfy this constraint. Moreover, taking
the partial derivative with respect to each limit Mi yields
∂ ∂ ∂
∂Mi L = ∂M i
V ar (Xi ∧ Mi ) − λ ∂M i
E (Xi ∧ Mi )
∂
= ∂Mi E (Xi ∧ Mi ) − (E (Xi ∧ Mi ))2 − λ(1 − Fi (Mi ))
2
∂
Setting ∂Mi L = 0 and solving for λ, we get
From the math, it turns out that the retention limit less the expected insurer’s claims, Mi − E (Xi ∧ Mi ), is
the same for all risks. This is intuitively appealing.
10.3. REINSURANCE 231
Example 10.3.3. Excess of loss for three Pareto risks. Consider three risks that have a Pareto
distribution, each having a different set of parameters (so they are independent but non-identical). Show
numerically that the optimal retention limits M1 , M2 , and M3 resulting retention limit minus expected
insurer’s claims, Mi − E (Xi ∧ Mi ), is the same for all risks, as we derived theoretically. Further, graphically
compare the distribution of total risks to that retained by the insurer and by the reinsurer.
Show an Example with Three Pareto Risks
We first optimize the Lagrangian using the R package alabama for Augmented Lagrangian Adaptive Barrier
Minimization Algorithm.
theta1 = 1000;theta2 = 2000;theta3 = 3000;
alpha1 = 3; alpha2 = 3; alpha3 = 4;
Pmin <- 2000
library(actuar)
VarFct <- function(M){
M1=M[1];M2=M[2];M3=M[3]
mu1 <- levpareto(limit=M1,shape=alpha1, scale=theta1, order=1)
var1 <- levpareto(limit=M1,shape=alpha1, scale=theta1, order=2)-mu1^2
mu2 <- levpareto(limit=M2,shape=alpha2, scale=theta2, order=1)
var2 <- levpareto(limit=M2,shape=alpha2, scale=theta2, order=2)-mu2^2
mu3 <- levpareto(limit=M3,shape=alpha3, scale=theta3, order=1)
var3 <- levpareto(limit=M3,shape=alpha3, scale=theta3, order=2)-mu3^2
varFct <- var1 +var2+var3
meanFct <- mu1+mu2+mu3
c(meanFct,varFct)
}
f <- function(M){VarFct(M)[2]}
h <- function(M){VarFct(M)[1] - Pmin}
library(alabama)
par0=rep(1000,3)
op <- auglag(par=par0,fn=f,hin=h,control.outer=list(trace=FALSE))
The optimal retention limits M1 , M2 , and M3 resulting retention limit minus expected insurer’s claims,
Mi − E (Xi ∧ Mi ), is the same for all risks, as we derived theoretically.
M1star = op$par[1];M2star = op$par[2];M3star = op$par[3]
M1star -levpareto(M1star,shape=alpha1, scale=theta1,order=1)
[1] 1344.135
M2star -levpareto(M2star,shape=alpha2, scale=theta2,order=1)
[1] 1344.133
M3star -levpareto(M3star,shape=alpha3, scale=theta3,order=1)
[1] 1344.133
We graphically compare the distribution of total risks to that retained by the insurer and by the reinsurer.
set.seed(2018)
nSim = 10000
library(actuar)
Y1 <- rpareto(nSim, shape = alpha1, scale = theta1)
Y2 <- rpareto(nSim, shape = alpha2, scale = theta2)
Y3 <- rpareto(nSim, shape = alpha3, scale = theta3)
YTotal <- Y1 + Y2 + Y3
Yinsur <- pmin(Y1,M1star)+pmin(Y2,M2star)+pmin(Y3,M3star)
232 CHAPTER 10. PORTFOLIO MANAGEMENT INCLUDING REINSURANCE
par(mfrow=c(1,3))
plot(density(YTotal), xlim=c(0,10000), main="Total Loss", xlab="Losses")
plot(density(Yinsur), xlim=c(0,10000), main="Insurer", xlab="Losses")
plot(density(Yreinsur), xlim=c(0,10000), main="Reinsurer", xlab="Losses")
Another proportional treaty is known as surplus share; this type of contract is common in commercial
property insurance.
• A surplus share treaty allows the reinsured to limit its exposure on any one risk to a given amount (the
retained line).
• The reinsurer assumes a part of the risk in proportion to the amount that the insured value exceeds the
retained line, up to a given limit (expressed as a multiple of the retained line, or number of lines).
For example, let the retained line be $100,000 and let the given limit be 4 lines ($400,000). Then, if S is the
loss, the reinsurer’s portion is min(400000, (S − 100000)+ ).
Layers of Coverage
One can also extend non-proportional stop loss treaties by introducing additional parties to the contract.
For example, instead of simply an insurer and reinsurer or an insurer and a policyholder, think about the
situation with all three parties, a policyholder, insurer, and reinsurer, who agree on how to share a risk. More
generally, we consider k parties. If k = 4, it could be an insurer and three different reinsurers.
Example 10.3.4. Layers of coverage for three parties.
10.3. REINSURANCE 233
• Suppose that there are k = 3 parties. The first party is responsible for the first 100 of claims, the
second responsible for claims from 100 to 3000, and the third responsible for claims above 3000.
• If there are four claims in the amounts 50, 600, 1800 and 4000, then they would be allocated to the
parties as follows:
To handle the general situation with k groups, partition the positive real line into k intervals using the
cut-points
0 = M0 < M1 < · · · < Mk−1 < Mk = ∞.
Note that the jth interval is (Mj−1 , Mj ]. Now let Yj be the amount of risk shared by the jth party. To
illustrate, if a loss x is such that Mj−1 < x ≤ Mj , then
Y1 M1 − M0
Y2 M2 − M1
.. ..
.
.
Yj = x − Mj−1
Yj+1 0
. ..
..
.
Yk 0
With the expression Yj = min(S, Mj ) − min(S, Mj−1 ), we see that the jth party is responsible for claims in
the interval (Mj−1 , Mj ]. With this, it is easy to check that S = Y1 + Y2 + · · · + Yk . As emphasized in the
following example, we also remark that the parties need not be different.
Example 10.3.5. - Suppose that a policyholder is responsible for the first 500 of claims and all claims in excess
of 100,000. The insurer takes claims between 100 and 100,000. - Then, we would use M1 = 100, M2 = 100000.
- The policyholder is responsible for Y1 = min(S, 100) and Y3 = S − min(S, 100000) = max(0, S − 100000).
For additional reading, see the Wisconsin Property Fund site for more info on layers of reinsurance.
Many other variations of the foundational contracts are possible. For one more illustration, consider the
following.
Example. 10.3.6. Portfolio management. You are the Chief Risk Officer of a telecommunications firm.
Your firm has several property and liabililty risks. We will consider:
• X1 - buildings, modeled using a gamma distribution with mean 200 and scale parameter 100.
• X2 - motor vehicles, modeled using a gamma distribution with mean 400 and scale parameter 200.
• X3 - directors and executive officers risk, modeled using a Pareto distribution with mean 1000 and scale
parameter 1000.
234 CHAPTER 10. PORTFOLIO MANAGEMENT INCLUDING REINSURANCE
• X4 - cyber risks, modeled using a Pareto distribution with mean 1000 and scale parameter 2000.
Denote the total risk as
S = X1 + X2 + X3 + X4 .
so that your retained risk is Yretained = S − Yinsurer = min(X1 , M1 ) + min(X2 , M2 ). Using deductibles M1 =
100 and M2 = 200:
a. Determine the expected claim amount of (i) that retained, (ii) that accepted by the insurer, and (iii)
the total overall amount.
b. Determine the 80th, 90th, 95th, and 99th percentiles for (i) that retained, (ii) that accepted by the
insurer, and (iii) the total overall amount.
c. Compare the distributions by plotting the densities for (i) that retained, (ii) that accepted by the
insurer, and (iii) the total overall amount.
Show Example Solution with R Code
In preparation, here is the code needed to set the parameters.
# For the gamma distributions, use
alpha1 <- 2; theta1 <- 100
alpha2 <- 2; theta2 <- 200
# For the Pareto distributions, use
alpha3 <- 2; theta3 <- 1000
alpha4 <- 3; theta4 <- 2000
# Limits
M1 <- 100
M2 <- 200
With these parameters, we can now simulate realizations of the portfolio risks.
# Simulate the risks
nSim <- 10000 #number of simulations
set.seed(2017) #set seed to reproduce work
X1 <- rgamma(nSim,alpha1,scale = theta1)
X2 <- rgamma(nSim,alpha2,scale = theta2)
# For the Pareto Distribution, use
library(actuar)
X3 <- rpareto(nSim,scale=theta3,shape=alpha3)
X4 <- rpareto(nSim,scale=theta4,shape=alpha4)
# Portfolio Risks
S <- X1 + X2 + X3 + X4
Yretained <- pmin(X1,M1) + pmin(X2,M2)
Yinsurer <- S - Yretained
Loss Reserving
237
238 CHAPTER 11. LOSS RESERVING
Chapter 12
239
240 CHAPTER 12. EXPERIENCE RATING USING BONUS-MALUS
Chapter 13
Data Systems
Chapter Preview. This chapter covers the learning areas on data and systems outlined in the IAA (International
Actuarial Association) Education Syllabus published in September 2015.
13.1 Data
In terms of how data are collected, data can be divided into two types (Hox and Boeije, 2005): primary
data and secondary data. Primary data are original data that are collected for a specific research problem.
Secondary data are data originally collected for a different purpose and reused for another research problem.
A major advantage of using primary data is that the theoretical constructs, the research design, and the
data collection strategy can be tailored to the underlying research question to ensure that the data collected
indeed help to solve the problem. A disadvantage of using primary data is that data collection can be costly
and time-consuming. Using secondary data has the advantage of lower cost and faster access to relevant
information. However, using secondary data may not be optimal for the research question under consideration.
In terms of the degree of organization of the data, data can be also divided into two types (Inmon and
Linstedt, 2014; O’Leary, 2013; Hashem et al., 2015; Abdullah and Ahmad, 2013; Pries and Dunnigan, 2015):
structured data and unstructured data. Structured data have a predictable and regularly occurring format.
In contrast, unstructured data are unpredictable and have no structure that is recognizable to a computer.
Structured data consists of records, attributes, keys, and indices and are typically managed by a database
management system (DBMS) such as IBM DB2, Oracle, MySQL, and Microsoft SQL Server. As a result,
most units of structured data can be located quickly and easily. Unstructured data have many different forms
and variations. One common form of unstructured data is text. Accessing unstructured data is clumsy. To
find a given unit of data in a long text, for example, sequentially search is usually performed.
In terms of how the data are measured, data can be classified as qualitative or quantitative. Qualitative data
is data about qualities, which cannot be actually measured. As a result, qualitative data is extremely varied
in nature and includes interviews, documents, and artifacts (Miles et al., 2014). Quantitative data is data
about quantities, which can be measured numerically with numbers. In terms of the level of measurement,
quantitative data can be further classified as nominal, ordinal, interval, or ratio (Gan, 2011). Nominal data,
also called categorical data, are discrete data without a natural ordering. Ordinal data are discrete data with
a natural order. Interval data are continuous data with a specific order and equal intervals. Ratio data are
interval data with a natural zero.
There exist a number of data sources. First, data can be obtained from university-based researchers who
collect primary data. Second, data can be obtained from organizations that are set up for the purpose of
241
242 CHAPTER 13. DATA SYSTEMS
releasing secondary data for general research community. Third, data can be obtained from national and
regional statistical institutes that collect data. Finally, companies have corporate data that can be obtained
for research purpose.
While it might be difficult to obtain data to address a specific research problem or answer a business question,
it is relatively easy to obtain data to test a model or an algorithm for data analysis. In nowadays, readers can
obtain datasets from the Internet easily. The following is a list of some websites to obtain real-world data:
• UCI Machine Learning Repository This website (url: http://archive.ics.uci.edu/ml/index.php)
maintains more than 400 datasets that can be used to test machine learning algorithms.
• Kaggle The Kaggle website (url: https://www.kaggle.com/) include real-world datasets used for data
science competition. Readers can download data from Kaggle by registering an account.
• DrivenData DrivenData aims at bringing cutting-edge practices in data science to solve some of
the world’s biggest social challenges. In its website (url: https://www.drivendata.org/), readers can
participate data science competitions and download datasets.
• Analytics Vidhya This website (url: https://datahack.analyticsvidhya.com/contest/all/) allows you
to participate and download datasets from practice problems and hackathon problems.
• KDD Cup KDD Cup is the annual Data Mining and Knowledge Discovery competition organized
by ACM Special Interest Group on Knowledge Discovery and Data Mining. This website (url: http:
//www.kdd.org/kdd-cup) contains the datasets used in past KDD Cup competitions since 1997.
• U.S. Government’s open data This website (url: https://www.data.gov/) contains about 200,000
datasets covering a wide range of areas including climate, education, energy, and finance.
• AWS Public Datasets In this website (url: https://aws.amazon.com/datasets/), Amazon provides a
centralized repository of public datasets, including some huge datasets.
As mentioned in the previous subsection, there are structured data as well as unstructured data. Structured
data are highly organized data and usually have the following tabular format:
V1 V2 ··· Vd
x1 x11 x12 ··· x1d
x2 x21 x22 ··· x2d
.. .. .. ..
. . . ··· .
xn xn1 xn2 ··· xnd
In other words, structured data can be organized into a table consists of rows and columns. Typically, each
row represents a record and each column represents an attribute. A table can be decomposed into several
tables that can be stored in a relational database such as the Microsoft SQL Server. The SQL (Structured
Query Language) can be used to access and modify the data easily and efficiently.
Unstructured data do not follow a regular format (Abdullah and Ahmad, 2013). Examples of unstructured
data include documents, videos, and audio files. Most of the data we encounter are unstructured data. In
fact, the term “big data” was coined to reflect this fact. Traditional relational databases cannot meet the
challenges on the varieties and scales brought by massive unstructured data nowadays. NoSQL databases
have been used to store massive unstructured data.
There are three main NoSQL databases (Chen et al., 2014): key-value databases, column-oriented databases,
and document-oriented databases. Key-value databases use a simple data model and store data according to
key-values. Modern key-value databases have higher expandability and smaller query response time than
relational databases. Examples of key-value databases include Dynamo used by Amazon and Voldemort used
13.1. DATA 243
by LinkedIn. Column-oriented databases store and process data according to columns rather than rows. The
columns and rows are segmented in multiple nodes to achieve expandability. Examples of column-oriented
databases include BigTable developed by Google and Cassandra developed by FaceBook. Document databases
are designed to support more complex data forms than those stored in key-value databases. Examples of
document databases include MongoDB, SimpleDB, and CouchDB. MongoDB is an open-source document-
oriented database that stores documents as binary objects. SimpleDB is a distributed NoSQL database used
by Amazon. CouchDB is an another open-source document-oriented database.
Accurate data are essential to useful data analysis. The lack of accurate data may lead to significant costs
to organizations in areas such as correction activities, lost customers, missed opportunities, and incorrect
decisions (Olson, 2003).
Data has quality if it satisfies its intended use, that is, the data is accurate, timely, relevant, complete,
understood, and trusted (Olson, 2003). As a result, we first need to know the specification of the intended
uses and then judge the suitability for those uses in order to assess the quality of the data. Unintended uses
of data can arise from a variety of reasons and lead to serious problems.
Accuracy is the single most important component of high-quality data. Accurate data have the following
properties (Olson, 2003):
• The data elements are not missing and have valid values.
• The values of the data elements are in the right ranges and have the right representations.
Inaccurate data arise from different sources. In particular, the following areas are common areas where
inaccurate data occur:
• Initial data entry. Mistakes (including deliberate errors) and system errors can occur during the initial
data entry. Flawed data entry processes can result in inaccurate data.
• Data decay. Data decay, also known as data degradation, refers to the gradual corruption of computer
data due to an accumulation of non-critical failures in a storage device.
• Data moving and restructuring. Inaccurate data can also arise from data extracting, cleaning, trans-
forming, loading, or integrating.
• Data using. Faulty reporting and lack of understanding can lead to inaccurate data.
Reverification and analysis are two approaches to find inaccurate data elements. To ensure that the data
elements are 100% accurate, we must use reverification. However, reverification can be time-consuming
and may not be possible for some data. Analytical techniques can also be used to identify inaccurate data
elements. There are five types of analysis that can be used to identify inaccurate data (Olson, 2003): data
element analysis, structural analysis, value correlation, aggregation correlation, and value inspection
Companies can create a data quality assurance program to create high-quality databases. For more information
about data quality issues management and data profiling techniques, readers are referred to (Olson, 2003).
Raw data usually need to be cleaned before useful analysis can be conducted. In particular, the following
areas need attention when preparing data for analysis (Janert, 2010):
• Missing values It is common to have missing values in raw data. Depending on the situations, we
can discard the record, discard the variable, or impute the missing values.
• Outliers Raw data may contain unusual data points such as outliers. We need to handle outliers
carefully. We cannot just remove outliers without knowing the reason for their existence. Sometimes
the outliers are caused by clerical errors. Sometimes outliers are the effect we are looking for.
244 CHAPTER 13. DATA SYSTEMS
• Junk Raw data may contain junks such as nonprintable characters. Junks are typically rare and not
easy to get noticed. However, junks can cause serious problems in downstream applications.
• Format Raw data may be formated in a way that is inconvenient for subsequent analysis. For example,
components of a record may be split into multiple lines in a text file. In such cases, lines corresponding
to a single record should be merged before loading to a data analysis software such as R.
• Duplicate records Raw data may contain duplicate records. Duplicate records should be recognized
and removed. This task may not be trivial depending on what you consider “duplicate.”
• Merging datasets Raw data may come from different sources. In such cases, we need to merge the
data from different sources to ensure compatibility.
For more information about how to handle data in R, readers are referred to (Forte, 2015) and (Buttrey and
Whitaker, 2017).
Data analysis is part of an overall study. For example, Figure 13.1 shows the process of a typical study in
behavioral and social sciences as described in (Albers, 2017). The data analysis part consists of the following
steps:
• Exploratory analysis The purpose of this step is to get a feel of the relationships with the data and
figure out what type of analysis for the data makes sense.
• Statistical analysis This step performs statistical analysis such as determining statistical significance
and effect size.
• Make sense of the results This step interprets the statistical results in the context of the overall
study.
• Determine implications This step interprets the data by connecting it to the study goals and the
larger field of this study.
13.2. DATA ANALYSIS PRELIMINARY 245
Figure 13.1: The process of a typical study in behavioral and social sciences.
The goal of the data analysis as described above focuses on explaining some phenomenon (See Section 13.2.5).
Shmueli (2010) described a general process for statistical modeling, which is shown in Figure 13.2. Depending
on the goal of the analysis, the steps differ in terms of the choice of methods, criteria, data, and information.
There are two phases of data analysis (Good, 1983): exploratory data analysis (EDA) and confirmatory data
analysis (CDA). Table 13.1 summarizes some differences between EDA and CDA. EDA is usually applied to
observational data with the goal of looking for patterns and formulating hypotheses. In contrast, CDA is
often applied to experimental data (i.e., data obtained by means of a formal design of experiments) with the
goal of quantifying the extent to which discrepancies between the model and the data could be expected to
occur by chance (Gelman, 2004).
EDA CDA
Data Observational data Experimental data
Methods for data analysis can be divided into two types (Abbott, 2014; Igual and Segu, 2017): supervised
learning methods and unsupervised learning methods. Supervised learning methods work with labeled data,
which include a target variable. Mathematically, supervised learning methods try to approximate the following
function:
Y = f (X1 , X2 , . . . , Xp ),
where Y is a target variable and X1 , X2 , . . ., Xp are explanatory variables. Other terms are also used to
mean a target variable. Table 13.2 gives a list of common names for different types of variables (Frees,
2009c). When the target variable is a categorical variable, supervised learning methods are called classification
methods. When the target variable is continuous, supervised learning methods are called regression methods.
Methods for data analysis can be parametric or nonparametric (Abbott, 2014). Parametric methods assume
that the data follow a certain distribution. Nonparametric methods do not assume distributions for the data
and therefore are called distribution-free methods.
Parametric methods have the advantage that if the distribution of the data is known, properties of the data
and properties of the method (e.g., errors, convergence, coefficients) can be derived. A disadvantage of
parametric methods is that analysts need to spend considerable time on figuring out the distribution. For
example, analysts may try different transformation methods to transform the data so that it follows a certain
distribution.
Since nonparametric methods make fewer assumptions, nonparametric methods have the advantage that
they are more flexible, more robust, and applicable to non-quantitative data. However, a drawback of
nonparametric methods is that the conclusions drawn from nonparametric methods are not as powerful as
those drawn from parametric methods.
There are two goals in data analysis (Breiman, 2001; Shmueli, 2010): explanation and prediction. In some
scientific areas such as economics, psychology, and environmental science, the focus of data analysis is to
explain the causal relationships between the input variables and the response variable. In other scientific
areas such as natural language processing and bioinformatics, the focus of data analysis is to predict what
the responses are going to be given the input variables.
Shmueli (2010) discussed in detail the distinction between explanatory modeling and predictive modeling,
which reflect the process of using data and methods for explaining or predicting, respectively. Explanatory
13.2. DATA ANALYSIS PRELIMINARY 247
modeling is commonly used for theory building and testing. However, predictive modeling is rarely used in
many scientific fields as a tool for developing theory.
Explanatory modeling is typically done as follows:
• State the prevailing theory.
• State causal hypotheses, which are given in terms of theoretical constructs rather than measurable
variables. A causal diagram is usually included to illustrate the hypothesized causal relationship between
the theoretical constructs.
• Operationalize constructs. In this step, previous literature and theoretical justification are used to build
a bridge between theoretical constructs and observable measurements.
• Collect data and build models alongside the statistical hypotheses, which are operationalized from the
research hypotheses.
• Reach research conclusions and recommend policy. The statistical conclusions are converted into
research conclusions. Policy recommendations are often accompanied.
Shmueli (2010) defined predictive modeling as the process of applying a statistical model or data mining
algorithm to data for the purpose of predicting new or future observations. Predictions include point
predictions, interval predictions, regions, distributions, and rankings of new observations. Predictive model
can be any method that produces predictions.
Breiman (2001) discussed two cultures in the use of statistical modeling to reach conclusions from data: the
data modeling culture and the algorithmic modeling culture. In the data modeling culture, the data are
assumed to be generated by a given stochastic data model. In the algorithmic modeling culture, the data
mechanism is treated as unknown and algorithmic models are used.
Data modeling gives the statistics field many successes in analyzing data and getting information about
the data mechanisms. However, Breiman (2001) argued that the focus on data models in the statistical
community has led to some side effects such as
• Produced irrelevant theory and questionable scientific conclusions.
• Kept statisticians from using algorithmic models that might be more suitable.
• Restricted the ability of statisticians to deal with a wide range of problems.
Algorithmic modeling was used by industrial statisticians long time ago. However, the development of
algorithmic methods was taken up by a community outside statistics (Breiman, 2001). The goal of algorithmic
modeling is predictive accuracy. For some complex prediction problems, data models are not suitable.
These prediction problems include speech recognition, image recognition, handwriting recognition, nonlinear
time series prediction, and financial market prediction. The theory in algorithmic modeling focuses on the
properties of algorithms, such as convergence and predictive accuracy.
Unlike traditional data analysis, big data analysis employs additional methods and tools that can extract
information rapidly from massive data. In particular, big data analysis uses the following processing methods
(Chen et al., 2014):
• Bloom filter A bloom filter is a space-efficient probabilistic data structure that is used to determine
whether an element belongs to a set. It has the advantages of high space efficiency and high query
speed. A drawback of using bloom filter is that there is a certain misrecognition rate.
248 CHAPTER 13. DATA SYSTEMS
• Hashing Hashing is a method that transforms data into fixed-length numerical values through a hash
function. It has the advantages of rapid reading and writing. However, sound hash functions are difficult
to find.
• Indexing Indexing refers to a process of partitioning data in order to speed up reading. Hashing is a
special case of indexing.
• Tries A trie, also called digital tree, is a method to improve query efficiency by using common prefixes
of character strings to reduce comparison on character strings to the greatest extent.
• Parallel computing Parallel computing uses multiple computing resources to complete a computation
task. Parallel computing tools include MPI (Message Passing Interface), MapReduce, and Dryad.
Big data analysis can be conducted in the following levels (Chen et al., 2014): memory-level, business
intelligence (BI) level, and massive level. Memory-level analysis is conducted when the data can be loaded to
the memory of a cluster of computers. Current hardware can handle hundreds of gigabytes (GB) of data in
memory. BI level analysis can be conducted when the data surpass the memory level. It is common for BI
level analysis products to support data over terabytes (TB). Massive level analysis is conducted when the
data surpass the capabilities of products for BI level analysis. Usually Hadoop and MapReduce are used in
massive level analysis.
As mentioned in Section 13.2.1, a typical data analysis workflow includes collecting data, analyzing data, and
reporting results. The data collected are saved in a database or files. The data are then analyzed by one or
more scripts, which may save some intermediate results or always work on the raw data. Finally a report
is produced to describe the results, which include relevant plots, tables, and summaries of the data. The
workflow may subject to the following potential issues (Mailund, 2017, Chapter 2):
If the analysis is done on the raw data with a single script, then the first issue is not a major problem. If
the analysis consists of multiple scripts and a script saves intermediate results that are read by the next
script, then the scripts describe a workflow of data analysis. To reproduce an analysis, the scripts have to
be executed in the right order. The workflow may cause major problems if the order of the scripts is not
documented or the documentation is not updated or lost. One way to address the first issue is to write the
scripts so that any part of the workflow can be run completely automatically at any time.
If the documentation of the analysis is synchronized with the analysis, then the second issue is not a major
problem. However, the documentation may become completely useless if the scripts are changed but the
documentation is not updated.
Literate programming is an approach to address the two issues mentioned above. In literate programming, the
documentation of a program and the code of the program are written together. To do literate programming
in R, one way is to use the R Markdown and the knitr package.
Analysts may face ethical issues and dilemmas during the data analysis process. In some fields, for example,
ethical issues and dilemmas include participant consent, benefits, risk, confidentiality, and data ownership
(Miles et al., 2014). For data analysis in actuarial science and insurance in particular, we face the following
ethical matters and issues (Miles et al., 2014):
13.3. DATA ANALYSIS TECHNIQUES 249
• Worthness of the project Is the project worth doing? Will the project contribute in some significant
way to a domain broader than my career? If a project is only opportunistic and does not have a larger
significance, then it might be pursued with less care. The result may be looked good but not right.
• Competence Do I or the whole team have the expertise to carry out the project? Incompetence may
lead to weakness in the analytics such as collecting large amounts of data poorly and drawing superficial
conclusions.
• Benefits, costs, and reciprocity Will each stakeholder gain from the project? Are the benefit and
the cost equitable? A project will likely to fail if the benefit and the cost for a stakeholder do not match.
• Privacy and confidentiality How do we make sure that the information is kept confidentially? Where
raw data and analysis results are stored and how will have access to them should be documented in
explicit confidentiality agreements.
Supervised Unsupervised
Discrete Label Classification Clustering
Continuous Label Regression Dimension reduction
Table 13.3: Types of machine learning algorithms.
Originating in engineering, pattern recognition is a field that is closely related to machine learning, which
grew out of computer science. In fact, pattern recognition and machine learning can be considered to be two
facets of the same field (Bishop, 2007). Data mining is a field that concerns collecting, cleaning, processing,
analyzing, and gaining useful insights from data (Aggarwal, 2015).
Exploratory data analysis techniques include descriptive statistics as well as many unsupervised learning
techniques such as data clustering and principal component analysis.
In the mass noun sense, descriptive statistics is an area of statistics that concerns the collection, organization,
summarization, and presentation of data (Bluman, 2012). In the count noun sense, descriptive statistics are
summary statistics that quantitatively describe or summarize data.
Descriptive Statistics
Measures of central tendency Mean, median, mode, midrange
Measures of variation Range, variance, standard deviation
Measures of position Quantile
250 CHAPTER 13. DATA SYSTEMS
Principal component analysis (PCA) is a statistical procedure that transforms a dataset described by possibly
correlated variables into a dataset described by linearly uncorrelated variables, which are called principal
components and are ordered according to their variances. PCA is a technique for dimension reduction. If the
original variables are highly correlated, then the first few principal components can account for most of the
variation of the original data.
To describe PCA, let X1 , X2 , . . . , Xd be a set of variables. The first principal component is defined to be
the normalized linear combination of the variables that has the largest variance, that is, the first principal
component is defined as
cov (Zi , Zj ) = 0, j = 1, 2, . . . , i − 1.
The principal components of the variables are related to the eigenvectors and eigenvectors of the covariance
matrix of the variables. For i = 1, 2, . . . , d, let (λi , ei ) be the ith eigenvalue-eigenvector pair of the covariance
matrix Σ such that λ1 ≥ λ2 ≥ . . . ≥ λd ≥ 0 and the eigenvectors are normalized. Then the ith principal
component is given by
d
X
Zi = e0i X = eij Xj ,
j=1
where X = (X1 , X2 , . . . , Xd )0 . It can be shown that Var (Zi ) = λi . As a result, the proportion of variance
explained by the ith principal component is calculated as
Var (Zi ) λi
Pd = .
j=1 Var (Zj )
λ 1 + λ 2 + · · · + λd
For more information about PCA, readers are referred to (Mirkin, 2011).
13.3. DATA ANALYSIS TECHNIQUES 251
Cluster analysis (aka data clustering) refers to the process of dividing a dataset into homogeneous groups or
clusters such that points in the same cluster are similar and points from different clusters are quite distinct
(Gan et al., 2007; Gan, 2011). Data clustering is one of the most popular tools for exploratory data analysis
and has found applications in many scientific areas.
During the past several decades, many clustering algorithms have been proposed. Among these clustering
algorithms, the k-means algorithm is perhaps the most well-known algorithm due to its simplicity. To describe
the k-means algorithm, let X = {x1 , x2 , . . . , xn } be a dataset containing n points, each of which is described
by d numerical features. Given a desired number of clusters k, the k-means algorithm aims at minimizing the
following objective function:
k X
X n
P (U, Z) = uil kxi − zl k2 ,
l=1 i=1
where U = (uil )n×k is an n × k partition matrix, Z = {z1 , z2 , . . . , zk } is a set of cluster centers, and k · k is
the L2 norm or Euclidean distance. The partition matrix U satisfies the following conditions:
The k-means algorithm employs an iterative procedure to minimize the objective function. It repeatedly
updates the partition matrix U and the cluster centers Z alternately until some stop criterion is met. When
the cluster centers Z are fixed, the partition matrix U is updated as follows:
1, if kxi − zl k = min1≤j≤k kxi − zj k;
uil =
0, if otherwise,
When the partition matrix U is fixed, the cluster centers are updated as follows:
Pn
i=1 uil xij
zlj = P n , l = 1, 2, . . . , k, j = 1, 2, . . . , d,
i=1 uil
where zlj is the jth component of zl and xij is the jth component of xi .
For more information about k-means, readers are referred to (Gan et al., 2007) and (Mirkin, 2011).
Confirmatory data analysis techniques include the traditional statistical tools of inference, significance, and
confidence.
Linear Models
Linear models, also called linear regression models, aim at using a linear function to approximate the
relationship between the dependent variable and independent variables. A linear regression model is called a
simple linear regression model if there is only one independent variable. When more than one independent
variables are involved, a linear regression model is called a multiple linear regression model.
252 CHAPTER 13. DATA SYSTEMS
Let X and Y denote the independent and the dependent variables, respectively. For i = 1, 2, . . . , n, let (xi , yi )
be the observed values of (X, Y ) in the ith case. Then the simple linear regression model is specified as
follows (Frees, 2009c):
yi = β0 + β1 xi + i , i = 1, 2, . . . , n,
where β0 and β1 are parameters and i is a random variable representing the error for the ith case.
When there are multiple independent variables, the following multiple linear regression model is used:
yi = β0 + β1 xi1 + · · · + βk xik + i ,
where β0 , β1 , . . ., βk are unknown parameters to be estimated.
Linear regression models usually make the following assumptions:
(a) xi1 , xi2 , . . . , xik are nonstochastic variables.
(b) Var (yi ) = σ 2 , where Var (yi ) denotes the variance of yi .
(c) y1 , y2 , . . . , yn are independent random variables.
For the purpose of obtaining tests and confidence statements with small samples, the following strong
normality assumption is also made:
(d) 1 , 2 , . . . , n are normally distributed.
The generalized linear model (GLM) is a wide family of regression models that include linear regression
models as special cases. In a GLM, the mean of the response (i.e., the dependent variable) is assumed to be a
function of linear combinations of the explanatory variables, i.e.,
µi = E[yi ],
ηi = x0i β = g(µi ),
where xi = (1, xi1 , xi2 , . . . , xik )0 is a vector of regressor values, µi is the mean response for the ith case, and
ηi is a systematic component of the GLM. The function g(·) is known and is called the link function. The
mean response can vary by observations by allowing some parameters to change. However, the regression
parameters β are assumed to be the same among different observations.
GLMs make the following assumptions:
(a) xi1 , xi2 , . . . , xin are nonstochastic variables.
(b) y1 , y2 , . . . , yn are independent.
(c) The dependent variable is assumed to follow a distribution from the linear exponential family.
(d) The variance of the dependent variable is not assumed to be constant but is a function of the mean, i.e.,
Tree-based Models
Decision trees, also known as tree-based models, involve dividing the predictor space (i.e., the space formed
by independent variables) into a number of simple regions and using the mean or the mode of the region
for prediction (Breiman et al., 1984). There are two types of tree-based models: classification trees and
regression trees. When the dependent variable is categorical, the resulting tree models are called classification
trees. When the dependent variable is continuous, the resulting tree models are called regression trees.
The process of building classification trees is similar to that of building regression trees. Here we only briefly
describe how to build a regression tree. To do that, the predictor space is divided into non-overlapping
regions such that the following objective function
J X
X n
f (R1 , R2 , . . . , RJ ) = IRj (xi )(yi − µj )2
j=1 i=1
is minimized, where I is an indicator function, Rj denotes the set of indices of the observations that belong
to the jth box, µj is the mean response of the observations in the jth box, xi is the vector of predictor values
for the ith observation, and yi is the response value for the ith observation.
In terms of predictive accuracy, decision trees generally do not perform to the level of other regression and
classification models. However, tree-based models may outperform linear models when the relationship
between the response and the predictors is nonlinear. For more information about decision trees, readers are
referred to (Breiman et al., 1984) and (Mitchell, 1997).
13.5 Summary
In this chapter, we gave a high-level overview of data analysis. The overview is divided into three major
parts: data, data analysis, and data analysis techniques. In the first part, we introduced data types, data
structures, data storages, and data sources. In particular, we provided several websites where readers can
254 CHAPTER 13. DATA SYSTEMS
obtain real-world datasets to horn their data analysis skills. In the second part, we introduced the process of
data analysis and various aspects of data analysis. In the third part, we introduced some commonly used
techniques for data analysis. In addition, we listed some R packages and functions that can be used to
perform various data analysis tasks.
• Guojun Gan, University of Connecticut, is the principal author of the initital version of this chapter.
Email: [email protected] for chapter comments and suggested improvements.
Chapter 14
Dependence Modeling
Chapter Preview. In practice, there are many types of variables that one encounter and the first step
in dependence modeling is identifying the type of variable you are dealing with to help direct you to
the appropriate technique.This chapter introduces readers to variable types and techniques for modeling
dependence or association of multivariate distributions. Section 14.1 provides an overview of the types of
variables. Section 14.2 then elaborates basic measures for modeling the dependence between variables.
Section 14.3 introduces a novel approach to modeling dependence using Copulas which is reinforced with
practical illustrations in Section 14.4. The types of Copula families and basic properties of Copula functions
is explained Section 14.5. The chapter concludes by explaining why the study of dependence modeling is
important in Section 14.6.
People, firms, and other entities that we want to understand are described in a dataset by numerical
characteristics. As these characteristics vary by entity, they are commonly known as variables. To manage
insurance systems, it will be critical to understand the distribution of each variable and how they are
associated with one another. It is common for data sets to have many variables (high dimensional) and so it
useful to begin by classifying them into different types. As will be seen, these classifications are not strict;
there is overlap among the groups. Nonetheless, the grouping summarized in Table 14.1 and explained in the
remainder of this section provide a solid first step in framing a data set.
255
256 CHAPTER 14. DEPENDENCE MODELING
A qualitative, or categorical, variable is one for which the measurement denotes membership in a set of groups,
or categories. For example, if you were coding which area of the country an insured resides, you might use a
1 for the northern part, 2 for southern, and 3 for everything else. This location variable is an example of a
nominal variable, one for which the levels have no natural ordering. Any analysis of nominal variables should
not depend on the labeling of the categories. For example, instead of using a 1,2,3 for north, south, other, I
should arrive at the same set of summary statistics if I used a 2,1,3 coding instead, interchanging north and
south.
In contrast, an ordinal variable is a type of categorical variable for which an ordering does exist. For example,
with a survey to see how satisfied customers are with our claims servicing department, we might use a five
14.1. VARIABLE TYPES 257
point scale that ranges from 1 meaning dissatisfied to a 5 meaning satisfied. Ordinal variables provide a clear
ordering of levels of a variable but the amount of separation between levels is unknown.
A binary variable is a special type of categorical variable where there are only two categories commonly taken
to be a 0 and a 1. For example, we might code a variable in a dataset to be a 1 if an insured is female and a
0 if male.
Unlike a qualitative variable, a quantitative variable is one in which numerical level is a realization from some
scale so that the distance between any two levels of the scale takes on meaning. A continuous variable is one
that can take on any value within a finite interval. For example, it is common to represent a policyholder’s
age, weight, or income, as a continuous variable. In contrast, a discrete variable is one that takes on only a
finite number of values in any finite interval. Like an ordinal variable, these represent distinct categories that
are ordered. Unlike an ordinal variable, the numerical difference between levels takes on economic meaning. A
special type of discrete variable is a count variable, one with values on the nonnegative integers. For example,
we will be particularly interested in the number of claims arising from a policy during a given period.
Some variables are inherently a combination of discrete and continuous components. For example, when
we analyze the insured loss of a policyholder, we will encounter a discrete outcome at zero, representing no
insured loss, and a continuous amount for positive outcomes, representing the amount of the insured loss.
Another interesting variation is an interval variable, one that gives a range of possible outcomes.
Circular data represent an interesting category typically not analyzed by insurers. As an example of circular
data, suppose that you monitor calls to your customer service center and would like to know when is the peak
time of the day for calls to arrive. In this context, one can think about the time of the day as a variable with
realizations on a circle, e.g., imagine an analog picture of a clock. For circular data, the distance between
observations at 00:15 and 00:45 are just as close as observations 23:45 and 00:15 (here, we use the convention
HH:MM means hours and minutes).
Insurance data typically are multivariate in the sense that we can take many measurements on a single entity.
For example, when studying losses associated with a firm’s worker’s compensation plan, we might want to
know the location of its manufacturing plants, the industry in which it operates, the number of employees,
and so forth. The usual strategy for analyzing multivariate data is to begin by examining each variable in
isolation of the others. This is known as a univariate approach.
14.2. CLASSIC MEASURES OF SCALAR ASSOCIATIONS 259
In contrast, for some variables, it makes little sense to only look at one dimensional aspects. For example,
insurers typically organize spatial data by longitude and latitude to analyze the location of weather related
insurance claims due hailstorms. Having only a single number, either longitude or latitude, provides little
information in understanding geographical location.
Another special case of a multivariate variable, less obvious, involves coding for missing data. When data
are missing, it is better to think about the variable as two dimensions, one to indicate whether or not the
variable is reported and the second providing the age (if reported). In the same way, insurance data are
commonly censored and truncated. We refer you to Chapter 4 for more on censored and truncated data.
Aggregate claims can also be coded as another special type of multivariate variable. We refer you to Chapter
5 for more Aggregate claims.
Perhaps the most complicated type of multivariate variable is a realization of a stochastic process. You will
recall that a stochastic process is little more than a collection of random variables. For example, in insurance,
we might think about the times that claims arrive to an insurance company in a one year time horizon. This
is a high dimensional variable that theoretically is infinite dimensional. Special techniques are required to
understand realizations of stochastic processes that will not be addressed here.
For this section, consider a pair of random variables (X, Y ) having joint distribution function F (·) and a
random sample (Xi , Yi ), i = 1, . . . , n. For the continuous case, suppose that F (·) is absolutely continuous
with absolutely continuous marginals.
Pearson Correlation
Pn
Define the sample covariance function Cov(X, Y ) = n1 i=1 (Xi − X̄)(Yi − Ȳ ), where X̄ and Ȳ are the sample
means of X and Y , respectively. Then, the product-moment (Pearson) correlation can be written as
Cov(X, Y )
r= p .
Cov(X, X)Cov(Y, Y )
The correlation statistic r is widely used to capture association between random variables. It is a (nonpara-
metric) estimator of the correlation parameter ρ, defined to be the covariance divided by the product of
standard deviations. In this sense, it captures association for any pair of random variables.
This statistic has several important features. Unlike regression estimators, it is symmetric between random
variables, so the correlation between X and Y equals the correlation between Y and X. It is unchanged by
linear transformations of random variables (up to sign changes) so that we can multiply random variables or
260 CHAPTER 14. DEPENDENCE MODELING
add constants as is helpful for interpretation. The range of the statistic is [−1, 1] which does not depend on
the distribution of either X or Y .
Further, in the case of independence, the correlation coefficient r is 0. However, it is well known that zero
correlation does not imply independence, except for normally distributed random variables. The correlation
statistic r is also a (maximum likelihood) estimator of the association parameter for bivariate normal
distribution. So, for normally distributed data, the correlation statistic r can be used to assess independence.
For additional interpretations of this well-known statistic, readers will enjoy (Lee Rodgers and Nicewander,
1998).
You can obtain the correlation statistic r using the cor() function in R and selecting the pearson method.
This is demonstrated below by using the Coverage rating variable in millions of dollars and Claim amount
variable in dollars from the LGPIF data introduced in chapter 1.
Output:
[1] 0.31
Output:
[1] 0.1
From R output above, r = 0.31 , which indicates a positive association between Claim and Coverage. This
means that as the coverage amount of a policy increases we expect claim to increase.
Spearman’s Rho
The Pearson correlation coefficient does have the drawback that it is not invariant to nonlinear transforms
of the data. For example, the correlation between X and ln Y can be quite different from the correlation
between X and Y . As we see from the R code for Pearson correlation statistic above, the correlation statistic
r between Coverage rating variable in logarithmic millions of dollars and Claim amounts variable in dollars
is 0.1 as compared to 0.31 when we calculate the correlation between Coverage rating variable in millions
of dollars and Claim amounts variable in dollars. This limitation is one reason for considering alternative
statistics.
Alternative measures of correlation are based on ranks of the data. Let R(Xj ) denote the rank of Xj from
0
the sample X1 , . . . , Xn and similarly for R(Yj ). Let R(X) = (R(X1 ), . . . , R(Xn )) denote the vector of ranks,
and similarly for R(Y ). For example, if n = 3 and X = (24, 13, 109), then R(X) = (2, 1, 3). A comprehensive
introduction of rank statistics can be found in, for example, (Hettmansperger, 1984). Also, ranks can be used
to obtain the empirical distribution function, refer to section 4.1.1 for more on the empirical distribution
function.
With this, the correlation measure of (Spearman, 1904) is simply the product-moment correlation computed
on the ranks:
14.2. CLASSIC MEASURES OF SCALAR ASSOCIATIONS 261
You can obtain the Spearman correlation statistic rS using the cor() function in R and selecting the spearman
method. From below, the Spearman correlation between the Coverage rating variable in millions of dollars
and Claim amount variable in dollars is 0.41.
R Code for Spearman Correlation Statistic
### Spearman correlation between Claim and Coverage ###
rs<-cor(Claim,Coverage, method = c("spearman"))
round(rs,2)
Output:
[1] 0.41
Output:
[1] 0.41
To show that the Spearman correlation statistic is invariate under strictly increasing transformations , from
the R Code for Spearman correlation statistic above, rS = 0.41 between the Coverage rating variable in
logarithmic millions of dollars and Claim amount variable in dollars.
Kendall’s Tau
An alternative measure that uses ranks is based on the concept of concordance. An observation pair (X, Y ) is
said to be concordant (discordant) if the observation with a larger value of X has also the larger (smaller) value
of Y . Then Pr(concordance) = Pr[(X1 − X2 )(Y1 − Y2 ) > 0] , Pr(discordance) = Pr[(X1 − X2 )(Y1 − Y2 ) < 0]
and
To estimate this, the pairs (Xi , Yi ) and (Xj , Yj ) are said to be concordant if the product sgn(Xj −Xi )sgn(Yj −
Yi ) equals 1 and discordant if the product equals -1. Here, sgn(x) = 1, 0, −1 as x > 0, x = 0, x < 0, respectively.
With this, we can express the association measure of (Kendall, 1938), known as Kendall’s tau, as
2
P
τ = sgn(Xj − Xi )sgn(Yj − Yi )
n(n−1)
2
Pi<j .
= n(n−1) i<j sgn(R(Xj ) − R(Xi ))sgn(R(Yj ) − R(Yi ))
Interestingly, (Hougaard, 2000), page 137, attributes the original discovery of this statistic to (Fechner, 1897),
noting that Kendall’s discovery was independent and more complete than the original work.
You can obtain the Kendall’s tau, using the cor() function in R and selecting the kendall method. From
below, τ = 0.32 between the Coverage rating variable in millions of dollars and Claim amount variable in
dollars.
R Code for Kendall’s Tau
262 CHAPTER 14. DEPENDENCE MODELING
Output:
[1] 0.32
Output:
[1] 0.32
Also,to show that the Kendall’s tau is invariate under strictly increasing transformations , τ = 0.32 between
the Coverage rating variable in logarithmic millions of dollars and Claim amount variable in dollars.
Bernoulli Variables
To see why dependence measures for continuous variables may not be the best for discrete variables, let us
focus on the case of Bernoulli variables that take on simple binary outcomes, 0 and 1. For notation, let
πjk = Pr(X = j, Y = k) for j, k = 0, 1 and let πX = Pr(X = 1) and similarly for πY . Then, the population
version of the product-moment (Pearson) correlation can be easily seen to be
π11 − πX πY
ρ= p .
πX (1 − πX )πY (1 − πY )
Unlike the case for continuous data, it is not possible for this measure to achieve the limiting boundaries of
the interval [−1, 1]. To see this, students of probability may recall the Fréchet-Höeffding bounds for a joint
distribution that turn out to be max{0, πX + πY − 1} ≤ π11 ≤ min{πX , πY } for this joint probability. This
limit on the joint probability imposes an additional restriction on the Pearson correlation. As an illustration,
assume equal probabilities πX = πY = π > 1/2. Then, the lower bound is
2π − 1 − π 2 1−π
=− .
π(1 − π) π
For example, if π = 0.8, then the smallest that the Pearson correlation could be is -0.25. More generally,
there are bounds on ρ that depend on πX and πY that make it difficult to interpret this measure.
As noted by (Bishop et al., 1975) (page 382), squaring this correlation coefficient yields the Pearson chi-
square statistic. Despite the boundary problems described above, this feature makes the Pearson correlation
coefficient a good choice for describing dependence with binary data. The other is the odds ratio, described
as follows.
As an alternative measure for Bernoulli variables, the odds ratio is given by
Pleasant calculations show that OR(z) is 0 at the lower Fréchet-Höeffding bound z = max{0, π1 + π2 − 1}
and is ∞ at the upper bound z = min{π1 , π2 }. Thus, the bounds on this measure do not depend on the
marginal probabilities πX and πY , making it easier to interpret this measure.
As noted by (Yule, 1900), odds ratios are invariant to the labeling of 0 and 1. Further, they are invariant to
the marginals in the sense that one can rescale π1 and π2 by positive constants and the odds ratio remains
unchanged. Specifically, suppose that ai , bj are sets of positive constants and that
new
πij = ai bj πij
new
P
and ij πij = 1. Then,
For additional help with interpretation, Yule proposed two transforms for the odds ratio, the first in (Yule,
1900),
OR − 1
,
OR + 1
√
OR − 1
√ .
OR + 1
Although these statistics provide the same information as is the original odds ration OR, they have the
advantage of taking values in the interval [−1, 1], making them easier to interpret.
In a later section, we will also see that the marginal distributions have no effect on the Fréchet-Höeffding of
the tetrachoric correlation, another measure of association, see also, (Joe, 2014), page 48.
Fire5
NoClaimCredit 0 1 Total
0 1611 2175 3786
1 897 956 1853
Total 2508 3131 5639
Output:
[1] 0.79
264 CHAPTER 14. DEPENDENCE MODELING
Categorical Variables
More generally, let (X, Y ) be a bivariate pair having ncatX and ncatY numbers of categories, respectively.
For a two-way table of counts, let njk be the number in the jth row, k column. Let nj· be the row margin
total and n·k be the column margin total. Define Pearson chi-square statistic as
X njk
G2 = 2 njk ln .
nj· n·k /n
jk
Under the assumption of independence, both chi2 and G2 have an asymptotic chi-square distribution with
(ncatX − 1)(ncatY − 1) degrees of freedom.
To help see what these statistics are estimating, let πjk = Pr(X = j, Y = k) and let πX,j = Pr(X = j) and
similarly for πY,k . Assuming that njk /n ≈ πjk for large n and similarly for the marginal probabilities, we
have
and
G2 X πjk
≈2 πjk ln .
n πX,j πY,k
jk
Under the null hypothesis of independence, we have πjk = πX,j πY,k and it is clear from these approximations
that we anticipate that these statistics will be small under this hypothesis.
Classical approaches, as described in (Bishop et al., 1975) (page 374), distinguish between tests of independence
and measures of associations. The former are designed to detect whether a relationship exists whereas the
latter are meant to assess the type and extent of a relationship. We acknowledge these differing purposes but
also less concerned with this distinction for actuarial applications.
NoClaimCredit
EntityType 0 1
City 644 149
County 310 18
Misc 336 273
School 1103 494
Town 492 479
Village 901 440
library(MASS)
table = table(EntityType, NoClaimCredit)
chisq.test(table)
Output:
------------------------------------
Test statistic df P value
---------------- ---- --------------
344.2 5 3.15e-72 * * *
------------------------------------
Output:
-----------------------------------------
Test statistic X-squared df P value
---------------- -------------- ---------
378.7 5 0 * * *
-----------------------------------------
Ordinal Variables
As the analyst moves from the continuous to the nominal scale, there are two main sources of loss of
information (Bishop et al., 1975) (page 343). The first is breaking the precise continuous measurements into
groups. The second is losing the ordering of the groups. So, it is sensible to describe what we can do with
variables that in discrete groups but where the ordering is known.
As described in Section 14.1.1, ordinal variables provide a clear ordering of levels of a variable but distances
between levels are unknown. Associations have traditionally been quantified parametrically using normal-based
correlations and nonparametrically using Spearman correlations with tied ranks.
Refer to page 60, Section 2.12.7 of (Joe, 2014). Let (y1 , y2 ) be a bivariate pair with discrete values on
m1 , . . . , m2 . For a two-way table of ordinal counts, let nst be the number in the sth row, t column. Let
(nm1 ∗ , . . . , nm2 ∗ ) be the row margin total and (n∗m1 , . . . , n∗m2 ) be the column margin total.
Let ξˆ1s = Φ−1 ((nm1 + · · · + ns∗ )/n) for s = m1 , . . . , m2 be a cutpoint and similarly for ξˆ2t . The polychoric
correlation, based on a two-step estimation procedure, is
266 CHAPTER 14. DEPENDENCE MODELING
Pm2 Pm2 n
ρˆN = argmaxρ s=m1 t=m1 n st log Φ2 (ξˆ1s , ξˆ2t ; ρ) − Φ2 (ξˆ1,s−1 , ξˆ2t ; ρ)
o
−Φ2 (ξˆ1s , ξˆ2,t−1 ; ρ) + Φ2 (ξˆ1,s−1 , ξˆ2,t−1 ; ρ)
NoClaimCredit
AlarmCredit 0 1
1 1669 942
2 121 118
3 195 132
4 1801 661
Output:
[1] -0.14
Interval Variables
As described in Section 14.1.2, interval variables provide a clear ordering of levels of a variable and the
numerical distance between any two levels of the scale can be readily interpretable. For example, a claims
count variable is an interval variable.
For measuring association, both the continuous variable and ordinal variable approaches make sense. The
former takes advantage of knowledge of the ordering although assumes continuity. The latter does not rely
on the continuity but also does not make use of the information given by the distance between scales.
For applications, one type is a count variable, a random variable on the discrete integers. Another is a
mixture variable, on that has discrete and continuous components.
The polyserial correlation is defined similarly, when one variable (y1 ) is continuous and the other (y2 ) ordinal.
Define z to be the normal score of y1 . The polyserial correlation is
n
( " #)
X ξˆ2,yi2 − ρzi1 ξˆ2,yi2−1 − ρzi1
ρˆN = argmaxρ log φ(zi1 ) Φ( ) − Φ( )
i=1
(1 − ρ2 )1/2 (1 − ρ2 )1/2
14.3. INTRODUCTION TO COPULAS 267
The biserial correlation is defined similarly, when one variable is continuous and the other binary.
Output:
[1] -0.04
Copula functions are widely used in statistics and actuarial science literature for dependency modeling.
C(u1 , . . . , up ) = Pr(U1 ≤ u1 , . . . , Up ≤ up ),
is a copula. We seek to use copulas in applications that are based on more than just uniformly distributed
data. Thus, consider arbitrary marginal distribution functions F1 (y1 ),. . . ,Fp (yp ). Then, we can define a
multivariate distribution function using the copula such that
Here, F is a multivariate distribution function in this equation. Sklar (1959) showed that any multivariate
distribution function F , can be written in the form of this equation, that is, using a copula representation.
Sklar also showed that, if the marginal distributions are continuous, then there is a unique copula representation.
In this chapter we focus on copula modeling with continuous variables. For discrete case, readers can see
(Joe, 2014) and (Genest and Nešlohva, 2007).
For bivariate case, p = 2 , the distribution function of two random variables can be written by the bivariate
copula function:
C(u1 , u2 ) = Pr(U1 ≤ u1 , U2 ≤ u2 ),
268 CHAPTER 14. DEPENDENCE MODELING
To give an example for bivariate copula, we can look at Frank’s (1979) copula. The equation is
1 (exp(θu1 ) − 1)(exp(θu2 ) − 1)
C(u1 , u2 ) = ln 1 + .
θ exp(θ) − 1
This is a bivariate distribution function with its domain on the unit square [0, 1]2 . Here θ is dependence
parameter and the range of dependence is controlled by the parameter θ. Positive association increases as θ
increases and this positive association can be summarized with Spearman’s rho (ρ) and Kendall’s tau (τ ).
Frank’s copula is one of the commonly used copula functions in the copula literature. We will see other
copula functions in Section 14.5.
This section analyzes the insurance losses and expenses data with the statistical programming R. This data
set was introduced in Frees and Valdez (1998) and is now readily available in the copula package. The model
fitting process is started by marginal modeling of two variables (loss and expense). Then we model the joint
distribution of these marginal outcomes.
We start with getting a sample (n = 1500) from the whole data. We consider first two variables of the data;
losses and expenses.
• losses : general liability claims from Insurance Services Office, Inc. (ISO)
• expenses : ALAE, specifically attributable to the settlement of individual claims (e.g. lawyer’s fees,
claims investigation expenses)
To visualize the relationship between losses and expenses (ALAE), scatterplots in figure 14.2 are created on
the real dollar scale and on the log scale.
R Code for Scatterplots
library(copula)
data(loss) # loss data
Lossdata <- loss
attach(Lossdata)
loss <- Lossdata$loss
par(mfrow=c(1, 2))
plot(loss,alae, cex=.5) # real dollar scale
plot(log(loss),log(alae),cex=.5) # log scale
par(mfrow=c(1, 2))
14.4. APPLICATION USING COPULAS 269
We first examine the marginal distributions of losses and expenses before going through the joint modeling.
The histograms show that both losses and expenses are right-skewed and fat-tailed.
For marginal distributions of losses and expenses, we consider a Pareto-type distribution, namely a Pareto
type II with distribution function
y −α
F (y) = 1 − 1 + ,
θ
where θ is the scale parameter and α is the shape parameter.
The marginal distributions of losses and expenses are fitted with maximum likelihood. Specifically, we use
the vglm function from the R VGAM package. Firstly, we fit the marginal distribution of expenses .
R Code for Pareto Fitting
library(VGAM)
fit = vglm(alae ~ 1, paretoII(location=0, lscale="loge", lshape="loge")) # fit the model by vlgm functio
coef(fit, matrix=TRUE) # extract fitted model coefficients, matrix=TRUE gives logarithm of estimated par
Coef(fit)
Output:
loge(scale) loge(shape)
(Intercept) 9.624673 0.7988753
scale shape
(Intercept) 15133.603598 2.223039
We repeat this procedure to fit the marginal distribution of the loss variable. Because the loss data also
seems right-skewed and heavy-tail data, we also model the marginal distribution with Pareto II distribution.
R Code for Pareto Fitting
fitloss = vglm(loss ~ 1, paretoII, trace=TRUE)
Coef(fit)
summary(fit)
Output:
scale shape
15133.603598 2.223039
To visualize the fitted distribution of expenses and loss variables, we use the estimated parameters and plot
the corresponding distribution function and density function. For more details on marginal model selection,
see Chapter 4.
The probability integral transformation shows that any continuous variable can be mapped to a U (0, 1)
random variable via its distribution function.
Given the fitted Pareto II distribution, the variable expenses is transformed to the variable u1 , which follows
a uniform distribution on [0, 1]:
14.4. APPLICATION USING COPULAS 271
−α̂
ALAE
u1 = 1 − 1 + .
θ̂
After applying the probability integral transformation to expenses variable, we plot the histogram of
Transformed Alae in Figure 14.3.
After fitting process,the variable loss is also transformed to the variable u2 , which follows a uniform distribution
on [0, 1]. We plot the histogram of Transformed Loss . As an alternative, the variable loss is transformed to
normal scores with the quantile function of standard normal distribution. As we see in Figure 14.4, normal
scores of the variable loss are approximately marginally standard normal.
Figure 14.4: Histogram of Transformed Loss. The left-hand panel shows the distribution of probability
integral transformed losses. The right-hand panel shows the distribution for the corresponding normal scores.
14.4. APPLICATION USING COPULAS 273
Figure 14.5: Left: Scatter plot for transformed variables. Right:Scatter plot for normal scores
Before jointly modeling losses and expenses, we draw the scatterplot of transformed variables (u1 , u2 ) and
the scatterplot of normal scores in Figure 14.5.
Then we calculate the Spearman’s rho between these two uniform random variables.
R Code for Scatter Plots and Correlation
par(mfrow = c(1, 2))
plot(u1, u2, cex = 0.5, xlim = c(-0.1,1.1), ylim = c(-0.1,1.1),
xlab = "Transformed Alae", ylab = "Transformed Loss")
plot(qnorm(u1), qnorm(u2))
cor(u1, u2, method = "spearman")
Output:
[1] 0.451872
Scatter plots and Spearman’s rho correlation value (0.451) shows us there is a positive dependency between
these two uniform random variables. It is more clear to see the relationship with normal scores in the second
graph. To learn more details about normal scores and their applications in copula modeling, see (Joe, 2014).
(U1 , U2 ), (U1 = F1 (ALAE) and U2 = F2 (LOSS)), is fit to Frank’s copula with maximum likelihood method.
274 CHAPTER 14. DEPENDENCE MODELING
Output:
Output :
[1] 0.4622722
To visualize the fitted Frank’s copula, the distribution function and density function perspective plots are
drawn in Figure 14.6.
R Code for Frank’s Copula Plots
par(mar=c(3.2,3,.2,.2),mfrow=c(1,2))
persp(frank.cop, pCopula, theta=50, zlab="C(u,v)",
xlab ="u", ylab="v", cex.lab=1.3)
persp(frank.cop, dCopula, theta=0, zlab="c(u,v)",
xlab ="u", ylab="v", cex.lab=1.3)
Frank’s copula models positive dependence for this data set, with θ = 3.114. For Frank’s copula, the
dependence is related to values of θ. That is:
• θ = 0: independent copula
• θ > 0: positive dependence
14.4. APPLICATION USING COPULAS 275
Figure 14.6: Left: Plot for distribution function for Franks Copula. Right:Plot for density function for Franks
Copula
276 CHAPTER 14. DEPENDENCE MODELING
There are several families of copulas have been described in the literature. Two main families of the copula
families are the Archimedian and Elliptical copulas.
Elliptical copulas are constructed from elliptical distributions. This copula decompose (multivariate) elliptical
distributions into their univariate elliptical marginal distributions by Sklar’s theorem (Hofert et al., 2018).
Properties of elliptical copulas are typically obtained from the properties ofcorresponding elliptical distributions
(Hofert et al., 2018).
For example, the normal distribution is a special type of elliptical distribution. To introduce the elliptical
class of copulas, we start with the familiar multivariate normal distribution with probability density function
1 1 0 −1
φN (z) = √ exp − z Σ z .
(2π)p/2 det Σ 2
Here, Σ is a correlation matrix, with ones on the diagonal. Let Φ and φ denote the standard normal
distribution and density functions. We define the Gaussian (normal) copula density function as
p
−1 −1
Y 1
cN (u1 , . . . , up ) = φN Φ (u1 ), . . . , Φ (up ) −1 (u ))
.
j=1
φ(Φ j
As with other copulas, the domain is the unit cube [0, 1]p .
Specifically, a p-dimensional vector z has an elliptical distribution if the density can be written as
kp 1 0 −1
hE (z) = √ gp (z − µ) Σ (z − µ) .
det Σ 2
We will use elliptical distributions to generate copulas. Because copulas are concerned primarily with
relationships, we may restrict our considerations to the case where µ = 0 and Σ is a correlation matrix. With
these restrictions, the marginal distributions of the multivariate elliptical copula are identical; we use H to
refer to this marginal distribution function and h is the corresponding density. This marginal density is
h(z) = k1 g1 (z 2 /2).
We are now ready to define the elliptical copula, a function defined on the unit cube [0, 1]p as
p
Y 1
cE (u1 , . . . , up ) = hE H −1 (u1 ), . . . , H −1 (up ) −1 (u ))
.
j=1
h(H j
14.5. TYPES OF COPULAS 277
In the elliptical copula family, the function gp is known as a generator in that it can be used to generate
alternative distributions.
Generator
Distribution gp (x)
Normal distribution e−x
t-distribution with r degrees of freedom (1 + 2x/r)−(p+r)/2
Cauchy (1 + 2x)−(p+1)/2
Logistic e−x /(1 + e−x )2
Exponential power exp(−rxs )
Table 14.6 : Distribution and Generator Functions (gp (x)) for Selected Elliptical Copulas
Most empirical work focuses on the normal copula and t-copula. That is, t-copulas are useful for modeling
the dependency in the tails of bivariate distributions, especially in financial risk analysis applications.
The t-copulas with same association parameter in varying the degrees of freedom parameter show us different
tail dependency structures. For more information on about t-copulas readers can see (Joe, 2014), (Hofert
et al., 2018).
This class of copulas are constructed from a generator function,which is g(·) is a convex, decreasing function
with domain [0,1] and range [0, ∞) such that g(0) = 0. Use g−1 for the inverse function of g. Then the
function
is said to be an Archimedean copula. The function g is known as the generator of the copula Cg .
For bivariate case; p = 2 , Archimedean copula function can be written by the function
Some important special cases of Archimedean copulas are Frank copula, Clayton/Cook-Johnson copula,
Gumbel/Hougaard copula. This copula classes are derived from different generator functions.
We can remember that we mentioned about Frank’s copula with details in Section 14.3 and in Section 14.4.
Here we will continue to express the equations for Clayton copula and Gumbel/Hougaard copula.
Clayton Copula
This is a bivariate distribution function of Clayton copula defined in unit square [0, 1]2 . The range of
dependence is controlled by the parameter θ as the same as Frank copula.
278 CHAPTER 14. DEPENDENCE MODELING
Gumbel-Hougaard copula
Readers seeking deeper background on Archimedean copulas can see Joe (2014), Frees and Valdez (1998),
and Genest and Mackay (1986).
Bounds on Association
Like all multivariate distribution functions, copulas are bounded. The Fr0 echet-Hoeffding bounds are
, for j = 1, . . . , p. The bound is achieved when U1 = · · · = Up . To see the left-hand side when p = 2, consider
U2 = 1 − U1 . In this case, if 1 − u2 < u1 then Pr(U1 ≤ u1 , U2 ≤ u2 ) = Pr(1 − u2 ≤ U1 < u1 ) = u1 + u2 − 1.
(Nelson, 1997)
The product copula is C(u1 , u2 ) = u1 u2 is the result of assuming independence between random variables.
The lower bound is achieved when the two random variables are perfectly negatively related (U2 = 1 − U1 )
and the upper bound is achieved when they are perfectly positively related (U2 = U1 ).
We can see The Frechet-Hoeffding bounds for two random variables in the Figure 14.7.
R Code for Frechet-Hoeffding Bounds for Two Random Variables
library(copula)
n<-100
set.seed(1980)
U<-runif(n)
par(mfrow=c(1, 2))
plot(cbind(U,1-U), xlab=quote(U[1]), ylab=quote(U[2]),main="Perfect Negative Dependency") # W for p=2
plot (cbind(U,U), xlab=quote(U[1]),ylab=quote(U[2]),main="Perfect Positive Dependency") #M for p=2
Measures of Association
Schweizer and Wolff (1981) established that the copula accounts for all the dependence between two random
variables, Y1 and Y2 , in the following sense. Consider m1 and m2 , strictly increasing functions. Thus, the
manner in which Y1 and Y2 “move together” is captured by the copula, regardless of the scale in which each
variable is measured.
Schweizer and Wolff also showed the two standard nonparametric measures of association could be expressed
solely in terms of the copula function. Spearman’s correlation coefficient is given by
Z Z
= 12 {C(u, v) − uv} dudv.
14.5. TYPES OF COPULAS 279
For these expressions, we assume that Y1 and Y2 have a jointly continuous distribution function. Further, the
definition of Kendall’s tau uses an independent copy of (Y1 , Y2 ), labeled (Y1∗ , Y2∗ ), to define the measure of
“concordance.” the widely used Pearson correlation depends on the margins as well as the copula. Because it
is affected by non-linear changes of scale.
Tail Dependency
There are some applications in which it is useful to distinguish by the part of the distribution in which the
association is strongest. For example, in insurance it is helpful to understand association among the largest
losses, that is, association in the right tails of the data.
To capture this type of dependency, we use the right-tail concentration function. The function is
From this equation , R(z) will equal to z under independence. Joe (1997) uses the term “upper tail dependence
parameter” for R = limz→1 R(z). Similarly, the left-tail concentration function is
Pr(U1 ≤ z, U2 ≤ z) C(z, z)
L(z) = = Pr(U1 ≤ z|U2 ≤ z) = .
z 1−z
Tail dependency concentration function captures the probability of two random variables both catching up
extreme values.
We calculate the left and right tail concentration functions for four different types of copulas; Normal,
Frank,Gumbel and t copula. After getting tail concentration functions for each copula, we show concentration
function’s values for these four copulas in Table 14.7. As in Venter (2002), we show L(z) for z ≤ 0.5 and
R(z) for z > 0.5 in the tail dependence plot in Figure 14.8. We interpret the tail dependence plot, to mean
that both the Frank and Normal copula exhibit no tail dependence whereas the t and the Gumbel may do so.
The t copula is symmetric in its treatment of upper and lower tails.
library(copula)
U1 = seq(0,0.5, by=0.002)
U2 = seq(0.5,1, by=0.002)
U = rbind(U1, U2)
TailFunction <- function(Tailcop) {
lowertail <- pCopula(cbind(U1,U1), Tailcop)/U1
uppertail <- (1-2*U2 +pCopula(cbind(U2,U2), Tailcop))/(1-U2)
14.5. TYPES OF COPULAS 281
Dependence Modeling is important because it enables us to understand the dependence structure by defining
the relationship between variables in a dataset. In insurance, ignoring dependence modeling may not impact
pricing but could lead to misestimation of required capital to cover losses. For instance, from Section 14.4 , it
is seen that there was a positive relationship between Loss and Expense. This means that, if there is a large
loss then we expect expenses to be large as well and ignoring this relationship could lead to misestimation of
reserves.
To illustrate the importance of dependence modeling, we refer you back to Portfolio Management example
in Chapter 6 that assumed that the property and liability risks are independent. Here, we incorporate
dependence by allowing the 4 lines of business to depend on one another through a Gaussian copula. In Table
14.8, we show that dependence affects the portfolio quantiles (V aRq ), although not the expect value. For
instance , the V aR0.99 for total risk which is the amount of capital required to ensure, with a 99% degree of
certainty that the firm does not become technically insolvent is higher when we incorporate dependence. This
leads to less capital being allocated when dependence is ignored and can cause unexpected solvency problems.
Table 14.8 : Results for portfolio expected value and quantiles (V aRq )
R Code for Simulation Using Gaussian Copula
library(VGAM)
X3 <- rparetoII(nSim,scale=theta3,shape=alpha3)
X4 <- rparetoII(nSim,scale=theta4,shape=alpha4)
# Portfolio Risks
S <- X1 + X2 + X3 + X4
Sretained <- pmin(X1,d1) + pmin(X2,d2)
Sinsurer <- S - Sretained
# Quantiles
quantMat <- rbind(
quantile(Sretained, probs=c(0.80, 0.90, 0.95, 0.99)),
quantile(Sinsurer, probs=c(0.80, 0.90, 0.95, 0.99)),
quantile(S , probs=c(0.80, 0.90, 0.95, 0.99)))
rownames(quantMat) <- c("Retained", "Insurer","Total")
round(quantMat,digits=2)
X1<-X[,1]
X2<-X[,2]
X3<-X[,3]
X4<-X[,4]
# Portfolio Risks
S <- X1 + X2 + X3 + X4
Sretained <- pmin(X1,d1) + pmin(X2,d2)
Sinsurer <- S - Sretained
# Quantiles
quantMat <- rbind(
quantile(Sretained, probs=c(0.80, 0.90, 0.95, 0.99)),
quantile(Sinsurer, probs=c(0.80, 0.90, 0.95, 0.99)),
quantile(S , probs=c(0.80, 0.90, 0.95, 0.99)))
rownames(quantMat) <- c("Retained", "Insurer","Total")
round(quantMat,digits=2)
• Edward W. (Jed) Frees and Nii-Armah Okine, University of Wisconsin-Madison, and Emine
Selin Sarıdaş, Mimar Sinan University, are the principal authors of the initital version of this chapter.
Email: [email protected] for chapter comments and suggested improvements.
Blomqvist (1950) developed a measure of dependence now known as Blomqvist’s beta, also called the median
concordance coefficient and the medial correlation coefficient. Using distribution functions, this parameter
can be expressed as
−1
(1/2), FY−1 (1/2) − 1.
β = 4F FX
−1
That is, first evaluate each marginal at its median (FX (1/2) and FY−1 (1/2), respectively). Then, evaluate
the bivariate distribution function at the two medians. After rescaling (multiplying by 4 and subtracting 1),
the coefficient turns out to have a range of [−1, 1], where 0 occurs under independence.
Like Spearman’s rho and Kendall’s tau, an estimator based on ranks is easy to provide. First write
β = 4C(1/2, 1/2) − 1 = 2 Pr((U1 − 1/2)(U2 − 1/2)) − 1 where U1 , U2 are uniform random variables. Then,
define
n
2X n+1 n+1
β̂ = I (R(Xi ) − )(R(Yi ) − ) ≥ 0 − 1.
n i=1 2 2
See, for example, (Joe, 2014), page 57 or (Hougaard, 2000), page 135, for more details.
Because Blomqvist’s parameter is based on the center of the distribution, it is particularly useful when data
are censored; in this case, information in extreme parts of the distribution are not always reliable. How does
this affect a choice of association measures? First, recall that association measures are based on a bivariate
distribution function. So, if one has knowledge of a good approximation of the distribution function, then
calculation of an association measure is straightforward in principle. Second, for censored data, bivariate
extensions of the univariate Kaplan-Meier distribution function estimator are available. For example, the
version introduced in (Dabrowska, 1988) is appealing. However, because of instances when large masses of
data appear at the upper range of the data, this and other estimators of the bivariate distribution function are
14.7. FURTHER RESOURCES AND CONTRIBUTORS 285
unreliable. This means that, summary measures of the estimated distribution function based on Spearman’s
rho or Kendall’s tau can be unreliable. For this situation, Blomqvist’s beta appears to be a better choice as
it focuses on the center of the distribution. (Hougaard, 2000), Chapter 14, provides additional discussion.
You can obtain the Blomqvist’s beta, using the betan() function from the copula library in R. From below,
β = 0.3 between the Coverage rating variable in millions of dollars and Claim amount variable in dollars.
Output:
[1] 0.3
Output:
[1] 0.3
In addition,to show that the Blomqvist’s beta is invariate under strictly increasing transformations , β = 0.3
between the Coverage rating variable in logarithmic millions of dollars and Claim amount variable in dollars.
For the first variable, the average rank of observations in the sth row is
1
r1s = nm1 ∗ + · · · + ns−1,∗ + (1 + ns∗ )
2
1
and similarly r2t = 2 [(n∗m1 + · · · + n∗,s−1 + 1) + (n∗m1 + . . . + n∗s )]. With this, we have Spearman’s rho
with tied rank is
Pm2 Pm2
s=m1 t=m1 nst (r1s − r̄)(r2t − r̄)
ρ̂S = Pm
2 2
2
Pm2
s=m1 ns∗ (r1s − r̄) t=m1 n∗t (r2t − r̄)
2
Special Case: Binary Data. Here, m1 = 0 and m2 = 1. For the first variable ranks, we have r10 = (1 + n0+ )/2
and r11 = (n0+ + 1 + n)/2. Thus, r10 − r̄ = (n0+ − n)/2 and r11 − r̄ = n0+ /2. This means that we have
P1 2
s=0 ns+ (r1s − r̄) = n(n − n0+ )n0+ /4 and similarly for the second variable. For the numerator, we have
286 CHAPTER 14. DEPENDENCE MODELING
1 X
X 1
nst (r1s − r̄)(r2t − r̄)
s=0 t=0
n0+ − n n+0 − n n0+ − n n+0 n0+ n+0 − n n0+ n+0
= n00 + n01 + n10 + n11
2 2 2 2 2 2 2 2
1
= (n00 (n0+ − n)(n+0 − n) + (n0+ − n00 )(n0+ − n)n+0
4
+ (n+0 − n00 )n0+ (n+0 − n) + (n − n+0 − n0+ + n00 )n0+ n+0 )
1
= (n00 n2 − n0+ (n0+ − n)n+0
4
+ n+0 n0+ (n+0 − n) + (n − n+0 − n0+ )n0+ n+0 )
1
= (n00 n2 − n0+ n+0 (n0+ − n + n+0 − n + n − n+0 − n0+ )
4
n
= (nn00 − n0+ n+0 ).
4
This yields
where π̂X = (n − n0+ )/n and similarly for π̂Y . Note that this is same form as the Pearson measure. From
this, we see that the joint count n00 drives this association measure.
You can obtain the ties-corrected Spearman correlation statistic rS using the cor() function in R and selecting
the spearman method. From below ρ̂S = −0.09
R Code for Ties-corrected Spearman Correlation
rs_ties<-cor(AlarmCredit,NoClaimCredit, method = c("spearman"))
round(rs_ties,2)
Output:
[1] -0.09
Chapter 15
Chapter preview. The appendix gives an overview of concepts and methods related to statistical inference on
the population of interest, using a random sample of observations from the population. In the appendix,
Section 15.1 introduces the basic concepts related to the population and the sample used for making the
inference. Section 15.2 presents the commonly used methods for point estimation of population characteristics.
Section 15.3 demonstrates interval estimation that takes into consideration the uncertainty in the estimation,
due to use of a random sample from the population. Section 15.4 introduces the concept of hypothesis testing
for the purpose of variable and model selection.
Minimum First Quartile Median Mean Third Quartile Maximum Standard Deviation
Claims 1 788 2,250 26,620 6,171 12,920,000 368,030
Logarithmic Claims 0 6.670 7.719 7.804 8.728 16.370 1.683
287
288 CHAPTER 15. APPENDIX A: REVIEW OF STATISTICAL INFERENCE
In statistics, a sampling error occurs when the sampling frame, the list from which the sample is drawn,
is not an adequate approximation of the population of interest. A sample must be a representative subset of
a population, or universe, of interest. If the sample is not representative, taking a larger sample does not
eliminate bias, as the same mistake is repeated over again and again. Thus, we introduce the concept for
random sampling that gives rise to a simple random sample that is representative of the population.
We assume that the random variable X represents a draw from a population with a distribution function
15.2. POINT ESTIMATION AND PROPERTIES 289
F (·) with mean E[X] = µ and variance Var[X] = E[(X − µ)2 ], where E(·) denotes the expectation of a
random variable. In random sampling, we make a total of n such draws represented by X1 , . . . , Xn , each
unrelated to one another (i.e., statistically independent). We refer to X1 , . . . , Xn as a random sample (with
replacement) from F (·), taking either a parametric or nonparametric form. Alternatively, we may say that
X1 , . . . , Xn are identically and independently distributed (iid) with distribution function F (·).
Using the random sample X1 , . . . , Xn , we are interested in making a conclusion on a specific attribute of the
population distribution F (·). For example, we may be interested in Pmaking an inference on the population
n
mean, denoted µ. It is natural to think of the sample mean, X̄ = i=1 Xi , as an estimate of the population
mean µ. We call the sample mean as a statistic calculated from the random sample X1 , . . . , Xn . Other
commonly used summary statistics include sample standard deviation and sample quantiles.
When using a statistic (e.g., the sample mean X̄) to make statistical inference on the population attribute
(e.g., population mean µ), the quality of inference is determined by the bias and uncertainty in the estimation,
owing to the use of a sample in place of the population. Hence, it is important to study the distribution of a
statistic that quantifies the bias and variability of the statistic. In particular, the distribution of the sample
mean, X̄ (or any other statistic), is called the sampling distribution. The sampling distribution depends
on the sampling process, the statistic, the sample size n and the population distribution F (·). The central
limit theorem gives the large-sample (sampling) distribution of the sample mean under certain conditions.
In statistics, there are variations of the central limit theorem (CLT) ensuring that, under certain conditions,
the sample mean will approach the population mean with its sampling distribution approaching the normal
distribution as the sample size goes to infinity. We give the Lindeberg–Levy CLT that establishes the
asymptotic sampling distribution of the sample mean X̄ calculated using a random sample from a universe
population having a distribution F (·).
Lindeberg–Levy CLT. Let X1 , . . . , Xn be a random sample from a population distribution F√
(·) with mean
µ and variance σ 2 < ∞. The difference between the sample mean X̄ and µ, when multiplied by n, converges
in distribution to a normal distribution as the sample size goes to infinity. That is,
√ d
n(X̄ − µ) −
→ N (0, σ).
Note that the CLT does not require a parametric form for F (·). Based on the CLT, we may perform statistical
inference on the population mean (we infer, not deduce). The types of inference we may perform include
estimation of the population, hypothesis testing on whether a null statement is true, and prediction of
future samples from the population.
For obtaining the population characteristics, there are different attributes related to the population distribution
F (·). Such measures include the mean, median, percentiles (i.e., 95th percentile), and standard deviation.
Because these summary measures do not depend on a specific parametric reference, they are nonparametric
summary measures.
In parametric analysis, on the other hand, we may assume specific families of distributions with specific
parameters. For example, people usually think of logarithm of claim amounts to be normally distributed
with mean µ and standard deviation σ. That is, we assume that the claims have a lognormal distribution
with parameters µ and σ. Alternatively, insurance companies commonly assume that claim severity follows a
gamma distribution with a shape parameter α and a scale parameter θ. Here, the normal, lognormal, and
gamma distributions are examples of parametric distributions. In the above examples, the quantities of µ, σ,
α, and θ are known as parameters. For a given parametric distribution family, the distribution is uniquely
determined by the values of the parameters.
One often uses θ to denote a summary attribute of the population. In parametric models, θ can be a
parameter or a function of parameters from a distribution such as the normal mean and variance parameters.
In nonparametric analysis, it can take a form of a nonparametric summary such as the population mean or
standard deviation. Let θ̂ = θ̂(X1 , . . . , Xn ) be a function of the sample that provides a proxy, or an estimate,
of θ. It is referred to as a statistic, a function of the sample X1 , . . . , Xn .
Show Wisconsin Property Fund Example - Continued
Example – Wisconsin Property Fund. The sample mean 7.804 and the sample standard deviation 1.683
can be either deemed as nonparametric estimates of the population mean and standard deviation, or as
parametric estimates of µ and σ of the normal distribution concerning the logarithmic claims. Using results
from the lognormal distribution, we may estimate the expected claim, the lognormal mean, as 10,106.8 (
= exp(7.804 + 1.6832 /2) ).
For the Wisconsin Property Fund data, we may denote µ̂ = 7.804 and σ̂ = 1.683, with the hat notation
denoting an estimate of the parameter based on the sample. In particular, such an estimate is referred
to as a point estimate, a single approximation of the corresponding parameter. For point estimation, we
introduce the two commonly used methods called the method of moments estimation and maximum likelihood
estimation.
Before defining the method of moments estimation, we define the the concept of moments. Moments are
population attributes that characterize the distribution function F (·). Given a random draw X from F (·),
the expectation µk = E[X k ] is called the kth moment of X, k = 1, 2, 3, · · ·. For example, the population
mean µ is the first moment. Furthermore, the expectation E[(X − µ)k ] is called a kth central moment.
Thus, the variance is the second central moment.
Using the random
Pn sample X1 , . . . , Xn , we may construct the corresponding sample moment,
µ̂k = (1/n) i=1 Xik , for estimating the population attribute µk . For example, we have used the
sample mean X̄ as P an estimator for the population mean µ. Similarly, the second central moment can be
n
estimated as (1/n) i=1 (Xi − X̄)2 . Without assuming a parametric form for F (·), the sample moments
constitute nonparametric estimates of the corresponding population attributes. Such an estimator based on
matching of the corresponding sample and population moments is called a method of moments estimator
(MME).
While the MME works naturally in a nonparametric model, it can be used to estimate parameters when
a specific parametric family of distribution is assumed for F (·). Denote by θ = (θ1 , · · · , θm ) the vector of
parameters corresponding to a parametric distribution F (·). Given a distribution family, we commonly know
the relationships between the parameters and the moments. In particular, we know the specific forms of
the functions h1 (·), h2 (·), · · · , hm (·) such that µ1 = h1 (θ), µ2 = h2 (θ), · · · , µm = hm (θ). Given the MME
µ̂1 , . . . , µ̂m from the random sample, the MME of the parameters θ̂1 , · · · , θ̂m can be obtained by solving the
15.2. POINT ESTIMATION AND PROPERTIES 291
equations of
µ̂1 = h1 (θ̂1 , · · · , θ̂m );
µ̂2 = h2 (θ̂1 , · · · , θ̂m );
···
µ̂m = hm (θ̂1 , · · · , θ̂m ).
When F (·) takes a parametric form, the maximum likelihood method is widely used for estimating the
population parameters θ. Maximum likelihood estimation is based on the likelihood function, a function
of the parameters given the observed sample. Denote by f (xi |θ) the probability function of Xi evaluated
at Xi = xi (i = 1, 2, · · · , n), the probability mass function in the case of a discrete X and the probability
density function in the case of a continuous X. Then the likelihood function of θ associated with the
observation (X1 , X2 , · · · , Xn ) = (x1 , x2 , · · · , xn ) = x can be written as
n
Y
L(θ|x) = f (xi |θ),
i=1
The maximum likelihood estimator (MLE) of θ is the set of values of θ that maximize the likelihood function
(log-likelihood function), given the observed sample. That is, the MLE θ̂ can be written as
θ̂ = argmaxθ∈Θ l(θ|x),
where Θ is the parameter space of θ, and argmaxθ∈Θ l(θ|x) is defined as the value of θ at which the function
l(θ|x) reachs its maximum.
Given the analytical form of the likelihood function, the MLE can be obtained by taking the first derivative
of the log-likelihood function with respect to θ, and setting the values of the partial derivatives to zero. That
is, the MLE are the solutions of the equations of
∂l(θ̂|x)
= 0;
∂ θ̂1
∂l(θ̂|x)
= 0;
∂ θ̂2
···
∂l(θ̂|x)
= 0,
∂ θ̂m
provided that the second partial derivatives are negative.
292 CHAPTER 15. APPENDIX A: REVIEW OF STATISTICAL INFERENCE
For parametric models, the MLE of the parameters can be obtained either analytically (e.g., in the case
of normal distributions and linear estimators), or numerically through iterative algorithms such as the
Newton-Raphson method and its adaptive versions (e.g., in the case of generalized linear models with a
non-normal response variable).
Normal distribution. Assume (X1 , X2 , · · · , Xn ) to be a random sample from the normal distribution
N (µ, σ 2 ). With an observed sample (X1 , X2 , · · · , Xn ) = (x1 , x2 , · · · , xn ), we can write the likelihood function
of µ, σ 2 as
n
(xi −µ)2
Y 1
L(µ, σ 2 ) = √ e− 2σ2 ,
i=1 2πσ 2
with the corresponding log-likelihood function given by
n
n 1 X 2
l(µ, σ 2 ) = − [ln(2π) + ln(σ 2 )] − 2 (xi − µ) .
2 2σ i=1
By solving
∂l(µ̂, σ 2 )
= 0,
∂ µ̂
Pn ∂l2 (µ̂,σ 2 )
we obtain µ̂ = x̄ = (1/n) i=1 xi . It is straightforward to verify that ∂ µ̂2 |µ̂=x̄ < 0. Since this works for
arbitrary x, µ̂ = X̄ is the MLE of µ. Similarly, by solving
∂l(µ, σ̂ 2 )
= 0,
∂ σ̂ 2
2
Pn
Pn σ̂ = (1/n)
we obtain i=1 (xi − µ)2 . Further replacing µ by µ̂, we derive the MLE of σ 2 as σ̂ 2 =
(1/n) i=1 (Xi − X̄)2 .
Hence, the sample mean X̄ and σ̂ 2 are both the MME and MLE for the mean µ and variance σ 2 , under a
normal population distribution F (·). More details regarding the properties of the likelihood function, and the
derivation of MLE under parametric distributions other than the normal distribution are given in Appendix
Chapter 16.
Due to the additivity property of the normal distribution (i.e., a sum of normal random variables that
follows a multivariate normal distribution still follows a normal distribution) and that the normal distribution
15.3. INTERVAL ESTIMATION 293
belongs to the location–scale family (i.e., a location and/or scale transformation of a normal random
variable has a normal distribution), the sample mean X̄ of a random sample from a normal F (·) has a normal
sampling distribution for any finite n. Given Xi ∼iid N (µ, σ 2 ), i = 1, . . . , n, the MLE of µ has an exact
distribution
σ2
X̄ ∼ N µ, .
n
Hence, the sample mean is an unbiased estimator of µ. In addition, the uncertainty in the estimation can be
quantified by its variance σ 2 /n, that decreases with the sample size n. When the sample size goes to infinity,
the sample mean will approach a single mass at the true value.
For the MLE of the mean parameter and any other parameters of other parametric distribution families,
however, we usually cannot derive an exact sampling distribution for finite samples. Fortunately, when the
sample size is sufficiently large, MLEs can be approximated by a normal distribution. Due to the general
maximum likelihood theory, the MLE has some nice large-sample properties.
• The MLE θ̂ of a parameter θ, is a consistent estimator. That is, θ̂ converges in probability to the true
value θ, as the sample size n goes to infinity.
• The MLE has the asymptotic normality property, meaning that the estimator will converge in
distribution to a normal distribution centered around the true value, when the sample size goes to
infinity. Namely, √
n(θ̂ − θ) →d N (0, V ) , as n → ∞,
where V is the inverse of the Fisher Information. Hence, the MLE θ̂ approximately follows a normal
distribution with mean θ and variance V /n, when the sample size is large.
• The MLE is efficient, meaning that it has the smallest asymptotic variance V , commonly referred to
as the Cramer–Rao lower bound. In particular, the Cramer–Rao lower bound is the inverse of the
Fisher information defined as I(θ) = −E(∂ 2 ln f (X; θ)/∂θ2P
). Hence, Var(θ̂) can be estimated based on
n
the observed Fisher information that can be written as − i=1 ∂ 2 ln f (Xi ; θ)/∂θ2 .
For many parametric distributions, the Fisher information may be derived analytically for the MLE of
parameters. For more sophisticated parametric models, the Fisher information can be evaluated numerically
using numerical integration for continuous distributions, or numerical summation for discrete distributions.
Given that the MLE θ̂ has either an exact or an approximate normal distribution with mean θ and q variance
Var(θ̂), we may take the square root of the variance and plug-in the estimate to define se(θ̂) = Var(θ̂). A
standard error is an estimated standard deviation that quantifies the uncertainty in the estimation resulting
from the use of a finite sample. Under some regularity conditions governing the population distribution, we
may establish that the statistic
θ̂ − θ
se(θ̂)
converges in distribution to a Student-t distribution with degrees of freedom (a parameter of the distribution)
n − p, where p is the number of parameters in the model other than the variance. For example, for the normal
distribution case, we have p = 1 for the parameter µ; for a linear regression model with an independent variable,
we have p = 2 for the parameters of the intercept and the independent variable. Denote by tn−p (1 − α/2) the
100 × (1 − α/2)-th percentile of the Student-t distribution that satisfies Pr [t < tn−p (1 − α/2)] = 1 − α/2.
We have, " #
α θ̂ − θ α
Pr −tn−p 1 − < < tn−p 1 − = 1 − α,
2 se(θ̂) 2
294 CHAPTER 15. APPENDIX A: REVIEW OF STATISTICAL INFERENCE
from which we can derive a confidence interval for θ. From the above equation we can derive a pair of
an interval of the form [θ̂1 , θ̂2 ]. This interval is a 1 − α confidence interval
statistics, θ̂1 and θ̂2 , that provide
for θ such that Pr θ̂1 ≤ θ ≤ θ̂2 = 1 − α, where the probability 1 − α is referred to as the confidence level.
Note that the above confidence interval is not valid for small samples, except for the case of the normal mean.
Normal distribution.
√ For the normal population mean√ µ, the MLE has an exact sampling distribution
X̄ ∼ N (µ, σ/ n), in which we can estimate se(θ̂) by σ̂/ n. Based on the Cochran’s theorem, the resulting
statistic has an exact Student-t distribution with degrees of freedom n − 1. Hence, we can derive the lower
and upper bounds of the confidence interval as
α σ̂
µ̂1 = µ̂ − tn−1 1 − √
2 n
and α σ̂
µ̂2 = µ̂ + tn−1 1 − √ .
2 n
When α = 0.05, tn−1 (1 − α/2) ≈ 1.96 for large values of n. Based on the Cochran’s theorem, the confidence
interval is valid regardless of the sample size.
In a statistical test, we are usually interested in testing whether a statement regarding some parameter(s), a
null hypothesis (denoted H0 ), is true given the observed data. The null hypothesis can take a general form
H0 : θ ∈ Θ0 , where Θ0 is a subset of the parameter space Θ of θ that may contain multiple parameters. For
the case with a single parameter θ, the null hypothesis usually takes either the form H0 : θ = θ0 or H0 : θ ≤ θ0 .
The opposite of the null hypothesis is called the alternative hypothesis that can be written as Ha : θ 6= θ0
15.4. HYPOTHESIS TESTING 295
or Ha : θ > θ0 . The statistical test on H0 : θ = θ0 is called a two-sided as the alternative hypothesis contains
two ineqalities of Ha : θ < θ0 or θ > θ0 . In contrast, the statistical test on either H0 : θ ≤ θ0 or H0 : θ ≥ θ0
is called a one-sided test.
A statistical test is usually constructed based on a statistic T and its exact or large-sample distribution. The
test typically rejects a two-sided test when either T > c1 or T < c2 , where the two constants c1 and c2 are
obtained based on the sampling distribution of T at a probability level α called the level of significance.
In particular, the level of significance α satisfies
meaning that if the null hypothesis were true, we would reject the null hypothesis only 5% of the times, if we
repeat the sampling process and perform the test over and over again.
Thus, the level of significance is the probability of making a type I error (error of the first kind), the error
of incorrectly rejecting a true null hypothesis. For this reason, the level of significance α is also referred to as
the type I error rate. Another type of error we may make in hypothesis testing is the type II error (error
of the second kind), the error of incorrectly accepting a false null hypothesis. Similarly, we can define the
type II error rate as the probability of not rejecting (accepting) a null hypothesis given that it is not true.
That is, the type II error rate is given by
Another important quantity concerning the quality of the statistical test is called the power of the test β,
defined as the probability of rejecting a false null hypothesis. The mathematical definition of the power is
Note that the power of the test is typically calculated based on a specific alternative value of θ = θa , given a
specific sampling distribution and a given sample size. In real experimental studies, people usually calculate
the required sample size in order to choose a sample size that will ensure a large chance of obtaining a
statistically significant test (i.e., with a prespecified statistical power such as 85%).
Based on the results from Section 15.3.1, we can define a Student t test for testing H0 : θ = θ0 . In particular,
we define the test statistic as
θ̂ − θ0
t-stat = ,
se(θ̂)
which has a large-sample distribution of a Student-t distribution with degrees of freedom n − p, when the
null hypothesis is true (i.e., when θ = θ0 ).
For a given level of significance α, say 5%, we reject the null hypothesis if the event t-stat < −tn−p (1 − α/2)
or t-stat > tn−p (1 − α/2) occurs (the rejection region). Under the null hypothesis H0 , we have
h α i h α i α
Pr t-stat < −tn−p 1 − = Pr t-stat > tn−p 1 − = .
2 2 2
In addition to the concept of rejection region, we may reject the test based on the p-value defined as
2 Pr(T > |t-stat|) for the aforementioned two-sided test, where the random variable T ∼ Tn−p . We reject the
null hypothesis if p-value is smaller than and equal to α. For a given sample, a p-value is defined to be the
smallest significance level for which the null hypothesis would be rejected.
Similarly, we can construct a one-sided test for the null hypothesis H0 : θ ≤ θ0 (or H0 : θ ≥ θ0 ). Using the
same test statistic, we reject the null hypothesis when t-stat > tn−p (1 − α) (or t-stat < −tn−p (1 − α) for
the test on H0 : θ ≥ θ0 ). The corresponding p-value is defined as Pr(T > |t-stat|) (or Pr(T < |t-stat|) for the
296 CHAPTER 15. APPENDIX A: REVIEW OF STATISTICAL INFERENCE
test on H0 : θ ≥ θ0 ). Note that the test is not valid for small samples, except for the case of the test on the
normal mean.
One-sample t Test for Normal Mean. For the test on the normal mean of the form H0 : µ = µ0 ,
H0 : µ ≤ µ0 or H0 : µ ≥ µ0 , we can define the test statistic as
X̄ − µ0
t-stat = √ ,
σ̂/ n
for which we have an exact sampling distribution t-stat ∼ Tn−1 from the Cochran’s theorem, with Tn−1
denoting a Student-t distribution with degrees of freedom n − 1. According to the Cochran’s theorem, the
test is valid for both small and large samples.
Show Wisconsin Property Fund Example - Continued
Example – Wisconsin Property Fund. Assume that mean logarithmic claims have historically been
approximately by µ0 = ln(5000) = 8.517. We might want to use the 2010 data to assess whether the mean of
the distribution has changed (a two-sided test), or whether it has increased (a one-sided test). Given the actual
2010 average µ̂ = 7.804, we may use the one-sample t test to assess whether this is a significant departure
√
from µ0 = 8.517 (i.e., in testing H0 : µ = 8.517). The test statistic t-stat = (8.517 − 7.804)/(1.683/ 1377) =
15.72 > t1376 (0.975). Hence, we reject the two-sided test at α = 5%. Similarly, we will reject the one-sided
test at α = 5%.
Show Wisconsin Property Fund Example - Continued
Example – Wisconsin Property Fund. For numerical stability and extensions to regression applications,
statistical packages often work with transformed versions of parameters. The following estimates are from
the R package VGAM (the function). More details on the MLE of other distribution families are given in
Appendix Chapter 17.
In the previous subsection, we have introduced the Student-t test on a single parameter, based on the
properties of the MLE. In this section, we define an alternative test called the likelihood ratio test (LRT).
The LRT may be used to test multiple parameters from the same statistical model.
Given the likelihood function L(θ|x) and Θ0 ⊂ Θ, the likelihood ratio test statistic for testing H0 : θ ∈ Θ0
against Ha : θ ∈
/ Θ0 is given by
supθ∈Θ0 L(θ|x)
L= ,
supθ∈Θ L(θ|x)
15.4. HYPOTHESIS TESTING 297
L(θ0 |x)
L= .
supθ∈Θ L(θ|x)
The LRT rejects the null hypothesis when L < c, with the threshold depending on the level of significance α,
the sample size n, and the number of parameters in θ. Based on the Neyman–Pearson Lemma, the LRT
is the uniformly most powerful (UMP) test for testing H0 : θ = θ0 versis Ha : θ = θa . That is, it provides
the largest power β for a given α and a given alternative value θa .
Based on the Wilks’s Theorem, the likelihood ratio test statistic −2 ln(L) converges in distribution to a
Chi-square distribution with the degree of freedom being the difference between the dimensionality of the
parameter spaces Θ and Θ0 , when the sample size goes to infinity and when the null model is nested within
the alternative model. That is, when the null model is a special case of the alternative model containing
a restricted sample space, we may approximate c by χ2p1 −p2 (1 − α), the 100 × (1 − α) th percentile of the
Chi-square distribution, with p1 − p2 being the degrees of freedom, and p1 and p2 being the numbers of
parameters in the alternative and null models, respectively. Note that the LRT is also a large-sample test
that will not be valid for small samples.
In real-life applications, the LRT has been commonly used for comparing two nested models. The LRT
approach as a model selection tool, however, has two major drawbacks: 1) It typically requires the null
model to be nested within the alternative model; 2) models selected from the LRT tends to provide in-sample
over-fitting, leading to poor out-of-sample prediction. In order to overcome these issues, model selection
based on information criteria, applicable to non-nested models while taking into consideration the model
complexity, is more widely used for model selection. Here, we introduce the two most widely used criteria,
the Akaike’s information criterion and the Bayesian information criterion.
In particular, the Akaike’s information criterion (AIC) is defined as
where θ̂ denotes the MLE of θ, and p is the number of parameters in the model. The additional term 2p
represents a penalty for the complexity of the model. That is, with the same maximized likelihood function,
the AIC favors model with less parameters. We note that the AIC does not consider the impact from the
sample size n.
Alternatively, people use the Bayesian information criterion (BIC) that takes into consideration the
sample size. The BIC is defined as
We observe that the BIC generally puts a higher weight on the number of parameters. With the same
maximized likelihood function, the BIC will suggest a more parsimonious model than the AIC.
Show Wisconsin Property Fund Example - Continued
Example – Wisconsin Property Fund. Both the AIC and BIC statistics suggest that the GB2 is the
best fitting model whereas gamma is the worst.
In this graph,
• black represents actual (smoothed) logarithmic claims
• Best approximated by green which is fitted GB2
• Pareto (purple) and Lognormal (lightblue) are also pretty good
• Worst are the exponential (in red) and gamma (in dark blue)
## Sample size: 6258
Show R Code
R Code for Fitted Claims Distributions
# R Code to fit several claims distributions
ClaimLev <- read.csv("Data/CLAIMLEVEL.csv", header=TRUE); nrow(ClaimLev)
ClaimData<-subset(ClaimLev,Year==2010);
#Use "VGAM" library for estimation of parameters
library(VGAM)
fit.LN <- vglm(Claim ~ 1, family=lognormal, data = ClaimData)
fit.gamma <- vglm(Claim ~ 1, family=gamma2, data = ClaimData)
theta.gamma<-exp(coef(fit.gamma)[1])/exp(coef(fit.gamma)[2])
alpha.gamma<-exp(coef(fit.gamma)[2])
fit.exp <- vglm(Claim ~ 1, exponential, data = ClaimData)
fit.pareto <- vglm(Claim ~ 1, paretoII, loc=0, data = ClaimData)
###################################################
# Inference assuming a GB2 Distribution - this is more complicated
# The likelihood functon of GB2 distribution (negative for optimization)
15.4. HYPOTHESIS TESTING 299
This appendix introduces the laws related to iterated expectations. In particular, Section 16.1 introduces the
concepts of conditional distribution and conditional expectation. Section 16.2 introduces the Law of Iterated
Expectations and the Law of Total Variance.
In some situations, we only observe a single outcome but can conceptualize an outcome as resulting from a
two (or more) stage process. Such types of statistical models are called two-stage, or hierarchical models.
Some special cases of hierarchical models include:
• models where the parameters of the distribution are random variables;
• mixture distribution, where Stage 1 represents the draw of a sub-population and Stage 2 represents a
random variable from a distribution that is determined by the sub-population drew in Stage 1;
• an aggregate distribution, where Stage 1 represents the draw of the number of events and Stage two
represents the loss amount occurred per event.
In these situations, the process gives rise to a conditional distribution of a random variable (the Stage 2
outcome) given the other (the Stage 1 outcome). The Law of Iterated Expectations can be useful for obtaining
the unconditional expectation or variance of a random variable in such cases.
Here we introduce the concept of conditional distribution respectively for discrete and continuous random
variables.
301
302 CHAPTER 16. APPENDIX B: ITERATED EXPECTATIONS
Discrete Case
Suppose that X and Y are both discrete random variables, meaning that they can take a finite or countable
number of possible values with a positive probability. The joint probability (mass) function of (X, Y )
is defined as
p(x, y) = Pr[X = x, Y = y]
.
When X and Y are independent (the value of X does not depend on that of Y ), we have
p(x, y) = p(x)p(y),
with p(x) = Pr[X = x] and p(y) = Pr[Y = y] being the marginal probability function of X and Y ,
respectively.
Given the joint probability function, we may obtain the marginal probability functions of Y as
X
p(y) = p(x, y),
x
where the summation is over all possible values of x, and the marginal probability function of X can be
obtained in a similar manner.
The conditional probability (mass) function of (Y |X) is defined as
p(x, y)
p(y|x) = Pr[Y = y|X = x] = ,
Pr[X = x]
where we may obtain the conditional probability function of (X|Y ) in a similar manner. In particular, the
above conditional probability represents the probability of the event Y = y given the event X = x. Hence,
even in cases where Pr[X = x] = 0, the function may be given as a particular form, in real applications.
Continuous Case
For continuous random variables X and Y , we may define their joint probability (density) function based
on the joint cumulative distribution function. The joint cumulative distribution function of (X, Y ) is
defined as
F (x, y) = Pr[X ≤ x, Y ≤ y].
with F (x) = Pr[X ≤ x] and F (y) = Pr[Y ≤ y] being the cumulative distribution function (cdf) of X
and Y , respectively. The random variable X is referred to as a continuous random variable if its cdf is
continuous on x.
When the cdf F (x) is continuous on x, then we define f (x) = ∂F (x)/∂x as the (marginal) probability
density function (pdf) of X. Similarly, if the joint cdf F (x, y) is continuous on both x and y, we define
∂ 2 F (x, y)
f (x, y) =
∂x∂y
as the joint probability density function of (X, Y ), in which case we refer to the random variables as
jointly continuous.
When X and Y are independent, we have
Given the joint density function, we may obtain the marginal density function of Y as
Z
f (y) = f (x, y) dx,
x
where the integral is over all possible values of x, and the marginal probability function of X can be obtained
in a similar manner.
Based on the joint pdf and the marginal pdf, we define the conditional probability density function of
(Y |X) as
f (x, y)
f (y|x) = ,
f (x)
where we may obtain the conditional probability function of (X|Y ) in a similar manner. Here, the conditional
density function is the density function of y given X = x. Hence, even in cases where Pr[X = x] = 0 or when
f (x) is not defined, the function may be given in a particular form in real applications.
Now we define the conditional expectation and variance based on the conditional distribution defined in the
previous subsection.
Discrete Case
P
For a discrete random variable Y , its expectation is defined as E[Y ] = y y p(y) if its value is finite, and
its variance is defined as Var[Y ] = E{(Y − E[Y ])2 } = y y 2 p(y) − {E[Y ]}2 if its value is finite.
P
For a discrete random variable Y , the conditional expectation of the random variable Y given the event
X = x is defined as
X
E[Y |X = x] = y p(y|x),
y
where X does not have to be a discrete variable, as far as the conditional probability function p(y|x) is given.
Note that the conditional expectation E[Y |X = x] is a fixed number. When we replace x with X on the right
hand side of the above equation, we can define the expectation of Y given the random variable X as
X
E[Y |X] = y p(y|X),
y
In a similar manner, we can define the conditional variance of the random variable Y given the event
X = x as
X
Var[Y |X = x] = E[Y 2 |X = x] − {E[Y |X = x]}2 = y 2 p(y|x) − {E[Y |X = x]}2 .
y
The variance of Y given X, Var[Y |X] can be defined by replacing x by X in the above equation, and Var[Y |X]
is still a random variable and the randomness comes from X.
304 CHAPTER 16. APPENDIX B: ITERATED EXPECTATIONS
Continuous Case
R
For a continuous random variable Y , its expectation is defined as E[Y ] = y y f (y)dy if the integral exists,
and its variance is defined as Var[Y ] = E{(X − E[Y ])2 } = y y 2 f (y)dy − {E[Y ]}2 if its value is finite.
R
For jointly continuous random variables X and Y , the conditional expectation of the random variable Y
given X = x is defined as Z
E[Y |X = x] = y f (y|x)dy.
y
where X does not have to be a continuous variable, as far as the conditional probability function f (y|x) is
given.
Similarly, the conditional expectation E[Y |X = x] is a fixed number. When we replace x with X on the
right-hand side of the above equation, we can define the expectation of Y given the random variable X as
Z
E[Y |X] = y p(y|X) dy,
y
The variance of Y given X, Var[Y |X] can then be defined by replacing x by X in the above equation, and
similarly Var[Y |X] is also a random variable and the randomness comes from X.
Consider two random variables X and Y , and h(X, Y ), a random variable depending on the function h, X
and Y .
Assuming all the expectations exist and are finite, the Law of Iterated Expectations states that
where the first (inside) expectation is taken with respect to the random variable Y and the second (outside)
expectation is taken with respect to X.
For the Law of Iterated Expectations, the random variables may be discrete, continuous, or a hybrid
combination of the two. We use the example of discrete variables of X and Y to illustrate the calculation of
the unconditional expectation using the Law of Iterated Expectations. For continuous random variables, we
only need to replace the summation with the integral, as illustrated earlier in the appendix.
16.2. ITERATED EXPECTATIONS AND TOTAL VARIANCE 305
Given p(y|x) the joint pmf of X and Y , the conditional expectation of h(X, Y ) given the event X = x is
defined as X
E [h(X, Y )|X = x] = h(x, y)p(y|x),
y
and the conditional expectation of h(X, Y ) given X being a random variable can be written as
X
E [h(X, Y )|X] = h(X, y)p(y|X).
y
The unconditional expectation of h(X, Y ) can then be obtained by taking the expectation of E [h(X, Y )|X]
with respect to the random variable X. That is, we can obtain E[h(X, Y )] as
( )
X X
E {E [h(X, Y )|X]} = h(x, y)p(y|x) p(x)
x y
XX
= h(x, y)p(y|x)p(x) .
x y
XX
= h(x, y)p(x, y) = E[h(X, Y )]
x y
The Law of Iterated Expectations for the continuous and hybrid cases can be proved in a similar manner, by
replacing the corresponding summation(s) by integral(s).
Assuming that all the variances exist and are finite, the Law of Total Variance states that
Var[h(X, Y )] = E {Var [h(X, Y )|X]} + Var {E [h(X, Y )|X]} ,
where the first (inside) expectation/variance is taken with respect to the random variable Y and the second
(outside) expectation/variance is taken with respect to X. Thus, the unconditional variance equals to the
expectation of the conditional variance plus the variance of the conditional expectation.
2
E{Var [h(X, Y )|X]} = E E h(X, Y )2 |X − E {E [h(X, Y )|X]}
2
= E h(X, Y )2 − E {E [h(X, Y )|X]} .
(16.1)
Further, note that the conditional expectation, E [h(X, Y )|X], is a function of X, denoted g(X). Thus, g(X)
is a random variable with mean E[h(X, Y )] and variance
Thus, adding Equations (16.1) and (16.2) leads to the unconditional variance Var [h(X, Y )].
16.2.3 Application
To apply the Law of Iterated Expectations and the Law of Total Variance, we generally adopt the following
procedure.
1. Identify the random variable that is being conditioned upon, typically a stage 1 outcome (that is not
observed).
2. Conditional on the stage 1 outcome, calculate summary measures such as a mean, variance, and the
like.
3. There are several results of the step 2, one for each stage 1 outcome. Then, combine these results using
the iterated expectations or total variance rules.
Mixtures of Finite Populations. Suppose that the random variable N1 represents a realization of the
number of claims in a policy year from the population of good drivers and N2 represents that from the
population of bad drivers. For a specific driver, there is a probability α that (s)he is a good driver. For a
specific draw N , we have (
N1 , if (s)he is a good driver;
N=
N2 , otherwise.
Let T be the indicator whether (s)he is a good driver, with T = 1 representing that the driver is a good
driver with Pr[T = 1] = α and T = 2 representing that the driver is a bad driver with Pr[T = 2] = 1 − α.
From the Law of Iterated Expectations, we can obtain the expected number of claims as
To be more concrete, suppose that Nj follows a Poisson distribution with the mean λj , j = 1, 2. Then we
have
Var[N |T = j] = E[N |T = j] = λj , j = 1, 2.
Note that the later is the variance for a Bernoulli with outcomes λ1 and λ2 , and the binomial probability α.
Based on the Law of Total Variance, the unconditional variance of N is given by
Chapter preview. Appendix Chapter 15 introduced the maximum likelihood theory regarding estimation of
parameters from a parametric family. This appendix gives more specific examples and expands some of the
concepts. Section 17.1 reviews the definition of the likelihood function, and introduces its properties. Section
17.2 reviews the maximum likelihood estimators, and extends their large-sample properties to the case where
there are multiple parameters in the model. Section 17.3 reviews statistical inference based on maximum
likelihood estimators, with specific examples on cases with multiple parameters.
From Appendix 15, the likelihood function is a function of parameters given the observed data. Here, we
review the concepts of the likelihood function, and introduces its properties that are bases for maximum
likelihood inference.
Here, we give a brief review of the likelihood function and the log-likelihood function from Appendix 15. Let
f (·|θ) be the probability function of X, the probability mass function (pmf) if X is discrete or the probability
density function (pdf) if it is continuous. The likelihood is a function of the parameters (θ) given the data
(x). Hence, it is a function of the parameters with the data being fixed, rather than a function of the data
with the parameters being fixed. The vector of data x is usually a realization of a random sample as defined
in Appendix 15.
Given a realized of a random sample x = (x1 , x2 , · · · , xn ) of size n, the likelihood function is defined as
n
Y
L(θ|x) = f (x|θ) = f (xi |θ),
i=1
307
308 CHAPTER 17. APPENDIX C: MAXIMUM LIKELIHOOD THEORY
where f (x|θ) denotes the joint probability function of x. The log-likelihood function leads to an additive
structure that is easy to work with.
In Appendix 15, we have used the normal distribution to illustrate concepts of the likelihood function and
the log-likelihood function. Here, we derive the likelihood and corresponding log-likelihood functions when
the population distribution is from the Pareto distribution family.
Show Example
Example – Pareto Distribution. Suppose that X1 , . . . , Xn represents a random sample from a single-
parameter Pareto distribution with the cumulative distribution function given by
α
500
F (x) = Pr(Xi ≤ x) = 1 − , x > 500,
x
where the parameter θ = α.
The corresponding probability density function is f (x) = 500α αx−α−1 and the log-likelihood function can be
derived as
Xn n
X
l(α|x) = ln f (xi ; α) = nα ln 500 + n ln α − (α + 1) ln xi .
i=1 i=1
In mathematical statistics, the first derivative of the log-likelihood function with respect to the parameters,
u(θ) = ∂l(θ|x)/∂θ, is referred to as the score function, or the score vector when there are multiple
parameters in θ. The score function or score vector can be written as
n n
∂ ∂ Y X ∂
u(θ) = l(θ|x) = ln f (xi ; θ) = ln f (xi ; θ),
∂θ ∂θ i=1 i=1
∂θ
where u(θ) = (u1 (θ), u2 (θ), · · · , up (θ)) when θ = (θ1 , · · · , θp ) contains p > 2 parameters, with the element
uk (θ) = ∂l(θ|x)/∂θk being the partial derivative with respect to θk (k = 1, 2, · · · , p).
The likelihood function has the following properties:
• One basic property of the likelihood function is that the expectation of the score function with respect
to x is 0. That is,
∂
E[u(θ)] = E l(θ|x) = 0
∂θ
To illustrate this, we have
" # Z
∂
∂ ∂θ f (x; θ) ∂
E l(θ|x) = E = f (y; θ)dy
∂θ f (x; θ) ∂θ
Z
∂ ∂
= f (y; θ)dy = 1 = 0.
∂θ ∂θ
• Denote by ∂ 2 l(θ|x)/∂θ∂θ 0 = ∂ 2 l(θ|x)/∂θ 2 the second derivative of the log-likelihood function when
θ is a single parameter, or by ∂ 2 l(θ|x)/∂θ∂θ 0 = (hjk ) = (∂ 2 l(θ|x)/∂xj ∂xk ) the hessian matrix of the
log-likelihood function when it contains multiple parameters. Denote [∂l(θ|x)∂θ][∂l(θ|x)∂θ 0 ] = u2 (θ)
when θ is a single parameter, or let [∂l(θ|x)∂θ][∂l(θ|x)∂θ 0 ] = (uujk ) be a p × p matrix when θ contains
17.2. MAXIMUM LIKELIHOOD ESTIMATORS 309
a total of p parameters, with each element uujk = uj (θ)uk (θ) and uj (θ) being the kth element of the
score vector as defined earlier. Another basic property of the likelihood function is that sum of the
expectation of the hessian matrix and the expectation of the kronecker product of the score vector and
its transpose is 0. That is,
∂2
∂l(θ|x) ∂l(θ|x)
E l(θ|x) + E = 0.
∂θ∂θ 0 ∂θ ∂θ 0
∂2
∂l(θ|x) ∂l(θ|x)
I(θ) = E = −E l(θ|x) .
∂θ ∂θ 0 ∂θ∂θ 0
As the sample size n goes to infinity, the score function (vector) converges in distribution to a normal
distribution (or multivariate normal distribution when θ contains multiple parameters) with mean 0
and variance (or covariance matrix in the multivariate case) given by I(θ).
In statistics, maximum likelihood estimators are values of the parameters θ that are most likely to have been
produced by the data.
Based on the definition given in Appendix 15, the value of θ, say θ̂ M LE , that maximizes the likelihood
function, is called the maximum likelihood estimator (MLE) of θ.
Because the log function ln(·) is a one-to-one function, we can also determine θ̂ M LE by maximizing the
log-likelihood function, l(θ|x). That is, the MLE is defined as
θ̂ M LE = argmaxθ∈Θ l(θ|x).
Given the analytical form of the likelihood function, the MLE can be obtained by taking the first derivative
of the log-likelihood function with respect to θ, and setting the values of the partial derivatives to zero. That
is, the MLE are the solutions of the equations of
∂l(θ̂|x)
= 0.
∂ θ̂
310 CHAPTER 17. APPENDIX C: MAXIMUM LIKELIHOOD THEORY
Show Example
Example. Course C/Exam 4. May 2000, 21. You are given the following five observations: 521, 658,
702, 819, 1217. You use the single-parameter Pareto with cumulative distribution function:
α
500
F (x) = 1 − , x > 500.
x
Calculate the maximum likelihood estimate of the parameter α.
Show Solution
Solution. With n = 5, the log-likelihood function is
5
X 5
X
l(α|x) = ln f (xi ; α) = 5α ln 500 + 5 ln α − (α + 1) ln xi .
i=1 i=1
From Appendix 15, the MLE has some nice large-sample properties, under certain regularity conditions. We
presented the results for a single parameter in Appendix 15, but results are true for the case when θ contains
multiple parameters. In particular, we have the following results, in a general case when θ = (θ1 , θ2 , · · · , θp ).
• The MLE of a parameter θ, θ̂ M LE , is a consistent estimator. That is, the MLE θ̂ M LE converges in
probability to the true value θ, as the sample size n goes to infinity.
• The MLE has the asymptotic normality property, meaning that the estimator will converge in
distribution to a multivariate normal distribution centered around the true value, when the sample size
goes to infinity. Namely,
√
n(θ̂ M LE − θ) → N (0, V ) , as n → ∞,
where V denotes the asymptotic variance (or covariance matrix) of the estimator. Hence, the MLE
θ̂ M LE has an approximate normal distribution with mean θ and variance (covariance matrix when
p > 1) V /n, when the sample size is large.
• The MLE is efficient, meaning that it has the smallest asymptotic variance V , commonly referred to
as the Cramer–Rao lower bound. In particular, the Cramer–Rao lower bound is the inverse of the
Fisher information (matrix) I(θ) defined earlier in this appendix. Hence, Var(θ̂ M LE ) can be estimated
based on the observed Fisher information.
Based on the above results, we may perform statistical inference based on the procedures defined in Appendix
15.
Show Example
Example. Course C/Exam 4. Nov 2000, 13. A sample of ten observations comes from a parametric
family f (x, ; θ1 , θ2 ) with log-likelihood function
10
X
l(θ1 , θ2 ) = f (xi ; θ1 , θ2 ) = −2.5θ12 − 3θ1 θ2 − θ22 + 5θ1 + 2θ2 + k,
i=1
17.3. STATISTICAL INFERENCE BASED ON MAXIMUM LIKELHOOD ESTIMATION 311
where k is a constant. Determine the estimated covariance matrix of the maximum likelihood estimator,
θˆ1 , θˆ2 .
Show Solution
Solution. Denoting l = l(θ1 , θ2 ), the hessian matrix of second derivatives is
∂2 ∂2
!
l ∂θ1 ∂θ2 l
∂θ12 −5 −3
∂2 ∂2
=
−3 −2
∂θ1 ∂θ2 l ∂θ12
l
∂2
5 3
I(θ1 , θ2 ) = −E l(θ|x) =
∂θ∂θ 0 3 2
and
−1 1 2 −3 2 −3
I (θ1 , θ2 ) = = .
5(2) − 3(3) −3 5 −3 5
The method of maximum likelihood has many advantages over alternative methods such as the method of
moment method introduced in Appendix 15.
• It is a general tool that works in many situations. For example, we may be able to write out the
closed-form likelihood function for censored and truncated data. Maximum likelihood estimation can
be used for regression models including covariates, such as survival regression, generalized linear models
and mixed models, that may include covariates that are time-dependent.
• From the efficiency of the MLE, it is optimal, the best, in the sense that it has the smallest variance
among the class of all unbiased estimators for large sample sizes.
• From the results on the asymptotic normality of the MLE, we can obtain a large-sample distribution for
the estimator, allowing users to assess the variability in the estimation and perform statistical inference
on the parameters. The approach is less computationally extensive than re-sampling methods that
require a large of fittings of the model.
Despite its numerous advantages, MLE has its drawback in cases such as generalized linear models when it
does not have a closed analytical form. In such cases, maximum likelihood estimators are computed iteratively
using numerical optimization methods. For example, we may use the Newton-Raphson iterative algorithm
or its variations for obtaining the MLE. Iterative algorithms require starting values. For some problems,
the choice of a close starting value is critical, particularly in cases where the likelihood function has local
minimums or maximums. Hence, there may be a convergence issue when the starting value is far from the
maximum. Hence, it is important to start from different values across the parameter space, and compare the
maximized likelihood or log-likelihood to make sure the algorithms have converged to a global maximum.
In Appendix 15, we have introduced maximum likelihood-based methods for statistical inference when θ
contains a single parameter. Here, we will extend the results to cases where there are multiple parameters in
θ.
In Appendix 15, we defined hypothesis testing concerning the null hypothesis, a statement on the parameter(s)
of a distribution or model. One important type of inference is to assess whether a parameter estimate is
statistically significant, meaning whether the value of the parameter is zero or not.
We have learned earlier that the MLE θ̂ M LE has a large-sample normal distribution with mean θ and the
variance covariance matrix I −1 (θ). Based on the multivariate normal distribution, the jth element of θ̂ M LE ,
say θ̂M LE,j , has a large-sample univariate normal distribution.
Define se(θ̂M LE,j ), the standard error (estimated standard deviation) to be the square root of the jth diagonal
element of I −1 (θ)M LE . To assess the null hypothesis that θj = θ0 , we define the t-statistic or t-ratio to be
t(θ̂M LE,j ) = (θ̂M LE,j − θ0 )/se(θ̂M LE,j ).
Under the null hypothesis, it has a Student-t distribution with degrees of freedom equal to n − p, with p
being the dimension of θ.
For most actuarial applications, we have a large sample size n, so the t-distribution is very close to the
(standard) normal distribution. In the case when n is very large or when the standard error is known, the
t-statistic can be referred to as a z-statistic or z-score.
Based on the results from Appendix 15, if the t-statistic t(θ̂M LE,j ) exceeds a cut-off (in absolute value), then
the test for the j parameter θj is said to be statistically significant. If θj is the regression coefficient of the j
th independent variable, then we say that the jth variable is statistically significant.
For example, if we use a 5% significance level, then the cut-off value is 1.96 using a normal distribution
approximation for cases with a large sample size. More generally, using a 100α% significance level, then the
cut-off is a 100(1 − α/2)% quantile from a Student-t distribution with the degree of freedom being n − p.
Another useful concept in hypothesis testing is the p-value, shorthand for probability value. From the
mathematical definition in Appendix 15, a p-value is defined as the smallest significance level for which the
null hypothesis would be rejected. Hence, the p-value is a useful summary statistic for the data analyst to
report because it allows the reader to understand the strength of statistical evidence concerning the deviation
from the null hypothesis.
In addition to hypothesis testing and interval estimation introduced in Appendix 15 and the previous
subsection, another important type of inference is selection of a model from two choices, where one choice is
a special case of the other with certain parameters being restricted. For such two models with one being
nested in the other, we have introduced the likelihood ratio test (LRT) in Appendix 15. Here, we will briefly
review the process of performing a LRT based on a specific example of two alternative models.
Suppose that we have a (large) model under which we derive the maximum likelihood estimator, θ̂ M LE . Now
assume that some of the p elements in θ are equal to zero and determine the maximum likelihood estimator
over the remaining set, with the resulting estimator denoted θ̂ Reduced .
Based on the definition in Appendix 15, the statistic, LRT = 2 l(θ̂ M LE ) − l(θ̂ Reduced ) , is called the
likelihood ratio statistic. Under the null hypothesis that the reduced model is correct, the likelihood ratio has
a Chi-square distribution with degrees of freedom equal to d, the number of variables set to zero.
17.3. STATISTICAL INFERENCE BASED ON MAXIMUM LIKELHOOD ESTIMATION 313
Such a test allows us to judge which of the two models is more likely to be correct, given the observed data.
If the statistic LRT is large relative to the critical value from the chi-square distribution, then we reject the
reduced model in favor of the larger one. Details regarding the critical value and alternative methods based
on information criteria are given in Appendix 15.
314 CHAPTER 17. APPENDIX C: MAXIMUM LIKELIHOOD THEORY
Bibliography
Aalen, Odd (1978). “Nonparametric inference for a family of counting processes,” The Annals of Statistics,
Vol. 6, pp. 701–726.
Abbott, Dean (2014). Applied Predictive Analytics: Principles and Techniques for the Professional Data
Analyst, Hoboken, NJ. Wiley.
Abdullah, Mohammad F. and Kamsuriah Ahmad (2013). “The mapping process of unstructured data to
structured data,” in 2013 International Conference on Research and Innovation in Information Systems
(ICRIIS), pp. 151–155.
Aggarwal, Charu C. (2015). Data Mining: The Textbook, New York, NY. Springer.
Agresti, Alan (1996). An Introduction to Categorical Data Analysis, Vol. 135. Wiley New York.
Albers, Michael J. (2017). Introduction to Quantitative Data Analysis in the Behavioral and Social Sciences,
Hoboken, NJ. John Wiley & Sons, Inc.
Bailey, Robert A. and J. Simon LeRoy (1960). “Two studies in automobile ratemaking,” Proceedings of the
Casualty Actuarial Society Casualty Actuarial Society, Vol. XLVII.
Bandyopadhyay, Prasanta S. and Malcolm R. Forster eds. (2011). Philosophy of Statistics, Handbook of the
Philosophy of Science 7. North Holland.
Billingsley, Patrick (2008). Probability and measure. John Wiley & Sons.
Bishop, Christopher M. (2007). Pattern Recognition and Machine Learning, New York, NY. Springer.
Bishop, Yvonne M., Stephen E. Fienberg, and Paul W. Holland (1975). Discrete Multivariate Analysis:
Theory and Practice. Cambridge [etc.]: MIT.
Blomqvist, Nils (1950). “On a measure of dependence between two random variables,” The Annals of
Mathematical Statistic, pp. 593–600.
Bluman, Allan (2012). Elementary Statistics: A Step By Step Approach, New York, NY. McGraw-Hill.
Bowers, Newton L., Hans U. Gerber, James C. Hickman, Donald A. Jones, and Cecil J. Nesbitt (1986).
Actuarial Mathematics. Society of Actuaries Itasca, Ill.
Box, George EP (1980). “Sampling and Bayes’ inference in scientific modelling and robustness,” Journal of
the Royal Statistical Society. Series A (General), pp. 383–430.
Breiman, Leo (2001). “Statistical Modeling: The Two Cultures,” Statistical Science, Vol. 16, pp. 199–231.
Breiman, Leo, Jerome Friedman, Charles J. Stone, and R.A. Olshen (1984). Classification and Regression
Trees, Raton Boca, FL. Chapman and Hall/CRC.
315
316 BIBLIOGRAPHY
Buhlmann, Hans and Alois Gisler (2005). A Course in Credibility Theory and its Applications. ACTEX
Publications.
Buttrey, Samuel E. and Lyn R. Whitaker (2017). A Data Scientist’s Guide to Acquiring, Cleaning, and
Managing Data in R, Hoboken, NJ. Wiley.
Chen, Min, Shiwen Mao, Yin Zhang, and Victor CM Leung (2014). Big Data: Related Technologies, Challenges
and Future Prospects, New York, NY. Springer.
Clarke, Bertrand, Ernest Fokoue, and Hao Helen Zhang (2009). Principles and theory for data mining and
machine learning, New York, NY. Springer-Verlag.
Cummins, J. David and Richard A. Derrig (2012). Managing the Insolvency Risk of Insurance Companies:
Proceedings of the Second International Conference on Insurance Solvency, Vol. 12. Springer Science &
Business Media.
Dabrowska, Dorota M. (1988). “Kaplan-meier estimate on the plane,” The Annals of Statistics, pp. 1475–1489.
Daroczi, Gergely (2015). Mastering Data Analysis with R, Birmingham, UK. Packt Publishing.
De Jong, Piet and Gillian Z. Heller (2008). Generalized Linear Models for Insurance Data. Cambridge
University Press Cambridge.
Dickson, David C. M., Mary Hardy, and Howard R. Waters (2013). Actuarial Mathematics for Life Contingent
Risks. Cambridge University Press.
Dobson, Annette J and Adrian Barnett (2008). An Introduction to Generalized Linear Models. CRC press.
Earnix (2013). “2013 Insurance Predictive Modeling Survey,” Earnix and Insurance Services Office, Inc. URL:
http://earnix.com/2013-insurance-predictive-modeling-survey/3594/, [Retrieved on July 7, 2014].
Faraway, Julian J (2016). Extending the Linear Model with R: Generalized Linear, Mixed Effects and
Nonparametric Regression Models, Vol. 124. CRC press.
Forte, Rui Miguel (2015). Mastering Predictive Analytics with R, Birmingham, UK. Packt Publishing.
Frees, Edward W (2009a). Regression modeling with actuarial and financial applications. Cambridge University
Press.
Frees, Edward W. (2009b). Regression Modeling with Actuarial and Financial Applications. Cambridge
University Press.
(2009c). Regression Modeling with Actuarial and Financial Applications. Cambridge University Press.
Frees, Edward W., Glenn Meyers, and A. David Cummings (2011). “Summarizing insurance scores using a
Gini index,” Journal of the American Statistical Association, Vol. 106, pp. 1085–1098.
(2014). “Insurance ratemaking and a Gini index,” Journal of Risk and Insurance, Vol. 81, pp.
335–366.
Frees, Edward W. and Emiliano A. Valdez (1998). “Understanding relationships using copulas,” North
American Actuarial Journal, Vol. 2, pp. 1–25.
(2008). “Hierarchical insurance claims modeling,” Journal of the American Statistical Association,
Vol. 103, pp. 1457–1469.
Gan, Guojun (2011). Data Clustering in C++: An Object-Oriented Approach, Data Mining and Knowledge
Discovery Series, Boca Raton, FL, USA. Chapman & Hall/CRC Press, DOI: http://dx.doi.org/10.1201/
b10814.
BIBLIOGRAPHY 317
Gan, Guojun, Chaoqun Ma, and Jianhong Wu (2007). Data Clustering: Theory, Algorithms, and Applications,
Philadelphia, PA. SIAM Press, DOI: http://dx.doi.org/10.1137/1.9780898718348.
Gelman, Andrew (2004). “Exploratory Data Analysis for Complex Models,” Journal of Computational and
Graphical Statistics, Vol. 13, pp. 755–779.
Genest, Christian and Josh Mackay (1986). “The joy of copulas: Bivariate distributions with uniform
marginals,” The American Statistician, Vol. 40, pp. 280–283.
Genest, Christian and Johanna Nešlohva (2007). “A primer on copulas for count data,” Journal of the Royal
Statistical Society, pp. 475–515.
Good, I. J. (1983). “The Philosophy of Exploratory Data Analysis,” Philosophy of Science, Vol. 50, pp.
283–295.
Gorman, Mark and Stephen Swenson (2013). “Building believers: How to expand the use of predictive
analytics in claims,” SAS, URL: http://www.sas.com/resources/whitepaper/wp_59831.pdf, [Retrieved on
August 17, 2014].
Greenwood, Major (1926). “The errors of sampling of the survivorship tables,” in Reports on Public Health
and Statistical Subjects, Vol. 33. London: Her Majesty’s Stationary Office.
Hardy, Mary R. (2006). “An introduction to risk measures for actuarial applications..”
Hashem, Ibrahim Abaker Targio, Ibrar Yaqoob, Nor Badrul Anuar, Salimah Mokhtar, Abdullah Gani, and
Samee Ullah Khan (2015). “The rise of “big data” on cloud computing: Review and open research issues,”
Information Systems, Vol. 47, pp. 98 – 115.
Hofert, Marius, Ivan Kojadinovic, Martin Mächler, and Jun Yan (2018). Elements of Copula Modeling with R.
Springer.
Hox, Joop J. and Hennie R. Boeije (2005). “Data collection, primary versus secondary,” in Encyclopedia of
social measurement. Elsevier, pp. 593 – 599.
Igual, Laura and Santi Segu (2017). Introduction to Data Science. A Python Approach to Concepts, Techniques
and Applications, New York, NY. Springer.
Inmon, W.H. and Dan Linstedt (2014). Data Architecture: A Primer for the Data Scientist: Big Data, Data
Warehouse and Data Vault, Cambridge, MA. Morgan Kaufmann.
Insurance Information Institute (2015). “International Insurance Fact Book,” Insurance Information Insti-
tute, URL: http://www.iii.org/sites/default/files/docs/pdf/international_insurance_factbook_2015.pdf,
[Retrieved on May 10, 2016].
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani (2013). An introduction to statistical
learning, Vol. 112. Springer.
Janert, Philipp K. (2010). Data Analysis with Open Source Tools, Sebastopol, CA. O’Reilly Media.
de Jong, Piet and Gillian Z. Heller (2008). Generalized linear models for insurance data, Cambridge, UK.
Cambridge University Press.
Judd, Charles M., Gary H. McClelland, and Carey S. Ryan (2017). Data Analysis. A Model Comparison
Approach to Regression, ANOVA and beyond, New York, NY. Routledge, 3rd edition.
318 BIBLIOGRAPHY
Kaplan, Edward L. and Paul Meier (1958). “Nonparametric estimation from incomplete observations,” Journal
of the American statistical association, Vol. 53, pp. 457–481.
Kendall, Maurice G (1938). “A new measure of rank correlation,” Biometrika, pp. 81–93.
Klugman, Stuart A., Harry H. Panjer, and Gordon E. Willmot (2012). Loss Models: From Data to Decisions.
John Wiley & Sons.
Kreer, Markus, Ayşe Kızılersü, Anthony W Thomas, and Alfredo D Egídio dos Reis (2015). “Goodness-of-fit
tests and applications for left-truncated Weibull distributions to non-life insurance,” European Actuarial
Journal, Vol. 5, pp. 139–163.
Kubat, Miroslav (2017). An Introduction to Machine Learning, New York, NY. Springer, 2nd edition.
Lee Rodgers, J and W. A Nicewander (1998). “Thirteen ways to look at the correlation coeffeicient,” The
American Statistician, Vol. 42, pp. 59–66.
Levin, Bruce, James Reeds et al. (1977). “Compound multinomial likelihood functions are unimodal: Proof
of a conjecture of IJ Good,” The Annals of Statistics, Vol. 5, pp. 79–87.
Lorenz, Max O. (1905). “Methods of measuring the concentration of wealth,” Publications of the American
statistical association, Vol. 9, pp. 209–219.
Mailund, Thomas (2017). Beginning Data Science in R: Data Analysis, Visualization, and Modelling for the
Data Scientist. Apress.
McCullagh, Peter and John A. Nelder (1989). Generalized linear models, Vol. 37. CRC press.
McDonald, James B (1984). “Some generalized functions for the size distribution of income,” Econometrica:
journal of the Econometric Society, pp. 647–663.
McDonald, James B and Yexiao J Xu (1995). “A generalization of the beta distribution with applications,”
Journal of Econometrics, Vol. 66, pp. 133–152.
Miles, Matthew, Michael Hberman, and Johnny Sdana (2014). Qualitative Data Analysis: A Methods
Sourcebook, Thousand Oaks, CA. Sage, 3rd edition.
Mirkin, Boris (2011). Core Concepts in Data Analysis: Summarization, Correlation and Visualization, London,
UK. Springer.
Mitchell, Tom M. (1997). Machine Learning. McGraw-Hill.
Mohri, Mehryar, Afshin Rostamizadeh, and Ameet Talwalkar (2012). Foundations of Machine Learning,
Cambridge, MA. MIT Press.
Nelson, Roger B. (1997). An Introduction to Copulas. Lecture Notes in Statistics 139.
Ohlsson, Esbjörn and Björn Johansson (2010). Non-life Insurance Pricing with Generalized Linear Models,
Vol. 21. Springer.
O’Leary, D. E. (2013). “Artificial Intelligence and Big Data,” IEEE Intelligent Systems, Vol. 28, pp. 96–99.
Olkin, Ingram, A John Petkau, and James V Zidek (1981). “A comparison of n estimators for the binomial
distribution,” Journal of the American Statistical Association, Vol. 76, pp. 637–642.
Olson, Jack E. (2003). Data Quality: The Accuracy Dimension, San Francisco, CA. Morgan Kaufmann.
Picard, Richard R. and Kenneth N. Berk (1990). “Data splitting,” The American Statistician, Vol. 44, pp.
140–147.
Pries, Kim H. and Robert Dunnigan (2015). Big Data Analytics: A Practical Guide for Managers, Boca
Raton, FL. CRC Press.
BIBLIOGRAPHY 319
Samuel, A. L. (1959). “Some Studies in Machine Learning Using the Game of Checkers,” IBM Journal of
Research and Development, Vol. 3, pp. 210–229.
Shmueli, Galit (2010). “To Explain or to Predict?” Statistical Science, Vol. 25, pp. 289–310.
Snee, Ronald D. (1977). “Validation of regression models: methods and examples,” Technometrics, Vol. 19,
pp. 415–428.
Spearman, C (1904). “The proof and measurement of association between two things,” The American Journal
of Psychology, Vol. 15, pp. 72–101.
Tevet, Dan (2016). “Applying Generalized Linear Models to Insurance Data,” Predictive Modeling Applications
in Actuarial Science: Volume 2, Case Studies in Insurance, p. 39.
Tse, Yiu-Kuen (2009). Nonlife Actuarial Models: Theory, Methods and Evaluation. Cambridge University
Press.
Tukey, John W. (1962). “The Future of Data Analysis,” The Annals of Mathematical Statistics, Vol. 33, pp.
1–67.
Venter, Gary (1983). “Transformed beta and gamma distributions and aggregate losses,” in Proceedings of
the Casualty Actuarial Society, Vol. 70, pp. 289–308.
Venter, Gary G. (2002). “Tails of copulas,” in Proceedings of the Casualty Actuarial Society, Vol. 89, pp.
68–113.
Yule, G. Udny (1900). “On the association of attributes in statistics: with illustrations from the material of
the childhood society,” Philosophical Transactions of the Royal Society of London. Series A, Containing
Papers of a Mathematical or Physical Character, pp. 257–319.
(1912). “On the methods of measuring association between two attributes,” Journal of the Royal
Statistical Society, pp. 579–652.