Academia.eduAcademia.edu

Applied probability sahoo

2023, Prashant Patel

Very interesting book

PROBABILITY AND MATHEMATICAL STATISTICS Prasanna Sahoo Department of Mathematics University of Louisville Louisville, KY 40292 USA v THIS BOOK IS DEDICATED TO AMIT SADHNA MY PARENTS, TEACHERS AND STUDENTS vi vii c Copyright !2008. All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the author. viii ix PREFACE This book is both a tutorial and a textbook. This book presents an introduction to probability and mathematical statistics and it is intended for students already having some elementary mathematical background. It is intended for a one-year junior or senior level undergraduate or beginning graduate level course in probability theory and mathematical statistics. The book contains more material than normally would be taught in a one-year course. This should give the teacher flexibility with respect to the selection of the content and level at which the book is to be used. This book is based on over 15 years of lectures in senior level calculus based courses in probability theory and mathematical statistics at the University of Louisville. Probability theory and mathematical statistics are difficult subjects both for students to comprehend and teachers to explain. Despite the publication of a great many textbooks in this field, each one intended to provide an improvement over the previous textbooks, this subject is still difficult to comprehend. A good set of examples makes these subjects easy to understand. For this reason alone I have included more than 350 completely worked out examples and over 165 illustrations. I give a rigorous treatment of the fundamentals of probability and statistics using mostly calculus. I have given great attention to the clarity of the presentation of the materials. In the text, theoretical results are presented as theorems, propositions or lemmas, of which as a rule rigorous proofs are given. For the few exceptions to this rule references are given to indicate where details can be found. This book contains over 450 problems of varying degrees of difficulty to help students master their problem solving skill. In many existing textbooks, the examples following the explanation of a topic are too few in number or too simple to obtain a through grasp of the principles involved. Often, in many books, examples are presented in abbreviated form that leaves out much material between steps, and requires that students derive the omitted materials themselves. As a result, students find examples difficult to understand. Moreover, in some textbooks, examples x are often worded in a confusing manner. They do not state the problem and then present the solution. Instead, they pass through a general discussion, never revealing what is to be solved for. In this book, I give many examples to illustrate each topic. Often we provide illustrations to promote a better understanding of the topic. All examples in this book are formulated as questions and clear and concise answers are provided in step-by-step detail. There are several good books on these subjects and perhaps there is no need to bring a new one to the market. So for several years, this was circulated as a series of typeset lecture notes among my students who were preparing for the examination 110 of the Actuarial Society of America. Many of my students encouraged me to formally write it as a book. Actuarial students will benefit greatly from this book. The book is written in simple English; this might be an advantage to students whose native language is not English. I cannot claim that all the materials I have written in this book are mine. I have learned the subject from many excellent books, such as Introduction to Mathematical Statistics by Hogg and Craig, and An Introduction to Probability Theory and Its Applications by Feller. In fact, these books have had a profound impact on me, and my explanations are influenced greatly by these textbooks. If there are some similarities, then it is due to the fact that I could not make improvements on the original explanations. I am very thankful to the authors of these great textbooks. I am also thankful to the Actuarial Society of America for letting me use their test problems. I thank all my students in my probability theory and mathematical statistics courses from 1988 to 2005 who helped me in many ways to make this book possible in the present form. Lastly, if it weren’t for the infinite patience of my wife, Sadhna, this book would never get out of the hard drive of my computer. The author on a Macintosh computer using TEX, the typesetting system designed by Donald Knuth, typeset the entire book. The figures were generated by the author using MATHEMATICA, a system for doing mathematics designed by Wolfram Research, and MAPLE, a system for doing mathematics designed by Maplesoft. The author is very thankful to the University of Louisville for providing many internal financial grants while this book was under preparation. Prasanna Sahoo, Louisville xi xii TABLE OF CONTENTS 1. Probability of Events . . . . . . . . . . . . . . . . . . . 1 1.1. Introduction 1.2. Counting Techniques 1.3. Probability Measure 1.4. Some Properties of the Probability Measure 1.5. Review Exercises 2. Conditional Probability and Bayes’ Theorem . . . . . . . 27 2.1. Conditional Probability 2.2. Bayes’ Theorem 2.3. Review Exercises 3. Random Variables and Distribution Functions . . . . . . . 45 3.1. Introduction 3.2. Distribution Functions of Discrete Variables 3.3. Distribution Functions of Continuous Variables 3.4. Percentile for Continuous Random Variables 3.5. Review Exercises 4. Moments of Random Variables and Chebychev Inequality . 73 4.1. Moments of Random Variables 4.2. Expected Value of Random Variables 4.3. Variance of Random Variables 4.4. Chebychev Inequality 4.5. Moment Generating Functions 4.6. Review Exercises xiii 5. Some Special Discrete Distributions . . . . . . . . . . . 107 5.1. Bernoulli Distribution 5.2. Binomial Distribution 5.3. Geometric Distribution 5.4. Negative Binomial Distribution 5.5. Hypergeometric Distribution 5.6. Poisson Distribution 5.7. Riemann Zeta Distribution 5.8. Review Exercises 6. Some Special Continuous Distributions . . . . . . . . . 141 . . . . . . . . . . . . . . . . . 185 6.1. Uniform Distribution 6.2. Gamma Distribution 6.3. Beta Distribution 6.4. Normal Distribution 6.5. Lognormal Distribution 6.6. Inverse Gaussian Distribution 6.7. Logistic Distribution 6.8. Review Exercises 7. Two Random Variables 7.1. Bivariate Discrete Random Variables 7.2. Bivariate Continuous Random Variables 7.3. Conditional Distributions 7.4. Independence of Random Variables 7.5. Review Exercises 8. Product Moments of Bivariate Random Variables . . . . 8.1. Covariance of Bivariate Random Variables 8.2. Independence of Random Variables 8.3. Variance of the Linear Combination of Random Variables 8.4. Correlation and Independence 8.5. Moment Generating Functions 8.6. Review Exercises 213 xiv 9. Conditional Expectations of Bivariate Random Variables 237 9.1. Conditional Expected Values 9.2. Conditional Variance 9.3. Regression Curve and Scedastic Curves 9.4. Review Exercises 10. Functions of Random Variables and Their Distribution . 257 10.1. Distribution Function Method 10.2. Transformation Method for Univariate Case 10.3. Transformation Method for Bivariate Case 10.4. Convolution Method for Sums of Random Variables 10.5. Moment Method for Sums of Random Variables 10.6. Review Exercises 11. Some Special Discrete Bivariate Distributions . . . . . 289 12. Some Special Continuous Bivariate Distributions . . . . 317 11.1. Bivariate Bernoulli Distribution 11.2. Bivariate Binomial Distribution 11.3. Bivariate Geometric Distribution 11.4. Bivariate Negative Binomial Distribution 11.5. Bivariate Hypergeometric Distribution 11.6. Bivariate Poisson Distribution 11.7. Review Exercises 12.1. Bivariate Uniform Distribution 12.2. Bivariate Cauchy Distribution 12.3. Bivariate Gamma Distribution 12.4. Bivariate Beta Distribution 12.5. Bivariate Normal Distribution 12.6. Bivariate Logistic Distribution 12.7. Review Exercises xv 13. Sequences of Random Variables and Order Statistics . . 351 13.1. Distribution of Sample Mean and Variance 13.2. Laws of Large Numbers 13.3. The Central Limit Theorem 13.4. Order Statistics 13.5. Sample Percentiles 13.6. Review Exercises 14. Sampling Distributions Associated with the Normal Population . . . . . . . . . . . . . . . . . 391 14.1. Chi-square distribution 14.2. Student’s t-distribution 14.3. Snedecor’s F -distribution 14.4. Review Exercises 15. Some Techniques for Finding Point Estimators of Parameters . . . . . . . . . . . . . . . 409 15.1. Moment Method 15.2. Maximum Likelihood Method 15.3. Bayesian Method 15.3. Review Exercises 16. Criteria for Evaluating the Goodness of Estimators . . . . . . . . . . . . . . . . . . . . . 16.1. The Unbiased Estimator 16.2. The Relatively Efficient Estimator 16.3. The Minimum Variance Unbiased Estimator 16.4. Sufficient Estimator 16.5. Consistent Estimator 16.6. Review Exercises 449 xvi 17. Some Techniques for Finding Interval Estimators of Parameters . . . . . . . . . . . . . . . 489 17.1. Interval Estimators and Confidence Intervals for Parameters 17.2. Pivotal Quantity Method 17.3. Confidence Interval for Population Mean 17.4. Confidence Interval for Population Variance 17.5. Confidence Interval for Parameter of some Distributions not belonging to the Location-Scale Family 17.6. Approximate Confidence Interval for Parameter with MLE 17.7. The Statistical or General Method 17.8. Criteria for Evaluating Confidence Intervals 17.9. Review Exercises 18. Test of Statistical Hypotheses . . . . . . . . . . . . . 533 18.1. Introduction 18.2. A Method of Finding Tests 18.3. Methods of Evaluating Tests 18.4. Some Examples of Likelihood Ratio Tests 18.5. Review Exercises 19. Simple Linear Regression and Correlation Analysis . . 577 20. Analysis of Variance . . . . . . . . . . . . . . . . . . 20.1. One-way Analysis of Variance with Equal Sample Sizes 20.2. One-way Analysis of Variance with Unequal Sample Sizes 20.3. Pair wise Comparisons 20.4. Tests for the Homogeneity of Variances 20.5. Review Exercises 613 19.1. Least Squared Method 19.2. Normal Regression Analysis 19.3. The Correlation Analysis 19.4. Review Exercises xvii 21. Goodness of Fits Tests . . . . . . . . . . . . . . . . . 645 21.1. Chi-Squared test 21.2. Kolmogorov-Smirnov test 21.3. Review Exercises References . . . . . . . . . . . . . . . . . . . . . . . . . 663 Answers to Selected Review Exercises 669 . . . . . . . . . . . Probability and Mathematical Statistics 1 Chapter 1 PROBABILITY OF EVENTS 1.1. Introduction During his lecture in 1929, Bertrand Russel said, “Probability is the most important concept in modern science, especially as nobody has the slightest notion what it means.” Most people have some vague ideas about what probability of an event means. The interpretation of the word probability involves synonyms such as chance, odds, uncertainty, prevalence, risk, expectancy etc. “We use probability when we want to make an affirmation, but are not quite sure,” writes J.R. Lucas. There are many distinct interpretations of the word probability. A complete discussion of these interpretations will take us to areas such as philosophy, theory of algorithm and randomness, religion, etc. Thus, we will only focus on two extreme interpretations. One interpretation is due to the so-called objective school and the other is due to the subjective school. The subjective school defines probabilities as subjective assignments based on rational thought with available information. Some subjective probabilists interpret probabilities as the degree of belief. Thus, it is difficult to interpret the probability of an event. The objective school defines probabilities to be “long run” relative frequencies. This means that one should compute a probability by taking the number of favorable outcomes of an experiment and dividing it by total numbers of the possible outcomes of the experiment, and then taking the limit as the number of trials becomes large. Some statisticians object to the word “long run”. The philosopher and statistician John Keynes said “in the long run we are all dead”. The objective school uses the theory developed by Probability of Events 2 Von Mises (1928) and Kolmogorov (1965). The Russian mathematician Kolmogorov gave the solid foundation of probability theory using measure theory. The advantage of Kolmogorov’s theory is that one can construct probabilities according to the rules, compute other probabilities using axioms, and then interpret these probabilities. In this book, we will study mathematically one interpretation of probability out of many. In fact, we will study probability theory based on the theory developed by the late Kolmogorov. There are many applications of probability theory. We are studying probability theory because we would like to study mathematical statistics. Statistics is concerned with the development of methods and their applications for collecting, analyzing and interpreting quantitative data in such a way that the reliability of a conclusion based on data may be evaluated objectively by means of probability statements. Probability theory is used to evaluate the reliability of conclusions and inferences based on data. Thus, probability theory is fundamental to mathematical statistics. For an event A of a discrete sample space S, the probability of A can be computed by using the formula P (A) = N (A) N (S) where N (A) denotes the number of elements of A and N (S) denotes the number of elements in the sample space S. For a discrete case, the probability of an event A can be computed by counting the number of elements in A and dividing it by the number of elements in the sample space S. In the next section, we develop various counting techniques. The branch of mathematics that deals with the various counting techniques is called combinatorics. 1.2. Counting Techniques There are three basic counting techniques. They are multiplication rule, permutation and combination. 1.2.1 Multiplication Rule. If E1 is an experiment with n1 outcomes and E2 is an experiment with n2 possible outcomes, then the experiment which consists of performing E1 first and then E2 consists of n1 n2 possible outcomes. Probability and Mathematical Statistics 3 Example 1.1. Find the possible number of outcomes in a sequence of two tosses of a fair coin. Answer: The number of possible outcomes is 2 · 2 = 4. This is evident from the following tree diagram. H HH H HT T H TH T TT T Tree diagram Example 1.2. Find the number of possible outcomes of the rolling of a die and then tossing a coin. Answer: Here n1 = 6 and n2 = 2. Thus by multiplication rule, the number of possible outcomes is 12. H 1 2 3 4 5 6 Tree diagram T 1H 1T 2H 2T 3H 3T 4H 4T 5H 5T 6H 6T Example 1.3. How many different license plates are possible if Kentucky uses three letters followed by three digits. Answer: (26)3 (10)3 = (17576) (1000) = 17, 576, 000. 1.2.2. Permutation Consider a set of 4 objects. Suppose we want to fill 3 positions with objects selected from the above 4. Then the number of possible ordered arrangements is 24 and they are Probability of Events 4 abc abd acb acd adc adb bac bad bca bcd bda bdc cab cad cba cbd cdb cda dab dac dbc dba dca dcb The number of possible ordered arrangements can be computed as follows: Since there are 3 positions and 4 objects, the first position can be filled in 4 different ways. Once the first position is filled the remaining 2 positions can be filled from the remaining 3 objects. Thus, the second position can be filled in 3 ways. The third position can be filled in 2 ways. Then the total number of ways 3 positions can be filled out of 4 objects is given by (4) (3) (2) = 24. In general, if r positions are to be filled from n objects, then the total number of possible ways they can be filled are given by n(n − 1)(n − 2) · · · (n − r + 1) n! = (n − r)! = n Pr . Thus, n Pr represents the number of ways r positions can be filled from n objects. Definition 1.1. Each of the n Pr arrangements is called a permutation of n objects taken r at a time. Example 1.4. How many permutations are there of all three of letters a, b, and c? Answer: 3 P3 n! (n − r)! . 3! = =6 0! = Probability and Mathematical Statistics 5 Example 1.5. Find the number of permutations of n distinct objects. Answer: n Pn = n! n! = = n!. (n − n)! 0! Example 1.6. Four names are drawn from the 24 members of a club for the offices of President, Vice-President, Treasurer, and Secretary. In how many different ways can this be done? Answer: 24 P4 (24)! (20)! = (24) (23) (22) (21) = = 255, 024. 1.2.3. Combination In permutation, order is important. But in many problems the order of selection is not important and interest centers only on the set of r objects. Let c denote the number of subsets of size r that can be selected from n different objects. The r objects in each set can be ordered in r Pr ways. Thus we have n Pr = c (r Pr ) . From this, we get n Pr n! = (n − r)! r! P r r ! n" The number c is denoted by r . Thus, the above can be written as c= # $ n n! = . r (n − r)! r! ! " Definition 1.2. Each of the nr unordered subsets is called a combination of n objects taken r at a time. Example 1.7. How many committees of two chemists and one physicist can be formed from 4 chemists and 3 physicists? Probability of Events 6 Answer: # $# $ 3 4 1 2 = (6) (3) = 18. Thus 18 different committees can be formed. 1.2.4. Binomial Theorem We know from lower level mathematics courses that (x + y)2 = x2 + 2 xy + y 2 # $ # $ # $ 2 2 2 2 2 y xy + x + = 2 1 0 2 # $ % 2 2−k k = x y . k k=0 Similarly (x + y)3 = x3 + 3 x2 y + 3xy 2 + y 3 # $ # $ # $ # $ 3 3 3 3 2 3 3 2 y xy + x y+ x + = 3 2 1 0 3 # $ % 3 3−k k = x y . k k=0 In general, using induction arguments, we can show that (x + y)n = n # $ % n k=0 k xn−k y k . ! " This result is called the Binomial Theorem. The coefficient nk is called the binomial coefficient. A combinatorial proof of the Binomial Theorem follows. If we write (x + y)n as the n times the product of the factor (x + y), that is (x + y)n = (x + y) (x + y) (x + y) · · · (x + y), ! " then the coefficient of xn−k y k is nk , that is the number of ways in which we can choose the k factors providing the y’s. Probability and Mathematical Statistics 7 Remark 1.1. In 1665, Newton discovered the Binomial Series. The Binomial Series is given by # $ # $ # $ α n α 2 α (1 + y) = 1 + y + ··· y + ··· + y+ n 2 1 ∞ # $ % α k =1+ y , k α k=1 where α is a real number and # $ α α(α − 1)(α − 2) · · · (α − k + 1) . = k k! This !α" k is called the generalized binomial coefficient. Now, we investigate some properties of the binomial coefficients. Theorem 1.1. Let n ∈ N (the set of natural numbers) and r = 0, 1, 2, ..., n. Then $ # $ # n n . = n−r r Proof: By direct verification, we get # n n−r $ n! (n − n + r)! (n − r)! n! = r! (n − r)! # $ n . = r = This theorem says that the binomial coefficients are symmetrical. !" !" !" Example 1.8. Evaluate 31 + 32 + 30 . Answer: Since the combinations of 3 things taken 1 at a time are 3, we get ! 3" !3" 1 = 3. Similarly, 0 is 1. By Theorem 1, # $ # $ 3 3 = = 3. 1 2 Hence # $ # $ # $ 3 3 3 = 3 + 3 + 1 = 7. + + 0 2 1 Probability of Events 8 Theorem 1.2. For any positive integer n and r = 1, 2, 3, ..., n, we have $ $ # # $ # n−1 n−1 n . + = r−1 r r Proof: (1 + y)n = (1 + y) (1 + y)n−1 = (1 + y)n−1 + y (1 + y)n−1 $ n−1 # n−1 n # $ % % #n − 1$ n r % n−1 r y = y +y yr r r r r=0 r=0 r=0 n−1 n−1 % #n − 1$ % # n − 1$ y r+1 . yr + = r r r=0 r=0 Equating the coefficients of y r from both sides of the above expression, we obtain $ # $ # $ # n n−1 n−1 = + r−1 r r and the proof is now complete. ! " !23" !24" Example 1.9. Evaluate 23 10 + 9 + 11 . Answer: # $ # $ # $ 23 23 24 + + 10 9 11 # $ # $ 24 24 + = 11 10 # $ 25 = 11 25! = (14)! (11)! = 4, 457, 400. # $ n Example 1.10. Use the Binomial Theorem to show that (−1) = 0. r r=0 Answer: Using the Binomial Theorem, we get n (1 + x) = n # $ % n r=0 r xr n % r Probability and Mathematical Statistics 9 for all real numbers x. Letting x = −1 in the above, we get 0= n # $ % n r=0 r (−1)r . Theorem 1.3. Let m and n be positive integers. Then $ # $ k # $# % m n m+n = . r k−r k r=0 Proof: (1 + y)m+n = (1 + y)m (1 + y)n & m # $ '& n # $ ' m+n % #m + n $ % m % n r y = yr yr . r r r r=0 r=0 r=0 Equating the coefficients of y k from the both sides of the above expression, we obtain $ # $ # $# $ # $# $ # $# m+n m n m n m n + ··· + = + k k−k k 0 k 1 k−1 and the conclusion of the theorem follows. Example 1.11. Show that n # $2 % n r=0 r = # $ 2n . n Answer: Let k = n and m = n. Then from Theorem 3, we get $ $ # k # $# % m+n n m = k k−r r r=0 $ $ # $ # # n % n 2n n = n n−r r r=0 $ $ # $ # # n % n 2n n = n r r r=0 # $ # $ n 2 % 2n n = . n r r=0 Probability of Events 10 Theorem 1.4. Let n be a positive integer and k = 1, 2, 3, ..., n. Then # $ n−1 % # m $ n = . k−1 k m=k−1 Proof: In order to establish the above identity, we use the Binomial Theorem together with the following result of the elementary algebra xn − y n = (x − y) Note that n # $ % n k=1 k k x = n # $ % n k=0 k = (x + 1 − 1) n−1 m # %% m=0 j=0 = k=0 n−1 % by Binomial Theorem (x + 1)m by above identity m=0 n−1 m # %% m=0 j=0 = xk y n−1−k . xk − 1 = (x + 1)n − 1n =x n−1 % $ m j x j $ m j+1 x j n n−1 % % # m $ xk . k−1 k=1 m=k−1 Hence equating the coefficient of xk , we obtain # $ n−1 % # m $ n . = k−1 k m=k−1 This completes the proof of the theorem. The following result # % n (x1 + x2 + · · · + xm ) = n1 +n2 +···+nm =n $ n xn1 xn2 · · · xnmm n1 , n2 , ..., nm 1 2 is known as the multinomial theorem and it generalizes the binomial theorem. The sum is taken over all positive integers n1 , n2 , ..., nm such that n1 + n2 + · · · + nm = n, and $ # n n! . = n1 ! n2 !, ..., nm ! n1 , n2 , ..., nm Probability and Mathematical Statistics 11 This coefficient is known as the multinomial coefficient. 1.3. Probability Measure A random experiment is an experiment whose outcomes cannot be predicted with certainty. However, in most cases the collection of every possible outcome of a random experiment can be listed. Definition 1.3. A sample space of a random experiment is the collection of all possible outcomes. Example 1.12. What is the sample space for an experiment in which we select a rat at random from a cage and determine its sex? Answer: The sample space of this experiment is S = {M, F } where M denotes the male rat and F denotes the female rat. Example 1.13. What is the sample space for an experiment in which the state of Kentucky picks a three digit integer at random for its daily lottery? Answer: The sample space of this experiment is S = {000, 001, 002, · · · · · · , 998, 999}. Example 1.14. What is the sample space for an experiment in which we roll a pair of dice, one red and one green? Answer: The sample space S for this experiment is given by {(1, 1) (2, 1) (3, 1) S= (4, 1) (5, 1) (6, 1) (1, 2) (2, 2) (3, 2) (4, 2) (5, 2) (6, 2) (1, 3) (2, 3) (3, 3) (4, 3) (5, 3) (6, 3) (1, 4) (2, 4) (3, 4) (4, 4) (5, 4) (6, 4) (1, 5) (2, 5) (3, 5) (4, 5) (5, 5) (6, 5) (1, 6) (2, 6) (3, 6) (4, 6) (5, 6) (6, 6)} This set S can be written as S = {(x, y) | 1 ≤ x ≤ 6, 1 ≤ y ≤ 6} where x represents the number rolled on red die and y denotes the number rolled on green die. Probability of Events 12 Definition 1.4. Each element of the sample space is called a sample point. Definition 1.5. If the sample space consists of a countable number of sample points, then the sample space is said to be a countable sample space. Definition 1.6. If a sample space contains an uncountable number of sample points, then it is called a continuous sample space. An event A is a subset of the sample space S. It seems obvious that if A and B are events in sample space S, then A ∪ B, Ac , A ∩ B are also entitled to be events. Thus precisely we define an event as follows: Definition 1.7. A subset A of the sample space S is said to be an event if it belongs to a collection F of subsets of S satisfying the following three rules: (a) S ∈ F; (b) if A ∈ F then Ac ∈ F; and (c) if Aj ∈ F for j ≥ 1, then (∞ j=1 ∈ F. The collection F is called an event space or a σ-field. If A is the outcome of an experiment, then we say that the event A has occurred. Example 1.15. Describe the sample space of rolling a die and interpret the event {1, 2}. Answer: The sample space of this experiment is S = {1, 2, 3, 4, 5, 6}. The event {1, 2} means getting either a 1 or a 2. Example 1.16. First describe the sample space of rolling a pair of dice, then describe the event A that the sum of numbers rolled is 7. Answer: The sample space of this experiment is S = {(x, y) | x, y = 1, 2, 3, 4, 5, 6} and A = {(1, 6), (6, 1), (2, 5), (5, 2), (4, 3), (3, 4)}. Definition 1.8. Let S be the sample space of a random experiment. A probability measure P : F → [0, 1] is a set function which assigns real numbers to the various events of S satisfying (P1) P (A) ≥ 0 for all event A ∈ F, (P2) P (S) = 1, Probability and Mathematical Statistics (P3) P if ) ∞ * Ak + k=1 A1 , A2 , A3 , ..., = ∞ % 13 P (Ak ) k=1 Ak , ..... are mutually disjoint events of S. Any set function with the above three properties is a probability measure for S. For a given sample space S, there may be more than one probability measure. The probability of an event A is the value of the probability measure at A, that is P rob(A) = P (A). Theorem 1.5. If ∅ is a empty set (that is an impossible event), then P (∅) = 0. Proof: Let A1 = S and Ai = ∅ for i = 2, 3, ..., ∞. Then S= ∞ * Ai i=1 where Ai ∩ Aj = ∅ for i += j. By axiom 2 and axiom 3, we get 1 = P (S) (by axiom 2) ) ∞ + * =P Ai i=1 = ∞ % P (Ai ) (by axiom 3) i=1 = P (A1 ) + = P (S) + ∞ % P (Ai ) i=2 ∞ % P (∅) i=2 =1+ ∞ % P (∅). i=2 Therefore ∞ % P (∅) = 0. i=2 Since P (∅) ≥ 0 by axiom 1, we have P (∅) = 0 Probability of Events 14 and the proof of the theorem is complete. This theorem says that the probability of an impossible event is zero. Note that if the probability of an event is zero, that does not mean the event is empty (or impossible). There are random experiments in which there are infinitely many events each with probability 0. Similarly, if A is an event with probability 1, then it does not mean A is the sample space S. In fact there are random experiments in which one can find infinitely many events each with probability 1. Theorem 1.6. Let {A1 , A2 , ..., An } be a finite collection of n events such that Ai ∩ Ej = ∅ for i += j. Then P ) n * Ai i=1 + = n % P (Ai ). i=1 Proof: Consider the collection {A#i }∞ i=1 of the subsets of the sample space S such that A#1 = A1 , A#2 = A2 , ..., A#n = An and A#n+1 = A#n+2 = A#n+3 = · · · = ∅. Hence P ) n * i=1 Ai + =P )∞ * A#i i=1 = ∞ % + P (A#i ) i=1 = = = = n % i=1 n % i=1 n % i=1 n % ∞ % P (A#i ) + P (Ai ) + i=n+1 ∞ % i=n+1 P (Ai ) + 0 P (Ai ) i=1 and the proof of the theorem is now complete. P (A#i ) P (∅) Probability and Mathematical Statistics 15 When n = 2, the above theorem yields P (A1 ∪ A2 ) = P (A1 ) + P (A2 ) where A1 and A2 are disjoint (or mutually exclusive) events. In the following theorem, we give a method for computing probability of an event A by knowing the probabilities of the elementary events of the sample space S. Theorem 1.7. If A is an event of a discrete sample space S, then the probability of A is equal to the sum of the probabilities of its elementary events. Proof: Any set A in S can be written as the union of its singleton sets. Let {Oi }∞ i=1 be the collection of all the singleton sets (or the elementary events) of A. Then ∞ * A= Oi . i=1 By axiom (P3), we get P (A) = P )∞ * i=1 = ∞ % Oi + P (Oi ). i=1 Example 1.17. If a fair coin is tossed twice, what is the probability of getting at least one head? Answer: The sample space of this experiment is S = {HH, HT, T H, T T }. The event A is given by A = { at least one head } = {HH, HT, T H}. By Theorem 1.7, the probability of A is the sum of the probabilities of its elementary events. Thus, we get P (A) = P (HH) + P (HT ) + P (T H) 1 1 1 = + + 4 4 4 3 = . 4 Probability of Events 16 Remark 1.2. Notice that here we are not computing the probability of the elementary events by taking the number of points in the elementary event and dividing by the total number of points in the sample space. We are using the randomness to obtain the probability of the elementary events. That is, we are assuming that each outcome is equally likely. This is why the randomness is an integral part of probability theory. Corollary 1.1. If S is a finite sample space with n sample elements and A is an event in S with m elements, then the probability of A is given by m P (A) = . n Proof: By the previous theorem, we get + )m * P (A) = P Oi i=1 = = m % i=1 m % i=1 = P (Oi ) 1 n m . n The proof is now complete. Example 1.18. A die is loaded in such a way that the probability of the face with j dots turning up is proportional to j for j = 1, 2, ..., 6. What is the probability, in one roll of the die, that an odd number of dots will turn up? Answer: P ({j}) ∝ j = kj where k is a constant of proportionality. Next, we determine this constant k by using the axiom (P2). Using Theorem 1.5, we get P (S) = P ({1}) + P ({2}) + P ({3}) + P ({4}) + P ({5}) + P ({6}) = k + 2k + 3k + 4k + 5k + 6k = (1 + 2 + 3 + 4 + 5 + 6) k (6)(6 + 1) = k 2 = 21k. Probability and Mathematical Statistics 17 Using (P2), we get 21k = 1. Thus k = 1 21 . Hence, we have P ({j}) = j . 21 Now, we want to find the probability of the odd number of dots turning up. P (odd numbered dot will turn up) = P ({1}) + P ({3}) + P ({5}) 1 3 5 = + + 21 21 21 9 . = 21 Remark 1.3. Recall that the sum of the first n integers is equal to That is, 1 + 2 + 3 + · · · · · · + (n − 2) + (n − 1) + n = n 2 (n+1). n(n + 1) . 2 This formula was first proven by Gauss (1777-1855) when he was a young school boy. Remark 1.4. Gauss proved that the sum of the first n positive integers is n (n+1) when he was a school boy. Kolmogorov, the father of modern 2 probability theory, proved that the sum of the first n odd positive integers is n2 , when he was five years old. 1.4. Some Properties of the Probability Measure Next, we present some theorems that will illustrate the various intuitive properties of a probability measure. Theorem 1.8. If A be any event of the sample space S, then P (Ac ) = 1 − P (A) where Ac denotes the complement of A with respect to S. Proof: Let A be any subset of S. Then S = A ∪ Ac . Further A and Ac are mutually disjoint. Thus, using (P3), we get 1 = P (S) = P (A ∪ Ac ) = P (A) + P (Ac ). Probability of Events 18 A A c Hence, we see that P (Ac ) = 1 − P (A). This completes the proof. Theorem 1.9. If A ⊆ B ⊆ S, then P (A) ≤ P (B). S A B Proof: Note that B = A ∪ (B \ A) where B \ A denotes all the elements x that are in B but not in A. Further, A ∩ (B \ A) = ∅. Hence by (P3), we get P (B) = P (A ∪ (B \ A)) = P (A) + P (B \ A). By axiom (P1), we know that P (B \ A) ≥ 0. Thus, from the above, we get P (B) ≥ P (A) and the proof is complete. Theorem 1.10. If A is any event in S, then 0 ≤ P (A) ≤ 1. Probability and Mathematical Statistics 19 Proof: Follows from axioms (P1) and (P2) and Theorem 1.8. Theorem 1.10. If A and B are any two events, then P (A ∪ B) = P (A) + P (B) − P (A ∩ B). Proof: It is easy to see that A ∪ B = A ∪ (Ac ∩ B) and A ∩ (Ac ∩ B) = ∅. S A B Hence by (P3), we get P (A ∪ B) = P (A) + P (Ac ∩ B) But the set B can also be written as B = (A ∩ B) ∪ (Ac ∩ B) S A B (1.1) Probability of Events 20 Therefore, by (P3), we get P (B) = P (A ∩ B) + P (Ac ∩ B). (1.2) Eliminating P (Ac ∩ B) from (1.1) and (1.2), we get P (A ∪ B) = P (A) + P (B) − P (A ∩ B) and the proof of the theorem is now complete. This above theorem tells us how to calculate the probability that at least one of A and B occurs. Example 1.19. If P (A) = 0.25 and P (B) = 0.8, then show that 0.05 ≤ P (A ∩ B) ≤ 0.25. Answer: Since A ∩ B ⊆ A and A ∩ B ⊆ B, by Theorem 1.8, we get P (A ∩ B) ≤ P (A) and also P (A ∩ B) ≤ P (B). Hence P (A ∩ B) ≤ min{P (A), P (B)}. This shows that P (A ∩ B) ≤ 0.25. (1.3) Since A ∪ B ⊆ S, by Theorem 1.8, we get P (A ∪ B) ≤ P (S) That is, by Theorem 1.10 P (A) + P (B) − P (A ∩ B) ≤ P (S). Hence, we obtain 0.8 + 0.25 − P (A ∩ B) ≤ 1 and this yields 0.8 + 0.25 − 1 ≤ P (A ∩ B). From this, we get 0.05 ≤ P (A ∩ B). (1.4) Probability and Mathematical Statistics 21 From (1.3) and (1.4), we get 0.05 ≤ P (A ∩ B) ≤ 0.25. Example 1.20. Let A and B be events in a sample space S such that P (A) = 12 = P (B) and P (Ac ∩ B c ) = 31 . Find P (A ∪ B c ). Answer: Notice that A ∪ B c = A ∪ (Ac ∩ B c ). Hence, P (A ∪ B c ) = P (A) + P (Ac ∩ B c ) 1 1 = + 2 3 5 = . 6 Theorem 1.11. If A1 and A2 are two events such that A1 ⊆ A2 , then P (A2 \ A1 ) = P (A2 ) − P (A1 ). Proof: The event A2 can be written as A2 = A1 * (A2 \ A1 ) where the sets A1 and A2 \ A1 are disjoint. Hence P (A2 ) = P (A1 ) + P (A2 \ A1 ) which is P (A2 \ A1 ) = P (A2 ) − P (A1 ) and the proof of the theorem is now complete. From calculus we know that a real function f : R I →R I (the set of real numbers) is continuous on R I if and only if, for every convergent sequence {xn }∞ in R, I n=1 , lim f (xn ) = f lim xn . n→∞ n→∞ Probability of Events 22 Theorem 1.12. If A1 , A2 , ..., An , ... is a sequence of events in sample space S such that A1 ⊆ A2 ⊆ · · · ⊆ An ⊆ · · ·, then + )∞ * An = lim P (An ). P n→∞ n=1 Similarly, if B1 , B2 , ..., Bn , ... is a sequence of events in sample space S such that B1 ⊇ B2 ⊇ · · · ⊇ Bn ⊇ · · ·, then )∞ + . Bn = lim P (Bn ). P n→∞ n=1 Proof: Given an increasing sequence of events A1 ⊆ A2 ⊆ · · · ⊆ An ⊆ · · · we define a disjoint collection of events as follows: E1 = A1 En = An \ An−1 ∀n ≥ 2. Then {En }∞ n=1 is a disjoint collection of events such that ∞ * An = ∞ * En . n=1 n=1 Further P ) ∞ * n=1 An + =P ) ∞ * En n=1 = ∞ % + P (En ) n=1 = lim m→∞ = lim m→∞ m % P (En ) n=1 / P (A1 ) + = lim P (Am ) m→∞ = lim P (An ). n→∞ m % n=2 0 [P (An ) − P (An−1 )] Probability and Mathematical Statistics 23 The second part of the theorem can be proved similarly. Note that lim An = n→∞ and lim Bn = n→∞ ∞ * An ∞ . Bn . n=1 n=1 Hence the results above theorem can be written as , P lim An = lim P (An ) n→∞ and P , n→∞ lim Bn = lim P (Bn ) n→∞ n→∞ and the Theorem 1.12 is called the continuity theorem for the probability measure. 1.5. Review Exercises 1. If we randomly pick two television sets in succession from a shipment of 240 television sets of which 15 are defective, what is the probability that they will both be defective? 2. A poll of 500 people determines that 382 like ice cream and 362 like cake. How many people like both if each of them likes at least one of the two? (Hint: Use P (A ∪ B) = P (A) + P (B) − P (A ∩ B) ). 3. The Mathematics Department of the University of Louisville consists of 8 professors, 6 associate professors, 13 assistant professors. In how many of all possible samples of size 4, chosen without replacement, will every type of professor be represented? 4. A pair of dice consisting of a six-sided die and a four-sided die is rolled and the sum is determined. Let A be the event that a sum of 5 is rolled and let B be the event that a sum of 5 or a sum of 9 is rolled. Find (a) P (A), (b) P (B), and (c) P (A ∩ B). 5. A faculty leader was meeting two students in Paris, one arriving by train from Amsterdam and the other arriving from Brussels at approximately the same time. Let A and B be the events that the trains are on time, respectively. If P (A) = 0.93, P (B) = 0.89 and P (A ∩ B) = 0.87, then find the probability that at least one train is on time. Probability of Events 24 6. Bill, George, and Ross, in order, roll a die. The first one to roll an even number wins and the game is ended. What is the probability that Bill will win the game? 7. Let A and B be events such that P (A) = Find the probability of the event Ac ∪ B c . 1 2 = P (B) and P (Ac ∩ B c ) = 31 . 8. Suppose a box contains 4 blue, 5 white, 6 red and 7 green balls. In how many of all possible samples of size 5, chosen without replacement, will every color be represented? n # $ % n 9. Using the Binomial Theorem, show that k = n 2n−1 . k k=0 10. A function consists of a domain A, a co-domain B and a rule f . The rule f assigns to each number in the domain A one and only one letter in the co-domain B. If A = {1, 2, 3} and B = {x, y, z, w}, then find all the distinct functions that can be formed from the set A into the set B. 11. Let S be a countable sample space. Let {Oi }∞ i=1 be the collection of all the elementary events in S. What should be the value of the constant c such ! "i that P (Oi ) = c 13 will be a probability measure in S? 12. A box contains five green balls, three black balls, and seven red balls. Two balls are selected at random without replacement from the box. What is the probability that both balls are the same color? 13. Find the sample space of the random experiment which consists of tossing a coin until the first head is obtained. Is this sample space discrete? 14. Find the sample space of the random experiment which consists of tossing a coin infinitely many times. Is this sample space discrete? 15. Five fair dice are thrown. What is the probability that a full house is thrown (that is, where two dice show one number and other three dice show a second number)? 16. If a fair coin is tossed repeatedly, what is the probability that the third head occurs on the nth toss? 17. In a particular softball league each team consists of 5 women and 5 men. In determining a batting order for 10 players, a woman must bat first, and successive batters must be of opposite sex. How many different batting orders are possible for a team? Probability and Mathematical Statistics 25 18. An urn contains 3 red balls, 2 green balls and 1 yellow ball. Three balls are selected at random and without replacement from the urn. What is the probability that at least 1 color is not drawn? 19. A box contains four $10 bills, six $5 bills and two $1 bills. Two bills are taken at random from the box without replacement. What is the probability that both bills will be of the same denomination? 20. An urn contains n white counters numbered 1 through n, n black counters numbered 1 through n, and n red counter numbered 1 through n. If two counters are to be drawn at random without replacement, what is the probability that both counters will be of the same color or bear the same number? 21. Two people take turns rolling a fair die. Person X rolls first, then person Y , then X, and so on. The winner is the first to roll a 6. What is the probability that person X wins? 22. Mr. Flowers plants 10 rose bushes in a row. Eight of the bushes are white and two are red, and he plants them in random order. What is the probability that he will consecutively plant seven or more white bushes? 23. Using mathematical induction, show that n # $ k % dn dn−k n d [f (x) · g(x)] = [f (x)] · [g(x)] . dxn dxn−k k dxk k=0 Probability of Events 26 Probability and Mathematical Statistics 27 Chapter 2 CONDITIONAL PROBABILITIES AND BAYES’ THEOREM 2.1. Conditional Probabilities First, we give a heuristic argument for the definition of conditional probability, and then based on our heuristic argument, we define the conditional probability. Consider a random experiment whose sample space is S. Let B ⊂ S. In many situations, we are only concerned with those outcomes that are elements of B. This means that we consider B to be our new sample space. B A For the time being, suppose S is a nonempty finite sample space and B is a nonempty subset of S. Given this new discrete sample space B, how do we define the probability of an event A? Intuitively, one should define the probability of A with respect to the new sample space B as (see the figure above) the number of elements in A ∩ B P (A given B) = . the number of elements in B Conditional Probability and Bayes’ Theorem 28 We denote the conditional probability of A given the new sample space B as P (A/B). Hence with this notation, we say that N (A ∩ B) N (B) P (A ∩ B) = , P (B) P (A/B) = since N (S) += 0. Here N (S) denotes the number of elements in S. Thus, if the sample space is finite, then the above definition of the probability of an event A given that the event B has occurred makes sense intuitively. Now we define the conditional probability for any sample space (discrete or continuous) as follows. Definition 2.1. Let S be a sample space associated with a random experiment. The conditional probability of an event A, given that event B has occurred, is defined by P (A ∩ B) P (A/B) = P (B) provided P (B) > 0. This conditional probability measure P (A/B) satisfies all three axioms of a probability measure. That is, (CP1) P (A/B) ≥ 0 for all event A (CP2) P (B/B) = 1 (CP3) If A1 , A2 , ..., Ak , ... are mutually exclusive events, then ∞ ∞ * % P( Ak /B) = P (Ak /B). k=1 k=1 Thus, it is a probability measure with respect to the new sample space B. Example 2.1. A drawer contains 4 black, 6 brown, and 8 olive socks. Two socks are selected at random from the drawer. (a) What is the probability that both socks are of the same color? (b) What is the probability that both socks are olive if it is known that they are of the same color? Answer: The sample space of this experiment consists of S = {(x, y) | x, y ∈ Bl, Ol, Br}. The cardinality of S is N (S) = # 18 2 $ = 153. Probability and Mathematical Statistics 29 Let A be the event that two socks selected at random are of the same color. Then the cardinality of A is given by # $ # $ # $ 4 6 8 N (A) = + + 2 2 2 = 6 + 15 + 28 = 49. Therefore, the probability of A is given by 49 49 . P (A) = !18" = 153 2 Let B be the event that two socks selected at random are olive. Then the cardinality of B is given by # $ 8 N (B) = 2 and hence ! 8" 2 "= P (B) = !18 2 Notice that B ⊂ A. Hence, 28 . 153 P (A ∩ B) P (A) P (B) = P (A) # $# $ 153 28 = 153 49 28 4 = = . 49 7 P (B/A) = Let A and B be two mutually disjoint events in a sample space S. We want to find a formula for computing the probability that the event A occurs before the event B in a sequence trials. Let P (A) and P (B) be the probabilities that A and B occur, respectively. Then the probability that neither A nor B occurs is 1 − P (A) − P (B). Let us denote this probability by r, that is r = 1 − P (A) − P (B). In the first trial, either A occurs, or B occurs, or neither A nor B occurs. In the first trial if A occurs, then the probability of A occurs before B is 1. Conditional Probability and Bayes’ Theorem 30 If B occurs in the first trial, then the probability of A occurs before B is 0. If neither A nor B occurs in the first trial, we look at the outcomes of the second trial. In the second trial if A occurs, then the probability of A occurs before B is 1. If B occurs in the second trial, then the probability of A occurs before B is 0. If neither A nor B occurs in the second trial, we look at the outcomes of the third trial, and so on. This argument can be summarized in the following diagram. P(A) A before B 1 A before B 0 P(B) P(A) 1 A before B 0 r P(B) P(A) 1 0 r P(B) P(A) 0 r r = 1-P(A)-P(B) A before B 1 P(B) r Hence the probability that the event A comes before the event B is given by P (A before B) = P (A) + r P (A) + r2 P (A) + r3 P (A) + · · · + rn P (A) + · · · = P (A) [1 + r + r2 + · · · + rn + · · · ] 1 = P (A) 1−r 1 = P (A) 1 − [1 − P (A) − P (B)] P (A) . = P (A) + P (B) The event A before B can also be interpreted as a conditional event. In this interpretation the event A before B means the occurrence of the event A given that A ∪ B has already occurred. Thus we again have P (A ∩ (A ∪ B)) P (A ∪ B) P (A) = . P (A) + P (B) P (A/A ∪ B) = Example 2.2. A pair of four-sided dice is rolled and the sum is determined. What is the probability that a sum of 3 is rolled before a sum of 5 is rolled in a sequence of rolls of the dice? Probability and Mathematical Statistics 31 Answer: The sample space of this random experiment is {(1, 1) (1, 2) (1, 3) (2, 1) (2, 2) (2, 3) S= (3, 1) (3, 2) (3, 3) (4, 1) (4, 2) (4, 3) (1, 4) (2, 4) (3, 4) (4, 4)}. Let A denote the event of getting a sum of 3 and B denote the event of getting a sum of 5. The probability that a sum of 3 is rolled before a sum of 5 is rolled can be thought of as the conditional probability of a sum of 3, given that a sum of 3 or 5 has occurred. That is, P (A/A ∪ B). Hence P (A/A ∪ B) = = = = = P (A ∩ (A ∪ B)) P (A ∪ B) P (A) P (A) + P (B) N (A) N (A) + N (B) 2 2+4 1 . 3 Example 2.3. If we randomly pick two television sets in succession from a shipment of 240 television sets of which 15 are defective, what is the probability that they will be both defective? Answer: Let A denote the event that the first television picked was defective. Let B denote the event that the second television picked was defective. Then A∩B will denote the event that both televisions picked were defective. Using the conditional probability, we can calculate P (A ∩ B) = P (A) P (B/A) # $# $ 15 14 = 240 239 7 . = 1912 In Example 2.3, we assume that we are sampling without replacement. Definition 2.2. If an object is selected and then replaced before the next object is selected, this is known as sampling with replacement. Otherwise, it is called sampling without replacement. Conditional Probability and Bayes’ Theorem 32 Rolling a die is equivalent to sampling with replacement, whereas dealing a deck of cards to players is sampling without replacement. Example 2.4. A box of fuses contains 20 fuses, of which 5 are defective. If 3 of the fuses are selected at random and removed from the box in succession without replacement, what is the probability that all three fuses are defective? Answer: Let A be the event that the first fuse selected is defective. Let B be the event that the second fuse selected is defective. Let C be the event that the third fuse selected is defective. The probability that all three fuses selected are defective is P (A ∩ B ∩ C). Hence P (A ∩ B ∩ C) = P (A) P (B/A) P (C/A ∩ B) # $# $# $ 5 4 3 = 20 19 18 1 . = 114 Definition 2.3. Two events A and B of a sample space S are called independent if and only if P (A ∩ B) = P (A) P (B). Example 2.5. The following diagram shows two events A and B in the sample space S. Are the events A and B independent? S B A Answer: There are 10 black dots in S and event A contains 4 of these dots. 4 So the probability of A, is P (A) = 10 . Similarly, event B contains 5 black 5 dots. Hence P (B) = 10 . The conditional probability of A given B is P (A/B) = P (A ∩ B) 2 = . P (B) 5 Probability and Mathematical Statistics 33 This shows that P (A/B) = P (A). Hence A and B are independent. Theorem 2.1. Let A, B ⊆ S. If A and B are independent and P (B) > 0, then P (A/B) = P (A). Proof: P (A ∩ B) P (B) P (A) P (B) = P (B) = P (A). P (A/B) = Theorem 2.2. If A and B are independent events. Then Ac and B are independent. Similarly A and B c are independent. Proof: We know that A and B are independent, that is P (A ∩ B) = P (A) P (B) and we want to show that Ac and B are independent, that is P (Ac ∩ B) = P (Ac ) P (B). Since P (Ac ∩ B) = P (Ac /B) P (B) = [1 − P (A/B)] P (B) = P (B) − P (A/B)P (B) = P (B) − P (A ∩ B) = P (B) − P (A) P (B) = P (B) [1 − P (A)] = P (B)P (Ac ), the events Ac and B are independent. Similarly, it can be shown that A and B c are independent and the proof is now complete. Remark 2.1. The concept of independence is fundamental. In fact, it is this concept that justifies the mathematical development of probability as a separate discipline from measure theory. Mark Kac said, “independence of events is not a purely mathematical concept.” It can, however, be made plausible Conditional Probability and Bayes’ Theorem 34 that it should be interpreted by the rule of multiplication of probabilities and this leads to the mathematical definition of independence. Example 2.6. Flip a coin and then independently cast a die. What is the probability of observing heads on the coin and a 2 or 3 on the die? Answer: Let A denote the event of observing a head on the coin and let B be the event of observing a 2 or 3 on the die. Then P (A ∩ B) = P (A) P (B) # $# $ 1 2 = 2 6 1 = . 6 Example 2.7. An urn contains 3 red, 2 white and 4 yellow balls. An ordered sample of size 3 is drawn from the urn. If the balls are drawn with replacement so that one outcome does not change the probabilities of others, then what is the probability of drawing a sample that has balls of each color? Also, find the probability of drawing a sample that has two yellow balls and a red ball or a red ball and two white balls? Answer: # $# $# $ 3 2 4 8 P (RW Y ) = = 9 9 9 243 and P (Y Y R or RW W ) = # $# $# $ # $# $# $ 20 4 3 3 2 2 4 + = . 9 9 9 9 9 9 243 If the balls are drawn without replacement, then # $# $# $ 3 2 4 1 P (RW Y ) = = . 9 8 7 21 # $# $# $ # $# $# $ 4 7 3 3 3 2 1 P (Y Y R or RW W ) = + = . 9 8 7 9 8 7 84 There is a tendency to equate the concepts “mutually exclusive” and “independence”. This is a fallacy. Two events A and B are mutually exclusive if A ∩ B = ∅ and they are called possible if P (A) += 0 += P (B). Theorem 2.2. Two possible mutually exclusive events are always dependent (that is not independent). Probability and Mathematical Statistics 35 Proof: Suppose not. Then P (A ∩ B) = P (A) P (B) P (∅) = P (A) P (B) 0 = P (A) P (B). Hence, we get either P (A) = 0 or P (B) = 0. This is a contradiction to the fact that A and B are possible events. This completes the proof. Theorem 2.3. Two possible independent events are not mutually exclusive. Proof: Let A and B be two independent events and suppose A and B are mutually exclusive. Then P (A) P (B) = P (A ∩ B) = P (∅) = 0. Therefore, we get either P (A) = 0 or P (B) = 0. This is a contradiction to the fact that A and B are possible events. The possible events A and B exclusive implies A and B are not independent; and A and B independent implies A and B are not exclusive. 2.2. Bayes’ Theorem There are many situations where the ultimate outcome of an experiment depends on what happens in various intermediate stages. This issue is resolved by the Bayes’ Theorem. Definition 2.4. Let S be a set and let P = {Ai }m i=1 be a collection of subsets of S. The collection P is called a partition of S if m * Ai (a) S = i=1 (b) Ai ∩ Aj = ∅ for i += j. A2 A1 A3 A5 A4 Sample Space Conditional Probability and Bayes’ Theorem 36 Theorem 2.4. If the events {Bi }m i=1 constitute a partition of the sample space S and P (Bi ) += 0 for i = 1, 2, ..., m, then for any event A in S P (A) = m % P (Bi ) P (A/Bi ). i=1 Proof: Let S be a sample space and A be an event in S. Let {Bi }m i=1 be any partition of S. Then A= m * i=1 Thus P (A) = = m % i=1 m % (A ∩ Bi ) . P (A ∩ Bi ) P (Bi ) P (A/Bi ) . i=1 Theorem 2.5. If the events {Bi }m i=1 constitute a partition of the sample space S and P (Bi ) += 0 for i = 1, 2, ..., m, then for any event A in S such that P (A) += 0 P (Bk ) P (A/Bk ) P (Bk /A) = 1m i=1 P (Bi ) P (A/Bi ) k = 1, 2, ..., m. Proof: Using the definition of conditional probability, we get P (Bk /A) = P (A ∩ Bk ) . P (A) Using Theorem 1, we get P (A ∩ Bk ) . i=1 P (Bi ) P (A/Bi ) P (Bk /A) = 1m This completes the proof. This Theorem is called Bayes Theorem. The probability P (Bk ) is called prior probability. The probability P (Bk /A) is called posterior probability. Example 2.8. Two boxes containing marbles are placed on a table. The boxes are labeled B1 and B2 . Box B1 contains 7 green marbles and 4 white Probability and Mathematical Statistics 37 marbles. Box B2 contains 3 green marbles and 10 yellow marbles. The boxes are arranged so that the probability of selecting box B1 is 31 and the probability of selecting box B2 is 23 . Kathy is blindfolded and asked to select a marble. She will win a color TV if she selects a green marble. (a) What is the probability that Kathy will win the TV (that is, she will select a green marble)? (b) If Kathy wins the color TV, what is the probability that the green marble was selected from the first box? Answer: Let A be the event of drawing a green marble. The prior probabilities are P (B1 ) = 31 and P (B2 ) = 23 . (a) The probability that Kathy will win the TV is P (A) = P (A ∩ B1 ) + P (A ∩ B2 ) = P (A/B1 ) P (B1 ) + P (A/B2 ) P (B2 ) # $# $ # $# $ 1 3 2 7 + = 11 3 13 3 7 2 = + 33 13 91 66 = + 429 429 = 157 . 429 (b) Given that Kathy won the TV, the probability that the green marble was selected from B1 is 7/11 1/3 Green marble Selecting box B1 4/11 Not a green marble Green marble 2/3 3/13 Selecting box B2 10/13 Not a green marble Conditional Probability and Bayes’ Theorem P (B1 /A) = 38 P (A/B1 ) P (B1 ) P (A/B1 ) P (B1 ) + P (A/B2 ) P (B2 ) ! 7 " !1" " !3 3 " ! 2 " = ! 7 " ! 111 11 3 + 13 3 = 91 . 157 Note that P (A/B1 ) is the probability of selecting a green marble from B1 whereas P (B1 /A) is the probability that the green marble was selected from box B1 . Example 2.9. Suppose box A contains 4 red and 5 blue chips and box B contains 6 red and 3 blue chips. A chip is chosen at random from the box A and placed in box B. Finally, a chip is chosen at random from among those now in box B. What is the probability a blue chip was transferred from box A to box B given that the chip chosen from box B is red? Answer: Let E represent the event of moving a blue chip from box A to box B. We want to find the probability of a blue chip which was moved from box A to box B given that the chip chosen from B was red. The probability of choosing a red chip from box A is P (R) = 49 and the probability of choosing a blue chip from box A is P (B) = 95 . If a red chip was moved from box A to box B, then box B has 7 red chips and 3 blue chips. Thus the probability 7 . Similarly, if a blue chip was moved of choosing a red chip from box B is 10 from box A to box B, then the probability of choosing a red chip from box 6 . B is 10 Red chip 7/10 red 4/9 Box B 7 red 3 blue 3/10 Box A Not a red chip Red chip blue 5/9 Box B 6 red 4 blue 6/10 4/10 Not a red chip Probability and Mathematical Statistics 39 Hence, the probability that a blue chip was transferred from box A to box B given that the chip chosen from box B is red is given by P (E/R) = P (R/E) P (E) P (R) =! = ! 6 " !5" 10 " ! " !9 6 " ! 5 " 7 4 10 9 + 10 9 15 . 29 Example 2.10. Sixty percent of new drivers have had driver education. During their first year, new drivers without driver education have probability 0.08 of having an accident, but new drivers with driver education have only a 0.05 probability of an accident. What is the probability a new driver has had driver education, given that the driver has had no accident the first year? Answer: Let A represent the new driver who has had driver education and B represent the new driver who has had an accident in his first year. Let Ac and B c be the complement of A and B, respectively. We want to find the probability that a new driver has had driver education, given that the driver has had no accidents in the first year, that is P (A/B c ). P (A ∩ B c ) P (B c ) P (B c /A) P (A) = c P (B /A) P (A) + P (B c /Ac ) P (Ac ) P (A/B c ) = = [1 − P (B/A)] P (A) [1 − P (B/A)] P (A) + [1 − P (B/Ac )] [1 − P (A)] =! ! 60 " ! 95 " 100" " ! !100 " ! 95 " 40 92 60 100 100 + 100 100 = 0.6077. Example 2.11. One-half percent of the population has AIDS. There is a test to detect AIDS. A positive test result is supposed to mean that you Conditional Probability and Bayes’ Theorem 40 have AIDS but the test is not perfect. For people with AIDS, the test misses the diagnosis 2% of the times. And for the people without AIDS, the test incorrectly tells 3% of them that they have AIDS. (a) What is the probability that a person picked at random will test positive? (b) What is the probability that you have AIDS given that your test comes back positive? Answer: Let A denote the event of one who has AIDS and let B denote the event that the test comes out positive. (a) The probability that a person picked at random will test positive is given by P (test positive) = (0.005) (0.98) + (0.995) (0.03) = 0.0049 + 0.0298 = 0.035. (b) The probability that you have AIDS given that your test comes back positive is given by favorable positive branches total positive branches (0.005) (0.98) = (0.005) (0.98) + (0.995) (0.03) 0.0049 = 0.14. = 0.035 P (A/B) = 0.98 Test positive AIDS 0.005 0.02 Test negative Test positive 0.03 0.995 No AIDS 0.97 Test negative Remark 2.2. This example illustrates why Bayes’ theorem is so important. What we would really like to know in this situation is a first-stage result: Do you have AIDS? But we cannot get this information without an autopsy. The first stage is hidden. But the second stage is not hidden. The best we can do is make a prediction about the first stage. This illustrates why backward conditional probabilities are so useful. Probability and Mathematical Statistics 41 2.3. Review Exercises 1. Let P (A) = 0.4 and P (A ∪ B) = 0.6. For what value of P (B) are A and B independent? 2. A die is loaded in such a way that the probability of the face with j dots turning up is proportional to j for j = 1, 2, 3, 4, 5, 6. In 6 independent throws of this die, what is the probability that each face turns up exactly once? 3. A system engineer is interested in assessing the reliability of a rocket composed of three stages. At take off, the engine of the first stage of the rocket must lift the rocket off the ground. If that engine accomplishes its task, the engine of the second stage must now lift the rocket into orbit. Once the engines in both stages 1 and 2 have performed successfully, the engine of the third stage is used to complete the rocket’s mission. The reliability of the rocket is measured by the probability of the completion of the mission. If the probabilities of successful performance of the engines of stages 1, 2 and 3 are 0.99, 0.97 and 0.98, respectively, find the reliability of the rocket. 4. Identical twins come from the same egg and hence are of the same sex. Fraternal twins have a 50-50 chance of being the same sex. Among twins the probability of a fraternal set is 13 and an identical set is 23 . If the next set of twins are of the same sex, what is the probability they are identical? 5. In rolling a pair of fair dice, what is the probability that a sum of 7 is rolled before a sum of 8 is rolled ? 6. A card is drawn at random from an ordinary deck of 52 cards and replaced. This is done a total of 5 independent times. What is the conditional probability of drawing the ace of spades exactly 4 times, given that this ace is drawn at least 4 times? 7. Let A and B be independent events with P (A) = P (B) and P (A ∪ B) = 0.5. What is the probability of the event A? 8. An urn contains 6 red balls and 3 blue balls. One ball is selected at random and is replaced by a ball of the other color. A second ball is then chosen. What is the conditional probability that the first ball selected is red, given that the second ball was red? Conditional Probability and Bayes’ Theorem 42 9. A family has five children. Assuming that the probability of a girl on each birth was 0.5 and that the five births were independent, what is the probability the family has at least one girl, given that they have at least one boy? 10. An urn contains 4 balls numbered 0 through 3. One ball is selected at random and removed from the urn and not replaced. All balls with nonzero numbers less than that of the selected ball are also removed from the urn. Then a second ball is selected at random from those remaining in the urn. What is the probability that the second ball selected is numbered 3? 11. English and American spelling are rigour and rigor, respectively. A man staying at Al Rashid hotel writes this word, and a letter taken at random from his spelling is found to be a vowel. If 40 percent of the English-speaking men at the hotel are English and 60 percent are American, what is the probability that the writer is an Englishman? 12. A diagnostic test for a certain disease is said to be 90% accurate in that, if a person has the disease, the test will detect with probability 0.9. Also, if a person does not have the disease, the test will report that he or she doesn’t have it with probability 0.9. Only 1% of the population has the disease in question. If the diagnostic test reports that a person chosen at random from the population has the disease, what is the conditional probability that the person, in fact, has the disease? 13. A small grocery store had 10 cartons of milk, 2 of which were sour. If you are going to buy the 6th carton of milk sold that day at random, find the probability of selecting a carton of sour milk. 14. Suppose Q and S are independent events such that the probability that at least one of them occurs is not occur is 1 9. 1 3 and the probability that Q occurs but S does What is the probability of S? 15. A box contains 2 green and 3 white balls. A ball is selected at random from the box. If the ball is green, a card is drawn from a deck of 52 cards. If the ball is white, a card is drawn from the deck consisting of just the 16 pictures. (a) What is the probability of drawing a king? (b) What is the probability of a white ball was selected given that a king was drawn? Probability and Mathematical Statistics 43 16. Five urns are numbered 3,4,5,6 and 7, respectively. Inside each urn is n2 dollars where n is the number on the urn. The following experiment is performed: An urn is selected at random. If its number is a prime number the experimenter receives the amount in the urn and the experiment is over. If its number is not a prime number, a second urn is selected from the remaining four and the experimenter receives the total amount in the two urns selected. What is the probability that the experimenter ends up with exactly twentyfive dollars? 17. A cookie jar has 3 red marbles and 1 white marble. A shoebox has 1 red marble and 1 white marble. Three marbles are chosen at random without replacement from the cookie jar and placed in the shoebox. Then 2 marbles are chosen at random and without replacement from the shoebox. What is the probability that both marbles chosen from the shoebox are red? 18. A urn contains n black balls and n white balls. Three balls are chosen from the urn at random and without replacement. What is the value of n if the probability is 1 12 that all three balls are white? 19. An urn contains 10 balls numbered 1 through 10. Five balls are drawn at random and without replacement. Let A be the event that “Exactly two odd-numbered balls are drawn and they occur on odd-numbered draws from the urn.” What is the probability of event A? 20. I have five envelopes numbered 3, 4, 5, 6, 7 all hidden in a box. I pick an envelope – if it is prime then I get the square of that number in dollars. Otherwise (without replacement) I pick another envelope and then get the sum of squares of the two envelopes I picked (in dollars). What is the probability that I will get $25? Conditional Probability and Bayes’ Theorem 44 Probability and Mathematical Statistics 45 Chapter 3 RANDOM VARIABLES AND DISTRIBUTION FUNCTIONS 3.1. Introduction In many random experiments, the elements of sample space are not necessarily numbers. For example, in a coin tossing experiment the sample space consists of S = {Head, Tail}. Statistical methods involve primarily numerical data. Hence, one has to ‘mathematize’ the outcomes of the sample space. This mathematization, or quantification, is achieved through the notion of random variables. Definition 3.1. Consider a random experiment whose sample space is S. A random variable X is a function from the sample space S into the set of real numbers R I such that for each interval I in R, I the set {s ∈ S | X(s) ∈ I} is an event in S. In a particular experiment a random variable X would be some function that assigns a real number X(s) to each possible outcome s in the sample space. Given a random experiment, there can be many random variables. This is due to the fact that given two (finite) sets A and B, the number of distinct functions one can come up with is |B||A| . Here |A| means the cardinality of the set A. Random variable is not a variable. Also, it is not random. Thus someone named it inappropriately. The following analogy speaks the role of the random variable. Random variable is like the Holy Roman Empire – it was Random Variables and Distribution Functions 46 not holy, it was not Roman, and it was not an empire. A random variable is neither random nor variable, it is simply a function. The values it takes on are both random and variable. Definition 3.2. The set {x ∈ R I | x = X(s), s ∈ S} is called the space of the random variable X. The space of the random variable X will be denoted by RX . The space of the random variable X is actually the range of the function X : S → R. I Example 3.1. Consider the coin tossing experiment. Construct a random variable X for this experiment. What is the space of this random variable X? Answer: The sample space of this experiment is given by S = {Head, Tail}. Let us define a function from S into the set of reals as follows X(Head) = 0 X(T ail) = 1. Then X is a valid map and thus by our definition of random variable, it is a random variable for the coin tossing experiment. The space of this random variable is RX = {0, 1}. Tail X Head 0 1 Real line Sample Space X(head) = 0 and X(tail) = 1 Example 3.2. Consider an experiment in which a coin is tossed ten times. What is the sample space of this experiment? How many elements are in this sample space? Define a random variable for this sample space and then find the space of the random variable. Probability and Mathematical Statistics 47 Answer: The sample space of this experiment is given by S = {s | s is a sequence of 10 heads or tails}. The cardinality of S is |S| = 210 . Let X : S → R I be a function from the sample space S into the set of reals R I defined as follows: X(s) = number of heads in sequence s. Then X is a random variable. This random variable, for example, maps the sequence HHT T T HT T HH to the real number 5, that is X(HHT T T HT T HH) = 5. The space of this random variable is RX = {0, 1, 2, ..., 10}. Now, we introduce some notations. By (X = x) we mean the event {s ∈ S | X(s) = x}. Similarly, (a < X < b) means the event {s ∈ S | a < X < b} of the sample space S. These are illustrated in the following diagrams. S S A X B x X Real line Sample Space Real line Sample Space (X=x) means the event A a b (a<X<b) means the event B Definition 3.3. If the space of random variable X is countable, then X is called a discrete random variable. Definition 3.4. If the space of random variable X is uncountable, then X is called a continuous random variable. In the case of a continuous random variable, the space is either an interval or a union of intervals. A random variable is characterized through its probability density function. First, we consider the discrete case and then we examine the continuous case. Random Variables and Distribution Functions 48 3.2. Distribution Functions of Discrete Random Variables Definition 3.5. Let RX be the space of the random variable X. The function f : RX → R I defined by f (x) = P (X = x) is called the probability density function (pdf) of X. Example 3.3. In an introductory statistics class of 50 students, there are 11 freshman, 19 sophomores, 14 juniors and 6 seniors. One student is selected at random. What is the sample space of this experiment? Construct a random variable X for this sample space and then find its space. Further, find the probability density function of this random variable X. Answer: The sample space of this random experiment is S = {F r, So, Jr, Sr}. Define a function X : S → R I as follows: X(F r) = 1, X(So) = 2 X(Jr) = 3, X(Sr) = 4. Then clearly X is a random variable in S. The space of X is given by RX = {1, 2, 3, 4}. The probability density function of X is given by 11 50 19 f (2) = P (X = 2) = 50 14 f (3) = P (X = 3) = 50 6 f (4) = P (X = 4) = . 50 f (1) = P (X = 1) = Example 3.4. A box contains 5 colored balls, 2 black and 3 white. Balls are drawn successively without replacement. If the random variable X is the number of draws until the last black ball is obtained, find the probability density function for the random variable X. Probability and Mathematical Statistics 49 Answer: Let ‘B’ denote the black ball, and ‘W’ denote the white ball. Then the sample space S of this experiment is given by (see the figure below) B B B B B W 2B 3W W W B B B W W B B W B W W W B B B S = { BB, BW B, W BB, BW W B, W BW B, W W BB, BW W W B, W W BW B, W W W BB, W BW W B}. Hence the sample space has 10 points, that is |S| = 10. It is easy to see that the space of the random variable X is {2, 3, 4, 5}. BB BWB WBB BWWB WBWB WWBB BWWWB WWBWB WWWBB WBWWB Sample Space S X 2 3 4 5 Real line Therefore, the probability density function of X is given by 1 , 10 3 f (4) = P (X = 4) = , 10 f (2) = P (X = 2) = Thus f (x) = x−1 , 10 2 10 4 f (5) = P (X = 5) = . 10 f (3) = P (X = 3) = x = 2, 3, 4, 5. Random Variables and Distribution Functions 50 Example 3.5. A pair of dice consisting of a six-sided die and a four-sided die is rolled and the sum is determined. Let the random variable X denote this sum. Find the sample space, the space of the random variable, and probability density function of X. Answer: The sample space of this random experiment is given by {(1, 1) (2, 1) S= (3, 1) (4, 1) (1, 2) (2, 2) (3, 2) (4, 2) (1, 3) (2, 3) (3, 3) (4, 3) (1, 4) (2, 4) (3, 4) (4, 4) (1, 5) (2, 5) (3, 5) (4, 5) (1, 6) (2, 6) (3, 6) (4, 6)} The space of the random variable X is given by RX = {2, 3, 4, 5, 6, 7, 8, 9, 10}. Therefore, the probability density function of X is given by 1 , f (3) = P (X = 3) = f (2) = P (X = 2) = 24 3 f (4) = P (X = 4) = , f (5) = P (X = 5) = 24 4 f (6) = P (X = 6) = , f (7) = P (X = 7) = 24 3 f (8) = P (X = 8) = , f (9) = P (X = 9) = 24 1 f (10) = P (X = 10) = . 24 2 24 4 24 4 24 2 24 Example 3.6. A fair coin is tossed 3 times. Let the random variable X denote the number of heads in 3 tosses of the coin. Find the sample space, the space of the random variable, and the probability density function of X. Answer: The sample space S of this experiment consists of all binary sequences of length 3, that is S = {T T T, T T H, T HT, HT T, T HH, HT H, HHT, HHH}. TTT TTH THT HTT THH HTH HHT HHH Sample Space S X 0 1 2 3 Real line Probability and Mathematical Statistics 51 The space of this random variable is given by RX = {0, 1, 2, 3}. Therefore, the probability density function of X is given by 1 8 3 f (1) = P (X = 1) = 8 3 f (2) = P (X = 2) = 8 1 f (3) = P (X = 3) = . 8 f (0) = P (X = 0) = This can be written as follows: f (x) = # $ # $x # $3−x 3 1 1 x 2 2 x = 0, 1, 2, 3. The probability density function f (x) of a random variable X completely characterizes it. Some basic properties of a discrete probability density function are summarized below. Theorem 3.1. If X is a discrete random variable with space RX and probability density function f (x), then (a) f (x) ≥ 0 for all x in RX , and (b) % f (x) = 1. x∈RX Example 3.7. If the probability of a random variable X with space RX = {1, 2, 3, ..., 12} is given by f (x) = k (2x − 1), then, what is the value of the constant k? Random Variables and Distribution Functions Answer: 1= % 52 f (x) x∈RX = % x∈RX = 12 % x=1 / k (2x − 1) k (2x − 1) =k 2 12 % x=1 0 x − 12 3 2 (12)(13) − 12 =k 2 2 = k 144. Hence k= 1 . 144 Definition 3.6. The cumulative distribution function F (x) of a random variable X is defined as F (x) = P (X ≤ x) for all real numbers x. Theorem 3.2. If X is a random variable with the space RX , then F (x) = % f (t) t≤x for x ∈ RX . Example 3.8. If the probability density function of the random variable X is given by 1 (2x − 1) for x = 1, 2, 3, ..., 12 144 then find the cumulative distribution function of X. Answer: The space of the random variable X is given by RX = {1, 2, 3, ..., 12}. Probability and Mathematical Statistics 53 Then F (1) = % f (t) = f (1) = % f (t) = f (1) + f (2) = % f (t) = f (1) + f (2) + f (3) = t≤1 F (2) = 1 144 t≤2 F (3) = t≤3 1 3 4 + = 144 144 144 3 5 9 1 + + = 144 144 144 144 .. ........ .. ........ % F (12) = f (t) = f (1) + f (2) + · · · + f (12) = 1. t≤12 F (x) represents the accumulation of f (t) up to t ≤ x. Theorem 3.3. Let X be a random variable with cumulative distribution function F (x). Then the cumulative distribution function satisfies the followings: (a) F (−∞) = 0, (b) F (∞) = 1, and (c) F (x) is an increasing function, that is if x < y, then F (x) ≤ F (y) for all reals x, y. The proof of this theorem is trivial and we leave it to the students. Theorem 3.4. If the space RX of the random variable X is given by RX = {x1 < x2 < x3 < · · · < xn }, then f (x1 ) = F (x1 ) f (x2 ) = F (x2 ) − F (x1 ) f (x3 ) = F (x3 ) − F (x2 ) .. ........ .. ........ f (xn ) = F (xn ) − F (xn−1 ). Random Variables and Distribution Functions 54 F(x4) 1 f(x4) F(x3) f(x3) F(x2) f(x2) F(x1) 0 f(x1) x1 x2 x3 x4 x Theorem 3.2 tells us how to find cumulative distribution function from the probability density function, whereas Theorem 3.4 tells us how to find the probability density function given the cumulative distribution function. Example 3.9. Find the probability density function of the random variable X whose cumulative distribution function is  0.00 if x < −1        0.25 if −1 ≤ x < 1      F (x) = 0.50 if 1 ≤ x < 3       0.75 if 3 ≤ x < 5       1.00 if x ≥ 5 . Also, find (a) P (X ≤ 3), (b) P (X = 3), and (c) P (X < 3). Answer: The space of this random variable is given by RX = {−1, 1, 3, 5}. By the previous theorem, the probability density function of X is given by f (−1) = 0.25 f (1) = 0.50 − 0.25 = 0.25 f (3) = 0.75 − 0.50 = 0.25 f (5) = 1.00 − 0.75 = 0.25. The probability P (X ≤ 3) can be computed by using the definition of F . Hence P (X ≤ 3) = F (3) = 0.75. Probability and Mathematical Statistics 55 The probability P (X = 3) can be computed from P (X = 3) = F (5) − F (3) = 1 − 0.75 = 0.25. Finally, we get P (X < 3) from P (X < 3) = P (X ≤ 1) = F (1) = 0.5. We close this section with an example showing that there is no one-toone correspondence between a random variable and its distribution function. Consider a coin tossing experiment with the sample space consisting of a head and a tail, that is S = { head, tail }. Define two random variables X1 and X2 from S as follows: X1 ( head ) = 0 and X1 ( tail ) = 1 X2 ( head ) = 1 and X2 ( tail ) = 0. and It is easy to see that both these random variables have the same distribution function, namely & 0 if x < 0 FXi (x) = 12 if 0 ≤ x < 1 1 if 1 ≤ x for i = 1, 2. Hence there is no one-to-one correspondence between a random variable and its distribution function. 3.3. Distribution Functions of Continuous Random Variables Recall that a random variable X is said to be continuous if its space is either an interval or a union of intervals. Definition 3.7. Let X be a continuous random variable whose space is the set of real numbers R. I A nonnegative real valued function f : R I →R I is said to be the probability density function for the continuous random variable X if it satisfies: 8∞ (a) −∞ f (x) dx = 1, and 8 (b) if A is an event, then P (A) = A f (x) dx. Example 3.10. Is the real valued function f : R I →R I defined by 9 2 x−2 if 1 < x < 2 f (x) = 0 otherwise, Random Variables and Distribution Functions 56 a probability density function for some random variable X? Answer: We have to show that f is nonnegative and the area under f (x) is unity. Since the domain of f is the interval (0, 1), it is clear that f is nonnegative. Next, we calculate : ∞ f (x) dx = : 2 2 x−2 dx 1 −∞ 2 32 1 = −2 x 1 2 3 1 = −2 −1 2 = 1. Thus f is a probability density function. Example 3.11. Is the real valued function f : R I →R I defined by f (x) = 9 1 + |x| if −1 < x < 1 0 otherwise, a probability density function for some random variable X? Probability and Mathematical Statistics 57 Answer: It is easy to see that f is nonnegative, that is f (x) ≥ 0, since f (x) = 1 + |x|. Next we show that the area under f is not unity. For this we compute : : 1 1 f (x) dx = −1 = −1 : 0 −1 (1 + |x|) dx (1 − x) dx + : 1 (1 + x) dx 0 2 2 30 31 1 2 1 2 = x− x + x+ x 2 2 −1 0 1 1 =1+ +1+ 2 2 = 3. Thus f is not a probability density function for some random variable X. Example 3.12. For what value of the constant c, the real valued function f :R I →R I given by f (x) = c , 1 + (x − θ)2 −∞ < x < ∞, where θ is a real parameter, is a probability density function for random variable X? Answer: Since f is nonnegative, we see that c ≥ 0. To find the value of c, we use the fact that for pdf the area is unity, that is : ∞ 1= f (x) dx −∞ : ∞ c = dx 2 −∞ 1 + (x − θ) : ∞ c dz = 2 −∞ 1 + z ; <∞ = c tan−1 z −∞ ; < = c tan−1 (∞) − tan−1 (−∞) 2 3 1 1 =c π+ π 2 2 = c π. Hence c = 1 π and the density function becomes f (x) = 1 , π [1 + (x − θ)2 ] −∞ < x < ∞. Random Variables and Distribution Functions 58 This density function is called the Cauchy distribution function with parameter θ. If a random variable X has this pdf then it is called a Cauchy random variable and is denoted by X ∼ CAU (θ). This distribution is symmetrical about θ. Further, it achieves it maximum at x = θ. The following figure illustrates the symmetry of the distribution for θ = 2. Example 3.13. For what value of the constant c, the real valued function f :R I →R I given by & c if a ≤ x ≤ b f (x) = 0 otherwise, where a, b are real constants, is a probability density function for random variable X? Answer: Since f is a pdf, k is nonnegative. Further, since the area under f is unity, we get : ∞ 1= f (x) dx −∞ b = : c dx a b = c [x]a Hence c = 1 b−a , = c [b − a]. and the pdf becomes & 1 f (x) = b−a if a ≤ x ≤ b 0 otherwise. This probability density function is called the uniform distribution on the interval [a, b]. If a random variable X has this pdf then it is called a Probability and Mathematical Statistics 59 uniform random variable and is denoted by X ∼ U N IF (a, b). The following is a graph of the probability density function of a random variable on the interval [2, 5]. Definition 3.8. Let f (x) be the probability density function of a continuous random variable X. The cumulative distribution function F (x) of X is defined as : x F (x) = P (X ≤ x) = f (t) dt. −∞ The cumulative distribution function F (x) represents the area under the probability density function f (x) on the interval (−∞, x) (see figure below). Like the discrete case, the cdf is an increasing function of x, and it takes value 0 at negative infinity and 1 at positive infinity. Theorem 3.5. If F (x) is the cumulative distribution function of a continuous random variable X, the probability density function f (x) of X is the derivative of F (x), that is d F (x) = f (x). dx Random Variables and Distribution Functions 60 Proof: By Fundamental Theorem of Calculus, we get $ #: x d d f (t) dt (F (x)) = dx dx −∞ dx = f (x) dx = f (x). This theorem tells us that if the random variable is continuous, then we can find the pdf given cdf by taking the derivative of the cdf. Recall that for discrete random variable, the pdf at a point in space of the random variable can be obtained from the cdf by taking the difference between the cdf at the point and the cdf immediately below the point. Example 3.14. What is the cumulative distribution function of the Cauchy random variable with parameter θ? Answer: The cdf of X is given by : x F (x) = f (t) dt −∞ : x 1 dt = π [1 + (t − θ)2 ] −∞ : x−θ 1 = dz π [1 + z2] −∞ 1 1 = tan−1 (x − θ) + . π 2 Example 3.15. What is the probability density function of the random variable whose cdf is 1 F (x) = , −∞ < x < ∞? 1 + e−x Answer: The pdf of the random variable is given by d F (x) f (x) = dx # $ d 1 = dx 1 + e−x "−1 d ! 1 + e−x = dx " d ! 1 + e−x = (−1) (1 + e−x )−2 dx e−x = . (1 + e−x )2 Probability and Mathematical Statistics 61 Next, we briefly discuss the problem of finding probability when the cdf is given. We summarize our results in the following theorem. Theorem 3.6. Let X be a continuous random variable whose cdf is F (x). Then followings are true: (a) P (X < x) = F (x), (b) P (X > x) = 1 − F (x), (c) P (X = x) = 0 , and (d) P (a < X < b) = F (b) − F (a). 3.4. Percentiles for Continuous Random Variables In this section, we discuss various percentiles of a continuous random variable. If the random variable is discrete, then to discuss percentile, we have to know the order statistics of samples. We shall treat the percentile of discrete random variable in Chapter 13. Definition 3.9. Let p be a real number between 0 and 1. A 100pth percentile of the distribution of a random variable X is any real number q satisfying P (X ≤ q) ≤ p and P (X > q) ≤ 1 − p. A 100pth percentile is a measure of location for the probability distribution in the sense that q divides the distribution of the probability mass into two parts, one having probability mass p and other having probability mass 1 − p (see diagram below). Example 3.16. If the random variable X has the density function   ex−2 for x < 2 f (x) =  0 otherwise, Random Variables and Distribution Functions 62 then what is the 75th percentile of X? Answer: Since 100pth = 75, we get p = 0.75. By definition of percentile, we have : q 0.75 = p = f (x) dx −∞ : q ex−2 dx = −∞ ; <q = ex−2 −∞ = eq−2 . From this solving for q, we get the 75th percentile to be # $ 3 q = 2 + ln . 4 Example 3.17. What is the 87.5 percentile for the distribution with density function 1 f (x) = e−|x| − ∞ < x < ∞? 2 Answer: Note that this density function is symmetric about the y-axis, that is f (x) = f (−x). Hence : 0 −∞ f (x) dx = 1 . 2 Probability and Mathematical Statistics 63 Now we compute the 87.5th percentile q of the above distribution. 87.5 = 100 : q f (x) dx −∞ : 0 : q 1 −|x| 1 −|x| e dx + e dx = 2 0 2 −∞ : 0 : q 1 x 1 −x = e dx + e dx −∞ 2 0 2 : q 1 1 −x = + e dx 2 0 2 1 1 1 = + − e−q . 2 2 2 Therefore solving for q, we get 0.125 = 1 −q e 2 # q = − ln 25 100 $ = ln 4. Hence the 87.5th percentile of the distribution is ln 4. Example 3.18. Let the continuous random variable X have the density function f (x) as shown in the figure below: What is the 25th percentile of the distribution of X? Answer: Since the line passes through the points (0, 0) and (a, 14 ), the function f (x) is equal to 1 x. f (x) = 4a Random Variables and Distribution Functions 64 Since f (x) is a density function the area under f (x) should be unity. Hence : a 1= f (x) dx :0 a 1 x dx = 0 4a 1 2 = a 8a a = . 8 Thus a = 8. Hence the probability density function of X is f (x) = 1 x. 32 Now we want to find the 25th percentile. : q 25 f (x) dx = 100 0 : q 1 = x dx 0 32 1 2 = q . 64 √ Hence q = 16, that is the 25th percentile of the above distribution is 4. Definition 3.10. The 25th and 75th percentiles of any distribution are called the first and the third quartiles, respectively. Definition 3.11. The 50th percentile of any distribution is called the median of the distribution. The median divides the distribution of the probability mass into two equal parts (see the following figure). Probability and Mathematical Statistics 65 If a probability density function f (x) is symmetric about the y-axis, then the median is always 0. Example 3.19. A random variable is called standard normal if its probability density function is of the form 1 2 1 f (x) = √ e− 2 x , 2π −∞ < x < ∞. What is the median of X? Answer: Notice that f (x) = f (−x), hence the probability density function is symmetric about the y-axis. Thus the median of X is 0. Definition 3.12. A mode of the distribution of a continuous random variable X is the value of x where the probability density function f (x) attains a relative maximum (see diagram). y Relative Maximum f(x) 0 x mode mode A mode of a random variable X is one of its most probable values. A random variable can have more than one mode. Example 3.20. Let X be a uniform random variable on the interval [0, 1], that is X ∼ U N IF (0, 1). How many modes does X have? Answer: Since X ∼ U N IF (0, 1), the probability density function of X is f (x) = & 1 if 0 ≤ x ≤ 1 0 otherwise. Hence the derivative of f (x) is f # (x) = 0 x ∈ (0, 1). Therefore X has infinitely many modes. Random Variables and Distribution Functions 66 Example 3.21. Let X be a Cauchy random variable with parameter θ = 0, that is X ∼ CAU (0). What is the mode of X? Answer: Since X ∼ CAU (0), the probability density function of f (x) is f (x) = 1 π (1 + x2 ) Hence f # (x) = − ∞ < x < ∞. −2x . π (1 + x2 )2 Setting this derivative to 0, we get x = 0. Thus the mode of X is 0. Example 3.22. Let X be a continuous random variable with density function   x2 e−bx for x ≥ 0 f (x) =  0 otherwise, where b > 0. What is the mode of X? Answer: df dx = 2xe−bx − x2 be−bx 0= Hence = (2 − bx)x = 0. 2 . b Thus the mode of X is 2b . The graph of the f (x) for b = 4 is shown below. x=0 or x= Example 3.23. A continuous random variable has density function  2  3x for 0 ≤ x ≤ θ θ3 f (x) =  0 otherwise, Probability and Mathematical Statistics 67 for some θ > 0. What is the ratio of the mode to the median for this distribution? Answer: For fixed θ > 0, the density function f (x) is an increasing function. Thus, f (x) has maximum at the right end point of the interval [0, θ]. Hence the mode of this distribution is θ. Next we compute the median of this distribution. 1 = 2 : q f (x) dx 0 : q 3x2 dx θ3 0 2 3 3q x = 3 θ 2 3 30 q = 3 . θ = Hence 1 q = 2− 3 θ. Thus the ratio of the mode of this distribution to the median is √ mode θ 3 = − 1 = 2. median 3 2 θ Example 3.24. A continuous random variable has density function f (x) =  2  3x θ3  0 for 0 ≤ x ≤ θ otherwise, for some θ > 0. What is the probability of X less than the ratio of the mode to the median of this distribution? Answer: In the previous example, we have shown that the ratio of the mode to the median of this distribution is given by a := √ mode 3 = 2. median Random Variables and Distribution Functions 68 Hence the probability of X less than the ratio of the mode to the median of this distribution is : a f (x) dx P (X < a) = 0 : a 2 3x dx = θ3 0 2 3 3a x = 3 θ 0 a3 θ3 !√ "3 3 2 = θ3 2 = 3. θ = 3.5. Review Exercises 1. Let the random variable X have the density function  =  k x for 0 ≤ x ≤ 2 k f (x) =  0 elsewhere. If the mode of this distribution is at x = √ 2 4 , then what is the median of X? 2. The random variable X has density function   c xk+1 (1 − x)k for 0 < x < 1 f (x) =  0 otherwise, where c > 0 and 1 < k < 2. What is the mode of X? 3. The random variable X has density function   (k + 1) x2 for 0 < x < 1 f (x) =  0 otherwise, where k is a constant. What is the median of X? 4. What are the median, and mode, respectively, for the density function f (x) = 1 , π (1 + x2 ) −∞ < x < ∞? Probability and Mathematical Statistics 69 5. What is the 10th percentile of the random variable X whose probability density function is f (x) = &1 θ x e− θ 0 if x ≥ 0, θ>0 elsewhere? 6. What is the median of the random variable X whose probability density function is & 1 −x 2 if x ≥ 0 2 e f (x) = 0 elsewhere? 7. A continuous random variable X has the density f (x) =  2  3x 8  0 for 0 ≤ x ≤ 2 otherwise. What is the probability that X is greater than its 75th percentile? 8. What is the probability density function of the random variable X if its cumulative distribution function is given by  0.0 if   0.5 if F (x) =   0.7 if 1.0 if x<2 2≤x<3 3≤x<π x ≥ π? 9. Let the distribution of X for x > 0 be F (x) = 1 − 3 % xk e−x . k! k=0 What is the density function of X for x > 0? 10. Let X be a random variable with cumulative distribution function   1 − e−x for x > 0 F (x) =  0 for x ≤ 0. ! " What is the P 0 ≤ eX ≤ 4 ? Random Variables and Distribution Functions 70 11. Let X be a continuous random variable with density function   a x2 e−10 x for x ≥ 0 f (x) =  0 otherwise, where a > 0. What is the probability of X greater than or equal to the mode of X? 12. Let the random variable X have the density function  =  k x for 0 ≤ x ≤ 2 k f (x) =  0 elsewhere. If the mode of this distribution is at x = X less than the median of X? √ 2 4 , then what is the probability of 13. The random variable X has density function   (k + 1) x2 for 0 < x < 1 f (x) =  0 otherwise, where k is a constant. What is the probability of X between the first and third quartiles? 14. Let X be a random variable having continuous cumulative distribution function F (x). What is the cumulative distribution function Y = max(0, −X)? 15. Let X be a random variable with probability density function f (x) = 2 3x for x = 1, 2, 3, .... What is the probability that X is even? 16. An urn contains 5 balls numbered 1 through 5. Two balls are selected at random without replacement from the urn. If the random variable X denotes the sum of the numbers on the 2 balls, then what are the space and the probability density function of X? 17. A pair of six-sided dice is rolled and the sum is determined. If the random variable X denotes the sum of the numbers rolled, then what are the space and the probability density function of X? Probability and Mathematical Statistics 71 18. Five digit codes are selected at random from the set {0, 1, 2, ..., 9} with replacement. If the random variable X denotes the number of zeros in randomly chosen codes, then what are the space and the probability density function of X? 19. A urn contains 10 coins of which 4 are counterfeit. Coins are removed from the urn, one at a time, until all counterfeit coins are found. If the random variable X denotes the number of coins removed to find the first counterfeit one, then what are the space and the probability density function of X? 20. Let X be a random variable with probability density function 2c f (x) = x for x = 1, 2, 3, 4, ..., ∞ 3 for some constant c. What is the value of c? What is the probability that X is even? 21. If the random variable X possesses the density function 9 cx if 0 ≤ x ≤ 2 f (x) = 0 otherwise, then what is the value of c for which f (x) is a probability density function? What is the cumulative distribution function of X. Graph the functions f (x) and F (x). Use F (x) to compute P (1 ≤ X ≤ 2). 22. The length of time required by students to complete a 1-hour exam is a random variable with a pdf given by 9 cx2 + x if 0 ≤ x ≤ 1 f (x) = 0 otherwise, then what the probability a student finishes in less than a half hour? 23. What is the probability of, when blindfolded, hitting a circle inscribed on a square wall? 24. Let f (x) be a continuous probability density function. Show that, for " ! is also a probability every −∞ < µ < ∞ and σ > 0, the function σ1 f x−µ σ density function. 25. Let X be a random variable with probability density function f (x) and cumulative distribution function F (x). True or False? (a) f (x) can’t be larger than 1. (b) F (x) can’t be larger than 1. (c) f (x) can’t decrease. (d) F (x) can’t decrease. (e) f (x) can’t be negative. (f) F (x) can’t be negative. (g) Area under f must be 1. (h) Area under F must be 1. (i) f can’t jump. (j) F can’t jump. Random Variables and Distribution Functions 72 Probability and Mathematical Statistics 73 Chapter 4 MOMENTS OF RANDOM VARIABLES AND CHEBYCHEV INEQUALITY 4.1. Moments of Random Variables In this chapter, we introduce the concepts of various moments of a random variable. Further, we examine the expected value and the variance of random variables in detail. We shall conclude this chapter with a discussion of Chebychev’s inequality. Definition 4.1. The nth moment about the origin of a random variable X, as denoted by E(X n ), is defined to be  %  xn f (x) if X is discrete  n E (X ) = x∈RX   8 ∞ xn f (x) dx if X is continuous −∞ for n = 0, 1, 2, 3, ...., provided the right side converges absolutely. If n = 1, then E(X) is called the first moment about the origin. If n = 2, then E(X 2 ) is called the second moment of X about the origin. In general, these moments may or may not exist for a given random variable. If for a random variable, a particular moment does not exist, then we say that the random variable does not have that moment. For these moments to exist one requires absolute convergence of the sum or the integral. Next, we shall define two important characteristics of a random variable, namely the expected value and variance. Occasionally E (X n ) will be written as E [X n ]. Moments of Random Variables and Chebychev Inequality 74 4.2. Expected Value of Random Variables A random variable X is characterized by its probability density function, which defines the relative likelihood of assuming one value over the others. In Chapter 3, we have seen that given a probability density function f of a random variable X, one can construct the distribution function F of it through summation or integration. Conversely, the density function f (x) can be obtained as the marginal value or derivative of F (x). The density function can be used to infer a number of characteristics of the underlying random variable. The two most important attributes are measures of location and dispersion. In this section, we treat the measure of location and treat the other measure in the next section. Definition 4.2. Let X be a random variable with space RX and probability density function f (x). The mean µX of the random variable X is defined as  %  x f (x) if X is discrete  µX = x∈RX   8 ∞ x f (x) dx if X is continuous −∞ if the right hand side exists. The mean of a random variable is a composite of its values weighted by the corresponding probabilities. The mean is a measure of central tendency: the value that the random variable takes “on average.” The mean is also called the expected value of the random variable X and is denoted by E(X). The symbol E is called the expectation operator. The expected value of a random variable may or may not exist. Example 4.1. If X is a uniform random variable on the interval (2, 7), then what is the mean of X? E(X) = (2+7)/2 Probability and Mathematical Statistics 75 Answer: The density function of X is &1 if 2 < x < 7 5 f (x) = 0 otherwise. Thus the mean or the expected value of X is µX = E(X) : ∞ = x f (x) dx −∞ = : 7 2 x 1 dx 5 2 1 2 = x 10 37 2 = 1 (49 − 4) 10 = 45 10 = 9 2 = 2+7 . 2 In general, if X ∼ U N IF (a, b), then E(X) = 12 (a + b). Example 4.2. If X is a Cauchy random variable with parameter θ, that is X ∼ CAU (θ), then what is the expected value of X? Answer: We want to find the expected value of X if it exists. The expected 8 value of X will exist if the integral RI xf (x)dx converges absolutely, that is : |x f (x)| dx < ∞. R I If this integral diverges, then the expected value of X does not exist. Hence, 8 let us find out if RI |x f (x)| dx converges or not. Moments of Random Variables and Chebychev Inequality : R I 76 |x f (x)| dx : ∞ |x f (x)| dx = −∞ = : ∞ −∞ = : ∞ −∞ > > > > 1 >x > > π[1 + (x − θ)2 ] > dx > > > > 1 > dz >(z + θ) > π[1 + z 2 ] > =θ+2 : ∞ 0 z 1 dz π[1 + z 2 ] 2 3∞ 1 2 =θ+ ln(1 + z ) π 0 1 lim ln(1 + b2 ) π b→∞ =θ+∞ =θ+ = ∞. Since, the above integral does not exist, the expected value for the Cauchy distribution also does not exist. Remark 4.1. Indeed, it can be shown that a random variable X with the Cauchy distribution, E (X n ), does not exist for any natural number n. Thus, Cauchy random variables have no moments at all. Example 4.3. If the probability density function of the random variable X is   (1 − p)x−1 p if x = 1, 2, 3, 4, ..., ∞ f (x) =  0 otherwise, then what is the expected value of X? Probability and Mathematical Statistics 77 Answer: The expected value of X is E(X) = = % x f (x) x∈RX ∞ % x=1 x (1 − p)x−1 p d =p dp d =p dp &: / &/ d = −p dp ∞ % x=1 ∞ : % x=1 &∞ % x=1 x−1 x (1 − p) 0 x−1 x (1 − p) x (1 − p) ' 0 dp dp ' ' . 9 ? d 1 = −p (1 − p) dp 1 − (1 − p) 9 ? d 1 = −p dp p # $2 1 =p p 1 = p Hence the expected value of X is the reciprocal of the parameter p. Definition 4.3. If a random variable X whose probability density function is given by & (1 − p)x−1 p if x = 1, 2, 3, 4, ..., ∞ f (x) = 0 otherwise is called a geometric random variable and is denoted by X ∼ GEO(p). Example 4.4. A couple decides to have 3 children. If none of the 3 is a girl, they will try again; and if they still don’t get a girl, they will try once more. If the random variable X denotes the number of children the couple will have following this scheme, then what is the expected value of X? Answer: Since the couple can have 3 or 4 or 5 children, the space of the random variable X is RX = {3, 4, 5}. Moments of Random Variables and Chebychev Inequality 78 The probability density function of X is given by f (3) = P (X = 3) = P (at least one girl) = 1 − P (no girls) = 1 − P (3 boys in 3 tries) 3 = 1 − (P (1 boy in each try)) # $3 1 =1− 2 7 = . 8 f (4) = P (X = 4) = P (3 boys and 1 girl in last try) 3 = (P (1 boy in each try)) P (1 girl in last try) # $3 # $ 1 1 = 2 2 1 = . 16 f (5) = P (X = 5) = P (4 boys and 1 girl in last try) + P (5 boys in 5 tries) = P (1 boy in each try)4 P (1 girl in last try) + P (1 boy in each try)5 # $ 4 # $ # $5 1 1 1 + = 2 2 2 1 = . 16 Hence, the expected value of the random variable is % E(X) = x f (x) x∈RX = 5 % x f (x) x=3 = 3 f (3) + 4 f (4) + 5 f (5) 1 1 14 +4 +5 =3 16 16 16 42 + 4 + 5 = 16 3 51 =3 . = 16 16 Probability and Mathematical Statistics 79 Remark 4.2. We interpret this physically as meaning that if many couples have children according to this scheme, it is likely that the average family 3 children. size would be near 3 16 Example 4.5. A lot of 8 TV sets includes 3 that are defective. If 4 of the sets are chosen at random for shipment to a hotel, how many defective sets can they expect? Answer: Let X be the random variable representing the number of defective TV sets in a shipment of 4. Then the space of the random variable X is RX = {0, 1, 2, 3}. Then the probability density function of X is given by f (x) = P (X = x) = P (x defective TV sets in a shipment of four) !3" ! 5 " = x " !84−x x = 0, 1, 2, 3. 4 Hence, we have f (0) = f (1) = f (2) = f (3) = !3" ! 5" !8"4 = 0 !3"4!5" !8"3 = 1 !3"4!5" !8"2 = 2 !3"4!5" !8"1 = 3 4 5 70 30 70 30 70 5 . 70 Therefore, the expected value of X is given by % E(X) = x f (x) x∈RX = 3 % x f (x) 0 = f (1) + 2 f (2) + 3 f (3) 30 5 30 = +2 +3 70 70 70 30 + 60 + 15 = 70 105 = 1.5. = 70 Moments of Random Variables and Chebychev Inequality 80 Remark 4.3. Since they cannot possibly get 1.5 defective TV sets, it should be noted that the term “expect” is not used in its colloquial sense. Indeed, it should be interpreted as an average pertaining to repeated shipments made under given conditions. Now we prove a result concerning the expected value operator E. Theorem 4.1. Let X be a random variable with pdf f (x). If a and b are any two real numbers, then E(aX + b) = a E(X) + b. Proof: We will prove only for the continuous case. E(aX + b) = = : ∞ −∞ : ∞ =a (a x + b) f (x) dx : ∞ b f (x) dx a x f (x) dx + −∞ : ∞ −∞ x f (x) dx + b −∞ = aE(X) + b. To prove the discrete case, replace the integral by summation. This completes the proof. 4.3. Variance of Random Variables The spread of the distribution of a random variable X is its variance. Definition 4.4. Let X be a random variable with mean µX . The variance of X, denoted by V ar(X), is defined as ! " V ar(X) = E [ X − µX ]2 . 2 It is also denoted by σX . The positive square root of the variance is called the standard deviation of the random variable X. Like variance, the standard deviation also measures the spread. The following theorem tells us how to compute the variance in an alternative way. 2 Theorem 4.2. If X is a random variable with mean µX and variance σX , then 2 σX = E(X 2 ) − ( µX )2 . Probability and Mathematical Statistics Proof: 81 ! " 2 σX = E [ X − µX ]2 ! " = E X 2 − 2 µX X + µ2X = E(X 2 ) − 2 µX E(X) + ( µX )2 = E(X 2 ) − 2 µX µX + ( µX )2 = E(X 2 ) − ( µX )2 . 2 Theorem 4.3. If X is a random variable with mean µX and variance σX , then V ar(aX + b) = a2 V ar(X), where a and b are arbitrary real constants. Proof: , 2 V ar(a X + b) = E [ (a X + b) − µaX+b ] , 2 = E [ a X + b − E(a X + b) ] , 2 = E [ a X + b − a µX+ − b ] , 2 = E a2 [ X − µX ] , 2 = a2 E [ X − µX ] = a2 V ar(X). Example 4.6. Let X have the density function f (x) = & 2x k2 for 0 ≤ x ≤ k 0 otherwise. For what value of k is the variance of X equal to 2? Answer: The expected value of X is E(X) = : k x f (x) dx 0 = : k x 0 2 = k. 3 2x dx k2 Moments of Random Variables and Chebychev Inequality E(X 2 ) = : 82 k x2 f (x) dx 0 = : k x2 0 2x dx k2 2 = k2 . 4 Hence, the variance is given by V ar(X) = E(X 2 ) − ( µX )2 4 2 = k2 − k2 4 9 1 2 = k . 18 Since this variance is given to be 2, we get 1 2 k =2 18 and this implies that k = ±6. But k is given to be greater than 0, hence k must be equal to 6. Example 4.7. If the probability density function of the random variable is   1 − |x| for |x| < 1 f (x) =  0 otherwise, then what is the variance of X? Answer: Since V ar(X) = E(X 2 ) − µ2X , we need to find the first and second moments of X. The first moment of X is given by µX = E(X) : ∞ x f (x) dx = −∞ 1 = = = : −1 : 0 −1 : 0 −1 x (1 − |x|) dx x (1 + x) dx + (x + x2 ) dx + 1 1 1 1 = − + − 3 2 2 3 = 0. : : 1 0 1 0 x (1 − x) dx (x − x2 ) dx Probability and Mathematical Statistics 83 The second moment E(X 2 ) of X is given by : ∞ 2 E(X ) = x2 f (x) dx −∞ 1 = = : −1 : 0 x2 (1 − |x|) dx x2 (1 + x) dx + −1 0 = : (x2 + x3 ) dx + : : 1 0 1 (x2 − x3 ) dx 0 −1 1 1 1 1 − + − 3 4 3 4 1 = . 6 Thus, the variance of X is given by x2 (1 − x) dx = V ar(X) = E(X 2 ) − µ2X = 1 1 −0= . 6 6 Example 4.8. Suppose the random variable X has mean µ and variance σ 2 > 0. What are the values of the numbers a and b such that a + bX has mean 0 and variance 1? Answer: The mean of the random variable is 0. Hence 0 = E(a + bX) = a + b E(X) = a + b µ. Thus a = −b µ. Similarly, the variance of a + bX is 1. That is 1 = V ar(a + bX) = b2 V ar(X) = b2 σ 2 . Moments of Random Variables and Chebychev Inequality Hence b= 1 σ or b=− and 1 σ and 84 a=− a= µ σ µ . σ Example 4.9. Suppose X has the density function 9 2 f (x) = 3 x for 0 < x < 1 0 otherwise. What is the expected area of a random isosceles right triangle with hypotenuse X? Answer: Let ABC denote this random isosceles right triangle. Let AC = x. Then x AB = BC = √ 2 1 x x x2 √ √ = 2 2 2 4 The expected area of this random triangle is given by : 1 2 x 3 E(area of random ABC) = 3 x2 dx = . 4 20 0 Area of ABC = The expected area of ABC is 0.15 A x B C Probability and Mathematical Statistics 85 For the next example, we need these following results. For −1 < x < 1, let g(x) = ∞ % k=0 Then g # (x) = ∞ % a xk = a . 1−x a k xk−1 = k=1 and ## g (x) = ∞ % k=2 a , (1 − x)2 a k (k − 1) xk−2 = 2a . (1 − x)3 Example 4.10. If the probability density function of the random variable X is   (1 − p)x−1 p if x = 1, 2, 3, 4, ..., ∞ f (x) =  0 otherwise, then what is the variance of X? Answer: We want to find the variance of X. But variance of X is defined as ! " 2 V ar(X) = E X 2 − [ E(X) ] 2 = E(X(X − 1)) + E(X) − [ E(X) ] . We write the variance in the above manner because E(X 2 ) has no closed form solution. However, one can find the closed form solution of E(X(X − 1)). From Example 4.3, we know that E(X) = p1 . Hence, we now focus on finding the second factorial moment of X, that is E(X(X − 1)). E(X(X − 1)) = = ∞ % x=1 ∞ % x=2 = x (x − 1) (1 − p)x−1 p x (x − 1) (1 − p) (1 − p)x−2 p 2 (1 − p) 2 p (1 − p) = . (1 − (1 − p))3 p2 Hence 2 V ar(X) = E(X(X − 1)) + E(X) − [ E(X) ] = 2 (1 − p) 1 1−p 1 + − 2 = p2 p p p2 Moments of Random Variables and Chebychev Inequality 86 4.4. Chebychev Inequality We have taken it for granted, in section 4.2, that the standard deviation (which is the positive square root of the variance) measures the spread of a distribution of a random variable. The spread is measured by the area between “two values”. The area under the pdf between two values is the probability of X between the two values. If the standard deviation σ measures the spread, then σ should control the area between the “two values”. It is well known that if the probability density function is standard normal, that is 1 2 1 −∞ < x < ∞, f (x) = √ e− 2 x , 2π then the mean µ = 0 and the standard deviation σ = 1, and the area between the values µ − σ and µ + σ is 68%. Similarly, the area between the values µ − 2σ and µ + 2σ is 95%. In this way, the standard deviation controls the area between the values µ − kσ and µ + kσ for some k if the distribution is standard normal. If we do not know the probability density function of a random variable, can we find an estimate of the area between the values µ − kσ and µ + kσ for some given k? This problem was solved by Chebychev, a well known Russian mathematician. He proved that the area under f (x) on the interval [µ − kσ, µ + kσ] is at least 1 − k −2 . This is equivalent to saying the probability that a random variable is within k standard deviations of the mean is at least 1 − k −2 . Theorem 4.4 (Chebychev Inequality). Let X be a random variable with probability density function f (x). If µ and σ > 0 are the mean and standard deviation of X, then P (|X − µ| < k σ) ≥ 1 − for any nonzero real positive constant k. 1 k2 Probability and Mathematical Statistics 87 at least 1-k - 2 Mean Mean - k SD Mean + k SD Proof: We assume that the random variable X is continuous. If X is not continuous we replace the integral by summation in the following proof. From the definition of variance, we have the following: σ2 = : ∞ −∞ = : (x − µ)2 f (x) dx µ−k σ (x − µ)2 f (x) dx + −∞ + : ∞ µ+k σ Since, 8 µ+k σ µ−k σ 2 : µ+k σ µ−k σ (x − µ)2 f (x) dx (x − µ)2 f (x) dx. (x − µ)2 f (x) dx is positive, we get from the above σ ≥ : µ−k σ −∞ 2 (x − µ) f (x) dx + If x ∈ (−∞, µ − k σ), then : ∞ µ+k σ (x − µ)2 f (x) dx. x ≤ µ − k σ. Hence kσ ≤ µ−x for k 2 σ 2 ≤ (µ − x)2 . That is (µ − x)2 ≥ k 2 σ 2 . Similarly, if x ∈ (µ + k σ, ∞), then x ≥ µ+kσ (4.1) Moments of Random Variables and Chebychev Inequality 88 Therefore k 2 σ 2 ≤ (µ − x)2 . Thus if x +∈ (µ − k σ, µ + k σ), then (µ − x)2 ≥ k 2 σ 2 . (4.2) Using (4.2) and (4.1), we get 2 2 2 σ ≥k σ Hence 1 ≥ k2 Therefore /: µ−k σ f (x) dx + −∞ /: : ∞ 0 f (x) dx . µ+k σ µ−k σ f (x) dx + −∞ : ∞ 0 f (x) dx . µ+k σ 1 ≥ P (X ≤ µ − k σ) + P (X ≥ µ + k σ). k2 Thus 1 ≥ P (|X − µ| ≥ k σ) k2 which is P (|X − µ| < k σ) ≥ 1 − 1 . k2 This completes the proof of this theorem. The following integration formula : 0 1 xn (1 − x)m dx = n! m! (n + m + 1)! will be used in the next example. In this formula m and n represent any two positive integers. Example 4.11. Let the probability density function of a random variable X be & 630 x4 (1 − x)4 if 0 < x < 1 f (x) = 0 otherwise. What is the exact value of P (|X − µ| ≤ 2 σ)? What is the approximate value of P (|X − µ| ≤ 2 σ) when one uses the Chebychev inequality? Probability and Mathematical Statistics 89 Answer: First, we find the mean and variance of the above distribution. The mean of X is given by E(X) = : 1 x f (x) dx 0 = : 1 630 x5 (1 − x)4 dx 0 5! 4! (5 + 4 + 1)! 5! 4! = 630 10! 2880 = 630 3628800 630 = 1260 1 = . 2 = 630 Similarly, the variance of X can be computed from V ar(X) = : 1 x2 f (x) dx − µ2X 0 = : 1 630 x6 (1 − x)4 dx − 0 1 6! 4! − (6 + 4 + 1)! 4 6! 4! 1 = 630 − 11! 4 1 6 = 630 − 22 4 12 11 = − 44 44 1 . = 44 = 630 Therefore, the standard deviation of X is σ= @ 1 = 0.15. 44 1 4 Moments of Random Variables and Chebychev Inequality Thus 90 P (|X − µ| ≤ 2 σ) = P (|X − 0.5| ≤ 0.3) = P (−0.3 ≤ X − 0.5 ≤ 0.3) = P (0.2 ≤ X ≤ 0.8) = : 0.8 0.2 630 x4 (1 − x)4 dx = 0.96. If we use the Chebychev inequality, then we get an approximation of the exact value we have. This approximate value is P (|X − µ| ≤ 2 σ) ≥ 1 − 1 = 0.75 4 Hence, Chebychev inequality tells us that if we do not know the distribution of X, then P (|X − µ| ≤ 2 σ) is at least 0.75. Lower the standard deviation, and the smaller is the spread of the distribution. If the standard deviation is zero, then the distribution has no spread. This means that the distribution is concentrated at a single point. In the literature, such distributions are called degenerate distributions. The above figure shows how the spread decreases with the decrease of the standard deviation. 4.5. Moment Generating Functions We have seen in Section 3 that there are some distributions, such as geometric, whose moments are difficult to compute from the definition. A Probability and Mathematical Statistics 91 moment generating function is a real valued function from which one can generate all the moments of a given random variable. In many cases, it is easier to compute various moments of X using the moment generating function. Definition 4.5. Let X be a random variable whose probability density function is f (x). A real valued function M : R I →R I defined by ! " M (t) = E et X is called the moment generating function of X if this expected value exists for all t in the interval −h < t < h for some h > 0. In general, not every random variable has a moment generating function. But if the moment generating function of a random variable exists, then it is unique. At the end of this section, we will give an example of a random variable which does not have a moment generating function. Using the definition of expected value of a random variable, we obtain the explicit representation for M (t) as M (t) =  %  et x f (x)  x∈RX   8 ∞ et x −∞ f (x) dx if X is discrete if X is continuous. Example 4.12. Let X be a random variable whose moment generating function is M (t) and n be any natural number. What is the nth derivative of M (t) at t = 0? Answer: Similarly, " d ! d M (t) = E et X dt dt# $ d tX =E e dt ! " = E X et X . d2 d2 ! t X " M (t) = E e 2 dt2 dt# $ d2 t X e =E dt2 ! 2 t X" =E X e . Moments of Random Variables and Chebychev Inequality 92 Hence, in general we get " dn dn ! M (t) = n E et X n dt dt# $ dn t X e =E dtn ! n t X" =E X e . If we set t = 0 in the nth derivative, we get > > ! "> dn = E X n et X >t=0 = E (X n ) . M (t)>> n dt t=0 Hence the nth derivative of the moment generating function of X evaluated at t = 0 is the nth moment of X about the origin. This example tells us if we know the moment generating function of a random variable; then we can generate all the moments of X by taking derivatives of the moment generating function and then evaluating them at zero. Example 4.13. What is the moment generating function of the random variable X whose probability density function is given by f (x) = & e−x for x > 0 0 otherwise? What are the mean and variance of X? Answer: The moment generating function of X is ! " M (t) = E et X : ∞ et x f (x) dx = 0 : ∞ et x e−x dx = :0 ∞ e−(1−t) x dx = 0 1 A −(1−t) x B∞ −e 1−t 0 1 if 1 − t > 0. = 1−t = Probability and Mathematical Statistics 93 The expected value of X can be computed from M (t) as > > d M (t)>> E(X) = dt t=0 > > d −1 > = (1 − t) > dt > t=0 −2 > = (1 − t) t=0 > > 1 > = 2 (1 − t) > t=0 = 1. Similarly E(X 2 ) = > > d2 > M (t) > 2 dt t=0 > > d2 −1 > = 2 (1 − t) > dt > t=0 = 2 (1 − t)−3 >t=0 > > 2 > = 3 (1 − t) >t=0 = 2. Therefore, the variance of X is V ar(X) = E(X 2 ) − (µ)2 = 2 − 1 = 1. Example 4.14. Let X have the probability density function & 1 ! 8 "x for x = 0, 1, 2, ..., ∞ 9 9 f (x) = 0 otherwise. What is the moment generating function of the random variable X? Moments of Random Variables and Chebychev Inequality Answer: 94 ! " M (t) = E et X ∞ % = et x f (x) x=0 ∞ % # $ # $x 8 1 = e 9 9 x=0 # $% $x ∞ # 1 t 8 = e 9 x=0 9 # $ 1 1 = 9 1 − et 98 = tx 1 9 − 8 et if 8 <1 9 # $ 9 t < ln . 8 if et Example 4.15. Let X be a continuous random variable with density function 9 b e−b x for x > 0 f (x) = 0 otherwise , where b > 0. If M (t) is the moment generating function of X, then what is M (−6 b)? Answer: ! " M (t) = E et X : ∞ b et x e−b x dx = 0 : ∞ e−(b−t) x dx =b 0 b A −(b−t) x B∞ = −e b−t 0 b = if b − t > 0. b−t Hence M (−6 b) = b 7b = 71 . Example 4.16. Let the random variable X have moment generating func−2 tion M (t) = (1 − t) for t < 1. What is the third moment of X about the origin? Answer: To compute the third moment E(X 3 ) of X about the origin, we Probability and Mathematical Statistics 95 need to compute the third derivative of M (t) at t = 0. −2 M (t) = (1 − t) −3 M # (t) = 2 (1 − t) −4 M ## (t) = 6 (1 − t) −5 M ### (t) = 24 (1 − t) . Thus the third moment of X is given by ! " E X3 = 24 = 24. (1 − 0)5 Theorem 4.5. Let M (t) be the moment generating function of the random variable X. If M (t) = a0 + a1 t + a2 t2 + · · · + an tn + · · · (4.3) is the Taylor series expansion of M (t), then E (X n ) = (n!) an for all natural number n. Proof: Let M (t) be the moment generating function of the random variable X. The Taylor series expansion of M (t) about 0 is given by M (t) = M (0) + M # (0) M ## (0) 2 M ### (0) 3 M (n) (0) n t+ t + t + ··· + t + ··· 1! 2! 3! n! Since E(X n ) = M (n) (0) for n ≥ 1 and M (0) = 1, we have M (t) = 1 + E(X) E(X 2 ) 2 E(X 3 ) 3 E(X n ) n t+ t + t + ··· + t + · · · (4.4) 1! 2! 3! n! From (4.3) and (4.4), equating the coefficients of the like powers of t, we obtain E (X n ) an = n! which is E (X n ) = (n!) an . This proves the theorem. Moments of Random Variables and Chebychev Inequality 96 Example 4.17. What is the 479th moment of X about the origin, if the 1 moment generating function of X is 1+t ? 1 can be obtained by using Answer The Taylor series expansion of M (t) = 1+t long division (a technique we have learned in high school). 1 1+t 1 = 1 − (−t) M (t) = = 1 + (−t) + (−t)2 + (−t)3 + · · · + (−t)n + · · · = 1 − t + t2 − t3 + t4 + · · · + (−1)n tn + · · · Therefore an = (−1)n and from this we obtain a479 = −1. By Theorem 4.5, ! " E X 479 = (479!) a479 = − 479! Example 4.18. If the moment generating of a random variable X is M (t) = ∞ % e(t j−1) j=0 j! , then what is the probability of the event X = 2? Answer: By examining the given moment generating function of X, it is easy to note that X is a discrete random variable with space RX = {0, 1, 2, · · · , ∞}. Hence by definition, the moment generating function of X is ∞ % M (t) = et j f (j). (4.5) j=0 But we are given that M (t) = = ∞ % e(t j−1) j=0 ∞ % j=0 j! e−1 t j e . j! From (4.5) and the above, equating the coefficients of etj , we get f (j) = e−1 j! for j = 0, 1, 2, ..., ∞. Probability and Mathematical Statistics 97 Thus, the probability of the event X = 2 is given by P (X = 2) = f (2) = 1 e−1 = . 2! 2e Example 4.19. Let X be a random variable with E (X n ) = 0.8 for n = 1, 2, 3, ..., ∞. What are the moment generating function and probability density function of X? Answer: $ tn n! n=1 $ # ∞ % tn n E (X ) = M (0) + n! n=1 # $ ∞ % tn = 1 + 0.8 n! n=1 ∞ # n$ % t = 0.2 + 0.8 + 0.8 n! n=1 # $ ∞ % tn = 0.2 + 0.8 n! n=0 M (t) = M (0) + ∞ % M (n) (0) # = 0.2 e0 t + 0.8 e1 t . Therefore, we get f (0) = P (X = 0) = 0.2 and f (1) = P (X = 1) = 0.8. Hence the moment generating function of X is M (t) = 0.2 + 0.8 et , and the probability density function of X is f (x) = & |x − 0.2| for x = 0, 1 0 otherwise. Example 4.20. If the moment generating function of a random variable X is given by M (t) = 5 t 4 2t 3 3t 2 4t 1 5t e + e + e + e + e , 15 15 15 15 15 Moments of Random Variables and Chebychev Inequality 98 then what is the probability density function of X? What is the space of the random variable X? Answer: The moment generating function of X is given to be M (t) = 5 t 4 2t 3 3t 2 4t 1 5t e + e + e + e + e . 15 15 15 15 15 This suggests that X is a discrete random variable. Since X is a discrete random variable, by definition of the moment generating function, we see that M (t) = % et x f (x) x∈RX t x1 =e f (x1 ) + et x2 f (x2 ) + et x3 f (x3 ) + et x4 f (x4 ) + et x5 f (x5 ). Hence we have f (x1 ) = f (1) = f (x2 ) = f (2) = f (x3 ) = f (3) = f (x4 ) = f (4) = f (x5 ) = f (5) = 5 15 4 15 3 15 2 15 1 . 15 Therefore the probability density function of X is given by f (x) = 6−x 15 for x = 1, 2, 3, 4, 5 and the space of the random variable X is RX = {1, 2, 3, 4, 5}. Example 4.21. If the probability density function of a discrete random variable X is f (x) = 6 , x2 π2 for x = 1, 2, 3, ..., ∞, then what is the moment generating function of X? Probability and Mathematical Statistics 99 Answer: If the moment generating function of X exists, then M (t) = ∞ % etx f (x) x=1 ) √ +2 6 e = π x x=1 ∞ # tx $ % e 6 = π 2 x2 x=1 ∞ % = tx ∞ 6 % etx . π 2 x=1 x2 Now we show that the above infinite series diverges if t belongs to the interval (−h, h) for any h > 0. To prove that this series is divergent, we do the ratio test, that is $ # t (n+1) 2 $ # e n an+1 = lim lim 2 n→∞ n→∞ an (n + 1) et n # tn t $ e e n2 = lim n→∞ (n + 1)2 et n ) # $2 + n t = lim e n→∞ n+1 = et . For any h > 0, since et is not always less than 1 for all t in the interval (−h, h), we conclude that the above infinite series diverges and hence for this random variable X the moment generating function does not exist. Notice that for the above random variable, E [X n ] does not exist for any natural number n. Hence the discrete random variable X in Example 4.21 has no moments. Similarly, the continuous random variable X whose Moments of Random Variables and Chebychev Inequality 100 probability density function is f (x) =   x12  0 for 1 ≤ x < ∞ otherwise, has no moment generating function and no moments. In the following theorem we summarize some important properties of the moment generating function of a random variable. Theorem 4.6. Let X be a random variable with the moment generating function MX (t). If a and b are any two real constants, then MX+a (t) = ea t MX (t) (4.6) Mb X (t) = MX (b t) # $ a t t b . M X+a (t) = e MX b b (4.7) Proof: First, we prove (4.6). , MX+a (t) = E et (X+a) ! " = E et X+t a ! " = E et X et a ! " = et a E et X = et a MX (t). Similarly, we prove (4.7). , Mb X (t) = E et (b X) , = E e(t b) X = MX (t b). By using (4.6) and (4.7), we easily get (4.8). M X+a (t) = M X + a (t) b b =e a b t b M X (t) b # $ a t t b = e MX . b (4.8) Probability and Mathematical Statistics 101 This completes the proof of this theorem. Definition 4.6. The nth factorial moment of a random variable X is E(X(X − 1)(X − 2) · · · (X − n + 1)). Definition 4.7. The factorial moment generating function (FMGF) of X is denoted by G(t) and defined as ! " G(t) = E tX . It is not difficult to establish a relationship between the moment generating function (MGF) and the factorial moment generating function (FMGF). The relationship between them is the following: , ! " ! X G(t) = E tX = E eln t = E eX ln t " = M (ln t). Thus, if we know the MGF of a random variable, we can determine its FMGF and conversely. Definition 4.8. Let X be a random variable. The characteristic function φ(t) of X is defined as ! " φ(t) = E ei t X = E ( cos(tX) + i sin(tX) ) = E ( cos(tX) ) + i E ( sin(tX) ) . The probability density function can be recovered from the characteristic function by using the following formula 1 f (x) = 2π : ∞ e−i t x φ(t) dt. −∞ Unlike the moment generating function, the characteristic function of a random variable always exists. For example, the Cauchy random variable X 1 with probability density f (x) = π(1+x 2 ) has no moment generating function. However, the characteristic function is ! " φ(t) = E ei t X : ∞ eitx = dx 2 −∞ π(1 + x ) = e−|t| . Moments of Random Variables and Chebychev Inequality 102 To evaluate the above integral one needs the theory of residues from the complex analysis. The characteristic function φ(t) satisfies the same set of properties as the moment generating functions as given in Theorem 4.6. The following integrals : ∞ xm e−x dx = m! if m is a positive integer 0 and : ∞ √ −x xe dx = √ 0 π 2 are needed for some problems in the Review Exercises of this chapter. These formulas will be discussed in Chapter 6 while we describe the properties and usefulness of the gamma distribution. We end this chapter with the following comment about the Taylor’s series. Taylor’s series was discovered to mimic the decimal expansion of real numbers. For example 125 = 1 (10)2 + 2 (10)1 + 5 (10)0 is an expansion of the number 125 with respect to base 10. Similarly, 125 = 1 (9)2 + 4 (9)1 + 8 (9)0 is an expansion of the number 125 in base 9 and it is 148. Since given a function f : R I →R I and x ∈ R, I f (x) is a real number and it can be expanded with respect to the base x. The expansion of f (x) with respect to base x will have a form f (x) = a0 x0 + a1 x1 + a2 x2 + a3 x3 + · · · which is f (x) = ∞ % ak xk . k=0 If we know the coefficients ak for k = 0, 1, 2, 3, ..., then we will have the expansion of f (x) in base x. Taylor found the remarkable fact that the the coefficients ak can be computed if f (x) is sufficiently differentiable. He proved that for k = 1, 2, 3, ... ak = f (k) (0) k! with f (0) = f (0). Probability and Mathematical Statistics 103 4.6. Review Exercises 1. In a state lottery a five-digit integer is selected at random. If a player bets 1 dollar on a particular number, the payoff (if that number is selected) is $500 minus the $1 paid for the ticket. Let X equal the payoff to the better. Find the expected value of X. 2. A discrete random variable X has probability density function of the form & c (8 − x) for x = 0, 1, 2, 3, 4, 5 f (x) = 0 otherwise. (a) Find the constant c. (b) Find P (X > 2). (c) Find the expected value E(X) for the random variable X. 3. A random variable X has a cumulative distribution function 1 if 0 < x ≤ 1  2x F (x) =  x − 21 if 1 < x ≤ 32 . (a) Graph F (x). (b) Graph f (x). (c) Find P (X ≤ 0.5). (d) Find P (X ≥ 0.5). (e) Find P (X ≤ 1.25). (f) Find P (X = 1.25). 4. Let X be a random variable with probability density function &1 for x = 1, 2, 5 8x f (x) = 0 otherwise. (a) Find the expected value of X. (b) Find the variance of X. (c) Find the expected value of 2X + 3. (d) Find the variance of 2X + 3. (e) Find the expected value of 3X − 5X 2 + 1. 5. The measured radius of a circle, R, has probability density function & 6 r (1 − r) if 0 < r < 1 f (r) = 0 otherwise. (a) Find the expected value of the radius. (b) Find the expected circumference. (c) Find the expected area. 6. Let X be a continuous random variable with density function  3  θ x + 32 θ 2 x2 for 0 < x < √1 θ f (x) =  0 otherwise, Moments of Random Variables and Chebychev Inequality 104 where θ > 0. What is the expected value of X? 7. Suppose X is a random variable µ and variance σ 2 > 0. For ,; with mean < 1 2 minimized? what value of a, where a > 0 is E a X − a 8. A rectangle is to be constructed having dimension X by 2X, where X is a random variable with probability density function f (x) = &1 2 for 0 < x < 2 0 otherwise. What is the expected area of the rectangle? 9. A box is to be constructed so that the height is 10 inches and its base is X inches by X inches. If X has a uniform distribution over the interval [2, 8], then what is the expected volume of the box in cubic inches? 10. If X is a random variable with density function   1.4 e−2x + 0.9 e−3x for x > 0 f (x) =  0 elsewhere, then what is the expected value of X? 11. A fair coin is tossed. If a head occurs, 1 die is rolled; if a tail occurs, 2 dice are rolled. Let X be the total on the die or dice. What is the expected value of X? 12. If velocities of the molecules of a gas have the probability density (Maxwell’s law)  2 2  a v 2 e−h v for v ≥ 0 f (v) =  0 otherwise, then what are the expectation and the variance of the velocity of the molecules and also the magnitude of a for some given h? 13. A couple decides to have children until they get a girl, but they agree to stop with a maximum of 3 children even if they haven’t gotten a girl. If X and Y denote the number of children and number of girls, respectively, then what are E(X) and E(Y )? 14. In roulette, a wheel stops with equal probability at any of the 38 numbers 0, 00, 1, 2, ..., 36. If you bet $1 on a number, then you win $36 (net gain is Probability and Mathematical Statistics 105 $35) if the number comes up; otherwise, you lose your dollar. What are your expected winnings? 15. If the moment generating function for the random variable X is MX (t) = 1 1+t , what is the third moment of X about the point x = 2? 16. If the mean and the variance of a certain distribution are 2 and 8, what are the first three terms in the series expansion of the moment generating function? 17. Let X be a random variable with density function   a e−ax for x > 0 f (x) =  0 otherwise, where a > 0. If M (t) denotes the moment generating function of X, what is M (−3a)? 18. Suppose the random variable X has moment generating M (t) = 1 , (1 − βt)k for t < 1 . β What is the nth moment of X? 19. Two balls are dropped in such a way that each ball is equally likely to fall into any one of four holes. Both balls may fall into the same hole. Let X denote the number of unoccupied holes at the end of the experiment. What is the moment generating function of X? 20. If the moment generating function of X is M (t) = what is the fourth moment of X? 1 (1−t)2 for t < 1, then 21. Let the random variable X have the moment generating function M (t) = e3t , 1 − t2 −1 < t < 1. What are the mean and the variance of X, respectively? 22. Let the random variable X have the moment generating function 2 M (t) = e3t+t . What is the second moment of X about x = 0? Moments of Random Variables and Chebychev Inequality 106 23. Suppose the random variable X has the cumulative density function F (x). Show that the expected value of the random variable (X − c)2 is minimum if c equals the expected value of X. 24. Suppose the continuous random variable X has the cumulative density function F (x). Show that the expected value of the random variable |X − c| is minimum if c equals the median of X (that is, F (c) = 0.5). 25. Let the random variable X have the probability density function f (x) = 1 −|x| e 2 − ∞ < x < ∞. What are the expected value and the variance of X? 4 26. If MX (t) = k (2 + 3et ) , what is the value of k? 27. Given the moment generating function of X as M (t) = 1 + t + 4t2 + 10t3 + 14t4 + · · · what is the third moment of X about its mean? 28. A set of measurements X has a mean of 7 and standard deviation of 0.2. For simplicity, a linear transformation Y = aX + b is to be applied to make the mean and variance both equal to 1. What are the values of the constants a and b? 29. A fair coin is to be tossed 3 times. The player receives 10 dollars if all three turn up heads and pays 3 dollars if there is one or no heads. No gain or loss is incurred otherwise. If Y is the gain of the player, what the expected value of Y ? 30. If X has the probability density function 9 −x e for x > 0 f (x) = 0 otherwise, 3 then what is the expected value of the random variable Y = e 4 X + 6? 31. If the probability density function of the random variable X if   (1 − p)x−1 p if x = 1, 2, 3, ..., ∞ f (x) =  0 otherwise, then what is the expected value of the random variable X −1 ? Probability and Mathematical Statistics 107 Chapter 5 SOME SPECIAL DISCRETE DISTRIBUTIONS Given a random experiment, we can find the set of all possible outcomes which is known as the sample space. Objects in a sample space may not be numbers. Thus, we use the notion of random variable to quantify the qualitative elements of the sample space. A random variable is characterized by either its probability density function or its cumulative distribution function. The other characteristics of a random variable are its mean, variance and moment generating function. In this chapter, we explore some frequently encountered discrete distributions and study their important characteristics. 5.1. Bernoulli Distribution A Bernoulli trial is a random experiment in which there are precisely two possible outcomes, which we conveniently call ‘failure’ (F) and ‘success’ (S). We can define a random variable from the sample space {S, F } into the set of real numbers as follows: X(F ) = 0 X(S) = 1. Some Special Discrete Distributions 108 Sample Space S X F 1= X(S) X(F) = 0 The probability density function of this random variable is f (0) = P (X = 0) = 1 − p f (1) = P (X = 1) = p, where p denotes the probability of success. Hence f (x) = px (1 − p)1−x , x = 0, 1. Definition 5.1. The random variable X is called the Bernoulli random variable if its probability density function is of the form f (x) = px (1 − p)1−x , x = 0, 1 where p is the probability of success. We denote the Bernoulli random variable by writing X ∼ BER(p). Example 5.1. What is the probability of getting a score of not less than 5 in a throw of a six-sided die? Answer: Although there are six possible scores {1, 2, 3, 4, 5, 6}, we are grouping them into two sets, namely {1, 2, 3, 4} and {5, 6}. Any score in {1, 2, 3, 4} is a failure and any score in {5, 6} is a success. Thus, this is a Bernoulli trial with P (X = 0) = P (failure) = 4 6 and P (X = 1) = P (success) = 2 . 6 Hence, the probability of getting a score of not less than 5 in a throw of a six-sided die is 62 . Probability and Mathematical Statistics 109 Theorem 5.1. If X is a Bernoulli random variable with parameter p, then the mean, variance and moment generating functions are respectively given by µX = p 2 σX = p (1 − p) MX (t) = (1 − p) + p et . Proof: The mean of the Bernoulli random variable is µX = 1 % x f (x) x=0 = 1 % x=0 x px (1 − p)1−x = p. Similarly, the variance of X is given by 2 σX = 1 % x=0 = 1 % x=0 2 (x − µX )2 f (x) (x − p)2 px (1 − p)1−x = p (1 − p) + p (1 − p)2 = p (1 − p) [p + (1 − p)] = p (1 − p). Next, we find the moment generating function of the Bernoulli random variable ! " M (t) = E etX = 1 % x=0 etx px (1 − p)1−x = (1 − p) + et p. This completes the proof. The moment generating function of X and all the moments of X are shown below for p = 0.5. Note that for the Bernoulli distribution all its moments about zero are same and equal to p. Some Special Discrete Distributions 110 5.2. Binomial Distribution Consider a fixed number n of mutually independent Bernoulli trails. Suppose these trials have same probability of success, say p. A random variable X is called a binomial random variable if it represents the total number of successes in n independent Bernoulli trials. Now we determine the probability density function of a binomial random variable. Recall that the probability density function of X is defined as f (x) = P (X = x). Thus, to find the probability density function of X we have to find the probability of x successes in n independent trails. If we have x successes in n trails, then the probability of each n-tuple with x successes and n − x failures is However, there are Hence ! n" x px (1 − p)n−x . tuples with x successes and n − x failures in n trials. # $ n x p (1 − p)n−x . P (X = x) = x Therefore, the probability density function of X is # $ n x f (x) = p (1 − p)n−x , x = 0, 1, ..., n. x Definition 5.2. The random variable X is called the binomial random variable with parameters p and n if its probability density function is of the form # $ n x f (x) = p (1 − p)n−x , x = 0, 1, ..., n x Probability and Mathematical Statistics 111 where 0 < p < 1 is the probability of success. We will denote a binomial random variable with parameters p and n as X ∼ BIN (n, p). Example 5.2. Is the real valued function f (x) given by # $ n x f (x) = p (1 − p)n−x , x = 0, 1, ..., n x where n and p are parameters, a probability density function? Answer: To answer this question, we have to check that f (x) is nonnegative 1n and x=0 f (x) is 1. It is easy to see that f (x) ≥ 0. We show that sum is one. n # $ n % % n x f (x) = p (1 − p)n−x x x=0 x=0 n = (p + 1 − p) = 1. Hence f (x) is really a probability density function. Example 5.3. On a five-question multiple-choice test there are five possible answers, of which one is correct. If a student guesses randomly and independently, what is the probability that she is correct only on questions 1 and 4? Answer: Here the probability of success is p = 15 , and thus 1 − p = Therefore, the probability that she is correct on questions 1 and 4 is P (correct on questions 1 and 4) =p2 (1 − p)3 # $2 # $ 3 4 1 = 5 5 64 = 5 = 0.02048. 5 4 5. Some Special Discrete Distributions 112 Example 5.4. On a five-question multiple-choice test there are five possible answers, of which one is correct. If a student guesses randomly and independently, what is the probability that she is correct only on two questions? Answer: Here the probability of success is p = 51 , and thus 1 − p = 54 . There !" are 52 different ways she can be correct on two questions. Therefore, the probability that she is correct on two questions is # $ 5 2 p (1 − p)3 P (correct on two questions) = 2 # $ 2 # $3 1 4 = 10 5 5 = 640 = 0.2048. 55 Example 5.5. What is the probability of rolling two sixes and three nonsixes in 5 independent casts of a fair die? Answer: Let the random variable X denote the number of sixes in 5 independent casts of a fair die. Then X is a binomial random variable with probability of success p and n = 5. The probability of getting a six is p = 16 . Hence # $ # $2 # $3 5 1 5 P (X = 2) = f (2) = 2 6 6 = 10 = # 1 36 $# 125 216 $ 1250 = 0.160751. 7776 Example 5.6. What is the probability of rolling at most two sixes in 5 independent casts of a fair die? Answer: Let the random variable X denote number of sixes in 5 independent casts of a fair die. Then X is a binomial random variable with probability of success p and n = 5. The probability of getting a six is p = 16 . Hence, the Probability and Mathematical Statistics 113 probability of rolling at most two sixes is P (X ≤ 2) = F (2) = f (0) + f (1) + f (2) # $ # $ 0 # $ 5 # $ # $1 # $ 4 # $ # $ 2 # $ 3 5 5 5 5 5 1 1 5 1 + + = 6 6 6 6 6 6 1 2 0 $ # $ # $ # 2 k 5−k % 5 5 1 = 6 6 k k=0 = 1 (0.9421 + 0.9734) = 0.9577 2 (from binomial table) Theorem 5.2. If X is binomial random variable with parameters p and n, then the mean, variance and moment generating functions are respectively given by µX = n p 2 σX = n p (1 − p) ; <n MX (t) = (1 − p) + p et . Proof: First, we determine the moment generating function M (t) of X and then we generate mean and variance from M (t). ! " M (t) = E etX # $ n % tx n px (1 − p)n−x e = x x=0 n # $ % n ! t "x = pe (1 − p)n−x x x=0 ! "n = p et + 1 − p . Hence ! "n−1 M # (t) = n p et + 1 − p p et . Some Special Discrete Distributions 114 Therefore µX = M # (0) = n p. Similarly ! "n−1 ! "n−2 ! t "2 M ## (t) = n p et + 1 − p p et + n (n − 1) p et + 1 − p pe . Therefore E(X 2 ) = M ## (0) = n (n − 1) p2 + n p. Hence V ar(X) = E(X 2 ) − µ2X = n (n − 1) p2 + n p − n2 p2 = n p (1 − p). This completes the proof. Example 5.7. Suppose that 2000 points are selected independently and at random from the unit squares S = {(x, y) | 0 ≤ x, y ≤ 1}. Let X equal the number of points that fall in A = {(x, y) | x2 +y 2 < 1}. How is X distributed? What are the mean, variance and standard deviation of X? Answer: If a point falls in A, then it is a success. If a point falls in the complement of A, then it is a failure. The probability of success is p= area of A 1 = π. area of S 4 Since, the random variable represents the number of successes in 2000 independent trials, the random variable X is a binomial with parameters p = π4 and n = 2000, that is X ∼ BIN (2000, π4 ). Probability and Mathematical Statistics 115 Hence by Theorem 5.2, µX = 2000 π = 1570.8, 4 and , π- π 2 σX = 337.1. = 2000 1 − 4 4 The standard deviation of X is σX = √ 337.1 = 18.36. Example 5.8. Let the probability that the birth weight (in grams) of babies in America is less than 2547 grams be 0.1. If X equals the number of babies that weigh less than 2547 grams at birth among 20 of these babies selected at random, then what is P (X ≤ 3)? Answer: If a baby weighs less than 2547, then it is a success; otherwise it is a failure. Thus X is a binomial random variable with probability of success p and n = 20. We are given that p = 0.1. Hence $k # $20−k 3 # $ # % 20 1 9 P (X ≤ 3) = k 10 10 k=0 = 0.867 (from table). Example 5.9. Let X1 , X2 , X3 be three independent Bernoulli random variables with the same probability of success p. What is the probability density function of the random variable X = X1 + X2 + X3 ? Answer: The sample space of the three independent Bernoulli trials is S = {F F F, F F S, F SF, SF F, F SS, SF S, SSF, SSS}. Some Special Discrete Distributions 116 The random variable X = X1 + X2 + X3 represents the number of successes in each element of S. The following diagram illustrates this. Sum of three Bernoulli Trials S X FFF FFS FSF SFF FSS SFS SSF SSS 0 1 R x 2 3 Let p be the probability of success. Then f (0) = P (X = 0) = P (F F F ) = (1 − p)3 f (1) = P (X = 1) = P (F F S) + P (F SF ) + P (SF F ) = 3 p (1 − p)2 f (2) = P (X = 2) = P (F SS) + P (SF S) + P (SSF ) = 3 p2 (1 − p) f (3) = P (X = 3) = P (SSS) = p3 . Hence f (x) = # $ 3 x p (1 − p)3−x , x x = 0, 1, 2, 3. Thus X ∼ BIN (3, p). In general, if Xi ∼ BER(p), then E ) n % 1n Xi i=1 and V ar ) n % i=1 Xi i=1 + + Xi ∼ BIN (n, p) and hence = np = n p (1 − p). The binomial distribution can arise whenever we select a random sample of n units with replacement. Each unit in the population is classified into one of two categories according to whether it does or does not possess a certain property. For example, the unit may be a person and the property may be Probability and Mathematical Statistics 117 whether he intends to vote “yes”. If the unit is a machine part, the property may be whether the part is defective and so on. If the proportion of units in the population possessing the property of interest is p, and if Z denotes the number of units in the sample of size n that possess the given property, then Z ∼ BIN (n, p). 5.3. Geometric Distribution If X represents the total number of successes in n independent Bernoulli trials, then the random variable X ∼ BIN (n, p), where p is the probability of success of a single Bernoulli trial and the probability density function of X is given by # $ n x f (x) = p (1 − p)n−x , x = 0, 1, ..., n. x Let X denote the trial number on which the first success occurs. Geometric Random Variable FFFFFFS FFFFFS FFFFS FFFS FFS FS S Sample Space X 1 2 3 4 5 6 7 Space of the random variable Hence the probability that the first success occurs on xth trial is given by f (x) = P (X = x) = (1 − p)x−1 p. Hence, the probability density function of X is f (x) = (1 − p)x−1 p x = 1, 2, 3, ..., ∞, where p denotes the probability of success in a single Bernoulli trial. Some Special Discrete Distributions 118 Definition 5.3. A random variable X has a geometric distribution if its probability density function is given by f (x) = (1 − p)x−1 p x = 1, 2, 3, ..., ∞, where p denotes the probability of success in a single Bernoulli trial. If X has a geometric distribution we denote it as X ∼ GEO(p). Example 5.10. Is the real valued function f (x) defined by f (x) = (1 − p)x−1 p x = 1, 2, 3, ..., ∞ where 0 < p < 1 is a parameter, a probability density function? Answer: It is easy to check that f (x) ≥ 0. Thus we only show that the sum is one. ∞ ∞ % % (1 − p)x−1 p f (x) = x=1 x=1 ∞ % =p y=0 =p (1 − p)y , where y = x − 1 1 = 1. 1 − (1 − p) Hence f (x) is a probability density function. Example 5.11. The probability that a machine produces a defective item is 0.02. Each item is checked as it is produced. Assuming that these are independent trials, what is the probability that at least 100 items must be checked to find one that is defective? Probability and Mathematical Statistics 119 Answer: Let X denote the trial number on which the first defective item is observed. We want to find P (X ≥ 100) = = ∞ % x=100 ∞ % x=100 f (x) (1 − p)x−1 p = (1 − p)99 99 = (1 − p) ∞ % y=0 (1 − p)y p = (0.98)99 = 0.1353. Hence the probability that at least 100 items must be checked to find one that is defective is 0.1353. Example 5.12. A gambler plays roulette at Monte Carlo and continues gambling, wagering the same amount each time on “Red”, until he wins for the first time. If the probability of “Red” is 18 38 and the gambler has only enough money for 5 trials, then (a) what is the probability that he will win before he exhausts his funds; (b) what is the probability that he wins on the second trial? Answer: 18 . 38 (a) Hence the probability that he will win before he exhausts his funds is given by P (X ≤ 5) = 1 − P (X ≥ 6) p = P (Red) = = 1 − (1 − p)5 $5 # 18 =1− 1− 38 = 1 − (0.5263)5 = 1 − 0.044 = 0.956. (b) Similarly, the probability that he wins on the second trial is given by P (X = 2) = f (2) = (1 − p)2−1 p # $# $ 18 18 = 1− 38 38 360 = = 0.2493. 1444 Some Special Discrete Distributions 120 The following theorem provides us with the mean, variance and moment generating function of a random variable with the geometric distribution. Theorem 5.3. If X is a geometric random variable with parameter p, then the mean, variance and moment generating functions are respectively given by 1 p 1−p 2 σX = p2 p et , MX (t) = 1 − (1 − p) et µX = if t < −ln(1 − p). Proof: First, we compute the moment generating function of X and then we generate all the mean and variance of X from it. M (t) = ∞ % etx (1 − p)x−1 p x=1 ∞ % =p et(y+1) (1 − p)y , y=0 ∞ % ! t t = pe y=0 = where y = x − 1 "y e (1 − p) p et , 1 − (1 − p) et if t < −ln(1 − p). Probability and Mathematical Statistics 121 Differentiating M (t) with respect to t, we obtain M # (t) = = = (1 − (1 − p) et ) p et + p et (1 − p) et 2 [1 − (1 − p)et ] p et [1 − (1 − p) et + (1 − p) et ] 2 [1 − (1 − p)et ] p et 2. [1 − (1 − p)et ] Hence 1 . p µX = E(X) = M # (0) = Similarly, the second derivative of M (t) can be obtained from the first derivative as 2 M ## (t) = [1 − (1 − p) et ] p et + p et 2 [1 − (1 − p) et ] (1 − p) et 4 [1 − (1 − p)et ] Hence M ## (0) = . 2−p p3 + 2 p2 (1 − p) = . p4 p2 Therefore, the variance of X is 2 σX = M ## (0) − ( M # (0) )2 2−p 1 = − 2 p2 p 1−p = . p2 This completes the proof of the theorem. Theorem 5.4. The random variable X is geometric if and only if it satisfies the memoryless property, that is P (X > m + n / X > n) = P (X > m) for all natural numbers n and m. Proof: It is easy to check that the geometric distribution satisfies the lack of memory property P (X > m + n / X > n) = P (X > m) Some Special Discrete Distributions 122 which is P (X > m + n and X > n) = P (X > m) P (X > n). (5.1) If X is geometric, that is X ∼ (1 − p)x−1 p, then P (X > n + m) = ∞ % (1 − p)x−1 p x=n+m+1 n+m = (1 − p) = (1 − p)n (1 − p)m = P (X > n) P (X > m). Hence the geometric distribution has the lack of memory property. Let X be a random variable which satisfies the lack of memory property, that is P (X > m + n and X > n) = P (X > m) P (X > n). We want to show that X is geometric. Define g : N → R I by g(n) := P (X > n) (5.2) Using (5.2) in (5.1), we get ∀ m, n ∈ N, g(m + n) = g(m) g(n) (5.3) since P (X > m + n and X > n) = P (X > m + n). Letting m = 1 in (5.3), we see that g(n + 1) = g(n) g(1) = g(n − 1) g(1)2 = g(n − 2) g(1)3 = ··· ··· = g(n − (n − 1)) g(1)n = g(1)n+1 = an+1 , where a is an arbitrary constant. Hence g(n) = an . From (5.2), we get 1 − F (n) = P (X > n) = an Probability and Mathematical Statistics 123 and thus F (n) = 1 − an . Since F (n) is a distribution function 1 = lim F (n) = lim (1 − an ) . n→∞ n→∞ From the above, we conclude that 0 < a < 1. We rename the constant a as (1 − p). Thus, F (n) = 1 − (1 − p)n . The probability density function of X is hence f (1) = F (1) = p f (2) = F (2) − F (1) = 1 − (1 − p)2 − 1 + (1 − p) = (1 − p) p f (3) = F (3) − F (2) = 1 − (1 − p)3 − 1 + (1 − p)2 = (1 − p)2 p ··· ··· f (x) = F (x) − F (x − 1) = (1 − p)x−1 p. Thus X is geometric with parameter p. This completes the proof. The difference between the binomial and the geometric distributions is the following. In binomial distribution, the number of trials was predetermined, whereas in geometric it is the random variable. 5.4. Negative Binomial Distribution Let X denote the trial number on which the rth success occurs. Here r is a positive integer greater than or equal to one. This is equivalent to saying that the random variable X denotes the number of trials needed to observe the rth successes. Suppose we want to find the probability that the fifth head is observed on the 10th independent flip of an unbiased coin. This is a case of finding P (X = 10). Let us find the general case P (X = x). P (X = x) = P (first x − 1 trials contain x − r failures and r − 1 successes) · P (rth success in x th trial) $ x − 1 r−1 p = (1 − p)x−r p r−1 # $ x−1 r = p (1 − p)x−r , x = r, r + 1, ..., ∞. r−1 # Some Special Discrete Distributions 124 Hence the probability density function of the random variable X is given by $ # x−1 r p (1 − p)x−r , f (x) = x = r, r + 1, ..., ∞. r−1 Notice that this probability density function f (x) can also be expressed as $ # x+r−1 r p (1 − p)x , f (x) = x = 0, 1, ..., ∞. r−1 SSSS FSSSS SFSSS SSFSS SSSFS FFSSSS FSFSSS FSSFSS FSSSFS X 4 5 6 7 X is NBIN(4,P) S p a c e o f R V Definition 5.4. A random variable X has the negative binomial (or Pascal) distribution if its probability density function is of the form $ # x−1 r p (1 − p)x−r , f (x) = x = r, r + 1, ..., ∞, r−1 where p is the probability of success in a single Bernoulli trial. We denote the random variable X whose distribution is negative binomial distribution by writing X ∼ N BIN (r, p). We need the following technical result to show that the above function is really a probability density function. The technical result we are going to establish is called the negative binomial theorem. Probability and Mathematical Statistics 125 Theorem 5.5. Let r be a nonzero positive integer. Then −r (1 − y) = $ ∞ # % x−1 r−1 x=r y x−r where |y| < 1. Proof: Define h(y) = (1 − y)−r . Now expanding h(y) by Taylor series method about y = 0, we get −r (1 − y) = ∞ % h(k) (0) k! k=0 yk , where h(k) (y) is the k th derivative of h. This k th derivative of h(y) can be directly computed and direct computation gives h(k) (y) = r (r + 1) (r + 2) · · · (r + k − 1) (1 − y)−(r+k) . Hence, we get h(k) (0) = r (r + 1) (r + 2) · · · (r + k − 1) = (r + k − 1)! . (r − 1)! Letting this into the Taylor’s expansion of h(y), we get (1 − y)−r = ∞ % (r + k − 1)! yk (r − 1)! k! k=0 $ ∞ # % r+k−1 k y . = r−1 k=0 Letting x = k + r, we get −r (1 − y) = $ ∞ # % x−1 x=r r−1 y x−r . This completes the proof of the theorem. Theorem 5.5 can also be proved using the geometric series ∞ % n=0 yn = 1 1−y (5.4) Some Special Discrete Distributions 126 where |y| < 1. Differentiating k times both sides of the equality (5.4) and then simplifying we have ∞ # $ % n n=k k y n−k = 1 . (1 − y)k+1 (5.5) Letting n = x − 1 and k = r − 1 in (5.5), we have the asserted result. Example 5.13. Is the real valued function defined by f (x) = # $ x−1 r p (1 − p)x−r , r−1 x = r, r + 1, ..., ∞, where 0 < p < 1 is a parameter, a probability density function? Answer: It is easy to check that f (x) ≥ 0. Now we show that equal to one. ∞ % x=r ∞ % f (x) is x=r $ ∞ # % x−1 pr (1 − p)x−r r − 1 x=r $ ∞ # % x−1 r (1 − p)x−r =p r − 1 x=r f (x) = = pr (1 − (1 − p))−r = pr p−r = 1. Computing the mean and variance of the negative binomial distribution using definition is difficult. However, if we use the moment generating approach, then it is not so difficult. Hence in the next example, we determine the moment generating function of this negative binomial distribution. Example 5.14. What is the moment generating function of the random variable X whose probability density function is f (x) = # $ x−1 r p (1 − p)x−r , r−1 x = r, r + 1, ..., ∞? Answer: The moment generating function of this negative binomial random Probability and Mathematical Statistics 127 variable is M (t) = ∞ % x=r ∞ % etx f (x) # $ x−1 r p (1 − p)x−r r − 1 x=r # $ ∞ % r t(x−r) tr x − 1 =p e e (1 − p)x−r r − 1 x=r $ ∞ # % x − 1 t(x−r) r tr =p e e (1 − p)x−r r − 1 x=r $ ∞ # % <x−r x−1 ; t = pr etr e (1 − p) r−1 x=r <−r ; = pr etr 1 − (1 − p)et $r # p et , if t < −ln(1 − p). = 1 − (1 − p)et = etx The following theorem can easily be proved. Theorem 5.6. If X ∼ N BIN (r, p), then r p r (1 − p) V ar(X) = p2 $r # p et , M (t) = 1 − (1 − p)et E(X) = if t < −ln(1 − p). Example 5.15. What is the probability that the fifth head is observed on the 10th independent flip of a coin? Answer: Let X denote the number of trials needed to observe 5th head. Hence X has a negative binomial distribution with r = 5 and p = 21 . We want to find P (X = 10) = f (10) # $ 9 5 p (1 − p)5 = 4 # $ # $10 1 9 = 4 2 63 . = 512 Some Special Discrete Distributions 128 We close this section with the following comment. In the negative binomial distribution the parameter r is a positive integer. One can generalize the negative binomial distribution to allow noninteger values of the parameter r. To do this let us write the probability density function of the negative binomial distribution as # $ x−1 r f (x) = p (1 − p)x−r r−1 (x − 1)! pr (1 − p)x−r = (r − 1)! (x − r)! Γ(x) pr (1 − p)x−r , = Γ(r) Γ(x − r − 1) where Γ(z) = : ∞ for x = r, r + 1, ..., ∞, tz−1 e−t dt 0 is the well known gamma function. The gamma function generalizes the notion of factorial and it will be treated in the next chapter. 5.5. Hypergeometric Distribution Consider a collection of n objects which can be classified into two classes, say class 1 and class 2. Suppose that there are n1 objects in class 1 and n2 objects in class 2. A collection of r objects is selected from these n objects at random and without replacement. We are interested in finding out the probability that exactly x of these r objects are from class 1. If x of these r objects are from class 1, then the remaining r − x objects must be from class ! " 2. We can select x objects from class 1 in any one of nx1 ways. Similarly, ! n2 " the remaining r − x objects can be selected in r−x ways. Thus, the number of ways one can select a subset of r objects from a set of n objects, such that Probability and Mathematical Statistics 129 x number of objects will be from class 1 and r − x number of objects will be ! " ! n2 " from class 2, is given by nx1 r−x . Hence, ! n1 " ! n2 " P (X = x) = x !n"r−x , r where x ≤ r, x ≤ n1 and r − x ≤ n2 . Class I Class II Out of n1 objects x will be selected Out of n2 objects r-x will be chosen From n1+n2 objects select r objects such that x objects are of class I & r-x are of class II Definition 5.5. A random variable X is said to have a hypergeometric distribution if its probability density function is of the form ! n1 " ! n2 " r−x " , f (x) = !xn1 +n 2 x = 0, 1, 2, ..., r r where x ≤ n1 and r − x ≤ n2 with n1 and n2 being two positive integers. We shall denote such a random variable by writing X ∼ HY P (n1 , n2 , r). Example 5.16. Suppose there are 3 defective items in a lot of 50 items. A sample of size 10 is taken at random and without replacement. Let X denote the number of defective items in the sample. What is the probability that the sample contains at most one defective item? Answer: Clearly, X ∼ HY P (3, 47, 10). Hence the probability that the sample contains at most one defective item is P (X ≤ 1) = P (X = 0) + P (X = 1) !3" !47" !3" !47" = " + !5010 0 10 = 0.504 + 0.4 = 0.904. !50"9 1 10 Some Special Discrete Distributions 130 Example 5.17. A random sample of 5 students is drawn without replacement from among 300 seniors, and each of these 5 seniors is asked if she/he has tried a certain drug. Suppose 50% of the seniors actually have tried the drug. What is the probability that two of the students interviewed have tried the drug? Answer: Let X denote the number of students interviewed who have tried the drug. Hence the probability that two of the students interviewed have tried the drug is !150" !150" P (X = 2) = !300"3 2 5 = 0.3146. Example 5.18. A radio supply house has 200 transistor radios, of which 3 are improperly soldered and 197 are properly soldered. The supply house randomly draws 4 radios without replacement and sends them to a customer. What is the probability that the supply house sends 2 improperly soldered radios to its customer? Answer: The probability that the supply house sends 2 improperly soldered Probability and Mathematical Statistics 131 radios to its customer is P (X = 2) = !3" !197" 2 !2002" 4 = 0.000895. Theorem 5.7. If X ∼ HY P (n1 , n2 , r), then n1 n1 + n 2 $# $# $ # n2 n1 + n 2 − r n1 . V ar(X) = r n1 + n 2 n 1 + n2 n 1 + n2 − 1 E(X) = r Proof: Let X ∼ HY P (n1 , n2 , r). We compute the mean and variance of X by computing the first and the second factorial moments of the random variable X. First, we compute the first factorial moment (which is same as the expected value) of X. The expected value of X is given by E(X) = = r % x=0 r % x=0 = n1 =r !n1 " ! n2 " r−x " x !xn1 +n 2 r r % x=1 r % ! n2 " (n1 − 1)! ! r−x 2 " (x − 1)! (n1 − x)! n1 +n r !n1 −1" ! n2 " x−1 ! r−x " n1 +n2 n1 +n2 −1 r−1 r x=1 " ! n2 " ! r−1 n1 −1 % n1 r−1−y y !n1 +n2 −1" , n1 + n2 y=0 r−1 = n1 =r x f (x) where y = x − 1 n1 . n 1 + n2 The last equality is obtained since " ! n2 " ! r−1 n1 −1 % y r−1−y !n1 +n2 −1" = 1. y=0 r−1 Similarly, we find the second factorial moment of X to be E(X(X − 1)) = r(r − 1) n1 (n1 − 1) . (n1 + n2 ) (n1 + n2 − 1) Some Special Discrete Distributions 132 Therefore, the variance of X is V ar(X) = E(X 2 ) − E(X)2 = E(X(X − 1)) + E(X) − E(X)2 # $2 n1 n1 r(r − 1) n1 (n1 − 1) +r − r (n1 + n2 ) (n1 + n2 − 1) n 1 + n2 n 1 + n2 # $# $# $ n1 n2 n1 + n 2 − r =r . n1 + n 2 n 1 + n2 n 1 + n2 − 1 = 5.6. Poisson Distribution In this section, we define an important discrete distribution which is widely used for modeling many real life situations. First, we define this distribution and then we present some of its important properties. Definition 5.6. A random variable X is said to have a Poisson distribution if its probability density function is given by f (x) = e−λ λx , x! x = 0, 1, 2, · · · , ∞, where 0 < λ < ∞ is a parameter. We denote such a random variable by X ∼ P OI(λ). The probability density function f is called the Poisson distribution after Simeon D. Poisson (1781-1840). Example 5.19. Is the real valued function defined by f (x) = e−λ λx , x! x = 0, 1, 2, · · · , ∞, where 0 < λ < ∞ is a parameter, a probability density function? Probability and Mathematical Statistics 133 Answer: It is easy to check f (x) ≥ 0. We show that one. ∞ % f (x) = ∞ % e−λ λx f (x) is equal to x=0 x! x=0 x=0 ∞ % ∞ % λx x! x=0 = e−λ = e−λ eλ = 1. Theorem 5.8. If X ∼ P OI(λ), then E(X) = λ V ar(X) = λ M (t) = eλ (e t −1) . Proof: First, we find the moment generating function of X. M (t) = = ∞ % x=0 ∞ % etx f (x) e−λ λx x! etx x=0 = e−λ = e−λ ∞ % etx x=0 ∞ % λx x! x (et λ) x! x=0 t = e−λ eλe = eλ (e t −1) . Thus, M # (t) = λ et eλ (e t −1) , and E(X) = M # (0) = λ. Similarly, M ## (t) = λ et eλ (e Hence t −1) "2 ! t + λ et eλ (e −1) . M ## (0) = E(X 2 ) = λ2 + λ. Some Special Discrete Distributions 134 Therefore V ar(X) = E(X 2 ) − ( E(X) )2 = λ2 + λ − λ2 = λ. This completes the proof. Example 5.20. A random variable X has a Poisson distribution with a mean of 3. What is the probability that X is bounded by 1 and 3, that is, P (1 ≤ X ≤ 3)? Answer: µX = 3 = λ f (x) = Hence f (x) = 3x e−3 , x! λx e−λ x! x = 0, 1, 2, ... Therefore P (1 ≤ X ≤ 3) = f (1) + f (2) + f (3) 9 27 −3 = 3 e−3 + e−3 + e 2 6 = 12 e−3 . Example 5.21. The number of traffic accidents per week in a small city has a Poisson distribution with mean equal to 3. What is the probability of exactly 2 accidents occur in 2 weeks? Answer: The mean traffic accident is 3. Thus, the mean accidents in two weeks are λ = (3) (2) = 6. Probability and Mathematical Statistics Since f (x) = we get f (2) = 135 λx e−λ x! 62 e−6 = 18 e−6 . 2! Example 5.22. Let X have a Poisson distribution with parameter λ = 1. What is the probability that X ≥ 2 given that X ≤ 4? Answer: P (2 ≤ X ≤ 4) . P (X ≤ 4) P (X ≥ 2 / X ≤ 4) = P (2 ≤ X ≤ 4) = = = 4 % λx e−λ x=2 x! 4 1 % 1 e x=2 x! 17 . 24 e Similarly P (X ≤ 4) = = 4 1 % 1 e x=0 x! 65 . 24 e Therefore, we have P (X ≥ 2 / X ≤ 4) = 17 . 65 Example 5.23. If the moment generating function of a random variable X t is M (t) = e4.6 (e −1) , then what are the mean and variance of X? What is the probability that X is between 3 and 6, that is P (3 < X < 6)? Some Special Discrete Distributions 136 Answer: Since the moment generating function of X is given by M (t) = e4.6 (e t −1) we conclude that X ∼ P OI(λ) with λ = 4.6. Thus, by Theorem 5.8, we get E(X) = 4.6 = V ar(X). P (3 < X < 6) = f (4) + f (5) = F (5) − F (3) = 0.686 − 0.326 = 0.36. 5.7. Riemann Zeta Distribution The zeta distribution was used by the Italian economist Vilfredo Pareto (1848-1923) to study the distribution of family incomes of a given country. Definition 5.7. A random variable X is said to have Riemann zeta distribution if its probability density function is of the form f (x) = 1 x−(α+1) , ζ(α + 1) x = 1, 2, 3, ..., ∞ where α > 0 is a parameter and ζ(s) = 1 + # $s # $s # $ s # $s 1 1 1 1 + + + ··· + + ··· 2 3 4 x is the well known the Riemann zeta function. A random variable having a Riemann zeta distribution with parameter α will be denoted by X ∼ RIZ(α). The following figures illustrate the Riemann zeta distribution for the case α = 2 and α = 1. Probability and Mathematical Statistics 137 The following theorem is easy to prove and we leave its proof to the reader. Theorem 5.9. If X ∼ RIZ(α), then ζ(α) ζ(α + 1) ζ(α − 1)ζ(α + 1) − (ζ(α))2 . V ar(X) = (ζ(α + 1))2 E(X) = Remark 5.1. If 0 < α ≤ 1, then ζ(α) = ∞. Hence if X ∼ RIZ(α) and the parameter α ≤ 1, then the variance of X is infinite. 5.8. Review Exercises 1. What is the probability of getting exactly 3 heads in 5 flips of a fair coin? 2. On six successive flips of a fair coin, what is the probability of observing 3 heads and 3 tails? 3. What is the probability that in 3 rolls of a pair of six-sided dice, exactly one total of 7 is rolled? 4. What is the probability of getting exactly four “sixes” when a die is rolled 7 times? 5. In a family of 4 children, what is the probability that there will be exactly two boys? 6. If a fair coin is tossed 4 times, what is the probability of getting at least two heads? 7. In Louisville the probability that a thunderstorm will occur on any day during the spring is 0.05. Assuming independence, what is the probability that the first thunderstorm occurs on April 5? (Assume spring begins on March 1.) 8. A ball is drawn from an urn containing 3 white and 3 black balls. After the ball is drawn, it is then replaced and another ball is drawn. This goes on indefinitely. What is the probability that, of the first 4 balls drawn, exactly 2 are white? 9. What is the probability that a person flipping a fair coin requires four tosses to get a head? 10. Assume that hitting oil at one drilling location is independent of another, and that, in a particular region, the probability of success at any individual Some Special Discrete Distributions 138 location is 0.3. Suppose the drilling company believes that a venture will be profitable if the number of wells drilled until the second success occurs is less than or equal to 7. What is the probability that the venture will be profitable? 11. Suppose an experiment consists of tossing a fair coin until three heads occur. What is the probability that the experiment ends after exactly six flips of the coin with a head on the fifth toss as well as on the sixth? 12. Customers at Fred’s Cafe wins a $100 prize if their cash register receipts show a star on each of the five consecutive days Monday, Tuesday, ..., Friday in any one week. The cash register is programmed to print stars on a randomly selected 10% of the receipts. If Mark eats at Fred’s Cafe once each day for four consecutive weeks, and if the appearance of the stars is an independent process, what is the probability that Mark will win at least $100? 13. If a fair coin is tossed repeatedly, what is the probability that the third head occurs on the nth toss? 14. Suppose 30 percent of all electrical fuses manufactured by a certain company fail to meet municipal building standards. What is the probability that in a random sample of 10 fuses, exactly 3 will fail to meet municipal building standards? 15. A bin of 10 light bulbs contains 4 that are defective. If 3 bulbs are chosen without replacement from the bin, what is the probability that exactly k of the bulbs in the sample are defective? 16. Let X denote the number of independent rolls of a fair die required to obtain the first “3”. What is P (X ≥ 6)? 17. The number of automobiles crossing a certain intersection during any time interval of length t minutes between 3:00 P.M. and 4:00 P.M. has a Poisson distribution with mean t. Let W be time elapsed after 3:00 P.M. before the first automobile crosses the intersection. What is the probability that W is less than 2 minutes? 18. In rolling one die repeatedly, what is the probability of getting the third six on the xth roll? 19. A coin is tossed 6 times. What is the probability that the number of heads in the first 3 throws is the same as the number in the last 3 throws? Probability and Mathematical Statistics 139 20. One hundred pennies are being distributed independently and at random into 30 boxes, labeled 1, 2, ..., 30. What is the probability that there are exactly 3 pennies in box number 1? 21. The density function of a certain random variable is f (x) = & !22" 4x (0.2)4x (0.8)22−4x 0 if x = 0, 41 , 24 , · · · , 22 4 otherwise. What is the expected value of X 2 ? 100 22. If MX (t) = k (2 + 3et ) , what is the value of k? What is the variance of the random variable X? , t -3 e , what is the value of k? What is the variance of 23. If MX (t) = k 7−5e t the random variable X? 24. If for a Poisson distribution 2f (0) + f (2) = 2f (1), what is the mean of the distribution? 25. The number of hits, X, per baseball game, has a Poisson distribution. If the probability of a no-hit game is 13 , what is the probability of having 2 or more hits in specified game? 26. Suppose X has a Poisson distribution with a standard deviation of 4. What is the conditional probability that X is exactly 1 given that X ≥ 1 ? 27. A die is loaded in such a way that the probability of the face with j dots turning up is proportional to j 2 for j = 1, 2, 3, 4, 5, 6. What is the probability of rolling at most three sixes in 5 independent casts of this die? 28. A die is loaded in such a way that the probability of the face with j dots turning up is proportional to j 2 for j = 1, 2, 3, 4, 5, 6. What is the probability of getting the third six on the 7th roll of this loaded die? Some Special Discrete Distributions 140 Probability and Mathematical Statistics 141 Chapter 6 SOME SPECIAL CONTINUOUS DISTRIBUTIONS In this chapter, we study some well known continuous probability density functions. We want to study them because they arise in many applications. We begin with the simplest probability density function. 6.1. Uniform Distribution Let the random variable X denote the outcome when a point is selected at random from an interval [a, b]. We want to find the probability of the event X ≤ x, that is we would like to determine the probability that the point selected from [a, b] would be less than or equal to x. To compute this probability, we need a probability measure µ that satisfies the three axioms of Kolmogorov (namely nonnegativity, normalization and countable additivity). For continuous variables, the events are interval or union of intervals. The length of the interval when normalized satisfies all the three axioms and thus it can be used as a probability measure for one-dimensional random variables. Hence length of [a, x] P (X ≤ x) = . length of [a, b] Thus, the cumulative distribution function F is x−a F (x) = P (X ≤ x) = , a ≤ x ≤ b, b−a where a and b are any two real constants with a < b. To determine the probability density function from cumulative density function, we calculate the derivative of F (x). Hence f (x) = 1 d F (x) = , dx b−a a ≤ x ≤ b. Some Special Continuous Distributions 142 Definition 6.1. A random variable X is said to be uniform on the interval [a, b] if its probability density function is of the form f (x) = 1 , b−a a ≤ x ≤ b, where a and b are constants. We denote a random variable X with the uniform distribution on the interval [a, b] as X ∼ U N IF (a, b). The uniform distribution provides a probability model for selecting points at random from an interval [a, b]. An important application of uniform distribution lies in random number generation. The following theorem gives the mean, variance and moment generating function of a uniform random variable. Theorem 6.1. If X is uniform on the interval [a, b] then the mean, variance and moment generating function of X are given by b+a 2 (b − a)2 V ar(X) =  12 if t = 0 1 M (t) =  etb −eta , if t += 0 t (b−a) E(X) = Proof: E(X) = : b x f (x) dx a : b 1 dx b−a a 2 2 3b 1 x = b−a 2 a 1 = (b + a). 2 = x Probability and Mathematical Statistics E(X 2 ) = : 143 b x2 f (x) dx a : b 1 dx b − a a 2 3 3b 1 x = b−a 3 a = x2 1 b3 − a3 b−a 3 (b − a) (b2 + ba + a2 ) 1 = (b − a) 3 1 2 = (b + ba + a2 ). 3 = Hence, the variance of X is given by V ar(X) = E(X 2 ) − ( E(X) )2 1 2 (b + a)2 (b + ba + a2 ) − 3 4 < 1 ; 2 2 = 4b + 4ba + 4a − 3a2 − 3b2 − 6ba 12 < 1 ; 2 b − 2ba + a2 = 12 1 2 = (b − a) . 12 = Next, we compute the moment generating function of X. First, we handle the case t += 0. Assume t += 0. Hence ! " M (t) = E etX : b 1 dx etx = b − a a 2 tx 3b 1 e = b−a t a = etb − eta . t (b − a) If t = 0, we have know that M (0) = 1, hence we get M (t) =  1 if t = 0  etb −eta , if t += 0 t (b−a) Some Special Continuous Distributions 144 and this completes the proof. Example 6.1. Suppose Y ∼ U N IF (0, 1) and Y = probability density function of X? 1 4 X 2 . What is the Answer: We shall find the probability density function of X through the cumulative distribution function of Y . The cumulative distribution function of X is given by F (x) = P (X ≤ x) ! " = P X 2 ≤ x2 $ # 1 1 2 X ≤ x2 =P 4 4 # $ 2 x =P Y ≤ 4 : x42 = f (y) dy 0 = : x2 4 dy 0 2 = x . 4 Thus f (x) = d x F (x) = . dx 2 Hence the probability density function of X is given by f (x) = &x 2 for 0 ≤ x ≤ 2 0 otherwise. Probability and Mathematical Statistics 145 Example 6.2. If X has a uniform distribution on the interval from 0 to 10, ! " then what is P X + 10 X ≥7 ? Answer: Since X ∼ U N IF (0, 10), the probability density function of X is 1 for 0 ≤ x ≤ 10. Hence f (x) = 10 P # X+ $ ! " 10 ≥ 7 = P X 2 + 10 ≥ 7 X X ! " = P X 2 − 7 X + 10 ≥ 0 = P ((X − 5) (X − 2) ≥ 0) = P (X ≤ 2 or X ≥ 5) = 1 − P (2 ≤ X ≤ 5) : 5 =1− f (x) dx 2 : 5 1 dx 10 2 3 7 =1− = . 10 10 =1− Example 6.3. If X is uniform on the interval from 0 to 3, what is the probability that the quadratic equation 4t2 + 4tX + X + 2 = 0 has real solutions? Answer: Since X ∼ U N IF (0, 3), the probability density function of X is f (x) = &1 3 0≤x≤3 0 otherwise. Some Special Continuous Distributions 146 The quadratic equation 4t2 + 4tX + X + 2 = 0 has real solution if the discriminant of this equation is positive. That is 16X 2 − 16(X + 2) ≥ 0, which is X 2 − X − 2 ≥ 0. From this, we get (X − 2) (X + 1) ≥ 0. The probability that the quadratic equation 4t2 + 4tX + X + 2 = 0 has real roots is equivalent to P ( (X − 2) (X + 1) ≥ 0 ) = P (X ≤ −1 or X ≥ 2) = P (X ≤ −1) + P (X ≥ 2) : −1 : 3 = f (x) dx f (x) dx + −∞ =0+ : 2 2 3 1 dx 3 1 = = 0.3333. 3 Theorem 6.2. If X is a continuous random variable with a strictly increasing cumulative distribution function F (x), then the random variable Y , defined by Y = F (X) has the uniform distribution on the interval [0, 1]. Proof: Since F is strictly increasing, the inverse F −1 (x) of F (x) exists. We want to show that the probability density function g(y) of Y is g(y) = 1. First, we find the cumulative distribution G(y) function of Y . G(y) = P (Y ≤ y) = P (F (X) ≤ y) ! " = P X ≤ F −1 (y) ! " = F F −1 (y) = y. Probability and Mathematical Statistics 147 Hence the probability density function of Y is given by d d G(y) = y = 1. dy dy g(y) = The following problem can be solved using this theorem but we solve it without this theorem. Example 6.4. If the probability density function of X is f (x) = e−x , (1 + e−x )2 −∞ < x < ∞, then what is the probability density function of Y = 1 ? 1+e−X Answer: The cumulative distribution function of Y is given by G(y) = P (Y ≤ y) # $ 1 =P ≤y 1 + e−X # $ 1 −X =P 1+e ≥ y $ # 1−y = P e−X ≥ y # $ 1−y = P −X ≥ ln y $ # 1−y = P X ≤ − ln y : − ln 1−y y e−x dx = (1 + e−x )2 −∞ 2 3− ln 1−y y 1 = −x 1+e −∞ 1 = 1 + 1−y y = y. Hence, the probability density function of Y is given by f (y) = & 1 if 0 < y < 1 0 otherwise. Some Special Continuous Distributions 148 Example 6.5. A box to be constructed so that its height is 10 inches and its base is X inches by X inches. If X has a uniform distribution over the interval (2, 8), then what is the expected volume of the box in cubic inches? Answer: Since X ∼ U N IF (2, 8), f (x) = 1 1 = 8−2 6 on (2, 8). The volume V of the box is V = 10 X 2 . Hence ! " E(V ) = E 10X 2 ! " = 10 E X 2 : 8 1 x2 dx = 10 6 2 2 3 38 10 x = 6 3 2 < ; 10 83 − 23 = (5) (8) (7) = 280 cubic inches. = 18 Example 6.6. Two numbers are chosen independently and at random from the interval (0, 1). What is the probability that the two numbers differs by more than 21 ? Answer: See figure below: Choose x from the x-axis between 0 and 1, and choose y from the y-axis between 0 and 1. The probability that the two numbers differ by more than Probability and Mathematical Statistics 1 2 149 is equal to the area of the shaded region. Thus P # 1 |X − Y | > 2 $ = 1 8 + 1 1 8 = 1 . 4 6.2. Gamma Distribution The gamma distribution involves the notion of gamma function. First, we develop the notion of gamma function and study some of its well known properties. The gamma function, Γ(z), is a generalization of the notion of factorial. The gamma function is defined as : ∞ Γ(z) := xz−1 e−x dx, 0 where z is positive real number (that is, z > 0). The condition z > 0 is assumed for the convergence of the integral. Although the integral does not converge for z < 0, it can be shown by using an alternative definition of gamma function that it is defined for all z ∈ R I \ {0, −1, −2, −3, ...}. The integral on the right side of the above expression is called Euler’s second integral, after the Swiss mathematician Leonhard Euler (1707-1783). The graph of the gamma function is shown below. Observe that the zero and negative integers correspond to vertical asymptotes of the graph of gamma function. Lemma 6.1. Γ(1) = 1. Proof: Γ(1) = : 0 ∞ ; <∞ x0 e−x dx = −e−x 0 = 1. Lemma 6.2. The gamma function Γ(z) satisfies the functional equation Γ(z) = (z − 1) Γ(z − 1) for all real number z > 1. Some Special Continuous Distributions 150 Proof: Let z be a real number such that z > 1, and consider : ∞ xz−1 e−x dx Γ(z) = 0 : ∞ ; z−1 −x <∞ (z − 1) xz−2 e−x dx = −x e + 0 0 : ∞ xz−2 e−x dx = (z − 1) 0 = (z − 1) Γ(z − 1). Although, we have proved this lemma for all real z > 1, actually this lemma holds also for all real number z ∈ R I \ {1, 0, −1, −2, −3, ...}. !1" √ Lemma 6.3. Γ 2 = π. Proof: We want to show that # $ : ∞ −x 1 e √ dx Γ = 2 x 0 √ √ is equal to π. We substitute y = x, hence the above integral becomes # $ : ∞ −x 1 e √ dx Γ = 2 x 0 : ∞ √ 2 =2 e−y dy, where y = x. 0 Hence and also # $ : ∞ 2 1 Γ e−u du =2 2 0 # $ : ∞ 2 1 e−v dv. =2 2 0 Multiplying the above two expressions, we get # # $$2 : ∞: ∞ 2 2 1 Γ e−(u +v ) du dv. =4 2 0 0 Γ Now we change the integral into polar form by the transformation u = r cos(θ) and v = r sin(θ). The Jacobian of the transformation is  ∂u ∂u  J(r, θ) = det  = det # ∂r ∂θ ∂v ∂r ∂v ∂θ  cos(θ) −r sin(θ) sin(θ) r cos(θ) = r cos2 (θ) + r sin2 (θ) = r. $ Probability and Mathematical Statistics 151 Hence, we get # # $$2 : π2 : ∞ 2 1 e−r J(r, θ) dr dθ =4 Γ 2 0 0 : π2 : ∞ 2 =4 e−r r dr dθ 0 =2 : 0 π 2 0 =2 : π 2 0 =2 : : ∞ 2 e−r 2r dr dθ 0 3 2: ∞ 2 e−r dr2 dθ 0 π 2 Γ(1) dθ 0 = π. Therefore, we get # $ √ 1 Γ = π. 2 ! " √ Lemma 6.4. Γ − 21 = −2 π. Proof: By Lemma 6.2, we get Γ (z) = (z − 1) Γ (z − 1) for all z ∈ R I \ {1, 0, −1, −2, −3, ...}. Letting z = 21 , we get # $ # $ # $ 1 1 1 Γ = −1 Γ −1 2 2 2 which is # 1 Γ − 2 Example 6.7. Evaluate Γ Answer: Γ $ !5" 2 # $ √ 1 = −2 Γ = −2 π. 2 . # $ # $ 5 1 3 1 3 √ Γ = = π. 2 2 2 2 4 ! " Example 6.8. Evaluate Γ − 72 . Some Special Continuous Distributions 152 Answer: Consider # $ # $ 1 3 3 Γ − =− Γ − 2 2 2 $# $ # $ # 5 5 3 − Γ − = − 2 2 2 $# $# $ # $ # 5 7 7 3 − − Γ − . = − 2 2 2 2 Hence # 7 Γ − 2 $ = # 2 − 3 $# 2 − 5 $# 2 − 7 $ # $ 1 16 √ Γ − = π. 2 105 Example 6.9. Evaluate Γ (7.8). Answer: Γ (7.8) = (6.8) (5.8) (4.8) (3.8) (2.8) (1.8) Γ (1.8) = (3625.7) Γ (1.8) = (3625.7) (0.9314) = 3376.9. Here we have used the gamma table to find Γ (1.8) to be 0.9314. Example 6.10. If n is a natural number, then Γ(n + 1) = n!. Answer: Γ(n + 1) = n Γ(n) = n (n − 1) Γ(n − 1) = n (n − 1) (n − 2) Γ(n − 2) = ··· ··· = n (n − 1) (n − 2) · · · (1) Γ(1) = n! Now we are ready to define the gamma distribution. Definition 6.2. A continuous random variable X is said to have a gamma distribution if its probability density function is given by  x  Γ(α)1 θα xα−1 e− θ if 0 < x < ∞ f (x) =  0 otherwise, where α > 0 and θ > 0. We denote a random variable with gamma distribution as X ∼ GAM (θ, α). The following diagram shows the graph of the gamma density for various values of values of the parameters θ and α. Probability and Mathematical Statistics 153 The following theorem gives the expected value, the variance, and the moment generating function of the gamma random variable Theorem 6.3. If X ∼ GAM (θ, α), then E(X) = θ α V ar(X) = θ2 α M (t) = # 1 1−θt $α , if t< 1 . θ Proof: First, we derive the moment generating function of X and then we compute the mean and variance of it. The moment generating function ! " M (t) = E etX : ∞ x 1 xα−1 e− θ etx dx = α Γ(α) θ :0 ∞ 1 1 xα−1 e− θ (1−θt)x dx = Γ(α) θα 0 : ∞ θα 1 1 y α−1 e−y dy, where y = (1 − θt)x = α α Γ(α) θ (1 − θt) θ 0 : ∞ 1 1 y α−1 e−y dy = (1 − θt)α 0 Γ(α) 1 = , since the integrand is GAM (1, α). (1 − θt)α Some Special Continuous Distributions 154 The first derivative of the moment generating function is d (1 − θt)−α dt = (−α) (1 − θt)−α−1 (−θ) M # (t) = = α θ (1 − θt)−(α+1) . Hence from above, we find the expected value of X to be E(X) = M # (0) = α θ. Similarly, d , α θ (1 − θt)−(α+1) dt = α θ (α + 1) θ (1 − θt)−(α+2) M ## (t) = Thus, the variance of X is = α (α + 1) θ2 (1 − θt)−(α+2) . 2 V ar(X) = M ## (0) − (M # (0)) = α (α + 1) θ2 − α2 θ2 = α θ2 and proof of the theorem is now complete In figure below the graphs of moment generating function for various values of the parameters are illustrated. Example 6.11. Let X have the density function  x  Γ(α)1 θα xα−1 e− θ if 0 < x < ∞ f (x) =  0 otherwise, Probability and Mathematical Statistics where α > 0 and θ > 0. If α = 4, what is the mean of 155 1 X3 ? Answer: ! " E X −3 = : ∞ 1 f (x) dx 3 x 0 : ∞ x 1 1 x3 e− θ dx = 3 4 x Γ(4) θ 0 : ∞ x 1 e− θ dx = 4 3! θ 0 : ∞ 1 1 −x = e θ dx 3! θ3 0 θ 1 = since the integrand is GAM(θ, 1). 3! θ3 Definition 6.3. A continuous random variable is said to be an exponential random variable with parameter θ if its probability density function is of the form  x  θ1 e− θ if x > 0 f (x) =  0 otherwise, where θ > 0. If a random variable X has an exponential density function with parameter θ, then we denote it by writing X ∼ EXP (θ). An exponential distribution is a special case of the gamma distribution. If the parameter α = 1, then the gamma distribution reduces to the exponential distribution. Hence most of the information about an exponential distribution can be obtained from the gamma distribution. Example 6.12. What is the cumulative density function of a random variable which has an exponential distribution with variance 25? Some Special Continuous Distributions 156 Answer: Since an exponential distribution is a special case of the gamma distribution with α = 1, from Theorem 6.3, we get V ar(X) = θ2 . But this is given to be 25. Thus, θ2 = 25 or θ = 5. Hence, the probability density function of X is : x f (t) dt F (x) = 0 : x 1 −t e 5 dt 0 5 B t x 1 A = −5 e− 5 5 0 −x 5 =1−e . = Example 6.13. If the random variable X has a gamma distribution with parameters α = 1 and θ = 1, then what is the probability that X is between its mean and median? Answer: Since X ∼ GAM (1, 1), the probability density function of X is f (x) = & e−x if x > 0 0 otherwise. Hence, the median q of X can be calculated from : q 1 = e−x dx 2 ;0 <q = −e−x 0 = 1 − e−q . Hence 1 = 1 − e−q 2 Probability and Mathematical Statistics 157 and from this, we get q = ln 2. The mean of X can be found from the Theorem 6.3. E(X) = α θ = 1. Hence the mean of X is 1 and the median of X is ln 2. Thus P (ln 2 ≤ X ≤ 1) = : 1 e−x dx ln 2 ; <1 = −e−x ln 2 1 = e− ln 2 − e 1 1 = − 2 e e−2 = . 2e Example 6.14. If the random variable X has a gamma distribution with parameters α = 1 and θ = 2, then what is the probability density function of the random variable Y = eX ? Answer: First, we calculate the cumulative distribution function G(y) of Y . G(y) = P ( Y ≤ y ) ! " = P eX ≤ y = P ( X ≤ ln y ) : ln y 1 −x = e 2 dx 2 0 x <ln y 1 ; −2 e− 2 0 = 2 1 = 1 − 1 ln y e2 1 =1− √ . y Hence, the probability density function of Y is given by g(y) = d d G(y) = dy dy # 1 1− √ y $ = 1 √ . 2y y Some Special Continuous Distributions 158 Thus, if X ∼ GAM (1, 2), then probability density function of eX is f (x) = & 1√ 2x x if 0 otherwise. 1≤x<∞ Definition 6.4. A continuous random variable X is said to have a chi-square distribution with r degrees of freedom if its probability density function is of the form  r x  r1 r x 2 −1 e− 2 if 0 < x < ∞ Γ( 2 ) 2 2 f (x) =  0 otherwise, where r > 0. If X has a chi-square distribution, then we denote it by writing X ∼ χ2 (r). The gamma distribution reduces to the chi-square distribution if α = 2r and θ = 2. Thus, the chi-square distribution is a special case of the gamma distribution. Further, if r → ∞, then the chi-square distribution tends to the normal distribution. Probability and Mathematical Statistics 159 The chi-square distribution was originated in the works of British Statistician Karl Pearson (1857-1936) but it was originally discovered by German physicist F. R. Helmert (1843-1917). Example 6.15. If X ∼ GAM (1, 1), then what is the probability density function of the random variable 2X? Answer: We will use the moment generating method to find the distribution of 2X. The moment generating function of a gamma random variable is given by (see Theorem 6.3) 1 . θ Since X ∼ GAM (1, 1), the moment generating function of X is given by −α M (t) = (1 − θ t) , if t< 1 , t < 1. 1−t Hence, the moment generating function of 2X is MX (t) = M2X (t) = MX (2t) 1 = 1 − 2t 1 = 2 (1 − 2t) 2 = MGF of χ2 (2). Hence, if X is an exponential with parameter 1, then 2X is chi-square with 2 degrees of freedom. Example 6.16. If X ∼ χ2 (5), then what is the probability that X is between 1.145 and 12.83? Answer: The probability of X between 1.145 and 12.83 can be calculated from the following: P (1.145 ≤ X ≤ 12.83) = P (X ≤ 12.83) − P (X ≤ 1.145) : 12.83 : 1.145 = f (x) dx f (x) dx − 0 0 = : 12.83 0 Γ 1 !5" 2 = 0.975 − 0.050 = 0.925. 2 5 2 x 5 2 −1 x e− 2 dx − : (from χ2 table) 0 1.145 Γ 1 !5" 2 5 2 5 2 x x 2 −1 e− 2 dx Some Special Continuous Distributions 160 These integrals are hard to evaluate and so their values are taken from the chi-square table. Example 6.17. If X ∼ χ2 (7), then what are values of the constants a and b such that P (a < X < b) = 0.95? Answer: Since 0.95 = P (a < X < b) = P (X < b) − P (X < a), we get P (X < b) = 0.95 + P (X < a). We choose a = 1.690, so that P (X < 1.690) = 0.025. From this, we get P (X < b) = 0.95 + 0.025 = 0.975 Thus, from the chi-square table, we get b = 16.01. Definition 6.5. A continuous random variable X is said to have a n-Erlang distribution if its probability density function is of the form  n−1  λe−λx (λx) , if 0 < x < ∞ (n−1)! f (x) =  0 otherwise, where λ > 0 is a parameter. The gamma distribution reduces to n-Erlang distribution if α = n, where n is a positive integer, and θ = λ1 . The gamma distribution can be generalized to include the Weibull distribution. We call this generalized distribution the unified distribution. The form of this distribution is the following:  ψ −x−(α −α−1)   ψ α θ xα−1 e , if 0 < x < ∞ α ψ f (x) = θ Γ(α +1)   0 otherwise, where θ > 0, α > 0, and ψ ∈ {0, 1} are parameters. If ψ = 0, the unified distribution reduces  α  α xα−1 e− xθ , if 0 < x < ∞ θ f (x) =  0 otherwise Probability and Mathematical Statistics 161 which is known as the Weibull distribution. For α = 1, the Weibull distribution becomes an exponential distribution. The Weibull distribution provides probabilistic models for life-length data of components or systems. The mean and variance of the Weibull distribution are given by # $ 1 1 , E(X) = θ α Γ 1 + α & # $ 2# $32 ' 2 2 1 . − 1+ V ar(X) = θ α Γ 1 + α α From this Weibull distribution, one can get the Rayleigh distribution by taking θ = 2σ 2 and α = 2. The Rayleigh distribution is given by  x2  x2 e− 2σ 2 , if 0 < x < ∞ σ f (x) =  0 otherwise. If ψ = 1, the unified distribution reduces to the gamma distribution. 6.3. Beta Distribution The beta distribution is one of the basic distributions in statistics. It has many applications in classical as well as Bayesian statistics. It is a versatile distribution and as such it is used in modeling the behavior of random variables that are positive but bounded in possible values. Proportions and percentages fall in this category. The beta distribution involves the notion of beta function. First we explain the notion of the beta integral and some of its simple properties. Let α and β be any two positive real numbers. The beta function B(α, β) is defined as : 1 xα−1 (1 − x)β−1 dx. B(α, β) = 0 First, we prove a theorem that establishes the connection between the beta function and the gamma function. Theorem 6.4. Let α and β be any two positive real numbers. Then B(α, β) = where Γ(z) = : 0 ∞ Γ(α)Γ(β) , Γ(α + β) xz−1 e−x dx Some Special Continuous Distributions 162 is the gamma function. Proof: We prove this theorem by computing #: ∞ $ $ #: ∞ β−1 −y α−1 −x y e dy x e dx Γ(α) Γ(β) = #:0 ∞ $0 #: ∞ $ 2 2 = u2α−2 e−u 2udu v 2β−2 e−v 2vdv 0 : 0∞ : ∞ 2α−1 2β−1 −(u2 +v 2 ) =4 u v e dudv 0 =4 : π 2 = 0 ∞ 0 0 #: : ∞ 2 r2α+2β−2 (cos θ)2α−1 (sin θ)2β−1 e−r rdrdθ $) : π 2 α+β−1 −r 2 (r ) 0 e ) : = Γ(α + β) 2 = Γ(α + β) : 0 dr 2 2 2 2α−1 (cos θ) 0 π 2 (cos θ)2α−1 (sin θ)2β−1 dθ 0 2β−1 (sin θ) + dθ + 1 tα−1 (1 − t)β−1 dt = Γ(α + β) B(α, β). The second line in the above integral is obtained by substituting x = u2 and y = v 2 . Similarly, the fourth and seventh lines are obtained by substituting u = r cos θ, v = r sin θ, and t = cos2 θ, respectively. This proves the theorem. The following two corollaries are consequences of the last theorem. Corollary 6.1. For every positive α and β, the beta function is symmetric, that is B(α, β) = B(β, α). Corollary 6.2. For every positive α and β, the beta function can be written as : π 2 B(α, β) = 2 (cos θ)2α−1 (sin θ)2β−1 dθ. 0 The following corollary is obtained substituting s = of the beta function. t 1−t in the definition Corollary 6.3. For every positive α and β, the beta function can be expressed as : ∞ sα−1 ds. B(α, β) = (1 + s)α+β 0 Probability and Mathematical Statistics 163 Using Theorem 6.4 and the property of gamma function, we have the following corollary. Corollary 6.4. For every positive real number β and every positive integer α, the beta function reduces to B(α, β) = (α − 1)! . (α − 1 + β)(α − 2 + β) · · · (1 + β)β Corollary 6.5. For every pair of positive integers α and β, the beta function satisfies the following recursive relation B(α, β) = (α − 1)(β − 1) B(α − 1, β − 1). (α + β − 1)(α + β − 2) Definition 6.6. A random variable X is said to have the beta density function if its probability density function is of the form & 1 α−1 (1 − x)β−1 , if 0 < x < 1 B(α,β) x f (x) = 0 otherwise for every positive α and β. If X has a beta distribution, then we symbolically denote this by writing X ∼ BET A(α, β). The following figure illustrates the graph of the beta distribution for various values of α and β. The beta distribution reduces to the uniform distribution over (0, 1) if α = 1 = β. The following theorem gives the mean and variance of the beta distribution. Some Special Continuous Distributions 164 Theorem 6.5. If X ∼ BET A(α, β), E(X) = V ar(X) = α α+β αβ . (α + β)2 (α + β + 1) Proof: The expected value of X is given by E(X) = : 1 x f (x) dx 0 = = = = = : 1 1 xα (1 − x)β−1 dx B(α, β) 0 B(α + 1, β) B(α, β) Γ(α + 1) Γ(β) Γ(α + β) Γ(α + β + 1) Γ(α) Γ(β) Γ(α + β) α Γ(α) Γ(β) (α + β) Γ(α + β) Γ(α) Γ(β) α . α+β Similarly, we can show that ! " E X2 = α (α + 1) . (α + β + 1) (α + β) Therefore ! " V ar(X) = E X 2 − E(X) = (α + β)2 αβ (α + β + 1) and the proof of the theorem is now complete. Example 6.18. The percentage of impurities per batch in a certain chemical product is a random variable X that follows the beta distribution given by f (x) = & 60 x3 (1 − x)2 for 0 < x < 1 0 otherwise. What is the probability that a randomly selected batch will have more than 25% impurities? Probability and Mathematical Statistics 165 Proof: The probability that a randomly selected batch will have more than 25% impurities is given by P (X ≥ 0.25) = : 1 0.25 = 60 : 2 60 x3 (1 − x)2 dx 1 0.25 ! " x3 − 2x4 + x5 dx 2x5 x6 x4 − + = 60 4 5 6 657 = 0.9624. = 60 40960 31 0.25 Example 6.19. The proportion of time per day that all checkout counters in a supermarket are busy follows a distribution f (x) = & k x2 (1 − x)9 for 0 < x < 1 0 otherwise. What is the value of the constant k so that f (x) is a valid probability density function? Proof: Using the definition of the beta function, we get that : 0 1 x2 (1 − x)9 dx = B(3, 10). Hence by Theorem 6.4, we obtain B(3, 10) = Γ(3) Γ(10) 1 = . Γ(13) 660 Hence k should be equal to 660. The beta distribution can be generalized to any bounded interval [a, b]. This generalized distribution is called the generalized beta distribution. If a random variable X has this generalized beta distribution we denote it by writing X ∼ GBET A(α, β, a, b). The probability density of the generalized beta distribution is given by f (x) =    (x−a)α−1 (b−x)β−1 1 B(α,β) (b−a)α+β−1 if a < x < b 0 otherwise Some Special Continuous Distributions 166 where α, β, a > 0. If X ∼ GBET A(α, β, a, b), then E(X) = (b − a) V ar(X) = (b − a)2 α +a α+β (α + β)2 αβ . (α + β + 1) It can be shown that if X = (b − a)Y + a and Y ∼ BET A(α, β), then X ∼ GBET A(α, β, a, b). Thus using Theorem 6.5, we get E(X) = E((b − a)Y + a) = (b − a)E(Y ) + a = (b − a) α +a α+β and V ar(X) = V ar((b−a)Y +a) = (b−a)2 V ar(Y ) = (b−a)2 αβ (α + β)2 (α + β + 1) . 6.4. Normal Distribution Among continuous probability distributions, the normal distribution is very well known since it arises in many applications. Normal distribution was discovered by a French mathematician Abraham DeMoivre (1667-1754). DeMoivre wrote two important books. One is called the Annuities Upon Lives, the first book on actuarial sciences and the second book is called the Doctrine of Chances, one of the early books on the probability theory. PierreSimon Laplace (1749-1827) applied normal distribution to astronomy. Carl Friedrich Gauss (1777-1855) used normal distribution in his studies of problems in physics and astronomy. Adolphe Quetelet (1796-1874) demonstrated that man’s physical traits (such as height, chest expansion, weight etc.) as well as social traits follow normal distribution. The main importance of normal distribution lies on the central limit theorem which says that the sample mean has a normal distribution if the sample size is large. Definition 6.7. A random variable X is said to have a normal distribution if its probability density function is given by f (x) = x−µ 2 1 1 √ e− 2 ( σ ) , σ 2π −∞ < x < ∞, where −∞ < µ < ∞ and 0 < σ 2 < ∞ are arbitrary parameters. If X has a normal distribution with parameters µ and σ 2 , then we write X ∼ N (µ, σ 2 ). Probability and Mathematical Statistics 167 Example 6.20. Is the real valued function defined by f (x) = x−µ 2 1 1 √ e− 2 ( σ ) , σ 2π −∞ < x < ∞ a probability density function of some random variable X? Answer: To answer this question, we must check that f is nonnegative and it integrates to 1. The nonnegative part is trivial since the exponential function is always positive. Hence using property of the gamma function, we show that f integrates to 1 on R. I : ∞ : ∞ x−µ 2 1 1 √ e− 2 ( σ ) dx f (x) dx = −∞ σ 2 π −∞ =2 : ∞ x−µ 2 1 1 √ e− 2 ( σ ) dx σ 2π µ 2 = √ σ 2π 1 =√ π : 1 =√ Γ π : 0 ∞ 0 ∞ −z e σ √ dz, 2z where 1 z= 2 # x−µ σ $2 1 √ e−z dz z # $ 1 √ 1 =√ π = 1. 2 π The following theorem tells us that the parameter µ is the mean and the parameter σ 2 is the variance of the normal distribution. Some Special Continuous Distributions 168 Theorem 6.6. If X ∼ N (µ, σ 2 ), then E(X) = µ V ar(X) = σ 2 1 M (t) = eµt+ 2 σ 2 2 t . Proof: We prove this theorem by first computing the moment generating function and finding out the mean and variance of X from it. ! " M (t) = E etX : ∞ etx f (x) dx = −∞ : ∞ x−µ 2 1 1 e− 2 ( σ ) dx etx √ = σ 2π −∞ = : ∞ etx −∞ = : ∞ −∞ = : ∞ −∞ 2 2 1 1 √ e− 2σ2 (x −2µx+µ ) dx σ 2π 2 2 2 1 1 √ e− 2σ2 (x −2µx+µ −2σ tx) dx σ 2π 2 2 1 1 2 2 1 √ e− 2σ2 (x−µ−σ t) eµt+ 2 σ t dx σ 2π : µt+ 21 σ 2 t2 =e ∞ −∞ 1 = eµt+ 2 σ 2 2 t 2 2 1 1 √ e− 2σ2 (x−µ−σ t) dx σ 2π . The last integral integrates to 1 because the integrand is the probability density function of a normal random variable whose mean is µ + σ 2 t and variance σ 2 , that is N (µ + σ 2 t, σ 2 ). Finally, from the moment generating function one determines the mean and variance of the normal distribution. We leave this part to the reader. Example 6.21. If X is any random variable with mean µ and variance σ 2 > 0, then what are the mean and variance of the random variable Y = X−µ σ ? Probability and Mathematical Statistics 169 Answer: The mean of the random variable Y is $ # X −µ E(Y ) = E σ 1 = E (X − µ) σ 1 = (E(X) − µ) σ 1 = (µ − µ) σ = 0. The variance of Y is given by V ar(Y ) = V ar # X −µ σ $ 1 V ar (X − µ) σ2 1 = V ar(X) σ 1 = 2 σ2 σ = 1. = Hence, if we define a new random variable by taking a random variable and subtracting its mean from it and then dividing the resulting by its standard deviation, then this new random variable will have zero mean and unit variance. Definition 6.8. A normal random variable is said to be standard normal, if its mean is zero and variance is one. We denote a standard normal random variable X by X ∼ N (0, 1). The probability density function of standard normal distribution is the following: x2 1 −∞ < x < ∞. e− 2 , f (x) = √ 2π Example 6.22. If X ∼ N (0, 1), what is the probability of the random variable X less than or equal to −1.72? Answer: P (X ≤ −1.72) = 1 − P (X ≤ 1.72) = 1 − 0.9573 = 0.0427. (from table) Some Special Continuous Distributions 170 Example 6.23. If Z ∼ N (0, 1), what is the value of the constant c such that P (|Z| ≤ c) = 0.95? Answer: 0.95 = P (|Z| ≤ c) = P (−c ≤ Z ≤ c) = P (Z ≤ c) − P (Z ≤ −c) = 2 P (Z ≤ c) − 1. Hence P (Z ≤ c) = 0.975, and from this using the table we get c = 1.96. The following theorem is very important and allows us to find probabilities by using the standard normal table. Theorem 6.7. If X ∼ N (µ, σ 2 ), then the random variable Z = N (0, 1). X−µ σ ∼ Proof: We will show that Z is standard normal by finding the probability density function of Z. We compute the probability density of Z by cumulative distribution function method. F (z) = P (Z ≤ z) # $ X −µ =P ≤z σ = P (X ≤ σ z + µ) : σ z+µ x−µ 2 1 1 √ = e− 2 ( σ ) dx σ 2π −∞ : z 2 1 1 √ = σ e− 2 w dw, −∞ σ 2 π Hence where w = x−µ . σ 1 2 1 f (z) = F # (z) = √ e− 2 z . 2π The following example illustrates how to use standard normal table to find probability for normal random variables. Example 6.24. If X ∼ N (3, 16), then what is P (4 ≤ X ≤ 8)? Probability and Mathematical Statistics Answer: 171 # X −3 8−3 4−3 P (4 ≤ X ≤ 8) = P ≤ ≤ 4 4 4 $ # 5 1 ≤Z≤ =P 4 4 $ = P (Z ≤ 1.25) − P (Z ≤ 0.25) = 0.8944 − 0.5987 = 0.2957. Example 6.25. If X ∼ N (25, 36), then what is the value of the constant c such that P (|X − 25| ≤ c) = 0.9544? Answer: 0.9544 = P (|X − 25| ≤ c) = P (−c ≤ X − 25 ≤ c) # $ c X − 25 c =P − ≤ ≤ 6 6 6 , c c =P − ≤Z≤ 6, , 6 cc=P Z≤ −P Z ≤− 6 6 , c − 1. = 2P Z ≤ 6 Hence , cP Z≤ = 0.9772 6 and from this, using the normal table, we get c =2 6 or c = 12. The following theorem can be proved similar to Theorem 6.7. -2 , ∼ χ2 (1). Theorem 6.8. If X ∼ N (µ, σ 2 ), then the random variable X−µ σ , -2 Proof: Let W = X−µ and Z = X−µ σ σ . We will show that the random variable W is chi-square with 1 degree of freedom. This amounts to showing that the probability density function of W to be g(w) =  √1 2π w  0 1 e− 2 w if 0 < w < ∞ otherwise . Some Special Continuous Distributions 172 We compute the probability density function of W by distribution function method. Let G(w) be the cumulative distribution function W , which is G(w) = P (W ≤ w) + )# $2 X −µ =P ≤w σ $ # √ X −µ √ =P − w≤ ≤ w σ ! √ √ " =P − w≤Z≤ w : √w = √ f (z) dz, − w where f (z) denotes the probability density function of the standard normal random variable Z. Thus, the probability density function of W is given by d G(w) dw √ : w d f (z) dz = dw −√w √ √ !√ " d w ! √ " d (− w) =f −f − w w dw dw 1 −1w 1 1 −1w 1 2 2 √ +√ e √ =√ e 2 w 2 w 2π 2π 1 1 e− 2 w . =√ 2π w g(w) = Thus, we have shown that W is chi-square with one degree of freedom and the proof is now complete. ! " Example 6.26. If X ∼ N (7, 4), what is P 15.364 ≤ (X − 7)2 ≤ 20.095 ? Answer: Since X ∼ N (7, 4), we get µ = 7 and σ = 2. Thus ! " P 15.364 ≤ (X − 7)2 ≤ 20.095 + ) # $2 15.364 X −7 20.095 ≤ ≤ =P 4 2 4 ! " 2 = P 3.841 ≤ Z ≤ 5.024 ! " ! " = P 0 ≤ Z 2 ≤ 5.024 − P 0 ≤ Z 2 ≤ 3.841 = 0.975 − 0.949 = 0.026. Probability and Mathematical Statistics 173 A generalization of the normal distribution is the following: "ν ! ϕ(ν) ν ϕ(ν) |x−µ| − σ g(x) = e 2 σ Γ(1/ν) where ϕ(ν) = G Γ(3/ν) Γ(1/ν) and ν and σ are real positive constants and −∞ < µ < ∞ is a real constant. The constant µ represents the mean and the constant σ represents the standard deviation of the generalized normal distribution. If ν = 2, then generalized normal distribution reduces to the normal distribution. If ν = 1, then the generalized normal distribution reduces to the Laplace distribution whose density function is given by f (x) = 1 − |x−µ| e θ 2θ where θ = √σ2 . The generalized normal distribution is very useful in signal processing and in particular modeling of the discrete cosine transform (DCT) coefficients of a digital image. 6.5. Lognormal Distribution The study lognormal distribution was initiated by Galton and McAlister in 1879. They came across this distribution while studying the use of the geometric mean as an estimate of location. Later, Kapteyn (1903) discussed the genesis of this distribution. This distribution can be defined as the distribution of a random variable whose logarithm is normally distributed. Often the size distribution of organisms, the distribution of species, the distribution of the number of persons in a census occupation class, the distribution of stars in the universe, and the distribution of the size of incomes are modeled by lognormal distributions. The lognormal distribution is used in biology, astronomy, economics, pharmacology and engineering. This distribution is sometimes known as the Galton-McAlister distribution. In economics, the lognormal distribution is called the Cobb-Douglas distribution. Definition 6.10. A random variable X is said to have a lognormal distribution if its probability density function is given by ! ln(x)−µ "2  − 21 1  σ √ , if 0 < x < ∞ e f (x) = x σ 2 π  0 otherwise , Some Special Continuous Distributions 174 where −∞ < µ < ∞ and 0 < σ 2 < ∞ are arbitrary parameters. If X has a lognormal distribution with parameters µ and σ 2 , then we H write X ∼ \(µ, σ 2 ). H Example 6.27. If X ∼ \(µ, σ 2 ), what is the 100pth percentile of X? Answer: Let q be the 100pth percentile of X. Then by definition of percentile, we get : q ! ln(x)−µ "2 1 −1 σ √ e 2 dx. p= 0 xσ 2π Substituting z = ln(x)−µ σ in the above integral, we have p= : ln(q)−µ σ −∞ zp = : −∞ 1 2 1 √ e− 2 z dz 2π 1 2 1 √ e− 2 z dz, 2π is the 100pth of the standard normal random variable. where zp = ln(q)−µ σ Hence 100pth percentile of X is q = eσ zp +µ , where zp is the 100pth percentile of the standard normal random variable Z. H Theorem 6.9. If X ∼ \(µ, σ 2 ), then 1 2 E(X) = eµ+ 2 σ A 2 B 2 V ar(X) = eσ − 1 e2µ+σ . Probability and Mathematical Statistics 175 Proof: Let t be a positive integer. We compute the tth moment of X. : ∞ ! t" xt f (x) dx E X = 0 : ∞ ! ln(x)−µ "2 1 − 12 t σ √ e dx. x = xσ 2π 0 Substituting z = ln(x) in the last integral, we get : ∞ ! " z−µ 2 1 1 e− 2 ( σ ) dz = MZ (t), etz √ E Xt = σ 2π −∞ where MZ (t) denotes the moment generating function of the random variable Z ∼ N (µ, σ 2 ). Therefore, 1 2 2 1 2 MZ (t) = eµt+ 2 σ t . Thus letting t = 1, we get E(X) = eµ+ 2 σ . Similarly, taking t = 2, we have 2 E(X 2 ) = e2µ+2σ . Thus, we have A 2 B 2 V ar(X) = E(X 2 ) − E(X)2 = eσ − 1 e2µ+σ and now the proof of the theorem is complete. H Example 6.28. If X ∼ \(0, 4), then what is the probability that X is between 1 and 12.1825? H Answer: Since X ∼ \(0, 4), the random variable Y = ln(X) ∼ N (0, 4). Hence P (1 ≤ X ≤ 12.1825) = P (ln(1) ≤ ln(X) ≤ ln(12.1825)) = P (0 ≤ Y ≤ 2.50) = P (0 ≤ Z ≤ 1.25) = P (Z ≤ 1.25) − P (Z ≤ 0) = 0.8944 − 0.5000 = 0.4944. Some Special Continuous Distributions 176 Example 6.29. If the amount of time needed to solve a problem by a group of students follows the lognormal distribution with parameters µ and σ 2 , then what is the value of µ so that the probability of solving a problem in 10 minutes or less by any randomly picked student is 95% when σ 2 = 4? Answer: Let the random variable X denote the amount of time needed H to a solve a problem. Then X ∼ \(µ, 4). We want to find µ so that P (X ≤ 10) = 0.95. Hence 0.95 = P (X ≤ 10) = P (ln(X) ≤ ln(10)) = P (ln(X) − µ ≤ ln(10) − µ) $ # ln(10) − µ ln(X) − µ ≤ =P 2 2 # $ ln(10) − µ =P Z≤ , 2 where Z ∼ N (0, 1). Using the table for standard normal distribution, we get ln(10) − µ = 1.65. 2 Hence µ = ln(10) − 2(1.65) = 2.3025 − 3.300 = −0.9975. 6.6. Inverse Gaussian Distribution If a sufficiently small macroscopic particle is suspended in a fluid that is in thermal equilibrium, the particle will move about erratically in response to natural collisional bombardments by the individual molecules of the fluid. This erratic motion is called “Brownian motion” after the botanist Robert Brown (1773-1858) who first observed this erratic motion in 1828. Independently, Einstein (1905) and Smoluchowski (1906) gave the mathematical description of Brownian motion. The distribution of the first passage time in Brownian motion is the inverse Gaussian distribution. This distribution was systematically studied by Tweedie in 1945. The interpurchase times of toothpaste of a family, the duration of labor strikes in a geographical region, word frequency in a language, conversion time for convertible bonds, length of employee service, and crop field size follow inverse Gaussian distribution. Inverse Gaussian distribution is very useful for analysis of certain skewed data. Probability and Mathematical Statistics 177 Definition 6.10. A random variable X is said to have an inverse Gaussian distribution if its probability density function is given by = λ(x−µ)2   λ x− 32 e− 2µ2 x , if 0 < x < ∞ 2π f (x) =   0 otherwise, where 0 < µ < ∞ and 0 < λ < ∞ are arbitrary parameters. If X has an inverse Gaussian distribution with parameters µ and λ, then we write X ∼ IG(µ, λ). The characteristic function φ(t) of X ∼ IG(µ, λ) is ! " φ(t) = E eitX A I B 2 =e λ µ 1− 1− 2iµ λ t . Some Special Continuous Distributions 178 Using this, we have the following theorem. Theorem 6.10. If X ∼ IG(µ, λ), then E(X) = µ V ar(X) = µ3 . λ ! " ! " Proof: Since φ(t) = E eitX , the derivative φ# (t) = i E XeitX . Therefore φ# (0) = i E (X). We know the characteristic function φ(t) of X ∼ IG(µ, λ) is A I B φ(t) = e λ µ 1− 2t 1− 2iµ λ . Differentiating φ(t) with respect to t, we have B0 / A I 2iµ2 t λ 1− 1− d µ λ φ# (t) = e dt A I B ) / 0+ @ 2iµ2 t λ 1− 1− d λ 2iµ2 t µ λ 1− 1− =e dt µ λ A I B # 1 $ 2t −2 λ 1− 2iµ 2iµ2 t µ 1− λ = iµ e . 1− λ Hence φ# (0) = i µ. Therefore, E(X) = µ. Similarly, one can show that V ar(X) = µ3 . λ This completes the proof of the theorem. The distribution function F (x) of the inverse Gaussian random variable X with parameters µ and λ was computed by Shuster (1968) as F (x) = Φ )G ) G 2 2 3+ 3+ 2λ x λ x λ −1 +1 , +eµ Φ − µ µ µ µ where Φ is the distribution function of the standard normal distribution function. 6.7. Logistics Distribution The logistic distribution is often considered as an alternative to the univariate normal distribution. The logistic distribution has a shape very close Probability and Mathematical Statistics 179 to that of a normal distribution but has heavier tails than the normal. The logistic distribution is used in modeling demographic data. It is also used as an alternative to the Weibull distribution in life-testing. Definition 6.11. A random variable X is said to have a logistic distribution if its probability density function is given by f (x) = σ π √ 3 A − √π e 3 ( x−µ σ ) − √π 1+e 3 ( x−µ σ ) − ∞ < x < ∞, B2 where −∞ < µ < ∞ and σ > 0 are parameters. If X has a logistic distribution with parameters µ and σ, then we write X ∼ LOG(µ, σ). Theorem 6.11. If X ∼ LOG(µ, λ), then E(X) = µ V ar(X) = σ 2 µt M (t) = e ) √ 3 σt Γ 1+ π + ) + 3 Γ 1− σt , π √ |t| < π √ . σ 3 Proof: First, we derive the moment generating function of X and then we Some Special Continuous Distributions 180 compute the mean and variance of it. The moment generating function is M (t) = : ∞ etx f (x) dx −∞ = : ∞ tx e −∞ µt =e = eµt = eµt = eµt : ∞ σ π √ 3 A − √π e 3 ( x−µ σ ) − √π 1+e 3 ( x−µ σ ) e−w B2 dx π(x − µ) √ e and s = 2 dw, where w = −w 3σ (1 + e ) −∞ : ∞ ! −w "−s e−w e 2 dw (1 + e−w ) −∞ : 1 ! −1 "−s 1 z −1 dz, where z = 1 + e−w 0 : 1 z s (1 − z)−s dz sw √ 3σ t π 0 = eµt B(1 + s, 1 − s) Γ(1 + s) Γ(1 − s) = eµt Γ(1 + s + 1 − s) Γ(1 + s) Γ(1 − s) = eµt Γ(2) µt = e Γ(1 + s) Γ(1 − s) ) + ) + √ √ 3 3 µt =e Γ 1+ σt Γ 1 − σt π π )√ + )√ + 3 σ 3 σ = eµt t cosec t . π π We leave the rest of the proof to the reader. 6.8. Review Exercises 1. If Y ∼ U N IF (0, 1), then what is the probability density function of X = − ln Y ? 2. Let the probability density function of X be f (x) = J e−x 0 if x > 0 otherwise . Let Y = 1 − e−X . Find the distribution of Y . Probability and Mathematical Statistics 181 3. After a certain time the weight W of crystals formed is given approximately by W = eX where X ∼ N (µ, σ 2 ). What is the probability density function of W for 0 < w < ∞ ? 4. What is the probability that a normal random variable with mean 6 and standard deviation 3 will fall between 5.7 and 7.5 ? 5. Let X have a distribution with the 75th percentile equal to bility density function equal to & −λx λe for 0 < x < ∞ f (x) = 0 otherwise. 1 3 and proba- What is the value of the parameter λ ? 6. If a normal distribution with mean µ and variance σ 2 > 0 has 46th percentile equal to 20σ, then what is µ in term of standard deviation? 7. Let X be a random variable with cumulative distribution function & 0 if x ≤ 0 F (x) = 1 − e−x if x > 0. ! " What is P 0 ≤ eX ≤ 4 ? 8. Let X have the density function  Γ(α+β) α−1  Γ(α) (1 − x)β−1 Γ(β) x f (x) =  0 for 0 < x < 1 otherwise, where α > 0 and β > 0. If β = 6 and α = 5, what is the mean of the random −1 variable (1 − X) ? 9. R.A. Fisher proved that when n ≥ 30 and Y has a chi-square distribution √ √ with n degrees freedom, then 2Y − 2n − 1 has an approximate standard normal distribution. Under this approximation, what is the 90th percentile of Y when n = 41 ? 10. Let Y have a chi-square distribution with 32 degrees of freedom so that its variance is 64. If P (Y > c) = 0.0668, then what is the approximate value of the constant c? 11. If in a certain normal distribution of X, the probability is 0.5 that X is less than 500 and 0.0227 that X is greater than 650. What is the standard deviation of X? Some Special Continuous Distributions 182 12. If X ∼ N (5, 4), then what is the probability that 8 < Y < 13 where Y = 2X + 1? 13. Given the probability density function of a random variable X as   θ e−θx if x > 0 f (x) =  0 otherwise, what is the nth moment of X about the origin? 14. If the random variable X is normal with mean 1 and standard deviation ! " 2, then what is P X 2 − 2X ≤ 8 ? 15. Suppose X has a standard normal distribution and Y = eX . What is the k th moment of Y ? 16. If the random variable X has uniform distribution on the interval [0, a], what is the probability that the random variable greater than its square, that ! " is P X > X 2 ? 17. If the random variable Y has a chi-square distribution with 54 degrees of freedom, then what is the approximate 84th percentile of Y ? 18. Let X be a continuous random variable with density function & 2 for 1 < x < 2 x2 f (x) = 0 elsewhere. √ If Y = X, what is the density function for Y where nonzero? 19. If X is normal with mean 0 and variance 4, then what is the probability ! " 4 4 ≥ 0, that is P X − X ≥0 ? of the event X − X 20. If the waiting time at Rally’s drive-in-window is normally distributed with mean 13 minutes and standard deviation 2 minutes, then what percentage of customers wait longer than 10 minutes but less than 15 minutes? 21. If X is uniform on the interval from −5 to 5, what is the probability that the quadratic equation 100t2 + 20tX + 2X + 3 = 0 has complex solutions? 22. If the random variable X ∼ Exp(θ), then what is the probability density √ function of the random variable Y = X X? 23. If the random variable X ∼ N (0, 1), then what is the probability density I function of the random variable Y = |X|? Probability and Mathematical Statistics 183 H 24. If the random variable X ∼ \(µ, σ 2 ), then what is the probability density function of the random variable ln(X)? H 25. If the random variable X ∼ \(µ, σ 2 ), then what is the mode of X? H 26. If the random variable X ∼ \(µ, σ 2 ), then what is the median of X? H 27. If the random variable X ∼ \(µ, σ 2 ), then what is the probability that the quadratic equation 4t2 + 4tX + X + 2 = 0 has real solutions? dy + q(x) y = 0 28. Consider the Karl Pearson’s differential equation p(x) dx 2 where p(x) = a + bx + cx and q(x) = x − d. Show that if a = c = 0, d b > 0, d > −b, then y(x) is gamma; and if a = 0, b = −c, d−1 b < 1, b > −1, then y(x) is beta. 29. Let a, b, α, β be any four real numbers with a < b and α, β positive. If X ∼ BET A(α, β), then what is the probability density function of the random variable Y = (b − a)X + a? 30. A nonnegative continuous random variable X is said to be memoryless if P (X > s + t/X > t) = P (X > s) for all s, t ≥ 0. Show that the exponential random variable is memoryless. 31. Show that every nonnegative continuous memoryless random variable is an exponential random variable. 32. Using gamma function evaluate the following integrals: 8∞ 8∞ 8∞ 8∞ 2 2 2 2 (i) 0 e−x dx; (ii) 0 x e−x dx; (iii) 0 x2 e−x dx; (iv) 0 x3 e−x dx. 33. Using beta function evaluate the following integrals: 81 8 100 81 (i) 0 x2 (1 − x)2 dx; (ii) 0 x5 (100 − x)7 dx; (iii) 0 x11 (1 − x3 )7 dx. 34. If Γ(z) denotes the gamma function, then prove that Γ(1 + t) Γ(1 − t) = tcosec(t). 35. Let α and β be given positive real numbers, with α < β. If two points are selected at random from a straight line segment of length β, what is the probability that the distance between them is at least α ? 36. If the random variable X ∼ GAM (θ, α), then what is the nth moment of X about the origin? Some Special Continuous Distributions 184 Probability and Mathematical Statistics 185 Chapter 7 TWO RANDOM VARIABLES There are many random experiments that involve more than one random variable. For example, an educator may study the joint behavior of grades and time devoted to study; a physician may study the joint behavior of blood pressure and weight. Similarly an economist may study the joint behavior of business volume and profit. In fact, most real problems we come across will have more than one underlying random variable of interest. 7.1. Bivariate Discrete Random Variables In this section, we develop all the necessary terminologies for studying bivariate discrete random variables. Definition 7.1. A discrete bivariate random variable (X, Y ) is an ordered pair of discrete random variables. Definition 7.2. Let (X, Y ) be a bivariate random variable and let RX and RY be the range spaces of X and Y , respectively. A real-valued function f : RX × RY → R I is called a joint probability density function for X and Y if and only if f (x, y) = P (X = x, Y = y) for all (x, y) ∈ RX × RY . Here, the event (X = x, Y = y) means the intersection of the events (X = x) and (Y = y), that is (X = x) . (Y = y). Example 7.1. Roll a pair of unbiased dice. If X denotes the smaller and Y denotes the larger outcome on the dice, then what is the joint probability density function of X and Y ? Two Random Variables 186 Answer: The sample space S of rolling two dice consists of {(1, 1) (2, 1) (3, 1) S= (4, 1) (5, 1) (6, 1) (1, 2) (2, 2) (3, 2) (4, 2) (5, 2) (6, 2) (1, 3) (2, 3) (3, 3) (4, 3) (5, 3) (6, 3) (1, 4) (2, 4) (3, 4) (4, 4) (5, 4) (6, 4) (1, 5) (2, 5) (3, 5) (4, 5) (5, 5) (6, 5) (1, 6) (2, 6) (3, 6) (4, 6) (5, 6) (6, 6)} The probability density function f (x, y) can be computed for X = 2 and Y = 3 as follows: There are two outcomes namely (2, 3) and (3, 2) in the sample S of 36 outcomes which contribute to the joint event (X = 2, Y = 3). Hence 2 f (2, 3) = P (X = 2, Y = 3) = . 36 Similarly, we can compute the rest of the probabilities. The following table shows these probabilities: 6 2 36 2 36 2 36 2 36 2 36 1 36 5 2 36 2 36 2 36 2 36 1 36 0 4 2 36 2 36 2 36 1 36 0 0 3 2 36 2 36 1 36 0 0 0 2 2 36 1 36 0 0 0 0 1 1 36 0 0 0 0 0 1 2 3 These tabulated values can be written  1  36    2 f (x, y) = 36     0 4 5 6 as if 1 ≤ x = y ≤ 6 if 1 ≤ x < y ≤ 6 otherwise. Example 7.2. A group of 9 executives of a certain firm include 4 who are married, 3 who never married, and 2 who are divorced. Three of the Probability and Mathematical Statistics 187 executives are to be selected for promotion. Let X denote the number of married executives and Y the number of never married executives among the 3 selected for promotion. Assuming that the three are randomly selected from the nine available, what is the joint probability density function of the random variables X and Y ? !" Answer: The number of ways we can choose 3 out of 9 is 93 which is 84. Thus 0 =0 f (0, 0) = P (X = 0, Y = 0) = 84 f (1, 0) = P (X = 1, Y = 0) = !4" ! 3 " ! 2" = 4 84 f (2, 0) = P (X = 2, Y = 0) = !4" ! 3 " ! 2" = 12 84 f (3, 0) = P (X = 3, Y = 0) = !4" ! 3 " ! 2" = 4 . 84 1 0 2 84 2 1 0 84 3 0 0 84 Similarly, we can find the rest of the probabilities. The following table gives the complete information about these probabilities. 3 1 84 0 2 6 84 12 84 0 0 1 3 84 24 84 18 84 0 0 0 4 84 12 84 4 84 0 1 0 2 0 3 Definition 7.3. Let (X, Y ) be a discrete bivariate random variable. Let RX and RY be the range spaces of X and Y , respectively. Let f (x, y) be the joint probability density function of X and Y . The function f1 (x) = % y∈RY f (x, y) Two Random Variables 188 is called the marginal probability density function of X. Similarly, the function % f2 (y) = f (x, y) x∈RX is called the marginal probability density function of Y . Marginal Density of Y The following diagram illustrates the concept of marginal graphically. Joint Density of (X, Y) Marginal Density of X Example 7.3. If the joint probability density function of the discrete random variables X and Y is given by  1 if 1 ≤ x = y ≤ 6  36    2 f (x, y) = 36 if 1 ≤ x < y ≤ 6     0 otherwise, then what are marginals of X and Y ? Answer: The marginal of X can be obtained by summing the joint probability density function f (x, y) for all y values in the range space RY of the random variable Y . That is % f1 (x) = f (x, y) y∈RY = 6 % f (x, y) y=1 = f (x, x) + % y>x f (x, y) + % f (x, y) y<x 1 2 + (6 − x) +0 36 36 1 [13 − 2 x] , x = 1, 2, ..., 6. = 36 = Probability and Mathematical Statistics 189 Similarly, one can obtain the marginal probability density of Y by summing over for all x values in the range space RX of the random variable X. Hence % f2 (y) = f (x, y) x∈RX = 6 % f (x, y) x=1 = f (y, y) + % f (x, y) + x<y % f (x, y) x>y 2 1 + (y − 1) +0 36 36 1 = [2y − 1] , y = 1, 2, ..., 6. 36 = Example 7.4. Let X and Y be discrete random variables with joint probability density function & 1 if x = 1, 2; y = 1, 2, 3 21 (x + y) f (x, y) = 0 otherwise. What are the marginal probability density functions of X and Y ? Answer: The marginal of X is given by f1 (x) = 3 % 1 (x + y) 21 y=1 1 1 3x + [1 + 2 + 3] 21 21 x+2 = , x = 1, 2. 7 = Similarly, the marginal of Y is given by f2 (y) = 2 % 1 (x + y) 21 x=1 3 2y + 21 21 3 + 2y = , 21 = y = 1, 2, 3. From the above examples, note that the marginal f1 (x) is obtained by summing across the columns. Similarly, the marginal f2 (y) is obtained by summing across the rows. Two Random Variables 190 The following theorem follows from the definition of the joint probability density function. Theorem 7.1. A real valued function f of two variables is a joint probability density function of a pair of discrete random variables X and Y (with range spaces RX and RY , respectively) if and only if (a) f (x, y) ≥ 0 % % (b) for all (x, y) ∈ RX × RY ; f (x, y) = 1. x∈RX y∈RY Example 7.5. For what value of the constant k the function given by f (x, y) = & k xy if x = 1, 2, 3; y = 1, 2, 3 0 otherwise is a joint probability density function of some random variables X and Y ? Answer: Since 1= 3 3 % % f (x, y) x=1 y=1 = 3 3 % % kxy x=1 y=1 = k [1 + 2 + 3 + 2 + 4 + 6 + 3 + 6 + 9] = 36 k. Hence 1 36 and the corresponding density function is given by k= f (x, y) = & 1 36 0 xy if x = 1, 2, 3; y = 1, 2, 3 otherwise . As in the case of one random variable, there are many situations where one wants to know the probability that the values of two random variables are less than or equal to some real numbers x and y. Probability and Mathematical Statistics 191 Definition 7.4. Let X and Y be any two discrete random variables. The real valued function F : R I2 → R I is called the joint cumulative probability distribution function of X and Y if and only if F (x, y) = P (X ≤ x, Y ≤ y) K for all (x, y) ∈ R I 2 . Here, the event (X ≤ x, Y ≤ y) means (X ≤ x) (Y ≤ y). From this definition it can be shown that for any real numbers a and b F (a ≤ X ≤ b, c ≤ Y ≤ d) = F (b, d) + F (a, c) − F (a, d) − F (b, c). Further, one can also show that F (x, y) = %% f (s, t) s≤x t≤y where (s, t) is any pair of nonnegative numbers. 7.2. Bivariate Continuous Random Variables In this section, we shall extend the idea of probability density functions of one random variable to that of two random variables. Definition 7.5. The joint probability density function of the random variables X and Y is an integrable function f (x, y) such that (a) f (x, y) ≥ 0 for all (x, y) ∈ R I 2 ; and 8∞ 8∞ (b) −∞ −∞ f (x, y) dx dy = 1. Example 7.6. Let the joint density function of X and Y be given by f (x, y) = & k xy 2 if 0 < x < y < 1 0 otherwise. What is the value of the constant k ? Two Random Variables 192 Answer: Since f is a joint probability density function, we have : ∞: ∞ 1= f (x, y) dx dy −∞ −∞ = : 1 : = = k 2 k x y 2 dx dy 1 k y2 : y x dx dy 0 0 = y 0 0 = : : 1 y 4 dy 0 k ; 5 <1 y 0 10 k . 10 Hence k = 10. If we know the joint probability density function f of the random variables X and Y , then we can compute the probability of the event A from : : P (A) = f (x, y) dx dy. A Example 7.7. Let the joint density of and Y be " &6 ! 2 5 x + 2 xy f (x, y) = 0 What is the probability of the event (X the continuous random variables X if 0 ≤ x ≤ 1; 0 ≤ y ≤ 1 elsewhere. ≤Y) ? Probability and Mathematical Statistics 193 Answer: Let A = (X ≤ Y ). we want to find : : P (A) = f (x, y) dx dy A = : 1 0 = = = = 2: y 0 6 5 : 6 5 : 1 0 0 1 2 3 " 6 ! 2 x + 2 x y dx dy 5 x3 + x2 y 3 3x=y dy x=0 4 3 y dy 3 2 ; 4 <1 y 0 5 2 . 5 Definition 7.6. Let (X, Y ) be a continuous bivariate random variable. Let f (x, y) be the joint probability density function of X and Y . The function : ∞ f1 (x) = f (x, y) dy −∞ is called the marginal probability density function of X. Similarly, the function : ∞ f (x, y) dx f2 (y) = −∞ is called the marginal probability density function of Y . Example 7.8. If the joint density function for X and Y is given by   43 for 0 < y 2 < x < 1 f (x, y) =  0 otherwise, then what is the marginal density function of X, for 0 < x < 1? Answer: The domain of the f consists of the region bounded by the curve x = y 2 and the vertical line x = 1. (See the figure on the next page.) Two Random Variables 194 Hence f1 (x) = : √ x 2 3 = y 4 = 3 dy 4 √ − x 3√x √ − x 3√ x. 2 Example 7.9. Let X and Y have joint density function f (x, y) = & 2 e−x−y for 0 < x ≤ y < ∞ 0 otherwise. What is the marginal density of X where nonzero? Probability and Mathematical Statistics 195 Answer: The marginal density of X is given by f1 (x) = : ∞ f (x, y) dy −∞ : ∞ 2 e−x−y dy : ∞ = 2 e−x e−y dy x ; <∞ = 2 e−x −e−y x = x = 2 e−x e−x = 2 e−2x 0 < x < ∞. Example 7.10. Let (X, Y ) be distributed uniformly on the circular disk centered at (0, 0) with radius √2π . What is the marginal density function of X where nonzero? Answer: The equation of a circle with radius x2 + y 2 = 4 . π Hence, solving this equation for y, we get @ y=± 4 − x2 . π Thus, the marginal density of X is given by √2 π and center at the origin is Two Random Variables 196 f1 (x) = : √ π4 −x2 √4 = : √ π4 −x2 √4 − = f (x, y) dy 2 π −x − π : √ π4 −x2 √4 − π 2 1 = y 4 1 = 2 @ 1 dy area of the circle −x2 −x2 1 dy 4 3√ π4 −x2 √4 − 2 π −x 4 − x2 . π Definition 7.7. Let X and Y be the continuous random variables with joint probability density function f (x, y). The joint cumulative distribution function F (x, y) of X and Y is defined as F (x, y) = P (X ≤ x, Y ≤ y) = : y −∞ : x f (u, v) du dv −∞ for all (x, y) ∈ R I 2. From the fundamental theorem of calculus, we again obtain f (x, y) = ∂2F . ∂x ∂y Example 7.11. If the joint cumulative distribution function of X and Y is given by F (x, y) =  ! "  15 2 x3 y + 3 x2 y 2  0 then what is the joint density of X and Y ? for 0 < x, y < 1 elsewhere, Probability and Mathematical Statistics Answer: 197 " 1 ∂ ∂ ! 3 2 x y + 3 x2 y 2 5 ∂x ∂y " 1 ∂ ! 3 = 2 x + 6 x2 y 5 ∂x " 1 ! 2 = 6 x + 12 x y 5 6 = (x2 + 2 x y). 5 f (x, y) = Hence, the joint density of X and Y is given by f (x, y) = &6 ! 5 x2 + 2 x y 0 " for 0 < x, y < 1 elsewhere. Example 7.12. Let X and Y have the joint density function f (x, y) = & 2x for 0 < x < 1; 0 < y < 1 0 elsewhere. ! " What is P X + Y ≤ 1 / X ≤ 21 ? Answer: (See the diagram below.) Two Random Variables P # 1 X + Y ≤ 1/X ≤ 2 198 $ "< ; K! P (X + Y ≤ 1) X ≤ 12 " ! = P X ≤ 21 = 8 12 A8 0 = 1 6 1 4 = 2 . 3 1 2 0 B B 8 1 A8 1−y 2 x dx dy + 1 0 2 x dx dy 2 B 8 1 A8 21 2 x dx dy 0 0 Example 7.13. Let X and Y have the joint density function & x+y for 0 ≤ x ≤ 1; 0 ≤ y ≤ 1 f (x, y) = 0 elsewhere. What is P (2X ≤ 1 / X + Y ≤ 1) ? Answer: We know that ;! < "K X ≤ 21 (X + Y ≤ 1) P (2X ≤ 1 / X + Y ≤ 1) = . P (X + Y ≤ 1) 3 : 1 2 : 1−x P [X + Y ≤ 1] = (x + y) dy dx P 0 = = 2 0 x2 x3 (1 − x)3 − − 2 3 6 1 2 = . 6 3 31 0 Probability and Mathematical Statistics 199 Similarly P $ 3 : 1 : 1−x 2# 2 1 . (X + Y ≤ 1) = (x + y) dy dx X≤ 2 0 0 2 x3 (1 − x)3 x2 = − − 2 3 6 = Thus, P (2X ≤ 1 / X + Y ≤ 1) = 3 21 0 11 . 48 # 11 48 $ # $ 3 11 = . 1 16 7.3. Conditional Distributions First, we motivate the definition of conditional distribution using discrete random variables and then based on this motivation we give a general definition of the conditional distribution. Let X and Y be two discrete random variables with joint probability density f (x, y). Then by definition of the joint probability density, we have f (x, y) = P (X = x, Y = y). If A = {X = x}, B = {Y = y} and f2 (y) = P (Y = y), then from the above equation we have P ({X = x} / {Y = y}) = P (A / B) K P (A B) = P (B) P ({X = x} and {Y = y}) = P (Y = y) f (x, y) = . f2 (y) If we write the P ({X = x} / {Y = y}) as g(x / y), then we have g(x / y) = f (x, y) . f2 (y) Two Random Variables 200 For the discrete bivariate random variables, we can write the conditional probability of the event {X = x} given the event {Y = y} as the ratio of the K probability of the event {X = x} {Y = y} to the probability of the event {Y = y} which is f (x, y) g(x / y) = . f2 (y) We use this fact to define the conditional probability density function given two random variables X and Y . Definition 7.8. Let X and Y be any two random variables with joint density f (x, y) and marginals f1 (x) and f2 (y). The conditional probability density function g of X, given (the event) Y = y, is defined as g(x / y) = f (x, y) f2 (y) f2 (y) > 0. Similarly, the conditional probability density function h of Y , given (the event) X = x, is defined as h(y / x) = f (x, y) f1 (x) f1 (x) > 0. Example 7.14. Let X and Y be discrete random variables with joint probability function f (x, y) = & 1 21 (x + y) for x = 1, 2, 3; y = 1, 2. 0 elsewhere. What is the conditional probability density function of X, given Y = 2 ? Answer: We want to find g(x/2). Since g(x / 2) = f (x, 2) f2 (2) we should first compute the marginal of Y , that is f2 (2). The marginal of Y is given by 3 % 1 f2 (y) = (x + y) 21 x=1 = 1 (6 + 3 y). 21 Probability and Mathematical Statistics 201 Hence f2 (2) = 12 21 . Thus, the conditional probability density function of X, given Y = 2, is g(x/2) = = = f (x, 2) f2 (2) 1 21 (x + 2) 12 21 1 (x + 2), 12 x = 1, 2, 3. Example 7.15. Let X and Y be discrete random variables with joint probability density function & x+y for x = 1, 2; y = 1, 2, 3, 4 32 f (x, y) = 0 otherwise. What is the conditional probability of Y given X = x ? Answer: f1 (x) = 4 % f (x, y) y=1 4 1 % (x + y) = 32 y=1 = 1 (4 x + 10). 32 Therefore f (x, y) f1 (x) 1 (x + y) = 132 (4 x + 10) 32 x+y . = 4x + 10 Thus, the conditional probability Y given X = x is & x+y for x = 1, 2; y = 1, 2, 3, 4 4x+10 h(y/x) = 0 otherwise. h(y/x) = Example 7.16. Let X and Y be continuous random variables with joint pdf & 12 x for 0 < y < 2x < 1 f (x, y) = 0 otherwise . Two Random Variables 202 What is the conditional density function of Y given X = x ? Answer: First, we have to find the marginal of X. f1 (x) = = : ∞ −∞ : 2x f (x, y) dy 12 x dy 0 = 24 x2 . Thus, the conditional density of Y given X = x is f (x, y) f1 (x) 12 x = 24 x2 1 , for = 2x h(y/x) = 0 < y < 2x < 1 and zero elsewhere. Example 7.17. Let X and Y be random variables such that X has density function & 24 x2 for 0 < x < 21 f1 (x) = 0 elsewhere Probability and Mathematical Statistics 203 and the conditional density of Y given X = x is & h(y/x) = y 2 x2 for 0 < y < 2x 0 elsewhere . What is the conditional density of X given Y = y over the appropriate domain? Answer: The joint density f (x, y) of X and Y is given by f (x, y) = h(y/x) f1 (x) y = 24 x2 2 x2 = 12y for 0 < y < 2x < 1. The marginal density of Y is given by f2 (y) = : ∞ f (x, y) dx −∞ = : 1 2 y 2 12 y dx = 6 y (1 − y), for 0 < y < 1. Hence, the conditional density of X given Y = y is f (x, y) f2 (y) 12y = 6 y (1 − y) 2 = . 1−y g(x/y) = Thus, the conditional density of X given Y = y is given by g(x/y) = & 2 1−y for 0 < y < 2x < 1 0 otherwise. Note that for a specific x, the function f (x, y) is the intersection (profile) of the surface z = f (x, y) by the plane x = constant. The conditional density f (y/x), is the profile of f (x, y) normalized by the factor f11(x) . Two Random Variables 204 7.4. Independence of Random Variables In this section, we define the concept of stochastic independence of two random variables X and Y . The conditional probability density function g of X given Y = y usually depends on y. If g is independent of y, then the random variables X and Y are said to be independent. This motivates the following definition. Definition 7.8. Let X and Y be any two random variables with joint density f (x, y) and marginals f1 (x) and f2 (y). The random variables X and Y are (stochastically) independent if and only if f (x, y) = f1 (x) f2 (y) for all (x, y) ∈ RX × RY . Example 7.18. Let X and Y be discrete random variables with joint density  1 for 1 ≤ x = y ≤ 6  36 f (x, y) =  2 for 1 ≤ x < y ≤ 6. 36 Are X and Y stochastically independent? Answer: The marginals of X and Y are given by f1 (x) = 6 % f (x, y) y=1 = f (x, x) + % f (x, y) + y>x % f (x, y) % f (x, y) y<x 1 2 + (6 − x) +0 36 36 13 − 2x = , for x = 1, 2, ..., 6 36 = and f2 (y) = 6 % f (x, y) x=1 = f (y, y) + % x<y f (x, y) + x>y 1 2 + (y − 1) +0 36 36 2y − 1 , for y = 1, 2, ..., 6. = 36 = Probability and Mathematical Statistics 205 Since 11 1 1 += = f1 (1) f2 (1), 36 36 36 we conclude that f (x, y) += f1 (x) f2 (y), and X and Y are not independent. f (1, 1) = This example also illustrates that the marginals of X and Y can be determined if one knows the joint density f (x, y). However, if one knows the marginals of X and Y , then it is not possible to find the joint density of X and Y unless the random variables are independent. Example 7.19. Let X and Y have the joint density & −(x+y) e for 0 < x, y < ∞ f (x, y) = 0 otherwise. Are X and Y stochastically independent? Answer: The marginals of X and Y are given by : ∞ : ∞ f1 (x) = f (x, y) dy = e−(x+y) dy = e−x 0 and f2 (y) = : 0 Hence ∞ 0 f (x, y) dx = : ∞ e−(x+y) dx = e−y . 0 f (x, y) = e−(x+y) = e−x e−y = f1 (x) f2 (y). Thus, X and Y are stochastically independent. Notice that if the joint density f (x, y) of X and Y can be factored into two nonnegative functions, one solely depending on x and the other solely depending on y, then X and Y are independent. We can use this factorization approach to predict when X and Y are not independent. Example 7.20. Let X and Y have the joint density & x+y for 0 < x < 1; 0 < y < 1 f (x, y) = 0 otherwise. Are X and Y stochastically independent? Answer: Notice that f (x, y) = x + y , y=x 1+ . x Two Random Variables 206 Thus, the joint density cannot be factored into two nonnegative functions one depending on x and the other depending on y; and therefore X and Y are not independent. If X and Y are independent, then the random variables U = φ(X) and V = ψ(Y ) are also independent. Here φ, ψ : R I →R I are some real valued functions. From this comment, one can conclude that if X and Y are independent, then the random variables eX and Y 3 +Y 2 +1 are also independent. Definition 7.9. The random variables X and Y are said to be independent and identically distributed (IID) if and only if they are independent and have the same distribution. Example 7.21. Let X and Y be two independent random variables with identical probability density function given by f (x) = & e−x for x > 0 0 elsewhere. What is the probability density function of W = min{X, Y } ? Answer: Let G(w) be the cumulative distribution function of W . Then G(w) = P (W ≤ w) = 1 − P (W > w) = 1 − P (min{X, Y } > w) = 1 − P (X > w and Y > w) = 1 − P (X > w) P (Y > w) (since X and Y are independent) #: ∞ $ $ #: ∞ =1− e−y dy e−x dx w w ! "2 = 1 − e−w = 1 − e−2w . Thus, the probability density function of W is g(w) = Hence " d d ! 1 − e−2w = 2 e−2w . G(w) = dw dw g(w) = & 2 e−2w for w > 0 0 elsewhere. Probability and Mathematical Statistics 207 7.5. Review Exercises 1. Let X and Y be discrete random variables with joint probability density function & 1 for x = 1, 2, 3; y = 1, 2 21 (x + y) f (x, y) = 0 otherwise. What are the marginals of X and Y ? 2. Roll a pair of unbiased dice. Let X be the maximum of the two faces and Y be the sum of the two faces. What is the joint density of X and Y ? 3. For what value of c is the real valued function & c (x + 2y) for x = 1, 2; y = 1, 2 f (x, y) = 0 otherwise a joint density for some random variables X and Y ? 4. Let X and Y have the joint density f (x, y) = & e−(x+y) for 0 ≤ x, y < ∞ 0 otherwise. What is P (X ≥ Y ≥ 2) ? 5. If the random variable X is uniform on the interval from −1 to 1, and the random variable Y is uniform on the interval from 0 to 1, what is the probability that the the quadratic equation t2 + 2Xt + Y = 0 has real solutions? Assume X and Y are independent. 6. Let Y have a uniform distribution on the interval (0, 1), and let the conditional density of X given Y = y be uniform on the interval from 0 to √ y. What is the marginal density of X for 0 < x < 1? Two Random Variables 208 7. If the joint cumulative distribution of the random variables X and Y is F (x, y) =   (1 − e−x )(1 − e−y )  0 for x > 0, y > 0 otherwise, what is the joint probability density function of the random variables X and Y , and the P (1 < X < 3, 1 < Y < 2)? 8. If the random variables X and Y have the joint density f (x, y) =   67 x for 1 ≤ x + y ≤ 2, x ≥ 0, y ≥ 0  0 otherwise, what is the probability P (Y ≥ X 2 ) ? 9. If the random variables X and Y have the joint density f (x, y) =   67 x for 1 ≤ x + y ≤ 2, x ≥ 0, y ≥ 0  0 otherwise, what is the probability P [max(X, Y ) > 1] ? 10. Let X and Y have the joint probability density function f (x, y) = & 5 16 xy 2 0 for 0 < x < y < 2 elsewhere. What is the marginal density function of X where it is nonzero? 11. Let X and Y have the joint probability density function f (x, y) = & 4x for 0 < x < 0 elsewhere. √ y<1 What is the marginal density function of Y , where nonzero? 12. A point (X, Y ) is chosen at random from a uniform distribution on the circular disk of radius centered at the point (1, 1). For a given value of X = x between 0 and 2 and for y in the appropriate domain, what is the conditional density function for Y ? Probability and Mathematical Statistics 209 13. Let X and Y be continuous random variables with joint density function f (x, y) = &3 4 (2 − x − y) for 0 < x, y < 2; 0 < x + y < 2 0 otherwise. What is the conditional probability P (X < 1 | Y < 1) ? 14. Let X and Y be continuous random variables with joint density function f (x, y) = & 12x for 0 < y < 2x < 1 0 otherwise. What is the conditional density function of Y given X = x ? 15. Let X and Y be continuous random variables with joint density function f (x, y) = & 24xy for x > 0, y > 0, 0 < x + y < 1 0 otherwise. ! What is the conditional probability P X < 1 2 |Y = 1 4 " ? 16. Let X and Y be two independent random variables with identical probability density function given by f (x) = & e−x for x > 0 0 elsewhere. What is the probability density function of W = max{X, Y } ? 17. Let X and Y be two independent random variables with identical probability density function given by  2  3θx3 for 0 ≤ x ≤ θ f (x) =  0 elsewhere, for some θ > 0. What is the probability density function of W = min{X, Y }? 18. Ron and Glenna agree to meet between 5 P.M. and 6 P.M. Suppose that each of them arrive at a time distributed uniformly at random in this time interval, independent of the other. Each will wait for the other at most 10 minutes (and if other does not show up they will leave). What is the probability that they actually go out? Two Random Variables 210 19. Let X and Y be two independent random variables distributed uniformly on the interval [0, 1]. What is the probability of the event Y ≥ 12 given that Y ≥ 1 − 2X? 20. Let X and Y have the joint density & 8xy for 0 < y < x < 1 f (x, y) = 0 otherwise. What is P (X + Y > 1) ? 21. Let X and Y be continuous random variables with joint density function & 2 for 0 ≤ y ≤ x < 1 f (x, y) = 0 otherwise. Are X and Y stochastically independent? 22. Let X and Y be continuous random variables with joint density function & 2x for 0 < x, y < 1 f (x, y) = 0 otherwise. Are X and Y stochastically independent? 23. A bus and a passenger arrive at a bus stop at a uniformly distributed time over the interval 0 to 1 hour. Assume the arrival times of the bus and passenger are independent of one another and that the passenger will wait up to 15 minutes for the bus to arrive. What is the probability that the passenger will catch the bus? 24. Let X and Y be continuous random variables with joint density function & 4xy for 0 ≤ x, y ≤ 1 f (x, y) = 0 otherwise. What is the probability of the event X ≤ 1 2 given that Y ≥ 43 ? 25. Let X and Y be continuous random variables with joint density function f (x, y) = &1 2 for 0 ≤ x ≤ y ≤ 2 0 otherwise. What is the probability of the event X ≤ 1 2 given that Y = 1? Probability and Mathematical Statistics 26. If the joint density of the random variables X and Y is f (x, y) = & 1 1 2 0 if 0 ≤ x ≤ y ≤ 1 if 1 ≤ x ≤ 2, 0 ≤ y ≤ 1 otherwise, ! " what is the probability of the event X ≤ 23 , Y ≤ 21 ? 27. If the joint density of the random variables X and Y is ; <  emin{x,y} − 1 e−(x+y) if 0 < x, y < ∞ f (x, y) =  0 otherwise, then what is the marginal density function of X, where nonzero? 211 Two Random Variables 212 Probability and Mathematical Statistics 213 Chapter 8 PRODUCT MOMENTS OF BIVARIATE RANDOM VARIABLES In this chapter, we define various product moments of a bivariate random variable. The main concept we introduce in this chapter is the notion of covariance between two random variables. Using this notion, we study the statistical dependence of two random variables. 8.1. Covariance of Bivariate Random Variables First, we define the notion of product moment of two random variables and then using this product moment, we give the definition of covariance between two random variables. Definition 8.1. Let X and Y be any two random variables with joint density function f (x, y). The product moment of X and Y , denoted by E(XY ), is defined as  % %  xy f (x, y) if X and Y are discrete  y∈R x∈R Y X E(XY ) =   8 ∞ 8 ∞ xy f (x, y) dx dy if X and Y are continuous. −∞ −∞ Here, RX and RY represent the range spaces of X and Y respectively. Definition 8.2. Let X and Y be any two random variables with joint density function f (x, y). The covariance between X and Y , denoted by Cov(X, Y ) (or σXY ), is defined as Cov(X, Y ) = E( (X − µX ) (Y − µY ) ), Product Moments of Bivariate Random Variables 214 where µX and µY are mean of X and Y , respectively. Notice that the covariance of X and Y is really the product moment of X − µX and Y − µY . Further, the mean of µX is given by : ∞: ∞ : ∞ µX = E(X) = x f (x, y) dx dy, x f1 (x) dx = −∞ −∞ −∞ and similarly the mean of Y is given by : : ∞ µY = E(Y ) = y f2 (y) dy = ∞ −∞ −∞ : ∞ y f (x, y) dy dx. −∞ Theorem 8.1. Let X and Y be any two random variables. Then Cov(X, Y ) = E(XY ) − E(X) E(Y ). Proof: Cov(X, Y ) = E((X − µX ) (Y − µY )) = E(XY − µX Y − µY X + µX µY ) = E(XY ) − µX E(Y ) − µY E(X) + µX µY = E(XY ) − µX µY − µY µX + µX µY = E(XY ) − µX µY = E(XY ) − E(X) E(Y ). 2 Corollary 8.1. Cov(X, X) = σX . Proof: Cov(X, X) = E(XX) − E(X) E(X) = E(X 2 ) − µ2X = V ar(X) 2 = σX . Example 8.1. Let X and Y be discrete random variables with joint density f (x, y) = & x+2y 18 0 for x = 1, 2; y = 1, 2 elsewhere. What is the covariance σXY between X and Y . Probability and Mathematical Statistics 215 Answer: The marginal of X is f1 (x) = 2 % x + 2y y=1 18 = 1 (2x + 6). 18 Hence the expected value of X is E(X) = 2 % x f1 (x) x=1 = 1 f1 (1) + 2f1 (2) 8 10 = +2 18 18 28 = . 18 Similarly, the marginal of Y is f2 (y) = 2 % x + 2y x=1 18 = 1 (3 + 4y). 18 Hence the expected value of Y is E(Y ) = 2 % y f2 (y) y=1 = 1 f2 (1) + 2f2 (2) 7 11 +2 = 18 18 29 = . 18 Further, the product moment of X and Y is given by E(XY ) = 2 2 % % x y f (x, y) x=1 y=1 = f (1, 1) + 2 f (1, 2) + 2 f (2, 1) + 4 f (2, 2) 3 5 4 6 = +2 +2 +4 18 18 18 18 3 + 10 + 8 + 24 = 18 45 . = 18 Product Moments of Bivariate Random Variables 216 Hence, the covariance between X and Y is given by Cov(X, Y ) = E(XY ) − E(X) E(Y ) # $# $ 45 28 29 = − 18 18 18 (45) (18) − (28) (29) = (18) (18) 810 − 812 = 324 2 = −0.00617. =− 324 Remark 8.1. For an arbitrary random variable, the product moment and covariance may or may not exist. Further, note that unlike variance, the covariance between two random variables may be negative. Example 8.2. Let X and Y have the joint density function & x+y if 0 < x, y < 1 f (x, y) = 0 elsewhere . What is the covariance between X and Y ? Answer: The marginal density of X is f1 (x) = : 1 (x + y) dy 0 2 3y=1 y2 = xy + 2 y=0 1 =x+ . 2 Thus, the expected value of X is given by E(X) = : 1 x f1 (x) dx 0 : 1 1 x (x + ) dx 2 0 2 3 3 2 1 x x = + 3 4 0 7 . = 12 = Probability and Mathematical Statistics 217 Similarly (or using the fact that the density is symmetric in x and y), we get E(Y ) = 7 . 12 Now, we compute the product moment of X and Y . : 1: 1 E(XY ) = x y(x + y) dx dy 0 0 = : 1 0 : 1 : 2 1 (x2 y + x y 2 ) dx dy 0 x3 y x2 y 2 = + 3 2 0 $ # : 1 2 y y + dy = 3 2 0 31 2 2 y3 y + dy = 6 6 0 1 1 = + 6 6 4 = . 12 3x=1 dy x=0 Hence the covariance between X and Y is given by Cov(X, Y ) = E(XY ) − E(X) E(Y ) # $# $ 4 7 7 − = 12 12 12 48 − 49 = 144 1 =− . 144 Example 8.3. Let X and Y be continuous random variables with joint density function & 2 if 0 < y < 1 − x; 0 < x < 1 f (x, y) = 0 elsewhere . What is the covariance between X and Y ? Answer: The marginal density of X is given by : 1−x f1 (x) = 2 dy = 2 (1 − x). 0 Product Moments of Bivariate Random Variables 218 Hence the expected value of X is : 1 : µX = E(X) = x f1 (x) dx = 0 Similarly, the marginal of Y is f2 (y) = 1 o : 2 (1 − x) dx = 1−y 2 dx = 2 (1 − y). 0 Hence the expected value of Y is : : 1 µY = E(Y ) = y f2 (y) dy = 1 o 0 2 (1 − y) dy = The product moment of X and Y is given by : 1 : 1−x x y f (x, y) dy dx E(XY ) = 0 0 = : 1 0 =2 : : 0 1−x x y 2 dy dx 0 1 x 2 y2 2 31−x dx 0 : 1 1 =2 x (1 − x)2 dx 2 0 : 1 ! " x − 2x2 + x3 dx = 0 31 1 2 2 3 1 4 x − x + x 2 3 4 0 1 = . 12 Therefore, the covariance between X and Y is given by Cov(X, Y ) = E(XY ) − E(X) E(Y ) 1 1 = − 12 9 4 1 3 = − =− . 36 36 36 = 2 1 . 3 1 . 3 Probability and Mathematical Statistics 219 Theorem 8.2. If X and Y are any two random variables and a, b, c, and d are real constants, then Cov(a X + b, c Y + d) = a c Cov(X, Y ). Proof: Cov(a X + b, c Y + d) = E ((aX + b)(cY + d)) − E(aX + b) E(cY + d) = E (acXY + adX + bcY + bd) − (aE(X) + b) (cE(Y ) + d) = ac E(XY ) + ad E(X) + bc E(Y ) + bd − [ac E(X) E(Y ) + ad E(X) + bc E(Y ) + bd] = ac [E(XY ) − E(X) E(Y )] = ac Cov(X, Y ). Example 8.4. If the product moment of X and Y is 3 and the mean of X and Y are both equal to 2, then what is the covariance of the random variables 2X + 10 and − 52 Y + 3 ? Answer: Since E(XY ) = 3 and E(X) = 2 = E(Y ), the covariance of X and Y is given by Cov(X, Y ) = E(XY ) − E(X) E(Y ) = 3 − 4 = −1. Then the covariance of 2X + 10 and − 52 Y + 3 is given by # $ # $ 5 5 Cov 2X + 10, − Y + 3 = 2 − Cov(X, Y ) 2 2 = (−5) (−1) = 5. Remark 8.2. Notice that the Theorem 8.2 can be furthered improved. That is, if X, Y , Z are three random variables, then Cov(X + Y, Z) = Cov(X, Z) + Cov(Y, Z) and Cov(X, Y + Z) = Cov(X, Y ) + Cov(X, Z). Product Moments of Bivariate Random Variables 220 The first formula can be established as follows. Consider Cov(X + Y, Z) = E((X + Y )Z) − E(X + Y ) E(Z) = E(XZ + Y Z) − E(X)E(Z) − E(Y )E(Z) = E(XZ) − E(X)E(Z) + E(Y Z) − E(Y )E(Z) = Cov(X, Z) + Cov(Y, Z). 8.2. Independence of Random Variables In this section, we study the effect of independence on the product moment (and hence on the covariance). We begin with a simple theorem. Theorem 8.3. If X and Y are independent random variables, then E(XY ) = E(X) E(Y ). Proof: Recall that X and Y are independent if and only if f (x, y) = f1 (x) f2 (y). Let us assume that X and Y are continuous. Therefore : ∞: ∞ E(XY ) = x y f (x, y) dx dy −∞ −∞ : ∞: ∞ = x y f1 (x) f2 (y) dx dy −∞ −∞ $ $ #: ∞ #: ∞ = y f2 (y) dy x f1 (x) dx −∞ = E(X) E(Y ). −∞ If X and Y are discrete, then replace the integrals by appropriate sums to prove the same result. Example 8.5. Let X and Y be two independent random variables with respective density functions: & 2 3x if 0 < x < 1 f (x) = 0 otherwise and g(y) = & 4 y3 if 0 < y < 1 0 otherwise . Probability and Mathematical Statistics What is E !X " Y 221 ? Answer: Since X and Y are independent, the joint density of X and Y is given by h(x, y) = f (x) g(y). Therefore E # X Y $ = = : : ∞ ∞ −∞ −∞ 1: 1 : 0 0 : 1 : x h(x, y) dx dy y x f (x) g(y) dx dy y 1 x 2 3 3x 4y dx dy 0 y 0 #: 1 $ $ #: 1 = 4y 2 dy 3x3 dx 0 # 0$ # $ 3 4 = = 1. 4 3 = ! " E(X) = Remark 8.3. The independence of X and Y does not imply E X !X " ! −1 " ! Y −1 " E(Y ) but only implies E Y = E(X) E Y . Further, note that E Y is not 1 equal to E(Y . ) Theorem 8.4. If X and Y are independent random variables, then the covariance between X and Y is always zero, that is Cov(X, Y ) = 0. Proof: Suppose X and Y are independent, then by Theorem 8.3, we have E(XY ) = E(X) E(Y ). Consider Cov(X, Y ) = E(XY ) − E(X) E(Y ) = E(X) E(Y ) − E(X) E(Y ) = 0. Example 8.6. Let the random variables X and Y have the joint density &1 if (x, y) ∈ { (0, 1), (0, −1), (1, 0), (−1, 0) } 4 f (x, y) = 0 otherwise. What is the covariance of X and Y ? Are the random variables X and Y independent? Product Moments of Bivariate Random Variables 222 Answer: The joint density of X and Y are shown in the following table with the marginals f1 (x) and f2 (y). (x, y) −1 0 −1 0 1 4 0 1 4 0 1 4 0 1 4 2 4 1 0 1 4 0 1 4 f1 (x) 1 4 2 4 1 f2 (y) 1 4 From this table, we see that # $ # $ 2 2 1 0 = f (0, 0) += f1 (0) f2 (0) = = 4 4 4 and thus f (x, y) += f1 (x) f2 (y) for all (x, y) is the range space of the joint variable (X, Y ). Therefore X and Y are not independent. Next, we compute the covariance between X and Y . For this we need Probability and Mathematical Statistics 223 E(X), E(Y ) and E(XY ). The expected value of X is E(X) = 1 % xf1 (x) x=−1 = (−1) f1 (−1) + (0)f1 (0) + (1) f1 (1) 1 1 =− +0+ 4 4 = 0. Similarly, the expected value of Y is E(Y ) = 1 % yf2 (y) y=−1 = (−1) f2 (−1) + (0)f2 (0) + (1) f2 (1) 1 1 =− +0+ 4 4 = 0. The product moment of X and Y is given by E(XY ) = 1 1 % % x y f (x, y) x=−1 y=−1 = (1) f (−1, −1) + (0) f (−1, 0) + (−1) f (−1, 1) + (0) f (0, −1) + (0) f (0, 0) + (0) f (0, 1) + (−1) f (1, −1) + (0) f (1, 0) + (1) f (1, 1) = 0. Hence, the covariance between X and Y is given by Cov(X, Y ) = E(XY ) − E(X) E(Y ) = 0. Remark 8.4. This example shows that if the covariance of X and Y is zero that does not mean the random variables are independent. However, we know from Theorem 8.4 that if X and Y are independent, then the Cov(X, Y ) is always zero. Product Moments of Bivariate Random Variables 224 8.3. Variance of the Linear Combination of Random Variables Given two random variables, X and Y , we determine the variance of their linear combination, that is aX + bY . Theorem 8.5. Let X and Y be any two random variables and let a and b be any two real numbers. Then V ar(aX + bY ) = a2 V ar(X) + b2 V ar(Y ) + 2 a b Cov(X, Y ). Proof: V ar(aX + bY ) , 2 = E [aX + bY − E(aX + bY )] , 2 = E [aX + bY − a E(X) − b E(Y )] , 2 = E [a (X − µX ) + b (Y − µY )] ! " = E a2 (X − µX )2 + b2 (Y − µY )2 + 2 a b (X − µX ) (Y − µY ) " " ! ! = a2 E (X − µX )2 + b2 E (X − µX )2 + 2 a b E((X − µX ) (Y − µY )) = a2 V ar(X) + b2 V ar(Y ) + 2 a b Cov(X, Y ). Example 8.7. If V ar(X + Y ) = 3, V ar(X − Y ) = 1, E(X) = 1 and E(Y ) = 2, then what is E(XY ) ? Answer: Hence, we get 2 V ar(X + Y ) = σX + σY2 + 2 Cov(X, Y ), 2 V ar(X − Y ) = σX + σY2 − 2 Cov(X, Y ). 1 [ V ar(X + Y ) − V ar(X − Y ) ] 4 1 = [3 − 1] 4 1 = . 2 Cov(X, Y ) = Therefore, the product moment of X and Y is given by E(XY ) = Cov(X, Y ) + E(X) E(Y ) 1 = + (1) (2) 2 5 = . 2 Probability and Mathematical Statistics 225 Example 8.8. Let X and Y be random variables with V ar(X) = 4, V ar(Y ) = 9 and V ar(X − Y ) = 16. What is Cov(X, Y ) ? Answer: V ar(X − Y ) = V ar(X) + V ar(Y ) − 2 Cov(X, Y ) 16 = 4 + 9 − 2 Cov(X, Y ). Hence 3 Cov(X, Y ) = − . 2 Remark 8.5. The Theorem 8.5 can be extended to three or more random variables. In case of three random variables X, Y, Z, we have V ar(X + Y + Z) = V ar(X) + V ar(Y ) + V ar(Z) + 2Cov(X, Y ) + 2Cov(Y, Z) + 2Cov(Z, X). To see this consider V ar(X + Y + Z) = V ar((X + Y ) + Z) = V ar(X + Y ) + V ar(Z) + 2Cov(X + Y, Z) = V ar(X + Y ) + V ar(Z) + 2Cov(X, Z) + 2Cov(Y, Z) = V ar(X) + V ar(Y ) + 2Cov(X, Y ) + V ar(Z) + 2Cov(X, Z) + 2Cov(Y, Z) = V ar(X) + V ar(Y ) + V ar(Z) + 2Cov(X, Y ) + 2Cov(Y, Z) + 2Cov(Z, X). Theorem 8.6. If X and Y are independent random variables with E(X) = 0 = E(Y ), then V ar(XY ) = V ar(X) V ar(Y ). Proof: ! " 2 V ar(XY ) = E (XY )2 − (E(X) E(Y )) ! " = E (XY )2 ! " = E X2 Y 2 ! " ! " = E X2 E Y 2 (by independence of X and Y ) = V ar(X) V ar(Y ). Product Moments of Bivariate Random Variables 226 Example 8.9. Let X and Y be independent random variables, each with density & 1 for −θ < x < θ 2θ f (x) = 0 otherwise. If the V ar(XY ) = 64 9 , then what is the value of θ ? Answer: E(X) = : θ −θ 1 1 x dx = 2θ 2θ 2 x2 2 3θ = 0. −θ Since Y has the same density, we conclude that E(Y ) = 0. Hence 64 = V ar(XY ) 9 = V ar(X) V ar(Y ) + ): + ): θ θ 1 2 1 2 = x dx y dy −θ 2θ −θ 2θ # 2$ # 2$ θ θ = 3 3 4 θ = . 9 Hence, we obtain θ4 = 64 or √ θ = 2 2. 8.4. Correlation and Independence The functional dependency of the random variable Y on the random variable X can be obtained by examining the correlation coefficient. The definition of the correlation coefficient ρ between X and Y is given below. 2 Definition 8.3. Let X and Y be two random variables with variances σX and σY2 , respectively. Let the covariance of X and Y be Cov(X, Y ). Then the correlation coefficient ρ between X and Y is given by ρ= Cov(X, Y ) . σX σY Theorem 8.7. If X and Y are independent, the correlation coefficient between X and Y is zero. Probability and Mathematical Statistics Proof: 227 Cov(X, Y ) σX σY 0 = σX σY = 0. ρ= Remark 8.4. The converse of this theorem is not true. If the correlation coefficient of X and Y is zero, then X and Y are said to be uncorrelated. Lemma 8.1. If X ( and Y ( are the standardizations of the random variables X and Y , respectively, the correlation coefficient between X ( and Y ( is equal to the correlation coefficient between X and Y . Proof: Let ρ( be the correlation coefficient between X ( and Y ( . Further, let ρ denote the correlation coefficient between X and Y . We will show that ρ( = ρ. Consider Cov (X ( , Y ( ) σX * σY * = Cov (X ( , Y ( ) $ # X − µX Y − µY , = Cov σX σY 1 = Cov (X − µX , Y − µY ) σX σY Cov (X, Y ) = σX σY = ρ. ρ( = This lemma states that the value of the correlation coefficient between two random variables does not change by standardization of them. Theorem 8.8. For any random variables X and Y , the correlation coefficient ρ satisfies −1 ≤ ρ ≤ 1, and ρ = 1 or ρ = −1 implies that the random variable Y = a X + b, where a and b are arbitrary real constants with a += 0. 2 Proof: Let µX be the mean of X and µY be the mean of Y , and σX and σY2 be the variances of X and Y , respectively. Further, let X∗ = X − µX σX and Y∗ = Y − µY σY Product Moments of Bivariate Random Variables 228 be the standardization of X and Y , respectively. Then 2 µX ∗ = 0 and σX ∗ = 1, and µY ∗ = 0 and σY2 ∗ = 1. Thus V ar(X ∗ − Y ∗ ) = V ar(X ∗ ) + V ar(Y ∗ ) − 2Cov(X ∗ , Y ∗ ) ∗ 2 2 = σX ∗ + σY ∗ − 2 ρ σX ∗ σY ∗ = 1 + 1 − 2ρ∗ = 1 + 1 − 2ρ (by Lemma 8.1) = 2(1 − ρ). Since the variance of a random variable is always positive, we get 2 (1 − ρ) ≥ 0 which is ρ ≤ 1. By a similar argument, using V ar(X ∗ + Y ∗ ), one can show that −1 ≤ ρ. Hence, we have −1 ≤ ρ ≤ 1. Now, we show that if ρ = 1 or ρ = −1, then Y and X are related through an affine transformation. Consider the case ρ = 1, then V ar(X ∗ − Y ∗ ) = 0. But if the variance of a random variable is 0, then all the probability mass is concentrated at a point (that is, the distribution of the corresponding random variable is degenerate). Thus V ar(X ∗ − Y ∗ ) = 0 implies X ∗ − Y ∗ takes only one value. But E [X ∗ − Y ∗ ] = 0. Thus, we get X∗ − Y ∗ ≡ 0 or X ∗ ≡ Y ∗. Hence X − µX Y − µY = . σX σY Solving this for Y in terms of X, we get Y = aX + b Probability and Mathematical Statistics where a= σY σX and 229 b = µY − a µX . Thus if ρ = 1, then Y is a linear in X. Similarly, we can show for the case ρ = −1, the random variables X and Y are linearly related. This completes the proof of the theorem. 8.5. Moment Generating Functions Similar to the moment generating function for the univariate case, one can define the moment generating function for the bivariate case to compute the various product moments. The moment generating function for the bivariate case is defined as follows: Definition 8.4. Let X and Y be two random variables with joint density function f (x, y). A real valued function M : R I 2 →R I defined by ! " M (s, t) = E esX+tY is called the joint moment generating function of X and Y if this expected value exists for all s is some interval −h < s < h and for all t is some interval −k < t < k for some positive h and k. It is easy to see from this definition that ! " M (s, 0) = E esX and From this we see that E(X k ) = ! " M (0, t) = E etY . > ∂ k M (s, t) >> , > ∂sk (0,0) E(Y k ) = for k = 1, 2, 3, 4, ...; and > ∂ 2 M (s, t) >> E(XY ) = . ∂s ∂t >(0,0) > ∂ k M (s, t) >> , > ∂tk (0,0) Example 8.10. Let the random variables X and Y have the joint density & −y e for 0 < x < y < ∞ f (x, y) = 0 otherwise. Product Moments of Bivariate Random Variables 230 What is the joint moment generating function for X and Y ? Answer: The joint moment generating function of X and Y is given by ! " M (s, t) = E esX+tY : ∞: ∞ = esx+ty f (x, y) dy dx :0 ∞ :0 ∞ = esx+ty e−y dy dx 0 x 3 : ∞ 2: ∞ esx+ty−y dy dx = 0 = x 1 , (1 − s − t) (1 − t) provided s + t < 1 and t < 1. Example 8.11. If the joint moment generating function of the random variables X and Y is M (s, t) = e(s+3t+2s what is the covariance of X and Y ? Answer: 2 +18t2 +12st) Probability and Mathematical Statistics 231 2 2 M (s, t) = e(s+3t+2s +18t +12st) ∂M = (1 + 4s + 12t) M (s, t) > ∂s ∂M >> = 1 M (0, 0) ∂s >(0,0) = 1. ∂M = (3 + 36t + 12s) M (s, t) > ∂t ∂M >> = 3 M (0, 0) ∂t > (0,0) = 3. Hence µX = 1 and µY = 3. Now we compute the product moment of X and Y . # $ ∂ 2 M (s, t) ∂ ∂M = ∂s ∂t ∂t ∂s ∂ = (M (s, t) (1 + 4s + 12t)) ∂t ∂M + M (s, t) (12). = (1 + 4s + 12t) ∂t Therefore Thus > ∂ 2 M (s, t) >> = 1 (3) + 1 (12). ∂s ∂t >(0,0) E(XY ) = 15 and the covariance of X and Y is given by Cov(X, Y ) = E(XY ) − E(X) E(Y ) = 15 − (3) (1) = 12. Theorem 8.9. If X and Y are independent then MaX+bY (t) = MX (at) MY (bt), Product Moments of Bivariate Random Variables 232 where a and b real parameters. Proof: Let W = aX + bY . Hence MaX+bY (t) = MW (t) ! " = E etW , = E et(aX+bY ) ! " = E etaX etbY ! " ! " = E etaX E etbY (by Theorem 8.3) = MX (at) MY (bt). This theorem is very powerful. It helps us to find the distribution of a linear combination of independent random variables. The following examples illustrate how one can use this theorem to determine distribution of a linear combination. Example 8.12. Suppose the random variable X is normal with mean 2 and standard deviation 3 and the random variable Y is also normal with mean 0 and standard deviation 4. If X and Y are independent, then what is the probability distribution of the random variable X + Y ? Answer: Since X ∼ N (2, 9), the moment generating function of X is given by 9 2 1 2 2 MX (t) = eµt+ 2 σ t = e2t+ 2 t . Similarly, since Y ∼ N (0, 16), 1 MY (t) = eµt+ 2 σ 2 2 t 16 2 =e2t . Since X and Y are independent, the moment generating function of X + Y is given by MX+Y (t) = MX (t) MY (t) 9 2 16 2 = e2t+ 2 t e 2 t 25 2 = e2t+ 2 t . Hence X + Y ∼ N (2, 25). Thus, X + Y has a normal distribution with mean 2 and variance 25. From this information we can find the probability density function of W = X + Y as f (w) = √ 1 w−2 2 1 e− 2 ( 5 ) , 50π −∞ < w < ∞. Probability and Mathematical Statistics 233 Remark 8.6. In fact if X and Y are independent normal random variables 2 with means µX and µY and variances σX and σY2 , respectively, then aX + bY 2 is also normal with mean aµX + bµY and variance a2 σX + b2 σY2 . Example 8.13. Let X and Y be two independent and identically distributed random variables. If their common distribution is chi-square with one degree of freedom, then what is the distribution of X + Y ? What is the moment generating function of X − Y ? Answer: Since X and Y are both χ2 (1), the moment generating functions are 1 MX (t) = √ 1 − 2t and MY (t) = √ 1 . 1 − 2t Since, the random variables X and Y are independent, the moment generating function of X + Y is given by MX+Y (t) = MX (t) MY (t) 1 1 √ =√ 1 − 2t 1 − 2t 1 = 2 . (1 − 2t) 2 Hence X + Y ∼ χ2 (2). Thus, if X and Y are independent chi-square random variables, then their sum is also a chi-square random variable. Next, we show that X − Y is not a chi-square random variable, even if X and Y are both chi-square. MX−Y (t) = MX (t) MY (−t) 1 1 √ =√ 1 − 2t 1 + 2t 1 =√ . 1 − 4t2 This moment generating function does not correspond to the moment generating function of a chi-square random variable with any degree of freedoms. Further, it is surprising that this moment generating function does not correspond to that of any known distributions. Remark 8.7. If X and Y are chi-square and independent random variables, then their linear combination is not necessarily a chi-square random variable. Product Moments of Bivariate Random Variables 234 Example 8.14. Let X and Y be two independent Bernoulli random variables with parameter p. What is the distribution of X + Y ? Answer: Since X and Y are Bernoulli with parameter p, their moment generating functions are MX (t) = (1 − p) + pet MY (t) = (1 − p) + pet . Since, X and Y are independent, the moment generating function of their sum is the product of their moment generating functions, that is MX+Y (t) = MX (t) MY (t) ! " ! " = 1 − p + pet 1 − p + pet ! "2 = 1 − p + pet . Hence X + Y ∼ BIN (2, p). Thus the sum of two independent Bernoulli random variable is a binomial random variable with parameter 2 and p. 8.6. Review Exercises 1. Suppose that X1 and X2 are random variables with zero mean and unit variance. If the correlation coefficient of X1 and X2 is −0.5, then what is the 12 variance of Y = k=1 k 2 Xk ? 2. If the joint density of the random variables X and Y is f (x, y) =   81  0 if (x, y) ∈ { (x, 0), (0, −y) | x, y = −2, −1, 1, 2 } otherwise, what is the covariance of X and Y ? Are X and Y independent? 3. Suppose the random variables X and Y are independent and identically distributed. Let Z = aX + Y . If the correlation coefficient between X and Z is 31 , then what is the value of the constant a ? 4. Let X and Y be two independent random variables with chi-square distribution with 2 degrees of freedom. What is the moment generating function of the random variable 2X + 3Y ? If possible, what is the distribution of 2X + 3Y ? 5. Let X and Y be two independent random variables. If X ∼ BIN (n, p) and Y ∼ BIN (m, p), then what is the distribution of X + Y ? Probability and Mathematical Statistics 235 6. Let X and Y be two independent random variables. If X and Y are both standard normal, then what is the distribution of the random variable " ! 2 1 2 ? 2 X −Y 7. If the joint probability density function of X and Y is f (x, y) = & 1 if 0 < x < 1; 0 < y < 1 0 elsewhere, then what is the joint moment generating function of X and Y ? 8. Let the joint density function of X and Y be  1 if 1 ≤ x = y ≤ 6  36 f (x, y) =  2 if 1 ≤ x < y ≤ 6. 36 What is the correlation coefficient of X and Y ? 9. Suppose that X and Y are random variables with joint moment generating function # $10 1 s 3 t 3 M (s, t) = , e + e + 4 8 8 for all real s and t. What is the covariance of X and Y ? 10. Suppose that X and Y are random variables with joint density function f (x, y) =  1  6π  0 for x2 4 + y2 9 for x2 4 + y2 9 ≤1 > 1. What is the covariance of X and Y ? Are X and Y independent? 11. Let X and Y be two random variables. Suppose E(X) = 1, E(Y ) = 2, V ar(X) = 1, V ar(Y ) = 2, and Cov(X, Y ) = 21 . For what values of the constants a and b, the random variable aX + bY , whose expected value is 3, has minimum variance? 12. A box contains 5 white balls and 3 black balls. Draw 2 balls without replacement. If X represents the number of white balls and Y represents the number of black balls drawn, what is the covariance of X and Y ? 13. If X represents the number of 1’s and Y represents the number of 5’s in three tosses of a fair six-sided die, what is the correlation between X and Y ? Product Moments of Bivariate Random Variables 236 14. Let Y and Z be two random variables. If V ar(Y ) = 4, V ar(Z) = 16, and Cov(Y, Z) = 2, then what is V ar(3Z − 2Y )? 15. Three random variables X1 , X2 , X3 , have equal variances σ 2 and coefficient of correlation between X1 and X2 of ρ and between X1 and X3 and between X2 and X3 of zero. What is the correlation between Y and Z where Y = X1 + X2 and Z = X2 + X3 ? 16. If X and Y are two independent Bernoulli random variables with parameter p, then what is the joint moment generating function of X − Y ? 17. If X1 , X2 , ..., Xn are normal random variables with variance σ 2 and covariance between any pair of random variables ρσ 2 , what is the variance of n1 (X1 + X2 + · · · + Xn ) ? 2 = a, 18. The coefficient of correlation between X and Y is 31 and σX 2 2 σY = 4a, and σZ = 114 where Z = 3X − 4Y . What is the value of the constant a ? 19. Let X and Y be independent random variables with E(X) = 1, E(Y ) = 2, and V ar(X) = V ar(Y ) = σ 2 . For what value of the constant k is the expected value of the random variable k(X 2 − Y 2 ) + Y 2 equals σ 2 ? 20. Let X be a random variable with finite variance. If Y = 15 − X, then what is the coefficient of correlation between the random variables X and (X + Y )X ? 21. The mean of a normal random variable X is 10 and the variance is 12. The mean of a normal random variable Y is −5 and the variance is 5. If the covariance of X and Y is 4, then what is the probability of the event X +Y >5 ? Probability and Mathematical Statistics 237 Chapter 9 CONDITIONAL EXPECTATION OF BIVARIATE RANDOM VARIABLES This chapter examines the conditional mean and conditional variance associated with two random variables. The conditional mean is very useful in Bayesian estimation of parameters with a square loss function. Further, the notion of conditional mean sets the path for regression analysis in statistics. 9.1. Conditional Expected Values Let X and Y be any two random variables with joint density f (x, y). Recall that the conditional probability density of X, given the event Y = y, is defined as f (x, y) g(x/y) = , f2 (y) > 0 f2 (y) where f2 (y) is the marginal probability density of Y . Similarly, the conditional probability density of Y , given the event X = x, is defined as h(y/x) = f (x, y) , f1 (x) f1 (x) > 0 where f1 (x) is the marginal probability density of X. Definition 9.1. The conditional mean of X given Y = y is defined as µX|y = E (X | y) , Conditional Expectations of Bivariate Random Variables where E (X | y) =  %  x g(x/y)    238 if X is discrete x∈RX     8 ∞ x g(x/y) dx −∞ if X is continuous. Similarly, the conditional mean of Y given X = x is defined as µY |x = E (Y | x) , where E (Y | x) =  %  y h(y/x)    y∈R if Y is discrete Y     8 ∞ y h(y/x) dy −∞ if Y is continuous. Example 9.1. Let X and Y be discrete random variables with joint probability density function f (x, y) = & 1 21 (x + y) for x = 1, 2, 3; y = 1, 2 0 otherwise. What is the conditional mean of X given Y = y, that is E(X|y)? Answer: To compute the conditional mean of X given Y = y, we need the conditional density g(x/y) of X given Y = y. However, to find g(x/y), we need to know the marginal of Y , that is f2 (y). Thus, we begin with f2 (y) = = 3 % 1 (x + y) 21 x=1 1 (6 + 3y). 21 Therefore, the conditional density of X given Y = y is given by f (x, y) f2 (y) x+y = , 6 + 3y g(x/y) = x = 1, 2, 3. Probability and Mathematical Statistics 239 The conditional expected value of X given the event Y = y % E (X | y) = x g(x/y) x∈RX 3 % x+y 6 + 3y x=1 0 / 3 3 % % 1 2 x x +y = 6 + 3y x=1 x=1 = = x 14 + 6y , 6 + 3y y = 1, 2. Remark 9.1. Note that the conditional mean of X given Y = y is dependent only on y, that is E(X|y) is a function φ of y. In the above example, this function φ is a rational function, namely φ(y) = 14+6y 6+3y . Example 9.2. Let X and Y have the joint density function & x+y for 0 < x, y < 1 f (x, y) = 0 otherwise. " ! What is the conditional mean E Y | X = 13 ? Answer: f1 (x) = : 1 (x + y) dy 0 2 31 1 = xy + y 2 2 0 1 =x+ . 2 Conditional Expectations of Bivariate Random Variables h(y/x) = 240 x+y f (x, y) = . f1 (x) x + 21 # $ : 1 1 E Y |X = y h(y/x) dy = 3 0 : 1 x+y y = dy x + 21 0 : 1 1 +y = y 3 5 dy 0 6 = 5 6 5 6 = 5 6 = 5 3 = . 5 = 6 $ 1 y + y 2 dy 3 0 31 2 1 2 1 3 y + y 6 3 0 2 3 1 2 + 6 6 # $ 3 6 : 1 # The mean of the random variable Y is a deterministic number. The conditional mean of Y given X = x, that is E(Y |x), is a function φ(x) of the variable x. Using this function, we form φ(X). This function φ(X) is a random variable. Thus starting from the deterministic function E(Y |x), we have formed the random variable E(Y |X) = φ(X). An important property of conditional expectation is given by the following theorem. Theorem 9.1. The expected value of the random variable E(Y |X) is equal to the expected value of Y , that is ! " Ex Ey|x (Y |X) = Ey (Y ), Probability and Mathematical Statistics 241 where Ex (X) stands for the expectation of X with respect to the distribution of X and Ey|x (Y |X) stands for the expected value of Y with respect to the conditional density h(y/X). Proof: We prove this theorem for continuous variables and leave the discrete case to the reader. 2: ∞ 3 ! " y h(y/X) dy Ex Ey|x (Y |X) = Ex −∞ $ : ∞ #: ∞ y h(y/x) dy f1 (x) dx = −∞ −∞ $ : ∞ #: ∞ y h(y/x)f1 (x) dy dx = −∞ −∞ $ : ∞ #: ∞ h(y/x)f1 (x) dx y dy = −∞ −∞ $ : ∞ #: ∞ f (x, y) dx y dy = −∞ −∞ : ∞ y f2 (y) dy = −∞ = Ey (Y ). Example 9.3. An insect lays Y number of eggs, where Y has a Poisson distribution with parameter λ. If the probability of each egg surviving is p, then on the average how many eggs will survive? Answer: Let X denote the number of surviving eggs. Then, given that Y = y (that is given that the insect has laid y eggs) the random variable X has a binomial distribution with parameters y and p. Thus X|Y ∼ BIN (Y, p) Y ∼ P OI(λ). Therefore, the expected number of survivors is given by ! " Ex (X) = Ey Ex|y (X|Y ) = Ey (p Y ) = p Ey (Y ) = p λ. (since X|Y ∼ BIN(Y, p)) (since Y ∼ POI(λ)) Definition 9.2. A random variable X is said to have a mixture distribution if the distribution of X depends on a quantity which also has a distribution. Conditional Expectations of Bivariate Random Variables 242 Example 9.4. A fair coin is tossed. If a head occurs, 1 die is rolled; if a tail occurs, 2 dice are rolled. Let Y be the total on the die or dice. What is the expected value of Y ? Answer: Let X denote the outcome of tossing a coin. Then X ∼ BER(p), where the probability of success is p = 12 . Ey (Y ) = Ex ( Ey|x (Y |X) ) 1 1 = Ey|x (Y |X = 0) + Ey|x (Y |X = 1) 2# 2 $ 1 1+2+3+4+5+6 = 2 6 # $ 1 2 + 6 + 12 + 20 + 30 + 42 + 40 + 36 + 30 + 22 + 12 + 2 36 # $ 1 126 252 + = 2 36 36 378 = 72 = 5.25. Note that the expected number of dots that show when 1 die is rolled is 126 36 , 252 and the expected number of dots that show when 2 dice are rolled is 36 . Theorem 9.2. Let X and Y be two random variables with mean µX and µY , and standard deviation σX and σY , respectively. If the conditional expectation of Y given X = x is linear in x, then E(Y |X = x) = µY + ρ σY (x − µX ), σX where ρ denotes the correlation coefficient of X and Y . Proof: We assume that the random variables X and Y are continuous. If they are discrete, the proof of the theorem follows exactly the same way by replacing the integrals with summations. We are given that E(Y |X = x) is linear in x, that is E(Y |X = x) = a x + b, (9.0) where a and b are two constants. Hence, from above we get : ∞ −∞ y h(y/x) dy = a x + b Probability and Mathematical Statistics which implies : ∞ y −∞ 243 f (x, y) dy = a x + b. f1 (x) Multiplying both sides by f1 (x), we get : ∞ y f (x, y) dy = (a x + b) f1 (x) (9.1) −∞ Now integrating with respect to x, we get : ∞: ∞ : y f (x, y) dy dx = −∞ −∞ ∞ (a x + b) f1 (x) dx −∞ This yields µY = a µX + b. (9.2) Now, we multiply (9.1) with x and then integrate the resulting expression with respect to x to get : ∞ : ∞: ∞ (a x2 + bx) f1 (x) dx. xy f (x, y) dy dx = −∞ From this we get −∞ −∞ ! " E(XY ) = a E X 2 + b µX . (9.3) Solving (9.2) and (9.3) for the unknown a and b, we get E(XY ) − µX µY 2 σX σXY = 2 σX σXY σY = σX σY σX σY . =ρ σX a= Similarly, we get b = µY + ρ σY µX . σX Letting a and b into (9.0) we obtain the asserted result and the proof of the theorem is now complete. Example 9.5. Suppose X and Y are random variables with E(Y |X = x) = −x + 3 and E(X|Y = y) = − 14 y + 5. What is the correlation coefficient of X and Y ? Conditional Expectations of Bivariate Random Variables 244 Answer: From the Theorem 9.2, we get µY + ρ σY (x − µX ) = −x + 3. σX Therefore, equating the coefficients of x terms, we get ρ Similarly, since µX + ρ σY = −1. σX (9.4) 1 σX (y − µY ) = − y + 5 σY 4 we have ρ σX 1 =− . σY 4 (9.5) Multiplying (9.4) with (9.5), we get ρ # $ σX 1 σY ρ = (−1) − σX σY 4 which is ρ2 = Solving this, we get Y Since ρ σσX = −1 and σY σX 1 . 4 1 ρ=± . 2 > 0, we get 1 ρ=− . 2 9.2. Conditional Variance The variance of the probability density function f (y/x) is called the conditional variance of Y given that X = x. This conditional variance is defined as follows: Definition 9.3. Let X and Y be two random variables with joint density f (x, y) and f (y/x) be the conditional density of Y given X = x. The conditional variance of Y given X = x, denoted by V ar(Y |x), is defined as ! " 2 V ar(Y |x) = E Y 2 | x − (E(Y |x)) , where E(Y |x) denotes the conditional mean of Y given X = x. Probability and Mathematical Statistics 245 Example 9.6. Let X and Y be continuous random variables with joint probability density function & f (x, y) = e−y for 0 < x < y < ∞ 0 otherwise. What is the conditional variance of Y given the knowledge that X = x? Answer: The marginal density of f1 (x) is given by f1 (x) = : ∞ f (x, y) dy −∞ : ∞ e−y dy ;x <∞ = −e−y x = = e−x . Thus, the conditional density of Y given X = x is f (x, y) f1 (x) e−y = −x e = e−(y−x) h(y/x) = for y > x. Thus, given X = x, Y has an exponential distribution with parameter θ = 1 and location parameter x. The conditional mean of Y given X = x is E(Y |x) = = = : ∞ y h(y/x) dy −∞ ∞ : :x∞ y e−(y−x) dy (z + x) e−z dz where z = y − x 0 =x : ∞ e−z dz + 0 = x Γ(1) + Γ(2) = x + 1. : 0 ∞ z e−z dz Conditional Expectations of Bivariate Random Variables 246 Similarly, we compute the second moment of the distribution h(y/x). 2 E(Y |x) = = = : ∞ y 2 h(y/x) dy −∞ : ∞ :x∞ y 2 e−(y−x) dy (z + x)2 e−z dz where z = y − x 0 = x2 : ∞ e−z dz + 0 : ∞ z 2 e−z dz + 2 x 0 = x2 Γ(1) + Γ(3) + 2 x Γ(2) : ∞ z e−z dz 0 = x2 + 2 + 2x = (1 + x)2 + 1. Therefore ! " V ar(Y |x) = E Y 2 |x − [ E(Y |x) ]2 = (1 + x)2 + 1 − (1 + x)2 = 1. Remark 9.2. The variance of Y is 2. This can be seen as follows: Since, the 8y marginal of Y is given by f2 (y) = 0 e−y dx = y e−y , the expected value of Y ! " 8∞ 8 ∞ 2 −y is E(Y ) = 0 y e dy = Γ(3) = 2, and E Y 2 = 0 y 3 e−y dy = Γ(4) = 6. Thus, the variance of Y is V ar(Y ) = 6−4 = 2. However, given the knowledge X = x, the variance of Y is 1. Thus, in a way the prior knowledge reduces the variability (or the variance) of a random variable. Next, we simply state the following theorem concerning the conditional variance without proof. Probability and Mathematical Statistics 247 Theorem 9.3. Let X and Y be two random variables with mean µX and µY , and standard deviation σX and σY , respectively. If the conditional expectation of Y given X = x is linear in x, then Ex (V ar(Y |X)) = (1 − ρ2 ) V ar(Y ), where ρ denotes the correlation coefficient of X and Y . Example 9.7. Let E(Y |X = x) = 2x and V ar(Y |X = x) = 4x2 , and let X have a uniform distribution on the interval from 0 to 1. What is the variance of Y ? Answer: If E(Y |X = x) is linear function of x, then E(Y |X = x) = µY + ρ σY (x − µX ) σX and Ex ( V ar(Y |X) ) = σY2 (1 − ρ2 ). We are given that µY + ρ σY (x − µX ) = 2x. σX Hence, equating the coefficient of x terms, we get σY =2 σX ρ which is ρ=2 σX . σY (9.6) Further, we are given that V ar(Y |X = x) = 4x2 Since X ∼ U N IF (0, 1), we get the density of X to be f (x) = 1 on the interval (0, 1) Therefore, : ∞ Ex ( V ar(Y |X) ) = V ar(Y |X = x) f (x) dx −∞ 1 = : 4 x2 dx 0 =4 2 4 = . 3 x3 3 31 0 Conditional Expectations of Bivariate Random Variables By Theorem 9.3, 248 4 = Ex ( V ar(Y |X) ) 3 ! " = σY2 1 − ρ2 # $ σ2 = σY2 1 − 4 X σY2 2 = σY2 − 4 σX Hence σY2 = 4 2 + 4 σX . 3 2 Since X ∼ U N IF (0, 1), the variance of X is given by σX = the variance of Y is given by σY2 = 1 12 . Therefore, 4 4 16 4 20 5 + = + = = . 3 12 12 12 12 3 Example 9.8. Let E(X|Y = y) = 3y and V ar(X|Y = y) = 2, and let Y have density function f (y) = & e−y if y > 0 0 otherwise. What is the variance of X? Answer: By Theorem 9.3, we get and ! " 2 V ar(X|Y = y) = σX 1 − ρ2 = 2 µX + ρ σX (y − µY ) = 3y. σY Thus ρ=3 σY . σX Hence from (9.7), we get Ey (V ar(X|Y )) = 2 and thus 2 σX # σ2 1 − 9 Y2 σX $ =2 which is 2 σX = 9 σY2 + 2. (9.7) Probability and Mathematical Statistics 249 ! " Now, we compute the variance of Y . For this, we need E(Y ) and E Y 2 . E(Y ) = : ∞ y f (y) dy 0 = : ∞ y e−y dy 0 = Γ(2) = 1. Similarly ! E Y 2 " = : ∞ y 2 f (y) dy 0 = : ∞ y 2 e−y dy 0 = Γ(3) = 2. Therefore ! " V ar(Y ) = E Y 2 − [ E(Y ) ]2 = 2 − 1 = 1. Hence, the variance of X can be calculated as 2 σX = 9 σY2 + 2 = 9 (1) + 2 = 11. Remark 9.3. Notice that, in Example 9.8, we calculated the variance of Y directly using the form of f (y). It is easy to note that f (y) has the form of an exponential density with parameter θ = 1, and therefore its variance is the square of the parameter. This straightforward gives σY2 = 1. 9.3. Regression Curve and Scedastic Curve One of the major goals in most statistical studies is to establish relationships between two or more random variables. For example, a company would like to know the relationship between the potential sales of a new product in terms of its price. Historically, regression analysis was originated in the works of Sir Francis Galton (1822-1911) but most of the theory of regression analysis was developed by his student Sir Ronald Fisher (1890-1962). Conditional Expectations of Bivariate Random Variables 250 Definition 9.4. Let X and Y be two random variables with joint probability density function f (x, y) and let h(y/x) is the conditional density of Y given X = x. Then the conditional mean E(Y |X = x) = : ∞ y h(y/x) dy −∞ is called the regression function of Y on X. The graph of this regression function of Y on X is known as the regression curve of Y on X. Example 9.9. Let X and Y be two random variables with joint density f (x, y) = & x e−x(1+y) if x > 0; y > 0 0 otherwise. What is the regression function of Y on X? Answer: The marginal density f1 (x) of X is f1 (x) = = : ∞ −∞ : ∞ f (x, y) dy x e−x(1+y) dy 0 : ∞ x e−x e−xy dy : ∞ −x e−xy dy = xe 0 3∞ 2 1 −xy −x = xe − e x 0 = 0 = e−x . The conditional density of Y given X = x is h(y/x) = f (x, y) f1 (x) x e−x(1+y) e−x −xy = xe . = Probability and Mathematical Statistics 251 The conditional mean of Y given that X = x is E(Y |X = x) = = : ∞ −∞ : ∞ y h(y/x) dy y x e−xy dy 0 : 1 ∞ −z ze dz x 0 1 = Γ(2) x 1 = . x = (where z = xy) Thus, the regression function (or equation) of Y on X is given by E(Y |x) = 1 x for 0 < x < ∞. Definition 9.4. Let X and Y be two random variables with joint probability density function f (x, y) and let E(Y |X = x) be the regression function of Y on X. If this regression function is linear, then E(Y |X = x) is called a linear regression of Y on X. Otherwise, it is called nonlinear regression of Y on X. Example 9.10. Given the regression lines E(Y |X = x) = x + 2 and E(X|Y = y) = 1 + 12 y, what is the expected value of X? Answer: Since the conditional expectation E(Y |X = x) is linear in x, we get σY µY + ρ (x − µX ) = x + 2. σX Hence, equating the coefficients of x and constant terms, we get ρ σY =1 σX (9.8) Conditional Expectations of Bivariate Random Variables and µY − ρ σY µX = 2, σX 252 (9.9) respectively. Now, using (9.8) in (9.9), we get µY − µX = 2. (9.10) Similarly, since E(X|Y = y) is linear in y, we get ρ 1 σX = σY 2 and µX − ρ σX µY = 1, σY (9.11) (9.12) Hence, letting (9.10) into (9.11) and simplifying, we get 2µX − µY = 2. (9.13) Now adding (9.13) to (9.10), we see that µX = 4. Remark 9.4. In statistics, a linear regression usually means the conditional expectation E (Y /x) is linear in the parameters, but not in x. Therefore, E (Y /x) = α + θx2 will be a linear model, where as E (Y /x) = α xθ is not a linear regression model. Definition 9.5. Let X and Y be two random variables with joint probability density function f (x, y) and let h(y/x) is the conditional density of Y given X = x. Then the conditional variance : ∞ V ar(Y |X = x) = y 2 h(y/x) dy −∞ is called the scedastic function of Y on X. The graph of this scedastic function of Y on X is known as the scedastic curve of Y on X. Scedastic curves and regression curves are used for constructing families of bivariate probability density functions with specified marginals. Probability and Mathematical Statistics 253 9.4. Review Exercises 1. Given the regression lines E(Y |X = x) = x+2 and E(X|Y = y) = 1+ 12 y, what is expected value of Y ? 2. If the joint density of X and Y is  k if −1 < x < 1; x2 < y < 1 f (x, y) =  0 elsewhere , where k is a constant, what is E(Y |X = x) ? 3. Suppose the joint density of X and Y is defined by f (x, y) = & 10xy 2 if 0 elsewhere. 0<x<y<1 ! " What is E X 2 |Y = y ? 4. Let X and Y joint density function f (x, y) = & 2e−2(x+y) if 0 elsewhere. 0<x<y<∞ What is the expected value of Y , given X = x, for x > 0 ? 5. Let X and Y joint density function f (x, y) = & 8xy if 0 < x < 1; 0 < y < x 0 elsewhere. What is the regression curve y on x, that is, E (Y /X = x)? 6. Suppose X and Y are random variables with means µX and µY , respectively; and E(Y |X = x) = − 31 x + 10 and E(X|Y = y) = − 43 y + 2. What are the values of µX and µY ? 7. Let X and Y have joint density f (x, y) = & 24 5 0 (x + y) for 0 ≤ 2y ≤ x ≤ 1 otherwise. What is the conditional expectation of X given Y = y ? Conditional Expectations of Bivariate Random Variables 254 8. Let X and Y have joint density f (x, y) = & c xy 2 for 0 ≤ y ≤ 2x; 1 ≤ x ≤ 5 0 otherwise. What is the conditional expectation of Y given X = x ? 9. Let X and Y have joint density f (x, y) = & e−y for y ≥ x ≥ 0 0 otherwise. What is the conditional expectation of X given Y = y ? 10. Let X and Y have joint density & 2 xy f (x, y) = 0 for 0 ≤ y ≤ 2x ≤ 2 otherwise. What is the conditional expectation of Y given X = x ? 11. Let E(Y |X = x) = 2 + 5x, V ar(Y |X = x) = 3, and let X have the density function f (x) = &1 4 0 x x e− 2 if 0 < x < ∞ otherwise. What is the mean and variance of random variable Y ? 12. Let E(Y |X = x) = 2x and V ar(Y |X = x) = 4x2 + 3, and let X have the density function   √4 x2 e−x2 for 0 ≤ x < ∞ π f (x) =  0 elsewhere. What is the variance of Y ? 13. Let X and Y have joint density & 2 for 0 < y < 1 − x; and 0 < x < 1 f (x, y) = 0 otherwise. What is the conditional variance of Y given X = x ? Probability and Mathematical Statistics 14. Let X and Y have joint density & 4x f (x, y) = 0 255 for 0 < x < √ y<1 elsewhere. What is the conditional variance of Y given X = x ? 15. Let X and Y have joint density &6 for 1 ≤ x + y ≤ 2; x ≥ 0, y ≥ 0 7x f (x, y) = 0 elsewhere. What is the conditional variance of X given Y = 16. Let X and Y have joint density & 12x f (x, y) = 0 3 2 ? for 0 < y < 2x < 1 elsewhere. What is the conditional variance of Y given X = 0.5 ? 17. Let the random variable W denote the number of students who take business calculus each semester at the University of Louisville. If the random variable W has a Poisson distribution with parameter λ equal to 300 and the probability of each student passing the course is 35 , then on an average how many students will pass the business calculus? 18. If the conditional density of Y given X = x is given by ! "  y5 xy (1 − x)5−y if y = 0, 1, 2, ..., 5 f (y/x) =  0 otherwise, and the marginal density of X is   4x3 f1 (x) =  0 if 0 < x < 1 otherwise, then what is the conditional expectation of Y given the event X = x? 19. If the joint density of the random variables X and Y is   2+(2x−1)(2y−1) if 0 < x, y < 1 2 f (x, y) =  0 otherwise, Conditional Expectations of Bivariate Random Variables then what is the regression function of Y on X? 20. If the joint density of the random variables X and Y is ; <  emin{x,y} − 1 e−(x+y) if 0 < x, y < ∞ f (x, y) =  0 otherwise, then what is the conditional expectation of Y given X = x? 256 Probability and Mathematical Statistics 257 Chapter 10 TRANSFORMATION OF RANDOM VARIABLES AND THEIR DISTRIBUTIONS In many statistical applications, given the probability distribution of a univariate random variable X, one would like to know the probability distribution of another univariate random variable Y = φ(X), where φ is some known function. For example, if we know the probability distribution of the random variable X, we would like know the distribution of Y = ln(X). For univariate random variable X, some commonly used transformed random I variable Y of X are: Y = X 2 , Y = |X|, Y = |X|, Y = ln(X), Y = -2 , X−µ X−µ . Similarly for a bivariate random variable (X, Y ), , and Y = σ σ some of the most common transformations of X and Y are X + Y , XY , X Y , √ min{X, Y }, max{X, Y } or X 2 + Y 2 . In this chapter, we examine various methods for finding the distribution of a transformed univariate or bivariate random variable, when transformation and distribution of the variable are known. First, we treat the univariate case. Then we treat the bivariate case. We begin with an example for univariate discrete random variable. Example 10.1. The probability density function of the random variable X is shown in the table below. x −2 −1 0 f (x) 1 10 2 10 1 10 1 1 10 2 3 4 1 10 2 10 2 10 Transformation of Random Variables and their Distributions 258 What is the probability density function of the random variable Y = X 2 ? Answer: The space of the random variable X is RX = {−2, −1, 0, 1, 2, 3, 4}. Then the space of the random variable Y is RY = {x2 | x ∈ RX }. Thus, RY = {0, 1, 4, 9, 16}. Now we compute the probability density function g(y) for y in RY . g(0) = P (Y = 0) = P (X 2 = 0) = P (X = 0)) = 1 10 3 10 2 g(4) = P (Y = 4) = P (X 2 = 4) = P (X = −2) + P (X = 2) = 10 2 g(9) = P (Y = 9) = P (X 2 = 9) = P (X = 3) = 10 2 g(16) = P (Y = 16) = P (X 2 = 16) = P (X = 4) = . 10 g(1) = P (Y = 1) = P (X 2 = 1) = P (X = −1) + P (X = 1) = We summarize the distribution of Y in the following table. y 0 1 4 9 16 g(y) 1 10 3 10 2 10 2 10 2 10 3/10 2/10 2/10 1/10 1/10 -2 -1 0 1 2 3 4 0 1 Density Function of X 4 9 Density Function of Y = X 16 2 Example 10.2. The probability density function of the random variable X is shown in the table below. x 1 2 3 f (x) 1 6 1 6 1 6 4 1 6 5 6 1 6 1 6 What is the probability density function of the random variable Y = 2X + 1? Probability and Mathematical Statistics 259 Answer: The space of the random variable X is RX = {1, 2, 3, 4, 5, 6}. Then the space of the random variable Y is RY = {2x + 1 | x ∈ RX }. Thus, RY = {3, 5, 7, 9, 11, 13}. Next we compute the probability density function g(y) for y in RY . The pdf g(y) is given by 1 6 1 g(5) = P (Y = 5) = P (2X + 1 = 5) = P (X = 2)) = 6 1 g(7) = P (Y = 7) = P (2X + 1 = 7) = P (X = 3)) = 6 1 g(9) = P (Y = 9) = P (2X + 1 = 9) = P (X = 4)) = 6 g(3) = P (Y = 3) = P (2X + 1 = 3) = P (X = 1)) = 1 6 1 g(13) = P (Y = 13) = P (2X + 1 = 13) = P (X = 6)) = . 6 g(11) = P (Y = 11) = P (2X + 1 = 11) = P (X = 5)) = We summarize the distribution of Y in the following table. y 3 5 7 g(y) 1 6 1 6 1 6 9 11 13 1 6 1 6 1 6 The distribution of X and 2X + 1 are illustrated below. 1/6 1/6 1 2 3 4 5 Density Function of X 6 3 5 7 9 11 13 Density Function of Y = 2X+1 In Example 10.1, we computed the distribution (that is, the probability density function) of transformed random variable Y = φ(X), where φ(x) = x2 . This transformation is not either increasing or decreasing (that is, monotonic) in the space, RX , of the random variable X. Therefore, the distribution of Y turn out to be quite different from that of X. In Example 10.2, the form of distribution of the transform random variable Y = φ(X), where φ(x) = 2x + 1, is essentially same. This is mainly due to the fact that φ(x) = 2x + 1 is monotonic in RX . Transformation of Random Variables and their Distributions 260 In this chapter, we shall examine the probability density function of transformed random variables by knowing the density functions of the original random variables. There are several methods for finding the probability density function of a transformed random variable. Some of these methods are: (1) distribution function method (2) transformation method (3) convolution method, and (4) moment generating function method. Among these four methods, the transformation method is the most useful one. The convolution method is a special case of this method. The transformation method is derived using the distribution function method. 10.1. Distribution Function Method We have seen in chapter six that an easy way to find the probability density function of a transformation of continuous random variables is to determine its distribution function and then its density function by differentiation. Example 10.3. A box is to be constructed so that the height is 4 inches and its base is X inches by X inches. If X has a standard normal distribution, what is the distribution of the volume of the box? Answer: The volume of the box is a random variable, since X is a random variable. This random variable V is given by V = 4X 2 . To find the density function of V , we first determine the form of the distribution function G(v) of V and then we differentiate G(v) to find the density function of V . The distribution function of V is given by G(v) = P (V ≤ v) ! " = P 4X 2 ≤ v $ # 1√ 1√ v≤X≤ v =P − 2 2 √ : 12 v 1 2 1 √ e− 2 x dx = 1 √ 2π −2 v : 12 √v 1 2 1 √ e− 2 x dx =2 (since the integrand is even). 2π 0 Probability and Mathematical Statistics 261 Hence, by the Fundamental Theorem of Calculus, we get dG(v) dv) + : 12 √v 1 − 1 x2 d √ e 2 dx = 2 dv 2π 0 # $ √ 2 1 1√ 1 1 d v = 2 √ e− 2 ( 2 v) 2 dv 2π 1 1 1 =√ e− 8 v √ 2 v 2π 1 v 1 = ! 1 " √ v 2 −1 e− 8 Γ 2 8 # $ 1 = V ∼ GAM 8, . 2 g(v) = Example 10.4. If the density function of X is   21 for −1 < x < 1 f (x) =  0 otherwise, what is the probability density function of Y = X 2 ? Answer: We first find the cumulative distribution function of Y and then by differentiation, we obtain the density of Y . The distribution function G(y) of Y is given by G(y) = P (Y ≤ y) ! " = P X2 ≤ y √ √ = P (− y ≤ X ≤ y) : √y 1 = √ dx − y 2 √ = y. Transformation of Random Variables and their Distributions 262 Hence, the density function of Y is given by dG(y) dy √ d y = dy 1 = √ 2 y g(y) = for 0 < y < 1. 10.2. Transformation Method for Univariate Case The following theorem is the backbone of the transformation method. Theorem 10.1. Let X be a continuous random variable with probability density function f (x). Let y = T (x) be an increasing (or decreasing) function. Then the density function of the random variable Y = T (X) is given by > > > dx > g(y) = >> >> f (W (y)) dy where x = W (y) is the inverse function of T (x). Proof: Suppose y = T (x) is an increasing function. The distribution function G(y) of Y is given by G(y) = P (Y ≤ y) = P (T (X) ≤ y) = P (X ≤ W (y)) : W (y) = f (x) dx. −∞ Probability and Mathematical Statistics 263 Then, differentiating we get the density function of Y , which is dG(y) dy ): + W (y) d = f (x) dx dy −∞ g(y) = dW (y) dy dx = f (W (y)) dy = f (W (y)) (since x = W (y)). On the other hand, if y = T (x) is a decreasing function, then the distribution function of Y is given by G(y) = P (Y ≤ y) = P (T (X) ≤ y) = P (X ≥ W (y)) (since T (x ) is decreasing) = 1 − P (X ≤ W (y)) : W (y) =1− f (x) dx. −∞ As before, differentiating we get the density function of Y , which is dG(y) dy ) + : W (y) d = f (x) dx 1− dy −∞ g(y) = dW (y) dy dx = −f (W (y)) dy = −f (W (y)) (since Hence, combining both the cases, we get > > > dx > g(y) = >> >> f (W (y)) dy x = W (y)). and the proof of the theorem is now complete. ! " 2 Example 10.5. Let Z = X−µ σ . If X ∼ N µ, σ , what is the probability density function of Z? Transformation of Random Variables and their Distributions Answer: z = U (x) = 264 x−µ . σ Hence, the inverse of U is given by W (z) = x = σ z + µ. Therefore dx = σ. dz Hence, by Theorem 10.1, the density of Z is given by > > > dx > g(z) = >> >> f (W (y)) dz ! W (z)−µ "2 1 −1 σ =σ√ e 2 2πσ 2 zσ+µ−µ 2 1 1 e− 2 ( σ ) =√ 2π 1 − 1 z2 =√ e 2 . 2π " ! 2 2 Example 10.6. Let Z = X−µ σ . If X ∼ N µ, σ , then show that Z is 2 2 chi-square with one degree of freedom, that Z ∼ χ (1). Answer: y = T (x) = # √ x = µ + σ y. √ W (y) = µ + σ y, σ dx = √ . dy 2 y x−µ σ $2 . y > 0. Probability and Mathematical Statistics 265 The density of Y is Hence Y ∼ χ2 (1). > > > dx > g(y) = >> >> f (W (y)) dy 1 = σ √ f (W (y)) 2 y ! W (y)−µ "2 1 1 −1 σ e 2 =σ √ √ 2 y 2πσ 2 ! √yσ+µ−µ "2 1 −1 σ = √ e 2 2 2πy 1 1 = √ e− 2 y 2 2πy 1 1 1 = √ √ y − 2 e− 2 y 2 π 2 1 1 1 ! 1 " √ y − 2 e− 2 y . = 2Γ 2 2 Example 10.7. Let Y = − ln X. If X ∼ U N IF (0, 1), then what is the density function of Y where nonzero? Answer: We are given that y = T (x) = − ln x. Hence, the inverse of y = T (x) is given by W (y) = x = e−y . Therefore dx = − e−y . dy Transformation of Random Variables and their Distributions 266 Hence, by Theorem 10.1, the probability density of Y is given by > > > dx > g(y) = >> >> f (W (y)) dy = e−y f (W (y)) = e−y . Thus Y ∼ EXP (1). Hence, if X ∼ U N IF (0, 1), then the random variable − ln X ∼ EXP (1). Although all the examples we have in this section involve continuous random variables, the transformation method also works for the discrete random variables. 10.3. Transformation Method for Bivariate Case In this section, we extend the Theorem 10.2 to the bivariate case and present some examples to illustrate the importance of this extension. We state this theorem without a proof. Theorem 10.2. Let X and Y be two continuous random variables with joint density f (x, y). Let U = P (X, Y ) and V = Q(X, Y ) be functions of X and Y . If the functions P (x, y) and Q(x, y) have single valued inverses, say X = R(U, V ) and Y = S(U, V ), then the joint density g(u, v) of U and V is given by g(u, v) = |J| f (R(u, v), S(u, v)), where J denotes the Jacobian and given by # ∂x ∂x $ ∂v J = det ∂u ∂y ∂y ∂u = ∂v ∂x ∂y ∂x ∂y − . ∂u ∂v ∂v ∂u Probability and Mathematical Statistics 267 Example 10.8. Let X and Y have the joint probability density function f (x, y) = & 8 xy for 0 < x < y < 1 0 otherwise. What is the joint density of U = X Y Answer: Since  X Y  V =Y and V = Y ? U= we get by solving for X and Y X=UY =UV Y = V. ' Hence, the Jacobian of the transformation is given by ∂x ∂y ∂x ∂y − ∂u ∂v ∂v ∂u =v·1−u·0 J= = v. The joint density function of U and V is g(u, v) = |J| f (R(u, v), S(u, v)) = |v| f (uv, v) = v 8 (uv) v = 8 uv 3 . Note that, since 0<x<y<1 we have 0 < uv < v < 1. The last inequalities yield 0 < uv < v 0 < v < 1. ' Transformation of Random Variables and their Distributions Therefore, we get 0<u<1 0 < v < 1. 268 ' Thus, the joint density of U and V is given by & 8 uv 3 for 0 < u < 1; 0 < v < 1 g(u, v) = 0 otherwise. Example 10.9. Let each of the independent random variables X and Y have the density function & −x e for 0 < x < ∞ f (x) = 0 otherwise. What is the joint density of U = X and V = 2X + 3Y and the domain on which this density is positive? Answer: Since U =X V = 2X + 3Y, ' we get by solving for X and Y  X=U  1 2 Y = V − U.  3 3 Hence, the Jacobian of the transformation is given by ∂x ∂y ∂x ∂y − ∂u#∂v$ ∂v ∂u # $ 2 1 −0· − =1· 3 3 1 = . 3 J= Probability and Mathematical Statistics 269 The joint density function of U and V is g(u, v) = |J| f (R(u, v), S(u, v)) > > # $ >1> 2 1 > > = > > f u, v − u 3 3 3 1 −u − 1 v+ 2 u e e 3 3 3 u+v 1 = e−( 3 ) . 3 = Since we get 0<x<∞ 0 < y < ∞, 0<u<∞ 0 < v < ∞, Further, since v = 2u + 3y and 3y > 0, we have v > 2u. Hence, the domain of g(u, v) where nonzero is given by 0 < 2u < v < ∞. The joint density g(u, v) of the random variables U and V is given by   1 e−( u+v 3 ) for 0 < 2u < v < ∞ 3 g(u, v) =  0 otherwise. Example 10.10. Let X and Y be independent random variables, each with density function   λ e−λx for 0 < x < ∞ f (x) =  0 otherwise, where λ > 0. Let U = X + 2Y and V = 2X + Y . What is the joint density of U and V ? Answer: Since U = X + 2Y V = 2X + Y, ' Transformation of Random Variables and their Distributions 270 we get by solving for X and Y 2 1 X=− U+ V 3 3 2 1 Y = U − V. 3 3      Hence, the Jacobian of the transformation is given by ∂x ∂y ∂x ∂y − ∂u $ # ∂v $∂u # $ # $ # ∂v 2 2 1 1 − − = − 3 3 3 3 1 4 = − 9 9 1 =− . 3 J= The joint density function of U and V is g(u, v) = |J| f (R(u, v), S(u, v)) > > > 1> > = >− >> f (R(u, v)) f (S(u, v)) 3 1 λ eλR(u,v) λ eλS(u,v) 3 1 = λ2 eλ[R(u,v)+S(u,v)] 3 1 2 −λ( u+v 3 ). = λ e 3 = Hence, the joint density g(u, v) of the random variables U and V is given by   1 λ2 e−λ( u+v 3 ) for 0 < u < ∞; 0 < v < ∞ 3 g(u, v) =  0 otherwise. Example 10.11. Let X and Y be independent random variables, each with density function 1 2 1 f (x) = √ e− 2 x , 2π −∞ < x < ∞. Let U = X Y and V = Y . What is the joint density of U and V ? Also, what is the density of U ? Probability and Mathematical Statistics Answer: Since 271  X Y  V = Y, U= we get by solving for X and Y X = UV Y = V. ' Hence, the Jacobian of the transformation is given by ∂x ∂y ∂x ∂y − ∂u ∂v ∂v ∂u = v · (1) − u · (0) J= = v. The joint density function of U and V is g(u, v) = |J| f (R(u, v), S(u, v)) = |v| f (R(u, v)) f (S(u, v)) 1 2 1 2 1 1 = |v| √ e− 2 R (u,v) √ e− 2 S (u,v) 2π 2π = |v| 1 − 12 [R2 (u,v)+S 2 (u,v)] e 2π = |v| 1 − 12 [u2 v2 +v2 ] e 2π = |v| 1 − 12 v2 (u2 +1) e . 2π Hence, the joint density g(u, v) of the random variables U and V is given by g(u, v) = |v| 1 − 12 v2 (u2 +1) e , 2π where −∞ < u < ∞ and −∞ < v < ∞. Transformation of Random Variables and their Distributions 272 Next, we want to find the density of U . We can obtain this by finding the marginal of U from the joint density of U and V . Hence, the marginal g1 (u) of U is given by : ∞ g(u, v) dv g1 (u) = −∞ : ∞ 1 − 12 v2 (u2 +1) e dv = |v| 2π −∞ : ∞ : 0 1 − 12 v2 (u2 +1) 1 − 21 v2 (u2 +1) v dv + dv −v = e e 2π 2π 0 −∞ 30 # $2 1 1 2 − 21 v 2 (u2 +1) = e 2π 2 u2 + 1 −∞ 3∞ # $2 1 1 −2 − 12 v2 (u2 +1) + e 2π 2 u2 + 1 0 1 1 1 1 = + 2π u2 + 1 2π u2 + 1 1 = . π (u2 + 1) Thus U ∼ CAU (1). Remark 10.1. If X and Y are independent and standard normal random variables, then the quotient X Y is always a Cauchy random variable. However, the converse of this is not true. For example, if X and Y are independent and each have the same density function f (x) = √ 2 x2 , π 1 + x4 −∞ < x < ∞, then it can be shown that the random variable X Y is a Cauchy random variable. Laha (1959) and Kotlarski (1960) have given a complete description of the family of all probability density function f such that the quotient X Y Probability and Mathematical Statistics 273 follows the standard Cauchy distribution whenever X and Y are independent and identically distributed random variables with common density f . Example 10.12. Let X have a Poisson distribution with mean λ. Find a transformation T (x) so that V ar ( T (X) ) is free of λ, for large values of λ. Answer: We expand the function T (x) by Taylor’s series about λ. Then, neglecting the higher orders terms for large values of λ, we get T (x) = T (λ) + (x − λ) T # (λ) + · · · · · · where T # (λ) represents derivative of T (x) at x = λ. Now, we compute the variance of T (X). V ar ( T (X) ) = V ar ( T (λ) + (X − λ) T # (λ) + · · · ) = V ar ( T (λ) ) + V ar ( (X − λ) T # (λ) ) 2 = 0 + [T # (λ)] V ar(X − λ) 2 = [T # (λ)] V ar(X) 2 = [T # (λ)] λ. We want V ar( T (X) ) to be free of λ for large λ. Therefore, we have 2 [T # (λ)] λ = k, where k is a constant. From this, we get c T # (λ) = √ , λ where c = √ k. Solving this differential equation, we get : 1 √ dλ T (λ) = c λ √ = 2c λ. √ Hence, the transformation T (x) = 2c x will free V ar( T (X) ) of λ if the random variable X ∼ P OI(λ). Example 10.13. Let X ∼ P OI(λ1 ) and Y ∼ P OI(λ2 ). What is the probability density function of X + Y if X and Y are independent? Answer: Let us denote U = X + Y and V = X. First of all, we find the joint density of U and V and then summing the joint density we determine Transformation of Random Variables and their Distributions 274 the marginal of U which is the density function of X + Y ? Now writing X and Y in terms of U and V , we get ' X=V Y = U − X = U − V. Hence, the Jacobian of the transformation is given by ∂x ∂y ∂x ∂y − ∂u ∂v ∂v ∂u = (0)(−1) − (1)(1) J= = −1. The joint density function of U and V is g(u, v) = |J| f (R(u, v), S(u, v)) = |−1| f (v, u − v) = f (v) f (u − v) # −λ1 v $ # −λ2 u−v $ e e λ2 λ1 = v! (u − v)! = e−(λ1 +λ2 ) λv1 λu−v 2 , (v)! (u − v)! where v = 0, 1, 2, ..., u and u = 0, 1, 2, ..., ∞. Hence, the marginal density of U is given by u % e−(λ1 +λ2 ) λv1 λu−v 2 g1 (u) = (v)! (u − v)! v=0 = e−(λ1 +λ2 ) = e−(λ1 +λ2 ) u % λv1 λu−v 2 (v)! (u − v)! v=0 # $ u % 1 u λv1 λu−v 2 u! v v=0 e−(λ1 +λ2 ) u (λ1 + λ2 ) . u! Thus, the density function of U = X + Y is given by  −(λ +λ )  e 1 2 (λ1 + λ2 )u for u = 0, 1, 2, ..., ∞ u! g1 (u) =  0 otherwise. = This example tells us that if X ∼ P OI(λ1 ) and Y ∼ P OI(λ2 ) and they are independent, then X + Y ∼ P OI(λ1 + λ2 ). Probability and Mathematical Statistics 275 Theorem 10.3. Let the joint density of the random variables X and Y be Y f (x, y). Then probability density functions of X + Y , XY , and X are given by : ∞ hX+Y (v) = −∞ : ∞ hXY (v) = −∞ ∞ : h X (v) = Y −∞ f (u, v − u) du 1 , vf u, du |u| u |u| f (u, vu) du, respectively. Proof: Let U = X and V = X + Y . So that X = R(U, V ) = U , and Y = S(U, V ) = V − U . Hence, the Jacobian of the transformation is given by ∂x ∂y ∂x ∂y J= − = 1. ∂u ∂v ∂v ∂u The joint density function of U and V is g(u, v) = |J| f (R(u, v), S(u, v)) = f (R(u, v), S(u, v)) = f (u, v − u). Hence, the marginal density of V = X + Y is given by : hX+Y (v) = ∞ −∞ f (u, v − u) du. Similarly, one can obtain the other two density functions. This completes the proof. In addition, if the random variables X and Y in Theorem 10.3 are independent and have the probability density functions f (x) and g(y) respectively, then we have : ∞ g(y) f (z − y) dy # $ z 1 g(y) f dy hXY (z) = |y| y −∞ : ∞ h X (z) = |y| g(y) f (zy) dy. hX+Y (z) = −∞ : ∞ Y −∞ Transformation of Random Variables and their Distributions 276 Each of the following figures shows how the distribution of the random variable X + Y is obtained from the joint distribution of (X, Y ). (X, Y) 3 6 5 4 3 2 X X+Y 2 1 3 2 Marginal Density of X Distribution of X+Y 5 4 Distribution of 3 1 Marginal Density of Y 3 2 1 3 2 Marginal Density of (X, Y) Joint Density of 1 Marginal Density of Y Joint Density of Example 10.14. Roll an unbiased die twice. If X denotes the outcome in the first roll and Y denotes the outcome in the second roll, what is the distribution of the random variable Z = max{X, Y }? Answer: The space of X is RX = {1, 2, 3, 4, 5, 6}. Similarly, the space of Y is RY = {1, 2, 3, 4, 5, 6}. Hence the space of the random variable (X, Y ) is RX × RY . The following table shows the distribution of (X, Y ). 1 1 36 1 36 1 36 1 36 1 36 1 36 2 1 36 1 36 1 36 1 36 1 36 1 36 3 1 36 1 36 1 36 1 36 1 36 1 36 4 1 36 1 36 1 36 1 36 1 36 1 36 5 1 36 1 36 1 36 1 36 1 36 1 36 6 1 36 1 36 1 36 1 36 1 36 1 36 1 2 4 5 3 6 The space of the random variable Z = max{X, Y } is RZ = {1, 2, 3, 4, 5, 6}. 1 . Similarly, Z = 2 Thus Z = 1 only if (X, Y ) = (1, 1). Hence P (Z = 1) = 36 3 . Proceeding in only if (X, Y ) = (1, 2), (2, 2) or (2, 1). Hence, P (Z = 2) = 36 a similar manner, we get the distribution of Z which is summarized in the table below. Probability and Mathematical Statistics 277 z 1 2 3 h(z) 1 36 3 36 5 36 4 7 36 5 6 9 36 11 36 In this example, the random variable Z may be described as the best out of two rolls. Note that the probability density of Z can also be stated as h(z) = 2z − 1 , 36 for z ∈ {1, 2, 3, 4, 5, 6}. 10.4. Convolution Method for Sums of Random Variables In this section, we illustrate how convolution technique can be used in finding the distribution of the sum of random variables when they are independent. This convolution technique does not work if the random variables are not independent. Definition 10.1. Let f and g be two real valued functions. The convolution of f and g, denoted by f / g, is defined as : ∞ (f / g)(z) = f (z − y) g(y) dy −∞ : ∞ = g(z − x) f (x) dx. −∞ Hence from this definition it is clear that f / g = g / f . Let X and Y be two independent random variables with probability density functions f (x) and g(y). Then by Theorem 10.3, we get : ∞ f (z − y) g(y) dy. h(z) = −∞ Thus, this result shows that the density of the random variable Z = X + Y is the convolution of the density of X with the density of Y . Example 10.15. What is the probability density of the sum of two independent random variables, each of which is uniformly distributed over the interval [0, 1]? Answer: Let Z = X + Y , where X ∼ U N IF (0, 1) and Y ∼ U N IF (0, 1). Hence, the density function f (x) of the random variable X is given by & 1 for 0 ≤ x ≤ 1 f (x) = 0 otherwise. Transformation of Random Variables and their Distributions 278 Similarly, the density function g(y) of Y is given by g(y) = & 1 for 0 ≤ y ≤ 1 0 otherwise. Since X and Y are independent, the density function of Z can be obtained by the method of convolution. Since, the sum z = x + y is between 0 and 2, we consider two cases. First, suppose 0 ≤ z ≤ 1, then h(z) = (f / g) (z) : ∞ = f (z − x) g(x) dx −∞ 1 = : f (z − x) g(x) dx 0 = = : z :o z o = : f (z − x) g(x) dx + : 1 f (z − x) g(x) dx z f (z − x) g(x) dx + 0 (since f (z − x) = 0 between z and 1) z dx 0 = z. Similarly, if 1 ≤ z ≤ 2, then h(z) = (f / g) (z) : ∞ = f (z − x) g(x) dx −∞ 1 = : f (z − x) g(x) dx 0 = : z−1 0 =0+ = : : f (z − x) g(x) dx + 1 z−1 1 dx z−1 = 2 − z. : 1 z−1 f (z − x) g(x) dx f (z − x) g(x) dx (since f (z − x) = 0 between 0 and z − 1) Probability and Mathematical Statistics 279 Thus, the density function of Z = X + Y is given by h(z) =  0        z    2−z      0 for −∞ < z ≤ 0 for 0 ≤ z ≤ 1 for 1 ≤ z ≤ 2 for 2 < z < ∞ . The graph of this density function looks like a tent and it is called a tent function. However, in literature, this density function is known as the Simpson’s distribution. Example 10.16. What is the probability density of the sum of two independent random variables, each of which is gamma with parameter α = 1 and θ = 1 ? Answer: Let Z = X + Y , where X ∼ GAM (1, 1) and Y ∼ GAM (1, 1). Hence, the density function f (x) of the random variable X is given by f (x) = & e−x for 0 < x < ∞ 0 otherwise. Similarly, the density function g(y) of Y is given by g(y) = & e−y for 0 < y < ∞ 0 otherwise. Since X and Y are independent, the density function of Z can be obtained by the method of convolution. Notice that the sum z = x + y is between 0 Transformation of Random Variables and their Distributions 280 and ∞, and 0 < x < z. Hence, the density function of Z is given by h(z) = (f / g) (z) : ∞ f (z − x) g(x) dx = −∞ : ∞ f (z − x) g(x) dx = :0 z = e−(z−x) e−x dx 0 : z = e−z+x e−x dx :0 z = e−z dx 0 = z e−z z 1 z 2−1 e− 1 . = Γ(2) 12 Hence Z ∼ GAM (1, 2). Thus, if X ∼ GAM (1, 1) and Y ∼ GAM (1, 1), then X + Y ∼ GAM (1, 2), that X + Y is a gamma with α = 2 and θ = 1. Recall that a gamma random variable with α = 1 is known as an exponential random variable with parameter θ. Thus, in view of the above example, we see that the sum of two independent exponential random variables is not necessarily an exponential variable. Example 10.17. What is the probability density of the sum of two independent random variables, each of which is standard normal? Answer: Let Z = X + Y , where X ∼ N (0, 1) and Y ∼ N (0, 1). Hence, the density function f (x) of the random variable X is given by x2 1 f (x) = √ e− 2 2π Similarly, the density function g(y) of Y is given by y2 1 g(y) = √ e− 2 2π Since X and Y are independent, the density function of Z can be obtained by the method of convolution. Notice that the sum z = x + y is between −∞ Probability and Mathematical Statistics 281 and ∞. Hence, the density function of Z is given by h(z) = (f / g) (z) : ∞ = f (z − x) g(x) dx −∞ : ∞ −(z−x)2 x2 1 = e− 2 dx e 2 2π −∞ : ∞ z 2 1 − z2 e−(x− 2 ) dx = e 4 2π −∞ 2: ∞ 3 z 2 1 − z2 √ 1 √ e−(x− 2 ) dx e 4 π = 2π π −∞ 2: ∞ 3 1 −w2 1 − z2 z √ e = e 4 dw , where w = x − 2π 2 π −∞ 1 − z2 =√ e 4 4π ! "2 1 − 12 z−0 √ 2 . =√ e 4π The integral in the brackets equals to one, since the integrand is the normal density function with mean µ = 0 and variance σ 2 = 21 . Hence sum of two standard normal random variables is again a normal random variable with mean zero and variance 2. Example 10.18. What is the probability density of the sum of two independent random variables, each of which is Cauchy? Answer: Let Z = X + Y , where X ∼ N (0, 1) and Y ∼ N (0, 1). Hence, the density function f (x) of the random variable X and Y are is given by f (x) = 1 π (1 + x2 ) and g(y) = 1 , π (1 + y 2 ) respectively. Since X and Y are independent, the density function of Z can be obtained by the method of convolution. Notice that the sum z = x + y is between −∞ and ∞. Hence, the density function of Z is given by h(z) = (f / g) (z) : ∞ = f (z − x) g(x) dx −∞ : ∞ 1 1 dx = 2 ) π (1 + x2 ) π (1 + (z − x) −∞ : ∞ 1 1 1 = 2 dx. π −∞ 1 + (z − x)2 1 + x2 Transformation of Random Variables and their Distributions 282 To integrate the above integral, we decompose the integrand using partial fraction decomposition. Hence 1 2Ax + B 2 C (z − x) + D 1 = + 1 + (z − x)2 1 + x2 1 + x2 1 + (z − x)2 where A= 1 =C z (4 + z 2 ) and B = 1 = D. 4 + z2 Now integration yields : ∞ 1 1 1 dx π 2 −∞ 1 + (z − x)2 1 + x2 $ 3∞ 2 # 1 1 + x2 2 −1 2 −1 + z tan x − z tan (z − x) z ln = 2 2 π z (4 + z 2 ) 1 + (z − x)2 −∞ ; < 1 0 + z2 π + z2 π = 2 2 π z (4 + z 2 ) 2 = . π (4 + z 2 ) Hence the sum of two independent Cauchy random variables is not a Cauchy random variable. If X ∼ CAU (0) and Y ∼ CAU (0), then it can be easily shown using Example 10.18 that the random variable Z = X+Y is again Cauchy, that is 2 Z ∼ CAU (0). This is a remarkable property of the Cauchy distribution. So far we have considered the convolution of two continuous independent random variables. However, the concept can be modified to the case when the random variables are discrete. Let X and Y be two discrete random variables both taking on values that are integers. Let Z = X + Y be the sum of the two random variables. Hence Z takes values on the set of integers. Suppose that X = n where n is some integer. Then Z = z if and only if Y = z − n. Thus the events (Z = z) is the union of the pair wise disjoint events (X = n) and (Y = z − n) where n runs over the integers. The cdf H(z) of Z can be obtained as follows: P (Z = z) = ∞ % n=−∞ which is h(z) = P (X = n) P (Y = z − n) ∞ % n=−∞ f (n) g(z − n), Probability and Mathematical Statistics 283 where F (x) and G(y) are the cdf of X and Y , respectively. Definition 10.2. Let X and Y be two independent integer-valued discrete random variables, with pdfs f (x) and g(y) respectively. Then the convolution of f (x) and g(y) is the cdf h = f / g given by h(m) = ∞ % n=−∞ f (n) g(m − n), for m = −∞, ..., −2, −1, 0, 1, 2, ....∞. The function h(z) is the pdf of the discrete random variable Z = X + Y . Example 10.19. Let each of the random variable X and Y represents the outcomes of a six-sided die. What is the cumulative density function of the sum of X and Y ? Answer: Since the range of X as well as Y is {1, 2, 3, 4, 5, 6}, the range of Z = X + Y is RZ = {2, 3, 4, ..., 11, 12}. The pdf of Z is given by h(2) = f (1) g(1) = 1 1 1 · = 6 6 36 1 1 1 1 2 · + · = 6 6 6 6 36 1 1 1 1 1 1 3 h(4) = f (1) g(3) + h(2) g(2) + f (3) g(1) = · + · + · = . 6 6 6 6 6 6 36 h(3) = f (1) g(2) + f (2) g(1) = 4 5 6 Continuing in this manner we obtain h(5) = 36 , h(6) = 36 , h(7) = 36 , 5 4 3 2 1 h(8) = 36 , h(9) = 36 , h(10) = 36 , h(11) = 36 , and h(12) = 36 . Putting these into one expression we have h(z) = z−1 % n=1 = f (n)g(z − n) 6 − |z − 7| , 36 z = 2, 3, 4, ..., 12. It is easy to note that the convolution operation is commutative as well as associative. Using the associativity of the convolution operation one can compute the pdf of the random variable Sn = X1 + X2 + · · · + Xn , where X1 , X2 , ..., Xn are random variables each having the same pdf f (x). Then the pdf of S1 is f (x). Since Sn = Sn−1 + Xn and the pdf of Xn is f (x), the pdf of Sn can be obtained by induction. Transformation of Random Variables and their Distributions 284 10.5. Moment Generating Function Method We know that if X and Y are independent random variables, then MX+Y (t) = MX (t) MY (t). This result can be used to find the distribution of the sum X + Y . Like the convolution method, this method can be used in finding the distribution of X + Y if X and Y are independent random variables. We briefly illustrate the method using the following example. Example 10.20. Let X ∼ P OI(λ1 ) and Y ∼ P OI(λ2 ). What is the probability density function of X + Y if X and Y are independent? Answer: Since, X ∼ P OI(λ1 ) and Y ∼ P OI(λ2 ), we get MX (t) = eλ1 (e t −1) and MY (t) = eλ2 (e t −1) . Further, since X and Y are independent, we have MX+Y (t) = MX (t) MY (t) = eλ1 (e t −1) λ2 (et −1) e λ1 (et −1)+λ2 (et −1) =e t = e(λ1 +λ2 )(e −1) , that is, X +Y ∼ P OI(λ1 +λ2 ). Hence the density function h(z) of Z = X +Y is given by h(z) =  −(λ +λ )  e 1 2 (λ1 + λ2 )z z!  0 for z = 0, 1, 2, 3, ... otherwise. Compare this example to Example 10.13. You will see that moment method has a definite advantage over the convolution method. However, if you use the moment method in Example 10.15, then you will have problem identifying the form of the density function of the random variable X + Y . Thus, it is difficult to say which method always works. Most of the time we pick a particular method based on the type of problem at hand. Probability and Mathematical Statistics 285 Example 10.21. What is the probability density function of the sum of two independent random variable, each of which is gamma with parameters θ and α? Answer: Let X and Y be two independent gamma random variables with parameters θ and α, that is X ∼ GAM (θ, α) and Y ∼ GAM (θ, α). From Theorem 6.3, the moment generating functions of X and Y are obtained as MX (t) = (1 − θ)−α and MY (t) = (1 − θ)−α , respectively. Since, X and Y are independent, we have MX+Y (t) = MX (t) MY (t) = (1 − θ)−α (1 − θ)−α = (1 − θ)−2α . Thus X + Y has a moment generating function of a gamma random variable with parameters θ and 2α. Therefore X + Y ∼ GAM (θ, 2α). 10.6. Review Exercises 1. Let X be a continuous random variable with density function f (x) = & e−2x + 1 2 e−x 0 for 0 < x < ∞ otherwise. If Y = e−2X , then what is the density function of Y where nonzero? 2. Suppose that X is a random variable with density function f (x) = &3 8 x2 0 for 0 < x < 2 otherwise. Let Y = mX 2 , where m is a fixed positive number. What is the density function of Y where nonzero? 3. Let X be a continuous random variable with density function f (x) = & 2 e−2x for x > 0 0 otherwise and let Y = e−X . What is the density function g(y) of Y where nonzero? Transformation of Random Variables and their Distributions 286 4. What is the probability density of the sum of two independent random variables, each of which is uniformly distributed over the interval [−2, 2]? 5. Let X and Y be random variables with joint density function f (x, y) = & e−x for 0 < x < ∞; 0 < y < 1 0 elsewhere . If Z = X + 2Y , then what is the joint density of X and Z where nonzero? 6. Let X be a continuous random variable with density function f (x) = If Y = √ & 2 x2 for 1 < x < 2 0 elsewhere. X, then what is the density function of Y for 1 < y < √ 2? 7. What is the probability density of the sum of two independent random variables, each of which has the density function given by f (x) = & 10−x 50 0 for 0 < x < 10 elsewhere? 8. What is the probability density of the sum of two independent random variables, each of which has the density function given by & a for a ≤ x < ∞ x2 f (x) = 0 elsewhere? 9. Roll an unbiased die 3 times. If U denotes the outcome in the first roll, V denotes the outcome in the second roll, and W denotes the outcome of the third roll, what is the distribution of the random variable Z = max{U, V, W }? 10. The probability density of V , the velocity of a gas molecule, by MaxwellBoltzmann law is given by  3 2 2  4√hπ v 2 e−h v for 0 ≤ v < ∞ f (v) =  0 otherwise, where h is the Plank’s constant. If m represents the mass of a gas molecule, then what is the probability density of the kinetic energy Z = 21 mV 2 ? Probability and Mathematical Statistics 287 11. If the random variables X and Y have the joint density f (x, y) =   67 x for 1 ≤ x + y ≤ 2, x ≥ 0, y ≥ 0  0 otherwise, what is the joint density of U = 2X + 3Y and V = 4X + Y ? 12. If the random variables X and Y have the joint density f (x, y) = what is the density of   67 x for 1 ≤ x + y ≤ 2, x ≥ 0, y ≥ 0  0 otherwise, X Y ? 13. Let X and Y have the joint probability density function f (x, y) = & 5 16 xy 2 0 for 0 < x < y < 2 elsewhere. What is the joint density function of U = 3X − 2Y and V = X + 2Y where it is nonzero? 14. Let X and Y have the joint probability density function f (x, y) = & 4x for 0 < x < 0 elsewhere. √ y<1 What is the joint density function of U = 5X − 2Y and V = 3X + 2Y where it is nonzero? 15. Let X and Y have the joint probability density function f (x, y) = & 4x for 0 < x < 0 elsewhere. √ y<1 What is the density function of X − Y ? 16. Let X and Y have the joint probability density function f (x, y) = & 4x for 0 < x < 0 elsewhere. √ y<1 Transformation of Random Variables and their Distributions What is the density function of X Y 288 ? 17. Let X and Y have the joint probability density function √ & 4x for 0 < x < y < 1 f (x, y) = 0 elsewhere. What is the density function of XY ? 18. Let X and Y have the joint probability density function & 5 2 for 0 < x < y < 2 16 xy f (x, y) = 0 elsewhere. What is the density function of Y X? 19. If X an uniform random variable on the interval [0, 2] and Y is an uniform random variable on the interval [0, 3], then what is the joint probability density function of X + Y if they are independent? 20. What is the probability density function of the sum of two independent random variable, each of which is binomial with parameters n and p? 21. What is the probability density function of the sum of two independent random variable, each of which is exponential with mean θ? 22. What is the probability density function of the average of two independent random variable, each of which is Cauchy with parameter θ = 0? 23. What is the probability density function of the average of two independent random variable, each of which is normal with mean µ and variance σ2 ? 24. Both roots of the quadratic equation x2 + αx + β = 0 can take all values from −1 to +1 with equal probabilities. What are the probability density functions of the coefficients α and β? 25. If A, B, C are independent random variables uniformly distributed on the interval from zero to one, then what is the probability that the quadratic equation Ax2 + Bx + C = 0 has real solutions? 26. The price of a stock on a given trading day changes according to the distribution f (−1) = 14 , f (0) = 12 , f (1) = 81 , and f (2) = 18 . Find the distribution for the change in stock price after two (independent) trading days. Probability and Mathematical Statistics 289 Chapter 11 SOME SPECIAL DISCRETE BIVARIATE DISTRIBUTIONS In this chapter, we shall examine some bivariate discrete probability density functions. Ever since the first statistical use of the bivariate normal distribution (which will be treated in Chapter 12) by Galton and Dickson in 1886, attempts have been made to develop families of bivariate distributions to describe non-normal variations. In many textbooks, only the bivariate normal distribution is treated. This is partly due to the dominant role the bivariate normal distribution has played in statistical theory. Recently, however, other bivariate distributions have started appearing in probability models and statistical sampling problems. This chapter will focus on some well known bivariate discrete distributions whose marginal distributions are wellknown univariate distributions. The book of K.V. Mardia gives an excellent exposition on various bivariate distributions. 11.1. Bivariate Bernoulli Distribution We define a bivariate Bernoulli random variable by specifying the form of the joint probability distribution. Definition 11.1. A discrete bivariate random variable (X, Y ) is said to have the bivariate Bernoulli distribution if its joint probability density is of the form  1−x−y 1  x! y! (1−x−y)! , if x, y = 0, 1 px1 py2 (1 − p1 − p2 ) f (x, y) =  0 otherwise, Some Special Discrete Bivariate Distributions 290 where 0 < p1 , p2 , p1 + p2 < 1 and x + y ≤ 1. We denote a bivariate Bernoulli random variable by writing (X, Y ) ∼ BER (p1 , p2 ). In the following theorem, we present the expected values and the variances of X and Y , the covariance between X and Y , and their joint moment generating function. Recall that the joint moment generating function of X ! " and Y is defined as M (s, t) := E esX+tY . Theorem 11.1. Let (X, Y ) ∼ BER (p1 , p2 ), where p1 and p2 are parameters. Then E(X) = p1 E(Y ) = p2 V ar(X) = p1 (1 − p1 ) V ar(Y ) = p2 (1 − p2 ) Cov(X, Y ) = −p1 p2 M (s, t) = 1 − p1 − p2 + p1 es + p2 et . Proof: First, we derive the joint moment generating function of X and Y and then establish the rest of the results from it. The joint moment generating function of X and Y is given by ! " M (s, t) = E esX+tY = 1 1 % % f (x, y) esx+ty x=0 y=0 = f (0, 0) + f (1, 0) es + f (0, 1) et + f (1, 1) et+s = 1 − p1 − p2 + p1 es + p2 et + 0 et+s = 1 − p1 − p2 + p1 es + p2 et . The expected value of X is given by E(X) = > ∂M >> ∂s >(0,0) > "> ∂ ! s t > 1 − p 1 − p2 + p 1 e + p2 e > = ∂s (0,0) = p1 es |(0,0) = p1 . Probability and Mathematical Statistics 291 Similarly, the expected value of Y is given by > ∂M >> E(Y ) = ∂t >(0,0) > "> ∂ ! 1 − p1 − p2 + p1 es + p2 et >> ∂t (0,0) > = p2 et >(0,0) = = p2 . The product moment of X and Y is E(XY ) = > ∂ 2 M >> ∂t ∂s >(0,0) > "> ∂2 ! s t > = 1 − p1 − p2 + p1 e + p 2 e > ∂t ∂s (0,0) > > ∂ = ( p1 es )>> ∂t (0,0) = 0. Therefore the covariance of X and Y is Cov(X, Y ) = E(XY ) − E(X) E(Y ) = −p1 p2 Similarly, it can be shown that E(X 2 ) = p1 and E(Y 2 ) = p2 . Thus, we have V ar(X) = E(X 2 ) − E(X)2 = p1 − p21 = p1 (1 − p1 ) and V ar(Y ) = E(Y 2 ) − E(Y )2 = p2 − p22 = p2 (1 − p2 ). This completes the proof of the theorem. The next theorem presents some information regarding the conditional distributions f (x/y) and f (y/x). Some Special Discrete Bivariate Distributions 292 Theorem 11.2. Let (X, Y ) ∼ BER (p1 , p2 ), where p1 and p2 are parameters. Then the conditional distributions f (y/x) and f (x/y) are also Bernoulli and p2 (1 − x) E(Y /x) = 1 − p1 p1 (1 − y) E(X/y) = 1 − p2 p2 (1 − p1 − p2 ) (1 − x) V ar(Y /x) = (1 − p1 )2 p1 (1 − p1 − p2 ) (1 − y) . V ar(X/y) = (1 − p2 )2 Proof: Notice that f (x, y) f1 (x) f (x, y) = 1 % f (x, y) f (y/x) = y=0 = Hence f (x, y) f (x, 0) + f (x, 1) x = 0, 1; y = 0, 1; 0 ≤ x + y ≤ 1. f (0, 1) f (0, 0) + f (0, 1) p2 = 1 − p 1 − p2 + p 2 p2 = 1 − p1 f (1/0) = and f (1, 1) f (1, 0) + f (1, 1) 0 = = 0. p1 + 0 f (1/1) = Now we compute the conditional expectation E(Y /x) for x = 0, 1. Hence E(Y /x = 0) = 1 % y f (y/0) y=0 = f (1/0) p2 = 1 − p1 Probability and Mathematical Statistics 293 and E(Y /x = 1) = f (1/1) = 0. Merging these together, we have E(Y /x) = p2 (1 − x) 1 − p1 x = 0, 1. Similarly, we compute E(Y 2 /x = 0) = 1 % y 2 f (y/0) y=0 = f (1/0) p2 = 1 − p1 and E(Y 2 /x = 1) = f (1/1) = 0. Therefore V ar(Y /x = 0) = E(Y 2 /x = 0) − E(Y /x = 0)2 # $2 p2 p2 − = 1 − p1 1 − p1 p2 (1 − p1 ) − p22 = (1 − p1 )2 p2 (1 − p1 − p2 ) = (1 − p1 )2 and V ar(Y /x = 1) = 0. Merging these together, we have V ar(Y /x) = p2 (1 − p1 − p2 ) (1 − x) (1 − p1 )2 x = 0, 1. The conditional expectation E(X/y) and the conditional variance V ar(X/y) can be obtained in a similar manner. We leave their derivations to the reader. 11.2. Bivariate Binomial Distribution The bivariate binomial random variable is defined by specifying the form of the joint probability distribution. Some Special Discrete Bivariate Distributions 294 Definition 11.2. A discrete bivariate random variable (X, Y ) is said to have the bivariate binomial distribution with parameters n, p1 , p2 if its joint probability density is of the form  n−x−y n!  x! y! (n−x−y)! , if x, y = 0, 1, ..., n px1 py2 (1 − p1 − p2 ) f (x, y) =  0 otherwise, where 0 < p1 , p2 , p1 +p2 < 1, x+y ≤ n and n is a positive integer. We denote a bivariate binomial random variable by writing (X, Y ) ∼ BIN (n, p1 , p2 ). Bivariate binomial distribution is also known as trinomial distribution. It will be shown in the proof of Theorem 11.4 that the marginal distributions of X and Y are BIN (n, p1 ) and BIN (n, p2 ), respectively. The following two examples illustrate the applicability of bivariate binomial distribution. Example 11.1. In the city of Louisville on a Friday night, radio station A has 50 percent listeners, radio station B has 30 percent listeners, and radio station C has 20 percent listeners. What is the probability that among 8 listeners in the city of Louisville, randomly chosen on a Friday night, 5 will be listening to station A, 2 will be listening to station B, and 1 will be listening to station C? Answer: Let X denote the number listeners that listen to station A, and Y denote the listeners that listen to station B. Then the joint distribution 3 5 , and p2 = 10 . The of X and Y is bivariate binomial with n = 8, p1 = 10 probability that among 8 listeners in the city of Louisville, randomly chosen on a Friday night, 5 will be listening to station A, 2 will be listening to station B, and 1 will be listening to station C is given by P (X = 5, Y = 2) = f (5, 2) n! n−x−y px py (1 − p1 − p2 ) x! y! (n − x − y)! 1 2 # $ 5 # $2 # $ 5 3 2 8! = 5! 2! 1! 10 10 10 = = 0.0945. Example 11.2. A certain game involves rolling a fair die and watching the numbers of rolls of 4 and 5. What is the probability that in 10 rolls of the die one 4 and three 5 will be observed? Probability and Mathematical Statistics 295 Answer: Let X denote the number of 4 and Y denote the number of 5. Then the joint distribution of X and Y is bivariate binomial with n = 10, p1 = 16 , p2 = 61 and 1 − p1 − p2 = 46 . Hence the probability that in 10 rolls of the die one 4 and three 5 will be observed is P (X = 5, Y = 2) = f (1, 3) n! n−x−y px py (1 − p1 − p2 ) x! y! (n − x − y)! 1 2 # $1 # $ 3 # $10−1−3 10! 1 1 1 1 = 1− − 1! 3! (10 − 1 − 3)! 6 6 6 6 # $1 # $ 3 # $6 1 1 4 10! = 1! 3! (10 − 1 − 3)! 6 6 6 573440 = 10077696 = 0.0569. = Using transformation method discussed in chapter 10, it can be shown that if X1 , X2 and X3 are independent binomial random variables, then the joint distribution of the random variables X = X1 + X2 and Y = X1 + X3 is bivariate binomial. This approach is known as trivariate reduction technique for constructing bivariate distribution. To establish the next theorem, we need a generalization of the binomial theorem which was treated in Chapter 1. The following result generalizes the binomial theorem and can be called trinomial theorem. Similar to the proof of binomial theorem, one can establish $ n # n % % n (a + b + c) = ax by cn−x−y , x, y x=0 y=0 n where 0 ≤ x + y ≤ n and # n x, y $ = n! . x! y! (n − x − y)! In the following theorem, we present the expected values of X and Y , their variances, the covariance between X and Y , and the joint moment generating function. Some Special Discrete Bivariate Distributions 296 Theorem 11.3. Let (X, Y ) ∼ BIN (n, p1 , p2 ), where n, p1 and p2 are parameters. Then E(X) = n p1 E(Y ) = n p2 V ar(X) = n p1 (1 − p1 ) V ar(Y ) = n p2 (1 − p2 ) Cov(X, Y ) = −n p1 p2 "n ! M (s, t) = 1 − p1 − p2 + p1 es + p2 et . Proof: First, we find the joint moment generating function of X and Y . The moment generating function M (s, t) is given by ! " M (s, t) = E esX+tY n n % % = esx+ty f (x, y) = x=0 y=0 n n % % esx+ty x=0 y=0 n % n % n! n−x−y px py (1 − p1 − p2 ) x! y! (n − x − y)! 1 2 "y n! x ! n−x−y (es p1 ) et p2 (1 − p1 − p2 ) x! y! (n − x − y)! x=0 y=0 ! "n = 1 − p1 − p2 + p1 es + p2 et (by trinomial theorem). = The expected value of X is given by > ∂M >> E(X) = ∂s >(0,0) > " > ∂ ! s t n> = 1 − p 1 − p2 + p1 e + p2 e > ∂s (0,0) > ! " n−1 > = n 1 − p1 − p2 + p1 es + p2 et p1 es > (0,0) = n p1 . Similarly, the expected value of Y is given by > ∂M >> E(Y ) = ∂t >(0,0) > " > ∂ ! s t n> = 1 − p1 − p2 + p 1 e + p 2 e > ∂t (0,0) > ! " n−1 > = n 1 − p1 − p2 + p1 es + p2 et p2 et > = n p2 . (0,0) Probability and Mathematical Statistics 297 The product moment of X and Y is E(XY ) = > ∂ 2 M >> ∂t ∂s >(0,0) > " > ∂2 ! s t n> = 1 − p 1 − p 2 + p 1 e + p2 e > ∂t ∂s (0,0) ->> " ∂ , ! n−1 s t s > n 1 − p 1 − p2 + p1 e + p 2 e p1 e > = ∂t (0,0) = n (n − 1)p1 p2 . Therefore the covariance of X and Y is Cov(X, Y ) = E(XY ) − E(X) E(Y ) = n(n − 1)p1 p2 − n2 p1 p2 = −np1 p2 . Similarly, it can be shown that E(X 2 ) = n(n − 1)p21 + np1 Thus, we have and E(Y 2 ) = n(n − 1)p22 + np2 . V ar(X) = E(X 2 ) − E(X)2 = n(n − 1)p22 + np2 − n2 p21 = n p1 (1 − p1 ) and similarly V ar(Y ) = E(Y 2 ) − E(Y )2 = n p2 (1 − p2 ). This completes the proof of the theorem. The following results are needed for the next theorem and they can be established using binomial theorem discussed in chapter 1. For any real numbers a and b, we have and # $ m % m y m−y y a b = m a (a + b)m−1 y y=0 (11.1) # $ m y m−y a b = m a (ma + b) (a + b)m−2 y y y=0 (11.2) m % 2 where m is a positive integer. Some Special Discrete Bivariate Distributions 298 Example 11.3. If X equals the number of ones and Y equals the number of twos and threes when a pair of fair dice are rolled, then what is the correlation coefficient of X and Y ? Answer: The joint density of X and Y is bivariate binomial and is given by 2! f (x, y) = x! y! (2 − x − y)! # $x # $y # $2−x−y 2 3 1 , 6 6 6 0 ≤ x + y ≤ 2, where x and y are nonnegative integers. By Theorem 11.3, we have 1 6 # 2 V ar(Y ) = n p2 (1 − p2 ) = 2 6 # V ar(X) = n p1 (1 − p1 ) = 2 and Cov(X, Y ) = −n p1 p2 = −2 1 6 $ = 10 , 36 2 1− 6 $ = 16 , 36 1− 4 1 2 =− . 6 6 36 Therefore Corr(X, Y ) = I Cov(X, Y ) V ar(X) V ar(Y ) 4 =− √ 4 10 = −0.3162. The next theorem presents some information regarding the conditional distributions f (x/y) and f (y/x). Theorem 11.4. Let (X, Y ) ∼ BIN (n, p1 , p2 ), where n, p1 and p2 are parameters. Then the conditional distributions f (y/x) and f (x/y) are also binomial and p2 (n − x) E(Y /x) = 1 − p1 p1 (n − y) E(X/y) = 1 − p2 p2 (1 − p1 − p2 ) (n − x) V ar(Y /x) = (1 − p1 )2 p1 (1 − p1 − p2 ) (n − y) V ar(X/y) = . (1 − p2 )2 Probability and Mathematical Statistics 299 , first we find the marginal density of X. The Proof: Since f (y/x) = ff(x,y) 1 (x) marginal density f1 (x) of X is given by f1 (x) = n−x % y=0 n! n−x−y px py (1 − p1 − p2 ) x! y! (n − x − y)! 1 2 n−x % n! px1 (n − x)! py (1 − p1 − p2 )n−x−y x! (n − x)! y=0 y! (n − x − y)! 2 # $ n x n−x p (1 − p1 − p2 + p2 ) (by binomial theorem) = x 1 # $ n x n−x p (1 − p1 ) = . x 1 = In order to compute the conditional expectations, we need the conditional densities of f (x, y). The conditional density of Y given X = x is f (x, y) f1 (x) f (x, y) = !n" x n−x x p1 (1 − p1 ) f (y/x) = (n − x)! py (1 − p1 − p2 )n−x−y (1 − p1 )x−n (n − x − y)! y! 2 # $ n−x y n−x−y = (1 − p1 )x−n p2 (1 − p1 − p2 ) . y = Hence the conditional expectation of Y given the event X = x is $ n−x y n−x−y p2 (1 − p1 − p2 ) y (1 − p1 ) E (Y /x) = y y=0 n−x % #n − x$ y n−x−y x−n p2 (1 − p1 − p2 ) y = (1 − p1 ) y y=0 n−x % x−n # = (1 − p1 )x−n p2 (n − x) (1 − p1 )n−x−1 p2 (n − x) . = 1 − p1 Next, we find the conditional variance of Y given event X = x. For this Some Special Discrete Bivariate Distributions 300 ! " we need the conditional expectation E Y 2 /x , which is given by % ! " n−x y 2 f (x, y) E Y 2 /x = y=0 # $ n−x y n−x−y y (1 − p1 ) = p2 (1 − p1 − p2 ) y y=0 n−x % #n − x$ y n−x−y x−n p2 (1 − p1 − p2 ) y2 = (1 − p1 ) y y=0 n−x % 2 x−n = (1 − p1 )x−n p2 (n − x) (1 − p1 )n−x−2 [(n − x)p2 + 1 − p1 − p2 ] p2 (n − x)[(n − x)p2 + 1 − p1 − p2 ] . = (1 − p1 )2 Hence, the conditional variance of Y given X = x is ! " V ar(Y /x) = E Y 2 /x − E(Y /x)2 p2 (n − x) [(n − x)p2 + 1 − p1 − p2 ] − = (1 − p1 )2 p2 (1 − p1 − p2 ) (n − x) . = (1 − p1 )2 # p2 (n − x) 1 − p1 $2 Similarly, one can establish E(X/y) = p1 (n − y) 1 − p2 and V ar(X/y) = p1 (1 − p1 − p2 ) (n − y) . (1 − p2 )2 This completes the proof of the theorem. Note that f (y/x) in the above theorem is a univariate binomial probability density function. To see this observe that $ # n−x−y x−n n − x (1 − p1 ) py2 (1 − p1 − p2 ) y # $# $y # $n−x−y n−x p2 p2 = . 1− y 1 − p1 1 − p1 Hence, f (y/x) is a probability density function of a binomial random variable p2 with parameters n − x and 1−p . 1 The marginal density f2 (y) of Y can be obtained similarly as # $ n y n−y f2 (y) = p (1 − p2 ) , y 2 Probability and Mathematical Statistics 301 where y = 0, 1, ..., n. The form of these densities show that the marginals of bivariate binomial distribution are again binomial. Example 11.4. Let W equal the weight of soap in a 1-kilogram box that is distributed in India. Suppose P (W < 1) = 0.02 and P (W > 1.072) = 0.08. Call a box of soap light, good, or heavy depending on whether W < 1, 1 ≤ W ≤ 1.072, or W > 1.072, respectively. In a random sample of 50 boxes, let X equal the number of light boxes and Y the number of good boxes. What are the regression and scedastic curves of Y on X? Answer: The joint probability density function of X and Y is given by f (x, y) = 50! px py (1 − p1 − p2 )50−x−y , x! y! (50 − x − y)! 1 2 0 ≤ x + y ≤ 50, where x and y are nonnegative integers. Hence, (X, Y ) ∼ BIN (n, p1 , p2 ), where n = 50, p1 = 0.02 and p2 = 0.90. The regression curve of Y on X is given by p2 (n − x) E(Y /x) = 1 − p1 0.9 (50 − x) = 1 − 0.02 45 = (50 − x). 49 The scedastic curve of Y on X is the conditional variance of Y given X = x and it equal to p2 (1 − p1 − p2 ) (n − x) (1 − p1 )2 0.9 0.08 (50 − x) = (1 − 0.02)2 180 (50 − x). = 2401 V ar(Y /x) = Note that if n = 1, then bivariate binomial distribution reduces to bivariate Bernoulli distribution. 11.3. Bivariate Geometric Distribution Recall that if the random variable X denotes the trial number on which first success occurs, then X is univariate geometric. The probability density function of an univariate geometric variable is f (x) = px−1 (1 − p), x = 1, 2, 3, ..., ∞, Some Special Discrete Bivariate Distributions 302 where p is the probability of failure in a single Bernoulli trial. This univariate geometric distribution can be generalized to the bivariate case. Guldberg (1934) introduced the bivariate geometric distribution and Lundberg (1940) first used it in connection with problems of accident proneness. This distribution has found many applications in various statistical methods. Definition 11.3. A discrete bivariate random variable (X, Y ) is said to have the bivariate geometric distribution with parameters p1 and p2 if its joint probability density is of the form f (x, y) =  x y  (x+y)! x! y! p1 p2 (1 − p1 − p2 ) ,  0 if x, y = 0, 1, ..., ∞ otherwise, where 0 < p1 , p2 , p1 + p2 < 1. We denote a bivariate geometric random variable by writing (X, Y ) ∼ GEO (p1 , p2 ). Example 11.5. Motor vehicles arriving at an intersection can turn right or left or continue straight ahead. In a study of traffic patterns at this intersection over a long period of time, engineers have noted that 40 percents of the motor vehicles turn left, 25 percents turn right, and the remainder continue straight ahead. For the next ten cars entering the intersection, what is the probability that 5 cars will turn left, 4 cars will turn right, and the last car will go straight ahead? Answer: Let X denote the number of cars turning left and Y denote the number of cars turning right. Since, the last car will go straight ahead, the joint distribution of X and Y is geometric with parameters p1 = 0.4, p2 = 0.25 and p3 = 1 − p1 − p2 = 0.35. For the next ten cars entering the intersection, the probability that 5 cars will turn left, 4 cars will turn right, and the last car will go straight ahead is given by P (X = 5, Y = 4) = f (5, 4) (x + y)! x y p1 p2 (1 − p1 − p2 ) x! y! (5 + 4)! = (0.4)5 (0.25)4 (1 − 0.4 − 0.25) 5! 4! 9! = (0.4)5 (0.25)4 (0.35) 5! 4! = 0.00677. = Probability and Mathematical Statistics 303 The following technical result is essential for proving the following theorem. If a and b are positive real numbers with 0 < a + b < 1, then ∞ ∞ % % 1 (x + y)! x y a b = . x! y! 1 − a−b x=0 y=0 (11.3) In the following theorem, we present the expected values and the variances of X and Y , the covariance between X and Y , and the moment generating function. Theorem 11.5. Let (X, Y ) ∼ GEO (p1 , p2 ), where p1 and p2 are parameters. Then p1 E(X) = 1 − p1 − p 2 p2 E(Y ) = 1 − p1 − p 2 p1 (1 − p2 ) V ar(X) = (1 − p1 − p2 )2 p2 (1 − p1 ) V ar(Y ) = (1 − p1 − p2 )2 p 1 p2 Cov(X, Y ) = (1 − p1 − p2 )2 1 − p1 − p 2 M (s, t) = . 1 − p1 es − p2 et Proof: We only find the joint moment generating function M (s, t) of X and Y and leave proof of the rests to the reader of this book. The joint moment generating function M (s, t) is given by ! " M (s, t) = E esX+tY n n % % = esx+ty f (x, y) = x=0 y=0 n % n % esx+ty x=0 y=0 = (1 − p1 − p2 ) = (x + y)! x y p1 p2 (1 − p1 − p2 ) x! y! n % n % "y (x + y)! x! (p1 es ) p2 et x! y! x=0 y=0 (1 − p1 − p2 ) 1 − p1 es − p2 et (by (11.3) ). Some Special Discrete Bivariate Distributions 304 The following results are needed for the next theorem. Let a be a positive real number less than one. Then ∞ % 1 (x + y)! y a = , x! y! (1 − a)x+1 y=0 ∞ % (x + y)! a(1 + x) y ay = , x! y! (1 − a)x+2 y=0 and ∞ % (x + y)! 2 y a(1 + x) y a = [a(x + 1) + 1]. x! y! (1 − a)x+3 y=0 (11.4) (11.5) (11.6) The next theorem presents some information regarding the conditional densities f (x/y) and f (y/x). Theorem 11.6. Let (X, Y ) ∼ GEO (p1 , p2 ), where p1 and p2 are parameters. Then the conditional distributions f (y/x) and f (x/y) are also geometrical and p2 (1 + x) E(Y /x) = 1 − p2 p1 (1 + y) E(X/y) = 1 − p1 p2 (1 + x) V ar(Y /x) = (1 − p2 )2 p1 (1 + y) . V ar(X/y) = (1 − p1 )2 Proof: Again, as before, we first find the conditional probability density of Y given the event X = x. The marginal density f1 (x) is given by f1 (x) = = ∞ % y=0 ∞ % y=0 f (x, y) (x + y)! x y p1 p2 (1 − p1 − p2 ) x! y! = (1 − p1 − p2 ) px1 = (1 − p1 − p2 ) px1 (1 − p2 )x+1 ∞ % (x + y)! y p2 x! y! y=0 (by (11.4) ). Probability and Mathematical Statistics 305 Therefore the conditional density of Y given the event X = x is f (y/x) = (x + y)! y f (x, y) = p2 (1 − p2 )x+1 . f1 (x) x! y! The conditional expectation of Y given X = x is E(Y /x) = = ∞ % y=0 ∞ % y=0 = y f (y/x) y (x + y)! y p2 (1 − p2 )x+1 x! y! p2 (1 + x) (1 − p2 ) (by (11.5) ). Similarly, one can show that E(X/y) = p1 (1 + y) . (1 − p1 ) To compute the conditional variance of Y given the event that X = x, first ! " we have to find E Y 2 /x , which is given by ∞ " % E Y /x = y 2 f (y/x) ! 2 y=0 = ∞ % y=0 = y2 (x + y)! y p2 (1 − p2 )x+1 x! y! p2 (1 + x) [p2 (1 + x) + 1] (1 − p2 )2 (by (11.6) ). Therefore ! " ! " V ar Y 2 /x = E Y 2 /x − E(Y /x)2 p2 (1 + x) = [(p2 (1 + x) + 1] − (1 − p2 )2 p2 (1 + x) = . (1 − p2 )2 # p2 (1 + x) 1 − p2 $2 The rest of the moments can be determined in a similar manner. The proof of the theorem is now complete. Some Special Discrete Bivariate Distributions 306 11.4. Bivariate Negative Binomial Distribution The univariate negative binomial distribution can be generalized to the bivariate case. Guldberg (1934) introduced this distribution and Lundberg (1940) first used it in connection with problems of accident proneness. Arbous and Kerrich (1951) arrived at this distribution by mixing parameters of the bivariate Poisson distribution. Definition 11.4. A discrete bivariate random variable (X, Y ) is said to have the bivariate negative binomial distribution with parameters k, p1 and p2 if its joint probability density is of the form  k x y  (x+y+k−1)! if x, y = 0, 1, ..., ∞ x! y! (k−1)! p1 p2 (1 − p1 − p2 ) , f (x, y) =  0 otherwise, where 0 < p1 , p2 , p1 + p2 < 1 and k is a nonzero positive integer. We denote a bivariate negative binomial random variable by writing (X, Y ) ∼ N BIN (k, p1 , p2 ). Example 11.6. An experiment consists of selecting a marble at random and with replacement from a box containing 10 white marbles, 15 black marbles and 5 green marbles. What is the probability that it takes exactly 11 trials to get 5 white, 3 black and the third green marbles at the 11th trial? Answer: Let X denote the number of white marbles and Y denote the number of black marbles. The joint distribution of X and Y is bivariate negative binomial with parameters p1 = 31 , p2 = 21 , and k = 3. Hence the probability that it takes exactly 11 trials to get 5 white, 3 black and the third green marbles at the 11th trial is P (X = 5, Y = 3) = f (5, 3) (x + y + k − 1)! x y k = p1 p2 (1 − p1 − p2 ) x! y! (k − 1)! (5 + 3 + 3 − 1)! 3 = (0.33)5 (0.5)3 (1 − 0.33 − 0.5) 5! 3! (3 − 1)! 10! = (0.33)5 (0.5)3 (0.17)3 5! 3! 2! = 0.0000503. The negative binomial theorem which was treated in chapter 5 can be generalized to ∞ ∞ % % 1 (x + y + k − 1)! x y . (11.7) p1 p 2 = k x! y! (k − 1)! (1 − p1 − p2 ) x=0 y=0 Probability and Mathematical Statistics 307 In the following theorem, we present the expected values and the variances of X and Y , the covariance between X and Y , and the moment generating function. Theorem 11.7. Let (X, Y ) ∼ N BIN (k, p1 , p2 ), where k, p1 and p2 are parameters. Then E(X) = E(Y ) = V ar(X) = V ar(Y ) = Cov(X, Y ) = M (s, t) = k p1 1 − p 1 − p2 k p2 1 − p 1 − p2 k p1 (1 − p2 ) (1 − p1 − p2 )2 k p2 (1 − p1 ) (1 − p1 − p2 )2 k p1 p2 (1 − p1 − p2 )2 k (1 − p1 − p2 ) k (1 − p1 es − p2 et ) . Proof: We only find the joint moment generating function M (s, t) of the random variables X and Y and leave the rests to the reader. The joint moment generating function is given by ! " M (s, t) = E esX+tY ∞ % ∞ % = esx+ty f (x, y) = x=0 y=0 ∞ ∞ % % esx+ty x=0 y=0 k = (1 − p1 − p2 ) (x + y + k − 1)! x y k p1 p2 (1 − p1 − p2 ) x! y! (k − 1)! ∞ ∞ % % "y (x + y + k − 1)! s x ! (e p1 ) et p2 x! y! (k − 1)! x=0 y=0 k = (1 − p1 − p2 ) k (1 − p1 es − p2 et ) (by (11.7)). This completes the proof of the theorem. To establish the next theorem, we need the following two results. If a is a positive real constant in the interval (0, 1), then ∞ % 1 (x + k) (x + y + k − 1)! y , a = x! y! (k − 1)! (1 − a)x+k y=0 (11.8) Some Special Discrete Bivariate Distributions ∞ % y y=0 a (x + k) (x + y + k − 1)! y a = , x! y! (k − 1)! (1 − a)x+k+1 308 (11.9) and ∞ % y=0 y2 a (x + k) (x + y + k − 1)! y a = [1 + (x + k)a] . x! y! (k − 1)! (1 − a)x+k+2 (11.10) The next theorem presents some information regarding the conditional densities f (x/y) and f (y/x). Theorem 11.8. Let (X, Y ) ∼ N BIN (k, p1 , p2 ), where p1 and p2 are parameters. Then the conditional densities f (y/x) and f (x/y) are also negative binomial and p2 (k + x) E(Y /x) = 1 − p2 p1 (k + y) E(X/y) = 1 − p1 p2 (k + x) V ar(Y /x) = (1 − p2 )2 p1 (k + y) V ar(X/y) = . (1 − p1 )2 Proof: First, we find the marginal density of X. The marginal density f1 (x) is given by f1 (x) = = ∞ % y=0 ∞ % y=0 f (x, y) (x + y + k − 1)! x y p 1 p2 x! y! (k − 1)! (x + y + k − 1)! y p2 x! y! (k − 1)! 1 k = (1 − p1 − p2 ) px1 (by (11.8)). (1 − p2 )x+k k = (1 − p1 − p2 ) px1 The conditional density of Y given the event X = x is f (x, y) f1 (x) (x + y + k − 1)! y = p2 (1 − p2 )x+k . x! y! (k − 1)! f (y/x) = Probability and Mathematical Statistics 309 The conditional expectation E(Y /x) is given by E (Y /x) = ∞ ∞ % % x=0 y=0 y (x + y + k − 1)! y p2 (1 − p2 )x+k x! y! (k − 1)! = (1 − p2 )x+k = (1 − p2 )x+k = p2 (x + k) . (1 − p2 ) ∞ % ∞ % y x=0 y=0 (x + y + k − 1)! y p2 x! y! (k − 1)! p2 (x + k) (1 − p2 )x+k+1 (by (11.9)) ! " The conditional expectation E Y 2 /x can be computed as follows ∞ ∞ % ! " % (x + y + k − 1)! y E Y 2 /x = y2 p2 (1 − p2 )x+k x! y! (k − 1)! x=0 y=0 = (1 − p2 )x+k = (1 − p2 )x+k = ∞ % ∞ % x=0 y=0 y2 (x + y + k − 1)! y p2 x! y! (k − 1)! p2 (x + k) [1 + (x + k) p2 ] (1 − p2 )x+k+2 (by (11.10)) p2 (x + k) [1 + (x + k) p2 ]. (1 − p2 )2 The conditional variance of Y given X = x is ! " 2 V ar (Y /x) = E Y 2 /x − E (Y /x) p2 (x + k) [1 + (x + k) p2 ] − (1 − p2 )2 p2 (x + k) . = (1 − p2 )2 = # p2 (x + k) (1 − p2 ) $2 The conditional expected value E(X/y) and conditional variance V ar(X/y) can be computed in a similar way. This completes the proof. Note that if k = 1, then bivariate negative binomial distribution reduces to bivariate geometric distribution. 11.5. Bivariate Hypergeometric Distribution The univariate hypergeometric distribution can be generalized to the bivariate case. Isserlis (1914) introduced this distribution and Pearson (1924) Some Special Discrete Bivariate Distributions 310 gave various properties of this distribution. Pearson also fitted this distribution to an observed data of the number of cards of a certain suit in two hands at whist. Definition 11.5. A discrete bivariate random variable (X, Y ) is said to have the bivariate hypergeometric distribution with parameters r, n1 , n2 , n3 if its joint probability distribution is of the form  n1 n2 n3 y ) (r−x−y )  ( x )n( +n , if x, y = 0, 1, ..., r +n ( 1 r2 3 ) f (x, y) =  0 otherwise, where x ≤ n1 , y ≤ n2 , r − x − y ≤ n3 and r is a positive integer less than or equal to n1 + n2 + n3 . We denote a bivariate hypergeometric random variable by writing (X, Y ) ∼ HY P (r, n1 , n2 , n3 ). Example 11.7. A panel of prospective jurors includes 6 african american, 4 asian american and 9 white american. If the selection is random, what is the probability that a jury will consists of 4 african american, 3 asian american and 5 white american? Answer: Here n1 = 7, n2 = 3 and n3 = 9 so that n = 19. A total of 12 jurors will be selected so that r = 12. In this example x = 4, y = 3 and r − x − y = 5. Hence the probability that a jury will consists of 4 african american, 3 asian american and 5 white american is !7" ! 3 " ! 9" 4410 3 " 5 = f (4, 3) = 4 !19 = 0.0875. 50388 12 Example 11.8. Among 25 silver dollars struck in 1903 there are 15 from the Philadelphia mint, 7 from the New Orleans mint, and 3 from the San Francisco. If 5 of these silver dollars are picked at random, what is the probability of getting 4 from the Philadelphia mint and 1 from the New Orleans? Answer: Here n = 25, r = 5 and n1 = 15, n2 = 7, n3 = 3. The the probability of getting 4 from the Philadelphia mint and 1 from the New Orleans is !15" !7" !3" 9555 f (4, 1) = 4 !251" 0 = = 0.1798. 53130 5 In the following theorem, we present the expected values and the variances of X and Y , and the covariance between X and Y . Probability and Mathematical Statistics 311 Theorem 11.9. Let (X, Y ) ∼ HY P (r, n1 , n2 , n3 ), where r, n1 , n2 and n3 are parameters. Then r n1 n1 + n 2 + n3 r n2 E(Y ) = n1 + n 2 + n3 # $ n1 + n2 + n 3 − r r n1 (n2 + n3 ) V ar(X) = (n1 + n2 + n3 )2 n1 + n2 + n3 − 1 # $ r n2 (n1 + n3 ) n1 + n2 + n 3 − r V ar(Y ) = (n1 + n2 + n3 )2 n1 + n2 + n3 − 1 $ # n1 + n 2 + n3 − r r n 1 n2 . Cov(X, Y ) = − (n1 + n2 + n3 )2 n1 + n2 + n3 − 1 E(X) = Proof: We find only the mean and variance of X. The mean and variance of Y can be found in a similar manner. The covariance of X and Y will be left to the reader as an exercise. To find the expected value of X, we need the marginal density f1 (x) of X. The marginal of X is given by f1 (x) = r−x % y=0 = r−x % y=0 f (x, y) ! n 1 " ! n2 " ! x n3 r−x−y y " !n1 +n 2 +n3 r ! n1 " = !n1 +nx2 +n3 " r = ! n1 " !n1 +nx2 +n3 " r r−x % y=0 # # " n2 y $# n2 + n3 r−x $ n3 r−x−y $ (by Theorem 1.3) This shows that X ∼ HY P (n1 , n2 + n3 , r). Hence, by Theorem 5.7, we get E(X) = and r n1 , n 1 + n2 + n3 r n1 (n2 + n3 ) V ar(X) = (n1 + n2 + n3 )2 # n1 + n2 + n 3 − r n1 + n 2 + n3 − 1 $ . Similarly, the random variable Y ∼ HY P (n2 , n1 + n3 , r). Hence, again by Theorem 5.7, we get r n2 , E(Y ) = n1 + n 2 + n3 Some Special Discrete Bivariate Distributions and 312 r n2 (n1 + n3 ) V ar(Y ) = (n1 + n2 + n3 )2 # n 1 + n2 + n 3 − r n1 + n2 + n3 − 1 $ . The next theorem presents some information regarding the conditional densities f (x/y) and f (y/x). Theorem 11.10. Let (X, Y ) ∼ HY P (r, n1 , n2 , n3 ), where r, n1 , n2 and n3 are parameters. Then the conditional distributions f (y/x) and f (x/y) are also hypergeometric and n2 (r − x) n 2 + n3 n1 (r − y) E(X/y) = n 1 + n3 $# $ # n2 n3 x − n1 n1 + n 2 + n3 − x V ar(Y /x) = n2 + n3 − 1 n2 + n 3 n 2 + n3 # $# $ y − n2 n1 n 3 n1 + n 2 + n3 − y . V ar(X/y) = n1 + n3 − 1 n1 + n 3 n1 + n3 E(Y /x) = Proof: To find E(Y /x), we need the conditional density f (y/x) of Y given the event X = x. The conditional density f (y/x) is given by f (y/x) = = f (x, y) f1 (x) !n2 " ! n3 r−x−y " !n2 +n 3 r−x y " . Hence, the random variable Y given X = x is a hypergeometric random variable with parameters n2 , n3 , and r − x, that is Y /x ∼ HY P (n2 , n3 , r − x). Hence, by Theorem 5.7, we get E(Y /x) = n2 (r − x) n2 + n 3 and n 2 n3 V ar(Y /x) = n 2 + n3 − 1 # n1 + n 2 + n3 − x n2 + n 3 $# x − n1 n 2 + n3 $ . Probability and Mathematical Statistics 313 Similarly, one can find E(X/y) and V ar(X/y). The proof of the theorem is now complete. 11.6. Bivariate Poisson Distribution The univariate Poisson distribution can be generalized to the bivariate case. In 1934, Campbell, first derived this distribution. However, in 1944, Aitken gave the explicit formula for the bivariate Poisson distribution function. In 1964, Holgate also arrived at the bivariate Poisson distribution by deriving the joint distribution of X = X1 + X3 and Y = X2 + X3 , where X1 , X2 , X3 are independent Poisson random variables. Unlike the previous bivariate distributions, the conditional distributions of bivariate Poisson distribution are not Poisson. In fact, Seshadri and Patil (1964), indicated that no bivariate distribution exists having both marginal and conditional distributions of Poisson form. Definition 11.6. A discrete bivariate random variable (X, Y ) is said to have the bivariate Poisson distribution with parameters λ1 , λ2 , λ3 if its joint probability density is of the form f (x, y) = where  (−λ1 −λ2 +λ3 ) (λ1 −λ3 )x (λ2 −λ3 )y e ψ(x, y) x! y!  0 min(x,y) ψ(x, y) := % r=0 for x, y = 0, 1, ..., ∞ otherwise, x(r) y (r) λr3 (λ1 − λ3 )r (λ2 − λ3 )r r! with x(r) := x(x − 1) · · · (x − r + 1), and λ1 > λ3 > 0, λ2 > λ3 > 0 are parameters. We denote a bivariate Poisson random variable by writing (X, Y ) ∼ P OI (λ1 , λ2 , λ3 ). In the following theorem, we present the expected values and the variances of X and Y , the covariance between X and Y and the joint moment generating function. Theorem 11.11. Let (X, Y ) ∼ P OI (λ1 , λ2 , λ3 ), where λ1 , λ2 and λ3 are Some Special Discrete Bivariate Distributions 314 parameters. Then E(X) = λ1 E(Y ) = λ2 V ar(X) = λ1 V ar(Y ) = λ2 Cov(X, Y ) = λ3 s M (s, t) = e−λ1 −λ2 −λ3 +λ1 e +λ2 et +λ3 es+t . The next theorem presents some special characteristics of the conditional densities f (x/y) and f (y/x). Theorem 11.12. Let (X, Y ) ∼ P OI (λ1 , λ2 , λ3 ), where λ1 , λ2 and λ3 are parameters. Then $ λ3 x λ1 # $ λ3 E(X/y) = λ1 − λ3 + y λ2 # $ λ3 (λ1 − λ3 ) V ar(Y /x) = λ2 − λ3 + x λ21 $ # λ3 (λ2 − λ3 ) y. V ar(X/y) = λ1 − λ3 + λ22 E(Y /x) = λ2 − λ3 + # 11.7. Review Exercises 1. A box contains 10 white marbles, 15 black marbles and 5 green marbles. If 10 marbles are selected at random and without replacement, what is the probability that 5 are white, 3 are black and 2 are green? 2. An urn contains 3 red balls, 2 green balls and 1 yellow ball. Three balls are selected at random and without replacement from the urn. What is the probability that at least 1 color is not drawn? 3. An urn contains 4 red balls, 8 green balls and 2 yellow balls. Five balls are randomly selected, without replacement, from the urn. What is the probability that 1 red ball, 2 green balls, and 2 yellow balls will be selected? 4. From a group of three Republicans, two Democrats, and one Independent, a committee of two people is to be randomly selected. If X denotes the Probability and Mathematical Statistics 315 number of Republicans and Y the number of Democrats on the committee, then what is the variance of Y given that X = x? 5. If X equals the number of ones and Y the number of twos and threes when a four fair dice are rolled, then what is the conditional variance of X and Y = 1? 6. Motor vehicles arriving at an intersection can turn right or left or continue straight ahead. In a study of traffic patterns at this intersection over a long period of time, engineers have noted that 40 percents of the motor vehicles turn left, 25 percents turn right, and the remainder continue straight ahead. For the next five cars entering the intersection, what is the probability that at least one turn right? (Answer: 0.7627) 7. Among a large number of applicants for a certain position, 60 percents have only a high school education, 30 percents have some college training, and 10 percents have completed a college degree. If 5 applicants are randomly selected to be interviewed, what is the probability that at least one will have completed a college degree? 8. In a population of 200 students who have just completed a first course in calculus, 50 have earned A’s, 80 B’s and remaining earned F ’s. A sample of size 25 is taken at random and without replacement from this population. What is the probability that 10 students have A’s, 12 students have B’s and 3 students have F ’s ? 9. If X equals the number of ones and Y the number of twos and threes when a four fair dice are rolled, then what is the correlation coefficient of X and Y ? 10. If the joint moment generating function of X and Y is M (s, t) = , -5 k 7−es4−2et , then what is the value of the constant k? What is the correlation coefficient between X and Y ? 11. A die with 1 painted on three sides, 2 painted on two sides, and 3 painted on one side is rolled 15 times. What is the probability that we will get eight 1’s, six 2’s and a 3 on the last roll? 12. The output of a machine is graded excellent 80 percents of time, good 15 percents of time, and defective 5 percents of time. What is the probability that a random sample of size 15 has 10 excellent, 3 good, and 2 defective items? Some Special Discrete Bivariate Distributions 316 13. An industrial product is graded by a machine excellent 80 percents of time, good 15 percents of time, and defective 5 percents of time. A random sample of 15 items is graded. What is the probability that machine will grade 10 excellent, 3 good, and 2 defective of which one being the last one graded? 14. If (X, Y ) ∼ HY P (n1 , n2 , n3 , r), then what is the covariance of the random variables X and Y ? Probability and Mathematical Statistics 317 Chapter 12 SOME SPECIAL CONTINUOUS BIVARIATE DISTRIBUTIONS In this chapter, we study some well known continuous bivariate probability density functions. First, we present the natural extensions of univariate probability density functions that were treated in chapter 6. Then we present some other bivariate distributions that have been reported in the literature. The bivariate normal distribution has been treated in most textbooks because of its dominant role in the statistical theory. The other continuous bivariate distributions rarely treated in any textbooks. It is in this textbook, well known bivariate distributions have been treated for the first time. The monograph of K.V. Mardia gives an excellent exposition on various bivariate distributions. We begin this chapter with the bivariate uniform distribution. 12.1. Bivariate Uniform Distribution In this section, we study Morgenstern bivariate uniform distribution in detail. The marginals of Morgenstern bivariate uniform distribution are uniform. In this sense, it is an extension of univariate uniform distribution. Other bivariate uniform distributions will be pointed out without any in depth treatment. In 1956, Morgenstern introduced a one-parameter family of bivariate distributions whose univariate marginal are uniform distributions by the following formula f (x, y) = f1 (x) f2 (y) ( 1 + α [2F1 (x) − 1] [2F2 (y) − 1] ) , Some Special Continuous Bivariate Distributions 318 where α ∈ [−1, 1] is a parameter. If one assumes The cdf Fi (x) = x and the pdf fi (x) = 1 (i = 1, 2), then we arrive at the Morgenstern uniform distribution on the unit square. The joint probability density function f (x, y) of the Morgenstern uniform distribution on the unit square is given by f (x, y) = 1 + α (2x − 1) (2y − 1), 0 < x, y ≤ 1, −1 ≤ α ≤ 1. Next, we define the Morgenstern uniform distribution on an arbitrary rectangle [a, b] × [c, d]. Definition 12.1. A continuous bivariate random variable (X, Y ) is said to have the bivariate uniform distribution on the rectangle [a, b] × [c, d] if its joint probability density function is of the form f (x, y) =  2y−2c 2x−2a  1+α ( b−a −1)( d−c −1) (b−a) (d−c)  0 for x ∈ [a, b] y ∈ [c, d] otherwise , where α is an apriori chosen parameter in [−1, 1]. We denote a Morgenstern bivariate uniform random variable on a rectangle [a, b] × [c, d] by writing (X, Y ) ∼ U N IF (a, b, c, d, α). The following figures show the graph and the equi-density curves of f (x, y) on unit square with α = 0.5. In the following theorem, we present the expected values, the variances of the random variables X and Y , and the covariance between X and Y . Theorem 12.1. Let (X, Y ) ∼ U N IF M (a, b, c, d, α), where a, b, c, d and α Probability and Mathematical Statistics 319 are parameters. Then E(X) = E(Y ) = V ar(X) = V ar(Y ) = Cov(X, Y ) = b+a 2 d+c 2 (b − a)2 12 (d − c)2 12 1 α (b − a) (d − c). 36 Proof: First, we determine the marginal density of X which is given by : d f1 (x) = f (x, y) dy c -, , : d 1 + α 2x−2a − 1 2y−2c − 1 b−a d−c dy = (b − a) (d − c) c = 1 . b−a Thus, the marginal density of X is uniform on the interval from a to b. That is X ∼ U N IF (a, b). Hence by Theorem 6.1, we have E(X) = b+a 2 and V ar(X) = (b − a)2 . 12 Similarly, one can show that Y ∼ U N IF (c, d) and therefore by Theorem 6.1 E(Y ) = d+c 2 and V ar(Y ) = (d − c)2 . 12 The product moment of X and Y is : b: d xy f (x, y) dx dy E(XY ) = c a = : a = b : c 1+α d xy , 2x−2a b−a -, − 1 2y−2c − 1 d−c (b − a) (d − c) 1 1 α (b − a) (d − c) + (b + a) (d + c). 36 4 dx dy Some Special Continuous Bivariate Distributions 320 Thus, the covariance of X and Y is Cov(X, Y ) = E(XY ) − E(X) E(Y ) = 1 1 1 α (b − a) (d − c) + (b + a) (d + c) − (b + a) (d + c) 36 4 4 = 1 α (b − a) (d − c). 36 This completes the proof of the theorem. In the next theorem, we states some information related to the conditional densities f (y/x) and f (x/y). Theorem 12.2. Let (X, Y ) ∼ U N IF (a, b, c, d, α), where a, b, c, d and α are parameters. Then " ! 2 d+c α E(Y /x) = + c + 4cd + d2 2 6 (b − a) ! 2 " α b+a + a + 4ab + b2 E(X/y) = 2 6 (b − a) 1 36 # d−c b−a $2 1 V ar(X/y) = 36 # b−a d−c $2 V ar(Y /x) = # $ 2x − 2a −1 b−a # $ 2y − 2c −1 d−c ; 2 < α (a + b) (4x − a − b) + 3(b − a)2 − 4α2 x2 ; 2 < α (c + d) (4y − c − d) + 3(d − c)2 − 4α2 y 2 . Proof: First, we determine the conditional density function f (y/x). Recall 1 that f1 (x) = b−a . Hence, $# $3 2 # 1 2y − 2c 2x − 2a f (y/x) = −1 −1 . 1+α d−c b−a d−c Probability and Mathematical Statistics 321 The conditional expectation E(Y /x) is given by E(Y /x) = : d y f (y/x) dy c = 1 d−c : d c $# $3 2 # 2y − 2c 2x − 2a −1 −1 dy y 1+α b−a d−c = d+c α + 2 6 (d − c)2 = d+c α + 2 6 (d − c) # # 2x − 2a −1 b−a $ ; 3 < d − c3 + 3dc2 − 3cd2 $ ; < 2x − 2a − 1 d2 + 4dc + c2 . b−a ! " Similarly, the conditional expectation E Y 2 /x is given by ! " E Y 2 /x = : d y 2 f (y/x) dy c 1 d−c 1 = d−c d+c = 2 d+c = 2 = $# $3 2 # 2y − 2c 2x − 2a −1 −1 dy y2 1 + α b−a d−c c 3 2 2 # $ " α d − c2 2x − 2a 1 ! 2 + −1 d − c2 (d − c)2 2 d−c b−a 6 $ # " ! 2x − 2a 1 + α d2 − c2 −1 6 b−a # $3 2 2x − 2a α −1 . 1 + (d − c) 3 b−a : d Therefore, the conditional variance of Y given the event X = x is ! " V ar(Y /x) = E Y 2 /x − E(Y /x)2 # $2 ; 2 < 1 d−c = α (a + b)(4x − a − b) + 3(b − a)2 − 4α2 x2 . 36 b − a The conditional expectation E(X/y) and the conditional variance V ar(X/y) can be found in a similar manner. This completes the proof of the theorem. The following figure illustrate the regression and scedastic curves of the Morgenstern uniform distribution function on unit square with α = 0.5. Some Special Continuous Bivariate Distributions 322 Next, we give a definition of another generalized bivariate uniform distribution. Definition 12.2. Let S ⊂ R I 2 be a region in the Euclidean plane R I 2 with area A. The random variables X and Y is said to be bivariate uniform over S if the joint density of X and Y is of the form &1 for (x, y) ∈ S A f (x, y) = 0 otherwise . In 1965, Plackett constructed a class of bivariate distribution F (x, y) for given marginals F1 (x) and F2 (y) as the square root of the equation (α − 1) F (x, y)2 − {1 + (α − 1) [F1 (x) + F2 (y)] } F (x, y) + α F1 (x) F2 (y) = 0 (where 0 < α < ∞) which satisfies the Fréchet inequalities max {F1 (x) + F2 (y) − 1, 0} ≤ F (x, y) ≤ min {F1 (x), F2 (y)} . The class of bivariate joint density function constructed by Plackett is the following f (x, y) = α f1 (x) f2 (y) [(α − 1) {F1 (x) + F2 (y) − 2F1 (x)F2 (y)} + 1] 3 [S(x, y)2 − 4 α (α − 1) F1 (x) F2 (y)] 2 , where S(x, y) = 1 + (α − 1) (F1 (x) + F2 (y)) . If one takes Fi (x) = x and fi (x) = 1 (for i = 1, 2), then the joint density function constructed by Plackett reduces to f (x, y) = α [(α − 1) {x + y − 2xy} + 1] 3 [{1 + (α − 1)(x + y)}2 − 4 α (α − 1) xy] 2 , Probability and Mathematical Statistics 323 where 0 ≤ x, y ≤ 1, and α > 0. But unfortunately this is not a bivariate density function since this bivariate density does not integrate to one. This fact was missed by both Plackett (1965) and Mardia (1967). 12.2. Bivariate Cauchy Distribution Recall that univariate Cauchy probability distribution was defined in Chapter 3 as f (x) = A θ 2 π θ + (x − α) B, −∞ < x < ∞, where α > 0 and θ are real parameters. The parameter α is called the location parameter. In Chapter 4, we have pointed out that any random variable whose probability density function is Cauchy has no moments. This random variable is further, has no moment generating function. The Cauchy distribution is widely used for instructional purposes besides its statistical use. The main purpose of this section is to generalize univariate Cauchy distribution to bivariate case and study its various intrinsic properties. We define the bivariate Cauchy random variables by using the form of their joint probability density function. Definition 12.3. A continuous bivariate random variable (X, Y ) is said to have the bivariate Cauchy distribution if its joint probability density function is of the form f (x, y) = θ 3 2π [ θ2 + (x − α)2 + (y − β)2 ] 2 , −∞ < x, y < ∞, where θ is a positive parameter and α and β are location parameters. We denote a bivariate Cauchy random variable by writing (X, Y ) ∼ CAU (θ, α, β). The following figures show the graph and the equi-density curves of the Cauchy density function f (x, y) with parameters α = 0 = β and θ = 0.5. Some Special Continuous Bivariate Distributions 324 The bivariate Cauchy distribution can be derived by considering the distribution of radio active particles emanating from a source that hit a two-dimensional screen. This distribution is a special case of the bivariate t-distribution which was first constructed by Karl Pearson in 1923. The following theorem shows that if a bivariate random variable (X, Y ) is Cauchy, then it has no moments like the univariate Cauchy random variable. Further, for a bivariate Cauchy random variable (X, Y ), the covariance (and hence the correlation) between X and Y does not exist. Theorem 12.3. Let (X, Y ) ∼ CAU (θ, α, β), where θ > 0, α and β are parameters. Then the moments E(X), E(Y ), V ar(X), V ar(Y ), and Cov(X, Y ) do not exist. Proof: In order to find the moments of X and Y , we need their marginal distributions. First, we find the marginal of X which is given by f1 (x) = = : ∞ f (x, y) dy −∞ ∞ : −∞ θ 3 2π [ θ2 + (x − α)2 + (y − β)2 ] 2 dy. To evaluate the above integral, we make a trigonometric substitution y=β+ I [θ2 + (x − α)2 ] tan ψ. Hence dy = I [θ2 + (x − α)2 ] sec2 ψ dψ and ; <3 θ2 + (x − α)2 + (y − β)2 2 ; <3 ! "3 = θ2 + (x − α)2 2 1 + tan2 ψ 2 ; <3 = θ2 + (x − α)2 2 sec3 ψ. Probability and Mathematical Statistics 325 Using these in the above integral, we get : ∞ θ −∞ dy 3 2π [ θ2 + (x − α)2 + (y − β)2 ] 2 θ = 2π = = : π 2 −π 2 I [θ2 + (x − α)2 ] sec2 ψ dψ 3 [θ2 + (x − α)2 ] 2 sec3 ψ θ 2π [θ2 + (x − α)2 ] : π 2 cos ψ dψ −π 2 θ . π [θ2 + (x − α)2 ] Hence, the marginal of X is a Cauchy distribution with parameters θ and α. Thus, for the random variable X, the expected value E(X) and the variance V ar(X) do not exist (see Example 4.2). In a similar manner, it can be shown that the marginal distribution of Y is also Cauchy with parameters θ and β and hence E(Y ) and V ar(Y ) do not exist. Since Cov(X, Y ) = E(XY ) − E(X) E(Y ), it easy to note that Cov(X, Y ) also does not exist. This completes the proof of the theorem. The conditional distribution of Y given the event X = x is given by f (y/x) = f (x, y) 1 θ2 + (x − α)2 = . f1 (x) 2 [ θ2 + (x − α)2 + (y − β)2 ] 23 Similarly, the conditional distribution of X given the event Y = y is f (y/x) = 1 θ2 + (y − β)2 . 2 [ θ2 + (x − α)2 + (y − β)2 ] 32 Next theorem states some properties of the conditional densities f (y/x) and f (x/y). Theorem 12.4. Let (X, Y ) ∼ CAU (θ, α, β), where θ > 0, α and β are parameters. Then the conditional expectations E(Y /x) = β E(X/y) = α, Some Special Continuous Bivariate Distributions 326 and the conditional variances V ar(Y /x) and V ar(X/y) do not exist. Proof: First, we show that E(Y /x) is β. The conditional expectation of Y given the event X = x can be computed as E(Y /x) = : ∞ −∞ : ∞ y f (y/x) dy θ2 + (x − α)2 1 dy 2 [ θ2 + (x − α)2 + (y − β)2 ] 32 −∞ ! " : < ∞ d θ2 + (x − α)2 + (y − β)2 1 ; 2 2 θ + (x − α) = 3 4 −∞ [ θ 2 + (x − α)2 + (y − β)2 ] 2 : ∞ < β ; 2 dy + θ + (x − α)2 3 2 2 −∞ [ θ + (x − α)2 + (y − β)2 ] 2 0∞ / < 2 1 ; 2 2 −I = θ + (x − α) 4 θ2 + (x − α)2 + (y − β)2 −∞ : π2 < β ; 2 cos ψ dψ θ + (x − α)2 + π [θ 2 + (x − α)2 ] 2 −2 = y =0+β = β. Similarly, it can be shown that E(X/y) = α. Next, we show that the conditional variance of Y given X = x does not exist. To show this, we need ! " E Y 2 /x , which is given by ! " E Y /x = 2 : ∞ −∞ y2 1 θ2 + (x − α)2 dy. 2 [ θ2 + (x − α)2 + (y − β)2 ] 32 The above integral does not exist and hence the conditional second moment of Y given X = x does not exist. As a consequence, the V ar(Y /x) also does not exist. Similarly, the variance of X given the event Y = y also does not exist. This completes the proof of the theorem. 12.3. Bivariate Gamma Distributions In this section, we present three different bivariate gamma probability density functions and study some of their intrinsic properties. Definition 12.4. A continuous bivariate random variable (X, Y ) is said to have the bivariate gamma distribution if its joint probability density function Probability and Mathematical Statistics 327 is of the form  # √ $ 1  2 θ xy (xy) 2 (α−1) − x+y  1−θ Iα−1 e 1 1−θ f (x, y) = (1−θ) Γ(α) θ 2 (α−1)   0 if 0 ≤ x, y < ∞ otherwise, where θ ∈ [0, 1) and α > 0 are parameters, and Ik (z) := ∞ % r=0 ! 1 "k+2r 2 z . r! Γ(k + r + 1) As usual, we denote this bivariate gamma random variable by writing (X, Y ) ∼ GAM K(α, θ). The function Ik (z) is called the modified Bessel function of the first kind of order k. In explicit form f (x, y) is given by f (x, y) =      1 θ α−1 Γ(α) x+y e− 1−θ ∞ % k=0 0 α+k−1 (θ x y) k! Γ(α + k) (1 − θ)α+2k for 0 ≤ x, y < ∞ otherwise. The following figures show the graph of the joint density function f (x, y) of a bivariate gamma random variable with parameters α = 1 and θ = 0.5 and the equi-density curves of f (x, y). In 1941, Kibble found this bivariate gamma density function. However, Wicksell in 1933 had constructed the characteristic function of this bivariate gamma density function without knowing the explicit form of this density function. If { (Xi , Yi ) | i = 1, 2, ..., n} is a random sample from a bivariate normal distribution with zero means, then the bivariate random variable n n % % (X, Y ), where X = n1 Xi2 and Y = n1 Yi2 , has bivariate gamma distrii=1 i=1 bution. This fact was established by Wicksell by finding the characteristic Some Special Continuous Bivariate Distributions 328 function of (X, Y ). This bivariate gamma distribution has found applications in noise theory (see Rice (1944, 1945)). The following theorem provides us some important characteristic of the bivariate gamma distribution of Kibble. Theorem 12.5. Let the random variable (X, Y ) ∼ GAM K(α, θ), where 0 < α < ∞ and 0 ≤ θ < 1 are parameters. Then the marginals of X and Y are univariate gamma and E(X) = α E(Y ) = α V ar(X) = α V ar(Y ) = α Cov(X, Y ) = α θ M (s, t) = 1 . [(1 − s) (1 − t) − θ s t]α Proof: First, we show that the marginal distribution of X is univariate gamma with parameter α (and θ = 1). The marginal density of X is given by : ∞ f1 (x) = f (x, y) dy 0 : ∞ α+k−1 % (θ x y) 1 − x+y 1−θ dy e θα−1 Γ(α) k! Γ(α + k) (1 − θ)α+2k 0 k=0 : ∞ ∞ α+k−1 % y x (θ x) 1 − 1−θ e y α+k−1 e− 1−θ dy = α−1 α+2k θ Γ(α) k! Γ(α + k) (1 − θ) 0 = ∞ k=0 ∞ % α+k−1 x (θ x) 1 e− 1−θ (1 − θ)α+k Γ(α + k) Γ(α) k! Γ(α + k) (1 − θ)α+2k k=0 $k ∞ # % x θ 1 = xα+k−1 e− 1−θ 1−θ k! Γ(α) k=0 # $k ∞ % x xθ 1 1 xα−1 e− 1−θ = Γ(α) k! 1 − θ = θα−1 k=0 xθ x 1 = xα−1 e− 1−θ e 1−θ Γ(α) 1 = xα−1 e−x . Γ(α) Probability and Mathematical Statistics 329 Thus, the marginal distribution of X is gamma with parameters α and θ = 1. Therefore, by Theorem 6.3, we obtain E(X) = α, V ar(X) = α. Similarly, we can show that the marginal density of Y is gamma with parameters α and θ = 1. Hence, we have E(Y ) = α, V ar(Y ) = α. The moment generating function can be computed in a similar manner and we leave it to the reader. This completes the proof of the theorem. The following results are needed for the next theorem. From calculus we know that ∞ % zk , (12.1) ez = k! k=0 and the infinite series on the right converges for all z ∈ R. I Differentiating both sides of (12.1) and then multiplying the resulting expression by z, one obtains ∞ % zk zez = (12.2) k . k! k=0 If one differentiates (12.2) again with respect to z and multiply the resulting expression by z, then he/she will get z 2 z ze + z e = ∞ % k2 k=0 zk . k! (12.3) Theorem 12.6. Let the random variable (X, Y ) ∼ GAM K(α, θ), where 0 < α < ∞ and 0 ≤ θ < 1 are parameters. Then E(Y /x) = θ x + (1 − θ) α E(X/y) = θ y + (1 − θ) α V ar(Y /x) = (1 − θ) [ 2θ x + (1 − θ) α ] V ar(X/y) = (1 − θ) [ 2θ y + (1 − θ) α ]. Some Special Continuous Bivariate Distributions 330 Proof: First, we will find the conditional probability density function Y given X = x, which is given by f (y/x) = = f (x, y) f1 (x) 1 x+y θα−1 xα−1 e−x x = ex− 1−θ ∞ % k=0 e− 1−θ ∞ % k=0 α+k−1 (θ x y) k! Γ(α + k) (1 − θ)α+2k y 1 (θ x)k α+k−1 − 1−θ . y e Γ(α + k) (1 − θ)α+2k k! Next, we compute the conditional expectation of Y given the event X = x. The conditional expectation E(Y /x) is given by E(Y /x) : ∞ = y f (y/x) dy 0 = : ∞ 0 x = ex− 1−θ ∞ % y (θ x)k α+k−1 − 1−θ 1 y e dy α+2k Γ(α + k) (1 − θ) k! k=0 : ∞ ∞ % y (θ x)k 1 y α+k e− 1−θ dy α+2k Γ(α + k) (1 − θ) k! 0 x y ex− 1−θ k=0 ∞ % 1 (θ x)k (1 − θ)α+k+1 Γ(α + k) Γ(α + k) (1 − θ)α+2k k! k=0 # $k ∞ % x 1 θx (α + k) = (1 − θ) ex− 1−θ k! 1 − θ k=0 2 3 θx θx x θ x 1−θ e (by (12.1) and (12.2)) = (1 − θ) ex− 1−θ α e 1−θ + 1−θ x = ex− 1−θ = (1 − θ) α + θ x. In order to determine the conditional variance of Y given the event X = x, we need the conditional expectation of Y 2 given the event X = x. This Probability and Mathematical Statistics 331 conditional expectation can be evaluated as follows: E(Y 2 /x) : ∞ y 2 f (y/x) dy = 0 = : ∞ 0 x = ex− 1−θ ∞ % y (θ x)k α+k−1 − 1−θ 1 y e dy α+2k Γ(α + k) (1 − θ) k! k=0 : ∞ ∞ % y 1 (θ x)k y α+k+1 e− 1−θ dy Γ(α + k) (1 − θ)α+2k k! 0 x y 2 ex− 1−θ k=0 ∞ % (θ x)k 1 (1 − θ)α+k+2 Γ(α + k + 2) Γ(α + k) (1 − θ)α+2k k! k=0 # $k ∞ % x 1 θx (α + k + 1) (α + k) = (1 − θ)2 ex− 1−θ k! 1 − θ k=0 # $k ∞ % x 1 θx 2 2 2 x− 1−θ (α + 2αk + k + α + k) = (1 − θ) e k! 1 − θ k=0 / # $k 0 ∞ % x θx θx k2 θx x− 1−θ 2 2 + +e = (1 − θ) α + α + (2α + 1) 1−θ 1−θ k! 1 − θ k=0 / # $2 0 θx θx θx = (1 − θ)2 α2 + α + (2α + 1) + + 1−θ 1−θ 1−θ x = ex− 1−θ = (α2 + α) (1 − θ)2 + 2(α + 1) θ (1 − θ) x + θ2 x2 . The conditional variance of Y given X = x is V ar(Y /x) = E(Y 2 /x) − E(Y /x)2 = (α2 + α) (1 − θ)2 + 2(α + 1) θ (1 − θ) x + θ2 x2 ; < − (1 − θ)2 α2 + θ2 x2 + 2 α θ (1 − θ) x = (1 − θ) [α (1 − θ) + 2 θ x] . Since the density function f (x, y) is symmetric, that is f (x, y) = f (y, x), the conditional expectation E(X/y) and the conditional variance V ar(X/y) can be obtained by interchanging x with y in the formulae of E(Y /x) and V ar(Y /x). This completes the proof of the theorem. In 1941, Cherian constructed a bivariate gamma distribution whose probability density function is given by  −(x+y) 8 min{x,y} α3 (x−z)α1 (y−z)α2 z z  Oe3 e dz if 0 < x, y < ∞ z (x−z) (y−z) 0 Γ(αi ) i=1 f (x, y) =  0 otherwise, Some Special Continuous Bivariate Distributions 332 where α1 , α2 , α3 ∈ (0, ∞) are parameters. If a bivariate random variable (X, Y ) has a Cherian bivariate gamma probability density function with parameters α1 , α2 and α3 , then we denote this by writing (X, Y ) ∼ GAM C(α1 , α2 , α3 ). It can be shown that the marginals of f (x, y) are given by f1 (x) = and f2 (x) = & & 1 Γ(α1 +α3 ) xα1 +α3 −1 e−x 0 if 0 < x < ∞ otherwise 1 Γ(α2 +α3 ) xα2 +α3 −1 e−y 0 Hence, we have the following theorem. if 0 < y < ∞ otherwise. Theorem 12.7. If (X, Y ) ∼ GAM C(α, β, γ), then E(X) = α + γ E(Y ) = β + γ V ar(X) = α + γ V ar(Y ) = β + γ E(XY ) = γ + (α + γ)(β + γ). The following theorem can be established by first computing the conditional probability density functions. We leave the proof of the following theorem to the reader. Theorem 12.8. If (X, Y ) ∼ GAM C(α, β, γ), then E(Y /x) = β + γ x α+γ and E(X/y) = α + γ y. β+γ David and Fix (1961) have studied the rank correlation and regression for samples from this distribution. For an account of this bivariate gamma distribution the interested reader should refer to Moran (1967). In 1934, McKay gave another bivariate gamma distribution whose probability density function is of the form  α+β θ α−1  Γ(α) (y − x)β−1 e−θ y if 0 < x < y < ∞ Γ(β) x f (x, y) =  0 otherwise, Probability and Mathematical Statistics 333 where θ, α, β ∈ (0, ∞) are parameters. If the form of the joint density of the random variable (X, Y ) is similar to the density function of the bivariate gamma distribution of McKay, then we write (X, Y ) ∼ GAM M (θ, α, β). The graph of probability density function f (x, y) of the bivariate gamma distribution of McKay for θ = α = β = 2 is shown below. The other figure illustrates the equi-density curves of this joint density function f (x, y). It can shown that if (X, Y ) ∼ GAM M (θ, α, β), then the marginal f1 (x) of X and the marginal f2 (y) of Y are given by f1 (x) = and f2 (y) = ! Hence X ∼ GAM α, following theorem.    1 θ " & θα Γ(α) xα−1 e−θ x 0 θ α+β Γ(α+β) 0 if 0 ≤ x < ∞ otherwise xα+β−1 e−θ x if 0 ≤ x < ∞ otherwise. " and Y ∼ GAM α + β, θ1 . Therefore, we have the ! Theorem 12.9. If (X, Y ) ∼ GAM M (θ, α, β), then α θ α+β E(Y ) = θ α V ar(X) = 2 θ α+β V ar(Y ) = θ2 $α # $β # θ θ . M (s, t) = θ−s−t θ−t E(X) = Some Special Continuous Bivariate Distributions 334 We state the various properties of the conditional densities of f (x, y), without proof, in the following theorem. Theorem 12.10. If (X, Y ) ∼ GAM M (θ, α, β), then β θ αy E(X/y) = α+β β V ar(Y /x) = 2 θ E(Y /x) = x + V ar(X/y) = αβ y2 . (α + β)2 (α + β + 1) We know that the univariate exponential distribution is a special case of the univariate gamma distribution. Similarly, the bivariate exponential distribution is a special case of bivariate gamma distribution. On taking the index parameters to be unity in the Kibble and Cherian bivariate gamma distribution given above, we obtain the corresponding bivariate exponential distributions. The bivariate exponential probability density function corresponding to bivariate gamma distribution of Kibble is given by f (x, y) =  ∞ %   e−( x+y 1−θ )   k=0 (θ x y)k k! Γ(k + 1) (1 − θ)2k+1 0 if 0 < x, y < ∞ otherwise, where θ ∈ (0, 1) is a parameter. The bivariate exponential distribution corresponding to the Cherian bivariate distribution is the following: f (x, y) = < & ; min{x,y} e − 1 e−(x+y) 0 if 0 < x, y < ∞ otherwise. In 1960, Gumble has studied the following bivariate exponential distribution whose density function is given by f (x, y) =   [(1 + θx) (1 + θy) − θ] e−(x+y+θ x y)  0 where θ > 0 is a parameter. if 0 < x, y < ∞ otherwise, Probability and Mathematical Statistics 335 In 1967, Marshall and Olkin introduced the following bivariate exponential distribution   1 − e−(α+γ)x − e−(β+γ)y + e−(αx+βy+γ max{x,y}) if x, y > 0 F (x, y) =  0 otherwise, where α, β, γ > 0 are parameters. The exponential distribution function of Marshall and Olkin satisfies the lack of memory property P (X > x + t, Y > y + t / X > t, Y > t) = P (X > x, Y > y). 12.4. Bivariate Beta Distribution The bivariate beta distribution (also known as Dirichlet distribution ) is one of the basic distributions in statistics. The bivariate beta distribution is used in geology, biology, and chemistry for handling compositional data which are subject to nonnegativity and constant-sum constraints. It is also used nowadays with increasing frequency in statistical modeling, distribution theory and Bayesian statistics. For example, it is used to model the distribution of brand shares of certain consumer products, and in describing the joint distribution of two soil strength parameters. Further, it is used in modeling the proportions of the electorates who vote for a candidates in a two-candidate election. In Bayesian statistics, the beta distribution is very popular as a prior since it yields a beta distribution as posterior. In this section, we give some basic facts about the bivariate beta distribution. Definition 12.5. A continuous bivariate random variable (X, Y ) is said to have the bivariate beta distribution if its joint probability density function is of the form  Γ(θ1 +θ2 +θ3 )  Γ(θ xθ1 −1 y θ2 −1 (1 − x − y)θ3 −1 if 0 < x, y, x + y < 1 1 )Γ(θ2 )Γ(θ3 ) f (x, y) =  0 otherwise, Some Special Continuous Bivariate Distributions 336 where θ1 , θ2 , θ3 are positive parameters. We will denote a bivariate beta random variable (X, Y ) with positive parameters θ1 , θ2 and θ3 by writing (X, Y ) ∼ Beta(θ1 , θ2 , θ3 ). The following figures show the graph and the equi-density curves of f (x, y) on the domain of its definition. In the following theorem, we present the expected values, the variances of the random variables X and Y , and the correlation between X and Y . Theorem 12.11. Let (X, Y ) ∼ Beta(θ1 , θ2 , θ3 ), where θ1 , θ2 and θ3 are positive apriori chosen parameters. Then X ∼ Beta(θ1 , θ2 + θ3 ) and Y ∼ Beta(θ2 , θ1 + θ3 ) and E(X) = θ1 , θ V ar(X) = θ1 (θ − θ1 ) θ2 (θ + 1) E(Y ) = θ2 , θ V ar(Y ) = θ2 (θ − θ2 ) θ2 (θ + 1) Cov(X, Y ) = − θ1 θ2 θ2 (θ + 1) where θ = θ1 + θ2 + θ3 . Proof: First, we show that X ∼ Beta(θ1 , θ2 + θ3 ) and Y ∼ Beta(θ2 , θ1 + θ3 ). Since (X, Y ) ∼ Beta(θ2 , θ1 , θ3 ), the joint density of (X, Y ) is given by f (x, y) = Γ(θ) xθ1 −1 y θ2 −1 (1 − x − y)θ3 −1 , Γ(θ1 )Γ(θ2 )Γ(θ3 ) Probability and Mathematical Statistics 337 where θ = θ1 + θ2 + θ3 . Thus the marginal density of X is given by : 1 f1 (x) = f (x, y) dy 0 = Γ(θ) xθ1 −1 Γ(θ1 )Γ(θ2 )Γ(θ3 ) : 1−x y θ2 −1 (1 − x − y)θ3 −1 dy 0 Γ(θ) = xθ1 −1 (1 − x)θ3 −1 Γ(θ1 )Γ(θ2 )Γ(θ3 ) Now we substitute u = 1 − y 1−x : 1−x y θ2 −1 0 # y 1− 1−x $θ3 −1 dy in the above integral. Then we have : 1 uθ2 −1 (1 − u)θ3 −1 du xθ1 −1 (1 − x)θ2 +θ3 −1 Γ(θ) Γ(θ1 )Γ(θ2 )Γ(θ3 ) 0 Γ(θ) xθ1 −1 (1 − x)θ2 +θ3 −1 B(θ2 , θ3 ) = Γ(θ1 )Γ(θ2 )Γ(θ3 ) Γ(θ) = xθ1 −1 (1 − x)θ2 +θ3 −1 Γ(θ1 )Γ(θ2 + θ3 ) f1 (x) = since : 1 0 uθ2 −1 (1 − u)θ3 −1 du = B(θ2 , θ3 ) = Γ(θ2 )Γ(θ3 ) . Γ(θ2 + θ3 ) This proves that the random variable X ∼ Beta(θ1 , θ2 + θ3 ). Similarly, one can shows that the random variable Y ∼ Beta(θ2 , θ1 + θ3 ). Now using Theorem 6.5, we see that E(X) = θ1 , θ V ar(X) = θ1 (θ − θ1 ) θ2 (θ + 1) E(Y ) = θ2 , θ V ar(X) = θ2 (θ − θ2 ) , θ2 (θ + 1) where θ = θ1 + θ2 + θ3 . Next, we compute the product moment of X and Y . Consider E(XY ) : 1 2: = 0 1−x 3 xy f (x, y) dy dx 0 Γ(θ) = Γ(θ1 )Γ(θ2 )Γ(θ3 ) : 1 2: 1−x θ1 −1 θ2 −1 xy x 0 0 y θ3 −1 (1 − x − y) 3 dy dx 3 xθ1 y θ2 (1 − x − y)θ3 −1 dy dx 0 0 /: # $θ3 −1 0 : 1 1−x y Γ(θ) = y θ2 1 − xθ1 (1 − x)θ3 −1 dy dx. Γ(θ1 )Γ(θ2 )Γ(θ3 ) 0 1−x 0 = Γ(θ) Γ(θ1 )Γ(θ2 )Γ(θ3 ) : 1 2: 1−x Some Special Continuous Bivariate Distributions 338 y 1−x in the above integral to obtain 2: 1 3 : 1 Γ(θ) θ2 θ3 −1 θ1 θ2 +θ3 E(XY ) = u (1 − u) du dx x (1 − x) Γ(θ1 )Γ(θ2 )Γ(θ3 ) 0 0 Now we substitute u = Since : 0 and : 0 we have 1 uθ2 (1 − u)θ3 −1 du = B(θ2 + 1, θ3 ) 1 xθ1 (1 − x)θ2 +θ3 dx = B(θ1 + 1, θ2 + θ3 + 1) Γ(θ) B(θ2 + 1, θ3 ) B(θ1 + 1, θ2 + θ3 + 1) Γ(θ1 )Γ(θ2 )Γ(θ3 ) Γ(θ) θ1 Γ(θ1 )(θ2 + θ3 )Γ(θ2 + θ3 ) θ2 Γ(θ2 )Γ(θ3 ) = Γ(θ1 )Γ(θ2 )Γ(θ3 ) (θ)(θ + 1)Γ(θ) (θ2 + θ3 )Γ(θ2 + θ3 ) θ1 θ2 = where θ = θ1 + θ2 + θ3 . θ (θ + 1) E(XY ) = Now it is easy to compute the covariance of X and Y since Cov(X, Y ) = E(XY ) − E(X)E(Y ) θ1 θ2 θ1 θ 2 − = θ (θ + 1) θ θ θ1 θ2 =− 2 . θ (θ + 1) The proof of the theorem is now complete. The correlation coefficient of X and Y can be computed using the covariance as G Cov(X, Y ) θ1 θ2 . ρ= I =− (θ + θ V ar(X) V ar(Y ) 1 3 )(θ2 + θ3 ) Next theorem states some properties of the conditional density functions f (x/y) and f (y/x). Theorem 12.12. Let (X, Y ) ∼ Beta(θ1 , θ2 , θ3 ) where θ1 , θ2 and θ3 are positive parameters. Then θ2 (1 − x) , θ2 + θ3 θ1 (1 − y) E(X/y) = , θ1 + θ3 E(Y /x) = θ2 θ3 (1 − x)2 (θ2 + θ3 )2 (θ2 + θ3 + 1) θ1 θ3 (1 − y)2 V ar(X/y) = . (θ1 + θ3 )2 (θ1 + θ3 + 1) V ar(Y /x) = Probability and Mathematical Statistics 339 Proof: We know that if (X, Y ) ∼ Beta(θ1 , θ2 , θ3 ), the random variable X ∼ Beta(θ1 , θ2 + θ3 ). Therefore f (y/x) = = f (x, y) f1 (x) 1 Γ(θ2 + θ3 ) 1 − x Γ(θ2 )Γ(θ3 ) # y 1−x $θ2 −1 # 1− y 1−x $θ3 −1 > Y > for all 0 < y < 1 − x. Thus the random variable 1−x is a beta random > X=x variable with parameters θ2 and θ3 . Now we compute the conditional expectation of Y /x. Consider : 1−x E(Y /x) = y f (y/x) dy 0 Γ(θ2 + θ3 ) 1 = 1 − x Γ(θ2 )Γ(θ3 ) Now we substitute u = E(Y /x) = = = = y 1−x : 1−x y 0 # y 1−x $θ2 −1 # y 1− 1−x $θ3 −1 in the above integral to obtain : 1 Γ(θ2 + θ3 ) (1 − x) uθ2 (1 − u)θ3 −1 du Γ(θ2 )Γ(θ3 ) 0 Γ(θ2 + θ3 ) (1 − x) B(θ2 + 1, θ3 ) Γ(θ2 )Γ(θ3 ) Γ(θ2 + θ3 ) θ2 Γ(θ2 )Γ(θ3 ) (1 − x) Γ(θ2 )Γ(θ3 ) (θ2 + θ3 ) Γ(θ2 + θ3 ) θ2 (1 − x). θ2 + θ3 Next, we compute E(Y 2 /x). Consider : 1−x y 2 f (y/x) dy E(Y 2 /x) = 0 Γ(θ2 + θ3 ) 1 = 1 − x Γ(θ2 )Γ(θ3 ) : 0 1−x y 2 # y 1−x $θ2 −1 # y 1− 1−x $θ3 −1 : 1 Γ(θ2 + θ3 ) (1 − x)2 uθ2 +1 (1 − u)θ3 −1 du Γ(θ2 )Γ(θ3 ) 0 Γ(θ2 + θ3 ) = (1 − x)2 B(θ2 + 2, θ3 ) Γ(θ2 )Γ(θ3 ) Γ(θ2 + θ3 ) (θ2 + 1) θ2 Γ(θ2 )Γ(θ3 ) = (1 − x)2 Γ(θ2 )Γ(θ3 ) (θ2 + θ3 + 1) (θ2 + θ3 ) Γ(θ2 + θ3 ) (θ2 + 1) θ2 = (1 − x)2 . (θ2 + θ3 + 1) (θ2 + θ3 = dy. dy Some Special Continuous Bivariate Distributions 340 Therefore V ar(Y /x) = E(Y 2 /x) − E(Y /x)2 = θ2 θ3 (1 − x)2 . (θ2 + θ3 )2 (θ2 + θ3 + 1) Similarly, one can compute E(X/y) and V ar(X/y). We leave this computation to the reader. Now the proof of the theorem is now complete. The Dirichlet distribution can be extended from the unit square (0, 1)2 to an arbitrary rectangle (a1 , b1 ) × (a2 , b2 ). Definition 12.6. A continuous bivariate random variable (X1 , X2 ) is said to have the generalized bivariate beta distribution if its joint probability density function is of the form $θ −1 # $θ −1 2 # Γ(θ1 + θ2 + θ3 ) P xk − ak k xk − ak 3 f (x1 , x2 ) = 1− Γ(θ1 )Γ(θ2 )Γ(θ3 ) bk − ak bk − ak k=1 where 0 < x1 , x2 , x1 + x2 < 1 and θ1 , θ2 , θ3 , a1 , b1 , a2 , b2 are parameters. We will denote a bivariate generalized beta random variable (X, Y ) with positive parameters θ1 , θ2 and θ3 by writing (X, Y ) ∼ GBeta(θ1 , θ2 , θ3 , a1 , b1 , a2 , b2 ). It can be shown that if Xk = (bk − ak )Yk + ak (for k = 1, 2) and each (Y1 , Y2 ) ∼ Beta(θ1 , θ2 , θ3 ), then (X1 , X2 ) ∼ GBeta(θ1 , θ2 , θ3 , a1 , b1 , a2 , b2 ). Therefore, by Theorem 12.11 Theorem 12.13. Let (X, Y ) ∼ GBeta(θ1 , θ2 , θ3 , a1 , b1 , a2 , b2 ), where θ1 , θ2 and θ3 are positive apriori chosen parameters. Then X ∼ Beta(θ1 , θ2 + θ3 ) and Y ∼ Beta(θ2 , θ1 + θ3 ) and θ1 (θ − θ1 ) θ2 (θ + 1) θ2 (θ − θ2 ) V ar(Y ) = (b2 − a2 )2 2 θ (θ + 1) θ1 θ2 Cov(X, Y ) = −(b1 − a1 )(b2 − a2 ) 2 θ (θ + 1) θ1 + a1 , θ θ2 E(Y ) = (b2 − a2 ) + a2 , θ E(X) = (b1 − a1 ) V ar(X) = (b1 − a1 )2 where θ = θ1 + θ2 + θ3 . Another generalization of the bivariate beta distribution is the following: Definition 12.7. A continuous bivariate random variable (X1 , X2 ) is said to have the generalized bivariate beta distribution if its joint probability density function is of the form 1 2 −1 f (x1 , x2 ) = (1 − x1 x2 )β2 −1 xα1 −1 (1 − x1 )β1 −α2 −β2 xα 2 B(α1 , β1 )B(α2 , β2 ) Probability and Mathematical Statistics 341 where 0 < x1 , x2 , x1 + x2 < 1 and α1 , α2 , β1 , β2 are parameters. It is not difficult to see that X ∼ Beta(α1 , β1 ) and Y ∼ Beta(α2 , β2 ). 12.5. Bivariate Normal Distribution The bivariate normal distribution is a generalization of the univariate normal distribution. The first statistical treatment of the bivariate normal distribution was given by Galton and Dickson in 1886. Although there are several other bivariate distributions as discussed above, the bivariate normal distribution still plays a dominant role. The development of normal theory has been intensive and most thinking has centered upon bivariate normal distribution because of the relative simplicity of mathematical treatment of it. In this section, we give an in depth treatment of the bivariate normal distribution. Definition 12.8. A continuous bivariate random variable (X, Y ) is said to have the bivariate normal distribution if its joint probability density function is of the form f (x, y) = 1 I 2 π σ1 σ2 1 1− ρ2 e− 2 Q(x,y) , −∞ < x, y < ∞, I σ1 , σ2 ∈ (0, ∞) and ρ ∈ (−1, 1) are parameters, and where µ1 , µ2 ∈ R, /# # $2 $# $ # $2 0 y − µ2 y − µ2 x − µ1 x − µ1 1 . − 2ρ + Q(x, y) := 1 − ρ2 σ1 σ1 σ2 σ2 As usual, we denote this bivariate normal random variable by writing (X, Y ) ∼ N (µ1 , µ2 , σ1 , σ2 , ρ). The graph of f (x, y) has a shape of a “mountain”. The pair (µ1 , µ2 ) tells us where the center of the mountain is located in the (x, y)-plane, while σ12 and σ22 measure the spread of this mountain in the x-direction and y-direction, respectively. The parameter ρ determines the shape and orientation on the (x, y)-plane of the mountain. The following figures show the graphs of the bivariate normal distributions with different values of correlation coefficient ρ. The first two figures illustrate the graph of the bivariate normal distribution with ρ = 0, µ1 = µ2 = 0, and σ1 = σ2 = 1 and the equi-density plots. The next two figures illustrate the graph of the bivariate normal distribution with ρ = 0.5, µ1 = µ2 = 0, and σ1 = σ2 = 0.5 and the equi-density plots. The last two figures illustrate the graph of the bivariate normal distribution with ρ = −0.5, µ1 = µ2 = 0, and σ1 = σ2 = 0.5 and the equi-density plots. Some Special Continuous Bivariate Distributions 342 One of the remarkable features of the bivariate normal distribution is that if we vertically slice the graph of f (x, y) along any direction, we obtain a univariate normal distribution. In particular, if we vertically slice the graph of the f (x, y) along the x-axis, we obtain a univariate normal distribution. That is the marginal of f (x, y) is again normal. One can show that the marginals of f (x, y) are given by f1 (x) = σ1 1 √ σ2 1 √ 2π − 12 ! x−µ1 "2 − 21 ! x−µ2 "2 e and f2 (y) = 2π e σ1 σ2 In view of these, the following theorem is obvious. . Probability and Mathematical Statistics 343 Theorem 12.14. If (X, Y ) ∼ N (µ1 , µ2 , σ1 , σ2 , ρ), then E(X) = µ1 E(Y ) = µ2 V ar(X) = σ12 V ar(Y ) = σ22 Corr(X, Y ) = ρ 1 2 2 M (s, t) = eµ1 s+µ2 t+ 2 (σ1 s +2ρσ1 σ2 st+σ22 t2 ) . Proof: It is easy to establish the formulae for E(X), E(Y ), V ar(X) and V ar(Y ). Here we only establish the moment generating function. Since " " ! ! (X, Y ) ∼ N (µ1 , µ2 , σ1 , σ2 , ρ), we have X ∼ N µ1 , σ12 and Y ∼ N µ2 , σ22 . Further, for any s and t, the random variable W = sX + tY is again normal with µW = sµ1 + tµ2 2 σW = s2 σ12 + 2stρσ1 σ2 + t2 σ2 . and Since W is a normal random variable, its moment generating function is given by 1 2 2 M (τ ) = eµW τ + 2 τ σW . The joint moment generating function of (X, Y ) is ! " M (s, t) = E esX+tY 1 2 = eµW + 2 σW 1 2 2 = eµ1 s+µ2 t+ 2 (σ1 s +2ρσ1 σ2 st+σ22 t2 ) . This completes the proof of the theorem. It can be shown that the conditional density of Y given X = x is , -2 y−b √ − 12 1 2 σ2 1−ρ I e f (y/x) = σ2 2π (1 − ρ2 ) where b = µ2 + ρ σ2 (x − µ1 ). σ1 Similarly, the conditional density f (x/y) is f (x/y) = 1 σ1 − 21 I e 2π (1 − ρ2 ) , σ1 x−c √ 1−ρ2 -2 , Some Special Continuous Bivariate Distributions where c = µ1 + ρ 344 σ1 (y − µ2 ). σ2 In view of the form of f (y/x) and f (x/y), the following theorem is transparent. Theorem 12.15. If (X, Y ) ∼ N (µ1 , µ2 , σ1 , σ2 , ρ), then σ2 (x − µ1 ) σ1 σ1 (y − µ2 ) E(X/y) = µ1 + ρ σ2 V ar(Y /x) = σ22 (1 − ρ2 ) E(Y /x) = µ2 + ρ V ar(X/y) = σ12 (1 − ρ2 ). We have seen that if (X, Y ) has a bivariate normal distribution, then the distributions of X and Y are also normal. However, the converse of this is not true. That is if X and Y have normal distributions as their marginals, then their joint distribution is not necessarily bivariate normal. Now we present some characterization theorems concerning the bivariate normal distribution. The first theorem is due to Cramer (1941). Theorem 12.16. The random variables X and Y have a joint bivariate normal distribution if and only if every linear combination of X and Y has a univariate normal distribution. Theorem 12.17. The random variables X and Y with unit variances and correlation coefficient ρ have a joint bivariate normal distribution if and only if 2 3 ∂ ∂2 E[g(X, Y )] = E g(X, Y ) ∂ρ ∂X ∂Y holds for an arbitrary function g(x, y) of two variable. Many interesting characterizations of bivariate normal distribution can be found in the survey paper of Hamedani (1992). 12.6. Bivariate Logistic Distributions In this section, we study two bivariate logistic distributions. A univariate logistic distribution is often considered as an alternative to the univariate normal distribution. The univariate logistic distribution has a shape very close to that of a univariate normal distribution but has heavier tails than Probability and Mathematical Statistics 345 the normal. This distribution is also used as an alternative to the univariate Weibull distribution in life-testing. The univariate logistic distribution has the following probability density function f (x) = σ π √ A 3 − √π e 3 ( x−µ σ ) − √π 1+e 3 ( x−µ σ ) − ∞ < x < ∞, B2 where −∞ < µ < ∞ and σ > 0 are parameters. The parameter µ is the mean and the parameter σ is the standard deviation of the distribution. A random variable X with the above logistic distribution will be denoted by X ∼ LOG(µ, σ). It is well known that the moment generating function of univariate logistic distribution is given by ) + ) + √ √ 3 3 µt M (t) = e Γ 1 + σt Γ 1 − σt π π for |t| < σ π√3 . We give brief proof of the above result for µ = 0 and σ = Then with these assumptions, the logistic density function reduces to f (x) = e−x 2. (1 + e−x ) The moment generating function with respect to this density function is : ∞ etx f (x) dx M (t) = = −∞ ∞ : etx e−x 2 dx (1 + e−1 ) ! −x "−t e−x = e 2 dx (1 + e−1 ) −∞ : 1 ! −1 "−t z −1 dz where z = = −∞ : ∞ 0 = : 0 1 z t (1 − z)−t dz = B(1 + t, 1 − t) Γ(1 + t) Γ(1 − t) = Γ(1 + t + 1 − t) Γ(1 + t) Γ(1 − t) = Γ(2) = Γ(1 + t) Γ(1 − t) = t cosec(t). 1 1 + e−x π √ . 3 Some Special Continuous Bivariate Distributions 346 Recall that the marginals and conditionals of the bivariate normal distribution are univariate normal. This beautiful property enjoyed by the bivariate normal distribution are apparently lacking from other bivariate distributions we have discussed so far. If we can not define a bivariate logistic distribution so that the conditionals and marginals are univariate logistic, then we would like to have at least one of the marginal distributions logistic and the conditional distribution of the other variable logistic. The following bivariate logistic distribution is due to Gumble (1961). Definition 12.9. A continuous bivariate random variable (X, Y ) is said to have the bivariate logistic distribution of first kind if its joint probability density function is of the form f (x, y) = − √π 2 π2 e 3 ! x−µ1 σ1 + y−µ2 σ2 " 2 ! x−µ1 " ! y−µ2 " 33 − √π − √π σ1 σ2 3 3 3 σ1 σ2 1 + e +e −∞ < x, y < ∞, where −∞ < µ1 , µ2 < ∞, and 0 < σ1 , σ2 < ∞ are parameters. If a random variable (X, Y ) has a bivariate logistic distribution of first kind, then we express this by writing (X, Y ) ∼ LOGF (µ1 , µ2 , σ1 , σ2 ). The following figures show the graph of f (x, y) with µ1 = 0 = µ2 and σ1 = 1 = σ2 and the equidensity plots. It can be shown that marginally, X is a logistic random variable. That is, X ∼ LOG (µ1 , σ1 ). Similarly, Y ∼ LOG (µ2 , σ2 ). These facts lead us to the following theorem. Theorem 12.18. If the random variable (X, Y ) ∼ LOGF (µ1 , µ2 , σ1 , σ2 ), Probability and Mathematical Statistics 347 then E(X) = µ1 E(Y ) = µ2 V ar(X) = σ12 V ar(Y ) = σ22 1 E(XY ) = σ1 σ2 + µ1 µ2 , 2 and the moment generating function is given by µ1 s+µ2 t M (s, t) = e for |s| < π√ σ1 3 ) √ + ) √ + ) √ + (σ1 s + σ2 t) 3 σ1 s 3 σ2 t 3 Γ 1+ Γ 1− Γ 1− π π π and |t| < π√ . σ2 3 It is an easy exercise to see that if the random variables X and Y have a joint bivariate logistic distribution, then the correlation between X and Y is 21 . This can be considered as one of the drawbacks of this distribution in the sense that it limits the dependence between the random variables X and Y. The conditional density of Y given X = x is f (y/x) = 2 π − √π √ e 3 σ2 3 ! y−µ " σ2 2 2 ! x−µ1 " 32 − √π 1 + e 3 σ1 2 ! x−µ1 " ! y−µ2 " 33 . − √π − √π σ 1 + e 3 σ2 1+e 3 Similarly the conditional density of X given Y = y is f (x/y) = 2 π − √π √ e 3 σ1 3 ! x−µ " 1 σ1 2 ! y−µ2 " 32 − √π 1 + e 3 σ2 2 ! x−µ1 " ! y−µ2 " 33 . π √ − − √π 1 + e 3 σ 1 + e 3 σ2 Using these densities, the next theorem offers various conditional properties of the bivariate logistic distribution. Theorem 12.19. If the random variable (X, Y ) ∼ LOGF (µ1 , µ2 , σ1 , σ2 ), Some Special Continuous Bivariate Distributions then 348 # − √π ! x−µ1 " $ σ1 E(Y /x) = 1 − ln 1 + e # ! y−µ2 " $ − √π σ2 3 E(X/y) = 1 − ln 1 + e 3 π3 −1 3 π3 − 1. V ar(X/y) = 3 V ar(Y /x) = It was pointed out earlier that one of the drawbacks of this bivariate logistic distribution of first kind is that it limits the dependence of the random variables. The following bivariate logistic distribution was suggested to rectify this drawback. Definition 12.10. A continuous bivariate random variable (X, Y ) is said to have the bivariate logistic distribution of second kind if its joint probability density function is of the form 1−2α f (x, y) = [φα (x, y)] 2 [1 + φα (x, y)] # φα (x, y) − 1 +α φα (x, y) + 1 $ e−α(x+y) , −∞ < x, y < ∞, 1 where α > 0 is a parameter, and φα (x, y) := (e−αx + e−αy ) α . As before, we denote a bivariate logistic random variable of second kind (X, Y ) by writing (X, Y ) ∼ LOGS(α). The marginal densities of X and Y are again logistic and they given by f1 (x) = and f2 (y) = e−x 2, −∞ < x < ∞ 2, −∞ < y < ∞. (1 + e−x ) e−y (1 + e−y ) It was shown by Oliveira (1961) that if (X, Y ) ∼ LOGS(α), then the correlation between X and Y is ρ(X, Y ) = 1 − 1 . 2 α2 Probability and Mathematical Statistics 349 12.7. Review Exercises 1. If (X, Y ) ∼ N (µ1 , µ2 , σ1 , σ2 , ρ) with Q(x, y) = x2 +2y 2 −2xy +2x−2y +1, then what is the value of the conditional variance of Y given the event X = x? 2. If (X, Y ) ∼ N (µ1 , µ2 , σ1 , σ2 , ρ) with Q(x, y) = − < 1 ; (x + 3)2 − 16(x + 3)(y − 2) + 4(y − 2)2 , 102 then what is the value of the conditional expectation of Y given X = x? 3. If (X, Y ) ∼ N (µ1 , µ2 , σ1 , σ2 , ρ), then what is the correlation coefficient of the random variables U and V , where U = 2X + 3Y and V = 2X − 3Y ? 4. Let the random variables X and Y denote the height and weight of wild turkeys. If the random variables X and Y have a bivariate normal distribution with µ1 = 18 inches, µ2 = 15 pounds, σ1 = 3 inches, σ2 = 2 pounds, and ρ = 0.75, then what is the expected weight of one of these wild turkeys that is 17 inches tall? 5. If (X, Y ) ∼ N (µ1 , µ2 , σ1 , σ2 , ρ), then what is the moment generating function of the random variables U and V , where U = 7X + 3Y and V = 7X − 3Y ? 6. Let (X, Y ) have a bivariate normal distribution. The mean of X is 10 and the variance of X is 12. The mean of Y is −5 and the variance of Y is 5. If the covariance of X and Y is 4, then what is the probability that X + Y is greater than 10? 7. Let X and Y have a bivariate normal distribution with means µX = 5 and µY = 6, standard deviations σX = 3 and σY = 2, and covariance σXY = 2. Let Φ denote the cumulative distribution function of a normal random variable with mean 0 and variance 1. What is P (2 ≤ X − Y ≤ 5) in terms of Φ ? 8. If (X, Y ) ∼ N (µ1 , µ2 , σ1 , σ2 , ρ) with Q(x, y) = −x2 + xy − 2y 2 , then what is the conditional distributions of X given the event Y = y? 9. If (X, Y ) ∼ GAM K(α, θ), where 0 < α < ∞ and 0 ≤ θ < 1 are parameters, then show that the moment generating function is given by M (s, t) = # 1 (1 − s) (1 − t) − θ s t $α . Some Special Continuous Bivariate Distributions 350 10. Let X and Y have a bivariate gamma distribution of Kibble with parameters α = 1 and 0 ≤ θ < 0. What is the probability that the random variable 7X is less than 12 ? 11. If (X, Y ) ∼ GAM C(α, β, γ), then what are the regression and scedestic curves of Y on X? 12. The position of a random point (X, Y ) is equally probable anywhere on a circle of radius R and whose center is at the origin. What is the probability density function of each of the random variables X and Y ? Are the random variables X and Y independent? 13. If (X, Y ) ∼ GAM C(α, β, γ), what is the correlation coefficient of the random variables X and Y ? 14. Let X and Y have a bivariate exponential distribution of Gumble with parameter θ > 0. What is the regression curve of Y on X? 15. A screen of a navigational radar station represents a circle of radius 12 inches. As a result of noise, a spot may appear with its center at any point of the circle. Find the expected value and variance of the distance between the center of the spot and the center of the circle. 16. Let X and Y have a bivariate normal distribution. Which of the following statements must be true? (I) Any nonzero linear combination of X and Y has a normal distribution. (II) E(Y /X = x) is a linear function of x. (III) V ar(Y /X = x) ≤ V ar(Y ). 17. If (X, Y ) ∼ LOGS(α), then what is the correlation between X and Y ? 18. If (X, Y ) ∼ LOGF (µ1 , µ2 , σ1 , σ2 ), then what is the correlation between the random variables X and Y ? 19. If (X, Y ) ∼ LOGF (µ1 , µ2 , σ1 , σ2 ), then show that marginally X and Y are univariate logistic. 20. If (X, Y ) ∼ LOGF (µ1 , µ2 , σ1 , σ2 ), then what is the scedastic curve of the random variable Y and X? Probability and Mathematical Statistics 351 Chapter 13 SEQUENCES OF RANDOM VARIABLES AND ORDER STASTISTICS In this chapter, we generalize some of the results we have studied in the previous chapters. We do these generalizations because the generalizations are needed in the subsequent chapters relating mathematical statistics. In this chapter, we also examine the weak law of large numbers, Bernoulli’s law of large numbers, the strong law of large numbers, and the central limit theorem. Further, in this chapter, we treat the order statistics and percentiles. 13.1. Distribution of sample mean and variance Consider a random experiment. Let X be the random variable associated with this experiment. Let f (x) be the probability density function of X. Let us repeat this experiment n times. Let Xk be the random variable associated with the k th repetition. Then the collection of the random variables { X1 , X2 , ..., Xn } is a random sample of size n. From here after, we simply denote X1 , X2 , ..., Xn as a random sample of size n. The random variables X1 , X2 , ..., Xn are independent and identically distributed with the common probability density function f (x). For a random sample, functions such as the sample mean X, the sample variance S 2 are called statistics. In a particular sample, say x1 , x2 , ..., xn , we observed x and s2 . We may consider n 1 % X= Xi n i=1 Sequences of Random Variables and Order Statistics and S2 = 352 n "2 1 %! Xi − X n − 1 i=1 as random variables and x and s2 are the realizations from a particular sample. In this section, we are mainly interested in finding the probability distributions of the sample mean X and sample variance S 2 , that is the distribution of the statistics of samples. Example 13.1. Let X1 and X2 be a random sample of size 2 from a distribution with probability density function f (x) = J 6x(1 − x) 0 if 0 < x < 1 otherwise. What are the mean and variance of sample sum Y = X1 + X2 ? Answer: The population mean µX = E (X) : 1 x 6x(1 − x) dx = 0 =6 : 0 1 x2 (1 − x) dx = 6 B(3, 2) (here B denotes the beta function) Γ(3) Γ(2) =6 Γ(5) # $ 1 =6 12 1 = . 2 Since X1 and X2 have the same distribution, we obtain µX1 = Hence the mean of Y is given by E(Y ) = E(X1 + X2 ) = E(X1 ) + E(X2 ) 1 1 = + 2 2 = 1. 1 2 = µX2 . Probability and Mathematical Statistics 353 Next, we compute the variance of the population X. The variance of X is given by ! " V ar(X) = E X 2 − E(X)2 # $2 : 1 1 3 6x (1 − x) dx − = 2 0 # $ : 1 1 x3 (1 − x) dx − =6 4 0 # $ 1 = 6 B(4, 2) − 4 # $ 1 Γ(4) Γ(2) − =6 Γ(6) 4 # $ # $ 1 1 =6 − 20 4 5 6 − = 20 20 1 = . 20 Since X1 and X2 have the same distribution as the population X, we get V ar(X1 ) = 1 = V ar(X2 ). 20 Hence, the variance of the sample sum Y is given by V ar(Y ) = V ar (X1 + X2 ) = V ar (X1 ) + V ar (X2 ) + 2 Cov (X1 , X2 ) = V ar (X1 ) + V ar (X2 ) 1 1 = + 20 20 1 = . 10 Example 13.2. Let X1 and X2 be a random sample of size 2 from a distribution with density f (x) = &1 4 for x = 1, 2, 3, 4 0 otherwise. What is the distribution of the sample sum Y = X1 + X2 ? Sequences of Random Variables and Order Statistics 354 Answer: Since the range space of X1 as well as X2 is {1, 2, 3, 4}, the range space of Y = X1 + X2 is RY = {2, 3, 4, 5, 6, 7, 8}. Let g(y) be the density function of Y . We want to find this density function. First, we find g(2), g(3) and so on. g(2) = P (Y = 2) = P (X1 + X2 = 2) = P (X1 = 1 and X2 = 1) = P (X1 = 1) P (X2 = 1) (by independence of X1 and X2 ) = f (1) f (1) = # $# $ 1 1 1 . = 4 4 16 g(3) = P (Y = 3) = P (X1 + X2 = 3) = P (X1 = 1 and X2 = 2) + P (X1 = 2 and X2 = 1) = P (X1 = 1) P (X2 = 2) + P (X1 = 2) P (X2 = 1) = f (1) f (2) + f (2) f (1) # $# $ # $# $ 1 1 1 2 1 . + = = 4 4 4 4 16 (by independence of X1 and X2 ) Probability and Mathematical Statistics 355 g(4) = P (Y = 4) = P (X1 + X2 = 4) = P (X1 = 1 and X2 = 3) + P (X1 = 3 and X2 = 1) + P (X1 = 2 and X2 = 2) = P (X1 = 3) P (X2 = 1) + P (X1 = 1) P (X2 = 3) + P (X1 = 2) P (X2 = 2) (by independence of X1 and X2 ) = f (1) f (3) + f (3) f (1) + f (2) f (2) # $# $ # $# $ # $# $ 1 1 1 1 1 1 = + + 4 4 4 4 4 4 3 = . 16 Similarly, we get g(5) = 4 , 16 g(6) = 3 , 16 g(7) = 2 , 16 g(8) = 1 . 16 Thus, putting these into one expression, we get g(y) = P (Y = y) = y−1 % k=1 = f (k) f (y − k) 4 − |y − 5| , 16 Remark 13.1. Note that g(y) = y−1 % k=1 y = 2, 3, 4, ..., 8. f (k) f (y − k) is the discrete convolution of f with itself. The concept of convolution was introduced in chapter 10. The above example can also be done using the moment generating func- Sequences of Random Variables and Order Statistics 356 tion method as follows: MY (t) = MX1 +X2 (t) = MX1 (t) MX2 (t) $# t $ # t e + e2t + e3t + e4t e + e2t + e3t + e4t = 4 4 # t $ 2t 3t 4t 2 e +e +e +e = 4 e2t + 2e3t + 3e4t + 4e5t + 3e6t + 2e7t + e8t = . 16 Hence, the density of Y is given by g(y) = 4 − |y − 5| , 16 y = 2, 3, 4, ..., 8. Theorem 13.1. If X1 , X2 , ..., Xn are mutually independent random variables with densities f1 (x1 ), f2 (x2 ), ..., fn (xn ) and E[ui (Xi )], i = 1, 2, ..., n exist, then / n 0 n P P E ui (Xi ) = E[ui (Xi )], i=1 i=1 where ui (i = 1, 2, ..., n) are arbitrary functions. Proof: We prove the theorem assuming that the random variables X1 , X2 , ..., Xn are continuous. If the random variables are not continuous, then the proof follows exactly in the same manner if one replaces the integrals by summations. Since ) n + P E ui (Xi ) i=1 = E(u1 (X1 ) · · · un (Xn )) : ∞ : ∞ u1 (x1 ) · · · un (xn )f (x1 , ..., xn )dx1 · · · dxn ··· = −∞ −∞ : ∞ : ∞ = ··· u1 (x1 ) · · · un (xn )f1 (x1 ) · · · fn (xn )dx1 · · · dxn −∞ −∞ : ∞ : ∞ un (xn )fn (xn )dxn u1 (x1 )f1 (x1 )dx1 · · · = −∞ = E (u1 (X1 )) · · · E (un (Xn )) n P E (ui (Xi )) , = i=1 −∞ Probability and Mathematical Statistics 357 the proof of the theorem is now complete. Example 13.3. Let X and Y be two random variables with the joint density & −(x+y) e for 0 < x, y < ∞ f (x, y) = 0 otherwise. What is the expected value of the continuous random variable Z = X 2 Y 2 + XY 2 + X 2 + X? Answer: Since f (x, y) = e−(x+y) = e−x e−y = f1 (x) f2 (y), the random variables X and Y are mutually independent. Hence, the expected value of X is : ∞ E(X) = x f1 (x) dx :0 ∞ = xe−x dx 0 = Γ(2) = 1. Similarly, the expected value of X 2 is given by : ∞ ! " E X2 = x2 f1 (x) dx 0 : ∞ = x2 e−x dx 0 = Γ(3) = 2. Since the marginals of X and Y are same, we also get E(Y ) = 1 and E(Y 2 ) = 2. Further, by Theorem 13.1, we get ; < E [Z] = E X 2 Y 2 + XY 2 + X 2 + X ;! "! "< = E X2 + X Y 2 + 1 ; < ; < = E X2 + X E Y 2 + 1 (by Theorem 13.1) ! ; 2< " ! ; 2< " = E X + E [X] E Y + 1 = (2 + 1) (2 + 1) = 9. Sequences of Random Variables and Order Statistics 358 Theorem 13.2. If X1 , X2 , ..., Xn are mutually independent random variables with respective means µ1 , µ2 , ..., µn and variances σ12 , σ22 , ..., σn2 , then 1n the mean and variance of Y = i=1 ai Xi , where a1 , a2 , ..., an are real constants, are given by µY = n % ai µi σY2 = and n % a2i σi2 . i=1 i=1 Proof: First we show that µY = 1n i=1 ai µi . Since µY = E(Y ) ) n + % =E ai Xi i=1 = = n % i=1 n % ai E(Xi ) ai µi i=1 we have asserted result. Next we show σY2 Cov(Xi , Xj ) = 0 for i += j, we have = 1n i=1 a2i σi2 . Since σY2 = V ar(Y ) = V ar (ai Xi ) n % = a2i V ar (Xi ) i=1 = n % a2i σi2 . i=1 This completes the proof of the theorem. Example 13.4. Let the independent random variables X1 and X2 have means µ1 = −4 and µ2 = 3, respectively and variances σ12 = 4 and σ22 = 9. What are the mean and variance of Y = 3X1 − 2X2 ? Answer: The mean of Y is µY = 3µ1 − 2µ2 = 3(−4) − 2(3) = −18. Probability and Mathematical Statistics 359 Similarly, the variance of Y is σY2 = (3)2 σ12 + (−2)2 σ22 = 9 σ12 + 4 σ22 = 9(4) + 4(9) = 72. Example 13.5. Let X1 , X2 , ..., X50 be a random sample of size 50 from a distribution with density & 1 −x θ for 0 ≤ x < ∞ θ e f (x) = 0 otherwise. What are the mean and variance of the sample mean X? Answer: Since the distribution of the population X is exponential, the mean and variance of X are given by µX = θ, and 2 σX = θ2 . Thus, the mean of the sample mean is # $ ! " X1 + X2 + · · · + X50 E X =E 50 = = 50 1 % E (Xi ) 50 i=1 50 1 % θ 50 i=1 1 50 θ = θ. 50 The variance of the sample mean is given by + ) 50 % 1 ! " V ar X = V ar Xi 50 i=1 $2 50 # % 1 2 = σX i 50 i=1 $2 50 # % 1 θ2 = 50 i=1 # $2 1 = 50 θ2 50 θ2 . = 50 = Sequences of Random Variables and Order Statistics 360 Theorem 13.3. If X1 , X2 , ..., Xn are independent random variables with respective moment generating functions MXi (t), i = 1, 2, ..., n, then the mo1n ment generating function of Y = i=1 ai Xi is given by MY (t) = n P MXi (ai t) . i=1 Proof: Since MY (t) = M1n i=1 = = n P i=1 n P ai Xi (t) Mai Xi (t) MXi (ai t) i=1 we have the asserted result and the proof of the theorem is now complete. Example 13.6. Let X1 , X2 , ..., X10 be the observations from a random sample of size 10 from a distribution with density 1 2 1 e− 2 x , f (x) = √ 2π −∞ < x < ∞. What is the moment generating function of the sample mean? Answer: The density of the population X is a standard normal. Hence, the moment generating function of each Xi is 1 2 MXi (t) = e 2 t , i = 1, 2, ..., 10. The moment generating function of the sample mean is MX (t) = M110 1 Xi i=1 10 = 10 P MXi i=1 = 10 P 1 10 " . $ t2 t2 = e 200 ! Hence X ∼ N 0, 1 t 10 e 200 i=1 A # (t) B10 !1 = e 10 t2 2 " . Probability and Mathematical Statistics 361 The last example tells us that if we take a sample of any size from a standard normal population, then the sample mean also has a normal distribution. The following theorem says that a linear combination of random variables with normal distributions is again normal. Theorem 13.4. If X1 , X2 , ..., Xn are mutually independent random variables such that ! " Xi ∼ N µi , σi2 , i = 1, 2, ..., n. Then the random variable Y = n % ai Xi is a normal random variable with i=1 mean µY = n % ai µi σY2 = and i=1 that is Y ∼ N !1n i=1 ai µi , n % a2i σi2 , i=1 1n i=1 " a2i σi2 . " ! Proof: Since each Xi ∼ N µi , σi2 , the moment generating function of each Xi is given by 1 2 2 MXi (t) = eµi t+ 2 σi t . Hence using Theorem 13.3, we have MY (t) = = n P i=1 n P MXi (ai t) 1 2 2 2 eai µi t+ 2 ai σi t i=1 1n 2 2 2 1n 1 = e i=1 ai µi t+ 2 i=1 ai σi t . Thus the random variable Y ∼ N theorem is now complete. ) n % i=1 ai µi , n % i=1 + a2i σi2 . The proof of the Example 13.7. Let X1 , X2 , ..., Xn be the observations from a random sample of size n from a normal distribution with mean µ and variance σ 2 > 0. What are the mean and variance of the sample mean X? Sequences of Random Variables and Order Statistics 362 Answer: The expected value (or mean) of the sample mean is given by n ! " 1 % E (Xi ) E X = n i=1 = n 1 % µ n i=1 = µ. Similarly, the variance of the sample mean is n ! " % V ar X = V ar i=1 # Xi n $ = n # $2 % 1 i=1 n σ2 = σ2 . n This example along with the previous theorem says that if we take a random sample of size n from a normal population with mean µ and variance σ 2 , σ2 then the, sample - mean is also normal with mean µ and variance n , that is X ∼ N µ, σ2 n . Example 13.8. Let X1 , X2 , ..., X64 be a random sample of size 64 from a normal distribution with µ = 50 and σ 2 = 16. What are P (49 < X8 < 51) ! " and P 49 < X < 51 ? Answer: Since X8 ∼ N (50, 16), we get P (49 < X8 < 51) = P (49 − 50 < X8 − 50 < 51 − 50) # $ X8 − 50 51 − 50 49 − 50 < < =P 4 4 4 $ # 1 X8 − 50 1 < =P − < 4 4 4 # $ 1 1 =P − <Z< 4 4 $ # 1 −1 = 2P Z < 4 = 0.1974 (from normal table). Probability and Mathematical Statistics 363 " ! 16 . Hence By the previous theorem, we see that X ∼ N 50, 64 ! " ! " P 49 < X < 51 = P 49 − 50 < X − 50 < 51 − 50   51 − 50 49 − 50 X − 50  =P = < = < = 16 64  16 64 16 64  X − 50 < 2 = P −2 < = 16 64 = P (−2 < Z < 2) = 2P (Z < 2) − 1 = 0.9544 (from normal table). This example tells us that X has a greater probability of falling in an interval containing µ, than a single observation, say X8 (or in general any Xi ). Theorem 13.5. Let the distributions of the random variables X1 , X2 , ..., Xn be χ2 (r1 ), χ2 (r2 ), ..., χ2 (rn ), respectively. If X1 , X2 , ..., Xn are mutually in1n dependent, then Y = X1 + X2 + · · · + Xn ∼ χ2 ( i=1 ri ). Proof: Since each Xi ∼ χ2 (ri ), the moment generating function of each Xi is given by ri MXi (t) = (1 − 2t)− 2 . By Theorem 13.3, we have MY (t) = n P i=1 MXi (t) = n P i=1 ri 1 (1 − 2t)− 2 = (1 − 2t)− 2 1n i=1 ri . 1n Hence Y ∼ χ2 ( i=1 ri ) and the proof of the theorem is now complete. The proof of the following theorem is an easy consequence of Theorem 13.5 and we leave the proof to the reader. Theorem 13.6. If Z1 , Z2 , ..., Zn are mutually independent and each one is standard normal, then Z12 + Z22 + · · · + Zn2 ∼ χ2 (n), that is the sum is chi-square with n degrees of freedom. The following theorem is very useful in mathematical statistics and its proof is beyond the scope of this introductory book. Theorem 13.7. If X1 , X2 , ..., Xn are observations of a random sample of ! " size n from the normal distribution N µ, σ 2 , then the sample mean X = Sequences of Random Variables and Order Statistics 1n 1 Xi and the sample variance S 2 = n−1 following properties: (A) X and S 2 are independent, and 2 (B) (n − 1) Sσ2 ∼ χ2 (n − 1). 1 n i=1 364 1n i=1 (Xi − X)2 have the Remark 13.2. At first sight the statement (A) might seem odd since the sample mean X occurs explicitly in the definition of the sample variance S 2 . This remarkable independence of X and S 2 is a unique property that distinguishes normal distribution from all other probability distributions. Example 13.9. Let X1 , X2 , ..., Xn denote a random sample from a normal distribution with variance σ 2 > 0. If the first percentile of the statistics 2 1n is 1.24, where X denotes the sample mean, what is the W = i=1 (Xiσ−X) 2 sample size n? Answer: 1 = P (W ≤ 1.24) 100 ) n + % (Xi − X)2 =P ≤ 1.24 σ2 i=1 # $ S2 = P (n − 1) 2 ≤ 1.24 σ ! 2 " = P χ (n − 1) ≤ 1.24 . Thus from χ2 -table, we get n−1=7 and hence the sample size n is 8. Example 13.10. Let X1 , X2 , ..., X4 be a random sample from a normal distribution with unknown mean and variance equal to 9. Let S 2 = " ! 2 " 14 ! 1 i=1 Xi − X . If P S ≤ k = 0.05, then what is k? 3 Answer: ! " 0.05 = P S 2 ≤ k $ # 2 3S 3 ≤ k =P 9 9 $ # 3 = P χ2 (3) ≤ k . 9 From χ2 -table with 3 degrees of freedom, we get 3 k = 0.35 9 Probability and Mathematical Statistics 365 and thus the constant k is given by k = 3(0.35) = 1.05. 13.2. Laws of Large Numbers In this section, we mainly examine the weak law of large numbers. The weak law of large numbers states that if X1 , X2 , ..., Xn is a random sample of size n from a population X with mean µ, then the sample mean X rarely deviates from the population mean µ when the sample size n is very large. In other words, the sample mean X converges in probability to the population mean µ. We begin this section with a result known as Markov inequality which is needed to establish the weak law of large numbers. Theorem 13.8 (Markov Inequality). Suppose X is a nonnegative random variable with mean E(X). Then E(X) t P (X ≥ t) ≤ for all t > 0. Proof: We assume the random variable X is continuous. If X is not continuous, then a proof can be obtained for this case by replacing the integrals with summations in the following proof. Since : ∞ E(X) = xf (x)dx −∞ t = ≥ ≥ : xf (x)dx + −∞ : ∞ ∞ xf (x)dx t xf (x)dx t : =t : ∞ t : tf (x)dx ∞ because x ∈ [t, ∞) f (x)dx t = t P (X ≥ t), we see that P (X ≥ t) ≤ This completes the proof of the theorem. E(X) . t Sequences of Random Variables and Order Statistics 366 In Theorem 4.4 of the chapter 4, Chebychev inequality was treated. Let X be a random variable with mean µ and standard deviation σ. Then Chebychev inequality says that P (|X − µ| < kσ) ≥ 1 − 1 k2 for any nonzero positive constant k. This result can be obtained easily using Theorem 13.8 as follows. By Markov inequality, we have P ((X − µ)2 ≥ t2 ) ≤ E((X − µ)2 ) t2 for all t > 0. Since the events (X − µ)2 ≥ t2 and |X − µ| ≥ t are same, we get E((X − µ)2 ) P ((X − µ)2 ≥ t2 ) = P (|X − µ| ≥ t) ≤ t2 for all t > 0. Hence σ2 P (|X − µ| ≥ t) ≤ 2 . t Letting t = kσ in the above equality, we see that P (|X − µ| ≥ kσ) ≤ 1 . k2 Hence 1 . k2 The last inequality yields the Chebychev inequality 1 − P (|X − µ| < kσ) ≤ P (|X − µ| < kσ) ≥ 1 − 1 . k2 Now we are ready to treat the weak law of large numbers. Theorem 13.9. Let X1 , X2 , ... be a sequence of independent and identically distributed random variables with µ = E(Xi ) and σ 2 = V ar(Xi ) < ∞ for i = 1, 2, ..., ∞. Then lim P (|S n − µ| ≥ ε) = 0 n→∞ for every ε. Here S n denotes X1 +X2 +···+Xn . n Proof: By Theorem 13.2 (or Example 13.7) we have E(S n ) = µ and V ar(S n ) = σ2 . n Probability and Mathematical Statistics 367 By Chebychev’s inequality V ar(S n ) ε2 P (|S n − E(S n )| ≥ ε) ≤ for ε > 0. Hence σ2 . n ε2 Taking the limit as n tends to infinity, we get P (|S n − µ| ≥ ε) ≤ σ2 n→∞ n ε2 lim P (|S n − µ| ≥ ε) ≤ lim n→∞ which yields lim P (|S n − µ| ≥ ε) = 0 n→∞ and the proof of the theorem is now complete. It is possible to prove the weak law of large numbers assuming only E(X) to exist and finite but the proof is more involved. The weak law of large numbers says that the sequence of sample means R∞ S n n=1 from a population X stays close to the population mean E(X) most of the time. Let us consider an experiment that consists of tossing a coin infinitely many times. Let Xi be 1 if the ith toss results in a Head, and 0 otherwise. The weak law of large numbers says that Q Sn = 1 X1 + X2 + · · · + Xn → n 2 as n→∞ (13.0) but it is easy to come up with sequences of tosses for which (13.0) is false: H H H H H H H H H H H H H H T H H T H H T H H T ··· ··· ··· ··· The strong law of large numbers (Theorem 13.11) states that the set of “bad sequences” like the ones given above has probability zero. Note that the assertion of Theorem 13.9 for any ε > 0 can also be written as lim P (|S n − µ| < ε) = 1. n→∞ The type of convergence we saw in the weak law of large numbers is not the type of convergence discussed in calculus. This type of convergence is called convergence in probability and defined as follows. Sequences of Random Variables and Order Statistics 368 Definition 13.1. Suppose X1 , X2 , ... is a sequence of random variables defined on a sample space S. The sequence converges in probability to the random variable X if, for any ε > 0, lim P (|Xn − X| < ε) = 1. n→∞ In view of the above definition, the weak law of large numbers states that the sample mean X converges in probability to the population mean µ. The following theorem is known as the Bernoulli law of large numbers and is a special case of the weak law of large numbers. Theorem 13.10. Let X1 , X2 , ... be a sequence of independent and identically distributed Bernoulli random variables with probability of success p. Then, for any ε > 0, lim P (|S n − p| < ε) = 1 n→∞ where S n denotes X1 +X2 +···+Xn . n The fact that the relative frequency of occurrence of an event E is very likely to be close to its probability P (E) for large n can be derived from the weak law of large numbers. Consider a repeatable random experiment repeated large number of time independently. Let Xi = 1 if E occurs on the ith repetition and Xi = 0 if E does not occur on ith repetition. Then µ = E(Xi ) = 1 · P (E) + 0 · P (E) = P (E) for i = 1, 2, 3, ... and X1 + X2 + · · · + Xn = N (E) where N (E) denotes the number of times E occurs. Hence by the weak law of large numbers, we have > > #> $ #> $ > N (E) > > > X1 + X2 + · · · + Xn lim P >> − P (E)>> ≥ ε = lim P >> − µ>> ≥ ε n→∞ n→∞ n n > " !> > > = lim P S n − µ ≥ ε n→∞ = 0. Hence, for large n, the relative frequency of occurrence of the event E is very likely to be close to its probability P (E). Now we present the strong law of large numbers without a proof. Probability and Mathematical Statistics 369 Theorem 13.11. Let X1 , X2 , ... be a sequence of independent and identically distributed random variables with µ = E(Xi ) and σ 2 = V ar(Xi ) < ∞ for i = 1, 2, ..., ∞. Then , P lim S n = µ = 1 n→∞ for every ε > 0. Here S n denotes X1 +X2 +···+Xn . n The type convergence in Theorem 13.11 is called almost sure convergence. The notion of almost sure convergence is defined as follows. Definition 13.2 Suppose the random variable X and the sequence X1 , X2 , ..., of random variables are defined on a sample space S. The sequence Xn (w) converges almost surely to X(w) if ,J S> P w ∈ S > lim Xn (w) = X(w) = 1. n→∞ It can be shown that the convergence in probability implies the almost sure convergence but not the converse. 13.3. The Central Limit Theorem n Consider a random sample of measurement {Xi }i=1 . The Xi ’s are identically distributed and their common distribution is the distribution of the population. We have seen that if the population distribution is normal, then the sample mean X is also normal. More precisely, if X1 , X2 , ..., Xn is a random sample from a normal distribution with density f (x) = 1 x−µ 2 1 √ e− 2 ( σ ) σ 2π then X∼N # µ, σ2 n $ . The central limit theorem (also known as Lindeberg-Levy Theorem) states that even though the population distribution may be far from being normal, still for large sample size n, the distribution of the standardized sample mean is approximately standard normal with better approximations obtained with the larger sample size. Mathematically this can be stated as follows. Theorem 13.12 (Central Limit Theorem). Let X1 , X2 , ..., Xn be a random sample of size n from a distribution with mean µ and variance σ 2 < ∞, then the limiting distribution of Zn = X −µ √σ n Sequences of Random Variables and Order Statistics 370 is standard normal, that is Zn converges in distribution to Z where Z denotes a standard normal random variable. The type of convergence used in the central limit theorem is called the convergence in distribution and is defined as follows. Definition 13.3. Suppose X is a random variable with cumulative density function F (x) and the sequence X1 , X2 , ... of random variables with cumulative density functions F1 (x), F2 (x), ..., respectively. The sequence Xn converges in distribution to X if lim Fn (x) = F (x) n→∞ for all values x at which F (x) is continuous. The distribution of X is called the limiting distribution of Xn . Whenever a sequence of random variables X1 , X2 , ... converges in distrid bution to the random variable X, it will be denoted by Xn → X. Example 13.11. Let Y = X1 + X2 + · · · + X15 be the sum of a random sample of size 15 from the distribution whose density function is f (x) = &3 2 x2 if −1 < x < 1 0 otherwise. What is the approximate value of P (−0.3 ≤ Y ≤ 1.5) when one uses the central limit theorem? Answer: First, we find the mean µ and variance σ 2 for the density function f (x). The mean for this distribution is given by : 1 3 3 x dx 2 −1 2 31 3 x4 = 2 4 −1 µ= = 0. Probability and Mathematical Statistics 371 Hence the variance of this distribution is given by 2 V ar(X) = E(X 2 ) − [ E(X) ] : 1 3 4 x dx = −1 2 2 31 3 x5 = 2 5 −1 3 = 5 = 0.6. P (−0.3 ≤ Y ≤ 1.5) = P (−0.3 − 0 ≤ Y − 0 ≤ 1.5 − 0) ) + −0.3 Y −0 1.5 =P I ≤I ≤I 15(0.6) 15(0.6) 15(0.6) = P (−0.10 ≤ Z ≤ 0.50) = P (Z ≤ 0.50) + P (Z ≤ 0.10) − 1 = 0.6915 + 0.5398 − 1 = 0.2313. Example 13.12. Let X1 , X2 , ..., Xn be a random sample of size n = 25 from a population that has a mean µ = 71.43 and variance σ 2 = 56.25. Let X be the sample mean. What is the probability that the sample mean is between 68.91 and 71.97? ! " Answer: The mean of X is given by E X = 71.43. The variance of X is given by ! " σ2 56.25 V ar X = = = 2.25. n 25 In order to find the probability that the sample mean is between 68.91 and 71.97, we need the distribution of the population. However, the population distribution is unknown. Therefore, we use the central limit theorem. The ∼ N (0, 1) as n approaches infinity. central limit theorem says that X−µ σ √ n Therefore " ! P 68.91 ≤ X ≤ 71.97 # $ 68.91 − 71.43 71.97 − 71.43 X − 71.43 √ √ = ≤ √ ≤ 2.25 2.25 2.25 = P (−0.68 ≤ W ≤ 0.36) = P (W ≤ 0.36) + P (W ≤ 0.68) − 1 = 0.5941. Sequences of Random Variables and Order Statistics 372 Example 13.13. Light bulbs are installed successively into a socket. If we assume that each light bulb has a mean life of 2 months with a standard deviation of 0.25 months, what is the probability that 40 bulbs last at least 7 years? Answer: Let Xi denote the life time of the ith bulb installed. The 40 light bulbs last a total time of S40 = X1 + X2 + · · · + X40 . By the central limit theorem 140 i=1 Xi − nµ √ ∼ N (0, 1) nσ 2 Thus That is Therefore as n → ∞. S − (40)(2) I40 ∼ N (0, 1). (40)(0.25)2 S40 − 80 ∼ N (0, 1). 1.581 P (S40 ≥ 7(12)) $ # S40 − 80 84 − 80 ≥ =P 1.581 1.581 = P (Z ≥ 2.530) = 0.0057. Example 13.14. Light bulbs are installed into a socket. Assume that each has a mean life of 2 months with standard deviation of 0.25 month. How many bulbs n should be bought so that one can be 95% sure that the supply of n bulbs will last 5 years? Answer: Let Xi denote the life time of the ith bulb installed. The n light bulbs last a total time of Sn = X 1 + X 2 + · · · + X n . The total average life span Sn has E (Sn ) = 2n and V ar(Sn ) = n . 16 Probability and Mathematical Statistics 373 By the central limit theorem, we get Sn − E (Sn ) √ n 4 ∼ N (0, 1). Thus, we seek n such that 0.95 = P (Sn ≥ 60) + ) 60 − 2n Sn − 2n √ ≥ √n =P n $4 240 − 8n √ =P Z≥ n # $ 240 − 8n √ =1−P Z ≤ . n # 4 From the standard normal table, we get 240 − 8n √ = −1.645 n which implies √ 1.645 n + 8n − 240 = 0. √ Solving this quadratic equation for n, we get √ n = −5.375 or 5.581. Thus n = 31.15. So we should buy 32 bulbs. Example 13.15. American Airlines claims that the average number of people who pay for in-flight movies, when the plane is fully loaded, is 42 with a standard deviation of 8. A sample of 36 fully loaded planes is taken. What is the probability that fewer than 38 people paid for the in-flight movies? Answer: Here, we like to find P (X < 38). Since, we do not know the distribution of X, we will use the central limit theorem. We are given that the population mean is µ = 42 and population standard deviation is σ = 8. Moreover, we are dealing with sample of size n = 36. Thus # $ X − 42 38 − 42 P (X < 38) = P < 8 8 6 = P (Z < −3) = 1 − P (Z < 3) = 1 − 0.9987 = 0.0013. 6 Sequences of Random Variables and Order Statistics 374 Since we have not yet seen the proof of the central limit theorem, first let us go through some examples to see the main idea behind the proof of the central limit theorem. Later, at the end of this section a proof of the central limit theorem will be given. We know from the central limit theorem that if X1 , X2 , ..., Xn is a random sample of size n from a distribution with mean µ and variance σ 2 , then X −µ √σ n d → Z ∼ N (0, 1) as n → ∞. However, the above expression is not equivalent to # $ σ2 d X → Z ∼ N µ, as n→∞ n as the following example shows. Example 13.16. Let X1 , X2 , ..., Xn be a random sample of size n from a gamma distribution with parameters θ = 1 and α = 1. What is the distribution of the sample mean X? Also, what is the limiting distribution of X as n → ∞? Answer: Since, each Xi ∼ GAM (1, 1), the probability density function of each Xi is given by & −x e if x ≥ 0 f (x) = 0 otherwise and hence the moment generating function of each Xi is MXi (t) = 1 . 1−t First we determine the moment generating function of the sample mean X, and then examine this moment generating function to find the probability distribution of X. Since (t) M (t) = M 1 1n X n i=1 Xi # $ t = MXi n i=1 n P = n P i=1 =! ! 1 " 1 − nt 1 "n , 1 − nt Probability and Mathematical Statistics therefore X ∼ GAM !1 n, 375 " n . Next, we find the limiting distribution of X as n → ∞. This can be done again by finding the limiting moment generating function of X and identifying the distribution of X. Consider 1 "n 1 − nt 1 "n ! = limn→∞ 1 − nt 1 = −t e = et . lim MX (t) = lim ! n→∞ n→∞ Thus, the sample mean X has a degenerate distribution, that is all the probability mass is concentrated at one point of the space of X. Example 13.17. Let X1 , X2 , ..., Xn be a random sample of size n from a gamma distribution with parameters θ = 1 and α = 1. What is the distribution of X −µ as n→∞ σ √ n where µ and σ are the population mean and variance, respectively? Answer: From Example 13.7, we know that MX (t) = ! 1 "n . 1 − nt Since the population distribution is gamma with θ = 1 and α = 1, the population mean µ is 1 and population variance σ 2 is also 1. Therefore M X−1 (t) = M√nX−√n (t) √1 n √ nt MX √ nt , = e− = e− = √ e nt , !√ nt 1 1− √ nt n 1 1− √t n " -n -n . Sequences of Random Variables and Order Statistics 376 The limiting moment generating function can be obtained by taking the limit of the above expression as n tends to infinity. That is, lim M X−1 (t) = lim n→∞ n→∞ √1 n √ e 1 2 = e2t = X −µ √σ n nt , 1 1− √t n -n (using MAPLE) ∼ N (0, 1). The following theorem is used to prove the central limit theorem. Theorem 13.13 (Lévy Continuity Theorem). Let X1 , X2 , ... be a sequence of random variables with distribution functions F1 (x), F2 (x), ... and moment generating functions MX1 (t), MX2 (t), ..., respectively. Let X be a random variable with distribution function F (x) and moment generating function MX (t). If for all t in the open interval (−h, h) for some h > 0 lim MXn (t) = MX (t), n→∞ then at the points of continuity of F (x) lim Fn (x) = F (x). n→∞ The proof of this theorem is beyond the scope of this book. The following limit 2 3n t d(n) lim 1 + + = et , n→∞ n n if lim d(n) = 0, n→∞ (13.1) whose proof we leave it to the reader, can be established using advanced calculus. Here t is independent of n. Now we proceed to prove the central limit theorem assuming that the moment generating function of the population X exists. Let MX−µ (t) be the moment generating function of the random variable X − µ. We denote MX−µ (t) as M (t) when there is no danger of confusion. Then M (0) = 1,    M # (0) = E(X − µ) = E(X) − µ = µ − µ = 0,  ! "  M ## (0) = E (X − µ)2 = σ 2 . (13.2) Probability and Mathematical Statistics 377 By Taylor series expansion of M (t) about 0, we get M (t) = M (0) + M # (0) t + 1 ## M (η) t2 2 where η ∈ (0, t). Hence using (13.2), we have 1 ## M (η) t2 2 1 1 1 = 1 + σ 2 t2 + M ## (η) t2 − σ 2 t2 2 2 2 < 1 2 2 1 ; ## =1+ σ t + M (η) − σ 2 t2 . 2 2 M (t) = 1 + Now using M (t) we compute the moment generating function of Zn . Note that n 1 % X −µ (Xi − µ). = √ Zn = σ √ σ n i=1 n Hence MZn (t) = = n P i=1 n P i=1 2 for 0 < |η| < lim n→∞ σ σ 1 √ MXi −µ MX−µ # # σ # t √ $ n $ t √ σ n $ 3n t √ = M σ n 3n 2 (M ## (η) − σ 2 ) t2 t2 + = 1+ 2n 2 n σ2 n |t|. Note that since 0 < |η| < t √ = 0, n lim η = 0, n→∞ and σ 1 √ n |t|, we have lim M ## (η) − σ 2 = 0. n→∞ (13.3) Letting (M ## (η) − σ 2 ) t2 2 σ2 and using (13.3), we see that lim d(n) = 0, and d(n) = n→∞ 2 3n d(n) t2 . MZn (t) = 1 + + 2n n (13.4) Using (13.1) we have lim MZn (t) = lim n→∞ n→∞ 2 d(n) t2 + 1+ 2n n 3n 1 2 = e2 t . Sequences of Random Variables and Order Statistics 378 Hence by the Lévy continuity theorem, we obtain lim Fn (x) = Φ(x) n→∞ where Φ(x) is the cumulative density function of the standard normal distrid bution. Thus Zn → Z and the proof of the theorem is now complete. Now we give another proof of the central limit theorem using L’Hospital rule. This proof is essentially due to Tardiff (1981). A , - Bn t √ where M (t) is (t) = M . Then M As before, let Zn = X−µ σ Z n √ σ n n the moment generating function of the random variable X − µ. Hence from (13.2), we have M (0) = 1, M # (0) = 0, and M ## (0) = σ 2 . Now applying the L’Hospital rule twice we compute lim MZn (t) 2 # $ 3n t √ = lim M n→∞ σ n # # # $$$ t √ = lim exp n ln M n→∞ σ n --  , ,  t ln M σ √ n  = lim exp  1 n→∞ # $ 0 form since M(0) = 1 n→∞ 0 n  , , -−1 -, - t t 1 # √ √ √ M M − σ n σ n n n   t # = lim exp   (L Hospital rule) n→∞ 2σ − n12   t M = lim exp  n→∞ 2σ  2  t = lim exp  2 n→∞ 2σ ) 2 , M σ t √ n -−1 M# √1 n , σ ## t √ n - M ## , , σ M # σ t √ t √ , n - t √ σ 2 t M (0) M (0) − {M (0)} 2 2 σ2 M (0) # 2 $ < t ; = exp 1 · σ 2 − 02 2 σ2 $ # 1 2 t . = exp 2 = exp n -   # $ 0 form since M# (0) = 0 0 J -S2  , t − M# σ √ n   -2 n + Probability and Mathematical Statistics 379 Hence by the Lévy continuity theorem, we obtain lim Fn (x) = Φ(x) n→∞ where Φ(x) is the cumulative density function of the standard normal distrid bution. Thus as n → ∞, the random variable Zn → Z, where Z ∼ N (0, 1). Remark 13.3. In contrast to the moment generating function, since the characteristic function of a random variable always exists, the original proof of the central limit theorem involved the characteristic function (see for example An Introduction to Probability Theory and Its Applications, Volume II by Feller). In 1988, Brown gave an elementary proof using very clever Taylor series expansions, where the use of the characteristic function has been avoided. 13.4. Order Statistics Often, sample values such as the smallest, largest, or middle observation from a random sample provide important information. For example, the highest flood water or lowest winter temperature recorded during the last 50 years might be useful when planning for future emergencies. The median price of houses sold during the previous month might be useful for estimating the cost of living. The statistics highest, lowest or median are examples of order statistics. Definition 13.4. Let X1 , X2 , ..., Xn be observations from a random sample of size n from a distribution f (x). Let X(1) denote the smallest of {X1 , X2 , ..., Xn }, X(2) denote the second smallest of {X1 , X2 , ..., Xn }, and similarly X(r) denote the rth smallest of {X1 , X2 , ..., Xn }. Then the random variables X(1) , X(2) , ..., X(n) are called the order statistics of the sample X1 , X2 , ..., Xn . In particular, X(r) is called the rth -order statistic of X1 , X2 , ..., Xn . The sample range, R, is the distance between the smallest and the largest observation. That is, R = X(n) − X(1) . This is an important statistic which is defined using order statistics. The distribution of the order statistics are very important when one uses these in any statistical investigation. The next theorem gives the distribution of an order statistic. Sequences of Random Variables and Order Statistics 380 Theorem 13.14. Let X1 , X2 , ..., Xn be a random sample of size n from a distribution with density function f (x). Then the probability density function of the rth order statistic, X(r) , is g(x) = n! r−1 n−r f (x) [1 − F (x)] , [F (x)] (r − 1)! (n − r)! where F (x) denotes the cdf of f (x). Proof: We prove the theorem assuming f (x) continuous. In the case f (x) is discrete the proof has to be modified appropriately. Let h be a positive real number and x be an arbitrary point in the domain of f . Let us divide the real line into three segments, namely R I = (−∞, x) * [x, x + h) * [x + h, ∞). The probability, say p1 , of a sample value falls into the first interval (−∞, x] and is given by : x p1 = f (t) dt = F (x). −∞ Similarly, the probability p2 of a sample value falls into the second interval [x, x + h) is : x+h f (t) dt = F (x + h) − F (x). p2 = x In the same token, we can compute the probability p3 of a sample value which falls into the third interval p3 = : ∞ x+h f (t) dt = 1 − F (x + h). Then the probability, Ph (x), that (r−1) sample values fall in the first interval, one falls in the second interval, and (n − r) fall in the third interval is Ph (x) = # $ n n! pr−1 p2 pn−r . pr−1 p12 p3n−r = 3 (r − 1)! (n − r)! 1 r − 1, 1, n − r 1 Probability and Mathematical Statistics 381 Hence the probability density function g(x) of the rth statistics is given by g(x) Ph (x) h→0 h 2 3 n! p2 n−r = lim pr−1 p 1 h→0 (r − 1)! (n − r)! h 3 F (x + h) − F (x) n! r−1 n−r = [F (x)] lim lim [1 − F (x + h)] h→0 h→0 (r − 1)! (n − r)! h n! r−1 n−r = [F (x)] F # (x) [1 − F (x)] (r − 1)! (n − r)! n! r−1 n−r = [F (x)] f (x) [1 − F (x)] . (r − 1)! (n − r)! = lim Example 13.18. Let X1 , X2 be a random sample from a distribution with density function & −x e for 0 ≤ x < ∞ f (x) = 0 otherwise. What is the density function of Y = min{X1 , X2 } where nonzero? Answer: The cumulative distribution function of f (x) is : x e−t dt F (x) = 0 = 1 − e−x In this example, n = 2 and r = 1. Hence, the density of Y is 2! 0 [F (y)] f (y) [1 − F (y)] 0! 1! = 2f (y) [1 − F (y)] ! " = 2 e−y 1 − 1 + e−y g(y) = = 2 e−2y . Example 13.19. Let Y1 < Y2 < · · · < Y6 be the order statistics from a random sample of size 6 from a distribution with density function & 2x for 0 < x < 1 f (x) = 0 otherwise. Sequences of Random Variables and Order Statistics 382 What is the expected value of Y6 ? Answer: f (x) = 2x : x F (x) = 2t dt 0 = x2 . The density function of Y6 is given by 6! 5 [F (y)] f (y) 5! 0! ! "5 = 6 y 2 2y g(y) = = 12y 11 . Hence, the expected value of Y6 is E (Y6 ) = : 1 y g(y) dy 0 = : 1 y 12y 11 dy 0 12 ; 13 <1 y 0 13 12 . = 13 = Example 13.20. Let X, Y and Z be independent uniform random variables on the interval (0, a). Let W = min{X, Y, Z}. What is the expected value of "2 ! 1− W ? a Answer: The probability distribution of X (or Y or Z) is f (x) = & 1 a if 0 < x < a 0 otherwise. Thus the cumulative distribution of function of f (x) is given by F (x) =  0     x a     1 if x ≤ 0 if 0 < x < a if x ≥ a. Probability and Mathematical Statistics 383 Since W = min{X, Y, Z}, W is the first order statistic of the random sample X, Y, Z. Thus, the density function of W is given by 3! 0 2 [F (w)] f (w) [1 − F (w)] 0! 1! 2! 2 = 3f (w) [1 − F (w)] $ # , w -2 1 =3 1− a a , 2 3 w = . 1− a a g(w) = Thus, the pdf of W is given by g(w) =    3 a 0 ! 1− " w 2 a if 0 < w < a otherwise. The expected value of W is E /# $2 0 W 1− a : a, w -2 = 1− g(w) dw a :0 a , w -2 w -2 3 , = dw 1− 1− a a a 0 : a , w -4 3 = dw 1− a 0 a 3a 2 3 , w -5 =− 1− 5 a 0 3 = . 5 Example 13.21. Let X1 , X2 , ..., Xn be a random sample from a population X with uniform distribution on the interval [0, 1]. What is the probability distribution of the sample range W := X(n) − X(1) ? Answer: To find the distribution of W , we need the joint distribution of the ! " ! " random variable X(n) , X(1) . The joint distribution of X(n) , X(1) is given by n−2 h(x1 , xn ) = n(n − 1)f (x1 )f (xn ) [F (xn ) − F (x1 )] , Sequences of Random Variables and Order Statistics 384 where xn ≥ x1 and f (x) is the probability density function of X. To determine the probability distribution of the sample range W , we consider the transformation ' U = X(1) W = X(n) − X(1) which has an inverse X(1) = U X(n) = U + W. ' The Jacobian of this transformation is # $ 1 0 J = det = 1. 1 1 Hence the joint density of (U, W ) is given by g(u, w) = |J| h(x1 , xn ) = n(n − 1)f (u)f (u + w)[F (u + w) − F (u)]n−2 where w ≥ 0. Since f (u) and f (u+w) are simultaneously nonzero if 0 ≤ u ≤ 1 and 0 ≤ u + w ≤ 1. Hence f (u) and f (u + w) are simultaneously nonzero if 0 ≤ u ≤ 1 − w. Thus, the probability of W is given by : ∞ g(u, w) du j(w) = −∞ : ∞ = n(n − 1)f (u)f (u + w)[F (u + w) − F (u)]n−2 du −∞ = n(n − 1) w n−2 : 1−w du 0 = n(n − 1) (1 − w) wn−2 where 0 ≤ w ≤ 1. 13.5. Sample Percentiles The sample median, M , is a number such that approximately one-half of the observations are less than M and one-half are greater than M . Definition 13.5. Let X1 , X2 , ..., Xn be a random sample. The sample median M is defined as  if n is odd  X( n+1 2 ) A B M=  1 X n + X n+2 if n is even. 2 (2) ( 2 ) Probability and Mathematical Statistics 385 The median is a measure of location like sample mean. Recall that for continuous distribution, 100pth percentile, πp , is a number such that : πp f (x) dx. p= −∞ Definition 13.6. The 100pth sample percentile is defined as X ([np])     πp = M     X(n+1−[n(1−p)]) if p < 0.5 if p = 0.5 if p > 0.5. where [b] denote the number b rounded to the nearest integer. Example 13.22. Let X1 , X2 , ..., X12 be a random sample of size 12. What is the 65th percentile of this sample? Answer: 100p = 65 p = 0.65 n(1 − p) = (12)(1 − 0.65) = 4.2 [n(1 − p)] = [4.2] = 4 Hence by definition of 65th percentile is π0.65 = X(n+1−[n(1−p)]) = X(13−4) = X(9) . Thus, the 65th percentile of the random sample X1 , X2 , ..., X12 is the 9th order statistic. For any number p between 0 and 1, the 100pth sample percentile is an observation such that approximately np observations are less than this observation and n(1 − p) observations are greater than this. Definition 13.7. The 25th percentile is called the lower quartile while the 75th percentile is called the upper quartile. The distance between these two quartiles is called the interquartile range. Sequences of Random Variables and Order Statistics 386 Example 13.23. If a sample of size 3 from a uniform distribution over [0, 1] is observed, what is the probability that the sample median is between 14 and 3 4? Answer: When a sample of (2n + 1) random variables are observed, the (n + 1)th smallest random variable is called the sample median. For our problem, the sample median is given by X(2) = 2nd smallest {X1 , X2 , X3 }. Let Y = X(2) . The density function of each Xi is given by f (x) = & 1 if 0 ≤ x ≤ 1 0 otherwise. Hence, the cumulative density function of f (x) is F (x) = x. Thus the density function of Y is given by 3! 2−1 3−2 [F (y)] f (y) [1 − F (y)] 1! 1! = 6 F (y) f (y) [1 − F (y)] g(y) = = 6y (1 − y). Therefore P # 1 3 <Y < 4 4 $ = = : : 3 4 g(y) dy 1 4 3 4 1 4 2 6 y (1 − y) dy y2 y3 =6 − 2 3 11 = . 16 3 34 1 4 13.6. Review Exercises 1. Suppose we roll a die 1000 times. What is the probability that the sum of the numbers obtained lies between 3000 and 4000? Probability and Mathematical Statistics 387 2. Suppose Kathy flip a coin 1000 times. What is the probability she will get at least 600 heads? 3. At a certain large university the weight of the male students and female students are approximately normally distributed with means and standard deviations of 180, and 20, and 130 and 15, respectively. If a male and female are selected at random, what is the probability that the sum of their weights is less than 280? 4. Seven observations are drawn from a population with an unknown continuous distribution. What is the probability that the least and the greatest observations bracket the median? 5. If the random variable X has the density function f (x) =   2 (1 − x)  for 0 ≤ x ≤ 1 0 otherwise, what is the probability that the larger of 2 independent observations of X will exceed 12 ? 6. Let X1 , X2 , X3 be a random sample from the uniform distribution on the interval (0, 1). What is the probability that the sample median is less than 0.4? 7. Let X1 , X2 , X3 , X4 , X5 be a random sample from the uniform distribution on the interval (0, θ), where θ is unknown, and let Xmax denote the largest observation. For what value of the constant k, the expected value of the random variable kXmax is equal to θ? 8. A random sample of size 16 is to be taken from a normal population having mean 100 and variance 4. What is the 90th percentile of the distribution of the sample mean? 9. If the density function of a random variable X is given by f (x) =  1  2x  0 for 1 e <x<e otherwise, what is the probability that one of the two independent observations of X is less than 2 and the other is greater than 1? Sequences of Random Variables and Order Statistics 388 10. Five observations have been drawn independently and at random from a continuous distribution. What is the probability that the next observation will be less than all of the first 5? 11. Let the random variable X denote the length of time it takes to complete a mathematics assignment. Suppose the density function of X is given by f (x) =   e−(x−θ)  0 for θ < x < ∞ otherwise, where θ is a positive constant that represents the minimum time to complete a mathematics assignment. If X1 , X2 , ..., X5 is a random sample from this distribution. What is the expected value of X(1) ? 12. Let X and Y be two independent random variables with identical probability density function given by f (x) = & e−x for x > 0 0 elsewhere. What is the probability density function of W = max{X, Y } ? 13. Let X and Y be two independent random variables with identical probability density function given by f (x) =  2  3θx3  0 for 0 ≤ x ≤ θ elsewhere, for some θ > 0. What is the probability density function of W = min{X, Y }? 14. Let X1 , X2 , ..., Xn be a random sample from a uniform distribution on the interval from 0 to 5. What is the limiting moment generating function of X−µ as n → ∞? σ √ n 15. Let X1 , X2 , ..., Xn be a random sample of size n from a normal distribution with mean µ and variance 1. If the 75th percentile of the statistic "2 1n ! W = i=1 Xi − X is 28.24, what is the sample size n ? 16. Let X1 , X2 , ..., Xn be a random sample of size n from a Bernoulli distribution with probability of success p = 21 . What is the limiting distribution the sample mean X ? Probability and Mathematical Statistics 389 17. Let X1 , X2 , ..., X1995 be a random sample of size 1995 from a distribution with probability density function f (x) = e−λ λx x! x = 0, 1, 2, 3, ..., ∞. What is the distribution of 1995X ? 18. Suppose X1 , X2 , ..., Xn is a random sample from the uniform distribution on (0, 1) and Z be the sample range. What is the probability that Z is less than or equal to 0.5? 19. Let X1 , X2 , ..., X9 be a random sample from a uniform distribution on the interval [1, 12]. Find the probability that the next to smallest is greater than or equal to 4? 20. A machine needs 4 out of its 6 independent components to operate. Let X1 , X2 , ..., X6 be the lifetime of the respective components. Suppose each is exponentially distributed with parameter θ. What is the probability density function of the machine lifetime? 21. Suppose X1 , X2 , ..., X2n+1 is a random sample from the uniform distribution on (0, 1). What is the probability density function of the sample median X(n+1) ? 22. Let X and Y be two random variables with joint density J 12x if 0 < y < 2x < 1 f (x, y) = 0 otherwise. What is the expected value of the random variable Z = X 2 Y 3 +X 2 −X −Y 3 ? 23. Let X1 , X2 , ..., X50 be a random sample of size 50 from a distribution with density & 1 α−1 − x for 0 < x < ∞ e θ Γ(α) θ α x f (x) = 0 otherwise. What are the mean and variance of the sample mean X? 24. Let X1 , X2 , ..., X100 be a random sample of size 100 from a distribution with density 9 −λ x e λ for x = 0, 1, 2, ..., ∞ f (x) = x! 0 otherwise. What is the probability that X greater than or equal to 1? Sequences of Random Variables and Order Statistics 390 Probability and Mathematical Statistics 391 Chapter 14 SAMPLING DISTRIBUTIONS ASSOCIATED WITH THE NORMAL POPULATIONS Given a random sample X1 , X2 , ..., Xn from a population X with probability distribution f (x; θ), where θ is a parameter, a statistic is a function T of X1 , X2 , ..., Xn , that is T = T (X1 , X2 , ..., Xn ) which is free of the parameter θ. If the distribution of the population is known, then sometimes it is possible to find the probability distribution of the statistic T . The probability distribution of the statistic T is called the sampling distribution of T . The joint distribution of the random variables X1 , X2 , ..., Xn is called the distribution of the sample. The distribution of the sample is the joint density f (x1 , x2 , ..., xn ; θ) = f (x1 ; θ)f (x2 ; θ) · · · f (xn ; θ) = n P f (xi ; θ) i=1 since the random variables X1 , X2 , ..., Xn are independent and identically distributed. Since the normal population is very important in statistics, the sampling distributions associated with the normal population are very important. The most important sampling distributions which are associated with the normal Sampling Distributions Associated with the Normal Population 392 population are the followings: the chi-square distribution, the student’s tdistribution, the F-distribution, and the beta distribution. In this chapter, we only consider the first three distributions, since the last distribution was considered earlier. 14.1. Chi-square distribution In this section, we treat the Chi-square distribution, which is one of the very useful sampling distributions. Definition 14.1. A continuous random variable X is said to have a chisquare distribution with r degrees of freedom if its probability density function is of the form  x r  r1 r x 2 −1 e− 2 if 0 ≤ x < ∞ Γ( 2 ) 2 2 f (x; r) =  0 otherwise, where r > 0. If X has chi-square distribution, then we denote it by writing X ∼ χ2 (r). Recall that a gamma distribution reduces to chi-square distribution if α = 2r and θ = 2. The mean and variance of X are r and 2r, respectively. Thus, chi-square distribution is also a special case of gamma distribution. Further, if r → ∞, then chi-square distribution tends to normal distribution. Example 14.1. If X ∼ GAM (1, 1), then what is the probability density function of the random variable 2X? Answer: We will use the moment generating method to find the distribution of 2X. The moment generating function of a gamma random variable is given by 1 −α M (t) = (1 − θ t) , if t< . θ Probability and Mathematical Statistics 393 Since X ∼ GAM (1, 1), the moment generating function of X is given by MX (t) = 1 , 1−t t < 1. Hence, the moment generating function of 2X is M2X (t) = MX (2t) 1 = 1 − 2t 1 = 2 (1 − 2t) 2 = MGF of χ2 (2). Hence, if X is GAM (1, 1) or is an exponential with parameter 1, then 2X is chi-square with 2 degrees of freedom. Example 14.2. If X ∼ χ2 (5), then what is the probability that X is between 1.145 and 12.83? Answer: The probability of X between 1.145 and 12.83 can be calculated from the following: P (1.145 ≤ X ≤ 12.83) = P (X ≤ 12.83) − P (X ≤ 1.145) : 12.83 : 1.145 = f (x) dx f (x) dx − 0 0 = : 12.83 0 Γ 1 !5" 2 = 0.975 − 0.050 2 5 2 x 5 2 −1 x e− 2 dx − 2 : 0 1.145 Γ (from χ table) 1 !5" 2 5 2 5 2 x x 2 −1 e− 2 dx = 0.925. The above integrals are hard to evaluate and thus their values are taken from the chi-square table. Example 14.3. If X ∼ χ2 (7), then what are values of the constants a and b such that P (a < X < b) = 0.95? Answer: Since 0.95 = P (a < X < b) = P (X < b) − P (X < a), we get P (X < b) = 0.95 + P (X < a). Sampling Distributions Associated with the Normal Population 394 We choose a = 1.690, so that P (X < 1.690) = 0.025. From this, we get P (X < b) = 0.95 + 0.025 = 0.975 Thus, from chi-square table, we get b = 16.01. The following theorems were studied earlier in Chapters 6 and 13 and they are very useful in finding the sampling distributions of many statistics. We state these theorems here for the convenience of the reader. -2 , Theorem 14.1. If X ∼ N (µ, σ 2 ), then X−µ ∼ χ2 (1). σ Theorem 14.2. If X ∼ N (µ, σ 2 ) and X1 , X2 , ..., Xn is a random sample from the population X, then $2 n # % Xi − µ σ i=1 ∼ χ2 (n). Theorem 14.3. If X ∼ N (µ, σ 2 ) and X1 , X2 , ..., Xn is a random sample from the population X, then (n − 1) S 2 ∼ χ2 (n − 1). σ2 Theorem 14.4. If X ∼ GAM (θ, α), then 2 X ∼ χ2 (2α). θ Example 14.4. A new component is placed in service and n spares are available. If the times to failure in days are independent exponential variables, that is Xi ∼ EXP (100), how many spares would be needed to be 95% sure of successful operation for at least two years ? Answer: Since Xi ∼ EXP (100), n % i=1 Xi ∼ GAM (100, n). Probability and Mathematical Statistics 395 Hence, by Theorem 14.4, the random variable Y = n 2 % Xi ∼ χ2 (2n). 100 i=1 We have to find the number of spares n such that ) n + % 0.95 = P Xi ≥ 2 years i=1 =P ) n % ) i=1 + Xi ≥ 730 days + n 2 % 2 =P Xi ≥ 730 days 100 i=1 100 ) + n 2 % 730 =P Xi ≥ 100 i=1 50 ! 2 " = P χ (2n) ≥ 14.6 . (from χ2 table) 2n = 25 Hence n = 13 (after rounding up to the next integer). Thus, 13 spares are needed to be 95% sure of successful operation for at least two years. Example 14.5. If X ∼ N (10, 25) and X1 , X2 , ..., X501 is a random sample of size 501 from the population X, then what is the expected value of the sample variance S 2 ? Answer: We will use the Theorem 14.3, to do this problem. By Theorem 14.3, we see that (501 − 1) S 2 ∼ χ2 (500). σ2 Hence, the expected value of S 2 is given by $# $ 3 2# ; 2< 500 25 S2 E S =E 500 25 $ 2# $ 3 # 500 25 E S2 = 500 25 # $ ; < 1 = E χ2 (500) 20 # $ 1 500 = 20 = 25. Sampling Distributions Associated with the Normal Population 396 14.2. Student’s t-distribution Here we treat the Student’s t-distribution, which is also one of the very useful sampling distributions. Definition 14.2. A continuous random variable X is said to have a tdistribution with ν degrees of freedom if its probability density function is of the form " ! Γ ν+1 2 f (x; ν) = −∞ < x < ∞ ν+1 , ! "! √ 2 "( 2 ) π ν Γ ν2 1 + xν where ν > 0. If X has a t-distribution with ν degrees of freedom, then we denote it by writing X ∼ t(ν). The t-distribution was discovered by W.S. Gosset (1876-1936) of England who published his work under the pseudonym of student. Therefore, this distribution is known as Student’s t-distribution. This distribution is a generalization of the Cauchy distribution and the normal distribution. That is, if ν = 1, then the probability density function of X becomes f (x; 1) = 1 π (1 + x2 ) − ∞ < x < ∞, which is the Cauchy distribution. Further, if ν → ∞, then 1 2 1 lim f (x; ν) = √ e− 2 x 2π ν→∞ − ∞ < x < ∞, which is the probability density function of the standard normal distribution. The following figure shows the graph of t-distributions with various degrees of freedom. Example 14.6. If T ∼ t(10), then what is the probability that T is at least 2.228 ? Probability and Mathematical Statistics 397 Answer: The probability that T is at least 2.228 is given by P (T ≥ 2.228) = 1 − P (T < 2.228) = 1 − 0.975 (from t − table) = 0.025. Example 14.7. If T ∼ t(19), then what is the value of the constant c such that P (|T | ≤ c) = 0.95 ? Answer: 0.95 = P (|T | ≤ c) = P (−c ≤ T ≤ c) = P (T ≤ c) − 1 + P (T ≤ c) = 2 P (T ≤ c) − 1. Hence P (T ≤ c) = 0.975. Thus, using the t-table, we get for 19 degrees of freedom c = 2.093. Theorem 14.5. If the random variable X has a t-distribution with ν degrees of freedom, then & 0 if ν ≥ 2 E[X] = DN E if ν = 1 and V ar[X] = & ν ν−2 if ν≥3 DN E if ν = 1, 2 where DNE means does not exist. Theorem 14.6. If Z ∼ N (0, 1) and U ∼ χ2 (ν) and in addition, Z and U are independent, then the random variable W defined by Z W == U ν has a t-distribution with ν degrees of freedom. Sampling Distributions Associated with the Normal Population 398 Theorem 14.7. If X ∼ N (µ, σ 2 ) and X1 , X2 , ..., Xn be a random sample from the population X, then X −µ √S n ∼ t(n − 1). Proof: Since each Xi ∼ N (µ, σ 2 ), # X∼N σ2 µ, n $ . Thus, X −µ √σ n ∼ N (0, 1). Further, from Theorem 14.3 we know that (n − 1) S2 ∼ χ2 (n − 1). σ2 Hence X −µ √S n X−µ == σ √ n (n−1) S 2 (n−1) σ 2 ∼ t(n − 1) (by Theorem 14.6). This completes the proof of the theorem. Example 14.8. Let X1 , X2 , X3 , X4 be a random sample of size 4 from a standard normal distribution. If the statistic W is given by W =I X1 − X2 + X3 X12 + X22 + X32 + X42 then what is the expected value of W ? Answer: Since Xi ∼ N (0, 1), we get X1 − X2 + X3 ∼ N (0, 3) and X1 − X2 + X3 √ ∼ N (0, 1). 3 Further, since Xi ∼ N (0, 1), we have Xi2 ∼ χ2 (1) , Probability and Mathematical Statistics 399 and hence X12 + X22 + X32 + X42 ∼ χ2 (4) Thus, = X1 −X √2 +X3 3 X12 +X22 +X32 +X42 4 = # 2 √ 3 $ W ∼ t(4). Now using the distribution of W , we find the expected value of W . )√ + 2 3 3 2 E √ W E [W ] = 2 3 )√ + 3 = E [t(4)] 2 )√ + 3 0 = 2 = 0. Example 14.9. If X ∼ N (0, 1) and X1 , X2 is random sample of size two from the population X, then what is the 75th percentile of the statistic W = √X1 2 ? X2 Answer: Since each Xi ∼ N (0, 1), we have X1 ∼ N (0, 1) X22 ∼ χ2 (1). Hence X1 W = I 2 ∼ t(1). X2 The 75th percentile a of W is then given by 0.75 = P (W ≤ a) Hence, from the t-table, we get a = 1.0 Hence the 75th percentile of W is 1.0. Example 14.10. Suppose X1 , X2 , ...., Xn is a random sample from a normal 1n distribution with mean µ and variance σ 2 . If X = n1 i=1 Xi and V 2 = Sampling Distributions Associated with the Normal Population 1 n 1n i=1 ! "2 Xi − X , and Xn+1 is an additional observation, what is the value of m so that the statistics m(X−Xn+1 ) V has a t-distribution. Answer: Since Xi ∼ N (µ, σ 2 ) $ # σ2 ⇒ X ∼ N µ, n $ # σ2 ⇒ X − Xn+1 ∼ N µ − µ, + σ2 n # # $ $ n+1 ⇒ X − Xn+1 ∼ N 0, σ2 n ⇒ X − Xn+1 = ∼ N (0, 1) σ n+1 n Now, we establish a relationship between V 2 and S 2 . We know that n (n − 1) S 2 = (n − 1) = n % i=1 =n % 1 (Xi − X)2 (n − 1) i=1 (Xi − X)2 ) n 1 % (Xi − X)2 n i=1 + = n V 2. Hence, by Theorem 14.3 nV 2 (n − 1) S 2 = ∼ χ2 (n − 1). 2 σ σ2 Thus 400 )@ n−1 n+1 + X−Xn+1 √ n+1 X − Xn+1 σ = @ n ∼ t(n − 1). nV2 V 2 σ (n−1) Thus by comparison, we get m= @ n−1 . n+1 Probability and Mathematical Statistics 401 14.3. Snedecor’s F -distribution The next sampling distribution to be discussed in this chapter is Snedecor’s F -distribution. This distribution has many applications in mathematical statistics. In the analysis of variance, this distribution is used to develop the technique for testing the equalities of sample means. Definition 14.3. A continuous random variable X is said to have a F distribution with ν1 and ν2 degrees of freedom if its probability density function is of the form  ! ν " ν21 ν1 −1 ν1 +ν2 1  x 2   Γ( 2 ) ν2 if 0 ≤ x < ∞ ν1 +ν2 ! " f (x; ν1 , ν2 ) = Γ( ν21 ) Γ( ν22 ) 1+ νν1 x ( 2 ) 2    0 otherwise, where ν1 , ν2 > 0. If X has a F -distribution with ν1 and ν2 degrees of freedom, then we denote it by writing X ∼ F (ν1 , ν2 ). The F -distribution was named in honor of Sir Ronald Fisher by George Snedecor. F -distribution arises as the distribution of a ratio of variances. Like, the other two distributions this distribution also tends to normal distribution as ν1 and ν2 become very large. The following figure illustrates the shape of the graph of this distribution for various degrees of freedom. The following theorem gives us the mean and variance of Snedecor’s F distribution. Theorem 14.8. If the random variable X ∼ F (ν1 , ν2 ), then & ν2 if ν2 ≥ 3 ν2 −2 E[X] = DN E if ν2 = 1, 2 and V ar[X] =    2 ν22 (ν1 +ν2 −2) ν1 (ν2 −2)2 (ν2 −4) if ν2 ≥ 5 DN E if ν2 = 1, 2, 3, 4. Sampling Distributions Associated with the Normal Population 402 Here DNE means does not exist. Example 14.11. If X ∼ F (9, 10), what P (X ≥ 3.02) ? Also, find the mean and variance of X. Answer: P (X ≥ 3.02) = 1 − P (X ≤ 3.02) = 1 − P (F (9, 10) ≤ 3.02) = 1 − 0.95 (from F − table) = 0.05. Next, we determine the mean and variance of X using the Theorem 14.8. Hence, ν2 10 10 E(X) = = = = 1.25 ν2 − 2 10 − 2 8 and V ar(X) = 2 ν22 (ν1 + ν2 − 2) ν1 (ν2 − 2)2 (ν2 − 4) = 2 (10)2 (19 − 2) 9 (8)2 (6) = (25) (17) (27) (16) = 425 = 0.9838. 432 Theorem 14.9. If X ∼ F (ν1 , ν2 ), then the random variable 1 X ∼ F (ν2 , ν1 ). This theorem is very useful for computing probabilities like P (X ≤ 0.2439). If you look at a F -table, you will notice that the table start with values bigger than 1. Our next example illustrates how to find such probabilities using Theorem 14.9. Example 14.12. If X ∼ F (6, 9), what is the probability that X is less than or equal to 0.2439 ? Probability and Mathematical Statistics 403 Answer: We use the above theorem to compute $ # 1 1 ≥ P (X ≤ 0.2439) = P X 0.2439 # $ 1 = P F (9, 6) ≥ (by Theorem 14.9) 0.2439 $ # 1 = 1 − P F (9, 6) ≤ 0.2439 = 1 − P (F (9, 6) ≤ 4.10) = 1 − 0.95 = 0.05. The following theorem says that F -distribution arises as the distribution of a random variable which is the quotient of two independently distributed chi-square random variables, each of which is divided by its degrees of freedom. Theorem 14.10. If U ∼ χ2 (ν1 ) and V ∼ χ2 (ν2 ), and the random variables U and V are independent, then the random variable U ν1 V ν2 ∼ F (ν1 , ν2 ) . Example 14.13. Let X1 , X2 , ..., X4 and Y1 , Y2 , ..., Y5 be two random samples of size 4 and 5 respectively, from a standard normal population. What is the ! " X12 +X22 +X32 +X42 variance of the statistic T = 45 Y 2 +Y 2 +Y 2 +Y 2 +Y 2 ? 1 2 3 4 5 Answer: Since the population is standard normal, we get X12 + X22 + X32 + X42 ∼ χ2 (4). Similarly, Y12 + Y22 + Y32 + Y42 + Y52 ∼ χ2 (5). Thus # $ X12 + X22 + X32 + X42 5 T = 2 4 Y1 + Y22 + Y32 + Y42 + Y52 = X12 +X22 +X32 +X42 4 Y12 +Y22 +Y32 +Y42 +Y52 5 = T ∼ F (4, 5). Sampling Distributions Associated with the Normal Population Therefore 404 V ar(T ) = V ar[ F (4, 5) ] 2 (5)2 (7) 4 (3)2 (1) 350 = 36 = 9.72. = Theorem 14.11. Let X ∼ N (µ1 , σ12 ) and X1 , X2 , ..., Xn be a random sample of size n from the population X. Let Y ∼ N (µ2 , σ22 ) and Y1 , Y2 , ..., Ym be a random sample of size m from the population Y . Then the statistic S12 σ12 S22 σ22 ∼ F (n − 1, m − 1), where S12 and S22 denote the sample variances of the first and the second sample, respectively. Proof: Since, Xi ∼ N (µ1 , σ12 ) we have by Theorem 14.3, we get (n − 1) S12 ∼ χ2 (n − 1). σ12 Similarly, since Yi ∼ N (µ2 , σ22 ) we have by Theorem 14.3, we get (m − 1) Therefore S12 σ12 S22 σ22 = S22 ∼ χ2 (m − 1). σ22 (n−1) S12 (n−1) σ12 (m−1) S22 (m−1) σ22 ∼ F (n − 1, m − 1). This completes the proof of the theorem. Because of this theorem, the F -distribution is also known as the varianceratio distribution. Probability and Mathematical Statistics 405 14.4. Review Exercises 1. Let X1 , X2 , ..., X5 be a random sample of size 5 from a normal distribution with mean zero and standard deviation 2. Find the sampling distribution of the statistic X1 + 2X2 − X3 + X4 + X5 . 2. Let X1 , X2 , X3 be a random sample of size 3 from a standard normal distribution. Find the distribution of X12 + X22 + X32 . If possible, find the sampling distribution of X12 − X22 . If not, justify why you can not determine it’s distribution. 3. Let X1 , X2 , X3 be a random sample of size 3 from a standard normal 3 distribution. Find the sampling distribution of the statistics √X1 2+X2 +X 2 2 X1 +X2 +X3 2 −X3 and √X1 −X . 2 2 2 X1 +X2 +X3 4. Let X1 , X2 , X3 be a random sample of size 3 from an exponential distribution with a parameter θ > 0. Find the distribution of the sample (that is the joint distribution of the random variables X1 , X2 , X3 ). 5. Let X1 , X2 , ..., Xn be a random sample of size n from a normal population with mean µ and variance σ 2 > 0. What is the expected value of the sample "2 1n ! 1 variance S 2 = n−1 i=1 Xi − X̄ ? 6. Let X1 , X2 , X3 , X4 be a random sample of size 4 from a standard normal 4 . population. Find the distribution of the statistic √X1 +X 2 2 X2 +X3 7. Let X1 , X2 , X3 , X4 be a random sample of size 4 from a standard normal population. Find the sampling distribution (if possible) and moment generating function of the statistic 2X12 +3X22 +X32 +4X42 . What is the probability distribution of the sample? 8. Let X equal the maximal oxygen intake of a human on a treadmill, where the measurement are in milliliters of oxygen per minute per kilogram of weight. Assume that for a particular population the mean of X is µ = 54.03 and the standard deviation is σ = 5.8. Let X̄ be the sample mean of a random sample X1 , X2 , ..., X47 of size 47 drawn from X. Find the probability that the sample mean is between 52.761 and 54.453. 9. Let X1 , X2 , ..., Xn be a random sample from a normal distribution with "2 1n ! mean µ and variance σ 2 . What is the variance of V 2 = n1 i=1 Xi − X ? 10. If X is a random variable with mean µ and variance σ 2 , then µ − 2σ is called the lower 2σ point of X. Suppose a random sample X1 , X2 , X3 , X4 is Sampling Distributions Associated with the Normal Population 406 drawn from a chi-square distribution with two degrees of freedom. What is the lower 2σ point of X1 + X2 + X3 + X4 ? 11. Let X and Y be independent normal random variables such that the mean and variance of X are 2 and 4, respectively, while the mean and variance of Y are 6 and k, respectively. A sample of size 4 is taken from the X-distribution and a sample of size 9 is taken from the Y -distribution. If " ! P Y − X > 8 = 0.0228, then what is the value of the constant k ? 12. Let X1 , X2 , ..., Xn be a random sample of size n from a distribution with density function & −λx λe if 0 < x < ∞ f (x; λ) = 0 otherwise. 1n What is the distribution of the statistic Y = 2λ i=1 Xi ? 13. Suppose X has a normal distribution with mean 0 and variance 1, Y has a chi-square distribution with n degrees of freedom, W has a chi-square distribution with p degrees of freedom, and W, X, and Y are independent. X ? What is the sampling distribution of the statistic V = I W +Y p+n 14. A random sample X1 , X2 , ..., Xn of size n is selected from a normal population with mean µ and standard deviation 1. Later an additional independent observation Xn+1 is obtained from the same population. What 1n is the distribution of the statistic (Xn+1 − µ)2 + i=1 (Xi − X)2 , where X denote the sample mean? ) , where X, Y , Z, and W are independent normal 15. Let T = √k(X+Y Z 2 +W 2 random variables with mean 0 and variance σ 2 > 0. For exactly one value of k, T has a t-distribution. If r denotes the degrees of freedom of that distribution, then what is the value of the pair (k, r)? 16. Let X and Y be joint normal random variables with common mean 0, common variance 1, and covariance 21 . What is the probability of the event √ " √ " ! ! X + Y ≤ 3 , that is P X + Y ≤ 3 ? 17. Suppose Xj = Zj − Zj−1 , where j = 1, 2, ..., n and Z0 , Z1 , ..., Zn are independent and identically distributed with common variance σ 2 . What is 1n the variance of the random variable n1 j=1 Xj ? 18. A random sample of size 5 is taken from a normal distribution with mean 0 and standard deviation 2. Find the constant k such that 0.05 is equal to the Probability and Mathematical Statistics 407 probability that the sum of the squares of the sample observations exceeds the constant k. 19. Let X1 , X2 , ..., Xn and Y1 , Y2 , ..., Yn be two random sample from the independent normal distributions with V ar[Xi ] = σ 2 and V ar[Yi ] = 2σ 2 , for "2 "2 1n ! 1n ! i = 1, 2, ..., n and σ 2 > 0. If U = i=1 Xi − X and V = i=1 Yi − Y , then what is the sampling distribution of the statistic 2U2σ+V 2 ? 20. Suppose X1 , X2 , ..., X6 and Y1 , Y2 , ..., Y9 are independent, identically distributed normal random variables, each with mean zero and variance σ2 >  0  9 / 6 % % 0. What is the 95th percentile of the statistics W = Xi2 /  Yj2 ? i=1 j=1 21. Let X1 , X2 , ..., X6 and Y1 , Y2 , ..., Y8 be independent random samples from a normal distribution with mean 0 and variance 1, and Z =  / 6 0 8 % % 4 Xi2 / 3 Yj2 ? i=1 j=1 Sampling Distributions Associated with the Normal Population 408 Probability and Mathematical Statistics 409 Chapter 15 SOME TECHNIQUES FOR FINDING POINT ESTIMATORS OF PARAMETERS A statistical population consists of all the measurements of interest in a statistical investigation. Usually a population is described by a random variable X. If we can gain some knowledge about the probability density function f (x; θ) of X, then we also gain some knowledge about the population under investigation. A sample is a portion of the population usually chosen by method of random sampling and as such it is a set of random variables X1 , X2 , ..., Xn with the same probability density function f (x; θ) as the population. Once the sampling is done, we get X1 = x1 , X2 = x2 , · · · , Xn = xn where x1 , x2 , ..., xn are the sample data. Every statistical method employs a random sample to gain information about the population. Since the population is characterized by the probability density function f (x; θ), in statistics one makes statistical inferences about the population distribution f (x; θ) based on sample information. A statistical inference is a statement based on sample information about the population. There are three types of statistical inferences (1) estimation (2) Some Techniques for finding Point Estimators of Parameters 410 hypothesis testing and (3) prediction. The goal of this chapter is to examine some well known point estimation methods. In point estimation, we try to find the parameter θ of the population distribution f (x; θ) from the sample information. Thus, in the parametric point estimation one assumes the functional form of the pdf f (x; θ) to be known and only estimate the unknown parameter θ of the population using information available from the sample. Definition 15.1. Let X be a population with the density function f (x; θ), where θ is an unknown parameter. The set of all admissible values of θ is called a parameter space and it is denoted by Ω, that is Ω = {θ ∈ R I n | f (x; θ) is a pdf } for some natural number m. Example 15.1. If X ∼ EXP (θ), then what is the parameter space of θ ? Answer: Since X ∼ EXP (θ), the density function of X is given by f (x; θ) = 1 −x e θ. θ If θ is zero or negative then f (x; θ) is not a density function. Thus, the admissible values of θ are all the positive real numbers. Hence Ω = {θ ∈ R I | 0 < θ < ∞} =R I +. ! " Example 15.2. If X ∼ N µ, σ 2 , what is the parameter space? Answer: The parameter space Ω is given by "R Q ! Ω = θ ∈R I 2 | f (x; θ) ∼ N µ, σ 2 Q R = (µ, σ) ∈ R I 2 | − ∞ < µ < ∞, 0 < σ < ∞ =R I ×R I+ = upper half plane. In general, a parameter space is a subset of R I m . Statistics concerns with the estimation of the unknown parameter θ from a random sample X1 , X2 , ..., Xn . Recall that a statistic is a function of X1 , X2 , ..., Xn and free of the population parameter θ. Probability and Mathematical Statistics 411 Definition 15.2. Let X ∼ f (x; θ) and X1 , X2 , ..., Xn be a random sample from the population X. Any statistic that can be used to guess the parameter θ is called an estimator of θ. The numerical value of this statistic is called Z an estimate of θ. The estimator of the parameter θ is denoted by θ. One of the basic problems is how to find an estimator of population parameter θ. There are several methods for finding an estimator of θ. Some of these methods are: (1) Moment Method (2) Maximum Likelihood Method (3) Bayes Method (4) Least Squares Method (5) Minimum Chi-Squares Method (6) Minimum Distance Method In this chapter, we only discuss the first three methods of estimating a population parameter. 15.1. Moment Method Let X1 , X2 , ..., Xn be a random sample from a population X with probability density function f (x; θ1 , θ2 , ..., θm ), where θ1 , θ2 , ..., θm are m unknown parameters. Let : ∞ ! " E Xk = xk f (x; θ1 , θ2 , ..., θm ) dx −∞ be the k th population moment about 0. Further, let Mk = n 1 % k X n i=1 i be the k th sample moment about 0. In moment method, we find the estimator for the parameters θ1 , θ2 , ..., θm by equating the first m population moments (if they exist) to the first m sample moments, that is E (X) = M1 ! " E X 2 = M2 ! " E X 3 = M3 .. . E (X m ) = Mm Some Techniques for finding Point Estimators of Parameters 412 The moment method is one of the classical methods for estimating parameters and motivation comes from the fact that the sample moments are in some sense estimates for the population moments. The moment method was first discovered by British statistician Karl Pearson in 1902. Now we provide some examples to illustrate this method. ! " Example 15.3. Let X ∼ N µ, σ 2 and X1 , X2 , ..., Xn be a random sample of size n from the population X. What are the estimators of the population parameters µ and σ 2 if we use the moment method? Answer: Since the population is normal, that is we know that Hence ! " X ∼ N µ, σ 2 E (X) = µ ! " E X 2 = σ 2 + µ2 . µ = E (X) = M1 n 1 % Xi = n i=1 = X. Therefore, the estimator of the parameter µ is X, that is µ Z = X. Next, we find the estimator of σ 2 equating E(X 2 ) to M2 . Note that σ 2 = σ 2 + µ2 − µ2 ! " = E X 2 − µ2 = M2 − µ2 n 1 % 2 2 X −X = n i=1 i = n "2 1 %! Xi − X . n i=1 The last line follows from the fact that Probability and Mathematical Statistics 413 n n "2 1 %! 1 %, 2 2 Xi − 2 Xi X + X Xi − X = n i=1 n i=1 = = = n n n 1 % 2 1 % 2 1 % X Xi − 2 Xi X + n i=1 n i=1 n i=1 n n 1 % 2 1 % 2 Xi − 2 X Xi + X n i=1 n i=1 n 1 % 2 2 X − 2X X + X n i=1 i n 1 % 2 2 = X −X . n i=1 i Thus, the estimator of σ 2 is 1 n n % ! i=1 "2 Xi − X , that is n % ! "2 [2 = 1 σ Xi − X . n i=1 Example 15.4. Let X1 , X2 , ..., Xn be a random sample of size n from a population X with probability density function   θ xθ−1 if 0 < x < 1 f (x; θ) =  0 otherwise, where 0 < θ < ∞ is an unknown parameter. Using the method of moment find an estimator of θ ? If x1 = 0.2, x2 = 0.6, x3 = 0.5, x4 = 0.3 is a random sample of size 4, then what is the estimate of θ ? Answer: To find an estimator, we shall equate the population moment to the sample moment. The population moment E(X) is given by : 1 E(X) = x f (x; θ) dx 0 = : 1 x θ xθ−1 dx 0 =θ : 1 xθ dx 0 θ ; θ+1 <1 x 0 θ+1 θ . = θ+1 = Some Techniques for finding Point Estimators of Parameters 414 We know that M1 = X. Now setting M1 equal to E(X) and solving for θ, we get θ X= θ+1 that is X θ= , 1−X X is an estimator of the where X is the sample mean. Thus, the statistic 1−X parameter θ. Hence X θZ = . 1−X Since x1 = 0.2, x2 = 0.6, x3 = 0.5, x4 = 0.3, we have X = 0.4 and 2 0.4 = θZ = 1 − 0.4 3 is an estimate of the θ. Example 15.5. What is the basic principle of the moment method? Answer: To choose a value for the unknown population parameter for which the observed data have the same moments as the population. Example 15.6. Suppose X1 , X2 , ..., X7 is a random sample from a population X with density function  −x  x6 e β7 if 0 < x < ∞ Γ(7) β f (x; β) =  0 otherwise. Find an estimator of β by the moment method. Answer: Since, we have only one parameter, we need to compute only the first population moment E(X) about 0. Thus, : ∞ E(X) = x f (x; β) dx 0 : x x6 e− β x = dx Γ(7) β 7 0 : ∞ # $7 x x 1 e− β dx = Γ(7) 0 β : ∞ 1 =β y 7 e−y dy Γ(7) 0 1 =β Γ(8) Γ(7) = 7 β. ∞ Probability and Mathematical Statistics 415 Since M1 = X, equating E(X) to M1 , we get 7β = X that is 1 X. 7 Therefore, the estimator of β by the moment method is given by β= 1 βZ = X. 7 Example 15.7. Suppose X1 , X2 , ..., Xn is a random sample from a population X with density function f (x; θ) = &1 θ if 0 < x < θ 0 otherwise. Find an estimator of θ by the moment method. Answer: Examining the density function of the population X, we see that X ∼ U N IF (0, θ). Therefore E(X) = θ . 2 Now, equating this population moment to the sample moment, we obtain θ = E(X) = M1 = X. 2 Therefore, the estimator of θ is θZ = 2 X. Example 15.8. Suppose X1 , X2 , ..., Xn is a random sample from a population X with density function f (x; α, β) = & 1 β−α if α < x < β 0 otherwise. Find the estimators of α and β by the moment method. Some Techniques for finding Point Estimators of Parameters 416 Answer: Examining the density function of the population X, we see that X ∼ U N IF (α, β). Since, the distribution has two unknown parameters, we need the first two population moments. Therefore α+β (β − α)2 and E(X 2 ) = + E(X)2 . 2 12 Equating these moments to the corresponding sample moments, we obtain E(X) = α+β = E(X) = M1 = X 2 that is α + β = 2X and which is n 1 % 2 (β − α)2 X + E(X)2 = E(X 2 ) = M2 = 12 n i=1 i / 0 n 1% 2 2 (β − α) = 12 X − E(X) n i=1 i / n 0 1% 2 2 = 12 X −X n i=1 i 0 / n "2 1 %! 2 . Xi − X = 12 n i=1 2 Hence, we get \ ] n ] 12 % ! 2 "2 β−α=^ Xi − X . n i=1 Adding equation (1) to equation (2), we obtain \ ] n ]3 % "2 ! 2 2β = 2X ± 2 ^ Xi − X n i=1 that is (1) \ ] n ]3 % ! 2 "2 β =X ±^ Xi − X . n i=1 Similarly, subtracting (2) from (1), we get \ ] n ]3 % ! 2 "2 Xi − X . α=X ∓^ n i=1 (2) Probability and Mathematical Statistics 417 Since, α < β, we see that the estimators of α and β are \ ] n ]3 % "2 ! 2 Xi − X α Z =X −^ n i=1 and \ ] n ]3 % "2 ! 2 βZ = X + ^ Xi − X . n i=1 15.2. Maximum Likelihood Method The maximum likelihood method was first used by Sir Ronald Fisher in 1912 for finding estimator of a unknown parameter. However, the method originated in the works of Gauss and Bernoulli. Next, we describe the method in detail. Definition 15.3. Let X1 , X2 , ..., Xn be a random sample from a population X with probability density function f (x; θ), where θ is an unknown parameter. The likelihood function, L(θ), is the distribution of the sample. That is n P L(θ) = f (xi ; θ). i=1 This definition says that the likelihood function of a random sample X1 , X2 , ..., Xn is the joint density of the random variables X1 , X2 , ..., Xn . The θ that maximizes the likelihood function L(θ) is called the maximum Z Hence likelihood estimator of θ, and it is denoted by θ. θZ = Arg sup L(θ), θ∈Ω where Ω is the parameter space of θ so that L(θ) is the joint density of the sample. The method of maximum likelihood in a sense picks out of all the possible values of θ the one most likely to have produced the given observations x1 , x2 , ..., xn . The method is summarized below: (1) Obtain a random sample x1 , x2 , ..., xn from the distribution of a population X with probability density function f (x; θ); (2) define the likelihood function for the sample x1 , x2 , ..., xn by L(θ) = f (x1 ; θ)f (x2 ; θ) · · · f (xn ; θ); (3) find the expression for θ that maximizes L(θ). This can be done directly or by maximizing ln L(θ); (4) replace θ by θZ to obtain an expression for the maximum likelihood estimator for θ; (5) find the observed value of this estimator for a given sample. Some Techniques for finding Point Estimators of Parameters 418 Example 15.9. If X1 , X2 , ..., Xn is a random sample from a distribution with density function f (x; θ) =   (1 − θ) x−θ  if 0 < x < 1 0 elsewhere, what is the maximum likelihood estimator of θ ? Answer: The likelihood function of the sample is given by L(θ) = n P f (xi ; θ). i=1 Therefore ln L(θ) = ln ) n P + f (xi ; θ) i=1 = = n % i=1 n % i=1 ln f (xi ; θ) ; < ln (1 − θ) xi −θ = n ln(1 − θ) − θ n % ln xi . i=1 Now we maximize ln L(θ) with respect to θ. d d ln L(θ) = dθ dθ ) n ln(1 − θ) − θ n =− Setting this derivative d ln L(θ) dθ % n − ln xi . 1 − θ i=1 n % ln xi i=1 to 0, we get n % n d ln L(θ) ln xi = 0 =− − dθ 1 − θ i=1 that is n 1 1 % ln xi =− 1−θ n i=1 + Probability and Mathematical Statistics or 419 n 1 % 1 ln xi = −ln x. =− 1−θ n i=1 or 1 . ln x This θ can be shown to be maximum by the second derivative test and we leave this verification to the reader. Therefore, the estimator of θ is θ =1+ θZ = 1 + 1 . ln X Example 15.10. If X1 , X2 , ..., Xn is a random sample from a distribution with density function  6 −x  x e β7 if 0 < x < ∞ Γ(7) β f (x; β) =  0 otherwise, then what is the maximum likelihood estimator of β ? Answer: The likelihood function of the sample is given by L(β) = n P f (xi ; β). i=1 Thus, ln L(β) = n % ln f (xi , β) i=1 n % =6 i=1 Therefore Setting this derivative ln xi − n 1 % xi − n ln(6!) − 7n ln(β). β i=1 n d 7n 1 % xi − ln L(β) = 2 . dβ β i=1 β d dβ ln L(β) to zero, we get n 1 % 7n xi − =0 β 2 i=1 β which yields β= n 1 % xi . 7n i=1 Some Techniques for finding Point Estimators of Parameters 420 This β can be shown to be maximum by the second derivative test and again we leave this verification to the reader. Hence, the estimator of β is given by 1 βZ = X. 7 Remark 15.1. Note that this maximum likelihood estimator of β is same as the one found for β using the moment method in Example 15.6. However, in general the estimators by different methods are different as the following example illustrates. Example 15.11. If X1 , X2 , ..., Xn is a random sample from a distribution with density function   θ1 if 0 < x < θ f (x; θ) =  0 otherwise, then what is the maximum likelihood estimator of θ ? Answer: The likelihood function of the sample is given by L(θ) = = n P f (xi ; θ) i=1 n # P i=1 1 θ # $n 1 = θ $ θ > xi (i = 1, 2, 3, ..., n) θ > max{x1 , x2 , ..., xn }. Hence the parameter space of θ with respect to L(θ) is given by Ω = {θ ∈ R I | xmax < θ < ∞} = (xmax , ∞). Now we maximize L(θ) on Ω. First, we compute ln L(θ) and then differentiate it to get ln L(θ) = −n ln(θ) and d n ln L(θ) = − < 0. dθ θ Therefore ln L(θ) is a decreasing function of θ and as such the maximum of ln L(θ) occurs at the left end point of the interval (xmax , ∞). Therefore, at Probability and Mathematical Statistics 421 θ = xmax the likelihood function achieve maximum. Hence the likelihood estimator of θ is given by θZ = X(n) where X(n) denotes the nth order statistic of the given sample. Thus, Example 15.7 and Example 15.11 say that the if we estimate the parameter θ of a distribution with uniform density on the interval (0, θ), then the maximum likelihood estimator is given by θZ = X(n) where as θZ = 2 X is the estimator obtained by the method of moment. Hence, in general these two methods do not provide the same estimator of an unknown parameter. Example 15.12. Let X1 , X2 , ..., Xn be a random sample from a distribution with density function =  2 e− 21 (x−θ)2 if x ≥ θ π f (x; θ) =  0 elsewhere. What is the maximum likelihood estimator of θ ? Answer: The likelihood function L(θ) is given by )@ +n n P 1 2 2 L(θ) = xi ≥ θ (i = 1, 2, 3, ..., n). e− 2 (xi −θ) π i=1 Hence the parameter space of θ is given by Ω = {θ ∈ R I | 0 ≤ θ ≤ xmin } = [0, xmin ], , where xmin = min{x1 , x2 , ..., xn }. Now we evaluate the logarithm of the likelihood function. # $ n 2 n 1 % (xi − θ)2 , ln L(θ) = ln − 2 π 2 i=1 where θ is on the interval [0, xmin ]. Now we maximize ln L(θ) subject to the condition 0 ≤ θ ≤ xmin . Taking the derivative, we get n n % d 1% (xi − θ). (xi − θ) 2(−1) = ln L(θ) = − dθ 2 i=1 i=1 Some Techniques for finding Point Estimators of Parameters 422 In this example, if we equate the derivative to zero, then we get θ = x. But this value of θ is not on the parameter space Ω. Thus, θ = x is not the solution. Hence to find the solution of this optimization process, we examine the behavior of the ln L(θ) on the interval [0, xmin ]. Note that n n % 1% d (xi − θ) 2(−1) = (xi − θ) > 0 ln L(θ) = − dθ 2 i=1 i=1 since each xi is bigger than θ. Therefore, the function ln L(θ) is an increasing function on the interval [0, xmin ] and as such it will achieve maximum at the right end point of the interval [0, xmin ]. Therefore, the maximum likelihood estimator of θ is given by Z = X(1) X where X(1) denotes the smallest observation in the random sample X1 , X2 , ..., Xn . Example 15.13. Let X1 , X2 , ..., Xn be a random sample from a normal population with mean µ and variance σ 2 . What are the maximum likelihood estimators of µ and σ 2 ? Answer: Since X ∼ N (µ, σ 2 ), the probability density function of X is given by 1 x−µ 2 1 f (x; µ, σ) = √ e− 2 ( σ ) . σ 2π The likelihood function of the sample is given by L(µ, σ) = n P i=1 σ 1 √ e− 2 ( 1 2π xi −µ σ 2 ) . Hence, the logarithm of this likelihood function is given by ln L(µ, σ) = − n n 1 % ln(2π) − n ln(σ) − 2 (xi − µ)2 . 2 2σ i=1 Taking the partial derivatives of ln L(µ, σ) with respect to µ and σ, we get n n 1 % 1 % ∂ ln L(µ, σ) = − 2 (xi − µ) (−2) = 2 (xi − µ). ∂µ 2σ i=1 σ i=1 and n ∂ n 1 % (xi − µ)2 . ln L(µ, σ) = − + 3 ∂σ σ σ i=1 Probability and Mathematical Statistics ∂ ln L(µ, σ) = 0 and Setting ∂µ µ and σ, we get ∂ ∂σ 423 ln L(µ, σ) = 0, and solving for the unknown µ= n 1 % xi = x. n i=1 Thus the maximum likelihood estimator of µ is µ Z = X. Similarly, we get − n 1 % n (xi − µ)2 = 0 + 3 σ σ i=1 implies n σ2 = 1% (xi − µ)2 . n i=1 Again µ and σ 2 found by the first derivative test can be shown to be maximum using the second derivative test for the functions of two variables. Hence, using the estimator of µ in the above expression, we get the estimator of σ 2 to be n % [2 = 1 (Xi − X)2 . σ n i=1 Example 15.14. Suppose X1 , X2 , ..., Xn is a random sample from a distribution with density function f (x; α, β) = & 1 β−α if α < x < β 0 otherwise. Find the estimators of α and β by the method of maximum likelihood. Answer: The likelihood function of the sample is given by L(α, β) = n P 1 = β − α i=1 # 1 β−α $n for all α ≤ xi for (i = 1, 2, ..., n) and for all β ≥ xi for (i = 1, 2, ..., n). Hence, the domain of the likelihood function is Ω = {(α, β) | 0 < α ≤ x(1) and x(n) ≤ β < ∞}. Some Techniques for finding Point Estimators of Parameters 424 It is easy to see that L(α, β) is maximum if α = x(1) and β = x(n) . Therefore, the maximum likelihood estimator of α and β are α Z = X(1) and βZ = X(n) . The maximum likelihood estimator θZ of a parameter θ has a remarkable property known as the invariance property. This invariance property says Z is the maximum that if θZ is a maximum likelihood estimator of θ, then g(θ) k likelihood estimator of g(θ), where g is a function from R I to a subset of R I m. This result was proved by Zehna in 1966. We state this result as a theorem without a proof. Theorem 15.1. Let θZ be a maximum likelihood estimator of a parameter θ and let g(θ) be a , function of θ. Then the maximum likelihood estimator of Z g(θ) is given by g θ . Now we give two examples to illustrate the importance of this theorem. Example 15.15. Let X1 , X2 , ..., Xn be a random sample from a normal population with mean µ and variance σ 2 . What are the maximum likelihood estimators of σ and µ − σ? Answer: From Example 15.13, we have the maximum likelihood estimator of µ and σ 2 to be µ Z=X and n % [2 = 1 σ (Xi − X)2 =: Σ2 (say). n i=1 Now using the invariance property of the maximum likelihood estimator we have σ Z=Σ and µ_ − σ = X − Σ. Example 15.16. Suppose X1 , X2 , ..., Xn is a random sample from a distribution with density function & 1 if α < x < β β−α f (x; α, β) = 0 otherwise. I 2 2 Find the estimator of α + β by the method of maximum likelihood. Probability and Mathematical Statistics 425 Answer: From Example 15.14, we have the maximum likelihood estimator of α and β to be α Z = X(1) and βZ = X(n) , respectively. Now using the invariance property of the maximum likelihood I α2 + β 2 is estimator we see that the maximum likelihood estimator of = 2 + X2 . X(1) (n) The concept of information in statistics was introduced by Sir Ronald Fisher, and it is known as Fisher information. Definition 15.4. Let X be an observation from a population with probability density function f (x; θ). Suppose f (x; θ) is continuous, twice differentiable and it’s support does not depend on θ. Then the Fisher information, I(θ), in a single observation X about θ is given by I(θ) = : 2 ∞ −∞ d ln f (x; θ) dθ 32 f (x; θ) dx. Thus I(θ) is the expected value of the square of the random variable d ln f (X;θ) . That is, dθ )2 I(θ) = E d ln f (X; θ) dθ 32 + . In the following lemma, we give an alternative formula for the Fisher information. Lemma 15.1. The Fisher information contained in a single observation about the unknown parameter θ can be given alternatively as 3 : ∞2 2 d ln f (x; θ) I(θ) = − f (x; θ) dx. dθ2 −∞ Proof: Since f (x; θ) is a probability density function, : ∞ f (x; θ) dx = 1. −∞ Differentiating (3) with respect to θ, we get d dθ : ∞ −∞ f (x; θ) dx = 0. (3) Some Techniques for finding Point Estimators of Parameters 426 Rewriting the last equality, we obtain : ∞ 1 df (x; θ) f (x; θ) dx = 0 dθ f (x; θ) −∞ which is : ∞ −∞ d ln f (x; θ) f (x; θ) dx = 0. dθ (4) Now differentiating (4) with respect to θ, we see that 3 : ∞2 2 d ln f (x; θ) d ln f (x; θ) df (x; θ) f (x; θ) + dx = 0. dθ2 dθ dθ −∞ Rewriting the last equality, we have 3 : ∞2 2 d ln f (x; θ) d ln f (x; θ) df (x; θ) 1 f (x; θ) + f (x; θ) dx = 0 dθ2 dθ dθ f (x; θ) −∞ which is : ∞ −∞ ) 2 32 + d ln f (x; θ) d2 ln f (x; θ) + f (x; θ) dx = 0. dθ2 dθ The last equality implies that : ∞ −∞ 2 d ln f (x; θ) dθ 32 f (x; θ) dx = − : ∞ −∞ 2 3 d2 ln f (x; θ) f (x; θ) dx. dθ2 Hence using the definition of Fisher information, we have 3 : ∞2 2 d ln f (x; θ) I(θ) = − f (x; θ) dx dθ2 −∞ and the proof of the lemma is now complete. The following two examples illustrate how one can determine Fisher information. Example 15.17. Let X be a single observation taken from a normal population with unknown mean µ and known variance σ 2 . Find the Fisher information in a single observation X about µ. Answer: Since X ∼ N (µ, σ 2 ), the probability density of X is given by f (x; µ) = √ 1 2πσ 2 1 2 e− 2σ2 (x−µ) . Probability and Mathematical Statistics Hence 427 (x − µ)2 1 . ln f (x; µ) = − ln(2πσ 2 ) − 2 2σ 2 Therefore x−µ d ln f (x; µ) = dµ σ2 and d2 ln f (x; µ) 1 = − 2. dµ2 σ Hence I(µ) = − : ∞ −∞ # − 1 σ2 $ f (x; µ) dx = 1 . σ2 Example 15.18. Let X1 , X2 , ..., Xn be a random sample from a normal population with unknown mean µ and known variance σ 2 . Find the Fisher information in this sample of size n about µ. Answer: Let In (µ) be the required Fisher information. Then from the definition, we have $ d2 ln f (X1 , X2 , ..., Xn ; µ dµ2 # 2 $ d = −E {ln f (X ; µ) + · · · + ln f (X ; µ)} 1 n dµ2 # 2 $ # 2 $ d ln f (X1 ; µ) d ln f (Xn ; µ) = −E − · · · − E dµ2 dµ2 In (µ) = − E # = I(µ) + · · · + I(µ) = n I(µ) 1 =n 2 σ (using Example 15.17). This example shows that if X1 , X2 , ..., Xn is a random sample from a population X ∼ f (x; θ), then the Fisher information, In (θ), in a sample of size n about the parameter θ is equal to n times the Fisher information in X about θ. Thus In (θ) = n I(θ). If X is a random variable with probability density function f (x; θ), where θ = (θ1 , ..., θn ) is an unknown parameter vector then the Fisher information, Some Techniques for finding Point Estimators of Parameters 428 I(θ), is a n × n matrix given by I(θ) = (Iij (θ)) # # 2 $$ ∂ ln f (X; θ) = −E . ∂θi ∂θj Example 15.19. Let X1 , X2 , ..., Xn be a random sample from a normal population with mean µ and variance σ 2 . What is the Fisher information matrix, In (µ, σ 2 ), of the sample of size n about the parameters µ and σ 2 ? Answer: Let us write θ1 = µ and θ2 = σ 2 . The Fisher information, In (θ), in a sample of size n about the parameter (θ1 , θ2 ) is equal to n times the Fisher information in the population about (θ1 , θ2 ), that is In (θ1 , θ2 ) = n I(θ1 , θ2 ). (5) Since there are two parameters θ1 and θ2 , the Fisher information matrix I(θ1 , θ2 ) is a 2 × 2 matrix given by   I11 (θ1 , θ2 ) I12 (θ1 , θ2 )  I(θ1 , θ2 ) =  (6) I21 (θ1 , θ2 ) I22 (θ1 , θ2 ) where $ ∂ 2 ln f (X; θ1 , θ2 ) ∂θi ∂θj for i = 1, 2 and j = 1, 2. Now we proceed to compute Iij . Since Iij (θ1 , θ2 ) = −E # f (x; θ1 , θ2 ) = √ (x−θ1 )2 1 e− 2 θ2 2 π θ2 we have (x − θ1 )2 1 ln(2 π θ2 ) − . 2 2 θ2 Taking partials of ln f (x; θ1 , θ2 ), we have ln f (x; θ1 , θ2 ) = − ∂ ln f (x; θ1 , θ2 ) ∂θ1 ∂ ln f (x; θ1 , θ2 ) ∂θ2 2 ∂ ln f (x; θ1 , θ2 ) ∂θ12 ∂ 2 ln f (x; θ1 , θ2 ) ∂θ22 ∂ 2 ln f (x; θ1 , θ2 ) ∂θ1 ∂θ2 x − θ1 , θ2 1 (x − θ1 )2 =− + , 2 θ2 2 θ22 1 =− , θ2 1 (x − θ1 )2 = − , 2 2 θ2 θ23 x − θ1 =− . θ22 = Probability and Mathematical Statistics Hence 429 # 1 I11 (θ1 , θ2 ) = −E − θ2 $ = 1 1 = 2. θ2 σ Similarly, # $ θ1 θ1 X − θ1 E(X) θ1 I21 (θ1 , θ2 ) = I12 (θ1 , θ2 ) = −E − − 2 = 2 − 2 =0 = 2 2 θ2 θ2 θ2 θ2 θ2 and # $ (X − θ1 )2 1 I22 (θ1 , θ2 ) = −E − + θ23 2θ22 " ! E (X − θ1 )2 1 θ2 1 1 1 . − 2 = 3− 2 = 2 = = θ23 2θ2 θ2 2θ2 2θ2 2σ 4 Thus from (5), (6) and the above calculations, the Fisher information matrix is given by  1   n  0 0 σ2 σ2 = . In (θ1 , θ2 ) = n  0 2σ1 4 0 2σn4 Now we present an important theorem about the maximum likelihood estimator without a proof. Theorem 15.2. Under certain regularity conditions on the f (x; θ) the maximum likelihood estimator θZ of θ based on a random sample of size n from a population X with probability density f (x; θ) is asymptotically normally 1 distributed with mean θ and variance n I(θ) . That is θZM L ∼ N # 1 θ, n I(θ) $ as n → ∞. The following example shows that the maximum likelihood estimator of a parameter is not necessarily unique. Example 15.20. If X1 , X2 , ..., Xn is a random sample from a distribution with density function   12 if θ − 1 ≤ x ≤ θ + 1 f (x; θ) =  0 otherwise, then what is the maximum likelihood estimator of θ? Some Techniques for finding Point Estimators of Parameters 430 Answer: The likelihood function of this sample is given by L(θ) = & ! 1 "n 2 0 if max{x1 , ..., xn } − 1 ≤ θ ≤ min{x1 , ..., xn } + 1 otherwise. Since the likelihood function is a constant, any value in the interval [max{x1 , ..., xn } − 1, min{x1 , ..., xn } + 1] is a maximum likelihood estimate of θ. Example 15.21. What is the basic principle of maximum likelihood estimation? Answer: To choose a value of the parameter for which the observed data have as high a probability or density as possible. In other words a maximum likelihood estimate is a parameter value under which the sample data have the highest probability. 15.3. Bayesian Method In the classical approach, the parameter θ is assumed to be an unknown, but fixed quantity. A random sample X1 , X2 , ..., Xn is drawn from a population with probability density function f (x; θ) and based on the observed values in the sample, knowledge about the value of θ is obtained. In Bayesian approach θ is considered to be a quantity whose variation can be described by a probability distribution (known as the prior distribution). This is a subjective distribution, based on the experimenter’s belief, and is formulated before the data are seen (and hence the name prior distribution). A sample is then taken from a population where θ is a parameter and the prior distribution is updated with this sample information. This updated prior is called the posterior distribution. The updating is done with the help of Bayes’ theorem and hence the name Bayesian method. In this section, we shall denote the population density f (x; θ) as f (x/θ), that is the density of the population X given the parameter θ. Definition 15.5. Let X1 , X2 , ..., Xn be a random sample from a distribution with density f (x/θ), where θ is the unknown parameter to be estimated. The probability density function of the random variable θ is called the prior distribution of θ and usually denoted by h(θ). Definition 15.6. Let X1 , X2 , ..., Xn be a random sample from a distribution with density f (x/θ), where θ is the unknown parameter to be estimated. The Probability and Mathematical Statistics 431 conditional density, k(θ/x1 , x2 , ..., xn ), of θ given the sample x1 , x2 , ..., xn is called the posterior distribution of θ. Example 15.22. Let X1 = 1, X2 = 2 be a random sample of size 2 from a distribution with probability density function f (x/θ) = # $ 3 x θ (1 − θ)3−x , x x = 0, 1, 2, 3. If the prior density of θ is h(θ) =  k  if 0 1 2 <θ<1 otherwise, what is the posterior distribution of θ ? Answer: Since h(θ) is the probability density of θ, we should get : 1 h(θ) dθ = 1 1 2 which implies : 1 k dθ = 1. 1 2 Therefore k = 2. The joint density of the sample and the parameter is given by u(x1 , x2 , θ) = f (x1 /θ)f (x2 /θ)h(θ) # $ # $ 3 x2 3 x1 θ (1 − θ)3−x2 2 θ (1 − θ)3−x1 = x2 x1 # $# $ 3 3 x1 +x2 θ (1 − θ)6−x1 −x2 . =2 x2 x1 Hence, u(1, 2, θ) = 2 # $# $ 3 3 3 θ (1 − θ)3 1 2 = 18 θ3 (1 − θ)3 . Some Techniques for finding Point Estimators of Parameters 432 The marginal distribution of the sample g(1, 2) = = : : 1 u(1, 2, θ) dθ 1 2 1 1 2 = 18 = 18 = 18 θ3 (1 − θ)3 dθ : : 1 1 2 1 1 2 9 . 140 ! " θ3 1 + 3θ2 − 3θ − θ3 dθ ! " θ3 + 3θ5 − 3θ4 − θ6 dθ The conditional distribution of the parameter θ given the sample X1 = 1 and X2 = 2 is given by k(θ/x1 = 1, x2 = 2) = = u(1, 2, θ) g(1, 2) 18 θ3 (1 − θ)3 9 140 3 = 280 θ (1 − θ)3 . Therefore, the posterior distribution of θ is k(θ/x1 = 1, x2 = 2) = & 1 2 280 θ3 (1 − θ)3 if 0 otherwise. <θ<1 Remark 15.2. If X1 , X2 , ..., Xn is a random sample from a population with density f (x/θ), then the joint density of the sample and the parameter is given by n P f (xi /θ). u(x1 , x2 , ..., xn , θ) = h(θ) i=1 Given this joint density, the marginal density of the sample can be computed using the formula g(x1 , x2 , ..., xn ) = : ∞ −∞ h(θ) n P i=1 f (xi /θ) dθ. Probability and Mathematical Statistics 433 Now using the Bayes rule, the posterior distribution of θ can be computed as follows: On h(θ) i=1 f (xi /θ) On . k(θ/x1 , x2 , ..., xn ) = 8 ∞ h(θ) i=1 f (xi /θ) dθ −∞ In Bayesian method, we use two types of loss functions. Definition 15.7. Let X1 , X2 , ..., Xn be a random sample from a distribution with density f (x/θ), where θ is the unknown parameter to be estimated. Let θZ be an estimator of θ. The function , - , -2 Z θ = θZ − θ L2 θ, is called the squared error loss. The function > , - > Z θ = >>θZ − θ>> L1 θ, is called the absolute error loss. The loss function L represents the ‘loss’ incurred when θZ is used in place of the parameter θ. Definition 15.8. Let X1 , X2 , ..., Xn be a random sample from a distribution with density f (x/θ), where θ is the parameter to be estimated. Let , unknown Z Z θ be an estimator of θ and let L θ, θ be a given loss function. The expected value of this loss function with respect to the population distribution f (x/θ), that is : , Z θ f (x/θ) dx RL (θ) = L θ, is called the risk. The posterior density of the parameter θ given the sample x1 , x2 , ..., xn , that is k(θ/x1 , x2 , ..., xn ) contains all information about θ. In Bayesian estimation of parameter one chooses an estimate θZ for θ such that Z 1 , x2 , ..., xn ) k(θ/x is maximum subject to a loss function. Mathematically, this is equivalent to minimizing the integral : , Z θ k(θ/x1 , x2 , ..., xn ) dθ L θ, Ω Some Techniques for finding Point Estimators of Parameters 434 Z where Ω denotes the support of the prior density h(θ) of with respect to θ, the parameter θ. Example 15.23. Suppose one observation was taken of a random variable X which yielded the value 2. The density function for X is   θ1 if 0 < x < θ f (x/θ) =  0 otherwise, and prior distribution for parameter θ is & 3 h(θ) = θ4 if 1 < θ < ∞ 0 otherwise. If the loss function is L(z, θ) = (z − θ)2 , then what is the Bayes’ estimate for θ? Answer: The prior density of the random variable θ is & 3 if 1 < θ < ∞ θ4 h(θ) = 0 otherwise. The probability density function of the population is &1 if 0 < x < θ θ f (x/θ) = 0 otherwise. Hence, the joint probability density function of the sample and the parameter is given by u(x, θ) = h(θ) f (x/θ) 3 1 = 4 θ θ & −5 3θ if 0 < x < θ and = 0 otherwise. The marginal density of the sample is given by : ∞ g(x) = u(x, θ) dθ :x∞ 3 θ−5 dθ = x 3 = x−4 4 3 . = 4 x4 1<θ<∞ Probability and Mathematical Statistics Thus, if x = 2, then g(2) = given by 3 64 . 435 The posterior density of θ when x = 2 is u(2, θ) g(2) 64 −5 3θ = 3 & 64 θ−5 if 2 < θ < ∞ = 0 otherwise . Now, we find the Bayes estimator by minimizing the expression E [L(θ, z)/x = 2]. That is : θZ = Arg max L(θ, z) k(θ/x = 2) dθ. k(θ/x = 2) = z∈Ω Ω Let us call this integral ψ(z). Then : ψ(z) = L(θ, z) k(θ/x = 2) dθ :Ω∞ (z − θ)2 k(θ/x = 2) dθ = :2 ∞ = (z − θ)2 64θ−5 dθ. 2 We want to find the value of z which yields a minimum of ψ(z). This can be done by taking the derivative of ψ(z) and evaluating where the derivative is zero. : ∞ d d (z − θ)2 64θ−5 dθ ψ(z) = dz dz 2 : ∞ (z − θ) 64θ−5 dθ =2 2 : ∞ : ∞ θ 64θ−5 dθ z 64θ−5 dθ − 2 =2 2 2 16 = 2z − . 3 Setting this derivative of ψ(z) to zero and solving for z, we get 16 =0 3 8 ⇒z= . 3 2z − 2 ψ(z) = 2, the function ψ(z) has a minimum at z = Since d dz 2 Bayes’ estimate of θ is 83 . 8 3. Hence, the Some Techniques for finding Point Estimators of Parameters 436 In Example 15.23, we ,have- found the Bayes’ estimate of θ by di8 Z θ k(θ/x1 , x2 , ..., xn ) dθ with respect to θ. Z rectly minimizing the Ω L θ, The next result is very useful while finding the Bayes’ estimate using Z θ) = (θ − θ) Z 2 , then a quadratic Notice, that if L(θ, , - loss function. 8 2 Z θ k(θ/x1 , x2 , ..., xn ) dθ is E (θ − θ) Z /x1 , x2 , ..., xn . The followL θ, Ω ing theorem is based on the fact that the function φ defined by φ(c) = ; < E (X − c)2 attains minimum if c = E[X]. Theorem 15.3. Let X1 , X2 , ..., Xn be a random sample from a distribution with density f (x/θ), where θ is the unknown parameter to be estimated. If the loss function is squared error, then the Bayes’ estimator θZ of parameter θ is given by θZ = E(θ/x1 , x2 , ..., xn ), where the expectation is taken with respect to density k(θ/x1 , x2 , ..., xn ). Now we give several examples to illustrate the use of this theorem. Example 15.24. Suppose the prior distribution of θ is uniform over the interval (0, 1). Given θ, the population X is uniform over the interval (0, θ). If the squared error loss function is used, find the Bayes’ estimator of θ based on a sample of size one. Answer: The prior density of θ is given by h(θ) = & 1 if 0 < θ < 1 0 otherwise . The density of population is given by f (x/θ) = &1 θ if 0 < x < θ 0 otherwise. The joint density of the sample and the parameter is given by u(x, θ) = h(θ) f (x/θ) # $ 1 =1 θ &1 if 0 < x < θ < 1 θ = 0 otherwise . Probability and Mathematical Statistics 437 The marginal density of the sample is : 1 g(x) = u(x, θ) dθ x 1 : 1 dθ θ Jx − ln x = 0 = if 0 < x < 1 otherwise. The conditional density of θ given the sample is & 1 if 0 < x < θ < 1 − θ ln x u(x, θ) k(θ/x) = = g(x) 0 elsewhere . Since the loss function is quadratic error, therefore the Bayes’ estimator of θ is θZ = E[θ/x] : 1 = θ k(θ/x) dθ x 1 : −1 dθ θ ln x x : 1 1 =− dθ ln x x x−1 . = ln x Thus, the Bayes’ estimator of θ based on one observation X is = θ X −1 . θZ = ln X Example 15.25. Given θ, the random variable X has a binomial distribution with n = 2 and probability of success θ. If the prior density of θ is  k if 21 < θ < 1 h(θ) =  0 otherwise, what is the Bayes’ estimate of θ for a squared error loss if X = 1 ? ! " Answer: Note that θ is uniform on the interval 12 , 1 , hence k = 2. Therefore, the prior density of θ is & 2 if 12 < θ < 1 h(θ) = 0 otherwise. Some Techniques for finding Point Estimators of Parameters 438 The population density is given by # $ # $ n x 2 x n−x f (x/θ) = θ (1 − θ) = θ (1 − θ)2−x , x x x = 0, 1, 2. The joint density of the sample and the parameter θ is u(x, θ) = h(θ) f (x/θ) # $ 2 x =2 θ (1 − θ)2−x x where by 1 2 < θ < 1 and x = 0, 1, 2. The marginal density of the sample is given g(x) = : 1 u(x, θ) dθ. 1 2 This integral is easy to evaluate if we substitute X = 1 now. Hence g(1) = = : : =4 1 1 2 1 1 2 2 2 3 2 = 3 1 = . 3 = # $ 2 2 θ (1 − θ) dθ 1 ! " 4θ − 4θ2 dθ 31 θ2 θ3 − 2 3 1 2 ; 2 <1 3θ − 2θ3 1 2 #2 $3 3 2 − (3 − 2) − 4 8 Therefore, the posterior density of θ given x = 1, is k(θ/x = 1) = where 1 2 u(1, θ) = 12 (θ − θ2 ), g(1) < θ < 1. Since the loss function is quadratic error, therefore the Probability and Mathematical Statistics 439 Bayes’ estimate of θ is θZ = E[θ/x = 1] : 1 θ k(θ/x = 1) dθ = = : 1 2 1 1 2 12 θ (θ − θ2 ) dθ ; <1 = 4 θ3 − 3 θ4 1 2 5 =1− 16 11 = . 16 Hence, based on the sample of size one with X = 1, the Bayes’ estimate of θ is 11 16 , that is 11 θZ = . 16 The following theorem help us to evaluate the Bayes estimate of a sample if the loss function is absolute error loss. This theorem is based the fact that a function φ(c) = E [ |X − c| ] is minimum if c is the median of X. Theorem 15.4. Let X1 , X2 , ..., Xn be a random sample from a distribution with density f (x/θ), where θ is the unknown parameter to be estimated. If the loss function is absolute error, then the Bayes estimator θZ of the parameter θ is given by θZ = median of k(θ/x1 , x2 , ..., xn ) where k(θ/x1 , x2 , ..., xn ) is the posterior distribution of θ. The followings are some examples to illustrate the above theorem. Example 15.26. Given θ, the random variable X has a binomial distribution with n = 3 and probability of success θ. If the prior density of θ is h(θ) =  k  0 if 1 2 <θ<1 otherwise, what is the Bayes’ estimate of θ for an absolute difference error loss if the sample consists of one observation x = 3? Some Techniques for finding Point Estimators of Parameters 440 Answer: Since, the prior density of θ is h(θ) =  2  if 1 2 <θ<1 0 otherwise , f (x/θ) = # $ 3 x θ (1 − θ)3−x , x and the population density is the joint density of the sample and the parameter is given by u(3, θ) = h(θ) f (3/θ) = 2 θ3 , where 1 2 < θ < 1. The marginal density of the sample (at x = 3) is given by g(3) = = = = : : 2 1 u(3, θ) dθ 1 2 1 2 θ3 dθ 1 2 θ4 2 15 . 32 31 1 2 Therefore, the conditional density of θ given X = 3 is u(3, θ) k(θ/x = 3) = = g(3) & 64 15 θ3 0 if 1 2 <θ<1 elsewhere. Since, the loss function is absolute error, the Bayes’ estimator is the median of the probability density function k(θ/x = 3). That is : Z θ 64 3 θ dθ 1 15 2 64 ; 4 <Z θ θ 1 = 60 2 2 3 1 64 , Z-4 θ − . = 60 16 1 = 2 Probability and Mathematical Statistics 441 Z we get Solving the above equation for θ, @ 4 17 θZ = = 0.8537. 32 Example 15.27. Suppose the prior distribution of θ is uniform over the interval (2, 5). Given θ, X is uniform over the interval (0, θ). What is the Bayes’ estimator of θ for absolute error loss if X = 1 ? Answer: Since, the prior density of θ is   13 if 2 < θ < 5 h(θ) =  0 otherwise , and the population density is f (x/θ) =   θ1  if 0 < x < θ 0 elsewhere, the joint density of the sample and the parameter is given by u(x, θ) = h(θ) f (x/θ) = 1 , 3θ where 2 < θ < 5 and 0 < x < θ. The marginal density of the sample (at x = 1) is given by g(1) = : 5 u(1, θ) dθ 1 = : 2 u(1, θ) dθ + 1 : 5 u(1, θ) dθ 2 5 1 dθ 3θ 2 # $ 1 5 = ln . 3 2 = : Therefore, the conditional density of θ given the sample x = 1, is u(1, θ) g(1) 1 ! ". = θ ln 52 k(θ/x = 1) = Some Techniques for finding Point Estimators of Parameters 442 Since, the loss function is absolute error, the Bayes estimate of θ is the median of k(θ/x = 1). Hence : Z θ 1 1 ! 5 " dθ = 2 2 θ ln 2 = Z we get Solving for θ, ln θZ = 1 ! 5 " ln √ 2 ) + θZ 2 . 10 = 3.16. Example 15.28. What is the basic principle of Bayesian estimation? Answer: The basic principle behind the Bayesian estimation method consists of choosing a value of the parameter θ for which the observed data have as high a posterior probability k(θ/x1 , x2 , ..., xn ) of θ as possible subject to a loss function. 15.4. Review Exercises 1. Let X1 , X2 , ..., Xn be a random sample of size n from a distribution with a probability density function  1  2θ if −θ < x < θ f (x; θ) =  0 otherwise, where 0 < θ is a parameter. Using the moment method find an estimator for the parameter θ. 2. Let X1 , X2 , ..., Xn be a random sample of size n from a distribution with a probability density function   (θ + 1) x−θ−2 if 1 < x < ∞ f (x; θ) =  0 otherwise, where 0 < θ is a parameter. Using the moment method find an estimator for the parameter θ. 3. Let X1 , X2 , ..., Xn be a random sample of size n from a distribution with a probability density function   θ2 x e−θ x if 0 < x < ∞ f (x; θ) =  0 otherwise, Probability and Mathematical Statistics 443 where 0 < θ is a parameter. Using the moment method find an estimator for the parameter θ. 4. Let X1 , X2 , ..., Xn be a random sample of size n from a distribution with a probability density function   θ xθ−1 if 0 < x < 1 f (x; θ) =  0 otherwise, where 0 < θ is a parameter. Using the maximum likelihood method find an estimator for the parameter θ. 5. Let X1 , X2 , ..., Xn be a random sample of size n from a distribution with a probability density function   (θ + 1) x−θ−2 if 1 < x < ∞ f (x; θ) =  0 otherwise, where 0 < θ is a parameter. Using the maximum likelihood method find an estimator for the parameter θ. 6. Let X1 , X2 , ..., Xn be a random sample of size n from a distribution with a probability density function   θ2 x e−θ x if 0 < x < ∞ f (x; θ) =  0 otherwise, where 0 < θ is a parameter. Using the maximum likelihood method find an estimator for the parameter θ. 7. Let X1 , X2 , X3 , X4 be a random sample from a distribution with density function  −(x−4) 1e β for x > 4 β f (x; β) =  0 otherwise, where β > 0. If the data from this random sample are 8.2, 9.1, 10.6 and 4.9, respectively, what is the maximum likelihood estimate of β? 8. Given θ, the random variable X has a binomial distribution with n = 2 and probability of success θ. If the prior density of θ is   k if 12 < θ < 1 h(θ) =  0 otherwise, Some Techniques for finding Point Estimators of Parameters 444 what is the Bayes’ estimate of θ for a squared error loss if the sample consists of x1 = 1 and x2 = 2. 9. Suppose two observations were taken of a random variable X which yielded the values 2 and 3. The density function for X is   θ1 if 0 < x < θ f (x/θ) =  0 otherwise, and prior distribution for the parameter θ is & −4 3θ if θ > 1 h(θ) = 0 otherwise. If the loss function is quadratic, then what is the Bayes’ estimate for θ? 10. The Pareto distribution is often used in study of incomes and has the cumulative density function  ! "  1 − αx θ if α ≤ x F (x; α, θ) =  0 otherwise, where 0 < α < ∞ and 1 < θ < ∞ are parameters. Find the maximum likelihood estimates of α and θ based on a sample of size 5 for value 3, 5, 2, 7, 8. 11. The Pareto distribution is often used in study of incomes and has the cumulative density function  ! "  1 − αx θ if α ≤ x F (x; α, θ) =  0 otherwise, where 0 < α < ∞ and 1 < θ < ∞ are parameters. Using moment methods find estimates of α and θ based on a sample of size 5 for value 3, 5, 2, 7, 8. 12. Suppose one observation was taken of a random variable X which yielded the value 2. The density function for X is 2 1 1 f (x/µ) = √ e− 2 (x−µ) 2π − ∞ < x < ∞, and prior distribution of µ is 1 2 1 h(µ) = √ e− 2 µ 2π − ∞ < µ < ∞. Probability and Mathematical Statistics 445 If the loss function is quadratic, then what is the Bayes’ estimate for µ? 13. Let X1 , X2 , ..., Xn be a random sample of size n from a distribution with probability density   θ1 if 2θ ≤ x ≤ 3θ f (x) =  0 otherwise, where θ > 0. What is the maximum likelihood estimator of θ? 14. Let X1 , X2 , ..., Xn be a random sample of size n from a distribution with probability density f (x) =   1 − θ2  0 if 0 ≤ x ≤ 1 1−θ 2 otherwise, where θ > 0. What is the maximum likelihood estimator of θ? 15. Given θ, the random variable X has a binomial distribution with n = 3 and probability of success θ. If the prior density of θ is h(θ) =  k  0 if 1 2 <θ<1 otherwise, what is the Bayes’ estimate of θ for a absolute difference error loss if the sample consists of one observation x = 1? 16. Suppose the random variable X has the cumulative density function F (x). Show that the expected value of the random variable (X − c)2 is minimum if c equals the expected value of X. 17. Suppose the continuous random variable X has the cumulative density function F (x). Show that the expected value of the random variable |X − c| is minimum if c equals the median of X (that is, F (c) = 0.5). 18. Eight independent trials are conducted of a given system with the following results: S, F, S, F, S, S, S, S where S denotes the success and F denotes the failure. What is the maximum likelihood estimate of the probability of successful operation p ? 4 2 5, 3, 5 ! "β β) x2 19. What is the maximum likelihood estimate of β if the 5 values 1, 3 5 2, 4 ? were drawn from the population for which f (x; β) = 1 2 (1 + Some Techniques for finding Point Estimators of Parameters 446 20. If a sample of five values of X is taken from the population for which f (x; t) = 2(t − 1)tx , what is the maximum likelihood estimator of t ? 21. A sample of size n is drawn from a gamma distribution f (x; β) =  −x  x3 e 4 β 6β  0 if 0 < x < ∞ otherwise. What is the maximum likelihood estimator of β ? 22. The probability density function of the random variable X is defined by f (x; λ) = & √ 1 − 32 λ + λ x if 0 ≤ x ≤ 1 0 otherwise. What is the maximum likelihood estimate of the parameter λ based on two 9 ? independent observations x1 = 41 and x2 = 16 23. Let X1 , X2 , ..., Xn be a random sample from a distribution with density function f (x; σ) = σ2 e−σ|x−µ| . What is the maximum likelihood estimator of σ? 24. Suppose X1 , X2 , ... are independent random variables, each with probability of success p and probability of failure 1 − p, where 0 ≤ p ≤ 1. Let N be the number of observation needed to obtain the first success. What is the maximum likelihood estimator of p in term of N ? 25. Let X1 , X2 , X3 and X4 be a random sample from the discrete distribution X such that  2x −θ2 θ e for x = 0, 1, 2, ..., ∞ x! P (X = x) =  0 otherwise, where θ > 0. If the data are 17, 10, 32, 5, what is the maximum likelihood estimate of θ ? 26. Let X1 , X2 , ..., Xn be a random sample of size n from a population with a probability density function f (x; α, λ) =  α λ  Γ(α) xα−1 e−λ x  0 if 0<x<∞ otherwise, Probability and Mathematical Statistics 447 where α and λ are parameters. Using the moment method find the estimators for the parameters α and λ. 27. Let X1 , X2 , ..., Xn be a random sample of size n from a population distribution with the probability density function # $ 10 x f (x; p) = p (1 − p)10−x x for x = 0, 1, ..., 10, where p is a parameter. Find the Fisher information in the sample about the parameter p. 28. Let X1 , X2 , ..., Xn be a random sample of size n from a population distribution with the probability density function   θ2 x e−θ x if 0 < x < ∞ f (x; θ) =  0 otherwise, where 0 < θ is a parameter. Find the Fisher information in the sample about the parameter θ. 29. Let X1 , X2 , ..., Xn be a random sample of size n from a population distribution with the probability density function ! ln(x)−µ "2  − 12 1  σ √ e , if 0 < x < ∞ f (x; µ, σ 2 ) = x σ 2 π  0 otherwise , where −∞ < µ < ∞ and 0 < σ 2 < ∞ are unknown parameters. Find the Fisher information matrix in the sample about the parameters µ and σ 2 . 30. Let X1 , X2 , ..., Xn be a random sample of size n from a population distribution with the probability density function = λ(x−µ)2   λ x− 23 e− 2µ2 x , if 0 < x < ∞ 2π f (x; µ, λ) =   0 otherwise , where 0 < µ < ∞ and 0 < λ < ∞ are unknown parameters. Find the Fisher information matrix in the sample about the parameters µ and λ. 31. Let X1 , X2 , ..., Xn be a random sample of size n from a distribution with a probability density function  x  Γ(α)1 θα xα−1 e− θ if 0 < x < ∞ f (x) =  0 otherwise, Some Techniques for finding Point Estimators of Parameters 448 where α > 0 and θ > 0 are parameters. Using the moment method find estimators for parameters α and β. 32. Let X1 , X2 , ..., Xn be a random sample of sizen from a distribution with a probability density function f (x; θ) = 1 , π [1 + (x − θ)2 ] −∞ < x < ∞, where 0 < θ is a parameter. Using the maximum likelihood method find an estimator for the parameter θ. 33. Let X1 , X2 , ..., Xn be a random sample of sizen from a distribution with a probability density function f (x; θ) = 1 −|x−θ| e , 2 −∞ < x < ∞, where 0 < θ is a parameter. Using the maximum likelihood method find an estimator for the parameter θ. 34. Let X1 , X2 , ..., Xn be a random sample of size n from a population distribution with the probability density function f (x; λ) =  x −λ e  λ x!  0 if x = 0, 1, ..., ∞ otherwise, where λ > 0 is an unknown parameter. Find the Fisher information matrix in the sample about the parameter λ. 35. Let X1 , X2 , ..., Xn be a random sample of size n from a population distribution with the probability density function   (1 − p)x−1 p if x = 1, ..., ∞ f (x; p) =  0 otherwise, where 0 < p < 1 is an unknown parameter. Find the Fisher information matrix in the sample about the parameter p. Probability and Mathematical Statistics 449 Chapter 16 CRITERIA FOR EVALUATING THE GOODNESS OF ESTIMATORS We have seen in Chapter 15 that, in general, different parameter estimation methods yield different estimators. For example, if X ∼ U N IF (0, θ) and X1 , X2 , ..., Xn is a random sample from the population X, then the estimator of θ obtained by moment method is θZM M = 2X where as the estimator obtained by the maximum likelihood method is θZM L = X(n) where X and X(n) are the sample average and the nth order statistic, respectively. Now the question arises: which of the two estimators is better? Thus, we need some criteria to evaluate the goodness of an estimator. Some well known criteria for evaluating the goodness of an estimator are: (1) Unbiasedness, (2) Efficiency and Relative Efficiency, (3) Uniform Minimum Variance Unbiasedness, (4) Sufficiency, and (5) Consistency. In this chapter, we shall examine only the first four criteria in details. The concepts of unbiasedness, efficiency and sufficiency were introduced by Sir Ronald Fisher. Criteria for Evaluating the Goodness of Estimators 450 16.1. The Unbiased Estimator Let X1 , X2 , ..., Xn be a random sample of size n from a population with probability density function f (x; θ). An estimator θZ of θ is a function of the random variables X1 , X2 , ..., Xn which is free of the parameter θ. An estimate is a realized value of an estimator that is obtained when a sample is actually taken. Definition 16.1. An estimator θZ of θ is said to be an unbiased estimator of θ if and only if , E θZ = θ. If θZ is not unbiased, then it is called a biased estimator of θ. An estimator of a parameter may not equal to the actual value of the parameter for every realization of the sample X1 , X2 , ..., Xn , but if it is unbiased then on an average it will equal to the parameter. Example 16.1. Let X1 , X2 , ..., Xn be a random sample from a normal population with mean µ and variance σ 2 > 0. Is the sample mean X an unbiased estimator of the parameter µ ? Answer: Since, each Xi ∼ N (µ, σ 2 ), we have X∼N # σ2 µ, n $ . That is, the sample mean is normal with mean µ and variance σ2 n . Thus ! " E X = µ. Therefore, the sample mean X is an unbiased estimator of µ. Example 16.2. Let X1 , X2 , ..., Xn be a random sample from a normal population with mean µ and variance σ 2 > 0. What is the maximum likelihood estimator of σ 2 ? Is this maximum likelihood estimator an unbiased estimator of the parameter σ 2 ? Answer: In Example 15.13, we have shown that the maximum likelihood estimator of σ 2 is n % "2 ! [2 = 1 σ Xi − X . n i=1 Probability and Mathematical Statistics 451 Now, we examine the unbiasedness of this estimator 0 / n B "2 1 %! [ 2 Xi − X E σ =E n i=1 / 0 n "2 n − 1 1 %! Xi − X =E n n − 1 i=1 0 / n "2 n−1 1 %! = Xi − X E n n − 1 i=1 A n − 1 ; 2< E S n 2 3 n−1 2 σ2 E S = n σ2 < σ2 ; 2 = E χ (n − 1) n σ2 = (n − 1) n n−1 2 = σ n += σ 2 . = (since n−1 2 S ∼ χ2 (n − 1)) σ2 Therefore, the maximum likelihood estimator of σ 2 is a biased estimator. Next, in the following example, we show that the sample variance S 2 given by the expression n S2 = "2 1 %! Xi − X n − 1 i=1 is an unbiased estimator of the population variance σ 2 irrespective of the population distribution. Example 16.3. Let X1 , X2 , ..., Xn be a random sample from a population with mean µ and variance σ 2 > 0. Is the sample variance S 2 an unbiased estimator of the population variance σ 2 ? Answer: Note that the distribution of the population is not given. However, ! " we are given E(Xi ) = ,µ and E[(Xi − µ)2 ] = σ 2 . In order to find E S 2 , ! " 2 we need E X and E X . Thus we proceed to find these two expected Criteria for Evaluating the Goodness of Estimators 452 values. Consider $ X1 + X2 + · · · + Xn n n n % 1 % 1 E(Xi ) = µ=µ = n i=1 n i=1 ! " E X =E Similarly, $ X1 + X2 + · · · + Xn n n n 1 % 1 % 2 σ2 = 2 . V ar(Xi ) = 2 σ = n i=1 n i=1 n ! " V ar X = V ar Therefore Consider # # , 2! " ! "2 σ2 = V ar X + E X = + µ2 . E X n ! E S " 2 0 n "2 1 %! =E Xi − X n − 1 i=1 0 / n %, 1 2 2 = Xi − 2XXi + X E n−1 i=1 / n 0 % 1 2 2 = E Xi − n X n−1 i=1 ' & n B A % ; < 1 2 2 = E Xi − E n X n − 1 i=1 2 # $3 1 σ2 = n(σ 2 + µ2 ) − n µ2 + n−1 n < 1 ; = (n − 1) σ 2 n−1 = σ2 . / Therefore, the sample variance S 2 is an unbiased estimator of the population variance σ 2 . Example 16.4. Let X be a random variable with mean 2. Let θZ1 and θZ2 be unbiased estimators of the second and third moments, respectively, of X about the origin. Find an unbiased estimator of the third moment of X about its mean in terms of θZ1 and θZ2 . Probability and Mathematical Statistics 453 Answer: Since, θZ1 and θZ2 are the unbiased estimators of the second and third moments of X about origin, we get , , ! " E θZ1 = E(X 2 ) and E θZ2 = E X 3 . The unbiased estimator of the third moment of X about its mean is A B ; < 3 E (X − 2) = E X 3 − 6X 2 + 12X − 8 ; < ; < = E X 3 − 6E X 2 + 12E [X] − 8 = θZ2 − 6θZ1 + 24 − 8 = θZ2 − 6θZ1 + 16. Thus, the unbiased estimator of the third moment of X about its mean is θZ2 − 6θZ1 + 16. Example 16.5. Let X1 , X2 , ..., X5 be a sample of size 5 from distribution on the interval (0, θ), where θ is unknown. Let the θ be k Xmax , where k is some constant and Xmax is the largest In order k Xmax to be an unbiased estimator, what should be the constant k ? Answer: The probability density function of Xmax is given by 5! 4 [F (x)] f (x) 4! 0! , x -4 1 =5 θ θ 5 4 = 5x . θ g(x) = If k Xmax is an unbiased estimator of θ, then θ = E (k Xmax ) = k E (Xmax ) : θ x g(x) dx =k 0 =k : θ 0 5 5 x dx θ5 5 = k θ. 6 Hence, k= 6 . 5 the uniform estimator of observation. the value of Criteria for Evaluating the Goodness of Estimators 454 Example 16.6. Let X1 , X2 , ..., Xn be a sample of size n from a distribution with unknown mean −∞ < µ < ∞, and unknown variance σ 2 > 0. Show 2 +···+nXn are both unbiased estimators that the statistic X and Y = X1 +2X n (n+1) 2 ! " of µ. Further, show that V ar X < V ar(Y ). Answer: First, we show that X is an unbiased estimator of µ # $ ! " X1 + X2 + · · · + Xn E X =E n n 1 % = E (Xi ) n i=1 = n 1 % µ = µ. n i=1 Hence, the sample mean X is an unbiased estimator of the population mean irrespective of the distribution of X. Next, we show that Y is also an unbiased estimator of µ. ) + X1 + 2X2 + · · · + nXn E (Y ) = E n (n+1) 2 = = 2 n (n + 1) 2 n (n + 1) n % i=1 n % i E (Xi ) iµ i=1 n (n + 1) 2 µ n (n + 1) 2 = µ. = Hence, X and Y are both unbiased estimator of the population mean irrespective of the distribution of the population. The variance of X is given by 2 3 ; < X1 + X2 + · · · + Xn V ar X = V ar n 1 = 2 V ar [X1 + X2 + · · · + Xn ] n n 1 % = 2 V ar [Xi ] n i=1 = σ2 . n Probability and Mathematical Statistics 455 Similarly, the variance of Y can be calculated as follows: 0 / X1 + 2X2 + · · · + nXn V ar [Y ] = V ar n (n+1) 2 4 = 2 V ar [1 X1 + 2 X2 + · · · + n Xn ] n (n + 1)2 n % 4 V ar [i Xi ] = 2 n (n + 1)2 i=1 = n % 4 i2 V ar [Xi ] n2 (n + 1)2 i=1 n % 4 2 = 2 i2 σ n (n + 1)2 i=1 = σ2 2 3 2 = 3 = Since 2 2n+1 3 (n+1) > 1 for n 4 n (n + 1) (2n + 1) n2 (n + 1)2 6 2 2n + 1 σ (n + 1) n ; < 2n + 1 V ar X . (n + 1) ; < ≥ 2, we see that V ar X < V ar[ Y ]. This shows that although the estimators X and Y are both unbiased estimator of µ, yet the variance of the sample mean X is smaller than the variance of Y . In statistics, between two unbiased estimators one prefers the estimator which has the minimum variance. This leads to our next topic. However, before we move to the next topic we complete this section with some known disadvantages with the notion of unbiasedness. The first disadvantage is that an unbiased estimator for a parameter may not exist. The second disadvantage is that the property of unbiasedness is not invariant under functional transformation, that is, if θZ is an unbiased estimator of θ and g is a function, Z may not be an unbiased estimator of g(θ). then g(θ) 16.2. The Relatively Efficient Estimator We have seen that in Example 16.6 that the sample mean X= and the statistic Y = X1 + X2 + · · · + Xn n X1 + 2X2 + · · · + nXn 1 + 2 + ··· + n Criteria for Evaluating the Goodness of Estimators 456 are both unbiased estimators of the population mean. However, we also seen that ! " V ar X < V ar(Y ). The following figure graphically illustrates the shape of the distributions of both the unbiased estimators. m m If an unbiased estimator has a smaller variance or dispersion, then it has a greater chance of being close to true parameter θ. Therefore when two estimators of θ are both unbiased, then one should pick the one with the smaller variance. Definition 16.2. Let θZ1 and θZ2 be two unbiased estimators of θ. The estimator θZ1 is said to be more efficient than θZ2 if The ratio η given by , , V ar θZ1 < V ar θZ2 . , η θZ1 , θZ2 - , V ar θZ2 , = V ar θZ1 is called the relative efficiency of θZ1 with respect to θZ2 . Example 16.7. Let X1 , X2 , X3 be a random sample of size 3 from a population with mean µ and variance σ 2 > 0. If the statistics X and Y given by X1 + 2X2 + 3X3 Y = 6 are two unbiased estimators of the population mean µ, then which one is more efficient? Probability and Mathematical Statistics 457 Answer: Since E (Xi ) = µ and V ar (Xi ) = σ 2 , we get ! " E X =E # X1 + X2 + X3 3 $ 1 (E (X1 ) + E (X2 ) + E (X3 )) 3 1 = 3µ 3 =µ = and E (Y ) = E # X1 + 2X2 + 3X3 6 $ 1 (E (X1 ) + 2E (X2 ) + 3E (X3 )) 6 1 = 6µ 6 = µ. = Therefore both X and Y are unbiased. Next we determine the variance of both the estimators. The variances of these estimators are given by ! " V ar X = V ar # X1 + X2 + X3 3 $ 1 [V ar (X1 ) + V ar (X2 ) + V ar (X3 )] 9 1 = 3σ 2 9 12 2 σ = 36 = and V ar (Y ) = V ar # X1 + 2X2 + 3X3 6 $ 1 [V ar (X1 ) + 4V ar (X2 ) + 9V ar (X3 )] 36 1 = 14σ 2 36 14 2 = σ . 36 = Therefore ! " 12 2 14 2 σ = V ar X < V ar (Y ) = σ . 36 36 Criteria for Evaluating the Goodness of Estimators 458 Hence, X is more efficient than the estimator Y . Further, the relative efficiency of X with respect to Y is given by ! " 14 7 η X, Y = = . 12 6 Example 16.8. Let X1 , X2 , ..., Xn be a random sample of size n from a population with density  x  θ1 e− θ if 0 ≤ x < ∞ f (x; θ) =  0 otherwise, where θ > 0 is a parameter. Are the estimators X1 and X unbiased? Given, X1 and X, which one is more efficient estimator of θ ? Answer: Since the population X is exponential with parameter θ, that is X ∼ EXP (θ), the mean and variance of it are given by E(X) = θ and V ar(X) = θ2 . Since X1 , X2 , ..., Xn is a random sample from X, we see that the statistic X1 ∼ EXP (θ). Hence, the expected value of X1 is θ and thus it is an unbiased estimator of the parameter θ. Also, the sample mean is an unbiased estimator of θ since n ! " 1% E (Xi ) E X = n i=1 1 nθ n = θ. = Next, we compute the variances of the unbiased estimators X1 and X. It is easy to see that V ar (X1 ) = θ2 and ! " V ar X = V ar # X1 + X2 + · · · + Xn n n % 1 = 2 V ar (Xi ) n i=1 1 nθ2 n2 θ2 = . n = $ Probability and Mathematical Statistics Hence 459 ! " θ2 = V ar X < V ar (X1 ) = θ2 . n Thus X is more efficient than X1 and the relative efficiency of X with respect to X1 is θ2 η(X, X1 ) = θ2 = n. n Example 16.9. Let X1 , X2 , X3 be a random sample of size 3 from a population with density f (x; λ) =  x −λ e  λ x!  if x = 0, 1, 2, ..., ∞ 0 otherwise, where λ is a parameter. Are the estimators given by [1 = 1 (X1 + 2X2 + X3 ) λ 4 and [2 = 1 (4X1 + 3X2 + 2X3 ) λ 9 [1 and λ [2 , which one is more efficient estimator of λ ? unbiased? Given, λ Find an unbiased estimator of λ whose variance is smaller than the variances [1 and λ [2 . of λ Answer: Since each Xi ∼ P OI(λ), we get E (Xi ) = λ and V ar (Xi ) = λ. It is easy to see that , - 1 [1 = (E (X1 ) + 2E (X2 ) + E (X3 )) E λ 4 1 = 4λ 4 = λ, and , - 1 [2 = (4E (X1 ) + 3E (X2 ) + 2E (X3 )) E λ 9 1 = 9λ 9 = λ. Criteria for Evaluating the Goodness of Estimators 460 [1 and λ [2 are unbiased estimators of λ. Now we compute their Thus, both λ variances to find out which one is more efficient. It is easy to note that , [1 = 1 (V ar (X1 ) + 4V ar (X2 ) + V ar (X3 )) V ar λ 16 1 = 6λ 16 6 = λ 16 486 = λ, 1296 and , [2 = 1 (16V ar (X1 ) + 9V ar (X2 ) + 4V ar (X3 )) V ar λ 81 1 = 29λ 81 29 = λ 81 464 λ, = 1296 Since, , , [2 < V ar λ [1 , V ar λ [2 is efficient than the estimator λ [1 . We have seen in section the estimator λ 16.1 that the sample mean is always an unbiased estimator of the population mean irrespective of the population distribution. The variance of the sample mean is always equals to n1 times the population variance, where n denotes the sample size. Hence, we get Therefore, we get ! " λ 432 V ar X = = λ. 3 1296 , , ! " [2 < V ar λ [1 . V ar X < V ar λ Thus, the sample mean has even smaller variance than the two unbiased estimators given in this example. In view of this example, now we have encountered a new problem. That is how to find an unbiased estimator which has the smallest variance among all unbiased estimators of a given parameter. We resolve this issue in the next section. Probability and Mathematical Statistics 461 16.3. The Uniform Minimum Variance Unbiased Estimator Let X1 , X2 , ..., Xn be a random sample of size n from a population with probability density function f (x; θ). Recall that an estimator θZ of θ is a function of the random variables X1 , X2 , ..., Xn which does depend on θ. Definition 16.3. An unbiased estimator θZ of θ is said to be a uniform minimum variance unbiased estimator of θ if and only if , , V ar θZ ≤ V ar TZ for any unbiased estimator TZ of θ. If an estimator θZ is unbiased then the mean of this estimator is equal to the parameter θ, that is , E θZ = θ and the variance of θZ is 2, , --2 3 , V ar θZ = E θZ − E θZ 2, -2 3 = E θZ − θ . This variance, if exists, is a function of the unbiased estimator θZ and it has a minimum in the class of all unbiased estimators of θ. Therefore we have an alternative definition of the uniform minimum variance unbiased estimator. Definition 16.4. An unbiased estimator θZ of θ is said to be a uniform minimum unbiased estimator of θ if it minimizes the variance 2, 3 -variance 2 Z E θ−θ . Z Z Example , - 16.10. Let , -θ1 and θ2 be unbiased , -estimators of θ. Suppose V ar θZ1 = 1, V ar θZ2 = 2 and Cov θZ1 , θZ2 = 12 . What are the values of c1 and c2 for which c1 θZ1 + c2 θZ2 is an unbiased estimator of θ with minimum variance among unbiased estimators of this type? Answer: We want c1 θZ1 + c2 θZ2 to be a minimum variance unbiased estimator of θ. Then A B E c1 θZ1 + c2 θZ2 = θ A B A B ⇒ c1 E θZ1 + c2 E θZ2 = θ ⇒ c1 θ + c2 θ = θ ⇒ ⇒ c1 + c2 = 1 c2 = 1 − c1 . Criteria for Evaluating the Goodness of Estimators 462 Therefore A B A B A B , V ar c1 θZ1 + c2 θZ2 = c21 V ar θZ1 + c22 V ar θZ2 + 2 c1 c2 Cov θZ1 , θZ1 = c21 + 2c22 + c1 c2 = c21 + 2(1 − c1 )2 + c1 (1 − c1 ) = 2(1 − c1 )2 + c1 = 2 + 2c21 − 3c1 . A B Hence, the variance V ar c1 θZ1 + c2 θZ2 is a function of c1 . Let us denote this function by φ(c1 ), that is A B φ(c1 ) := V ar c1 θZ1 + c2 θZ2 = 2 + 2c21 − 3c1 . Taking the derivative of φ(c1 ) with respect to c1 , we get d φ(c1 ) = 4c1 − 3. dc1 Setting this derivative to zero and solving for c1 , we obtain 4c1 − 3 = 0 ⇒ Therefore c2 = 1 − c1 = 1 − c1 = 3 . 4 3 1 = . 4 4 In Example 16.10, we saw that if θZ1 and θZ2 are any two unbiased estimators of θ, then c θZ1 + (1 − c) θZ2 is also an unbiased estimator of θ for any c ∈ R. I Hence given two estimators θZ1 and θZ2 , J S C = θZ | θZ = c θZ1 + (1 − c) θZ2 , c ∈ R I forms an uncountable class of unbiased estimators of θ. When the variances of θZ1 and θZ2 are known along with the their covariance, then in Example 16.10 we were able to determine the minimum variance unbiased estimator in the class C. If the variances of the estimators θZ1 and θZ2 are not known, then it is very difficult to find the minimum variance estimator even in the class of estimators C. Notice that C is a subset of the class of all unbiased estimators and finding a minimum variance unbiased estimator in this class is a difficult task. Probability and Mathematical Statistics 463 One way to find a uniform minimum variance unbiased estimator for a parameter is to use the Cramér-Rao lower bound or the Fisher information inequality. Theorem 16.1. Let X1 , X2 , ..., Xn be a random sample of size n from a population X with probability density f (x; θ), where θ is a scalar parameter. Let θZ be any unbiased estimator of θ. Suppose the likelihood function L(θ) is a differentiable function of θ and satisfies : ∞ : ∞ d h(x1 , ..., xn ) L(θ) dx1 · · · dxn ··· dθ −∞ −∞ (1) : ∞ : ∞ d = ··· h(x1 , ..., xn ) L(θ) dx1 · · · dxn dθ −∞ −∞ for any h(x1 , ..., xn ) with E(h(X1 , ..., Xn )) < ∞. Then , V ar θZ ≥ E 2, 1 ∂ ln L(θ) ∂θ -2 3 . (CR1) Proof: Since L(θ) is the joint probability density function of the sample X1 , X2 , ..., Xn , : ∞ : ∞ L(θ) dx1 · · · dxn = 1. (2) ··· −∞ −∞ Differentiating (2) with respect to θ we have d dθ : ∞ −∞ ··· : ∞ −∞ L(θ) dx1 · · · dxn = 0 and use of (1) with h(x1 , ..., xn ) = 1 yields : ∞ −∞ ··· : ∞ −∞ d L(θ) dx1 · · · dxn = 0. dθ Rewriting (3) as : ∞ −∞ we see that : ··· ∞ −∞ : ··· ∞ −∞ : dL(θ) 1 L(θ) dx1 · · · dxn = 0 dθ L(θ) ∞ −∞ d ln L(θ) L(θ) dx1 · · · dxn = 0. dθ (3) Criteria for Evaluating the Goodness of Estimators Hence : ∞ −∞ : ··· ∞ d ln L(θ) L(θ) dx1 · · · dxn = 0. dθ θ −∞ Since θZ is an unbiased estimator of θ, we see that , - : E θZ = ∞ −∞ ··· : ∞ −∞ θZ L(θ) dx1 · · · dxn = θ. 464 (4) (5) Differentiating (5) with respect to θ, we have d dθ : ∞ −∞ ··· : ∞ −∞ θZ L(θ) dx1 · · · dxn = 1. Z we have Again using (1) with h(X1 , ..., Xn ) = θ, : ∞ −∞ : Rewriting (6) as : ∞ −∞ we have : ··· ∞ : −∞ ∞ −∞ ··· : ∞ d L(θ) dx1 · · · dxn = 1. θZ dθ −∞ ··· (6) dL(θ) 1 θZ L(θ) dx1 · · · dxn = 1 dθ L(θ) ∞ d ln L(θ) L(θ) dx1 · · · dxn = 1. θZ dθ −∞ (7) From (4) and (7), we obtain : ∞ −∞ ··· : ∞ −∞ , θZ − θ - d ln L(θ) L(θ) dx1 · · · dxn = 1. dθ By the Cauchy-Schwarz inequality, $2 - d ln L(θ) L(θ) dx1 · · · dxn dθ −∞ −∞ $ #: ∞ : ∞, -2 L(θ) dx1 · · · dxn θZ − θ ··· ≤ −∞ −∞ + ): $2 : ∞# ∞ d ln L(θ) · L(θ) dx1 · · · dxn ··· dθ −∞ −∞ /# $2 0 , ∂ ln L(θ) = V ar θZ E . ∂θ 1= #: ∞ ··· : ∞ , θZ − θ (8) Probability and Mathematical Statistics Therefore 465 , V ar θZ ≥ E 2, 1 ∂ ln L(θ) ∂θ and the proof of theorem is now complete. -2 3 If L(θ) is twice differentiable with respect to θ, the inequality (CR1) can be stated equivalently as , V ar θZ ≥ E A −1 ∂ 2 ln L(θ) ∂θ 2 B. (CR2) The inequalities (CR1) and (CR2) are known as Cramér-Rao lower bound for the variance of θZ or the Fisher information inequality. The condition (1) interchanges the order on integration and differentiation. Therefore any distribution whose range depend on the value of the parameter is not covered by this theorem. Hence distribution like the uniform distribution may not be analyzed using the Cramér-Rao lower bound. If the estimator θZ is minimum variance in addition to being unbiased, then equality holds. We state this as a theorem without giving a proof. Theorem 16.2. Let X1 , X2 , ..., Xn be a random sample of size n from a population X with probability density f (x; θ), where θ is a parameter. If θZ is an unbiased estimator of θ and , V ar θZ = E 2, 1 ∂ ln L(θ) ∂θ -2 3 , then θZ is a uniform minimum variance unbiased estimator of θ. The converse of this is not true. Definition 16.5. An unbiased estimator θZ is called an efficient estimator if it satisfies Cramér-Rao lower bound, that is , V ar θZ = E 2, 1 ∂ ln L(θ) ∂θ -2 3 . In view of the above theorem it is easy to note that an efficient estimator of a parameter is always a uniform minimum variance unbiased estimator of Criteria for Evaluating the Goodness of Estimators 466 a parameter. However, not every uniform minimum variance unbiased estimator of a parameter is efficient. In other words not every uniform minimum variance unbiased estimators of a parameter satisfy the Cramér-Rao lower bound , 1 V ar θZ ≥ 2, -2 3 . E ∂ ln∂θL(θ) Example 16.11. Let X1 , X2 , ..., Xn be a random sample of size n from a distribution with density function   3θ x2 e−θx3 if 0 < x < ∞ f (x; θ) =  0 otherwise. What is the Cramér-Rao lower bound for the variance of unbiased estimator of the parameter θ ? Answer: Let θZ be an unbiased estimator of θ. Cramér-Rao lower bound for the variance of θZ is given by , V ar θZ ≥ E A −1 ∂ 2 ln L(θ) ∂θ 2 B, where L(θ) denotes the likelihood function of the given random sample X1 , X2 , ..., Xn . Since, the likelihood function of the sample is L(θ) = n P 3 3θ x2i e−θxi i=1 we get ln L(θ) = n ln θ + n % i=1 n % " ! x3i . ln 3x2i − θ i=1 n % ∂ ln L(θ) n x3 , = − ∂θ θ i=1 i and ∂ 2 ln L(θ) n = − 2. 2 ∂θ θ Hence, using this in the Cramér-Rao inequality, we get , - θ2 V ar θZ ≥ . n Probability and Mathematical Statistics 467 Thus the Cramér-Rao lower bound for the variance of the unbiased estimator 2 of θ is θn . Example 16.12. Let X1 , X2 , ..., Xn be a random sample from a normal population with unknown mean µ and known variance σ 2 > 0. What is the maximum likelihood estimator of µ? Is this maximum likelihood estimator an efficient estimator of µ? Answer: The probability density function of the population is f (x; µ) = √ Thus and hence 1 2π σ 2 1 e− 2σ2 (x−µ)2 . 1 1 ln f (x; µ) = − ln(2πσ 2 ) − 2 (x − µ)2 2 2σ n n 1 % 2 ln L(µ) = − ln(2πσ ) − 2 (xi − µ)2 . 2 2σ i=1 Taking the derivative of ln L(µ) with respect to µ, we get n 1 % d ln L(µ) = 2 (xi − µ). dµ σ i=1 Setting this derivative to zero and solving for µ, we see that µ Z = X. The variance of X is given by # $ ! " X1 + X2 + · · · + Xn V ar X = V ar n σ2 . = n Next we determine the Cramér-Rao lower bound for the estimator X. We already know that n d ln L(µ) 1 % (xi − µ) = 2 dµ σ i=1 and hence n d2 ln L(µ) = − 2. 2 dµ σ Therefore E # d2 ln L(µ) dµ2 $ =− n σ2 Criteria for Evaluating the Goodness of Estimators and − Thus E , 1 d2 ln L(µ) dµ2 ! " V ar X = − E 468 -= , σ2 . n 1 d2 ln L(µ) dµ2 - and X is an efficient estimator of µ. Since every efficient estimator is a uniform minimum variance unbiased estimator, therefore X is a uniform minimum variance unbiased estimator of µ. Example 16.13. Let X1 , X2 , ..., Xn be a random sample from a normal population with known mean µ and unknown variance σ 2 > 0. What is the maximum likelihood estimator of σ 2 ? Is this maximum likelihood estimator a uniform minimum variance unbiased estimator of σ 2 ? Answer: Let us write θ = σ 2 . Then f (x; θ) = √ and ln L(θ) = − 2 1 1 e− 2θ (x−µ) 2πθ n n n 1 % ln(2π) − ln(θ) − (xi − µ)2 . 2 2 2θ i=1 Differentiating ln L(θ) with respect to θ, we have n n 1 1 % d (xi − µ)2 ln L(θ) = − + 2 dθ 2 θ 2θ i=1 Setting this derivative to zero and solving for θ, we see that n 1 % θZ = (Xi − µ)2 . n i=1 Next we show that this estimator is unbiased. For this we consider + ) n , 1 % 2 Z E θ =E (Xi − µ) n i=1 ) n # + % X i − µ $2 σ2 E = n σ i=1 θ E( χ2 (n) ) n θ = n = θ. n = Probability and Mathematical Statistics 469 Hence θZ is an unbiased estimator of θ. The variance of θZ can be obtained as follows: + ) n , 1 % 2 Z (Xi − µ) V ar θ = V ar n i=1 + ) n # % X i − µ $2 σ4 = V ar n σ i=1 θ2 V ar( χ2 (n) ) n2 n θ2 = 2 4 n 2 2θ2 2σ 4 = = . n n = Z The Finally we determine the Cramér-Rao lower bound for the variance of θ. second derivative of ln L(θ) with respect to θ is n d2 ln L(θ) n 1 % = − (xi − µ)2 . dθ2 2θ2 θ3 i=1 Hence E # d2 ln L(θ) dθ2 $ 1 n = 2 − 3E 2θ θ n − 2θ2 n = 2− 2θ n =− 2 2θ = Thus − Therefore E , 1 d2 ln L(θ) dθ 2 E i=1 2 (Xi − µ) + " θ ! 2 E χ (n) 3 θ n θ2 -= , V ar θZ = − ) n % , 2σ 4 θ2 = . n n 1 d2 ln L(θ) dθ 2 -. Hence θZ is an efficient estimator of θ. Since every efficient estimator is a 1n 2 uniform minimum variance unbiased estimator, therefore n1 i=1 (Xi − µ) is a uniform minimum variance unbiased estimator of σ 2 . Example 16.14. Let X1 , X2 , ..., Xn be a random sample of size n from a normal population known mean µ and variance σ 2 > 0. Show that S 2 = Criteria for Evaluating the Goodness of Estimators 1 n−1 470 1n − X)2 is an unbiased estimator of σ 2 . Further, show that S 2 can not attain the Cramér-Rao lower bound. i=1 (Xi Answer: From Example 16.2, we know that S 2 is an unbiased estimator of σ 2 . The variance of S 2 can be computed as follows: + ) n ! 2" 1 % 2 (Xi − X) V ar S = V ar n − 1 i=1 + ) n # % X i − X $2 σ4 V ar = (n − 1)2 σ i=1 σ4 V ar( χ2 (n − 1) ) (n − 1)2 σ4 = 2 (n − 1) (n − 1)2 2σ 4 . = n−1 = Next we let θ = σ 2 and determine the Cramér-Rao lower bound for the variance of S 2 . The second derivative of ln L(θ) with respect to θ is n n 1 % d2 ln L(θ) (xi − µ)2 . = 2− 3 dθ2 2θ θ i=1 Hence E # d2 ln L(θ) dθ2 $ 1 n = 2 − 3E 2θ θ n − 2θ2 n = 2− 2θ n =− 2 2θ = Thus − Hence E , 1 d2 ln L(θ) dθ 2 ) n % i=1 2 (Xi − µ) + " θ ! 2 E χ (n) θ3 n θ2 -= θ2 2σ 4 = . n n ! " 1 2σ 4 2σ 4 -= = V ar S 2 > − , 2 . n−1 n E d ln L(θ) 2 dθ This shows that S 2 can not attain the Cramér-Rao lower bound. Probability and Mathematical Statistics 471 The disadvantages of Cramér-Rao lower bound approach are the followings: (1) Not every density function f (x; θ) satisfies the assumptions of Cramér-Rao theorem and (2) not every allowable estimator attains the Cramér-Rao lower bound. Hence in any one of these situations, one does not know whether an estimator is a uniform minimum variance unbiased estimator or not. 16.4. Sufficient Estimator In many situations, we can not easily find the distribution of the estimator θZ of a parameter θ even though we know the distribution of the population. Therefore, we have no way to know whether our estimator θZ is unbiased or biased. Hence, we need some other criteria to judge the quality of an estimator. Sufficiency is one such criteria for judging the quality of an estimator. Recall that an estimator of a population parameter is a function of the sample values that does not contain the parameter. An estimator summarizes the information found in the sample about the parameter. If an estimator summarizes just as much information about the parameter being estimated as the sample does, then the estimator is called a sufficient estimator. Definition 16.6. Let X ∼ f (x; θ) be a population and let X1 , X2 , ..., Xn be a random sample of size n from this population X. An estimator θZ of the parameter θ is said to be a sufficient estimator of θ if the conditional distribution of the sample given the estimator θZ does not depend on the parameter θ. Example 16.15. If X1 , X2 , ..., Xn is a random sample from the distribution with probability density function   θx (1 − θ)1−x if x = 0, 1 f (x; θ) =  0 elsewhere , 1n where 0 < θ < 1. Show that Y = i=1 Xi is a sufficient statistic of θ. Answer: First, we find the distribution of the sample. This is given by f (x1 , x2 , ..., xn ) = n P θxi (1 − θ)1−xi = θy (1 − θ)n−y . n % Xi ∼ BIN (n, θ). i=1 Since, each Xi ∼ BER(θ), we have Y = i=1 Criteria for Evaluating the Goodness of Estimators 472 Therefore, the probability density function of Y is given by # $ n y θ (1 − θ)n−y . g(y) = y Further, since each Xi ∼ BER(θ), the space of each Xi is given by RXi = { 0, 1 }. Therefore, the space of the random variable Y = 1n i=1 Xi is given by RY = { 0, 1, 2, 3, 4, ..., n }. Let A be the event (X1 = x1 , X2 = x2 , ..., Xn = xn ) and B denotes the event K (Y = y). Then A ⊂ B and therefore A B = A. Now, we find the conditional density of the sample given the estimator Y , that is f (x1 , x2 , ..., xn /Y = y) = P (X1 = x1 , X2 = x2 , ..., Xn = xn /Y = y) = P (A/B) K P (A B) = P (B) P (A) = P (B) f (x1 , x2 , ..., xn ) = g(y) y θ (1 − θ)n−y = ! n" y n−y y θ (1 − θ) 1 = ! n" . y Hence, the conditional density of the sample given the statistic Y is independent of the parameter θ. Therefore, by definition Y is a sufficient statistic. Example 16.16. If X1 , X2 , ..., Xn is a random sample from the distribution with probability density function   e−(x−θ) if θ < x < ∞ f (x; θ) =  0 elsewhere , where −∞ < θ < ∞. What is the maximum likelihood estimator of θ ? Is this maximum likelihood estimator sufficient estimator of θ ? Probability and Mathematical Statistics 473 Answer: We have seen in Chapter 15 that the maximum likelihood estimator of θ is Y = X(1) , that is the first order statistic of the sample. Let us find the probability density of this statistic, which is given by g(y) = n! 0 n−1 [F (y)] f (y) [1 − F (y)] (n − 1)! n−1 = n f (y) [1 − F (y)] A J SBn−1 = n e−(y−θ) 1 − 1 − e−(y−θ) = n enθ e−ny . The probability density of the random sample is f (x1 , x2 , ..., xn ) = n P i=1 nθ =e e−(xi −θ) e−n x , n % xi . Let A be the event (X1 = x1 , X2 = x2 , ..., Xn = xn ) and K B denotes the event (Y = y). Then A ⊂ B and therefore A B = A. Now, we find the conditional density of the sample given the estimator Y , that is where nx = i=1 f (x1 , x2 , ..., xn /Y = y) = P (X1 = x1 , X2 = x2 , ..., Xn = xn /Y = y) = P (A/B) K P (A B) = P (B) P (A) = P (B) f (x1 , x2 , ..., xn ) = g(y) enθ e−n x n enθ e−n y e−n x = . n e−n y = Hence, the conditional density of the sample given the statistic Y is independent of the parameter θ. Therefore, by definition Y is a sufficient statistic. Criteria for Evaluating the Goodness of Estimators 474 We have seen that to verify whether an estimator is sufficient or not one has to examine the conditional density of the sample given the estimator. To compute this conditional density one has to use the density of the estimator. The density of the estimator is not always easy to find. Therefore, verifying the sufficiency of an estimator using this definition is not always easy. The following factorization theorem of Fisher and Neyman helps to decide when an estimator is sufficient. Theorem 16.3. Let X1 , X2 , ..., Xn denote a random sample with probability density function f (x1 , x2 , ..., xn ; θ), which depends on the population parameter θ. The estimator θZ is sufficient for θ if and only if Z θ) h(x1 , x2 , ..., xn ) f (x1 , x2 , ..., xn ; θ) = φ(θ, where φ depends on x1 , x2 , ..., xn only through θZ and h(x1 , x2 , ..., xn ) does not depend on θ. Now we give two examples to illustrate the factorization theorem. Example 16.17. Let X1 , X2 , ..., Xn be a random sample from a distribution with density function f (x; λ) =  x −λ e  λ x!  if x = 0, 1, 2, ..., ∞ 0 elsewhere, where λ > 0 is a parameter. Find the maximum likelihood estimator of λ and show that the maximum likelihood estimator of λ is sufficient estimator of the parameter λ. Answer: First, we find the density of the sample or the likelihood function of the sample. The likelihood function of the sample is given by L(λ) = = n P i=1 n P i=1 = f (xi ; λ) λxi e−λ xi ! λnX e−nλ . n P (xi !) i=1 Probability and Mathematical Statistics 475 Taking the logarithm of the likelihood function, we get ln L(λ) = nx ln λ − nλ − ln Therefore n P (xi !). i=1 d 1 ln L(λ) = nx − n. dλ λ Setting this derivative to zero and solving for λ, we get λ = x. The second derivative test assures us that the above λ is a maximum. Hence, the maximum likelihood estimator of λ is the sample mean X. Next, we show that X is sufficient, by using the Factorization Theorem of Fisher and Neyman. We factor the joint density of the sample as L(λ) = λnx e−nλ n P (xi !) i=1 < ; = λnx e−nλ 1 n P (xi !) i=1 = φ(X, λ) h (x1 , x2 , ..., xn ) . Therefore, the estimator X is a sufficient estimator of λ. Example 16.18. Let X1 , X2 , ..., Xn be a random sample from a normal distribution with density function 2 1 1 f (x; µ) = √ e− 2 (x−µ) , 2π where −∞ < µ < ∞ is a parameter. Find the maximum likelihood estimator of µ and show that the maximum likelihood estimator of µ is a sufficient estimator. Answer: We know that the maximum likelihood estimator of µ is the sample mean X. Next, we show that this maximum likelihood estimator X is a Criteria for Evaluating the Goodness of Estimators 476 sufficient estimator of µ. The joint density of the sample is given by f (x1 , x2 , ...,xn ; µ) n P f (xi ; µ) = i=1 n P = 2 1 1 √ e− 2 (xi −µ) 2π i=1 = # = # = # = # = # 1 √ 2π $n 1 √ 2π $n 1 √ 2π $n 1 √ 2π $n 1 √ 2π $n − 21 n % (xi − µ)2 − 21 n % [(xi − x) + (x − µ)] − 21 n % < ; (xi − x)2 + 2(xi − x)(x − µ) + (x − µ)2 − 21 n % ; < (xi − x)2 + (x − µ)2 e e e e i=1 i=1 2 i=1 i=1 2 −n 2 (x−µ) e − 12 e n % i=1 (xi − x)2 Hence, by the Factorization Theorem, X is a sufficient estimator of the population mean. Note that the probability density function of the Example 16.17 which is f (x; λ) = can be written as  x −λ e  λ x!  0 if x = 0, 1, 2, ..., ∞ elsewhere , f (x; λ) = e{x ln λ−ln x!−λ} for x = 0, 1, 2, ... This density function is of the form f (x; λ) = e{K(x)A(λ)+S(x)+B(λ)} . Similarly, the probability density function of the Example 16.12, which is 2 1 1 f (x; µ) = √ e− 2 (x−µ) 2π Probability and Mathematical Statistics 477 can also be written as f (x; µ) = e{xµ− x2 2 2 − µ2 − 21 ln(2π)} . This probability density function is of the form f (x; µ) = e{K(x)A(µ)+S(x)+B(µ)} . We have also seen that in both the examples, the sufficient estimators were n % the sample mean X, which can be written as n1 Xi . i=1 Our next theorem gives a general result in this direction. The following theorem is known as the Pitman-Koopman theorem. Theorem 16.4. Let X1 , X2 , ..., Xn be a random sample from a distribution with probability density function of the exponential form f (x; θ) = e{K(x)A(θ)+S(x)+B(θ)} on a support free of θ. Then the statistic n % K(Xi ) is a sufficient statistic i=1 for the parameter θ. Proof: The joint density of the sample is f (x1 , x2 , ..., xn ; θ) = = n P i=1 n P f (xi ; θ) e{K(xi )A(θ)+S(xi )+B(θ)} i=1 & n % i=1 =e & n % =e K(xi )A(θ) + n % ' S(xi ) + n B(θ) i=1 ' / K(xi )A(θ) + n B(θ) i=1 Hence by the Factorization Theorem the estimator e n % i=1 statistic for the parameter θ. This completes the proof. n % i=1 S(xi ) 0 . K(Xi ) is a sufficient Criteria for Evaluating the Goodness of Estimators 478 Example 16.19. Let X1 , X2 , ..., Xn be a random sample from a distribution with density function   θ xθ−1 for 0 < x < 1 f (x; θ) =  0 otherwise, where θ > 0 is a parameter. Using the Pitman-Koopman Theorem find a sufficient estimator of θ. Answer: The Pitman-Koopman Theorem says that if the probability density function can be expressed in the form of f (x; θ) = e{K(x)A(θ)+S(x)+B(θ)} 1n then i=1 K(Xi ) is a sufficient statistic for θ. The given population density can be written as f (x; θ) = θ xθ−1 = e{ln[θ x θ−1 ] = e{ln θ+(θ−1) ln x} . Thus, K(x) = ln x A(θ) = θ S(x) = − ln x B(θ) = ln θ. Hence by Pitman-Koopman Theorem, n % K(Xi ) = n % ln Xi i=1 i=1 = ln n P Xi . i=1 Thus ln On i=1 Xi is a sufficient statistic for θ. n P Xi is also a sufficient statistic of θ, since Remark 16.1. Notice that i=1 + ) n n P P Xi . Xi , we also know knowing ln i=1 i=1 Example 16.20. Let X1 , X2 , ..., Xn be a random sample from a distribution with density function  x  θ1 e− θ for 0 < x < ∞ f (x; θ) =  0 otherwise, Probability and Mathematical Statistics 479 where 0 < θ < ∞ is a parameter. Find a sufficient estimator of θ. Answer: First, we rewrite the population density in the exponential form. That is 1 x f (x; θ) = e− θ θ ; < −x ln 1 e θ =e θ x = e− ln θ− θ . Hence 1 θ B(θ) = − ln θ. A(θ) = − K(x) = x S(x) = 0 Hence by Pitman-Koopman Theorem, n % i=1 K(Xi ) = n % Xi = n X. i=1 Thus, nX is a sufficient statistic for θ. Since knowing nX, we also know X, the estimator X is also a sufficient estimator of θ. Example 16.21. Let X1 , X2 , ..., Xn be a random sample from a distribution with density function   e−(x−θ) for θ < x < ∞ f (x; θ) =  0 otherwise, where −∞ < θ < ∞ is a parameter. Can Pitman-Koopman Theorem be used to find a sufficient statistic for θ? Answer: No. We can not use Pitman-Koopman Theorem to find a sufficient statistic for θ since the domain where the population density is nonzero is not free of θ. Next, we present the connection between the maximum likelihood estimator and the sufficient estimator. If there is a sufficient estimator for the parameter θ and if the maximum likelihood estimator of this θ is unique, then the maximum likelihood estimator is a function of the sufficient estimator. That is θZML = ψ(θZS ), where ψ is a real valued function, θZML is the maximum likelihood estimator of θ, and θZS is the sufficient estimator of θ. Criteria for Evaluating the Goodness of Estimators 480 Similarly, a connection can be established between the uniform minimum variance unbiased estimator and the sufficient estimator of a parameter θ. If there is a sufficient estimator for the parameter θ and if the uniform minimum variance unbiased estimator of this θ is unique, then the uniform minimum variance unbiased estimator is a function of the sufficient estimator. That is θZMVUE = η(θZS ), where η is a real valued function, θZMVUE is the uniform minimum variance unbiased estimator of θ, and θZS is the sufficient estimator of θ. Finally, we may ask “If there are sufficient estimators, why are not there necessary estimators?” In fact, there are. Dynkin (1951) gave the following definition. Definition 16.7. An estimator is said to be a necessary estimator if it can be written as a function of every sufficient estimators. 16.5. Consistent Estimator Let X1 , X2 , ..., Xn be a random sample from a population X with density f (x; θ). Let θZ be an estimator of θ based on the sample of size n. Obviously the estimator depends on the sample size n. In order to reflect the dependency of θZ on n, we denote θZ as θZn . Definition 16.7. Let X1 , X2 , ..., Xn be a random sample from a population X with density f (x; θ). A sequence of estimators {θZn } of θ is said to be consistent for θ if and only if the sequence {θZn } converges in probability to θ, that is, for any ǫ > 0 > ,> > > lim P >θZn − θ> ≥ ǫ = 0. n→∞ Note that consistency is actually a concept relating to a sequence of Z estimators {θZn }∞ n=no but we usually say “consistency of θn ” for simplicity. Further, consistency is a large sample property of an estimator. The following theorem states that if the mean squared error goes to zero as n goes to infinity, then {θZn } converges in probability to θ. Theorem 16.5. Let X1 , X2 , ..., Xn be a random sample from a population X with density f (x; θ) and {θZn } be a sequence of estimators of θ based on the sample. If the variance of θZn exists for each n and is finite and #, -2 $ Z θn − θ =0 lim E n→∞ Probability and Mathematical Statistics then, for any ǫ > 0, 481 > ,> > > lim P >θZn − θ> ≥ ǫ = 0. n→∞ Proof: By Markov Inequality (see Theorem 13.8) we have #, -2 $ [ #, $ θ − θ E n -2 θ[ P ≥ ǫ2 ≤ n−θ ǫ2 for all ǫ > 0. Since the events -2 , ≥ ǫ2 θ[ n−θ and are same, we see that P #, θ[ n−θ -2 for all n ∈ N. I Hence if ≥ ǫ2 $ , - E = P |θ[ − θ| ≥ ǫ ≤ n lim E n→∞ then |θ[ n − θ| ≥ ǫ #, θ[ n−θ -2 $ #, θ[ n−θ ǫ2 -2 $ =0 , lim P |θ[ n − θ| ≥ ǫ = 0 n→∞ and the proof of the theorem is complete. Let , , Z θ = E θZ − θ B θ, , Z θ = 0. Next we show be the biased. If an estimator is unbiased, then B θ, that #, -2 $ , - A , -B2 Z Zθ = V ar θZ + B θ, E θ−θ . (1) To see this consider #, #, -2 $ -2 $ =E E θZ2 − 2 θZ θ + θ2 θZ − θ , , = E θZ2 − 2E θZ θ + θ2 , -2 , , -2 , = E θZ2 − E θZ + E θZ − 2E θZ θ + θ2 , , -2 , = V ar θZ + E θZ − 2E θZ θ + θ2 , - A , B2 = V ar θZ + E θZ − θ , - A , -B2 Zθ . = V ar θZ + B θ, Criteria for Evaluating the Goodness of Estimators 482 In view of (1), we can say that if , lim V ar θZn = 0 n→∞ and , lim B θZn , θ = 0 n→∞ then lim E n→∞ #, θZn − θ -2 $ (2) (3) = 0. In other words, to show a sequence of estimators is consistent we have to verify the limits (2) and (3). Example 16.22. Let X1 , X2 , ..., Xn be a random sample from a normal population X with mean µ and variance σ 2 > 0. Is the likelihood estimator n %! "2 [2 = 1 Xi − X . σ n i=1 of σ 2 a consistent estimator of σ 2 ? [2 depends on the sample size n, we denote σ [2 as σ [2 n . Hence Answer: Since σ n %! "2 [2 n = 1 σ Xi − X . n i=1 [2 n is given by The variance of σ , [2 n V ar σ - ) + n "2 1 %! = V ar Xi − X n i=1 # $ 2 1 2 (n − 1)S = 2 V ar σ n σ2 $ # 4 σ (n − 1)S 2 = 2 V ar n σ2 4 ! " σ = 2 V ar χ2 (n − 1) n 2(n − 1)σ 4 = 2 2 n 3 1 1 − 2 σ4 . = n n2 Probability and Mathematical Statistics 483 Hence 2 3 , 1 1 lim V ar θZn = lim − 2 2 σ 4 = 0. n→∞ n→∞ n n , The biased B θZn , θ is given by , , B θZn , θ = E θZn − σ 2 + ) n "2 1 %! − σ2 Xi − X =E n i=1 # $ 2 1 2 (n − 1)S − σ2 = E σ n σ2 " σ2 ! 2 = E χ (n − 1) − σ 2 n (n − 1)σ 2 = − σ2 n σ2 =− . n Thus Hence 1 n n % ! i=1 Xi − X , σ2 lim B θZn , θ = − lim = 0. n→∞ n→∞ n "2 is a consistent estimator of σ 2 . In the last example we saw that the likelihood estimator of variance is a consistent estimator. In general, if the density function f (x; θ) of a population satisfies some mild conditions, then the maximum likelihood estimator of θ is consistent. Similarly, if the density function f (x; θ) of a population satisfies some mild conditions, then the estimator obtained by moment method is also consistent. Since consistency is a large sample property of an estimator, some statisticians suggest that consistency should not be used alone for judging the goodness of an estimator; rather it should be used along with other criteria. 16.6. Review Exercises 1. Let T1 and T2 be estimators of a population parameter θ based upon the " ! same random sample. If Ti ∼ N θ, σi2 i = 1, 2 and if T = bT1 + (1 − b)T2 , then for what value of b, T is a minimum variance unbiased estimator of θ ? Criteria for Evaluating the Goodness of Estimators 484 2. Let X1 , X2 , ..., Xn be a random sample from a distribution with density function 1 − |x| − ∞ < x < ∞, f (x; θ) = e θ 2θ where 0 < θ is a parameter. What is the expected value of the maximum likelihood estimator of θ ? Is this estimator unbiased? 3. Let X1 , X2 , ..., Xn be a random sample from a distribution with density function 1 − |x| − ∞ < x < ∞, f (x; θ) = e θ 2θ where 0 < θ is a parameter. Is the maximum likelihood estimator an efficient estimator of θ? 4. A random sample X1 , X2 , ..., Xn of size n is selected from a normal distribution with variance σ 2 . Let S 2 be the unbiased estimator of σ 2 , and T be the maximum likelihood estimator of σ 2 . If 20T − 19S 2 = 0, then what is the sample size? 5. Suppose X and Y are independent random variables each with density function & 2 x θ2 for 0 < x < θ1 f (x) = 0 otherwise. If k (X + 2Y ) is an unbiased estimator of θ−1 , then what is the value of k? 6. An object of length c is measured by two persons using the same instrument. The instrument error has a normal distribution with mean 0 and variance 1. The first person measures the object 25 times, and the average of the measurements is X̄ = 12. The second person measures the objects 36 times, and the average of the measurements is Ȳ = 12.8. To estimate c we use the weighted average a X̄ + b Ȳ as an estimator. Determine the constants a and b such that a X̄ + b Ȳ is the minimum variance unbiased estimator of c and then calculate the minimum variance unbiased estimate of c. 7. Let X1 , X2 , ..., Xn be a random sample from a distribution with probability density function f (x) =  3  3 θ x2 e−θ x  0 for 0 < x < ∞ otherwise, where θ > 0 is an unknown parameter. Find a sufficient statistics for θ. Probability and Mathematical Statistics 485 8. Let X1 , X2 , ..., Xn be a random sample from a Weibull distribution with probability density function f (x) =  x β  θββ xβ−1 e−( θ )  0 if x > 0 otherwise , where θ > 0 and β > 0 are parameters. Find a sufficient statistics for θ with β known, say β = 2. If β is unknown, can you find a single sufficient statistics for θ? 9. Let X1 , X2 be a random sample of size 2 from population with probability density  x  θ1 e− θ if 0 < x < ∞ f (x; θ) =  0 otherwise, √ where θ > 0 is an unknown parameter. If Y = X1 X2 , then what should be the value of the constant k such that kY is an unbiased estimator of the parameter θ ? 10. Let X1 , X2 , ..., Xn be a random sample from a population with probability density function f (x; θ) =   θ1  0 if 0 < x < θ otherwise , where θ > 0 is an unknown parameter. If X denotes the sample mean, then what should be value of the constant k such that kX is an unbiased estimator of θ ? 11. Let X1 , X2 , ..., Xn be a random sample from a population with probability density function f (x; θ) =   θ1  0 if 0 < x < θ otherwise , where θ > 0 is an unknown parameter. If Xmed denotes the sample median, then what should be value of the constant k such that kXmed is an unbiased estimator of θ ? Criteria for Evaluating the Goodness of Estimators 486 12. What do you understand by an unbiased estimator of a parameter θ? What is the basic principle of the maximum likelihood estimation of a parameter θ? What is the basic principle of the Bayesian estimation of a parameter θ? What is the main difference between Bayesian method and likelihood method. 13. Let X1 , X2 , ..., Xn be a random sample from a population X with density function  θ  (1+x) for 0 ≤ x < ∞ θ+1 f (x; θ) =  0 otherwise, where θ > 0 is an unknown parameter. What is a sufficient statistic for the parameter θ? 14. Let X1 , X2 , ..., Xn be a random sample from a population X with density function  x2  x2 e− 2θ2 for 0 ≤ x < ∞ θ f (x; θ) =  0 otherwise, where θ is an unknown parameter. What is a sufficient statistic for the parameter θ? 15. Let X1 , X2 , ..., Xn be a random sample from a distribution with density function   e−(x−θ) for θ < x < ∞ f (x; θ) =  0 otherwise, where −∞ < θ < ∞ is a parameter. What is the maximum likelihood estimator of θ? Find a sufficient statistics of the parameter θ. 16. Let X1 , X2 , ..., Xn be a random sample from a distribution with density function   e−(x−θ) for θ < x < ∞ f (x; θ) =  0 otherwise, where −∞ < θ < ∞ is a parameter. Are the estimators X(1) and X − 1 are unbiased estimators of θ? Which one is more efficient than the other? 17. Let X1 , X2 , ..., Xn be a random sample from a population X with density function   θ xθ−1 for 0 ≤ x < 1 f (x; θ) =  0 otherwise, Probability and Mathematical Statistics 487 where θ > 1 is an unknown parameter. What is a sufficient statistic for the parameter θ? 18. Let X1 , X2 , ..., Xn be a random sample from a population X with density function  α  θ α xα−1 e−θx for 0 ≤ x < ∞ f (x; θ) =  0 otherwise, where θ > 0 and α > 0 are parameters. What is a sufficient statistic for the parameter θ for a fixed α? 19. Let X1 , X2 , ..., Xn be a random sample from a population X with density function  αθ  θ(θ+1) for α < x < ∞ x f (x; θ) =  0 otherwise, where θ > 0 and α > 0 are parameters. What is a sufficient statistic for the parameter θ for a fixed α? 20. Let X1 , X2 , ..., Xn be a random sample from a population X with density function  !m"  x θx (1 − θ)m−x for x = 0, 1, 2, ..., m f (x; θ) =  0 otherwise, where 0 < θ < 1 is parameter. Show that unbiased estimator of θ for a fixed m. X m is a uniform minimum variance 21. Let X1 , X2 , ..., Xn be a random sample from a population X with density function   θ xθ−1 for 0 < x < 1 f (x; θ) =  0 otherwise, 1 n where θ > 1 is parameter. Show that − n1 i=1 ln(Xi ) is a uniform minimum 1 variance unbiased estimator of θ . 22. Let X1 , X2 , ..., Xn be a random sample from a uniform population X on the interval [0, θ], where θ > 0 is a parameter. Is the likelihood estimator θZ = X(n) of θ a consistent estimator of θ? 23. Let X1 , X2 , ..., Xn be a random sample from a population X ∼ P OI(λ), where λ > 0 is a parameter. Is the estimator X of λ a consistent estimator of λ? Criteria for Evaluating the Goodness of Estimators 488 24. Let X1 , X2 , ..., Xn be a random sample from a population X having the probability density function f (x; θ) = 9 θ xθ−1 , if 0 < x < 1 0 otherwise, where θ > 0 is a parameter. Is the estimator θZ = moment method, a consistent estimator of θ? X 1−X of θ, obtained by the 25. Let X1 , X2 , ..., Xn be a random sample from a population X having the probability density function  ! n"  x px (1 − p)n−x , if x = 0, 1, 2, ..., n f (x; p) =  0 otherwise, where 0 < p < 1 is a parameter and m is a fixed positive integer. What is the maximum likelihood estimator for p. Is this maximum likelihood estimator for p is an efficient estimator? 26. Let X1 , X2 , ..., Xn be a random sample from a population X having the probability density function f (x; θ) = 9 2 θ2 0 θ − x, if 0 ≤ x ≤ θ otherwise, where θ > 0 is a parameter. Find an estimator for θ using the moment method. 27. A box contains 50 red and blue balls out of which θ are red. A sample of 30 balls is to be selected without replacement. If X denotes the number of red balls in the sample, then find an estimator for θ using the moment method. Probability and Mathematical Statistics 489 Chapter 17 SOME TECHNIQUES FOR FINDING INTERVAL ESTIMATORS FOR PARAMETERS In point estimation we find a value for the parameter θ given a sample data. For example, if X1 , X2 , ..., Xn is a random sample of size n from a population with probability density function =  2 e− 21 (x−θ)2 for x ≥ θ π f (x; θ) =  0 otherwise, then the likelihood function of θ is L(θ) = n P i=1 @ 2 − 1 (xi −θ)2 e 2 , π where x1 ≥ θ, x2 ≥ θ, ..., xn ≥ θ. This likelihood function simplifies to n % 2 3 n2 − 12 (xi − θ)2 2 , e i=1 L(θ) = π where min{x1 , x2 , ..., xn } ≥ θ. Taking the natural logarithm of L(θ) and maximizing, we obtain the maximum likelihood estimator of θ as the first order statistic of the sample X1 , X2 , ..., Xn , that is θZ = X(1) , Techniques for finding Interval Estimators of Parameters 490 where X(1) = min{X1 , X2 , ..., Xn }. Suppose the true value of θ = 1. Using the maximum likelihood estimator of θ, we are trying to guess this value of θ based on a random sample. Suppose X1 = 1.5, X2 = 1.1, X3 = 1.7, X4 = 2.1, X5 = 3.1 is a set of sample data from the above population. Then based on this random sample, we will get θZML = X(1) = min{1.5, 1.1, 1.7, 2.1, 3.1} = 1.1. If we take another random sample, say X1 = 1.8, X2 = 2.1, X3 = 2.5, X4 = 3.1, X5 = 2.6 then the maximum likelihood estimator of this θ will be θZ = 1.8 based on this sample. The graph of the density function f (x; θ) for θ = 1 is shown below. From the graph, it is clear that a number close to 1 has higher chance of getting randomly picked by the sampling process, then the numbers that are substantially bigger than 1. Hence, it makes sense that θ should be estimated by the smallest sample value. However, from this example we see that the point estimate of θ is not equal to the true value of θ. Even if we take many random samples, yet the estimate of θ will rarely equal the actual value of the parameter. Hence, instead of finding a single value for θ, we should report a range of probable values for the parameter θ with certain degree of confidence. This brings us to the notion of confidence interval of a parameter. 17.1. Interval Estimators and Confidence Intervals for Parameters The interval estimation problem can be stated as follow: Given a random sample X1 , X2 , ..., Xn and a probability value 1 − α, find a pair of statistics L = L(X1 , X2 , ..., Xn ) and U = U (X1 , X2 , ..., Xn ) with L ≤ U such that the Probability and Mathematical Statistics 491 probability of θ being on the random interval [L, U ] is 1 − α. That is P (L ≤ θ ≤ U ) = 1 − α. Recall that a sample is a portion of the population usually chosen by method of random sampling and as such it is a set of random variables X1 , X2 , ..., Xn with the same probability density function f (x; θ) as the population. Once the sampling is done, we get X1 = x1 , X2 = x2 , · · · , Xn = xn where x1 , x2 , ..., xn are the sample data. Definition 17.1. Let X1 , X2 , ..., Xn be a random sample of size n from a population X with density f (x; θ), where θ is an unknown parameter. The interval estimator of θ is a pair of statistics L = L(X1 , X2 , ..., Xn ) and U = U (X1 , X2 , ..., Xn ) with L ≤ U such that if x1 , x2 , ..., xn is a set of sample data, then θ belongs to the interval [L(x1 , x2 , ...xn ), U (x1 , x2 , ...xn )]. The interval [l, u] will be denoted as an interval estimate of θ whereas the random interval [L, U ] will denote the interval estimator of θ. Notice that the interval estimator of θ is the random interval [L, U ]. Next, we define the 100(1 − α)% confidence interval for the unknown parameter θ. Definition 17.2. Let X1 , X2 , ..., Xn be a random sample of size n from a population X with density f (x; θ), where θ is an unknown parameter. The interval estimator of θ is called a 100(1 − α)% confidence interval for θ if P (L ≤ θ ≤ U ) = 1 − α. The random variable L is called the lower confidence limit and U is called the upper confidence limit. The number (1−α) is called the confidence coefficient or degree of confidence. There are several methods for constructing confidence intervals for an unknown parameter θ. Some well known methods are: (1) Pivotal Quantity Method, (2) Maximum Likelihood Estimator (MLE) Method, (3) Bayesian Method, (4) Invariant Methods, (5) Inversion of Test Statistic Method, and (6) The Statistical or General Method. In this chapter, we only focus on the pivotal quantity method and the MLE method. We also briefly examine the the statistical or general method. The pivotal quantity method is mainly due to George Bernard and David Fraser of the University of Waterloo, and this method is perhaps one of the most elegant methods of constructing confidence intervals for unknown parameters. Techniques for finding Interval Estimators of Parameters 492 17.2. Pivotal Quantity Method In this section, we explain how the notion of pivotal quantity can be used to construct confidence interval for a unknown parameter. We will also examine how to find pivotal quantities for parameters associated with certain probability density functions. We begin with the formal definition of the pivotal quantity. Definition 17.3. Let X1 , X2 , ..., Xn be a random sample of size n from a population X with probability density function f (x; θ), where θ is an unknown parameter. A pivotal quantity Q is a function of X1 , X2 , ..., Xn and θ whose probability distribution is independent of the parameter θ. Notice that the pivotal quantity Q(X1 , X2 , ..., Xn , θ) will usually contain both the parameter θ and an estimator (that is, a statistic) of θ. Now we give an example of a pivotal quantity. Example 17.1. Let X1 , X2 , ..., Xn be a random sample from a normal population X with mean µ and a known variance σ 2 . Find a pivotal quantity for the unknown parameter µ. Answer: Since each Xi ∼ N (µ, σ 2 ), X∼N # σ2 µ, n $ . Standardizing X, we see that X −µ √σ n ∼ N (0, 1). The statistics Q given by Q(X1 , X2 , ..., Xn , µ) = X −µ √σ n is a pivotal quantity since it is a function of X1 , X2 , ..., Xn and µ and its probability density function is free of the parameter µ. There is no general rule for finding a pivotal quantity (or pivot) for a parameter θ of an arbitrarily given density function f (x; θ). Hence to some extents, finding pivots relies on guesswork. However, if the probability density function f (x; θ) belongs to the location-scale family, then there is a systematic way to find pivots. Probability and Mathematical Statistics 493 Definition 17.4. Let g : R I →R I be a probability density function. Then for any µ and any σ > 0, the family of functions 9 # $ ? 1 x−µ F = f (x; µ, σ) = g | µ ∈ (−∞, ∞), σ ∈ (0, ∞) σ σ is called the location-scale family with standard probability density f (x; θ). The parameter µ is called the location parameter and the parameter σ is called the scale parameter. If σ = 1, then F is called the location family. If µ = 0, then F is called the scale family It should be noted that each member f (x; µ, σ) of the location-scale 1 2 family is a probability density function. If we take g(x) = √12π e− 2 x , then the normal density function # $ 1 x−µ 2 1 x−µ 1 f (x; µ, σ) = g e− 2 ( σ ) , −∞ < x < ∞ =√ 2 σ σ 2πσ belongs to the location-scale family. The density function  x  θ1 e− θ if 0 < x < ∞ f (x; θ) =  0 otherwise, belongs to the scale family. However, the density function   θ xθ−1 if 0 < x < 1 f (x; θ) =  0 otherwise, does not belong to the location-scale family. It is relatively easy to find pivotal quantities for location or scale parameter when the density function of the population belongs to the location-scale family F. When the density function belongs to location family, the pivot for the location parameter µ is µ Z − µ, where µ Z is the maximum likelihood estimator of µ. If σ Z is the maximum likelihood estimator of σ, then the pivot σ for the scale parameter σ is Z σ when the density function belongs to the scale µ−µ family. The pivot for location parameter µ is Z and the pivot for the scale Z σ Z σ parameter σ is σ when the density function belongs to location-scale family. Sometime it is appropriate to make a minor modification to the pivot obtained in this way, such as multiplying by a constant, so that the modified pivot will have a known distribution. Techniques for finding Interval Estimators of Parameters 494 Remark 17.1. Pivotal quantity can also be constructed using a sufficient statistic for the parameter. Suppose T = T (X1 , X2 , ..., Xn ) is a sufficient statistic based on a random sample X1 , X2 , ..., Xn from a population X with probability density function f (x; θ). Let the probability density function of T be g(t; θ). If g(t; θ) belongs to the location family, then an appropriate constant multiple of T − a(θ) is a pivotal quantity for the location parameter θ for some suitable expression a(θ). If g(t; θ) belongs to the scale family, then T is a pivotal quantity for the scale an appropriate constant multiple of b(θ) parameter θ for some suitable expression b(θ). Similarly, if g(t; θ) belongs to is the location-scale family, then an appropriate constant multiple of T −a(θ) b(θ) a pivotal quantity for the location parameter θ for some suitable expressions a(θ) and b(θ). Algebraic manipulations of pivots are key factors in finding confidence intervals. If Q = Q(X1 , X2 , ..., Xn , θ) is a pivot, then a 100(1−α)% confidence interval for θ may be constructed as follows: First, find two values a and b such that P (a ≤ Q ≤ b) = 1 − α, then convert the inequality a ≤ Q ≤ b into the form L ≤ θ ≤ U . For example, if X is normal population with unknown mean µ and known variance σ 2 , then its pdf belongs to the location-scale family. A pivot for µ 2 is X−µ S . However, since the variance σ is known, there is no need to take S. So we consider the pivot X−µ to construct the 100(1 − 2α)% confidence σ interval for µ. Since our population X ∼ N (µ, σ 2 ), the sample mean X is also a normal with the same mean µ and the variance equals to √σn . Hence 1 − 2α = P ) −zα ≤ X −µ √σ n ≤ zα + $ σ σ √ √ ≤ X ≤ µ + zα = P µ − zα n n # $ σ σ √ √ = P X − zα ≤ µ ≤ X + zα . n n # Therefore, the 100(1 − 2α)% confidence interval for µ is 3 2 σ σ √ √ , X + zα . X − zα n n Probability and Mathematical Statistics 495 Here zα denotes the 100(1 − α)-percentile (or (1 − α)-quartile) of a standard normal random variable Z, that is P (Z ≤ zα ) = 1 − α, where α ≤ 0.5 (see figure below). Note that α = P (Z ≤ −zα ) if α ≤ 0.5. 1- ! ! Z! A 100(1 − α)% confidence interval for a parameter θ has the following interpretation. If X1 = x1 , X2 = x2 , ..., Xn = xn is a sample of size n, then based on this sample we construct a 100(1 − α)% confidence interval [l, u] which is a subinterval of the real line R. I Suppose we take large number of samples from the underlying population and construct all the corresponding 100(1 − α)% confidence intervals, then approximately 100(1 − α)% of these intervals would include the unknown value of the parameter θ. In the next several sections, we illustrate how pivotal quantity method can be used to determine confidence intervals for various parameters. 17.3. Confidence Interval for Population Mean At the outset, we use the pivotal quantity method to construct a confidence interval for the mean of a normal population. Here we assume first the population variance is known and then variance is unknown. Next, we construct the confidence interval for the mean of a population with continuous, symmetric and unimodal probability distribution by applying the central limit theorem. Let X1 , X2 , ..., Xn be a random sample from a population X ∼ N (µ, σ 2 ), where µ is an unknown parameter and σ 2 is a known parameter. First of all, we need a pivotal quantity Q(X1 , X2 , ..., Xn , µ). To construct this pivotal Techniques for finding Interval Estimators of Parameters 496 quantity, we find the likelihood estimator of the parameter µ. We know that µ Z = X. Since, each Xi ∼ N (µ, σ 2 ), the distribution of the sample mean is given by $ # σ2 . X ∼ N µ, n It is easy to see that the distribution of the estimator of µ is not independent of the parameter µ. If we standardize X, then we get X −µ √σ n ∼ N (0, 1). The distribution of the standardized X is independent of the parameter µ. This standardized X is the pivotal quantity since it is a function of the sample X1 , X2 , ..., Xn and the parameter µ, and its probability distribution is independent of the parameter µ. Using this pivotal quantity, we construct the confidence interval as follows: ) + X −µ ≤ z α2 1 − α = P −z α2 ≤ σ =P # X− # √ σ √ n n $ z α 2 # ≤µ≤X+ σ √ n $ z α 2 $ Hence, the (1 − α)% confidence interval for µ when the population X is normal with the known variance σ 2 is given by 2 X− # σ √ n $ z α2 , X + # σ √ n $ 3 z α2 . This says that if samples of size n are taken from a normal population with mean µ and known variance σ 2 and if the interval 2 X− # σ √ n $ z α2 , X + # σ √ n $ z α2 3 is constructed for every sample, then in the long-run 100(1 − α)% of the intervals will cover the unknown parameter µ and hence with a confidence of (1 − α)100% we can say that µ lies on the interval 2 X− # σ √ n $ z , X+ α 2 # σ √ n $ z α 2 3 . Probability and Mathematical Statistics 497 The interval estimate of µ is found by taking a good (here maximum likelihood) estimator X of µ and adding and subtracting z α2 times the standard deviation of X. Remark 17.2. By definition a 100(1 − α)% confidence interval for a parameter θ is an interval [L, U ] such that the probability of θ being in the interval [L, U ] is 1 − α. That is 1 − α = P (L ≤ θ ≤ U ). One can find infinitely many pairs L, U such that 1 − α = P (L ≤ θ ≤ U ). Hence, there are infinitely many confidence intervals for a given parameter. However, we only consider the confidence interval of shortest length. If a confidence interval is constructed by omitting equal tail areas then we obtain what is known as the central confidence interval. In a symmetric distribution, it can be shown that the central confidence interval is of the shortest length. Example 17.2. Let X1 , X2 , ..., X11 be a random sample of size 11 from a normal distribution with unknown mean µ and variance σ 2 = 9.9. If 111 i=1 xi = 132, then what is the 95% confidence interval for µ ? Answer: Since each Xi ∼ N (µ, 9.9), the confidence interval for µ is given by # # 2 $ $ 3 σ σ X− √ z α2 , X + √ z α2 . n n 111 Since i=1 xi = 132, the sample mean x = 132 11 = 12. Also, we see that @ σ2 = n @ 9.9 √ = 0.9. 11 Further, since 1 − α = 0.95, α = 0.05. Thus z α2 = z0.025 = 1.96 (from normal table). Using these information in the expression of the confidence interval for µ, we get A √ √ B 12 − 1.96 0.9, 12 + 1.96 0.9 that is [10.141, 13.859]. Techniques for finding Interval Estimators of Parameters 498 Example 17.3. Let X1 , X2 , ..., X11 be a random sample of size 11 from a normal distribution with unknown mean µ and variance σ 2 = 9.9. If 111 i=1 xi = 132, then for what value of the constant k is A 12 − k √ 0.9, 12 + k √ B 0.9 a 90% confidence interval for µ ? Answer: The 90% confidence interval for µ when the variance is given is 2 x− Thus we need to find x, # = x= σ √ n σ2 n $ z α2 , x + # σ √ n $ 3 z α2 . and z α2 corresponding to 1 − α = 0.9. Hence 111 i=1 xi 11 132 = 11 = 12. @ @ σ2 9.9 = n 11 √ = 0.9. z0.05 = 1.64 (from normal table). Hence, the confidence interval for µ at 90% confidence level is A √ √ B 12 − (1.64) 0.9, 12 + (1.64) 0.9 . Comparing this interval with the given interval, we get k = 1.64. and the corresponding 90% confidence interval is [10.444, 13.556]. Remark 17.3. Notice that the length of the 90% confidence interval for µ is 3.112. However, the length of the 95% confidence interval is 3.718. Thus higher the confidence level bigger is the length of the confidence interval. Hence, the confidence level is directly proportional to the length of the confidence interval. In view of this fact, we see that if the confidence level is zero, Probability and Mathematical Statistics 499 then the length is also zero. That is when the confidence level is zero, the confidence interval of µ degenerates into a point X. Until now we have considered the case when the population is normal with unknown mean µ and known variance σ 2 . Now we consider the case when the population is non-normal but its probability density function is continuous, symmetric and unimodal. If the sample size is large, then by the central limit theorem X −µ √σ n ∼ N (0, 1) n → ∞. as Thus, in this case we can take the pivotal quantity to be Q(X1 , X2 , ..., Xn , µ) = X −µ √σ n , if the sample size is large (generally n ≥ 32). Since the pivotal quantity is same as before, we get the sample expression for the (1 − α)100% confidence interval, that is 2 X− # σ √ n $ z , X+ α 2 # σ √ n $ z α 2 3 . Example 17.4. Let X1 , X2 , ..., X40 be a random sample of size 40 from 140 a distribution with known variance and unknown mean µ. If i=1 xi = 2 286.56 and σ = 10, then what is the 90 percent confidence interval for the population mean µ ? Answer: Since 1 − α = 0.90, we get α2 = 0.05. Hence, z0.05 = 1.64 (from the standard normal table). Next, we find the sample mean x= 286.56 = 7.164. 40 Hence, the confidence interval for µ is given by / 7.164 − (1.64) )@ 10 40 + , 7.164 + (1.64) that is [6.344, 7.984]. )@ 10 40 +0 Techniques for finding Interval Estimators of Parameters 500 Example 17.5. In sampling from a nonnormal distribution with a variance of 25, how large must the sample size be so that the length of a 95% confidence interval for the mean is 1.96 ? Answer: The confidence interval when the sample is taken from a normal population with a variance of 25 is # # $ $ 3 2 σ σ α z2 , x+ √ z α2 . x− √ n n Thus the length of the confidence interval is @ σ2 ℓ = 2 z α2 n @ 25 = 2 z0.025 n @ 25 . = 2 (1.96) n But we are given that the length of the confidence interval is ℓ = 1.96. Thus @ 25 1.96 = 2 (1.96) n √ n = 10 n = 100. Hence, the sample size must be 100 so that the length of the 95% confidence interval will be 1.96. So far, we have discussed the method of construction of confidence interval for the parameter population mean when the variance is known. It is very unlikely that one will know the variance without knowing the population mean, and thus what we have treated so far in this section is not very realistic. Now we treat case of constructing the confidence interval for population mean when the population variance is also unknown. First of all, we begin with the construction of confidence interval assuming the population X is normal. Suppose X1 , X2 , ..., Xn is random sample from a normal population X with mean µ and variance σ 2 > 0. Let the sample mean and sample variances be X and S 2 respectively. Then (n − 1)S 2 ∼ χ2 (n − 1) σ2 Probability and Mathematical Statistics and 501 X −µ = ∼ N (0, 1). σ2 n Therefore, the random variable defined by the ratio of (n−1)S 2 σ2 X−µ to I has σ2 n a t-distribution with (n − 1) degrees of freedom, that is Q(X1 , X2 , ..., Xn , µ) = = X−µ I 2 σ n (n−1)S 2 (n−1)σ 2 X −µ ∼ t(n − 1), = = S2 n where Q is the pivotal quantity to be used for the construction of the confidence interval for µ. Using this pivotal quantity, we construct the confidence interval as follows: ) + X −µ 1 − α = P −t α2 (n − 1) ≤ S ≤ t α2 (n − 1) # =P X− # S √ n $ √ n t α2 (n − 1) ≤ µ ≤ X + # S √ n $ $ t α2 (n − 1) Hence, the 100(1 − α)% confidence interval for µ when the population X is normal with the unknown variance σ 2 is given by 2 3 # # $ $ S S t α2 (n − 1) , X + √ t α2 (n − 1) . X− √ n n Example 17.6. A random sample of 9 observations from a normal popula19 tion yields the observed statistics x = 5 and 81 i=1 (xi − x)2 = 36. What is the 95% confidence interval for µ ? Answer: Since n=9 x=5 s2 = 36 and 1 − α = 0.95, the 95% confidence interval for µ is given by 2 $ $ 3 # # s s t α2 (n − 1) , x + √ t α2 (n − 1) , x− √ n n that is 2 5− # 6 √ 9 $ t0.025 (8) , 5 + # 6 √ 9 $ 3 t0.025 (8) , Techniques for finding Interval Estimators of Parameters which is 2 5− # 6 √ 9 $ (2.306) , 5 + # 6 √ 9 502 $ 3 (2.306) . Hence, the 95% confidence interval for µ is given by [0.388, 9.612]. Example 17.7. Which of the following is true of a 95% confidence interval for the mean of a population? (a) The interval includes 95% of the population values on the average. (b) The interval includes 95% of the sample values on the average. (c) The interval has 95% chance of including the sample mean. Answer: None of the statements is correct since the 95% confidence interval for the population mean µ means that the interval has 95% chance of including the population mean µ. Finally, we consider the case when the population is non-normal but it probability density function is continuous, symmetric and unimodal. If some weak conditions are satisfied, then the sample variance S 2 of a random sample of size n ≥ 2, converges stochastically to σ 2 . Therefore, in = X−µ I 2 σ n (n−1)S 2 (n−1)σ 2 X −µ = = S2 n the numerator of the left-hand member converges to N (0, 1) and the denominator of that member converges to 1. Hence X −µ = ∼ N (0, 1) S2 n as n → ∞. This fact can be used for the construction of a confidence interval for population mean when variance is unknown and the population distribution is nonnormal. We let the pivotal quantity to be X −µ Q(X1 , X2 , ..., Xn , µ) = = S2 n and obtain the following confidence interval $ $ 3 # # 2 S S α √ √ z2 , X + z α2 . X− n n Probability and Mathematical Statistics 503 We summarize the results of this section by the following table. Population normal Variance σ 2 known normal not known not normal known not normal not normal known not known not normal not known Sample Size n n≥2 Confidence Limits x ∓ z α2 √σn x ∓ t α2 (n − 1) √sn n≥2 n ≥ 32 x ∓ z α2 n < 32 n ≥ 32 √σ n no formula exists x ∓ t α2 (n − 1) √sn n < 32 no formula exists 17.4. Confidence Interval for Population Variance In this section, we will first describe the method for constructing the confidence interval for variance when the population is normal with a known population mean µ. Then we treat the case when the population mean is also unknown. Let X1 , X2 , ..., Xn be a random sample from a normal population X with known mean µ and unknown variance σ 2 . We would like to construct a 100(1 − α)% confidence interval for the variance σ 2 , that is, we would like to find the estimate of L and U such that ! " P L ≤ σ 2 ≤ U = 1 − α. To find these estimate of L and U , we first construct a pivotal quantity. Thus ! " Xi ∼ N µ, σ 2 , $ # Xi − µ ∼ N (0, 1), σ # $2 Xi − µ ∼ χ2 (1). σ $2 n # % Xi − µ ∼ χ2 (n). σ i=1 We define the pivotal quantity Q(X1 , X2 , ..., Xn , σ 2 ) as 2 Q(X1 , X2 , ..., Xn , σ ) = $2 n # % Xi − µ i=1 σ Techniques for finding Interval Estimators of Parameters 504 which has a chi-square distribution with n degrees of freedom. Hence 1 − α = P (a ≤ Q ≤ b) + ) $2 n # % Xi − µ =P a≤ ≤b σ i=1 ) + n σ2 1 % 1 ≥ ≥ =P a (Xi − µ)2 b i=1 1n # 1n $ 2 2 2 i=1 (Xi − µ) i=1 (Xi − µ) =P ≥σ ≥ a b 1n 1 $ # n 2 2 (X (X − µ) i − µ) i i=1 ≤ σ 2 ≤ i=1 =P b a )1 + 1n n 2 2 2 i=1 (Xi − µ) i=1 (Xi − µ) =P ≤σ ≤ χ21− α (n) χ2α (n) 2 2 2 Therefore, the (1 − α)% confidence interval for σ when mean is known is given by /1 0 1n n 2 2 i=1 (Xi − µ) i=1 (Xi − µ) . , χ21− α (n) χ2α (n) 2 2 Example 17.8. A random sample of 9 observations from a normal pop19 ulation with µ = 5 yields the observed statistics 18 i=1 x2i = 39.125 and 19 2 i=1 xi = 45. What is the 95% confidence interval for σ ? Answer: We have been given that n=9 and µ = 5. Further we know that 9 % 9 xi = 45 and i=1 Hence 9 % 1% 2 x = 39.125. 8 i=1 i x2i = 313, i=1 and 9 % i=1 (xi − µ)2 = 9 % i=1 x2i − 2µ 9 % xi + 9µ2 i=1 = 313 − 450 + 225 = 88. Probability and Mathematical Statistics Since 1 − α = 0.95, we get table we have α 2 505 = 0.025 and 1 − χ20.025 (9) = 2.700 α 2 = 0.975. Using chi-square χ20.975 (9) = 19.02. and Hence, the 95% confidence interval for σ 2 is given by /1 0 1n n 2 2 i=1 (Xi − µ) i=1 (Xi − µ) , , χ21− α (n) χ2α (n) 2 that is 2 2 88 88 , 19.02 2.7 3 which is [4.63, 32.59]. Remark 17.4. Since the χ2 distribution is not symmetric, the above confidence interval is not necessarily the shortest. Later, in the next section, we describe how one construct a confidence interval of shortest length. Consider a random sample X1 , X2 , ..., Xn from a normal population X ∼ N (µ, σ 2 ), where the population mean µ and population variance σ 2 are unknown. We want to construct a 100(1 − α)% confidence interval for the population variance. We know that (n − 1)S 2 ∼ χ2 (n − 1) σ2 ⇒ 1n i=1 ! Xi − X σ2 "2 ∼ χ2 (n − 1). 1n 2 (Xi −X ) as the pivotal quantity Q to construct the confidence We take i=1 σ2 2 interval for σ . Hence, we have ) + 1 1 1−α=P ≤Q≤ 2 χ2α (n − 1) χ1− α (n − 1) 2 2 ) + "2 1n ! 1 1 i=1 Xi − X ≤ 2 =P ≤ χ2α (n − 1) σ2 χ1− α (n − 1) 2 2 ) 1n ! "2 "2 + 1n ! X − X − X X i i i=1 =P . ≤ σ 2 ≤ i=12 χ21− α (n − 1) χ α (n − 1) 2 2 Techniques for finding Interval Estimators of Parameters 506 Hence, the 100(1 − α)% confidence interval for variance σ 2 when the population mean is unknown is given by / 1n "2 ! i=1 Xi − X χ21− α (n − 1) 2 , 1n ! Xi − X α (n − 1) i=1 χ2 2 "2 0 Example 17.9. Let X1 , X2 , ..., Xn be a random sample of size 13 from a 113 113 normal distribution N (µ, σ 2 ). If i=1 xi = 246.61 and i=1 x2i = 4806.61. Find the 90% confidence interval for σ 2 ? Answer: x = 18.97 13 s2 = 1 % (xi − x)2 n − 1 i=1 13 = <2 1 %; 2 xi − nx2 n − 1 i=1 1 [4806.61 − 4678.2] 12 1 128.41. = 12 = Hence, 12s2 = 128.41. Further, since 1 − α = 0.90, we get 1 − α2 = 0.95. Therefore, from chi-square table, we get χ20.95 (12) = 21.03, α 2 = 0.05 and χ20.05 (12) = 5.23. Hence, the 95% confidence interval for σ 2 is 2 3 128.41 128.41 , , 21.03 5.23 that is [6.11, 24.55]. Example 17.10. Let X1 , X2 , ..., Xn be a random sample of size n from a " ! distribution N µ, σ 2 , where µ and σ 2 are unknown parameters. What is the shortest 90% confidence interval for the standard deviation σ ? Answer: Let S 2 be the sample variance. Then (n − 1)S 2 ∼ χ2 (n − 1). σ2 Probability and Mathematical Statistics 507 Using this random variable as a pivot, we can construct a 100(1 − α)% confidence interval for σ from $ # (n − 1)S 2 ≤ b 1−α=P a≤ σ2 by suitably choosing the constants a and b. Hence, the confidence interval for σ is given by /@ 0 @ (n − 1)S 2 (n − 1)S 2 , . b a The length of this confidence interval is given by 2 3 √ 1 1 L(a, b) = S n − 1 √ − √ . a b In order to find the shortest confidence interval, we should find a pair of constants a and b such that L(a, b) is minimum. Thus, we have a constraint minimization problem. That is      Minimize L(a, b) Subject to the condition : b     f (u)du = 1 − α, (MP) a where f (x) = Γ 1 ! n−1 " 2 2 n−1 2 x n−1 2 −1 x e− 2 . Differentiating L with respect to a, we get $ # √ 1 3 db 1 3 dL = S n − 1 − a− 2 + b− 2 . da 2 2 da From : b a f (u) du = 1 − α, we find the derivative of b with respect to a as follows: d da : b f (u) du = a that is f (b) d (1 − α) da db − f (a) = 0. da Techniques for finding Interval Estimators of Parameters 508 Thus, we have f (a) db = . da f (b) Letting this into the expression for the derivative of L, we get # √ dL =S n−1 da 1 3 1 3 f (a) − a− 2 + b− 2 2 2 f (b) $ . Setting this derivative to zero, we get √ S n−1 # 1 3 1 3 f (a) − a− 2 + b− 2 2 2 f (b) $ =0 which yields 3 3 a 2 f (a) = b 2 f (b). Using the form of f , we get from the above expression 3 a2 a n−3 2 3 a e− 2 = b 2 b n−3 2 b e− 2 that is n n a b a 2 e− 2 = b 2 e− 2 . From this we get ln ,ab = # a−b n $ . Hence to obtain the pair of constants a and b that will produce the shortest confidence interval for σ, we have to solve the following system of nonlinear equations  : b   f (u) du = 1 − α  a (/) ,a- a − b   ln = . b n If ao and bo are solutions of (/), then the shortest confidence interval for σ is given by  G G 2 2 (n − 1)S   (n − 1)S , . bo ao Since this system of nonlinear equations is hard to solve analytically, numerical solutions are given in statistical literature in the form of a table for finding the shortest interval for the variance. Probability and Mathematical Statistics 509 17.5. Confidence Interval for Parameter of some Distributions not belonging to the Location-Scale Family In this section, we illustrate the pivotal quantity method for finding confidence intervals for a parameter θ when the density function does not belong to the location-scale family. The following density functions does not belong to the location-scale family:   θxθ−1 if 0 < x < 1 f (x; θ) =  0 otherwise, or f (x; θ) = &1 θ if 0 < x < θ 0 otherwise. We will construct interval estimators for the parameters in these density functions. The same idea for finding the interval estimators can be used to find interval estimators for parameters of density functions that belong to the location-scale family such as & 1 −x θ if 0 < x < ∞ θe f (x; θ) = 0 otherwise. To find the pivotal quantities for the above mentioned distributions and others we need the following three results. The first result is Theorem 6.2 while the proof of the second result is easy and we leave it to the reader. Theorem 17.1. Let F (x; θ) be the cumulative distribution function of a continuous random variable X. Then F (X; θ) ∼ U N IF (0, 1). Theorem 17.2. If X ∼ U N IF (0, 1), then − ln X ∼ EXP (1). Theorem 17.3. Let X1 , X2 , ..., Xn be a random sample from a distribution with density function  x  θ1 e− θ if 0 < x < ∞ f (x; θ) =  0 otherwise, Techniques for finding Interval Estimators of Parameters 510 where θ > 0 is a parameter. Then the random variable n 2 % Xi ∼ χ2 (2n) θ i=1 1n Proof: Let Y = θ2 i=1 Xi . Now we show that the sampling distribution of Y is chi-square with 2n degrees of freedom. We use the moment generating method to show this. The moment generating function of Y is given by MY (t) = M % n 2 θ (t) Xi i=1 = = n P i=1 n # P i=1 # MXi 2 t θ $ 2 1−θ t θ $−1 −n = (1 − 2t) − 2n 2 = (1 − 2t) . − 2n Since (1 − 2t) 2 corresponds to the moment generating function of a chisquare random variable with 2n degrees of freedom, we conclude that n 2 % Xi ∼ χ2 (2n). θ i=1 Theorem 17.4. Let X1 , X2 , ..., Xn be a random sample from a distribution with density function f (x; θ) =   θxθ−1  0 if 0 ≤ x ≤ 1 otherwise, where θ > 0 is a parameter. Then the random variable −2θ a chi-square distribution with 2n degree of freedoms. Proof: We are given that Xi ∼ θ xθ−1 , 0 < x < 1. 1n i=1 ln Xi has Probability and Mathematical Statistics 511 Hence, the cdf of f is F (x; θ) = : x θ xθ−1 dx = xθ . 0 Thus by Theorem 17.1, each F (Xi ; θ) ∼ U N IF (0, 1), that is Xiθ ∼ U N IF (0, 1). By Theorem 17.2, each − ln Xiθ ∼ EXP (1), that is −θ ln Xi ∼ EXP (1). By Theorem 17.3 (with θ = 1), we obtain −2 θ n % i=1 ln Xi ∼ χ2 (2n). Hence, the sampling distribution of −2 θ degree of freedoms. 1n i=1 ln Xi is chi-square with 2n The following theorem whose proof follows from Theorems 17.1, 17.2 and 17.3 is the key to finding pivotal quantity of many distributions that do not belong to the location-scale family. Further, this theorem can also be used for finding the pivotal quantities for parameters of some distributions that belong the location-scale family. Theorem 17.5. Let X1 , X2 , ..., Xn be a random sample from a continuous population X with a distribution function F (x; θ). If F (x; θ) is monotone in 1n θ, then the statistic Q = −2 i=1 ln F (Xi ; θ) is a pivotal quantity and has a chi-square distribution with 2n degrees of freedom (that is, Q ∼ χ2 (2n)). It should be noted that the condition F (x; θ) is monotone in θ is needed to ensure an interval. Otherwise we may get a confidence region instead of a 1n confidence interval. Further note that the statistic −2 i=1 ln (1 − F (Xi ; θ)) is also has a chi-square distribution with 2n degrees of freedom, that is −2 n % i=1 ln (1 − F (Xi ; θ)) ∼ χ2 (2n). Techniques for finding Interval Estimators of Parameters 512 Example 17.11. If X1 , X2 , ..., Xn is a random sample from a population with density   θxθ−1 if 0 < x < 1 f (x; θ) =  0 otherwise, where θ > 0 is an unknown parameter, what is a 100(1 − α)% confidence interval for θ? Answer: To construct a confidence interval for θ, we need a pivotal quantity. That is, we need a random variable which is a function of the sample and the parameter, and whose probability distribution is known but does not involve θ. We use the random variable Q = −2 θ n % i=1 ln Xi ∼ χ2 (2n) as the pivotal quantity. The 100(1 − α)% confidence interval for θ can be constructed from , 1 − α = P χ2α2 (2n) ≤ Q ≤ χ21− α2 (2n) + ) n % 2 2 ln Xi ≤ χ1− α2 (2n) = P χ α2 (2n) ≤ −2 θ i=1    =P   χ2α (2n) 2 −2 n % ln Xi i=1 ≤θ≤   χ21− α (2n)  2 . n  %  −2 ln Xi i=1 Hence, 100(1 − α)% confidence interval for θ is given by        χ2α (2n) 2 −2 n % ln Xi i=1 ! ,  χ21− α (2n)  2 . n  %  −2 ln Xi i=1 " α Here χ21− α (2n) denotes the 1 − 2 -quantile of a chi-square random variable 2 Y , that is α P (Y ≤ χ21− α2 (2n)) = 1 − 2 α 2 and χ α (2n) similarly denotes 2 -quantile of Y , that is 2 , - α P Y ≤ χ2α2 (2n) = 2 Probability and Mathematical Statistics 513 for α ≤ 0.5 (see figure below). 1"! !/ 2 !/2 #2 !/2 #2 1"!/2 Example 17.12. If X1 , X2 , ..., Xn is a random sample from a distribution with density function f (x; θ) =   θ1  if 0 < x < θ 0 otherwise, where θ > 0 is a parameter, then what is the 100(1 − α)% confidence interval for θ? Answer: The cumulation density function of f (x; θ) is F (x; θ) = Since −2 n % i=1 &x θ if 0 < x < θ 0 otherwise. ln F (Xi ; θ) = −2 n % i=1 ln # = 2n ln θ − 2 Xi θ $ n % ln Xi i=1 1n by Theorem 17.5, the quantity 2n ln θ − 2 i=1 ln Xi ∼ χ2 (2n). Since 1n 2n ln θ − 2 i=1 ln Xi is a function of the sample and the parameter and its distribution is independent of θ, it is a pivot for θ. Hence, we take Q(X1 , X2 , ..., Xn , θ) = 2n ln θ − 2 n % i=1 ln Xi . Techniques for finding Interval Estimators of Parameters 514 The 100(1 − α)% confidence interval for θ can be constructed from , 1 − α = P χ2α2 (2n) ≤ Q ≤ χ21− α2 (2n) + ) n % 2 2 ln Xi ≤ χ1− α2 (2n) = P χ α2 (2n) ≤ 2n ln θ − 2 =P =P i=1 ) n n % % χ2α2 (2n) + 2 ln Xi ≤ 2n ln θ ≤ χ21− α2 (2n) + 2 ln Xi ) e 1 2n J i=1 χ2α (2n)+2 2 1n i=1 ln Xi S ≤θ≤e 1 2n J i=1 χ21− α (2n)+2 2 1n i=1 Hence, 100(1 − α)% confidence interval for θ is given by & & ' '  n n % % 2 2 1 1 χ1− α (2n)+2 ln Xi ln Xi 2n  2n χ α2 (2n)+2 2  i=1 i=1 , e e  + ln Xi S+ .    .  The density function of the following example belongs to the scale family. However, one can use Theorem 17.5 to find a pivot for the parameter and determine the interval estimators for the parameter. Example 17.13. If X1 , X2 , ..., Xn is a random sample from a distribution with density function  x  θ1 e− θ if 0 < x < ∞ f (x; θ) =  0 otherwise, where θ > 0 is a parameter, then what is the 100(1 − α)% confidence interval for θ? Answer: The cumulative density function F (x; θ) of the density function & 1 −x θ if 0 < x < ∞ θe f (x; θ) = 0 otherwise is given by x F (x; θ) = 1 − e− θ . Hence −2 n % i=1 n ln (1 − F (Xi ; θ)) = 2% Xi . θ i=1 Probability and Mathematical Statistics Thus We take Q = 515 n 2 θ n % i=1 2% Xi ∼ χ2 (2n). θ i=1 Xi as the pivotal quantity. The 100(1 − α)% confidence interval for θ can be constructed using , 1 − α = P χ2α2 (2n) ≤ Q ≤ χ21− α2 (2n) ) + n 2 % 2 2 Xi ≤ χ1− α2 (2n) = P χ α2 (2n) ≤ θ i=1   n n % % 2 Xi   2 Xi   i=1  . ≤ θ ≤ 2i=1  =P 2 χ α (2n)    χ1− α2 (2n) 2 Hence, 100(1 − α)% confidence interval  n % Xi 2   i=1   χ2 α (2n) ,  1− 2 for θ is given by  n % 2 Xi   i=1 . 2 χ α (2n)   2 In this section, we have seen that 100(1 − α)% confidence interval for the parameter θ can be constructed by taking the pivotal quantity Q to be either Q = −2 or Q = −2 n % i=1 n % ln F (Xi ; θ) i=1 ln (1 − F (Xi ; θ)) . In either case, the distribution of Q is chi-squared with 2n degrees of freedom, that is Q ∼ χ2 (2n). Since chi-squared distribution is not symmetric about the y-axis, the confidence intervals constructed in this section do not have the shortest length. In order to have a shortest confidence interval one has to solve the following minimization problem:  Minimize L(a, b)   : b (MP) Subject to the condition f (u)du = 1 − α,   a Techniques for finding Interval Estimators of Parameters where f (x) = Γ 1 ! n−1 " 2 2 n−1 2 x 516 n−1 2 −1 x e− 2 . In the case of Example 17.13, the minimization process leads to the following system of nonlinear equations : a     b f (u) du = 1 − α ln ,ab a−b   = . 2(n + 1) (NE) If ao and bo are solutions of (NE), then the shortest confidence interval for θ is given by 1n 2 1n 3 2 i=1 Xi 2 i=1 Xi . , bo ao 17.6. Approximate Confidence Interval for Parameter with MLE In this section, we discuss how to construct an approximate (1 − α)100% confidence interval for a population parameter θ using its maximum likelihood Z Let X1 , X2 , ..., Xn be a random sample from a population X estimator θ. with density f (x; θ). Let θZ be the maximum likelihood estimator of θ. If the sample size n is large, then using asymptotic property of the maximum likelihood estimator, we have , θZ − E θZ @ as n → ∞, , - ∼ N (0, 1) V ar θZ , Z Since, for large n, where V ar θZ denotes the variance of the estimator θ. the maximum likelihood estimator of θ is unbiased, we get θZ − θ @ , - ∼ N (0, 1) V ar θZ as n → ∞. , The variance V ar θZ can be computed directly whenever possible or using the Cramér-Rao lower bound , −1 B. V ar θZ ≥ A 2 d ln L(θ) E dθ 2 Probability and Mathematical Statistics 517 θ −θ Now using Q = = Z ! " as the pivotal quantity, we construct an approxiθ V ar Z mate (1 − α)100% confidence interval for θ as " ! 1 − α = P −z α2 ≤ Q ≤ z α2     θZ − θ  α =P , - ≤ z α2  . −z 2 ≤ @ Z V ar θ , If V ar θZ is free of θ, then have 1−α=P ) @ θZ − z α2 + @ , , Z Z Z . V ar θ ≤ θ ≤ θ + z α2 V ar θ Thus 100(1 − α)% approximate confidence interval for θ is / θZ − z α2 0 @ , , V ar θZ , θZ + z α2 V ar θZ @ , provided V ar θZ is free of θ. , Remark 17.5. In many situations V ar θZ is not free of the parameter θ. In those situations we still use the above form of the confidence interval by , replacing the parameter θ by θZ in the expression of V ar θZ . Next, we give some examples to illustrate this method. Example 17.14. Let X1 , X2 , ..., Xn be a random sample from a population X with probability density function f (x; p) = & px (1 − p)(1−x) if x = 0, 1 0 otherwise. What is a 100(1 − α)% approximate confidence interval for the parameter p? Answer: The likelihood function of the sample is given by L(p) = n P i=1 pxi (1 − p)(1−xi ) . Techniques for finding Interval Estimators of Parameters 518 Taking the logarithm of the likelihood function, we get ln L(p) = n % i=1 [xi ln p + (1 − xi ) ln(1 − p)] . Differentiating, the above expression, we get n n 1 % 1 % d ln L(p) = xi − (1 − xi ). dp p i=1 1 − p i=1 Setting this equals to zero and solving for p, we get nx n − nx − = 0, p 1−p that is (1 − p) n x = p (n − n x), which is n x − p n x = p n − p n x. Hence p = x. Therefore, the maximum likelihood estimator of p is given by The variance of X is pZ = X. ! " σ2 V ar X = . n Since X ∼ Ber(p), the variance σ 2 = p(1 − p), and ! " p(1 − p) V ar (Z p) = V ar X = . n Since V ar (Z p) is not free of the parameter p, we replave p by pZ in the expression of V ar (Z p) to get pZ (1 − pZ) V ar (Z p) 7 . n The 100(1−α)% approximate confidence interval for the parameter p is given by / 0 @ @ pZ (1 − pZ) pZ (1 − pZ) pZ − z α2 , pZ + z α2 n n Probability and Mathematical Statistics which is  X − z α 2 G 519 X (1 − X) , X + z α2 n G  X (1 − X)  . n The above confidence interval is a 100(1 − α)% approximate confidence interval for proportion. Example 17.15. A poll was taken of university students before a student election. Of 78 students contacted, 33 said they would vote for Mr. Smith. The population may be taken as 2200. Obtain 95% confidence limits for the proportion of voters in the population in favor of Mr. Smith. Answer: The sample proportion pZ is given by Hence @ pZ = pZ (1 − pZ) = n @ 33 = 0.4231. 78 (0.4231) (0.5769) = 0.0559. 78 The 2.5th percentile of normal distribution is given by z0.025 = 1.96 (From table). Hence, the lower confidence limit of 95% confidence interval is @ pZ (1 − pZ) pZ − z α2 n = 0.4231 − (1.96) (0.0559) = 0.4231 − 0.1096 = 0.3135. Similarly, the upper confidence limit of 95% confidence interval is @ pZ (1 − pZ) pZ + z α2 n = 0.4231 + (1.96) (0.0559) = 0.4231 + 0.1096 = 0.5327. Hence, the 95% confidence limits for the proportion of voters in the population in favor of Smith are 0.3135 and 0.5327. Techniques for finding Interval Estimators of Parameters 520 Remark 17.6. In Example 17.15, the 95% percent approximate confidence interval for the parameter p was [0.3135, 0.5327]. This confidence interval can be improved to a shorter interval by means of a quadratic inequality. Now we explain how the interval can be improved. First note that in Example 17.14, which we are using for Example 17.15, the approximate value of the = variance of the ML estimator pZ was obtained to be p(1−p) . However, n Z p √ −p is the exact variance of pZ. Now the pivotal quantity Q = pZ − p . Q= = p(1−p) n V ar(Z p) this becomes Using this pivotal quantity, we can construct a 95% confidence interval as   p Z − p 0.05 = P − z0.025 ≤ = ≤ z0.025  p(1−p) n >  > > > > pZ − p > > ≤ 1.96  . = P  >> = > > p(1−p) > n Using pZ = 0.4231 and n = 78, we solve the inequality > > > > > pZ − p > >= > > p(1−p) > ≤ 1.96 > > n which is > > > > > 0.4231 − p > > = > ≤ 1.96. > > p(1−p) > > 78 Squaring both sides of the above inequality and simplifying, we get 78 (0.4231 − p)2 ≤ (1.96)2 (p − p2 ). The last inequality is equivalent to 13.96306158 − 69.84520000 p + 81.84160000 p2 ≤ 0. Solving this quadratic inequality, we obtain [0.3196, 0.5338] as a 95% confidence interval for p. This interval is an improvement since its length is 0.2142 where as the length of the interval [0.3135, 0.5327] is 0.2192. Probability and Mathematical Statistics 521 Example 17.16. If X1 , X2 , ..., Xn is a random sample from a population with density   θ xθ−1 if 0 < x < 1 f (x; θ) =  0 otherwise, where θ > 0 is an unknown parameter, what is a 100(1 − α)% approximate confidence interval for θ if the sample size is large? Answer: The likelihood function L(θ) of the sample is L(θ) = n P θ xθ−1 . i i=1 Hence ln L(θ) = n ln θ + (θ − 1) n % ln xi . i=1 The first derivative of the logarithm of the likelihood function is n n % d ln xi . ln L(θ) = + dθ θ i=1 Setting this derivative to zero and solving for θ, we obtain θ = − 1n n i=1 ln xi . Hence, the maximum likelihood estimator of θ is given by n θZ = − 1n . i=1 ln Xi Finding the variance of this estimator is difficult. We compute its variance by computing the Cramér-Rao bound for this estimator. The second derivative of the logarithm of the likelihood function is given by ) + n d2 d n % ln xi ln L(θ) = + dθ2 dθ θ i=1 n = − 2. θ Hence E # $ n d2 ln L(θ) = − 2 . dθ2 θ Techniques for finding Interval Estimators of Parameters Therefore Thus we take 522 , θ V ar θZ ≥ . n , θ V ar θZ 7 . n , Since V ar θZ has θ in its expression, we replace the unknown θ by its estimate θZ so that , - θZ2 V ar θZ 7 . n The 100(1 − α)% approximate confidence interval for θ is given by 0 / Z Z θ θ θZ − z α2 √ , θZ + z α2 √ , n n which is 2 $ $3 # # √ √ n n n n α α − 1n , − 1n . + z 2 1n − z 2 1n i=1 ln Xi i=1 ln Xi i=1 ln Xi i=1 ln Xi Remark 17.7. In the next section 17.2, we derived the exact confidence interval for θ when the population distribution in exponential. The exact 100(1 − α)% confidence interval for θ was given by 0 / χ21− α (2n) χ2α (2n) 2 2 . , − 1n − 1n 2 i=1 ln Xi 2 i=1 ln Xi Note that this exact confidence interval is not the shortest confidence interval for the parameter θ. Example 17.17. If X1 , X2 , ..., X49 is a random sample from a population with density   θ xθ−1 if 0 < x < 1 f (x; θ) =  0 otherwise, where θ > 0 is an unknown parameter, what are 90% approximate and exact 149 confidence intervals for θ if i=1 ln Xi = −0.7567? Answer: We are given the followings: n = 49 49 % i=1 ln Xi = −0.7576 1 − α = 0.90. Probability and Mathematical Statistics 523 Hence, we get z0.05 = 1.64, n 49 = −64.75 = −0.7567 ln X i i=1 and 1n √ n 7 = −9.25. = −0.7567 ln X i i=1 1n Hence, the approximate confidence interval is given by [64.75 − (1.64)(9.25), 64.75 + (1.64)(9.25)] that is [49.58, 79.92]. Next, we compute the exact 90% confidence interval for θ using the formula / 0 χ2α (2n) χ21− α (2n) 2 2 − 1n . , − 1n 2 i=1 ln Xi 2 i=1 ln Xi From chi-square table, we get χ20.05 (98) = 77.93 χ20.95 (98) = 124.34. and Hence, the exact 90% confidence interval is 2 77.93 124.34 , (2)(0.7567) (2)(0.7567) 3 that is [51.49, 82.16]. Example 17.18. If X1 , X2 , ..., Xn is a random sample from a population with density  if x = 0, 1, 2, ..., ∞  (1 − θ) θx f (x; θ) =  0 otherwise, where 0 < θ < 1 is an unknown parameter, what is a 100(1−α)% approximate confidence interval for θ if the sample size is large? Answer: The logarithm of the likelihood function of the sample is ln L(θ) = ln θ n % i=1 xi + n ln(1 − θ). Techniques for finding Interval Estimators of Parameters 524 Differentiating we see obtain d ln L(θ) = dθ 1n i=1 xi θ − n . 1−θ Equating this derivative to zero and solving for θ, we get θ = maximum likelihood estimator of θ is given by θZ = x 1+x . Thus, the X . 1+X Next, we find the variance of this estimator using the Cramér-Rao lower bound. For this, we need the second derivative of ln L(θ). Hence nx n d2 ln L(θ) = − 2 − . 2 dθ θ (1 − θ)2 Therefore # 2 $ d ln L(θ) E dθ2 # $ nX n =E − 2 − θ (1 − θ)2 " ! n n = 2E X − θ (1 − θ)2 n n 1 = 2 − θ (1 − θ) (1 − θ)2 2 3 θ 1 n + =− θ(1 − θ) θ 1 − θ n (1 − θ + θ2 ) =− 2 . θ (1 − θ)2 Therefore , V ar θZ 7 (since each Xi ∼ GEO(1 − θ)) , -2 θZ2 1 − θZ , -. n 1 − θZ + θZ2 The 100(1 − α)% approximate confidence interval for θ is given by   θZ − z α  2  , , θZ 1 − θZ θZ 1 − θZ  Z  @ , - , θ + z α2 @ , - , n 1 − θZ + θZ2 n 1 − θZ + θZ2 Probability and Mathematical Statistics 525 where θZ = X . 1+X 17.7. The Statistical or General Method Now we briefly describe the statistical or general method for constructing a confidence interval. Let X1 , X2 , ..., Xn be a random sample from a population with density f (x; θ), where θ is a unknown parameter. We want to determine an interval estimator for θ. Let T (X1 , X2 , ..., Xn ) be some statistics having the density function g(t; θ). Let p1 and p2 be two fixed positive number in the open interval (0, 1) with p1 + p2 < 1. Now we define two functions h1 (θ) and h2 (θ) as follows: p1 = such that : h1 (θ) g(t; θ) dt and p2 = : h2 (θ) g(t; θ) dt −∞ −∞ P (h1 (θ) < T (X1 , X2 , ..., Xn ) < h2 (θ)) = 1 − p1 − p2 . If h1 (θ) and h2 (θ) are monotone functions in θ, then we can find a confidence interval P (u1 < θ < u2 ) = 1 − p1 − p2 where u1 = u1 (t) and u2 = u2 (t). The statistics T (X1 , X2 , ..., Xn ) may be a sufficient statistics, or a maximum likelihood estimator. If we minimize the length u2 −u1 of the confidence interval, subject to the condition 1−p1 −p2 = 1 − α for 0 < α < 1, we obtain the shortest confidence interval based on the statistics T . 17.8. Criteria for Evaluating Confidence Intervals In many situations, one can have more than one confidence intervals for the same parameter θ. Thus it necessary to have a set of criteria to decide whether a particular interval is better than the other intervals. Some well known criteria are: (1) Shortest Length and (2) Unbiasedness. Now we only briefly describe these criteria. The criterion of shortest length demands that a good 100(1 − α)% confidence interval [L, U ] of a parameter θ should have the shortest length ℓ = U −L. In the pivotal quantity method one finds a pivot Q for a parameter θ and then converting the probability statement P (a < Q < b) = 1 − α Techniques for finding Interval Estimators of Parameters 526 to P (L < θ < U ) = 1 − α obtains a 100(1−α)% confidence interval for θ. If the constants a and b can be found such that the difference U − L depending on the sample X1 , X2 , ..., Xn is minimum for every realization of the sample, then the random interval [L, U ] is said to be the shortest confidence interval based on Q. If the pivotal quantity Q has certain type of density functions, then one can easily construct confidence interval of shortest length. The following result is important in this regard. Theorem 17.6. Let the density function of the pivot Q ∼ h(q; θ) be continuous and unimodal. If in some interval [a, b] the density function h has a mode, 8b and satisfies conditions (i) a h(q; θ)dq = 1 − α and (ii) h(a) = h(b) > 0, then the interval [a, b] is of the shortest length among all intervals that satisfy condition (i). If the density function is not unimodal, then minimization of ℓ is necessary to construct a shortest confidence interval. One of the weakness of this shortest length criterion is that in some cases, ℓ could be a random variable. Often, the expected length of the interval E(ℓ) = E(U − L) is also used as a criterion for evaluating the goodness of an interval. However, this too has weaknesses. A weakness of this criterion is that minimization of E(ℓ) depends on the unknown true value of the parameter θ. If the sample size is very large, then every approximate confidence interval constructed using MLE method has minimum expected length. A confidence interval is only shortest based on a particular pivot Q. It is possible to find another pivot Q( which may yield even a shorter interval than the shortest interval found based on Q. The question naturally arises is how to find the pivot that gives the shortest confidence interval among all other pivots. It has been pointed out that a pivotal quantity Q which is a some function of the complete and sufficient statistics gives shortest confidence interval. Unbiasedness, is yet another criterion for judging the goodness of an interval estimator. The unbiasedness is defined as follow. A 100(1 − α)% confidence interval [L, U ] of the parameter θ is said to be unbiased if & ≥ 1 − α if θ( = θ ( P (L ≤ θ ≤ U ) ≤ 1 − α if θ( += θ. Probability and Mathematical Statistics 527 17.9. Review Exercises 1. Let X1 , X2 , ..., Xn be a random sample from a population with gamma density function f (x; θ, β) =  x  Γ(β)1 θβ xβ−1 e− θ  for 0 < x < ∞ 0 otherwise, where θ is an unknown parameter and β > 0 is a known parameter. Show that 0 / 1 1n n 2 i=1 Xi 2 i=1 Xi , χ21− α (2nβ) χ2α (2nβ) 2 2 is a 100(1 − α)% confidence interval for the parameter θ. 2. Let X1 , X2 , ..., Xn be a random sample from a population with Weibull density function f (x; θ, β) =  β  β xβ−1 e− xθ θ  for 0 < x < ∞ 0 otherwise, where θ is an unknown parameter and β > 0 is a known parameter. Show that / 1 0 1n n 2 i=1 Xiβ 2 i=1 Xiβ , χ21− α (2n) χ2α (2n) 2 2 is a 100(1 − α)% confidence interval for the parameter θ. 3. Let X1 , X2 , ..., Xn be a random sample from a population with Pareto density function f (x; θ, β) =   θ β θ x−(θ+1)  0 for β ≤ x < ∞ otherwise, where θ is an unknown parameter and β > 0 is a known parameter. Show that , - ,  1n 1n 2 i=1 ln Xβi 2 i=1 ln Xβi   , χ21− α (2n) χ2α (2n) 2 is a 100(1 − α)% confidence interval for θ1 . 2 Techniques for finding Interval Estimators of Parameters 528 4. Let X1 , X2 , ..., Xn be a random sample from a population with Laplace density function f (x; θ) = 1 − |x| e θ , 2θ −∞ < x < ∞ where θ is an unknown parameter. Show that / 1 n 2 i=1 |Xi | , 2 χ1− α (2n) 2 0 1n 2 i=1 |Xi | χ2α (2n) 2 is a 100(1 − α)% confidence interval for θ. 5. Let X1 , X2 , ..., Xn be a random sample from a population with density function  2  12 x3 e− x2θ for 0 < x < ∞ 2θ f (x; θ) =  0 otherwise, where θ is an unknown parameter. Show that /1 n 2 i=1 Xi χ21− α (4n) 2 , 1n 2 i=1 Xi 2 χ (4n) α 2 0 is a 100(1 − α)% confidence interval for θ. 6. Let X1 , X2 , ..., Xn be a random sample from a population with density function  xβ−1  β θ (1+x for 0 < x < ∞ β )θ+1 f (x; θ, β) =  0 otherwise, where θ is an unknown parameter and β > 0 is a known parameter. Show that   χ2α (2n) χ21− α (2n) 2 2  -, - , , 1n 1n 2 i=1 ln 1 + Xiβ 2 i=1 ln 1 + Xiβ is a 100(1 − α)% confidence interval for θ. 7. Let X1 , X2 , ..., Xn be a random sample from a population with density function   e−(x−θ) if θ < x < ∞ f (x; θ) =  0 otherwise, Probability and Mathematical Statistics 529 where θ ∈ R I is an unknown parameter. Then show that Q = X(1) − θ is a pivotal quantity. Using this pivotal quantity find a 100(1 − α)% confidence interval for θ. 8. Let X1 , X2 , ..., Xn be a random sample from a population with density function   e−(x−θ) if θ < x < ∞ f (x; θ) =  0 otherwise, ! " where θ ∈ R I is an unknown parameter. Then show that Q = 2n X(1) − θ is a pivotal quantity. Using this pivotal quantity find a 100(1 − α)% confidence interval for θ. 9. Let X1 , X2 , ..., Xn be a random sample from a population with density function   e−(x−θ) if θ < x < ∞ f (x; θ) =  0 otherwise, where θ ∈ R I is an unknown parameter. Then show that Q = e−(X(1) −θ) is a pivotal quantity. Using this pivotal quantity find a 100(1 − α)% confidence interval for θ. 10. Let X1 , X2 , ..., Xn be a random sample from a population with uniform density function   θ1 if 0 ≤ x ≤ θ f (x; θ) =  0 otherwise, X where 0 < θ is an unknown parameter. Then show that Q = θ(n) is a pivotal quantity. Using this pivotal quantity find a 100(1 − α)% confidence interval for θ. 11. Let X1 , X2 , ..., Xn be a random sample from a population with uniform density function   θ1 if 0 ≤ x ≤ θ f (x; θ) =  0 otherwise, X −X where 0 < θ is an unknown parameter. Then show that Q = (n) θ (1) is a pivotal quantity. Using this pivotal quantity find a 100(1 − α)% confidence interval for θ. Techniques for finding Interval Estimators of Parameters 530 12. If X1 , X2 , ..., Xn is a random sample from a population with density =  2 e− 12 (x−θ)2 if θ ≤ x < ∞ π f (x; θ) =  0 otherwise, where θ is an unknown parameter, what is a 100(1 − α)% approximate confidence interval for θ if the sample size is large? 13. Let X1 , X2 , ..., Xn be a random sample of size n from a distribution with a probability density function   (θ + 1) x−θ−2 if 1 < x < ∞ f (x; θ) =  0 otherwise, where 0 < θ is a parameter. What is a 100(1 − α)% approximate confidence interval for θ if the sample size is large? 14. Let X1 , X2 , ..., Xn be a random sample of size n from a distribution with a probability density function   θ2 x e−θ x if 0 < x < ∞ f (x; θ) =  0 otherwise, where 0 < θ is a parameter. What is a 100(1 − α)% approximate confidence interval for θ if the sample size is large? 15. Let X1 , X2 , ..., Xn be a random sample from a distribution with density function  −(x−4) 1e β for x > 4 β f (x) =  0 otherwise, where β > 0. What is a 100(1 − α)% approximate confidence interval for θ if the sample size is large? 16. Let X1 , X2 , ..., Xn be a random sample from a distribution with density function   θ1 for 0 ≤ x ≤ θ f (x; θ) =  0 otherwise, where 0 < θ. What is a 100(1 − α)% approximate confidence interval for θ if the sample size is large? Probability and Mathematical Statistics 531 17. A sample X1 , X2 , ..., Xn of size n is drawn from a gamma distribution f (x; β) =  −x  x3 e 4 β 6β  0 if 0 < x < ∞ otherwise. What is a 100(1 − α)% approximate confidence interval for θ if the sample size is large? 18. Let X1 , X2 , ..., Xn be a random sample from a continuous population X with a distribution function F (x; θ). Show that the statistic 1n Q = −2 i=1 ln F (Xi ; θ) is a pivotal quantity and has a chi-square distribution with 2n degrees of freedom. 19. Let X1 , X2 , ..., Xn be a random sample from a continuous population X with a distribution function F (x; θ). Show that the statistic 1n Q = −2 i=1 ln (1 − F (Xi ; θ)) is a pivotal quantity and has a chi-square distribution with 2n degrees of freedom. Techniques for finding Interval Estimators of Parameters 532 Probability and Mathematical Statistics 533 Chapter 18 TEST OF STATISTICAL HYPOTHESES FOR PARAMETERS 18.1. Introduction Inferential statistics consists of estimation and hypothesis testing. We have already discussed various methods of finding point and interval estimators of parameters. We have also examined the goodness of an estimator. Suppose X1 , X2 , ..., Xn is a random sample from a population with probability density function given by   (1 + θ) xθ for 0 < x < 1 f (x; θ) =  0 otherwise, where θ > 0 is an unknown parameter. Further, let n = 4 and suppose x1 = 0.92, x2 = 0.75, x3 = 0.85, x4 = 0.8 is a set of random sample data from the above distribution. If we apply the maximum likelihood method, then we will find that the estimator θZ of θ is θZ = −1 − 4 . ln(X1 ) + ln(X2 ) + ln(X3 ) + ln(X2 ) Hence, the maximum likelihood estimate of θ is 4 ln(0.92) + ln(0.75) + ln(0.85) + ln(0.80) 4 = 4.2861 = −1 + 0.7567 θZ = −1 − Test of Statistical Hypotheses for Parameters 534 Therefore, the corresponding probability density function of the population is given by & 5.2861 x4.2861 for 0 < x < 1 f (x) = 0 otherwise. Since, the point estimate will rarely equal to the true value of θ, we would like to report a range of values with some degree of confidence. If we want to report an interval of values for θ with a confidence level of 90%, then we need a 90% confidence interval for θ. If we use the pivotal quantity method, then we will find that the confidence interval for θ is 0 / χ21− α (8) χ2α (8) 2 2 . , −1 − 14 −1 − 14 2 i=1 ln Xi 2 i=1 ln Xi 14 Since χ20.05 (8) = 2.73, χ20.95 (8) = 15.51, and i=1 ln(xi ) = −0.7567, we obtain 3 2 2.73 15.51 −1 + , −1 + 2(0.7567) 2(0.7567) which is [ 0.803, 9.249 ] . Thus we may draw inference, at a 90% confidence level, that the population X has the distribution   (1 + θ) xθ for 0 < x < 1 f (x; θ) = (/)  0 otherwise, where θ ∈ [0.803, 9.249]. If we think carefully, we will notice that we have made one assumption. The assumption is that the observable quantity X can be modeled by a density function as shown in (/). Since, we are concerned with the parametric statistics, our assumption is in fact about θ. Based on the sample data, we found that an interval estimate of θ at a 90% confidence level is [0.803, 9.249]. But, we assumed that θ ∈ [0.803, 9.249]. However, we can not be sure that our assumption regarding the parameter is real and is not due to the chance in the random sampling process. The validation of this assumption can be done by the hypothesis test. In this chapter, we discuss testing of statistical hypotheses. Most of the ideas regarding the hypothesis test came from Jerry Neyman and Karl Pearson during 1928-1938. Definition 18.1. A statistical hypothesis H is a conjecture about the distribution f (x; θ) of a population X. This conjecture is usually about the Probability and Mathematical Statistics 535 parameter θ if one is dealing with a parametric statistics; otherwise it is about the form of the distribution of X. Definition 18.2. A hypothesis H is said to be a simple hypothesis if H completely specifies the density f (x; θ) of the population; otherwise it is called a composite hypothesis. Definition 18.3. The hypothesis to be tested is called the null hypothesis. The negation of the null hypothesis is called the alternative hypothesis. The null and alternative hypotheses are denoted by Ho and Ha , respectively. If θ denotes a population parameter, then the general format of the null hypothesis and alternative hypothesis is Ho : θ ∈ Ω o and Ha : θ ∈ Ωa (/) where Ωo and Ωa are subsets of the parameter space Ω with Ωo ∩ Ωa = ∅ and Ωo ∪ Ωa ⊆ Ω. Remark 18.1. If Ωo ∪ Ωa = Ω, then (/) becomes H o : θ ∈ Ωo and Ha : θ +∈ Ωo . If Ωo is a singleton set, then Ho reduces to a simple hypothesis. For example, Ωo = {4.2861}, the null hypothesis becomes Ho : θ = 4.2861 and the alternative hypothesis becomes Ha : θ += 4.2861. Hence, the null hypothesis Ho : θ = 4.2861 is a simple hypothesis and the alternative Ha : θ += 4.2861 is a composite hypothesis. Definition 18.4. A hypothesis test is an ordered sequence (X1 , X2 , ..., Xn ; Ho , Ha ; C) where X1 , X2 , ..., Xn is a random sample from a population X with the probability density function f (x; θ), Ho and Ha are hypotheses concerning the parameter θ in f (x; θ), and C is a Borel set in R I n. Remark 18.2. Borel sets are defined using the notion of σ-algebra. A collection of subsets A of a set S is called a σ-algebra if (i) S ∈ A, (ii) Ac ∈ A, (∞ whenever A ∈ A, and (iii) k=1 Ak ∈ A, whenever A1 , A2 , ..., An , ... ∈ A. The Borel sets are the member of the smallest σ-algebra containing all open sets Test of Statistical Hypotheses for Parameters 536 of R I n . Two examples of Borel sets in R I n are the sets that arise by countable n union of closed intervals in R I , and countable intersection of open sets in R I n. The set C is called the critical region in the hypothesis test. The critical region is obtained using a test statistics W (X1 , X2 , ..., Xn ). If the outcome of (X1 , X2 , ..., Xn ) turns out to be an element of C, then we decide to accept Ha ; otherwise we accept Ho . Broadly speaking, a hypothesis test is a rule that tells us for which sample values we should decide to accept Ho as true and for which sample values we should decide to reject Ho and accept Ha as true. Typically, a hypothesis test is specified in terms of a test statistics W . For example, a test might specify 1n that Ho is to be rejected if the sample total k=1 Xk is less than 8. In this case the critical region C is the set {(x1 , x2 , ..., xn ) | x1 + x2 + · · · + xn < 8}. 18.2. A Method of Finding Tests There are several methods to find test procedures and they are: (1) Likelihood Ratio Tests, (2) Invariant Tests, (3) Bayesian Tests, and (4) UnionIntersection and Intersection-Union Tests. In this section, we only examine likelihood ratio tests. Definition 18.5. The likelihood ratio test statistic for testing the simple null hypothesis Ho : θ ∈ Ωo against the composite alternative hypothesis Ha : θ +∈ Ωo based on a set of random sample data x1 , x2 , ..., xn is defined as maxL(θ, x1 , x2 , ..., xn ) W (x1 , x2 , ..., xn ) = θ∈Ωo maxL(θ, x1 , x2 , ..., xn ) , θ∈Ω where Ω denotes the parameter space, and L(θ, x1 , x2 , ..., xn ) denotes the likelihood function of the random sample, that is L(θ, x1 , x2 , ..., xn ) = n P f (xi ; θ). i=1 A likelihood ratio test (LRT) is any test that has a critical region C (that is, rejection region) of the form C = {(x1 , x2 , ..., xn ) | W (x1 , x2 , ..., xn ) ≤ k} , where k is a number in the unit interval [0, 1]. Probability and Mathematical Statistics 537 If Ho : θ = θ0 and Ha : θ = θa are both simple hypotheses, then the likelihood ratio test statistic is defined as W (x1 , x2 , ..., xn ) = L(θo , x1 , x2 , ..., xn ) . L(θa , x1 , x2 , ..., xn ) Now we give some examples to illustrate this definition. Example 18.1. Let X1 , X2 , X3 denote three independent observations from a distribution with density f (x; θ) = & (1 + θ) xθ if 0 ≤ x ≤ 1 0 otherwise. What is the form of the LRT critical region for testing Ho : θ = 1 versus Ha : θ = 2? Answer: In this example, θo = 1 and θa = 2. By the above definition, the form of the critical region is given by > ? > L (θo , x1 , x2 , x3 ) (x1 , x2 , x3 ) ∈ R I 3 >> ≤k L (θa , x1 , x2 , x3 ) > ' & > (1 + θ )3 O3 xθo o 3 > i=1 i ≤k = (x1 , x2 , x3 ) ∈ R I > O > (1 + θa )3 3i=1 xθi a > 9 ? > 3 > 8x1 x2 x3 = (x1 , x2 , x3 ) ∈ R ≤k I > 27x21 x22 x23 > ? 9 > 1 27 3 > ≤ k = (x1 , x2 , x3 ) ∈ R I > x1 x2 x3 8 Q R = (x1 , x2 , x3 ) ∈ R I 3 | x1 x2 x3 ≥ a, C= 9 where a is some constant. Hence the likelihood ratio test is of the form: 3 P Xi ≥ a.” “Reject Ho if i=1 Example 18.2. Let X1 , X2 , ..., X12 be a random sample from a normal population with mean zero and variance σ 2 . What is the form of the LRT critical region for testing the null hypothesis Ho : σ 2 = 10 versus Ha : σ 2 = 5? Answer: Here σo2 = 10 and σa2 = 5. By the above definition, the form of the Test of Statistical Hypotheses for Parameters 538 critical region is given by (with σo 2 = 10 and σa 2 = 5) > ! & ' " > 2 12 > L σo , x1 , x2 , ..., x12 C = (x1 , x2 , ..., x12 ) ∈ R I ≤k > > L (σa 2 , x1 , x2 , ..., x12 )  >  2 x > − 12 ( σoi ) 1   12 √ e   >P 2πσo2 12 > ≤ k = (x1 , x2 , ..., x12 ) ∈ R I > 2 x −1( i )   > 1   > i=1 √2πσ2 e 2 σa a ># $ & ' > 1 6 1 112 2 x 12 > i = (x1 , x2 , ..., x12 ) ∈ R I e 20 i=1 ≤ k > > 2 > & ' 12 >% 12 > 2 = (x1 , x2 , ..., x12 ) ∈ R I xi ≤ a , > > i=1 where a is some constant. Hence the likelihood ratio test is of the form: 12 % “Reject Ho if Xi2 ≤ a.” i=1 Example 18.3. Suppose that X is a random variable about which the hypothesis Ho : X ∼ U N IF (0, 1) against Ha : X ∼ N (0, 1) is to be tested. What is the form of the LRT critical region based on one observation of X? 1 2 Answer: In this example, Lo (x) = 1 and La (x) = √12π e− 2 x . By the above definition, the form of the critical region is given by > 9 ? > Lo (x) C = x ∈R I >> ≤k , where k ∈ [0, ∞) La (x) >√ J S 1 2 > = x ∈R I > 2π e 2 x ≤ k > 9 # $? > 2 k > = x ∈R I > x ≤ 2 ln √ 2π = {x ∈ R I | x ≤ a, } where a is some constant. Hence the likelihood ratio test is of the form: “Reject Ho if X ≤ a.” In the above three examples, we have dealt with the case when null as well as alternative were simple. If the null hypothesis is simple (for example, Ho : θ = θo ) and the alternative is a composite hypothesis (for example, Ha : θ += θo ), then the following algorithm can be used to construct the likelihood ratio critical region: (1) Find the likelihood function L(θ, x1 , x2 , ..., xn ) for the given sample. Probability and Mathematical Statistics 539 (2) Find L(θo , x1 , x2 , ..., xn ). (3) Find maxL(θ, x1 , x2 , ..., xn ). θ∈Ω (4) Rewrite L(θo ,x1 ,x2 ,...,xn ) maxL(θ, x1 , x2 , ..., xn ) in a “suitable form”. θ∈Ω (5) Use step (4) to construct the critical region. Now we give an example to illustrate these steps. Example 18.4. Let X be a single observation from a population with probability density f (x; θ) =  x −θ e  θ x!  0 for x = 0, 1, 2, ..., ∞ otherwise, where θ ≥ 0. Find the likelihood ratio critical region for testing the null hypothesis Ho : θ = 2 against the composite alternative Ha : θ += 2. Answer: The likelihood function based on one observation x is L(θ, x) = θx e−θ . x! Next, we find L(θo , x) which is given by L(2, x) = 2x e−2 . x! Our next step is to evaluate maxL(θ, x). For this we differentiate L(θ, x) θ≥0 with respect to θ, and then set the derivative to 0 and solve for θ. Hence and dL(θ,x) dθ < 1 ; −θ x−1 dL(θ, x) = e xθ − θx e−θ dθ x! = 0 gives θ = x. Hence maxL(θ, x) = θ≥0 xx e−x . x! To do the step (4), we consider L(2, x) = maxL(θ, x) θ∈Ω 2x e−2 x! xx e−x x! Test of Statistical Hypotheses for Parameters which simplifies to L(2, x) = maxL(θ, x) θ∈Ω 540 # 2e x $x e−2 . Thus, the likelihood ratio critical region is given by C= 9 > # $x ? 9 > 2e e−2 ≤ k = x ∈ R I x ∈R I >> x > # $x ? > 2e > ≤ a > x where a is some constant. The likelihood ratio test is of the form: “Reject ! "X Ho if 2e ≤ a.” X So far, we have learned how to find tests for testing the null hypothesis against the alternative hypothesis. However, we have not considered the goodness of these tests. In the next section, we consider various criteria for evaluating the goodness of an hypothesis test. 18.3. Methods of Evaluating Tests There are several criteria to evaluate the goodness of a test procedure. Some well known criteria are: (1) Powerfulness, (2) Unbiasedness and Invariancy, and (3) Local Powerfulness. In order to examine some of these criteria, we need some terminologies such as error probabilities, power functions, type I error, and type II error. First, we develop these terminologies. A statistical hypothesis is a conjecture about the distribution f (x; θ) of the population X. This conjecture is usually about the parameter θ if one is dealing with a parametric statistics; otherwise it is about the form of the distribution of X. If the hypothesis completely specifies the density f (x; θ) of the population, then it is said to be a simple hypothesis; otherwise it is called a composite hypothesis. The hypothesis to be tested is called the null hypothesis. We often hope to reject the null hypothesis based on the sample information. The negation of the null hypothesis is called the alternative hypothesis. The null and alternative hypotheses are denoted by Ho and Ha , respectively. In hypothesis test, the basic problem is to decide, based on the sample information, whether the null hypothesis is true. There are four possible situations that determines our decision is correct or in error. These four situations are summarized below: Probability and Mathematical Statistics 541 Ho is true Correct Decision Type I Error Accept Ho Reject Ho Ho is false Type II Error Correct Decision Definition 18.6. Let Ho : θ ∈ Ωo and Ha : θ +∈ Ωo be the null and alternative hypothesis to be tested based on a random sample X1 , X2 , ..., Xn from a population X with density f (x; θ), where θ is a parameter. The significance level of the hypothesis test H o : θ ∈ Ωo and Ha : θ +∈ Ωo , denoted by α, is defined as α = P (Type I Error) . Thus, the significance level of a hypothesis test we mean the probability of rejecting a true null hypothesis, that is α = P (Reject Ho / Ho is true) . This is also equivalent to α = P (Accept Ha / Ho is true) . Definition 18.7. Let Ho : θ ∈ Ωo and Ha : θ +∈ Ωo be the null and alternative hypothesis to be tested based on a random sample X1 , X2 , ..., Xn from a population X with density f (x; θ), where θ is a parameter. The probability of type II error of the hypothesis test H o : θ ∈ Ωo and Ha : θ +∈ Ωo , denoted by β, is defined as β = P (Accept Ho / Ho is false) . Similarly, this is also equivalent to β = P (Accept Ho / Ha is true) . Remark 18.3. Note that α can be numerically evaluated if the null hypothesis is a simple hypothesis and rejection rule is given. Similarly, β can be Test of Statistical Hypotheses for Parameters 542 evaluated if the alternative hypothesis is simple and rejection rule is known. If null and the alternatives are composite hypotheses, then α and β become functions of θ. Example 18.5. Let X1 , X2 , ..., X20 be a random sample from a distribution with probability density function f (x; p) =   px (1 − p)1−x  0 if x = 0, 1 otherwise, where 0 < p ≤ 21 is a parameter. The hypothesis Ho : p = 21 to be tested 120 against Ha : p < 12 . If Ho is rejected when i=1 Xi ≤ 6, then what is the probability of type I error? Answer: Since each observation Xi ∼ BER(p), the sum the observations 20 % Xi ∼ BIN (20, p). The probability of type I error is given by i=1 α = P (Type I Error) = P (Reject Ho / Ho is true) b + ) 20 % Xi ≤ 6 Ho is true =P ) i=1 20 % b 1 =P Xi ≤ 6 Ho : p = 2 i=1 $ # $ # $ # 6 k 20−k % 1 1 20 1− = 2 2 k + k=0 = 0.0577 (from binomial table). Hence the probability of type I error is 0.0577. Example 18.6. Let p represent the proportion of defectives in a manufacturing process. To test Ho : p ≤ 41 versus Ha : p > 14 , a random sample of size 5 is taken from the process. If the number of defectives is 4 or more, the null hypothesis is rejected. What is the probability of rejecting Ho if p = 51 ? Answer: Let X denote the number of defectives out of a random sample of size 5. Then X is a binomial random variable with n = 5 and p = 51 . Hence, Probability and Mathematical Statistics 543 the probability of rejecting Ho is given by α = P (Reject Ho / Ho is true) = P (X ≥ 4 / Ho is true) $ # c 1 =P X≥4 p= 5 # c $ # c $ 1 1 =P X=4 p= +P X =5 p= 5 5 # $ # $ 5 4 5 = p (1 − p)1 + p5 (1 − p)0 4 5 # $4 # $ # $5 4 1 1 + =5 5 5 5 # $5 1 = [20 + 1] 5 21 . = 3125 Hence the probability of rejecting the null hypothesis Ho is 21 3125 . Example 18.7. A random sample of size 4 is taken from a normal distribution with unknown mean µ and variance σ 2 > 0. To test Ho : µ = 0 against Ha : µ < 0 the following test is used: “Reject Ho if and only if X1 + X2 + X3 + X4 < −20.” Find the value of σ so that the significance level of this test will be closed to 0.14. Answer: Since 0.14 = α (significance level) = P (Type I Error) = P (Reject Ho / Ho is true) = P (X1 + X2 + X3 + X4 < −20 /Ho : µ = 0) ! " = P X < −5 /Ho : µ = 0 $ # X −0 −5 − 0 < =P σ σ 2 # $ 2 10 =P Z<− , σ we get from the standard normal table 1.08 = 10 . σ Test of Statistical Hypotheses for Parameters 544 Therefore σ= 10 = 9.26. 1.08 Hence, the standard deviation has to be 9.26 so that the significance level will be closed to 0.14. Example 18.8. A normal population has a standard deviation of 16. The critical region for testing Ho : µ = 5 versus the alternative Ha : µ = k is X̄ > k − 2. What would be the value of the constant k and the sample size n which would allow the probability of Type I error to be 0.0228 and the probability of Type II error to be 0.1587. ! " Answer: It is given that the population X ∼ N µ, 162 . Since 0.0228 = α = P (Type I Error) = P (Reject Ho / Ho is true) " ! = P X > k − 2 /Ho : µ = 5   X − 5 k − 7  =P = >=  256 n 256 n  k−7 = P Z > =  256 n  k − 7  = 1 − P Z ≤ = 256 n Hence, from standard normal table, we have √ (k − 7) n =2 16 which gives √ (k − 7) n = 32. Probability and Mathematical Statistics 545 Similarly 0.1587 = P (Type II Error) = P (Accept Ho / Ha is true) " ! = P X ≤ k − 2 /Ha : µ = k   b X − µ k − 2 − µ =P = ≤ = Ha : µ = k   256 n 256 n  X −k k − 2 − k =P= ≤ =  256 n 256 n 2 = P Z ≤ − = # 256 n   √ $ 2 n =1−P Z ≤ . 16 , , √ Hence 0.1587 = 1 − P Z ≤ 216n or P Z ≤ the standard normal table, we have √ 2 n =1 16 √ 2 n 16 = 0.8413. Thus, from which yields n = 64. Letting this value of n in √ (k − 7) n = 32, we see that k = 11. While deciding to accept Ho or Ha , we may make a wrong decision. The probability γ of a wrong decision can be computed as follows: γ = P (Ha accepted and Ho is true) + P (Ho accepted and Ha is true) = P (Ha accepted / Ho is true) P (Ho is true) + P (Ho accepted / Ha is true) P (Ha is true) = α P (Ho is true) + β P (Ha is true) . In most cases, the probabilities P (Ho is true) and P (Ha is true) are not known. Therefore, it is, in general, not possible to determine the exact Test of Statistical Hypotheses for Parameters 546 numerical value of the probability γ of making a wrong decision. However, since γ is a weighted sum of α and β, and P (Ho is true) + P (Ha is true) = 1, we have γ ≤ max{α, β}. A good decision rule (or a good test) is the one which yields the smallest γ. In view of the above inequality, one will have a small γ if the probability of type I error as well as probability of type II error are small. The alternative hypothesis is mostly a composite hypothesis. Thus, it is not possible to find a value for the probability of type II error, β. For composite alternative, β is a function of θ. That is, β : Ωco :→ [0, 1]. Here Ωco denotes the complement of the set Ωo in the parameter space Ω. In hypothesis test, instead of β, one usually considers the power of the test 1 − β(θ), and a small probability of type II error is equivalent to large power of the test. Definition 18.8. Let Ho : θ ∈ Ωo and Ha : θ +∈ Ωo be the null and alternative hypothesis to be tested based on a random sample X1 , X2 , ..., Xn from a population X with density f (x; θ), where θ is a parameter. The power function of a hypothesis test Ho : θ ∈ Ωo versus is a function π : Ω → [0, 1] defined by   P (Type I Error) π(θ) =  1 − P (Type II Error) Ha : θ +∈ Ωo if Ho is true if Ha is true. Example 18.9. A manufacturing firm needs to test the null hypothesis Ho that the probability p of a defective item is 0.1 or less, against the alternative hypothesis Ha : p > 0.1. The procedure is to select two items at random. If both are defective, Ho is rejected; otherwise, a third is selected. If the third item is defective Ho is rejected. If all other cases, Ho is accepted, what is the power of the test in terms of p (if Ho is true)? Answer: Let p be the probability of a defective item. We want to calculate the power of the test at the null hypothesis. The power function of the test is given by  if p ≤ 0.1  P (Type I Error) π(p) =  1 − P (Type II Error) if p > 0.1. Probability and Mathematical Statistics 547 Hence, we have π(p) = P (Reject Ho / Ho is true) = P (Reject Ho / Ho : p = p) = P (first two items are both defective /p) + + P (at least one of the first two items is not defective and third is/p) # $ 2 = p2 + (1 − p)2 p + p(1 − p)p 1 = p + p 2 − p3 . The graph of this power function is shown below. $ (p) Remark 18.4. If X denotes the number of independent trials needed to obtain the first success, then X ∼ GEO(p), and P (X = k) = (1 − p)k−1 p, where k = 1, 2, 3, ..., ∞. Further P (X ≤ n) = 1 − (1 − p)n since n % k=1 (1 − p)k−1 p = p n % k=1 (1 − p)k−1 1 − (1 − p)n 1 − (1 − p) = 1 − (1 − p)n . =p Test of Statistical Hypotheses for Parameters 548 Example 18.10. Let X be the number of independent trails required to obtain a success where p is the probability of success on each trial. The hypothesis Ho : p = 0.1 is to be tested against the alternative Ha : p = 0.3. The hypothesis is rejected if X ≤ 4. What is the power of the test if Ha is true? Answer: The power function is given by π(p) = Hence, we have   P (Type I Error)  if p = 0.1 1 − P (Type II Error) if p = 0.3. α = 1 − P (Accept Ho / Ho is false) = P (Reject Ho / Ha is true) = P (X ≤ 4 / Ha is true) = P (X ≤ 4 / p = 0.3) = 4 % P (X = k /p = 0.3) k=1 = 4 % k=1 = 4 % (1 − p)k−1 p (where p = 0.3) (0.7)k−1 (0.3) k=1 = 0.3 4 % (0.7)k−1 k=1 = 1 − (0.7)4 = 0.7599. Hence, the power of the test at the alternative is 0.7599. Example 18.11. Let X1 , X2 , ..., X25 be a random sample of size 25 drawn from a normal distribution with unknown mean µ and variance σ 2 = 100. It is desired to test the null hypothesis µ = 4 against the alternative µ = 6. What is the power at µ = 6 of the test with rejection rule: reject µ = 4 if 125 i=1 Xi ≥ 125? Probability and Mathematical Statistics 549 Answer: The power of the test at the alternative is π(6) = 1 − P (Type II Error) = 1 − P (Accept Ho / Ho is false) = P (Reject Ho / Ha is true) ) 25 + % =P Xi ≥ 125 / Ha : µ = 6 i=1 " = P X ≥ 5 / Ha µ = 6 ) + X −6 5−6 =P ≥ 10 10 ! =P # √ 25 1 Z≥− 2 = 0.6915. $ √ 25 Example 18.12. A urn contains 7 balls, θ of which are red. A sample of size 2 is drawn without replacement to test Ho : θ ≤ 1 against Ha : θ > 1. If the null hypothesis is rejected if one or more red balls are drawn, find the power of the test when θ = 2. Answer: The power of the test at θ = 2 is given by π(2) = 1 − P (Type II Error) = 1 − P (Accept Ho / Ho is false) = 1 − P (zero red balls are drawn /2 balls were red) ! 5" = 1 − !72" 2 10 =1− 21 11 = 21 = 0.524. In all of these examples, we have seen that if the rule for rejection of the null hypothesis Ho is given, then one can compute the significance level or power function of the hypothesis test. The rejection rule is given in terms of a statistic W (X1 , X2 , ..., Xn ) of the sample X1 , X2 , ..., Xn . For instance, in Example 18.5, the rejection rule was: “Reject the null hypothesis Ho if 120 i=1 Xi ≤ 6.” Similarly, in Example 18.7, the rejection rule was: “Reject Ho Test of Statistical Hypotheses for Parameters 550 if and only if X1 + X2 + X3 + X4 < −20”, and so on. The statistic W , used in the statement of the rejection rule, partitioned the set S n into two subsets, where S denotes the support of the density function of the population X. One subset is called the rejection or critical region and other subset is called the acceptance region. The rejection rule is obtained in such a way that the probability of the type I error is as small as possible and the power of the test at the alternative is as large as possible. Next, we give two definitions that will lead us to the definition of uniformly most powerful test. Definition 18.9. Given 0 ≤ δ ≤ 1, a test (or test procedure) T for testing the null hypothesis Ho : θ ∈ Ωo against the alternative Ha : θ ∈ Ωa is said to be a test of level δ if max π(θ) ≤ δ, θ∈Ωo where π(θ) denotes the power function of the test T . Definition 18.10. Given 0 ≤ δ ≤ 1, a test (or test procedure) for testing the null hypothesis Ho : θ ∈ Ωo against the alternative Ha : θ ∈ Ωa is said to be a test of size δ if max π(θ) = δ. θ∈Ωo Definition 18.11. Let T be a test procedure for testing the null hypothesis Ho : θ ∈ Ωo against the alternative Ha : θ ∈ Ωa . The test (or test procedure) T is said to be the uniformly most powerful (UMP) test of level δ if T is of level δ and for any other test W of level δ, πT (θ) ≥ πW (θ) for all θ ∈ Ωa . Here πT (θ) and πW (θ) denote the power functions of tests T and W , respectively. Remark 18.5. If T is a test procedure for testing Ho : θ = θo against Ha : θ = θa based on a sample data x1 , ..., xn from a population X with a continuous probability density function f (x; θ), then there is a critical region C associated with the the test procedure T , and power function of T can be computed as : πT = C L(θa , x1 , ..., xn ) dx1 · · · dxn . Probability and Mathematical Statistics 551 Similarly, the size of a critical region C, say α, can be given by : α= L(θo , x1 , ..., xn ) dx1 · · · dxn . C The following famous result tells us which tests are uniformly most powerful if the null hypothesis and the alternative hypothesis are both simple. Theorem 18.1 (Neyman-Pearson). Let X1 , X2 , ..., Xn be a random sample from a population with probability density function f (x; θ). Let L(θ, x1 , ..., xn ) = n P f (xi ; θ) i=1 be the likelihood function of the sample. Then any critical region C of the form > ? 9 > L (θo , x1 , ..., xn ) > ≤k C = (x1 , x2 , ..., xn ) > L (θa , x1 , ..., xn ) for some constant 0 ≤ k < ∞ is best (or uniformly most powerful) of its size for testing Ho : θ = θo against Ha : θ = θa . Proof: We assume that the population has a continuous probability density function. If the population has a discrete distribution, the proof can be appropriately modified by replacing integration by summation. Let C be the critical region of size α as described in the statement of the theorem. Let B be any other critical region of size α. We want to show that the power of C is greater than or equal to that of B. In view of Remark 18.5, we would like to show that : : L(θa , x1 , ..., xn ) dx1 · · · dxn ≥ L(θa , x1 , ..., xn ) dx1 · · · dxn . (1) C B Since C and B are both critical regions of size α, we have : : L(θo , x1 , ..., xn ) dx1 · · · dxn . L(θo , x1 , ..., xn ) dx1 · · · dxn = C (2) B The last equality (2) can be written as : : L(θo , x1 , ..., xn ) dx1 · · · dxn + L(θo , x1 , ..., xn ) dx1 · · · dxn C∩B C∩B c : : L(θo , x1 , ..., xn ) dx1 · · · dxn L(θo , x1 , ..., xn ) dx1 · · · dxn + = C∩B C c ∩B Test of Statistical Hypotheses for Parameters 552 since C = (C ∩ B) ∪ (C ∩ B c ) and B = (C ∩ B) ∪ (C c ∩ B). Therefore from the last equality, we have : : L(θo , x1 , ..., xn ) dx1 · · · dxn = C c ∩B C∩B c Since C= we have 9 L(θo , x1 , ..., xn ) dx1 · · · dxn . (4) > ? > L (θo , x1 , ..., xn ) > ≤k (x1 , x2 , ..., xn ) > L (θa , x1 , ..., xn ) L(θa , x1 , ..., xn ) ≥ (3) L(θo , x1 , ..., xn ) k (5) (6) on C, and L(θo , x1 , ..., xn ) k on C c . Therefore from (4), (6) and (7), we have : L(θa , x1 ,..., xn ) dx1 · · · dxn C∩B c : L(θo , x1 , ..., xn ) dx1 · · · dxn ≥ k c :C∩B L(θo , x1 , ..., xn ) dx1 · · · dxn = k c :C ∩B L(θa , x1 , ..., xn ) dx1 · · · dxn . ≥ L(θa , x1 , ..., xn ) < (7) C c ∩B Thus, we obtain : : L(θa , x1 , ..., xn ) dx1 · · · dxn ≥ C∩B c C c ∩B L(θa , x1 , ..., xn ) dx1 · · · dxn . From (3) and the last inequality, we see that : L(θa , x1 , ..., xn ) dx1 · · · dxn C : : = L(θa , x1 , ..., xn ) dx1 · · · dxn + L(θa , x1 , ..., xn ) dx1 · · · dxn c :C∩B :C∩B L(θa , x1 , ..., xn ) dx1 · · · dxn L(θa , x1 , ..., xn ) dx1 · · · dxn + ≥ C c ∩B C∩B : ≥ L(θa , x1 , ..., xn ) dx1 · · · dxn B and hence the theorem is proved. Probability and Mathematical Statistics 553 Now we give several examples to illustrate the use of this theorem. Example 18.13. Let X be a random variable with a density function f (x). What is the critical region for the best test of   21 if −1 < x < 1 Ho : f (x) =  0 elsewhere, against Ha : f (x) =   1 − |x|  0 if −1 < x < 1 elsewhere, at the significance size α = 0.10? Answer: We assume that the test is performed with a sample of size 1. Using Neyman-Pearson Theorem, the best critical region for the best test at the significance size α is given by 9 ? Lo (x) C = x ∈R I | ≤k La (x) 9 ? 1 2 = x ∈R I | ≤k 1 − |x| ? 9 1 = x ∈R I | |x| ≤ 1 − 2k ? 9 1 1 −1≤x≤1− . = x ∈R I | 2k 2k Since 0.1 = P ( C ) # $ Lo (X) =P ≤ k / Ho is true La (X) # $ 1 2 =P ≤ k / Ho is true 1 − |X| $ # , 1 1 −1≤X ≤1− / Ho is true =P 2k 2k 1 : 1− 2k 1 dx = 1 2 2k −1 1 , 2k we get the critical region C to be =1− C = {x ∈ R I | − 0.1 ≤ x ≤ 0.1}. Test of Statistical Hypotheses for Parameters 554 Thus the best critical region is C = [−0.1, 0.1] and the best test is: “Reject Ho if −0.1 ≤ X ≤ 0.1”. Example 18.14. Suppose X has the density function f (x; θ) = & (1 + θ) xθ if 0 ≤ x ≤ 1 0 otherwise. Based on a single observed value of X, find the most powerful critical region of size α = 0.1 for testing Ho : θ = 1 against Ha : θ = 2. Answer: By Neyman-Pearson Theorem, the form of the critical region is given by ? 9 L (θo , x) C = x ∈R I | ≤k L (θa , x) ? 9 (1 + θo ) xθo ≤k = x ∈R I | (1 + θa ) xθa 9 ? 2x = x ∈R I | ≤k 3x2 ? 9 1 3 = x ∈R I | ≤ k x 2 = {x ∈ R I | x ≥ a, } where a is some constant. Hence the most powerful or best test is of the form: “Reject Ho if X ≥ a.” Since, the significance level of the test is given to be α = 0.1, the constant a can be determined. Now we proceed to find a. Since 0.1 = α = P (Reject Ho / Ho is true} = P (X ≥ a / θ = 1) : 1 2x dx = a = 1 − a2 , hence a2 = 1 − 0.1 = 0.9. Therefore a= √ 0.9, Probability and Mathematical Statistics 555 since k in Neyman-Pearson Theorem is positive. Hence, the most powerful √ test is given by “Reject Ho if X ≥ 0.9”. Example 18.15. Suppose that X is a random variable about which the hypothesis Ho : X ∼ U N IF (0, 1) against Ha : X ∼ N (0, 1) is to be tested. What is the most powerful test with a significance level α = 0.05 based on one observation of X? Answer: By Neyman-Pearson Theorem, the form of the critical region is given by 9 ? Lo (x) C = x ∈R I | ≤k La (x) J S √ 1 2 = x ∈R I | 2π e 2 x ≤ k $? 9 # k 2 = x ∈R I | x ≤ 2 ln √ 2π = {x ∈ R I | x ≤ a, } where a is some constant. Hence the most powerful or best test is of the form: “Reject Ho if X ≤ a.” Since, the significance level of the test is given to be α = 0.05, the constant a can be determined. Now we proceed to find a. Since 0.05 = α = P (Reject Ho / Ho is true} = P (X ≤ a / X ∼ U N IF (0, 1)) : a dx = 0 = a, hence a = 0.05. Thus, the most powerful critical region is given by C = {x ∈ R I | 0 < x ≤ 0.05} based on the support of the uniform distribution on the open interval (0, 1). Since the support of this uniform distribution is the interval (0, 1), the acceptance region (or the complement of C in (0, 1)) is C c = {x ∈ R I | 0.05 < x < 1}. Test of Statistical Hypotheses for Parameters 556 However, since the support of the standard normal distribution is R, I the actual critical region should be the complement of C c in R. I Therefore, the critical region of this hypothesis test is the set {x ∈ R I | x ≤ 0.05 or x ≥ 1}. The most powerful test for α = 0.05 is: “Reject Ho if X ≤ 0.05 or X ≥ 1.” Example 18.16. Let X1 , X2 , X3 denote three independent observations from a distribution with density f (x; θ) = & (1 + θ) xθ if 0 ≤ x ≤ 1 0 otherwise. What is the form of the best critical region of size 0.034 for testing Ho : θ = 1 versus Ha : θ = 2? Answer: By Neyman-Pearson Theorem, the form of the critical region is given by (with θo = 1 and θa = 2) 9 ? L (θo , x1 , x2 , x3 ) 3 C = (x1 , x2 , x3 ) ∈ R ≤k I | L (θa , x1 , x2 , x3 ) & ' O3 θo 3 (1 + θ ) x o i ≤k = (x1 , x2 , x3 ) ∈ R I3 | Oi=1 3 (1 + θa )3 i=1 xθi a ? 9 8x1 x2 x3 ≤ k = (x1 , x2 , x3 ) ∈ R I3 | 27x21 x22 x23 ? 9 27 1 ≤ k = (x1 , x2 , x3 ) ∈ R I3 | x1 x2 x3 8 Q R 3 = (x1 , x2 , x3 ) ∈ R I | x1 x2 x3 ≥ a, where a is some constant. Hence the most powerful or best test is of the 3 P form: “Reject Ho if Xi ≥ a.” i=1 Since, the significance level of the test is given to be α = 0.034, the constant a can be determined. To evaluate the constant a, we need the probability distribution of X1 X2 X3 . The distribution of X1 X2 X3 is not easy to get. Hence, we will use Theorem 17.5. There, we have shown that Probability and Mathematical Statistics −2(1 + θ) 13 i=1 557 ln Xi ∼ χ2 (6). Now we proceed to find a. Since 0.034 = α = P (Reject Ho / Ho is true} = P (X1 X2 X3 ≥ a / θ = 1) = P (ln(X1 X2 X3 ) ≥ ln a / θ = 1) = P (−2(1 + θ) ln(X1 X2 X3 ) ≤ −2(1 + θ) ln a / θ = 1) = P (−4 ln(X1 X2 X3 ) ≤ −4 ln a) ! " = P χ2 (6) ≤ −4 ln a hence from chi-square table, we get −4 ln a = 1.4. Therefore a = e−0.35 = 0.7047. Hence, the most powerful test is given by “Reject Ho if X1 X2 X3 ≥ 0.7047”. The critical region C is the region above the surface x1 x2 x3 = 0.7047 of the unit cube [0, 1]3 . The following figure illustrates this region. Critical region is to the right of the shaded surface Example 18.17. Let X1 , X2 , ..., X12 be a random sample from a normal population with mean zero and variance σ 2 . What is the most powerful test of size 0.025 for testing the null hypothesis Ho : σ 2 = 10 versus Ha : σ 2 = 5? Test of Statistical Hypotheses for Parameters 558 Answer: By Neyman-Pearson Theorem, the form of the critical region is given by (with σo 2 = 10 and σa 2 = 5) C= & (x1 , x2 , ..., x12 ) ∈ R I 12    = (x1 , x2 , ..., x12 ) ∈ R I 12   & = = (x1 , x2 , ..., x12 ) ∈ R I 12 & (x1 , x2 , ..., x12 ) ∈ R I 12 > ! ' > L σ 2 , x , x , ..., x " 1 2 12 o > ≤k > > L (σa 2 , x1 , x2 , ..., x12 ) >  2 x > − 12 ( σoi ) 1  12 √ e >P  2πσo2 > ≤ k > 2 x −1( i )  > 1  > i=1 √2πσ2 e 2 σa a ># $ ' > 1 6 1 112 2 > x e 20 i=1 i ≤ k > > 2 > ' 12 >% > 2 xi ≤ a , > > i=1 where a is some constant. Hence the most powerful or best test is of the 12 % form: “Reject Ho if Xi2 ≤ a.” i=1 Since, the significance level of the test is given to be α = 0.025, the constant a can be determined. To evaluate the constant a, we need the 2 probability distribution of X12 + X22 + · · · + X12 . It can be shown that the 112 ! Xi "2 2 ∼ χ (12). Now we proceed to find a. Since distribution of i=1 σ 0.025 = α = P (Reject Ho / Ho is true} ) 12 # $ + % Xi 2 2 =P ≤ a / σ = 10 σ i=1 ) 12 # + % X i $2 √ =P ≤ a / σ 2 = 10 10 i=1 , a= P χ2 (12) ≤ , 10 hence from chi-square table, we get a = 4.4. 10 Therefore a = 44. Probability and Mathematical Statistics 559 Hence, the most powerful test is given by “Reject Ho if best critical region of size 0.025 is given by C= & (x1 , x2 , ..., x12 ) ∈ R I 12 | 12 % i=1 x2i 112 ≤ 44 i=1 ' Xi2 ≤ 44.” The . In last five examples, we have found the most powerful tests and corresponding critical regions when the both Ho and Ha are simple hypotheses. If either Ho or Ha is not simple, then it is not always possible to find the most powerful test and corresponding critical region. In this situation, hypothesis test is found by using the likelihood ratio. A test obtained by using likelihood ratio is called the likelihood ratio test and the corresponding critical region is called the likelihood ratio critical region. 18.4. Some Examples of Likelihood Ratio Tests In this section, we illustrate, using likelihood ratio, how one can construct hypothesis test when one of the hypotheses is not simple. As pointed out earlier, the test we will construct using the likelihood ratio is not the most powerful test. However, such a test has all the desirable properties of a hypothesis test. To construct the test one has to follow a sequence of steps. These steps are outlined below: (1) Find the likelihood function L(θ, x1 , x2 , ..., xn ) for the given sample. (2) Evaluate maxL(θ, x1 , x2 , ..., xn ). θ∈Ωo (3) Find the maximum likelihood estimator θZ of θ. , Z x1 , x2 , ..., xn . (4) Compute maxL(θ, x1 , x2 , ..., xn ) using L θ, θ∈Ω maxL(θ, x1 , x2 , ..., xn ) (5) Using steps (2) and (4), find W (x1 , ..., xn ) = θ∈Ωo maxL(θ, x1 , x2 , ..., xn ) . θ∈Ω (6) Using step (5) determine C = {(x1 , x2 , ..., xn ) | W (x1 , ..., xn ) ≤ k}, where k ∈ [0, 1]. [ (x1 , ..., xn ) ≤ A. (7) Reduce W (x1 , ..., xn ) ≤ k to an equivalent inequality W [ (x1 , ..., xn ). (8) Determine the distribution of W , [ (x1 , ..., xn ) ≤ A | Ho is true . (9) Find A such that given α equals P W Test of Statistical Hypotheses for Parameters 560 In the remaining examples, for notational simplicity, we will denote the likelihood function L(θ, x1 , x2 , ..., xn ) simply as L(θ). Example 18.19. Let X1 , X2 , ..., Xn be a random sample from a normal population with mean µ and known variance σ 2 . What is the likelihood ratio test of size α for testing the null hypothesis Ho : µ = µo versus the alternative hypothesis Ha : µ += µo ? Answer: The likelihood function of the sample is given by L(µ) = n # P i=1 = # 1 √ σ 2π 1 √ σ 2π $n $ 1 e− 2σ2 (xi −µ) − 1 2σ 2 e n % i=1 2 (xi − µ)2 . Since Ωo = {µo }, we obtain max L(µ) = L(µo ) µ∈Ωo = # 1 √ σ 2π $n − n % 1 2σ 2 e i=1 (xi − µo )2 . We have seen in Example 15.13 that if X ∼ N (µ, σ 2 ), then the maximum likelihood estimator of µ is X, that is µ Z = X. Hence max L(µ) = L(Z µ) = µ∈Ω # 1 √ σ 2π $n − 1 2σ 2 e n % i=1 (xi − x)2 Now the likelihood ratio statistics W (x1 , x2 , ..., xn ) is given by W (x1 , x2 , ..., xn ) = , √1 σ 2π , -n √1 σ 2π − 1 2σ 2 e -n − e n % i=1 1 2σ 2 (xi − µo )2 n % i=1 (xi − x)2 . Probability and Mathematical Statistics 561 which simplifies to n 2 W (x1 , x2 , ..., xn ) = e− 2σ2 (x−µo ) . Now the inequality W (x1 , x2 , ..., xn ) ≤ k becomes 2 n e− 2σ2 (x−µo ) ≤ k and which can be rewritten as (x − µo )2 ≥ − 2σ 2 ln(k) n or |x − µo | ≥ K = 2 where K = − 2σn ln(k). In view of the above inequality, the critical region can be described as C = {(x1 , x2 , ..., xn ) | |x − µo | ≥ K }. Since we are given the size of the critical region to be α, we can determine the constant K. Since the size of the critical region is α, we have > !> " α = P >X − µo > ≥ K . For finding K, we need the probability density function of the statistic X −µo when the population X is N (µ, σ 2 ) and the null hypothesis Ho : µ = µo is true. Since σ 2 is known and Xi ∼ N (µ, σ 2 ), X − µo √σ n and ∼ N (0, 1) > !> " α = P >X − µo > ≥ K > )> √ + >X − µ > n > o> =P > σ >≥K > √n > σ # √ $ n X − µo = P |Z| ≥ K where Z= √σ σ n # √ $ √ n n ≤Z≤K = 1 − P −K σ σ Test of Statistical Hypotheses for Parameters 562 we get z α2 = K which is √ n σ σ K = z α2 √ , n where z α2 is a real number such that the integral of the standard normal density from z α2 to ∞ equals α2 . Hence, the likelihood ratio test is given by “Reject Ho if > > >X − µo > ≥ z α √σ .” 2 n If we denote z= x − µo √σ n then the above inequality becomes |Z| ≥ z α2 . Thus critical region is given by Q C = (x1 , x2 , ..., xn ) | |z| ≥ z α2 }. This tells us that the null hypothesis must be rejected when the absolute value of z takes on a value greater than or equal to z α2 . - Z!/2 Reject Ho Accept Ho Z!/2 Reject H o Remark 18.6. The hypothesis Ha : µ += µo is called a two-sided alternative hypothesis. An alternative hypothesis of the form Ha : µ > µo is called a right-sided alternative. Similarly, Ha : µ < µo is called the a left-sided Probability and Mathematical Statistics 563 alternative. In the above example, if we had a right-sided alternative, that is Ha : µ > µo , then the critical region would have been C = {(x1 , x2 , ..., xn ) | z ≥ zα }. Similarly, if the alternative would have been left-sided, that is Ha : µ < µo , then the critical region would have been C = {(x1 , x2 , ..., xn ) | z ≤ −zα }. We summarize the three cases of hypotheses test of the mean (of the normal population with known variance) in the following table. Ho Ha µ = µo µ > µo µ = µo µ < µo µ = µo µ += µo Critical Region (or Test) z= z= x−µo σ √ n x−µo σ √ n ≥ zα ≤ −zα > > > x−µ > o> > |z| = > √σ > ≥ z α2 n Example 18.20. Let X1 , X2 , ..., Xn be a random sample from a normal population with mean µ and unknown variance σ 2 . What is the likelihood ratio test of size α for testing the null hypothesis Ho : µ = µo versus the alternative hypothesis Ha : µ += µo ? Answer: In this example, " R µ, σ 2 ∈ R I 2 | − ∞ < µ < ∞, σ 2 > 0 , Q! " R Ωo = µo , σ 2 ∈ R I 2 | σ2 > 0 , Q! " R Ωa = µ, σ 2 ∈ R I 2 | µ += µo , σ 2 > 0 . Ω= Q! These sets are illustrated below. Test of Statistical Hypotheses for Parameters 564 ' %& % µ µ& Graphs of % and %& The likelihood function is given by ! L µ, σ 2 " = n # P i=1 = # √ √ 1 2πσ 2 1 2πσ 2 $n $ e− 2 ( 1 1 e− 2σ2 xi −µ σ 1n i=1 2 ) (xi −µ)2 . ! " Next, we find the maximum of L µ, σ 2 on the set Ωo . Since the set Ωo is " R Q! I 2 | 0 < σ < ∞ , we have equal to µo , σ 2 ∈ R max 2 (µ,σ )∈Ωo ! " ! " 2 L µ, σ 2 = max L µ , σ . o 2 σ >0 " " ! ! Since L µo , σ 2 and ln L µo , σ 2 achieve the maximum at the same σ value, " ! we determine the value of σ where ln L µo , σ 2 achieves the maximum. Taking the natural logarithm of the likelihood function, we get n ! ! "" n n 1 % ln L µ, σ 2 = − ln(σ 2 ) − ln(2π) − 2 (xi − µo )2 . 2 2 2σ i=1 " ! Differentiating ln L µo , σ 2 with respect to σ 2 , we get from the last equality n ! ! "" d n 1 % 2 (xi − µo )2 . ln L µ, σ = − + dσ 2 2σ 2 2σ 4 i=1 Setting this derivative to zero and solving for σ, we obtain \ ] n ]1 % (xi − µo )2 . σ=^ n i=1 Probability and Mathematical Statistics 565 \ ] n ]1 % "" ! ! 2 (xi − µo )2 . Since this Thus ln L µ, σ attains maximum at σ = ^ n i=1 " ! value of σ is also yield maximum value of L µ, σ 2 , we have ) +− n2 n % ! " n 1 e− 2 . max L µo , σ 2 = 2π (xi − µo )2 2 n i=1 σ >0 ! " Next, we determine the maximum of L µ, σ 2 on the set Ω. As before, " " ! ! we consider ln L µ, σ 2 to determine where L µ, σ 2 achieves maximum. ! " Taking the natural logarithm of L µ, σ 2 , we obtain n ! ! "" n n 1 % ln L µ, σ 2 = − ln(σ 2 ) − ln(2π) − 2 (xi − µ)2 . 2 2 2σ i=1 ! " Taking the partial derivatives of ln L µ, σ 2 first with respect to µ and then with respect to σ 2 , we get n ! " 1 % ∂ (xi − µ), ln L µ, σ 2 = 2 ∂µ σ i=1 and n ! " ∂ n 1 % 2 ln L µ, σ = − + (xi − µ)2 , ∂σ 2 2σ 2 2σ 4 i=1 respectively. Setting these partial derivatives to zero and solving for µ and σ, we obtain n−1 2 s , and σ2 = µ=x n n % 1 where s2 = n−1 (xi − x)2 is the sample variance. i=1 ! " Letting these optimal values of µ and σ into L µ, σ 2 , we obtain ) +− n2 n % ! " n 1 (xi − x)2 e− 2 . max L µ, σ 2 = 2π n i=1 (µ,σ 2 )∈Ω Hence ! " 2 ) 2π n1 n % 2 +− n2 −n 2  n % 2 (xi − µo ) e (xi − µo )  max L µ, σ  i=1 i=1  ! " = ) = n +− n2 n max L µ, σ 2 % % 2 n (µ,σ )∈Ω 2 1 − (xi − x)2 2 (xi − x) e 2π n (µ,σ 2 )∈Ωo i=1 i=1 − n2      . Test of Statistical Hypotheses for Parameters Since n % i=1 and n % i=1 we get (xi − x)2 = (n − 1) s2 (xi − µ)2 = W (x1 , x2 , ..., xn ) = 566 n % i=1 2 (xi − x)2 + n (x − µo ) , ! " ) + n max L µ, σ 2 2 −2 n (x − µo ) " = 1+ ! . n−1 s2 max L µ, σ 2 2 (µ,σ 2 )∈Ωo (µ,σ )∈Ω Now the inequality W (x1 , x2 , ..., xn ) ≤ k becomes ) + n 2 −2 n (x − µo ) 1+ ≤k n−1 s2 and which can be rewritten as $2 # x − µo n − 1 , −2 k n −1 ≥ s n or > > >x − µ > > o> > s >≥K > √n > @ A 2 B where K = (n − 1) k − n − 1 . In view of the above inequality, the critical region can be described as & C= > > ' >x − µ > > o> (x1 , x2 , ..., xn ) | > s > ≥ K > √n > > > > x−µ > o> > and the best likelihood ratio test is: “Reject Ho if > √s > ≥ K”. Since we n are given the size of the critical region to be α, we can find the constant K. o For finding K, we need the probability density function of the statistic x−µ √s n when the population X is N (µ, σ 2 ) and the null hypothesis Ho : µ = µo is true. Since the population is normal with mean µ and variance σ 2 , X − µo √S n ∼ t(n − 1), Probability and Mathematical Statistics 567 2 where S is the sample variance and equals to 1 n−1 n % ! i=1 s K = t α2 (n − 1) √ , n "2 Xi − X . Hence where t α2 (n − 1) is a real number such that the integral of the t-distribution with n − 1 degrees of freedom from t α2 (n − 1) to ∞ equals α2 . Therefore, the likelihood ratio test is given by “Reject Ho : µ = µo if If we denote > > >X − µo > ≥ t α (n − 1) √S .” 2 n t= x − µo √s n then the above inequality becomes |T | ≥ t α2 (n − 1). Thus critical region is given by Q C = (x1 , x2 , ..., xn ) | |t| ≥ t α2 (n − 1) }. This tells us that the null hypothesis must be rejected when the absolute value of t takes on a value greater than or equal to t α2 (n − 1). - t!/2(n-1) Reject H o Accept Ho t!/2 (n-1) Reject Ho Remark 18.7. In the above example, if we had a right-sided alternative, that is Ha : µ > µo , then the critical region would have been C = {(x1 , x2 , ..., xn ) | t ≥ tα (n − 1) }. Test of Statistical Hypotheses for Parameters 568 Similarly, if the alternative would have been left-sided, that is Ha : µ < µo , then the critical region would have been C = {(x1 , x2 , ..., xn ) | t ≤ −tα (n − 1) }. We summarize the three cases of hypotheses test of the mean (of the normal population with unknown variance) in the following table. Ho Ha Critical Region (or Test) µ = µo µ > µo µ = µo µ < µo µ = µo µ += µo t= t= x−µo √s n x−µo √s n ≥ tα (n − 1) ≤ −tα (n − 1) > > > > o> |t| = >> x−µ √s > ≥ t α2 (n − 1) n Example 18.21. Let X1 , X2 , ..., Xn be a random sample from a normal population with mean µ and variance σ 2 . What is the likelihood ratio test of significance of size α for testing the null hypothesis Ho : σ 2 = σo2 versus Ha : σ 2 += σo2 ? Answer: In this example, Q! " R Ω = µ, σ 2 ∈ R I 2 | − ∞ < µ < ∞, σ 2 > 0 , " R Q! I2 | −∞<µ<∞ , Ωo = µ, σo2 ∈ R Q! " R Ωa = µ, σ 2 ∈ R I 2 | − ∞ < µ < ∞, σ += σo . These sets are illustrated below. ' '& %& % µ 0 Graphs of % and %& Probability and Mathematical Statistics 569 The likelihood function is given by ! L µ, σ 2 " = n # P i=1 = # √ √ 1 2πσ 2 1 2πσ 2 $n $ e− 2 ( 1 xi −µ σ 1n 1 e− 2σ2 i=1 2 ) (xi −µ)2 . ! " Next, we find the maximum of L µ, σ 2 on the set Ωo . Since the set Ωo is Q! " R equal to µ, σo2 ∈ R I 2 | − ∞ < µ < ∞ , we have max (µ,σ 2 )∈Ωo " ! L µ, σ 2 = max −∞<µ<∞ " ! L µ, σo2 . ! " ! " Since L µ, σo2 and ln L µ, σo2 achieve the maximum at the same µ value, we ! " determine the value of µ where ln L µ, σo2 achieves the maximum. Taking the natural logarithm of the likelihood function, we get n ! ! "" n 1 % n (xi − µ)2 . ln L µ, σo2 = − ln(σo2 ) − ln(2π) − 2 2 2 2σo i=1 ! " Differentiating ln L µ, σo2 with respect to µ, we get from the last equality n ! ! "" d 1 % ln L µ, σ 2 = 2 (xi − µ). dµ σo i=1 Setting this derivative to zero and solving for µ, we obtain µ = x. Hence, we obtain max −∞<µ<∞ ! " L µ, σ 2 = # 1 2πσo2 $ n2 − e 1 2 2σo 1n i=1 (xi −x)2 ! " Next, we determine the maximum of L µ, σ 2 on the set Ω. As before, " ! " ! we consider ln L µ, σ 2 to determine where L µ, σ 2 achieves maximum. ! " Taking the natural logarithm of L µ, σ 2 , we obtain n ! ! "" 1 % n ln L µ, σ 2 = −n ln(σ) − ln(2π) − 2 (xi − µ)2 . 2 2σ i=1 Test of Statistical Hypotheses for Parameters 570 ! " Taking the partial derivatives of ln L µ, σ 2 first with respect to µ and then with respect to σ 2 , we get n " ! 1 % ∂ 2 (xi − µ), ln L µ, σ = 2 ∂µ σ i=1 and n ! " n 1 % ∂ 2 ln L µ, σ = − + (xi − µ)2 , ∂σ 2 2σ 2 2σ 4 i=1 respectively. Setting these partial derivatives to zero and solving for µ and σ, we obtain n−1 2 µ=x and σ2 = s , n n % 1 (xi − x)2 is the sample variance. where s2 = n−1 i=1 ! " Letting these optimal values of µ and σ into L µ, σ 2 , we obtain ! max L µ, σ (µ,σ 2 )∈Ω Therefore " 2 = # n 2π(n − 1)s2 $ n2 − n 2(n−1)s2 e " ! 2 max L µ, σ (µ,σ 2 )∈Ωo ! " W (x1 , x2 , ..., xn ) = max L µ, σ 2 2 (µ,σ )∈Ω =, =n , 1 2πσo2 n 2π(n−1)s2 −n 2 e n 2 # - n2 - n2 − e 1 2 2σo − e i=1 1n i=1 $ n2 (xi − x)2 (xi −x)2 1n n 2(n−1)s2 (n − 1)s2 σo2 n % i=1 − e (xi −x)2 (n−1)s2 2 2σo Now the inequality W (x1 , x2 , ..., xn ) ≤ k becomes n −n 2 e n 2 # (n − 1)s2 σo2 $ n2 − e (n−1)s2 2 2σo ≤k which is equivalent to $n (n−1)s2 # , - n $2 # − n 2 (n − 1)s2 2 σo e ≤ k := Ko , σo2 e . . Probability and Mathematical Statistics 571 where Ko is a constant. Let H be a function defined by H(w) = wn e−w . Using this, we see that the above inequality becomes H # (n − 1)s2 σo2 $ ≤ Ko . The figure below illustrates this inequality. H(w) Graph of H(w) Ko w K1 K2 From this it follows that (n − 1)s2 ≤ K1 σo2 or (n − 1)s2 ≥ K2 . σo2 In view of these inequalities, the critical region can be described as C= 9 > ? > (n − 1)s2 (n − 1)s2 > (x1 , x2 , ..., xn ) > ≤ K1 or ≥ K2 , σo2 σo2 and the best likelihood ratio test is: “Reject Ho if (n − 1)S 2 (n − 1)S 2 ≤ K or ≥ K2 .” 1 σo2 σo2 Since we are given the size of the critical region to be α, we can determine the constants K1 and K2 . As the sample X1 , X2 , ..., Xn is taken from a normal distribution with mean µ and variance σ 2 , we get (n − 1)S 2 ∼ χ2 (n − 1) σo2 Test of Statistical Hypotheses for Parameters 572 when the null hypothesis Ho : σ 2 = σo2 is true. Therefore, the likelihood ratio critical region C becomes > ? 9 > (n − 1)s2 (n − 1)s2 2 2 α (n − 1) or α (n − 1) ≤ χ ≥ χ (x1 , x2 , ..., xn ) >> 1− 2 2 σo2 σo2 and the likelihood ratio test is: “Reject Ho : σ 2 = σo2 if (n − 1)S 2 (n − 1)S 2 2 α (n − 1) or ≤ χ ≥ χ21− α2 (n − 1)” 2 σo2 σo2 where χ2α (n − 1) is a real number such that the integral of the chi-square 2 density function with (n − 1) degrees of freedom from 0 to χ2α (n − 1) is α2 . 2 Further, χ21− α (n − 1) denotes the real number such that the integral of the 2 chi-square density function with (n − 1) degrees of freedom from χ21− α (n − 1) 2 to ∞ is α2 . Remark 18.8. We summarize the three cases of hypotheses test of the variance (of the normal population with unknown mean) in the following table. Ho Ha σ 2 = σo2 σ 2 > σo2 σ 2 = σo2 σ 2 < σo2 σ 2 = σo2 σ 2 += σo2 Critical Region (or Test) χ2 = (n−1)s2 σo2 χ2 = ≥ χ21−α (n − 1) (n−1)s2 σo2 χ2 = (n−1)s2 σo2 χ2 = (n−1)s2 σo2 ≤ χ2α (n − 1) ≥ χ21−α/2 (n − 1) or ≤ χ2α/2 (n − 1) 18.5. Review Exercises 1. Five trials X1 , X2 , ..., X5 of a Bernoulli experiment were conducted to test Ho : p = 12 against Ha : p = 43 . The null hypothesis Ho will be rejected if 15 i=1 Xi = 5. Find the probability of Type I and Type II errors. 2. A manufacturer of car batteries claims that the life of his batteries is normally distributed with a standard deviation equal to 0.9 year. If a random Probability and Mathematical Statistics 573 sample of 10 of these batteries has a standard deviation of 1.2 years, do you think that σ > 0.9 year? Use a 0.05 level of significance. 3. Let X1 , X2 , ..., X8 be a random sample of size 8 from a Poisson distribution with parameter λ. Reject the null hypothesis Ho : λ = 0.5 if the observed 18 sum i=1 xi ≥ 8. First, compute the significance level α of the test. Second, find the power function β(λ) of the test as a sum of Poisson probabilities when Ha is true. 4. Suppose X has the density function f (x) = &1 θ for 0 < x < θ 0 otherwise. If one observation of X is taken, what are the probabilities of Type I and Type II errors in testing the null hypothesis Ho : θ = 1 against the alternative hypothesis Ha : θ = 2, if Ho is rejected for X > 0.92. 5. Let X have the density function f (x) = & (θ + 1) xθ for 0 < x < 1 where θ > 0 0 otherwise. The hypothesis Ho : θ = 1 is to be rejected in favor of H1 : θ = 2 if X > 0.90. What is the probability of Type I error? 6. Let X1 , X2 , ..., X6 be a random sample from a distribution with density function & θ−1 θx for 0 < x < 1 where θ > 0 f (x) = 0 otherwise. The null hypothesis Ho : θ = 1 is to be rejected in favor of the alternative Ha : θ > 1 if and only if at least 5 of the sample observations are larger than 0.7. What is the significance level of the test? 7. A researcher wants to test Ho : θ = 0 versus Ha : θ = 1, where θ is a parameter of a population of interest. The statistic W , based on a random sample of the population, is used to test the hypothesis. Suppose that under Ho , W has a normal distribution with mean 0 and variance 1, and under Ha , W has a normal distribution with mean 4 and variance 1. If Ho is rejected when W > 1.50, then what are the probabilities of a Type I or Type II error respectively? Test of Statistical Hypotheses for Parameters 574 8. Let X1 and X2 be a random sample of size 2 from a normal distribution N (µ, 1). Find the likelihood ratio critical region of size 0.005 for testing the null hypothesis Ho : µ = 0 against the composite alternative Ha : µ += 0? 9. Let X1 , X2 , ..., X10 be a random sample from a Poisson distribution with mean θ. What is the most powerful (or best) critical region of size 0.08 for testing the null hypothesis H0 : θ = 0.1 against Ha : θ = 0.5? 10. Let X be a random sample of size 1 from a distribution with probability density function & (1 − θ2 ) + θ x if 0 ≤ x ≤ 1 f (x, θ) = 0 otherwise. For a significance level α = 0.1, what is the best (or uniformly most powerful) critical region for testing the null hypothesis Ho : θ = −1 against Ha : θ = 1? 11. Let X1 , X2 be a random sample of size 2 from a distribution with probability density function  x −θ e  θ x! if x = 0, 1, 2, 3, .... f (x, θ) =  0 otherwise, where θ ≥ 0. For a significance level α = 0.053, what is the best critical region for testing the null hypothesis Ho : θ = 1 against Ha : θ = 2? Sketch the graph of the best critical region. 12. Let X1 , X2 , ..., X8 be a random sample of size 8 from a distribution with probability density function  x −θ e  θ x! if x = 0, 1, 2, 3, .... f (x, θ) =  0 otherwise, where θ ≥ 0. What is the likelihood ratio critical region for testing the null hypothesis Ho : θ = 1 against Ha : θ += 1? If α = 0.1 can you determine the best likelihood ratio critical region? 13. Let X1 , X2 , ..., Xn be a random sample of size n from a distribution with probability density function  6 −x  x e β7 , if x > 0 Γ(7)β f (x, θ) =  0 otherwise, Probability and Mathematical Statistics 575 where β ≥ 0. What is the likelihood ratio critical region for testing the null hypothesis Ho : β = 5 against Ha : β += 5? What is the most powerful test ? 14. Let X1 , X2 , ..., X5 denote a random sample of size 5 from a population X with probability density function f (x; θ) =   (1 − θ)x−1 θ  0 if x = 1, 2, 3, ..., ∞ otherwise, where 0 < θ < 1 is a parameter. What is the likelihood ratio critical region of size 0.05 for testing Ho : θ = 0.5 versus Ha : θ += 0.5? 15. Let X1 , X2 , X3 denote a random sample of size 3 from a population X with probability density function (x−µ)2 1 f (x; µ) = √ e− 2 2π − ∞ < x < ∞, where −∞ < µ < ∞ is a parameter. What is the likelihood ratio critical region of size 0.05 for testing Ho : µ = 3 versus Ha : µ += 3? 16. Let X1 , X2 , X3 denote a random sample of size 3 from a population X with probability density function f (x; θ) =  x  θ1 e− θ  0 if 0 < x < ∞ otherwise, where 0 < θ < ∞ is a parameter. What is the likelihood ratio critical region for testing Ho : θ = 3 versus Ha : θ += 3? 17. Let X1 , X2 , X3 denote a random sample of size 3 from a population X with probability density function f (x; θ) =  −θ x  e x!θ  0 if x = 0, 1, 2, 3, ..., ∞ otherwise, where 0 < θ < ∞ is a parameter. What is the likelihood ratio critical region for testing Ho : θ = 0.1 versus Ha : θ += 0.1? 18. A box contains 4 marbles, θ of which are white and the rest are black. A sample of size 2 is drawn to test Ho : θ = 2 versus Ha : θ += 2. If the null Test of Statistical Hypotheses for Parameters 576 hypothesis is rejected if both marbles are the same color, find the significance level of the test. 19. Let X1 , X2 , X3 denote a random sample of size 3 from a population X with probability density function   θ1 for 0 ≤ x ≤ θ f (x; θ) =  0 otherwise, where 0 < θ < ∞ is a parameter. What is the likelihood ratio critical region of size 117 125 for testing Ho : θ = 5 versus Ha : θ += 5? 20. Let X1 , X2 and X3 denote three independent observations from a distribution with density  x  β1 e− β for 0 < x < ∞ f (x; β) =  0 otherwise, where 0 < β < ∞ is a parameter. What is the best (or uniformly most powerful critical region for testing Ho : β = 5 versus Ha : β = 10? 21. Suppose X has the density function f (x) = &1 θ for 0 < x < θ 0 otherwise. If X1 , X2 , X3 , X4 is a random sample of size 4 taken from X, what are the probabilities of Type I and Type II errors in testing the null hypothesis Ho : θ = 1 against the alternative hypothesis Ha : θ = 2, if Ho is rejected for max{X1 , X2 , X3 , X4 } ≤ 21 . 22. Let X1 , X2 , X3 denote a random sample of size 3 from a population X with probability density function  x  θ1 e− θ if 0 < x < ∞ f (x; θ) =  0 otherwise, where 0 < θ < ∞ is a parameter. The null hypothesis Ho : θ = 3 is to be rejected in favor of the alternative Ha : θ += 3 if and only if X > 6.296. What is the significance level of the test? Probability and Mathematical Statistics 577 Chapter 19 SIMPLE LINEAR REGRESSION AND CORRELATION ANALYSIS Let X and Y be two random variables with joint probability density function f (x, y). Then the conditional density of Y given that X = x is f (y/x) = where g(x) = : ∞ f (x, y) g(x) f (x, y) dy −∞ is the marginal density of X. The conditional mean of Y : ∞ E (Y |X = x) = yf (y/x) dy −∞ is called the regression equation of Y on X. Example 19.1. Let X and Y be two random variables with the joint probability density function f (x, y) = & xe−x(1+y) if x > 0, y > 0 0 otherwise. Find the regression equation of Y on X and then sketch the regression curve. Simple Linear Regression and Correlation Analysis 578 Answer: The marginal density of X is given by : ∞ g(x) = xe−x(1+y) dy −∞ : ∞ = xe−x e−xy dy −∞ : ∞ −x = xe e−xy dy −∞ 3∞ 2 1 −xy −x = xe − e x 0 = e−x . The conditional density of Y given X = x is f (y/x) = f (x, y) xe−x(1+y) = = xe−xy , g(x) e−x y > 0. The conditional mean of Y given X = x is given by : ∞ : ∞ 1 y x e−xy dy = . yf (y/x) dy = E(Y /x) = x −∞ −∞ Thus the regression equation of Y on X is 1 , x > 0. x The graph of this equation of Y on X is shown below. E(Y /x) = Graph of the regression equation E(Y/x) = 1/ x Probability and Mathematical Statistics 579 From this example it is clear that the conditional mean E(Y /x) is a function of x. If this function is of the form α + βx, then the corresponding regression equation is called a linear regression equation; otherwise it is called a nonlinear regression equation. The term linear regression refers to a specification that is linear in the parameters. Thus E(Y /x) = α + βx2 is also a linear regression equation. The regression equation E(Y /x) = αxβ is an example of a nonlinear regression equation. The main purpose of regression analysis is to predict Yi from the knowledge of xi using the relationship like E(Yi /xi ) = α + βxi . The Yi is called the response or dependent variable where as xi is called the predictor or independent variable. The term regression has an interesting history, dating back to Francis Galton (1822-1911). Galton studied the heights of fathers and sons, in which he observed a regression (a “turning back”) from the heights of sons to the heights of their fathers. That is tall fathers tend to have tall sons and short fathers tend to have short sons. However, he also found that very tall fathers tend to have shorter sons and very short fathers tend to have taller sons. Galton called this phenomenon regression towards the mean. In regression analysis, that is when investigating the relationship between a predictor and response variable, there are two steps to the analysis. The first step is totally data oriented. This step is always performed. The second step is the statistical one, in which we draw conclusions about the (population) regression equation E(Yi /xi ). Normally the regression equation contains several parameters. There are two well known methods for finding the estimates of the parameters of the regression equation. These two methods are: (1) The least square method and (2) the normal regression method. 19.1. The Least Squares Method Let {(xi , yi ) | i = 1, 2, ..., n} be a set of data. Assume that E(Yi /xi ) = α + βxi , that is yi = α + βxi , i = 1, 2, ..., n. (1) Simple Linear Regression and Correlation Analysis 580 Then the sum of the squares of the error is given by E(α, β) = n % i=1 2 (yi − α − βxi ) . (2) The least squares estimates of α and β are defined to be those values which minimize E(α, β). That is, , α Z, βZ = arg min E(α, β). (α,β) This least squares method is due to Adrien M. Legendre (1752-1833). Note that the least squares method also works even if the regression equation is nonlinear (that is, not of the form (1)). Next, we give several examples to illustrate the method of least squares. Example 19.2. Given the five pairs of points (x, y) shown in table below x y 4 5 −2 0 0 0 3 6 1 3 what is the line of the form y = x + b best fits the data by method of least squares? Answer: Suppose the best fit line is y = x + b. Then for each xi , xi + b is the estimated value of yi . The difference between yi and the estimated value of yi is the error or the residual corresponding to the ith measurement. That is, the error corresponding to the ith measurement is given by ǫi = yi − xi − b. Hence the sum of the squares of the errors is E(b) = = 5 % ǫ2i i=1 5 % i=1 2 (yi − xi − b) . Differentiating E(b) with respect to b, we get 5 % d (yi − xi − b) (−1). E(b) = 2 db i=1 Probability and Mathematical Statistics Setting d db E(b) 581 equal to 0, we get 5 % i=1 (yi − xi − b) = 0 which is 5b = 5 % i=1 yi − 5 % xi . i=1 Using the data, we see that 5b = 14 − 6 which yields b = 85 . Hence the best fitted line is 8 y =x+ . 5 Example 19.3. Suppose the line y = bx + 1 is fit by the method of least squares to the 3 data points x y 1 2 2 2 4 0 What is the value of the constant b? Answer: The error corresponding to the ith measurement is given by ǫi = yi − bxi − 1. Hence the sum of the squares of the errors is E(b) = = 3 % ǫ2i i=1 3 % i=1 2 (yi − bxi − 1) . Differentiating E(b) with respect to b, we get 3 % d (yi − bxi − 1) (−xi ). E(b) = 2 db i=1 Simple Linear Regression and Correlation Analysis Setting d db E(b) 582 equal to 0, we get 3 % (yi − bxi − 1) xi = 0 i=1 which in turn yields b= n % i=1 xi yi − n % n % xi i=1 x2i i=1 Using the given data we see that b= 1 6−7 =− , 21 21 and the best fitted line is y=− 1 x + 1. 21 Example 19.4. Observations y1 , y2 , ..., yn are assumed to come from a model with E(Yi /xi ) = θ + 2 ln xi where θ is an unknown parameter and x1 , x2 , ..., xn are given constants. What is the least square estimate of the parameter θ? Answer: The sum of the squares of errors is E(θ) = n % ǫ2i = n % i=1 i=1 2 (yi − θ − 2 ln xi ) . Differentiating E(θ) with respect to θ, we get n % d (yi − θ − 2 ln xi ) (−1). E(θ) = 2 dθ i=1 Setting d dθ E(θ) equal to 0, we get n % i=1 which is 1 θ= n (yi − θ − 2 ln xi ) = 0 ) n % i=1 yi − 2 n % i=1 ln xi + . Probability and Mathematical Statistics 583 Hence the least squares estimate of θ is θZ = y − 2 n n % ln xi . i=1 Example 19.5. Given the three pairs of points (x, y) shown below: x y 4 2 1 1 2 0 What is the curve of the form y = xβ best fits the data by method of least squares? Answer: The sum of the squares of the errors is given by E(β) = = n % ǫ2i i=1 n , % i=1 yi − xβi -2 . Differentiating E(β) with respect to β, we get n , % d yi − xβi (− xβi ln xi ) E(β) = 2 dβ i=1 Setting this derivative d dβ E(β) n % to 0, we get yi xβi ln xi = n % xβi xβi ln xi . i=1 i=1 Using the given data we obtain (2) 4β ln 4 = 42β ln 4 + 22β ln 2 which simplifies to 4 = (2) 4β + 1 or 3 . 2 Taking the natural logarithm of both sides of the above expression, we get 4β = β= ln 3 − ln 2 = 0.2925 ln 4 Simple Linear Regression and Correlation Analysis 584 Thus the least squares best fit model is y = x0.2925 . Example 19.6. Observations y1 , y2 , ..., yn are assumed to come from a model with E(Yi /xi ) = α + βxi , where α and β are unknown parameters, and x1 , x2 , ..., xn are given constants. What are the least squares estimate of the parameters α and β? Answer: The sum of the squares of the errors is given by E(α, β) = = n % ǫ2i i=1 n % i=1 2 (yi − −α − βxi ) . Differentiating E(α, β) with respect to α and β respectively, we get n % ∂ (yi − α − βxi ) (−1) E(α, β) = 2 ∂α i=1 and n % ∂ (yi − α − βxi ) (−xi ). E(α, β) = 2 ∂β i=1 Setting these partial derivatives n % n % i=1 From (3), we obtain ∂ ∂β E(α, β) (yi − α − βxi ) xi = 0. (4) yi = nα + β i=1 n % xi i=1 y = α + β x. (5) Similarly, from (4), we have n % i=1 to 0, we get (3) n % which is and (yi − α − βxi ) = 0 i=1 and ∂ ∂α E(α, β) xi yi = α n % i=1 xi + β n % i=1 x2i Probability and Mathematical Statistics 585 which can be rewritten as follows n % i=1 (xi − x)(yi − y) + nx y = n α x + β Defining Sxy := n % i=1 n % i=1 (xi − x)(xi − x) + nβ x2 (6) (xi − x)(yi − y) we see that (6) reduces to < ; Sxy + nx y = α n x + β Sxx + nx2 (7) Substituting (5) into (7), we have ; < Sxy + nx y = [y − β x] n x + β Sxx + nx2 . Simplifying the last equation, we get Sxy = β Sxx which is β= Sxy . Sxx (8) In view of (8) and (5), we get α=y− Sxy x. Sxx (9) Thus the least squares estimates of α and β are respectively. α Z=y− Sxy Sxy , x and βZ = Sxx Sxx We need some notations. The random variable Y given X = x will be denoted by Yx . Note that this is the variable appears in the model E(Y /x) = α + βx. When one chooses in succession values x1 , x2 , ..., xn for x, a sequence Yx1 , Yx2 , ..., Yxn of random variable is obtained. For the sake of convenience, we denote the random variables Yx1 , Yx2 , ..., Yxn simply as Y1 , Y2 , ..., Yn . To do some statistical analysis, we make following three assumptions: (1) E(Yx ) = α + β x so that µi = E(Yi ) = α + β xi ; Simple Linear Regression and Correlation Analysis 586 (2) Y1 , Y2 , ..., Yn are independent; (3) Each of the random variables Y1 , Y2 , ..., Yn has the same variance σ 2 . Theorem 19.1. Under the above three assumptions, the least squares estimators α Z and βZ of a linear model E(Y /x) = α + β x are unbiased. Proof: From the previous example, we know that the least squares estimators of α and β are SxY SxY X and βZ = α Z=Y − , Sxx Sxx where n % (xi − x)(Yi − Y ). SxY := i=1 First, we show βZ is unbiased. Consider # $ , SxY 1 E βZ = E E (SxY ) = Sxx Sxx ) n + % 1 = (xi − x)(Yi − Y ) E Sxx i=1 = n ! " 1 % (xi − x) E Yi − Y Sxx i=1 n n ! " 1 % 1 % = (xi − x) E (Yi ) − (xi − x) E Y Sxx i=1 Sxx i=1 = = n n ! " % 1 % 1 (xi − x) E (Yi ) − E Y (xi − x) Sxx i=1 Sxx i=1 n n 1 % 1 % (xi − x) E (Yi ) = (xi − x) (α + βxi ) Sxx i=1 Sxx i=1 =α =β =β =β =β n n 1 % 1 % (xi − x) + β (xi − x) xi Sxx i=1 Sxx i=1 n 1 % (xi − x) xi Sxx i=1 n n 1 % 1 % (xi − x) xi − β (xi − x) x Sxx i=1 Sxx i=1 n 1 % (xi − x) (xi − x) Sxx i=1 1 Sxx = β. Sxx Probability and Mathematical Statistics 587 Thus the estimator βZ is unbiased estimator of the parameter β. Next, we show that α Z is also an unbiased estimator of α. Consider $ # # $ ! " SxY SxY x =E Y −x E E (Z α) = E Y − Sxx Sxx , ! " ! " = E Y − x E βZ = E Y − x β + ) n 1 % = E (Yi ) − x β n i=1 ) n + 1 % = E (α + βxi ) − x β n i=1 + ) n % 1 = xi − x β nα + β n i=1 =α+βx−x β =α This proves that α Z is an unbiased estimator of α and the proof of the theorem is now complete. 19.2. The Normal Regression Analysis In a regression analysis, we assume that the xi ’s are constants while yi ’s are values of the random variables Yi ’s. A regression analysis is called a normal regression analysis if the conditional density of Yi given Xi = xi is of the form ! yi −α−βxi "2 1 −1 σ , e 2 f (yi /xi ) = √ 2πσ 2 where σ 2 denotes the variance, and α and β are the regression coefficients. That is Y |xi ∼ N (α + βx, σ 2 ). If there is no danger of confusion, then we will write Yi for Y |xi . The figure on the next page shows the regression model of Y with equal variances, and with means falling on the straight line µy = α + β x. Normal regression analysis concerns with the estimation of σ, α, and β. We use maximum likelihood method to estimate these parameters. The maximum likelihood function of the sample is given by L(σ, α, β) = n P i=1 f (yi /xi ) Simple Linear Regression and Correlation Analysis 588 and ln L(σ, α, β) = n % ln f (yi /xi ) i=1 = −n ln σ − z n 1 % n 2 ln(2π) − 2 (yi − α − β xi ) . 2 2σ i=1 y ' ' µ1 ' µ2 µ3 y= 0 !+(x x1 x2 x3 Normal Regression x Taking the partial derivatives of ln L(σ, α, β) with respect to α, β and σ respectively, we get n ∂ 1 % ln L(σ, α, β) = 2 (yi − α − β xi ) ∂α σ i=1 n ∂ 1 % ln L(σ, α, β) = 2 (yi − α − β xi ) xi ∂β σ i=1 n n 1 % ∂ 2 (yi − α − β xi ) . ln L(σ, α, β) = − + 3 ∂σ σ σ i=1 Equating each of these partial derivatives to zero and solving the system of three equations, we obtain the maximum likelihood estimator of β, α, σ as G 2 3 S SxY 1 S xY xY Z β= , α Z=Y − SxY , SY Y − x, and σ Z= Sxx Sxx n Sxx Probability and Mathematical Statistics where SxY = n % i=1 589 ! " (xi − x) Yi − Y . Theorem 19.2. In the normal regression analysis, the likelihood estimators βZ and α Z are unbiased estimators of β and α, respectively. Proof: Recall that SxY βZ = Sxx n ! " 1 % (xi − x) Yi − Y Sxx i=1 $ n # % xi − x = Yi , Sxx i=1 = 1n 2 where Sxx = i=1 (xi − x) . Thus βZ is a linear combination of Yi ’s. Since " ! Yi ∼ N α + βxi , σ 2 , we see that βZ is also a normal random variable. First we show βZ is an unbiased estimator of β. Since + ) n # , % xi − x $ Z E β =E Yi Sxx i=1 $ n # % xi − x = E (Yi ) Sxx i=1 $ n # % xi − x (α + βxi ) = β, = Sxx i=1 the maximum likelihood estimator of β is unbiased. Next, we show that α Z is also an unbiased estimator of α. Consider # $ # $ ! " SxY SxY E (Z α) = E Y − x =E Y −x E Sxx Sxx , ! " ! " = E Y − x E βZ = E Y − x β + ) n 1 % = E (Yi ) − x β n i=1 ) n + 1 % = E (α + βxi ) − x β n i=1 ) + n % 1 = nα + β xi − x β n i=1 = α + β x − x β = α. Simple Linear Regression and Correlation Analysis 590 This proves that α Z is an unbiased estimator of α and the proof of the theorem is now complete. Theorem 19.3. In normal regression analysis, the distributions of the estimators βZ and α Z are given by $ # $ # σ2 x2 σ 2 σ2 Z and α Z ∼ N α, β ∼ N β, + Sxx n Sxx where Sxx = n % i=1 Proof: Since SxY βZ = Sxx 2 (xi − x) . n ! " 1 % (xi − x) Yi − Y Sxx i=1 $ n # % xi − x = Yi , Sxx i=1 ! " the βZ is a linear combination of Yi ’s. As Yi ∼ N α + βxi , σ 2 , we see that βZ is also a normal random variable. By Theorem 19.2, βZ is an unbiased estimator of β. The variance of βZ is given by $2 n # , - % xi − x V ar βZ = V ar (Yi /xi ) Sxx i=1 $2 n # % xi − x σ2 = S xx i=1 = = = n 1 % 2 (xi − x) σ 2 2 Sxx i=1 σ2 . Sxx Hence βZ is a normal random ,variable-with mean (or expected value) β and 2 2 variance Sσxx . That is βZ ∼ N β, Sσxx . Now determine the distribution of α Z. Since each Yi ∼ N (α + βxi , σ 2 ), the distribution of Y is given by $ # σ2 . Y ∼ N α + βx, n Probability and Mathematical Statistics 591 Since βZ ∼ N the distribution of x βZ is given by x βZ ∼ N # # σ2 β, Sxx $ σ2 x β, x Sxx 2 $ . Since α Z = Y − x βZ and Y and x βZ being two normal random variables, α Z is also a normal random variable with mean equal to α + β x − β x = α and 2 2 2 σ variance variance equal to σn + xSxx . That is $ # σ2 x2 σ 2 + α Z ∼ N α, n Sxx and the proof of the theorem is now complete. It should be noted that in the proof of the last theorem, we have assumed the fact that Y and x βZ are statistically independent. In the next theorem, we give an unbiased estimator of the variance σ 2 . For this we need the distribution of the statistic U given by nσ Z2 . σ2 It can be shown (we will omit the proof, for a proof see Graybill (1961)) that the distribution of the statistic nσ Z2 U = 2 ∼ χ2 (n − 2). σ U= Theorem 19.4. An unbiased estimator S 2 of σ 2 is given by S2 = where σ Z= @ Proof: Since 1 n A SY Y − SxY Sxx B SxY . nσ Z2 , n−2 $ nσ Z2 n−2 # 2$ 2 nσ Z σ = E n−2 σ2 σ2 = E(χ2 (n − 2)) n−2 σ2 (n − 2) = σ 2 . = n−2 E(S 2 ) = E # Simple Linear Regression and Correlation Analysis 592 The proof of the theorem is now complete. Note that the estimator S 2 can be written as S 2 = SSE = SY Y = βZ SxY = 2 % i=1 SSE n−2 , where [yi − α Z − βZ xi ] the estimator S 2 is unbiased estimator of σ 2 . The proof of the theorem is now complete. In the next theorem we give the distribution of two statistics that can be used for testing hypothesis and constructing confidence interval for the regression parameters α and β. Theorem 19.5. The statistics and βZ − β Qβ = σ Z α Z−α Qα = σ Z @ G (n − 2) Sxx n (n − 2) Sxx 2 n (x) + Sxx have both a t-distribution with n − 2 degrees of freedom. Proof: From Theorem 19.3, we know that βZ ∼ N # β, σ2 Sxx $ . Hence by standardizing, we get βZ − β Z== ∼ N (0, 1). σ2 Sxx Further, we know that the likelihood estimator of σ is σ Z= G 3 2 1 SxY SxY SY Y − n Sxx σ2 is chi-square with n − 2 degrees and the distribution of the statistic U = nσZ 2 of freedom. Probability and Mathematical Statistics 593 Z−β β σ2 ∼ N (0, 1) and U = nσZ ∼ χ2 (n − 2), by Theorem 14.6, Since Z = I 2 σ2 Sxx the statistic IZU ∼ t(n − 2). Hence n−2 βZ − β Qβ = σ Z @ βZ − β (n − 2) Sxx == n nZ σ2 (n−2) Sxx Z−β β I 2 == σ Sxx nZ σ2 (n−2) σ 2 ∼ t(n − 2). Similarly, it can be shown that G (n − 2) Sxx α Z−α ∼ t(n − 2). Qα = 2 σ Z n (x) + Sxx This completes the proof of the theorem. In the normal regression model, if β = 0, then E(Yx ) = α. This implies that E(Yx ) does not depend on x. Therefore if β += 0, then E(Yx ) is dependent on x. Thus the null hypothesis Ho : β = 0 should be tested against Z Theorem 19.3 says Ha : β += 0. To devise a test we need the distribution of β. 2 that βZ is normally distributed with mean β and variance Sσx x . Therefore, we have βZ − β Z== ∼ N (0, 1). σ2 Sxx In practice the variance V ar(Yi /xi ) which is σ 2 is usually unknown. Hence the above statistic Z is not very useful. However, using the statistic Qβ , we can devise a hypothesis test to test the hypothesis Ho : β = βo against Ha : β += βo at a significance level γ. For this one has to evaluate the quantity > > > > > βZ − β > > > |t| = > = > nZ σ2 > > (n−2) Sxx > > > βZ − β @ (n − 2) S > > xx > => > > σ > Z n and compare it to quantile tγ/2 (n − 2). The hypothesis test, at significance level γ, is then “Reject Ho : β = βo if |t| > tγ/2 (n − 2)”. The statistic βZ − β Qβ = σ Z @ (n − 2) Sxx n Simple Linear Regression and Correlation Analysis 594 is a pivotal quantity for the parameter β since the distribution of this quantity Qβ is a t-distribution with n − 2 degrees of freedom. Thus it can be used for the construction of a (1 − γ)100% confidence interval for the parameter β as follows: 1−γ + @ βZ − β (n − 2)Sxx ≤ t γ2 (n − 2) = P −t γ2 (n − 2) ≤ σ Z n $ # @ @ n n Z Z σ σ . = P β − t γ2 (n − 2)Z ≤ β ≤ β + t γ2 (n − 2)Z (n − 2)Sxx (n − 2) Sxx ) Hence, the (1 − γ)% confidence interval for β is given by 2 3 @ @ n n Z Z γ γ β − t 2 (n − 2) σ Z Z , β + t 2 (n − 2) σ . (n − 2) Sxx (n − 2) Sxx In a similar manner one can devise hypothesis test for α and construct confidence interval for α using the statistic Qα . We leave these to the reader. Now we give two examples to illustrate how to find the normal regression line and related things. Example 19.7. Let the following data on the number of hours, x which ten persons studied for a French test and their scores, y on the test is shown below: x y 4 31 9 58 10 65 14 73 4 37 7 44 12 60 22 91 1 21 17 84 Find the normal regression line that approximates the regression of test scores on the number of hours studied. Further test the hypothesis Ho : β = 3 versus Ha : β += 3 at the significance level 0.02. Answer: From the above data, we have 10 % xi = 100, x2i = 1376 i=1 i=1 10 % 10 % yi = 564, i=1 10 % yi2 = i=1 10 % i=1 xi yi = 6945 Probability and Mathematical Statistics Sxx = 376, 595 Sxy = 1305, Syy = 4752.4. Hence sxy = 3.471 and α Z = y − βZ x = 21.690. βZ = sxx Thus the normal regression line is y = 21.690 + 3.471x. This regression line is shown below. 100 y 80 60 40 20 0 5 10 15 20 25 x Regression line y = 21.690 + 3.471 x Now we test the hypothesis Ho : β = 3 against Ha : β += 3 at 0.02 level of significance. From the data, the maximum likelihood estimate of σ is G 2 3 Sxy 1 σ Z= Syy − Sxy n Sxx = @ = @ B 1 A Syy − βZ Sxy n 1 [4752.4 − (3.471)(1305)] 10 = 4.720 Simple Linear Regression and Correlation Analysis and 596 > > > 3.471 − 3 @ (8) (376) > > > |t| = > > = 1.73. > 4.720 10 > Hence 1.73 = |t| < t0.01 (8) = 2.896. Thus we do not reject the null hypothesis that Ho : β = 3 at the significance level 0.02. This means that we can not conclude that on the average an extra hour of study will increase the score by more than 3 points. Example 19.8. The frequency of chirping of a cricket is thought to be related to temperature. This suggests the possibility that temperature can be estimated from the chirp frequency. Let the following data on the number chirps per second, x by the striped ground cricket and the temperature, y in Fahrenheit is shown below: x y 20 89 16 72 20 93 18 84 17 81 16 75 15 70 17 82 15 69 16 83 Find the normal regression line that approximates the regression of temperature on the number chirps per second by the striped ground cricket. Further test the hypothesis Ho : β = 4 versus Ha : β += 4 at the significance level 0.1. Answer: From the above data, we have 10 % xi = 170, i=1 10 % 10 % x2i = 2920 i=1 yi = 789, 10 % yi2 = 64270 i=1 i=1 10 % xi yi = 13688 i=1 Sxx = 376, Hence Sxy = 1305, Syy = 4752.4. sxy βZ = = 4.067 and α Z = y − βZ x = 9.761. sxx Thus the normal regression line is y = 9.761 + 4.067x. Probability and Mathematical Statistics 597 This regression line is shown below. 95 90 y 85 80 75 70 65 14 15 16 17 18 19 20 21 x Regression line y = 9.761 + 4.067x Now we test the hypothesis Ho : β = 4 against Ha : β += 4 at 0.1 level of significance. From the data, the maximum likelihood estimate of σ is G 2 3 1 Sxy Syy − Sxy n Sxx = @ = @ B 1 A Syy − βZ Sxy n σ Z= 1 [589 − (4.067)(122)] 10 = 3.047 and Hence > > > 4.067 − 4 @ (8) (30) > > > |t| = > > = 0.528. > 3.047 10 > 0.528 = |t| < t0.05 (8) = 1.860. Simple Linear Regression and Correlation Analysis 598 Thus we do not reject the null hypothesis that Ho : β = 4 at a significance level 0.1. Let µx = α + β x and write YZx = α Z + βZ x for an arbitrary but fixed x. Z Then Yx is an estimator of µx . The following theorem gives various properties of this estimator. Theorem 19.6. Let x be an arbitrary but fixed real number. Then (i) YZx is a linear estimator of Y1 , Y2 , ..., Yn , (ii) YZx is ,an unbiased S of µx , and - J estimator (x−x)2 1 Z σ2 . (iii) V ar Yx = n + Sxx Proof: First we show that YZx is a linear estimator of Y1 , Y2 , ..., Yn . Since YZx = α Z + βZ x Z + βZ x = Y − βx = Y + βZ (x − x) n % (xk − x) (x − x) =Y + Yk Sxx k=1 = = n % Yk n k=1 n # % k=1 + n % (xk − x) (x − x) k=1 Sxx 1 (xk − x) (x − x) + n Sxx $ Yk Yk YZx is a linear estimator of Y1 , Y2 , ..., Yn . Next, we show that YZx is an unbiased estimator of µx . Since , , E YZx = E α Z + βZ x , = E (Z α) + E βZ x =α+βx = µx YZx is an unbiased estimator of µx . Finally, we calculate the variance of YZx using Theorem 19.3. The variance Probability and Mathematical Statistics 599 of YZx is given by , , V ar YZx = V ar α Z + βZ x , , = V ar (Z α) + V ar βZ x + 2Cov α Z, βZ x # $ , σ2 x2 1 = + + 2 x Cov α Z, βZ + x2 n Sxx Sxx $ # x2 x σ2 1 + − 2x = n Sxx Sxx $ # 1 (x − x)2 = σ2 . + n Sxx In this computation we have used the fact that , x σ2 Cov α Z, βZ = − Sxx whose proof is left to the reader as an exercise. The proof of the theorem is now complete. By Theorem 19.3, we see that # $ σ2 βZ ∼ N β, and Sxx α Z∼N # α, σ2 x2 σ 2 + n Sxx $ . Since YZx = α Z + βZ x, the random variable YZx is also a normal random variable with mean µx and variance $ , - #1 (x − x)2 Z + σ2 . V ar Yx = n Sxx Hence standardizing YZx , we have YZ − µ @ x , x - ∼ N (0, 1). V ar YZx If σ 2 is known, then one can take the statistic Q = =YZx −µ!x " as a pivotal Zx V ar Y quantity to construct a confidence interval for µx . The (1−γ)100% confidence interval for µx when σ 2 is known is given by 2 3 = = Z Z Z Z γ γ Yx − z 2 V ar(Yx ), Yx + z 2 V ar(Yx ) . Simple Linear Regression and Correlation Analysis 600 Example 19.9. Let the following data on the number chirps per second, x by the striped ground cricket and the temperature, y in Fahrenheit is shown below: x y 20 89 16 72 20 93 18 84 17 81 16 75 15 70 17 82 15 69 16 83 What is the 95% confidence interval for β? What is the 95% confidence interval for µx when x = 14 and σ = 3.047? Answer: From Example 19.8, we have n = 10, βZ = 4.067, σ Z = 3.047 and Sxx = 376. The (1 − γ)% confidence interval for β is given by 3 2 @ @ n n Z Z βZ − t γ2 (n − 2) σ , βZ + t γ2 (n − 2) σ . (n − 2) Sxx (n − 2) Sxx Therefore the 90% confidence interval for β is G G / 0 10 10 4.067 − t0.025 (8) (3.047) , 4.067 + t0.025 (8) (3.047) (8) (376) (8) (376) which is [ 4.067 − t0.025 (8) (0.1755) , 4.067 + t0.025 (8) (0.1755)] . Since from the t-table, we have t0.025 (8) = 2.306, the 90% confidence interval for β becomes [ 4.067 − (2.306) (0.1755) , 4.067 + (2.306) (0.1755)] which is [3.6623, 4.4717]. If variance σ 2 is not known, then we can use the fact that the statistic σ2 is chi-squares with n − 2 degrees of freedom to obtain a pivotal U = nσZ 2 quantity for µx . This can be done as follows: G (n − 2) Sxx YZx − µx Q= σ Z Sxx + n (x − x)2 = =! YZx −µx " (x−x)2 1 n+ = Sxx nZ σ2 (n−2) σ 2 σ2 ∼ t(n − 2). Probability and Mathematical Statistics 601 Using this pivotal quantity one can construct a (1 − γ)100% confidence interval for mean µ as / G YZx − t γ2 (n − 2) G 0 Sxx + n(x − x)2 Z Sxx + n(x − x)2 , Yx + t γ2 (n − 2) . (n − 2) Sxx (n − 2) Sxx Next we determine the 90% confidence interval for µx when x = 14 and σ = 3.047. The (1 − γ)100% confidence interval for µx when σ 2 is known is given by 2 3 = = Z Z Z Z Yx − z γ2 V ar(Yx ), Yx + z γ2 V ar(Yx ) . From the data, we have YZx = α Z + βZ x = 9.761 + (4.067) (14) = 66.699 and $ - #1 (14 − 17)2 Z V ar Yx = + σ 2 = (0.124) (3.047)2 = 1.1512. 10 376 , The 90% confidence interval for µx is given by A 66.699 − z0.025 √ 1.1512, 66.699 + z0.025 √ 1.1512 B and since z0.025 = 1.96 (from the normal table), we have [66.699 − (1.96) (1.073), 66.699 + (1.96) (1.073)] which is [64.596, 68.802]. We now consider the predictions made by the normal regression equation Z YZx = α Z + βx. The quantity YZx gives an estimate of µx = α + βx. Each time we compute a regression line from a random sample we are observing one possible linear equation in a population consisting all possible linear equations. Further, the actual value of Yx that will be observed for given value of x is normal with mean α + βx and variance σ 2 . So the actual observed value will be different from µx . Thus, the predicted value for YZx will be in error from two different sources, namely (1) α Z and βZ are randomly distributed about α and β, and (2) Yx is randomly distributed about µx . Simple Linear Regression and Correlation Analysis 602 Let yx denote the actual value of Yx that will be observed for the value x and consider the random variable D = Yx − α Z − βZ x. Since D is a linear combination of normal random variables, D is also a normal random variable. The mean of D is given by Z E(D) = E(Yx ) − E(Z α) − x E(β) = α + β x − α − xβ = 0. The variance of D is given by V ar(D) = V ar(Yx − α Z − βZ x) Z + 2 x Cov(Z Z = V ar(Yx ) + V ar(Z α) + x2 V ar(β) α, β) σ2 x2 σ 2 x σ2 + x2 − 2x + n Sxx Sxx Sxx (x − x)2 σ 2 σ2 = σ2 + + n Sxx (n + 1) Sxx + n 2 σ . = n Sxx = σ2 + Therefore D∼N # 0, $ (n + 1) Sxx + n 2 σ . n Sxx We standardize D to get Z== D−0 (n+1) Sxx +n n Sxx σ2 ∼ N (0, 1). Since in practice the variance of Yx which is σ 2 is unknown, we can not use Z to construct a confidence interval for a predicted value yx . σ2 We know that U = n Z ∼ χ2 (n − 2). By Theorem 14.6, the statistic σ2 Probability and Mathematical Statistics IZU ∼ t(n − 2). Hence 603 n−2 yx − α Z − βZ x Q= σ Z yx −α Z−βZ x I (n+1) S +n xx = = n Sxx nZ σ2 G (n − 2) Sxx (n + 1) Sxx + n σ2 (n−2) σ 2 √ D−0 V ar(D) == == nZ σ2 (n−2) σ 2 Z U n−2 ∼ t(n − 2). The statistic Q is a pivotal quantity for the predicted value yx and one can use it to construct a (1 − γ)100% confidence interval for yx . The (1 − γ)100% confidence interval, [a, b], for yx is given by , 1 − γ = P −t γ2 (n − 2) ≤ Q ≤ t γ2 (n − 2) = P (a ≤ yx ≤ b), where and a=α Z + βZ x − t γ2 (n − 2) σ Z b=α Z + βZ x + t γ2 (n − 2) σ Z G G (n + 1) Sxx + n (n − 2) Sxx (n + 1) Sxx + n . (n − 2) Sxx This confidence interval for yx is usually known as the prediction interval for predicted value yx based on the given x. The prediction interval represents an interval that has a probability equal to 1−γ of containing not a parameter but a future value yx of the random variable Yx . In many instances the prediction interval is more relevant to a scientist or engineer than the confidence interval on the mean µx . Example 19.10. Let the following data on the number chirps per second, x by the striped ground cricket and the temperature, y in Fahrenheit is shown below: Simple Linear Regression and Correlation Analysis x y 20 89 16 72 20 93 18 84 17 81 604 16 75 15 70 17 82 15 69 16 83 What is the 95% prediction interval for yx when x = 14? Answer: From Example 19.8, we have n = 10, βZ = 4.067, α Z = 9.761, Thus the normal regression line is σ Z = 3.047 and Sxx = 376. yx = 9.761 + 4.067x. Since x = 14, the corresponding predicted value yx is given by yx = 9.761 + (4.067) (14) = 66.699. Therefore G (n + 1) Sxx + n (n − 2) Sxx G (11) (376) + 10 = 66.699 − t0.025 (8) (3.047) (8) (376) a=α Z + βZ x − t γ2 (n − 2) σ Z = 66.699 − (2.306) (3.047) (1.1740) = 58.4501. Similarly G (n + 1) Sxx + n (n − 2) Sxx G (11) (376) + 10 = 66.699 + t0.025 (8) (3.047) (8) (376) b=α Z + βZ x + t γ2 (n − 2) σ Z = 66.699 + (2.306) (3.047) (1.1740) = 74.9479. Hence the 95% prediction interval for yx when x = 14 is [58.4501, 74.9479]. 19.3. The Correlation Analysis In the first two sections of this chapter, we examine the regression problem and have done an in-depth study of the least squares and the normal regression analysis. In the regression analysis, we assumed that the values of X are not random variables, but are fixed. However, the values of Yx for Probability and Mathematical Statistics 605 a given value of x are randomly distributed about E(Yx ) = µx = α + βx. Further, letting ε to be a random variable with E(ε) = 0 and V ar(ε) = σ 2 , one can model the so called regression problem by Yx = α + β x + ε. In this section, we examine the correlation problem. Unlike the regression problem, here both X and Y are random variables and the correlation problem can be modeled by E(Y ) = α + β E(X). From an experimental point of view this means that we are observing random vector (X, Y ) drawn from some bivariate population. Recall that if (X, Y ) is a bivariate random variable then the correlation coefficient ρ is defined as E ((X − µX ) (Y − µY )) ρ= I E ((X − µX )2 ) E ((Y − µY )2 ) where µX and µY are the mean of the random variables X and Y , respectively. Definition 19.1. If (X1 , Y1 ), (X2 , Y2 ), ..., (Xn , Yn ) is a random sample from a bivariate population, then the sample correlation coefficient is defined as n % i=1 (Xi − X) (Yi − Y ) R= \ ]% ] n ^ (Xi − X)2 i=1 \ . ]% ] n ^ (Yi − Y )2 i=1 The corresponding quantity computed from data (x1 , y1 ), (x2 , y2 ), ..., (xn , yn ) will be denoted by r and it is an estimate of the correlation coefficient ρ. Now we give a geometrical interpretation of the sample correlation coefficient based on a paired data set {(x1 , y1 ), (x2 , y2 ), ..., (xn , yn )}. We can associate this data set with two vectors 7x = (x1 , x2 , ..., xn ) and 7y = (y1 , y2 , ..., yn ) in R I n . Let L be the subset {λ 7e | λ ∈ R} I of R I n , where 7e = (1, 1, ..., 1) ∈ R I n. n n Consider the linear space V given by R I modulo L, that is V = R I /L. The linear space V is illustrated in a figure on next page when n = 2. Simple Linear Regression and Correlation Analysis 606 y L V x [x] Illustration of the linear space V for n=2 We denote the equivalence class associated with the vector 7x by [7x]. In the linear space V it can be shown that the points (x1 , y1 ), (x2 , y2 ), ..., (xn , yn ) are collinear if and only if the the vectors [7x] and [7y ] in V are proportional. We define an inner product on this linear space V by 8[7x], [7y ]9 = n % i=1 (xi − x) (yi − y). Then the angle θ between the vectors [7x] and [7y ] is given by which is cos(θ) = I n % i=1 8[7x], [7y ]9 I 8[7x], [7x]9 8[7y ], [7y ]9 (xi − x) (yi − y) cos(θ) = \ ]% ] n ^ (xi − x)2 i=1 \ = r. ]% ] n ^ (yi − y)2 i=1 Thus the sample correlation coefficient r can be interpreted geometrically as the cosine of the angle between the vectors [7x] and [7y ]. From this view point the following theorem is obvious. Probability and Mathematical Statistics 607 Theorem 19.7. The sample correlation coefficient r satisfies the inequality −1 ≤ r ≤ 1. The sample correlation coefficient r = ±1 if and only if the set of points {(x1 , y1 ), (x2 , y2 ), ..., (xn , yn )} for n ≥ 3 are collinear. To do some statistical analysis, we assume that the paired data is a random sample of size n from a bivariate normal population (X, Y ) ∼ BV N (µ1 , µ2 , σ12 , σ22 , ρ). Then the conditional distribution of the random variable Y given X = x is normal, that is Y |x ∼ N # $ σ2 2 2 µ2 + ρ (x − µ1 ), σ2 (1 − ρ ) . σ1 This can be viewed as a normal regression model E(Y |x ) = α + β x where α = µ − ρ σσ21 µ1 , β = ρ σσ21 , and V ar(Y |x ) = σ22 (1 − ρ2 ). Since β = ρ σσ21 , if ρ = 0, then β = 0. Hence the null hypothesis Ho : ρ = 0 is equivalent to Ho : β = 0. In the previous section, we devised a hypothesis test for testing Ho : β = βo against Ha : β += βo . This hypothesis test, at significance level γ, is “Reject Ho : β = βo if |t| ≥ t γ2 (n − 2)”, where If β = 0, then we have βZ − β t= σ Z βZ t= σ Z @ @ (n − 2) Sxx . n (n − 2) Sxx . n (10) Now we express t in term of the sample correlation coefficient r. Recall that Sxy βZ = , Sxx and σ Z2 = (11) 2 3 Sxy 1 Sxy , Syy − n Sxx (12) Sxy . Sxx Syy (13) r= I Simple Linear Regression and Correlation Analysis 608 Now using (11), (12), and (13), we compute @ βZ (n − 2) Sxx t= σ Z n @ √ n (n − 2) Sxx Sxy @ = B Sxx A n Sxy Syy − Sxx Sxy =I √ = Sxy @ Sxx Syy A n−2 √ 1 1− Sxy Sxy Sxx Syy r . 1 − r2 B √ n−2 Hence to test the null hypothesis Ho : ρ = 0 against Ha : ρ += 0, at significance level γ, is “Reject Ho : ρ = 0 if |t| ≥ t γ2 (n − 2)”, where t = √ r n − 2 1−r 2. This above test does not extend to test other values of ρ except ρ = 0. However, tests for the nonzero values of ρ can be achieved by the following result. Theorem 19.8. Let (X1 , Y1 ), (X2 , Y2 ), ..., (Xn , Yn ) be a random sample from a bivariate normal population (X, Y ) ∼ BV N (µ1 , µ2 , σ12 , σ22 , ρ). If V = 1 ln 2 then Z= √ # 1+R 1−R $ and m = n − 3 (V − m) → N (0, 1) 1 ln 2 # 1+ρ 1−ρ $ , as n → ∞. This theorem says that the statistic V is approximately normal with 1 mean m and variance n−3 when n is large. This statistic can be used to devise a hypothesis test for the nonzero values of ρ. Hence to test the null hypothesis Ho : ρ = ρo against Ha : ρ += ρo , at significance level γ, is,“Reject √ 1+ρo . Ho : ρ = ρo if |z| ≥ z γ2 ”, where z = n − 3 (V − mo ) and mo = 12 ln 1−ρ o Example 19.11. The following data were obtained in a study of the relationship between the weight and chest size of infants at birth: x, weight in kg y, chest size in cm 2.76 29.5 2.17 26.3 5.53 36.6 4.31 27.8 2.30 28.3 3.70 28.6 Probability and Mathematical Statistics 609 Determine the sample correlation coefficient r and then test the null hypothesis Ho : ρ = 0 against the alternative hypothesis Ha : ρ += 0 at a significance level 0.01. Answer: From the above data we find that x = 3.46 and y = 29.51. Next, we compute Sxx , Syy and Sxy using a tabular representation. x−x −0.70 −1.29 2.07 0.85 −1.16 0.24 y−y −0.01 −3.21 7.09 −1.71 −1.21 −0.91 (x − x)(y − y) 0.007 4.141 14.676 −1.453 1.404 −0.218 Sxy = 18.557 (x − x)2 0.490 1.664 4.285 0.722 1.346 0.058 Sxx = 8.565 (y − y)2 0.000 10.304 50.268 2.924 1.464 0.828 Syy = 65.788 Hence, the correlation coefficient r is given by r= I Sxy 18.557 = 0.782. =I Sxx Syy (8.565) (65.788) The computed t value is give by t= √ n−2√ I r 0.782 = 2.509. = (6 − 2) I 2 1−r 1 − (0.782)2 From the t-table we have t0.005 (4) = 4.604. Since 2.509 = |t| +≥ t0.005 (4) = 4.604 we do not reject the null hypothesis Ho : ρ = 0. 19.4. Review Exercises 1. Let Y1 , Y2 , ..., Yn be n independent random variables such that each Yi ∼ N (βxi , σ 2 ), where both β and σ 2 are unknown parameters. If {(x1 , y1 ), (x2 , y2 ), ..., (xn , yn )} is a data set where y1 , y2 , ..., yn are the observed values based on x1 , x2 , ..., xn , then find the maximum likelihood estimators of βZ and σ Z2 of β and σ 2 . Simple Linear Regression and Correlation Analysis 610 2. Let Y1 , Y2 , ..., Yn be n independent random variables such that each Yi ∼ N (βxi , σ 2 ), where both β and σ 2 are unknown parameters. If {(x1 , y1 ), (x2 , y2 ), ..., (xn , yn )} is a data set where y1 , y2 , ..., yn are the observed values based on x1 , x2 , ..., xn , then show that the maximum likelihood estimator of βZ is normally distributed. What are the mean and variance of Z β? 3. Let Y1 , Y2 , ..., Yn be n independent random variables such that each Yi ∼ N (βxi , σ 2 ), where both β and σ 2 are unknown parameters. If {(x1 , y1 ), (x2 , y2 ), ..., (xn , yn )} is a data set where y1 , y2 , ..., yn are the observed values based on x1 , x2 , ..., xn , then find an unbiased estimator σ Z2 of σ 2 and then find a constant k such that k σ Z2 ∼ χ2 (2n). 4. Let Y1 , Y2 , ..., Yn be n independent random variables such that each Yi ∼ N (βxi , σ 2 ), where both β and σ 2 are unknown parameters. If {(x1 , y1 ), (x2 , y2 ), ..., (xn , yn )} is a data set where y1 , y2 , ..., yn are the observed values based on x1 , x2 , ..., xn , then find a pivotal quantity for β and using this pivotal quantity construct a (1 − γ)100% confidence interval for β. 5. Let Y1 , Y2 , ..., Yn be n independent random variables such that each Yi ∼ N (βxi , σ 2 ), where both β and σ 2 are unknown parameters. If {(x1 , y1 ), (x2 , y2 ), ..., (xn , yn )} is a data set where y1 , y2 , ..., yn are the observed values based on x1 , x2 , ..., xn , then find a pivotal quantity for σ 2 and using this pivotal quantity construct a (1 − γ)100% confidence interval for σ2 . 6. Let Y1 , Y2 , ..., Yn be n independent random variables such that each Yi ∼ EXP (βxi ), where β is an unknown parameter. If {(x1 , y1 ), (x2 , y2 ), ..., (xn , yn )} is a data set where y1 , y2 , ..., yn are the observed values based on x1 , x2 , ..., xn , then find the maximum likelihood estimator of βZ of β. 7. Let Y1 , Y2 , ..., Yn be n independent random variables such that each Yi ∼ EXP (βxi ), where β is an unknown parameter. If {(x1 , y1 ), (x2 , y2 ), ..., (xn , yn )} is a data set where y1 , y2 , ..., yn are the observed values based on x1 , x2 , ..., xn , then find the least squares estimator of βZ of β. 8. Let Y1 , Y2 , ..., Yn be n independent random variables such that each Yi ∼ P OI(βxi ), where β is an unknown parameter. If {(x1 , y1 ), (x2 , y2 ), ..., (xn , yn )} is a data set where y1 , y2 , ..., yn are the ob- Probability and Mathematical Statistics 611 served values based on x1 , x2 , ..., xn , then find the maximum likelihood estimator of βZ of β. 9. Let Y1 , Y2 , ..., Yn be n independent random variables such that each Yi ∼ P OI(βxi ), where β is an unknown parameter. If {(x1 , y1 ), (x2 , y2 ), ..., (xn , yn )} is a data set where y1 , y2 , ..., yn are the observed values based on x1 , x2 , ..., xn , then find the least squares estimator of βZ of β. 10. Let Y1 , Y2 , ..., Yn be n independent random variables such that each Yi ∼ P OI(βxi ), where β is an unknown parameter. If {(x1 , y1 ), (x2 , y2 ), ..., (xn , yn )} is a data set where y1 , y2 , ..., yn are the observed values based on x1 , x2 , ..., xn , show that the least squares estimator and the maximum likelihood estimator of β are both unbiased estimator of β. 11. Let Y1 , Y2 , ..., Yn be n independent random variables such that each Yi ∼ P OI(βxi ), where β is an unknown parameter. If {(x1 , y1 ), (x2 , y2 ), ..., (xn , yn )} is a data set where y1 , y2 , ..., yn are the observed values based on x1 , x2 , ..., xn , the find the variances of both the least squares estimator and the maximum likelihood estimator of β. 12. Given the five pairs of points (x, y) shown below: x y 10 50.071 20 0.078 30 0.112 40 0.120 50 0.131 What is the curve of the form y = a + bx + cx2 best fits the data by method of least squares? 13. Given the five pairs of points (x, y) shown below: x y 4 10 7 16 9 22 10 20 11 25 What is the curve of the form y = a + b x best fits the data by method of least squares? 14. The following data were obtained from the grades of six students selected at random: Mathematics Grade, x English Grade, y 72 76 94 86 82 65 74 89 65 80 85 92 Simple Linear Regression and Correlation Analysis 612 Find the sample correlation coefficient r and then test the null hypothesis Ho : ρ = 0 against the alternative hypothesis Ha : ρ += 0 at a significance level 0.01. 15. Given a set of data {(x1 , y2 ), (x2 , y2 ), ..., (xn , yn )} what is the least square estimate of α if y = α is fitted to this data set. 16. Given a set of data points {(2, 3), (4, 6), (5, 7)} what is the curve of the form y = α + β x2 best fits the data by method of least squares? 17. Given a data set {(1, 1), (2, 1), (2, 3), (3, 2), (4, 3)} and Yx ∼ N (α + β x, σ 2 ), find the point estimate of σ 2 and then construct a 90% confidence interval for σ. 18. For the data set {(1, 1), (2, 1), (2, 3), (3, 2), (4, 3)} determine the correlation coefficient r. Test the null hypothesis H0 : ρ = 0 versus Ha : ρ += 0 at a significance level 0.01. Probability and Mathematical Statistics 613 Chapter 20 ANALYSIS OF VARIANCE In Chapter 19, we examine how a quantitative independent variable x can be used for predicting the value of a quantitative dependent variable y. In this chapter we would like to examine whether one or more independent (or predictor) variable affects a dependent (or response) variable y. This chapter differs from the last chapter because the independent variable may now be either quantitative or qualitative. It also differs from the last chapter in assuming that the response measurements were obtained for specific settings of the independent variables. Selecting the settings of the independent variables is another aspect of experimental design. It enables us to tell whether changes in the independent variables cause changes in the mean response and it permits us to analyze the data using a method known as analysis of variance (or ANOVA). Sir Ronald Aylmer Fisher (1890-1962) developed the analysis of variance in 1920’s and used it to analyze data from agricultural experiments. The ANOVA investigates independent measurements from several treatments or levels of one or more than one factors (that is, the predictor variables). The technique of ANOVA consists of partitioning the total sum of squares into component sum of squares due to different factors and the error. For instance, suppose there are Q factors. Then the total sum of squares (SST ) is partitioned as SST = SSA + SSB + · · · + SSQ + SSError , where SSA , SSB , ..., and SSQ represent the sum of squares associated with the factors A, B, ..., and Q, respectively. If the ANOVA involves only one factor, then it is called one-way analysis of variance. Similarly if it involves two factors, then it is called the two-way analysis of variance. If it involves Analysis of Variance 614 more then two factors, then the corresponding ANOVA is called the higher order analysis of variance. In this chapter we only treat the one-way analysis of variance. The analysis of variance is a special case of the linear models that represent the relationship between a continuous response variable y and one or more predictor variables (either continuous or categorical) in the form y =Xβ+ǫ (1) where y is an m × 1 vector of observations of response variable, X is the m × n design matrix determined by the predictor variables, β is n × 1 vector of parameters, and ǫ is an m × 1 vector of random error (or disturbances) independent of each other and having distribution. 20.1. One-Way Analysis of Variance with Equal Sample Sizes The standard model of one-way ANOVA is given by Yij = µi + ǫij for i = 1, 2, ..., m, j = 1, 2, ..., n, (2) where m ≥ 2 and n ≥ 2. In this model, we assume that each random variable Yij ∼ N (µi , σ 2 ) for i = 1, 2, ..., m, j = 1, 2, ..., n. (3) Note that because of (3), each ǫij in model (2) is normally distributed with mean zero and variance σ 2 . Given m independent samples, each of size n, where the members of the i sample, Yi1 , Yi2 , ..., Yin , are normal random variables with mean µi and unknown variance σ 2 . That is, th ! " Yij ∼ N µi , σ 2 , i = 1, 2, ..., m, j = 1, 2, ..., n. We will be interested in testing the null hypothesis Ho : µ1 = µ2 = · · · = µm = µ against the alternative hypothesis Ha : not all the means are equal. Probability and Mathematical Statistics 615 In the following theorem we present the maximum likelihood estimators of the parameters µ1 , µ2 , ..., µm and σ 2 . Theorem 20.1. Suppose the one-way ANOVA model is given by the equation (2) where the ǫij ’s are independent and normally distributed random variables with mean zero and variance σ 2 for i = 1, 2, ..., m and j = 1, 2, ..., n. Then the MLE’s of the parameters µi (i = 1, 2, ..., m) and σ 2 of the model are given by µZi = Y i• i = 1, 2, ..., m, [2 = 1 SSW , σ nm n m % n % % "2 ! 1 Yij and SSW = Yij − Y i• is the within samples where Y i• = n j=1 i=1 j=1 sum of squares. Proof: The likelihood function is given by ? n 9 m P (Y −µ )2 P 1 − ij 2 i 2 2σ √ L(µ1 , µ2 , ..., µm , σ ) = e 2πσ 2 i=1 j=1 = # √ 1 2πσ 2 $nm − e 1 2σ 2 m % n % i=1 j=1 (Yij − µi )2 . Taking the natural logarithm of the likelihood function L, we obtain ln L(µ1 , µ2 , ..., µm , σ 2 ) = − m n 1 %% nm (Yij − µi )2 . ln(2 π σ 2 ) − 2 2 2σ i=1 j=1 (4) Now taking the partial derivative of (4) with respect to µ1 , µ2 , ..., µm and σ 2 , we get n ∂lnL 1% (Yij − µi ) (5) = 2 ∂µi σ j=1 and m n ∂lnL nm 1 %% (Yij − µi )2 . = − + ∂σ 2 2σ 2 2σ 4 i=1 j=1 (6) Equating these partial derivatives to zero and solving for µi and σ 2 , respectively, we have i = 1, 2, ..., m, µi = Y i• m n "2 1 %% ! σ2 = Yij − Y i• , nm i=1 j=1 Analysis of Variance 616 where n Y i• = 1% Yij . n j=1 It can be checked that these solutions yield the maximum of the likelihood function and we leave this verification to the reader. Thus the maximum likelihood estimators of the model parameters are given by where SSW = n % n % ! i=1 j=1 i = 1, 2, ..., m, µZi = Y i• 1 [2 = σ SSW , nm "2 Yij − Y i• . The proof of the theorem is now complete. Define m Y •• = n 1 %% Yij . nm i=1 j=1 (7) Further, define n m % % ! SST = i=1 j=1 SSW = Yij − Y •• n m % % ! i=1 j=1 "2 Yij − Y i• "2 (8) (9) and SSB = n m % % ! i=1 j=1 Y i• − Y •• "2 (10) Here SST is the total sum of square, SSW is the within sum of square, and SSB is the between sum of square. Next we consider the partitioning of the total sum of squares. The following lemma gives us such a partition. Lemma 20.1. The total sum of squares is equal to the sum of within and between sum of squares, that is SST = SSW + SSB . (11) Probability and Mathematical Statistics 617 Proof: Rewriting (8) we have SST = m % n % ! i=1 j=1 Yij − Y •• "2 n m % % ; <2 (Yij − Y i• ) + (Yi• − Y •• ) = i=1 j=1 = m % n % i=1 j=1 (Yij − Y i• )2 + +2 = SSW + SSB + 2 m % n % i=1 j=1 m n %% i=1 j=1 n m % % i=1 j=1 (Y i• − Y •• )2 (Yij − Y i• ) (Y i• − Y •• ) (Yij − Y i• ) (Y i• − Y •• ). The cross-product term vanishes, that is n m % % i=1 j=1 (Yij − Y i• ) (Y i• − Y •• ) = m % i=1 n % (Yi• − Y•• ) (Yij − Y i• ) = 0. j=1 Hence we obtain the asserted result SST = SSW + SSB and the proof of the lemma is complete. The following theorem is a technical result and is needed for testing the null hypothesis against the alternative hypothesis. Theorem 20.2. Consider the ANOVA model Yij = µi + ǫij i = 1, 2, ..., m, ! " where Yij ∼ N µi , σ 2 . Then (a) the random variable SSW σ2 j = 1, 2, ..., n, ∼ χ2 (m(n − 1)), and (b) the statistics SSW and SSB are independent. Further, if the null hypothesis Ho : µ1 = µ2 = · · · = µm = µ is true, then (c) the random variable (d) the statistics SSB σ2 SSB m(n−1) SSW (m−1) (e) the random variable SST σ2 ∼ χ2 (m − 1), ∼ F (m − 1, m(n − 1)), and ∼ χ2 (nm − 1). Analysis of Variance 618 Proof: In Chapter 13, we have seen in Theorem 13.7 that if X1 , X2 , ..., Xn are independent random variables each one having the distribution N (µ, σ 2 ), n % (Xi − X)2 have the following properties: then their mean X and (i) X and (ii) 1 σ2 n % i=1 n % i=1 i=1 (Xi − X)2 are independent, and (Xi − X)2 ∼ χ2 (n − 1). Now using (i) and (ii), we establish this theorem. (a) Using (ii), we see that n "2 1 %! Yij − Y i• ∼ χ2 (n − 1) 2 σ j=1 for each i = 1, 2, ..., m. Since n % ! j=1 Yij − Y i• "2 and n % ! j=1 Yi$ j − Y i$ • "2 are independent for i# += i, we obtain m n % "2 1 %! Yij − Y i• ∼ χ2 (m(n − 1)). 2 σ j=1 i=1 Hence m n "2 SSW 1 %%! = Yij − Y i• 2 2 σ σ i=1 j=1 = m n % "2 1 %! Yij − Y i• ∼ χ2 (m(n − 1)). 2 σ j=1 i=1 (b) Since for each i = 1, 2, ..., m, the random variables Yi1 , Yi2 , ..., Yin are independent and ! " Yi1 , Yi2 , ..., Yin ∼ N µi , σ 2 we conclude by (i) that n % ! j=1 Yij − Y i• "2 and Y i• Probability and Mathematical Statistics 619 are independent. Further n % ! j=1 Yij − Y i• "2 and Y i$ • are independent for i# += i. Therefore, each of the statistics n % ! j=1 Yij − Y i• "2 i = 1, 2, ..., m is independent of the statistics Y 1• , Y 2• , ..., Y m• , and the statistics n % ! j=1 Yij − Y i• "2 i = 1, 2, ..., m are independent. Thus it follows that the sets n % ! j=1 Yij − Y i• "2 i = 1, 2, ..., m and Y i• i = 1, 2, ..., m are independent. Thus m % n % ! i=1 j=1 Yij − Y i• "2 and m % n % ! i=1 j=1 Y i• − Y •• "2 are independent. Hence by definition, the statistics SSW and SSB are independent. Suppose the null hypothesis Ho : µ1 = µ2 = · · · = µm = µ is true. (c) Under Ho , the random variables -, Y 2• , ..., Y m• are independent and , Y 1• σ2 identically distributed with N µ, n . Therefore by (ii) m "2 n %! Y i• − Y •• ∼ χ2 (m − 1). σ 2 i=1 Hence m n "2 1 %%! SSB = Y i• − Y •• 2 2 σ σ i=1 j=1 = m "2 n %! Y i• − Y •• ∼ χ2 (m − 1). σ 2 i=1 Analysis of Variance 620 (d) Since SSW ∼ χ2 (m(n − 1)) σ2 and SSB ∼ χ2 (m − 1) σ2 therefore SSB (m−1) σ 2 SSW (n(m−1) σ 2 That is SSB (m−1) SSW (n(m−1) ∼ F (m − 1, m(n − 1)). ∼ F (m − 1, m(n − 1)). (e) Under Ho , the random variables Yij , i = 1, 2, ..., m, j = 1, 2, ..., n are independent and each has the distribution N (µ, σ 2 ). By (ii) we see that m n "2 1 %% ! Yij − Y •• ∼ χ2 (nm − 1). 2 σ i=1 j=1 Hence we have SST ∼ χ2 (nm − 1) σ2 and the proof of the theorem is now complete. From Theorem 20.1, we see that the maximum likelihood estimator of each µi (i = 1, 2, ..., m) is given by , 2 and since Y i• ∼ N µi , σn , µZi = Y i• , ! " E (µZi ) = E Y i• = µi . Thus the maximum likelihood estimators are unbiased estimator of µi for i = 1, 2, ..., m. Since [2 = SSW σ mn and by Theorem 20.2, σ12 SSW ∼ χ2 (m(n − 1)), we have # $ $ # , 1 1 2 SSW 1 2 [ 2 σ E SSW = σ m(n − 1) += σ 2 . = E σ =E mn mn σ2 mn Probability and Mathematical Statistics 621 [2 of σ 2 is biased. However, the Thus the maximum likelihood estimator σ SSW SST estimator m(n−1) is an unbiased estimator. Similarly, the estimator mn−1 is an unbiased estimator where as SST mn is a biased estimator of σ 2 . Theorem 20.3. Suppose the one-way ANOVA model is given by the equation (2) where the ǫij ’s are independent and normally distributed random variables with mean zero and variance σ 2 for i = 1, 2, ..., m and j = 1, 2, ..., n. The null hypothesis Ho : µ1 = µ2 = · · · = µm = µ is rejected whenever the test statistics F satisfies F= SSB /(m − 1) > Fα (m − 1, m(n − 1)), SSW /(m(n − 1)) (12) where α is the significance level of the hypothesis test and Fα (m−1, m(n−1)) denotes the 100(1 − α) percentile of the F -distribution with m − 1 numerator and nm − m denominator degrees of freedom. Proof: Under the null hypothesis Ho : µ1 = µ2 = · · · = µm = µ, the likelihood function takes the form ? m P n 9 (Y −µ)2 P 1 − ij 2 2 2σ √ L(µ, σ ) = e 2πσ 2 i=1 j=1 = # √ 1 2πσ 2 $nm 1 2σ 2 − m % n % i=1 j=1 e (Yij − µ)2 . Taking the natural logarithm of the likelihood function and then maximizing it, we obtain 1 SST and σ_ µ Z = Y •• Ho = mn as the maximum likelihood estimators of µ and σ 2 , respectively. Inserting these estimators into the likelihood function, we have the maximum of the likelihood function, that is  max L(µ, σ 2 ) =  = 1 2 2π σ_ Ho nm  − e 1 [ 2σ 2 Ho m % n % i=1 j=1 (Yij − Y •• )2 Simplifying the above expression, we see that nm  1 − mn  e 2 SST max L(µ, σ 2 ) =  = 2 2π σ_ Ho SST . Analysis of Variance 622 which is  max L(µ, σ 2 ) =  = 1 2 2π σ_ Ho nm  e− mn 2 . (13) When no restrictions imposed, we get the maximum of the likelihood function from Theorem 20.1 as 2 max L(µ1 , µ2 , ..., µm , σ ) = ) 1 I [2 2π σ +nm − 1 Z 2σ 2 e n m % % i=1 j=1 (Yij − Y i• )2 . Simplifying the above expression, we see that 2 max L(µ1 , µ2 , ..., µm , σ ) = ) which is 2 max L(µ1 , µ2 , ..., µm , σ ) = I 1 [2 2π σ ) I +nm 1 [2 2π σ − 2 mn SSW SS e +nm W e− mn 2 . (14) Next we find the likelihood ratio statistic W for testing the null hypothesis Ho : µ1 = µ2 = · · · = µm = µ. Recall that the likelihood ratio statistic W can be found by evaluating W = max L(µ, σ 2 ) . max L(µ1 , µ2 , ..., µm , σ 2 ) Using (13) and (14), we see that W = ) [2 σ 2 σ_ Ho + mn 2 . (15) Hence the likelihood ratio test to reject the null hypothesis Ho is given by the inequality W < k0 where k0 is a constant. Using (15) and simplifying, we get 2 σ_ Ho > k1 [2 σ Probability and Mathematical Statistics where k1 = , 1 k0 2 - mn 623 . Hence 2 σ_ SST /mn = Ho > k1 . [2 SSW /mn σ Using Lemma 20.1 we have SSW + SSB > k1 . SSW Therefore SSB >k SSW (16) where k = k1 − 1. In order to find the cutoff point k in (16), we use Theorem 20.2 (d). Therefore F= m(n − 1) SSB /(m − 1) > k SSW /(m(n − 1)) m−1 Since F has F distribution, we obtain m(n − 1) k = Fα (m − 1, m(n − 1)). m−1 Thus, at a significance level α, reject the null hypothesis Ho if F= SSB /(m − 1) > Fα (m − 1, m(n − 1)) SSW /(m(n − 1)) and the proof of the theorem is complete. The various quantities used in carrying out the test described in Theorem 20.3 are presented in a tabular form known as the ANOVA table. Source of variation Sums of squares Degree of freedom Between SSB m−1 Within SSW m(n − 1) Total SST mn − 1 Mean squares MSB = MSW = SSB m−1 SSW m(n−1) Table 20.1. One-Way ANOVA Table F-statistics F F= MSB MSW Analysis of Variance 624 At a significance level α, the likelihood ratio test is: “Reject the null hypothesis Ho : µ1 = µ2 = · · · = µm = µ if F > Fα (m − 1, m(n − 1)).” One can also use the notion of p−value to perform this hypothesis test. If the value of the test statistics is F = γ, then the p-value is defined as p − value = P (F (m − 1, m(n − 1)) ≥ γ). Alternatively, at a significance level α, the likelihood ratio test is: “Reject the null hypothesis Ho : µ1 = µ2 = · · · = µm = µ if p − value < α.” The following figure illustrates the notions of between sample variation and within sample variation. Between sample variation Grand Mean Within sample variation Key concepts in ANOVA The ANOVA model described in (2), that is Yij = µi + ǫij for i = 1, 2, ..., m, j = 1, 2, ..., n, can be rewritten as Yij = µ + αi + ǫij for i = 1, 2, ..., m, where µ is the mean of the m values of µi , and m % i=1 j = 1, 2, ..., n, αi = 0. The quantity αi is called the effect of the ith treatment. Thus any observed value is the sum of Probability and Mathematical Statistics 625 an overall mean µ, a treatment or class deviation αi , and a random element from a normally distributed random variable ǫij with mean zero and variance σ 2 . This model is called model I, the fixed effects model. The effects of the treatments or classes, measured by the parameters αi , are regarded as fixed but unknown quantities to be estimated. In this fixed effect model the null hypothesis H0 is now Ho : α1 = α2 = · · · = αm = 0 and the alternative hypothesis is Ha : not all the αi are zero. The random effects model, also known as model II, is given by Yij = µ + Ai + ǫij for i = 1, 2, ..., m, j = 1, 2, ..., n, where µ is the overall mean and 2 Ai ∼ N (0, σA ) and ǫij ∼ N (0, σ 2 ). 2 In this model, the variances σA and σ 2 are unknown quantities to be esti2 mated. The null hypothesis of the random effect model is Ho : σA = 0 and 2 the alternative hypothesis is Ha : σA > 0. In this chapter we do not consider the random effect model. Before we present some examples, we point out the assumptions on which the ANOVA is based on. The ANOVA is based on the following three assumptions: (1) Independent Samples: The samples taken from the population under consideration should be independent of one another. (2) Normal Population: For each population, the variable under consideration should be normally distributed. (3) Equal Variance: The variances of the variables under consideration should be the same for all the populations. Example 20.1. The data in the following table gives the number of hours of relief provided by 5 different brands of headache tablets administered to 25 subjects experiencing fevers of 38o C or more. Perform the analysis of variance Analysis of Variance 626 and test the hypothesis at the 0.05 level of significance that the mean number of hours of relief provided by the tablets is same for all 5 brands. A B Tablets C D F 5 4 8 6 3 9 7 8 6 9 3 5 2 3 7 2 3 4 1 4 7 6 9 4 7 Answer: Using the formulas (8), (9) and (10), we compute the sum of squares SSW , SSB and SST as SSW = 57.60, SSB = 79.94, and SST = 137.04. The ANOVA table for this problem is shown below. Source of variation Sums of squares Degree of freedom Mean squares F-statistics F Between 79.94 4 19.86 6.90 Within 57.60 20 2.88 Total 137.04 24 At the significance level α = 0.05, we find the F-table that F0.05 (4, 20) = 2.8661. Since 6.90 = F > F0.05 (4, 20) = 2.8661 we reject the null hypothesis that the mean number of hours of relief provided by the tablets is same for all 5 brands. Note that using a statistical package like MINITAB, SAS or SPSS we can compute the p-value to be p − value = P (F (4, 20) ≥ 6.90) = 0.001. Hence again we reach the same conclusion since p-value is less then the given α for this problem. Probability and Mathematical Statistics 627 Example 20.2. Perform the analysis of variance and test the null hypothesis at the 0.05 level of significance for the following two data sets. Data Set 1 A 8.1 4.2 14.7 9.9 12.1 6.2 Sample B 8.0 15.1 4.7 10.4 9.0 9.8 Data Set 2 C A Sample B C 14.8 5.3 11.1 7.9 9.3 7.4 9.2 9.1 9.2 9.2 9.3 9.2 9.5 9.5 9.5 9.6 9.5 9.4 9.4 9.3 9.3 9.3 9.2 9.3 Answer: Computing the sum of squares SSW , SSB and SST , we have the following two ANOVA tables: Source of variation Sums of squares Degree of freedom Mean squares F-statistics F Between 0.3 2 0.1 0.01 Within 187.2 15 12.5 Total 187.5 17 Source of variation Sums of squares Degree of freedom Mean squares F-statistics F Between 0.280 2 0.140 35.0 Within 0.600 15 0.004 Total 0.340 17 and Analysis of Variance 628 At the significance level α = 0.05, we find from the F-table that F0.05 (2, 15) = 3.68. For the first data set, since 0.01 = F < F0.05 (2, 15) = 3.68 we do not reject the null hypothesis whereas for the second data set, 35.0 = F > F0.05 (2, 15) = 3.68 we reject the null hypothesis. Remark 20.1. Note that the sample means are same in both the data sets. However, there is a less variation among the sample points in samples of the second data set. The ANOVA finds a more significant differences among the means in the second data set. This example suggests that the larger the variation among sample means compared with the variation of the measurements within samples, the greater is the evidence to indicate a difference among population means. 20.2. One-Way Analysis of Variance with Unequal Sample Sizes In the previous section, we examined the theory of ANOVA when samples are same sizes. When the samples are same sizes we say that the ANOVA is in the balanced case. In this section we examine the theory of ANOVA for unbalanced case, that is when the samples are of different sizes. In experimental work, one often encounters unbalance case due to the death of experimental animals in a study or drop out of the human subjects from a study or due to damage of experimental materials used in a study. Our analysis of the last section for the equal sample size will be valid but have to be modified to accommodate the different sample size. Consider m independent samples of respective sizes n1 , n2 , ..., nm , where the members of the ith sample, Yi1 , Yi2 , ..., Yini , are normal random variables with mean µi and unknown variance σ 2 . That is, ! " Yij ∼ N µi , σ 2 , i = 1, 2, ..., m, j = 1, 2, ..., ni . Let us denote N = n1 + n2 + · · · + nm . Again, we will be interested in testing the null hypothesis Ho : µ1 = µ2 = · · · = µm = µ Probability and Mathematical Statistics 629 against the alternative hypothesis Ha : not all the means are equal. Now we defining Y i• n 1 % Yij , = ni j=1 (17) m n Y •• = SST = i 1 %% Yij , N i=1 j=1 ni m % % ! i=1 j=1 SSW = Yij − Y •• ni m % % ! i=1 j=1 and SSB = ni m % % ! i=1 j=1 (18) "2 Yij − Y i• "2 Y i• − Y •• , (19) , (20) "2 (21) we have the following results analogous to the results in the previous section. Theorem 20.4. Suppose the one-way ANOVA model is given by the equation (2) where the ǫij ’s are independent and normally distributed random variables with mean zero and variance σ 2 for i = 1, 2, ..., m and j = 1, 2, ..., ni . Then the MLE’s of the parameters µi (i = 1, 2, ..., m) and σ 2 of the model are given by µZi = Y i• i = 1, 2, ..., m, [2 = 1 SSW , σ N where Y i• = 1 ni ni % Yij and SSW = j=1 sum of squares. ni m % % ! i=1 j=1 Yij − Y i• "2 is the within samples Lemma 20.2. The total sum of squares is equal to the sum of within and between sum of squares, that is SST = SSW + SSB . Theorem 20.5. Consider the ANOVA model Yij = µi + ǫij i = 1, 2, ..., m, j = 1, 2, ..., ni , Analysis of Variance 630 ! " where Yij ∼ N µi , σ 2 . Then (a) the random variable SSW σ2 ∼ χ2 (N − m), and (b) the statistics SSW and SSB are independent. Further, if the null hypothesis Ho : µ1 = µ2 = · · · = µm = µ is true, then (c) the random variable (d) the statistics SSB σ2 SSB m(n−1) SSW (m−1) (e) the random variable SST σ2 ∼ χ2 (m − 1), ∼ F (m − 1, N − m), and ∼ χ2 (N − 1). Theorem 20.6. Suppose the one-way ANOVA model is given by the equation (2) where the ǫij ’s are independent and normally distributed random variables with mean zero and variance σ 2 for i = 1, 2, ..., m and j = 1, 2, ..., ni . The null hypothesis Ho : µ1 = µ2 = · · · = µm = µ is rejected whenever the test statistics F satisfies F= SSB /(m − 1) > Fα (m − 1, N − m), SSW /(N − m) where α is the significance level of the hypothesis test and Fα (m − 1, N − m) denotes the 100(1 − α) percentile of the F -distribution with m − 1 numerator and N − m denominator degrees of freedom. The corresponding ANOVA table for this case is Source of variation Sums of squares Degree of freedom Mean squares Between SSB m−1 MSB = SSB m−1 Within SSW N −m MSW = SSW N −m Total SST N −1 F-statistics F F= MSB MSW Table 20.2. One-Way ANOVA Table with unequal sample size Example 20.3. Three sections of elementary statistics were taught by different instructors. A common final examination was given. The test scores are given in the table below. Perform the analysis of variance and test the hypothesis at the 0.05 level of significance that there is a difference in the average grades given by the three instructors. Probability and Mathematical Statistics Instructor A 75 91 83 45 82 75 68 47 38 631 Elementary Statistics Instructor B Instructor C 90 80 50 93 53 87 76 82 78 80 33 79 17 81 55 70 61 43 89 73 58 70 Answer: Using the formulas (17) - (21), we compute the sum of squares SSW , SSB and SST as SSW = 10362, SSB = 755, and SST = 11117. The ANOVA table for this problem is shown below. Source of variation Sums of squares Degree of freedom Mean squares F-statistics F Between 755 2 377 1.02 Within 10362 28 370 Total 11117 30 At the significance level α = 0.05, we find the F-table that F0.05 (2, 28) = 3.34. Since 1.02 = F < F0.05 (2, 28) = 3.34 we accept the null hypothesis that there is no difference in the average grades given by the three instructors. Note that using a statistical package like MINITAB, SAS or SPSS we can compute the p-value to be p − value = P (F (2, 28) ≥ 1.02) = 0.374. Analysis of Variance 632 Hence again we reach the same conclusion since p-value is less then the given α for this problem. We conclude this section pointing out the advantages of choosing equal sample sizes (balance case) over the choice of unequal sample sizes (unbalance case). The first advantage is that the F-statistics is insensitive to slight departures from the assumption of equal variances when the sample sizes are equal. The second advantage is that the choice of equal sample size minimizes the probability of committing a type II error. 20.3. Pair wise Comparisons When the null hypothesis is rejected using the F -test in ANOVA, one may still wants to know where the difference among the means is. There are several methods to find out where the significant differences in the means lie after the ANOVA procedure is performed. Among the most commonly used tests are Scheffé test and Tuckey test. In this section, we give a brief description of these tests. In order to perform the Scheffé test, we have to compare the means two at a time using all possible combinations of means. Since we have m means, ! " we need m 2 pair wise comparisons. A pair wise comparison can be viewed as a test of the null hypothesis H0 : µi = µk against the alternative Ha : µi += µk for all i += k. To conduct this test we compute the statistics "2 Y i• − Y k• -, , Fs = M SW n1i + n1k ! where Y i• and Y k• are the means of the samples being compared, ni and nk are the respective sample sizes, and M SW is the mean sum of squared of within group. We reject the null hypothesis at a significance level of α if Fs > (m − 1)Fα (m − 1, N − m) where N = n1 + n2 + · · · + nm . Example 20.4. Perform the analysis of variance and test the null hypothesis at the 0.05 level of significance for the following data given in the table below. Further perform a Scheffé test to determine where the significant differences in the means lie. Probability and Mathematical Statistics 1 9.2 9.1 9.2 9.2 9.3 9.2 633 Sample 2 9.5 9.5 9.5 9.6 9.5 9.4 3 9.4 9.3 9.3 9.3 9.2 9.3 Answer: The ANOVA table for this data is given by Source of variation Sums of squares Degree of freedom Mean squares F-statistics F Between 0.280 2 0.140 35.0 Within 0.600 15 0.004 Total 0.340 17 At the significance level α = 0.05, we find the F-table that F0.05 (2, 15) = 3.68. Since 35.0 = F > F0.05 (2, 15) = 3.68 we reject the null hypothesis. Now we perform the Scheffé test to determine where the significant differences in the means lie. From given data, we obtain Y 1• = 9.2, Y 2• = 9.5 and Y 3• = 9.3. Since m = 3, we have to make 3 pair wise comparisons, namely µ1 with µ2 , µ1 with µ3 , and µ2 with µ3 . First we consider the comparison of µ1 with µ2 . For this case, we find ! "2 2 Y 1• − Y 2• (9.2 − 9.5) , ! 1 1 " = 67.5. -= Fs = 0.004 6 + 6 M SW n11 + n12 Since 67.5 = Fs > 2 F0.05 (2, 15) = 7.36 we reject the null hypothesis H0 : µ1 = µ2 in favor of the alternative Ha : µ1 += µ2 . Analysis of Variance 634 Next we consider the comparison of µ1 with µ3 . For this case, we find "2 ! 2 Y 1• − Y 3• (9.2 − 9.3) , -= ! 1 1 " = 7.5. Fs = 0.004 6 + 6 M SW n11 + n13 Since 7.5 = Fs > 2 F0.05 (2, 15) = 7.36 we reject the null hypothesis H0 : µ1 = µ3 in favor of the alternative Ha : µ1 += µ3 . Finally we consider the comparison of µ2 with µ3 . For this case, we find "2 ! 2 Y 2• − Y 3• (9.5 − 9.3) -= , ! 1 1 " = 30.0. Fs = 0.004 6 + 6 M SW n12 + n13 Since 30.0 = Fs > 2 F0.05 (2, 15) = 7.36 we reject the null hypothesis H0 : µ2 = µ3 in favor of the alternative Ha : µ2 += µ3 . Next consider the Tukey test. Tuckey test is applicable when we have a balanced case, that is when the sample sizes are equal. For Tukey test we compute the statistics Y i• − Y k• , Q= = M SW n where Y i• and Y k• are the means of the samples being compared, n is the size of the samples, and M SW is the mean sum of squared of within group. At a significance level α, we reject the null hypothesis H0 if |Q| > Qα (m, ν) where ν represents the degrees of freedom for the error mean square. Example 20.5. For the data given in Example 20.4 perform a Tukey test to determine where the significant differences in the means lie. Answer: We have seen that Y 1• = 9.2, Y 2• = 9.5 and Y 3• = 9.3. First we compare µ1 with µ2 . For this we compute Y 1• − Y 2• 9.2 − 9.3 = −11.6189. = = Q= = M SW n 0.004 6 Probability and Mathematical Statistics 635 Since 11.6189 = |Q| > Q0.05 (2, 15) = 3.01 we reject the null hypothesis H0 : µ1 = µ2 in favor of the alternative Ha : µ1 += µ2 . Next we compare µ1 with µ3 . For this we compute Y 1• − Y 3• 9.2 − 9.5 = = = −3.8729. Q= = 0.004 6 M SW n Since 3.8729 = |Q| > Q0.05 (2, 15) = 3.01 we reject the null hypothesis H0 : µ1 = µ3 in favor of the alternative Ha : µ1 += µ3 . Finally we compare µ2 with µ3 . For this we compute Y 2• − Y 3• 9.5 − 9.3 = 7.7459. Q= = = = M SW n Since 0.004 6 7.7459 = |Q| > Q0.05 (2, 15) = 3.01 we reject the null hypothesis H0 : µ2 = µ3 in favor of the alternative Ha : µ2 += µ3 . Often in scientific and engineering problems, the experiment dictates the need for comparing simultaneously each treatment with a control. Now we describe a test developed by C. W. Dunnett for determining significant differences between each treatment mean and the control. Suppose we wish to test the m hypotheses H0 : µ0 = µi versus Ha : µ0 += µi for i = 1, 2, ..., m, where µ0 represents the mean yield for the population of measurements in which the control is used. To test the null hypotheses specified by H0 against two-sided alternatives for an experimental situation in which there are m treatments, excluding the control, and n observation per treatment, we first calculate Y i• − Y 0• , i = 1, 2, ..., m. Di = = 2 M SW n Analysis of Variance 636 At a significance level α, we reject the null hypothesis H0 if |Di | > D α2 (m, ν) where ν represents the degrees of freedom for the error mean square. The values of the quantity D α2 (m, ν) are tabulated for various α, m and ν. Example 20.6. For the data given in the table below perform a Dunnett test to determine any significant differences between each treatment mean and the control. Control Sample 1 Sample 2 9.2 9.1 9.2 9.2 9.3 9.2 9.5 9.5 9.5 9.6 9.5 9.4 9.4 9.3 9.3 9.3 9.2 9.3 Answer: The ANOVA table for this data is given by Source of variation Sums of squares Degree of freedom Mean squares F-statistics F Between 0.280 2 0.140 35.0 Within 0.600 15 0.004 Total 0.340 17 At the significance level α = 0.05, we find that D0.025 (2, 15) = 2.44. Since 35.0 = D > D0.025 (2, 15) = 2.44 we reject the null hypothesis. Now we perform the Dunnett test to determine if there is any significant differences between each treatment mean and the control. From given data, we obtain Y 0• = 9.2, Y 1• = 9.5 and Y 2• = 9.3. Since m = 2, we have to make 2 pair wise comparisons, namely µ0 with µ1 , and µ0 with µ2 . First we consider the comparison of µ0 with µ1 . For this case, we find Y 1• − Y 0• 9.5 − 9.2 == D1 = = = 8.2158. 2 M SW n 2 (0.004) 6 Probability and Mathematical Statistics 637 Since 8.2158 = D1 > D0.025 (2, 15) = 2.44 we reject the null hypothesis H0 : µ1 = µ0 in favor of the alternative Ha : µ1 += µ0 . Next we find 9.3 − 9.2 Y 2• − Y 0• = 2.7386. D2 = = == 2 M SW n 2 (0.004) 6 Since 2.7386 = D2 > D0.025 (2, 15) = 2.44 we reject the null hypothesis H0 : µ2 = µ0 in favor of the alternative Ha : µ2 += µ0 . 20.4. Tests for the Homogeneity of Variances One of the assumptions behind the ANOVA is the equal variance, that is the variances of the variables under consideration should be the same for all population. Earlier we have pointed out that the F-statistics is insensitive to slight departures from the assumption of equal variances when the sample sizes are equal. Nevertheless it is advisable to run a preliminary test for homogeneity of variances. Such a test would certainly be advisable in the case of unequal sample sizes if there is a doubt concerning the homogeneity of population variances. Suppose we want to test the null hypothesis 2 H0 : σ12 = σ22 = · · · σm versus the alternative hypothesis Ha : not all variances are equal. A frequently used test for the homogeneity of population variances is the Bartlett test. Bartlett (1937) proposed a test for equal variances that was modification of the normal-theory likelihood ratio test. We will use this test to test the above null hypothesis H0 against Ha . 2 First, we compute the m sample variances S12 , S22 , ..., Sm from the samples of Analysis of Variance 638 size n1 , n2 , ..., nm , with n1 + n2 + · · · + nm = N . The test statistics Bc is given by m % (ni − 1) ln Si2 (N − m) ln Sp2 − Bc = 1+ 1 3 (m−1) ) i=1 m % 1 1 − n − 1 N − m i=1 i + where the pooled variance Sp2 is given by Sp2 = m % i=1 (ni − 1) Si2 = MSW . N −m It is known that the sampling distribution of Bc is approximately chi-square with m − 1 degrees of freedom, that is Bc ∼ χ2 (m − 1) when (ni − 1) ≥ 3. Thus the Bartlett test rejects the null hypothesis H0 : 2 at a significance level α if σ12 = σ22 = · · · σm Bc > χ21−α (m − 1), where χ21−α (m − 1) denotes the upper (1 − α)100 percentile of the chi-square distribution with m − 1 degrees of freedom. Example 20.7. For the following data perform an ANOVA and then apply Bartlett test to examine if the homogeneity of variances condition is met for a significance level 0.05. Sample 1 Sample 2 34 28 29 37 42 27 29 35 25 29 41 40 29 32 31 43 31 29 28 30 37 44 29 31 Data Sample 3 32 34 30 42 32 33 29 27 37 26 29 31 Sample 4 34 29 32 28 32 34 29 31 30 37 43 42 Probability and Mathematical Statistics 639 Answer: The ANOVA table for this data is given by Source of variation Sums of squares Degree of freedom Mean squares F-statistics F Between 16.2 3 5.4 0.20 Within 1202.2 44 27.3 Total 1218.5 47 At the significance level α = 0.05, we find the F-table that F0.05 (2, 44) = 3.23. Since 0.20 = F < F0.05 (2, 44) = 3.23 we do not reject the null hypothesis. Now we compute Bartlett test statistic Bc . From the data the variances of each group can be found to be S12 = 35.2836, S22 = 30.1401, S32 = 19.4481, S42 = 24.4036. Further, the pooled variance is Sp2 = MSW = 27.3. The statistics Bc is Bc = (N − m) ln Sp2 − 1+ = = 1 3 (m−1) ) m % m % i=1 (ni − 1) ln Si2 1 1 − n −1 N −m i=1 i + 44 ln 27.3 − 11 [ ln 35.2836 − ln 30.1401 − ln 19.4481 − ln 24.4036 ] , 1 1 4 1 + 3 (4−1) 12−1 − 48−4 1.0537 = 1.0153. 1.0378 From chi-square table we find that χ20.95 (3) = 7.815. Hence, since 1.0153 = Bc < χ20.95 (3) = 7.815, Analysis of Variance 640 we do not reject the null hypothesis that the variances are equal. Hence Bartlett test suggests that the homogeneity of variances condition is met. The Bartlett test assumes that the m samples should be taken from m normal populations. Thus Bartlett test is sensitive to departures from normality. The Levene test is an alternative to the Bartlett test that is less sensitive to departures from normality. Levene (1960) proposed a test for the homogeneity of population variances that considers the random variables "2 ! Wij = Yij − Y i• and apply a one-way analysis of variance to these variables. If the F -test is significant, the homogeneity of variances is rejected. Levene (1960) also proposed using F -tests based on the variables = Wij = |Yij − Y i• |, Wij = ln |Yij − Y i• |, and Wij = |Yij − Y i• |. Brown and Forsythe (1974c) proposed using the transformed variables based on the absolute deviations from the median, that is Wij = |Yij − M ed(Yi• )|, where M ed(Yi• ) denotes the median of group i. Again if the F -test is significant, the homogeneity of variances is rejected. Example 20.8. For the data in Example 20.7 do a Levene test to examine if the homogeneity of variances condition is met for a significance level 0.05. Answer: From data we find that Y 1• = 33.00, Y 2• = 32.83, Y 3• = 31.83, ! "2 and Y 4• = 33.42. Next we compute Wij = Yij − Y i• . The resulting values are given in the table below. Sample 1 1 25 16 16 81 36 16 4 64 16 64 49 Transformed Data Sample 2 Sample 3 14.7 0.7 3.4 103.4 3.4 14.7 23.4 8.0 17.4 124.7 14.7 3.4 0.0 4.7 3.4 103.4 0.0 1.4 8.0 23.4 26.7 34.0 0.0 0.7 Sample 4 0.3 19.5 2.0 29.3 2.0 0.3 19.5 5.8 11.7 12.8 91.8 73.7 Probability and Mathematical Statistics 641 Now we perform an ANOVA to the data given in the table above. The ANOVA table for this data is given by Source of variation Sums of squares Degree of freedom Mean squares F-statistics F Between 1430 3 477 0.46 Within 45491 44 1034 Total 46922 47 At the significance level α = 0.05, we find the F-table that F0.05 (3, 44) = 2.84. Since 0.46 = F < F0.05 (3, 44) = 2.84 we do not reject the null hypothesis that the variances are equal. Hence Bartlett test suggests that the homogeneity of variances condition is met. Although Bartlet test is most widely used test for homogeneity of variances a test due to Cochran provides a computationally simple procedure. Cochran test is one of the best method for detecting cases where the variance of one of the groups is much larger than that of the other groups. The test statistics of Cochran test is give by max Si2 C= 1≤i≤m m % . Si2 i=1 2 The Cochran test rejects the null hypothesis H0 : σ12 = σ22 = · · · σm at a significance level α if C > Cα . The critical values of Cα were originally published by Eisenhart et al (1947) for some combinations of degrees of freedom ν and the number of groups m. Here the degrees of freedom ν are ν = max (ni − 1). 1≤i≤m Example 20.9. For the data in Example 20.7 perform a Cochran test to examine if the homogeneity of variances condition is met for a significance level 0.05. Analysis of Variance 642 Answer: From the data the variances of each group can be found to be S12 = 35.2836, S22 = 30.1401, S32 = 19.4481, S42 = 24.4036. Hence the test statistic for Cochran test is C= 35.2836 35.2836 = = 0.3328. 35.2836 + 30.1401 + 19.4481 + 24.4036 109.2754 The critical value C0.5 (3, 11) is given by 0.4884. Since 0.3328 = C < C0.5 (3, 11) = 0.4884. At a significance level α = 0.05, we do not reject the null hypothesis that the variances are equal. Hence Cochran test suggests that the homogeneity of variances condition is met. 20.5. Exercises 1. A consumer organization wants to compare the prices charged for a particular brand of refrigerator in three types of stores in Louisville: discount stores, department stores and appliance stores. Random samples of 6 stores of each type were selected. The results were shown below. Discount Department Appliance 1200 1300 1100 1400 1250 1150 1700 1500 1450 1300 1300 1500 1600 1500 1300 1500 1700 1400 At the 0.05 level of significance, is there any evidence of a difference in the average price between the types of stores? 2. It is conjectured that a certain gene might be linked to ovarian cancer. The ovarian cancer is sub-classified into three categories: stage I, stage II and stage III-IV. There are three random samples available; one from each stage. The samples are labelled with three colors dyes and hybridized on a four channel cDNA microarray (one channel remains unused). The experiment is repeated 5 times and the following data were obtained. Probability and Mathematical Statistics Array 643 Microarray Data mRNA 1 mRNA 2 1 2 3 4 5 100 90 105 83 78 mRNA 3 95 93 79 85 90 70 72 81 74 75 Is there any difference between the averages of the three mRNA samples at 0.05 significance level? 3. A stock market analyst thinks 4 stock of mutual funds generate about the same return. He collected the accompaning rate-of-return data on 4 different mutual funds during the last 7 years. The data is given in table below. Year Mutual Funds A B C D 2000 2001 2002 2004 2005 2006 2007 12 12 13 18 17 18 12 15 11 12 11 10 10 12 11 17 18 20 19 12 15 13 19 15 25 19 17 20 Do a one-way ANOVA to decide whether the funds give different performance at 0.05 significance level. 4. Give a proof of the Theorem 20.4. 5. Give a proof of the Lemma 20.2. 6. Give a proof of the Theorem 20.5. 7. Give a proof of the Theorem 20.6. 8. An automobile company produces and sells its cars under 3 different brand names. An autoanalyst wants to see whether different brand of cars have same performance. He tested 20 cars from 3 different brands and recorded the mileage per gallon. Analysis of Variance 644 Brand 1 Brand 2 Brand 3 32 29 32 25 35 33 34 31 31 28 30 34 39 36 38 34 25 31 37 32 Do the data suggest a rejection of the null hypothesis at a significance level 0.05 that the mileage per gallon generated by three different brands are same. Probability and Mathematical Statistics 645 Chapter 21 GOODNESS OF FITS TESTS In point estimation, interval estimation or hypothesis test we always started with a random sample X1 , X2 , ..., Xn of size n from a known distribution. In order to apply the theory to data analysis one has to know the distribution of the sample. Quite often the experimenter (or data analyst) assumes the nature of the sample distribution based on his subjective knowledge. Goodness of fit tests are performed to validate experimenter opinion about the distribution of the population from where the sample is drawn. The most commonly known and most frequently used goodness of fit tests are the Kolmogorov-Smirnov (KS) test and the Pearson chi-square (χ2 ) test. There is a controversy over which test is the most powerful, but the general feeling seems to be that the Kolmogorov-Smirnov test is probably more powerful than the chi-square test in most situations. The KS test measures the distance between distribution functions, while the χ2 test measures the distance between density functions. Usually, if the population distribution is continuous, then one uses the Kolmogorov-Smirnov where as if the population distribution is discrete, then one performs the Pearson’s chi-square goodness of fit test. 21.1. Kolmogorov-Smirnov Test Let X1 , X2 , ..., Xn be a random sample from a population X. We hypothesized that the distribution of X is F (x). Further, we wish to test our hypothesis. Thus our null hypothesis is Ho : X ∼ F (x). Goodness of Fit Tests 646 We would like to design a test of this null hypothesis against the alternative Ha : X +∼ F (x). In order to design a test, first of all we need a statistic which will unbiasedly estimate the unknown distribution F (x) of the population X using the random sample X1 , X2 , ..., Xn . Let x(1) < x(2) < · · · < x(n) be the observed values of the ordered statistics X(1) , X(2) , ..., X(n) . The empirical distribution of the random sample is defined as  0 if x < x , (1)     for k = 1, 2, ..., n − 1, Fn (x) = nk if x(k) ≤ x < x(k+1) ,     1 if x(n) ≤ x. The graph of the empirical distribution function F4 (x) is shown below. F4(x) 1.00 0.75 0.50 0.25 0 x(1) x(2) x(3) x (4) Empirical Distribution Function For a fixed value of x, the empirical distribution function can be considered as a random variable that takes on the values 0, 1 2 n−1 n , , ..., , . n n n n First we show that Fn (x) is an unbiased estimator of the population distribution F (x). That is, E(Fn (x)) = F (x) (1) Probability and Mathematical Statistics 647 for a fixed value of x. To establish (1), we need the probability density function of the random variable Fn (x). From the definition of the empirical distribution we see that if exactly k observations are less than or equal to x, then k Fn (x) = n which is n Fn (x) = k. The probability that an observation is less than or equal to x is given by F (x). x ()) x Threre are k sample observations each with probability F(x) x()+1) There are n-k sample observations each with probability 1- F(x) Distribution of the Empirical Distribution Function Hence (see figure above) # $ k P (n Fn (x) = k) = P Fn (x) = n # $ n k n−k [F (x)] [1 − F (x)] = k for k = 0, 1, ..., n. Thus n Fn (x) ∼ BIN (n, F (x)). Goodness of Fit Tests 648 Thus the expected value of the random variable n Fn (x) is given by E(n Fn (x)) = n F (x) n E(Fn (x)) = n F (x) E(Fn (x)) = F (x). This shows that, for a fixed x, Fn (x), on an average, equals to the population distribution function F (x). Hence the empirical distribution function Fn (x) is an unbiased estimator of F (x). Since n Fn (x) ∼ BIN (n, F (x)), the variance of n Fn (x) is given by V ar(n Fn (x)) = n F (x) [1 − F (x)]. Hence the variance of Fn (x) is V ar(Fn (x)) = F (x) [1 − F (x)] . n It is easy to see that V ar(Fn (x)) → 0 as n → ∞ for all values of x. Thus the empirical distribution function Fn (x) and F (x) tend to be closer to each other with large n. As a matter of fact, Glivenkno, a Russian mathematician, proved that Fn (x) converges to F (x) uniformly in x as n → ∞ with probability one. Because of the convergence of the empirical distribution function to the theoretical distribution function, it makes sense to construct a goodness of fit test based on the closeness of Fn (x) and hypothesized distribution F (x). Let Dn = max|Fn (x) − F (x)|. x∈ R I That is Dn is the maximum of all pointwise differences |Fn (x) − F (x)|. The distribution of the Kolmogorov-Smirnov statistic, Dn can be derived. However, we shall not do that here as the derivation is quite involved. In stead, we give a closed form formula for P (Dn ≤ d). If X1 , X2 , ..., Xn is a sample from a population with continuous distribution function F (x), then  0    P n : 2 i− 1 +d n P (Dn ≤ d) = n! du    i=1 2 i−d 1 if d ≤ if 1 2n 1 2n <d<1 if d ≥ 1 Probability and Mathematical Statistics 649 where du = du1 du2 · · · dun with 0 < u1 < u2 < · · · < un < 1. Further, ∞ % √ 2 2 (−1)k−1 e−2 k d . lim P ( n Dn ≤ d) = 1 − 2 n→∞ k=1 These formulas show that the distribution of the Kolmogorov-Smirnov statistic Dn is distribution free, that is, it does not depend on the distribution F of the population. For most situations, it is sufficient to use the following approximations due to Kolmogorov: √ 2 P ( n Dn ≤ d) ≈ 1 − 2e−2nd 1 for d > √ . n If the null hypothesis Ho : X ∼ F (x) is true, the statistic Dn is small. It is therefore reasonable to reject Ho if and only if the observed value of Dn is larger than some constant dn . If the level of significance is given to be α, then the constant dn can be found from 2 α = P (Dn > dn / Ho is true) ≈ 2e−2ndn . This yields the following hypothesis test: Reject Ho if Dn ≥ dn where @ ,α1 ln dn = − 2n 2 is obtained from the above Kolmogorov’s approximation. Note that the approximate value of d12 obtained by the above formula is equal to 0.3533 when α = 0.1, however more accurate value of d12 is 0.34. Next we address the issue of the computation of the statistics Dn . Let us define Dn+ = max{Fn (x) − F (x)} x∈ R I and Dn− = max{F (x) − Fn (x)}. x∈ R I Then it is easy to see that − Dn = max{Dn+ , DN }. Further, since Fn (x(i) ) = i n. it can be shown that 3 ? 9 2 i + − F (x(i) ) , 0 Dn = max max 1≤i≤n n Goodness of Fit Tests 650 and Dn− 2 3 ? i−1 = max max F (x(i) ) − ,0 . 1≤i≤n n 9 Therefore it can also be shown that 9 2 3? i i−1 Dn = max max . − F (x(i) ), F (x(i) ) − 1≤i≤n n n The following figure illustrates the Kolmogorov-Smirnov statistics Dn when n = 4. 1.00 D4 F(x) 0.75 0.50 0.25 0 x(1) x(2) x(3) x(4) Kolmogorov-Smirnov Statistic Example 21.1. The data on the heights of 12 infants are given below: 18.2, 21.4, 22.6, 17.4, 17.6, 16.7, 17.1, 21.4, 20.1, 17.9, 16.8, 23.1. Test the hypothesis that the data came from some normal population at a significance level α = 0.1. Answer: Here, the null hypothesis is Ho : X ∼ N (µ, σ 2 ). First we estimate µ and σ 2 from the data. Thus, we get x= 230.3 = 19.2. 12 Probability and Mathematical Statistics and s2 = 651 1 (230.3)2 4482.01 − 12 62.17 = = 5.65. 12 − 1 11 Hence s = 2.38. Then by the null hypothesis F (x(i) ) = P # Z≤ x(i) − 19.2 2.38 $ where Z ∼ N (0, 1) and i = 1, 2, ..., n. Next we compute the KolmogorovSmirnov statistic Dn the given sample of size 12 using the following tabular form. i 1 2 3 4 5 6 7 8 9 10 11 12 x(i) 16.7 16.8 17.1 17.4 17.6 17.9 18.2 20.1 21.4 21.4 22.6 23.1 i 12 − F (x(i) ) −0.0636 0.0105 0.0606 0.1097 0.1653 0.2088 0.2461 0.0187 0.0121 F (x(i) ) 0.1469 0.1562 0.1894 0.2236 0.2514 0.2912 0.3372 0.6480 0.8212 −0.0069 0.0505 0.9236 0.9495 F (x(i) ) − i−1 12 0.1469 0.0729 0.0227 −0.0264 −0.0819 −0.1255 −0.1628 0.0647 0.0712 0.0903 0.0328 Thus D12 = 0.2461. From the tabulated value, we see that d12 = 0.34 for significance level α = 0.1. Since D12 is smaller than d12 we accept the null hypothesis Ho : X ∼ N (µ, σ 2 ). Hence the data came from a normal population. Example 21.2. Let X1 , X2 , ..., X10 be a random sample from a distribution whose probability density function is f (x) = & 1 if 0 < x < 1 0 otherwise. Based on the observed values 0.62, 0.36, 0.23, 0.76, 0.65, 0.09, 0.55, 0.26, 0.38, 0.24, test the hypothesis Ho : X ∼ U N IF (0, 1) against Ha : X +∼ U N IF (0, 1) at a significance level α = 0.1. Goodness of Fit Tests 652 Answer: The null hypothesis is Ho : X ∼ U N IF (0, 1). Thus F (x) = & 0 if x < 0 x if 0 ≤ x < 1 1 if x ≥ 1. Hence F (x(i) ) = x(i) for i = 1, 2, ..., n. Next we compute the Kolmogorov-Smirnov statistic Dn the given sample of size 10 using the following tabular form. i 1 2 3 4 5 6 7 8 9 10 x(i) 0.09 0.23 0.24 0.26 0.36 0.38 0.55 0.62 0.65 0.76 F (x(i) ) 0.09 0.23 0.24 0.26 0.36 0.38 0.55 0.62 0.65 0.76 i 10 − F (x(i) ) 0.01 −0.03 0.06 0.14 0.14 0.22 0.15 0.18 0.25 0.24 F (x(i) ) − 0.09 0.13 0.04 −0.04 −0.04 −0.12 −0.05 −0.08 −0.15 −0.14 i−1 10 Thus D10 = 0.25. From the tabulated value, we see that d10 = 0.37 for significance level α = 0.1. Since D10 is smaller than d10 we accept the null hypothesis Ho : X ∼ U N IF (0, 1). 21.2 Chi-square Test The chi-square goodness of fit test was introduced by Karl Pearson in 1900. Recall that the Kolmogorov-Smirnov test is only for testing a specific continuous distribution. Thus if we wish to test the null hypothesis Ho : X ∼ BIN (n, p) against the alternative Ha : X +∼ BIN (n, p), then we can not use the Kolmogorov-Smirnov test. Pearson chi-square goodness of fit test can be used for testing of null hypothesis involving discrete as well as continuous Probability and Mathematical Statistics 653 distribution. Unlike Kolmogorov-Smirnov test, the Pearson chi-square test uses the density function the population X. Let X1 , X2 , ..., Xn be a random sample from a population X with probability density function f (x). We wish to test the null hypothesis Ho : X ∼ f (x) against Ha : X +∼ f (x). If the probability density function f (x) is continuous, then we divide up the abscissa of the probability density function f (x) and calculate the probability pi for each of the interval by using : xi f (x) dx, pi = xi−1 where {x0 , x1 , ..., xn } is a partition of the domain of the f (x). y f(x) p2 p3 p4 p1 0 x1 x2 x3 p5 x4 x5 xn Discretization of continuous density function Let Y1 , Y2 , ..., Ym denote the number of observations (from the random sample X1 , X2 , ..., Xn ) is 1st , 2nd , 3rd , ..., mth interval, respectively. Since the sample size is n, the number of observations expected to fall in the ith interval is equal to npi . Then m % (Yi − npi )2 Q= npi i=1 Goodness of Fit Tests 654 measures the closeness of observed Yi to expected number npi . The distribution of Q is chi-square with m − 1 degrees of freedom. The derivation of this fact is quite involved and beyond the scope of this introductory level book. Although the distribution of Q for m > 2 is hard to derive, yet for m = 2 it not very difficult. Thus we give a derivation to convince the reader that Q has χ2 distribution. Notice that Y1 ∼ BIN (n, p1 ). Hence for large n by the central limit theorem, we have Thus Since I Y1 − n p1 n p1 (1 − p1 ) ∼ N (0, 1). (Y1 − n p1 )2 ∼ χ2 (1). n p1 (1 − p1 ) (Y1 − n p1 )2 (Y1 − n p1 )2 (Y1 − n p1 )2 + = , n p1 (1 − p1 ) n p1 n (1 − p1 ) we have This implies that (Y1 − n p1 )2 (Y1 − n p1 )2 + ∼ χ2 (1) n p1 n (1 − p1 ) which is (Y1 − n p1 )2 (n − Y2 − n + n p2 )2 + ∼ χ2 (1) n p1 n p2 due to the facts that Y1 + Y2 = n and p1 + p2 = 1. Hence 2 % (Yi − n pi )2 i=1 n pi ∼ χ2 (1), that is, the chi-square statistic Q has approximate chi-square distribution. Now the simple null hypothesis H0 : p1 = p10 , p2 = p20 , · · · pm = pm0 is to be tested against the composite alternative Ha : at least one pi is not equal to pi0 for some i. Here p10 , p20 , ..., pm0 are fixed probability values. If the null hypothesis is true, then the statistic m % (Yi − n pi0 )2 Q= n pi0 i=1 Probability and Mathematical Statistics 655 has an approximate chi-square distribution with m − 1 degrees of freedom. If the significance level α of the hypothesis test is given, then ! " α = P Q ≥ χ21−α (m − 1) and the test is “Reject Ho if Q ≥ χ21−α (m − 1).” Here χ21−α (m − 1) denotes a real number such that the integral of the chi-square density function with m − 1 degrees of freedom from zero to this real number χ21−α (m − 1) is 1 − α. Now we give several examples to illustrate the chi-square goodness-of-fit test. Example 21.3. A die was rolled 30 times with the results shown below: Number of spots Frequency (xi ) 1 1 2 4 3 9 4 9 5 2 6 5 If a chi-square goodness of fit test is used to test the hypothesis that the die is fair at a significance level α = 0.05, then what is the value of the chi-square statistic and decision reached? Answer: In this problem, the null hypothesis is H o : p1 = p 2 = · · · = p 6 = 1 . 6 The alternative hypothesis is that not all pi ’s are equal to be based on 30 trials, so n = 30. The test statistic Q= 6 % (xi − n pi )2 n pi i=1 , where p1 = p2 = · · · = p6 = 16 . Thus n pi = (30) and Q= 6 % (xi − n pi )2 i=1 = 1 =5 6 n pi 6 % (xi − 5)2 i=1 5 1 [16 + 1 + 16 + 16 + 9] 5 58 = 11.6. = 5 = 1 6. The test will Goodness of Fit Tests 656 The tabulated χ2 value for χ20.95 (5) is given by χ20.95 (5) = 11.07. Since 11.6 = Q > χ20.95 (5) = 11.07 the null hypothesis Ho : p1 = p2 = · · · = p6 = 1 6 should be rejected. Example 21.4. It is hypothesized that an experiment results in outcomes 3 1 K, L, M and N with probabilities 51 , 10 , 10 and 25 , respectively. Forty independent repetitions of the experiment have results as follows: Outcome Frequency K 11 L 14 M 5 N 10 If a chi-square goodness of fit test is used to test the above hypothesis at the significance level α = 0.01, then what is the value of the chi-square statistic and the decision reached? Answer: Here the null hypothesis to be tested is Ho : p(K) = 1 3 1 2 , p(L) = , p(M ) = , p(N ) = . 5 10 10 5 The test will be based on n = 40 trials. The test statistic Q= 4 % (xk − npk )2 k=1 n pk (x2 − 12)2 (x3 − 4)2 (x4 − 16)2 (x1 − 8)2 + + + 8 12 4 16 (11 − 8)2 (14 − 12)2 (5 − 4)2 (10 − 16)2 = + + + 8 12 4 16 4 1 36 9 = + + + 8 12 4 16 95 = = 3.958. 24 = From chi-square table, we have χ20.99 (3) = 11.35. Thus 3.958 = Q < χ20.99 (3) = 11.35. Probability and Mathematical Statistics 657 Therefore we accept the null hypothesis. Example 21.5. Test at the 10% significance level the hypothesis that the following data 06.88 69.82 15.74 05.79 07.99 06.92 04.80 09.85 07.05 19.06 06.97 04.34 13.45 05.74 10.07 00.32 04.14 05.19 18.69 02.45 03.02 09.87 02.44 18.99 18.90 05.38 02.36 09.66 00.97 04.82 06.54 16.91 23.69 05.42 10.43 03.67 07.47 44.10 01.54 15.06 02.94 05.04 01.70 01.55 00.49 04.89 07.97 02.14 20.99 02.81 give the values of a random sample of size 50 from an exponential distribution with probability density function  x  θ1 e− θ if 0 < x < ∞ f (x; θ) =  0 elsewhere, where θ > 0. Answer: From the data x = 9.74 and s = 11.71. Notice that Ho : X ∼ EXP (θ). Hence we have to partition the domain of the experimental distribution into m parts. There is no rule to determine what should be the value of m. We assume m = 10 (an arbitrary choice for the sake of convenience). We partition the domain of the given probability density function into 10 mutually disjoint sets of equal probability. This partition can be found as follow. Note that x estimate θ. Thus θZ = x = 9.74. Now we compute the points x1 , x2 , ..., x10 which will be used to partition the domain of f (x) : x1 1 1 −x = e θ 10 xo θ ; x <x1 = − e− θ 0 Hence = 1 − e− # x1 θ . $ 10 x1 = θ ln 9 # $ 10 = 9.74 ln 9 = 1.026. Goodness of Fit Tests 658 Using the value of x1 , we can find the value of x2 . That is 1 = 10 : x2 x1 = e− Hence 1 −x e θ θ x1 θ # − e− − x2 = −θ ln e In general x1 θ x2 θ . 1 − 10 $ . $ # xk−1 1 xk = −θ ln e− θ − 10 for k = 1, 2, ..., 9, and x10 = ∞. Using these xk ’s we find the intervals Ak = [xk , xk+1 ) which are tabulates in the table below along with the number of data points in each each interval. Interval Ai [0, 1.026) [1.026, 2.173) [2.173, 3.474) [3.474, 4.975) [4.975, 6.751) [6.751, 8.925) [8.925, 11.727) [11.727, 15.676) [15.676, 22.437) [22.437, ∞) Total Frequency (oi ) 3 4 6 6 7 7 5 2 7 3 50 Expected value (ei ) 5 5 5 5 5 5 5 5 5 5 50 From this table, we compute the statistics Q= 10 % (oi − ei )2 i=1 ei = 6.4. and from the chi-square table, we obtain χ20.9 (9) = 14.68. Since 6.4 = Q < χ20.9 (9) = 14.68 Probability and Mathematical Statistics 659 we accept the null hypothesis that the sample was taken from a population with exponential distribution. 21.3. Review Exercises 1. The data on the heights of 4 infants are: 18.2, 21.4, 16.7 and 23.1. For a significance level α = 0.1, use Kolmogorov-Smirnov Test to test the hypothesis that the data came from some uniform population on the interval (15, 25). (Use d4 = 0.56 at α = 0.1.) 2. A four-sided die was rolled 40 times with the following results Number of spots Frequency 1 5 2 9 3 10 4 16 If a chi-square goodness of fit test is used to test the hypothesis that the die is fair at a significance level α = 0.05, then what is the value of the chi-square statistic? 3. A coin is tossed 500 times and k heads are observed. If the chi-squares distribution is used to test the hypothesis that the coin is unbiased, this hypothesis will be accepted at 5 percents level of significance if and only if k lies between what values? (Use χ20.05 (1) = 3.84.) 4. It is hypothesized that an experiment results in outcomes A, C, T and G 5 1 , 16 , 18 and 83 , respectively. Eighty independent repetiwith probabilities 16 tions of the experiment have results as follows: Outcome Frequency A 3 G 28 C 15 T 34 If a chi-square goodness of fit test is used to test the above hypothesis at the significance level α = 0.1, then what is the value of the chi-square statistic and the decision reached? 5. A die was rolled 50 times with the results shown below: Number of spots Frequency (xi ) 1 8 2 7 3 12 4 13 5 4 6 6 If a chi-square goodness of fit test is used to test the hypothesis that the die is fair at a significance level α = 0.1, then what is the value of the chi-square statistic and decision reached? Goodness of Fit Tests 660 6. Test at the 10% significance level the hypothesis that the following data 05.88 70.82 16.74 04.79 06.99 05.92 03.80 08.85 07.97 05.34 14.45 01.32 03.14 06.19 02.02 08.87 03.44 05.38 03.36 08.66 06.05 06.74 19.69 17.99 01.97 18.06 11.07 03.45 17.90 03.82 05.54 17.91 24.69 04.42 11.43 02.67 08.47 45.10 01.54 14.06 01.94 06.04 02.70 01.55 01.49 03.89 08.97 03.14 19.99 01.81 give the values of a random sample of size 50 from an exponential distribution with probability density function  x  θ1 e− θ if 0 < x < ∞ f (x; θ) =  0 elsewhere, where θ > 0. 7. Test at the 10% significance level the hypothesis that the following data 0.88 0.82 0.74 0.79 0.94 0.92 0.97 0.32 0.02 0.38 0.80 0.34 0.14 0.87 0.36 0.85 0.05 0.06 0.54 0.67 0.94 0.45 0.74 0.07 0.91 0.47 0.04 0.19 0.69 0.45 0.69 0.10 0.70 0.44 0.99 0.90 0.42 0.54 0.55 0.66 0.97 0.82 0.43 0.06 0.49 0.89 0.97 0.14 0.99 0.81 give the values of a random sample of size 50 from an exponential distribution with probability density function   (1 + θ) xθ if 0 ≤ x ≤ 1 f (x; θ) =  0 elsewhere, where θ > 0. 8. Test at the 10% significance level the hypothesis that the following data 06.88 29.82 15.74 05.79 07.99 06.92 04.80 09.85 07.05 06.97 04.34 13.45 05.74 00.32 04.14 05.19 18.69 03.02 09.87 02.44 18.99 05.38 02.36 09.66 00.97 19.06 10.07 02.45 18.90 04.82 06.54 16.91 23.69 05.42 10.43 03.67 07.47 24.10 01.54 15.06 02.94 05.04 01.70 01.55 00.49 04.89 07.97 02.14 20.99 02.81 give the values of a random sample of size 50 from an exponential distribution with probability density function &1 if 0 ≤ x ≤ θ θ f (x; θ) = 0 elsewhere. Probability and Mathematical Statistics 661 9. Suppose that in 60 rolls of a die the outcomes 1, 2, 3, 4, 5, and 6 occur with frequencies n1 , n2 , 14, 8, 10, and 8 respectively. What is the least value 12 of i=1 (ni −10)2 for which the chi-square test rejects the hypothesis that the 12 die is fair at 1% level of significance level? (Answer: i=1 (ni −10)2 ≥ 63.43.) 10. It is hypothesized that of all marathon runners 70% are adult men, 25% are adult women, and 5% are youths. To test this hypothesis, the following data from the a recent marathon are used: Adult Men 630 Adult Women 300 Youths 70 Total 1000 A chi-square goodness-of-fit test is used. What is the value of the statistics? (Ans: 25) Goodness of Fit Tests 662 Probability and Mathematical Statistics 663 REFERENCES [1] Aitken, A. C. (1944). Statistical Mathematics. 3rd edn. Edinburgh and London: Oliver and Boyd, [2] Arbous, A. G. and Kerrich, J. E. (1951). Accident statistics and the concept of accident-proneness. Biometrics, 7, 340-432. [3] Arnold, S. (1984). Pivotal quantities and invariant confidence regions. Statistics and Decisions 2, 257-280. [4] Bain, L. J. and Engelhardt. M. (1992). Introduction to Probability and Mathematical Statistics. Belmont: Duxbury Press. [5] Bartlett, M. S. (1937). Properties of sufficiency and statistical tests. Proceedings of the Royal Society, London, Ser. A, 160, 268-282. [6] Bartlett, M. S. (1937). Some examples of statistical methods of research in agriculture and applied biology. J. R. Stat. Soc., Suppli., 4, 137-183. [7] Brown, L. D. (1988). Lecture Notes, Department of Mathematics, Cornell University. Ithaca, New York. [8] Brown, M. B. and Forsythe, A. B. (1974). Robust tests for equality of variances. Journal of American Statistical Association, 69, 364-367. [9] Campbell, J. T. (1934). THe Poisson correlation function. Proc. Edin. Math. Soc., Series 2, 4, 18-26. [10] Casella, G. and Berger, R. L. (1990). Statistical Inference. Belmont: Wadsworth. [11] Castillo, E. (1988). Extreme Value Theory in Engineering. San Diego: Academic Press. [12] Cherian, K. C. (1941). A bivariate correlated gamma-type distribution function. J. Indian Math. Soc., 5, 133-144. References 664 [13] Dahiya, R., and Guttman, I. (1982). Shortest confidence and prediction intervals for the log-normal. The canadian Journal of Statistics 10, 777891. [14] David, F.N. and Fix, E. (1961). Rank correlation and regression in a non-normal surface. Proc. 4th Berkeley Symp. Math. Statist. & Prob., 1, 177-197. [15] Desu, M. (1971). Optimal confidence intervals of fixed width. The American Statistician 25, 27-29. [16] Dynkin, E. B. (1951). Necessary and sufficient statistics for a family of probability distributions. English translation in Selected Translations in Mathematical Statistics and Probability, 1 (1961), 23-41. [17] Einstein, A. (1905). Über die von der molekularkinetischen Theorie der Wärme geforderte Bewegung von in ruhenden Flüssigkeiten suspendierten Teilchen, Ann. Phys. 17, 549560. [18] Eisenhart, C., Hastay, M. W. and Wallis, W. A. (1947). Selected Techniques of Statistical Analysis, New York: McGraw-Hill. [19] Feller, W. (1968). An Introduction to Probability Theory and Its Applications, Volume I. New York: Wiley. [20] Feller, W. (1971). An Introduction to Probability Theory and Its Applications, Volume II. New York: Wiley. [21] Ferentinos, K. K. (1988). On shortest confidence intervals and their relation uniformly minimum variance unbiased estimators. Statistical Papers 29, 59-75. [22] Freund, J. E. and Walpole, R. E. (1987). Mathematical Statistics. Englewood Cliffs: Prantice-Hall. [23] Galton, F. (1879). The geometric mean in vital and social statistics. Proc. Roy. Soc., 29, 365-367. [24] Galton, F. (1886). Family likeness in stature. With an appendix by J.D.H. Dickson. Proc. Roy. Soc., 40, 42-73. [25] Ghahramani, S. (2000). Fundamentals of Probability. Upper Saddle River, New Jersey: Prentice Hall. Probability and Mathematical Statistics 665 [26] Graybill, F. A. (1961). An Introduction to Linear Statistical Models, Vol. 1. New YorK: McGraw-Hill. [27] Guenther, W. (1969). Shortest confidence intervals. The American Statistician 23, 51-53. [28] Guldberg, A. (1934). On discontinuous frequency functions of two variables. Skand. Aktuar., 17, 89-117. [29] Gumbel, E. J. (1960). Bivariate exponetial distributions. J. Amer. Statist. Ass., 55, 698-707. [30] Hamedani, G. G. (1992). Bivariate and multivariate normal characterizations: a brief survey. Comm. Statist. Theory Methods, 21, 2665-2688. [31] Hamming, R. W. (1991). The Art of Probability for Scientists and Engineers New York: Addison-Wesley. [32] Hogg, R. V. and Craig, A. T. (1978). Introduction to Mathematical Statistics. New York: Macmillan. [33] Hogg, R. V. and Tanis, E. A. (1993). Probability and Statistical Inference. New York: Macmillan. [34] Holgate, P. (1964). Estimation for the bivariate Poisson distribution. Biometrika, 51, 241-245. [35] Kapteyn, J. C. (1903). Skew Frequency Curves in Biology and Statistics. Astronomical Laboratory, Noordhoff, Groningen. [36] Kibble, W. F. (1941). A two-variate gamma type distribution. Sankhya, 5, 137-150. [37] Kolmogorov, A. N. (1933). Grundbegriffe der Wahrscheinlichkeitsrechnung. Erg. Math., Vol 2, Berlin: Springer-Verlag. [38] Kolmogorov, A. N. (1956). Foundations of the Theory of Probability. New York: Chelsea Publishing Company. [39] Kotlarski, I. I. (1960). On random variables whose quotient follows the Cauchy law. Colloquium Mathematicum. 7, 277-284. [40] Isserlis, L. (1914). The application of solid hypergeometrical series to frequency distributions in space. Phil. Mag., 28, 379-403. References 666 [41] Laha, G. (1959). On a class of distribution functions where the quotient follows the Cauchy law. Trans. Amer. Math. Soc. 93, 205-215. [42] Levene, H. (1960). In Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling. I. Olkin et. al. eds., Stanford University Press, 278-292. [43] Lundberg, O. (1934). On Random Processes and their Applications to Sickness and Accident Statistics. Uppsala: Almqvist and Wiksell. [44] Mardia, K. V. (1970). Families of Bivariate Distributions. London: Charles Griffin & Co Ltd. [45] Marshall, A. W. and Olkin, I. (1967). A multivariate exponential distribution. J. Amer. Statist. Ass., 62. 30-44. [46] McAlister, D. (1879). The law of the geometric mean. Proc. Roy. Soc., 29, 367-375. [47] McKay, A. T. (1934). Sampling from batches. J. Roy. Statist. Soc., Suppliment, 1, 207-216. [48] Meyer, P. L. (1970). Introductory Probability and Statistical Applications. Reading: Addison-Wesley. [49] Mood, A., Graybill, G. and Boes, D. (1974). Introduction to the Theory of Statistics (3rd Ed.). New York: McGraw-Hill. [50] Moran, P. A. P. (1967). Testing for correlation between non-negative variates. Biometrika, 54, 385-394. [51] Morgenstern, D. (1956). Einfache Beispiele zweidimensionaler Verteilungen. Mitt. Math. Statist., 8, 234-235. [52] Papoulis, A. (1990). Probability and Statistics. Englewood Cliffs: Prantice-Hall. [53] Pearson, K. (1924). On the moments of the hypergeometrical series. Biometrika, 16, 157-160. [54] Pestman, W. R. (1998). Mathematical Statistics: An Introduction New York: Walter de Gruyter. [55] Pitman, J. (1993). Probability. New York: Springer-Verlag. Probability and Mathematical Statistics 667 [56] Plackett, R. L. (1965). A class of bivariate distributions. J. Amer. Statist. Ass., 60, 516-522. [57] Rice, S. O. (1944). Mathematical analysis of random noise. Bell. Syst. Tech. J., 23, 282-332. [58] Rice, S. O. (1945). Mathematical analysis of random noise. Bell. Syst. Tech. J., 24, 46-156. [59] Rinaman, W. C. (1993). Foundations of Probability and Statistics. New York: Saunders College Publishing. [60] Rosenthal, J. S. (2000). A First Look at Rigorous Probability Theory. Singapore: World Scientific. [61] Ross, S. (1988). A First Course in Probability. New York: Macmillan. [62] Ross, S. M. (2000). Introduction to Probability and Statistics for Engineers and Scientists. San Diego: Harcourt Academic Press. [63] Roussas, G. (2003). An Introduction to Probability and Statistical Inference. San Diego: Academic Press. [64] Sahai, H. and Ageel, M. I. (2000). The Analysis of Variance. Boston: Birkhauser. [65] Seshadri, V. and Patil, G. P. (1964). A characterization of a bivariate distribution by the marginal and the conditional distributions of the same component. Ann. Inst. Statist. Math., 15, 215-221. [66] H. Scheffé (1959). The Analysis of Variance. New York: Wiley. [67] Smoluchowski, M. (1906). Zur kinetischen Theorie der Brownschen Molekularbewe-gung und der Suspensionen, Ann. Phys. 21, 756780. [68] Snedecor, G. W. and Cochran, W. G. (1983). Statistical Methods. 6th eds. Iowa State University Press, Ames, Iowa. [69] Sveshnikov, A. A. (1978). Problems in Probability Theory, Mathematical Statistics and Theory of Random Functions. New York: Dover. [70] Tardiff, R. M. (1981). L’Hospital rule and the central limit theorem. American Statistician, 35, 43-44. [71] Taylor, L. D. (1974). Probability and Mathematical Statistics. New York: Harper & Row. References 668 [72] Tweedie, M. C. K. (1945). Inverse statistical variates. Nature, 155, 453. [73] Waissi, G. R. (1993). A unifying probability density function. Appl. Math. Lett. 6, 25-26. [74] Waissi, G. R. (1994). An improved unifying density function. Appl. Math. Lett. 7, 71-73. [75] Waissi, G. R. (1998). Transformation of the unifying density to the normal distribution. Appl. Math. Lett. 11, 45-28. [76] Wicksell, S. D. (1933). On correlation functions of Type III. Biometrika, 25, 121-133. [77] Zehna, P. W. (1966). Invariance of maximum likelihood estimators. Annals of Mathematical Statistics, 37, 744. Probability and Mathematical Statistics ANSWERES TO SELECTED REVIEW EXERCISES CHAPTER 1 1. 7 1912 . 2. 244. 3. 7488. 4 , (b) 4. (a) 24 5. 0.95. 6. 47 . 7. 23 . 8. 7560. 10. 43 . 6 24 and (c) 4 24 . 11. 2. 12. 0.3238. 13. S has countable number of elements. 14. S has uncountable number of elements. 25 15. 648 . ! "n+1 16. (n − 1)(n − 2) 21 . 2 17. (5!) . 18. 19. 20. 21. 22. 7 10 . 1 3. n+1 3n−1 . 6 11 . 1 5. 669 Answers to Selected Problems 670 CHAPTER 2 1. 2. 1 3. (6!)2 (21)6 . 3. 0.941. 4. 5. 6. 4 5. 6 11 . 255 256 . 7. 0.2929. 10 17 . 9. 30 31 . 7 . 10. 24 8. 11. 4 ( 10 )( 36 ) . 3 6 ( ) ( 6 )+( 10 ) ( 52 ) 12. (0.01) (0.9) (0.01) (0.9)+(0.99) (0.1) . 13. 1 5. 2 9. 14. 4 10 15. (a) 16. 1 4. 17. 3 8. 18. 5. 19. 5 42 . 20. 1 4. !2" ! 5 4 52 " + !3" ! 5 4 16 " and (b) 4 ) ( 35 )( 16 . 4 4 + ) ( ) ( 52 ) ( 35 ) ( 16 2 5 Probability and Mathematical Statistics 671 CHAPTER 3 1. 2. 3. 1 4. k+1 2k+1 . 1 √ 3 . 2 4. Mode of X = 0 and median of X = 0. ! " 5. θ ln 10 9 . 6. 2 ln 2. 7. 0.25. 8. f (2) = 0.5, f (3) = 0.2, f (π) = 0.3. 9. f (x) = 61 x3 e−x . 10. 3 4. 11. a = 500, mode = 0.2, and P (X ≥ 0.2) = 0.6766. 12. 0.5. 13. 0.5. 14. 1 − F (−y). 15. 1 4. 16. RX = {3, 4, 5, 6, 7, 8, 9}; f (3) = f (4) = 2 20 , f (5) = f (6) = f (7) = 17. RX = {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}; f (2) = 5 36 , 1 36 , f (9) = f (3) = 4 36 , 2 36 , f (10) = f (4) = 3 36 , 18. RX = {0, 1, 2, 3, 4, 5}; f (0) = 59049 105 , f (1) = 32805 105 , 3 36 , 4 20 , f (5) = f (11) = 2 36 , f (2) = 7290 105 , 19. RX = {1, 2, 3, 4, 5, 6, 7}; f (8) = f (9) = 4 36 , f (6) = f (12) = f (3) = 5 36 , 2 20 . f (7) = 6 36 , f (8) = 1 36 . 810 105 , f (4) = 45 105 , f (5) = 1 105 . f (1) = 0.4, f (2) = 0.2666, f (3) = 0.1666, f (4) = 0.0952, f (5) = 0.0476, f (6) = 0.0190, f (7) = 0.0048. 20. c = 1 and P (X = even) = 41 . 21. c = 12 , P (1 ≤ X ≤ 2) = 43 . " ! 3 . 22. c = 32 and P X ≤ 12 = 16 Answers to Selected Problems 672 CHAPTER 4 1. −0.995. 1 33 , 2. (a) (b) 12 33 , 65 33 . (c) 3. (c) 0.25, (d) 0.75, (e) 0.75, (f) 0. 4. (a) 3.75, (b) 2.6875, (c) 10.5, (d) 10.75, (e) −71.5. 5. (a) 0.5, (b) π, (c) 6. 7. 8. 3 10 π. 17 √1 . 24 θ = 1 4 E(x2 ) . 8 3. 9. 280. 10. 9 20 . 11. 5.25. 12. a = 3 4h √ , π E(X) = 13. E(X) = 74 , E(Y ) 2 √ , h π 7 = 8. V ar(X) = 1 h2 ;3 2 − 2 . 14. − 38 15. −38. 16. M (t) = 1 + 2t + 6t2 + · · ·. 17. 1 4. n 18. β 19. 1 4 On−1 i=1 (k + i). ; 2t < 3e + e3t . 20. 120. 21. E(X) = 3, V ar(X) = 2. 22. 11. 23. c = E(X). 24. F (c) = 0.5. 25. E(X) = 0, V ar(X) = 2. 26. 1 625 . 27. 38. 28. a = 5 and b = −34 or a = −5 and b = 36. 29. −0.25. 30. 10. 1 31. − 1−p p ln p. 4 π < . Probability and Mathematical Statistics CHAPTER 5 1. 5 16 . 2. 5 16 . 3. 25 72 . 4. 4375 279936 . 5. 3 8. 6. 11 16 . 7. 0.008304. 8. 3 8. 9. 1 4. 10. 0.671. 11. 1 16 . 12. 0.0000399994. 13. n2 −3n+2 2n+1 . 14. 0.2668. ( 6 ) (4) 15. 3−k10 k , (3) 16. 0.4019. 0 ≤ k ≤ 3. 17. 1 − e12 . ! " ! 1 "3 ! 5 "x−3 18. x−1 . 2 6 6 19. 5 16 . 20. 0.22345. 21. 1.43. 22. 24. 23. 26.25. 24. 2. 25. 0.3005. 26. 4 e4 −1 . 27. 0.9130. 28. 0.1239. 673 Answers to Selected Problems 674 CHAPTER 6 1. f (x) = e−x 0 < x < ∞. 2. Y ∼ U N IF (0, 1). 1 ln w−µ 2 1 e− 2 ( σ ) . 3. f (w) = w√2πσ 2 4. 0.2313. 5. 3 ln 4. 6. 20.1σ. 7. 34 . 8. 2.0. 9. 53.04. 10. 44.5314. 11. 75. 12. 0.4649. 13. n! θn . 14. 0.8664. 1 2 15. e 2 k . 16. a1 . 17. 64.3441.& 18. g(y) = 19. 0.5. 20. 0.7745. 21. 0.4. 22. 4 y3 if 0 < y < 0 otherwise. 2 3 2 − 13 − yθ e 3θ y y4 √4 y e− 2 . 2π H √ 2 . 23. 24. ln(X) ∼ \(µ, σ 2 ). 2 25. eµ−σ . 26. eµ . 27. 0.3669. 29. Y ∼ GBET A(α, β, a, b). √ √ 32. (i) 12 π, (ii) 21 , (iii) 14 π, (iv) 12 . 1 1 33. (i) 180 , (ii) (100)13 5!13!7! , (iii) 360 . , -2 α 35. 1 − β . 36. E(X n ) = θn Γ(n+α) Γ(α) . Probability and Mathematical Statistics CHAPTER 7 1. f1 (x) = 2x+3 21 , 2. f (x, y) = 3. 4. 5. 1 18 . 1 2e4 . 1 3. & and f2 (y) = 1 36 2 36 3y+6 21 . if 1 < x < y = 2x < 12 if 1 < x < y < 2x < 12 otherwise. 0 J 2(1 − x) if 0 < x < 1 0 otherwise. 2 . 7. (e −1)(e−1) 5 e 6. f1 (x) = 8. 0.2922. 9. 5 7. 10. f1 (x) = 11. f2 (y) = 9 J 12. f (y/x) = 13. 6 7. 14. f (y/x) = 15. 4 9. 5 48 x(8 0 2y 0 9 1+ 0 9 1 2x 0 − x3 ) if 0 < y < 1 otherwise. √ 1 if (x − 1)2 + (y − 1)2 ≤ 1 2 1−(x−1) otherwise. if 0 < y < 2x < 1 otherwise. 16. g(w) = 2e−w − 2e−2w . - 2 , 3 17. g(w) = 1 − wθ3 6w θ3 . 18. 19. 20. 11 36 . 7 12 . 5 6. 21. No. 22. Yes. 23. 24. 25. 7 32 . 1 4. 1 2. −x 26. x e . if 0 < x < 2 otherwise. 675 Answers to Selected Problems CHAPTER 8 1. 13. 2. Cov(X, Y ) = 0. Since 0 = f (0, 0) += f1 (0)f2 (0) = 41 , X and Y are not independent. 3. √1 . 8 4. 1 (1−4t)(1−6t) . 5. X + Y ∼ BIN (n + m, p). " ! 6. 12 X 2 − Y 2 ∼ EXP (1). 7. M (s, t) = es −1 s + et −1 t . 15 9. − 16 . 10. Cov(X, Y ) = 0. No. 11. a = 6 8 and b = 98 . 45 12. Cov = − 112 . 13. Corr(X, Y ) = − 15 . 14. 136. √ 15. 12 1 + ρ . 16. (1 − p + pet )(1 − p + pe−t ). 17. σ2 n 18. 2. 19. 4 3. 20. 1. 21. 1 2. [1 + (n − 1)ρ]. 676 Probability and Mathematical Statistics CHAPTER 9 1. 6. 2. 1 2 (1 3. 1 2 y2 . 4. 1 2 + x. + x2 ). 5. 2x. 6. µX = − 22 3 and µY = 7. 3 1 2+3y−28y 3 1+2y−8y 2 . 8. 3 2 x. 9. 1 2 y. 10. 4 3 x. 11. 203. 12. 15 − π1 . 13. 1 12 15. 5 192 . 16. 1 12 . (1 − x)2 . "2 ! 1 1 − x2 . 14. 12 17. 180. 19. x 6 + 20. x 2 + 1. 5 12 . 112 9 . 677 Answers to Selected Problems 678 CHAPTER 10 &1 1 √ for 0 ≤ y ≤ 1 2 + 4 y 1. g(y) = 0 otherwise.  √ y 3  √ for 0 ≤ y ≤ 4m 16 m m 2. g(y) =  0 otherwise. & 2y for 0 ≤ y ≤ 1 3. g(y) = 0 otherwise.  1 (z   16 + 4) for −4 ≤ z ≤ 0   1 4. g(z) = 16 (4 − z) for 0 ≤ z ≤ 4     0 otherwise. & 1 −x for 0 < x < z < 2 + x < ∞ 2 e 5. g(z, x) = 0 otherwise. √ & 4 for 0 < y < 2 y3 6. g(y) = 0 otherwise.  z3 z2 z − 250 + 25 for 0 ≤ z ≤ 10  15000    z2 z3 8 7. g(z) = − 2z − 250 − 15000 for 10 ≤ z ≤ 20 15 25     0 otherwise.  2 " ! 2a (u−2a)  4a3 ln u−a + 2 for 2a ≤ u < ∞ u a u (u−a) 8. g(u) =  0 otherwise. 9. h(y) = 3z 2 −2z+1 , 216 10. g(z) =    11. g(u, v) = 12. g1 (u) = 3 4h √ m π 0 & & z = 1, 2, 3, 4, 5, 6. = 2 2z − 2hm z for 0 ≤ z < ∞ m e otherwise. 3u − 350 0 + 9v 350 for 10 ≤ 3u + v ≤ 20, u ≥ 0, v ≥ 0 otherwise. 2u (1+u)3 if 0 ≤ u < ∞ 0 otherwise. Probability and Mathematical Statistics 13. g(u, v) = 14. g(u, v) = 15. 16. 17. 18. 19. 20.  3 2 2 3  5 [9v −5u v+3uv +u ] 32768  0 & u+v 32 679 for 0 < 2v + 2u < 3v − u < 16 otherwise. √ for 0 < u + v < 2 5v − 3u < 8 0 otherwise.  2 + 4u + 2u2 if −1 ≤ u ≤ 0     √ g1 (u) = 2 1 − 4u if 0 ≤ u ≤ 14     0 otherwise. 4 if 0 ≤ u ≤ 1  3 u    g1 (u) = 4 u−5 if 1 ≤ u < ∞ 3     0 otherwise.  1  4u 3 − 4u if 0 ≤ u ≤ 1 g1 (u) =  0 otherwise. & −3 2u if 1 ≤ u < ∞ g1 (u) = 0 otherwise. w if 0 ≤ w ≤ 2  6        26 if 2 ≤ w ≤ 3 f (w) =  5−w   if 3 ≤ w ≤ 5  6     0 otherwise. BIN (2n, p) 21. GAM (θ, 2) 22. CAU (0) 23. N (2µ, 2σ 2 )  & 1  14 (2 − |α|) if |α| ≤ 2 − 2 ln(|β|) if |β| ≤ 1 24. f1 (α) = f2 (β) =  0 otherwise, 0 otherwise. Answers to Selected Problems CHAPTER 11 2. 7 10 . 3. 960 75 . 6. 0.7627. 680 Probability and Mathematical Statistics CHAPTER 12 681 Answers to Selected Problems 682 CHAPTER 13 3. 0.115. 4. 1.0. 5. 7 16 . 6. 0.352. 7. 6 5. 8. 100.64. 9. 1+ln(2) . 2 10. [1 − F (x6 )]5 . 11. θ + 15 . 12. 2 e−w [1 − e−w ]. , 3 2 13. 6 wθ3 1 − wθ3 . 14. N (0, 1). 15. 25. 1 16. X has a degenerate distribution with MGF M (t) = e 2 t . 17. P OI(1995λ). ! "n 18. 21 (n + 1). 19. 88 119 35. 20. f (x) = 60 θ ! x 1 − e− θ "3 e− 3x θ for 0 < x < ∞. 21. X(n+1) ∼ Beta(n + 1, n + 1). Probability and Mathematical Statistics 683 CHAPTER 14 1. N (0, 32). 2. χ2 (3); the MGF of X12 − X22 is M (t) = 3. t(3). 4. f (x1 , x2 , x3 ) = 1 − θ3 e (x1 +x2 +x3 ) θ . 5. σ 2 6. t(2). 7. M (t) = √ 1 . (1−2t)(1−4t)(1−6t)(1−8t) 8. 0.625. 9. σ4 n2 2(n − 1). 10. 0. 11. 27. 12. χ2 (2n). 13. t(n + p). 14. χ2 (n). 15. (1, 2). 16. 0.84. 17. 2σ 2 n2 . 18. 11.07. 19. χ2 (2n − 2). 20. 2.25. 21. 6.37. √ 1 . 1−4t2 Answers to Selected Problems CHAPTER 15 \ ] % n ] 1. ^ n3 Xi2 . i=1 2. 1 . X̄−1 3. 2 . X̄ 4. − n % n . ln Xi i=1 5. n % n ln Xi − 1. i=1 6. 2 . X̄ 7. 4.2 8. 19 26 . 9. 15 4 . 10. 2. 11. α̂ = 3.534 and β̂ = 3.409. 12. 1. 13. 1 3 14. = 1− max{x1 , x2 , ..., xn }. 1 max{x1 ,x2 ,...,xn } . 15. 0.6207. 18. 0.75. 19. −1 + 20. X̄ . 1+X̄ 21. X̄ 4. 5 ln(2) . 22. 8. 23. n % i=1 n |Xi − µ| . 684 Probability and Mathematical Statistics 24. 25. 1 N. √ X̄. 26. λ̂ = nX̄ (n−1)S 2 27. 10 n p (1−p) . 28. 2n θ2 . 29. 30. 685 # n σ2 0 # nλ n 2σ 4 0 µ3 n 2λ2 0 31. α Z= $ 0 X , Z β and α̂ = nX̄ 2 (n−1)S 2 . . $ . βZ = 1 X ; 1 1n n i=1 < Xi2 − X . 1n 32. θZ is obtained by solving numerically the equation i=1 33. θZ is the median of the sample. 34. n λ. 35. n (1−p) p2 . 2(xi −θ) 1+(xi −θ)2 = 0. Answers to Selected Problems CHAPTER 16 σ22 −cov(T1 ,T2 ) . σ12 +σ22 −2cov(T1 ,T2 1. b = 2. θZ = |X| , E( |X| ) = θ, unbiased. 4. n = 20. 5. k = 12 . 6. a = 7. n % 25 61 , b= 36 61 , Xi3 . Z c = 12.47. i=1 8. n % Xi2 , no. i=1 9. k = 4 π. 10. k = 2. 11. k = 2. 13. ln n P (1 + Xi ). i=1 14. n % Xi2 . i=1 15. X(1) , and sufficient. 16. X(1) is biased and X − 1 is unbiased. X(1) is efficient then X − 1. 17. n % ln Xi . i=1 18. n % Xi . i=1 19. n % ln Xi . i=1 22. Yes. 23. Yes. 686 Probability and Mathematical Statistics 24. Yes. 25. Yes. 26. θZ = 3 X. 27. θZ = 50 30 X. 687 Answers to Selected Problems 688 CHAPTER 17 J n e−n q 0 A The confidence interval is X(1) − 7. The pdf of Q is g(q) = 8. The pdf of Q is g(q) = 9 1 2 1 n 1 e− 2 q 0 A The confidence interval is X(1) − 1 n 9 n q n−1 0 A The confidence interval is X(1) − 9. The pdf of Q is g(q) = 1 n if 0 < q < ∞ otherwise. ! " ln α2 , X(1) − if 0 < q < ∞ otherwise. ! " ln α2 , X(1) − if 0 < q < 1 otherwise. ! " ln α2 , X(1) − 10. The pdf g(q) of Q is given by g(q) = The confidence interval is 2 ! 2 " n1 X(n) , α 11. The pdf of Q is given by g(q) = A B 12. X(1) − z α2 √1n , X(1) + z α2 √1n . 9 9 n q n−1 0 , - n1 2 2−α 1 n ln , 2 2−α -B . 1 n ln , 2 2−α -B . 1 n ln , 2 2−α -B . if 0 ≤ q ≤ 1 otherwise. 3 X(n) . n (n − 1) q n−2 (1 − q) 0 B A Z θ√ +1 Z θ +1 α √ , θ + z , where θZ = −1 + 1n n ln x . 13. θZ − z α2 Z n n 2 i=1 14. A 2 X − z α2 = 2 2 2, X nX A 15. X − 4 − z α2 X−4 √ , n + z α2 = 2 2 nX X − 4 + z α2 B . X−4 √ n B . A B X X(n) √ √ α 16. X(n) − z α2 (n+1)(n) , X + z . (n) 2 (n+1) n+2 n+2 17. A 1 4 X − z α2 8 X √ , n 1 4 X + z α2 8 X √ n B . i if 0 ≤ q ≤ 1 otherwise. Probability and Mathematical Statistics 689 CHAPTER 18 1. α = 0.03125 and β = 0.763. 2. Do not reject Ho . 3. α = 0.0511 and β(λ) = 1 − 4. α = 0.08 and β = 0.46. 7 % (8λ)x e−8λ x=0 x! 5. α = 0.19. 6. α = 0.0109. 7. α = 0.0668 and β = 0.0062. 8. C = {(x1 , x2 ) | x2 ≥ 3.9395}. 9. C = {(x1 , ..., x10 ) | x ≥ 0.3}. 10. C = {x ∈ [0, 1] | x ≥ 0.829}. 11. C = {(x1 , x2 ) | x1 + x2 ≥ 5}. 12. C = {(x1 , ..., x8 ) | x − x ln x ≤ a}. 13. C = {(x1 , ..., xn ) | 35 ln x − x ≤ a}. 9 ? -5x−5 , x 5 14. C = (x1 , ..., x5 ) | 2x−2 x ≤a . 15. C = {(x1 , x2 , x3 ) | |x − 3| ≥ 1.96}. J S 1 16. C = (x1 , x2 , x3 ) | x e− 3 x ≤ a . S J ! e "3x ≤a . 17. C = (x1 , x2 , ..., xn ) | 10 x 18. 1 3. √ Q R 19. C = (x1 , x2 , x3 ) | x(3) ≤ 3 117 . 20. C = {(x1 , x2 , x3 ) | x ≥ 12.04}. 21. α = 1 16 and β = 22. α = 0.05. 255 256 . , λ += 0.5.