Biometry Lecture 2

Download as pdf
Download as pdf
You are on page 1of 18
10 Chapter 2 ~ Descriptive Statistics (Exploratory Data Analysis) All the data sets used in this chapter will be regarded as samples drawn from some Population. One of the main purposes of studying a sample is to get information about the Population, The main focus here is on summarizing and describing some features of the data, 2.1 Graphs and diagrams Line graph — A line graph is a graph used to present some characteristic recorded over time Example = 15 uw 3 sain? wit) 7 3 66 1991 199219931994 1995 Year {he raph above shows how a person's weight varied from the beginning of 1991 to the beginning of 1995. Bar charts A bar chart or bar graph is a chart consisting of rectangular bars with heights proportional 10 the values that they represent. Bar charts are used for comparing two or more valves thera taken over time or under different conditions, ‘Simple Bar Chart 1 @ simple bar chart the figures used to make comparisons are represented by bars. These are either drawn vertically or horizontally. Only totals are represented. The height or length of the bar is drawn in proportion to the size of the figure being presented. An example is shown on the following page, from some fn about the of the data. over time. 1991 to the tional to jues that are These are length of le is shown u Total UK Resident Population 1959.98 Component Bar Chart When you want to draw a bar chart to illustrate your data, it is often the case that the totals of the figures can be broken down into parts or components. Year Total Male Female 1959 | 51956000 | 25043000 | 26913000 1969 55461000 | 26908000 | 28553000 1979 56 240000 | 27373000 | 28867000 1989 57365000 | 27988000 | 29377000 1999 | 59501000 | 29299000 | 30 202 000 You start by drawing a simple bar chart with the total figures as shown above. The columns or bars (depending on whether you draw the chart vertically or horizontally) are then divided into the component parts. -EUEREE Total UK Resident Population 1959-99 (componentbar chart) [arora] jaMale 12 Muttiple (compound) Bar Chart You may find that your data allows you to make comparisons of the component figures themselves. If so, you will want to create a multiple (compound) bar chart This type of chart enables you to trace the trends of each i well as making comparisons between the components. Total UK Resident Population 1959-99 (compound bar chart) mMale Female BTotal Ui Resident Population 1959 1963 1873 1989 1900 Year Pareto chart A Pareto chart is a special type of bar chart where the values being plotted are arranged in descending order. The graph is accompanied by a line graph which shows the cumplative totals of each category, left to right. ‘The graph below is a Pareto chart that shows the percentage of late arrivals at a place of work organized according to cause of late arrival (from the most common to the least common cause). The line shows the accumulated percentages. y im gap component bar chart, ponent, as are hich s at a place mon to the 13, Dot Plot. This is diagram where a line is drawn according to a scale that is appropriate for the data set and the values (in the data set) plotted at their positions on the scale. If the same value occurs more than once, the multiple values are plotted on top of each other at the same point on the scale. For small data sets (few values) this plot can provide useful information regarding data patterns. ample Imagine that a medium-sized retailer, thinking of expanding into a new region identifies a business that it considers as being ready for takeover. It finds the following annual profit figures (in tens of thousands of pounds) for the target retailer's last ten years trading: SROs Teen? G. SP 4a 4 ‘To draw a dot plot we can begin by drawing a horizontal line across the page 10 represent the range of values of all the numbers; then we can mark an 'x’ above the appropriate value along the line as follows: Pie Chart ‘A Pie chart is a diagram that shows the subdivision of some entity/total into subgroups. The diagram is in the form of a circle which is divided into slices with each slice having an area | __ according to the proportion that it makes up of the total Example ‘The pie chart below shows the ingredients used to make a sausage and mushroom pizza 12.5% 15% Sausage Cheese Crust ETomato Sauce Hi 255, [miusheooms 14 ‘The degrees needed for each slice is found by calculating the appropriate percentage of 360 €-g. for sausage the degrees are 0.125x360 = 45 and for cheese 0.25x360 =90 etc ‘The complete calculations are shown in the table below. Ingredient _| Percentage Degrees Sausage as. 0.075 x 360.= 27 Cheese | 25 0.250 x 360 = 90 Crust 30, 0.50 x 360 = 180 3 Tomato sauce 12.5 0.125 x 360 = 45 Mushrooms 5 0.050 x 360 = 18 Stem-and-leaf plot AA stem-and-leaf plot is a device used for summarizing quantitative data in a table/graphical format to assist in visualizing the shape of a data set Examples 1) To construct a stem-and-leaf plot, the values must first be sorted in ascending order. Here is the sorted set of data values that will used in the examp! 44 46 47 49 63 64 66 68 68 72 72 75 76 81 84 88 106 Next, it must be determined what the stems will represent and what the leaves will represent. Typically, the leaf contains the last digit of the number and the stem contains all of the other digits. In the case of very large or very small numbers, the data values may be rounded to a particular place value (such as the hundredths place) that will be used for the leaves. The remaining digits to the left of the rounded place value are used as the stems. In this example, the leaf represents the “ones” place and the stem the rest of the number (“tens" place or higher) The stem-and-leaf plot is drawn with two columns separated by a vertical line. The stems are listed (o the left of the vertical line. It is important that each stem is listed only once and that no numbers are skipped, even if it means that some stems have no eaves. The leaves are listed in increasing order in a row to the right of each stem, 4679 4] 5] 6 [34688 |2256 ja Il ‘i 8 9 1 016 is listed have no m. 1s stem unit: 10.0 Conclusion: The 12 of the 17 values are greater or equal to 63 arid less or equal to 88. 2) Two data sets can be compared by drawing a back-to-back stem-and-leaf plot As an example, suppose the fat contents (in grams) for eating English breakfasts and cold meat sandwiches are to be compared. The fat contents are shown below. Sandwiches: 6, 7, 12, 13, 17, 18, 20, 21, 21, 24, 26, 28 30, 34 Breakfasts: 12, 14, 15, 16, 18, 23, 25, 25, 36, 36, 38, 41, 44, 45 A back-to-back stem-and-leaf plot is shown below. Breakfasts Sandwiches lole7 24568/1/2378 3.55 [2/01 1.46.8 6 68/3104 %. 145/14] key: 2/4=24 for sandwiches and 2[4=42 for breakfasts leaf unit: 1.0 . stem unit: 10.0 Conclusion: The fat content in English breakfasts appears to be higher than that in sandwiches, 2.2, Sigma and subscript notation ‘The symbol sigma S’ (Capital S in Greek alphabet) is used to denote “the sum of” values, Suppose the symbol x is used to denote some variable of interest in a study, In order to distinguish between values of this variable, subscripts are used first value in the data set which has a subscript | second value in the data set which has a subseript 2, Xa nth value in the data set which has a subscript n. ‘The sum of these values is written in shorthand notation as ing 7 For this data Sony, = (8) + (1345) + (OKT) + (12x6) + (1x9) + (X11) = 88 +65 +49+72 +904 88 452. Note that Say, 400 )C0 9) e.g, for the abovementioned data) x, = 61 and So = Lary = 2806 # ) x,y, ‘The summation notation is used extensively in specifying calculations in statistical formulae, 2.3. Frequency distributions and related graphs Frequency distribution |A frequency distribution is a table in which data are grouped into classes and the number of values (frequencies) which fall in each class recorded. ‘The main purpose of constructing a frequency distribution is to get insight into the distribution pattern of the frequencies over the classes. Hence, the name frequency distribution is used to refer to this pattern Example Ina survey of 40 families in a village, the number of children per family was recorded and the following data obtained. weyers wnneo Buroe ROUUR peer RAREO wun number of children 0 1 Zh 3 4 3 6 Total Note: The sum of the frequencies = sample size ie, ))f=n. Example 2 Consider the following data of low temperatures (in degrees Fahrenheit to the nearest degree) for 50 days. The highest temperature is 64 and the lowest temperature is 39. Data Set - Low Temperatures for 50 Days| a ae ee | 50 fsbo 63 _ bs / Constructing a frequency distribution ‘The classes into which the above values can be sorted can be found by following the steps shown below. 1. Find the maximum (=64) and minimum (=39) values and calculate the range = maximum — minimum = 64-39 =25. 2. Decide on the number of classes, Use Sturges’ rule which states that No, of elasses the rounded up value of (I + 1.44 In n) a q 41.44 x In(50) 63 a iek=7. 3. Calculate the class width such that no. of classes x class width > range i.e, 7x class width > 25. This suggests a class width of 4. | 4, Find the lower value that defines the first class. This is usually a value just below the ‘minimum value in the data set. Since the minimum value for this data set is 39, the lowest class can have a minimum value one below this ie. 38. 19 5. Find the lower values that define each of the classes that follow by successively adding the class width to the lower value of class. lower value of the second class = 38 + 4 = 42, lower value of the third class= 4244 = 46 ete, The frequency distribution below shows the data values sorted into the classes 38-41, 42-45, 46-49, 50-53, 54-57, 58-61, 62-65 ‘The table below shows the classes, their frequencies, relative frequencies and cumulative frequencies for the temperatures data set class Telative ‘cumulative limits | f | frequeney frequency 38-41 [4 0.08 4 42-45 [10 02 14 46-49 [8 0.16 22 30-53 | 15 03 37 34-57 [9 0.18 46 se-61 | 3 0.06 49 62-65 [1 (0.02 50. Total 50. ‘The values in the above example that define the classes of the frequency distribution are ae lass limits. The classes of the type 38 —41, 4245... in which both the upper and. lower limits are included are called “ inclusive classes” . For example, the class 38-4] includes all the values from 38 to 41 In spite of great importance of classification in statistical analysis, no hard and fast rules can be laid down for it The following points must be kept in mind for classification 1) The classes should be clearly defined and should not lead to any ambiguity. 2) Each of the given values inthe data set should be included in one of the classes. 3) ‘The classes should be of equal width, otherwise the different class frequencies will not be comparable. Ifthe class widths are unequal, then comparable figures can be Obtained by dividing the value of the frequencies by the corresponding widths of the class intervals. The ratios thus obtined are called * frequency density’ 4) ‘The number of classes should not be too large nor too small 20 Continuous Frequency Distribution If we deal with a continuous variable, it is not possible to arrange the data in the class intervals of above type. Let us consider the distribution of age in years. If class intervals are 15 ~ 19, 20 ~ 24 then persons with ages between 19 and 20 years are not taken into consideration. In such a case we form the class intervals as 0-5, 5-10, 10-15, 15 — 20,...... Here all the persons with any fraction of age are included in one group or the other. In the above classes, the upper limits of each class are excluded from the respective classes and are included in the immediate next class and are known as ‘exclusive classes’ ‘The upper and lower class limits of the new exclusive type classes are known as class boundaries. Ifdiis the gap between the upper limit of any class and the lower limit of the succeeding class, the class boundaries for any class are then given by Upper class boundary = upper class limit + (d/2) Lower class boundary = Lower class limit - (4/2) Example 2 continued The frequency distribution below includes the class boundaries. class class Telative ‘cumulative Jimits_| boundaries frequenc frequency 38-41 | 375-415 4 0.08 4 42-45 | 415-455 [10 02 4 46-49 | 455-495 8 0.16 DD, 50-53 | 495-535 | 15 03 37 54-57 | 535-575 9 0.18 46, 58-61 [575-615 3 0.06 49 2-65 | 615-655 1 0.02 50 Total 50 Example 3 ‘The monthly expenditures (thousands of rands) of 60 households are shown below. ‘The values of this data set were accurately recorded (not rounded). ("warvat | 7.8080 | 6.85461 | 1031167 | 8.48253 | 6.17060 [5009063 | “9.16412 | 5.67094 | 7.7904 | 7.97420 | 5.41634 (9.37265 | 10.14436 | 7.15675 | 10.31107 | 8.86571 | 10.1734 5.90076 | 6.5798 | 7.06965 | e.82430 | 7.47467 | 9.50018 4.90014 | 6.50273 | 6.12516 | 5.51933 | 7.49641 | 10.95599 5.87188 | 9.96036 | 9.89773 | 10.1883 | 5.12028 | 9.60018 8.56534 | 9.27719 | 8.7107 | 7.03318 | 10.78344 | 9.08941 6.85749 | 7.7887 | 9.68150 | 6.75009| 8.0521 | 8.19898 0.17312 | 7.51527 | 1131383 | 8.5765 | 7.48021 | 8.30881 7.37565 | 7.28159 | 8.81773 | 5.53182 | 5.98515 | 7.71778 upon ai ‘The frequency distribution shown below is a summary of this data set. For this distribution lower (upper) class limit = lower (upper) class boundary for each of the classes, ‘A value that falls on the boundary of 2 classes is allocated to the higher of the two classes e.g. ‘olassos t 45-55 | 5 35-65 | 7 65-75 | 19) 75-85 | 13 35-95 | 9 25-105 | 10 j05-115 | 3 Total 60, 5.50000 is allocated to the class 5.5 — 6.5 (not 4.5 to 5.5). Class midpoints The midpoint of class (xpig) can be calculated from X nia = Examples 1) For the frequency distribution in example 2 (temperature data), the class midpoints are given below. 2 Lower class limit (boundary) + Upper class limit (boundary) lass limits class boundaries | midpoints 38-41 37.5 = 41.5 30.5 42-45 41.5 = 45.5 35 46-49 45.5—49.5 475 50-53 495-535 315 34=57 53.5—57.5 35.5 58-61 575-615 59.5 2-65 61.5 - 65.5 G5 2) For the frequency distribution in example 3, the class midpoints are given below. classes | midpoints 45-55 si 55-65 6 6.5=75 7 75-85 8 85-95 9 95-105 10. wos—u.5 [1 22 Cumul: e frequencies The “less than” cumulative frequency of a class is the number of values in the sample that are less than or equal to the upper class boundary of the class. Examples 1) See frequency distribution in example 2 (temperature data). 2) For the frequency distribution in example 3 (expenditure data) the cumulative frequencies are calculated as shown below. upper class ‘cumulative ere Oeafiry a iieerrtentsies i |lealcaistions 45-55 35, 3 a 5 33-65 ss me FT 12 37 65-75 75 3 25 Se713 75-85 85 3 38 ‘SH7#13+13 85-05 a) 47 4741341349 95-105 105 10 37 S47+13+1349410 105-115 [115 3 60 ‘S47+13+13+9+1093 Total 60 Relat e and percentage frequencies © Relative frequency = frequency/sample size ic. Rf = © The percemtage frequency of a class is calculated from relative frequency x 100. Examples 1) See frequency distribution in example 2 (temperature data), 2) For the frequency distribution in example 3 (expenditure data) the relative and percentage frequencies are calculated as shown below. ae : relative percentage frequenc; frequency 45-55 5 0.083 83 55-65 et O.117 = ART [65-75 13, 0.217 217 75-85 13 0.217 21.7 8.5-9.5 9 0.15 15 9.5105 10 0.167, 16.7 105-115 3 0.05 3 Total 60 i 100 and 23 Histogram A histogram is the graphical representation of a frequency distribution. The frequency for each class is represented by a rectangular bar with the class boundaries as base frequency as height. Example A histogram of the frequency distribution in example 2 (temperature data) is shown below. 16) 14 12 a ad 2 $ a4 E ons 37.5-41.5 41.5-45.5 45.5-49.5 49.5-53.5 53.5-57.5 57.5-61.5 61.5-65.5 temperature Frequency polygon ‘This is also a graphical representation of a frequency distribution. For each class the class midpoint is plotted against the frequency and the plotted points joined by means of straight fines. Example For the temperature data the following values are plotted. madpont] 355 | 906 | 495 | 475 | 515 | 855 | 505 | 635 | 675 f 0 4 10 8 15 3 3 1 o ‘The plot is shown on the following page. eT 14 we p10 g8 Eo! a} 24 0 =e ° 1020 30 50-60 70 = 80 Note: The two plotted values at the lower and upper ends were added to anchor the graph to the horizontal axis. The lower end value is a plot of 0 versus the midpoint of the class below the first (lowest) class (35.5). This midpoint is obtained by subtracting the class width (4) from the midpoint of the lowest class (39.5). The upper end value is a plot of 0 versus the midpoint of the class above the last class (67.5). This midpoint is obtained by adding the class width (4) to the midpoint of the last (highest) class (63.5). ‘The histogram and frequency polygon are equivalent graphical representations of the patterh of the frequencies shown in the frequency distribution. It can be shown that the areas under the histogram and frequency polygon are the same, The total area under the histogram (frequency polygon) represents the total number of observations in the data set (n). ‘The ratio: [area under the histogram (frequency polygon) between 2 values}= sample size = sum of frequencies between the 2 values sample size is an estimate of the probability (chance) that a value drawn at random from the data set will lie between these two values. Examples 1) For the frequency distribution in example 2 the estimated chance that a randomly drawn value will be between 45.5 and 57.5 is mee = 0.64. 2) For the frequency distribution in example 3 the estimated chance that a randomly drawn value will be greater than 7:5 is 1S*2*10*3 _o 535, “Less than” ogive This is the graph of the cumulative frequencies versus the ‘upper class boundaries, Example For the “less than” ogive of the frequency distribution in example 2 (temperature data) the following values are plotted. class boundary [37.5 [415 | 455 | 493 | 535 | 575 | 13] 635 cumulative ree 0 4 14 | 22 | 37 | 46 | 49 | 50 cumulative froquency. lass boundary Note: The plotted value at the lower end was added to anchor the graph to the horizontal axis, The lower end value is a plot of O versus the upper class boundary of the class below the first (lowest) class (37.5). This upper class boundary is obtained by subtracting the class width (4) from the upper class boundary of the lowest class (41.5). A percentage “less than” ogive can be plotted by just changing the vertical scale. In this example the frequencies add up to 50. In order to convert these frequencies to percentages, cach frequency is multiplied by 2. To draw the percentage ogive, each cumulative frequency in the above table will have to be multiplied by 2. The resulting graph is shown on the following page. Values that have a given percentage of the observations in the data set less than it can be read off from the ogive. % cumulative freq, boundaries ‘The shape of a distribution ‘The main purpose of drawing a histogram is to describe the clustering pattern of the values in the data set. For a large sample size, the histogram (frequency polygon) can be fairly well approximated by a smooth curve (called a frequency curve) that is fitted to the frequencies, The following patterns of the shape of the frequency curve appear regularly in data sets. Symmetric bell shape , trequency & This shape is for data sets where the majority of values are in the central portion of the scale with fewer and fewer values the further away from the center (in both directions). Many data sets have this shape. Examples are 1) Marks obtained in an examination, 2) Heights of a large group of adult males. 3) IQ scores in a large population. Uniform (rectangular) shape a on) This shape occurs when all the values in the data set times, Examples are t occur approximately the same number of 1) Frequencies of winning numbers in a large number of Lotto draws. 2) Frequencies of winning numbers in a large number of roulette games. 3) Frequencies obtained when tossing an unbiased coin and recording 0 if tails come up and 1 if heads come up. Bimodal shape © ‘Body length (mm) This pattern which shows two distinct peaks (hence the name bimodal data) appearing when there are two subgroups with different sets of values in the same data set. 28 Examples. Exai 1) Measuring the body lengths of ants when there are adults and juveniles together in the ve same data set. The two peaks in the curve reflect the fact that juvenile ants have shorter body lengths than adult ants. 2) Heights of a population of males and females. Since the females are shorter than the 2 males, the frequency curve will have two peaks. One peak will be located where the most female heights are concentrated and one where the most male heights are concentrated. Positive skew shape Int Fa fort on 2.4 : Bos i Me This shape shows a high clustering of values at the lower end of the scale and less and less clustering further away from the lower end towards the upper end. Ex The Exam giv ‘The time it takes to serve a customer at a supermarket. For most customers the service time is quite short. The longer the service time, the less the number of customers. Negative skewed shape os M 02s oz fe Foo 005 ° 005 This shape shows a high clustering of values at the upper end of the scale and less and less clustering further away from the upper end towards the lower end.

You might also like