Manual Minitab
Manual Minitab
Manual Minitab
University of Toronto
ii
Contents
Preface vii
1
3 4 7 7 8 10 13 14 18 19 20 21 22 23 26 29 29 29 31 32 33 35 35 36 37 38 39 40 41 43
iv
CONTENTS
II
45
47 48 49 51 53 53 54 55 57 59 60 60 60 61 62 62 63 64 67 67 70 70 74 75
1 Looking at DataDistributions 1.1 Tabulating and Summarizing Data . . . . . . . . . . 1.1.1 Tallying Data . . . . . . . . . . . . . . . . . . 1.1.2 Describing Data . . . . . . . . . . . . . . . . 1.2 Plotting Data in a Graph Window . . . . . . . . . . 1.2.1 Dotplots . . . . . . . . . . . . . . . . . . . . . 1.2.2 Stem-and-Leaf Plots . . . . . . . . . . . . . . 1.2.3 Histograms . . . . . . . . . . . . . . . . . . . 1.2.4 Boxplots . . . . . . . . . . . . . . . . . . . . . 1.2.5 Time Series Plots . . . . . . . . . . . . . . . . 1.2.6 Bar Charts . . . . . . . . . . . . . . . . . . . 1.2.7 Pie Charts . . . . . . . . . . . . . . . . . . . 1.3 The Normal Distribution . . . . . . . . . . . . . . . . 1.3.1 Calculating the Density . . . . . . . . . . . . 1.3.2 Calculating the Distribution Function . . . . 1.3.3 Calculating the Inverse Distribution Function 1.3.4 Normal Probability Plots . . . . . . . . . . . 1.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 2 Looking at DataRelationships 2.1 Scatterplots . . . . . . . . . . . 2.2 Correlations . . . . . . . . . . . 2.3 Regression . . . . . . . . . . . . 2.4 Transformations . . . . . . . . 2.5 Exercises . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
3 Producing Data 77 3.1 Generating a Random Sample . . . . . . . . . . . . . . . . . . . . 78 3.2 Sampling from Distributions . . . . . . . . . . . . . . . . . . . . . 80 3.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4 Probability: The Study of Randomness 4.1 Basic Probability Calculations . . . . . . . . 4.2 More on Sampling from Distributions . . . 4.3 Simulation for Approximating Probabilities 4.4 Simulation for Approximating Means . . . . 4.5 Exercises . . . . . . . . . . . . . . . . . . . 85 85 86 89 90 91
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
5 Sampling Distributions 95 5.1 The Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . 95 5.2 Simulating Sampling Distributions . . . . . . . . . . . . . . . . . 98 5.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
CONTENTS 6 Introduction to Inference 6.1 z -Condence Intervals . . . . . . . . 6.2 z -Tests . . . . . . . . . . . . . . . . . 6.3 Simulations for Condence Intervals 6.4 Simulations for Power Calculations . 6.5 The Chi-Square Distribution . . . . 6.6 Exercises . . . . . . . . . . . . . . . 7 Inference for Distributions 7.1 The Student Distribution 7.2 t-Condence Intervals . . 7.3 t-Tests . . . . . . . . . . . 7.4 The Sign Test . . . . . . . 7.5 Comparing Two Samples 7.6 The F -Distribution . . . . 7.7 Exercises . . . . . . . . .
v 105 105 107 109 110 113 114 117 117 118 119 120 122 125 126 129 129 131 133 135 135 138 141 143
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
8 Inference for Proportions 8.1 Inference for a Single Proportion . . . . . . . . . . . . . . . . . . 8.2 Inference for Two Proportions . . . . . . . . . . . . . . . . . . . . 8.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Inference for Two-Way Tables 9.1 Tabulating and Plotting . . . 9.2 The Chi-square Test . . . . . 9.3 Analyzing Tables of Counts . 9.4 Exercises . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
10 Inference for Regression 145 10.1 Simple Regression Analysis . . . . . . . . . . . . . . . . . . . . . 145 10.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 11 Multiple Regression 155 11.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 11.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 12 One-Way Analysis of Variance 12.1 A Categorical Variable and a Quantitative Variable . . . . . . . . 12.2 One-Way Analysis of Variance . . . . . . . . . . . . . . . . . . . 12.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 163 166 171
13 Two-Way Analysis of Variance 173 13.1 The Two-Way ANOVA Command . . . . . . . . . . . . . . . . . 173 13.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
vi 14 Nonparametric Tests 14.1 The Wilcoxon Rank Sum Procedures . . 14.2 The Wilcoxon Signed Rank Procedures . 14.3 The Kruskal-Wallis Test . . . . . . . . . 14.4 Exercises . . . . . . . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
15 Logistic Regression 185 15.1 The Logistic Regression Model . . . . . . . . . . . . . . . . . . . 185 15.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 15.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
Appendices
A Projects B Mathematical and Statistical B.1 Mathematical Functions . . B.2 Column Statistics . . . . . . B.3 Row Statistics . . . . . . . .
191
191 Functions in Minitab 193 . . . . . . . . . . . . . . . . . . . . . 193 . . . . . . . . . . . . . . . . . . . . . 194 . . . . . . . . . . . . . . . . . . . . . 195 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 197 198 202 202 203 203 203 204 205 206
C Macros and Execs C.1 Global Macros . . . . . . . . . . . . . C.1.1 Control Statements . . . . . . . C.1.2 Startup Macro . . . . . . . . . C.1.3 Interactive Macros . . . . . . . C.2 Local Macros . . . . . . . . . . . . . . C.3 Execs . . . . . . . . . . . . . . . . . . C.3.1 Creating and Using an Exec . . C.3.2 The CK Capability for Looping C.3.3 Interactive Execs . . . . . . . . C.3.4 Startup Execs . . . . . . . . . .
D Matrix Algebra in Minitab 207 D.1 Creating Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 208 D.2 Commands for Matrix Operations . . . . . . . . . . . . . . . . . 210 E Advanced Statistical Methods in Minitab F References Index 213 215 216
Preface
This Minitab manual is to be used as an accompaniment to Introduction to the Practice of Statistics, Fourth Edition, by David S. Moore and George P. McCabe, and to the CD-ROM that accompanies this text. We abbreviate the textbook title as IPS. Minitab is a statistical software package that was designed especially for the teaching of introductory statistics courses. It is our view that an easy-to-use statistical software package is a vital and signicant component of such a course. This permits the student to focus on statistical concepts and thinking rather than computations or the learning of a statistical package. The main aim of any introductory statistics course should always be the why of statistics rather than technical details that do little to stimulate the majority of students or, in our opinion, do little to reinforce the key concepts. IPS succeeds admirably in communicating the important basic foundations of statistical thinking, and it is hoped that this manual serves as a useful adjunct to the text. It is natural to ask why Minitab is advocated for the course. In the authors experience, ease of learning and use are the salient features of the package, with obvious benets to the student and to the instructor, who can relegate many details to the software. While more sophisticated packages are necessary for higher-level professional work, it is our experience that attempting to teach one of these in a course forces too much attention on technical aspects. The time students need to spend to learn Minitab is relatively small and that it is a great virtue. Further Minitab will serve as a perfectly adequate tool for many of the statistical problems students will encounter in their undergraduate education. This manual is divided into two parts. Part I is an introduction that provides the necessary details to start using Minitab and in particular how to use worksheets. Not all the material in Part I needs to be absorbed on rst reading. We recommend reading I.1I.10 before starting to use Minitab. The material in I.11 is more for reference and for later reading. References are made to these sections later in the manual and can provide the stimulus to read them. Overall, the introductory Part I also serves as a reference for most of the nonstatistical commands in Minitab. vii
viii Part II follows the structure of the textbook. Each chapter is titled and numbered as in IPS. The last two chapters are not in IPS but correspond to optional material included on the CD-ROM. The Minitab commands relevant to doing the problems in each IPS chapter are introduced and their use illustrated. Each chapter concludes with a set of exercises, some of which are modications of or related to problems in IPS and many of which are new and specically designed to ensure that the relevant Minitab material has been understood. There are also appendices dealing with some more advanced features of Minitab, such as programming in Minitab and matrix algebra. Minitab is available in a variety of versions and for dierent types of computing systems. In writing the manual, we have used Version 13 for Windows, as discussed in the references in Appendix F, but have tried to make the contents of the manual compatible with earlier versions and for versions running under other operating systems. The core of the manual is a discussion of the menu commands while not neglecting to refer to the session commands. Overall, we feel that the manual can be successfully used with most versions of Minitab. This manual does not attempt a complete coverage of Minitab. Rather, we introduce and discuss those concepts in Minitab that we feel are most relevant for a student studying introductory statistics with IPS. We do introduce some concepts that are, strictly speaking, not necessary for solving the problems in IPS where we feel that they were likely to prove useful in a large number of data analysis problems encountered outside the classroom. While the manuals primary goal is to teach Minitab, generally we want to help develop strong data analytic skills in conjunction with the text and the CD-ROM. Thanks to Patrick Farace and Chris Spavins of W. H. Freeman and Company for their help and consideration. Also thanks to Rosemary and Heather. For further information on Minitab software, contact: Minitab Inc. 3081 Enterprise Drive State College, PA 16801 USA ph: 814.328.3280 fax: 814.238.4383 email: [email protected] URL: http://www.minitab.com
Part I
New Minitab commands discussed in this part Calc I Calculator Calc I Make Patterned Data Edit I Copy Cells Edit I Paste Cells Edit I Undo Cut Enable Command Language Editor I Editor I Insert Columns Editor I Make Output Editable File I Exit File I Other Files I Export Special Text File I Other Files I Import Special Text File I Print Worksheet File I Save Current Worksheet As Help Manip I Code Manip I Copy Columns Manip I Erase Variables Manip I Sort Manip I Unstack Window I Project Manager Calc I Column Statistics Calc I Row Statistics Edit I Cut Cells Edit I Select All Cells Edit I Undo Paste Editor I Insert Cells Editor I Insert Rows File File File File File I I I I I New Open Worksheet Print Session Window Save Current Worksheet Save Session Window As I I I I Concatenate Display Data Rank Stack
you start on Chapter II.1, however, you should read I.1I.10 and leave I.11 for later reading. Minitab is a software package that runs on a variety of dierent types of computers and comes in a number of versions. This manual does not try to describe all the possible implementations or the full extent of the package. We limit our discussion to those features common to the most recent versions of Minitab and, in particular, Versions 12 and 13. Also, we present only those aspects of Minitab relevant to carrying out the statistical analyses discussed in IPS. Of course, this is a fairly wide range of analyses, but the full power of Minitab is not necessary. Depending on the version of Minitab you are using, there may be many more useful features, and we encourage you to learn and use them. Throughout the manual, we point out what some of the additional useful features of Minitab are and how you can go about learning how to use them. Version 13 refers to the most current version of Minitab at the time of writing this manual. In this manual, special statistical or Minitab concepts will be highlighted in italic font. You should be sure that you understand these concepts. We will provide a brief explanation for any terms not dened in IPS. When a reference is made to a Minitab session command or subcommand , its name will be in bold font. Primarily, we will be discussing the menu commands that are available in Minitab. Menu commands are accessed by clicking the left button of the mouse on items in lists. We use a special notation for menu commands. For example, AIBIC is to be interpreted as left click the command A on the menu bar, then in the list that drops down, left click the command B, and, nally, left click C. The menu commands will be denoted in ordinary font (the actual appearance may vary slightly depending on the version of Windows you use). Any commands that we type and the output obtained will be denoted in typewriter font, as will the names of any les used by Minitab, variables, constants, and worksheets. At the end of each chapter, we provide a few exercises that can be used to make sure you have understood the material. We recommend, however, that whenever possible you use Minitab to do the problems in IPS. While many problems can be done by hand, you will save a considerable amount of time and avoid errors by learning to use Minitab eectively. We also recommend that you try out the Minitab commands as you read about them, as this will ensure full understanding.
In some cases, this may mean you type a command such as minitab at a computer system prompt and then hit the Enter or Return key on the keyboard after you have logged on, i.e., provided a login name and password to the computer system being used in your course. Typically, you will see the prompt MTB > on your screen, and this indicates that you have started a Minitab session. In most cases, you will double click an icon, such as that shown in Display I.1, that corresponds to the Minitab program.
Alternatively, you can use the Start button and click on Minitab in the Programs list. In this case, the program opens with a Minitab window, such as the one shown in Display I.2. The Minitab window is divided into two sub-windows with the upper window called the Session window and the lower one called the Data window .
Left clicking the mouse anywhere on a particular window brings that window to the foreground, i.e., makes it the active window, and the border at the top of the window turns dark blue. For example, clicking in the Session window will make the window containing the MTB > prompt active. Alternatively, you can use the command Window I Session in the menu bar at the top of the Minitab
window to make this window active. You may not see the MTB > prompt in your Session window, and for this manual it is important that you do so. You can ensure that this prompt always appears in your Session window by using Edit I Preferences, doubleclick on Session Window in the Preferences list that comes up, clicking on the Enable radio button under Command Language in the Session Window Preferences, clicking on OK, and clicking on Save. Without the MTB > prompt, you cannot type commands to be executed in the Session window. In the session window, Minitab commands are typed after the MTB > prompt and executed when you hit the Enter or Return key. For example, the rst command you should learn is exit, as this takes you out of your Minitab session and returns you to the system prompt or operating system. Otherwise, you can access commands using the menu bar (Display I.3) that resides at the top of the Minitab window. For example, you can access the exit command using File I Exit. In many circumstances, using the menu commands to do your analyses is easy and convenient, although there are certain circumstances where typing the session commands is necessary. You can also exit by clicking on the symbol in the upper right-hand corner of the Minitab window. When you exit, you are prompted by Minitab in a dialog window with the question, Save changes to this Project before closing? You can safely answer no to this question unless you are in fact using the Projects feature in Minitab as described in Appendix A. In I.8, we will discuss how to save the contents of a Data window before exiting. This is something you will commonly want to do.
Immediately below the menu bar in the Minitab window is the taskbar . The taskbar consists of various icons that provide a shortcut method for carrying out various operations by clicking on them. These operations can be identied by holding the cursor over each in turn, and it is a good idea to familiarize yourself with these. Of particular importance are the Cut Cells, Copy Cells, and Paste Cells icons, which are available when a Data window is active. When the operation associated with an icon is not available the icon is faded. Minitab is an interactive program. By this we mean that you supply Minitab with input data, or tell it where your input data is, and then Minitab responds instantaneously to any commands you give telling it to do something with that data. You are then ready to give another command. It is also possible to run a collection of Minitab commands in a batch program; i.e., several Minitab commands are executed sequentially before the output is returned to the user. The batch version is useful when there is an extensive number of computations to be carried out. You are referred to Appendix C for more discussion of the batch version.
4 Getting Help
At times, you may want more information about a command or some other aspect of Minitab than this manual provides, or you may wish to remind yourself of some detail that you have partially forgotten. Minitab contains an online manual that is very convenient. You can access this information directly by clicking on Help in the Menu bar and using the table of Contents or doing a Search of the manual for a particular concept. From the MTB > prompt, you can use the help command for this purpose. Typing help followed by the name of the command of interest and hitting Enter will cause Minitab to produce relevant output. For example, asking for help on the command help itself via the command MTB >help help
will give you an overview of what help information can be accessed on your system. The help command should be used to nd out about session commands.
5 The Worksheet
The basic structural component of Minitab is the worksheet . Basically, the worksheet can be thought of as a big rectangular array, or matrix, of cells organized into rows and columns as in the Data window of Display I.2. Each cell holds one piece of data. This piece of data could be a number, i.e. numeric data , or it could be a sequence of characters, such as a word or an arbitrary sequence of letters and numbers, i.e., text data . Data often comes as numbers, such as 1.7, 2.3, . . . but sometimes it comes in the form of a sequence of characters, such as black, brown, red, etc. Typically, sequences of characters are used as identiers in classications for some variable of interest, e.g., color, gender. A piece of text data can be up to 80 characters in length in Minitab. Version 13 also allows for date data , which is data especially formatted to indicate a date, for example, 3/4/97. We will not discuss date data. If possible, try to avoid using text data with Minitab, i.e., make sure all the values of a variable are numbers, as dealing with text data in Minitab is more dicult. For example, denote colors by numbers rather than by names. Still there will be applications where data comes to you as text data, e.g., in a computer le, and it is too extensive to convert to numeric data. So we will discuss how to input text data into a Minitab worksheet, but we recommend that in such cases you convert this to numeric data, using the methods of I.11.3, once it has been input. In Version 13 of Minitab it is somewhat easier to deal with text data than earlier versions, and this proviso is not as necessary. Display I.4 provides an example of a worksheet. Notice that the columns are labeled C1, C2, etc. and the rows are labeled 1, 2, 3, etc. We will refer to the worksheet depicted in Display I.4 as the marks worksheet hereafter and will use it throughout Part I to illustrate various Minitab commands and operations. Data arises from the process of taking measurements of variables in some real-world context. For example, in a population of students, suppose that we are conducting a study of academic performance in a Statistics course. Specifically, suppose that we want to examine the relationship between grades in Statistics, grades in a Calculus course, grades in a Physics course and gender. So we collect the following information for each student in the study: student number, grade in Statistics, grade in Calculus, grade in Physics, and gender. Therefore, we have 5 variables student number and the grades in the three subjects are numeric variables , and gender is a text variable . Let us further suppose that there are 10 students in the study. Display I.4 gives a possible outcome from collecting the data in such a study. Column C1 contains the student number (note that this is a categorical variable even though it is a number). The student number primarily serves as an identier so that we can check that the data has been entered correctly. This is
something you should always do as a rst step in your analysis. Columns C2 C4 contain the student grades in their Statistics, Calculus, and Physics courses and column C5 contains the gender data. Notice that a column contains the values collected for a single variable, and a row contains the values of all the variables for a single student. Sometimes, a row is referred to as an observation or case. Observe that the data for this study occupies a 10 5 subtable of the full worksheet. All of the other blank entries of the worksheet can be ignored, as they are undened.
There will be limitations on the number of columns and rows you can have in your worksheet, and this depends on the particular implementation of Minitab you are using. So if you plan to use Minitab for a large problem, you should check with the system person or further documentation to see what these are. For example, in some versions of Minitab there is a limitation of 5000 cells. So there can be one variable with 5000 values in it, or 50 variables with 100 values each, etc. Associated with a worksheet is a table of constants . Typically, these are numbers that you want to use in some arithmetical operation applied to every value in a column. For example, you may have recorded heights of people in inches and want to convert these to heights in centimeters. You must multiply every height by the value 2.54. The Minitab constants are labeled K1, K2, etc. Again, there are limitations on the number of constants you can associate with a worksheet. For example, in many versions there can be at most 1000 constants. So to continue with the above problem, we might assign the value 2.54 to K1. In I.7.4, we show how to make such an assignment, and in I.10.1 we show how to multiply every entry in a column by this value.
10
In Version 13 of Minitab, there is an additional structure beyond the worksheet called the project . A project can have multiple worksheets associated with it. Also, a project can have associated with it various graphs and records of the commands you have typed and the output obtained while working on the worksheets. Projects, which are discussed in Appendix A, can be saved and retrieved for later work. Projects .
6 Minitab Commands
We will now begin to introduce various Minitab commands to get data into a worksheet, edit a worksheet, perform various operations on the elements of a worksheet, and save and access a saved worksheet. Before we do, however, it is useful to know something about the basic structure of all Minitab commands. Associated with every command is of course its name, as in File I Exit and Help. Most commands also take arguments, and these arguments are column names, constants, and sometimes le names. Commands can be accessed by making use of the File, Edit, Manip, Calc, Stat, Graph and Editor entries in the menu bar. Clicking any of these brings up a list of commands that you can use to operate on your worksheet. The lists that appear may depend on which window is active, e.g., either a Data window or the Session window. Unless otherwise specied, we will always assume that the Session window is active when discussing menu commands. If a command name in a list is faded, then it is not available. Typically, using a command from the menu bar requires the use of a dialog box or dialog window that opens when you click on a command in the list. These are used to provide the arguments and subcommands to the command and specify where the output is to go. Dialog boxes have various boxes that must be lled in to correctly execute a command. Clicking in a box that needs to be lled in typically causes a variable list to appear in the left-most box, of all items in the active worksheet that can be placed in that box. Double clicking on items in the variable list places them in the box, or, alternatively, you can type them in directly. When you have lled in the dialog box and clicked OK, the command is printed in the Session window and executed. Any output is also printed in the Session window. Dialog boxes have a Help button that can be used to learn how to make the entries. For example, suppose that we want to calculate the mean of column C2 in the worksheet marks. Then the command Calc I Column Statistics brings up the dialog box shown in Display I.5. Notice that the radio button Sum is lled in. Clicking the radio button labelled Mean results in this button being lled in and the Sum button becoming empty. Whichever button is lled in will result in that statistic being calculated for the relevant columns when we nally implement the command by clicking OK. Currently, there are no columns selected, but clicking in the Input variable box brings up a list of possible columns in the display window on the left. The
11
results of these operations are shown in Display I.6. We double click on C2 in the variable list, which places this entry in the Input variable box as shown in Display I.7. Alternatively, we could have simply typed this entry into the box. After clicking the OK button, we obtain the output Mean of C2 = 69.900 in the Session window.
Display I.5: Initial view of the dialog box for Column Statistics.
Display I.6: View of the dialog box for Column Statistics after selecting Mean and bringing up the variable list.
12
Display I.7: Final view of the dialog box for Column Statistics.
Quite often, it is faster and more convenient to simply type your commands directly into the Session window. Sometimes, it is necessary to use the Session window approach, but for many commands the menu bar is available. So we now describe the use of commands in the Session window. The basic structure of such a command with n arguments is command name E1 ,E2 ,...,En where Ei is the ith argument. Alternatively, we can write command name E1 E2 ... En if we dont want to type commas. Conveniently, if the arguments E1 ,E2 ,...,En are consecutive columns in the worksheet, we have the following short-form command name E1 -En which saves even more typing and accordingly decreases our chance of making a typing mistake. If you are going to type a long list of arguments and you dont want them all on the same line, then you can type the continuation symbol & where you want to break the line and then hit Enter. Minitab responds with the prompt CONT> and you continue to type argument names. The command is executed when you hit Enter after an argument name without a continuation character following it. Many commands can, in addition, be supplied with various subcommands that alter the behavior of the command. The structure for commands with subcommands is
Minitab for Data Management command name E1 ... En1 ; subcommand name En1 +1 ... En2 ; . . . subcommand name Enk1 +1 ... Enk .
13
Notice that when there are subcommands each line ends with a semicolon until the last subcommand, which ends with a period. Also, subcommands may have arguments. When Minitab encounters a line ending in a semicolon it expects a subcommand on the next line and changes the prompt to SUBC > until it encounters a period, whereupon it executes the command. If while typing in one of your subcommands you suddenly decide that you would rather not execute the subcommand perhaps you realize something was wrong on a previous line then type abort after the SUBC > prompt and hit Enter. As a further convenience, it is worth noting that you need to only type in the rst four letters of any Minitab command or subcommand. For example, to calculate the mean of column C2 in the worksheet marks we can use the mean command in the Session window, as in MTB > mean c2 and we obtain the same output in the Session window as before. There are two additional ways in which you can input commands to Minitab. Instead of typing the commands directly into the Session window, you can also type these directly into the Command Line Editor, which is available via Edit I Command Line Editor. Multiple commands can then be typed directly into a box that pops up and executed when the Submit Commands button is clicked. Output appears in the Session window. Also, many commands are available on a toolbar that lies just below the menu bar at the top of the Minitab window. There is a dierent toolbar depending upon which window is active. We give a brief discussion of some of the features available in the toolbar in later sections.
14
after you hit Enter. Clicking on it alternates between row-wise and columnwise data entry. Certainly, this is an easy way to enter data when it is suitable. Remember, columns are variables and rows are observations! Also, you can have multiple data windows open and move data between them. Use the command File I New to open a new worksheet.
Each row corresponds to an observation, with the student number being the rst entry, followed by the marks in the students Statistics, Calculus, and Physics courses. These entries are separated by blanks. Notice the * in the sixth row of this data le. In Minitab, a * signies a missing numeric value, i.e., a data value that for some reason is not available. Alternatively, we could have just left this entry blank. A missing text value is simply denoted by a blank. Special attention should be paid to missing values. In general, Minitab statistical analyses ignore any cases that contain missing data except that the output of the command will tell you how many cases were ignored because of missing data. It is important to pay attention to this information. If your data is riddled with a large number of missing values, your analysis may be based on very few observations even if you have a large data set! When data in such a le is blank-delimited like this it is very easy to read in. After the command File I Other Files I Import Special Text, we see the dialog box shown in Display I.8 minus C1C4 in the Store data in column(s): box. We typed C1-C4 into this window to indicate that we want the data read in to be stored in these columns. Note that it doesnt matter if we use lower or upper case for the column names, as Minitab is not case sensitive. After clicking OK, we see the dialog box depicted in Display I.9, which we use to indicate from which le we want to read the data. Note that if your data is in .txt les rather than .dat les, you will have to indicate that you want to see these in
15
the Files of type box by selecting Text Files or perhaps All Files. Clicking on marks.txt results in the data being read into the worksheet.
Display I.8: Dialog box for importing data from external le.
Display I.9: Dialog box for selecting le from which data is to be read in.
Of course, this data set does not contain the text variable denoting the students gender. Suppose that the le marksgend.txt contains the following data exactly as typed.
16 12389 97658 53546 55542 11223 77788 44567 32156 33456 67945 81 75 77 63 71 87 23 67 81 74 85 72 83 42 82 56 45 72 77 91 78 62 81 55 67 * 35 81 88 92 m m f m f f m m f f
As this le contains text data in the fth column, we must tell Minitab how the data is formatted in the le. To access this feature we click on the Format button in the dialog box shown in Display I.8. This brings up the dialog box shown in Display I.10.
To indicate that we will specify the format, we click the radio button Userspecied format and ll the particular format into the box as shown in Display I.11. The format statement says that we are going to read in the data according to the following rule: a numeric variable occupying 5 spaces and with no decimals, followed by a space, a numeric variable occupying 2 spaces with no decimals, a space, a numeric variable occupying 2 spaces with no decimals, a space, a numeric variable occupying 2 spaces with no decimals, a space, and a text variable occupying 1 space. This rule must be rigorously adhered to or errors will occur. So the rules you need to remember if you use formatted input are that ak indicates a text variable occupying k spaces, kx indicates k spaces, and fk.l indicates a numeric variable occupying k spaces, of which l are to the right of the decimal point. Note if a data value does not ll up the full number of spaces allotted to it in the format statement, it must be right justied in its eld. Also, if a decimal point is included in the number, this occupies one of the spaces allocated to the variable and similarly for a negative or plus
17
sign. There are many other features to formatted input that we will not discuss here. Use the Help button in the dialog box for information on these features. Finally, clicking on the OK button reads this data into a worksheet as depicted in Display I.4. Typically, we try to avoid the use of formatted input because it is somewhat cumbersome, but sometimes we must use it.
Display I.11: Dialog box for formatted input with the format lled in.
In the session environment, the read command is available for inputting data into a worksheet with capabilities similar to what we have described. For example, the commands MTB >read c1-c4 DATA>12389 81 85 DATA>97658 75 72 DATA>53546 77 83 DATA>55542 63 42 DATA>11223 71 82 DATA>77788 87 56 DATA>44567 23 45 DATA>32156 67 72 DATA>33456 81 77 DATA>67945 74 91 DATA>end 10 rows read. 78 62 81 55 67 * 35 81 88 92
place the rst four columns into the marks worksheet. After typing read c1-c4 after the MTB > prompt and hitting Enter, Minitab responds with the DATA> prompt, and we type each row of the worksheet in as shown. To indicate that there is no more data, we type end and hit Enter. Similarly, we can enter text data in this way but cant combine the two unless we use a format subcommand. We refer the reader to help for more description of how this command works.
18
Display I.12: Dialog box for making patterned data with some entries lled in.
There is some shorthand associated with patterned data that can be very convenient. For example, typing m : n in a Minitab command is equivalent to typing the values m, m + 1, . . . , n when m < n and m, m 1, ..., n when m > n and m when m = n. The expression m : n/d, where d > 0, expands to a list as above but with the increment of d or d, whichever is relevant, replacing 1 or 1. If m < n then d is added to m until the next addition would exceed n and if m > n then d is subtracted from m until the next subtraction would be lower than n. The expression k(m : n/d) repeats m : n/d for k times while (m : n/d)l repeats each element in m : n/d for l times. The expression k(m : n/d)l repeats (m : n/d)l for k times. The set command is available in the session window to input patterned data. For example, suppose we want C6 to contain the 10 entries 1, 2, 3, 4, 5, 5, 4, 3, 2, 1. The command
19
does this. Also, we can add elements in parentheses. For example, the command MTB >set c6 DATA>(1:2/.5 4:3/.2) DATA>end creates the column with entries 1.0, 1.5, 2.0, 4.0, 3.8, 3.6, 3.4, 3.2, 3.0. The multiplicative factors k and l can also be used in such a context. Obviously, there is a great deal of scope for entering patterned data with set. The general syntax of the set command is set E1 where E1 is a column.
Display I.13: Dialog box for printing worksheet in the Session window.
20
The print command is available in the Session window and is often convenient to use. The general syntax for the print command is print E1 ... Em where E1 , ..., Em are columns and constants.
Display I.14: Filled in dialog box for assigning the constant k1 the value .5.
The let command is available in the Session window and is quite convenient. The following commands make this assignment and then we check, using the print command, that we have entered the constants correctly.
Minitab for Data Management MTB >let k1=.5 MTB >let k2=.25 MTB >let k3=.25 MTB >print k1-k3 K1 0.500000 K2 0.250000 K3 0.250000 Also, we can assign constants text values. For example, MTB >let k4=result assigns K4 the value result. Note the use of double quotes.
21
22
In the Session window, the name command is available for naming variables and constants. For example, the commands MTB >name c1 studid c2 stats c3 calculus & CONT>c4 physics c5 gender & CONT>k1 weight1 k2 weight2 k3 weight3 give the names studid to C1, stats to C2, calculus to C3, physics to C4, gender to C5, weight1 to K1, weight2 to K2, and weight3 to K3. Notice that we have made use of the continuation character & for convenience in typing in the full input to name. When using the variables as arguments just enclose the names in single quotes. For example, MTB >print studid calculus prints out the contents of these variables in the Session window. Variable and constant names can be at most 31 characters in length, cannot include the characters # and and cannot start with a leading blank or *. Recall that Minitab is not case sensitive, so it does not matter if we use lower or upper case letters when specifying the names.
Name studid stats calculus physics gender Name weight1 weight2 weight3
Notice that the info command tells us how many missing values there are and in what columns they occur and also the values of the constants. This information can also be accessed directly from the Project Manager window via Window I Project Manager.
23
24
Display I.16: Dialog box that determines how a block of copied cells is used, whether being inserted into a worksheet or replacing a block of cell of the same size.
An alternative approach is available for copying operations using Manip I Copy Columns and lling in the dialog box appropriately. For example, suppose we want to copy all the entries in the marks worksheet in rows 5 and 8 of columns C2 and C4 and place these in columns C7 and C8. The dialog box shown in Display I.17 would result in all the entries in columns C2 and C4 being copied to C7 and C8. To prevent this, we click on the Use Rows button, which brings up the dialog box shown in Display I.18. Clicking on the Use rows radio button and lling in the associated box with the entries 5 and 8 species that only entries in the fth and eighth rows will be copied. Clicking on the OK buttons in these dialog boxes then completes the operation.
Display I.17: Dialog box for copying entries in columns and pasting them.
25
One can also delete selected rows from specied columns using Manip I Delete Rows and lling in the dialog box appropriately. Notice, however, that whenever we delete a cell, the contents of the cells beneath the deleted one in that column simply move up to ll the cell. The cell entry does not become missing; rather, cells at the bottom of the column become undened! If you delete an entire row, this is not a problem because the rows below just shift up. For example, if we delete the third row then in the new worksheet, after the deletion, the third row is now occupied by what was formerly the fourth row. Therefore, you should be very careful, when you are not deleting whole rows, to ensure that you get the result you intended. Note that if you should delete all the entries from a column, this variable is still in the worksheet, but it is empty now. If you wish to delete a variable and all its entries, this can be accomplished from Manip I Erase Variables and lling in the dialog box appropriately. This is a good idea if you have a lot of variables and no longer need some of them. There are various commands in the Session window available for carrying out these editing operations. For example, the restart command in the Session window can be used to remove all entries from a worksheet. The let command allows you to replace individual entries. For example, MTB > let c2(2)=3 assigns the value 3 to the second entry in the column C2. The copy command can be used to copy a block of cell from one place to another. The insert command allows you to insert rows or observations anywhere in the worksheet. The delete command allows you to delete rows. The erase command is available for the deletion of columns or variables from the worksheet. As it is more convenient to edit a worksheet by directly working on the worksheet and using the menu commands, we do not discuss these commands further here.
26
To retrieve a worksheet, use File I Open Worksheet and ll in the dialog various windows and buttons box as depicted in Display I.20 appropriately. The
27
in this dialog box work as described for the File I Save Current Worksheet As command, with the exception that we now type the name of the le we want to open in the File name box and click on the Open button.
To print a worksheet, use the command File I Print Worksheet. The dialog box that subsequently pops up allows you to control the output in a number of ways. It may be that you would prefer to write out the contents of a worksheet to an external le that can be edited by an editor or perhaps used by some other program. This will not be the case if we save the worksheet as an .mtw le as only Minitab can read these. To do this, use the command File I Other Files I Export Special Text, lling in the dialog box and specifying the destination le when prompted. For example, if we want to save the contents of the marks worksheet, this command results in the dialog box of Display I.21 appearing. We have entered all ve columns into the Columns to export box and have not specied a format so the columns will be stored in the le with single blanks separating the columns. Clicking the OK button results in the dialog box of Display I.22 appearing. Here, we have typed in the name marks.dat to hold the contents. Note that while we have chosen a .dat type le, we also could have chosen a .txt type le. Clicking on the Save button results in a le marks.dat being created in the folder data with contents as displayed in Display I.23.
28
Display I.21: Dialog box for saving the contents of a worksheet to an external (non-Minitab) le.
Display I.22: Dialog box for selecting external le to hold contents of a worksheet.
In the Session window, the commands save and retrieve are available for saving and retrieving a worksheet in the .mtw format and the command write is available for saving a worksheet in an external le. We refer the reader to help for a description of how these commands work.
29
10 Mathematical Operations
When carrying out a data analysis a statistician is often called upon to transform the data in some way. This may involve applying some simple transformation to a variable to create a new variable e.g., take the natural logarithm of every grade in the marks worksheet to combining several variables together to form a new variable e.g., calculate the average grade for each student in the marks worksheet. In this section, we present some of the ways of doing this.
30
grade and placing the result in C6. Filling in the dialog box, corresponding to Calc I Calculator, as shown in Display I.24 accomplishes this when we click on the OK button.
Note that we can either type the relevant expression into the Expression box or use the buttons and double clicking on the relevant columns. Further, we type the column where we wish to store the results of our calculation in the Store result in variable box. These operations are done on the corresponding entries in each column; corresponding entries in the columns are operated on according to the formula we have specied, and a new column of the same length containing all the outcomes is created. Note that the sixth entry in C6 will be * missing because this entry was missing for C4. These kinds of operations can also be carried out directly in the Session window using the let command, and in some ways this is a simpler approach. For example, the session command MTB >let c6=c4-(c2+c3)/2 accomplishes this. We can also use these arithmetical operations on the constants K1, K2, etc., and numbers to create new constants or use the constants as scalars in operations with columns. For example, suppose that we want to compute the weighted average of the Statistics, Calculus, and Physics grades where Statistics gets twice the weight of the other grades. Recall that we created, as part of the marks worksheet, the constants weight1 = .5, weight2 = .25, and weight3 = .25 in K1, K2, and K3, respectively. So this weighted average is computed via the command MTB >let c7=weight1*stats+weight2*calculus& CONT>+weight3*physics
31
and the result is placed in C7. We have used the continuation character & for convenience in this computation. Alternatively, we could have used the Calc I Calculator command as above for this.
Display I.25: Dialog box for mathematical calculations illustrating the use of the natural logarithm function.
A complete list of such functions is given in the Functions window when All functions is in the window directly above the list. The same result can be obtained using the session command let and the natural logarithm function loge. For example, MTB >let c8=loge(c2) calculates the natural log of every entry in c2 and places the results in C8. There are a number of such functions and a complete list is provided in Appendix B.1. These functions can be applied to numbers as well as constants. If you want to know the sine of the number 3.4, then MTB >let k4=sin(3.4) MTB >print k4 K4 -0.255541 gives the value.
32
If we want to, we can store this result in a constant or column by making an appropriate entry in the Store result in box. We see from the dialog box that there are a number of possible statistics that can be computed. We can also compute statistics row-wise. One dierence with column statistics is that these must be stored. For example, suppose we want to compute the average of the Statistics, Calculus, and Physics marks. The command Calc I Row Statistics produces the dialog box shown in Display I.27 where we have placed C2, C3, and C4 into the Input variables box and c6 into the Store result in box.
33
It is also possible to compute column statistics using session commands. For example, MTB >mean(c2) MEAN = 69.900 computes the mean of c2. If we want to save the value for subsequent use, then the command MTB >let k1=mean(c2) does this. The general syntax for column statistic commands is column statistic name(E1 ) where the operation is carried out on the entries in column E1 , and output is written to the screen unless it is assigned to a constant using the let command. See Appendix B.2 for a list of all the column statistics available. Also, for most column statistics there are versions that compute row statistics, and these are obtained by placing r in front of the column statistic name. For example, MTB >rmean(c2 c3 c4 c6) computes the mean of the corresponding entries in C2, C3, and C4 and places the result in C6. The general syntax for row statistic commands is row statistic name(E1 . . . Em Em+1 ) where the operations are carried out on the rows in columns E1 , . . . , Em , and the output is placed in column Em+1 . See Appendix B.3 for a list of all the row statistics available.
Notice that there are two choices for these operators; for example, use either the symbol >= or the mnemonic ge. The comparison and logical operators are useful when we have simple questions about the worksheet that would be tedious to answer by inspection. This
34
feature is particularly useful when we are dealing with large data sets. For example, suppose that we want to count the number of times the Statistics grade was greater than the corresponding Calculus grade in the marks worksheet. The command Calc I Calculator gives the dialog box shown in Display I.28 where we have put c6 in the Store result in variable box and c2 > c3 in the Expression box. Clicking on the OK button results in the ith entry in C6 containing a 1 if the ith entry in C2 is greater than the ith entry in C3, i.e., the comparison is true, and a 0 otherwise. In this case, C6 contains the entries: 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, which the worksheet in Display I.4 veries as appropriate. If we use Calc I Calculator to calculate the sum of the entries in C6, we will have computed the number of times the Statistics grade is greater than the Calculus grade. These operations can also be simply carried out using session commands. For example, MTB >let c6=c2>c3 MTB >let k4=sum(c6) MTB >print k4 K4 4.00000 accomplishes this.
The logical operators combine with the comparison operators to allow more complicated questions to be asked. For example, suppose we wanted to calculate the number of students whose Statistics mark was greater than their Calculus mark and less than or equal to their Physics mark. The commands MTB >let c6=c2>c3 and c2<=c4 MTB >let k4=sum(c6) MTB >print k4 K4 1.00000
35
accomplish this. In this case, both conditions c2>c3 and c2<=c4 have to be true for a 1 to be recorded in C6. Note that the observation with the missing Physics mark is excluded. Of course, we can also implement this using Calc I Calculator and lling in the dialog box appropriately. Text variables can be used in comparisons where the ordering is alphabetical. For example, MTB >let c6=c5<m puts a 1 in C6 whenever the corresponding entry in C5 is alphabetically smaller than m.
11.1 Coding
The Manip I Code command is used to recode columns. By this we mean that data entries in columns are replaced by new values according to a coding scheme that we must specify. You can recode numeric into numeric, numeric into text, text into numeric, or text into text by choosing an appropriate subcommand. For example, suppose in the marks worksheet we want to recode the grades in C2, C3, and C4 so that any mark in the range 039 becomes an F, every mark in the range 4049 becomes an E, every mark in the range 5059 becomes a D, every mark in the range 6069 becomes a C, every mark in the range 7079 becomes a B, every mark in the range 80100 becomes an A, and the results are placed in columns C6, C7, and C8, respectively. Then the command Manip I Code I Numeric to Text brings up the dialog box shown in Display I.29. The ranges for the numeric values to be recoded to a common text value are typed in the Original values box, and the new values are typed in the New box. Note as discussed that we have used a shorthand for describing a range of data values in section 7.2. Because the sixth entry of C4 is *, i.e., it is missing, this value is simply recoded as a blank. You can also recode missing values by including * in one of the Original values boxes. If a value in a column is not covered by one of the values in the Original values boxes, then it is simply left the same in the new column.
36
Display I.29: Dialog box for recoding numeric values to text values.
Note that this menu command restricts the number of new code values to 8. The session command code allows up to 50 new codes. For example, suppose in the marks worksheet we want to recode the grades in C2, C3, and C4 so that any mark in the range 09 becomes a 0, every mark in the range 1019 becomes 10, etc., and the results are placed in columns C6, C7, and C8. The following command MTB >code(0:9) to 0 (10:19) to 10 (20:29) to 20 (30:39) to 30 & CONT>(40:49) to 40 (50:59) to 50 (60:69) to 60 (70:79) to 70 & CONT>(80:89) to 80 (90:99) to 90 for C2-C4 put in C6-C8 accomplishes this. Note the use of the continuation symbol &, as this is a long command. The general syntax for the code command is code (V1 ) to code1 ... (Vn ) to coden for E1 ... Em put in Em+1 ... E2m where Vi denotes a set of possible values and ranges for the values in columns E1 ... Em that are all coded as the number codei , and the results of this coding are placed in the columns Em+1 ... E2m , i.e., the recoded E1 is placed in Em+1 , etc.
37
In the session environment, the concatenate command is available for this operation. The general syntax of the concatenate command is concatenate E1 ... Em in Em+1 where E1 , ..., Em , are text columns, and Em+1 is the target text column.
38
Display I.32: Dialog box for converting text column c5 of the marks worksheet into a numeric column with the conversion table given in columns c6 and c7.
The general syntax for the corresponding session command convert is convert E1 E2 E3 E4 where E1 , E2 are the columns containing the conversion table, E3 is the column to be converted and E4 is the column containing the converted column.
11.4 History
Minitab keeps a record of the commands you have used and the data you have input in a session. This information can be obtained in the History folder of the Project Manager window. The commands can be copied from wherever they are listed and pasted into the Session window to be reexecuted, so that a number of commands can be executed at once without retyping. These commands can be edited before being executed again. This is very helpful when you have implemented a long sequence of commands and realize that you made an error early on. Note that even if you use the menu commands, a record is kept only of the corresponding session commands. The journal command is available in the Session window if you want to keep a record of the commands in an external le. For example,
Minitab for Data Management MTB >journal comm1 Collecting keyboard input(commands and data)in file:
39
comm1.MTJ MTB >read c1 c2 c3 DATA>1 2 3 DATA>end 1 rows read. MTB >nojournal puts read c1 c2 c3 1 2 3 end nojournal into the le comm1.mtj. The history is turned o as soon as the nojournal command is typed.
40 rank E1 E2
where E1 is the column whose ranks we want to compute, and E2 is the column that will hold the computed ranks.
Minitab for Data Management The general syntax of the corresponding session command sort is sort E1 E2 . . .Em Em+1 . . .E2m
41
where E1 is the column to be sorted, and E2 , ..., Em are carried along with the results placed in columns Em+1 , ..., E2m . Note that this sort can also be accomplished using the by subcommand, where the general syntax is sort E1 E2 . . .Em Em+1 . . .E2m ; by E2m+1 . . .En . where now we sort by columns E2m+1 , ..., En , sorting rst by E2m+1, then E2m+2 , etc., carrying along E1 , ..., Em and placing the result in Em+1 , ..., E2m . The descending subcommand can also be used to indicate which sorting variables we want to use in descending order rather than ascending order.
In the Session window, this same result can be obtained using the stack command. The general syntax for the stack command is given by stack E1 E2 . . .Em into Em+1 where E1 , E2 , ..., Em denote the columns or constants to be stacked one on top of the other, starting with E1 , and with the result placed in column Em+1 . If we
42
want to keep an index of where the values came from, then use the subcommand subscripts Em+2 which results in index values being stored in column Em+2 . To unstack values in a column by the values in an index column we use the Manip I Unstack command. For example, given the columns C6 and C7 of the marks worksheet as described above, the dialog box shown in Display I.36 unstacks C6 into three columns by the values in C7. The three columns are C8, C9, and C10. Note that they are identical to columns C2, C3, and C4, respectively. We must always specify a column containing the subscripts when unstacking a column.
The general syntax for the corresponding session command unstack is unstack E1 into E2 . . .Em ; subscripts Em+1 . where E1 is the column to be unstacked, E2 , ..., Em are the columns and constants to contain the unstacked column, and Em+1 gives the subscripts 1, 2, ... that indicate how E1 is to be unstacked. Note that it is also possible to simultaneously unstack blocks of columns. We refer the reader to help or Help for information on this.
43
12 Exercises
1. The following data give the Hi and Low trading prices in Canadian dollars for various stocks on a given day on the Toronto Stock Exchange. Create a worksheet, giving the columns the same variable names, using any of the methods discussed in I.7. Be careful to ensure that the value of the variable stock starts with a letter. Print the worksheet to check that you have successfully entered it. Save the worksheet giving it the name stocks.
Stock ACR MGI BLD CFP MAL CM AZC CMW AMZ GAC
Hi 7.95 4.75 112.25 9.65 8.25 45.90 1.99 20.00 2.70 52.00
Low 7.80 4.00 109.75 9.25 8.10 45.30 1.93 19.00 2.30 50.25
2 Retrieve the worksheet stocks created in Exercise 1. Change the Low value in the stock MGI to 3.95. Calculate the average of the Hi and Low prices for all the stocks, and save this in a column called average. Calculate the average of all the Hi prices, and save this in a constant called avhi. Similarly, do this for all the Low prices, and save this in a constant called avlo. Save the worksheet using the same name. Write all the columns out to a le called stocks.dat. Print the le stocks.dat on your system printer. 3 Retrieve the worksheet created in Exercise 2. Using the Minitab commands discussed in I.10, calculate the number of stocks in the worksheet whose average is greater than $5.00 and less than or equal to $45.00. 4 Using the worksheet created in Exercise 2, insert the following stocks at the beginning of the worksheet. Stock CLV SIL AC Hi 1.85 34.00 14.45 Low 1.78 34.00 14.05
44
Minitab for Data Management 5 Using the worksheet created in Exercise 4, sort the stocks into alphabetical order. Calculate the ranks of the individual stocks based on their Hi price, and save the ranking in a new column. Save the worksheet. 6 Using the worksheet created in Exercise 5, calculate the average Hi price of all the stocks beginning in A. 7 Using the worksheet created in Exercise 5, recode all the Low prices in the range $09.99 as 1, in the range $1039.99 as 2, and greater than or equal to $40 as 3, and save the recoded variable in a new column. 8 Using patterned data input, place the values from 10 to 10 in increments of .1 in C1. For each of the values in C1, calculate the value of the quadratic polynomial 2x2 + 4x 3 (i.e., substitute the value in each entry in C1 into this expression) and place these values in C2. Using Minitab commands and the values in C1 and C2, estimate the point in the range from 10 to 10 where this polynomial takes its smallest value and what this smallest value is. Using Minitab commands and the values in C1 and C2 estimate the points in the range from 10 to 10, where this polynomial is closest to 0. 9 Using patterned data input, place values in the range from 0 to 5 using an increment of .01 in C1. Calculate the value of 1 ex for each value in C1, and place the result in C2. Using Minitab commands, nd the largest value in C1 where the corresponding entry in C2 is less than or equal to .5. Note that ex corresponds to the exponentiate command (see Appendix B.1) evaluated at x. 10 Using patterned data input, place values in the range from 4 to 4 using an increment of .01 in C1. Calculate the value of
2 1 ex /2 2
for each value in C1, and place the result in C2, where = 3.1415927. Using parsums (see Appendix B.1), calculate the partial sums for C2, and place the result in C3. Multiply C3 times .01. Find the largest value in C1 such that the corresponding entry in C3 is less than or equal to .25.
Part II
45
Chapter 1
Looking at DataDistributions
New Minitab commands discussed in this chapter Calc I Probability Distributions I Normal File I Open Graph File I Save Graph As Graph I Boxplot Graph I Chart Graph I Dotplot Graph I Histogram Graph I Pie Chart Graph I Probability Plot Graph I Stem-and-Leaf Graph I Time Series Plot Manip I Code tat I Basic Statistics I Display Descriptive Statistics S Stat I Basic Statistics I Store Descriptive Statistics Stat I Tables I Tally
This chapter of IPS is concerned with the various ways of presenting and summarizing a data set. By presenting data, we mean convenient and informative methods of conveying the information contained in a data set. There are two basic methods for presenting data, namely graphically and through tabulations. Still, it can be hard to summarize exactly what these presentations are saying about the data. So the chapter also introduces various summary statistics that are commonly used to convey meaningful information in a concise way. All of these topics can involve much tedious, error prone calculation, if we were to insist on doing them by hand. An important point is that you should 47
48
Chapter 1
almost never rely on hand calculation in carrying out a data analysis. Not only are there many far more important things for you to be thinking about, as the text discusses, but you are also likely to make an error. On the other hand, never blindly trust the computer! Check your results and make sure that they make sense in light of the application. For this, a few simple hand calculations can prove valuable. In working through the problems in IPS, you should try to use Minitab as much as possible, as this will increase your skill with the package and inevitably make your data analyses easier and more eective.
1.1
If a variable is categorical, we construct a table using the values of the variable and record the frequency (count) of each value in the data and perhaps the relative frequency (proportion) of each value in the data as well. These relative frequencies then serve as a convenient summarization of the data. If the variable is quantitative, we typically group the data in some way, i.e., divide the range of the data into nonoverlapping intervals and record the frequency and proportion of values in each interval. Grouping is accomplished using the Manip I Code command discussed in I.11.1. If the values of a variable are ordered, we can record the cumulative distribution, namely the proportion of values less than or equal to each value. Quantitative variables are always ordered but sometimes categorical variables are as well, e.g., when a categorical variable arises from grouping a quantitative variable. Often, it is convenient with quantitative variables to record the empirical distribution function, which for data values x1 , . . . , xn and at a value x is given by (x) = # of xi x F n (x) is the proportion of data values less than or equal to x. We can i.e., F summarize such a presentation via the calculation of a few quantities such as the rst quartile, the median, and the third quartile or present the mean and the standard deviation. We introduce some new commands to carry out the necessary computations using the data shown in Table 1.1. This is data collected by A.A. Michelson and Simon Newcomb in 1882 concerning the speed of light. We will refer to this hereafter as Newcombs data and place these in the column C1 with the name time in the worksheet called newcomb.
49
1.1.1
Tallying Data
The Stat I Tables I Tally command tabulates categorical data. Consider New combs measurements in Table 1.1. These data range from 44 to 40 (use minimum and maximum in Calc I Calculator to calculate these values). Suppose we decide to group these into the intervals (50, 0], (0, 20], (20, 25], (25, 30], (30, 35], 0(35, 40]. Next we want to record the frequencies, relative frequencies, cumulative frequencies, and cumulative distribution of this grouped variable. First, we used the Manip I Code I Numeric to Numeric command, as de scribed in I.11.1, to recode the data so that every value in (50, 0] is given the value 1, every value in (0, 20] is given the value 2, etc., and these values are placed in C2. The dialog box for doing this is shown in Display 1.1.
50
Chapter 1
Next we used the Stat I Tables I Tally command, with the dialog box shown in Display 1.2,
Display 1.2: Dialog box for tallying the variable C2 in the newcomb worksheet.
to produce the output C2 Count Percent CumCnt CumPct 1 2 3.03 2 3.03 2 4 6.06 6 9.09 3 17 25.76 23 34.85 4 26 39.39 49 74.24 5 10 15.15 59 89.39 6 7 10.61 66 100.00 N= 66 in the Session window. We can also use the Stat I Tables I Tally command to compute the empir ical distribution function of C1 in the newcomb worksheet. First, we must sort the values in C1, from smallest to largest, using the Manip I Sort command described in I.11.6, and then we apply the Stat I Tables I Tally command to this sorted variable. The general syntax of the corresponding session command tally is tally E1 . . .Em where E1 , ..., Em are columns of categorical variables, and the command is applied to each column. If no subcommands are given, then only frequencies are computed, while the subcommands percents computes relative frequencies, cumcnts computes the cumulative frequency function, and cumpcts computes the cumulative distribution of C2. Any of the subcommands can be dropped. For example, the commands MTB >sort c1 c3 MTB >tally c3; SUBC>cumpcnts; SUBC>store c4 c5.
Looking At DataDistributions
51
rst use the sort command to sort the data in C1 from smallest to largest and place the results in C3. The cumulative distribution is computed for the values in C3 with the unique values in C3 stored in C4 and the cumulative distribution at each of the unique values stored in C5 via the store subcommand to tally.
1.1.2
Describing Data
The Stat I Basic Statistics I Display Descriptive Statistics command is used with quantitative variables to present a numerical summary of the variable values. These values are in a sense a summarization of the empirical distribution of the variable. For example, in the newcomb worksheet the dialog box shown in Display 1.3 leads to the output Variable N Mean Median TrMean StDev SE Mean time 66 26.21 27.00 27.40 10.75 1.32 Variable Minimum Maximum Q1 Q3 time -44.00 40.00 24.00 31.00 in the Session window. This provides the count N, the mean, median, trimmed mean TrMean (removes lower 5% and upper 5% of the data and averages the rest), standard deviation, standard error of the mean, minimum, maximum, rst quartile Q1, and third quartile Q3 of the variable C1. If we want such a summary of a variable by the values of another variable, we check the By variable box and indicate the by variable in the box to the right of this. For example, we might want such a summary for each of the groups we created in II.1.1, and so we would place C2 in this box. Note that a number of summary statistics can also be computed using the Column Statistics discussed in I.10.3.
Display 1.3: Dialog box for computing basic descriptive statistics of a quantitative variable.
If we wish to compute some basic statistics and store these values for later use, then the Stat I Basic Statistics I Store Descriptive Statistics command is available for this. For example, with the newcomb worksheet this command leads
52
Chapter 1
to the dialog box shown in Display 1.4. Clicking on the Statistics button results in the dialog box of Display 1.5 where we have checked First quartile, Median, Third quartile, Interquartile range, and N nonmissing as the statistics we want to compute. The result of these choices is that the next available variables in the worksheet contain these values. So in this case, the values of C3C7 are as depicted in Display 1.6. Note that these variables are now named as well. Note that many more statistics are available using this command.
Display 1.4: Dialog box for computing and storing various descriptive statistics.
Display 1.5: Dialog box for choosing the descriptive statistics to compute and store.
Display 1.6: Values obtained for descriptive statistics using dialog boxes in Figures 1.4 and 1.5.
The general syntax of the Session command describe, corresponding to Stat I Basic Statistics I Display Descriptive Statistics, is
53
where E1 , ..., Em are columns of quantitative variables and the command is applied to each column. A by subcommand can also be used. The stats command is available in the Session window if we want to store the values of statistics. We refer the reader to help for a description of this command.
1.2
One of the most informative ways of presenting data is via a plot. There are many dierent types of plots within Minitab, and which one to use depends on the type of variable you have and what you are trying to learn. In this section we describe how to use the plotting features in Minitab. There are, however, many features of plotting that we will not describe. For example, there are many graphical editing capabilities that allow you to add features, such as titles or legends. Some of these features are accessed via Graph I Layout. We refer the reader to Help for more details on these features. Each plot in Minitab is made in a Graph window. You can make multiple plots and retain each Graph window until you want to delete it simply by clicking the symbol in the upper right-hand corner. You make any particular Graph window active by clicking in it or by using the Window command. A plot can be saved in an external le in a variety of formats, such as Minitab graph .mgf, bitmap .bmp, JPEG .jpg, etc., using the File I Save Graph As command. If a graph has been saved in the .mgf format, it can be reopened using the File I Open Graph command.
1.2.1
Dotplots
The Graph I Dotplot command is used with quantitative variables and produces a plot of each data value as a dot along the x-axis so that you get a general idea of the location of the data and how much scatter there is. Actually, the data is grouped before plotting and multiple observations in a group are stacked over the x-axis. The interval between successive tick (+) marks on the x-axis is divided into 10 equal-length subintervals for the grouping. Typically, one also looks for points that are far from the main scatter of points as these may be identied as outliers and, as such, deleted from the data set for subsequent analysis. For example, for the newcomb worksheet dialog box in Display 1.7 results in the plot of Display 1.8. The general syntax of the corresponding Session command dotplot is dotplot E1 . . .Em where E1 , ..., Em are columns, and a dotplot is produced for each. There are a number of subcommands available. The same subcommand ensures the scales of the dotplots are the same for each column. The by subcommand allows plotting of a variable by the values of another variable with all plots having the same scale. The increment subcommand allows for control of the distance
54
Chapter 1
between the tick marks and start and end allow you to specify where the dotplot should begin and end. For example, MTB >dotplot c1; SUBC>increment=5; SUBC>start=20 end=35. puts the tick marks 5 units apart, starts the plot at 20, and ends it at 35, so some points are not plotted in this case.
1.2.2
Stem-and-Leaf Plots
Stem-and-leaf plots are similar to histograms and are produced by the Graph I Stem-and-Leaf command. These plots are also referred to as stemplots as in IPS. For example, using this command with the newcomb worksheet produces the output in the Session window
Looking At DataDistributions Stem-and-leaf of time N = 66 Leaf Unit = 1.0 1 -4 4 1 -3 1 -2 1 -1 2 -0 2 2 0 5 1 669 (41) 2 01122333444445555566666777777888888899999 20 3 0001122222334666679 1 4 0
55
which is a stem-and-leaf plot of the values in time. The rst column gives the depths for a given stem, i.e., the number of observations on that line and below it or above it, depending on whether or not the observation is below or above the median. The row containing the median is enclosed in parentheses ( ), and the depth is only the observations on that line. If the number of observations is even and the median is the average of values on dierent rows, then parentheses do not appear. The second column gives the stems, as determined by Minitab, and the remaining columns give the ordered leaves, where each digit represents one observation. The Leaf Unit determines where the decimal place goes after each leaf. So in this example, the rst observation is 44.0, while it would be 4.4 if the Leaf Unit were .1. Multiple stem-and-leaf plots can be carried out for a number of columns simultaneously and also for a single variable by the values of another variable.
1.2.3
Histograms
A histogram is a plot where the data are grouped into intervals, and over each such interval a bar is drawn of height equal to the frequency of data values in that interval or of height equal to the relative frequency (proportion) of data values in that interval or of height equal to the density of points in that interval, i.e., the proportion of points in the interval divided by the length of the interval. The Graph I Histogram command is used to obtain these plots. For example, using this command with the newcomb worksheet, produces the dialog box shown in Display 1.9. We have placed the variable time in the rst x box to indicate we want a histogram of this variable. We can produce multiple histograms by placing more variables in the x boxes. To select the type of histogram to plot, we next click on the O ptions button, which produces the dialog box of Display 1.10. Here, we have selected a density histogram and have specied the intervals to use for grouping the data by specifying the cutpoints 45, 30, 15, 0, 15, 30, 45, which prescribe the intervals [45, 30), [30, 15), etc., for the grouping. Alternatively, we could have specied the midpoints of the grouping intervals. The advantage with cutpoints is that subintervals of unequal lengths can be specied. Clicking on the OK buttons in these boxes
56
Chapter 1
produces the histogram shown in Display 1.11. As can be seen from the dialog box of Display 1.9, there are a variety of methods for controlling the appearance of the histogram produced, and we refer the reader to the Help button for a description of these.
Display 1.9: Dialog box for creating a histogram of the time variable in the newcomb worksheet.
Display 1.10: Dialog box for selecting the type of histogram to plot.
Looking At DataDistributions
57
Display 1.11: Density histogram of the time variable in the newcomb worksheet.
An important consideration when plotting multiple histograms is to ensure that all the histograms have the same x and y scales so that the plots are visually comparable. This can be accomplished from the dialog box shown in Display 1.9 by Frame I Multiple Graphs and then selecting Same X and same Y. The session command histogram is also available. This has the general syntax histogram E1 . . .Em where E1 , ..., Em correspond to columns. For example, the commands MTB >histogram c1; SUBC>cutpoints -45 -30 -15 0 15 30 45; SUBC>density. produce the histogram in Display 1.11 using the cutpoints and density subcommands. There are also subcommands midpoints, nintervals, which specify the number of subintervals, and frequency or percent, which respectively ensure that the heights of the bar lines equal the frequency and relative frequency of the data values in the interval. Also, the cumulative subcommand is available so that the bars represent all the values less than or equal to the endpoint of an interval. The subcommand same ensures that multiple histograms all have the same scale.
1.2.4
Boxplots
Boxplots are useful summaries of a quantitative variable and are obtained using the Graph I Boxplot command. Boxplots are used to provide a graphical notion of the location of the data and its scatter in a concise and evocative way. For example, in the newcomb worksheet this command produces the dialog box shown in Display 1.12 and the plot in Display 1.13. The line in the center of the
58
Chapter 1
box is the median. The line below the median is the rst quartile, also called the lower hinge, and the line above is third quartile, also called the upper hinge. The dierence between the third and rst quartile, is called the interquartile range or IQR. The vertical lines from the hinges are called whiskers, and these run from the hinges to the adjacent values. The adjacent values are given by the greatest value less than or equal to the upper limit (the third quartile plus 1.5 times the IQR) and by the least value greater than or equal to the lower limit (the rst quartile minus 1.5 times the IQR). The upper and lower limits are also referred to as the inner fences. The outer fences are dened by replacing the multiple 1.5 in the denition of the inner fences by 3.0. Values beyond the outer fences are plotted with a * and are called outliers. As with the plotting of histograms, multiple boxplots can be plotted for comparison purposes, and again, it is important to make sure that they all have the same scale.
Display 1.12: Dialog box for producing a boxplot of the time variable in the newcomb worksheet.
There is a corresponding session command called boxplot. We refer the reader to help for more discussion of this command.
Looking At DataDistributions
59
1.2.5
Often, data are collected sequentially in time. In such a context, it is instructive to plot the values of quantitative variables against time in a time series plot. For this we use the Graph I Time Series Plot command. If we suppose that the data values in time of the newcomb worksheet were obtained in the order they are listed, then applying this command to that data with the dialog box as in Display 1.14 produces the time plot shown in Display 1.15. Notice that in the Data display box we have specied that the graph should plot a symbol for each point and that the symbols plotted should connect via lines. For example, if we had left out connect, only the points would have been plotted. The lines help to visualize the form of the graph. The symbol plotted is a solid circle but other choices could have been made using the Edit Attributes button. Also, for the Time Scale we have chosen Index, which is just the order in which the observations are listed. If these observations were made at periodic time intervals, there are other possible choices that could be more meaningful.
Display 1.14: Dialog box for a time series plot of the variable time from the newcomb worksheet.
Display 1.15: Time series plot of the variable time from the newcomb worksheet.
There is also a corresponding session command tsplot. We refer the reader to help for more discussion of this.
60
Chapter 1
1.2.6
Bar Charts
It is also possible to produce various charts using the Graph I Chart command. For example, the dialog box shown in Display 1.16 plots a bar chart of the variable C2 in the newcomb worksheet. Each distinct value of C1 is plotted along the x-axis simply as a categorical value, not as a quantitative value, and a bar of height equal to the number of times that value occurs in the variable is drawn. A bar chart is a good way to plot categorical variables. There are many possibilities for the types of bar charts drawn, and we refer the reader to the Help button for a discussion of these.
The corresponding session command is chart E1 which produces a bar chart for the values in column E1 .
1.2.7
Pie Charts
A pie chart is a disk divided up into wedges where each wedge corresponds to a unique value of a variable, and the area of the wedge is proportional to the relative frequency of the value with which it corresponds. Pie charts can be obtained via Graph I Pie Chart, and there are various features available in the dialog box that can be used to enhance these plots. Pie charts are a common method for plotting categorical variables.
1.3
It is important in statistics to be able to do computations with the normal distribution. The equation of the density curve for the normal distribution with mean and standard deviation is given by
1 z 2 1 e 2 ( ) 2
Looking At DataDistributions
61
where z is a number. We refer to this as the N (, ) density curve. Also of interest is the area under the density curve from to a number x, i.e., the area between the graph of the N (, ) density curve and the interval (, x]. As noted in IPS, this is a value between 0 and 1. Sometimes, we specify a value p between 0 and 1 and then want to nd the point xp , such that p of the area under the N (, ) density curve lies over (, xp ]. The point xp is called the pth percentile of the N (, ) density curve. Often, we are given a mean and a standard deviation and asked to standardize a variable x whose values are in some column, i.e., produce the new variable z = x . These arithmetical operations can be carried out using the let command as described in I.10.1.
1.3.1
Suppose that we want to evaluate the N (, ) density curve at a value x. For this, we use the Calc I Probability Distributions I Normal command. For example, the dialog box in Display 1.17 indicates that we want to evaluate the N (10, 1) density curve at the value x = 11.0.
After clicking on the OK button the output Normal with mean = 10.0000 and standard deviation = 1.00000 x f( x ) 11.0000 0.2420 is printed in the Session window, which gives the value as .2420. Sometimes, we will want to evaluate the density curve at every value in a column of values, e.g., when we are plotting this curve. For this we simply click on the radio button Input column and type the relevant column in the associated box. The general syntax of the corresponding session command pdf with the normal subcommand is pdf E1 . . .Em into Em+1 . . .E2m ; normal mu = V1 sigma = V2 .
62
Chapter 1
where E1 , ..., Em are columns or constants containing numbers and Em+1 , ..., E2m are the columns or constants that store the values of the N (, ) density curve at these numbers and V1 = and V2 = . If no storage is specied, then the values are printed. For example, if we want to compute the N (.5, 1.2) density curve at every value between 3 and 3 in increments of .01, the commands MTB >set c1 DATA>-3:3/.01 DATA>end MTB >pdf c1 c2; SUBC>normal mu=-.5 sigma=1.2. put the values between 3 and 3 in increments of .01 in C1 using the set command. The pdf command with the normal subcommand calculates the N (.5, 1.2) density curve at each of these values and puts the outcomes in the corresponding entries of C2. If we plot C2 against C1, we will have a plot of the density curve of this distribution. For this, we use the scatterplot facilities in Minitab as discussed in II.3. Note that with the normal subcommand we must also specify the mean and the standard deviation via mu and sigma.
1.3.2
Suppose that we want to evaluate the area under N (, ) density curve over the interval (, x]. This is the value of the cumulative distribution function of the N (, ) distribution at the value x. For this, we use the Calc I Probability Distributions I Normal as well, but in this case in the dialog box of Display 1.17 we select Cumulative probability instead. Making this change in the dialog box of Display 1.17, we get the output x 11.0000 P( X <= x ) 0.8413
in the Session window. Again, we can evaluate this function at a single point or at every value in a variable. The general syntax of the corresponding Session command cdf command with the normal subcommand is cdf E1 . . .Em into Em+1 . . .E2m ; normal mu = V1 sigma = V2 . where E1 , ..., Em are columns or constants containing numbers and Em+1 , ..., E2m are the columns or constants that store the values of the area under N (, ) density curve over the interval from to these numbers and V1 = and V2 = . If no storage is specied, the values are printed.
1.3.3
Suppose that we want to evaluate percentiles for the N (, ) density curve. Again, we use the Calc I Probability Distributions I Normal command, but
Looking At DataDistributions
63
in this case, in the dialog box of Display 1.17 we select Inverse cumulative probability instead. Making this change in the dialog box of Display 1.17 and replacing 11 by .75 recall that the argument to this function must be between 0 and 1 we get the output P( X <= x ) 0.7500 x 10.6745
in the Session window. This indicates that the area to the left of 10.6745 underneath the N (.5, 1.2) density curve is .75. The general syntax of the corresponding session command invcdf with the normal subcommand is invcdf E1 . . .Em into Em+1 . . .E2m ; normal mu = V1 sigma = V2 . where E1 , ..., Em are columns or constants containing numbers between 0 and 1 and Em+1 , ..., E2m are the columns or constants that store the values of the percentiles of the N (, ) density curve at these numbers and where V1 = and V2 = . If no storage is specied, then the values are printed.
1.3.4
Some statistical procedures require that we assume that values for some variables are a sample from a normal distribution. A normal probability plot is a diagnostic that checks for the reasonableness of this assumption. To create such a plot, we use the Graph I Probability Plot command. For example, using this command get the dialog box in Display 1.18 where we have on the newcomb worksheet we placed time in the Variables box. Clicking on the OK button produces the plot in Display 1.19. The normal probability plot is given by the dark dotted curve. The plot also contains other information and further output is printed in the Session window. Of course, the plot should be like a straight line and it is not in this case.
64
Chapter 1
Display 1.19: Normal probability plot of the time variable in the newcomb worksheet.
The session commands MTB >nscores c1 c3 MTB >plot c3*c1 produce a normal probability plot like that shown in Display 2.3. The plot command will be discussed much more extensively in II.3. The nscores (normal scores ) command relies on some concepts that are beyond the level of this course so we do not discuss this further.
1.4
Exercises
When the data for an exercise come from an exercise in IPS, the IPS exercise number is given in parentheses ( ). All computations in these exercises are to be carried out using Minitab, and the exercises are designed to ensure that you have a reasonable understanding of the Minitab material in this chapter. Generally, you should be using Minitab to do all the computations and plotting required for the problems in IPS. 1. Using Newcombs measurements in Table 1.1, create a new variable by grouping these values into three subintervals [50, 0), [0, 20), [20, 50). Calculate the frequency distribution, the relative frequency distribution, and the cumulative distribution of this ordered categorical variable. 2. (1.21) Use Minitab to print the empirical distribution function. From this, determine the rst quartile, median, and third quartile. Also, use the empirical distribution function to compute the 10th and 90th percentiles. 3. Use Minitab to produce the stemplot of Example 1.4 of IPS. 4. Use Minitab to produce the time plot of Example 1.5 of IPS.
Looking At DataDistributions
65
5. (1.29) Use Minitab commands for the stemplot and the time plot. Use Minitab commands to compute a numerical summary of this data, and justify your choices. 6. (1.30) Transform the data in this problem by subtracting 5 from each value and multiplying by 10. Calculate the means and standard deviations, using any Minitab commands, of both the original and transformed data. Compute the ratio of the standard deviation of the transformed data to the standard deviation of the original data. Comment on this value. 7. (1.30) Transform this data by multiplying each value by 3. Compute the ratio of the standard deviation to the mean (called the coecient of variation ) for the original data and for the transformed data. Justify the outcome. 8. For the N (6, 1.1) density curve, compute the area between the interval (3, 5) and the density curve. What number has 53% of the area to the left of it for this density curve? 9. Use Minitab commands to verify the 68-95-99.7 rule for the N (2, 3) density curve. 10. Calculate and store the values of the N (0, 1) density curve at each value in [3, 3] using an increment of .01. Put the values in the interval [3, 3] in C1 and the values of the density curve in C2. Using the command plot C2*C1, plot the density curve. Comment on the shape of this curve. 11. Use Minitab commands to make the normal quantile plots presented in Figures 1.31 and 1.32 of IPS.
66
Chapter 1
Chapter 2
Looking at DataRelationships
New Minitab commands discussed in this chapter Graph I Plot Stat I Basic Statistics I Correlation Stat I Regression I Fitted Line Plot Stat I Regression I Regression
In this chapter, Minitab commands are described that permit the analysis of relationships among two variables. The methods are dierent depending on whether or not both variables are quantitative, both variables are categorical, or one is quantitative and the other is categorical. This chapter considers relationships between two quantitative variables with the remaining cases discussed in later chapters. Graphical methods are very useful in looking for relationships among variables, and we examine various plots for this.
2.1
Scatterplots
A scatterplot of two quantitative variables is a useful technique when looking for a relationship between two variables. By a scatterplot we mean a plot of one variable on the y-axis against the other variable on the x-axis. For example, consider Example 2.4 in IPS, where we are concerned with the relationship between the length of the femur and the length of the humerus in an extinct species. Suppose that we have input the data so that length of the femur measurements are in C1, which has been named femur, and the length of the humerus measurements are in C2, which has been named humerus, of the worksheet archaeopteryx. The command Graph I Plot produces the dialog box of into the rst box for the y variable Display 2.1, where we have placed femur 67
68
Chapter 2
and humerus in the rst box for the x variable. This produces the plot shown in Display 2.2. Note that we could alter the plotting symbol using the dialog box that appears when we click on the Edit Attributes box. Using the dialog box that appears when you click on the Annotation button, it is possible to give the plot a title, label plotted points, etc. Using the dialog box that appears when you click on the Frame button, you can change the labels on the axes. Rather than just plotting the points in a scatterplot, you can add connection lines (join the points with lines), add projection lines (drop a line from each point to the x-axis), and add areas (ll in the area under a polygon joining the points). Also, you can employ the scatterplot smoother lowess to plot a piecewise linear continuous curve through the scatter of points. These features are available via Graph I Plot I Display. There are a number of other features that allow you to control the appearance of the plot.
70
femur
60
50
40 40 45 50 55 60 65 70 75 80 85
humerus
Display 2.2: Scatter plot of femur length (C1) versus humerus length (C2) of Example 2.4 in IPS.
Looking At DataRelationships
69
It is also possible to have multiple scatterplots on the same plot. For example, suppose that C3 in the archaeopteryx worksheet contains the natural log of the femur variable. We obtained the plot of Display 2.3 by adding another pair of variables to the second Graph variables box as in Display 2.1 with C3 as the y variable and humerus as the x variable. To put these scatterplots on the same plot use Frame I Multiple Graphs and click on the Overlay graphs on the same page radio button.
75 65 55
femur
45 35 25 15 40 45 50 55 60 65 70 75 80 85
humerus
The technique of brushing is available after obtaining the plot to see which observations (rows) the points correspond to. This is helpful in identifying the points that correspond to outliers. Brushing is accessed from the toolbar just below the menu bar by clicking on the brush when the Graph window is active. The corresponding session command is plot. For example, MTB > plot femur*humerus produces the plot of Display 2.2. Note that the rst variable is plotted along the y-axis, and the second variable is plotted along the x-axis. There are various subcommands that can be used with plot, and we refer the reader to Help for a description of these. There are a number of additional plots available in Minitab that are related to the scatterplot. For example, a marginal plot of two variables is a scatterplot of one variable against the other where in addition histograms, dotplots or boxplots are plotted along the sides of the scatterplot for each variable. These are available via the menu command Graph I Marginal Plot. Draftsman plots allow you to produce a number of scatterplots in a rectangular array so that they can be compared. For example, you may want to plot C1 against C3, C2 against C3, C1 against C4, and C2 against C4 and see all of these in a common plot. This capability is available via the menu command Graph I Draftsman Plot and lling in the dialog box. Matrix plots provide a mechanism for placing a number of scatterplots in a rectangular array or matrix so that they can be directly compared or examined for relationships. Matrix plots are available via
70
Chapter 2
the command Graph I Matrix Plot. Also three-dimensional scatterplots are available via Graph I 3D Plot and contour plots via Graph I Contour Plot.
2.2
Correlations
While a scatterplot is a convenient graphical method for assessing whether or not there is any relationship between two variables, we would also like to assess this numerically. The correlation coecient provides a numerical summarization of the degree to which a linear relationship exists between two quantitative variables, and this can be calculated using the Stat I Basic Statistics I Correlation command. For example, applying this command to the femur and humerus variables of the worksheet archaeopteryx, i.e., the data of Example 2.4 in IPS and depicted in Display 2.2, we obtain the output Pearson correlation of femur and humerus = 0.994 P-Value = 0.001 in the Session window. For now, we ignore the number recorded as P-Value. The general syntax of the corresponding session command correlate is given by correlate E1 . . . Em where E1 , ..., Em are columns corresponding to numerical variables, and a correlation coecient is computed between each pair. This gives m(m 1)/2 correlation coecients. The subcommand nopvalues is available if you want to suppress the printing of P -values.
2.3
Regression
Regression is another technique for assessing the strength of a linear relationship existing between two variables and it is closely related to correlation. For this, we use the Stat I Regression command. As noted in IPS, the regression analysis of two quantitative variables involves computing the least-squares line y = a + bx, where one variable is taken to be the response variable y and the other is taken to be the explanatory variable x. Note that the least squares line is dierent depending upon which choice is made. For example, for the data of Example 2.4 in IPS and plotted in Display 2.2 letting femur be the response and humerus be the predictor or explanatory variable, the Stat I Regression I Regression command leads to the dialog box of Display 2.4, where we have made the appropriate entries in the Response and Predictors boxes. Clicking on the OK button leads to the output of Display 2.5 being printed in the Session window. This gives the least-squares line as y = 3.70 + .826x, i.e., a = 3.70 and b = .826, which we also see under the Coef column in the rst table. In addition, we obtain the value of the square of the correlation coecient, also known as the coecient of determination, as R-Sq = 98.8%. We will discuss the remaining output from this command in II.10.
Looking At DataRelationships
71
It is very convenient to have a scatterplot of the points together with the least-squares line. This can be accomplished using the Stat I Regression I Fitted Line Plot command. Filling in the dialog box for this command as in Display 2.4 produces the output in the Session window of Display 2.5 together with the plot of Display 2.6. There are some additional quantities that are often of interest in a regression analysis. For example, you may wish to have the tted values y = a + bx at each x value printed as well as the residuals y y . Clicking on the Results button in the dialog box of Display 2.4 and lling in the ensuing dialog box as in Display 2.7 results in these quantities being printed in the Session window as well as the output of Display 2.5.
72
Chapter 2
Display 2.6: Scatterplot of femur versus humerus in the archaeopteryx worksheet together with the least-squares line.
Display 2.7: Dialog box for controlling output for a regression analysis.
You will probably want to keep these values for later work. In this case, clicking on the Storage button of Display 2.4 and lling in the ensuing dialog box as in Display 2.8 results in these quantities being saved in the next two available columns in this case, C3 and C4 with the names resl1 and fits1 for the residuals and ts, respectively.
Display 2.8: Dialog box for storing various quantities computed as part of a regression analysis.
Even more likely is that you will want to plot the residuals as part of assessing whether or not the assumptions that underlie a regression analysis make sense
Looking At DataRelationships
73
in the particular application. For this, click on the Graphs button in the dialog box of Display 2.4. The dialog box of Display 2.9 becomes available. Notice that we have requested that the standardized residuals each residual divided by its standard error be plotted, and this plot appears in Display 2.10. All the standardized residuals should be in the interval (3, 3) , and no pattern should be discernible. In this case, this residual plot looks ne. From the dialog box of Display 2.9, we see that there are many other possibilities for residual plots.
Display 2.9: Dialog box for selecting various residual plots as part of a regression analysis.
Display 2.10: Plot of the standardized residuals versus humerus after regressing femur against humerus in the archaeopteryx worksheet.
The corresponding session command is given by regress, and by using the subcommands pts, residual, and sresidual we can calculate and store tted values, residuals, and standardized residuals, respectively. For example,
74 MTB > SUBC> SUBC> SUBC> regress c1 1 c2; fits c3; residuals c4; sresiduals c5.
Chapter 2
gives the output of Display 2.5 and also stores the tted values in C3, stores the residuals y y in C4, and stores the standardized residuals in C5. Note that the 1 in regress c1 1 c2 refers to the number of predictors we are using to predict the response variable. To plot the standardized residuals against humerus, we use MTB > plot c5*c2 which results in a plot like Display 2.10 but with dierent labels on the x axis.
2.4
Transformations
Sometimes, transformations of the variables are appropriate before we carry out a regression analysis. This is accomplished in Minitab using the Calc I Calculator command and the arithmetical and mathematical operations dis cussed in I.10.1 and I.10.2. In particular, when a residual plot looks bad, sometimes this can be xed by transforming one or more of the variables using a simple transformation, such as replacing the response variable by its logarithm or something else. For example, if we want to calculate the cube root i.e., x1/3 of every value in C1 and place these in C2, we use the Calc I Calculator command and the dialog box as depicted in Display 2.11. Alternatively, we could use the session command let as in MTB > let c2=c1**(1/3) which produces the same result.
Looking At DataRelationships
75
2.5
Exercises
When the data for an exercise come from an exercise in IPS, the IPS exercise number is given in parentheses ( ). All computations in these exercises are to be carried out using Minitab, and the exercises are designed to ensure that you have a reasonable understanding of the Minitab material in this chapter. Generally, you should be using Minitab to do all the computations and plotting required for the problems in IPS. 1. (2.10) Calculate the least-squares line and make a scatterplot of Fuel used against Speed together with the least-squares line. Plot the standardized residuals against Speed. What is the squared correlation coecient between these variables? 2. (2.11) Make a scatterplot of Rate against Mass where the points for different Sexes are labeled dierently (use Minitab for the labeling, too) and with the least-squares line on it. Hint: Make use of the stack command discussed in I.11.7. 3. Place the values 1 through 100 with an increment of .1 in C1 and the square of these values in C2. Calculate the correlation coecient between C1 and C2. Multiply each value in C1 by 10, add 5, and place the results in C3. Calculate the correlation coecient between C2 and C3. Why are these correlation coecients the same? 4. Place the values 1 through 100 with an increment of .1 in C1 and the square of these values in C2. Calculate the least-squares line with C2 as response and C1 as explanatory variable. Plot the standardized residuals. If you see such a pattern of residuals what transformation, might you use to remedy the problem? 5. (2.54) For the data in this problem, numerically verify the algebraic relationship that exists between the correlation coecient and the slope of the least-squares line. 6. For Example 2.17 in IPS, calculate the least-squares line and reproduce Display 2.21. Calculate the sum of the residuals and the sum of the squared residuals and divide this by the number of data points minus 2. Is there anything you can say about what these quantities are equal to in general? 7. (2.62) Use Minitab to do all the calculations in this problem. 8. Place the values 1 through 10 with an increment of .1 in C1, and place exp (1 + 2x) of these values in C2. Calculate the least-squares line using C2 as the response variable, and plot the standardized residuals against C1. What transformation would you use to remedy this residual plot? What is the least-squares line when you carry out this transformation?
76
Chapter 2
Chapter 3
Producing Data
New Minitab commands discussed in this chapter Calc I Set Base Calc I Random Data
This chapter is concerned with the collection of data, perhaps the most important step in a statistical problem, as this determines the quality of whatever conclusions are subsequently drawn. A poor analysis can be xed if the data collected are good by simply redoing the analysis. But if the data have not been appropriately collected, then no amount of analysis can rescue the study. We discuss Minitab commands that enable you to generate samples from populations and also to randomly allocate treatments to experimental units. Minitab uses computer algorithms to mimic randomness. Still, the results are not truly random. In fact, any simulation in Minitab can be repeated, with exactly the same results being obtained, using the Calc I Set Base command. specied the base, or seed, For example, in the dialog box of Display 3.1 we have random number as 1111089. The base can be any integer. When you want to repeat the simulation, you give this command, with the same integer. Provided you use the same simulation commands, you will get the same results. This can also be accomplished using the session command base V, where V is an integer.
Display 3.1: Dialog box for setting base or seed random number.
77
78
Chapter 3
3.1
Suppose that we have a large population of size N and we want to select a sample of n < N from the population. Further, we suppose that the elements of the population are ordered, i.e., we have been able to assign a unique number 1, . . . , N to each element of the population. To avoid selection biases, we want this to be a random sample, i.e., every subset of size n from the population has the same chance of being selected. As discussed in IPS, this implies that we generate our sample so that every subset of size n in the population has the same chance of being chosen. We can do this physically by using some simple random system, such as chips in a bowl or coin tossing. We could also use a table of random numbers, or, more conveniently, we can use computer algorithms that mimic the behavior of random systems. For example, suppose there are 1000 elements in a population, and we want to generate a sample of 50 from this population without replacement. We can use the Calc I Random Data I Sample from Columns command to do this. For example, suppose we have labeled each element of the population with a unique number in 1, 2, . . . , 1000, and, further, we have put these numbers in C1 of a worksheet. The dialog box of Display 3.2 results in a random sample of 50 being generated without replacement from C1 and stored in C2.
Display 3.2: Dialog box for generating a random sample without replacement.
Printing this sample gives the output MTB > print c2 C2 441 956 87 736 438 205 760 246 538 348 70 54 277 112 610 890 764 584 566 495 414 613 618 685
in the Session window. So now we go to the population and select the elements labeled 441, 956, 87, etc. The algorithm that underlies this command is such that we can be condent that this sample of 50 is like a random sample.
Producing Data The general syntax of the corresponding session command sample is sample V E1 . . . Em put into Em+1 . . . E2m
79
where V is the sample size n and V rows are sampled from the columns E1 , ..., Em and stored in columns Em+1 , ..., E2m . If we wanted to sample with replacement i.e., after a unit is sampled, it is placed back in the population so that it can possibly be sampled again we use the replace subcommand. Of course, for simple random sampling, we do not use the replace subcommand. Note that the columns can be numeric or text. Sometimes we want to generate random permutations, i.e., n = N , and we are simply reordering the elements of the population. For example, in experimental design, suppose we have N = n1 + + nk experimental units and k treatments, and we want to allocate ni applications of treatment i. Suppose further that we want all possible such applications to be equally likely. Then we generate a random permutation (l1 , . . . , lN ) of (1, . . . , N ) and allocate treatment 1 to those experimental units labeled l1 , . . . , ln1 , allocate treatment 2 to those experimental units labeled ln1 +1 , . . . , ln1 +n2 , etc. For example, if we have 30 experimental units and 3 treatments and we want to allocate 10 experimental units to each treatment, placing the numbers 1, 2, . . . , 30 in C1 and using the Calc I Random Data I Sample from Columns command as in the dialog box of Display 3.2, but with 30 in the Sample box, generates a random permutation this gives us the random permutation of 1, 2, . . . , 30 in C2. Implementing MTB > print c2 C2 13 7 26 8 22 23 28 17 3 25 9 2 14 29 15 18 6 11 16 5 12 27 4 30 20 24 1 19 21 10 and for the treatment allocation you can read the numbers row-wise or columnwise, as long as you are consistent. Row-wise is probably best, as this is how the numbers are stored in C2, and so you can always refer back to C2 (presuming you save your worksheet) if you get mixed up. The above examples show how to directly generate a sample from a population of modest size. But what happens if the population is huge or it is not convenient to label each unit with a number? For example, suppose we have a population of size 100,000 for which we have an ordered list and we want a sample of size 100. In this case more sophisticated techniques need to be used, but simple random sampling can still typically be accomplished (see Exercise 3.3 for a simple method that works in some contexts). Simple random sampling corresponds to sampling without replacement, i.e., after we randomly select an element from the population, we do not return it to the population before selecting the next sample element. Sampling with replacement corresponds to replacing each sample element in the population after selecting it and recording only the element that was obtained. So at each selection, every element has the same chance of being selected, and an element may appear more than once in the sample. Notice that we can also sample with
80
Chapter 3
replacement if we check the Sample with replacement box in the dialog box of Display 3.2.
3.2
Once we have generated a sample from a population, we measure various attributes of the sampled elements. For example, if we were sampling from a population of humans, we might measure each sampled units height. The height for the sample unit is now a random variable that follows the height distribution in the population from which we are sampling. For example, if 80% of the people in the population are between 4.5 feet and 6 feet, then under repeated sampling of an element from the population (with replacement) in the long run, 80% of the sampled units will have their heights in this range. Sometimes, we want to sample directly from this population distribution, i.e., generate a number in such a way that under repeated sampling in the long run the proportion of values falling in any range agrees with that prescribed by the population distribution. Of course, we typically dont know the population distribution, as this is what we want to nd out about in a statistical investigation. Still, there are many instances where we want to pretend that we do know it and simulate from this distribution, e.g., perhaps we want to consider the eect of various choices of population distribution on the sampling distribution of some statistic of interest. There are computer algorithms that allow us to do this for a variety of distributions. In Minitab, this is accomplished using the Calc I Random Data command. For example, suppose that we want to simulate the tossing of a fair coin (a coin where head and tail are equally likely as outcomes). The Calc I Random Data I Bernoulli command together with the dialog box of Display 3.3 generates a sample of 100 from the Bernoulli(.5) distribution and places these values in C1. A random variable has a Bernoulli(p) distribution if the probability the variable equals 1 success is p and the probability the variable equals 0 failure is 1 p. So to generate a sample of n from the Bernoulli(p) distribution, we put n in the Generate box and p in the Probability of success box. In such a case, we are simulating the tossing of a coin that produces a head on a single toss with probability p, i.e., the long-run proportion of heads that we observe in repeated tossing is p. Note that we can generate m samples of size n by putting m distinct columns in the Store in column(s) box. Often, a normal distribution with some particular mean and standard deviation is considered a reasonable assumption for the distribution of a measurement in a population. For example, the Calc I Random Data I Normal command together with the dialog box of Display 3.4 generates a sample of 200 from the N (5.2, 1.3) distribution and places this sample in C1. To generate a sample of n from the N (, ) distribution, we put n in the Generate box, in the Mean box, and in the Standard deviation box.
Producing Data
81
Display 3.3: Dialog box for generating a sample of 100 from the Bernoulli(.5) distribution.
Display 3.4: Dialog box for generating a sample of 200 from a N (5.2, 1.3) distribution.
The general syntax of the corresponding session command random is random V into E1 . . . Em and this puts a sample of size V into each of the columns E1 , ..., Em , according to the distribution specied by the subcommand. For example, MTB > random 100 c1; SUBC> bernoulli .5. simulates the tossing of a fair coin 100 times and places the results in C1 using the bernoulli subcommand. If no subcommand is provided, this distribution is taken to be the N (0, 1) distribution. The command
Chapter 3
generates a sample of 200 from the N (2.1, 3.3) distribution using the normal subcommand. There are a number of other subcommands specifying distributions, and we refer the reader to help for a description of these.
3.3
Exercises
When the data for an exercise come from an exercise in IPS, the IPS exercise number is given in parentheses ( ). All computations in these exercises are to be carried out using Minitab, and the exercises are designed to ensure that you have a reasonable understanding of the Minitab material in this chapter. Generally, you should be using Minitab to do all the computations and plotting required for the problems in IPS. If your version of Minitab places restrictions such that the value of the simulation sample size N requested in these problems is not feasible, then substitute a more appropriate value. Be aware, however, that the accuracy of your results is dependent on how large N is. 1. (3.13) Generate a random permutation of the names using Minitab. 2. (3.32) Use the Manip I Sort command described in I.11.6 to order the subjects by weight. Use the values 15 to indicate ve blocks of equal length in a separate column, and then use the Manip I Unstack command described in I.11.7 to put the blocks in separate columns. Generate a random permutation of each block. 3. Use the following methodology to generate a sample of 20 from a population of 100,000. First, put the values 09 in each of C1C5. Next, use sampling with replacement to generate 50 values from C1, and put the results in C6. Do the same for each of C2C5 and put the results in C7C10 (dont generate from these columns simultaneously). Create a single column of numbers using the digits in C6C10 as the digits in the numbers. Pick out the rst unique 20 entries as labels for the sample. If you do not obtain 20 unique values, repeat the process until you do. Why does this work? 4. Suppose you wanted to carry out stratied sampling where there are 3 strata, with the rst stratum containing 500 elements, the second stratum containing 400 elements, and the third stratum containing 100 elements. Generate a stratied sample with 50 elements from the rst stratum, 40 elements from the second stratum, and 10 elements from the third stratum. When the strata sample sizes are the same proportion of the total sample size as the strata population sizes are of the total population size this is called proportional sampling.
Producing Data
83
5. Suppose we have an urn containing 100 balls with 20 labeled 1, 50 labeled 2, and 30 labeled 3. Using sampling with replacement, generate a sample of size 1000 from this distribution employing the Calc I Random Data command to generate the sample directly from the relevant population distribution. Use the Stat I Tables I Cross Tabulation command to record the proportion of each label in the sample. 6. Carry out a simulation study with N = 1000 of the sampling distribution of p for n = 5, 10, 20 and for p = .5, .75, .95. In particular, calculate the empirical distribution functions and plot the histograms. Comment on your ndings. 7. Carry out a simulation study with N = 2000 of the sampling distribution of the sample standard deviation when sampling from the N (0, 1) distribution based on a sample of size n = 5. In particular, plot the histogram using cutpoints 0, 1.5, 2.0 2.5, 3.0 5.0. Repeat this for the sample coecient of variation (sample standard deviation divided by the sample mean) using the cutpoints 10, 9, ..., 0, ..., 9, 10. Comment on the shapes of the histograms relative to an N (0, 1) density curve.
84
Chapter 3
Chapter 4
In this chapter the concept of probability is introduced more formally than previously in the book. Probability theory underlies the powerful computational methodology known as simulation, which we introduced in Chapter 3. Simulation has many applications in probability and statistics and also in many other elds, such as engineering, chemistry, physics, and economics.
4.1
The calculation of probabilities for random variables can often be simplied by tabulating the cumulative distribution function. Also, means and variances are easily calculated using component-wise column operations in Minitab. For example, suppose we have the probability distribution x probability 1 .1 2 .2 3 .3 4 .4
in columns C1 and C2, with the values in C1 and the probabilities in C2. The Calc I Calculator command with the dialog box as in Display 4.1 computes the cumulative distribution function in C3 using Partial Sums.
85
86
Chapter 4
Display 4.1: Dialog box for computing partial sums of entries in C2 and placing these sums in C3.
in the Session window. We can also easily compute the mean and variance of this distribution. For example, the session commands MTB > let c4=c1*c2 MTB > let c5=c1*c1*c2 MTB > let k1=sum(c4) MTB > let k2=sum(c5)-k1*k1 MTB > print k1 k2 K1 3.00000 K2 1.00000 calculate the mean and variance and store these in K1 and K2, respectively. The mean is 3 and the variance is 1. Of course, we can also use Calc I Calculator someto do these calculations. In presenting more extensive computations, it is what easier to list the appropriate session commands, as we will do subsequently. However, this is not to be interpreted as the required way to do these computations, as it is obvious that the menu commands can be used as well. Use whatever you nd most convenient.
4.2
As we saw in II.3.2, Minitab includes algorithms for generating from many probability distributions using Calc I Random Data. This menu command
87
produces a drop-down list that includes the normal, binomial, Chi-square, F , t, uniform, and many other distributions that the text, and this manual, will discuss. Clicking on one of these names results in a dialog box with entries to be lled in further specifying the distribution and the size of the sample. For example, we can generate from one particularly important class of probability distributions using Calc I Random Data I Discrete. These probability distributions are concentrated on a nite number of values. To illustrate this, suppose we have the following values in C1 and C2. Row 1 2 3 4 C1 -1 2 3 10 C2 0.3 0.2 0.4 0.1
Here, C1 contains the possible values of an outcome, and C2 contains the probabilities that each of these values is obtained, so, for example, P ({1}) = .3, P ({2}) = .2, etc. The dialog box of Display 4.2 generates a sample of 50 from this discrete distribution and stores the sample in C3.
Display 4.2: Dialog box for generating a sample from a discrete distribution with values in C1 and probabilities in C2 and storing the sample in C3.
It is an interesting exercise to check that the algorithms Minitab is using are in fact producing samples appropriately. There are a variety of things one could check, but perhaps the simplest is to check that the long-run relative frequencies are correct. So in the example of this section, we want to make sure that, as we increase the size of the sample, the relative frequencies of 1, 2, 3, 10 in the sample are getting closer to .3, .2, .4, and .1, respectively. Note that it is not guaranteed that as we increase the sample size that the relative frequencies get closer monotonically to the corresponding probabilities, but inevitably this must be the case. First, we generated a sample of size 100 from this distribution and stored the values in C3 as in Display 4.2. Next, we recorded a 1 in C4 whenever the
88
Chapter 4
corresponding entry in C3 was 1 and recorded a 0 in C4 otherwise. To do this, we used the Calc I Calculator command with dialog box as shown in Display 4.3.
It is clear that the mean of C4 is the relative frequency of 1 in the sample. We calculated this mean using Calc I Column Statistics, as discussed in I.10.3, which gave the output Mean of C4 = 0.33000 in the Session window. Repeating this with a sample of size 1000, we obtained Mean of C4 = 0.28100 which we can see is a bit closer to the true value of .3. Repeating this with a sample of size 10, 000 from this distribution, we obtained Mean of C4 = 0.29300 which is closer still. It would appear that the relative frequency of 1 is indeed converging to .3. We can generate a randomly chosen point from the line interval (a, b) , where a < b, using Calc I Random Data I Uniform. For example, the dialog box a sample of 1500 from the uniform distribution on the of Display 4.4 generates interval (3.0, 6.3) . With this distribution, the probability of any subinterval (c, d) of (a, b) is given by (d c) / (b a), i.e., the length of (c, d) over the length of (a, b). Of course, we can estimate this probability by just counting the number of times the generated response falls in the interval (c, d) and dividing this by the total sample size. For example, using the outcomes from the dialog box of Display 4.3 and estimating the probability of the interval (4, 5), we get the relative frequency 0.30867, which is close to the true value of (5 4) / (6.3 3) = 0.30303.
89
Display 4.4: Dialog box for generating a sample of 1500 from the uniform distribution on the interval (3.0, 6.3). We can generalize this to generate from a point randomly chosen from a rectangle (a, b) (c, d), i.e., the set of all points (x, y ) such that a < x < b, c < y < d. If we want a sample of n from this distribution, we generate a sample x1 , . . . , xn from the uniform on (a, b) and also generate a sample y1 , . . . , yn from the uniform distribution on (c, d). Then (x1 , y1 ) , . . . , (xn , yn ) is a sample of n from the uniform distribution on (a, b) (c, d). We can approximate the probability of a random pair (x, y ) falling in any subset A (a, b) (c, d) by computing the relative frequency of A in the sample. The random command is the session command for carrying out simulations in Minitab. For example, the subcommand uniform V1 V2 species the continuous uniform distribution on the interval (V1 , V2 ); i.e., subintervals of the same length have the same probability of occurring. If we have placed a discrete probability distribution in column E2 , on the values in column E1 , the subcommand discrete E1 E2 generates a sample from this distribution.
4.3
As previously noted, simulation can be used to approximate probabilities. For a variety of reasons, these simulations are most easily presented using session commands but it is clear that we can replace each step by the appropriate menu command. For example, suppose we are asked to calculate P (.1 X1 + X2 .3)
90
Chapter 4
when X1 , X2 are both independent and follow the uniform distribution on the interval (0, 1) . The session commands MTB > random 1000 c1 c2; SUBC> uniform 0 1. MTB > let c3=c1+c2 MTB > let c4 = .1<=c3 and c3<=.3 MTB > let k1=sum(c4)/n(c4) MTB > print k1 K1 0.0400000 MTB > let k2=sqrt(k1*(1-k1)/n(c4)) MTB > print k2 K2 0.00619677 MTB > let k3=k1-3*k2 MTB > let k4=k1+3*k2 MTB > print k3 k4 K3 0.0214097 K4 0.0585903 generate N = 1000 independent values of X1 , X2 and place these values in C1 and C2, respectively, then calculate the sum X1 + X2 and put these values in C3. Using the comparison operators discussed in I.10.4, a 1 is recorded in C4 every time .1 X1 + X2 .3 is true and a 0 is recorded there otherwise. We then calculate the proportion of 1s in the sample as K1, and this is our estimate p of the probability. We will see later that a good measure of the accuracy of this estimate is the standard error of the estimate , which in this case is given by p p (1 p ) /N and this is computed in K2. Actually, we can feel fairly condent that the true value of the probability is in the interval p (1 p ) /N p 3 p
which in this case, equals the interval (0.0214097, 0.0585903). So we know the true value of the probability with reasonable accuracy. As the simulation size N increases, the Law of Large Numbers says that p converges to the true value of the probability.
4.4
The means of distributions can also be approximated using simulations in Minitab. For example, suppose X1 , X2 are both independent and follow the uniform distribution on the interval (0, 1) and that we want to calculate the mean of Y = 1/ (1 + X1 + X2 ) . We can approximate this in a simulation. The session commands
Probability: The Study of Randomness MTB > random 1000 c1 c2; SUBC> uniform 0 1. MTB > let c3=1/(1+c1+c2) MTB > let k1=mean(c3) MTB > let k2=stdev(c3)/sqrt(n(c3)) MTB > print k1 k2 K1 0.521532 K2 0.00375769 MTB > let k3=k1-3*k2 MTB > let k4=k1+3*k2 MTB > print k3 k4 K3 0.510259 K4 0.532805
91
generate N = 1000 independent values of X1 , X2 and place these values in C1, C2, then calculate Y = 1/ (1 + X1 + X2 ) and put these values in C3. The mean of C3 is stored in K1, and this is our estimate of the mean value of Y . As a measure of how accurate this estimate is, we compute the standard error of the estimate, which is given by the standard deviation divided by the square root of the simulation sample size N . Again, we can feel fairly condent that the interval given by the estimate plus or minus 3 times the standard error of the estimate contains the true value of the mean. In this case, this interval is given by (0.510259, 0.532805), and so we know this mean with reasonable accuracy. As the simulation size N increases, the Law of Large Numbers says that the approximation converges to the true value of the mean.
4.5
Exercises
When the data for an exercise come from an exercise in IPS, the IPS exercise number is given in parentheses ( ). All computations in these exercises are to be carried out using Minitab, and the exercises are designed to ensure that you have a reasonable understanding of the Minitab material in this chapter. Generally, you should be using Minitab to do all the computations and plotting required for the problems in IPS. If your version of Minitab places restrictions such that the value of the simulation sample size N requested in these problems is not feasible, then substitute a more appropriate value. Be aware, however, that the accuracy of your results is dependent on how large N is. 1. Suppose we have the probability distribution x probability 1 .15 2 .05 3 .33 4 .37 5 .10
on the values 1, 2, 3, 4, and 5. Calculate the mean and variance of this distribution. Suppose that three independent outcomes (X1 , X2 , X3 ) are
92
Chapter 4 generated from this distribution. Compute the probability that 1 < X1 4, 2 X2 and 3 < X3 5. 2. Suppose we have the probability distribution x probability 1 .15 2 .05 3 .33 4 .37 5 .10
on the values 1, 2, 3, 4, and 5. Using Minitab, verify that this is a probability distribution. Make a bar chart (probability histogram) of this distribution. Generate a sample of size 1000 from this distribution and plot a relative frequency histogram for the sample. 3. (4.23) Indicate how you would simulate the game of roulette using Minitab. Based on a simulation of N = 1000, estimate the probability of getting red and a multiple of 3. 4. A probability distribution is placed on the integers 1, 2, ..., 100, where the probability of integer i is c/i2 . Determine c so that this is a probability distribution. What is the 90th percentile? Generate a sample of 20 from the distribution. 5. Suppose an outcome is random on the square (0, 1) (0, 1). Using simulation, approximate the probability that the rst coordinate plus the second coordinate is less than .75 but greater than .25. 6. Generate from the uniform distribution on the unit disk a sample of 1000 D = (x, y ) : x2 + y2 1 .
7. The expression ex for x > 0 is the density curve for what is called the Exponential (1) distribution. Plot this density curve in the interval from 0 to 10 using an increment of .1. The Calc I Random Data I Exponential command can be used to generate from this distribution by specifying the Mean as 1 in the ensuing dialog box. Generate a sample of 1000 from this distribution and estimate its mean. Approximate the probability that a value generated from this distribution is in the interval (1,2). The general Exponential () has a density curve given by 1 ex/ for x > 0 and where > 0 is the mean. Repeat the simulation with mean = 3. Comment on the values of the estimated means. 8. Suppose you carry out a simulation to approximate the mean of a random variable X and you report the value 1.23 with a standard error of .025. If you are asked to approximate the mean of Y = 3 + 5X, do you have to carry out another simulation? If not, what is your approximation, and what is the standard error of this approximation? 9. Suppose that a random variable X follows an N (3, 2.3) distribution. Subsequently, conditions change and no values smaller than 1 or bigger than 9.5 can occur, i.e., the distribution is conditioned to the interval (1, 9.5).
93
Generate a sample of 1000 from the truncated distribution, and use the sample to approximate its mean. 10. Suppose that X is a random variable and follows an N (0, 1) distribution. Simulate N = 1000 values from the distribution of Y = X 2 , and plot these values in a histogram with cutpoints 0, .5, 1, 1.5, ..., 15. Approximate the mean of this distribution. Generate Y directly from its distribution, which is known to be a Chisquare(1) distribution. In general, the Chisquare(k) distribution can be generated from via the command Calc I Random Data I Chi-Square, where k is specied as the Degrees of freedom in the dialog box. Plot the Y values in a histogram using the same cutpoints. Comment on the two histograms. Note that you can plot the density curve of these distributions using Calc I Probability Distributions I Chi-Square and evaluating the probability density at a range of points as we discussed in II.2 for the normal distribution. 11. If X1 and X2 are independent random variables with X1 following a Chisquare(k1 ) distribution and X2 following a Chisquare(k2 ) distribution, then it is known that Y = X1 + X2 follows a Chisquare(k1 + k2 ) distribution. For k1 = 1, k2 = 1, verify this empirically by plotting histograms with cutpoints 0, .5, 1, 1.5, ..., 15, based on simulations of size N = 1000. 12. If X1 and X2 are independent random variables with X1 following an N (0, 1) distribution and X2 following a Chisquare(k) distribution, then it is known that X1 Y =p X2 /k follows a Student(k) distribution. The Student(k) distribution can be generated from using the command Calc I Random Data I t, where k is the Degrees of freedom and must be specied in the dialog box. For k = 3, verify this result empirically by plotting histograms with cutpoints 10, 9, ..., 9, 10, based on simulations of size N = 1000.
13. If X1 and X2 are independent random variables with X1 following a Chisquare(k1 ) distribution and X2 following a Chisquare(k2 ) distribution, then it is known that Y = X1 /k1 X2 /k2
follows an F (k1 , k2 ) distribution. The F (k1 , k2 ) distribution can be generated from using the subcommand Calc I Random Data I F, where k1 is the Numerator degrees of freedom and k2 is the Denominator degrees of freedom, both of which must be specied in the dialog box. For k1 = 1, k2 = 1, verify this empirically by plotting histograms with cutpoints 0, .5, 1, 1.5, ..., 15, based on simulations of size N = 1000.
94
Chapter 4
Chapter 5
Sampling Distributions
New Minitab command discussed in this chapter Calc I Probability Distributions I Binomial
Once data have been collected, they are analyzed using a variety of statistical techniques. Virtually, all of these involve computing statistics that measure some aspect of the data concerning questions we wish to answer. The answers determined by these statistics are subject to the uncertainty caused by the fact that we typically do not have the full population but only a sample from the population. As such, we have to be concerned with the variability in the answers when dierent samples are obtained. This leads to a concern with the sampling distribution of a statistic. Sometimes, the sampling distribution of a statistic can be worked out exactly through various mathematical techniques, e.g., in Chapter 5 of IPS it is seen that the number of 1s in a sample of n from a Bernoulli(p) distribution is Binomial(n, p). Often, however, this is not possible, and we must resort to approximations. One approximation technique is to use simulation. Sometimes, however, the statistics we are concerned with are averages, and, in such cases, we can typically approximate their sampling distribution via an appropriate normal distribution.
5.1
Suppose that X1 , . . . , Xn is a sample from the Bernoulli(p) distribution, i.e., X1 , . . . , Xn are independent realizations, where each Xi takes the value 1 or 0 with probabilities p and 1 p, respectively. The random variable Y = X1 + + Xn equals the number of 1s in the sample and follows, as discussed in IPS, a Binomial(n, p) distribution. Therefore, Y can take on any of the values 0, 1, . . . , n with positive probability. In fact, an exact formula can be derived 95
Chapter 5
is the probability that Y takes the value k for 0 k n. When n and k are small, this formula could be used to evaluate this probability but it is almost always better to use software like Minitab to do it, and when these values are not small, it is necessary. Also, we can use Minitab to compute the Binomial(n, p) cumulative probability distribution the probability contents of intervals (, x] and the inverse cumulative distribution percentiles of the distribution. For individual probabilities, we use the Calc I Probability Distributions I Binomial command. For example, suppose we have a Binomial(30, .2) distri bution and want to compute the probability P (Y = 10). This command, with the dialog box as in Display 5.1, produces the output Binomial with n = 30 and p = 0.200000 x P( X = x ) 10.00 0.0355 in the Session window, i.e., P (Y = 10) = .0355.
If we want to compute the probability of getting 10 or fewer successes, this is the probability of the interval (, 10], and we can use the Calc I Probability Distributions I Binomial command with the dialog box as in Display 5.2. This produces the output Binomial with n = 30 and p = 0.200000 x P( X <= x ) 10.00 0.9744 in the Session window, i.e., P (Y 10) = .9744.
Sampling Distributions
97
Display 5.2: Dialog box for computing cumulative probabilities for the Binomial(n, p) distribution.
Suppose we want to compute the rst quartile of this distribution. The Calc I Probability Distributions I Binomial command, with the dialog box as in Display 5.3, produces the output Binomial with n = 30 and p = 0.200000 x P( X <= x ) x P( X <= x ) 3 0.1227 4 0.2552 in the Session window. This gives the values x that have cumulative probabilities just smaller and just larger than the value requested. Recall that with a discrete distribution, such as the Binomial(n, p), we will not in general be able to obtain an exact percentile.
Display 5.3: Dialog box for computing percentiles of the Binomial(n, p) distribution.
98
Chapter 5
These commands can operate on all the values in a column simultaneously. This is very convenient if you should want to tabulate or graph the probability function, cumulative distribution function, or inverse distribution function. The general syntax of the pdf, cdf, and invcdf session commands is given in II.1.3, and here we use them with the binomial subcommand as in MTB > pdf 10; SUBC> binomial 30 .2. which outputs P (Y = 10) when Y has the Binomial(30, .2) distribution. Actually, when n is very large even software will not be useful to compute these probabilities, and you will have to use normal approximations to binomial probabilities via the central limit theorem. The pdf and cdf commands with the normal subcommand can be used for this. We might also want to simulate from the Binomial(n, p) distribution. For this we use the Calc I Random Data I Binomial command or the session command random with the binomial subcommand. For example, MTB > random 10 c1; SUBC> binomial 30 .2. MTB > print c1 C1 2 2 4 2 11 5 7 8 5 2 generates a sample of 10 from the Binomial(30, .2) distribution.
5.2
First, we consider an example where we know the exact sampling distribution. Suppose we ip a possibly biased coin n times and want to estimate the unknown probability p of getting a head. The natural estimate is p the proportion of heads in the sample. We would like to assess the sampling behavior of this statistic in a simulation. To do this, we choose a value for p, then generate N samples from the Bernoulli distribution of size n, for each of these compute p , look at the empirical distribution of these N values, perhaps plotting a histogram as well. The larger N is the closer the empirical distribution and histogram will be to the true sampling distribution of p . Note that there are two sample sizes here: the sample size n of the original sample the statistic is based on, which is xed, and the simulation sample size N , which we can control. This is characteristic of all simulations. Sometimes, using more advanced analytical techniques we can determine N so that the sampling distribution of the statistic is estimated with some prescribed accuracy. Some techniques for doing this are discussed in later chapters of IPS. Another method is to repeat the simulation a number of times, slowly increasing N until we see the results stabilize. This is sometimes the only way available, but caution should be shown as it is easy for simulation results to be very misleading if the nal N is too small.
Sampling Distributions
99
We illustrate a simulation to determine the sampling distribution of p when sampling from a Bernoulli(.75) distribution. For this, we use the commands Calc I Random Data I Bernoulli, Calc I Row Statistics, and Stat I Tables I Tally, with the dialog boxes given by Displays 5.4, 5.5, and 5.6, respectively, to produce the output Summary Statistics for Discrete Variables C11 CumPct 0.3 0.40 0.4 2.20 0.5 7.60 0.6 23.10 0.7 47.70 0.8 78.00 0.9 94.70 1.0 100.00 in the Session window. Here we have generated N = 1000 samples of size n = 10 from the Bernoulli(.75) distribution, i.e., we simulated the tossing of this coin 10,000 times, and we placed the results in the rows of columns C1C10 using Calc I Random Data I Bernoulli. The proportion of heads p in each sample is computed and placed in C11 using Calc I Row Statistics. Note that a mean of values equal to 0 or 1 is just the proportion of 1s in the sample. Finally, we used Stat I Tables I Tally to compute the empirical distribution function of these 1000 values of p . For example, this says 78% of these values were .8 or smaller and there were no instances smaller than .3.
Display 5.4: Dialog box for generating 10 columns of 1000 Bernoulli(.75) values.
100
Chapter 5
Display 5.5: Dialog box for computing the proportion of 1s in each of the 1000 samples of size 10.
Display 5.6: Dialog box for computing the empirical distribution function of p .
In Display 5.7, we have plotted a histogram of the 1000 values of p . Based on N = 800, the following empirical distribution was obtained: C11 CumPct 0.4 1.20 0.5 7.20 0.6 22.20 0.7 47.80 0.8 78.20 0.9 95.00 1.0 100.00 Because these values are reasonably close to those obtained with N = 1000, we stopped at N = 1000.
Sampling Distributions
101
300
Frequency
200
100
C11
Display 5.7: Histogram of simulation of N = 1000 values of p based on a sample of size n = 10 from the Bernoulli(.75) distribution.
The corresponding session commands for this simulation are MTB > SUBC> MTB > MTB > SUBC> random 1000 c1-c10; bernoulli .75. rmean c1-c10 c11 tally c11; cumpcts.
and these might seem like an easier way to implement the simulation. In Chapter 5 of IPS we saw that the sampling distribution of p can be determined exactly, i.e., there are formulas to determine this, and we can simulate directly from the sampling distribution, so this simulation can be made much more ecient. In eect, this entails using the Calc I Random Data I Binomial command with dialog box as in Display 5.8 and dividing each entry in C1 by 10. This generates N = 1000 values of p but uses a much smaller number of cells. Still, there are many statistics for which this kind of eciency reduction is not available, and, to get some idea of what their sampling distribution is like, we must resort to the more brute force form of simulation of generating directly from the population distribution. Sometimes, more sophisticated simulation techniques are needed to get an accurate assessment of a sampling distribution. Within Minitab, there are programming techniques, which we do not discuss in this manual, that can be applied in such cases. For example, it is clear that if our simulation required the generation of 106 cells (and this is not at all uncommon for some harder problems), the simulation approach we have described would not work within Minitab, as the worksheet would be too large.
102
Chapter 5
Display 5.8: Dialog box for generating 1000 values from the sampling distribution of 10 p using the Binomial(10, .75) distribution.
5.3
Exercises
When the data for an exercise come from an exercise in IPS, the IPS exercise number is given in parentheses ( ). All computations in these exercises are to be carried out using Minitab, and the exercises are designed to ensure that you have a reasonable understanding of the Minitab material in this chapter. Generally, you should be using Minitab to do all the computations and plotting required for the problems in IPS. If your version of Minitab places restrictions such that the value of the simulation sample size N requested in these problems is not feasible, then substitute a more appropriate value. Be aware, however, that the accuracy of your results is dependent on how large N is. 1. Calculate all the probabilities for the Binomial(5, .4) distribution and the Binomial(5, .6) distribution. What relationship do you observe? Can you explain this and state a general rule? 2. Compute all the probabilities for a Binomial(5, .8) distribution and use these to directly calculate the mean and variance. Verify your answers using the formulas provided in IPS. 3. Compute and plot the probability and cumulative distribution functions of the Binomial (10, .2) and the Binomial (10, .5) distributions. Comment on the shapes of these distributions. 4. Generate 1000 samples of size 10 from the Bernoulli(.3) distribution. Compute the proportion of 1s in each sample and compute the proportion of samples having no 1s, one 1, two 1s, etc. Compute what these proportions would be in the longrun and compare.
Sampling Distributions
103
5. Carry out a simulation study with N = 1000 of the sampling distribution of p for n = 5, 10, 20 and for p = .5, .75, .95. In particular, calculate the empirical distribution functions and plot the histograms. Comment on your ndings. 6. Suppose that X1 , X2 , X3 , . . . are independent realizations from the Bernoulli(p) distribution, i.e., each Xi takes the value 1 or 0 with probabilities p and 1 p, respectively. If the random variable Y counts the number of tosses until we obtain the rst head in a sequence of independent tosses X1 , X2 , X3 , . . . , then Y has a Geometric(p) distribution. Minitab does not have built-in algorithms for computing the probability function, distribution function, inverse distribution function, and for generating from this distribution. The probability function for this distribution is given by y 1 p P (Y = y ) = (1 p)
for y = 1, 2, . . . . Plot the probability function for the Geometric (.5) distribution for the values y = 1, . . . , 10. Do the same for the Geometric (.1) distribution. What do you notice?
7. Using methods for summing geometric sums, the cumulative distribution function of the Geometric (p) distribution (see Exercise II.5.6) is given y by P (Y y ) = 1 (1 p) . Plot the cumulative distribution function for the Geometric (.5) and Geometric (.1) distribution for the values y = 1, . . . , 10. What do you notice? 8. To randomly generate from the Geometric(p) distribution (see Exercise II.5.6), we can repeatedly generate from a Bernoulli(p) and count how many times we did this until the rst 1 appeared. A simple way to do this in Minitab is to generate N values from the Bernoulli(p) into a column. Count the number of entries until the rst 1, count the number of subsequent entries until the next 1, etc. These counts are identically and independently distributed according to the Geometric(p) distribution. This is a very inecient method when p is small and much better algorithms exist. Generate a sample of 10 from the Geometric (.5) distribution. 9. Carry out a simulation study, with N = 2000, of the sampling distribution of the sample standard deviation when sampling from the N (0, 1) distribution, based on a sample of size n = 5. In particular, plot the histogram using cutpoints 0, 1.5, 2.0 2.5, 3.0 5.0. Repeat this for the sample coecient of variation (sample standard deviation divided by the sample mean) using the cutpoints 10, 9, ..., 0, ..., 9, 10. Comment on the shapes of the histograms relative to a N (0, 1) density curve. 10. Generate N = 1000 samples of size n = 5 from the N (0, 1) distribution. Record a histogram for x using the cutpoints 3, 2 .5, 2, ..., 2.5, 3.0. Generate a sample of size N = 1000 from the N (0, 1/ 5) distribution. Plot the histogram using the same cutpoints and compare the histograms. What will happen to these histograms as we increase N ?
104
Chapter 5
11. Generate N = 1000 values of X1 , X2 , where X1 follows a N (3, 2) distribution and X2 follows a N (1, 3) distribution. Compute Y = X1 2X2 for each of these pairs and plot a histogram for Y using the cutpoints 20, 15, ..., 25, 30. Generate a sample of N = 1000 from the appropriate distribution of Y and plot a histogram using the same cutpoints. 12. Plot the density curve for the Exponential(3) distribution (see Exercise II.4.7) between 0 and 15 with an increment of .1. Generate N = 1000 samples of size n = 2 from the Exponential(3) distribution and record the sample means. Standardize the sample of x using = 3 and = 3. Plot a histogram of the standardized values using the cutpoints 5, 4, ..., 4, 5. Repeat this for n = 5, 10. Comment on the shapes of these histograms. 13. Plot the density of the uniform distribution on (0,1). Generate N = 1000 samples of size n = 2 p from this distribution. Standardize the sample of x using = .5 and = 1/12. Plot a histogram of the standardized values using the cutpoints 5, 4, ..., 4, 5. Repeat this for n = 5, 10. Comment on the shapes of these histograms. 14. The Weibull ( ) has density curve given by x1 ex for x > 0, where > 0 is a xed constant. Plot the Weibull (2) density in the range 0 to 10 with an increment of .1 using the Calc I Probability Distributions I Weibull, command. Generate a sample of N = 1000 from this distribu tion using the subcommand Calc I Random Data I Weibull where is the Shape parameter and the Scale parameter is 1. Plot a probability histogram and compare with the density curve.
Chapter 6
Introduction to Inference
New Minitab commands discussed in this chapter Stat I Basic Statistics I 1-Sample Z Power and Sample Size I 1-Sample Z
In this chapter, the basic tools of statistical inference are discussed. There are a number of Minitab commands that aid in the computation of condence intervals and for carrying out tests of signicance.
6.1
z -Condence Intervals
The command Stat I Basic Statistics I 1-Sample Z computes condence inter vals for the mean using a sample x1 , . . . , xn from a distribution where we know the standard deviation . There are three situations when this is appropriate: (1) We know that we are sampling from a normal distribution with unknown mean and known standard deviation , and thus z= x / n
is distributed N (0, 1). (2) We have a large sample from a distribution with unknown mean and known standard deviation , and the central limit theorem approximation to the distribution of x is appropriate, i.e., the distribution of z= is approximately distributed N (0, 1). 105 x / n
106
Chapter 6
(3) We have a large sample from a distribution with unknown mean and unknown standard deviation , and the sample size is large enough so that z= x s/ n
is approximately N (0, 1), where s is the sample standard deviation. The condence interval takes the form x z / n, where s is substituted for in case (3), and z is determined from the N (0, 1) distribution by the condence level desired, as described in IPS. Of course, situation (3) is probably the most realistic, but note that the condence intervals constructed for (1) are exact, while those constructed under (2) and (3) are only approximate, and a larger sample size is required in (3) for the approximation to be reasonable than for (2). Consider the sample given by 0.8403, 0.8363, 0.8447, which are stored in C1, and suppose that it makes sense to take = .0068. The command Stat I Basic Statistics I 1-Sample Z with the dialog boxes as in Displays 6.1 and 6.2 produces the output Variable N Mean C1 3 0.84043 99.0% CI (0.83032, 0.85055) StDev 0.00420 SE Mean 0.00393
in the Session window. This species the 99% condence interval (0.83032, 0.85055) for . Note that in the dialog box of Display 6.1, we specify where the data resides in the Variables box, the value of in the Sigma box, and click on the Options button to bring up the dialog box in Display 6.2. In this dialog box we have specied the 99% condence level in the Condence level box.
Display 6.1: First dialog box for producing the z -condence interval for .
Introduction to Inference
107
Display 6.2: Second dialog box for producing the z -condence interval. Here we specify the condence level.
The general syntax of the corresponding session command zinterval is zinterval V1 sigma = V2 E1 . . .Em where V1 is the condence level and is any value between 1 and 99.99, V2 is the assumed value of , and E1 , ..., Em are columns of data. A V1 % condence interval is produced for each column specied. If no value is specied for V1 , the default value is 95%.
6.2
z -Tests
The Stat I Basic Statistics I 1-Sample Z command is used when we want to test the hypothesis that the unknown mean equals a value 0 and one of the situations (1), (2), or (3) as discussed in II.10.1 is appropriate. The test is based on computing a P -value using the observed value of z= x 0 / n
and the N (0, 1) distribution as described in IPS. Suppose the sample 2.0, 0.4, 0.7, 2.0, 0.4, 2.2, 1.3, 1.2, 1.1, 2.3 is stored in C1, and we are asked to test the null hypothesis H0 : = 0 against the alternative Ha : > 0 and it makes sense to take = 1. The Stat I Basic Statistics I 1-Sample Z command together with the dialog boxes of Displays 6.3 and 6.4 produces the output Variable C1 99.0% Lower Bound 0.284 Z 3.23 P 0.001
in the Session window. This species the P -value for this test as .001, and so we reject the null hypothesis in favor of the alternative. In the rst dialog box, we specied where the data is located, the value of as before and that we want to test H0 : = 0 by 0 in the Test mean box. We brought up the second dialog box by clicking on the Options button. In the second dialog box, we specied that we want to test this null hypothesis against the alternative Ha : > 0 by selecting greater than in Alternative box. The other choices are not equal,
108
Chapter 6
which selects the alternative Ha : 6= 0, and less than, which selects the alternative Ha : < 0.
Display 6.3: First dialog box for testing a hypothesis concerning the mean using a z -test.
Display 6.4: Second dialog box for testing a hypothesis using the z -test.
The general syntax of the corresponding session command ztest is ztest V1 sigma = V2 E1 . . .Em where V1 is the hypothesized value to be tested, V2 is the assumed value of , and E1 , ..., Em are columns of data. If no value is specied for V1 , the default is 0. A test of the hypothesis is carried out for each column. If no alternative subcommand is specied, a two-sided test is conducted, i.e., H0 : = V1 against the alternative Ha : 6= V1 . If the subcommand SUBC> alternative 1. is used, a test of H0 : = V1 against the alternative Ha : > V1 is conducted. If the subcommand SUBC> alternative -1. is used, a test of H0 : = V1 against the alternative Ha : < V1 is conducted.
Introduction to Inference
109
6.3
When we are sampling from a N (, ) distribution and know the value of , the condence intervals constructed in II.6.1 are exact, i.e., in the long run a proportion 95% of the 95% condence intervals constructed for an unknown mean will contain the true value of this quantity. Of course, any given condence interval may or may not contain the true value of , and, in any nite number of such intervals so constructed, some proportion other than 95% will contain the true value of . As the number of intervals increases, however, the proportion covering will go to 95%. We illustrate this via a simulation study based on computing 90% condence intervals. The session commands MTB > random 100 c1-c5; SUBC> normal 1 2. MTB > rmean c1-c5 c6 MTB > invcdf .95; SUBC> normal 0 1. Normal with mean = 0 and standard deviation = 1.00000 P( X <= x) x 0.9500 1.6449 MTB > let k1=1.6449*2/sqrt(5) MTB > let c7=c6-k1 MTB > let c8=c6+k1 MTB > let c9=c7<1 and c8>1 MTB > mean c9 Mean of C9 = 0.94000 MTB > set c10 DATA> 1:25 DATA> end MTB > delete 26:100 c7 c8 MTB > mplot c7 versus c10 c8 versus c10; SUBC> xstart=1 end=25; SUBC> xincrement=1. generate 100 random samples of size 5 from the N (1, 2) distribution, place the means in C6, the lower end-point of a 90% condence interval in C7, and the upper end-point in C8, and record whether or not a condence interval covers the true value = 1 by placing a 1 or 0 in C9, respectively. The mean of C9 is the proportion of intervals that cover, and this is 94%, which is 4% too high. Finally, we plotted the rst 25 of these intervals in a plot shown in Figure 6.1. Drawing a solid horizontal line at 1 on the y -axis indicates that most of these intervals do indeed cover the true value = 1.
110
Chapter 6
4 3 2 1
C7
0 -1 -2 -3 0 5 10 15 20 25
C10
Figure 6.1: Plot of 90% condence intervals for the mean when sampling from the N (1, 2) distribution with n = 5. The lower end-point is open and the upper end-point is closed.
The simulation just carried out simply veries a theoretical fact. On the other hand, when we are computing approximate condence intervals i.e., we are not sampling necessarily from a normal distribution it is good to do some simulations from various distributions to see how much reliance we can place in the approximation at a given sample size. The true coverage probability of the interval, i.e., the long-run proportion of times that the interval covers the true mean, will not in general be equal to the nominal condence level. Small deviations are not serious, but large ones are.
6.4
where Z is a N (0, 1) random variable. This is equivalent to saying that the null hypothesis is rejected whenever
It is also useful to know in a given context how sensitive a particular test of signicance is. By this we mean how likely it is that the test will lead us to reject the null hypothesis when the null hypothesis is false. This is measured by the concept of the power of a test. Typically, a level is chosen for the P -value at which we would denitely reject the null hypothesis if the P -value is smaller than . For example, = .05 is a common choice for this level. Suppose that we have chosen the level of .05 for the two-sided z -test and we want to evaluate the power of the test when the true value of the mean is = 1 , i.e., evaluate the probability of getting a P -value smaller than .05 when the mean is 1 . The two-sided z -test with level rejects H0 : = 0 whenever x 0 P |Z | > / n
Introduction to Inference x 0 /n
111
is greater than or equal to the 1 /2 percentile for the N (0, 1) distribution. For example, if = .05, then 1 /2 = .975 and this percentile can be obtained using the command Calc I Probability Distributions I Normal and the inverse distribution function, which gives the output Normal with mean = 0 and standard deviation = 1.00000 P( X <= x) x 0.9750 1.9600 in the Session window, i.e., the .975 percentile of the N (0, 1) distribution is 1.96. Denote this percentile by z . If = 1 , then x 0 / n
0 is distributed is a realized value from the distribution of Y = X when X / n 1 0 N (1 , / n). Therefore, Y follows a N ( /n , 1) distribution. The power of the two-sided test at = 1 is
P (|Y | > z ) and this can be evaluated exactly using the command Calc I Probability Distributions I Normal and the distribution function, after writing P (|Y | > z ) = P (Y > z ) + P (Y < z ) (1 0 ) (1 0 ) +z +P Z < z =P Z> / n / n with Z following an N (0, 1) distribution. Alternatively, exact power calculations can be carried out under the assumption of sampling from a normal distribution using the Power and Sample Size I 1-Sample Z command and lling in the dialog box appropriately. Also, the minimum sample size required to guarantee a given power at a prescribed difference |1 0 | can be obtained using this command. For example, lling in the dialog box for this command as in Display 6.5 creates the output Testing mean = null (versus not = null) Calculating power for mean = null + difference Alpha = 0.05 Sigma = 1.3 Sample Difference Size Power 0.1 10 0.0568 0.2 10 0.0775
112
Chapter 6
in the Session window. This gives the power for testing H0 : = 0 versus H0 : 6= 0 at |1 0 | = .1 and |1 0 | = .2 when n = 10, = 1.3, and = .05. These powers are given by .0568 and .0775, respectively. Clicking on the Options button allows you to choose other alternatives and specify other valuesof in the Signicance level box.
Display 6.5: Dialog box for calculating powers and minimum sample sizes.
If we had instead lled in Power values at .1 and .2 in the dialog box of Display 6.5, say as .8 and .9, and had left the Sample sizes box empty, we would have obtained the output Testing mean = null (versus not = null) Calculating power for mean = null + difference Alpha = 0.05 Sigma = 1.3 Sample Target Actual Difference Size Power Power 0.1 1327 0.8000 0.8002 0.1 1776 0.9000 0.9000 0.2 332 0.8000 0.8005 0.2 444 0.9000 0.9000 in the Session window. This prescribes the minimum sample sizes n = 1327 and n = 1776 to obtain the powers .8 and .9, respectively, at the dierence .1 and the sample sizes n = 332 and n = 444 to obtain the powers .8 and .9, respectively, at the dierence .2. This derivation of the power of the two-sided test depended on the sample having an exact normal coming from a normal distribution, as this leads to X distribution. In general, however, X will be only approximately normal, and so the normal calculation is not exact. To assess the eect of the nonnormality, however, we can often simulate sampling from a variety of distributions and estimate the probability P (|Y | > z ). For example, suppose that we want to
Introduction to Inference
113
test H0 : = 0 in a two-sided z -test based on a sample of 10, where we estimate by the sample standard deviation and we want to evaluate the power at 1. Let us further suppose that we are actually sampling from a uniform distribution on the interval (10, 12), which indeed has its mean at 1. The simulation given by the session commands MTB > random 1000 c1-c10; SUBC> uniform -10 12. MTB > rmean c1-c10 c11 MTB > rstdev c1-c10 c12 MTB > let c13=absolute(c11/(c12/sqrt(10))) MTB > let c14=c13>1.96 MTB > let k1=mean(c14) MTB > let k2=sqrt(k1*(1-k1)/n(c14)) MTB > print k1 k2 K1 0.112000 K2 0.00997276 estimates the power to be .112, and the standard error of this estimate, as given in K2, is approximately .01. The application determines whether or not the assumption of a uniform distribution makes sense and whether or not this power is indicative of a sensitive test or not.
6.5
If Z is distributed according to the N (0, 1) distribution, then Y = Z 2 is distributed according to the Chisquare(1) distribution. If X1 is distributed Chisquare(k1 ) independent of X2 distributed Chisquare(k2 ), then Y = X1 +X2 is distributed according to the Chisquare(k1 + k2 ) distribution. There are Minitab commands that assist in carrying out computations for the Chisquare(k) distribution. Note that k is any positive value and is referred to as the degrees of freedom . The values of the density curve for the Chisquare(k) distribution can be obtained using the Calc I Probability Distributions I Chi-Square command, with k as the Degrees of freedom in the dialog box, or the session command pdf with the subcommand chisquare. For example, the command MTB > pdf c1 c2; SUBC> chisquare 4. calculates the value of the Chisquare(4) density curve at each value in C1 and stores these values in C2. This is useful for plotting the density curve. The Calc I Probability Distributions I Chi-Square command, or the session commands cdf and invcdf, can also be used to obtain values of the Chisquare(k) cumulative distribution function and inverse distribution function, respectively. We use the Calc I Random Data I Chi-Square command, or the session command to obtain random samples from these distributions. random,
114
Chapter 6
We will see applications of the chi-square distribution later in the book but we mention one here. In particular, if x1 , . . . , xn is a sample from a N (, ) Pn 2 distribution, then (n 1) s2 / 2 = ) /2 is known to follow a i=1 (xi x Chisquare(n 1) distribution, and this fact is used as a basis for inference about (condence intervals and tests of signicance). Because of the nonrobustness of these inferences to small deviations from normality, these inferences are not recommended.
6.6
Exercises
When the data for an exercise come from an exercise in IPS, the IPS exercise number is given in parentheses ( ). All computations in these exercises are to be carried out using Minitab, and the exercises are designed to ensure that you have a reasonable understanding of the Minitab material in this chapter. Generally, you should be using Minitab to do all the computations and plotting required for the problems in IPS. If your version of Minitab places restrictions such that the value of the simulation sample size N requested in these problems is not feasible, then substitute a more appropriate value. Be aware, however, that the accuracy of your results is dependent on how large N is. 1. (6.6) Use the Stat I Basic Statistics I 1- Sample Z command to compute 90%, 95%, and 99% condence intervals for . 2. (6.49) Use the Stat I Basic Statistics I 1- Sample Z command to test the null hypothesis against the appropriate alternative. Evaluate the power of the test with level = .05 at = 225. 3. Simulate N = 1000 samples of size 5 from the N (1, 2) distribution, and calculate the proportion of .90 z -condence intervals for the mean that cover the true value = 1. 4. Simulate N = 1000 samples of size 10 from the uniform distribution on (0,1), and calculate the proportion of .90 z -condence intervals for the mean that cover the true value = .5. Use = 1/ 12. 5. Simulate N = 1000 samples of size 10 from the Exponential(1) distribution (see Exercise II.4.7), and calculate the proportion of .95 z -condence intervals for the mean that cover the true value = 1. Use = 1. 6. The density curve for the Student(1) distribution takes the form 1 1 1 + x2 for < x < . This special case is called the Cauchy distribution. Plot this density curve in the range (20, 20) using an increment of .1. Simulate N = 1000 samples of size 5 from the Student(1) distribution (see Exercise
Introduction to Inference
115
II.4.12), and calculate the proportion of .90 condence intervals for the mean, using the sample standard deviation for , that cover the value = 0. It is possible to obtain very bad approximations in this example because the central limit theorem does not apply to this distribution. In fact, it does not have a mean. 7. Suppose we are testing H0 : = 3 versus H0 : 6= 3 when we are sampling from a N (, ) distribution with = 2.1 and the sample size is n = 20. If we use the critical value = .01, determine the power of this test at = 4. 8. Suppose we are testing H0 : = 3 versus H0 : > 3 when we are sampling from a N (, ) distribution with = 2.1. If we use the critical value = .01, determine the minimum sample size so that the power of this test at = 4 is .99. 9. The uniform distribution onq the interval (a, b) has mean = (a + b) /2 2 and standard deviation = (b a) /12. Calculate the power at = 1 of the two-sided z -test at level = .95 for testing H0 : = 0 when the sample size is n = 10, is the standard deviation of a uniform distribution on (10, 12), and we are sampling from a normal distribution. 10. Suppose that we are testing H0 : = 0 in a two-sided test based on a sample of 3. Approximate the power of the z -test at level = .1 at = 5 when we are sampling from the distribution of Y = 5 + W, where W follows a Student(6) distribution (see Exercise II.4.12) and we use the sample standard deviation to estimate . Note that the mean of the distribution of Y is 5.
116
Chapter 6