Very Basics Pss
Very Basics Pss
Very Basics Pss
This online seminar is to help you get started with basic data management and analysis in SPSS. It is for those people who: Are new to statistical data work and want to learn to use SPSS to manage data and perform common analysis. Took some undergraduate (or perhaps some graduate) introductory statistics course in SPSS long ago and want to refresh their memory. SPSS is one of the most user-friendly commercial statistical packages. As such, even beginners of statistical analysis would find its point-and-click and dialogue boxes interface very approachable and easy to use. However, in the long run, you will benefit a lot more by learning SPSS by SPSS Syntax. The pros of the syntax approach are: An efficient way for documentation and reproducibility (this is the reason I would strongly discourage you from keeping relying on the point-and-click approach). Much quicker and efficient once you learn how to write and run syntax commands. Can perform things unavailable/inaccessible from the point-and-click menus.
So, this online seminar attempts to prepare you for writing syntax commands yourself in the future to perform simple data tasks and run basic procedures. Aside from this online seminar, you have access to a lot of great instructional SPSS resources online for free. We strongly recommend that you use those resources to the full. This online seminar assumes that you are using SPSS ver.16 and up. Be aware there are some significant changes between ver.15 and before and ver.16 on. Contents: 1. Getting Started: Lets Open SPSS and Bring in Data................................................................. 2 2. How to Get Descriptive Statistics and Graphs.......................................................................... 14 3. How to Define Variable Properties........................................................................................... 29 4. How to Create and Recode Variables ....................................................................................... 33 5. How to Subset (Select) Data ..................................................................................................... 45 6. How to Sort and Split Data ....................................................................................................... 53 7. Simple Regression Example ..................................................................................................... 57
The Data Editor window shows you the working (= currently open) dataset in a spreadsheet format. Of course, it is now new and empty. You see there are two sheets in this window, the Data View and the Variable View. Currently, the Data View is active (in yellow). You see a message from SPSS at the bottom of the window. Currently, it is SPSS Processor is ready for your work. 2
Before getting started with our work, lets change the output settings. From the menu bat at the top (of whichever window), Edit Options This will bring you the Options dialogue box. Here, you can control what to display in your output. Click the Viewer tab. Here is one setting I strongly recommend that you choose: Display commands in log. You will see why in a moment. For now, just check the box and click OK.
Now, lets first bring in a data. Well use these files for this practice. xls_gss93.xls csv_gss93subset.csv fix_gss93subset.dat GSS93 subset.sav
* The second and third files are subsets of the last one GSS93 subset data (7 variables, 97 observations) in different formats for our practice purpose.
Reading Data from Excel Files We will start with importing the excel file xls_gss93subset.xls into SPSS. First, open the excel file and understand how it is formatted. The first row has variable names, and the data part is from the second row and below. Close the excel file and lets start reading this file into SPSS.
Start SPSS by clicking on the SPSS icon or from the Windows Start menu. From the SPSS menu bat at the top, go: File Open Data
This brings up a dialogue box Open Data as shown below. Go to your working directory verybasicSPSS, then select Excel (*.xls) format from Files of type. Then select the excel file xls_gss93.xls you saved there. Then click Open. (see below for a visualized instruction).
(1) Go to your working directory where you saved the excel file. (3) Select this file.
(2) Files of type is Excel. This brings up our excel file in the above window.
Now you should be seeing another dialogue box Opening Excel Data Source.
As we first checked, the excel file has variable names in the first row. So check the Read variable names from the first row of the data box. Click OK. Now you have a new, unsaved data in another SPSS Data Editor window. To save the data in the SPSS format, go from the pulldown menu: File Save Lets save it in your working directory with the name xls_gss93. Lets keep this data for a moment.
Reading Data from Text Files (comma-separated-values) Okay, we are next try importing csv_gss93subset.csv, a text file in the comma separated values format. The first line contains variable names. From the menu bar at the top, File Open Data This again should bring up a dialogue box called Open Data. Make sure you are looking in your working directory verybasicSPSS. Since our file extension is .csv, we need to select All Files(*.*) from Files of type. Then csv_gss93subset.csv shows up in the window. Select it and click Open.
Then the Text Import Wizard dialogue box shows up, which has six steps. Click Next, and in Step 2 of 6, check on the Yes radio button to the question Are variable names included at the top of your file? because we do have variable names in the first row, and else accept the default settings and keep moving on by clicking Next. Then in Step 6 of 6, you will see click Finish.
You should see another SPSS Data Editor window [DataSet2] popping up. As is clear, multiple data files can be simultaneously open in SPSS (well mention about this a bit more later). Browse the one you just read in, and lets just close it without saving. 6
Reading Data from Text Files (ASCII fixed format) Finally, lets practice reading the data fix_gss93subset.dat, which is an ASCII fixed format file. This type of data always comes with a codebook that specifies which column corresponds to which variable. Take a look at this text file (left below; notice there is no variable name header) and its codebook (right below).
11320 215 0 31325 425 0 555 0 65125 71122 85124 91322 1025 0 3143 2044 2043 4045 1078 2283 2255 3275 1231 1054 Variable Name id wrkstat marital agewed sibs childs age Column Number 1-4 5 6 7-8 9-10 11 12-13
Now, unlike the previous examples, there is no easy point-and-click method to read this type of data. What do we do then? The best way is to write syntax commands ourselves to bring in this data. Lets open a new syntax file for this work. From the menu bar at the top, go: File New Stntax You now should be seeing the SPSS Syntax Editor, which is another important window in the SPSS environment (as I emphasized in the introduction, you should eventually learn to use and write the Syntax file for your work. You are now getting a little glimpse of it). Lets save it as verybasicspss in your working directory. Now, lets type the following commands (be sure to specify the file location where you saved the file fix_gss93subset.dat).
data list fixed file='[specify your working directory]\fix_gss93subset.dat' / id 1-4 wrkstat 5 marital 6 agewed 7-8 sibs 9-10 childs 11 age 12-13.
Always end your comment with a period.
The command DATA LIST is to read a text format data file by assigning names and formats to each variable in the file. The keyword FIXED follows to tell SPSS that our data is a fixed format (actually, this is the SPSS default so you can skip it). The command FILE = file location/name here specifies your fixed format file and its location. After the slash (/) we provide SPSS with variable definitions (the variable names and column numbers) from the codebook. Two syntax rules you must remember here: 1. Notice that the whole command ended with a period (.). In SPSS, each command in SPSS must be completed with a period .. 2. SPSS Syntax is NOT case-sensitive.
Now, lets execute our commands. First, highlight them, then to run the highlighted part, hit the Run Current button or alternatively hit Ctrl + R keys. When you run the above command, another Data Editor window should open for this new data. But what did you get there? You should be seeing a blank spreadsheet under the *Untitled4 [] heading, although from the Variable View it looks like SPSS seems to have variable information. Why arent we seeing the data itself? To read the data, we need to run another command to actually use this data (because to use this data, SPSS needs to read it!). Get back to your Syntax Editor, and first make sure we are working on this data set.
This pull-down menu indicates your active data source.
When you have multiple data files open at the same time, you need to tell SPSS which data file you are working on (which is called Active file). You can make your file active by simply clicking anywhere in the Data Editor window of the data you want to use (in this case, *Untitled4 []), or when you have your Syntax Editor open, you can use the pull-down menu (in this case, it should be set to Unnamed since *Untitled4 [] is neither saved nor named). Once you make sure *Untitled4 [] is active, type in the following command (dont forget a comma), highlight and run it.
list.
Now, what do you have in your Data Editor and Output Viewer? You should now be seeing the data content in the Editor, and the command LIST is executed and the result is in the Output Viewer. The point is this: SPSS just keeps it in its memory and does not read the data until it needs to, because thats efficient in terms of processing. In this example, SPSS encounters the procedural command LIST, realizes it needs the data *Untitled4 [] to execute LIST and produce results on that data, and only at that moment does it read in the data.
But suppose you want to explicitly force a data pass so that you can immediately see the read-in data in the Data Viewer. The command EXECUTE does that for you. If you run the following,
data list fixed file='[specify your working directory]\fix_gss93subset.dat' / id 1-4 wrkstat 5 marital 6 agewed 7-8 sibs 9-10 childs 11 age 12-13. execute.
then you would immediately see the result in your Data Editor without running any procedural command. EXECUTE forces all the data to pass (including the data transformation, where you for example create or recode variables and need to read the new data with those new variables so you can use them), but it does nothing else to the session. It just forces a data pass. But as I said, SPSS reads the data as it needs to after all, so in most cases EXECUTE is rarely if ever necessary. In fact, to use EXECUTE at every single data transformation command slows down the processing because SPSS is forced to read the data at every single EXECUTE, even when data reading is unnecessary at that moment. So you should use EXECUTE sparingly. We will be back to this command later and discuss a couple of situations where you absolutely must run EXECUTE. Anyway, lets take a look at our output.
We checked Display commands in log in the Options menu, so SPSS displays the syntax it ran on the output.
You see the data content listed. You can save your output by going from the drop-down menu. File > Save As The file extension for the SPSS output is .spv (Note: SPSS older than version 16 has the extension .spo for the output files. To open and view .spo files in SPSS version 16 or later, you need to install SPSS Legacy Viewer. For more information, see the SPSS technical support website). The left-side pane is SPSSs outline view of your output. It serves like a table of content and allows you to navigate different parts of your output by clicking on small output icons. You also see why I strongly recommended you set your Options to Display commands in log. Notice that in the output, you see all the syntax we have run so far printed out, even those you didnt write yourself. This is why I strongly recommended that you set Display commands in log in the Options menu. First, having the actual commands you run along with the corresponding output helps you greatly with documentation. You can always see what command and options you used to generate the output that you have, and you can always keep track of exactly what you did with the data. This is very important. Further, be aware that SPSS syntax like those is running beneath the point-and-click interface, even when you simply use those pull-down menus without writing syntax commands yourself and do not see the actual commands SPSS runs. As mentioned earlier, you should eventually learn to write commands by using your Syntax Editor yourself and run them from there, instead of pointing and clicking. This is also very important for documentation and reproducibility. Had you written and run the following syntax commands yourself, you would have gotten the same results. They were the syntax running beneath your pointing and clicking. Read an Excel file
get data /type = xls /file = ' [specify your working directory]\verybasicSPSS\xls_gss93.xls' /sheet = name "xls_gss93subset" /readnames = on. execute.
10
The command GET DATA is to read external files into SPSS. For further syntax help, you always can go from the menu bar at the top, Help Command Syntax Reference We will learn some additional basics of command writing throughout this workshop. We can also directly input the data from the Syntax Editor. Type the following lines.
* Read dataline from syntax file . data list / id 1-3 sex 5 (A) age 7-8 treat 10. begin data 001 f 43 0 002 f 25 1 003 m 36 0 end data.
We use two commands. One is DATA LIST (we already learned it), and the other is a pair of BEGIN DATA and END DATA. Again notice each command is finalized by a period at the end. BEGIN DATA and END DATA are used when data are entered within the command sequence, and data records are placed in between. One important thing you need to remember from this example is this part:
sex 5 (A)
By default, SPSS treats variables as numeric. The variable sex here is a character variable (f/m). By putting (A) after the variable name and the column number, you tell SPSS that this is a character variable.
Open SPSS files Now, lets open an SPSS system file GSS93 subset.sav. This is actually the easiest part. From the menu bar of the SPSS Data Editor window at the top, go: File Open Data
11
Find and open GSS93 subset.sav by double-clicking on it or choosing it and hitting OK.
There you go. Lets click on the tab of the Variable View sheet and see what you have there.
12
The Data View sheet and the Variable View sheet look very similar, but the latter has information about the variables in the data set shown in the Data View sheet, including variable names, data type (Numeric or string, etc), variable and value labels, how the missing values are coded, etc. Most of those information cells have hidden dialogue boxes or pull-down menus which you can call up by selecting the cell and then clicking on the gray button that comes up on the right side of the cell. For example, lets try activating the dialogue box for the values of the variable marital.
13
(2) Then value labels dialogue box for the variable marital shows up.
Now, technically this box allows you to define/modify values of the variable. However, I just brought this up to warn you in case you happen to find it and want to use it. DONT USE THIS BOX for the data management purposes. Although it looks easy, to use this dialogue box is dangerous. It makes it extremely difficult to keep track of changes you made to the data, because it does not leave any record of your action. You should use the Syntax Editor instead. Lets just click Cancel to close the Value Labels dialogue box. Lets close all the data sources other than the one we just read in, GSS93 subset.sav.
14
(2) Options brings up the Descriptives: Options dialogue box. (3) Click Continue after selecting options you want.
Okay, we are ready to get descriptive statistics for this variable. Now, lets click Paste. What did you get? You should now have got an SPSS Syntax Editor window like the one below.
What we did here just now is just to paste the syntax command that SPSS writes to obtain descriptive statistics. As I emphasized, always be aware that SPSS syntax commands like this are running beneath the point-and-click interface, even when you simply use those pull-down menu and click OK. You should learn to write SPSS syntax yourself eventually. Now, lets take a look at the pasted command. The SPSS command to get descriptive statistics is DESCRIPTIVES followed by its subcommand VARIABLES = varname. The most basic structure of SPSS syntax command language is: COMMAND <options if any> / [SUBCOMAND <options if any>] . The slash (/) is to separate subcommands. But this basic form can take slightly different forms command by command. In DESCRIPTIVES, for example, the subcommand VARIABLES immediately follows the command DESCRIPTIVES, and before the slash (/).
15
It is always a good idea to add comments to your syntax file for the documentation purpose. Use an asterisk (*) or the command COMMENT to start your comment text. Remember, all the SPSS commands must end with a period, and that rule applies to comments as well. This is imperative to indicate the end of your comment with a period. Let me show you how so. Run the following bloc of commands. What did you get in your Output Viewer?
* Descriptives for years of education DESCRIPTIVES VARIABLES=educ /STATISTICS=MEAN STDDEV MIN MAX SKEWNESS .
You got nothing, except for the log of the syntax you just ran. Why? Because SPSS treats everything between * (or COMMENT) and the next period as your comment. In this case, Descriptives for SKEWNESS. is all treated as a bloc of comment, so DESCRIPTIVES was not executed as a command ( and you are left dumfounded to find no computation results shown in your Output Viewer). So, you always must end your comment with a period. A flip side you can see from this example is that in other words, you can start with * or COMMENT and keep commenting over multiple lines till you end it with a period. This may be helpful if you need to add extensive comments to your syntax. So lets fix our syntax.
* Descriptives for years of education A bloc of your comment between an * We can comment over multiple lines, and a period (.), over multiple lines. Just dont forget to end it with a period . DESCRIPTIVES VARIABLES=educ /STATISTICS=MEAN STDDEV /*standard deviation*/ MIN MAX SKEWNESS .
Notice you have your comment over multiple lines. As you can see, alternatively, you can use /* COMMENT HERE */ as well. In this case, */ instead of a period indicates the end of your comment. This way comments can be inserted in your command lines. Lets execute this syntax command, including the comment. Select (highlight) the whole syntax command and hit the Run Current button at the top of the Syntax Editor or hit Ctrl + R keys.
16
SPSS has in the left pane the output index table. Click any listed index, and SPSS navigates you to the corresponding result objects in the right pane (feel free to try). I highlight the descriptives to bring the corresponding output to my view.
17
You can copy and paste those output items. As an example, try right-clicking on the descriptive table, selecting Copy, and then pasting it on your word processor document. The average year of school completed is 13.04 years. Surprisingly, there are people with zero education. There seems to be no real concern about skewness. How different is the mean years of school completed between male and female? Lets compare their mean values. Analyze Compare means Means This will bring up the Means dialogue box. Select the respondents educ (Year of school completed) variable under the Dependent list heading, and respondents sex for Independent list. Click Options and add median and skewness to your statistics, and then click Continue. Then click Paste. Highlight and run the command.
* Comparing years of education by sex . MEANS TABLES=educ BY sex /CELLS MEAN COUNT STDDEV MEDIAN SKEW.
You should be seeing the result that on average highest year of education completed is 13.19 for male respondents and 12.92 for female respondents. The median value seems close to the mean value for males, so we would expect the variable is mostly normally distributed. Lets visualize it. Graphs Legacy Dialogues Histogram 18
The Histogram box pops up. Select the education variable for the Variable, check the display normal curve box and choose Respondents Sex to panel our histogram by column.
19
Not bad distributions, but (and not surprisingly) the highest years of education is 12 for so many people, especially for female respondents. Stem and leaf and box plots are as often used to check variables distributions and extreme values. Here, we use the command EXAMINE and get the descriptive information all at once. Analyze Descriptive Statistics Explore The Explore dialogue box shows up. Select educ (Highest year of school completed) for Dependent List and the sex variable (Respondents sex) for the Factor List. Then first click on Statistics and check the Descriptives and the Percentiles boxes. Continue.
Then click Plots, check the Factor levels together for Boxplot and the Stem-and-leaf boxes under the Descriptive heading.
20
[Descriptives and percentiles output omitted] Check the legends (highlighted) to see what the stem and leaf represent in your output.
Highest Year of School Completed Stem-and-Leaf Plot for sex= Male Frequency Stem & Leaf (=<5.0) 00000 000000 000000000000000 00000000 0000000000 0000000000000000 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000 000000000000000000000000000 0000000000000000000000000000 00000000000000000 00000000000000000000000000000000000000000000000 000000000000 0000000000000000 00000000 000000000
11.00 Extremes 10.00 6 . 13.00 7 . 30.00 8 . 17.00 9 . 21.00 10 . 33.00 11 . 172.00 12 . 55.00 13 . 57.00 14 . 34.00 15 . 95.00 16 . 25.00 17 . 32.00 18 . 16.00 19 . 18.00 20 . Stem width: Each leaf: 1
2 case(s)
Highest Year of School Completed Stem-and-Leaf Plot for sex= Female Frequency Stem & Leaf
32.00 Extremes (=<7.0) 29.00 8 . 0000000000 28.00 9 . 000000000 34.00 10 . 00000000000 48.00 11 . 0000000000000000 273.00 12 . 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 80.00 13 . 000000000000000000000000000 109.00 14 . 000000000000000000000000000000000000 36.00 15 . 000000000000 113.00 16 . 00000000000000000000000000000000000000 21.00 17 . 0000000 39.00 18 . 0000000000000 8.00 19 . 000 7.00 Extremes (>=20) Stem width: Each leaf: 1 3 case(s)
21
The top of the box represents the 75th percentile, the bottom of the box represents the 25th percentile, and the line in the middle represents the 50th percentile (= median). There is no middle line in the box for female cases, though. That is because the 50th percentile and 25th for the female sample have the same value (=12. Check your output for the percentile table yourself). The lines that extend out the top and bottom of the box are called whiskers, which represent the highest and lowest values that are not outliers or extreme values. Outliers are values that are between 1.5 and 3 times the interquartile range (interquartile = box-lengths from the 75th percentile or 25th percentile), and extreme values are values that are more than 3 times the interquartile range. They are represented by circles and asterisks beyond the whiskers, respectively.
EXAMINE is a very useful data exploration command. As you may have noticed, you at the same time can get a histogram (try just adding histogram in the above syntax to the / plot subcommand and running it) and a q-q plot (in the same way, add nnplot in the above syntax and run it).
We can get a good idea about our data by exploring data like this. Lets continue and get a frequency table for respondents work status and marital status. How many people are working full-time or unemployed? How many are married or divorced? Analyze Descriptive Statistics Frequencies 22
After selecting the variables wrkstat and marital, click the Charts button. You should get the Frequencies: Charts dialogue box as the below one. Check the Pie charts radio button under the Chart Type heading and the Percentages button under the Chart Values heading. Click Continue, and then paste the syntax. As you can see, those charts can be obtained through subcommands available in the FREQUENCIES command.
* Frequencies with pie charts . FREQUENCIES VARIABLES=wrkstat marital /PIECHART PERCENT /ORDER=ANALYSIS.
You should be now seeing frequency tables and nice big pie charts for these two variables. Approximately half of the respondents are working full-time, and are currently married.
23
Labor Force Status Frequency Percent Valid Percent Cumulative Percent Valid Working fulltime Working parttime Temp not working Unempl, laid off Retired School Keeping house Other Total 747 161 32 51 231 42 200 36 1500 49.8 10.7 2.1 3.4 15.4 2.8 13.3 2.4 100.0 49.8 10.7 2.1 3.4 15.4 2.8 13.3 2.4 100.0 49.8 60.5 62.7 66.1 81.5 84.3 97.6 100.0
Marital Status Frequency Percent Valid Percent Cumulative Percent Valid married widowed divorced separated never married Total Missing NA Total 795 165 213 40 286 1499 1 1500 53.0 11.0 14.2 2.7 19.1 99.9 .1 100.0 53.0 11.0 14.2 2.7 19.1 100.0 53.0 64.0 78.3 80.9 100.0
24
Lets get a crosstab to see if there may be any relationship between political views and opinions about life-prolonging measures. Analyze Descriptive Statistics Crosstabs Choose the variable letdie1 (Allow incurable patients to die) for the rows and polviews (Think of Self as Liberal or Conservative) for the columns, and then click on Statistics and check the box Chi-square in the Crosstabs: Statistics dialogue box. Click Continue.
Then click Cells button to get to the Crosstabs: Cell Display dialogue box. We want to know the expected value for each cell to compare with the corresponding observed value, so check both the Observed and Expected checkboxes under the Count heading.
25
Click Continue. Then once back to the main Crosstabs dialogue box, click Paste.
* Crosstab between letdie1 and polviews .
CROSSTABS /TABLES=letdie1 BY polviews /FORMAT=AVALUE TABLES /STATISTICS=CHISQ /CELLS=COUNT EXPECTED /COUNT ROUND CELL.
Your Chi-square test result is displayed at the bottom of the Crosstabs output.
Chi-Square Tests Value Pearson Chi-Square Likelihood Ratio Linear-by-Linear Association N of Valid Cases 33.155
a
df 6 6 1
a. 0 cells (.0%) have expected count less than 5. The minimum expected count is 6.94.
As the note at the bottom of the table says, the test is valid by meeting the test condition (i.e., the minimum expected value must be more than 5). It is statistically significant, indicating that political views are associated with opinions about life-prolonging measure. More politically liberal people are more open to the idea of allowing incurable patients to die. Okay, lets now see how closely associated years of education and age of first marriage. Graphs Legacy Dialogue Scatter/Dot The Scatter/Dot dialogue box comes up. Click on Simple scatter and click Define.
26
You should reach the Simple Scatterplot dialogue box. Choose agewed (Age when first married) for the Y-Axis and educ (Highest year of school completed) for the X-Axis.
Now, we have 1500 observations in this data, but the number of dots does not seem to be as many. This is because multiple observations share the same data points. We want to include how dense each data point is. To do so, we use the Chart Editor. Double-click on anywhere in the scatterplot output area to invoke the Chart Editor. Then go to: Options Bin Element 27
This brings you the Properties dialogue box (below). In the Binninb tab, select the Color Intensity radio button under the Count Indicators heading. Click Apply.
Then you have a scatterplot that includes information about number count for each data point.
28
So from the densest area of the scatter plots above, you can see many people graduated from high school and then soon got married around the age of 20. I urge you to closely review the syntax we used in this subsection. So far, we have had SPSS write codes for us, but again, eventually you should also be able to write your syntax yourself. You can extend it to perform tasks that the point-and-click cannot.
29
Define Variable Properties Here is part of your codebook of this data file (just part of it, for our practice purpose).
Variable Name id wrkstat Variable Label Respondent ID Number Labor Force Status Values 1 Working fulltime 2 Working parttime 3 Temp not working 4 Unempl, laid off 5 Retired 6 School 7 Keeping house 8 Other 1 Married 2 Widowed 3 Divorced 4 Separated 5 Never married Missing Values Width 4 1
marital
Marital Status
9 NA
Age When First Married Number of Brothers and Sisters Number of Children Age of Respondent 8 Eight or more
0 nap 99 na 98 dk 99 na 9 NA 99 NA
2 2 1 2
We include the information above in the data file. We will first use the point-and-click approach and then see the syntax code beneath it. From the menu bar at the top, go to: Data Define Variable Properties This brings up the Define Variable Properties dialogue box. Lets select all the seven variables.
Click Continue. 30
Youll reach the next dialogue box where you can define variable properties. Select variables one by one and define their properties according to the codebook. The Changed checkbox is automatically checked once you make changes to value label. An example using the variable wrkstat below
(1) Highlight and select a variable. Then the property items will show up in the right side. (2) Variable label
SPSS automatically checks this for you when you make changes to value labels.
Some variables have missing codes. To define missing values, check the Missing checkbox. For example, the variable marital has the value 9 for missing values. To tell SPSS that 9 represents missing values for this variable, you do the following.
31
Once you finish defining properties for all the variables, click Paste and see what syntax commands SPSS wrote. For each variable, several commands are used to define its properties. To define Measurement level, use VARIABLE LEVEL varname (LEVEL). To define Variable label, use VARIABLE LABELS varname label. To define Format, use FORMATS varname (format). To define Variable value labels, use VALUE LABELS varname labels. To define Missing values, use MISSING varname (values). Remember, each period must be finalized with period ..
Lets run the commands. Then, go to your SPSS Data Editors Variable View. See the results.
32
Save your data (Ctrl + S, or File > Save, or Save button at the task bar
As I note before, this is how you should change variable properties, because you can keep all the work and decisions you made for future reference and notes. Further, you can repeat the same task later again if necessary. Dont do this by using the hidden dialogue boxes of the Variable View. It will not allow you to keep any systematic records of your work, and you will most certainly lose truck of your research work if you keep doing that.
The mean value of variable is 46.23 with a standard deviation of 17.42. It ranges from 18 to 89. The minimum value of this variable is 18 (18 years old) and there is no 0 or below value there (no below-zero value, of course!), so we can log it as is, without adding anything. Well start with the point-and-click approach, and then go over the syntax command. From the menu bar at the top, go: Transform Compute Variable Then you get a new dialogue box Compute Variable popping up. The box under the Target Variable heading at the upper left corner is where you type in a newly created variable name. To define and compute the new variable, enter the expression in the box Numeric Expression.
33
Lets start by creating a logged age variable. We call our new variable lnage, so type in lnage in the Target Variable box. This is a numeric variable, so click on the Type & Label button right below and make sure it is specified as numeric. Also label this variable Logged Age. Click Continue to be back to the Compute Variable box. Then type in your expression in the blank of Numeric Expression. We use the function LN(numexpr) which returns a base-e log of numexpr (i.e., number or expression). You also can find functions in the boxes under the Function group heading and the Functions & Special Variables heading in the right side. For LN, select Arithmetic in the former, and then find and double-click on LN in the latter. The function is automatically entered in the Numeric Expression box. Plug in the variable age in the parenthesis.
(1) Enter new variable name (3) Type in expression. You can directly type in LN(age), but if you cannot recall functions or variable names, you can select the below two boxes in the lower right side and the variable list in the left side.
(2) Click this button to open the Type & Label box, specify type and label the new variable Logged age. Click Continue and back to this dialogue box.
34
Now, this process involves a transformation command COMPUTE. This creates new variables and hence updates your data anew. As I mentioned above already, SPSS wont perform the data transformation/reading until it needs to, which conversely means it will when it needs to. Meanwhile, to explicitly force a data pass, one can run EXECUTE. SPSS by default automatically adds EXECUTE, when transformation commands are pasted from a dialogue box. Just to get the idea how it works, try running your command without EXECUTE first, and see your Data Editor. A new column is created for the lnage variable in the Data Viewer, but the data is not read into SPSS, because we didnt have any data pass, whether its EXECUTE or any procedural command. Now highlight and run EXECUTE, and see what happened to your Data Viewer. SPSS spits what it secretly keeps in its memory and executes the data transformation, and now you have the new data read into SPSS with the newly created variable lnage. Lets also create a quadratic version of the age variable. Lets call this new variable sqage. This time, we just use the Syntax Editor. A squared term of a can simply be expressed a*a.
* Squared term of the age variable. compute sqage = age*age. variable labels sqage 'Quadratic Age'. descriptives variables = sqage.
Now, first, highlight the first two lines and see the Data Viewer. Again, a new column is created for the variable sqage, but no data transformation has been executed yet. Then, this time, instead of EXECUTE, we run a procedural command descriptives to obtain descriptive statistics for this new variable. You get the below result, and if you check the Data Viewer you see the new data is read in.
Descriptive Statistics N Quadratic Age Valid N (listwise) 1495 1495 Minimum 324.00 Maximum 7921.00 Mean 2440.0957 Std. Deviation 1789.00139
In this example, there is no EXECUTE, yet SPSS still performed the data transformation because it needs to read in sqage to execute the procedure DESCRIPTIVES for this variable. The point is, SPSS waits to make data changes until it absolutely needs to do so. This way, the number of data readings decreases and thereby SPSSs processing speeds up. Let me emphasize again: Therefore, in most cases you dont need to run EXECUTE every time, because SPSS reads the data when it needs to/has to. Unnecessary EXECUTE makes SPSS read the data again and again even when it doesnt have to, and as a result slows down the processing. So use this command just sparingly.
35
This means you can most of the time remove EXECUTE that SPSS by default automatically generates when you paste transformation commands from a dialogue box. But its annoying SPSS pastes and you delete EXECUTE again and again. We dont want SPSS to be so eager to insert EXECUTE every time. So, lets make SPSS a little lazy. Edit Options Then you will see the Options dialogue box. Click on the Data tab, and check the Calculate values before used radio button under the Transformation and Merge Options heading.
Click OK. Lets try pasting the same syntax to create the lnage variable from the dialogue box and see what this option change does for us. Click on the Dialogue Recall button of the menu bar (the SPSS windows has the same menu bar), and from the drop-down list, recall the command we just ran from the dialogue box, which is COMPUTE.
Pull-down menu shows up.
You should be seeing now the same dialogue box as this one, and our last work is still there. Lets click Paste (no worries about the Change existing variable? message; we are just pasting the command onto the Syntax file), and see what is pasted on your syntax file. Can you see that 36
SPSS lazy now, i.e., it does not paste EXECUTE this time? This means that SPSS wont perform data transformations after every transformation command. So, you dont have to force a data pass every single time; let SPSS read updated data when it needs to. There are, however, some specific situations where you absolutely and explicitly need to get SPSS to run EXECUTE and force a data pass. Lets take a look at the following example.
* Must-use EXECUTE example (1) Rule 1: Lag functions and EXECUTE .
* First create a mini data set. data list free / var1. begin data 12345 end data. compute var2 = var1. list. * (1)-(a): lag() function w/o intervening EXECUTE. compute lagvar1 = lag(var1) . compute var1 = var1*var1 . * (1)-(b): lag() function w/ intervening EXECUTE. compute lagvar2 = lag(var2) . Heres the difference! execute. compute var2 = var2*var2 . list.
What we did above is first to create a simple data containing var1 and var2, which are actually the same with five observations whose values are 1,2,3,4,5, and then to lag those variables by using the function LAG(). The only difference between (1)-(a) and (1)-(b) is whether EXECUTE. is placed after computing those lag variables. Now, see what you got in your output (or the Data Viewer). What difference did EXECUTE make to the new lagged variables?
var1 1.00 4.00 9.00 16.00 25.00 var2 1.00 4.00 9.00 16.00 25.00 lagvar1 . 1.00 4.00 9.00 16.00 lagvar2 . 1.00 2.00 3.00 4.00
Look at lagvar1 and lagvar2. You might have been assuming you were creating a set of the same lagged variables, but you got different results.
37
The key, of course, is the presence or absence of EXECUTE after compute lagvar# = lag(var#) . The difference happened because the function LAG() is calculated after all other transformations are performed, regardless of command order. So, in the example (1)-(a) without an intervening EXECUTE, the new variable lagvar1 was created from the transformed values of var1 (i.e., var1*var1). SPSS executed compute var1 = var1*var1 . first, and then, only then, calculated compute lagvar1 = lag(var1) . . In contrast, in the example (1)-(b), we explicitly placed an intervening EXECUTE after compute lagvar2 = lag(var2), meaning that we forced SPSS to transform the data and create lagvar2 at that point, before moving on to var2 transformation. Thus, var2 from which new variable lagvar2 was created was its original 1,2,3,4,5 values. So, depending on what you mean to do, you need to explicitly force a data pass when you use LAG(). This is the rule No.1 about the placement of the EXECUTE command between transformation commands. Lets take a look at another example. Examine the below syntax.
* Must-use EXECUTE example (2) Rule 2: System variable $CASENUM, SELECT IF and EXECUTE.
* (2)-(a): $casenum and SELECT IF, w/o intervening EXECUTE. compute var3 = $casenum. select if (mod(var3,2) = 0). descriptives var3.
* (2)-(b): $casenum and SELECT IF, w/ intervening EXECUTE. * First re-run Must-use EXECUTE example (1) syntax to bring back the data . compute var3 = $casenum. Heres the execute. difference! select if (mod(var3,2) = 0). list variables = var1 var2 var3. $CASENUM is a system variable that contains current case sequence number (i.e., 1,2,3,4,5 n). SELECT IF (expression) is a command for case selection based on specified criteria after IF. MOD(a, non-zero b) is a function that returns the remainder when a is divided by b. So, in the
above syntax we are telling SPSS to select cases where var3 are even numbers. We will continue to use the mini data set we created (be sure to have this data active). Now, lets highlight and run example (2)-(a). What did you get?
Warnings No cases were input to this procedure. Either there are none in the working data file or all of them have been filtered out. This command is not executed.
And indeed, you dont have any observation in your Data View. 38
Why did this happen? Its a combination of the two following things. First, the value of $CASENUM keeps changing in a dynamic manner. For example, if you delete the first case with the value 1, the formerly the second case with the value 2 moves up and becomes the first case with the value 1. Secondly, SELECT IF sequentially deletes each unselected case. So, in the example above, SPSS goes through the following sequence: (1) sees compute var3 = $casenum. , (2) creates var3 and gives the first case a value of 1, (3) evaluates it against the selection criterion select if (mod(var3,2) = 0). , (4) decides the first case does not meet it, and (5) delete the case. Now, SPSS comes back to the top of this loop, the formerly second case now has a value of 1 for var3, SPSS sees it, decides it does not meet the selection criterion, and deletes it keeps going the loop until it reaches the last observation (in this case the 5th observation). Notice no data reading happens throughout this sequence. It is only then that SPSS sees the procedural command descriptives var3. and tries to read the data to execute descriptives. But of course, at this point, all the cases are gone and there is no data left to read in. That is not what we wanted to do, of course. What should we do? We need to force SPSS to read the data after the transformation compute var3 = $casenum. to finalize the data before it starts selecting cases. Lets re-create the same mini-data set (because its gone!) and then highlight and run the example (2)-(b).
var1 4.00 16.00 var2 4.00 16.00 var3 2.00 4.00
Yes, this is exactly what we meant to do. We first created var3 (1,2,3,4,5), finalized it, selected even-number cases (i.e., 2 and 4), and then printed it. OK, heres one last example about EXECUTE in this workshop. Examine the following.
* Must-use EXECUTE example (3) Rule 3: Transformation command, MISSING VALUES and EXECUTE.
* First, create a mini data set. data list list / var1 var2 var3 var4. begin data 1014 2129 3056 4219 5025 6 2 6 13 7571 8022 end data.
39
list. * (3)-(a): Transformation followed by MISSING VALUES involving that var, w/o intervening EXECUTE. compute var5 = 0. if var2 = 0 var5 = 1 . missing values var2 (0). list. * Clear missing values. missing values var2 (). * (3)-(b): Transformation followed by MISSING VALUES involving that var, w/ intervening EXECUTE. compute var6 = 0. Again, heres if var2 = 0 var6 = 1 . the difference! execute. missing values var2 (0). list.
After creating a small data set, we first create a new variable var5 in example (3)-(a). Set all the observations to 0 first, then replace them with 1 for those cases where var2 has the value 0, so that we can create var5 as a 0/1 dummy variable. There should be four cases coded as 1 in var5 because there are as many 0s in var2. Then we use the command MISSING VALUES variable (value) to declare 0s of var2 as user-defined missing values. Now, with the small data, lets first highlight and run (3)-(a). What did you get? Your var5 has a value of 0 for all the observations, although the value 0 of var2 is now defined as missing, as you can see from the Variable View of the SPSS Data Editor. Its not exactly what we wanted; var5 should have the value of 1 when var2 is 0. Why does this happen? This is actually yet another situation where you must use EXECUTE explicitly; be careful when you have transformation commands followed by MISSING VALUES that works on the same variables as the transformations, because the command MISSING VALUES changes the dictionary (i.e., variable info in the Variable View) before the transformations are executed. In this example, the value 0 of var2 is defined as missing before var5 is created and then modified on the condition of var2, and hence transformation of var5 where var2 = 0 does not occur. So what we need to do is to complete the transformation and force a data pass (i.e., finalize var5) before MISSING VALUES defines the value 0 of var2 as missing. Thats where EXECUTE comes in. Place it before MISSING VALUES so that the data transformation is executed before the missing value command is run. Lets keep having this mini data active, and after resetting the missing value definition for var2 (i.e., * Clear missing values. part of the above syntax), lets highlight and run (3)-(b) to create var6 that is 1 when var2 = 0, whereas to define the value 0 of var2 as missing. Now, did you get something different this time?
40
See the difference between var5 and var6? Yes, this is what we wanted! These three above are oft-encountered situations where you need to explicitly force a data pass and you should keep them in mind.
Rule 1: Lag functions and EXECUTE . Rule 2: System variable $CASENUM, SELECT IF and EXECUTE. Rule 3: Transformation command, MISSING VALUES and EXECUTE.
The other two situations are when you run WRITE or XSAVE, both of which are treated as transformation commands. Ending your program with WRITE or XSAVE without any procedural command that forces a data pass leads to an empty data file (because, simply, it is not written or saved). In such cases, you often need EXECUTE after you run those commands. For more about WRITE or XSAVE, see Help Command Syntax Reference You can now close the mini data file without saving it.
Recoding Variables OK, lets next learn how to recode existing variables. We have the variable Marital status in the GSS data. With the data file GSS93 subset.sav active, we start by running descriptive statistics to get a good idea what the variable looks like.
descriptives variables = marital /statistics=mean stddev min max skewness. frequencies variables=marital.
41
Descriptive Statistics N Statistic Marital Status Valid N (listwise) 1499 1499 Minimum Statistic 1 Maximum Statistic 5 Mean Statistic 2.24 Std. Deviation Statistic 1.563 Skewness Statistic .847 Std. Error .063
Marital Status Cumulative Frequency Valid married widowed divorced separated never married Total Missing Total NA 795 165 213 40 286 1499 1 1500 Percent 53.0 11.0 14.2 2.7 19.1 99.9 .1 100.0 Valid Percent 53.0 11.0 14.2 2.7 19.1 100.0 Percent 53.0 64.0 78.3 80.9 100.0
This variable has five categories, coded as 1 to 5. The largest category is married, and the next largest is never married. Separated is by far the smallest category. Substantively, the middle three categories may be collapsed to create a new variable with three groups (1) currently married (2) previously married (with the assumption that separation is effectively marital dissolution) (3) never married: 1. 2. 3. 4. 5. Married Widowed Divorced Separated Never married 1. Currently married 2. Previously married 3. Never married
To perform this recode, lets go from the pull-down menu bar at the top, Transform Recode Into Different Variables
42
We select Into Different Variables rather than Into Same Variables because we want to create a new 3-category variable with the original 5-category variable intact (if we use Into Same Variables the original variable would be overwritten with the new one). Select the variable marital. Under the Output Variable heading in the right side, name our output (new) variable marital3cat and add a label (Marital Status 3 Category). Click Change. The middle pane (Numeric Variable -> Output Variable) should now show marital --> marital3cat.
(1) Select marital into the Numeric Variable -> Output Variable pane.
(2) We recode marital into a new, different variable. Decide on a new name for your new variable, then label it. Click Change.
Next, we define this new variable marital3cat based on the old variable marital. Click Old and New Values, and you will get another dialogue box Recode into Different Variables: Old and New Variables (below). We want to keep the category 1 (married) as is, collapse the categories 2, 3, 4 of the old variable (marital) into a new category and call that 2, and change the category 5 of the old variable (never married) to a new category 3. Here is how to do it.
(1) Use Range as its 2 through 4 we want to recode into a new category.
43
(4) Recode 1 to 1 and 5 to 3 into a new variable too. Use Value instead of Range. For everything else follow the same steps as the above.
(5) Once you are done, click Continue to go back to the previous Recode into Different Variables dialogue box. Then click Paste to paste SPSS syntax onto a Syntax Editor.
Once you are done with the point-and-click recoding work, click Continue and go back to the Recode into Different Variables dialogue box. Click Paste and see what syntax commands SPSS writes for you.
RECODE marital (1=1) (5=3) (2 thru 4=2) INTO marital3cat. VARIABLE LABELS marital3cat 'Marital status 3 categories'.
The command RECODE oldvarname recode argument INTO newvarname is to recode variables into new ones. SPSS adds the command VARIABLES LABELS as we specified (Marital Status 3 Category). As you can see, the syntax to do this task is quite simple, compared with quite a bit of pointing and clicking we did. This is why you should learn to write your own syntax yourself! We dont have value labels for the new three categories, but we know how to do it by using syntax, so lets add another command to this syntax file to label values. Also, we want to have it without any decimals, so format it the same way as marital. Also remember, RECODE is a transformation command (as it entails transformation of data), and SPSS does not execute it and
44
read the data until it has to. Lets get frequency distribution of the new variable. It at the same time forces a data pass and let us check the new variable. Our modified syntax looks like this.
RECODE marital (1=1) (5=3) (2 thru 4=2) INTO marital3cat. VARIABLE LABELS marital3cat 'Marital status 3 categories'. VALUE LABELS marital3cat 1 'Married' 2 'Was Married' 3 'Never Married' . FORMATS marital3cat (F4.0). frequencies variable = marital3cat.
Subsetting Variables Suppose we want to limit our GSS93 subset.sav data to only those variables we are interested in for our research project.
45
Before we drop the variables other than our variables of interest, lets double-check the data information to make sure you do not forget any important variables to include. In your new syntax file, type in and run:
* Get data information.
display dictionary .
SPSS printed out for you the dictionary in SPSS Output Viewer all the same variable information and value label information as you can get from the Variable View tab. Suppose that looking through the variable information, you decide that these below are the variables you will need for your analysis.
id wrkstat marital agewed sibs childs age educ degree padeg madeg sex race relig
Lets create a file that includes those variables above only. Again, we first do subsetting by the point-and-click approach, and then see how you can do the same by writing your syntax command. Have your SPSS Data Editor active (i.e., bring it to the top). From the menu bar, go File Save As And you have the Save Data As box. Make sure to choose the directory you want to save your subset data in. Here, we are going to save it in our working directory verybasicSPSS. Next, we need to decide the new files name in the File name: box. Lets call it GSSvarsub.
(1) Select the location you want to save your new subset file.
46
Now, click Variables and you will get another dialogue box called Save Data As: Variables. By default, SPSS keeps all the variables (all the variables are marked with an X). We will select those 14 variables listed above. You could de-select the variables that you dont need by clicking on their check boxes. Or, in this case, I would first click Drop All and then select what I want to keep by clicking on their check boxes, as the number of variables I want to keep is rather small (in the former way you need to uncheck 45 boxes, in the latter to check 14 boxes so more efficient). When you finish selecting what you need, click Continue.
By default, all the variables are selected. You can de-select what are unnecessary for you.
In this case I would deselect all first by clicking Drop All, and then select the 14 variables. Fewer times of
Click Paste and see the syntax commands SPSS spits for this task.
SAVE OUTFILE='D:\[Your working directory here]\verybasicSPSS\GSSvarsub.sav' /DROP=birthmo zodiac income91 rincom91 region xnorcsiz size partyid vote92 polviews cappun gunlaw grass life chldidel pillok sexeduc spanking letdie1 news tvhours bigband blugrass country blues musicals classicl folk jazz opera rap hvymetal attsprts visitart tvshows tvnews tvpbs scitest4 partners sexfreq dwelown sei cohort income4 degree2 agecat4 politics region4 married classic3 jazz3 rap3 blues3 /COMPRESSED.
47
is what you use to save your SPSS data file. Then to select some variables to create a subset of the original, we use either one of the subcommands
/DROP = list of variables to drop. /KEEP = list of variables to keep.
SPSS uses /DROP there, but if you write your own syntax commands, you of course can list the 14 variables by using /KEEP. In this case, that would actually be simpler. /COMPRESSED is just to save a file in compressed form (this is default; meaning you dont have to specify this when you write syntax yourself to save a file). Again, the pasted syntax looks messy with an array of many variable names, but it doesnt have to be messy like that when you write your syntax yourself to subset a file, because consecutive variables such as A, B, C, D, E can be written A to E.
save outfile ='D:\[Your working directory here]\verybasicSPSS\GSSvarsub.sav' /keep= id to age educ to race relig /compressed.
Another reason you should learn to write your own syntax! Lets highlight and run the syntax, and then check your working directory; your new data file should be saved there. Lets open the new file and first see if everything looks okay. Now that we have a new file for our research project, lets leave a brief comment to your data file helps you keep organized. We can do this by using the command DOCUMENT. With this data file active, lets write the following command and run it.
* Document this work in the new subset file. document Subset of "GSS93 subset" inc the necessary variables for project A. * Lets display the document we just created. display document.
You should get the below output in your Output Viewer. Your data comments are stored with the date information. This way, you wont lose track of what each data in your working directory is about.
Document 1a document Subset of "GSS93 subset" inc the necessary variables for project A. a. Entered 11-Mar-2009
Now, suppose your study focuses on African American population and want to limit your sample to African American cases only. Suppose also that you want to present a graphic of education level distribution among this demographic. Lets first get the break-down of the variable race.
frequencies variables=race.
Race of Respondent Cumulative Frequency Valid white black other Total 1257 168 75 1500 Percent 83.8 11.2 5.0 100.0 Valid Percent 83.8 11.2 5.0 100.0 Percent 83.8 95.0 100.0
So, we will select those 168 African American cases. Data Select Cases You will have the Select Cases dialogue box below. Check the radio button If conditions is satisfied and click on the If button.
(1) Check this radio button, and click on the button If right below
Then another dialogue box Select Cases: If shows up. Select the variable race (Respondents Race) from the left pane, and move it to the right pane by clicking the right-headed arrow. We want to select the 168 African American cases, which are coded as 2 in the data as you can see from the Variable View or your dictionary. So, complete the expression accordingly, that is, we are selecting cases if race = 2. Click Continue.
49
(2) Select Respondents race from the left pane. black is coded as 2, so complete the expression accordingly.
Once you are back to the Select Cases dialogue box (the first one), click Paste and see the commands SPSS writes and acts on (below).
USE ALL. COMPUTE filter_$=(race=2). VARIABLE LABEL filter_$ 'race=2 (FILTER)'. VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'. FORMAT filter_$ (f1.0). FILTER BY filter_$.
As you can see, SPSS creates a variable filter_$ based on the race variable, with 0 = Not Selected and 1 = Selected. Thus, the African American cases should have this variable coded as 1 (because the African American cases will be selected). Lets obtain a frequency table based on the variable race. This way the newly selected data is read in while the new data is checked.
frequencies variables = race.
Race of Respondent Cumulative Frequency Valid black 168 Percent 100.0 Valid Percent 100.0 Percent 100.0
50
Now, bring SPSS Data Editor to the front and see what happened to our original data.
Can you see what is going on here? The observation numbers in the row header for those cases where race = 1 and 3 are simply crossed out, but SPSS seems to hold all the information of the 1500 observations. What SPSS does here is just to filter out the non-black observations by using the filter variable (FILTER BY filter_$). Scroll it to the right, and you will see the variable filter_$ SPSS created for this selection task. As we noted above, the selected cases (i.e., African Americans) are coded as 1, the unselected are 0 (White and others). Lets see the distribution of education levels among the African American respondents.
frequencies variables = educ.
See the output. SPSS gives you a frequency distribution table for the black cases only. There are 168 observations, but one case is missing for the education variable.
51
Now, suppose you want to restore the whole data. We will write and run the commands below (very simple!)
filter off. use all.
By this we turn off the filter SPSS used to select cases and tell SPSS to restore the whole file (use all.). See the Data Editor and see what happened. All the observations are now back in. What is nice about using a filter to subset observations is, as you just saw, it is temporary. When you are conducting analysis, you may want to subset observations in many different ways. It is flexible to create a filter and turn it on and off to select observations. And you can keep the whole original data intact. We can subset observations permanently. We already used SELECT IF (expressions). The difference between SELECT IF (expressions) and FILTER BY is the former is to permanently select observations while the latter is temporary. So if you subset observations on the race variable by using SELECT IF (expressions) and run frequencies
select if (race=2). frequencies variables = educ.
Then in the Output Viewer, you should get exactly the same frequency distribution table. However, check the Data Viewer. How many observations do you have now in the data file?
52
We have now only 168 observations, all being African American. SPSS does not cross out the unselected observations. It instead deleted them, permanently. This method is good if you want to create and save a subset file which only includes cases that meet certain conditions (e.g., females only, those with higher education only, etc), but unlike the filter, you cannot restore the deleted cases unless you go back to the original file, so it may be inconvenient when you are conducting analysis and frequently select and re-select cases. You should choose which way to go depending on your purpose.
Sorting Data Lets open the data file GSSvarsub.sav if its not already open. Sorting data is simple and easy. Suppose we want to sort this data by sex (male = 1, female = 2). From the pull-down menu bat at the top, go: Data Sort Cases And the Sort Cases dialogue box appears.
53
Too simple a command, isnt it? The (A) following the variable name sex means that observations will be sorted in ascending order. That is the default, so you dont have to specify it when writing your own syntax. Lets highlight and run the command, then list the variable.
list variables = id sex.
If you need to sort observations in descending order, you need to explicitly specify that with (D) in the syntax (instead of (A)). Try it yourself.
sort cases by sex (d).
You can of course sort by more than one variable. For example, if you want to sort observations by sex, and then within each sex category sort observations by marital status, simply place the by variables in that order (see below).
sort cases by sex marital. list id sex marital.
Split Observations Suppose you want to obtain group-by-group numbers, such as average years of education by sex. As we already saw, this can be done fairly easily; the command MEANS has the option BY. You can write and run this simple syntax command yourself to achieve the goal.
means educ by sex.
54
Report Highest Year of School Completed Responden t's Sex Male Female Total Mean 13.19 12.92 13.04 N 639 857 1496 Std. Deviation 3.349 2.849 3.074
But how can we obtain separate analysis when the commands you want to use does not have this BY option? Suppose, for example, you suspect that years of education and the number of children are differently correlated across sexsay, having children often makes people interrupt or give up on education early, but perhaps females are more adversely impacted than males and this group-by-group correlations may give us some clue about this argument. The problem, however, is that the command CORRELATIONS does not have any BY option and does not let you obtain this statistic by sex and make a comparison in a one-step way. In such cases as this, here is what you do. From the pull-down menu, go: Data Split File The Split File dialogue box shows up.
55
Check the radio button Compare groups and select the variable Respondents Sex. We already sorted the data, but check the Sort the file by grouping variables radio button just in case. Click Paste.
SORT CASES BY sex. SPLIT FILE LAYERED BY sex. LAYERED is the default way that SPSS organizes the output, so when/if you write the syntax yourself, you dont have to add this (i.e., just split file by sex. will do). Now we can run
correlations between years of education and the number of children, by sex. Lets add the following command.
correlations educ childs.
Now, highlight all and run it. You should get the correlation matrix organized by sex (below).
Correlations Highest Year of Respondent's Sex Male Highest Year of School Completed Pearson Correlation Sig. (2-tailed) N Number of Children Pearson Correlation Sig. (2-tailed) N Female Highest Year of School Completed Pearson Correlation Sig. (2-tailed) N Number of Children Pearson Correlation Sig. (2-tailed) N 857 -.282 .000 855 857 639 -.182 .000 636 1 638 -.282 .000 855 1 School Completed 1 Number of Children -.182 .000 636 1
Now the analysis is conducted for each of the groups of the sex variable we specify as a split variable. The number of children is negatively correlated with the highest year of school completed for both the sex groups, but the association is stronger for females, which is in line with our expectation. Note that SPLIT FILE is in effect until you explicitly turn it off. Turn it off by running SPLIT FILE OFF. Lets run the same correlation and see what we get.
56
You can see now SPSS run the analysis on the whole data without creating groups.
57
(4) Include a dummy variable for sex and race as our control variable, where female = 1 and black = 1, respectively. Both groups are expected to have a lower score of SEI on average. So, first of all, lets create new variables to do the planned analysis. Because we suspect the length of work experience has diminishing return of SEI, we want to create a quadratic term of age.
compute sqage = age*age .
We also want to create a dummy variable that indicates whether respondents mother has education of 2-year college or higher. As we can see from the Variables button
So, we need to recode the madeg variables and create a new variable macol. 0 and 1 0 2 through 4 1 7 through 9 9 Then code 9 as this variables missing value.
RECODE madeg (1=0) (2 thru 4=1) (7 thru 9=9) INTO macol. VARIABLE LABELS macol 'mother college degree = 1'. MISSING VALUES macol(9). VALUE LABELS macol 0 'College -' 1 'College +' .
We also recode the variable sex (Respondents sex) to create a dummy variable female and race (Respondents race) to create a dummy variable black.
58
RECODE sex (1=0) (2=1) INTO female. VARIABLE LABELS female 'female = 1'. RECODE race (1=0) (3=0)(2=1) INTO black. VARIABLE LABELS black 'black = 1'.
Now, we have all the variables ready for the analysis. Analyze Regression Linear Select SEI for the dependent variable, and select sqage, age (age), mothers college education (macol), female (female), and race (black) under the Block 1 of 1 heading. Then click Statistics to bring up the Linear Regression: Statistics dialogue box. Check the Collinearity diagnostics box. Click Continue.
59
REGRESSION is SPSSs command to run an ordinary linear regression model. What you need
when you write your own syntax is highlighted lines in gray. The line of the subcommand /STATISTICS would be unnecessary if you simply want to get default statistics (i.e., coefficients, ANOVA, multiple R [model summary], excluded variables [which are not relevant here]). We need this line in this case because we ask for COLLIN and TOL, which are both collinearity diagnostics. Highlight and run those command lines. Here are some of our results.
Model Summary
R Square .085
Estimate 17.7582
a. Predictors: (Constant), black = 1, mother college degree = 1, female = 1, sqage, Number of Brothers and Sisters, Age of Respondent b. Dependent Variable: Respondent Socioeconomic Index
Coefficients
Standardized Unstandardized Coefficients Model 1 (Constant) sqage Age of Respondent mother college degree = 1 Number of Brothers and -.587 Sisters female = 1 black = 1 -3.955 -6.586 1.286 2.557 -.107 -.090 -3.075 -2.576 .002 .010 .991 .977 1.009 1.023 .283 -.072 -2.074 .038 .982 1.018 B 32.337 -.007 .865 6.931 Std. Error 5.364 .003 .240 1.615 -.490 .656 .149 Coefficients Beta t 6.029 -2.693 3.604 4.290 Sig. .000 .007 .000 .000 .036 .036 .990 27.759 27.760 1.011 Collinearity Statistics Tolerance VIF
As expected, the quadratic age variable is in the negative direction and statistically significant. Mothers education, a measure of respondents cultural capital at their family of orientation, shows a significant positive impact on respondents socioeconomic status. Having more siblings, on the other hand, seems to reduce resources available to people and lead to lower socioeconomic status. Finally, females and African Americans are on average a lower socioeconomic status than males and any other race groups. The overall explanatory power of the model is not quite strong, as indicated the R2 under Model Summary. 60
As for the collinearity diagnostics we tried, tolerance and VIF are inversely related (i.e., 1/tolerance = VIF) and thus tell you the same information. Although there is no definite cut-off line, a rule of thumb is VIF > 10 (or tolerance 0.1) merits further investigation. In our example, the VIF is high for the age variables, but this is fully expected since one is the squared term of the other and hence they are highly collinear by definition. Otherwise, the tolerance/VIF values all look okay. Another way to check collinearity problems is to use Collinearity Diagnostics below. The general rule of thumb is the condition index larger than 30 indicates strong collinearity. The dimension 6 has an over 30 number, but this one is again due to the age variables as highlighted in gray below. Otherwise, the result adds support to the absence of collinearity problem.
Collinearity Diagnosticsa Variance Proportions mother Condition Model Dimension Eigenvalue 1 1 2 3 4 5 6 7 4.317 .938 .788 .434 .397 .125 .003 Index 1.000 2.145 2.341 3.155 3.299 5.886 41.477 Age of college Number of Brothers female = black = 1 .02 .00 .00 .87 .00 .11 .00 1 .00 .91 .00 .00 .07 .00 .00
(Constant) sqage Respondent degree = 1 and Sisters .00 .00 .00 .00 .00 .07 .93 .00 .00 .00 .00 .01 .02 .97 .00 .00 .00 .00 .00 .00 1.00 .01 .00 .93 .01 .00 .04 .01 .01 .00 .01 .05 .61 .32 .00
This is the end of The Very Basic SPSS. Thanks for playing! 1
61