Sas Procs

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 8

STAT GLOSSARY:

Hypothesis: A concept that is not yet verified but that if true would explain certain facts or phenomena. A proposal intended to explain certain facts or observations. A scientific hypothesis that survives experimental testing becomes a scientific theory. Alpha and Beta: In statistics the alpha level is the probability that you will wrongly reject the null hypothesis. This is also referred to as a false positive. The beta level is the probability that you will wrongly retain the null hypothesis (false negative). Hypothesis Testing: Hypothesis testing (also called "significance testing") is a statistical procedure for discriminating between two statistical hypotheses - the null hypothesis (H0) and the alternative hypothesis ( Ha, often denoted as H1). Hypothesis testing rests on the presumption of validity of the null hypothesis - that is, the null hypothesis is accepted unless the data at hand testify strongly enough against it. The null hypothesis embodies the presumption that nothing has changed, or that there is no difference. The alternative hypothesis typically describes some change or effect that you expect or hope to see confirmed by data. For example, new drug A works better than standard drug B. As noted, the null hypothesis stands ("is accepted") unless the data at hand provide strong enough evidence against it. "Strong enough" means that probability that you would obtain a result as extreme as the observed result, given that the null hypothesis is true, is small enough (usually < 0.05).

Inferential Statistics: Inferential statistics is the body of statistical techniques that deal with the question "How reliable is the conclusion or estimate that we derive from a set of data?" The two main techniques are confidence intervals and hypothesis tests.

Contingency Table: A contingency table is a tabular representation of categorical data . A contingency table usually shows frequencies for particular combinations of values of two discrete random variable s X and Y. Each cell in the table represents a mutually exclusive combination of X-Y values. For example, consider a sample of N=200 beer-drinkers. For each drinker we have information on sex (variable X, taking on 2 possible values: "Male" and "Female") and preferred category of beer (variable Y, taking on 3 possible values: "Light", "Regular", "Dark"). A contingency table for these data might look like the following: Light Regular Dark Total Male 20 40 50 110 Female 50 20 20 90 Total: 70 60 70 200 This is a two-way 2x3 contingency table (i.e. two rows and three columns). The major classes of questions addressed by contingency tables analysis are i) a hypothesis testing question of whether there is association among the variables, or whether the variables are independent? For example, in a drug trial, dependence between "Outcome" (e.g. "Improvement", "No change", "Worsening") and treatment received (e.g "No treatment", "Drug A") would signify that the drug has an effect on patient outcome.

ii) Which model provides the best explanation for the data at hand? Statistic: A measure calculated from a sample of data. Contrast "statistic" (drawn from a sample) with "parameter," which is a characteristic of a population. For example, the sample mean is a statistic; the population mean is a parameter of a population. Covariate: A covariate is an independent variable not manipulated by the experimenter but still affecting the response. Sampling: Sampling is the process of selecting a proper subset of elements from the full population so that the subset can be used to make inference to the population as a whole. A probability sample is one in which each element has a known and positive chance (probability) of selection. A simple random sample is one in which each member has the same chance of selection. Confidence interval: A "confidence interval" (CI) is an interval estimate, that is, a range of values around a point estimate that takes sampling error into account. Ninety-five percent is an accepted standard of confidence. Technically, a 95% CI means that if repeated samples were drawn from the same population using the same sampling and data collection procedures, the true population value would fall within the confidence interval 95% of the time. Practically, a 95% CI summarizes both the estimate and its margin of error in a straightforward way with a reasonable degree of confidence Standard error (SE): A measure of the sampling variability or precision of an estimate. The SE of an estimate is expressed in the same units as the estimate itself. For example, an estimate of 10,000 visits with an SE of 500 indicates that the SE is 500 visits. p-value: A measure of the probability (p) that the difference between two estimates could have occurred by chance, if the estimates being compared were really the same. The larger the p-value, the more likely the difference could have occurred by chance. For example, if the difference between two estimates has a p-value of 0.01, it means that there is a 1% probability that the difference observed could be due to chance alone. Degrees of Freedom: For a set of data points, degrees of freedom is the minimal number of values which should be specified to determine all the data points. For example, if you have a sample of N random values, there are N degrees of freedom (you cannot determine the Nth random value even if you know N-1 other values). Another example is a 2x2 table; it generally has 4 degrees of freedom - each of the 4 cells can contain any number. If row and column marginal totals are specified, there is only 1 degree of freedom; i.e if you know the number in any one cell, you may calculate the remaining 3 numbers from the known number and the marginal totals. Degrees of freedom are often used to characterize various distributions. See, for example, chi-square distribution, t-distribution, F distribution. Central Tendency (Measures):

Any measure of central tendency provides a typical value of a set of N values. Ex: two samples - (8,9,10,11,12) and (18,19,20,21,22) have central locations differing by 10 units, and most measures of central location would give values 10 and 20 of the two samples, respectively. Normal Distribution: The normal distribution is a probability density which is bell-shaped, symmetrical, and single peaked. The mean, median and mode coincide and lie at the center of the distribution. The two tails extend indefinitely and never touch the x-axis (asymptotic to the x-axis). A normal distribution is fully specified by two parameters - mean and the standard deviation.

Geometric mean: The geometric mean of n values is determined by multiplying all n values together, then taking the nth root of the product. It is useful in taking averages of ratios. Correlation Coefficient: The (Pearson) correlation coefficient indicates the degree of linear relationship between two variables. The correlation coefficient always lies between -1 and +1. -1 indicates perfect linear negative relationship between two variables, +1 indicates perfect positive linear relationship and 0 indicates lack of any linear relationship. Pearson correlation coefficients are useful for continuous variables, while Spearman correlation coefficients are useful for ordinal variables. Analysis of Variance (ANOVA): A statistical technique which helps in making inference whether three or more samples might come from populations having the same mean; specifically, whether the differences among the samples might be caused by chance variation.

Hazard Function: In medical statistics, the hazard function is a relationship between a proportion and time. The proportion (also called the hazard ratio) is the proportion of subjects who die from among those who have survived to a time "t." Used in SAS Proc PHREG (Cox Proportional Hazard Regression) Survival Function: In medical statistics, the survival function is a relationship between a proportion and time. The proportion is the proportion of subjects who are still surviving at time "t." Kaplan-Meier Estimator: The Kaplan-Meier estimator is aimed at estimation of the survival function from censored life-time data. The value of the survival function between successive distinct uncensored observations is taken as constant, and the graph of the Kaplan-Meier estimate of the survival function is a series of horizontal steps of declining magnitude. Regression Analysis: Regression analysis provides a "best-fit" mathematical equation for the relationship between the dependent variable (response) and independent variable(s) (covariates). There are two major classes of regression - parametric and non-parametric. Parametric regression requires choice of the regression equation with one or a greater number of unknown parameters. Linear regression, in which a linear relationship between the dependent variable and independent variables is posited, is an example. The aim of parametric regression is to find the values of these parameters which provide the best fit to the data. The number of parameters is usually much smaller than the number of data points. In contrast, the <non-parametric regression> requires no such a choice of the regression equation.

Chi-Square Distribution: The square of a random variable having standard normal distribution is distributed as chi-square with 1 degree of freedom. The sum of squares of 'n' independently distributed standard normal variables has a ChiSquare distribution with 'n' degrees of freedom. The distribution is typically used to compare multiplesample count data in contingency tables to expected values under a null hypothesis. t-test: A t-test is a statistical hypothesis test based on a test statistic (t-statistic) whose sampling distribution is a tdistribution. Various t-tests, strictly speaking, are aimed at testing hypotheses about populations with normal probability distribution. However, statistical research has shown that t-tests often provide quite adequate results for non-normally distributed populations too. The term "t-test" is often used in a narrower sense - it refers to a popular test aimed at testing the hypothesis that the population mean is equal to some value (Mu).

Frequently used SAS Procs: General Procs: FREQ,MEANS,SORT,TRANSPOSE,SQL,PRINT, MIXED,PRINTTO,TABULATE,REPORT Stat Procs: UNIVARIATE,ANOVA,GLM,LOGISTIC,LIFETEST,CORR,COMPARE,TTEST,PHREG

1) FREQ COUNTS, CHISQUARE 2) MIXED Mixed linear models for balanced (normalized) data 3) LIFETEST Quality of life for existing patients, censored variables-> who are continuing, uncensored variables for dropped patients 4) 5) 6) 7) 8) 9) 10) 11) 12) GLM-> General Linear Model, regression UNIVARIATE-> Mean, Skewness, Kurtosis, Quantiles, Percentiles, histograms ANOVA-> for unbalanced data LOGISTIC -> Log regression, Relationship between discrete responses TABULATE-> Tabular formats REPORT-> Reporting CORR-> Pearson correlation coefficients, linear relation between two variables, sum of squares TTEST-> One, Two sample paired observation analysis COMPARE-> Compares metadata and data of two datasets.

PROC FREQ: PROC FREQ- Several SAS procedures produce frequency counts; only PROC FREQ computes chi-square tests for one-way to n-way tables to determine if an association exists between variables and agreement for cross tabulation (contingency-neither true nor false future estimates) tables. To estimate the strength of an association, PROC FREQ computes measures of association that tend to be close to zero when there is no association and close to the maximum (or minimum) value (degrees of freedom) when there is perfect association. The statistics for contingency tables include

chi-square tests and measures measures of association risks (binomial proportions) and risk differences for 22 tables odds ratios and relative risks for 22 tables tests for trend tests and measures of agreement Cochran-Mantel-Haenszel statistics

PROC FREQ does stratified analysis, computing statistics within, as well as across, strata using CochranMantel-Haenszel chi-square test. Frequencies and statistics can also be output to SAS data sets. Other procedures to consider for counting are TABULATE, CHART, and UNIVARIATE. PROC FREQ Example:
data SummerSchool; input Gender $ Internship $ Enrollment $ Count @@; datalines; boys yes yes 35 boys yes no 29 boys no yes 14 boys no no 27 girls yes yes 32 girls yes no 10 girls no yes 53 girls no no 23 ; /*Check the association between Internship and Enrollment*/ proc freq data=SummerSchool order=data;

weight count; tables Internship*Enrollment / chisq; run; /*Check the association between Internship and Enrollment grouping by Gendernote the use of Cochran-Mantel-Haenszel chi-square test */ proc freq data=SummerSchool order=data; weight count; tables gender*Internship*Enrollment / chisq cmh; run;

PROC MEANS:

The MEANS procedure provides data summarization tools to compute descriptive statistics for variables across all observations and within groups of observations. For example, PROC MEANS

calculates descriptive statistics based on moments estimates quantiles, which includes the median calculates confidence limits for the mean identifies extreme values performs a t test.

By default, PROC MEANS displays output. You can also use the OUTPUT statement to store the statistics in a SAS data set. PROC MEANS and PROC SUMMARY are very similar. Example:
data pets;

input Pet $ Gender $; datalines; dog m dog f dog f dog f cat m cat m cat f ; proc means data=pets order=freq; class pet gender; ways 2; run;

You might also like