Advanced PROC SQL Programming Techniques
Advanced PROC SQL Programming Techniques
Advanced PROC SQL Programming Techniques
In the following section, five typical clinical trial programming tasks are select max(reccount) into :&varmax
discussed. The PROC SQL code that performs each task is contrasted from reccount;
with the code for other methods. quit;
%mend getmaxcnt;
1. Create macro variable(s) in compact code
We can call the macros using:
Case A:
%getmaxcnt(inset=offtrt, varmax=offmax);
Sometimes during data checking and cleaning, we need to obtain the
aggregate value of a field and store it in a macro variable for use in later We can see that the advantages of using PROC SQL to get the
processing. For example, we want to calculate the maximum number of maximum number of records (or other aggregate values) in comparison
records where contact number is not missing for all patients in several with the other two methods are:
data sets. This can be accomplished using a DATA step or PROC SQL.
The input data set is: 1. We do not have to presort the input data as is necessary with the
DATA step method.
data offtrt; 2. We do not create any intermediate data set as in the PROC FREQ
input pat_num 1-7 contact 9-10 value 12-13; and the DATA step methods.
cards;
1111135 20 13 Case B:
1111135 40 80
1111166 20 20 Not only can PROC SQL be used for the creation of a single macro
1111166 . 13 variable, it can also create several macro variables in one statement.
1111166 . 15
run; For example, suppose we have a data set called quantile, and every row
contains q1, median and q3, for 25th percentile, median, and 75th
The code for the DATA Step is: percentile values for several groups of patients. The input data set is:
proc sql inobs=1; We also have a lab data set that has a patient number, time point, and
select trim(left(put(q1,6.2))), the test value:
trim(left(put(median,6.2))),
trim(left(put(q3,6.2))) data lab;
into :q1,:q2,:q3 input pat_num 1-7 point 9-10 value 12-13;
from quantile; cards;
quit; 1111135 20 13
1111135 40 80
The DATA step method requires a data _null_ step and three function 1111135 80 10
calls to create the three macro variables that PROC SQL can do in one 1111166 20 20
statement. Here, we begin to see PROC SQL’s power, which will be 1111166 40 13
appreciated when we want to build a series of “grouped” macro 1111166 60 15
variables. For example, we can modify the preceding example to run;
illustrate this use. Assume there are 10 records in data set quantile. If
we want to assign macro variables to all of the q1, median, and q3
values, we can use the following PROC SQL statements to create macro As we can see, there is a missing time point, 60, for patient 1111135,
variables for each group. and a missing time point, 80, for patient 1111166. We need to add
records for the missing time points so that every patient has all lab data
proc sql; time points. One way to accomplish this is to use PROC TRANSPOSE
select count(*) into :n on the lab data set. After we transpose, we will get variable names like:
from quantile; _20, _40, _60, _80 in the output data set and we want to store this list of
values in a macro variable for later processing.
select trim(left(put(q1,6.2))),
trim(left(put(median,6.2))), PROC SQL is the easiest way:
trim(left(put(q3,6.2)))
into :p1 - :p%left(&n), proc sql noprint;
:m1 - :m%left(&n), select '_'||compress(put(point,2.))
:r1 - :r%left(&n) into :pt separated by ‘ ‘
from quantile; from timept;
quit; quit;
The code can be even shorter if we take advantage of the fact that SAS If we use a DATA Step, the code will be:
will assign only as many macro variables as needed from the specified
range of 1 to 999: data _null_;
set timept end=eof;
proc sql; retain pt ;
select trim(left(put(q1,6.2))), length pt $ 100;
trim(left(put(median,6.2))),
trim(left(put(q3,6.2))) if (_n_=1) then
into :p1 - :p999, pt='_'||compress(put(point,2.));
:m1 - :m999, else
:r1 - :r999 pt =left(trim(pt)) || ' _'||put(point,2.);
from quantile; if eof then
quit; call symput('pt', pt);
run;
By using a simple – (dash), we are able to build groups of macro
variables p1, p2, p3…, m1, m2, m3…, r1, r2, r3… As we can see, a DATA _null_ step has to be used, the code is longer,
and the concatenation code is trickier than the corresponding PROC
The macro variables created above are all global variables that can be SQL code.
used later.
3. Append one data set to another one, excluding
2. Build a macro variable with a list of values duplicate records
dynamically
Appending one data set to another one and eliminating duplicate records
PROC SQL is also useful when we want to store a list of values in a requires considerable code in a DATA step. PROC SQL, on the other
macro variable. hand, can do it conveniently in one step using the UNION function.
In this example we have two data sets: the phone data set and the As illustrated in the above example, we see that the advantages of using
phonnew data set. We want to append the phonenew data set to the PROC SQL with the UNION function to append data compared to using
phone data set. However, we do not want to keep the duplicates. The SET statements are:
easiest way we have found is to use PROC SQL. We can see the
comparison of PROC SQL and the DATA step in the following a. Duplicate records are implicitly excluded.
example: b. The sorting order is maintained so no explicit sort is needed.
data phone; Of course, if we want to keep all records, including the duplicate records
input pat_num 1-7 contact 9-10 value 12-13; for some reason using PROC SQL, we can just replace UNION with
cards; UNION ALL. UNION ALL is equivalent to the SET statement in a
1111135 20 13 DATA step.
1111135 40 80
1111135 60 10
1111166 20 20 4. PROC SQL and table joins
1111166 40 13
run; In clinical trials, we usually obtain our original input data from a
relational data base like Oracle, Sybase or SQL Server. Since an
data phonenew; objective of relational data base design is to minimize data duplication.
input pat_num 1-7 contact 9-10 value 12-13; We often have to join two or more tables in order to obtain all the
cards; necessary variables for a data set.
1111135 60 10
1111135 80 80 Both data step and PROC SQL can be used to join two or more
1111166 60 15 tables.The following table compares these two methods.
run;
quit; As we can see, in the original lab data: time point 60 is missing for
patient 1111135, and time point 80 is missing for patient 1111166.
The result table for the
difference between the case We want to build a data set (alllab data set) with all missing time points
and control group is: filled in. i.e.
Chart 1: Many-Many Merge of two Data Sets * PROC TRANSPOSE PROC SQL
proc sort data=lab; proc sort
by pat_num point; data=lab;
run; by pat_num
point;
proc sql noprint; run;
select
'_'||compress(put(point,2.)) proc sql;
into :pt separated by ‘ ‘ create table
from timept; uniqan as
select
select count(*) distinct
into :num_obs pat_num
from timept; from lab;
quit;
create table
proc transpose data=lab lab_ as
out=lab2; select
by pat_num; pat_num,
id point; point
var value; from uniqan,
run; timept
order by
5. A powerful way to fill in missing data data alllab(keep=pat_num pat_num,
point value); point;
set lab2; quit;
PROC SQL can be very useful for filling in missing data. In clinical
trials, we sometimes have the following situation with lab data: there is array t{&num_obs} &pt;
PROC TRANSPOSE PROC SQL
do i=1 to &num_obs; data alllab;
point=substr(vname(t{i}),2); merge lab
value=t{i}; lab_;
output; by pat_num
end; point;
run; run;
Comparing the code, we can see that the PROC SQL code is more
compact and the logic is easier to understand. In PROC TRANSPOSE,
we first need to store the number of time points and the time point
values in macro variables since we need them for the final output of
alllab data set. The logic using PROC TRANSPOSE is more obscure
and PROC TRANSPOSE is very memory demanding. Therefore, for
large data sets, the PROC SQL method has more advantages when used
to fill in the missing data points than the PROC TRANSPOSE method.
CONCLUSION
REFERENCE
1. SAS Procedures Guide, Version 8. SAS Institute
CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the
author at:
Changhong Shi
Merck & Co., Inc.
RY34-A320
126 East Lincoln Ave.
Rahway, NJ 07065
Work Phone: (732) 594-1383
Fax: (732) 594-6075
Email: [email protected]
Sylvianne B. Roberge
Merck & Co., Inc.
RY34-A320
126 East Lincoln Ave.
Rahway, NJ 07065
Work Phone: (732) 594-7853
Fax: (732) 594-6075
Email: [email protected]