Advanced PROC SQL Programming Techniques

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Application of Some Advanced PROC SQL

Features in Clinical Trial Programming


Changhong Shi, Merck & Co., Inc.
Sylvianne B. Roberge, Merck & Co., Inc.

ABSTRACT if (first.pat_num) then


count=0;
The DATA step, as the foundation of SAS, provides many programming if (contact ne .) then
options. However, SAS programs using DATA steps sometimes get long count=count+1;
and the logic gets complicated. PROC SQL can be used as an
alternative. For example, PROC SQL can provide very elegant code if (last.pat_num) then
when a data set needs to be appended to another data set, excluding maxcnt=max(maxcnt, count);
duplicate records. In this paper, we will review some useful, usually
overlooked, features of PROC SQL and their applications in clinical if eof then
trials. Five examples are discussed in this paper. They include: creating call
macro variable(s) with compact code, building a macro variable with a symput("&varmax",trim(left(put(maxcnt,3.))));
list of values dynamically, appending one data set to another one and run;
excluding duplicate records, joining tables, especially many to many %mend getmaxcnt;
joins and self join, and filling in missing data. All of the examples
contrast the use of PROC SQL with other SAS techniques, especially The code for PROC SQL is:
DATA step coding.
%macro getmaxcnt(inset=,varmax=);
Even though we realize that we do not illustrate all the possible ways to %global &varmax;
handle a specific situation, all of the examples given in this paper show proc sql noprint;
the power of some advanced features of PROC SQL to solve clinical create view reccount as
trial programming problems. select pat_num, count(contact) as
reccount
from &inset
DISCUSSION group by pat_num;

In the following section, five typical clinical trial programming tasks are select max(reccount) into :&varmax
discussed. The PROC SQL code that performs each task is contrasted from reccount;
with the code for other methods. quit;
%mend getmaxcnt;
1. Create macro variable(s) in compact code
We can call the macros using:
Case A:
%getmaxcnt(inset=offtrt, varmax=offmax);
Sometimes during data checking and cleaning, we need to obtain the
aggregate value of a field and store it in a macro variable for use in later We can see that the advantages of using PROC SQL to get the
processing. For example, we want to calculate the maximum number of maximum number of records (or other aggregate values) in comparison
records where contact number is not missing for all patients in several with the other two methods are:
data sets. This can be accomplished using a DATA step or PROC SQL.
The input data set is: 1. We do not have to presort the input data as is necessary with the
DATA step method.
data offtrt; 2. We do not create any intermediate data set as in the PROC FREQ
input pat_num 1-7 contact 9-10 value 12-13; and the DATA step methods.
cards;
1111135 20 13 Case B:
1111135 40 80
1111166 20 20 Not only can PROC SQL be used for the creation of a single macro
1111166 . 13 variable, it can also create several macro variables in one statement.
1111166 . 15
run; For example, suppose we have a data set called quantile, and every row
contains q1, median and q3, for 25th percentile, median, and 75th
The code for the DATA Step is: percentile values for several groups of patients. The input data set is:

%macro getmaxcnt(inset=,varmax=); data quantile;


%global &varmax; input q1 1-4 median 6-10 q3 12-16;
proc sort data=&inset; cards;
by pat_num; 9.11 20.02 76.89
run; 8.32 40.32 80.23
6.16 60.91 20.15
data _null_; 7.34 40.45 60.03
set &inset end=eof; 9.57 60.65 56.78
by pat_num; run;
retain count maxcnt;
If we want to store these three variables for only the first row, we can
use DATA Step as follows:
For example, we have a data set that contains time points in our study
such as 20, 40, 60, 80. It is saved in the timept data set:
data _null_;
set quantile (obs=1); data timept;
call symput('q1',trim(left(put(q1,6.2)))); input point 1-2;
call cards;
symput('q2',trim(left(put(median,6.2)))); 20
call symput('q3',trim(left(put(q3,6.2)))); 40
run; 60
80
The corresponding PROC SQL code is: run;

proc sql inobs=1; We also have a lab data set that has a patient number, time point, and
select trim(left(put(q1,6.2))), the test value:
trim(left(put(median,6.2))),
trim(left(put(q3,6.2))) data lab;
into :q1,:q2,:q3 input pat_num 1-7 point 9-10 value 12-13;
from quantile; cards;
quit; 1111135 20 13
1111135 40 80
The DATA step method requires a data _null_ step and three function 1111135 80 10
calls to create the three macro variables that PROC SQL can do in one 1111166 20 20
statement. Here, we begin to see PROC SQL’s power, which will be 1111166 40 13
appreciated when we want to build a series of “grouped” macro 1111166 60 15
variables. For example, we can modify the preceding example to run;
illustrate this use. Assume there are 10 records in data set quantile. If
we want to assign macro variables to all of the q1, median, and q3
values, we can use the following PROC SQL statements to create macro As we can see, there is a missing time point, 60, for patient 1111135,
variables for each group. and a missing time point, 80, for patient 1111166. We need to add
records for the missing time points so that every patient has all lab data
proc sql; time points. One way to accomplish this is to use PROC TRANSPOSE
select count(*) into :n on the lab data set. After we transpose, we will get variable names like:
from quantile; _20, _40, _60, _80 in the output data set and we want to store this list of
values in a macro variable for later processing.
select trim(left(put(q1,6.2))),
trim(left(put(median,6.2))), PROC SQL is the easiest way:
trim(left(put(q3,6.2)))
into :p1 - :p%left(&n), proc sql noprint;
:m1 - :m%left(&n), select '_'||compress(put(point,2.))
:r1 - :r%left(&n) into :pt separated by ‘ ‘
from quantile; from timept;
quit; quit;

The code can be even shorter if we take advantage of the fact that SAS If we use a DATA Step, the code will be:
will assign only as many macro variables as needed from the specified
range of 1 to 999: data _null_;
set timept end=eof;
proc sql; retain pt ;
select trim(left(put(q1,6.2))), length pt $ 100;
trim(left(put(median,6.2))),
trim(left(put(q3,6.2))) if (_n_=1) then
into :p1 - :p999, pt='_'||compress(put(point,2.));
:m1 - :m999, else
:r1 - :r999 pt =left(trim(pt)) || ' _'||put(point,2.);
from quantile; if eof then
quit; call symput('pt', pt);
run;
By using a simple – (dash), we are able to build groups of macro
variables p1, p2, p3…, m1, m2, m3…, r1, r2, r3… As we can see, a DATA _null_ step has to be used, the code is longer,
and the concatenation code is trickier than the corresponding PROC
The macro variables created above are all global variables that can be SQL code.
used later.
3. Append one data set to another one, excluding
2. Build a macro variable with a list of values duplicate records
dynamically
Appending one data set to another one and eliminating duplicate records
PROC SQL is also useful when we want to store a list of values in a requires considerable code in a DATA step. PROC SQL, on the other
macro variable. hand, can do it conveniently in one step using the UNION function.
In this example we have two data sets: the phone data set and the As illustrated in the above example, we see that the advantages of using
phonnew data set. We want to append the phonenew data set to the PROC SQL with the UNION function to append data compared to using
phone data set. However, we do not want to keep the duplicates. The SET statements are:
easiest way we have found is to use PROC SQL. We can see the
comparison of PROC SQL and the DATA step in the following a. Duplicate records are implicitly excluded.
example: b. The sorting order is maintained so no explicit sort is needed.

data phone; Of course, if we want to keep all records, including the duplicate records
input pat_num 1-7 contact 9-10 value 12-13; for some reason using PROC SQL, we can just replace UNION with
cards; UNION ALL. UNION ALL is equivalent to the SET statement in a
1111135 20 13 DATA step.
1111135 40 80
1111135 60 10
1111166 20 20 4. PROC SQL and table joins
1111166 40 13
run; In clinical trials, we usually obtain our original input data from a
relational data base like Oracle, Sybase or SQL Server. Since an
data phonenew; objective of relational data base design is to minimize data duplication.
input pat_num 1-7 contact 9-10 value 12-13; We often have to join two or more tables in order to obtain all the
cards; necessary variables for a data set.
1111135 60 10
1111135 80 80 Both data step and PROC SQL can be used to join two or more
1111166 60 15 tables.The following table compares these two methods.
run;

proc sql; DATA Step Technique Proc SQL Technique


create table phonerec as 1. Simple Merge: 1. Inner Join:
select *
from phone data adher; data adher;
union input pat_num 1-7 input pat_num 1-7
select * contact 9-10 adhdt contact 9-10 adhdt
mmddyy8.; mmddyy8.;
from phonenew; cards; cards;
quit; 1111135 20 03/16/95 1111135 20 03/16/95
1111135 40 09/10/95 1111135 40 09/10/95
1111166 20 04/07/96 1111166 20 04/07/96
The result table is:
1111166 40 10/02/96 1111166 40 10/02/96
1111166 60 05/01/97 1111166 60 05/01/97
pat_num contact value run; run;
1111135 20 13
1111135 40 80 data phone; data phone;
1111135 60 10 input pat_num 1-7 input pat_num 1-7
1111135 80 80 contact 9-10 visdt contact 9-10 visdt
mmddyy8.; mmddyy8.;
1111166 20 20 cards; cards;
1111166 40 13 1111135 20 03/26/95 1111135 20 03/26/95
1111166 60 15 1111135 40 09/10/95 1111135 40 09/10/95
1111135 60 02/10/96 1111135 60 02/10/96
As we can see, the duplicate record for patient 1111135 at contact 60 is 1111166 20 04/07/96 1111166 20 04/07/96
excluded from the final data set. 1111166 40 10/20/96 1111166 40 10/20/96
run; run;
If we replace the PROC SQL code with the SET statement in DATA data proc sql;
step: duprec(keep=pat_num create table
contact visdt); duprec as
data phonerec; merge adher(in=a) select
set phone phonenew; phone(in=p); p.pat_num,
run; by pat_num p.contact,
contact; p.visdt
The result data set will be: if a and p then from adher a,
output; phone p
run; where
pat_num contact value
(a.pat_num=
1111135 20 13 p.pat_num)
1111135 40 80 and
1111135 60 10 (a.contact=
1111166 20 20 p.contact);
1111166 40 13 quit;
1111135 60 10
1111135 80 80 2. Selected Merge 2. Left or Right Join:
1111166 60 15 1:
data dre;
As we can see, the duplicate record for patient 1111135 at contact 60 is data dre; input pat_num 1-7 month
input pat_num 1-7 9-10 region1 12 region2
not excluded from the final data set.
month 9-10 region1 14;
12 region2 14; cards;
DATA Step Technique Proc SQL Technique DATA Step Technique Proc SQL Technique
cards; 1111135 1 1 2 set with three variables: cards;
1111135 1 1 2 1111135 6 2 3 patient number, visit number 1111135 16 03/16/95
1111135 6 2 3 1111166 12 1 4 and off-treatment date. The 1111135 36 05/10/97
1111166 12 1 4 1111166 6 2 4 1111135 56 04/07/99
second data set is a
1111166 6 2 4 1111168 12 3 4 1111166 40 10/02/96
1111168 12 3 4 run; reactivation data set with run;
run; Patient number, visit number
and reactivation date. We
data demo; want to merge these two data data reactiv;
data demo; input pat_num 1-7 sets together, keeping every input pat_num 1-7
input pat_num 1-7 height 9-11 weight 13- possible pair of off-treatment contact 9-10 reacdt
height 9-11 weight 15 smoker 17; date and reactivation date mmddyy8.;
13-15 smoker 17; cards; records for later processing. cards;
cards; 1111135 187 180 1 1111135 26 03/06/96
1111135 187 180 1 1111166 176 136 0 1111135 46 03/10/98
1111166 176 136 0 1111167 170 196 1 In order to do this, we need 3 1111166 60 08/07/98
1111167 170 196 1 run; sorts, 5 intermediate data run;
run; sets, and 3 merge processes
proc sql; to get the final result. Please proc sql;
data create table refer to a similar example in create table
dre(drop=height dre as BASE SAS help manual. offreac as
weight); select select
merge dre o.pat_num,
(in=d)
d.*,m.smoker Using a DATA step to do a o.offdt,
from dre as d many-many merge is
demo left join demo o.contact,
(in=m); cumbersome. r.contact as
as m
by pat_num; on r_contact,
if d; d.pat_num= r.reacdt
run; m.pat_num; from offtrt o,
quit; reactiv r
where
(o.pat_num=
r.pat_num);
3. Selected Merge 2: 3. Subquery or quit;
correlated
subquery:
5. Use some functions such 5. Correlated queries, join to
data violator; [1]
input pat_num 1-7; data violator; as Lag to “remember” itself
cards; input pat_num 1-7; specific records
1111135 cards;
1111166 1111135 In clinical trial
run; 1111166 programming, we often need
run; to compute a difference
data phone; between values in the case
input pat_num 1-7 data phone;
input pat_num 1-7 and control groups.
contact 9-10 phn13dsp
contact 9-10 phn13dsp In the following example for
12-14;
cards; 12-14; a case-control study, low is 0
1111135 20 100 cards; for the controls and 1 for the
1111135 40 150 1111135 20 100 cases. Other variables in data Match_11;
1111135 60 134 1111135 40 150 each row contain related input pair low age lwt
1111166 20 160 1111135 60 134 information.[1] race smoke ptd ht UI;
1111166 40 171 1111166 20 160 Cards;
1111167 70 165 1111166 40 171 1 0 14 135 1 0 0 0 0
run; 1111167 70 165 2 0 15 98 2 0 0 0 0
run; data Match_11; 3 0 16 95 3 0 0 0 0
data phone2; input pair low age 4 0 17 103 3 0 0 0 0
merge phone (in=p) proc sql; lwt race smoke ptd ht 5 0 17 122 1 1 0 0 0
violator create table UI; 1 1 14 101 3 1 1 0 0
(in=v); phone2 as Cards; 2 1 15 115 3 0 0 0 1
by pat_num; select p.* 1 0 14 135 1 0 0 0 0 3 1 16 130 3 0 0 0 0
if p and not v; from phone p 2 0 15 98 2 0 0 0 0 4 1 17 130 3 1 1 0 1
run; where p.pat_num 3 0 16 95 3 0 0 0 0 5 1 17 110 1 1 0 0 0
not in 4 0 17 103 3 0 0 0 0 run;
(select 5 0 17 122 1 1 0 0 0
distinct 1 1 14 101 3 1 1 0 0
pat_num 2 1 15 115 3 0 0 0 1
from violator 3 1 16 130 3 0 0 0 0
); 4 1 17 130 3 1 1 0 1 proc sql;
quit; 5 1 17 110 1 1 0 0 0 create table match as
run; select
one.low,
4. Many-Many Merge * 4. Cartesian Product proc sort
one.pair,
(one.lwt-two.lwt) as
data=Match_11; lwt_d,
Please Refer to Chart 1 data offtrt; by pair low;
below this table. We have input pat_num 1-7 (one.smoke-two.smoke)
run; as smoke_d,
two data sets. The first data contact 9-10 offdt
(one.ptd-two.ptd) as
set is an off-treatment data mmddyy8.;
DATA Step Technique Proc SQL Technique missing data for some time points and we need to add records for the
data match(keep=pair ptd_d, missing time points for each patient so that every patient has all lab data
low lwt_d smoke_d (one.ht-two.ht) as time points. For example:
ptd_d ht_d ui_d); ht_d,
(one.ui-two.ui) as UI_d We have an original lab data set like:
set match_11; from Match_11 one,
by pair low; Match_11 two
where(one.pair=two.pair pat_num point value
lpi=lag( lwt ); ) 1111135 20 13
spi=lag( smoke ); and (one.low>two.low); 1111135 40 80
pti=lag( ptd ); 1111135 80 10
hti=lag( ht ); quit;
1111166 20 20
uti=lag( ui );
1111166 40 13
if (_n_>1 and 1111166 60 15
last.pair) then
do; And we know that all patients should have four time points, 20, 40, 60,
lwt_d=lwt-lpi; and 80 in our particular study, which is saved in a timept data set.
smoke_d=smoke-spi;
ptd_d=ptd-pti;
ht_d=ht-hti; point
ui_d=ui-uti; 20
output; 40
end; 60
run; 80

quit; As we can see, in the original lab data: time point 60 is missing for
patient 1111135, and time point 80 is missing for patient 1111166.
The result table for the
difference between the case We want to build a data set (alllab data set) with all missing time points
and control group is: filled in. i.e.

pair low lwt_d smoke_d ptd_d pat_num point value


ht_d UI_d 1111135 20 13
-------------------------------------- 1111135 40 80
1 1 -34 1 1
1111135 60 .
0 0
1111135 80 10
2 1 17 0 0
1111166 20 20
0 1
3 1 35 0 0 1111166 40 13
0 0 1111166 60 15
4 1 27 1 1 1111166 80 .
0 1
5 1 -12 0 0 The two simple ways to do this task are: PROC SQL and PROC
0 0 TRANSPOSE. We list the code for comparison between these two
methods below:

Chart 1: Many-Many Merge of two Data Sets * PROC TRANSPOSE PROC SQL
proc sort data=lab; proc sort
by pat_num point; data=lab;
run; by pat_num
point;
proc sql noprint; run;
select
'_'||compress(put(point,2.)) proc sql;
into :pt separated by ‘ ‘ create table
from timept; uniqan as
select
select count(*) distinct
into :num_obs pat_num
from timept; from lab;
quit;
create table
proc transpose data=lab lab_ as
out=lab2; select
by pat_num; pat_num,
id point; point
var value; from uniqan,
run; timept
order by
5. A powerful way to fill in missing data data alllab(keep=pat_num pat_num,
point value); point;
set lab2; quit;
PROC SQL can be very useful for filling in missing data. In clinical
trials, we sometimes have the following situation with lab data: there is array t{&num_obs} &pt;
PROC TRANSPOSE PROC SQL
do i=1 to &num_obs; data alllab;
point=substr(vname(t{i}),2); merge lab
value=t{i}; lab_;
output; by pat_num
end; point;
run; run;

Comparing the code, we can see that the PROC SQL code is more
compact and the logic is easier to understand. In PROC TRANSPOSE,
we first need to store the number of time points and the time point
values in macro variables since we need them for the final output of
alllab data set. The logic using PROC TRANSPOSE is more obscure
and PROC TRANSPOSE is very memory demanding. Therefore, for
large data sets, the PROC SQL method has more advantages when used
to fill in the missing data points than the PROC TRANSPOSE method.

CONCLUSION

PROC SQL is a useful tool for SAS programmers to employ to


minimize program maintenance and simplify programming. It
complements the DATA step by reducing the need for presorting input
data sets and for creating temporary data sets. As we have shown in this
paper, it can make SAS programs more compact and the logic easier to
understand.

REFERENCE
1. SAS Procedures Guide, Version 8. SAS Institute

CONTACT INFORMATION

Your comments and questions are valued and encouraged. Contact the
author at:

Changhong Shi
Merck & Co., Inc.
RY34-A320
126 East Lincoln Ave.
Rahway, NJ 07065
Work Phone: (732) 594-1383
Fax: (732) 594-6075
Email: [email protected]

Sylvianne B. Roberge
Merck & Co., Inc.
RY34-A320
126 East Lincoln Ave.
Rahway, NJ 07065
Work Phone: (732) 594-7853
Fax: (732) 594-6075
Email: [email protected]

Special Thanks to Elaine B. Czarnecki for getting us started and


continuous support.

You might also like