Some Statistical Programming Considerations For E-Submission
Some Statistical Programming Considerations For E-Submission
Some Statistical Programming Considerations For E-Submission
INTRODUCTION
Most pharmaceutical companies, when filing an NDA with the regulatory agencies, have used electronic submission. In addition to generating the derived analysis datasets and SAS outputs of TLGs (tables, listings and graphs), statistical programmers usually need to prepare the following main components for the e-submission: SAS data transport files, data definition document (define.pdf), and statistical review aids. E-submission formats required for archiving by the agency are based on the type of files submitted. SAS datasets used to perform the analysis can be submitted in different formats. SAS transport (XPORT) files are currently accepted by the FDA. Other formats (e.g. XML) are also acceptable. An important part of preparing the XPROT files is addressing the data size limitations set by the FDA. SAS output files are a second type of files found in the e-submission. The text and image files are usually submitted in PDF format, such as the tables, listings and graphs. A third type of file submitted is the Data Definition Document, otherwise known as the define.pdf. There are two comparable structures for this file. In 1999 the FDA proposed a format of a define.pdf that consists of five columns (Variable Name, Variable Label, Type, Code and Comment). CDISC (Clinical Data Interchange Standards Consortium) proposed a format for define.pdf with seven columns (two additional columns Origin and Role). The CDISC structure is based on the SDTM (Study Data Tabulation Model)/SDS (Submission Data Standards) V3.1 (2004) models, and the Analysis Dataset Model (ADaM) V1.0. Both FDA and CDISC formats are acceptable by FDA. A fourth type of file submitted is known as a Statistical Review Aid. In addition to the data, output and define.pdf, some e-submissions request statistical review aids. The purpose of the review aids is to allow the reviewer to easily generate SAS outputs for the primary endpoint and/or the major secondary efficacy endpoints. Currently, no standard guideline is available for the statistical review aids.
Existing FDA guidance documents have discouraged the use of formats catalog for user-defined formats and suggest that character variables with meaningful values be used instead. Analysis datasets should use these character variables where appropriate. However, there are some cases where a numeric version of a categorical variable is required for statistical purpose, e.g. a statistical model may require numeric 0/1 variable as indicators. For all formatted variables, decode variables can be added, which contain the formatted value prior to the creation of the XPORT file. Since the FDA uses SAS version 5 transport files for all datasets, variable names should not be longer than 8 characters, and all character variables must be 200 characters or less. Correcting data issues through hardcoding within the analysis dataset program should be avoided. Data should be cleaned at the source. However, should hardcoding be required, details around the hardcoding need to be part of the data specifications document (define.pdf) submitted to the agency.
F = Female M = Male
CDISC described the datasets into two classes: Study Data Tabulations (SDT) and the Analysis Datasets (AD). Study Data Tabulations (SDT): Datasets containing data collected during the study and organized by clinical domain. These datasets are described by CDISC SDTM/SDS V3.1 (June 2004). Analysis Datasets (AD): Datasets used for statistical analysis and reporting by the sponsor. These datasets are described by CDISC Analysis Dataset Model (ADaM) standards V1.0 (Dec. 2004).
Example of Table of Contents (TOC) based on CDISC format: Datasets For Study 3001 Dataset Description AE Adverse events CM DM Concomitant medication Demographics Structure One record per subject, per event One record per subject, per medication, per instance One record per subject Purpose Tabulation Tabulation Tabulation Key variables USUBJID, AETERM, AESEQ USUBJID, ODCD, REPNO USUBJID Location ..\3001\data\ae.xpt ..\3001\data\cm.xpt ..\3001\data\dm.xpt
Example of Data Definition Table (DDT) based on CDISC format: Study 3001 - AE.xpt, Adverse Event, Feb. 10, 2005, One record per subject, per event Variable Name USUBJID STUDTID INVID SEX Variable Label Unique subject identifier Study number Investigator identifier Sex Type Char Char Char Char M,F,U Controlled Terms or Format Origin Sponsor Defined (CRF) ae.studyid (CRF) ae.invid (CRF) dm.sex Role Identifier Identifier Qualifier Qualifier Comments STUDYID || INVID || SUBJID
Variables in the dataset need to be ordered. The variables whose roles are defined as either Identifier, Selection, followed by project common variables are listed first, all remaining variables are then listed in alphabetical order. The variables list in the define.pdf should be in the same order as they appear in the dataset. The Role column determines the type of information conveyed by the variables, which can be classified into the following major roles in the SDTM/SDS V3.1 (June 2004): Identifier variables: identify the study, the subject, the domain, and the sequence number of the record. Topic variables: specify the focus of the observation (e.g. name of a lab test); Timing variables: describe the timing of the observation (e.g. start date and end date); Qualifier variables: include additional illustrative test, or numeric values that describe the results or additional traits of the observation (e.g. units). The list of Qualifier variables will vary depending on the type of observation and the specific domain. The CDISC SDTM/SDS V3.1 roles defined above should also be used where appropriate in Analysis Datasets (AD). Additional AD roles defined in ADaM V1.0 (Dec. 2004) include: Selection variables: Variables that are frequently used to subset, sort, or group data for reporting purposes: e.g. Treatment, Age, Gender, and Race... Analysis variables: Variables tabulated and or summarized for analysis purposes. Support variables: Provide useful background or reference information. There are some macros available to generate the define.pdf. Following are suggestions for creating or choosing an appropriate macro: Checks for data structure: Check the data structure, and make sure all datasets and variables are labeled. The length of variable name should be <= 8. Otherwise, the XPT file will not be created. Allows for columns to store both a short and long variable name: In the new version of CDISC SDS V3.1, there are two columns for the variable names: short name and long name. Short name should include <= 8 character field name. The <= 8 character limitation is currently required due to the limitation of the SAS V5 transport format currently required by FDA. A more flexible format is expected to replace this in the near future. The long name of variable may correspond to the variable name from the operational database or a long name generated from SAS version 6 or higher. Simultaneous file creation: The macro can generate Define.pdf and XPT files for all the datasets at the same time. This will help to ensure the data files created are current with the specifications defined. Additionally, it will allow for more efficient processing. Flexibility in creation of define.pdf: Allow for both the FDA and CDISC formats. The user has an alternative to choose one or the other. Addition of comments from external files: The Comments column could be easily added. A TEXT file or .CSV file with a special sign as the delimiter is recommended. The comma (,) should not be used as the delimiter, because this causes problems if the text includes a comma.
Proper pagination: When using ODS RTF to generate the define.pdf, caution is required because there is a bug for ODS RTF in SAS V8.2. If the Code or Comment is too long, it will automatically go to the next page, and the text will be truncated. SAS V9 may fix this bug of ODS RTF. Using ODS HTML will avoid the truncation problem. First, using ODS HTML generate the define.htm file, then convert it to WORD document or define.pdf file in MS WORD.
/************************************************************************ * Decodes/Formats * * * * TRTR: 0 = Placebo * * 1 = Active 10 ug/day * * 2 = Active 20 ug/day * * 3 = Active 30 ug/day * * * * ITT: 1=True, 0=False for the ITT population * * PP: 1=True, 0=False for the per-protocol population * * * * EFFTYPE: I = Imputed data for last observation carried forward (LOCF) * * R = Raw data * * * * TIMEPT: 0 = Baseline * * 4 = Week 1 / Visit 4 * * 5 = Week 2 / Visit 5 * * 6 = Week 4 / Visit 6 * * 7 = Week 8 / Visit 7 * * 8 = Week 12 / Visit 8 or Early Termination Visit (LOCF) * * 888 = Early Termination * * 999 = Final Visit Completer * * * * STARTA: 1 = Previous therapy 1 * * 2 = Previous therapy 2 * * * ************************************************************************/ /*********************************************************** * Beginning of Code * ***********************************************************/ ** Allow program to run without format catalog **; options nofmterr; ** Assign libname for sas data sets **; libname ads "c:\temp\ads"; ** Load data from the transport file to the working library **; filename indat "c:\temp\eff.xpt"; proc cimport lib=ads file=indat; run; ** Subset data to support week 12 and LOCF data **; proc sort data=ads.eff out=final; by trtr usubid; ** ITT population and the time point **; where itt eq 1 and timept eq 8 and efftype eq 'I'; run; title "Proj1234/3001: SAS outputs for the primary efficacy endpoint"; ** Produce baseline mean for table ef_0001t **; proc means data=final(where=(chbl>.z)) n mean; var blva; class trtr; output n=N mean=BLMean; run; ** Produce inferential statistics & p-value **; proc mixed data=final; class trtr pool strata sex;
model chbl = trtr pool strata sex blva age; ** LS Means Differences, 95% estimate "Active 10 ug/day estimate "Active 20 ug/day estimate "Active 30 ug/day estimate "Active 20 ug/day estimate "Active 30 ug/day estimate "Active 30 ug/day CI, and p-value for Comparisons **; Placebo" trtr -1 1 0 0 / cl; Placebo" trtr -1 0 1 0; Placebo" trtr -1 0 0 1; Active 10 ug/day" trtr 0 -1 1 0; Active 10 ug/day" trtr 0 -1 0 1; Active 20 ug/day" trtr 0 0 -1 1;
** LS Means & 95% CI for Treatment **; lsmeans trtr / cl alpha=0.05; run; /*********************************************************** * End of Code * ***********************************************************/
CONCLUSION
The analysis datasets submitted as SAS transport files should be directly usable by reviewing statisticians with little or no data manipulation and programming. The datasets should be compatible with currently available statistical software and compatible with future standard software tools that may be developed. The standards and models by FDA or CDISC for the submitted datasets are extremely important for efficient e-submission. When submitting clinical data to the FDA, the data definition document or define.pdf is probably the most important documentation of the datasets. It will clearly provide descriptions of the data contents, derivation and usage of the data. Manual creation of this document can be very labor intensive and error prone. Since most information necessary is from the datasets themselves, an automation of this task seems reasonable. There are some macros available for this purpose, this paper has highlighted the areas where particular attention must be made when generating the define.pdf. For the e-submission, following the FDA or CDISC guidelines based on the industry standards is paramount. Statistical review aids that allows FDA reviewer to work easier and faster is also important. Development of document templates based on previous submission examples may be helpful to the new project teams creating the define.pdf and statistical review aids.
REFERENCES
FDA (2001). Example NDA Submission (08MAR2001), http://www.fda.gov/cder/guidance/NDA_Example.htm FDA (2003). Electronic Regulatory Submissions and Review, http://www.fda.gov/cder/regulatory/ersr/default.htm CDISC (2004). SDTM/SDS V3.1 (25JUN2004), http://www.cdisc.org/models/sds/v3.1/index.html CDISC (2004). ADaM statistical model V1.0 (DEC2004), http://www.cdisc.org/models/adam/V1.0/index.html CDISC (2005). Case Report Tabulation Data Definition Specification (CRT-DDS, define.xml V1.0, 10FEB2005), http://www.cdisc.org/models/def/v1.0/index.html Dave Christiansen and Stephen Wilson (2004). Submission of analysis datasets and documentation: scientific and regulatory perspectives. PharmSUG2004, FC04
ACKNOWLEDGMENTS
The author would like to thank Mary McKenna (Sanofi-Aventis Pharma. Inc.) for her review and comments.
CONTACT INFORMATION
Hang Pang Sanofi-Aventis Pharma. Inc. BX2-300E, 200 Crossing Boulevard Bridgewater, NJ 08807-0890 Work Phone: 908-231-4187 Email: [email protected] SAS and all SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies.