SAS Interview Questions - Clinical
SAS Interview Questions - Clinical
com/
2. Describe the validation procedure? How would you perform the validation for TLG as well as analysis data
set?
Ans:- Validation procedure is used to check the output of the SAS program generated by the source programmer.
In this process validator write the program and generate the output. If this output is same as the output generated
by the SAS programmer's output then the program is considered to be valid. We can perform this validation for
TLG by checking the output manually and for analysis data set it can be done using PROC COMPARE.
3. How would you perform the validation for the listing, which has 400 pages?
Ans:- It is not possible to perform the validation for the listing having 400 pages manually. To do this, we convert
the listing in data sets by using PROC RTF and then after that we can compare it by using PROC COMPARE.
Ans:- Actually it depends on the complexity of the tables if there are same type of tables then, we can create 1-2-3
tables in a day.
7. What are all the PROCS have you used in your experience?
Ans:- I have used many procedures like proc report, proc sort, proc format etc. I have used proc report to generate
the list report, in this procedure I have used subjid as order variable and trt_grp, sbd, dbd as display variables.
8. Describe the data sets you have come across in your life?
Ans:- I have worked with demographic, adverse event , laboratory, analysis and other data sets.
9. How would you submit the docs to FDA? Who will submit the docs?
Ans:- We can submit the docs to FDA by e-submission. Docs can be submitted to FDA using
Define.pdf or define.Xml formats. In this doc we have the documentation about macros and program and E-
records also. Statistician or project manager will submit this doc to FDA.
11. Can u share your CDISC experience? What version of CDISC SDTM have you used?
Ans: I have used version 3.1.1 of the CDISC SDTM.
13. Tell me about your project group? To whom you would report/contact?
My project group consisting of six members, a project manager, two statisticians, lead programmer and two
programmers.
I usually report to the lead programmer. If I have any problem regarding the programming I would contact the lead
programmer.
If I have any doubt in values of variables in raw dataset I would contact the statistician. For example the dataset
related to the menopause symptoms in women, if the variable sex having the values like F, M. I would consider it
as wrong; in that type of situations I would contact the statistician.
15. How would you know whether the program has been modified or not?
I would know the program has been modified or not by seeing the modification history in the program header.
Clintrial, the market's leading Clinical Data Management System (CDMS).Oracle Clinical or OC is a database
management system designed by Oracle to provide data management, data entry and data validation
functionalities to Clinical Trials process.18. Tell me about MEDRA and what version of MEDRA did you use in your
project?Medical dictionary of regulatory activities. Version 10
24 What are the contents of lab data? What is the purpose of data set?
The lab data set contains the SUBJID, week number, and category of lab test, standard units, low normal and high
range of the values. The purpose of the lab data set is to obtain the difference in the values of key variables after
the administration of drug.
25.How did you do data cleaning? How do you change the values in the data on your own?
I used proc freq and proc univariate to find the discrepancies in the data, which I reported to my manager.
26.Have you created CRT’s, if you have, tell me what have you done in that?
Yes I have created patient profile tabulations as the request of my manager and and the statistician. I have used
PROC CONTENTS and PROC SQL to create simple patient listing which had all information of a particular patient
including age, sex, race etc.
28. How did you do data cleaning? How do you change the values in the data on your own?
I used proc freq and proc univariate to find the discrepancies in the data, which I reported to my manager.
29. Definitions?
CDISC- Clinical data interchange standards consortium.They have different data models, which define clinical data
standards for pharmaceutical industry.
SDTM – It defines the data tabulation datasets that are to be sent to the FDA for regulatory submissions.
ADaM – (Analysis data Model)Defines data set definition guidance for creating analysis data sets.
ODM – XML – based data model for allows transfer of XML based data .
Define.xml – for data definition file (define.pdf) which is machine readable.
ICH E3: Guideline, Structure and Content of Clinical Study Reports
ICH E6: Guideline, Good Clinical Practice
ICH E9: Guideline, Statistical Principles for Clinical Trials
Title 21 Part 312.32: Investigational New Drug Application
30. Have you ever done any Edit check programs in your project, if you have, tell me what do you know about
edit check programs?
Yes I have done edit check programs .Edit check programs – Data validation.
1.Data Validation – proc means, proc univariate, proc freq.Data Cleaning – finding errors.
2.Checking for invalid character values.Proc freq data = patients;Tables gender dx ae / nocum
nopercent;Run;Which gives frequency counts of unique character values.
3. Proc print with where statement to list invalid data values.[systolic blood pressure - 80 to 100][diastolic blood
pressure – 60 to 120]
4. Proc means, univariate and tabulate to look for outliers.Proc means – min, max, n and mean.Proc univariate –
five highest and lowest values[ stem leaf plots and box plots]
5. PROC FORMAT – range checking
6. Data Analysis – set, merge, update, keep, drop in data step.
7. Create datasets – PROC IMPORT and data step from flat files.
8. Extract data – LIBNAME.
9. SAS/STAT – PROC ANOVA, PROC REG.
10. Duplicate Data – PROC SORT Nodupkey or NoduplicateNodupkey – only checks for duplicates in BYNoduplicate
– checks entire observation (matches all variables)For getting duplicate observations first sort BY nodupkey and
merge it back to the original dataset and keep only records in original and sorted.
11. For creating analysis datasets from the raw data sets I used the PROC FORMAT, and rename and length
statements to make changes and finally make a analysis data set.
33. What do you lknow about ISS and ISE, have you ever produced these reports?
ISS (Integrated summary of safety):Integrates safety information from all sources (animal, clinical pharmacology,
controlled and uncontrolled studies, epidemiologic data). "ISS is, in part, simply a summation of data from
individual studies and, in part, a new analysis that goes beyond what can be done with individual studies."ISE
(Integrated Summary of efficacy)ISS & ISE are critical components of the safety and effectiveness submission and
expected to be submitted in the application in accordance with regulation. FDA’s guidance Format and Content of
Clinical and Statistical Sections of Application gives advice on how to construct these summaries. Note that, despite
the name, these are integrated analyses of all relevant data, not summaries.
34. Explain the process and how to do Data Validation?
I have done data validation and data cleaning to check if the data values are correct or if they conform to the
standard set of rules.A very simple approach to identifying invalid character values in this file is to use PROC FREQ
to list all the unique values of these variables. This gives us the total number of invalid observations. After
identifying the invalid data …we have to locate the observation so that we can report to the manager the particular
patient number.Invalid data can be located using the data _null_ programming.
Following is e.g
DATA _NULL_;
INFILE "C:PATIENTS,TXT" PAD;FILE PRINT; ***SEND OUTPUT TO THE OUTPUT WINDOW;
TITLE "LISTING OF INVALID DATA";
***NOTE: WE WILL ONLY INPUT THOSEVARIABLES OF INTEREST;INPUT @1 PATNO $3.@4 GENDER $1.@24 DX
$3.@27 AE $1.;
***CHECK GENDER;IF GENDER NOT IN ('F','M',' ') THEN PUT PATNO= GENDER=;
***CHECK DX;
IF VERIFY(DX,' 0123456789') NE 0
THEN PUT PATNO= DX=;
***CHECK AE;
IF AE NOT IN ('0','1',' ') THEN PUT PATNO= AE=;
RUN;
For data validation of numeric values like out of range or missing values I used proc print with a where statement.
If we have a range of numeric values ‘001’ – ‘999’ then we can first use user defined format and then use proc freq
to determine the invalid values.
PROC FORMAT;
VALUE $GENDER 'F','M' = 'VALID'' ' = 'MISSING'OTHER = 'MISCODED';
VALUE $DX '001' - '999'= 'VALID'' ' = 'MISSING'OTHER = 'MISCODED';
VALUE $AE '0','1' = 'VALID'' ' = 'MISSING'OTHER = 'MISCODED';
RUN;
One of the simplest ways to check for invalid numeric values is to run either PROC MEANS or PROC
UNIVARIATE.We can use the N and NMISS options in the Proc Means to check for missing and invalid data. Default
(n nmiss mean min max stddev).The main advantage of using PROC UNIVARIATE (default n mean std skewness
kurtosis) is that we get the extreme values i.e lowest and highest 5 values which we can see for data errors. If u
want to see the patid for these particular observations …..state and ID patno statement in the univariate
procedure.
35. Roles and responsibilities?
Programmer:
Develop programming for report formats (ISS & ISE shell) required by the regulatory authorities.Update ISS/ISE
shell, when required.
Clinical Study Team:
Provide information on safety and efficacy findings, when required.Provide updates on safety and efficacy findings
for periodic reporting.
Study Statistician
Draft ISS and ISE shell.Update shell, when appropriate.Analyze and report data in approved format, to meet
periodic reporting requirements.
37. What are the domains/datasets you have used in your studies?
Demog
Adverse Events
Vitals
ECG
Labs
Medical History
PhysicalExam etc
Adverse Events: Protocol no, Investigator no, Patient Id, Preferred Term, Investigator Term, (Abdominal dis, Freq
urination, headache, dizziness, hand-food syndrome, rash, Leukopenia, Neutropenia) Severity, Seriousness (y/n),
Seriousness Type (death, life threatening, permanently disabling), Visit number, Start time, Stop time, Related to
study drug?
Vitals: Subject number, Study date, Procedure time, Sitting blood pressure, Sitting Cardiac Rate, Visit number,
Change from baseline, Dose of treatment at time of vital sign, Abnormal (yes/no), BMI, Systolic blood pressure,
Diastolic blood pressure.
ECG: Subject no, Study Date, Study Time, Visit no, PR interval (msec), QRS duration (msec), QT interval (msec), QTc
interval (msec), Ventricular Rate (bpm), Change from baseline, Abnormal.
Labs: Subject no, Study day, Lab parameter (Lparm), lab units, ULN (upper limit of normal), LLN (lower limit of
normal), visit number, change from baseline, Greater than ULN (yes/no), lab related serious adverse event
(yes/no).Medical History: Medical Condition, Date of Diagnosis (yes/no), Years of onset or occurrence, Past
condition (yes/no), Current condition (yes/no).
PhysicalExam: Subject no, Exam date, Exam time, Visit number, Reason for exam, Body system, Abnormal (yes/no),
Findings, Change from baseline (improvement, worsening, no change), Comments
39. Give me the example of edit ckecks you made in your programs? Examples of Edit Checks
Labs
Result is within the normal range but abnormal is not blank or ‘N’Result is outside the normal range but abnormal
is blank
Vitals
Diastolic BP > Systolic BP
Medical History
Visit date prior to Screen datePhysicalPhysical exam is normal but comment included
40. What are the advantages of using SAS in clinical data management? Why should not we use other software
products in managing clinical data?
ADVANTAGES OF USING A SAS®-BASED SYSTEM
Less hardware is required.
A Typical SAS®-based system can utilize a standard file server to store its databases and does not require one or
more dedicated servers to handle the application load. PC SAS® can easily be used to handle processing, while data
access is left to the file server. Additionally, as presented later in this paper, it is possible to use the SAS® product
SAS®/Share to provide a dedicated server to handle data transactions.
Fewer personnel are required.
Systems that use complicated database software often require the hiring of one ore more DBA’s (Database
Administrators) who make sure the database software is running, make changes to the structure of the database,
etc. These individuals often require special training or background experience in the particular database
application being used, typically Oracle. Additionally, consultants are often required to set up the system and/or
studies since dedicated servers and specific expertise requirements often complicate the process. Users with even
casual SAS® experience can set up studies. Novice programmers can build the structure of the database and design
screens. Organizations that are involved in data management almost always have at least one SAS® programmer
already on staff. SAS® programmers will have an understanding of how the system actually works which would
allow them to extend the functionality of the system by directly accessing SAS® data from outside of the
system.Speed of setup is dramatically reduced. By keeping studies on a local file server and making the database
and screen design processes extremely simple and intuitive, setup time is reduced from weeks to days.All phases
of the data management process become homogeneous. From entry to analysis, data reside in SAS® data sets,
often the end goal of every data management group. Additionally, SAS® users are involved in each step, instead of
having specialists from different areas hand off pieces of studies during the project life cycle.No data conversion is
required. Since the data reside in SAS® data sets natively, no conversion programs need to be written.Data review
can happen during the data entry process, on the master database. As long as records are marked as being double-
keyed, data review personnel can run edit check programs and build queries on some patients while others are still
being entered.Tables and listings can be generated on live data. This helps speed up the development of table and
listing programs and allows programmers to avoid having to make continual copies or extracts of the data during
testing.
44. Describe the types of SAS programming tasks that you performed: Tables? Listings? Graphics? Ad hoc
reports? Other?
Prepared programs required for the ISS and ISE analysis reports. Developed and validated programs for preparing
ad-hoc statistical reports for the preparation of clinical study report. Wrote analysis programs in line with the
specifications defined by the study statistician. Base SAS (MEANS, FREQ, SUMMARY, TABULATE, REPORT etc) and
SAS/STAT procedures (REG, GLM, ANOVA, and UNIVARIATE etc.) were used for summarization, Cross-Tabulations
and statistical analysis purposes. Created Statistical reports using Proc Report, Data _null_ and SAS Macro. Created,
derived and merged and pooled datasets,listings and summary tables for Phase-I and Phase-II of clinical trials.
45. Have you been involved in editing the data or writing data queries? If your interviewer asks this question, the
u should ask him what he means by editing the data… and data queries…
41. Are you involved in writing the inferential analysis plan? Table’s specifications?
45. What other SAS features do you use for error trapping and data validation?
Conditional statements, if then else.
Put statement
Debug option.
Transform: The transform stage applies a series of rules or functions to the extracted data from the source to
derive the data to be loaded to the end target. Some data sources will require very little or even no manipulation
of data. In other cases, one or more of the following transformations types to meet the business and technical
needs of the end target may be required:·
Selecting only certain columns to load (or selecting null columns not to load) · Translating coded values (e.g., if the
source system stores 1 for male and 2 for female, but the warehouse stores M for male and F for female), this is
called automated data cleansing; no manual cleansing occurs during ETL · Encoding free-form values (e.g., mapping
"Male" to "1" and "Mr" to M) ·
Joining together data from multiple sources (e.g., lookup, merge, etc.) · Generating surrogate key values ·
Transposing or pivoting (turning multiple columns into multiple rows or vice versa) · Splitting a column into
multiple columns (e.g., putting a comma-separated list specified as a string in one column as individual values in
different columns) ·
Applying any form of simple or complex data validation; if failed, a full, partial or no rejection of the data, and thus
no, partial or all the data is handed over to the next step, depending on the rule design and exception handling.
Most of the above transformations itself might result in an exception, e.g. when a code-translation parses an
unknown code in the extracted data.
Load: The load phase loads the data into the end target, usually being the data warehouse (DW).
Depending on the requirements of the organization, this process ranges widely. Some data warehouses might
weekly overwrite existing information with cumulative, updated data, while other DW (or even other parts of the
same DW) might add new data in a historized form, e.g. hourly. The timing and scope to replace or append are
strategic design choices dependent on the time available and the business needs. More complex systems can
maintain a history and audit trail of all changes to the data loaded in the DW.
As the load phase interacts with a database, the constraints defined in the database schema as well as in triggers
activated upon data load apply (e.g. uniqueness, referential integrity, mandatory fields), which also contribute to
the overall data quality performance of the ETL process.