0% found this document useful (0 votes)
80 views

Dirty Data. Clean It Using SAS

This document provides an overview of principles for cleaning dirty data using SAS. It discusses including and excluding cases based on criteria, visualizing data distributions to identify outliers, dealing with invalid or inconsistent character variables, and approaches for handling missing data. The presentation also reviews SAS procedures like PROC FREQ, MEANS, and UNIVARIATE that can be used to understand data distributions and identify issues requiring data cleaning.

Uploaded by

melu-34533
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views

Dirty Data. Clean It Using SAS

This document provides an overview of principles for cleaning dirty data using SAS. It discusses including and excluding cases based on criteria, visualizing data distributions to identify outliers, dealing with invalid or inconsistent character variables, and approaches for handling missing data. The presentation also reviews SAS procedures like PROC FREQ, MEANS, and UNIVARIATE that can be used to understand data distributions and identify issues requiring data cleaning.

Uploaded by

melu-34533
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

DIRTY DATA?

CLEAN IT USING SAS


AN INTRODUCTION TO DATA CLEANING PRINCIPLES

CYP-C Research Champion Webinar


August 11, 2017
Giancarlo Di Giuseppe, MPH
Pediatric Oncology Group of Ontario

Healthcare innovation | Survivor care | Family assistance


Population data | Policy development | Education | Research
Outline
• SAS overview and procedures – revisited
• Fundamental principles to build a clean dataset
• Inclusion / exclusion criteria
• Visualizing data distributions
• Outliers
• Invalid or inconsistent character variables
• Dealing with missing data
• Creating data checkpoints
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
SAS Overview - Revisited
• For our purposes only two major things you can do in SAS
– DATA step - Manipulate the data in some way
• Reading in Data
• Creating and Redefining Variables
• Sub-Setting Data
• Working with Dates
• Working with Formats

– PROCedure step
• Analyze the data
• Produce frequency tables
• Estimate a regression model

Healthcare innovation | Survivor care | Family assistance


Population data | Policy development | Education | Research
SAS Procedures – Revisited
• SAS Procedures
– PROC FREQ
– PROC PRINT
– PROC MEANS
– PROC UNIVARIATE
– PROC SORT
– PROC CONTENTS

Healthcare innovation | Survivor care | Family assistance


Population data | Policy development | Education | Research
PRINCIPLES FOR CLEANLINESS

Healthcare innovation | Survivor care | Family assistance


Population data | Policy development | Education | Research
Understanding Your Dirty Data Source
• No database is initially ever
“clean”

• Databases are not constructed


with our own specific research
questions in mind

• Researchers must be familiar with


the purpose, how variables are
captured and defined, and the http://3rdsectorlabs.com/wp-content/uploads/2014/06/TSL-data-

structure of the database shower.png

Healthcare innovation | Survivor care | Family assistance


Population data | Policy development | Education | Research
Having an Analysis Plan
• Having clean data requires a sound analysis plan
– Envision what the analysis dataset will look like with all
variables and formats before performing data cleaning

• Determine what your study population denominator


is before you begin cleaning
– Is it patient population? Is it number of total diagnoses
(therefore, multiple dx’s per patient is possible)? Or is it
person-time? Etc.
– Based on the research question!

Healthcare innovation | Survivor care | Family assistance


Population data | Policy development | Education | Research
Data Manipulation and Data Cleaning:
A Simultaneous Process
• Data manipulation and data cleaning are not
mutually exclusive, rather they go hand-in-hand!

• Both can (and should) be


performed within a single
DATA step
http://i.telegraph.co.uk/multimedia/archive/03219/handshake1_3219777k.jpg

• Ensures efficient and easy to follow SAS


programming

Healthcare innovation | Survivor care | Family assistance


Population data | Policy development | Education | Research
SUB-SETTING YOUR DATA

Healthcare innovation | Survivor care | Family assistance


Population data | Policy development | Education | Research
Receiving Your Data Cut
• Typically data is requested with slightly more
information than needed
– Allows for wiggle room if hypothesis change slightly

• No data cut is ever perfect


– Data still needs to be cleaned

• Initial data cuts are never ready to be analyzed,


they must first be cleaned

Healthcare innovation | Survivor care | Family assistance


Population data | Policy development | Education | Research
Cleaning Using Inclusion & Exclusion Criteria
PROC SORT DATA = T7 OUT=T7_SORT; BY CYPCID DX_DATE; RUN;

DATA T8; SET T7_SORT; BY CYPCID; First cancers


/* INTERESTED IN PRIMARY DIAGNOSIS ONLY */
IF FIRST.CYPCID;
IF ORDINAL_PRIMARY IN (1);

/* AGE INCLUSION CRITERIA – 0 TO 14 */


IF 0 <= DX_AGE < 15;
Children aged 0 to 14
IF 0 <= DX_AGE < 1 THEN DX_AGE_GR=1;
ELSE IF DX_AGE < 7 THEN DX_AGE_GR=2;
ELSE IF DX_AGE < 11 THEN DX_AGE_GR=3; Note: Data cleaning
ELSE DX_AGE_GR=4; and data manipulation
LABEL DX_AGE_GR = "AGE AT FIRST DIAGNOSIS - GROUPED";
FORMAT DX_AGE_GR DX_AGE_GR.; done simultaneously!
/* SELECTS THOSE WITH A DIAGNOSIS BETWEEN 2002 & 2012 */
IF 2002 <= YEAR(DX_DATE) <= 2012; Diagnosed between 2002-
DX1_YEAR = YEAR(DX_DATE);
2012
/* LEUKEMIA CASES */
IF ICCC_MAIN = 1010 OR ICDO_M_CODE IN (9826, 9835, 9836,
9837);
Only concerned with
leukemia cases
RUN; *N=2,492; Keep logs of sample size in your DATA steps!!

Healthcare innovation | Survivor care | Family assistance


Population data | Policy development | Education | Research
DATA DISTRIBUTION

Healthcare innovation | Survivor care | Family assistance


Population data | Policy development | Education | Research
Recall From Last Session
• PROC FREQ produces frequency outputs
– Can be used for numeric or character variables
– Useful for counts and proportions

• PROC MEANS and UNIVARIATE produce outputs


describing the data distribution for numeric variables
– Checkpoint for data distributions and normality

• PROC FREQ and PROC MEANS/UNIVARIATE are


used in the first step of data cleaning to understand
the data
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
Duplicate Entries
PROC FREQ DATA=T8 ORDER=FREQ;
TABLE CYPCID /MISSING;
RUN;

Healthcare innovation | Survivor care | Family assistance


Population data | Policy development | Education | Research
Distribution of Continuous Data
ODS GRAPHICS ON;
PROC UNIVARIATE DATA = T8 NORMAL;
ID CYPCID;
VAR WBC_COUNT;
HISTOGRAM WBC_COUNT / NORMAL;
RUN;

Healthcare innovation | Survivor care | Family assistance


Population data | Policy development | Education | Research
Distribution of Continuous Data II

Healthcare innovation | Survivor care | Family assistance


Population data | Policy development | Education | Research
Distribution of Continuous Data III

Healthcare innovation | Survivor care | Family assistance


Population data | Policy development | Education | Research
Normality of Continuous Data

Healthcare innovation | Survivor care | Family assistance


Population data | Policy development | Education | Research
OUTLIERS

Healthcare innovation | Survivor care | Family assistance


Population data | Policy development | Education | Research
Dealing With Outliers
• If there are many outliers, these will introduce
bias in your study
• Many options to handle these skewed data:
– Median + IQR instead of mean
– Use a logical range of values and assign any
outlier the upper bound of the range
– Categorize your data based on the distribution or
clinically meaningful ranges
• Whichever approach used should be justified!

Healthcare innovation | Survivor care | Family assistance


Population data | Policy development | Education | Research
Dealing With Outliers II
DATA T8; SET T8;
/* UPPER LIMIT TO OUTLIERS */
IF WBC_COUNT >= 500 THEN WBC_COUNT_CLEAN = 500;
ELSE WBC_COUNT_CLEAN = WBC_COUNT;

/* CREATING CLINICAL CATEGORIES */


IF WBC_COUNT ^= . THEN DO;
IF WBC_COUNT < 50 THEN WBC_GROUP = 1;
ELSE IF WBC_COUNT < 100 THEN WBC_GROUP = 2;
ELSE IF WBC_COUNT < 200 THEN WBC_GROUP = 3;
DO loop
ELSE IF WBC_COUNT < 300 THEN WBC_GROUP = 4;
ELSE IF WBC_COUNT < 400 THEN WBC_GROUP = 5;
ELSE WBC_GROUP = 6; END;
RUN;

Healthcare innovation | Survivor care | Family assistance


Population data | Policy development | Education | Research
Dealing With Outliers III
/* MEAN VS MEDIAN + IQR */
PROC MEANS DATA=T8 MEAN MIN MAX Q1 MEDIAN Q3;
VAR WBC_COUNT_CLEAN;
RUN;

/* DATA CATEGORIZATION */
PROC FREQ DATA=T8;
TABLES WBC_GROUP /MISSING;
RUN;

Healthcare innovation | Survivor care | Family assistance


Population data | Policy development | Education | Research
Dealing With Outliers III

Healthcare innovation | Survivor care | Family assistance


Population data | Policy development | Education | Research
Dealing With Outliers III
/* MEAN VS MEDIAN + IQR */
PROC MEANS DATA=T8 MEAN MIN MAX Q1 MEDIAN Q3;
VAR WBC_COUNT_CLEAN;
RUN;

/* DATA CATEGORIZATION */
PROC FREQ DATA=T8;
TABLES WBC_GROUP /MISSING;
RUN;

Healthcare innovation | Survivor care | Family assistance


Population data | Policy development | Education | Research
CLEANING CHARACTER VARIABLES

Healthcare innovation | Survivor care | Family assistance


Population data | Policy development | Education | Research
CApITOLIzATioN Matters!
PROC FREQ DATA=T8;
TABLES PROTOCOL_NAME;
RUN;

Healthcare innovation | Survivor care | Family assistance


Population data | Policy development | Education | Research
CAPITOLIZATION Matters! Use UPCASE
DATA T8; SET T8;
PROTOCOL_NAME = UPCASE(PROTOCOL_NAME);
RUN;
PROC FREQ DATA=T8; TABLES PROTOCOL_NAME; RUN;

Healthcare innovation | Survivor care | Family assistance


Population data | Policy development | Education | Research
FINDing, Cleaning, and Manipulating
DATA T9; SET T8;
PROTOCOL_NAME = UPCASE(PROTOCOL_NAME);

IF FIND(PROTOCOL_NAME,"ALL PROTOCOL C") THEN DO;


PROTOCOL_NAME = "ALL PROTOCOL C";
DO loop
ALL_RISK = "HIGH RISK";
END;
RUN;
PROC FREQ DATA=T9; TABLES PROTOCOL_NAME; RUN;

Healthcare innovation | Survivor care | Family assistance


Population data | Policy development | Education | Research
Use Caution When Searching Text
• When performing character search functions in SAS,
be wary of the phrase being used
• Can lead to errors in data cleaning
• Searched term should be unique enough to prevent
unwanted matches
• If “ALL PROTOCOL B” was searched using FIND(),
then the BFM-90 protocol would have been
misclassified as Protocol B

Healthcare innovation | Survivor care | Family assistance


Population data | Policy development | Education | Research
MISSING DATA

Healthcare innovation | Survivor care | Family assistance


Population data | Policy development | Education | Research
Recall: Viewing Missing Data
PROC FREQ DATA = T8;
TABLES STAGE_CODE /MISSING;
RUN;

Healthcare innovation | Survivor care | Family assistance


Population data | Policy development | Education | Research
Understanding Your Missing Data
PROC FREQ DATA = T8; • Staging not done for the
WHERE DX1_GRP = 2; leukemia’s which represent a high
TABLES STAGE_CODE /MISSING;
RUN; % of childhood cancers
• Staging important for lymphomas
• Know your data!

Healthcare innovation | Survivor care | Family assistance


Population data | Policy development | Education | Research
DATA CHECKPOINTS

Healthcare innovation | Survivor care | Family assistance


Population data | Policy development | Education | Research
Date Checkpoints I
DATA FLAGS; SET T8 (KEEP=PATIENT_ID DOB DX_DATE1 DOD);
IF DOD < DX_DATE1 AND DOD ^=. THEN DEATH_FLAG = 1;
ELSE DEATH_FLAG=0;
IF DX_DATE1 < DOB THEN DX_FLAG = 1;
ELSE DX_FLAG = 0;
RUN;
PROC FREQ DATA=FLAGS; TABLES DEATH_FLAG DX_FLAG; RUN;

Healthcare innovation | Survivor care | Family assistance


Population data | Policy development | Education | Research
Date Checkpoints II
PROC PRINT DATA=T8 NOOBS;
WHERE DOD < DX_DATE1 AND DOD ^=. ;
VAR PATIENT_ID DX_DATE1 DOD;
RUN;

Healthcare innovation | Survivor care | Family assistance


Population data | Policy development | Education | Research
Treatment Checkpoints
DATA TX_FLAGS;
MERGE T8 (IN=MASTER) CHEMO (IN=A) SURG (IN=B)
BMT (IN=C) RAD (IN=D);

BY CYPCID;

IF A THEN CHEMO = 1; ELSE CHEMO = 0;


IF B THEN SURGERY = 1; ELSE SURGERY = 0;
IF C THEN BMT = 1; ELSE BMT = 0; Treatment flags
IF D THEN RAD = 1; ELSE RAD = 0;

NUM_TX_MODALITIES = SUM(CHEMO,SURGERY,BMT,RAD);

IF FIRST.CYPCID;
IF MASTER THEN OUTPUT;
RUN;
REMEMBER: All datasets involved in a merge must be sorted by the common identifier (ie.CYPCID)
Healthcare innovation | Survivor care | Family assistance
Population data | Policy development | Education | Research
Treatment Checkpoints II
PROC FREQ DATA=TX_FLAGS;
TABLES DX1_GRP * (CHEMO SURGERY BMT RAD);
TABLES DX1_GRP * NUM_TX_MODALITIES;
RUN;

Healthcare innovation | Survivor care | Family assistance


Population data | Policy development | Education | Research
Treatment Checkpoints II
PROC FREQ DATA=TX_FLAGS;
TABLES DX1_GRP * (CHEMO SURGERY BMT RAD);
TABLES DX1_GRP * NUM_TX_MODALITIES;
RUN;

Healthcare innovation | Survivor care | Family assistance


Population data | Policy development | Education | Research
Topics Covered
• Key principles to build a clean dataset
• Using Inclusion / exclusion criteria
• Visualizing data distributions
• Handling data outliers
• Cleaning character variables
• Dealing with missing data
• Creating data checkpoints

Healthcare innovation | Survivor care | Family assistance


Population data | Policy development | Education | Research
THANK YOU!

Healthcare innovation | Survivor care | Family assistance


Population data | Policy development | Education | Research

You might also like