Paper 109-25 Merges and Joins: Timothy J Harrington, Trilogy Consulting Corporation
Paper 109-25 Merges and Joins: Timothy J Harrington, Trilogy Consulting Corporation
Vertical Joining
A good example of vertical joining is adding to a data set in time sequence, for example, adding Februarys sales data to Januarys sales data to give a year-todate data set. Providing both data sets have the same variables and all the variables have the same attributes such as data type, length, and label, there is no problem. However, once the data sets are combined at least one of the variables should, in practice, be able to identify which of the source data sets any given observation originated from. In this sales data example a date or month name should be present to indicate whether a given observation came from Januarys data or Februarys data. Another issue may be the sort order. In this example there is no need to sort the resulting data set if the source data sets are in date order, but if, say, the data sets were sorted by product code, or sales representative the resulting data set would need to be resorted by date. Most importantly, when vertically joining data sets, is the issue vertical compatibility. This is whether the corresponding variables in each data set have the same attributes, and if there are any variables which are present in one data set but not in the other.
Coders' Corner
Generally this method is less efficient than using PROC DATASETS with APPEND.
Horizontal Joining
There are four basic types of horizontal join, the inner join, left join, right join, and full join. All such joins are Cartesian products made on specified key variables. If there are duplicate matches in either or both tables all of the matching observations are selected, for example if there are two equal key values in each input data set there will be four output observations created. The following example data sets are being used to demonstrate horizontal joins. These data sets called DOSING and EFFICACY are hypothetical clinical trials data sets. In the DOSING data set PATIENT is the patient id number, MEDCODE is the test medication (A or B), DOSE_ID is an observation id number, DOSEAMT is the amount of dose in mg, and DOSEFRQ is the dose frequency in doses per day. The EFFICACY data set contains an observation id number, EFFIC_ID, a VISIT number, and an efficacy SCORE (1 to 5). The variables DOSE_ID and EFFIC_ID in this example are for easy identification of the data set and input observation which contributed to the resulting output observation.
is therefore commutative, that is the tables can be joined in either order. The following PROC SQL segment creates a table named INNER1 as the inner join between DOSING and EFFICACY on PATIENT. A point to note is that where there are duplicate key values a complete Cartesian product is produced, in this example this happens with Patient 1004. The A and B characters preceding the variable names are aliases for each of the data set names and the ORDER BY clause sorts the resulting data set in ascending order of PATIENT and MEDCODE. Table INNER1: An INNER JOIN on PATIENT between DOSING and EFFICACY. PROC SQL; CREATE TABLE INNER1 AS SELECT A.*, B.EFFIC_ID, B.VISIT, B.SCORE FROM DOSING A, EFFICACY B WHERE A.PATIENT=B.PATIENT ORDER BY PATIENT; QUIT; PAT- MED DOSE DOSE DOSE EFFIC IENT CODE AMT FRQ VISIT SCORE ID ID 1 2 3 4 5 6 1001 1004 1004 1004 1004 1009 A A B B A A 2 1 4 4 1 2 2 2 2 2 2 2 1 1 1 2 2 1 4 2 2 1 1 5 1 3 4 4 3 8 1 3 3 4 4 6
This resulting table, INNER1, contains only observations with Patient Numbers common to both data sets. There are four observations for Patient 1004 because of the Cartesian product of two observations with this Patient Number in each data set. If the WHERE clause were omitted the complete Cartesian product of every observation would be selected, producing 48 (6*8) observations, hence at least one key variable must be specified when performing any type of horizontal join on data sets of more than a few observations. Another point to note is that instead of using the WHERE clause, the FROM statement could be rewritten as FROM DOSING A INNER JOIN EFFICACY B.
The EFFICACY data set OBS 1 2 3 4 5 6 PATIENT EFFIC_ID VISIT SCORE 1001 1002 1004 1004 1005 1009 1 2 3 4 5 6 1 1 1 2 1 1 4 5 2 1 2 5
Coders' Corner
order of the source data sets would produce a totally different result. Table LEFT: A LEFT JOIN on PATIENT between DOSING ('left' data set) and EFFICACY ('right' data set). PROC SQL; CREATE TABLE LEFT1 AS SELECT A.*, B.EFFIC_ID, B.VISIT, B.SCORE FROM DOSING A LEFT JOIN EFFICACY B ON A.PATIENT = B.PATIENT ORDER BY PATIENT; QUIT; PAT- MED DOSE DOSE DOSE EFFIC IENT CODE AMT FRQ VISIT SCORE ID ID 1 2 3 4 5 6 7 8 9 10 1001 1003 1004 1004 1004 1004 1006 1007 1008 1009 A A A B A B B A A A 2 1 1 4 1 4 2 2 1 2 2 2 2 2 2 2 2 1 2 2 1 . 1 1 2 2 . . . 1 4 . 2 2 1 1 . . . 5 1 2 3 4 3 4 5 6 7 8 1 . 3 3 4 4 . . . 6
5 6 7 8
B A A A
4 1 1 2
2 2 2 2
1 2 1 1
2 1 2 5
4 3 3 8
3 4 3 6
PAT- MED DOSE DOSE DOSE EFFIC IENT CODE AMT FRQ VISIT SCORE ID ID 1 2 3 4 5 6 7 8 9 10 11 12 . . 1001 1003 1004 1004 1004 1004 1006 1007 1008 1009 . . 2 1 4 4 1 1 2 2 1 2 . . 2 2 2 2 2 2 2 1 2 2 1 1 1 . 1 2 1 2 . . . 1 5 2 4 . 2 1 2 1 . . . 5 . . 1 2 4 4 3 3 5 6 7 8 2 5 1 . 3 4 3 4 . . . 6
A A B B A A B A A A
Using the COALESCE function and determining the source data set
When performing left, right, or full joins where observations do not have a key variable match, nonkey values are assigned missing values. Sometimes there is a need to substitute missing values with other data, either hard coded values or different items from either data set. One way to do this is with a CASE construct, but the COALESCE function is provided specifically for this purpose. In this example the variable ADJSCORE (Adjusted Score) contains the
A B
Coders' Corner
value of SCORE in the observations that match, but where there is no match and SCORE is missing AJDSCORE is assigned the value zero. If a value of SCORE was missing in a matching observation the value of ADJSCORE would also be set to zero. COALESCE may be used with either a character or numeric data type, but the second argument must be of that same data type. The CASE statement in the example assigns the values Match or Miss to the character variable INVAR depending on whether the values of PATIENT match or not. COALESCE function example PROC SQL; CREATE TABLE LEFT1A(DROP=DOSE_ID) AS SELECT A.*, B.VISIT, B.SCORE, COALESCE(B.SCORE,0) AS ADJSCORE, CASE (A.PATIENT=B.PATIENT) WHEN 1 THEN Match WHEN 0 THEN Miss ELSE END AS INVAR LENGTH=5 FROM DOSING A LEFT JOIN EFFICACY B ON A.PATIENT = B.PATIENT ORDER BY PATIENT, MEDCODE; QUIT;
The DATA step performing the merge must contain an applicable BY statement, matching the BY statement in the preceding PROC SORT or PROC SQL ORDER BY statement of each corresponding set. If a key variable has been sorted in descending order that variable must be specified as DESCENDING in the BY statement of the merge. If the BY values are unique in both data sets the merge is a one to one merge. If there are observations with duplicate BY values in one data set and only one matching observation in the other data set that single observation is joined with all of the matching BY variables in the first data set because of the implied retain. This is a one to many merge. This aspect of merging is commutative, in that performing a many to one merge with the order of the data sets reversed produces the same result. A Many to many merge is where there are corresponding duplicate BY variables in both data sets. Such a merge does not result in a Cartesian product because the observations are joined in sequence where they match. (See what happens to Patient 1004 in the examples listed below). When many to many situations are encountered the following message is written to the log file: NOTE: MERGE statement has more than one data set with repeats of BY values Merges and IN Variables An IN variable is a Boolean flag, which applies to an input data set. The IN variable is set to true (1) or false (0) depending on whether that data set contributed data to the current PDV contents or not. When two data sets are being merged at least one of the IN variables must be true, both IN variables are true if the BY variables match. IN variables are most useful for testing for such matches, as shown in the examples listed below.
PAT- MED DOSE DOSE ADJ INIENT CODE AMT FRQ VISIT SCORE SCORE VAR 1 2 3 4 5 6 7 8 9 10 1001 1003 1004 1004 1004 1004 1006 1007 1008 1009 A A A B A B B A A A 2 1 1 4 1 4 2 2 1 2 2 2 2 2 2 2 2 1 2 2 1 . 1 1 2 2 . . . 1 4 . 2 2 1 1 . . . 5 4 0 2 2 1 1 0 0 0 5 Match Miss Match Match Match Match Miss Miss Miss Match
Coders' Corner
This first example performs a merge using the key variable PATIENT and outputs an observation when the patient numbers are equal, and hence the IN variables are both true. This merge corresponds to the PROC SQL inner join, but with one important difference, no Cartesian products are generated because the merging process is sequential by observation. Hence, the data set INNER2 has only two observations for Patient 1004 instead of the four in INNER1. (For this reason many to many merges should be avoided in practice). Another point to note is that when testing IN variables an IF statement must be used, a WHERE clause will not work because the IN variables are calculated within the DATA Step and are not from the source data. Table INNER2: A MERGE between DOSING and EFFICACY where equal values of PATIENT occur in both input data sets. DATA INNER2; MERGE DOSING(IN=A) EFFICACY(IN=B); BY PATIENT; IF (A=B); RUN; PAT- MED DOSE DOSE DOSE EFFIC IENT CODE AMT FRQ VISIT SCORE ID ID 1 2 3 4 1001 1004 1004 1009 A A B A 2 1 4 2 2 2 2 2 1 1 2 1 4 2 1 5 1 3 4 8 1 3 4 6
In this third example only observations in the PDV from the EFFICACY data set are output. Values in DOSING which do not match, including the key value (Patient) are output as missing values. This corresponds to a right join in PROC SQL. Note that both the order in which the data sets are specified and which IN variable is tested are important. In this example merge dosing efficacy(in=b) is not the same as merge efficacy(in=b) dosing or merge dosing(in=b) efficacy. Table RIGHT2: A MERGE between DOSING and EFFICACY where all values of PATIENT are taken from EFFICACY and only matching values from DOSING. DATA RIGHT2; MERGE DOSING EFFICACY(IN=B); BY PATIENT; IF B; RUN; PAT- MED DOSE DOSE DOSE EFFIC IENT CODE AMT FRQ VISIT SCORE ID ID
1 2 3 4 5 6
A A B A
2 . 1 4 . 2
2 . 2 2 . 2
1 1 1 2 1 1
4 5 2 1 2 5
1 . 3 4 . 8
1 2 3 4 5 6
This second example performs the same merge as the first example but only outputs an observation whenever the DOSING data set contributes a value to the PDV. Observations with values of PATIENT which are present only in EFFICACY and not in DOSING are not output and missing values are substituted in the non-key variables not from DOSING. This merge corresponds to the PROC SQL left join, but without the Cartesian product of duplicate key values (Patient 1004). Table LEFT2: All values of PATIENT are taken from DOSING and only matching values from EFFICACY. DATA LEFT2; MERGE DOSING(IN=A) EFFICACY; BY PATIENT; IF A; RUN; PAT- MED DOSE DOSE DOSE EFIC IENT CODE AMT FRQ VISIT SCORE ID ID 1 2 3 4 5 6 7 8 1001 1003 1004 1004 1006 1007 1008 1009 A A A B B A A A 2 1 1 4 2 2 1 2 2 2 2 2 2 1 2 2 1 . 1 2 . . . 1 4 . 2 1 . . . 5 1 2 3 4 5 6 7 8 1 . 3 4 . . . 6
In this next and final example all observations in the PDV are output regardless of a match or not, hence IN variables are not needed. This corresponds to the PROC SQL full join. Table FULL2: A MERGE between DOSING and EFFICACY where all values of PATIENT are taken from both data sets regardless of whether they match or not. DATA FULL2; MERGE DOSING EFFICACY; BY PATIENT; RUN; PAT- MED DOSE DOSE DOSE EFFIC IENT CODE AMT FRQ VISIT SCORE ID ID 1 2 3 4 5 6 7 8 9 10 1001 1002 1003 1004 1004 1005 1006 1007 1008 1009 A A A B B A A A 2 . 1 1 4 . 2 2 1 2 2 . 2 2 2 . 2 1 2 2 1 1 . 1 2 1 . . . 1 4 5 . 2 1 2 . . . 5 1 . 2 3 4 . 5 6 7 8 1 2 . 3 4 5 . . . 6
Coders' Corner
occurrences of key values. When there are more matching key observations in one data set than in the other (one to many or many to one merges) the contents of the last matching observation from the data set with fewer matches are retained in the PDV. This is the implied retain. The result is the remaining non-key values from the data set with fewer matches appear in the corresponding excess output observations. The implied retain does not occur in the above examples with Patient 1004 because there are the same number of observations for this patient in both data sets. If one of the Patient 1004 observations is deleted from either DOSING or EFFICACY an implied retain would then take place with the second observation. A many-to-many MERGE does not produce a complete Cartesian product with duplicate key values in both data sets (many to many). A note indicating repeating BY values is written to the log file. Many-to-many merges are also expensive in terms of processing time and resources. Hence many-to-many merges should be avoided. When using a SQL join the observations in either data set do not have to be sorted, a DATA step MERGE requires the key variables (BY variables) to have been sorted in a corresponding order, either using PROC SORT, or an ORDER BY in a prior PROC SQL. A PROC SQL join can use aliases to identify the data set, which is contributing a particular variable. A DATA Step MERGE can be used with a logical IN variable to identify which data set contributed the key values to the current PDV contents. Specific non-key variable names must be unique to each data set since their source data set cannot be identified with an alias. Data set options such as KEEP, DROP, and RENAME may be used in both PROC SQL or in a DATA step. To subset data from either input data set a WHERE statement may be used in parenthesis after the data set name (This is more efficient than using the WHERE statement in a CREATE TABLE block or a DATA step.) Runtime benchmark tests show that a PROC SQL join is faster than a DATA Step MERGE. Using indexes improves performance still further, but PROC SQL indexes and DATA Step indexes are implemented differently internally by the SAS system and may conflict with each other curtailing performance.
References Timothy J. Harrington, Trilogy Consulting Corporation, 5148 Lovers Lane, Kalamazoo, MI 49002. (616) 344 1996. [email protected] SAS is a registered trademark of the SAS Institute Inc. In the USA and other countries. indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies.