ShaunKaufmann IntroHash Sept2013
ShaunKaufmann IntroHash Sept2013
Shaun Kaufmann
Farm Credit Canada
Agenda
This presentation will address four main questions:
• When the program runs, the program data vector contains the
observation currently being processed.
• Objects consist of things that they can do, called methods, and
information about their current state, called properties or attributes.
• Stated more simply, it doesn’t take any longer to look up a value in a hash table
with 10 million entries than it does to look up a value in a hash table with 1000
entries.
• Just to be completely clear, lookup time is constant and does not grow as the
number of elements grow…This is a big deal. This is why we use hash tables.
Uses for Hash Tables
• You can use hashes as a lookup table.
• You can use it to merge datasets. Gregg Snell presented a paper at SUGI 31 comparing
different merge methods and found hash tables to be the fastest.
• You can use hashes to access data outside of SAS. Because data stored in RDBMS can not
be sorted and indexed, it is often not possible to use a data step with a merge statement.
•John Blackwell’s NESUG2010 paper sites an example where an extract, sort and merge
resulted in a duration of 30 minutes. In comparison the an implementation using a
hash table took about 3 minutes.
Uses for Hash Tables (cont.)
• For complex manipulations requiring continuous updating of intermediate results.
• Note. The non-hash based implementation used database select and update statements.
SAS Hash Table Implementation
• Hashes are implemented in SAS as Objects
•They have methods and properties and they use DOT notation
•For example MyHash.find(); or This_Hash.definekey(‘empid’);
• Hashes can only be used in a Data Step procedure or DS2 program.
•You don’t need to know anything about the hash function used or collision
resolution approaches.
•customer_hash.defineKey(“Customer_Number”);
•customer_hash.defineData(“First_Name”, “Last_Name”);
•customer_hash.defineDone();
Setting up hashes in SAS (cont.)
•The Key and Data items defined on the previous slide are not automatically added to the
PDV when defined as part of the hash definition. You must use a length statement to
define and initialize the key and data variables in the PDV.
• rc = obj.defineData('data_var1', …, 'data_varN');
• rc = obj.defineDone();
•Adds the specified data associated with the given key to the hash object.
• rc = obj.find();
• rc = obj.find(key: key_val1, …, key: key_valN);
•Determines whether the given key has been stored in the hash object. If it has, the
data variables are updated and the return code is set to zero. If the key is not found,
the return code is non-zero.
SAS Hash Methods (cont.)
• rc = obj.replace();
• rc = obj.replace(key: key_val1,…, key: key_valN, data: data_val1, …, data: data_valN);
• Replaces the data associated with the given key with new data as specified in
data_val1…data_valN.
• rc = obj.check();
• rc = obj.check(key: key_val1, …, key: key_valN);
• Checks whether the given key has been stored in the hash object. The data variables
are not updated. Return codes are the same as for find.
SAS Hash Methods (cont.)
• rc = obj.remove();
• rc = obj.remove(key: key_val1, …, key: key_valN);
• rc = obj.clear();
• Removes all entries from a hash object without deleting the hash object.
SAS Hash Methods (cont.)
• rc = obj.output(dataset: 'dataset_name');
• Creates dataset dataset_name which will contain the data in the hash object.
• rc = obj.sum(sum: sum_var);
• rc = obj.sum(key: key_val1, …, key: key_valN, sum: sum_var);
• Gets the key summary for the given key and stores it in the DATA Step variable
sum_var. Key summaries are incremented when a key is accessed.
SAS Hash Methods (cont.)
• rc = obj.ref();
• rc = obj.ref(key: key_val1, …, key: key_valN);
• Performs a find operation for the current key. If the key is not in the hash object, it
will be added.
• Determines if two hash objects are equal. If they are equal, res_var is set to 1,
otherwise it is set to zero.
•rc = obj.delete();
• i = obj.num_items;
• sz = obj.item_size;
• Obtains the item size, in bytes, for an item in the hash object.
The Hash Iterator
• Gives you the ability to traverse your memory table from start to end, or
vice versa.
•A hash iterator is associated with a specific hash object and operates only
on that hash object.
•Before you declare your iterator you must declare your hash object.
Hash Iterator Methods
• declare hiter iterobj('hash_obj');
• Creates a hash iterator to retrieve items from the hash object named hash_obj.
• rc = iterobj.first();
• Copies the data for the first item in the hash object into the data variables for the
hash object.
• rc = iterobj.last();
• Copies the data for the last item in the hash object into the data variables for the
hash object.
Hash Iterator Methods (cont.)
• rc = iterobj.next();
• Copies the data for the next item in the hash object into the data variables for the
hash object. A non-zero value is returned if the next item cannot be retrieved.
• Use iteratively to traverse the hash object and return the data items in key order. If
first has not been called, next begins with the first item.
• rc = iterobj.prev();
• Copies the data for the previous item in the hash object into the data variables for
the hash object. A non-zero value is returned if the next item cannot be retrieved.
• Use iteratively to traverse the hash object and return the data items in reverse key
order. If last has not been called, prev begins with the last item.
Summary
•Hash Tables are in memory data structures.
•Hash Tables can be used for lookups, sorting, merging and to facilitate complex data
manipulations by removing the disk I/O associated with frequent query and update
statements.
•Hash tables are implemented in SAS as objects and provide a wide range of
functionality through their methods and properties.
References
Think FAST! Use Memory Tables (Hashing) for Faster Merging. SUGI 31 Paper 244-31
Gregg P. Snell, Data Savant Consulting, Shawnee, KS
Find() the power of Hash - How, Why and When to use the SAS® Hash Object.
John Blackwell, NESUG 2010.
SAS Hash Object Programming Made Easy. Michele M. Burlew. SAS Press 2012.
Better Hashing in SAS 9.2. Robert Ray and Jason Secosky. SAS Global Forum 2008.