What Exactly Are Hash Files? (Courtesy – Ken Bland)
I pasted this nice article by Ken Bland on my blog just for future reference. The original article can be found HERE. All Credits to Ken Bland.
DataStage® Server Edition utilizes several different types of hash files. The default and most versatile type is a dynamic (more precisely a type 30) hash file. A hash file can simply be described as a file that distributes data throughout a pre-sized and space allocated file. Every row has to have a key to determine where the row will reside within the file. This key is run thru an algorithm (technically a hashing algorithm, hence the name) to determine the location in the file for the row. One and only one row can exist at a single location.
A hash file is usually composed of 2 files: the data file and the overflow file. Since the file is pre-sized and the space allocated, rows that cannot fit within the data file will spill over to the overflow file. The hash file is at its optimal sizing when all of the rows are within the data file, and no rows are in the overflow. This is because during a row’s retrieval from the hash file, the address of the row is first tried in the data file using the highly efficient hash file algorithm. If the row is not there, then the overflow file is sequentially scanned. The more rows in the overflow, the longer it takes to determine if a row exists in a hash file.
Code: |
$ ls -al /var/opt/work/hash/tmpExample total 2032 drwxr-xr-x 2 radnet radnet 96 May 14 01:08 . drwxr-x— 65 radnet radnet 8192 May 14 01:53 .. -rw-r–r– 1 radnet radnet 0 May 14 01:08 .Type30 -rw-r–r– 1 radnet radnet 774144 May 14 01:52 DATA.30 -rw-r–r– 1 radnet radnet 249856 May 14 01:08 OVER.30 |
The beauty of the dynamic hash file, as compared to the static hash file, is that it will grow on demand. At a predetermined threshold, the file will automatically double its size and redistribute the data within the file. This removes the data in the overflow file and brings it back within the data file. The dynamic hash file defaults to an initial sizing value of 1. This means that the file will constantly resize by doubling itself to meet the incoming dataset. Clearing the hash file will shrink it back to the initial sizing.
To find a row in a hash file, you must know the key that was used to deposit the row somewhere within the file. The data is stored as a continuous string of characters, with a column separator (@AM system variable). Each column is positional in nature and relative to the columns preceding it. It should become obvious that a hash file is almost the same as a sequential file, meaning columns are positional and referenced by position, not name. In fact, column names are irrelevant, you only need to know the key construct and what data is located in what positional column. Because data is stored as a text string, there is no explicit data typing enforced within a hash file. The only rule in a hash file is to not use restricted ASCII characters within your data, as they are internal data separators. Furthermore, each row within a hash file is allowed to have it’s own structure, because again data is stored as a text string. Therefore, hash files are non-structured free-form scratch files!
How Are They Meant To Be Used by DataStage®?
Hash files should be limited to three main uses:
1. Small reference lookups by primary key
2. Small staging files usually when a reference capability has to exist
3. Non-persistent scratchpad work files.
Small reference lookups by primary key – DataStage® has the ability to cache lookups in memory, as long as they are hash files. Lookups are not joins, they are specific searches in a hash file for a row using the key value for that row. Since you will give the key value to the hash file row, the algorithm quickly converts the key to the location of the row within the hash file and returns the contents at that address. This type of activity using ODBC/OCI stages is unacceptable for performance reasons because of the lack of memory caching capability. In addition, as more instantiated jobs perform the same lookup, the database will become saturated with lookup queries.
Small staging files usually when a reference capability has to exist – A transformation job may prepare data that will be needed for reference in another job. This is accomplished quite easily by staging the data into a hash file. In order to make this data available for reference without a hash file would require either loading the reference into a target table or some sort of ETL scratch table.
Non-persistent scratchpad work files – Hash files are wonderful mechanisms for staging data if it is needed for a reference lookup. The best choice for data that is not needed for a reference lookup is a sequential file, because of the appending nature of writing to a sequential file. Hash files have the overhead of figuring out where the row should go, checking the space available within the hash file, and then either writing it to the data section or the overflow.
How Can A Hash File Be Abused in a DataStage® Design?
- Using a SQL interface to a hash file (UV/ODBC stage, UV/SQL commands). A hash file has no secondary indexes to support any type of relational queries. While the underlying engine has the technology in it to support this type of activity, you do not want to go down the path of building a database within your ETL tool! This defeats the entire purpose of exposing data transformation to the simple graphical metaphor. The only exception would be using the UV/ODBC stage to perform aggregations or sorts on the data, but even then you should question the method.
- Cramming obscene volumes into a hash file. If you are creating distributed hash files or 64BIT hash files, you should question what you’re trying to use the hash file to accomplish. These file options are meant for significant (files exceeding 2.2 gigabytes of data) hash files. The knowledge required to create and maintain is not every day ETL and therefore very unique. There are alternatives, such as partitioning your data or using smaller sets of source data to stay within the optimal usage of the hash file.
- You are storing persistent, production data inside your ETL tool. Hash files should be used as throw away or disposable scratchpad work files. Anything else requires archival and recovery procedures and processes to return your environment to it’s pre-corrupted state.
- Preloading to memory when each row will only be looked up once. If all rows are going to be referenced once, then preloading the file to memory is a waste of time and memory. The preloading activity reads the row and then puts it into memory. During processing the row is again referenced from memory. Without preloading the row would still be read only once from the file, but no memory overhead is undertaken. The accessing of the hash file will move a portion of the hash file into operating system disk cache and potentially give some benefit to accessing via direct reads.
- Preloading to memory when the majority of rows will never be referenced. If the number of input processing rows does not exceed the number of rows in a reference lookup, preloading to memory is a waste of time and memory. This is especially true when the reference lookup vastly exceeds the number of input processing rows. This is because the maximum number of reads on the reference lookup will be the number of input processing rows. Preloading the rest of the rows in the reference lookup into memory simply wastes time and memory.
- Many parallel jobs writing to the same hash file. There has to be very careful consideration and recognition as to the means by which DataStage® handles parallel writes to a hash file, especially with write delay caching. A job writing to a hash file will buffer its writes if using write delay caching, and then at the flush point grab the hash file and purge its data to disk. Instantiated jobs writing to the same hash file will encounter contention at the flush points. There is a point of diminishing return when N+1 instantiations has a greater overall runtime than N instantiations. This is unique to each server configuration and one you have to find out for yourself. In some cases 4 instances may have better overall runtime than 8 instances because of the disk congestion and contention.
- Using locking mechanisms for rows in the hash file. In order to help use DataStage® in a more transactional mode, locking features have been added. These features should NEVER be used for normal ETL. The overhead involved is tremendously degrading to performance. DataStage® uses dirty reads and writes to the hash files. It is time tested and true to run processes in serial and allow isolation of the hash file that way rather than attempt to run several processes in parallel and rely on locking to keep from clobbering data.
- In DataStage® Designer, make sure that all of your jobs have the exact same metadata for the hash file wherever it’s used. The system of record for any metadata should come from the job that creates the file, be it hashed or sequential. Using the correct metadata should make reference lookups easy, as you must always specify the key for the hash file metadata. If you are twiddling the primary key column metadata in designer to “join” to a hash file in a transformer stage, then you did not understand the preceding explanation as to the row address hashing algorithm.
- DataStage® can be deceiving in that you can change the metadata for the hash file stage and expect a different set of columns to be used to find a row. This is a relational perspective, not a hash file perspective. DataStage® will read the entire contents of that row into memory, so therefore you should have the complete metadata that accurately describes the contents of the file. You should not alter definitions from one job to another expecting results to meet your changed metadata.
Tricks, Features, and Unintended (but Cool) Results
There are a few tricks to making your hash files more efficient. The best option is to try to stay with dynamic hash files and use initial sizing as your first tuning approach. Using a simple job that reads a sequential text file and creates/writes to your hash file, you can easily test some of the tuning options.
Setting the minimum modulus
Start with a hash file with initial modulus (kind of like initial extent size in a relational database) of 1. Run your job limiting your rows to 1 so that you just create and populate the hash file with 1 row. Take a look at the contents of the subdirectory of your hash file name.
Code: |
$ ls -al /var/opt/work/hash/tmpExample total 2032 drwxr-xr-x 2 radnet radnet 96 May 14 01:08 . drwxr-x— 65 radnet radnet 8192 May 14 01:53 .. -rw-r–r– 1 radnet radnet 0 May 14 01:08 .Type30 -rw-r–r– 1 radnet radnet 4096 May 14 01:08 DATA.30 -rw-r–r– 1 radnet radnet 2048 May 14 01:08 OVER.30 |
You can see that the DATA.30 file has an initial size of 4096 and the OVER.30 file has a size of 2048. Now run your job again, this time for all of your rows to see what happens to the hash file.
Code: |
$ ls -al /var/opt/work/hash/tmpExample total 2032 drwxr-xr-x 2 radnet radnet 96 May 14 01:08 . drwxr-x— 65 radnet radnet 8192 May 14 01:53 .. -rw-r–r– 1 radnet radnet 0 May 14 01:08 .Type30 -rw-r–r– 1 radnet radnet 774144 May 14 01:12 DATA.30 -rw-r–r– 1 radnet radnet 249856 May 14 01:12 OVER.30 |
The data file has now grown by doubling in size again and again to meet the incoming data size. The job ended with data in the over flow file. This hash file is not tuned as an optimal reference lookup, as a referenced row must be checked in the data file and then potential checked in the less efficient overflow file.
At this point we could do some research and compute a modulus for this hash file. However, through a few rounds of experimentation we can arrive at a sufficiently sized file. In your Designer client set the minimum modulus upwards. Since modulus is computed using average bytes per row and number of rows, each hash file will be tuned according to the nature of the data that will go into it. Experiment re-running the job increasing the modulus until running the job produces all data in the data file and the over flow is back to the original 2048 bytes. Watch the file as the job runs, you should see it created at the initial size, and then after so many rows the job monitor will show the job stalling on rows/sec as the hash file dynamically grows and then pick up again on rows/sec.
Code: |
$ ls -al /var/opt/work/hash/tmpExample total 2032 drwxr-xr-x 2 radnet radnet 96 May 14 01:08 . drwxr-x— 65 radnet radnet 8192 May 14 01:53 .. -rw-r–r– 1 radnet radnet 0 May 14 01:08 .Type30 -rw-r–r– 1 radnet radnet 1343488 May 14 01:23 DATA.30 -rw-r–r– 1 radnet radnet 2048 May 14 01:21 OVER.30 |
Keep in mind that the minimum modulus means that this hash file will always start at this size, and grow by doubling. The only way to reclaim this disk space is to remove the hash file, clearing it will only return it to its minimum modulus.
Externally pathed hash files
Were DataStage® to always require a directory path for hash files, many people would have never encountered the frustrating issues relating to the legacy hash file technology from the prior versions of the engine. DataStage® by default will put hash files into the project. When it does this, it puts something similar to a database synonym into an internal project table (the dreaded VOC).
If you erased the hash file in at the OS level using the “rm” command, you’ll find that you will never be able to recreate a hash file under this name again. This is because that synonym is still in the VOC. Using the DELETE.FILE command in the DSEngine shell will fail because the physical file parts are missing, and therefore it doesn’t remove the VOC entry. You have to manually remove the VOC entry by typing “DELETE VOC hashfilename”.
The way to avoid this is to use externally pathed hash files. The DSEngine will not create a VOC entry for your hash file. By directly qualifying the location of the hash file, you can put hash files anywhere you want. Having work files created, modified, and deleted in the sensitive area as the project directory is playing with fire. Not only does this separate the critical pieces of the ETL tool from the scratch work files, it helps you manage the hash files by exposing them for tuning and for easy removal.
Non-structured data
A hash file has no internal form other than the key structure. Since the key is the address to where a row of data resides, it has the only absolute form and structure. For example, a hash file for holding customer information has a primary key of the customer number. However, there are different types of customer records with different columns and definitions. For each record type, you could have a distinct hash file. Or, you could add record type as part of the primary key and have one hash file. You could create many different column definitions for each customer record type, but use one hash file to hold all of the data. I don’t however recommend this type of data model.
De-duplicating data
One and only one row can exist in a hash file with a given key. When writing to a hash file, DataStage® will not check to see if a row is already in the hash file under that key. It will write or overwrite new data into that location without blinking an eye. It is incumbent upon a developer to check if a row exists via a reference lookup and then decide how to handle the write.
There are times when sorted or ordered data has to be de-duplicated about the primary key. In this situation, you may wish to only keep the last occurrence of the primary key. Writing to a hash file rows with duplicate primary keys will leave you with the last occurrence of that primary key.
Sorted going in, randomized coming out
If you write sorted data to a hash file, it will not be loaded quicker into the hash file. This is because the hashing algorithm has a randomizing effect as to where the data is located within the hash file. You can see this example by writing sorted data into a hash file, then using a job to scan the hash file and write it out to a sequential text file. There is no easy way to order data coming out of an external hash file. You could use the UV/ODBC (Universe on release 6) stage to order the data, because this stage treats the hash file like a relational data source and gives you SQL command capability, at a maintenance and performance cost. Your best way to sort hashed data is to sort it using a third-party sort, or even the Unix sort command, after extracting the data out of the hash file.
Handy Tips
Make sure your hash file has an identified primary key column, in your case PRODUCT_CODE sounds like it should be set to YES. Make sure you SAVE YOUR HASHFILE METADATA TO THE LIBRARY.
In another job that has to lookup against this hash file, LOAD THAT SAVED METADATA into your hash file reference stage. Press view data, you should see your data displayed.
Whenever you do reference lookups, JUST AS IN SQL , you must make sure you have trimmed data if there’s a chance of leading or trailing spaces screwing up the equivalency in the reference. If you are comparing numbers, you have to make sure you don’t getting screwed up by leading or trailing zeroes (0.00 is NOT EQUAL to 0, 0123 is NOT EQUAL to 123). There is no datatyping, it is all string comparison in DataStage. Verify that you are not gettting messed up by this. An easy way is to VIEW DATA and then double-click on the displayed field contents and look for highlighted spaces leading or trailing.