It happens far too often: Once an ETL process has been tested and executes successfully, there are no further checks to ensure that the operation actually did what it was supposed to do. Sometimes it takes a day, other times it takes a year, but eventually that call comes from a client, coworker, or boss: “What’s wrong with this data?” In that moment, the need for ETL auditing – which sadly is often seen as an optional feature – becomes very clear.
In this ETL Best Practice tip, I’m going to discuss ETL auditing and its importance in a well-designed data load strategy.
ETL Auditing
Let’s start by defining ETL auditing. Auditing in an extract, transform, and load process is intended to satisfy the following objectives:
- Check for data anomalies beyond simply checking for hard errors
- Capture and store an electronic trail of any material changes made to the data during transformation
If the ETL process is an automobile, then auditing is the insurance policy. ETL auditing helps to confirm that there are no abnormalities in the data even in the absence of errors. A well-designed auditing mechanism also adds to the integrity of the ETL process by eliminating ambiguity in transformation logic by trapping and tracing each change made to the data along the way. There are some shared behaviors between auditing and data lineage, the latter of which will be explored in detail in a future post.
Checking for Data Anomalies
Let’s say you are loading data from OLTP to a data warehouse using your favorite ETL tool. This particular load retrieves 1.5 million rows from the OLTP source. It goes through a few transformation steps along the way, and completes all of its tasks successfully as it loads the destination. However, a quick check of the destination reveals that only 1,499,990 rows were loaded. What happened to those missing 10 rows of data?
Here’s a dirty little secret that exists in ETL software: it is relatively easy to simply lose data. Even in my favorite ETL tool, SSIS, it’s not terribly difficult to configure a package that allows data to fall into the bit bucket without so much as a warning or an error. And updating data – perhaps incorrectly – is even easier than outright losing data. This is not a dysfunction of the software as much as it is a function of its flexibility. An ETL tool isn’t going to last long in the market if it won’t allow developers to bend, reshape, and transform data in unorthodox ways. That flexibility comes with the responsibility to use the tools properly; when poorly configured, data can simply disappear or end up transformed improperly. Even the most seasoned and well-meaning ETL engineers have occasionally discovered that their data took a wrong turn somewhere.
The antidote to this is to check for data irregularities, even when no errors are found. Even in the most rudimentary of ETL architectures, it is possible to check a few high-level metrics to confirm that the data loaded is what was expected. At a minimum, critical ETL processes should check the following to confirm that the inputs match the outputs:
- Overall row count
- Aggregate totals (which might include financial amounts or other summary data)
Some processes will require more exhaustive auditing. Those like me who have ever spent time moving and transforming general ledger data will testify that even row counts and financial sums are not enough; ensuring that the ending balance matches the beginning balance plus all of the transactions for that period will help you to keep your job for another day. In other cases, checking to make sure that data is within reasonable limits may be required. Before my consulting days I worked in healthcare, and one of our processes that imported data with known deficiencies had to be configured to scrutinize the data to make sure we didn’t end up with patients who were hundreds of years old – even though the hard limits of the ETL process would have supported such a value.
One final note on checking for data anomalies: Don’t forget to check for cases where no data is loaded. This happens more often than you might think. An source data file containing no data, an improperly configured query that returns no rows, or an empty source directory intended to contain one or more files could all lead to the ETL process running successfully but loading exactly zero rows of data. There are cases when this is appropriate – in particular, when running an incremental load on data that may or may not have changed. However, if a given process should always result in a nonzero number of rows loaded, be sure to add an audit step to verify this.
Transformation Change Log
ETL developers have a significant amount of power to change data. The “T” part of ETL implies that the processes we build will update data from its original state. When problems arise in ETL loads, they are very often rooted in these transformations: business rules, cleansing processes, deduplication, and other tweaks collide, leading to unexpected output in the data. Too often ETL processes become black boxes, understood only by their creator with little if any transparency into what goes on inside. Trust me when I say that you don’t want to be on the business end of an executive scolding session after apparently good data became suspect because of a process that is opaque to everyone else. Even worse, you don’t want to be caught empty-handed while facing an auditor asking questions about how and why critical data was updated.
When data is changed by an ETL process, there should be a record of that change. This may be as generic as publishing to the data consumers a list of business rules and other transformations that are applied to the data as it moves through the ETL chain. In some cases, this involves meticulously logging every change that occurs within the boundaries of ETL, capturing the before-and-after picture of the data for each change. The depth of this logging depends on many factors:
- How critical is this process?
- How much transformation is done to the data?
- How difficult is it for the layperson to understand what changes occur by design in the ETL process?
- To what extent is this data governed by HIPAA, SOX, or other regulations?
Capturing transformative changes in the ETL process makes for a more transparent and trustworthy process, and can protect the ETL developer when questions arise about what was changed, when, and why.
Building an ETL Audit System
How one goes about adding auditing to an existing ETL process will vary greatly depending on the platform used and the scope of auditing required. Most ETL software packages have tools that can assist in the logging process, and some third party software vendors offer auditing components and frameworks. Sometimes, the database engine itself can help with this as well: both SQL Server and Oracle offer change data capture functionality to track all of the insert, update, and delete operations against monitored tables.
Whether your ETL architecture is as simple as a collection of batch files executing SQL statements or as complex as hundreds or thousands of nested SSIS packages, adding auditing logic is achievable. Even with the availability of software specifically built for ETL auditing, I find that the approach that has worked best for me is to customize the audit logic for each process. Auditing is difficult to tackle with a cookie-cutter approach (even more so than logging discussed in the last post), and what is gained in convenience costs a great deal in flexibility.
While I have not yet found an auditing framework with enough customization to suit me, I have learned that it is possible to audit ETL processes in a customized way without reinventing the wheel every time. I keep on hand a common (but never set in stone) set of table definitions and supporting logic to shorten the path to the following auditing objectives:
- Simple aggregate auditing of row counts, financial data, and other key metrics
- Documentation templates for source-to-target mappings and transformation logic
- Lightweight auditing of batch-level changes
- Full auditing of every change made in the transformation process
Depending on the shape and type of data, the business needs, and the policies and regulations around data changes, I have used one or more – and occasionally all four – of the above designs in ETL processes. Where you can, find reproducible patterns that can be used in multiple processes, but don’t fall in the the habit of copying and pasting the logic. Nearly every process will require a different level of auditing.
Conclusion
Auditing ETL processes is often considered a luxury, a non-essential that can be added if time permits. However, skipping this critical element of a well-designed ETL architecture will almost always lead to friction or confusion over how and why the data looks like it does. ETL auditing is rarely the most visible element in the architecture, but it is a necessary insurance policy to protect the integrity of the data and the process.
Very interesting. I hadn’t thought of the importance of logging aggregates of the load (count, totals of amounts). Would it be overkill in most situations to run, store, and analyze the results from the Profiler task?
Hi Russ! For some loads, yes that would be overkill. However, for others – I’m thinking of loads where balancing is critical, such as general ledger and financial transaction data – it makes a lot of sense to do aggregate validation and auditing.
Great information. I am curious on how you handle the restartability of child packages, skipping ones that have completed and restarting ones that failed when auditing. Thanks.
Hi Juan – great question about restartability. I’m planning to address that comprehensively in a future post in the ETL Best Practices series. The short answer for here would be that SSIS checkpoints are the built-in way to manage restartability, but checkpoints come with a few gotchas.
Hi Tim, excellent blog posts you do.
Would you consider posting some SSIS examples / snippets of how you would handle some of the more complex auditing scenarios.
Understanding what you write does necessarily transcend to knowing howto go about it technically with for example SSIS. So like you have with your excel/table examples in the data lineage blog post, it would be nice to have a couple of technical images/screenshots to support this post and the points you are making – again like with the data lineage post it does not have to be exhaustive.
Jonas, thanks for the comment. How to handle auditing and data lineage in SSIS is one of the things on my list of topics to blog about.
Hi Tim
Really great article.
Could you post some examples of how you would audit aggregates? I think it is a great idea that I would like to take up however, I’m not quite sure how I would go about doing it in a SSIS package