If you were to poll data professionals on which tasks they enjoy working on the most, ETL logging would probably not make the list. However, it is essential to the success of any ETL architecture to establish an appropriate logging strategy. I like to compare a good logging infrastructure to the plumbing of a house: it is not outwardly visible, not terribly exciting, but you’ll certainly know if it is installed improperly (or not at all).
In this ETL Best Practice tip, I’ll share the benefits of an effective logging strategy, and will cover some of the variables including what should be logged and how long those logs should be retained.
ETL Logging
Before I start describing best practices, I’ll be clear on what ETL logging is. I consider ETL logging to be an activity log of relevant events that occur before, during, and after the execution of an extract-transform-load process. This logging usually occurs within the ETL software itself, but might also include other logs (those in job scheduling tools, for example) to supplement the integrated ETL logging. These logs will vary in granularity and scope, with some describing load-level metrics (how long the overall process took) and others specific to one small segment of the load (how many rows were loaded in step 7).
There is no one-size-fits-all approach to logging. Ideally, every process will be evaluated to determine the logging strategy based on criticality of the data, frequency of execution, likelihood of failure, and other factors important in your environment.
What Should Be Logged?
As noted above, logging requires more than just a cookie-cutter approach, because every process can have slightly different logging needs. In the real world, however, my years of consulting have taught me that most organizations take an all-or-nothing approach to all of the ETL processes: either they are not logged at all, or every possible metric is captured and stored forever. While there are some edge cases for each of these, the answer almost always lies somewhere in between (though hopefully closer to the “log everything” approach).
A minimalistic approach might involve simply logging the start and stop times of the overall process. A trimmed-down approach like this works best when the scope of possible errors is small, and when the process can be rerun without causing harm. I see this used successfully for some stage loads, where the data is loaded into volatile staging tables. Most stage loads involve first deleting old data from the target tables, meaning that they can be run repeatedly without causing inadvertent data duplication. Therefore, using a minimalistic logging approach can be used in these types of loads because the failure points are fairly minimal and the process can be easily rerun if further diagnosis is required.
Most processes require more intensive logging, though. To truly capture what’s going on inside each load process – rather than just measuring the process as a whole – there should be a step-by-step log for every task. If a particular ETL job has, for example, ten steps, in an ideal design would log the start and stop of each task, any task-level exceptions, as well as any necessary auditing information.
The following can be used as a rough guide on the types of events and metrics to capture during logging:
- Start and stop events. The beginning and ending timestamp for the ETL process as a whole, as well as their individual steps, should be stored.
- Status. Process steps can succeed or fail individually, and as such, their status (not started, running, succeeded, or failed) should be logged individually.
- Errors and other exceptions. Logging failures and anomalies is often the most time-consuming part of building a logging infrastructure. It’s also the part that yields the most value during testing and troubleshooting.
- Audit information. This can vary from simply capturing the number of rows loaded in each process execution, to a full analysis of row count and dollar value from source to destination.
- Testing and debugging information. This is particularly useful during the development and testing phase, most notably for processes that are heavy on the transformation part of ETL.
When planning a logging strategy, it is common to simply log everything. Some ETL tools, including SQL Server Integration Services, make it relatively easy to go into capture everything mode. I don’t want to necessarily discourage this; having too much information logged is better than not enough, all other things being equal. However, there is always a cost associated with logging: processing overhead, network bandwidth, and storage requirements all have to be considered when mapping out how ETL processes will be logged. Logging every metric and event available might be easier, but it doesn’t come for free. Remember: balance what you would like to capture in logging with the costs (hard and soft) of going overboard on your logging strategy.
How Should the Logs Be Consumed?
Let’s assume for a moment that you’ve got your logging infrastructure built and it is working as expected. How do you deliver this information to the right people?
As a consultant I’ve seen this dozens of times: the ETL processes are fully logged but nobody really knows what is logged or where the information is stored. If processes are logged into a black hole, the logging information is just data; it must be accessible and understandable before it can be classified as information. Most data geeks like me would actually prefer having access to the raw, unfiltered log data so we can write our own queries to get exactly the data we want. Other folks who aren’t as fond of writing SQL would probably do better with a more streamlined delivery of logging information. Most of us who develop ETL solutions will build logging tools primarily for ourselves and our peers, to assist in troubleshooting and error notification. Increasingly, though, others – often semitechnical or nontechnical folks – also need access to ETL logs. DBAs, helpdesk staff, business analysts, and auditors might all need to see what is in the logs, and each group likely has a very different expectation of how to get to that data. I’ve even worked with hands-on C-level executives who routinely reviewed ETL log data.
When considering how the logging information will be consumed, consider the following:
- Who is the primary audience for these logs?
- What other audiences might need access to this information?
- Will the data be reviewed in an ad-hoc fashion, or does it need to be formalized through push delivery or a dashboard?
- What format delivers the information in the clearest way possible?
- Are there any security concerns that require user-level data filtering?
The answers to these questions will drive the strategy for exposing log data. It may be as simple as providing to fellow DBAs the location of the log data, or more formalized such as building a SQL Server Reporting Services report or Power BI dashboard. Regardless of which approach is used, the goal is to get the necessary information to the target audience in the clearest way possible.
Retention Policy
My clients ask this a lot: “How long do we keep ETL logs?” My answer is always this: “Keep them as long as needed, but no longer.” Yes, it’s a stereotypical consultant answer, but in reality, the answer wholly depends on the expectations of how those logs will be used. Here’s a common dialog I share when discussing short- and long-term retention strategy for ETL logs:
Short retention period: “A short retention period ensures that you won’t fill up your drives with a bunch of old log data you’ll never use. Keeping just a few months or a year’s worth of logs provides enough history to troubleshoot any recent logical or performance issues in your ETL. Short retention periods can work well for you if you are mostly concerned about operational support and do not use your logs for long-term performance analysis or auditing purposes.”
Long retention period: “Yes, it costs you more in storage space to keep logs for a longer period of time. But who cares? Storage is relatively inexpensive, and you can archive the old stuff to slower and cheaper drives. By keeping your logs for a longer period, you can analyze ETL workloads and performance over time to see how you are trending year-over-year. And if you are in a heavily audited environment, long-term retention of logs can satisfy most if not all of your ETL auditing requirements.”
As with most of the other concepts I discuss here, there is no single approach to satisfy every situation. In most cases, though, I will steer clients toward a longer-term retention period for ETL logs, unless there’s a clear reason to avoid it. It’s better to have data that you’ll never use than to need data you don’t have. There are degrees of log retention as well; I have successfully built logging structures with a comprehensive short-term detail log strategy and a purge process to remove the detail while retaining summary information for years.
When creating an ETL logging retention strategy, the key questions to ask are these:
- Will we use ETL log data for tactical troubleshooting, or trend analysis and auditing?
- How much log data are we generating each day/month/year?
- What does it cost us to store log data?
- How difficult and costly would it be to archive cold logs to cheaper and slower storage?
- What are the regulatory and/or auditing requirements for ETL logs?
Conclusion
Building a proper ETL logging architecture is one of the most basic components of an enterprise data management strategy. Too often, logging is an afterthought, bolted on after everything else is complete or nearly so. Just like your household plumbing is designed in parallel with the rest of the house, so must your logging strategy be a central part of your ETL architecture.
The design of the ETL logging will vary from one organization to another, and might even differ from one project to another within the same company. Considering all of the variables – what level of logging is required, who will be using it, and how long it should be kept – will help shape the design of your ETL logging architecture.
Hi Tim
Good post. To expand on the technical details of logging, particularly when it comes to error capture, I have created a small error logging SSIS framework which you may find interesting. The details are as per the links below:
1. http://bicortex.com/how-to-create-a-simple-etlstored-procedure-error-capturing-database-schema-model-sql-code-and-implementation-part-1/
2. http://bicortex.com/how-to-create-a-simple-etlstored-procedure-error-capturing-database-schema-model-sql-code-and-implementation-part-2/
Cheers,
Martin
Also, would be interesting to get your opinion on the following:
http://multithreaded.stitchfix.com/blog/2016/03/16/engineers-shouldnt-write-etl/
Cheers,
Martin
Thanks Tim – too many data warehouses skip this critical component to their own peril. It should be considered foundational. Continuing with your plumbing metaphor – if you’re not using good logging I’d argue that your plumbing is very likely leaching lead into the water.
In addition to custom logging I also HIGHLY value BiXpress as it records and gives visibility into real and historical insight that you cannot do on your own. Its not a replacement for custom logging but is an incredible enhancement.
BiXpress has saved my bacon too many times to count due to its increased visibility into local variable and parameter values – it is not until you’ve had visibility into these that real realize how blind you’re flying without it. I watch in real time variable values showing me date ranges, counts, etc as they run. Once you’ve done that there’s no going back. There are caveats to using it but it adds SO much value I can’t imagine not having it. The caveat is – there are costs that you have to accept. Here’s a list…you have to apply it into your packages then make sure you uninstall during debugging/maintenance then re-apply it again when done or it will cause some nasty annoyances (such as hanging your package). Makes sense since it install events that look for component names and if they’re not there it hangs waiting for them. Another is if you want to upgrade to a new version you should regression test your entire system. And yet another is it “may” be problematic with BIML (but I’m not yet using BIML). It also includes a very good html notification component so you don’t have to custom code error handling or text/email notification. I would even go so far as to state that using BiXpress should be a best practice for any serious SSIS development.
Tim, fantastic post – this is a great reference for implementing a logging approach.
Gary, I really appreciate the comments on BI xPress (but I’m biased, I run the software arm at Pragmatic Works). On your list of costs associated with it – have you opened a case with Support (support@pragmaticworks.com) on the “hanging” issue?
On the BIML front – it should work fine with that – generate your packages with BIML, then apply the audit framework to them.
Great to hear that it’s been valuable to you – we love hearing that.
Although the SSIS Catalog has matured to the point where building views off it can be effective, I still like The Kimball Group’s strategy from the toolkit books and hook them into The Kimball Group’s Architecture for data quality (when necessary). We should be able to reference our totals and a breakdown of those totals for each ETL branch within a data flow task at the snap of a finger–especially these days where so many projects are undersourced.