I usually avoid talking about technology in absolutes, but here’s one that I can share without reservation: On a long enough timeline, every single ETL process will eventually fail. Is your ETL design built to handle a failure?
I see far too many SSIS packages, ADF data factories, and other data movement applications built with the assumption of 100% success, with no provision for failure. In this ETL Antipatterns post, I’ll discuss the folly of skipping error handling logic in ETL processes.
ETL Antipattern: no error handling logic
Imagine a scenario in which you are building a data load process for a company to track telemetry on its corporate vehicles. In this scenario, you import each vehicle’s telemetry data into a common table in your reporting database, from which the company’s users can report on where a vehicle was last seen, how fast it was going, and if stationary, how long it has been idle.
Now for a moment let’s assume that, during the load of this data, a network hiccup causes a temporary loss of connectivity in the middle of moving the data to the reporting table. Some of the data was loaded, and some wasn’t. The person responsible for checking the status of this load finds that it failed, runs it again, and it loads successfully.
Because in the previous load some (but not all) of the data was loaded, it’s likely that the destination table now has duplicates: the partial set of data loaded during the failure, and the full set of data from the subsequent successful load. The result is that the data consumers will now see multiple telemetry records for that vehicle. At a minimum, this is going to be an annoyance.
Now imagine for a second that data isn’t vehicle telemetry but bank transactions, or possibly general ledger entries. Customers, investors, and regulators are going to be very unforgiving if your data is wrong due to an ETL error.
Designing ETL for failure
Even the most robust extract-transform-load process will fail at some point. Even if there are no flaws in the ETL code, there are factors beyond the control of that process – network, authentication, and DNS, to name a few – that could break a load. When building ETL load logic, one must think beyond simply what the successful payload should look like, but what will happen if any component of that load fails.
In my training classes, I coach people to think about error handling in ETL processes as a core part of the functionality rather than an afterthought. Because of the essential nature of data movement and transformation, the handling of errors must be a primary concern and not just something bolted on at the end of the development cycle.
Having an ETL designed to address failures has two distinct benefits:
- It allows for a more clear path to resolution when an error occurs
- It helps prevent leaving the destination endpoint in an inconsistent, partially-loaded state in the event of a failure
Error handling patterns
ETL error handling usually falls into one of these categories:
No error handling (just let the process fail)
This is the most common pattern, as it is the default behavior. Do keep in mind that this is a valid design pattern for some loads – for example, if you are truncating and loading a staging table, there’s usually little harm in just letting the load fail in the event of an error. You can rerun the same load again, since it will truncate and reload the staging table each time.
Undo the changes made by the load
With this design pattern, you build the ETL logic in such a way that it will undo any changes made if a failure occurs. This is usually implemented through an explicit transaction (in the case of a relational database endpoint) or a script that will delete or revert the changes made during the failed load.
Continue the load after an error
In some cases, you may find that there is more value in letting the load run to completion even if it encounters one or more errors. This is most often configured at the row or source level, to allow a single row or source to fail while allowing the rest of the subsequent rows or steps to complete.
For any of these design patterns, you should ensure that any errors or anomalies are properly logged.
Admittedly, building ETL processes with failure in mind is a pessimistic approach. However, since the first job of a data professional is to protect the integrity of the data, one should always approach ETL design with the understanding that every one of these processes will, at some point, fail.
Leave a Reply