In my continuing series on ETL Antipatterns, I’ll discuss the problem of loading too much data in extract-transform-load processes.
ETL Antipattern: processing too much data
A common design flaw in enterprise ETL processes is that they are processing too much data. Having access to a great breadth and depth of data opens up lots of options for historical reporting and data analytics, but very often it is mistakenly assumed that all of the available data must be processed through ETL.
Although it may sound counterintuitive, there are many cases where purposefully leaving some data out of the ETL process leads to a better outcome. Among the factors to consider when deciding which data should be included:
- Does the data’s utility degrade with its age? Is there a point in time at which a set of data no longer provides business value due to it age?
- Is the data we are keeping at a more precise grain than we will ever need for reporting?
- Are there extra components of the data (individual fields or perhaps entire tables) that we don’t use for any reporting or analytics?
- Are we including the entire history of data in the ETL process when we could be doing a smaller, temporal-based incremental load?
- Are we continuing to reprocess data that is slowly changing or completely static (such as data for terminated employees or inactive products)?
Well-designed ETL processes will touch as much data as is required to be touched, but no more. As the amount of data processed in ETL rises, a number of risks also surface:
- It slows down your ETL processes. The amount of time required to process data scales with the volume of data being processed.
- Aggressive data loads can cause contention on the sources from which the data is retrieved, particularly if the ETL processes get their data directly from online transactional processing (OLTP) sources.
- Processing large volumes of data requires more system resources. This can slow down other processes running on the same server as the ETL, and may increase your costs for cloud-based resources.
- A larger footprint of data processing means a bigger liability in the event of a data breach or other security exposure.
Data volume and the discovery process
When we talk about how much data we should be processing, it’s important to separate the needs of the initial data discovery phase from the day-to-day data load process needs. Any time I start an ETL project, I want to include as much data as is practical in the initial analysis. The more data we touch during the discovery phase of an ETL projects, the more likely we are to learn about the structure of that data:
- Incremental data volume
- Data types
- Structural anomalies (flat file inconsistencies, table name variations, etc.)
- Seasonal fluctuations in the volume or values in the data
- Other anomalies such as missing time periods in the data
A large volume of representative data improves the data discovery process by allowing data architects and ETL developers the ability to find those patterns and build the code to prevent unexpected failures or adverse data out comes. For projects in which the volume of data isn’t prohibitive, I prefer to include as much data as is available for discovery.
Again, there’s a significant difference between what is required of data discovery and what the business needs dictate for ongoing data loads. Aim for a deep data set for discovery and testing, and you can trim that back to what is actually needed for business insights when you start running scheduled data loads.
How much is enough, and when is it too much?
There’s no single yardstick that can be used to determine how much data should be processed in your ETL operations. Data volume, data value over time, hardware configuration, and laws/regulations should all be considered when evaluating what to include in your loads.
If you take away one truism from here, I hope it’s this: Don’t assume that every bit of data is available must be included in your scheduled data loads. For data that is unlikely to add business value or may introduce unnecessary risk, consider leaving it out of your ETL processing.
Tim, this seems to me that this is rarely an issue if one is using architectural layers and related patterns. For example, if you’re loading and not loading just incremental then the question should be why the heck not? Usually (although of course not always) when I have seen systems loading 100% every time its due to a developer who hasn’t a clue about the physics of data movement. Like so many aspects of systems design, you can often get away with these little sins with no consequence when you’re dealing with tiny data volumes. Its only later after going production and data volumes grow that they find out it won’t scale. But this is just one of many areas data movement can go wrong due to just simple data movement physics that are ignored.