In this first post in my series on ETL Antipatterns, I’m going to discuss one of the most common missteps when building an extract-transform-load (ETL) process: jumping straight into writing code as a first step.
ETL Antipattern: start with writing code
Most data architects and developers are intensely curious folks. When we see a set of data, we want to immediately step into a data whisperer role. Where others may see a jumbled mess, we see an opportunity to discover patterns and answers. The best data architects crave those data discovery finds the same way a baseball player craves a bottom-of-the-9th game-winning home run.
That kind of intellectual curiosity is a necessary trait for data architects, but it can lead to a rush straight into writing ETL code. I’ve seen this a lot, and have done it myself (and admittedly still do it on occasion): skipping past the business-value analysis and diving straight into the haystack looking for needles. Getting raw data into a format that can easily be analyzed and validated is a critical part of the ETL development life cycle, but rarely is it the first step.
ETL is a business function
Any project including the movement or transformation of data is ultimately a business function. Yes, it’s made up of highly technical parts. Yes, it requires deep and specialized expertise to build properly. But the outcomes and answers resulting from the data and ETL processes must address the underlying business needs. The ETL is the vehicle, not the destination.
To ensure that the needs of the business are addressed, most ETL initiatives are preceded by several of the following:
- Answering the big question: Why are we doing this?
- A review of the source(s) of the data for a consensus on the value and trustworthiness of each source. Why waste time with garbage data?
- An inventory of the types of insights expected of the data (not the actual outcomes, of course – just the types of questions that are expected to be answered).
- A review of what cannot or should not be included in the ETL process. This should include a review of any legal, regulatory, and data volume concerns.
- A list of the types of data consumers that will ultimately use this data. This will help define the amount of transformation, cleansing, aggregation, and anonymization of the data that will be required of the ETL process.
Should you ever write ETL code first?
As always, there are exceptions to the “Don’t Start with Code” rule around ETL projects. Occasionally, there will be a need to put on one’s detective hat and jump straight into writing code to see what can be found in the data. Data sources of unknown origin or those with no documentation may require some preliminary discovery. This is also true for projects of a forensic nature, in which you have no idea what types of answers you’ll find until you analyze the data. But for the majority of initiatives requiring data movement and transformation, writing ETL code should not be the first step in the process.
Business needs should drive the ETL, not the other way around.
Leave a Reply