“Learning to trust is one of life’s most difficult tasks.” – Isaac Watts
As data professionals, there are times when our jobs are relatively easy. Back up the databases. Create the dashboard report. Move the data from flat files to the database. Create documentation. There are lots of cogs in those machines, but an experienced technologist will have little trouble ticking off those boxes. However, those whom we support – clients, end users, executives, coworkers – generally don’t care whether we’ve worked through our technical to-do list. Those folks want exactly one thing from us: data that they can trust. And building that trust is a very hard thing to do – much more difficult than any technical task in front of us.
We Don’t Trust This Data
In my time in consulting – and even before that, when I was a corporate employee – I have heard this phrase all too many times: “We don’t trust this data.”
The lack of trust in data in an organization is cancer. Folks are rarely shy about sharing when they don’t trust the data an organization uses to base its critical decisions on. Once the seed of distrust has been planted, it rarely ever goes away on its own. Further, it has a tendency to spread to other arms of the company, even to others who have no direct reason for distrust.
The fruits of data distrust are plentiful, and rarely positive:
- Extended hesitation to make decisions based on suspicion of the data
- Business folks taking matters into their own hands and creating data silos (hello, Excel Hell!)
- Reports, cubes, and other structures falling into disuse (and eventually, a lack of further development or support)
- Executives and subordinates reverting to manually compiled data for decision-making
It goes without saying that distrust in an organization’s data is very bad. But why does it happen in the first place? And more importantly, what can we technical professionals do to prevent or remedy the situation? The short answer is that there is no short answer. However, to build a plan to reverse (or simply prevent) a pattern of distrust, we must first examine the reasons why trust might have been lost in the first place.
Garbage in, garbage out.
This one is likely the most difficult for data professionals to deal with. When you start with bad data, you rarely end up with perfect data; the best you can do is to end up with a not-as-bad set of data. There are some really great data quality and data cleansing tools on the market, but even the best of breed may not eliminate all of the data issues. I ran into this quite frequently during my healthcare days. The hospital I worked for exchanged data with dozens if not hundreds of vendors, each one with their own standards and practices (many of them involving manual data entry). Needless to say, data quality was a constant challenge, and we had to balance a need for having data as clean as possible with the amount of time it would take to build the logic to cleanse the data. When dealing with sets of data like this, it’s critical to set expectations (more on that shortly).
It’s always late.
Unlike some of the other topics, this one falls squarely in our laps. As the processors of data – or more specifically, the architects of the processes – it is up to us to ensure that users have the data they need, when they need it. As data sets grow over time, the time required to process the data (from ETL to cleansing, and cube processing to report generation) will continue to increase without intervention. It is not enough to simply throw hardware at the problem – we have to be active participants in making those load processes as efficient as possible. Even if the data is correct, it will suffer from some level of distrust if it is not provided in a timely manner.
It’s not clear where these numbers came from.
Data lineage is critical. In most modern analytical systems, data consumers are rarely looking directly at the original transactional data. Instead, they are looking at a copy (or a copy of a copy of a copy…) that has been massaged to fit into the analytical data model. Along with that transformation comes a need to trace back to where that reshaped data originally came from, for auditing and validation purposes. The absence of data lineage is one of the chief deficiencies I find in data warehouse systems. It takes effort to get this right, and it’s especially hard to “bolt it on” after the system is already live. Data lineage is one of those things that is easy to set aside until later, but this technical debt has a high interest rate.
It’s inconsistent.
Data inconsistency often appears in organizations that allow self-service reporting. Don’t read this as my saying that self-service data is bad – it’s not. Allowing subject matter experts to directly access data (rather than simply handing off structured reports) will continue to evolve as a means of discovering new patterns in data. That being said, when a company makes the strategic decision to expose analytical structures directly to users, the risk of having inconsistent results is multiplied. When anyone with proper access can connect to reporting tables and create their own reports, it’s entirely possible that two reports may give two different answers to the same question. To overcome this, proper documentation and training is critical. Those with access to underlying tables, views, and cubes must understand the meaning, granularity, and limitations of those structures.
The goalposts are constantly moving.
Although this is not entirely the fault of the technical side of the house, the problem can be magnified if there are no controls over what can be changed. Let’s say you’ve got a structured report that shows P&L by department. One of your department heads complains that her department’s data is being unfairly skewed because of the format of the report. Too often, if that department head makes enough noise, the report will be updated just to satisfy that request. The problem is that now the resulting report indicates something different, not just for that department but for all departments. This is not really a technical problem, but more of a political one. It takes a steady demeanor to know when to push back against unreasonable requests for change.
Nobody owns it.
Some organizations treat data as if it were a fake plastic tree in a dentist’s office – just stick it in the corner and it’ll be good for years. It’s not like that at all. Data, and the processes that support it, is more like a fickle house plant. It requires constant attention: proper sunlight, daily watering, and occasional pruning. If nobody is paying attention to the data or the plumbing that drives it, it’s going to be as useful as an unwatered fern. Each set of data must have a clear owner, both on the technical side as well as in the business unit.
Setting expectations
One of the overused phrases I’m trying to banish from my vocabulary is, “It is what it is.” However, that phrase seems applicable here. Often, when dealing with data from outside vendors, closed software systems, or other sources over which we have limited control, there are constraints on the data that can’t easily be overcome. If you want daily sales information but your vendor refuses to provide anything more granular than a weekly summary, you’ll have to find a way to deal with what you have.
When those limitations arise, be clear – both in your communications and your documentation – about the shortcomings of that set of data. And be clear as to the boundaries, too. Communicating with business SMEs or executives about the deficiencies of one particular set of data, emphasize that the limitation doesn’t necessarily affect the remainder of the information available to them. Set expectations early and frequently to avoid distrust issues later.
It’s just wrong.
This one is the big one, and I purposefully saved it for last. Sometimes, the data you’ll receive is simply wrong (see the first bullet above), in which case you’ll want to be sure to fully document and explain this limitation.
All too often, though, the mechanisms that process the data can muck up the data, turning good data into suspect data. The possible causes for this are numerous: incorrect source-to-target mapping, an unhandled exception in the data, incorrect or inconsistent business rules, or simply losing data during ETL processing (yes, it can happen). This is the most critical piece to get right, because those who depend on the data rarely have insight into the internal plumbing that magically transforms flat files into analytical dashboards. It can be very easy for data consumers to mistrust this process – and by extension, the data that comes out of it – simply because it’s a black box from their perspective.
When the data is deemed to be wrong due to ETL or other processing, it’s essential to get out in front of the problem. Communicate what you found and how it was fixed (you’ll have to tailor this message to match the technical aptitude of the audience), and demonstrate that the resulting data truly is corrected after the process change. Follow up to ensure that the issue does not reoccur, and communicate that you are doing so.
The Fickleness of Trust
As noted in the quote at the beginning of this post, granting trust is difficult. It’s even more difficult to regain it after it has been violated. As the curators and protectors of data, those of us tasked with delivering tactical and analytical data must preserve – and occasionally, rebuild – trust in the data we provide. A lack of trust is a tripwire for any organization, and we data professionals must do everything we can to maintain data fidelity for our data consumers.
One of the most ironic experiences in my 42 year career was when I determined that a group of ten selectable online reports were producing invalid data on sales statistics for dealers. All ten had the same SQL logic that gave wrong numbers, making the whole series invalid. I submitted the corrected, tested, and validated code to the ‘project managers’ who declined to implement the fixes due to ‘risk of making changes’. It’s been about ten years and remaining co-workers tell me the bugs were never fixed. I guess they do in fact trust the data, in spite of having the invalidity proved.
Rick, that doesn’t surprise me. It sounds like “this is the way we’ve always done it” syndrome. That could be a whole other blog posts in itself.
Another hurdle might be the exclusion of data, not due to the ETL process but on request of the client. Managers do not want to see all values, but want to leave out certain categories to make the report look ‘better’. The rules that determine what is included get more complex over the years. Furthermore these rules are often applied only to custom reports, not to standard reports coming straight from the application. Hence a double truth exists within an organization making it very hard to validate reports resulting from some mangled data. Another source of small but very hard to track differences are steps that filter invalid data, because this data would otherwise give rise to ‘strange’ entries in the reports. Finally m-to-n relations can be a great source of confusion, but I guess you have dealt with this yourself to know what I mean, not to mention imperfect and changing attribute hierarchies. Giving direct access to data containing these kind of relationships is asking for trouble, because the technical skills to interpret and use that data are often beyond those of the department level data analyst.
Dony hit the nail right on the head.
The biggest issue I have is the “Well, I don’t really need to see this client/category/technician on this report…”. This leads to inconsistencies across reports and often arbitrary {WHERE [THING] NOT LIKE ‘PARTICULAR THING’} statements tacked onto datasets in the BI tools.
It’s absolutely impossible to keep track of all the one-offs.
You’re absolutely right, and it’s a fine line to walk between giving folks what they want and having reports that always agree with each other. Add in self-service reporting and analytics, and the waters really get murky.
I’m always surprised how rarely data is compared to other sources of the same data.
Nice summary of the problems surrounding data and trust. Although you touch on it briefly I think that bad technical design is also a common culprit. I also think that there are vested interests in bad data. One of the reasons that DW implementations fail to be adopted is because they don’t match existing numbers that hundreds of analysts maintain in spreadsheets of ambiguous data lineage. Unfortunately I think the solutions are not quite as easy as defining the issues.
Awesome article, Tim. Thanks for taking the time to write it.
All of what you said gets much worse when people have an unreasonable schedule imposed on them due to poor planning. As I’ve been known to say “If you want it real bad, that’s the way you’ll likely get it”.
Jeff, you’re absolutely right that these issue are often schedule related. The phrase I keep coming back to is, “Do you want it right, or do you want it right now?”
There is an opposite side to the coin where trust is absolute from the client when those of us providing it know its from garbage. Despite warnings and caveats they still treat it like gospel.
We have started to think in terms of a traffic light system of confidence because people might be able to understand that, but there is still an issue of how to categorise even on that simple scale.