The answer: It Depends.
One of the more common problems I encounter when managing data quality, especially in an ETL process, is the proper handling of null, empty string, or zero values. When I put on my preaching shoes to talk about bad data, this is one the areas I have to spend a lot of time covering because it is so often overlooked as a stumbling block for data quality. Improperly interpreting what is represented by these values can lead to inconsistent results, dirty data, and incorrect answers to business questions.
Null, empty string, or zero?
When managing text data, more often than not an empty string is treated the same way as a null value. For example, if a record in a Person table has a valid, non-blank value in the FirstName and LastName field but a non-null, empty string value in the MiddleName field, it’s usually assumed that the person has no middle name, or that the middle name is unknown. However, what if the data in question is the database for the official birth registrar for the municipality? A record with an empty string for the middle name could show that the person’s official birth record indicates that they legally do not have a middle name, while a null value in that field reflects an unknown value for the middle name (possibly to accommodate old records transcribed from incomplete handwritten birth certificates). In this case, an empty string value for MiddleName represents a completely different situation than a null in the same field.
It’s even trickier with numeric data. Zeroes and nulls are often treated as one and the same, but in many business cases, a null value is intended to be treated differently than a zero. Let’s say I’m working for a sales organization, and I want to calculate some metrics on some of our sales leads. If I’m arbitrarily interpreting a null value as a zero in the HouseholdIncome field, I’m going to get some incorrect results when calculating, for example, the average household income of our potential leads (since zeros are factored into averages, while nulls typically are not). In this case, a zero value indicates that the reported household income for that lead is zero, while a null value could indicate that the person refused to provide any answer for that field.
The ETL Factor
Complicating both of these is the exchange of data in ETL. Assuming the data is correct in the source system, it is possible – in fact, I would say it is likely – that some part of the ETL process can misinterpret what nulls, empty strings, and zeroes are supposed to represent. Part of this is a lack of standard nomenclature. How does one represent a null in a character field? If an empty string is used do indicate a null, then how does one indicate an empty string when that value should be separate and distinct from a null? On occasion, I’ve seen the literal word NULL used in text data to represent a null value. Find yourself a phone book (do they still print those?) of sufficient size, and you’ll find that there are individuals with the last name of Null. It can’t always be assumed that the string literal Null can be interpreted as a null value.
There’s no silver bullet for the null/empty string/zero quandary. But there are ways to mitigate the possibility of a misinterpretation causing pains in your database or ETL process:
- Know your data. I don’t mean that you just need to know its technical structure – you need to really know what it represents to your employer or client.
- Know how your data is being used. Talk to your users and find out how they create, curate, and report on the data. Ask them about any known outliers in the data. This will reveal a great deal about what the data means (see the previous bullet).
- Pay special attention to ETL, reporting, and analytics. These areas are hotspots where a misinterpretation of a potentially null value can be especially problematic.
- Have a solid, documented data mapping strategy. Given that many of the problems in this realm are rooted in moving data, having a well-documented and visible source-to-target mapping can help to clarify how these special values should be interpreted during data movement and transformation.
- Know where the quicksand is. Sometimes, an empty string and a null are synonymous by design. Solve the problems that need to be solved, and don’t get stuck working on those that don’t.
The null/empty string/zero problem isn’t going away anytime soon, but simply being aware that it’s out there can help avoid misinterpretations leading to bad data.
Good article. While you’ve done a good job of summarizing the issue, I’m afraid that it really goes much deeper. The NULL issue has created almost a religious debate between proponents and opponents with each giving a rather polarized viewpoint instead of one that is more pragmatic. At least your brief article indicates the pros and cons of using NULL covering both sides of the divide. It would be interesting to see a deeper dive along the same vein.
Aaron, you’re absolutely right – this is quite a deep rabbit hole. This post could easily develop into a series on the topic.
Thank you very much for this article, I’ve always wondered about this. At my old job I always created my dbs with NOT NULL fields and added an identity value = 0 (using identity_insert) to hold values that were not known. I was new so I didn’t fully understand the ramifications of that, but it always worked for me. At my new job I see NULLs all over the place and I see what it takes to contend with that, which is what I was trying to avoid at my old job. But in both cases we’re not really doing reporting that would be affected by it either way so its good to see there are even more ramifications to know about.
Good to know, good to know. Thanks.
Thanks Kyle for the kind words. It’s a stealthy problem, certainly, and is a very common issue when it comes to data quality and accuracy in reporting/analytics.
Fpdf while genrating PDF from mysql, I can’t avoid emplty strings which annoyingly take up space innmy PDF output. J am looking for professional help