T-SQL Tuesday #179 Roundup: The Data Detective Toolkit

Earlier this month, I hosted the monthly T-SQL Tuesday invitation in which I asked, “What’s in your data detective toolkit?” We got some great responses which I’ll recap here, and I’ll share a few thoughts of my own at the end.

First up, Rob Farley shared his approach, which focuses on the bigger picture of the business use of the data. Rather than starting with specific tools for data analysis, he suggests a review of the abstractions of the data – specifically, artifacts such as report definitions, views, stored procedures, and other structures – to get the lay of the land for an unknown data set. I like this approach, as it focuses on how the data is being used rather than trying to infer the intent and purpose of each data structure. On the tactical side, Rob also pointed out a tool named SQLFrontline, which I wasn’t familiar with but will plan to review soon.

Next, we heard from Andy Yun about his approach to data detective work, which is more people-centric than data-centric. After all, it’s the consumers of the data who will be the most knowledgeable about its value and use cases. Rather than immediately digging into the data, Andy suggests that asking intelligent questions of the data users will reveal much, particularly with a skilled data interviewer who understands both the technical and business concerns.

Lastly, we got some excellent data detective suggestions from Deborah Melkin. Deborah suggested a couple of very useful tools – the Query Store and Extended Events – that are built into SQL Server, and can be used to analyze historical uses of data. Other tools she mentioned included the free Plan Explorer along with open source tools sp_whoIsActive and sp_HumanEvents for analyzing interactions with the database server. She also wisely suggested human approaches including curiosity and a willingness to set aside assumptions as useful attributes in solving data mysteries.

For my part, I like a two-front approach – people and technology – as suggested by my friends above. Making sense of a poorly documented set of data requires both technical tools and some human intuition, along with a great deal of patience and persistence.

On the technical side, I’ve built several SQL procedures to allow for structured or ad-hoc analysis of both data and metadata. If the data is in a relational database, I’ll interrogate the metadata to identify formal relationships in the data, uniqueness constraints, and other revealing facts. If the metadata doesn’t fully tell the story, I’ll ask the underlying data for clues. What are the largest tables? Can I glean anything from the insert or update dates? Are there tables that share the same column names, and if so, does that imply a relationship?

As a data architect, I’m often challenged to analyze unstructured data, including flat files, JSON, or data retrieved from APIs. Unstructured data can be some of the most challenging to make sense of, so I use a set of scripts (mostly C#, though I’m working through migrating them to Python) to parse and analyze data in a structured way. I’ll often default back to my relational database ways and flatten the data just enough to squeeze it into a SQL Server or Postgres table where I can work SQL magic on it, but some of the gnarly data sets I’ve worked with require so much prep work that I just script it out and analyze in C# or Python.

Technical tools aside, there’s no substitute for a comprehensive understanding of how data is being used. If I’m getting data from elsewhere (a vendor, a new client, etc.), I’ll visit with the person who has the most knowledge of how that data is used. Even if that person is nontechnical, they can often help shed light on what is meant by the data. If the origin of the data set is a mystery (and we’ve all inherited data sets such as this), I’ll try to find out what it should ultimately be used for, and will modify my analysis to meet the needs of the consumer rather than the source.

And finally, as Deborah suggested, a curious and open-minded approach can be the best way to solve data mysteries. Set aside any assumptions, keep asking questions and testing, and never stop working the problem.

T-SQL Tuesday #179 Roundup: The Data Detective Toolkit

Like this:

Related

About the Author

Tim Mitchell

1 Trackbacks & Pingbacks

Leave a ReplyCancel reply

Share this: