Get to “Know Your Data” fundamentally

28 Jul

Data has been termed the new oil, diamond, or even oxygen. Let’s go with Gollum and call data “precious”.

Whatever the term, it appears important to “Know Your Data” (KYD). The big promise is that data lead to actionable insight that creates knowledge, value and better lives. We want a story with insight, a story about some entities of interest and their relationships, hopefully exciting and surprising.

As a fundamental target of any data story, we must first be able to uniquely identify the entities of interest in the data; before attempting to reveal any relationships between them. Such entities could be people, places, and - very specifically - things ;oP

Recalling data management principles, the number one tool to uniquely identify entities in any kind of data set (relational, nested, tree, graph or whatever model of data) are … drums please ... keys.

Keys are combinations of fields (which may be column names in a table, features in your feature store, or properties in your graph). Given some specific context, a person may naturally be identified by their email address, their social security number, national health identifier, or by combinations of some fields such as name, date of birth, address, phone number, etc.

Internally - inside your data repository - you may be using some artificial identifier (surrogate). Such surrogates do not carry any real-world meaning, and may be generated as auto-increments by your database management system or at the application level. Surrogates solve some problems, but create overheads in terms of updates, queries, and have limited use for joining data sources. Surrogates do not address entity integrity. They can be used internally, after entity integrity has been safely secured.

Excessive use of surrogates transforms query answers into a maze of surrogates that need to be linked to real-world entities to make the answer meaningful for the consumer of data. The well-known expression “You can’t see the forest for the trees” becomes “You can’t see the insight for the surrogates.”

Worse yet, the same real-world entity may be represented by multiple surrogates. If you fail in combining query results for individual surrogates, you have only gained partial insight and are likely to report incomplete or even inaccurate results. In this case, you have gone the full cycle and still need a natural key to figure out which surrogates do represent your entity of interest.

You may only rely on surrogates within the scope they are applied. If you want to link external data sources to existing ones, you cannot rely on surrogates. External data sources will not use the same surrogates, so you need some natural key that can link entities across data sets.

Correctly specifying business keys is paramount to a basic level of data quality. Specifying a key that is meaningless will prevent valuable data from entering your database, resulting in potential incompleteness, loss of information, outdated data, inaccurate reporting and less value. Not specifying a business key means you are actively permitting duplicates in your data, resulting in potential inconsistency, outdated data, and inaccurate reporting. Hence, the value of precision and recall in specifying business keys. An investment in high quality data pays the best return, and the investment ought to start with entity integrity.

Indeed, if understood and used correctly, keys have the following benefits:

1) Better data access by providing

logical and physical pathways to critical data elements
points of reference for foreign keys and joins with other tables

2) Improved data quality by

Enforcing entity integrity
Finding all surrogates used for the same entity
Imputing missing data values by combining records for the same entity

3) Higher performance by

Faster physical access using UNIQUE indexes
More transparency and trust with business keys for updates and queries
Actual insight from accurate and meaningful reports

Discovering what keys hold on your data sources is a computationally challenging task. It is therefore not surprising that all business intelligence tools lack capabilities in giving data-driven recommendations for specifying keys.

If you would like to discover what keys (should) hold on your data, contact us and give Key Finder a trial.

Sebastian Link

Get to “Know Your Data” fundamentally

An Introduction to Entity/Relationship Profiling

Why data quality problems cost your organization much more than you think