Why data quality problems cost your organization much more than you think

10 Jul

We are living in a truly unique and remarkable time. Organizations all around the world are collecting exponentially increasing amounts of data. We’ve been able to solve so many of the technical problems to make it all happen – created faster networks, cheaper and more performant hardware, etc. However, even though we are collecting unprecedented amounts of data, industry experts all agree on one thing – we are yet to figure out how to fully utilise its potential due to various challenges like persistent data quality problems.

Almost all but very small organizations in the world today experience significant financial losses from data quality problems. Yet the costs come in multiple forms, so it is often poorly understood just how much they can be when you add them all up. While the costs will vary considerably for different organizations and no precise answers exist, the aim of this blog post is to give you an overview of all the direct and indirect costs that may be negatively affecting your business’s bottom line.

Without further ado, let’s dive right into it. We’ll start off by discussing the widely-agreed-upon estimates of direct costs of data quality problems and then shift our attention to the more subtle indirect costs.

The first direct cost to consider is the percentage of revenue lost. Different estimates exist, but most reputable industry researchers seem to agree that it is somewhere between 15-25% of total revenue. Some of the reasons why the figure is so high are: bad business decisions, poor match of supply and demand, fewer sales closed, poor customer acquisition and retention. It’s easy to see why all these things happen time and time again when trustworthy data isn’t available when it’s needed.

The second direct cost comes from the increase in operating expenses caused by bad data. The consensus estimate based on multiple studies is that up to 20-30% of operating costs may be directly related to bad data. Knowledge workers waste up to 50% of their time dealing with mundane data quality issues and for data scientists this number may go as high as 80%. Collecting and storing data that is never used because it has too many quality issues to be considered trustworthy can also be quite costly. While the costs of collecting and storing 1MB of data have been consistently going down over the last decade, this trend is slowing, and the growth rate of the amount of data that organizations collect, and store far outpaces the savings from lower per-megabyte costs due to cheaper hardware.

The reality is that well over 50% of all data assets at an average organization do not meet its own data quality standards, and that most data is never used for analysis and just drowns in organizations’ data lakes after it is collected. This is the case because today’s software generally does not provide the necessary degree of automation to allow the limited number of data professionals to cope with the growing volume of data. Given that with the software used for data preparation too much manual effort is still required, the data professionals always find themselves overwhelmed and unable to make all the incoming data assets usable in a timely manner. If the degree of automation at this step of the data lifecycle does not increase quickly and significantly, the problem will only get worse.

Before we continue and look at indirect costs that are often not well understood, let’s stop and take a moment to appreciate how much a typical enterprise is probably losing from direct costs alone. We’ll look at Microsoft’s figures in 2020 in this example. Microsoft’s revenue was $143B, operating expenses – $100B. Let’s also assume that as a tech giant, Microsoft is significantly better at handling its data than an average organization. Still, even if only half of the global figures applied to Microsoft, they lost $20B on bad data in 2020 (excluding indirect costs). Shocking, isn’t it? Somehow when we talk about percentages of revenue and operating expenses it doesn’t sound as bad as the actual figures. Yet even though addressing data quality problems as effectively as possible should be a top priority for most companies, plenty of organizations today do not even recognize this as a board-level issue, naively believing that this is not very important for the bottom line and is just something only data geeks should care about.

Now that we’ve discussed the direct costs, let’s step back and look at indirect costs that are often not considered by managers of all levels. Overall, the culture of data-driven decision making and trusting the data within an organization suffers when high-quality data is often not available when it is needed. The impact of that is obviously hard to quantify with any precision but considering that an organization’s culture has been widely believed to be the most effective predictor of a company’s long-term success by some of the most prominent business leaders of our generation (such as Andrew Grove, the legendary CEO of Intel), its significance is hard to overstate. The culture of over-relying on gut feeling to make business decisions will surely cause losses time and time again over the years.

Other indirect costs are reputation damages, low employee satisfaction and performance, and high compliance risk which often results in substantial fines. Again, these costs are difficult to quantify. It’s hard to know exactly how many of your best data scientists that you spent so much time and effort hiring will leave this year because instead of doing what they love they are forced to spend most of their time manually fixing problems in data to make it usable. It’s hard to know the exact costs of reputation damages too, but it surely isn’t hard to see that when combined, these costs can be very significant.

The challenges presented to us by the exponentially increasing volume of collected data are truly unprecedented. Dealing with this problem will require leaders to create a comprehensive data strategy, rethink their organizational processes, start using the most advanced technologies available, the list goes on and on. Given the fact that data comes in many shapes and sizes (structured data, semi-structured data, unstructured data), and there are quite a few different kinds of data quality problems, organizations will need to use multiple different tools from different specialized vendors each of which excels at addressing a specific subproblem within the overall “bad data” problem. Only cutting-edge software that was created to deal with the new challenges of our time can help you succeed by providing the necessary automation to support your data professionals and help ensure that managers can trust the data to make important decisions.

While our company does not offer products to address each of the subproblems (neither does any other company, don’t believe them if they say otherwise), we deeply believe that our software can become an important element of your data tech stack. We rely on cutting-edge academic research to offer unique solutions to some of the most persistent and fundamental data problems.

References:

https://www.splunk.com/en_us/blog/leadership/dark-data-has-huge-potential-but-not-if-we-keep-ignoring-it.html

https://www.invensis.net/blog/cost-of-bad-data-quality/

https://hbr.org/2017/09/only-3-of-companies-data-meets-basic-quality-standards

https://sloanreview.mit.edu/article/seizing-opportunity-in-data-quality/

Nikita Dergunov

Why data quality problems cost your organization much more than you think

Get to “Know Your Data” fundamentally