Certified Data Mining and Warehousing Professional Need for data quality

Need for data quality
 


It takes just one piece of corrupt information to create monumental issues. Bad data multiplies at an astonishing rate, polluting not only the system in which it originates, but also the many other information sources it touches as it moves across a business. Therefore, the longer a company waits to detect and correct a bad record, the more damage it can do.

This is why taking a reactive approach to data quality, instead of a proactive one, can be an expensive decision. In a recent study, independent analyst firm SiriusDecisions notes what it calls the “1-10-100 rule,” which demonstrates the benefits of proactive data quality. The rule states that it costs only $1 to verify a record upon entry and $10 to cleanse and dedupe it after it has been entered, but $100 in potential lost productivity or revenue if nothing is done.

Data quality issues impact businesses of all types. Regardless of their cause, these problems cost billions of dollars each year. The longer they go undetected, the more damage they can do. Companies must leverage real-time quality control to not only correct existing records of subpar quality, but to also stop bad data from entering the environment in the first place.

 

Data quality is a complex measure of data properties from various dimensions. It gives us a picture of the extent to which the data are appropriate for their purpose.

The main dimensions of data quality are

  • Completeness – extent to which the expected attributes of data are provided. Data do not have to be 100% complete, the dimension is measured to the degree to which it matches user’s expectations and data availability. Can be measured in an automated way.

  • Accuracy – data reflect real world state. For example: company name is real company name, company identifier exists in the official register of companies. Can be measured in an automated way using various lists and mappings.  (NB: data can be complete but not accurate)

  • Credibility – extent to which the data is regarded as true and credible. It can vary from source to source, or even one sourced can contain automated and manually entered data. This is not quite measurable in an automated way.

  • Timeliness (age of data) – extent to which the data is sufficiently up-to-date for the task at hand. For example not timely data would be scraped from unstructured PDF that was published today, however, contains contracts from three months ago. This can be measured by comparing publishing date (or scraping date) and dates within the data source

Some other dimensions can also be measured, but require that one has multiple datasets describing the same things:

  • Consistency – do the facts in multiple datasets match? (some measurable)

  • Integrity – can be multiple datasets correctly joined together? Are all references valid? (measurable in automated way)

 

Initially sold as a service, data quality moved inside the walls of corporations, as low-cost and powerful server technology became available.

Companies with an emphasis on marketing often focus their quality efforts on name and address information, but data quality is recognized as an important property of all types of data. Principles of data quality can be applied to supply chain data, transactional data, and nearly every other category of data found in the enterprise. For example, making supply chain data conform to a certain standard has value to an organization by: 1) avoiding overstocking of similar but slightly different stock; 2) improving the understanding of vendor purchases to negotiate volume discounts; and 3) avoiding logistics costs in stocking and shipping parts across a large organization.

While name and address data has a clear standard as defined by local postal authorities, other types of data have few recognized standards. There is a movement in the industry today to standardize certain non-address data. The non-profit group GS1 is among the groups spearheading this movement.

For companies with significant research efforts, data quality can include developing protocols for research methods, reducing measurement error, bounds checking of the data, cross tabulation, modeling and outlier detection, verifying data integrity, etc.

 

 For Support