Data Quality

Since the invention of computers, people have used the term data to refer to computer information. This information was transmitted or stored. But that’s not the only definition of data; there are other types of data as well. So how can you verify the quality of the data? The data can be texts or numbers written on papers. They can also be bytes and bits within the memory of electronic devices. Or they can be facts that are stored inside a person’s mind.

What is data?

Now, if we talk about data primarily in the field of science, the answer to “what is data” will be that data is different types of information that are usually formatted in a certain way. According to Cahn and Cahn (2013), from a computer point of view, all software is divided into two broad categories, which are programs and data. Programs are the set of instructions that are used to manipulate data. So, after thoroughly understanding what data and data science are, let’s learn some fantastic facts.

Types and uses of data

Growth in the field of technology, specifically in smartphones, has led to text, video and audio being included in data, in addition to the web and activity logs. Most of this data is unstructured.

The term Big Data is used in the data definition to describe data that is in the range of petabytes or more. Big Data is also described as the 5 V: variety, volume, value, veracity and speed. Today, web-based e-commerce has spread enormously, Big Data-based business models have evolved and treat data as an asset in itself. And there are many benefits of Big Data, such as cost reduction, improved efficiency, increased sales, etc.

The meaning of data expands beyond data processing in computer applications. When it comes to what data science is, a set of facts is called data science. Consequently, finance, demographics, health, and marketing also have different meanings of data, which ultimately make up different responses to what data is.

How is the data analyzed?

Ideally, there are two ways to analyze the data:

Data analysis in qualitative research

Data analysis and research into subjective information work somewhat better than numerical information, since qualitative information is made up of words, representations, photographs, objects, and sometimes images. Obtaining knowledge from this tangled data is a confusing procedure; therefore, it is often used for exploratory research as well as for data analysis.

Finding patterns in qualitative data

Although there are different ways to discover patterns in printed data, word-based strategy is the most dependent and widely used global method for research and data analysis. Mainly, the process of data analysis in qualitative research is manual. In it, specialists, as a rule, read the accessible information and find the words monotonous or frequently used.

Data analysis in quantitative research

The main stage in research and data analysis is to do it for examination with the aim that nominal information can become something important. Data preparation includes the following:

Data validation

Editing the data

Data encoding

In the case of quantitative statistical research, the use of descriptive analysis usually yields supreme figures. However, analysis is never adequate to show the justification behind these figures. Still, it’s important to think about the best technique to use for researching and analyzing the data that fits your review survey and the story the specialists should tell.

Therefore, companies that are prepared to succeed in the hyper-competitive world must have a remarkable ability to investigate complex information, deduce remarkable knowledge and adjust to the new needs of the market.

Top Reasons to Become a Data Scientist: Jobs in Data

Below are the uses of data that explain how becoming a data scientist is the right choice.

Data science is used to detect risks and fraud. Initially, data science was used in the finance sector and it remains the most significant application of data science.

Next up is the health sector. Here, data science is used to analyze medical imaging, genetics, and genomics. It also applies to drug development. And finally, it’s a huge plus to become a virtual assistant for patients.

Another application of data science is internet search. All search engines make use of data science algorithms to display the desired result.

Many other applications of data science or artificial intelligence are targeted advertising, advanced image recognition, speed recognition, air route planning, augmented reality and games, etc.

What is data quality?

We can think of data as the basis of a hierarchy in which data is the lowest level. Above the data we have the information, which is the data in its context. Higher up we have knowledge seen as actionable information and at the top level wisdom as applied knowledge.

If the quality of the data is poor, the quality of the information will not be good. With poor information quality, you will lack actionable knowledge in business operations and will not be able to apply that knowledge or will do so incorrectly, with risky business results as a result.

How do you know if the data is of quality?

There are many definitions of data quality. The two that predominate are:

The data is of high quality if it is fit for its intended use.

If the data correctly represents the real-world construct that the data describes.

These two possible definitions may be contradictory. If, for example, a customer master data record is eligible to issue an invoice when you receive a payment, it may be eligible for that purpose. But if the customer master data record, at the same time, is incomplete or incorrect for customer service, because the data does not fully or incorrectly describe the who, what, and where of the real-world entity that has the customer role in that business operation , we have a commercial problem.

Inaccuracy of data

Often, master data should not be suitable for multiple purposes. This can be achieved by ensuring alignment with the real world. On the other hand, it may not be cost-effective or proportionate to strive for perfect alignment with the real world to make the data fit for purpose within the business objective on which a data quality initiative is funded. In practice, therefore, it is a question of finding a balance between these two definitions.

In research commissioned by Experian Data Quality in 2013, it was found that the main reason for data inaccuracy was human error, as 59% of cases were assessed for that cause. Avoiding or eventually correcting low-quality data caused by human error requires a comprehensive effort with the right combination of remedies that have to do with people, processes, and technology.

Other main reasons for the inaccuracy of the data found in the aforementioned research are the lack of communication between departments (31%) and an inadequate data strategy (24%). Solving these problems requires passionate involvement of senior management.

Importance of data quality

It’s usually not hard to get everyone in a company, including senior management, to agree that having good data quality is good for the business. In today’s era of digital transformation, support for focusing on data quality is even greater than before.

However, when it comes to the essential questions about who is responsible for the quality of the data, who should do something about it and who will fund the necessary activities, then things get tough.

Data quality resembles human health. Checking exactly how any element of diet and exercise can affect our health is diabolically difficult. Similarly, accurately checking how an element of our data can affect our business is also very difficult.

Example of data quality

In marketing you spend more than your account, and you annoy potential customers, sending the same material more than once to the same person, with the name and address a little different. The problem is duplicates in the same database and in multiple internal and external sources.

In online sales, not enough product data can be presented to support a self-service purchase decision. The issues here are the integrity of product data within your databases and how product data is syndicated between trading partners.

Processes based on reliable location information cannot be automated in the supply chain. The challenges here are to use the same standards and have the necessary accuracy within the location data.

In the financial reports you get different answers to the same question. This is because the data is inconsistent, the freshness of the data varies, and the definitions of the data are unclear.

Data quality at the corporate level

At the corporate level, data quality issues have a dramatic impact on meeting core business objectives, such as:

Inability to react in time to new market opportunities and thus hinder the achievements of profit and growth. Often, this is because you are not prepared to reuse existing data that was only suitable for yesterday’s requirements.

Obstacles in the implementation of cost reduction programs, as the data that must support ongoing business processes require too much manual inspection and correction. Automation will only work with complete and consistent data.

Deficiencies in meeting growing compliance requirements. These requirements range from privacy and data protection regulations such as the GDPR, health and safety requirements in various industries to financial restrictions, requirements and guidelines. Better data quality is, in most cases, a necessity to meet these compliance objectives.

Difficulties in exploiting predictive analytics of corporate data assets, which poses more risk than necessary when making both short- and long-term decisions. These challenges arise from problems related to data duplication, data incompleteness, data inconsistency and data inaccuracy.

How to improve data quality

Improving data quality requires a balanced mix of medicine that encompasses people, processes and technology, as well as a good part of senior management involvement.

Dimensions of data quality

In improving data quality, the goal will be to measure and improve a number of dimensions of data quality. Oliver (2013), establishes the following dimensions:

Uniqueness is the most addressed data quality dimension when it comes to customer master data. Customer master data is often affected by duplicates, that is, two or more rows in the database that describe the same real-world entity. There are several remedies to address this issue, from intercepting duplicates at the entry point to bulk deduplicating logs already stored in one or more databases.

In the case of product master data, uniqueness is a less common problem. However, completeness is often a big problem. One of the reasons is that completeness implies different requirements for different categories of products.

When working with location master data, consistency can be a challenge. Addressing, so to speak, the different formats of postal addresses around the world is certainly not a walk.

At the intersection of the location domain and the customer domain, the dimension of data quality called accuracy can be difficult to manage, as different use cases require different precision for a location time being a postal address and/or a geographic position.

What to know about dimensions of data quality

What is relevant to know about your customers and what is relevant to tell about your products are essential issues at the intersection of the customer and product master data domains.

Product data compliance is related to locations. Take units of measurement as an example. In America, the length of a small thing will be in inches, but in most of the rest of the world it will be in centimeters.

Timeliness, that is, whether data is available at the time it is needed, is the dimension of data quality that is maintained around the world.

Other dimensions of data quality that need to be measured and improved are data accuracy, which refers to alignment with the real world or a verifiable source, data validity, whether the data conforms to specified business requirements, and data integrity. , which refers to whether the relationships between entities and attributes are technically consistent.

Data quality management

In data quality management, the goal is to exploit a balanced set of solutions to prevent future data quality problems and cleanse (or ultimately purge) data that does not meet the data quality Key Performance Indicators (KPIs) needed to achieve current and future business goals.

Data quality KPIs will typically be measured on core business data assets within data quality dimensions such as data uniqueness, data integrity, data consistency, data compliance, data accuracy, data relevance, data timeliness , the accuracy of the data, the validity of the data and the integrity of the data.

Data quality KPIs should be related to the KPIs used to measure the performance of the overall business.

The remedies used to prevent data quality problems and the eventual cleaning of the same include these disciplines:

Data governance

Data profiles

Comparison

Quality reporting

Master Data Management (MDM)

Customer Data Integration (CDI)

Product Information Management (PIM)

Digital Asset Management (DAM)

Data governance

A data governance framework should set the data policies and standards that set the bar for the data quality KPIs that are needed and the data elements that need to be processed. This includes business rules that must be respected and supported by data quality measures.

In addition, the data governance framework should cover the organizational structures necessary to achieve the required level of data quality. This includes forums such as a data governance committee or similar, roles such as data owners, data stewards, data custodians or the like in balance with what makes sense in a given organization.

A business glossary is another valuable result of data governance used in data quality management. The business glossary is a manual for establishing the metadata used to achieve common data definitions within an organization and ultimately in the business ecosystem in which the organization operates.

Data profiles

It is essential that the persons designated as data quality and those responsible for preventing data quality problems and data cleansing have a thorough knowledge of the data in question.

Data profiling is a method, often supported by a specific technology, that is used to understand the data assets involved in data quality management. These data assets have often been populated over the years by different people operating under different business rules and brought together for beoke business objectives.

In data profiling, the frequency and distribution of data values are accounted for at the relevant structural levels. Data profiling can also be used to discover the keys that relate data entities between different databases and to the extent that this is no longer done within the individual databases.

This can be used to directly measure the integrity of the data and can be used as an input to establish the measurement of other dimensions of data quality.

Data alignment

When it comes to real-world alignment, it’s not enough to use exact keys in databases.

The classic example is how we write a person’s name differently due to misunderstandings, typos, use of nicknames, and more. In the case of company names, problems pile up with funny mnemonics and the inclusion of legal forms. When we place these people and organizations in places that use a postal address, the ways of writing it also have numerous results.

Data matching is a technology based on concordance codes, such as soundex, fuzzy logic, and increasingly also machine learning that is used to determine whether two or more data records describe the same real-world entity (typically a person, household, or organization).

This method can be used to deduplicate a database and find matching features in multiple data sources.

Data matching is often based on syntactic analysis of data, in which names, addresses, and other data elements are divided into discrete data elements. For example, an envelope type address is divided into building name, unit, house number, street, zip code, city, state/province, and country. This can be complemented by normalizing the data, for example, by using the same value for the street, street, and street.

Digital Asset Management (DAM)

Digital assets are images, text documents, videos, and other files that are often used in conjunction with product data. From the point of view of data quality, the challenges for this type of data revolve around correct labeling (metadata), as well as asset quality. For example, if an image of a product only clearly shows the product and not many other things.

Data quality best practices

Below, based on the reasoning set out in this post, we will list a collection of 10 very important data quality best practices. According to Rudestam (2014), these are:

Ensure the involvement of senior management. Many of the data quality problems are only solved with an interdepartmental view.

Manage data quality activities as part of a data governance framework. This framework should establish data policies and standards, the necessary functions, and provide a business glossary.

Occupy the roles of data owners and data administrators on the organization side.

Occupy the role of business or IT data custodians where it makes the most sense.

Use a business glossary as the basis for metadata management. Metadata is data about data. Metadata management should be used to have common data definitions. In turn, they should be linked to current and future business applications.

Keep track of data quality issues with an entry for each issue. Information about the assigned data owner and the data stewards involved must be included. The impact of the problem, the resolution and the timing of the necessary procedures must also be highlighted.

Analysis

For each data quality issue that arises, start with a root cause analysis. Data quality problems will only disappear if the solution addresses the root cause.

When it comes to finding solutions, efforts must be made to implement processes and technologies that prevent problems. It should also be ensured that they occur as close as possible to the point of incorporation of the data. This is instead of relying on a subsequent data cleanup.

Define data quality KPIs that are linked to general business performance KPIs. Data quality KPIs, sometimes also called data quality indicators (DQI), can be related to data quality dimensions. For example, data uniqueness, data integrity, and data consistency.

Use train accident anecdotes from data quality to raise awareness of the importance of data quality. However, use fact-based impact and risk analysis to justify the solutions and funding needed.

Today, a lot of data is already digitized. Therefore, avoid typing the data whenever possible. Instead, try to find cost-effective solutions for data onboarding that use third-party data sources for publicly available data. For example, with the locations in general and the names, addresses and identity documents of the companies. In other cases, of individuals. For product data, use third-party data from trading partners whenever possible.

Data quality reporting

Data profiling results can be used to measure data quality KPIs based on the data quality dimensions relevant to a given organization. The results of data matching are especially useful for measuring the uniqueness of the data.

In addition, it is useful to keep track of data quality issues. Known data quality issues are documented here and preventive and data cleansing activities are tracked.

Organizations that focus on data quality find it useful to operate a data quality dashboard that highlights data quality KPIs and the trend of their measurements, as well as the trend of problems that go through the record of data quality issues.

Our specialists wait for you to contact them through the quote form or direct chat. We also have confidential communication channels such as WhatsApp and Messenger. And if you want to be aware of our innovative services and the different advantages of hiring us, follow us on Facebook, Instagram or Twitter.

If this article was to your liking, do not forget to share it on your social networks.

You may also be interested in: Social Research

Bibliographic References

Cahn, Steven M. and Victor Cahn. Polishing Your Prose: How to Turn First Drafts Into Finished Work. New York: Columbia University Press, 2013.

Oliver, Paul. Writing Your Thesis. 3rd edition. London: Sage, 2013.

Rudestam, Kjell Erik and Rae R. Newton. Surviving Your Dissertation: A Comprehensive Guide to Content and Process. 4th edition. Thousand Oaks, CA: Sage Publications, 2014.

Data quality. Photo: Unsplash. Credits: Shalom de León @sakgraphy

Data Quality

What is data?

Types and uses of data