When creating a Business Intelligence system, it's common to retrieve data from many sources, during the extract, transform, load (ETL) process. Often, the data is duplicated in each source and we need to recognize the records representing the same entity. Let's take a closer look at some error handling tips in Business Intelligence.
A classic example is the master data. Even in the best case, where the company maintains a unique register of people, maintaining the data of suppliers, employees, co-workers and so on in the same archive, there will always be duplicated data within other systems connected to it. And in the case of very old information, the data might even be need to be recovered from files. (Find out how the retail industry benefits from Business Intelligence in our ebook: Business Intelligence in Retail guide)
The data sourced from different systems and information sources need to be unified, and we need a way to recognize when the data refers to the same person.
In the best case scenario, a serial number identifies similar records housed within various systems. However, this "ideal situation" rarely happens. The most likely situation will be to establish a link with other unique identifiers. In dealing with personal data, the most appropriate identifier could be the national ID number or other unique identifiers. One example is Italian tax code.
In some cases when there aren't any unique identifiers and records must be compared field by field, we can limit the comparison to a number of fields to ensure the uniqueness of the record. In the case of people, you can consider the fields such as name, surname, date of birth, place of birth, and gender which are used to derive the tax code.
Comparing these fields isn't always possible since the values contained in them are often subject to errors and format differences.
Let's take the case of the Italian tax code. But remember that this example can also be applied to any string (sequence of characters) such as the name, last name, address, etc. If the system is modern enough, the tax code is highly reliable because during insertion, cross-checks for accuracy are run to guarantee the absence of errors.
However, in older systems, the tax code was inserted as a simple string, and the controls were absent or superficial. The same goes for the information extracted from electronic documents, such as spreadsheets or text formatted files (CSV). In these cases, even the tax code couldn't be a reliable identifier.
To overcome possible errors, you should compare values not for exact matches, but for "similarity." Let's take the Italian tax code as an example:
The three tax codes illustrated above differ by a single letter (less in the second and an excess in the third), but they're still the same code. If the system compares them solely based on exact matches, they would be treated as three different values. As you can see, the system should be able to measure how similar the values are. On strings, there are several criteria which measure the "distance" between two values. One of the best known distances is the "Levenshtein Distance" (LvnD), which takes into account the number of substitutions, deletions and insertions of letters needed to transform one string into another.
To be able to consider two values as equal, you can fix the tolerance thresholds for the distance. By setting the threshold as a distance less than or equal to two LvnD 2, the three values above would be equivalent because:
Another issue to consider is how to determine which record has the correct value. In cases where there's a centralized register, it's possible to consider the registry values as the reference values. We can use the record values that were identified as similar to those of the central government identification system. If it isn't possible to identify one of the sources as trusted, you can use one criterion of the records that appears most often. The value that appears most frequently among the similar records should be considered as the reference value.
Managing Differences in Format
Aside from the problem of errors, loading data from multiple sources should take into account the diversity of formats to represent information.
In the case of strings, you must pay attention to the character encoding (ASCII, UTF-8, UTF-16, etc), and use the most extensive coding as a final format. This will help prevent information loss or misrepresentation.
When dealing with dates (also applicable to currencies or decimal numbers) there are other aspects to consider, such as the possible different formats used across multiple systems. For example, the same date can be represented in different ways:
In this case, you should choose a reference format. The comparisons should be made once you finish converting all values to that format.
Interdependent Values and Error Correction
The value of a particular field may depend on the value of other fields. So it's possible to verify the quality of the content in a specific record, to confirm that this dependency is satisfied. We can use the Italian tax code again as an example. By nature, it's dependent on a person's personal details. Let's take for example the date of birth. Here in this illustration of the tax code (1st column) and the date of birth (2nd column), we can easily determine which of the three following records can be considered correct:
To illustrate, the bold part of the tax code is taken from date of birth by:
- The last two numbers for the year of birth
- The month, represented by a letter (for example: A=January, E=May)
- The day
Comparing the tax code with the date of birth, we can immediately see that the second record has an inconsistency and can be marked as incorrect. If you are sure about the accurancy of one of the fields, you can correct the other. Otherwise, the record can be marked as "questionable".
When selecting reference values for a group of similar records, questionable records will surely have a lower weight than those that are not. By combining one or more of these techniques, you can reduce the number of those incorrectly recognized as unique records. This will allow you to successfully unify the information from multiple sources.
In conclusion, we can say that it's not always possible to handle erroneous data and automatically correct them, but you can surely reduce the number of cases drastically during the ETL phase of your Business Intelligence process.
If you want to have a better understanding of Business Intelligence and how it is used, how to adopt it and the strategic advantages for your organization, then click the button below and download our ebook: Business Intelligence in Retail guide.