1 The Dangers of Dirty Data
What is dirty data?
Letâs start on some common ground. What exactly is dirty data?
Well, the truth is that it can mean different things to different people working with different types of data. For the purposes of this book, it will be based on information used in a business context.
At its most basic level, dirty data is anything incorrect. It could be things such as:
Misspelt names
This happens more than you think. If itâs supplier names, it could be a simple switch of letters from ABC Printing to ABC Printign, a missing letter such as T Shoesmit instead of T Shoesmith, or something much more subtle like AT Jones, instead of TA Jones, which may not be easily picked up.
If youâre dealing with personal information, itâs doubly important to get the name right because of data protection regulations, such as the General Data Protection Regulation (GDPR). Very recently, I received a piece of mail for my new limited company âThe Classification Guru Ltdâ. The address was correct and my first and middle names were correct, but I had someone elseâs surname and a business name that wasnât mine.
When I checked on Companies House, I could see that the surname and the business name were related to one person â everything else was my information. What I suspect happened in this instance is one of several things: firstly that the list of names for mailings was in Excel and someone possibly hadnât filtered all columns and the information was therefore mixed up. Secondly, it could have been that some lines of data were removed, which caused some of the information in certain columns to shift up or down and misalign. It could have been something as simple as a cut and paste error that caused the problem. This could have easily been rectified by applying some spot checks to the data before it was used as a mailing list. Iâll cover this further later.
Incorrect or misleading descriptions
In the work I do, I see this a lot in invoice or Purchase Order (PO) descriptions. It could be something as simple as âservicesâ in the description and the personâs name as the supplier. Well, who are they? The copywriter, the lawyer, another consultant of some sort? It can be very tricky to find this out and so, more often than not, it will end up being classified under âProfessional Servicesâ. What if itâs actually plumbing or electrical services and should therefore sit under âFacilitiesâ? It might be a small value, but what if itâs not? It could be a large amount of spend that is not being accounted for correctly.
Misleading descriptions in spend data can happen easily if the data is not viewed in context. For example, if you only look at the information in the invoice or PO line description column, but not the supplier name, this can lead to misclassification. You might have âcleaningâ as a description, but the supplier is Dell or IBM. This completely changes the context of the information from janitorial services to data or computer or data cleaning services.
I once trained up a new team member and they classified LinkedIn, that well known business networking platform, as a restaurant because the description said restaurant. They had not viewed the data in context. The true nature of the spend would be more likely a job advertisement for a restaurant related position, or perhaps some advertising for something restaurant related, or even training in something restaurant related.
The spend was large. Had this not been picked up, there could have been thousands of pounds of spend against a restaurant in âTravelâ instead of sitting under âMarketingâ. Thatâs where value can also help guide the classification process, which Iâll discuss further in Chapter 4 â Spend Data Classification.
Attention to detail is also key. A large part of the spend data classification process is keyword searches and Iâll cover this later, but when you have descriptions such as âhotel cabâ or âtaxi from restaurant to hotelâ you have to be careful to read the data correctly and not just assume that because a word is in the description, that is the correct option to classify.
It could also be a completely incorrect description. Images are tagged to appear in searches, but what if a cat is tagged as a dog or vice versa? If you are a business and itâs your products that are being tagged incorrectly, then how will your clients or consumers find your products?
If youâre a retailer and you sell your products online, what if theyâre not categorised correctly? Iâve experienced this myself where Iâve seen a product Iâve liked, but when Iâve tried to find it again by keyword searches, I havenât been able to find it. Iâve had to literally scroll through ALL the products and who can be bothered with that?
Then there are the GL (General Ledger) codes. These are used by finance departments to account for financial transactions, but unlike spend data classification where the spend is classified exactly as what it is, GL codes can be used to track an item or a project.
A great example is a client of mine who had their data classified for the first time. Once finished, they had visibility on their spend across the whole business globally, including car leasing. When we looked at one particular supplier, the whole classification was car leasing, but there were four different GL codes assigned, ranging from office supplies to employee benefits. From an accounting perspective, this could potentially be correct, however, from a procurement perspective, this information could be misleading during the decision-making process and lead to bad business decisions. It flagged an issue in the finance department where multiple people had been assigning different codes to the same supplier based on their perception of what it should be. I have a solution for this, which Iâll get on to later.
Missing or incorrect codes
This can be a real issue in the manufacturing and supply chain industries. There are several reasons why a product code might be missing. If itâs an older product then historically it might never have been assigned a code. Or perhaps the code wasnât available when the product was set up, but no one followed up to add it in once the code had been created.
Then thereâs the âcanât be botheredâ aspect. We donât like to think about it, but some people just canât be bothered to find out the information they need and if theyâre not being monitored and know they can get away with it, theyâll continue to set up products with missing information. It could be wider than just the product code, it could be dimensions and weights, which are critical to many different areas of the business.
Just as harmful to the business is an incorrect code. It could be that a code has been mistyped with some numbers mixed up or perhaps one number or letter missing, or it could be something more subtle like a zero being replaced with the letter O. This can all result in duplicate records for these products, the wrong items being ordered, shipped, manufactured or the number inaccurately reported in inventory, etc., resulting in unnecessary expenditure for the business.
No standard formats for addresses
I see this a lot in both supplier and personal data. There are multiple ways that an address can be recorded; sometimes itâs all in one cell, sometimes split over a number of columns, and Iâve seen cities in the county or state column, or the postal or zip code in the city or county column. Itâs a mess and is there to some degree in nearly every data set.
Abbreviations are also problematic: Terrace could be Terr, Place â Plc, Road â Rd, Street â St. This can lead to near-duplicates, multiple records and information split between these multiple records, causing incorrect information and reporting to be used in the business.
No standard units of measure
This can cause a lot of issues, especially if you are trying to analyse or report on a specific product.
The list below shows common variants found in the data:
⢠Litre: litres, liter, liters, 1l, 1 l, 1ltr, 1 ltr, 2ltrs, 2 ltrs
⢠Metre: metres, meter, meters, 1m, 1 m, 1mtr, 1 mtr, 2mtrs, 2 mtrs
⢠Centimetre: centimetres, centimeter, centimeters, cm, cms, 1cm, 1 cm, 2cms, 2 cms
⢠Kilometre: kilometres, kilometer, kilometers, km, kms, 1km, 1 km, 2kms, 2 kms
⢠Gram: grams, gramme, grammes, gm, gms, 1gm, 1 gm, 2gms, 2 gms
⢠Kilogram: kilograms, kilo, kilos, 1kg, 1 kg, 2kgs, 2kgs
⢠Pound: pounds, 1lb, 1 lb, 2lbs, 2 lbs.
Even the little things, like whether you decide to have a space between the number and the unit of measure, can cause near duplications. Itâs much better to be clear and specific with your team to avoid multiple versions of the same items. If you have a global data set, are you working in the same units of measure or do you need to create a conversion to measure or compare like for like? This also leads us on nicely to the next section â currency.
Currency issues
If you are not aware that the values you are working with are in multiple currencies, you could spend hours trying to get figures to match up. Trust me, Iâve been there! In particular, when working with something like Swedish Krona versus GBP or USD, the values are significantly higher, so it could end up looking like youâve spent ÂŁ500 on a taxi.
Incorrect/partially classified spend data
For me, this is worse than not having classified data at all. When clients come to me for help, if they already have classified data, I immediately disregard everything and start again from scratch. Firstly, they wouldnât be using my services if there wasnât an issue with their classified data, and, secondly, in terms of time and being efficient, itâs far easier to start again with a clean slate to which you can apply standards that will be consistent and accurate.
Duplicates
The dreaded duplicates. These can appear in many forms, from duplicate invoices to customer/supplier records, to orders, to products and much more. They cause multiple records, which could mean the information is split, resulting in you only seeing part of the picture. Then there are the near-duplicates; donât even get me started on those. In business, this could be PWC/P.W.C., or with personal information, Robert Smith and Bob Smith. I cover this in more detail in the next section.
The consequences of dirty data
Why does it matter if data isnât quite right? Whoâs going to notice? Like any problem, itâs manageable when itâs small, but gone unnoticed or left to fester, it can become a really big issue. What if your car started making a rattling noise? You wouldnât leave that to deteriorate, would you? You definitely wouldnât go on a long road trip and risk being stranded in the middle of nowhere. Likewise, you shouldnât be making big business decisions based on unclassified, poor quality data.
Letâs go back to the examples weâve just covered. Whatâs the worst that can happen?
Misspelt names
If names are misspelt, you could have multiple accounts for the same supplier and you could, for example, have some orders being placed under one account and some under the other.
When it comes to running reports, it means you may only be seeing half the picture if youâve only selected one of the customers/suppliers. This could have a knock-on effect on areas like forecasting (not planning enough), discounts applied to orders (not giving the customer the correct discount), account management (not being given the right level of service for the amount purchased), sales (forecasting, analytics) and finance (the wrong credit limit, restrictions on purchases).
Itâs the same for procurement. If youâre monitoring supplier spend and you have multiple versions, you donât get a full picture of what youâre buying from that supplier and that could impact negotiations, contract compliance and planning. It also means you have fewer suppliers on your system than you think. If youâre working with global data sets, this can be a significant difference. A project I recently worked on had 41,000 suppliers in the file, but when I normalised them (more on that in Chapter 2), the true number of su...