1 The Dangers of Dirty Data
What is dirty data?
Letās start on some common ground. What exactly is dirty data?
Well, the truth is that it can mean different things to different people working with different types of data. For the purposes of this book, it will be based on information used in a business context.
At its most basic level, dirty data is anything incorrect. It could be things such as:
Misspelt names
This happens more than you think. If itās supplier names, it could be a simple switch of letters from ABC Printing to ABC Printign, a missing letter such as T Shoesmit instead of T Shoesmith, or something much more subtle like AT Jones, instead of TA Jones, which may not be easily picked up.
If youāre dealing with personal information, itās doubly important to get the name right because of data protection regulations, such as the General Data Protection Regulation (GDPR). Very recently, I received a piece of mail for my new limited company āThe Classification Guru Ltdā. The address was correct and my first and middle names were correct, but I had someone elseās surname and a business name that wasnāt mine.
When I checked on Companies House, I could see that the surname and the business name were related to one person ā everything else was my information. What I suspect happened in this instance is one of several things: firstly that the list of names for mailings was in Excel and someone possibly hadnāt filtered all columns and the information was therefore mixed up. Secondly, it could have been that some lines of data were removed, which caused some of the information in certain columns to shift up or down and misalign. It could have been something as simple as a cut and paste error that caused the problem. This could have easily been rectified by applying some spot checks to the data before it was used as a mailing list. Iāll cover this further later.
Incorrect or misleading descriptions
In the work I do, I see this a lot in invoice or Purchase Order (PO) descriptions. It could be something as simple as āservicesā in the description and the personās name as the supplier. Well, who are they? The copywriter, the lawyer, another consultant of some sort? It can be very tricky to find this out and so, more often than not, it will end up being classified under āProfessional Servicesā. What if itās actually plumbing or electrical services and should therefore sit under āFacilitiesā? It might be a small value, but what if itās not? It could be a large amount of spend that is not being accounted for correctly.
Misleading descriptions in spend data can happen easily if the data is not viewed in context. For example, if you only look at the information in the invoice or PO line description column, but not the supplier name, this can lead to misclassification. You might have ācleaningā as a description, but the supplier is Dell or IBM. This completely changes the context of the information from janitorial services to data or computer or data cleaning services.
I once trained up a new team member and they classified LinkedIn, that well known business networking platform, as a restaurant because the description said restaurant. They had not viewed the data in context. The true nature of the spend would be more likely a job advertisement for a restaurant related position, or perhaps some advertising for something restaurant related, or even training in something restaurant related.
The spend was large. Had this not been picked up, there could have been thousands of pounds of spend against a restaurant in āTravelā instead of sitting under āMarketingā. Thatās where value can also help guide the classification process, which Iāll discuss further in Chapter 4 ā Spend Data Classification.
Attention to detail is also key. A large part of the spend data classification process is keyword searches and Iāll cover this later, but when you have descriptions such as āhotel cabā or ātaxi from restaurant to hotelā you have to be careful to read the data correctly and not just assume that because a word is in the description, that is the correct option to classify.
It could also be a completely incorrect description. Images are tagged to appear in searches, but what if a cat is tagged as a dog or vice versa? If you are a business and itās your products that are being tagged incorrectly, then how will your clients or consumers find your products?
If youāre a retailer and you sell your products online, what if theyāre not categorised correctly? Iāve experienced this myself where Iāve seen a product Iāve liked, but when Iāve tried to find it again by keyword searches, I havenāt been able to find it. Iāve had to literally scroll through ALL the products and who can be bothered with that?
Then there are the GL (General Ledger) codes. These are used by finance departments to account for financial transactions, but unlike spend data classification where the spend is classified exactly as what it is, GL codes can be used to track an item or a project.
A great example is a client of mine who had their data classified for the first time. Once finished, they had visibility on their spend across the whole business globally, including car leasing. When we looked at one particular supplier, the whole classification was car leasing, but there were four different GL codes assigned, ranging from office supplies to employee benefits. From an accounting perspective, this could potentially be correct, however, from a procurement perspective, this information could be misleading during the decision-making process and lead to bad business decisions. It flagged an issue in the finance department where multiple people had been assigning different codes to the same supplier based on their perception of what it should be. I have a solution for this, which Iāll get on to later.
Missing or incorrect codes
This can be a real issue in the manufacturing and supply chain industries. There are several reasons why a product code might be missing. If itās an older product then historically it might never have been assigned a code. Or perhaps the code wasnāt available when the product was set up, but no one followed up to add it in once the code had been created.
Then thereās the ācanāt be botheredā aspect. We donāt like to think about it, but some people just canāt be bothered to find out the information they need and if theyāre not being monitored and know they can get away with it, theyāll continue to set up products with missing information. It could be wider than just the product code, it could be dimensions and weights, which are critical to many different areas of the business.
Just as harmful to the business is an incorrect code. It could be that a code has been mistyped with some numbers mixed up or perhaps one number or letter missing, or it could be something more subtle like a zero being replaced with the letter O. This can all result in duplicate records for these products, the wrong items being ordered, shipped, manufactured or the number inaccurately reported in inventory, etc., resulting in unnecessary expenditure for the business.
No standard formats for addresses
I see this a lot in both supplier and personal data. There are multiple ways that an address can be recorded; sometimes itās all in one cell, sometimes split over a number of columns, and Iāve seen cities in the county or state column, or the postal or zip code in the city or county column. Itās a mess and is there to some degree in nearly every data set.
Abbreviations are also problematic: Terrace could be Terr, Place ā Plc, Road ā Rd, Street ā St. This can lead to near-duplicates, multiple records and information split between these multiple records, causing incorrect information and reporting to be used in the business.
No standard units of measure
This can cause a lot of issues, especially if you are trying to analyse or report on a specific product.
The list below shows common variants found in the data:
ā¢ Litre: litres, liter, liters, 1l, 1 l, 1ltr, 1 ltr, 2ltrs, 2 ltrs
ā¢ Metre: metres, meter, meters, 1m, 1 m, 1mtr, 1 mtr, 2mtrs, 2 mtrs
ā¢ Centimetre: centimetres, centimeter, centimeters, cm, cms, 1cm, 1 cm, 2cms, 2 cms
ā¢ Kilometre: kilometres, kilometer, kilometers, km, kms, 1km, 1 km, 2kms, 2 kms
ā¢ Gram: grams, gramme, grammes, gm, gms, 1gm, 1 gm, 2gms, 2 gms
ā¢ Kilogram: kilograms, kilo, kilos, 1kg, 1 kg, 2kgs, 2kgs
ā¢ Pound: pounds, 1lb, 1 lb, 2lbs, 2 lbs.
Even the little things, like whether you decide to have a space between the number and the unit of measure, can cause near duplications. Itās much better to be clear and specific with your team to avoid multiple versions of the same items. If you have a global data set, are you working in the same units of measure or do you need to create a conversion to measure or compare like for like? This also leads us on nicely to the next section ā currency.
Currency issues
If you are not aware that the values you are working with are in multiple currencies, you could spend hours trying to get figures to match up. Trust me, Iāve been there! In particular, when working with something like Swedish Krona versus GBP or USD, the values are significantly higher, so it could end up looking like youāve spent Ā£500 on a taxi.
Incorrect/partially classified spend data
For me, this is worse than not having classified data at all. When clients come to me for help, if they already have classified data, I immediately disregard everything and start again from scratch. Firstly, they wouldnāt be using my services if there wasnāt an issue with their classified data, and, secondly, in terms of time and being efficient, itās far easier to start again with a clean slate to which you can apply standards that will be consistent and accurate.
Duplicates
The dreaded duplicates. These can appear in many forms, from duplicate invoices to customer/supplier records, to orders, to products and much more. They cause multiple records, which could mean the information is split, resulting in you only seeing part of the picture. Then there are the near-duplicates; donāt even get me started on those. In business, this could be PWC/P.W.C., or with personal information, Robert Smith and Bob Smith. I cover this in more detail in the next section.
The consequences of dirty data
Why does it matter if data isnāt quite right? Whoās going to notice? Like any problem, itās manageable when itās small, but gone unnoticed or left to fester, it can become a really big issue. What if your car started making a rattling noise? You wouldnāt leave that to deteriorate, would you? You definitely wouldnāt go on a long road trip and risk being stranded in the middle of nowhere. Likewise, you shouldnāt be making big business decisions based on unclassified, poor quality data.
Letās go back to the examples weāve just covered. Whatās the worst that can happen?
Misspelt names
If names are misspelt, you could have multiple accounts for the same supplier and you could, for example, have some orders being placed under one account and some under the other.
When it comes to running reports, it means you may only be seeing half the picture if youāve only selected one of the customers/suppliers. This could have a knock-on effect on areas like forecasting (not planning enough), discounts applied to orders (not giving the customer the correct discount), account management (not being given the right level of service for the amount purchased), sales (forecasting, analytics) and finance (the wrong credit limit, restrictions on purchases).
Itās the same for procurement. If youāre monitoring supplier spend and you have multiple versions, you donāt get a full picture of what youāre buying from that supplier and that could impact negotiations, contract compliance and planning. It also means you have fewer suppliers on your system than you think. If youāre working with global data sets, this can be a significant difference. A project I recently worked on had 41,000 suppliers in the file, but when I normalised them (more on that in Chapter 2), the true number of su...