eBook - ePub

Between the Spreadsheets

Name: Between the Spreadsheets
ISBN: 9781783305230

Classifying and Fixing Dirty Data

Susan Walsh,

English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Between the Spreadsheets

Classifying and Fixing Dirty Data

Susan Walsh,

About this book

Dirty data is a problem that costs businesses thousands, if not millions, every year. In organisations large and small across the globe you will hear talk of data quality issues. What you will rarely hear about is the consequences or how to fix it.

Between the Spreadsheets: Classifying and Fixing Dirty Data draws on classification expert Susan Walsh's decade of experience in data classification to present a fool-proof method for cleaning and classifying your data. The book covers everything from the very basics of data classification to normalisation and taxonomies, and presents the author's proven COAT methodology, helping ensure an organisation's data is Consistent, Organised, Accurate and Trustworthy. A series of data horror stories outlines what can go wrong in managing data, and if it does, how it can be fixed.

After reading this book, regardless of your level of experience, not only will you be able to work with your data more efficiently, but you will also understand the impact the work you do with it has, and how it affects the rest of the organisation.

Written in an engaging and highly practical manner, Between the Spreadsheets gives readers of all levels a deep understanding of the dangers of dirty data and the confidence and skills to work more efficiently and effectively with it.

Tools to learn more effectively

Saving Books

Keyword Search

Annotating Text

Listen to it instead

Information

Publisher

Year

Print ISBN

eBook ISBN

Topic

Subtopic

Information Management

Index

Business

1 The Dangers of Dirty Data

What is dirty data?

Let’s start on some common ground. What exactly is dirty data?

Well, the truth is that it can mean different things to different people working with different types of data. For the purposes of this book, it will be based on information used in a business context.

At its most basic level, dirty data is anything incorrect. It could be things such as:

Misspelt names

This happens more than you think. If it’s supplier names, it could be a simple switch of letters from ABC Printing to ABC Printign, a missing letter such as T Shoesmit instead of T Shoesmith, or something much more subtle like AT Jones, instead of TA Jones, which may not be easily picked up.

If you’re dealing with personal information, it’s doubly important to get the name right because of data protection regulations, such as the General Data Protection Regulation (GDPR). Very recently, I received a piece of mail for my new limited company ‘The Classification Guru Ltd’. The address was correct and my first and middle names were correct, but I had someone else’s surname and a business name that wasn’t mine.

When I checked on Companies House, I could see that the surname and the business name were related to one person – everything else was my information. What I suspect happened in this instance is one of several things: firstly that the list of names for mailings was in Excel and someone possibly hadn’t filtered all columns and the information was therefore mixed up. Secondly, it could have been that some lines of data were removed, which caused some of the information in certain columns to shift up or down and misalign. It could have been something as simple as a cut and paste error that caused the problem. This could have easily been rectified by applying some spot checks to the data before it was used as a mailing list. I’ll cover this further later.

Incorrect or misleading descriptions

In the work I do, I see this a lot in invoice or Purchase Order (PO) descriptions. It could be something as simple as ‘services’ in the description and the person’s name as the supplier. Well, who are they? The copywriter, the lawyer, another consultant of some sort? It can be very tricky to find this out and so, more often than not, it will end up being classified under ‘Professional Services’. What if it’s actually plumbing or electrical services and should therefore sit under ‘Facilities’? It might be a small value, but what if it’s not? It could be a large amount of spend that is not being accounted for correctly.

Misleading descriptions in spend data can happen easily if the data is not viewed in context. For example, if you only look at the information in the invoice or PO line description column, but not the supplier name, this can lead to misclassification. You might have ‘cleaning’ as a description, but the supplier is Dell or IBM. This completely changes the context of the information from janitorial services to data or computer or data cleaning services.

I once trained up a new team member and they classified LinkedIn, that well known business networking platform, as a restaurant because the description said restaurant. They had not viewed the data in context. The true nature of the spend would be more likely a job advertisement for a restaurant related position, or perhaps some advertising for something restaurant related, or even training in something restaurant related.

The spend was large. Had this not been picked up, there could have been thousands of pounds of spend against a restaurant in ‘Travel’ instead of sitting under ‘Marketing’. That’s where value can also help guide the classification process, which I’ll discuss further in Chapter 4 – Spend Data Classification.

Attention to detail is also key. A large part of the spend data classification process is keyword searches and I’ll cover this later, but when you have descriptions such as ‘hotel cab’ or ‘taxi from restaurant to hotel’ you have to be careful to read the data correctly and not just assume that because a word is in the description, that is the correct option to classify.

It could also be a completely incorrect description. Images are tagged to appear in searches, but what if a cat is tagged as a dog or vice versa? If you are a business and it’s your products that are being tagged incorrectly, then how will your clients or consumers find your products?

If you’re a retailer and you sell your products online, what if they’re not categorised correctly? I’ve experienced this myself where I’ve seen a product I’ve liked, but when I’ve tried to find it again by keyword searches, I haven’t been able to find it. I’ve had to literally scroll through ALL the products and who can be bothered with that?

Then there are the GL (General Ledger) codes. These are used by finance departments to account for financial transactions, but unlike spend data classification where the spend is classified exactly as what it is, GL codes can be used to track an item or a project.

A great example is a client of mine who had their data classified for the first time. Once finished, they had visibility on their spend across the whole business globally, including car leasing. When we looked at one particular supplier, the whole classification was car leasing, but there were four different GL codes assigned, ranging from office supplies to employee benefits. From an accounting perspective, this could potentially be correct, however, from a procurement perspective, this information could be misleading during the decision-making process and lead to bad business decisions. It flagged an issue in the finance department where multiple people had been assigning different codes to the same supplier based on their perception of what it should be. I have a solution for this, which I’ll get on to later.

Missing or incorrect codes

This can be a real issue in the manufacturing and supply chain industries. There are several reasons why a product code might be missing. If it’s an older product then historically it might never have been assigned a code. Or perhaps the code wasn’t available when the product was set up, but no one followed up to add it in once the code had been created.

Then there’s the ‘can’t be bothered’ aspect. We don’t like to think about it, but some people just can’t be bothered to find out the information they need and if they’re not being monitored and know they can get away with it, they’ll continue to set up products with missing information. It could be wider than just the product code, it could be dimensions and weights, which are critical to many different areas of the business.

Just as harmful to the business is an incorrect code. It could be that a code has been mistyped with some numbers mixed up or perhaps one number or letter missing, or it could be something more subtle like a zero being replaced with the letter O. This can all result in duplicate records for these products, the wrong items being ordered, shipped, manufactured or the number inaccurately reported in inventory, etc., resulting in unnecessary expenditure for the business.

No standard formats for addresses

I see this a lot in both supplier and personal data. There are multiple ways that an address can be recorded; sometimes it’s all in one cell, sometimes split over a number of columns, and I’ve seen cities in the county or state column, or the postal or zip code in the city or county column. It’s a mess and is there to some degree in nearly every data set.

Abbreviations are also problematic: Terrace could be Terr, Place – Plc, Road – Rd, Street – St. This can lead to near-duplicates, multiple records and information split between these multiple records, causing incorrect information and reporting to be used in the business.

No standard units of measure

This can cause a lot of issues, especially if you are trying to analyse or report on a specific product.

The list below shows common variants found in the data:

• Litre: litres, liter, liters, 1l, 1 l, 1ltr, 1 ltr, 2ltrs, 2 ltrs

• Metre: metres, meter, meters, 1m, 1 m, 1mtr, 1 mtr, 2mtrs, 2 mtrs

• Centimetre: centimetres, centimeter, centimeters, cm, cms, 1cm, 1 cm, 2cms, 2 cms

• Kilometre: kilometres, kilometer, kilometers, km, kms, 1km, 1 km, 2kms, 2 kms

• Gram: grams, gramme, grammes, gm, gms, 1gm, 1 gm, 2gms, 2 gms

• Kilogram: kilograms, kilo, kilos, 1kg, 1 kg, 2kgs, 2kgs

• Pound: pounds, 1lb, 1 lb, 2lbs, 2 lbs.

Even the little things, like whether you decide to have a space between the number and the unit of measure, can cause near duplications. It’s much better to be clear and specific with your team to avoid multiple versions of the same items. If you have a global data set, are you working in the same units of measure or do you need to create a conversion to measure or compare like for like? This also leads us on nicely to the next section – currency.

Currency issues

If you are not aware that the values you are working with are in multiple currencies, you could spend hours trying to get figures to match up. Trust me, I’ve been there! In particular, when working with something like Swedish Krona versus GBP or USD, the values are significantly higher, so it could end up looking like you’ve spent £500 on a taxi.

Incorrect/partially classified spend data

For me, this is worse than not having classified data at all. When clients come to me for help, if they already have classified data, I immediately disregard everything and start again from scratch. Firstly, they wouldn’t be using my services if there wasn’t an issue with their classified data, and, secondly, in terms of time and being efficient, it’s far easier to start again with a clean slate to which you can apply standards that will be consistent and accurate.

Duplicates

The dreaded duplicates. These can appear in many forms, from duplicate invoices to customer/supplier records, to orders, to products and much more. They cause multiple records, which could mean the information is split, resulting in you only seeing part of the picture. Then there are the near-duplicates; don’t even get me started on those. In business, this could be PWC/P.W.C., or with personal information, Robert Smith and Bob Smith. I cover this in more detail in the next section.

The consequences of dirty data

Why does it matter if data isn’t quite right? Who’s going to notice? Like any problem, it’s manageable when it’s small, but gone unnoticed or left to fester, it can become a really big issue. What if your car started making a rattling noise? You wouldn’t leave that to deteriorate, would you? You definitely wouldn’t go on a long road trip and risk being stranded in the middle of nowhere. Likewise, you shouldn’t be making big business decisions based on unclassified, poor quality data.

Let’s go back to the examples we’ve just covered. What’s the worst that can happen?

Misspelt names

If names are misspelt, you could have multiple accounts for the same supplier and you could, for example, have some orders being placed under one account and some under the other.

When it comes to running reports, it means you may only be seeing half the picture if you’ve only selected one of the customers/suppliers. This could have a knock-on effect on areas like forecasting (not planning enough), discounts applied to orders (not giving the customer the correct discount), account management (not being given the right level of service for the amount purchased), sales (forecasting, analytics) and finance (the wrong credit limit, restrictions on purchases).

It’s the same for procurement. If you’re monitoring supplier spend and you have multiple versions, you don’t get a full picture of what you’re buying from that supplier and that could impact negotiations, contract compliance and planning. It also means you have fewer suppliers on your system than you think. If you’re working with global data sets, this can be a significant difference. A project I recently worked on had 41,000 suppliers in the file, but when I normalised them (more on that in Chapter 2), the true number of su...

Cover
Praise for Between the Spreadsheets
Title Page
Copyright
Contents
Figures
Tables
Acknowledgements
Abbreviations
Introduction
1 The Dangers of Dirty Data
2 Supplier Normalisation
3 Taxonomies
4 Spend Data Classification
5 Basic Data Cleansing
6 Other Methodologies
7 The Dirty Data Maturity Model
8 Data Horror Stories
Summary
References
Index

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Yes, you can access Between the Spreadsheets by Susan Walsh in PDF and/or ePUB format, as well as other popular books in Business & Information Management. We have over one million books available in our catalogue for you to explore.

About this book

Tools to learn more effectively

Information

Table of contents

Frequently asked questions