eBook - ePub

Between the Spreadsheets

Name: Between the Spreadsheets
Author: Walsh

Classifying and Fixing Dirty Data

Walsh

English
ePUB (disponibile sull'app)
Disponibile su iOS e Android

eBook - ePub

Between the Spreadsheets

Classifying and Fixing Dirty Data

Walsh

Dettagli del libro

Anteprima del libro

Indice dei contenuti

Citazioni

Informazioni sul libro

Dirty data is a problem that costs businesses thousands, if not millions, every year. In organisations large and small across the globe you will hear talk of data quality issues. What you will rarely hear about is the consequences or how to fix it.

Between the Spreadsheets: Classifying and Fixing Dirty Data draws on classification expert Susan Walsh's decade of experience in data classification to present a fool-proof method for cleaning and classifying your data. The book covers everything from the very basics of data classification to normalisation and taxonomies, and presents the author's proven COAT methodology, helping ensure an organisation's data is Consistent, Organised, Accurate and Trustworthy. A series of data horror stories outlines what can go wrong in managing data, and if it does, how it can be fixed.

After reading this book, regardless of your level of experience, not only will you be able to work with your data more efficiently, but you will also understand the impact the work you do with it has, and how it affects the rest of the organisation.

Written in an engaging and highly practical manner, Between the Spreadsheets gives readers of all levels a deep understanding of the dangers of dirty data and the confidence and skills to work more efficiently and effectively with it.

Domande frequenti

Come faccio ad annullare l'abbonamento?

È semplicissimo: basta accedere alla sezione Account nelle Impostazioni e cliccare su "Annulla abbonamento". Dopo la cancellazione, l'abbonamento rimarrà attivo per il periodo rimanente già pagato. Per maggiori informazioni, clicca qui

È possibile scaricare libri? Se sì, come?

Al momento è possibile scaricare tramite l'app tutti i nostri libri ePub mobile-friendly. Anche la maggior parte dei nostri PDF è scaricabile e stiamo lavorando per rendere disponibile quanto prima il download di tutti gli altri file. Per maggiori informazioni, clicca qui

Che differenza c'è tra i piani?

Entrambi i piani ti danno accesso illimitato alla libreria e a tutte le funzionalità di Perlego. Le uniche differenze sono il prezzo e il periodo di abbonamento: con il piano annuale risparmierai circa il 30% rispetto a 12 rate con quello mensile.

Cos'è Perlego?

Perlego è un servizio di abbonamento a testi accademici, che ti permette di accedere a un'intera libreria online a un prezzo inferiore rispetto a quello che pagheresti per acquistare un singolo libro al mese. Con oltre 1 milione di testi suddivisi in più di 1.000 categorie, troverai sicuramente ciò che fa per te! Per maggiori informazioni, clicca qui.

Perlego supporta la sintesi vocale?

Cerca l'icona Sintesi vocale nel prossimo libro che leggerai per verificare se è possibile riprodurre l'audio. Questo strumento permette di leggere il testo a voce alta, evidenziandolo man mano che la lettura procede. Puoi aumentare o diminuire la velocità della sintesi vocale, oppure sospendere la riproduzione. Per maggiori informazioni, clicca qui.

Between the Spreadsheets è disponibile online in formato PDF/ePub?

Sì, puoi accedere a Between the Spreadsheets di Walsh in formato PDF e/o ePub, così come ad altri libri molto apprezzati nelle sezioni relative a Betriebswirtschaft e Informationsmanagement. Scopri oltre 1 milione di libri disponibili nel nostro catalogo.

Informazioni

Editore

Facet Publishing

Anno

2021

ISBN

9781783305230

Argomento

Betriebswirtschaft

Categoria

Informationsmanagement

1 The Dangers of Dirty Data

What is dirty data?

Let’s start on some common ground. What exactly is dirty data?

Well, the truth is that it can mean different things to different people working with different types of data. For the purposes of this book, it will be based on information used in a business context.

At its most basic level, dirty data is anything incorrect. It could be things such as:

Misspelt names

This happens more than you think. If it’s supplier names, it could be a simple switch of letters from ABC Printing to ABC Printign, a missing letter such as T Shoesmit instead of T Shoesmith, or something much more subtle like AT Jones, instead of TA Jones, which may not be easily picked up.

If you’re dealing with personal information, it’s doubly important to get the name right because of data protection regulations, such as the General Data Protection Regulation (GDPR). Very recently, I received a piece of mail for my new limited company ‘The Classification Guru Ltd’. The address was correct and my first and middle names were correct, but I had someone else’s surname and a business name that wasn’t mine.

When I checked on Companies House, I could see that the surname and the business name were related to one person – everything else was my information. What I suspect happened in this instance is one of several things: firstly that the list of names for mailings was in Excel and someone possibly hadn’t filtered all columns and the information was therefore mixed up. Secondly, it could have been that some lines of data were removed, which caused some of the information in certain columns to shift up or down and misalign. It could have been something as simple as a cut and paste error that caused the problem. This could have easily been rectified by applying some spot checks to the data before it was used as a mailing list. I’ll cover this further later.

Incorrect or misleading descriptions

In the work I do, I see this a lot in invoice or Purchase Order (PO) descriptions. It could be something as simple as ‘services’ in the description and the person’s name as the supplier. Well, who are they? The copywriter, the lawyer, another consultant of some sort? It can be very tricky to find this out and so, more often than not, it will end up being classified under ‘Professional Services’. What if it’s actually plumbing or electrical services and should therefore sit under ‘Facilities’? It might be a small value, but what if it’s not? It could be a large amount of spend that is not being accounted for correctly.

Misleading descriptions in spend data can happen easily if the data is not viewed in context. For example, if you only look at the information in the invoice or PO line description column, but not the supplier name, this can lead to misclassification. You might have ‘cleaning’ as a description, but the supplier is Dell or IBM. This completely changes the context of the information from janitorial services to data or computer or data cleaning services.

I once trained up a new team member and they classified LinkedIn, that well known business networking platform, as a restaurant because the description said restaurant. They had not viewed the data in context. The true nature of the spend would be more likely a job advertisement for a restaurant related position, or perhaps some advertising for something restaurant related, or even training in something restaurant related.

The spend was large. Had this not been picked up, there could have been thousands of pounds of spend against a restaurant in ‘Travel’ instead of sitting under ‘Marketing’. That’s where value can also help guide the classification process, which I’ll discuss further in Chapter 4 – Spend Data Classification.

Attention to detail is also key. A large part of the spend data classification process is keyword searches and I’ll cover this later, but when you have descriptions such as ‘hotel cab’ or ‘taxi from restaurant to hotel’ you have to be careful to read the data correctly and not just assume that because a word is in the description, that is the correct option to classify.

It could also be a completely incorrect description. Images are tagged to appear in searches, but what if a cat is tagged as a dog or vice versa? If you are a business and it’s your products that are being tagged incorrectly, then how will your clients or consumers find your products?

If you’re a retailer and you sell your products online, what if they’re not categorised correctly? I’ve experienced this myself where I’ve seen a product I’ve liked, but when I’ve tried to find it again by keyword searches, I haven’t been able to find it. I’ve had to literally scroll through ALL the products and who can be bothered with that?

Then there are the GL (General Ledger) codes. These are used by finance departments to account for financial transactions, but unlike spend data classification where the spend is classified exactly as what it is, GL codes can be used to track an item or a project.

A great example is a client of mine who had their data classified for the first time. Once finished, they had visibility on their spend across the whole business globally, including car leasing. When we looked at one particular supplier, the whole classification was car leasing, but there were four different GL codes assigned, ranging from office supplies to employee benefits. From an accounting perspective, this could potentially be correct, however, from a procurement perspective, this information could be misleading during the decision-making process and lead to bad business decisions. It flagged an issue in the finance department where multiple people had been assigning different codes to the same supplier based on their perception of what it should be. I have a solution for this, which I’ll get on to later.

Missing or incorrect codes

This can be a real issue in the manufacturing and supply chain industries. There are several reasons why a product code might be missing. If it’s an older product then historically it might never have been assigned a code. Or perhaps the code wasn’t available when the product was set up, but no one followed up to add it in once the code had been created.

Then there’s the ‘can’t be bothered’ aspect. We don’t like to think about it, but some people just can’t be bothered to find out the information they need and if they’re not being monitored and know they can get away with it, they’ll continue to set up products with missing information. It could be wider than just the product code, it could be dimensions and weights, which are critical to many different areas of the business.

Just as harmful to the business is an incorrect code. It could be that a code has been mistyped with some numbers mixed up or perhaps one number or letter missing, or it could be something more subtle like a zero being replaced with the letter O. This can all result in duplicate records for these products, the wrong items being ordered, shipped, manufactured or the number inaccurately reported in inventory, etc., resulting in unnecessary expenditure for the business.

No standard formats for addresses

I see this a lot in both supplier and personal data. There are multiple ways that an address can be recorded; sometimes it’s all in one cell, sometimes split over a number of columns, and I’ve seen cities in the county or state column, or the postal or zip code in the city or county column. It’s a mess and is there to some degree in nearly every data set.

Abbreviations are also problematic: Terrace could be Terr, Place – Plc, Road – Rd, Street – St. This can lead to near-duplicates, multiple records and information split between these multiple records, causing incorrect information and reporting to be used in the business.

No standard units of measure

This can cause a lot of issues, especially if you are trying to analyse or report on a specific product.

The list below shows common variants found in the data:

• Litre: litres, liter, liters, 1l, 1 l, 1ltr, 1 ltr, 2ltrs, 2 ltrs

• Metre: metres, meter, meters, 1m, 1 m, 1mtr, 1 mtr, 2mtrs, 2 mtrs

• Centimetre: centimetres, centimeter, centimeters, cm, cms, 1cm, 1 cm, 2cms, 2 cms

• Kilometre: kilometres, kilometer, kilometers, km, kms, 1km, 1 km, 2kms, 2 kms

• Gram: grams, gramme, grammes, gm, gms, 1gm, 1 gm, 2gms, 2 gms

• Kilogram: kilograms, kilo, kilos, 1kg, 1 kg, 2kgs, 2kgs

• Pound: pounds, 1lb, 1 lb, 2lbs, 2 lbs.

Even the little things, like whether you decide to have a space between the number and the unit of measure, can cause near duplications. It’s much better to be clear and specific with your team to avoid multiple versions of the same items. If you have a global data set, are you working in the same units of measure or do you need to create a conversion to measure or compare like for like? This also leads us on nicely to the next section – currency.

Currency issues

If you are not aware that the values you are working with are in multiple currencies, you could spend hours trying to get figures to match up. Trust me, I’ve been there! In particular, when working with something like Swedish Krona versus GBP or USD, the values are significantly higher, so it could end up looking like you’ve spent £500 on a taxi.

Incorrect/partially classified spend data

For me, this is worse than not having classified data at all. When clients come to me for help, if they already have classified data, I immediately disregard everything and start again from scratch. Firstly, they wouldn’t be using my services if there wasn’t an issue with their classified data, and, secondly, in terms of time and being efficient, it’s far easier to start again with a clean slate to which you can apply standards that will be consistent and accurate.

Duplicates

The dreaded duplicates. These can appear in many forms, from duplicate invoices to customer/supplier records, to orders, to products and much more. They cause multiple records, which could mean the information is split, resulting in you only seeing part of the picture. Then there are the near-duplicates; don’t even get me started on those. In business, this could be PWC/P.W.C., or with personal information, Robert Smith and Bob Smith. I cover this in more detail in the next section.

The consequences of dirty data

Why does it matter if data isn’t quite right? Who’s going to notice? Like any problem, it’s manageable when it’s small, but gone unnoticed or left to fester, it can become a really big issue. What if your car started making a rattling noise? You wouldn’t leave that to deteriorate, would you? You definitely wouldn’t go on a long road trip and risk being stranded in the middle of nowhere. Likewise, you shouldn’t be making big business decisions based on unclassified, poor quality data.

Let’s go back to the examples we’ve just covered. What’s the worst that can happen?

Misspelt names

If names are misspelt, you could have multiple accounts for the same supplier and you could, for example, have some orders being placed under one account and some under the other.

When it comes to running reports, it means you may only be seeing half the picture if you’ve only selected one of the customers/suppliers. This could have a knock-on effect on areas like forecasting (not planning enough), discounts applied to orders (not giving the customer the correct discount), account management (not being given the right level of service for the amount purchased), sales (forecasting, analytics) and finance (the wrong credit limit, restrictions on purchases).

It’s the same for procurement. If you’re monitoring supplier spend and you have multiple versions, you don’t get a full picture of what you’re buying from that supplier and that could impact negotiations, contract compliance and planning. It also means you have fewer suppliers on your system than you think. If you’re working with global data sets, this can be a significant difference. A project I recently worked on had 41,000 suppliers in the file, but when I normalised them (more on that in Chapter 2), the true number of su...

Indice dei contenuti

Cover
Praise for Between the Spreadsheets
Title Page
Copyright
Contents
Figures
Tables
Acknowledgements
Abbreviations
Introduction
1 The Dangers of Dirty Data
2 Supplier Normalisation
3 Taxonomies
4 Spend Data Classification
5 Basic Data Cleansing
6 Other Methodologies
7 The Dirty Data Maturity Model
8 Data Horror Stories
Summary
References
Index

Stili delle citazioni per Between the Spreadsheets

APA 6 Citation

Walsh. (2021). Between the Spreadsheets ([edition unavailable]). Facet Publishing. Retrieved from https://www.perlego.com/book/3259740/between-the-spreadsheets-classifying-and-fixing-dirty-data-pdf (Original work published 2021)

Chicago Citation

Walsh. (2021) 2021. Between the Spreadsheets. [Edition unavailable]. Facet Publishing. https://www.perlego.com/book/3259740/between-the-spreadsheets-classifying-and-fixing-dirty-data-pdf.

Harvard Citation

Walsh (2021) Between the Spreadsheets. [edition unavailable]. Facet Publishing. Available at: https://www.perlego.com/book/3259740/between-the-spreadsheets-classifying-and-fixing-dirty-data-pdf (Accessed: 15 October 2022).

MLA 7 Citation

Walsh. Between the Spreadsheets. [edition unavailable]. Facet Publishing, 2021. Web. 15 Oct. 2022.