Between the Spreadsheets
eBook - ePub

Between the Spreadsheets

Classifying and Fixing Dirty Data

Walsh

  1. English
  2. ePUB (apto para móviles)
  3. Disponible en iOS y Android
eBook - ePub

Between the Spreadsheets

Classifying and Fixing Dirty Data

Walsh

Detalles del libro
Vista previa del libro
Índice
Citas

Información del libro

Dirty data is a problem that costs businesses thousands, if not millions, every year. In organisations large and small across the globe you will hear talk of data quality issues. What you will rarely hear about is the consequences or how to fix it.

Between the Spreadsheets: Classifying and Fixing Dirty Data draws on classification expert Susan Walsh's decade of experience in data classification to present a fool-proof method for cleaning and classifying your data. The book covers everything from the very basics of data classification to normalisation and taxonomies, and presents the author's proven COAT methodology, helping ensure an organisation's data is Consistent, Organised, Accurate and Trustworthy. A series of data horror stories outlines what can go wrong in managing data, and if it does, how it can be fixed.

After reading this book, regardless of your level of experience, not only will you be able to work with your data more efficiently, but you will also understand the impact the work you do with it has, and how it affects the rest of the organisation.

Written in an engaging and highly practical manner, Between the Spreadsheets gives readers of all levels a deep understanding of the dangers of dirty data and the confidence and skills to work more efficiently and effectively with it.

Preguntas frecuentes

¿Cómo cancelo mi suscripción?
Simplemente, dirígete a la sección ajustes de la cuenta y haz clic en «Cancelar suscripción». Así de sencillo. Después de cancelar tu suscripción, esta permanecerá activa el tiempo restante que hayas pagado. Obtén más información aquí.
¿Cómo descargo los libros?
Por el momento, todos nuestros libros ePub adaptables a dispositivos móviles se pueden descargar a través de la aplicación. La mayor parte de nuestros PDF también se puede descargar y ya estamos trabajando para que el resto también sea descargable. Obtén más información aquí.
¿En qué se diferencian los planes de precios?
Ambos planes te permiten acceder por completo a la biblioteca y a todas las funciones de Perlego. Las únicas diferencias son el precio y el período de suscripción: con el plan anual ahorrarás en torno a un 30 % en comparación con 12 meses de un plan mensual.
¿Qué es Perlego?
Somos un servicio de suscripción de libros de texto en línea que te permite acceder a toda una biblioteca en línea por menos de lo que cuesta un libro al mes. Con más de un millón de libros sobre más de 1000 categorías, ¡tenemos todo lo que necesitas! Obtén más información aquí.
¿Perlego ofrece la función de texto a voz?
Busca el símbolo de lectura en voz alta en tu próximo libro para ver si puedes escucharlo. La herramienta de lectura en voz alta lee el texto en voz alta por ti, resaltando el texto a medida que se lee. Puedes pausarla, acelerarla y ralentizarla. Obtén más información aquí.
¿Es Between the Spreadsheets un PDF/ePUB en línea?
Sí, puedes acceder a Between the Spreadsheets de Walsh en formato PDF o ePUB, así como a otros libros populares de Betriebswirtschaft y Informationsmanagement. Tenemos más de un millón de libros disponibles en nuestro catálogo para que explores.

Información

Año
2021
ISBN
9781783305230
1 The Dangers of Dirty Data
What is dirty data?
Let’s start on some common ground. What exactly is dirty data?
Well, the truth is that it can mean different things to different people working with different types of data. For the purposes of this book, it will be based on information used in a business context.
At its most basic level, dirty data is anything incorrect. It could be things such as:
Misspelt names
This happens more than you think. If it’s supplier names, it could be a simple switch of letters from ABC Printing to ABC Printign, a missing letter such as T Shoesmit instead of T Shoesmith, or something much more subtle like AT Jones, instead of TA Jones, which may not be easily picked up.
If you’re dealing with personal information, it’s doubly important to get the name right because of data protection regulations, such as the General Data Protection Regulation (GDPR). Very recently, I received a piece of mail for my new limited company ‘The Classification Guru Ltd’. The address was correct and my first and middle names were correct, but I had someone else’s surname and a business name that wasn’t mine.
When I checked on Companies House, I could see that the surname and the business name were related to one person – everything else was my information. What I suspect happened in this instance is one of several things: firstly that the list of names for mailings was in Excel and someone possibly hadn’t filtered all columns and the information was therefore mixed up. Secondly, it could have been that some lines of data were removed, which caused some of the information in certain columns to shift up or down and misalign. It could have been something as simple as a cut and paste error that caused the problem. This could have easily been rectified by applying some spot checks to the data before it was used as a mailing list. I’ll cover this further later.
Incorrect or misleading descriptions
In the work I do, I see this a lot in invoice or Purchase Order (PO) descriptions. It could be something as simple as ‘services’ in the description and the person’s name as the supplier. Well, who are they? The copywriter, the lawyer, another consultant of some sort? It can be very tricky to find this out and so, more often than not, it will end up being classified under ‘Professional Services’. What if it’s actually plumbing or electrical services and should therefore sit under ‘Facilities’? It might be a small value, but what if it’s not? It could be a large amount of spend that is not being accounted for correctly.
Misleading descriptions in spend data can happen easily if the data is not viewed in context. For example, if you only look at the information in the invoice or PO line description column, but not the supplier name, this can lead to misclassification. You might have ‘cleaning’ as a description, but the supplier is Dell or IBM. This completely changes the context of the information from janitorial services to data or computer or data cleaning services.
I once trained up a new team member and they classified LinkedIn, that well known business networking platform, as a restaurant because the description said restaurant. They had not viewed the data in context. The true nature of the spend would be more likely a job advertisement for a restaurant related position, or perhaps some advertising for something restaurant related, or even training in something restaurant related.
The spend was large. Had this not been picked up, there could have been thousands of pounds of spend against a restaurant in ‘Travel’ instead of sitting under ‘Marketing’. That’s where value can also help guide the classification process, which I’ll discuss further in Chapter 4 – Spend Data Classification.
Attention to detail is also key. A large part of the spend data classification process is keyword searches and I’ll cover this later, but when you have descriptions such as ‘hotel cab’ or ‘taxi from restaurant to hotel’ you have to be careful to read the data correctly and not just assume that because a word is in the description, that is the correct option to classify.
It could also be a completely incorrect description. Images are tagged to appear in searches, but what if a cat is tagged as a dog or vice versa? If you are a business and it’s your products that are being tagged incorrectly, then how will your clients or consumers find your products?
If you’re a retailer and you sell your products online, what if they’re not categorised correctly? I’ve experienced this myself where I’ve seen a product I’ve liked, but when I’ve tried to find it again by keyword searches, I haven’t been able to find it. I’ve had to literally scroll through ALL the products and who can be bothered with that?
Then there are the GL (General Ledger) codes. These are used by finance departments to account for financial transactions, but unlike spend data classification where the spend is classified exactly as what it is, GL codes can be used to track an item or a project.
A great example is a client of mine who had their data classified for the first time. Once finished, they had visibility on their spend across the whole business globally, including car leasing. When we looked at one particular supplier, the whole classification was car leasing, but there were four different GL codes assigned, ranging from office supplies to employee benefits. From an accounting perspective, this could potentially be correct, however, from a procurement perspective, this information could be misleading during the decision-making process and lead to bad business decisions. It flagged an issue in the finance department where multiple people had been assigning different codes to the same supplier based on their perception of what it should be. I have a solution for this, which I’ll get on to later.
Missing or incorrect codes
This can be a real issue in the manufacturing and supply chain industries. There are several reasons why a product code might be missing. If it’s an older product then historically it might never have been assigned a code. Or perhaps the code wasn’t available when the product was set up, but no one followed up to add it in once the code had been created.
Then there’s the ‘can’t be bothered’ aspect. We don’t like to think about it, but some people just can’t be bothered to find out the information they need and if they’re not being monitored and know they can get away with it, they’ll continue to set up products with missing information. It could be wider than just the product code, it could be dimensions and weights, which are critical to many different areas of the business.
Just as harmful to the business is an incorrect code. It could be that a code has been mistyped with some numbers mixed up or perhaps one number or letter missing, or it could be something more subtle like a zero being replaced with the letter O. This can all result in duplicate records for these products, the wrong items being ordered, shipped, manufactured or the number inaccurately reported in inventory, etc., resulting in unnecessary expenditure for the business.
No standard formats for addresses
I see this a lot in both supplier and personal data. There are multiple ways that an address can be recorded; sometimes it’s all in one cell, sometimes split over a number of columns, and I’ve seen cities in the county or state column, or the postal or zip code in the city or county column. It’s a mess and is there to some degree in nearly every data set.
Abbreviations are also problematic: Terrace could be Terr, Place – Plc, Road – Rd, Street – St. This can lead to near-duplicates, multiple records and information split between these multiple records, causing incorrect information and reporting to be used in the business.
No standard units of measure
This can cause a lot of issues, especially if you are trying to analyse or report on a specific product.
The list below shows common variants found in the data:
Litre: litres, liter, liters, 1l, 1 l, 1ltr, 1 ltr, 2ltrs, 2 ltrs
Metre: metres, meter, meters, 1m, 1 m, 1mtr, 1 mtr, 2mtrs, 2 mtrs
Centimetre: centimetres, centimeter, centimeters, cm, cms, 1cm, 1 cm, 2cms, 2 cms
Kilometre: kilometres, kilometer, kilometers, km, kms, 1km, 1 km, 2kms, 2 kms
Gram: grams, gramme, grammes, gm, gms, 1gm, 1 gm, 2gms, 2 gms
Kilogram: kilograms, kilo, kilos, 1kg, 1 kg, 2kgs, 2kgs
Pound: pounds, 1lb, 1 lb, 2lbs, 2 lbs.
Even the little things, like whether you decide to have a space between the number and the unit of measure, can cause near duplications. It’s much better to be clear and specific with your team to avoid multiple versions of the same items. If you have a global data set, are you working in the same units of measure or do you need to create a conversion to measure or compare like for like? This also leads us on nicely to the next section – currency.
Currency issues
If you are not aware that the values you are working with are in multiple currencies, you could spend hours trying to get figures to match up. Trust me, I’ve been there! In particular, when working with something like Swedish Krona versus GBP or USD, the values are significantly higher, so it could end up looking like you’ve spent £500 on a taxi.
Incorrect/partially classified spend data
For me, this is worse than not having classified data at all. When clients come to me for help, if they already have classified data, I immediately disregard everything and start again from scratch. Firstly, they wouldn’t be using my services if there wasn’t an issue with their classified data, and, secondly, in terms of time and being efficient, it’s far easier to start again with a clean slate to which you can apply standards that will be consistent and accurate.
Duplicates
The dreaded duplicates. These can appear in many forms, from duplicate invoices to customer/supplier records, to orders, to products and much more. They cause multiple records, which could mean the information is split, resulting in you only seeing part of the picture. Then there are the near-duplicates; don’t even get me started on those. In business, this could be PWC/P.W.C., or with personal information, Robert Smith and Bob Smith. I cover this in more detail in the next section.
The consequences of dirty data
Why does it matter if data isn’t quite right? Who’s going to notice? Like any problem, it’s manageable when it’s small, but gone unnoticed or left to fester, it can become a really big issue. What if your car started making a rattling noise? You wouldn’t leave that to deteriorate, would you? You definitely wouldn’t go on a long road trip and risk being stranded in the middle of nowhere. Likewise, you shouldn’t be making big business decisions based on unclassified, poor quality data.
Let’s go back to the examples we’ve just covered. What’s the worst that can happen?
Misspelt names
If names are misspelt, you could have multiple accounts for the same supplier and you could, for example, have some orders being placed under one account and some under the other.
When it comes to running reports, it means you may only be seeing half the picture if you’ve only selected one of the customers/suppliers. This could have a knock-on effect on areas like forecasting (not planning enough), discounts applied to orders (not giving the customer the correct discount), account management (not being given the right level of service for the amount purchased), sales (forecasting, analytics) and finance (the wrong credit limit, restrictions on purchases).
It’s the same for procurement. If you’re monitoring supplier spend and you have multiple versions, you don’t get a full picture of what you’re buying from that supplier and that could impact negotiations, contract compliance and planning. It also means you have fewer suppliers on your system than you think. If you’re working with global data sets, this can be a significant difference. A project I recently worked on had 41,000 suppliers in the file, but when I normalised them (more on that in Chapter 2), the true number of su...

Índice

  1. Cover
  2. Praise for Between the Spreadsheets
  3. Title Page
  4. Copyright
  5. Contents
  6. Figures
  7. Tables
  8. Acknowledgements
  9. Abbreviations
  10. Introduction
  11. 1 The Dangers of Dirty Data
  12. 2 Supplier Normalisation
  13. 3 Taxonomies
  14. 4 Spend Data Classification
  15. 5 Basic Data Cleansing
  16. 6 Other Methodologies
  17. 7 The Dirty Data Maturity Model
  18. 8 Data Horror Stories
  19. Summary
  20. References
  21. Index
Estilos de citas para Between the Spreadsheets

APA 6 Citation

Walsh. (2021). Between the Spreadsheets ([edition unavailable]). Facet Publishing. Retrieved from https://www.perlego.com/book/3259740/between-the-spreadsheets-classifying-and-fixing-dirty-data-pdf (Original work published 2021)

Chicago Citation

Walsh. (2021) 2021. Between the Spreadsheets. [Edition unavailable]. Facet Publishing. https://www.perlego.com/book/3259740/between-the-spreadsheets-classifying-and-fixing-dirty-data-pdf.

Harvard Citation

Walsh (2021) Between the Spreadsheets. [edition unavailable]. Facet Publishing. Available at: https://www.perlego.com/book/3259740/between-the-spreadsheets-classifying-and-fixing-dirty-data-pdf (Accessed: 15 October 2022).

MLA 7 Citation

Walsh. Between the Spreadsheets. [edition unavailable]. Facet Publishing, 2021. Web. 15 Oct. 2022.