eBook - ePub

Visual Data Mining

Name: Visual Data Mining
Author: Russell K. Anderson

The VisMiner Approach

Russell K. Anderson,

English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Visual Data Mining

The VisMiner Approach

Russell K. Anderson,

Book details

Book preview

Table of contents

Citations

About This Book

A visual approach to data mining.

Data mining has been defined as the search for useful and previously unknown patterns in large datasets, yet when faced with the task of mining a large dataset, it is not always obvious where to start and how to proceed.

This book introduces a visual methodology for data mining demonstrating the application of methodology along with a sequence of exercises using VisMiner. VisMiner has been developed by the author and provides a powerful visual data mining tool enabling the reader to see the data that they are working on and to visually evaluate the models created from the data.

Key features:

Presents visual support for all phases of data mining including dataset preparation.
Provides a comprehensive set of non-trivial datasets and problems with accompanying software.
Features 3-D visualizations of multi-dimensional datasets.
Gives support for spatial data analysis with GIS like features.
Describes data mining algorithms with guidance on when and how to use.
Accompanied by VisMiner, a visual software tool for data mining, developed specifically to bridge the gap between theory and practice.

Visual Data Mining: The VisMiner Approach is designed as a hands-on work book to introduce the methodologies to students in data mining, advanced statistics, and business intelligence courses. This book provides a set of tutorials, exercises, and case studies that support students in learning data mining processes.

In praise of the VisMiner approach:

"What we discovered among students was that the visualization concepts and tools brought the analysis alive in a way that was broadly understood and could be used to make sound decisions with greater certainty about the outcomes"
— Dr. James V. Hansen, J. Owen Cherrington Professor, Marriott School, Brigham Young University, USA

"Students learn best when they are able to visualize relationships between data and results during the data mining process. VisMiner is easy to learn and yet offers great visualization capabilities throughout the data mining process. My students liked it very much and so did I."
— Dr. Douglas Dean, Assoc. Professor of Information Systems, Marriott School, Brigham Young University, USA

Frequently asked questions

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Yes, you can access Visual Data Mining by Russell K. Anderson in PDF and/or ePUB format, as well as other popular books in Matemáticas & Probabilidad y estadística. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Wiley

Year

2012

ISBN

9781118439258

Edition

Topic

Matemáticas

Subtopic

Probabilidad y estadística

Introduction

Data mining has been defined as the search for useful and previously unknown patterns in large datasets. Yet when faced with the task of mining a large dataset, it is not always obvious where to start and how to proceed. The purpose of this book is to introduce a methodology for data mining and to guide you in the application of that methodology using software specifically designed to support the methodology. In this chapter, we provide an overview of the methodology. The chapters that follow add detail to that methodology and contain a sequence of exercises that guide you in its application. The exercises use VisMiner, a powerful visual data mining tool which was designed around the methodology.

Data Mining Objectives

Normally in data mining a mathematical model is constructed for the purpose of prediction or description. A model can be thought of as a virtual box that accepts a set of inputs, then uses that input to generate output.

Prediction modeling algorithms use selected input attributes and a single selected output attribute from your dataset to build a model. The model, once built, is used to predict an output value based on input attribute values. The dataset used to build the model is assumed to contain historical data from past events in which the values of both the input and output attributes are known. The data mining methodology uses those values to construct a model that best fits the data. The process of model construction is sometimes referred to as training. The primary objective of model construction is to use the model for predictions in the future using known input attribute values when the value of the output attribute is not yet known. Prediction models that have a categorical output are known as classification models. For example, an insurance company may want to build a classification model to predict if an insurance claim is likely to be fraudulent or legitimate.

Prediction models that have numeric output are called regression models. For example, a retailer may use a regression model to predict sales for a proposed new store based on the demographics of the store. The model would be built using data from previously opened stores.

One special type of regression modeling is forecasting. Forecasting models use time series data to predict future values. They look at trends and cycles in previous periods in making the predictions for future time periods.

Description models built by data mining algorithms include: cluster, association, and sequence analyses.

Cluster analysis forms groupings of similar observations. The clusterings generated are not normally an end process in data mining. They are frequently used to extract subsets from the dataset to which other data mining methodologies may be applied. Because the behavioral characteristics of sub-populations within a dataset may be so different, it is frequently the case that models built using the subsets are more accurate than those built using the entire dataset. For example, the attitude toward, and use of, mass transit by the urban population is quite different from that of the rural population.

Association analysis looks for sets of items that occur together. Association analysis is also known as market basket analysis due to its application in studies of what consumers buy together. For example, a grocery retailer may find that bread, milk, and eggs are frequently purchased together. Note, however, that this would not be considered a real data mining discovery, since data mining is more concerned with finding the unexpected patterns rather than the expected.

Sequence analysis is similar to association analysis, except that it looks for groupings over time. For example, a women's clothing retailer may find that within two weeks of purchasing a pair of shoes, the customer may return to purchase a handbag. In bioinformatics, DNA studies frequently make use of sequence analysis.

Introduction to VisMiner

VisMiner is a software tool designed to visually support the entire data mining process. It is intended to be used in a course setting both for individual student use and classroom lectures when the processes of data mining are presented. During lectures, students using VisMiner installed on desktop, laptop, tablet computers, and smart phones are able to actively participate with the instructor as datasets are analyzed and the methodology is examined.

The architecture of VisMiner is represented in Figure 1.1. It consists of four main components:

the Control Center, which manages the datasets, starts and stops the modelers and viewers, and coordinates synchronization between viewers
VisSlave and ModelSlave which establish the connections between a slave computer and the Control Center
the modelers that execute the sophisticated data mining algorithms
the viewers that present interactive visualizations of the datasets and the models generated using the datasets.

Figure 1.1 VisMiner Architecture

As evidenced by Figure 1.1, VisMiner may run on one or more computers. The primary computer runs the Control Center. Computers that will present visualizations should run VisSlave; computers that will be used for back-end processing should run ModelSlave. In the full configuration of VisMiner, there should be just one instance of the Control Center executing, and as many instances of VisSlave and ModelSlave as there are computers available for their respective purposes. If there is only one computer, use it to run all three applications.

The Data Mining Process

Successful data mining requires a potentially time-consuming and methodical process. That's why they call it “mining”. Gold prospectors don't buy their gear, head out and discover gold on the first day. For them it takes months or even years of search. The same is true with data mining. It takes work, but hopefully not months or years.

In this book, we present a methodology. VisMiner is designed to support and streamline the methodology. The methodology consists of four steps:

Initial data exploration – conduct an initial exploration of the data to gain an overall understanding of its size and characteristics, looking for clues that should be explored in more depth.
Dataset preparation – prepare the data for analysis.
Algorithm application – select and apply data mining algorithms to the dataset.
Results evaluation – evaluate the results of the algorithm applications, assessing the “goodness of fit” of the data to the algorithm results and assessing the nature and strengths of inputs to the algorithm outputs.

These steps are not necessarily sequential in nature, but should be considered as an iterative process progressing towards the end result – a complete and thorough analysis. Some of the steps may even be completed in parallel. This is true for “Initial data exploration” and “dataset preparation”. In VisMiner for example, interactive visualizations designed primarily for the initial data exploration also support some of the dataset preparation tasks.

In the sections that follow, we elaborate on the tasks to be completed in each of the steps. In later chapters, problems and exercises are presented that guide you through completion of these tasks using VisMiner. Throughout the book, reference is made back to the task descriptions introduced here. It is suggested that as you work through the problems and exercises, you refer back to this list. Use it as a reminder of what has and has not been completed.

Initial Data Exploration

The primary objective of initial data exploration is to help the analyst gain an overall understanding of the dataset. This includes:

Dataset size and format – Determine the number of observations in the dataset. How much space does it occupy? In what format is it stored? Possible formats include tab or comma delimited text files, fixed field text files, tables in a relational database, and pages in a spreadsheet. Since most datasets stored in a relational database are encoded in the proprietary format of the database management system used to store the data, check that you have access to software that can retrieve and manipulate the content. Look also at the number of tables containing data of interest. If found in multiple tables, determine how they are linked and how they might be joined.
Attribute enumeration – Begin by browsing the list of attributes contained in the dataset and the corresponding types of each attribute. Understand what each attribute represents or measures and the units in which it is encoded. Look for identifier or key attributes – those that uniquely identify observations in the dataset.
Attribute distributions – For numeric types, determine the range of values in the dataset, then look at the shape and symmetry or skew of the distribution. Does it appear to approximate a normal distribution or some other distribution? For nominal (categorical) data, look at the number of unique values (categories) and the proportion of observations belonging to each category. For example, suppose that you have an attribute called CustomerType. The first thing that you want to determine is the number of different CustomerTypes in the dataset and the proportions of each.
Identification of sub-populations – Look for attribute distributions that are multimodal – that is distributions that have multiple peaks. When you see such distributions, it indicates that the observations in the dataset are drawn from multiple sub-populations with potentially different distributions. It is possible that these sub-populations could generate very different models when submitted in isolation to the data mining algorithms as compared to the model generated when submitting the entire dataset. For example, in some situations the purchasing behavior of risk-taking individuals may be quite different from those that are risk averse.
Pattern search – Look for potentially interesting and significant relationships (or patterns) between attributes. If your data mining objective is the generation of a prediction model, focus on relationships between your selected output attribute and attributes that may be considered for input. Note the type of the relationship – linear or non-linear, direct or inverse. Ask the question, “Does this relationship seem reasonable?” Also look at relationships between potential input attributes. If they are highly correlated, then you probably want to eliminate all but one as you conduct in-depth analyses.

Dataset Preparation

The objective of dataset preparation is to change or morph the dataset into a form that allows the dataset to be submitted to a data mining algorithm for analysis. Tasks include:

Observation reduction – Frequently there is no need to analyze the full dat...

Cover
Title Page
Copyright
Preface
Acknowledgments
Chapter 1: Introduction
Chapter 2: Initial Data Exploration and Dataset Preparation Using VisMiner
Chapter 3: Advanced Topics in Initial Exploration and Dataset Preparation Using VisMiner
Chapter 4: Prediction Algorithms for Data Mining
Chapter 5: Classification Models in VisMiner
Chapter 6: Regression Analysis
Chapter 7: Cluster Analysis
Appendix A: VisMiner Reference by Task
Appendix B: VisMiner Task/Tool Matrix
Appendix C: IP Address Look-up
Index