eBook - ePub

Applied Univariate, Bivariate, and Multivariate Statistics Using Python

Name: Applied Univariate, Bivariate, and Multivariate Statistics Using Python
Author: Daniel J. Denis

A Beginner's Guide to Advanced Data Analysis

Daniel J. Denis,

English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Applied Univariate, Bivariate, and Multivariate Statistics Using Python

A Beginner's Guide to Advanced Data Analysis

Daniel J. Denis,

Book details

Book preview

Table of contents

Citations

About This Book

Applied Univariate, Bivariate, and Multivariate Statistics Using Python

A practical, "how-to" reference for anyone performing essential statistical analyses and data management tasks in Python

Applied Univariate, Bivariate, and Multivariate Statistics Using Python delivers a comprehensive introduction to a wide range of statistical methods performed using Python in a single, one-stop reference. The book contains user-friendly guidance and instructions on using Python to run a variety of statistical procedures without getting bogged down in unnecessary theory. Throughout, the author emphasizes a set of computational tools used in the discovery of empirical patterns, as well as several popular statistical analyses and data management tasks that can be immediately applied.

Most of the datasets used in the book are small enough to be easily entered into Python manually, though they can also be downloaded for free from www.datapsyc.com. Onlyminimal knowledge of statistics is assumed, making the book perfect forthose seeking an easily accessible toolkit for statistical analysis with Python. Applied Univariate, Bivariate, and Multivariate Statistics Using Python represents the fastest way to learn how to analyze data with Python.

Readers will also benefit from the inclusion of:

A review of essential statistical principles, including types of data, measurement, significance tests, significance levels, and type I and type II errors
An introduction to Python, exploring how to communicate with Python
A treatment of exploratory data analysis, basic statistics and visualdisplays, including frequencies and descriptives, q-q plots, box-and-whisker plots, and data management
An introduction to topics such as ANOVA, MANOVA and discriminant analysis, regression, principal components analysis, factor analysis, cluster analysis, among others, exploring the nature of what these techniques can vs. cannot do on a methodological level

Perfect for undergraduate and graduate students in the social, behavioral, and natural sciences, Applied Univariate, Bivariate, and Multivariate Statistics Using Python will also earn a place in the libraries of researchers and data analysts seeking a quick go-to resource for univariate, bivariate, and multivariate analysis in Python.

Frequently asked questions

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Yes, you can access Applied Univariate, Bivariate, and Multivariate Statistics Using Python by Daniel J. Denis in PDF and/or ePUB format, as well as other popular books in Matemáticas & Probabilidad y estadística. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Wiley

Year

2021

ISBN

9781119578185

Edition

Topic

Matemáticas

Subtopic

Probabilidad y estadística

1
A Brief Introduction and Overview of Applied Statistics

CHAPTER OBJECTIVES

How probability is the basis of statistical and scientific thinking.
Examples of statistical inference and thinking in the COVID-19 pandemic.
Overview of how null hypothesis significance testing (NHST) works.
The relationship between statistical inference and decision-making.
Error rates in statistical thinking and how to minimize them.
The difference between a point estimator and an interval estimator.
The difference between a continuous vs. discrete variable.
Appreciating a few of the more salient philosophical underpinnings of applied statistics and science.
Understanding scales of measurement, nominal, ordinal, interval, and ratio.
Data analysis, data science, and “big data” distinctions.

The goal of this first chapter is to provide a global overview of the logic behind statistical inference and how it is the basis for analyzing data and addressing scientific problems. Statistical inference, in one form or another, has existed at least going back to the Greeks, even if it was only relatively recently formalized into a complete system. What unifies virtually all of statistical inference is that of probability. Without probability, statistical inference could not exist, and thus much of modern day statistics would not exist either (Stigler, 1986).

When we speak of the probability of an event occurring, we are seeking to know the likelihood of that event. Of course, that explanation is not useful, since all we have done is replace probability with the word likelihood. What we need is a more precise definition. Kolmogorov (1903–1987) established basic axioms of probability and was thus influential in the mathematics of modern-day probability theory. An axiom in mathematics is basically a statement that is assumed to be true without requiring any proof or justification. This is unlike a theorem in mathematics, which is only considered true if it can be rigorously justified, usually by other allied parallel mathematical results. Though the axioms help establish the mathematics of probability, they surprisingly do not help us define exactly what probability actually is. Some statisticians, scientists and philosophers hold that probability is a relative frequency, while others find it more useful to consider probability as a degree of belief. An example of a relative frequency would be flipping a coin 100 times and observing the number of heads that result. If that number is 40, then we might estimate the probability of heads on the coin to be 0.40, that is, 40/100. However, this number can also reflect our degree of belief in the probability of heads, by which we based our belief on a relative frequency. There are cases, however, in which relative frequencies are not so easily obtained or virtually impossible to estimate, such as the probability that COVID-19 will become a seasonal disease. Often, experts in the area have to provide good guesstimates based on prior knowledge and their clinical opinion. These probabilities are best considered subjective probabilities as they reflect a degree of belief or disbelief in a theory rather than a strict relative frequency. Historically, scholars who espouse that probability can be nothing more than a relative frequency are often called frequentists, while those who believe it is a degree of belief are usually called Bayesians, due to Bayesian statistics regularly employing subjective probabilities in its development and operations. A discussion of Bayesian statistics is well beyond the scope of this chapter and book. For an excellent introduction, as well as a general introduction to the rudiments of statistical theory, see Savage (1972).

When you think about it for a moment, virtually all things in the world are probabilistic. As a recent example, consider the COVID-19 pandemic of 2020. Since the start of the outbreak, questions involving probability were front and center in virtually all media discussions. That is, the undertones of probability, science, and statistical inference were virtually everywhere where discussions of the pandemic were to be had. Concepts of probability could not be avoided. The following are just a few of the questions asked during the pandemic:

What is the probability of contracting the virus, and does this probability vary as a function of factors such as pre-existing conditions or age? In this latter case, we might be interested in the conditional probability of contracting COVID-19 given a pre-existing condition or advanced age. For example, if someone suffers from heart disease, is that person at greatest risk of acquiring the infection? That is, what is the probability of COVID-19 infection being conditional on someone already suffering from heart disease or other ailments?
What proportion of the general population has the virus? Ideally, researchers wanted to know how many people world-wide had contracted the virus. This constituted a case of parameter estimation, where the parameter of interest was the proportion of cases world-wide having the virus. Since this number was unknown, it was typically estimated based on sample data by computing a statistic (i.e. in this case, a proportion) and using that number to infer the true population proportion. It is important to understand that the statistic in this case was a proportion, but it could have also been a different function of the data. For example, a percentage increase or decrease in COVID-19 cases was also a parameter of interest to be estimated via sample data across a particular period of time. In all such cases, we wish to estimate a parameter based on a statistic.
What proportion of those who contracted the virus will die of it? That is, what is the estimated total death count from the pandemic, from beginning to end? Statistics such as these involved projections of death counts over a specific period of time and relied on already established model curves from similar pandemics. Scientists who study infectious diseases have historically documented the likely (i.e. read: “probabilistic”) trajectories of death rates over a period of time, which incorporates estimates of how quickly and easily the virus spreads from one individual to the next. These estimates were all statistical in nature. Estimates often included confidence limits and bands around projected trajectories as a means of estimating the degree of uncertainty in the prediction. Hence, projected estimates were in the opinion of many media types “wrong,” but this was usually due to not understanding or appreciating the limits of uncertainty provided in the original estimates. Of course, uncertainty limits were sometimes quite wide, because predicting death rates was very difficult to begin with. When one models relatively wide margins of error, one is protected, in a sense, from getting the projection truly wrong. But of course, one needs to understand what these limits represent, otherwise they can be easily misunderstood. Were the point estimates wrong? Of course they were! We knew far before the data came in that the point projections would be off. Virtually all point predictions will always be wrong. The issue is whether the data fell in line with the prediction bands that were modeled (e.g. see Figure 1.1). If a modeler sets them too wide, then the model is essentially quite useless. For instance, had we said the projected number of deaths would be between 1,000 and 5,000,000 in the USA, that does not really tell us much more than we could have guessed by our own estimates not using data at all! Be wary of “sophisticated models” that tell you about the same thing (or even less!) than you could have guessed on your own (e.g. a weather model that predicts cold temperatures in Montana in December, how insightful!).
Measurement issues were also at the heart of the pandemic (though rarely addressed by the media). What exactly constituted a COVID-19 case? Differentiating between individuals who died “of” COVID-19 vs. died “with” COVID-19 was paramount, yet was often ignored in early reports. However, the question was central to everything! “Another individual died of COVID-19” does not mean anything if we do not know the mechanism or etiology of the death. Quite possibly, COVID-19 was a correlate to death in many cases, not a cause. That is, within a typical COVID-19 death could lie a virtual infinite number of possibilities that “contributed” in a sense, to the death. Perhaps one person died primarily from the virus, whereas another person died because they already suffered from severe heart disease, and the addition of the virus simply complicated the overall health issue and overwhelmed them, which essentially caused the death.

A scatterplot shows the combined forecast of deaths during the COVID-19 pandemic. — **Figure 1.1** Sample death predictions in the United States during the COVID-19 pandemic in 2020. The connected dots toward the right of the plot (beyond the break in the line) represent a **point prediction** for the given period (the dots toward the left are actual deaths based on prior time periods), while the shaded area represents a **band of uncertainty**. From the current date in the period of October 2020 forward (the time in which the image was published), the shaded area increases in order to reflect greater uncertainty in the estimate. *Source: CDC (Centers for Disease Control and Prevention); Materials Developed by CDC. Used with Permission. Available at CDC (*www.cdc.gov) free of charge.

To elaborate on the above point somewhat, measurement issues abound in scientific research and are extremely important, even when what is being measured is seemingly, at least at first glance, relatively simple and direct. If there are issues with how best to measure something like “COVID death,” just imagine where they surface elsewhere. In psychological research, for instance, measurement is even more challenging, and in many cases adequate measurement is simply not possible. This is why some natural scientists do not give much psychological research its due (at least in particular subdivisions of psychology), because they are doubtful that the measurement of such characteristics as anxiety, intelligence, and many other things is even possible. Self-reports are also usually fraught with difficulty as well. Hence, assessing the degree of depression present may seem trivial to someone who believes that a self-report of such symptoms is meaningless. “But I did a complex statistical analysis using my self-report data.” It doesn’t matter if you haven’t sold to the reader what you’re analyzing was successfully measured. The most important component to a house is its foundation. Some scientists would require a more definite “marker” such as a biological gene or other more physical characteristic or behavioral observation before they take your ensuing statistical analysis seriously. Statistical complexity usually does not advance a science on its own. Resolution of measurement issues is more often the paramount problem to be solved.

The key point from the above discussion is that with any research, with any scientific investigation, scientists are typically interested in estimating population parameters based on information in samples. This occurs by way of probability, and hence one can say that virtually the entire edifice of statistical and scientific inference is based on the theory of probability. Even when probability is not explicitly invoked, for instance in the case of the easy result in an experiment (e.g. 100 rats live who received COVID-19 treatment and 100 control rats die who did not receive treatment), the elements of probability are still present, as we will now discuss in surveying at a very intuitive level how classical hypothesis testing works in the sciences.

1.1 How Statistical Inference Works

Armed with some examples of the COVID-19 pandemic, we can quite easily illustrate the process of statistical inferen...

Cover
Title page
Copyright
Table of Contents
Preface
Chapter 1: A Brief Introduction and Overview of Applied Statistics
Chapter 2: Introduction to Python and the Field of Computational Statistics
Chapter 3: Visualization in Python: Introduction to Graphs and Plots
Chapter 4: Simple Statistical Techniques for Univariate and Bivariate Analyses
Chapter 5: Power, Effect Size, P-Values, and Estimating Required Sample Size Using Python
Chapter 6: Analysis of Variance
Chapter 7: Simple and Multiple Linear Regression
Chapter 8: Logistic Regression and the Generalized Linear Model
Chapter 9: Multivariate Analysis of Variance (MANOVA) and Discriminant Analysis
Chapter 10: Principal Components Analysis
Chapter 11: Exploratory Factor Analysis
Chapter 12: Cluster Analysis
References
Index
End User License Agreement