Pocket Primer
eBook - ePub

Pocket Primer

  1. 250 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Pocket Primer

Book details
Book preview
Table of contents
Citations

About This Book

As part of the best-selling Pocket Primer series, this book is designed to introduce the reader to the basic concepts of managing data using a variety of computerlanguages and applications. It is intended to be a fast-paced introduction to some basic features of data management and covers statistical concepts, data-related techniques, features of Pandas, RDBMS, SQL, NLP topics, Matplotlib, and data visualization. Companion files with source code and color figures are available. FEATURES:

  • Covers Pandas, RDBMS, NLP, data cleaning, SQL, and data visualization
  • Introduces probability and statistical concepts
  • Features numerous code samples throughout
  • Includes companion files with source code and figures

Frequently asked questions

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes, you can access Pocket Primer by Oswald Campesato in PDF and/or ePUB format, as well as other popular books in Computer Science & Programming in Python. We have over one million books available in our catalogue for you to explore.

CHAPTER 1

INTRODUCTION TO PROBABILITY AND STATISTICS

This chapter introduces you to concepts in probability as well as to an assortment of statistical terms and algorithms.
The first section of this chapter starts with a discussion of probability, how to calculate the expected value of a set of numbers (with associated probabilities), and the concept of a random variable (discrete and continuous), and a short list of some well-known probability distributions.
The second section of this chapter introduces basic statistical concepts, such as mean, median, mode, variance, and standard deviation, along with simple examples that illustrate how to calculate these terms. You will also learn about the terms RSS, TSS, R^2, and F1 score.
The third section of this chapter introduces Gini Impurity, entropy, perplexity, cross-entropy, and KL divergence. You will also learn about skewness and kurtosis.
The fourth section explains covariance and correlation matrices and how to calculate eigenvalues and eigenvectors.
The fifth section explains principal component analysis (PCA), which is a well-known dimensionality reduction technique. The final section introduces you to Bayes’ Theorem.

What Is a Probability?

If you have ever performed a science experiment in one of your classes, you might remember that measurements have some uncertainty. In general, we assume that there is a correct value, and we endeavor to find the best estimate of that value.
When we work with an event that can have multiple outcomes, we try to define the probability of an outcome as the chance that it will occur, which is calculated as follows:
 p(outcome) = # of times outcome occurs/(total number of outcomes) 
For example, in the case of a single balanced coin, the probability of tossing a head “H” equals the probability of tossing a tail “T”:
 p(H) = 1/2 = p(T) 
The set of probabilities associated with the outcomes {H, T} is shown in the set P:
 P = {1/2, 1/2} 
Some experiments involve replacement while others involve nonreplacement. For example, suppose that an urn contains 10 red balls and 10 green balls. What is the probability that a randomly selected ball is red? The answer is 10/(10+10) = 1/2. What is the probability that the second ball is also red?
There are two scenarios with two different answers. If each ball is selected with replacement, then each ball is returned to the urn after selection, which means that the urn always contains 10 red balls and 10 green balls. In this case, the answer is 1/2 * 1/2 = 1/4. In fact, the probability of any event is independent of all previous events.
On the other hand, if balls are selected without replacement, then the answer is 10/20 * 9/19. As you undoubtedly know, card games are also examples of selecting cards without replacement.
One other concept is called conditional probability, which refers to the likelihood of the occurrence of event E1 given that event E2 has occurred. A simple example is the following statement:
 "If it rains (E2), then I will carry an umbrella (E1)." 

Calculating the Expected Value

Consider the following scenario involving a well-balanced coin: whenever a head appears, you earn $1 and whenever a tail appears, you earn $1 dollar. If you toss the coin 100 times, how much money do you expect to earn? Since you will earn $1 regardless of the outcome, the expected value (in fact, the guaranteed value) is 100.
Now consider this scenario: whenever a head appears, you earn $1 and whenever a tail appears, you earn 0 dollars. If you toss the coin 100 times, how much money do you expect to earn? You probably determined the value 50 (which is the correct answer) by making a quick mental calculation. The more formal derivation of the value of E (the expected earning) is here:
 E = 100 *[1 * 0.5 + 0 * 0.5] = 100 * 0.5 = 50 
The quantity 1 * 0.5 + 0 * 0.5 is the amount of money you expected to earn during each coin toss (half the time you earn $1 and half the time you earn 0 dollars), and multiplying this number by 100 is the expected earning after 100 coin tosses. Also note that you might never earn $50: the actual amount that you earn can be any integer between 1 and 100 inclusive.
As another example, suppose that you earn $3 whenever a head appears, and you lose $1.50 dollars whenever a tail appears. Then the expected earning E after 100 coin tosses is shown here:
 E = 100 *[3 * 0.5 - 1.5 * 0.5] = 100 * 1.5 = 150 
We can generalize the preceding calculations as follows. Let P = {p1, . . . ,pn} be a probability distribution, which means that the values in P are nonnegative and their sum equals 1. In addition, let R = {R1, . . . ,Rn} be a set of rewards, where reward Ri is received with probability pi. Then the expected value E after N trials is shown here:
 E = N * [SUM pi*Ri] 
In the case of a single balanced die, we have the following probabilities:
 p(1) = 1/6 p(2) = 1/6 p(3) = 1/6 p(4) = 1/6 p(5) = 1/6 p(6) = 1/6 P = {1/6, 1/6, 1/6, 1/6, 1/6, 1/6} 
As a simple example, suppose that the earnings are {3, 0, −1, 2, 4, −1} when the values 1, 2, 3, 4, 5, 6, respectively, appear when tossing the single die. Then after 100 trials our expected earnings are calculated as follows:
 E = 100 * [3 + 0 + -1 + 2 + 4 + -1]/6 = 100 * 3/6 = 50 
In the case of two balanced dice, we have the following probabilities of rolling 2, 3, . . . , or 12:
 p(2) = 1/36 p(3) = 2/36 ... p(12) = 1/36 P = {1/36,2/36,3/36,4/36,5/36,6/36,5/36,4/36,3/36,2/36,1/36} 

Random Variables

A random variable is a variable that can have multiple values, where each value has an associated probability of occurrence. For example, if we let X be a random variable whose values are the outcomes of tossing a well-balanced die, then the values of X are the numbers in the set {1, 2, 3, 4, 5, 6}. Moreover, each of those values can occur with equal probability (which is 1/6).
In the case of two well-balanced dice, let X be a random variable whose values can be any of the numbers in the set {2, 3, 4, . . . , 12}. Then the associated probabilities for the different values for X are listed in the previous section.

Discrete versus Continuous Random Variables

The preceding section contains examples of discrete random variables because the list of possible values is either finite or countably infinite (such as the set of integers). As an aside, the set of rational numbers is also countably infinite, but the set of irrational numbers and also the set of real numbers are both uncountably infinite (proofs are available online). As pointed out earlier, the associated set of probabilities must form a probability distribution, which means that the probability values are nonnegative and their sum equals 1.
A continuous random variable is a random variable whose values can be any number in an interval, which can be an uncountably infinite number of distinct values. For example, the amount of time required to perform a task is represented by a continuous random variable.
A continuous random variable also has a probability distribution that is represented as a continuous function. The constraint for such a variable is that the area under the curve (which is sometimes calculated via a mathematical integral) equals 1.

Well-Known Probability Distributions

There are many probability distributions, and some of the well-known probability distributions are listed here:
  • Gaussian distribution
  • Poisson distribution
  • chi-squared distribution
  • binomial distribution
The Gaussian distribution is named after Karl F Gauss, and it is sometimes called the normal distribution or the Bell curve. The Gaussian distribution is symmetric: the shape of the curve on the left of the mean is identical to the shape of the curve on the right side of the mean. As an example, the distribution of IQ scores follows a curve that is similar to a Gaussian distribution.
Furthermore, the frequency of traffic at a given point in a road follows a Poisson distribution (which is not symmetric). Interestingly, if you count the number of people who go to a public pool based on 5˚ (Fa...

Table of contents

  1. Cover
  2. Title Page
  3. Copyright Page
  4. Dedication
  5. Contents
  6. Preface
  7. Chapter 1: Introduction to Probability and Statistics
  8. Chapter 2: Working with Data
  9. Chapter 3: Introduction to Pandas
  10. Chapter 4: Introduction to RDBMS and SQL
  11. Chapter 5: Working with SQL and MySQL
  12. Chapter 6: NLP and Data Cleaning
  13. Chapter 7: Data Visualization
  14. Index