Chapter 1
Role of Statistics and Data Analysis
1.1 Introduction
The purpose of this chapter is to provide an overview of important concepts in data analysis and statistics. Types of data, data evaluation, and an introduction to modeling and estimation are presented. Random variation, sampling, and different statistical paradigms are also introduced. These concepts are investigated in detail in subsequent chapters. An important distinguishing feature in many earth and environmental science analyses is the need for spatial sampling. Problems are described in the context of case studies, which use real data from earth science applications.
1.2 Case Studies
Wherever possible, case studies are used to illustrate methods. Two studies that are used extensively in this and subsequent chapters are water-well yield data and observations from an ice core.
1.2.1 Water-Well Yield Case Study
A concern in many parts of the world is the availability of an adequate supply of fresh water. Planners and managers want to know how much water is available. Scientists want to gain a greater understanding of transport systems and the relationship of water to other geologic phenomena. Homeowners who do not have access to municipal water want to know where to drill for water on their property. A subset of 754 water-well yield observations (water-well yield case study, Appendix I; see the book's Web site) from the Blue Ridge Geological Province, Loudoun County, Virginia (Sutphin et al., 2001) is used to illustrate graphical procedures. The variables are water-well yield in gallons per minute (gpm) for rock type Yg (Yg is a Middle Proterozoic Leucocratic Metagranite) and corresponding coordinates called easting (x-axis) and northing (y-axis). In Chapter 6 spatial applications are discussed.
1.2.2 Ice Core Case Study
Ice core data help scientists understand how Earth's climate works. The U.S. Geological Survey National Ice Core Laboratory (2004) states that āOver the past decade, research on the climate record frozen in ice cores from the Polar Regions has changed our basic understanding of how the climate system works. Changes in temperature and precipitation, which previously we believed, would require many thousands of years to happen were revealed, through the study of ice cores, to have happened in fewer than twenty years. These discoveries have challenged our beliefs about how the climate system works.ā
A record that can extend back many thousands of years may include temperature, precipitation, and chemical composition. An example of ice core data (ice core case study, Appendix II; see the book's Web site) submitted to the National Geophysical Data Center (2004) by Arkhipov et al. (1987) has been chosen. Data submitted by Arkhipov are from 1987 in the Austfonna Ice Cap of the Svalbard Archipelago and go to a depth of 566 m. Melting of ice masses is thought to be contributing to sea-level rise. Only data in the first 50 m are presented. In addition to depth, the variables are pH,
(hydrogen carbonate), and Cl (chlorine), all in milligrams per liter of water.
1.3 Data
Sir Arthur Conan Doyle, physician and writer (1859ā1930), noted: āIt is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.ā Data are fundamental to statistics. Most data are obtained from measurements. Increasingly, these measurements are obtained from automated processes such as ground weather stations and satellites. However, field studies are still an important way to collect data. Another important source of data is expert judgment. In areas where few hard data (measurements) are available, such as in the Arctic, experts are called upon to express their opinions.
Data may be rock type, wind speed, orientation of a fault, temperature, and a host of other variables. There are several ways to classify data. Two of the most useful classifications are continuous versus discrete and ratioāintervalāordinalānominal (Table 1.1). A continuous process generates continuous data. Discrete data typically result from counting. Continuous data can be ratio or interval. Discrete data are nominal data. Data classification systems help to select appropriate data analytic techniques and models.
Table 1.1 Data Classification Systems.
Continuous vs. Discrete Data |
Continuous: measurements can be made as fine as needed | Temperature, depth, sulfur content, well water yield |
Discrete: data that can be categorized into a classification where only a finite number of values are possible, typically count data | Number of days above freezing, number of water wells producing among a sample of 50 holes |
| |
Ratio, Interval, Ordinal, and Nominal Data |
Ratio: continuous data where an interval and ratio are meaningful | Depth, sulfur content |
Interval: continuous data with no natural zero | Temperature measured in degrees Celsius |
Ordinal: data that are rank ordered | Survey responses such as good, fair, poor; water yields as high, medium, low |
Nominal: Data that fit into categories; cannot be rank ordered | Location name, rock type |
To distinguish between ratio and interval data, consider the following example. With a ratio scale, zero means an absence of something, such as rainfall. With an interval scale, zero is arbitrary, such as zero degrees Celsius, which is not an absence of temperature and has a different meaning than zero degrees Fahrenheit. The terms quantitative and qualitative are also used. Sometimes qualitative data is considered synonymous with nominal data; and sometimes it just refers to something subjective or not precisely defined. Categorical data are data classified into categories. The terms categorical and nominal are sometimes used interchangeably.
Another way to view data is as primary or secondary. Primary data are collected to answer questions related to a particular study, such as sampling a site to ascertain the level of coal bed methane seepage. Secondary data are collected for some other purpose and may be used as supportive data. Typically, secondary data are historical data. Numerous government agencies routinely collect and publish both types of data on the earth sciences.
In the beginning chapters of this book, properties of a single variable are discussed. This variable may be temperature, water-well yield, or mercury level in fish. A single variable may change over time or space. In later chapters, multivariate data are examined, that is, data where multiple attributes are recorded at each sample point. Most data are multivariate. For example, in a study of climate, the relationships among temperature, atmospheric pressure, and precipitation can be analyzed. Geochemical data often contain dozens of variables.
1.4 Samples Versus the Population: Some Notation
A critical distinction for the analyst to make is sample versus population. A population comprises all the data of interest in a study. In most earth science applications, the population is large to infinite. In air quality studies, it may be the troposphere. A sample is a subset of a population. A statistic is a number derived from a sample. The method used to obtain a sample (the sampling plan) determines the type of inferences that can be made. Generally, in earth science applications, the sample size will be small with respect to the population size. The notations that are used in this book to represent populations and samples are those commonly used in the statistics literature. Statistics involves the use of random variables. A random variable is a function, that maps events into numbers. Each number or range of numbers is assigned a probability. There are two types of random variables, continuous and discrete. For example, a discrete random variab...