CHAPTER 1
R FOR DATA ANALYSIS
1.1 Introduction
1.1.1 Data Analysis
We ask many questions to which we seek answers. Some of these questions involve the way things work in our world, including our social processes and relationships, and our psychological selves. This book describes analyses based on observations that facilitate answering these types of questions. On average, do men make more than women managers at a particular company? Which of two pain medications is more effective? What College GPA do we forecast for an applicant who has a SAT score of 1130 and a High School GPA of 3.8? Do people generally have trust in others?
To answer questions such as these we seek empirical answers. We seek answers based on our observations of the world around us: What we see, hear, touch, taste, and smell, in particular, the measurements of what we observe. Our concern in this book is with observations in the form of data, the varying measurements of different people, organizations, places, things, and events.
empirical: Information based on observations acquired from our five senses.
data: Measurements of different people or whatever the unit of analysis.
Different measurements generally vary. For example, two different people have different heights, place differing amounts of trust in others, have different blood pressures, and earn different salaries. Height and blood pressure are two of the many variables that can be measured for anyone. For college students, College GPA and incoming High School GPA, and SAT score are measured variables.
Data analysis is the application of statistical concepts and methods to transform data, sometimes vast amounts of data, into usable information. This information is then used to form a conclusion regarding the people or organizations or places or whatever is the topic of interest. In the modern world data analysis is done exclusively on the computer. This book is about doing data analysis with one such computer software system, R, enhanced with lessR.
data analysis: Application of statistical methods to transform data into usable information.
1.1.2 R with lessR
Our journey into data analysis with R begins with some good news, some bad news, and fortunately, some more good news. The first good news, as announced in an article in the New York Times (Jan 7, 2009), is that the world of data analysis is rapidly changing. At the heart of this change is the computer application R, extensively used, for example, in the New York Times graphics department. From software for data analysis on the computer becoming widely available in the 1970s until the early 21st century, data analysis was typically accomplished with expensive, proprietary statistical applications. Originally they ran only on IBM mainframe computers, but eventually migrated to PCs as the technology developed.
The cost and exclusivity of competent statistical applications for data analysis is becoming less relevant as the capability and popularity of the R system continues to grow. In terms of pure statistical power to analyze data, R compares favorably to the most expensive commercial applications available, providing all that virtually anyone could desire. The cost is exactly $0.00 USD to use R for the rest of your life on the computer or computers of your choice, for whatever purpose you wish to use the software. Feel free to choose any computer you wish, because R runs identically on Windows, Macintosh, and Linux/Unix. Use wherever and on whatever you wish because R is free for you and for the world.
The problem is that although R’s capabilities and price are great, so is the effort required to learn what is generally considered a rather complex system with a steep learning curve. To get much done in R you need to write code. Sometimes you write a little code and sometimes a lot of code. The standard R environment has all the power you need, but is mostly for those who like, or at least tolerate, reading manuals, programming, and then debugging the resulting code. Get your code working right and you have harnessed the power of the program. Or, you might be staring at some cryptic error message you have no idea how to resolve.
Fortunately, R is not only open to the world, its creators designed the system so that anyone can contribute by adding extra functionality. Taking advantage of this opportunity, your author developed an extension to R called lessR. Compared to standard R, lessR requires much less R code to accomplish basic data analysis. The addition of lessR to the R environment does not diminish the standard R environment. On the contrary, the lessR enhancements are just added to what already exists.
R is a true programming language, so the flexibility of R allows almost anything that can be done with data. The reality, however, at least for the vast majority of the standard data analysis topics discussed in this book, is that certain specific steps must always be accomplished. There is no need for everyone to have to figure out and then repeat the same programming to accomplish those steps.
Instead, let lessR do the extra programming. For example, to do a comprehensive regression analysis with standard R begins with a dozen or so separate R statements, and then multiple lines of programming R code to organize the results. With lessR, as explained in Chapters 9 and 10, one instruction calls the Regression function to accomplish more than is accomplished with the dozen R statements and the extra programming. The lessR regression procedure taps directly into R’s capabilities, and then organizes the output and delivers several graphs. The appendix illustrates the core R functions upon which lessR depends.
appendix with equivalent R functions, Section 11.6, p. 279
Two primary objectives underlie the lessR project to minimize the needed programming to use R for data analysis.
◦ A data analysis procedure should typically produce desirable output without any extra instructions or information other than the name of the procedure and the relevant variable name or names.
◦ If changes to the default output are desired, such as choosing a new background color for a graph, then simply scan a list of the available options to understand how to provide all the information needed to proceed without writing code.
Let’s get started.
1.2 Access R
1.2.1 Download R
The best way to learn R is to start using R, which is available on many Internet servers around the world. These servers and the information on them comprise CRAN, the Comprehensive R Archive Network. Obtain the latest vers...