Automated Data Collection with R
eBook - ePub

Automated Data Collection with R

A Practical Guide to Web Scraping and Text Mining

Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis

  1. English
  2. ePUB (mobile friendly)
  3. Available on iOS & Android
eBook - ePub

Automated Data Collection with R

A Practical Guide to Web Scraping and Text Mining

Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis

Book details
Book preview
Table of contents
Citations

About This Book

A hands on guide to web scraping and text mining for both beginners and experienced users of R

  • Introduces fundamental concepts of the main architecture of the web and databases and covers HTTP, HTML, XML, JSON, SQL.
  • Provides basic techniques to query web documents and data sets (XPath and regular expressions).
  • An extensive set of exercises are presentedto guide the reader through each technique.
  • Explores both supervised and unsupervised techniques as well as advanced techniques such as data scraping and text management.
  • Case studies are featured throughout along with examples for each technique presented.
  • R code and solutionsto exercises featuredin the book are provided ona supporting website.

Frequently asked questions

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes, you can access Automated Data Collection with R by Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Mining. We have over one million books available in our catalogue for you to explore.

Information

Publisher
Wiley
Year
2014
ISBN
9781118834800
Edition
1

1
Introduction

Are you ready for your first encounter with web scraping? Let us start with a small example that you can recreate directly on your machine, provided you have R installed. The case study gives a first impression of the book's central themes.

1.1 Case study: World Heritage Sites in Danger

The United Nations Educational, Scientific and Cultural Organization (UNESCO) is an organization of the United Nations which, among other things, fights for the preservation of the world's natural and cultural heritage. As of today (November 2013), there are 981 heritage sites, most of which of are man-made like the Pyramids of Giza, but also natural phenomena like the Great Barrier Reef are listed. Unfortunately, some of the awarded places are threatened by human intervention. Which sites are threatened and where are they located? Are there regions in the world where sites are more endangered than in others? What are the reasons that put a site at risk? These are the questions that we want to examine in this first case study.
What do scientists always do first when they want to get up to speed on a topic? They look it up on Wikipedia! Checking out the page of the world heritage sites, we stumble across a list of currently and previously endangered sites at http://en.wikipedia.org/wiki/List_of_World_Heritage_in_Danger. You find a table with the current sites listed when accessing the link. It contains the name, location (city, country, and geographic coordinates), type of danger that is facing the site, the year the site was added to the world heritage list, and the year it was put on the list of endangered sites. Let us investigate how the sites are distributed around the world.
Wikipedia—information source of choice
While the table holds information on the places, it is not immediately clear where they are located and whether they are regionally clustered. Rather than trying to eyeball the table, it could be very useful to plot the locations of the places on a map. As humans deal well with visual information, we will try to visualize results whenever possible throughout this book. But how to get the information from the table to a map? This sounds like a difficult task, but with the techniques that we are going to discuss extensively in the next pages, it is in fact not. For now, we simply provide you with a first impression of how to tackle such a task with R. Detailed explanations of the commands in the code snippets are provided later and more systematically throughout the book.
To start, we have to load a couple of packages. While R only comes with a set of basic, mostly math- and statistics-related functions, it can easily be extended by user-written packages. For this example, we load the following packages using the library() function:1
 R> library(stringr) R> library(XML) R> library(maps) 
In the next step, we load the data from the webpage into R. This can be done easily using the readHTMLTable() function from the XML package:
images
We are going to explain the mechanics of this step and all other major web scraping techniques in more detail in Chapter 9. For now, all you need to know is that we are telling R that the imported data come in the form of an HTML document. R is capable of interpreting HTML, that is, it knows how tables, headlines, or other objects are structured in this file format. This works via a so-called parser, which is called with the function htmlParse(). In the next step, we tell R to extract all HTML tables it can find in the parsed object heritage_parsed and store them in a new object tables. If you are not already familiar with HTML, you will learn that HTML tables are constructed from the same code components in Chapter 2. The readHTMLTable() function helps in identifying and reading out these tables.
All the information we need is now contained in the tables object. This object is a list of all the tables the function could find in the HTML document. After eyeballing all the tables, we identify and select the table we are interested in (the second one) and write it into a new one, named danger_table. Some of the variables in our table are of no further interest, so we select only those that contain information about the site's name, location, criterion of heritage (cultural or natural), year of inscription, and year of endangerment. The variables in our table have been assigned unhandy names, so we relabel them. Finally, we have a look at the names of the first few sites:
images
images
This seems to have worked. Additionally, we perform some simple data cleaning, a step often necessary when importing web-based content into R. The variable crit, which contains the information whether the site is of cultural or natural character, is recoded, and the two variables y_ins and y_end are turned into numeric ones.2 Some of the entries in the y_end variable are ambiguous as they contain several years. We select the last given year in the cell. To do so, we specify a so-called regular expression, which goes [[:digit:]]4$—we explain what this means in the next paragraph:
images
The locn variable is a bit of a mess, exemplified by three cases drawn from the data-set:
images
The variable contains the name of the site's location, the country, and the geographic coordinates in several varieties. What we need for the map are the coordinates, given by the latitude (e.g., 30.84167N) and longitude (e.g., 29.66389E) values. To extract this information, we have to use some more advanced text manipulation tools called “regular expressions”, which are discussed extensively in Chapter 8. In short, we have to give R an exact description of what the information we are interested in looks like, and then let R search for and extract it. To do so, we use functions from the stringr package, which we will also discuss in detail in Chapter 8. In order to get the latitude and longitude values, we write the following:
The first regular expression
 R> reg_y <-"[/][ -]*[[:digit:]]*[.]*[[:digit:]]*[;]" R> reg_x <-"[;][ -]*[[:digit:]]*[.]*[[:digit:]]*" R> y_coords <- str_extract(danger_table$locn, reg_y) R> y_coords <- as.numeric(str_sub(y_coords, 3, -2)) R> danger_table$y_coords <- y_coords R> x_coords <- str_extract(danger_table$locn, reg_x) R> x_coords <- as.numeric(str_sub(x_coords, 3, -1)) R> danger_table$x_coords <- x_coords R> danger_table$locn <- NULL 
Do not be confused by the first two lines of code. What looks like the result of a monkey typing on a keyboard is in fact a precise description of the coordinates in the locn variable. The information is contained in the locn variable as decimal degrees as well as in degrees, minutes, and seconds. As the decimal degrees are easier to describe with a regular expression, we try to extract those. Writing regular expressions means finding a general pattern for strings that we want to extract. We observe that latitudes and longitudes always appear after a slash and are a sequence of several digits, separated by a dot. Some values start with a minus sign. Both values are separated by a semicolon, which is cut off along with the empty spaces and the slash. When we apply this pattern to the locn variable with the str_extract(...

Table of contents

  1. Cover
  2. Title Page
  3. Copyright
  4. Dedication
  5. Preface
  6. Chapter 1: Introduction
  7. Part One: A Primer on Web and Data Technologies
  8. Part Two: A Practical Toolbox for Web Scraping and Text Mining
  9. Part Three: A Bag of Case Studies
  10. References
  11. General index
  12. Package index
  13. Function index
  14. End User License Agreement